CVE-2022-31696: An Analysis of a VMware ESXi TCP Socket Keepalive Type Confusion LPE

22 June 2023 at 16:00

Last year we published our patch gap analysis of ESXi’s TCP/IP stack, which is forked from FreeBSD 8.2. While our focus was mainly on missing FreeBSD patches in ESXi, we also came across a type confusion bug in code introduced by VMware. This blog post details a vulnerability I discovered in ESXi’s implementation of the setsockopt system call that could lead to a sandbox escape. The vulnerability was assigned CVE-2022-31696 and disclosed as part of the advisory VMSA-2022-003. Additionally, I also explore ESXi’s kernel heap allocator and weaknesses in existing kernel mitigations.

For information regarding the initial analysis of the TCP/IP kernel module, VMkernel debug symbols, and porting type information from FreeBSD to ESXi, it is recommend to read our earlier analysis.

Comparing setsockopt in FreeBSD vs ESXi

First, let’s take a look at how ESXi 6.7 build 19195723’s setsockopt implementation differs from that of FreeBSD. Of particular note are differences in the handling of the SO_KEEPALIVE socket option. This option enables keep-alive messages on connection-oriented sockets.

*Figure 1 - FreeBSD 8.2 code (left) vs ESXi 6.7 19195723 IDA Pro decompiled code (right)*

In BSD systems, the TCP timer functions are registered and executed through the callout facility. ESXi added code here to check if there is an active callout for the keep-alive, by calling tcp_timer_active. If so, it resets the TCP keepidle to a newer value using tcp_timer_activate. The keepidle value determines how long TCP should wait before sending out the first keep-alive probe.

Type confusion vulnerability in SO_KEEPALIVE handling

What’s the issue with this newly added code? To understand this better, let’s take another look at the decompiled code with type information added.

*Figure 2 - Protocol PCB type casted without validation*

The Internet PCB structure inpcb has a pointer inp_ppcb that can point to either a TCP PCB (tcpcb) or a UDP PCB (udpcb) structure depending on the protocol. The vulnerable code shown here always type casts the pointer to tcpcb irrespective of the socket type. If the SO_KEEPALIVE option is set for a UDP socket, inp_ppcb is a pointer to a udpcb structure, but here it is casted to tcpcb structure due to the lack of validation. When the code further accesses the tcp_timer structure variable t_timers at offset 0x20, the access is out of bounds because the udpcb structure is only 0x10 bytes in size.

*Figure 3 - The Kernel Data Structures for TCP and UDP Protocol Control Blocks as seen in FreeBSD*

Triggering the Vulnerable Code and PSOD

In order to trigger the vulnerable code path, we need to create a UDP socket and then manipulate the socket using the setsockopt system call. Specifically, it is necessary to set the SO_KEEPALIVE option. Since ESXi does not package any build tools, we must compile the PoC statically in a Linux machine and then transfer the binary to ESXi for execution. Running the PoC will immediately trigger the Purple Screen of Death (PSOD). To trigger the bug from a sandboxed process, an attacker must be able to invoke the setsockopt system call on an existing UDP socket descriptor or create a new one for that purpose. Below is the PoC to trigger the bug:

The resulting PSOD:

*Figure 4 - ESXi PSOD on TCP Timers code when running the PoC*

Kernel Debug Setup for ESXi

While ESXi supports a local VMkernel debugger, VMKDBG, which can be used to inspect the PSOD, it is not as flexible as GDB. The GDB setup detailed in Attacking VMware NSX (Slides 34 – 37 in the PDF) is an excellent reference for getting started with ESXi kernel debugging. In summary, we used the GDB stubs feature provided by VMware to debug ESXi running as a guest VM on Fusion. We also disabled kASLR for ease of debugging. Since the ESXi kernel modules have symbols, it is possible to use GDB’s add-symbol-file command to load symbol information given an executable file and its base address in memory. The module base address and the path information required for add-symbol-file can be fetched using the esxcfg-info command as seen below:

While the file path to the tcpip module can be seen in the output, there is no file path entry for the VMkernel module. The VMkernel module with symbol information is found as a gzip-compressed file k.b00 within the bootbank directory of ESXi. Alternatively, to obtain the VMkernel executable with not only symbols but also type information, one can download it from the VMware WorkBench. However, in this case, the VMware WorkBench does not have debug information for the version of ESXi currently under analysis.

Once the kernel modules are available and their base addresses are known, connect the debugger and run the PoC to trigger the crash. The exception triggered may not be caught by GDB. In that case, ESXi will continue running, executing the handler for Interrupt 13 - General Protection Fault (GP), which is responsible for collecting fault information and core dumps. Should this occur, wait for the PSOD and then hit “Control + C” (SIGINT) to break into GDB. In the debug session shown below, you can see the symbolized stack trace obtained using GDB’s add-symbol-file command. tcp_timer_active was the last function to be executed before calling the interrupt handler. Therefore, choose the relevant frame (12 in this case) and inspect the program state. The register RAX was found to be loaded with some garbage value, leading to an invalid memory access during the execution of the mov eax,DWORD PTR [rax+0x38] instruction.

Analyzing the Exploitability of the Type Confusion Bug

Since the debug setup with symbols is now ready, let’s take another look at the crash by setting breakpoints and stepping through the code. The tcpcb structure can be inspected during the call to tcp_timer_active function, which takes it as the first argument. However, the type information is still missing within GDB. As a workaround, it is possible to use the type information from the FreeBSD kernel for debugging ESXi’s tcpip kernel module. Though some of the structure definitions vary somewhat between the FreeBSD and ESXi TCP/IP stacks, they have substantial similarities. Once again, GDB’s add-symbol-file command comes in handy. To import all structure definitions, use the add-symbol-file command but with address set to 0. Similarly, type information for VMkernel can be imported from an older version of ESXi vmkernel-visor (6.7-14320388) available through the VMware WorkBench.

Unlike the previous debug session, where the crash happened when accessing a garbage pointer, this time the t_timers variable is pointing to NULL and will result in a NULL pointer dereference. To better understand this behavior, it is necessary to examine the heap allocator used by ESXi. After some analysis on the vmkernel-visor executable, it was noticed that ESXi’s kernel heap allocator is based on Doug Lea's Malloc:

In dlmalloc, the malloc chunk headers are 32 bytes in size. The structure definition is as follows:

*Figure 5 - Chunk header of Doug Lea's Malloc*

The prev_foot field holds the size of previous chunk if free, whereas the head field holds the size of the current chunk. In addition to the size, the head field also holds two flag bits: PINUSE_BIT and CINUSE_BIT. The PINUSE_BIT (lowest order bit) marks if the previous chunk is in use. The CINUSE_BIT (second lowest bit) marks if the current chunk is in use. The forward fd and backward bk pointer fields are used only when the chunk is free. Otherwise the chunk data starts immediately after the head field. Now, looking back at the memory pointed to by RDI, it can be inferred that it is the data region of a dlmalloc chunk of size 32 bytes, which can hold 16 bytes of data (the udpcb structure).

As explained above, when fetching the t_timers pointer from offset +0x20, it accesses data from the adjacent chunk. This is because the allocated udpcb structure is smaller than the offset of t_timers in the tcpcb structure. Since the adjacent chunk may hold unrelated data, its contents are unpredictable (unless greater care is taken to first groom the heap). That is why the PoC crash will sometimes manifest as a NULL pointer deference and sometimes as a different kind of invalid access. Here is what the access of t_timers looks like:

*Figure 6 - State of heap memory during type confusion*

Assuming control of the t_timers pointer, it is possible to corrupt arbitrary memory during the write operations within the callout_stop or callout_reset functions. Alternatively, if there is control over the memory pointed to by the t_timers pointer, it is possible to control the subsequent access of the tcp_timer structure. Specifically, tcp_timer contains a callout substructure scheduled for execution by tcp_timer_activate. By targeting the c_func function pointer we can gain control of the instruction pointer. Since ESXi does not support Supervisor Mode Access Prevention (SMAP), t_timers could in fact point to user space memory instead of controlled memory in kernel space.

*Figure 7 - TCP timers and Callout data structures*

Note that structures such as tcpcb, tcp_timer and callout in ESXi are slightly different from the corresponding structures in FreeBSD. By comparing the decompiled ESXi code against FreeBSD 8.2, I identified new structure elements and adjusted the offsets of existing fields. For example, some global variables in FreeBSD such as tcp_keepidle, tcp_keepintvl and tcp_keepcnt were turned into fields of the tcp_timer structure in ESXi. This can be recognized by analyzing the tcp_timer_keep callout function.

In addition to lack of support for SMAP, the kASLR of kernel modules was also found to be weak. While the text base address showed significant randomization, the data segment base address did not, with as little as 1 bit of entropy in some cases. Here are the load addresses of the tcpip kernel module across multiple reboots:

Patch Analysis

To understand the fix for the type confusion bug, a patch diff was performed against ESXi 6.7 Build 20497097 (now at end-of-life). Instead of setting up the newer version of ESXi, you can just download the relevant VIB (vSphere Installation Bundle) from the ESXi Patch Tracker. In the case of tcpip, the kernel module is found within the ESXi base system esx-base VIB. This information can be queried using the esxcli command:

The diff between tcpip kernel modules from build 19195723 and 20497097 revealed an additional check added to sosetopt function. The code now checks whether the socket protocol is IPPROTO_TCP before proceeding with TCP timers. There is no explicit check to prevent a raw socket from entering the code path, but inp_ppcb is initialized only for socket types SOCK_STREAM and SOCK_DGRAM but not for type SOCK_RAW. Therefore, the timer code is reachable only when the socket type is SOCK_STREAM and the protocol is IPPROTO_TCP.

*Figure 8 - Vulnerable code (left) vs Fixed code (right)*

Interestingly, in 2012, the Linux kernel fixed a very similar issue in the handling of RAW sockets - CVE-2012-6657 Kernel: net: guard tcp_set_keepalive against crash:

*Figure 9 - Linux patch for CVE-2012-6657*

Conclusion

Historically, kernel privilege escalation vulnerabilities in ESXi have not been frequently seen. ESXi has no login shell for low-privileged users, so that entry point is eliminated. On the other hand, user-mode daemons such as SLPD run with the highest privileges (i.e., superDom), so in the case of compromise of a daemon, there is no need for further escalation. For these reasons, ESXi kernel bugs have not been a popular topic of discussion, at least not publicly. However, the situation is changing. SLP is no longer enabled by default, and ESXi is now sandboxing more and more user-mode processes. This makes us believe ESXi kernel bugs will become important in the coming years. For anyone interested, I hope this blog post will give some ideas to get started on the topic, and I’ll continue blogging about any significant findings in the future. Until then, you can follow me @renorobertr and follow the team on Twitter, Mastodon, LinkedIn, or Instagram for the latest in exploit techniques and security patches.

Normal view