CVE-2022-31696: AN ANALYSIS OF A VMWARE ESXI TCP SOCKET KEEPALIVE TYPE CONFUSION LPE
Last year we published our patch gap analysis of ESXi’s TCP/IP stack, which is forked from FreeBSD 8.2. While our focus was mainly on missing FreeBSD patches in ESXi, we also came across a type confusion bug in code introduced by VMware. This blog post details a vulnerability I discovered in ESXi’s implementation of the
setsockopt system call that could lead to a sandbox escape. The vulnerability was assigned CVE-2022-31696 and disclosed as part of the advisory VMSA-2022-003. Additionally, I also explore ESXi’s kernel heap allocator and weaknesses in existing kernel mitigations.
For information regarding the initial analysis of the TCP/IP kernel module, VMkernel debug symbols, and porting type information from FreeBSD to ESXi, it is recommend to read our earlier analysis.
Comparing setsockopt in FreeBSD vs ESXi
First, let’s take a look at how ESXi 6.7 build 19195723’s setsockopt implementation differs from that of FreeBSD. Of particular note are differences in the handling of the
SO_KEEPALIVE socket option. This option enables keep-alive messages on connection-oriented sockets.
In BSD systems, the TCP timer functions are registered and executed through the callout facility. ESXi added code here to check if there is an active callout for the keep-alive, by calling tcp_timer_active. If so, it resets the TCP
keepidle to a newer value using tcp_timer_activate. The
keepidle value determines how long TCP should wait before sending out the first keep-alive probe.
Type confusion vulnerability in SO_KEEPALIVE handling
What’s the issue with this newly added code? To understand this better, let’s take another look at the decompiled code with type information added.
The Internet PCB structure
inpcb has a pointer
inp_ppcb that can point to either a TCP PCB (
tcpcb) or a UDP PCB (
udpcb) structure depending on the protocol. The vulnerable code shown here always type casts the pointer to
tcpcb irrespective of the socket type. If the
SO_KEEPALIVE option is set for a UDP socket,
inp_ppcb is a pointer to a
udpcbstructure, but here it is casted to
tcpcb structure due to the lack of validation. When the code further accesses the
tcp_timer structure variable
t_timers at offset 0x20, the access is out of bounds because the
udpcb structure is only 0x10 bytes in size.
Triggering the Vulnerable Code and PSOD
In order to trigger the vulnerable code path, we need to create a UDP socket and then manipulate the socket using the setsockopt system call. Specifically, it is necessary to set the
SO_KEEPALIVE option. Since ESXi does not package any build tools, we must compile the PoC statically in a Linux machine and then transfer the binary to ESXi for execution. Running the PoC will immediately trigger the Purple Screen of Death (PSOD). To trigger the bug from a sandboxed process, an attacker must be able to invoke the
setsockopt system call on an existing UDP socket descriptor or create a new one for that purpose. Below is the PoC to trigger the bug:
The resulting PSOD:
Kernel Debug Setup for ESXi
While ESXi supports a local VMkernel debugger,
VMKDBG, which can be used to inspect the PSOD, it is not as flexible as GDB. The GDB setup detailed in Attacking VMware NSX (Slides 34 – 37 in the PDF) is an excellent reference for getting started with ESXi kernel debugging. In summary, we used the GDB stubs feature provided by VMware to debug ESXi running as a guest VM on Fusion. We also disabled kASLR for ease of debugging. Since the ESXi kernel modules have symbols, it is possible to use GDB’s
add-symbol-file command to load symbol information given an executable file and its base address in memory. The module base address and the path information required for
add-symbol-file can be fetched using the
esxcfg-info command as seen below:
While the file path to the
tcpip module can be seen in the output, there is no file path entry for the VMkernel module. The VMkernel module with symbol information is found as a gzip-compressed file
k.b00 within the
bootbank directory of ESXi. Alternatively, to obtain the VMkernel executable with not only symbols but also type information, one can download it from the VMware WorkBench. However, in this case, the VMware WorkBench does not have debug information for the version of ESXi currently under analysis.
Once the kernel modules are available and their base addresses are known, connect the debugger and run the PoC to trigger the crash. The exception triggered may not be caught by GDB. In that case, ESXi will continue running, executing the handler for Interrupt 13 – General Protection Fault (GP), which is responsible for collecting fault information and core dumps. Should this occur, wait for the PSOD and then hit “Control + C” (SIGINT) to break into GDB. In the debug session shown below, you can see the symbolized stack trace obtained using GDB’s
tcp_timer_active was the last function to be executed before calling the interrupt handler. Therefore, choose the relevant frame (12 in this case) and inspect the program state. The register RAX was found to be loaded with some garbage value, leading to an invalid memory access during the execution of the
mov eax,DWORD PTR [rax+0x38] instruction.
Analyzing the Exploitability of the Type Confusion Bug
Since the debug setup with symbols is now ready, let’s take another look at the crash by setting breakpoints and stepping through the code. The
tcpcb structure can be inspected during the call to tcp_timer_active function, which takes it as the first argument. However, the type information is still missing within GDB. As a workaround, it is possible to use the type information from the FreeBSD kernel for debugging ESXi’s
tcpip kernel module. Though some of the structure definitions vary somewhat between the FreeBSD and ESXi TCP/IP stacks, they have substantial similarities. Once again, GDB’s
add-symbol-filecommand comes in handy. To import all structure definitions, use the
add-symbol-filecommand but with address set to 0. Similarly, type information for VMkernel can be imported from an older version of ESXi vmkernel-visor (6.7-14320388) available through the VMware WorkBench.
Unlike the previous debug session, where the crash happened when accessing a garbage pointer, this time the
t_timers variable is pointing to NULL and will result in a NULL pointer dereference. To better understand this behavior, it is necessary to examine the heap allocator used by ESXi. After some analysis on the vmkernel-visor executable, it was noticed that ESXi’s kernel heap allocator is based on Doug Lea’s Malloc:
dlmalloc, the malloc chunk headers are 32 bytes in size. The structure definition is as follows:
prev_foot field holds the size of previous chunk if free, whereas the
head field holds the size of the current chunk. In addition to the size, the
head field also holds two flag bits:
PINUSE_BIT (lowest order bit) marks if the previous chunk is in use. The
CINUSE_BIT (second lowest bit) marks if the current chunk is in use. The forward
fd and backward
bk pointer fields are used only when the chunk is free. Otherwise the chunk data starts immediately after the
head field. Now, looking back at the memory pointed to by RDI, it can be inferred that it is the data region of a dlmalloc chunk of size 32 bytes, which can hold 16 bytes of data (the
As explained above, when fetching the
t_timers pointer from offset +0x20, it accesses data from the adjacent chunk. This is because the allocated
udpcb structure is smaller than the offset of
t_timers in the
tcpcb structure. Since the adjacent chunk may hold unrelated data, its contents are unpredictable (unless greater care is taken to first groom the heap). That is why the PoC crash will sometimes manifest as a NULL pointer deference and sometimes as a different kind of invalid access. Here is what the access of
t_timers looks like:
Assuming control of the
t_timers pointer, it is possible to corrupt arbitrary memory during the write operations within the callout_stop or callout_reset functions. Alternatively, if there is control over the memory pointed to by the
t_timers pointer, it is possible to control the subsequent access of the
tcp_timer structure. Specifically,
tcp_timer contains a
callout substructure scheduled for execution by
tcp_timer_activate. By targeting the
c_func function pointer we can gain control of the instruction pointer. Since ESXi does not support Supervisor Mode Access Prevention (SMAP),
t_timers could in fact point to user space memory instead of controlled memory in kernel space.
Note that structures such as
callout in ESXi are slightly different from the corresponding structures in FreeBSD. By comparing the decompiled ESXi code against FreeBSD 8.2, I identified new structure elements and adjusted the offsets of existing fields. For example, some global variables in FreeBSD such as
tcp_keepcnt were turned into fields of the
tcp_timer structure in ESXi. This can be recognized by analyzing the tcp_timer_keep callout function.
In addition to lack of support for SMAP, the kASLR of kernel modules was also found to be weak. While the text base address showed significant randomization, the data segment base address did not, with as little as 1 bit of entropy in some cases. Here are the load addresses of the
tcpip kernel module across multiple reboots:
To understand the fix for the type confusion bug, a patch diff was performed against ESXi 6.7 Build 20497097 (now at end-of-life). Instead of setting up the newer version of ESXi, you can just download the relevant VIB (vSphere Installation Bundle) from the ESXi Patch Tracker. In the case of
tcpip, the kernel module is found within the ESXi base system
esx-baseVIB. This information can be queried using the
The diff between
tcpip kernel modules from build 19195723 and 20497097 revealed an additional check added to
sosetopt function. The code now checks whether the socket protocol is
IPPROTO_TCP before proceeding with TCP timers. There is no explicit check to prevent a raw socket from entering the code path, but
inp_ppcb is initialized only for socket types
SOCK_DGRAM but not for type
SOCK_RAW. Therefore, the timer code is reachable only when the socket type is
SOCK_STREAM and the protocol is
Interestingly, in 2012, the Linux kernel fixed a very similar issue in the handling of RAW sockets – CVE-2012-6657 Kernel: net: guard tcp_set_keepalive against crash:
Historically, kernel privilege escalation vulnerabilities in ESXi have not been frequently seen. ESXi has no login shell for low-privileged users, so that entry point is eliminated. On the other hand, user-mode daemons such as SLPD run with the highest privileges (i.e., superDom), so in the case of compromise of a daemon, there is no need for further escalation. For these reasons, ESXi kernel bugs have not been a popular topic of discussion, at least not publicly. However, the situation is changing. SLP is no longer enabled by default, and ESXi is now sandboxing more and more user-mode processes. This makes us believe ESXi kernel bugs will become important in the coming years. For anyone interested, I hope this blog post will give some ideas to get started on the topic, and I’ll continue blogging about any significant findings in the future. Until then, you can follow me @renorobertr and follow the team on Twitter, Mastodon, LinkedIn, or Instagram for the latest in exploit techniques and security patches.
转载请注明：CVE-2022-31696: AN ANALYSIS OF A VMWARE ESXI TCP SOCKET KEEPALIVE TYPE CONFUSION LPE | CTF导航