Normal view

There are new articles available, click to refresh the page.
Before yesterdayGoogle Project Zero

Zooming in on Zero-click Exploits

By: Ryan
18 January 2022 at 17:28

Posted by Natalie Silvanovich, Project Zero


Zoom is a video conferencing platform that has gained popularity throughout the pandemic. Unlike other video conferencing systems that I have investigated, where one user initiates a call that other users must immediately accept or reject, Zoom calls are typically scheduled in advance and joined via an email invitation. In the past, I hadn’t prioritized reviewing Zoom because I believed that any attack against a Zoom client would require multiple clicks from a user. However, a zero-click attack against the Windows Zoom client was recently revealed at Pwn2Own, showing that it does indeed have a fully remote attack surface. The following post details my investigation into Zoom.

This analysis resulted in two vulnerabilities being reported to Zoom. One was a buffer overflow that affected both Zoom clients and MMR servers, and one was an info leak that is only useful to attackers on MMR servers. Both of these vulnerabilities were fixed on November 24, 2021.

Zoom Attack Surface Overview

Zoom’s main feature is multi-user conference calls called meetings that support a variety of features including audio, video, screen sharing and in-call text messages. There are several ways that users can join Zoom meetings. To start, Zoom provides full-featured installable clients for many platforms, including Windows, Mac, Linux, Android and iPhone. Users can also join Zoom meetings using a browser link, but they are able to use fewer features of Zoom. Finally, users can join a meeting by dialing phone numbers provided in the invitation on a touch-tone phone, but this only allows access to the audio stream of a meeting. This research focused on the Zoom client software, as the other methods of joining calls use existing device features.

Zoom clients support several communication features other than meetings that are available to a user’s Zoom Contacts. A Zoom Contact is a user that another user has added as a contact using the Zoom user interface. Both users must consent before they become Zoom Contacts. Afterwards, the users can send text messages to one another outside of meetings and start channels for persistent group conversations. Also, if either user hosts a meeting, they can invite the other user in a manner that is similar to a phone call: the other user is immediately notified and they can join the meeting with a single click. These features represent the zero-click attack surface of Zoom. Note that this attack surface is only available to attackers that have convinced their target to accept them as a contact. Likewise, meetings are part of the one-click attack surface only for Zoom Contacts, as other users need to click several times to enter a meeting.

That said, it’s likely not that difficult for a dedicated attacker to convince a target to join a Zoom call even if it takes multiple clicks, and the way some organizations use Zoom presents interesting attack scenarios. For example, many groups host public Zoom meetings, and Zoom supports a paid Webinar feature where large groups of unknown attendees can join a one-way video conference. It could be possible for an attacker to join a public meeting and target other attendees. Zoom also relies on a server to transmit audio and video streams, and end-to-end encryption is off by default. It could be possible for an attacker to compromise Zoom’s servers and gain access to meeting data.

Zoom Messages

I started out by looking at the zero-click attack surface of Zoom. Loading the Linux client into IDA, it appeared that a great deal of its server communication occurred over XMPP. Based on strings in the binary, it was clear that XMPP parsing was performed using a library called gloox. I fuzzed this library using AFL and other coverage-guided fuzzers, but did not find any vulnerabilities. I then looked at how Zoom uses the data provided over XMPP.

XMPP traffic seemed to be sent over SSL, so I located the SSL_write function in the binary based on log strings, and hooked it using Frida. The output contained many XMPP stanzas (messages) as well as other network traffic, which I analyzed to determine how XMPP is used by Zoom. XMPP is used for most communication between Zoom clients outside of meetings, such as messages and channels, and is also used for signaling (call set-up) when a Zoom Contact invites another Zoom Contact to a meeting.

I spent some time going through the client binary trying to determine how the client processes XMPP, for example, if a stanza contains a text message, how is that message extracted and displayed in the client. Even though the Zoom client contains many log strings, this was challenging, and I eventually asked my teammate Ned Williamson for help locating symbols for the client. He discovered that several old versions of the Android Zoom SDK contained symbols. While these versions are roughly five years old, and do not present a complete view of the client as they only include some libraries that it uses, they were immensely helpful in understanding how Zoom uses XMPP.

Application-defined tags can be added to gloox’s XMPP parser by extending the class StanzaExtension and implementing the method newInstance to define how the tag is converted into a C++ object. Parsed XMPP stanzas are then processed using the MessageHandler class. Application developers extend this class, implementing the method handleMessage with code that performs application functionality based on the contents of the stanza received. Zoom implements its XMPP handling in CXmppIMSession::handleMessage, which is a large function that is an entrypoint to most messaging and calling features. The final processing stage of many XMPP tags is in the class ns_zoom_messager::CZoomMMXmppWrapper, which contains many methods starting with ‘On’ that handle specific events. I spent a fair amount of time analyzing these code paths, but didn’t find any bugs. Interestingly, Thijs Alkemade and Daan Keuper released a write-up of their Pwn2Own bug after I completed this research, and it involved a vulnerability in this area.

RTP Processing

Afterwards, I investigated how Zoom clients process audio and video content. Like all other video conferencing systems that I have analyzed, it uses Real-time Transport Protocol (RTP) to transport this data. Based on log strings included in the Linux client binary, Zoom appears to use a branch of WebRTC for audio. Since I have looked at this library a great deal in previous posts, I did not investigate it further. For video, Zoom implements its own RTP processing and uses a custom underlying codec named Zealot (libzlt).

Analyzing the Linux client in IDA, I found what I believed to be the video RTP entrypoint, and fuzzed it using afl-qemu. This resulted in several crashes, mostly in RTP extension processing. I tried modifying the RTP sent by a client to reproduce these bugs, but it was not received by the device on the other side and I suspected the server was filtering it. I tried to get around this by enabling end-to-end encryption, but Zoom does not encrypt RTP headers, only the contents of RTP packets (as is typical of most RTP implementations).

Curious about how Zoom server filtering works, I decided to set up Zoom On-Premises Deployment. This is a Zoom product that allows customers to set up on-site servers to process their organization’s Zoom calls. This required a fair amount of configuration, and I ended up reaching out to the Zoom Security Team for assistance. They helped me get it working, and I greatly appreciate their contribution to this research.

Zoom On-Premises Deployments consist of two hosts: the controller and the Multimedia Router (MMR). Analyzing the traffic to each server, it became clear that the MMR is the host that transmits audio and video content between Zoom clients. Loading the code for the MMR process into IDA, I located where RTP is processed, and it indeed parses the extensions as a part of its forwarding logic and verifies them correctly, dropping any RTP packets that are malformed.

The code that processes RTP on the MMR appeared different than the code that I fuzzed on the device, so I set up fuzzing on the server code as well. This was challenging, as the code was in the MMR binary, which was not compiled as a relocatable binary (more on this later). This meant that I couldn’t load it as a library and call into specific offsets in the binary as I usually do to fuzz binaries that don’t have source code available. Instead, I compiled my own fuzzing stub that called the function I wanted to fuzz as a relocatable that defined fopen, and loaded it using LD_PRELOAD when executing the MMR binary. Then my code would take control of execution the first time that the MMR binary called fopen, and was able to call the function being fuzzed.

This approach has a lot of downsides, the biggest being that the fuzzing stub can’t accept command line parameters, execution is fairly slow and a lot of fuzzing tools don’t honor LD_PRELOAD on the target. That said, I was able to fuzz with code coverage using Mateusz Jurczyk’s excellent DrSanCov, with no results.

Packet Processing

When analyzing RTP traffic, I noticed that both Zoom clients and the MMR server process a great deal of packets that didn’t appear to be RTP or XMPP. Looking at the SDK with symbols, one library appeared to do a lot of serialization: libssb_sdk.so. This library contains a great deal of classes with the methods load_from and save_to defined with identical declarations, so it is likely that they all implement the same virtual class.

One parameter to the load_from methods is an object of class msg_db_t,  which implements a buffer that supports reading different data types. Deserialization is performed by load_from methods by reading needed data from the msg_db_t object, and serialization is performed by save_to methods by writing to it.

After hooking a few save_to methods with Frida and comparing the written output to data sent with SSL_write, it became clear that these serialization classes are part of the remote attack surface of Zoom. Reviewing each load_from method, several contained code similar to the following (from ssb::conf_send_msg_req::load_from).

ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::operator>>(

msg_db, &this->str_len, consume_bytes, error_out);

  str_len = this->str_len;

  if ( str_len )

  {

    mem = operator new[](str_len);

    out_len = 0;

    this->str_mem = mem;

    ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::

read_str_with_len(msg_db, mem, &out_len);

read_str_with_len is defined as follows.

int __fastcall ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::

read_str_with_len(msg_db_t* msg, signed __int8 *mem,

unsigned int *len)

{

  if ( !msg->invalid )

  {

ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::operator>>(msg, len, (int)len, 0);

    if ( !msg->invalid )

    {

      if ( *len )

        ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::

read(msg, mem, *len, 0);

    }

  }

  return msg;

}

Note that the string buffer is allocated based on a length read from the msg_db_t buffer, but then a second length is read from the buffer and used as the length of the string that is read. This means that if an attacker could manipulate the contents of the msg_db_t buffer, they could specify the length of the buffer allocated, and overwrite it with any length of data (up to a limit of 0x1FFF bytes, not shown in the code snippet above).

I tested this bug by hooking SSL_write with Frida, and sending the malformed packet, and it caused the Zoom client to crash on a variety of platforms. This vulnerability was assigned CVE-2021-34423 and fixed on November 24, 2021.

Looking at the code for the MMR server, I noticed that ssb::conf_send_msg_req::load_from, the class the vulnerability occurs in was also present on the MMR server. Since the MMR forwards Zoom meeting traffic from one client to another, it makes sense that it might also deserialize this packet type. I analyzed the MMR code in IDA, and found that deserialization of this class only occurs during Zoom Webinars. I purchased a Zoom Webinar license, and was able to crash my own Zoom MMR server by sending this packet. I was not willing to test a vulnerability of this type on Zoom’s public MMR servers, but it seems reasonably likely that the same code was also in Zoom’s public servers.

Looking further at deserialization, I noticed that all deserialized objects contain an optional field of type ssb::dyna_para_table_t, which is basically a properties table that allows a map of name strings to variant objects to be included in the deserialized object. The variants in the table are implemented by the structure ssb::variant_t, as follows.

struct variant{

char type;

short length;

var_data data;

};

union var_data{

        char i8;

        char* i8_ptr;

        short i16;

        short* i16_ptr;

        int i32;

        int* i32_ptr;

        long long i64;

        long long i64*;

};

The value of the type field corresponds to the width of the variant data (1 for 8-bit, 2 for 16-bit, 3 for 32-bit and 4 four 64-bit). The length field specifies whether the variant is an array and its length. If it has the value 0, the variant is not an array, and a numeric value is read from the data field based on its type. If the length field has any other value, the data field is cast to a pointer, an array of that size is read.

My immediate concern with this implementation was that it could be prone to type confusion. One possibility is that a numeric value could be confused with an array pointer, which would allow an attacker to create a variant with a pointer that they specify. However, both the client and MMR perform very aggressive type checks on variants they treat as arrays. Another possibility is that a pointer could be confused with a numeric value. This could allow an attacker to determine the address of a buffer they control if the value is ever returned to the attacker. I found a few locations in the MMR code where a pointer is converted to a numeric value in this way and logged, but nowhere that an attacker could obtain the incorrectly cast value. Finally, I looked at how array data is handled, and I found that there are several locations where byte array variants are converted to strings, however not all of them checked that the byte array has a null terminator. This meant that if these variants were converted to strings, the string could contain the contents of uninitialized memory.

Most of the time, packets sent to the MMR by one user are immediately forwarded to other users without being deserialized by the server. For some bugs, this is a useful feature, for example, it is what allows CVE-2021-34423 discussed earlier to be triggered on a client. However, an information leak in variants needs to occur on the server to be useful to an attacker. When a client deserializes an incoming packet, it is for use on the device, so even if a deserialized string contains sensitive information, it is unlikely that this information will be transmitted off the device. Meanwhile, the MMR exists expressly to transmit information from one user to another, so if a string gets deserialized, there is a reasonable chance that it gets sent to another user, or alters server behavior in an observable way. So, I tried to find a way to get the server to deserialize a variant and convert it to a string. I eventually figured out that when a user logs into Zoom in a browser, the browser can’t process serialized packets, so the MMR must convert them to strings so they can be accessed through web requests. Indeed, I found that if I removed the null terminator from the user_name variant, it would be converted to a string and sent to the browser as the user’s display name.

The vulnerability was assigned CVE-2021-34424 and fixed on November 24, 2021. I tested it on my own MMR as well as Zoom’s public MMR, and it worked and returned pointer data in both cases.

Exploit Attempt

I attempted to exploit my local MMR server with these vulnerabilities, and while I had success with portions of the exploit, I was not able to get it working. I started off by investigating the possibility of creating a client that could trigger each bug outside of the Zoom client, but client authentication appeared complex and I lacked symbols for this part of the code, so I didn’t pursue this as I suspected it would be very time-consuming. Instead, I analyzed the exploitability of the bugs by triggering them from a Linux Zoom client hooked with Frida.

I started off by investigating the impact of heap corruption on the MMR process. MMR servers run on CentOS 7, which uses a modern glibc heap, so exploiting heap unlinking did not seem promising. I looked into overwriting the vtable of a C++ object allocated on the heap instead.

 

I wrote several Frida scripts that hooked malloc on the server, and used them to monitor how incoming traffic affects allocation. It turned out that there are not many ways for an attacker to control memory allocation on an MMR server that are useful for exploiting this vulnerability. There are several packet types that an attacker can send to the server that cause memory to be allocated on the heap and then freed when processing is finished, but not as many where the attacker can trigger both allocation and freeing. Moreover, the MMR server performs different types of processing in separate threads that use unique heap arenas, so many areas of the code where this type of allocation is likely to occur, such as connection management, allocate memory in a different heap arena than the thread where the bug occurs. The only such allocations I could find that were made in the same arena were related to meeting set-up: when a user joins a meeting, certain objects are allocated on the heap, which are then freed when they leave the meeting. Unfortunately, these allocations are difficult to automate as they require many unique users accounts in order for the allocation to be performed repeatedly, and allocation takes an observable amount of time (seconds).

I eventually wrote Frida scripts that looked for free chunks of unusual sizes that bordered C++ objects with vtables during normal MMR operation. There were a few allocation sizes that met this criteria, and since CVE-2021-34423 allows for the size of the buffer that is overflowed to be specified by the attacker, I was able to corrupt the memory of the adjacent object. Unfortunately, heap verification was very robust, so in most cases, the MMR process would crash due to a heap verification error before a virtual call was made on the corrupted object. I eventually got around this by focusing on allocation sizes that are small enough to be stored in fastbins by the heap, as heap chunks that are stored in fastbins do not contain verifiable heap metadata. Chunks of size 58 turned out to be the best choice, and by triggering the bug with an allocation of that size, I was able to control the pointer of a virtual call about one in ten times I triggered the bug.

The next step was to figure out where to point the pointer I could control, and this turned out to be more challenging than I expected. The MMR process did not have ASLR enabled when I did this research (it was enabled in version 4.6.20211128.136, which was released on November 28, 2021), so I was hoping to find a series of locations in the binary that this call could be directed to that would eventually end in a call to execv with controllable parameters, as the MMR initialization code contains many calls to this function. However, there were a few features of the server that made this difficult. First, only the MMR binary was loaded at a fixed location. The heap and system libraries were not, so only the actual MMR code was available without bypassing ASLR. Second, if the MMR crashes, it has an exponential backoff which culminates in it respawning every hour on the hour. This limits how many exploit attempts an attacker has. It is realistic that an attacker might spend days or even weeks trying to exploit a server, but this still limits them to hundreds of attempts. This means that any exploit of an MMR server would need to be at least somewhat reliable, so certain techniques that require a lot of attempts, such as allocating a large buffer on the heap and trying to guess its location were not practical.

I eventually decided that it would be helpful to allocate a buffer on the heap with controlled contents and determine its location. This would make the exploit fairly reliable in the case that the overflow successfully leads to a virtual call, as the buffer could be used as a fake vtable, and also contain strings that could be used as parameters to execv. I tried using CVE-2021-34424 to leak such an address, but wasn’t able to get this working.

This bug allows the attacker to provide a string of any size, which then gets copied out of bounds up until a null character is encountered in memory, and then returned. It is possible for CVE-2021-34424 to return a heap pointer, as the MMR maps the heap that gets corrupted at a low address that does not usually contain null bytes, however, I could not find a way to force a specific heap pointer to be allocated next to the string buffer that gets copied out of bounds. C++ objects used by the MMR tend to be virtual objects, so the first 64 bits of most object allocations are a vtable which contains null bytes, ending the copy. Other allocated structures, especially larger ones, tend to contain non-pointer data. I was able to get this bug to return heap pointers by specifying a string that was less than 64 bits long, so the nearby allocations were sometimes the pointers themselves, but allocations of this size are so frequent it was not possible to ascertain what heap data they pointed to with any accuracy.

One last idea I had was to use another type confusion bug to leak a pointer to a controllable buffer. There is one such bug in the processing of deserialized ssb::kv_update_req objects. This object’s ssb::dyna_para_table_t table contains a variant named nodeid which represents the specific Zoom client that the message refers to. If an attacker changes this variant to be of type array instead of a 32-bit integer, the address of the pointer to this array will be logged as a string. I tried to combine CVE-2021-34424 with this bug, hoping that it might be possible for the leaked data to be this log string that contains pointer information. Unfortunately, I wasn’t able to get this to work because of timing: the log entry needs to be logged at almost exactly the same time as the bug is triggered so that the log data is still in memory, and I wasn't able to send packets fast enough. I suspect it might be possible for this to work with improved automation, as I was relying on clients hooked with Frida and browsers to interact with the Zoom server, but I decided not to pursue this as it would require tooling that would take substantial effort to develop.

Conclusion

I performed a security analysis of Zoom and reported two vulnerabilities. One was a buffer overflow that affected both Zoom clients and MMR servers, and one was an info leak that is only useful to attackers on MMR servers. Both of these vulnerabilities were fixed on November 24, 2021.

The vulnerabilities in Zoom’s MMR server are especially concerning, as this server processes meeting audio and video content, so a compromise could allow an attacker to monitor any Zoom meetings that do not have end-to-end encryption enabled. While I was not successful in exploiting these vulnerabilities, I was able to use them to perform many elements of exploitation, and I believe that an attacker would be able to exploit them with sufficient investment. The lack of ASLR in the Zoom MMR process greatly increased the risk that an attacker could compromise it, and it is positive that Zoom has recently enabled it. That said, if vulnerabilities similar to the ones that I reported still exist in the MMR server, it is likely that an attacker could bypass it, so it is also important that Zoom continue to improve the robustness of the MMR code.

It is also important to note that this research was possible because Zoom allows customers to set up their own servers, meanwhile no other video conferencing solution with proprietary servers that I have investigated allows this, so it is unclear how these results compare to other video conferencing platforms.

Overall, while the client bugs that were discovered during this research were comparable to what Project Zero has found in other videoconferencing platforms, the server bugs were surprising, especially when the server lacked ASLR and supports modes of operation that are not end-to-end encrypted.

There are a few factors that commonly lead to security problems in videoconferencing applications that contributed to these bugs in Zoom. One is the huge amount of code included in Zoom. There were large portions of code that I couldn’t determine the functionality of, and many of the classes that could be deserialized didn’t appear to be commonly used. This both increases the difficulty of security research and increases the attack surface by making more code that could potentially contain vulnerabilities available to attackers. In addition, Zoom uses many proprietary formats and protocols which meant that understanding the attack surface of the platform and creating the tooling to manipulate specific interfaces was very time consuming. Using the features we tested also required paying roughly $1500 USD in licensing fees. These barriers to security research likely mean that Zoom is not investigated as often as it could be, potentially leading to simple bugs going undiscovered.  

Still, my largest concern in this assessment was the lack of ASLR in the Zoom MMR server. ASLR is arguably the most important mitigation in preventing exploitation of memory corruption, and most other mitigations rely on it on some level to be effective. There is no good reason for it to be disabled in the vast majority of software. There has recently been a push to reduce the susceptibility of software to memory corruption vulnerabilities by moving to memory-safe languages and implementing enhanced memory mitigations, but this relies on vendors using the security measures provided by the platforms they write software for. All software written for platforms that support ASLR should have it (and other basic memory mitigations) enabled.

The closed nature of Zoom also impacted this analysis greatly. Most video conferencing systems use open-source software, either WebRTC or PJSIP. While these platforms are not free of problems, it’s easier for researchers, customers and vendors alike to verify their security properties and understand the risk they present because they are open. Closed-source software presents unique security challenges, and Zoom could do more to make their platform accessible to security researchers and others who wish to evaluate it. While the Zoom Security Team helped me access and configure server software, it is not clear that support is available to other researchers, and licensing the software was still expensive. Zoom, and other companies that produce closed-source security-sensitive software should consider how to make their software accessible to security researchers.

A walk through Project Zero metrics

By: Ryan
10 February 2022 at 16:58

Posted by Ryan Schoen, Project Zero

tl;dr

  • In 2021, vendors took an average of 52 days to fix security vulnerabilities reported from Project Zero. This is a significant acceleration from an average of about 80 days 3 years ago.
  • In addition to the average now being well below the 90-day deadline, we have also seen a dropoff in vendors missing the deadline (or the additional 14-day grace period). In 2021, only one bug exceeded its fix deadline, though 14% of bugs required the grace period.
  • Differences in the amount of time it takes a vendor/product to ship a fix to users reflects their product design, development practices, update cadence, and general processes towards security reports. We hope that this comparison can showcase best practices, and encourage vendors to experiment with new policies.
  • This data aggregation and analysis is relatively new for Project Zero, but we hope to do it more in the future. We encourage all vendors to consider publishing aggregate data on their time-to-fix and time-to-patch for externally reported vulnerabilities, as well as more data sharing and transparency in general.

Overview

For nearly ten years, Google’s Project Zero has been working to make it more difficult for bad actors to find and exploit security vulnerabilities, significantly improving the security of the Internet for everyone. In that time, we have partnered with folks across industry to transform the way organizations prioritize and approach fixing security vulnerabilities and updating people’s software.

To help contextualize the shifts we are seeing the ecosystem make, we looked back at the set of vulnerabilities Project Zero has been reporting, how a range of vendors have been responding to them, and then attempted to identify trends in this data, such as how the industry as a whole is patching vulnerabilities faster.

For this post, we look at fixed bugs that were reported between January 2019 and December 2021 (2019 is the year we made changes to our disclosure policies and also began recording more detailed metrics on our reported bugs). The data we'll be referencing is publicly available on the Project Zero Bug Tracker, and on various open source project repositories (in the case of the data used below to track the timeline of open-source browser bugs).

There are a number of caveats with our data, the largest being that we'll be looking at a small number of samples, so differences in numbers may or may not be statistically significant. Also, the direction of Project Zero's research is almost entirely influenced by the choices of individual researchers, so changes in our research targets could shift metrics as much as changes in vendor behaviors could. As much as possible, this post is designed to be an objective presentation of the data, with additional subjective analysis included at the end.

The data!

Between 2019 and 2021, Project Zero reported 376 issues to vendors under our standard 90-day deadline. 351 (93.4%) of these bugs have been fixed, while 14 (3.7%) have been marked as WontFix by the vendors. 11 (2.9%) other bugs remain unfixed, though at the time of this writing 8 have passed their deadline to be fixed; the remaining 3 are still within their deadline to be fixed. Most of the vulnerabilities are clustered around a few vendors, with 96 bugs (26%) being reported to Microsoft, 85 (23%) to Apple, and 60 (16%) to Google.

Deadline adherence

Once a vendor receives a bug report under our standard deadline, they have 90 days to fix it and ship a patched version to the public. The vendor can also request a 14-day grace period if the vendor confirms they plan to release the fix by the end of that total 104-day window.

In this section, we'll be taking a look at how often vendors are able to hit these deadlines. The table below includes all bugs that have been reported to the vendor under the 90-day deadline since January 2019 and have since been fixed, for vendors with the most bug reports in the window.

Deadline adherence and fix time 2019-2021, by bug report volume

Vendor

Total bugs

Fixed by day 90

Fixed during
grace period

Exceeded deadline

& grace period

Avg days to fix

Apple

84

73 (87%)

7 (8%)

4 (5%)

69

Microsoft

80

61 (76%)

15 (19%)

4 (5%)

83

Google

56

53 (95%)

2 (4%)

1 (2%)

44

Linux

25

24 (96%)

0 (0%)

1 (4%)

25

Adobe

19

15 (79%)

4 (21%)

0 (0%)

65

Mozilla

10

9 (90%)

1 (10%)

0 (0%)

46

Samsung

10

8 (80%)

2 (20%)

0 (0%)

72

Oracle

7

3 (43%)

0 (0%)

4 (57%)

109

Others*

55

48 (87%)

3 (5%)

4 (7%)

44

TOTAL

346

294 (84%)

34 (10%)

18 (5%)

61

* For completeness, the vendors included in the "Others" bucket are Apache, ASWF, Avast, AWS, c-ares, Canonical, F5, Facebook, git, Github, glibc, gnupg, gnutls, gstreamer, haproxy, Hashicorp, insidesecure, Intel, Kubernetes, libseccomp, libx264, Logmein, Node.js, opencontainers, QT, Qualcomm, RedHat, Reliance, SCTPLabs, Signal, systemd, Tencent, Tor, udisks, usrsctp, Vandyke, VietTel, webrtc, and Zoom.

Overall, the data show that almost all of the big vendors here are coming in under 90 days, on average. The bulk of fixes during a grace period come from Apple and Microsoft (22 out of 34 total).

Vendors have exceeded the deadline and grace period about 5% of the time over this period. In this slice, Oracle has exceeded at the highest rate, but admittedly with a relatively small sample size of only about 7 bugs. The next-highest rate is Microsoft, having exceeded 4 of their 80 deadlines.

Average number of days to fix bugs across all vendors is 61 days. Zooming in on just that stat, we can break it out by year:

Bug fix time 2019-2021, by bug report volume

Vendor

Bugs in 2019

(avg days to fix)

Bugs in 2020

(avg days to fix)

Bugs in 2021

(avg days to fix)

Apple

61 (71)

13 (63)

11 (64)

Microsoft

46 (85)

18 (87)

16 (76)

Google

26 (49)

13 (22)

17 (53)

Linux

12 (32)

8 (22)

5 (15)

Others*

54 (63)

35 (54)

14 (29)

TOTAL

199 (67)

87 (54)

63 (52)

* For completeness, the vendors included in the "Others" bucket are Adobe, Apache, ASWF, Avast, AWS, c-ares, Canonical, F5, Facebook, git, Github, glibc, gnupg, gnutls, gstreamer, haproxy, Hashicorp, insidesecure, Intel, Kubernetes, libseccomp, libx264, Logmein, Mozilla, Node.js, opencontainers, Oracle, QT, Qualcomm, RedHat, Reliance, Samsung, SCTPLabs, Signal, systemd, Tencent, Tor, udisks, usrsctp, Vandyke, VietTel, webrtc, and Zoom.

From this, we can see a few things: first of all, the overall time to fix has consistently been decreasing, but most significantly between 2019 and 2020. Microsoft, Apple, and Linux overall have reduced their time to fix during the period, whereas Google sped up in 2020 before slowing down again in 2021. Perhaps most impressively, the others not represented on the chart have collectively cut their time to fix in more than half, though it's possible this represents a change in research targets rather than a change in practices for any particular vendor.

Finally, focusing on just 2021, we see:

  • Only 1 deadline exceeded, versus an average of 9 per year in the other two years
  • The grace period used 9 times (notably with half being by Microsoft), versus the slightly lower average of 12.5 in the other years

Mobile phones

Since the products in the previous table span a range of types (desktop operating systems, mobile operating systems, browsers), we can also focus on a particular, hopefully more apples-to-apples comparison: mobile phone operating systems.

Vendor

Total bugs

Avg fix time

iOS

76

70

Android (Samsung)

10

72

Android (Pixel)

6

72

The first thing to note is that it appears that iOS received remarkably more bug reports from Project Zero than any flavor of Android did during this time period, but rather than an imbalance in research target selection, this is more a reflection of how Apple ships software. Security updates for "apps" such as iMessage, Facetime, and Safari/WebKit are all shipped as part of the OS updates, so we include those in the analysis of the operating system. On the other hand, security updates for standalone apps on Android happen through the Google Play Store, so they are not included here in this analysis.

Despite that, all three vendors have an extraordinarily similar average time to fix. With the data we have available, it's hard to determine how much time is spent on each part of the vulnerability lifecycle (e.g. triage, patch authoring, testing, etc). However, open-source products do provide a window into where time is spent.

Browsers

For most software, we aren't able to dig into specifics of the timeline. Specifically: after a vendor receives a report of a security issue, how much of the "time to fix" is spent between the bug report and landing the fix, and how much time is spent between landing that fix and releasing a build with the fix? The one window we do have is into open-source software, and specific to the type of vulnerability research that Project Zero does, open-source browsers.

Fix time analysis for open-source browsers, by bug volume

Browser

Bugs

Avg days from bug report to public patch

Avg days from public patch to release

Avg days from bug report to release

Chrome

40

5.3

24.6

29.9

WebKit

27

11.6

61.1

72.7

Firefox

8

16.6

21.1

37.8

Total

75

8.8

37.3

46.1

We can also take a look at the same data, but with each bug spread out in a histogram. In particular, the histogram of the amount of time from a fix being landed in public to that fix being shipped to users shows a clear story (in the table above, this corresponds to "Avg days from public patch to release" column:

Histogram showing the distributions of time from a fix landing in public to a fix shipping for Firefox, Webkit, and Chrome. The fact that Webkit is still on the higher end of the histogram tells us that most of their time is spent shipping the fixed build after the fix has landed.

The table and chart together tell us a few things:

Chrome is currently the fastest of the three browsers, with time from bug report to releasing a fix in the stable channel in 30 days. The time to patch is very fast here, with just an average of 5 days between the bug report and the patch landing in public. The time for that patch to be released to the public is the bulk of the overall time window, though overall we still see the Chrome (blue) bars of the histogram toward the left side of the histogram. (Important note: despite being housed within the same company, Project Zero follows the same policies and procedures with Chrome that an external security researcher would follow. More information on that is available in our Vulnerability Disclosure FAQ.)

Firefox comes in second in this analysis, though with a relatively small number of data points to analyze. Firefox releases a fix on average in 38 days. A little under half of that is time for the fix to land in public, though it's important to note that Firefox intentionally delays committing security patches to reduce the amount of exposure before the fix is released. Once the patch has been made public, it releases the fixed build on average a few days faster than Chrome – with the vast majority of the fixes shipping 10-15 days after their public patch.

WebKit is the outlier in this analysis, with the longest number of days to release a patch at 73 days. Their time to land the fix publicly is in the middle between Chrome and Firefox, but unfortunately this leaves a very long amount of time for opportunistic attackers to find the patch and exploit it prior to the fix being made available to users. This can be seen by the Apple (red) bars of the second histogram mostly being on the right side of the graph, and every one of them except one being past the 30-day mark.

Analysis, hopes, and dreams

Overall, we see a number of promising trends emerging from the data. Vendors are fixing almost all of the bugs that they receive, and they generally do it within the 90-day deadline plus the 14-day grace period when needed. Over the past three years vendors have, for the most part, accelerated their patch effectively reducing the overall average time to fix to about 52 days. In 2021, there was only one 90-day deadline exceeded. We suspect that this trend may be due to the fact that responsible disclosure policies have become the de-facto standard in the industry, and vendors are more equipped to react rapidly to reports with differing deadlines. We also suspect that vendors have learned best practices from each other, as there has been increasing transparency in the industry.

One important caveat: we are aware that reports from Project Zero may be outliers compared to other bug reports, in that they may receive faster action as there is a tangible risk of public disclosure (as the team will disclose if deadline conditions are not met) and Project Zero is a trusted source of reliable bug reports. We encourage vendors to release metrics, even if they are high level, to give a better overall picture of how quickly security issues are being fixed across the industry, and continue to encourage other security researchers to share their experiences.

For Google, and in particular Chrome, we suspect that the quick turnaround time on security bugs is in part due to their rapid release cycle, as well as their additional stable releases for security updates. We're encouraged by Chrome's recent switch from a 6-week release cycle to a 4-week release cycle. On the Android side, we see the Pixel variant of Android releasing fixes about on par with the Samsung variants as well as iOS. Even so, we encourage the Android team to look for additional ways to speed up the application of security updates and push that segment of the industry further.

For Apple, we're pleased with the acceleration of patches landing, as well as the recent lack of use of grace periods as well as lack of missed deadlines. For WebKit in particular, we hope to see a reduction in the amount of time it takes between landing a patch and shipping it out to users, especially since WebKit security affects all browsers used in iOS, as WebKit is the only browser engine permitted on the iOS platform.

For Microsoft, we suspect that the high time to fix and Microsoft's reliance on the grace period are consequences of the monthly cadence of Microsoft's "patch Tuesday" updates, which can make it more difficult for development teams to meet a disclosure deadline. We hope that Microsoft might consider implementing a more frequent patch cadence for security issues, or finding ways to further streamline their internal processes to land and ship code quicker.

Moving forward

This post represents some number-crunching we've done of our own public data, and we hope to continue this going forward. Now that we've established a baseline over the past few years, we plan to continue to publish an annual update to better understand how the trends progress.

To that end, we'd love to have even more insight into the processes and timelines of our vendors. We encourage all vendors to consider publishing aggregate data on their time-to-fix and time-to-patch for externally reported vulnerabilities. Through more transparency, information sharing, and collaboration across the industry, we believe we can learn from each other's best practices, better understand existing difficulties and hopefully make the internet a safer place for all.

Racing against the clock -- hitting a tiny kernel race window

By: Ryan
24 March 2022 at 20:51

TL;DR:

How to make a tiny kernel race window really large even on kernels without CONFIG_PREEMPT:

  • use a cache miss to widen the race window a little bit
  • make a timerfd expire in that window (which will run in an interrupt handler - in other words, in hardirq context)
  • make sure that the wakeup triggered by the timerfd has to churn through 50000 waitqueue items created by epoll

Racing one thread against a timer also avoids accumulating timing variations from two threads in each race attempt - hence the title. On the other hand, it also means you now have to deal with how hardware timers actually work, which introduces its own flavors of weird timing variations.

Introduction

I recently discovered a race condition (https://crbug.com/project-zero/2247) in the Linux kernel. (While trying to explain to someone how the fix for CVE-2021-0920 worked - I was explaining why the Unix GC is now safe, and then got confused because I couldn't actually figure out why it's safe after that fix, eventually realizing that it actually isn't safe.) It's a fairly narrow race window, so I was wondering whether it could be hit with a small number of attempts - especially on kernels that aren't built with CONFIG_PREEMPT, which would make it possible to preempt a thread with another thread, as I described at LSSEU2019.

This is a writeup of how I managed to hit the race on a normal Linux desktop kernel, with a hit rate somewhere around 30% if the proof of concept has been tuned for the specific machine. I didn't do a full exploit though, I stopped at getting evidence of use-after-free (UAF) accesses (with the help of a very large file descriptor table and userfaultfd, which might not be available to normal users depending on system configuration) because that's the part I was curious about.

This also demonstrates that even very small race conditions can still be exploitable if someone sinks enough time into writing an exploit, so be careful if you dismiss very small race windows as unexploitable or don't treat such issues as security bugs.

The UAF reproducer is in our bugtracker.

The bug

In the UNIX domain socket garbage collection code (which is needed to deal with reference loops formed by UNIX domain sockets that use SCM_RIGHTS file descriptor passing), the kernel tries to figure out whether it can account for all references to some file by comparing the file's refcount with the number of references from inflight SKBs (socket buffers). If they are equal, it assumes that the UNIX domain sockets subsystem effectively has exclusive access to the file because it owns all references.

(The same pattern also appears for files as an optimization in __fdget_pos(), see this LKML thread.)

The problem is that struct file can also be referenced from an RCU read-side critical section (which you can't detect by looking at the refcount), and such an RCU reference can be upgraded into a refcounted reference using get_file_rcu() / get_file_rcu_many() by __fget_files() as long as the refcount is non-zero. For example, when this happens in the dup() syscall, the resulting reference will then be installed in the FD table and be available for subsequent syscalls.

When the garbage collector (GC) believes that it has exclusive access to a file, it will perform operations on that file that violate the locking rules used in normal socket-related syscalls such as recvmsg() - unix_stream_read_generic() assumes that queued SKBs can only be removed under the ->iolock mutex, but the GC removes queued SKBs without using that mutex. (Thanks to Xingyu Jin for explaining that to me.)

One way of looking at this bug is that the GC is working correctly - here's a state diagram showing some of the possible states of a struct file, with more specific states nested under less specific ones and with the state transition in the GC marked:

All relevant states are RCU-accessible. An RCU-accessible object can have either a zero refcount or a positive refcount. Objects with a positive refcount can be either live or owned by the garbage collector. When the GC attempts to grab a file, it transitions from the state "live" to the state "owned by GC" by getting exclusive ownership of all references to the file.

While __fget_files() is making an incorrect assumption about the state of the struct file while it is trying to narrow down its possible states - it checks whether get_file_rcu() / get_file_rcu_many() succeeds, which narrows the file's state down a bit but not far enough:

__fget_files() first uses get_file_rcu() to conditionally narrow the state of a file from "any RCU-accessible state" to "any refcounted state". Then it has to narrow the state from "any refcounted state" to "live", but instead it just assumes that they are equivalent.

And this directly leads to how the bug was fixed (there's another follow-up patch, but that one just tries to clarify the code and recoup some of the resulting performance loss) - the fix adds another check in __fget_files() to properly narrow down the state of the file such that the file is guaranteed to be live:

The fix is to properly narrow the state from "any refcounted state" to "live" by checking whether the file is still referenced by a file descriptor table entry.

The fix ensures that a live reference can only be derived from another live reference by comparing with an FD table entry, which is guaranteed to point to a live object.

[Sidenote: This scheme is similar to the one used for struct page - gup_pte_range() also uses the "grab pointer, increment refcount, recheck pointer" pattern for locklessly looking up a struct page from a page table entry while ensuring that new refcounted references can't be created without holding an existing reference. This is really important for struct page because a page can be given back to the page allocator and reused while gup_pte_range() holds an uncounted reference to it - freed pages still have their struct page, so there's no need to delay freeing of the page - so if this went wrong, you'd get a page UAF.]

My initial suggestion was to instead fix the issue by changing how unix_gc() ensures that it has exclusive access, letting it set the file's refcount to zero to prevent turning RCU references into refcounted ones; this would have avoided adding any code in the hot __fget_files() path, but it would have only fixed unix_gc(), not the __fdget_pos() case I discovered later, so it's probably a good thing this isn't how it was fixed:

[Sidenote: In my original bug report I wrote that you'd have to wait an RCU grace period in the GC for this, but that wouldn't be necessary as long as the GC ensures that a reaped socket's refcount never becomes non-zero again.]

The race

There are multiple race conditions involved in exploiting this bug, but by far the trickiest to hit is that we have to race an operation into the tiny race window in the middle of __fget_files() (which can e.g. be reached via dup()), between the file descriptor table lookup and the refcount increment:

static struct file *__fget_files(struct files_struct *files, unsigned int fd,

                                 fmode_t mask, unsigned int refs)

{

        struct file *file;

        rcu_read_lock();

loop:

        file = files_lookup_fd_rcu(files, fd); // race window start

        if (file) {

                /* File object ref couldn't be taken.

                 * dup2() atomicity guarantee is the reason

                 * we loop to catch the new file (or NULL pointer)

                 */

                if (file->f_mode & mask)

                        file = NULL;

                else if (!get_file_rcu_many(file, refs)) // race window end

                        goto loop;

        }

        rcu_read_unlock();

        return file;

}

In this race window, the file descriptor must be closed (to drop the FD's reference to the file) and a unix_gc() run must get past the point where it checks the file's refcount ("total_refs = file_count(u->sk.sk_socket->file)").

In the Debian 5.10.0-9-amd64 kernel at version 5.10.70-1, that race window looks as follows:

<__fget_files+0x1e> cmp    r10,rax

<__fget_files+0x21> sbb    rax,rax

<__fget_files+0x24> mov    rdx,QWORD PTR [r11+0x8]

<__fget_files+0x28> and    eax,r8d

<__fget_files+0x2b> lea    rax,[rdx+rax*8]

<__fget_files+0x2f> mov    r12,QWORD PTR [rax] ; RACE WINDOW START

; r12 now contains file*

<__fget_files+0x32> test   r12,r12

<__fget_files+0x35> je     ffffffff812e3df7 <__fget_files+0x77>

<__fget_files+0x37> mov    eax,r9d

<__fget_files+0x3a> and    eax,DWORD PTR [r12+0x44] ; LOAD (for ->f_mode)

<__fget_files+0x3f> jne    ffffffff812e3df7 <__fget_files+0x77>

<__fget_files+0x41> mov    rax,QWORD PTR [r12+0x38] ; LOAD (for ->f_count)

<__fget_files+0x46> lea    rdx,[r12+0x38]

<__fget_files+0x4b> test   rax,rax

<__fget_files+0x4e> je     ffffffff812e3def <__fget_files+0x6f>

<__fget_files+0x50> lea    rcx,[rsi+rax*1]

<__fget_files+0x54> lock cmpxchg QWORD PTR [rdx],rcx ; RACE WINDOW END (on cmpxchg success)

As you can see, the race window is fairly small - around 12 instructions, assuming that the cmpxchg succeeds.

Missing some cache

Luckily for us, the race window contains the first few memory accesses to the struct file; therefore, by making sure that the struct file is not present in the fastest CPU caches, we can widen the race window by as much time as the memory accesses take. The standard way to do this is to use an eviction pattern / eviction set; but instead we can also make the cache line dirty on another core (see Anders Fogh's blogpost for more detail). (I'm not actually sure about the intricacies of how much latency this adds on different manufacturers' CPU cores, or on different CPU generations - I've only tested different versions of my proof-of-concept on Intel Skylake and Tiger Lake. Differences in cache coherency protocols or snooping might make a big difference.)

For the cache line containing the flags and refcount of a struct file, this can be done by, on another CPU, temporarily bumping its refcount up and then changing it back down, e.g. with close(dup(fd)) (or just by accessing the FD in pretty much any way from a multithreaded process).

However, when we're trying to hit the race in __fget_files() via dup(), we don't want any cache misses to occur before we hit the race window - that would slow us down and probably make us miss the race. To prevent that from happening, we can call dup() with a different FD number for a warm-up run shortly before attempting the race. Because we also want the relevant cache line in the FD table to be hot, we should choose the FD number for the warm-up run such that it uses the same cache line of the file descriptor table.

An interruption

Okay, a cache miss might be something like a few dozen or maybe hundred nanoseconds or so - that's better, but it's not great. What else can we do to make this tiny piece of code much slower to execute?

On Android, kernels normally set CONFIG_PREEMPT, which would've allowed abusing the scheduler to somehow interrupt the execution of this code. The way I've done this in the past was to give the victim thread a low scheduler priority and pin it to a specific CPU core together with another high-priority thread that is blocked on a read() syscall on an empty pipe (or eventfd); when data is written to the pipe from another CPU core, the pipe becomes readable, so the high-priority thread (which is registered on the pipe's waitqueue) becomes schedulable, and an inter-processor interrupt (IPI) is sent to the victim's CPU core to force it to enter the scheduler immediately.

One problem with that approach, aside from its reliance on CONFIG_PREEMPT, is that any timing variability in the kernel code involved in sending the IPI makes it harder to actually preempt the victim thread in the right spot.

(Thanks to the Xen security team - I think the first time I heard the idea of using an interrupt to widen a race window might have been from them.)

Setting an alarm

A better way to do this on an Android phone would be to trigger the scheduler not from an IPI, but from an expiring high-resolution timer on the same core, although I didn't get it to work (probably because my code was broken in unrelated ways).

High-resolution timers (hrtimers) are exposed through many userspace APIs. Even the timeout of select()/pselect() uses an hrtimer, although this is an hrtimer that normally has some slack applied to it to allow batching it with timers that are scheduled to expire a bit later. An example of a non-hrtimer-based API is the timeout used for reading from a UNIX domain socket (and probably also other types of sockets?), which can be set via SO_RCVTIMEO.

The thing that makes hrtimers "high-resolution" is that they don't just wait for the next periodic clock tick to arrive; instead, the expiration time of the next hrtimer on the CPU core is programmed into a hardware timer. So we could set an absolute hrtimer for some time in the future via something like timer_settime() or timerfd_settime(), and then at exactly the programmed time, the hardware will raise an interrupt! We've made the timing behavior of the OS irrelevant for the second side of the race, the only thing that matters is the hardware! Or... well, almost...

[Sidenote] Absolute timers: Not quite absolute

So we pick some absolute time at which we want to be interrupted, and tell the kernel using a syscall that accepts an absolute time, in nanoseconds. And then when that timer is the next one scheduled, the OS converts the absolute time to whatever clock base/scale the hardware timer is based on, and programs it into hardware. And the hardware usually supports programming timers with absolute time - e.g. on modern X86 (with X86_FEATURE_TSC_DEADLINE_TIMER), you can simply write an absolute Time Stamp Counter(TSC) deadline into MSR_IA32_TSC_DEADLINE, and when that deadline is reached, you get an interrupt. The situation on arm64 is similar, using the timer's comparator register (CVAL).

However, on both X86 and arm64, even though the clockevent subsystem is theoretically able to give absolute timestamps to clockevent drivers (via ->set_next_ktime()), the drivers instead only implement ->set_next_event(), which takes a relative time as argument. This means that the absolute timestamp has to be converted into a relative one, only to be converted back to absolute a short moment later. The delay between those two operations is essentially added to the timer's expiration time.

Luckily this didn't really seem to be a problem for me; if it was, I would have tried to repeatedly call timerfd_settime() shortly before the planned expiry time to ensure that the last time the hardware timer is programmed, the relevant code path is hot in the caches. (I did do some experimentation on arm64, where this seemed to maybe help a tiny bit, but I didn't really analyze it properly.)

A really big list of things to do

Okay, so all the stuff I said above would be helpful on an Android phone with CONFIG_PREEMPT, but what if we're trying to target a normal desktop/server kernel that doesn't have that turned on?

Well, we can still trigger hrtimer interrupts the same way - we just can't use them to immediately enter the scheduler and preempt the thread anymore. But instead of using the interrupt for preemption, we could just try to make the interrupt handler run for a really long time.

Linux has the concept of a "timerfd", which is a file descriptor that refers to a timer. You can e.g. call read() on a timerfd, and that operation will block until the timer has expired. Or you can monitor the timerfd using epoll, and it will show up as readable when the timer expires.

When a timerfd becomes ready, all the timerfd's waiters (including epoll watches), which are queued up in a linked list, are woken up via the wake_up() path - just like when e.g. a pipe becomes readable. Therefore, if we can make the list of waiters really long, the interrupt handler will have to spend a lot of time iterating over that list.

And for any waitqueue that is wired up to a file descriptor, it is fairly easy to add a ton of entries thanks to epoll. Epoll ties its watches to specific FD numbers, so if you duplicate an FD with hundreds of dup() calls, you can then use a single epoll instance to install hundreds of waiters on the file. Additionally, a single process can have lots of epoll instances. I used 500 epoll instances and 100 duplicate FDs, resulting in 50 000 waitqueue items.

Measuring race outcomes

A nice aspect of this race condition is that if you only hit the difficult race (close() the FD and run unix_gc() while dup() is preempted between FD table lookup and refcount increment), no memory corruption happens yet, but you can observe that the GC has incorrectly removed a socket buffer (SKB) from the victim socket. Even better, if the race fails, you can also see in which direction it failed, as long as no FDs below the victim FD are unused:

  • If dup() returns -1, it was called too late / the interrupt happened too soon: The file* was already gone from the FD table when __fget_files() tried to load it.
  • If dup() returns a file descriptor:
  • If it returns an FD higher than the victim FD, this implies that the victim FD was only closed after dup() had already elevated the refcount and allocated a new FD. This means dup() was called too soon / the interrupt happened too late.
  • If it returns the old victim FD number:
  • If recvmsg() on the FD returned by dup() returns no data, it means the race succeeded: The GC wrongly removed the queued SKB.
  • If recvmsg() returns data, the interrupt happened between the refcount increment and the allocation of a new FD. dup() was called a little bit too soon / the interrupt happened a little bit too late.

Based on this, I repeatedly tested different timing offsets, using a spinloop with a variable number of iterations to skew the timing, and plotted what outcomes the race attempts had depending on the timing skew.

Results: Debian kernel, on Tiger Lake

I tested this on a Tiger Lake laptop, with the same kernel as shown in the disassembly. Note that "0" on the X axis is offset -300 ns relative to the timer's programmed expiry.

This graph shows histograms of race attempt outcomes (too early, success, or too late), with the timing offset at which the outcome occurred on the X axis. The graph shows that depending on the timing offset, up to around 1/3 of race attempts succeeded.

Results: Other kernel, on Skylake

This graph shows similar histograms for a Skylake processor. The exact distribution is different, but again, depending on the timing offset, around 1/3 of race attempts succeeded.

These measurements are from an older laptop with a Skylake CPU, running a different kernel. Here "0" on the X axis is offset -1 us relative to the timer. (These timings are from a system that's running a different kernel from the one shown above, but I don't think that makes a difference.)

The exact timings of course look different between CPUs, and they probably also change based on CPU frequency scaling? But still, if you know what the right timing is (or measure the machine's timing before attempting to actually exploit the bug), you could hit this narrow race with a success rate of about 30%!

How important is the cache miss?

The previous section showed that with the right timing, the race succeeds with a probability around 30% - but it doesn't show whether the cache miss is actually important for that, or whether the race would still work fine without it. To verify that, I patched my test code to try to make the file's cache line hot (present in the cache) instead of cold (not present in the cache):

@@ -312,8 +312,10 @@

     }

 

+#if 0

     // bounce socket's file refcount over to other cpu

     pin_to(2);

     close(SYSCHK(dup(RESURRECT_FD+1-1)));

     pin_to(1);

+#endif

 

     //printf("setting timer\n");

@@ -352,5 +354,5 @@

     close(loop_root);

     while (ts_is_in_future(spin_stop))

-      close(SYSCHK(dup(FAKE_RESURRECT_FD)));

+      close(SYSCHK(dup(RESURRECT_FD)));

     while (ts_is_in_future(my_launch_ts)) /*spin*/;

With that patch, the race outcomes look like this on the Tiger Lake laptop:

This graph is a histogram of race outcomes depending on timing offset; it looks similar to the previous graphs, except that almost no race attempts succeed anymore.

But wait, those graphs make no sense!

If you've been paying attention, you may have noticed that the timing graphs I've been showing are really weird. If we were deterministically hitting the race in exactly the same way every time, the timing graph should look like this (looking just at the "too-early" and "too-late" cases for simplicity):

A sketch of a histogram of race outcomes where the "too early" outcome suddenly drops from 100% probability to 0% probability, and a bit afterwards, the "too late" outcome jumps from 0% probability to 100%

Sure, maybe there is some microarchitectural state that is different between runs, causing timing variations - cache state, branch predictor state, frequency scaling, or something along those lines -, but a small number of discrete events that haven't been accounted for should be adding steps to the graph. (If you're mathematically inclined, you can model that as the result of a convolution of the ideal timing graph with the timing delay distributions of individual discrete events.) For two unaccounted events, that might look like this:

A sketch of a histogram of race outcomes where the "too early" outcome drops from 100% probability to 0% probability in multiple discrete steps, and overlapping that, the "too late" outcome goes up from 0% probability to 100% in multiple discrete steps

But what the graphs are showing is more of a smooth, linear transition, like this:

A sketch of a histogram of race outcomes where the "too early" outcome's share linearly drops while the "too late" outcome's share linearly rises

And that seems to me like there's still something fundamentally wrong. Sure, if there was a sufficiently large number of discrete events mixed together, the curve would eventually just look like a smooth smear - but it seems unlikely to me that there is such a large number of somewhat-evenly distributed random discrete events. And sure, we do get a small amount of timing inaccuracy from sampling the clock in a spinloop, but that should be bounded to the execution time of that spinloop, and the timing smear is far too big for that.

So it looks like there is a source of randomness that isn't a discrete event, but something that introduces a random amount of timing delay within some window. So I became suspicious of the hardware timer. The kernel is using MSR_IA32_TSC_DEADLINE, and the Intel SDM tells us that that thing is programmed with a TSC value, which makes it look as if the timer has very high granularity. But MSR_IA32_TSC_DEADLINE is a newer mode of the LAPIC timer, and the older LAPIC timer modes were instead programmed in units of the APIC timer frequency. According to the Intel SDM, Volume 3A, section 10.5.4 "APIC Timer", that is "the processor’s bus clock or core crystal clock frequency (when TSC/core crystal clock ratio is enumerated in CPUID leaf 0x15) divided by the value specified in the divide configuration register". This frequency is significantly lower than the TSC frequency. So perhaps MSR_IA32_TSC_DEADLINE is actually just a front-end to the same old APIC timer?

I tried to measure the difference between the programmed TSC value and when execution was actually interrupted (not when the interrupt handler starts running, but when the old execution context is interrupted - you can measure that if the interrupted execution context is just running RDTSC in a loop); that looks as follows:

A graph showing noise. Delays from deadline TSC to last successful TSC read before interrupt look essentially random, in the range from around -130 to around 10.

As you can see, the expiry of the hardware timer indeed adds a bunch of noise. The size of the timing difference is also very close to the crystal clock frequency - the TSC/core crystal clock ratio on this machine is 117. So I tried plotting the absolute TSC values at which execution was interrupted, modulo the TSC / core crystal clock ratio, and got this:

A graph showing a clear grouping around 0, roughly in the range -20 to 10, with some noise scattered over the rest of the graph.

This confirms that MSR_IA32_TSC_DEADLINE is (apparently) an interface that internally converts the specified TSC value into less granular bus clock / core crystal clock time, at least on some Intel CPUs.

But there's still something really weird here: The TSC values at which execution seems to be interrupted were at negative offsets relative to the programmed expiry time, as if the timeouts were rounded down to the less granular clock, or something along those lines. To get a better idea of how timer interrupts work, I measured on yet another system (an old Haswell CPU) with a patched kernel when execution is interrupted and when the interrupt handler starts executing relative to the programmed expiry time (and also plotted the difference between the two):

A graph showing that the skid from programmed interrupt time to execution interruption is around -100 to -30 cycles, the skid to interrupt entry is around 360 to 420 cycles, and the time from execution interruption to interrupt entry has much less timing variance and is at around 440 cycles.

So it looks like the CPU starts handling timer interrupts a little bit before the programmed expiry time, but interrupt handler entry takes so long (~450 TSC clock cycles?) that by the time the CPU starts executing the interrupt handler, the timer expiry time has long passed.

Anyway, the important bit for us is that when the CPU interrupts execution due to timer expiry, it's always at a LAPIC timer edge; and LAPIC timer edges happen when the TSC value is a multiple of the TSC/LAPIC clock ratio. An exploit that doesn't take that into account and wrongly assumes that MSR_IA32_TSC_DEADLINE has TSC granularity will have its timing smeared by one LAPIC clock period, which can be something like 40ns.

The ~30% accuracy we could achieve with the existing PoC with the right timing is already not terrible; but if we control for the timer's weirdness, can we do better?

The problem is that we are effectively launching the race with two timers that behave differently: One timer based on calling clock_gettime() in a loop (which uses the high-resolution TSC to compute a time), the other a hardware timer based on the lower-resolution LAPIC clock. I see two options to fix this:

  1. Try to ensure that the second timer is set at the start of a LAPIC clock period - that way, the second timer should hopefully behave exactly like the first (or have an additional fixed offset, but we can compensate for that).
  2. Shift the first timer's expiry time down according to the distance from the second timer to the previous LAPIC clock period.

(One annoyance with this is that while we can grab information on how wall/monotonic time is calculated from TSC from the vvar mapping used by the vDSO, the clock is subject to minuscule additional corrections at every clock tick, which occur every 4ms on standard distro kernels (with CONFIG_HZ=250) as long as any core is running.)

I tried to see whether the timing graph would look nicer if I accounted for this LAPIC clock rounding and also used a custom kernel to cheat and control for possible skid introduced by the absolute-to-relative-and-back conversion of the expiry time (see further up), but that still didn't help all that much.

(No) surprise: clock speed matters

Something I should've thought about way earlier is that of course, clock speed matters. On newer Intel CPUs with P-states, the CPU is normally in control of its own frequency, and dynamically adjusts it as it sees fit; the OS just provides some hints.

Linux has an interface that claims to tell you the "current frequency" of each CPU core in /sys/devices/system/cpu/cpufreq/policy<n>/scaling_cur_freq, but when I tried using that, I got a different "frequency" every time I read that file, which seemed suspicious.

Looking at the implementation, it turns out that the value shown there is calculated in arch_freq_get_on_cpu() and its callees - the value is calculated on demand when the file is read, with results cached for around 10 milliseconds. The value is determined as the ratio between the deltas of MSR_IA32_APERF and MSR_IA32_MPERF between the last read and the current one. So if you have some tool that is polling these values every few seconds and wants to show average clock frequency over that time, it's probably a good way of doing things; but if you actually want the current clock frequency, it's not a good fit.

I hacked a helper into my kernel that samples both MSRs twice in quick succession, and that gives much cleaner results. When I measure the clock speeds and timing offsets at which the race succeeds, the result looks like this (showing just two clock speeds; the Y axis is the number of race successes at the clock offset specified on the X axis and the frequency scaling specified by the color):

A graph showing that the timing of successful race attempts depends on the CPU's performance setting - at 11/28 performance, most successful race attempts occur around clock offset -1200 (in TSC units), while at 14/28 performance, most successful race attempts occur around clock offset -1000.

So clearly, dynamic frequency scaling has a huge impact on the timing of the race - I guess that's to be expected, really.

But even accounting for all this, the graph still looks kind of smooth, so clearly there is still something more that I'm missing - oh well. I decided to stop experimenting with the race's timing at this point, since I didn't want to sink too much time into it. (Or perhaps I actually just stopped because I got distracted by newer and shinier things?)

Causing a UAF

Anyway, I could probably spend much more time trying to investigate the timing variations (and probably mostly bang my head against a wall because details of execution timing are really difficult to understand in detail, and to understand it completely, it might be necessary to use something like Gamozo Labs' "Sushi Roll" and then go through every single instruction in detail and compare the observations to the internal architecture of the CPU). Let's not do that, and get back to how to actually exploit this bug!

To turn this bug into memory corruption, we have to abuse that the recvmsg() path assumes that SKBs on the receive queue are protected from deletion by the socket mutex while the GC actually deletes SKBs from the receive queue without touching the socket mutex. For that purpose, while the unix GC is running, we have to start a recvmsg() call that looks up the victim SKB, block until the unix GC has freed the SKB, and then let recvmsg() continue operating on the freed SKB. This is fairly straightforward - while it is a race, we can easily slow down unix_gc() for multiple milliseconds by creating lots of sockets that are not directly referenced from the FD table and have many tiny SKBs queued up - here's a graph showing the unix GC execution time on my laptop, depending on the number of queued SKBs that the GC has to scan through:

A graph showing the time spent per GC run depending on the number of queued SKBs. The relationship is roughly linear.

To turn this into a UAF, it's also necessary to get past the following check near the end of unix_gc():

       /* All candidates should have been detached by now. */

        BUG_ON(!list_empty(&gc_candidates));

gc_candidates is a list that previously contained all sockets that were deemed to be unreachable by the GC. Then, the GC attempted to free all those sockets by eliminating their mutual references. If we manage to keep a reference to one of the sockets that the GC thought was going away, the GC detects that with the BUG_ON().

But we don't actually need the victim SKB to reference a socket that the GC thinks is going away; in scan_inflight(), the GC targets any SKB with a socket that is marked UNIX_GC_CANDIDATE, meaning it just had to be a candidate for being scanned by the GC. So by making the victim SKB hold a reference to a socket that is not directly referenced from a file descriptor table, but is indirectly referenced by a file descriptor table through another socket, we can ensure that the BUG_ON() won't trigger.

I extended my reproducer with this trick and some userfaultfd trickery to make recv() run with the right timing. Nowadays you don't necessarily get full access to userfaultfd as a normal user, but since I'm just trying to show the concept, and there are alternatives to userfaultfd (using FUSE or just slow disk access), that's good enough for this blogpost.

When a normal distro kernel is running normally, the UAF reproducer's UAF accesses won't actually be noticeable; but if you add the kernel command line flag slub_debug=FP (to enable SLUB's poisoning and sanity checks), the reproducer quickly crashes twice, first with a poison dereference and then a poison overwrite detection, showing that one byte of the poison was incremented:

general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] SMP NOPTI

CPU: 1 PID: 2655 Comm: hardirq_loop Not tainted 5.10.0-9-amd64 #1 Debian 5.10.70-1

[...]

RIP: 0010:unix_stream_read_generic+0x72b/0x870

Code: fe ff ff 31 ff e8 85 87 91 ff e9 a5 fe ff ff 45 01 77 44 8b 83 80 01 00 00 85 c0 0f 89 10 01 00 00 49 8b 47 38 48 85 c0 74 23 <0f> bf 00 66 85 c0 0f 85 20 01 00 00 4c 89 fe 48 8d 7c 24 58 44 89

RSP: 0018:ffffb789027f7cf0 EFLAGS: 00010202

RAX: 6b6b6b6b6b6b6b6b RBX: ffff982d1d897b40 RCX: 0000000000000000

RDX: 6a0fe1820359dce8 RSI: ffffffffa81f9ba0 RDI: 0000000000000246

RBP: ffff982d1d897ea8 R08: 0000000000000000 R09: 0000000000000000

R10: 0000000000000000 R11: ffff982d2645c900 R12: ffffb789027f7dd0

R13: ffff982d1d897c10 R14: 0000000000000001 R15: ffff982d3390e000

FS:  00007f547209d740(0000) GS:ffff98309fa40000(0000) knlGS:0000000000000000

CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 00007f54722cd000 CR3: 00000001b61f4002 CR4: 0000000000770ee0

PKRU: 55555554

Call Trace:

[...]

 unix_stream_recvmsg+0x53/0x70

[...]

 __sys_recvfrom+0x166/0x180

[...]

 __x64_sys_recvfrom+0x25/0x30

 do_syscall_64+0x33/0x80

 entry_SYSCALL_64_after_hwframe+0x44/0xa9

[...]

---[ end trace 39a81eb3a52e239c ]---

=============================================================================

BUG skbuff_head_cache (Tainted: G      D          ): Poison overwritten

-----------------------------------------------------------------------------

INFO: 0x00000000d7142451-0x00000000d7142451 @offset=68. First byte 0x6c instead of 0x6b

INFO: Slab 0x000000002f95c13c objects=32 used=32 fp=0x0000000000000000 flags=0x17ffffc0010200

INFO: Object 0x00000000ef9c59c8 @offset=0 fp=0x00000000100a3918

Object   00000000ef9c59c8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   0000000097454be8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   0000000035f1d791: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   00000000af71b907: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000000d2d371e: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkklkkkkkkkkkkk

Object   0000000000744b35: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   00000000794f2935: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000006dc06746: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000005fb18682: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   0000000072eb8dd2: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   00000000b5b572a9: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   0000000085d6850b: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000006346150b: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000000ddd1ced: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5  kkkkkkkkkkkkkkk.

Padding  00000000e00889a7: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ

Padding  00000000d190015f: 5a 5a 5a 5a 5a 5a 5a 5a                          ZZZZZZZZ

CPU: 7 PID: 1641 Comm: gnome-shell Tainted: G    B D           5.10.0-9-amd64 #1 Debian 5.10.70-1

[...]

Call Trace:

 dump_stack+0x6b/0x83

 check_bytes_and_report.cold+0x79/0x9a

 check_object+0x217/0x260

[...]

 alloc_debug_processing+0xd5/0x130

 ___slab_alloc+0x511/0x570

[...]

 __slab_alloc+0x1c/0x30

 kmem_cache_alloc_node+0x1f3/0x210

 __alloc_skb+0x46/0x1f0

 alloc_skb_with_frags+0x4d/0x1b0

 sock_alloc_send_pskb+0x1f3/0x220

[...]

 unix_stream_sendmsg+0x268/0x4d0

 sock_sendmsg+0x5e/0x60

 ____sys_sendmsg+0x22e/0x270

[...]

 ___sys_sendmsg+0x75/0xb0

[...]

 __sys_sendmsg+0x59/0xa0

 do_syscall_64+0x33/0x80

 entry_SYSCALL_64_after_hwframe+0x44/0xa9

[...]

FIX skbuff_head_cache: Restoring 0x00000000d7142451-0x00000000d7142451=0x6b

FIX skbuff_head_cache: Marking all objects used

RIP: 0010:unix_stream_read_generic+0x72b/0x870

Code: fe ff ff 31 ff e8 85 87 91 ff e9 a5 fe ff ff 45 01 77 44 8b 83 80 01 00 00 85 c0 0f 89 10 01 00 00 49 8b 47 38 48 85 c0 74 23 <0f> bf 00 66 85 c0 0f 85 20 01 00 00 4c 89 fe 48 8d 7c 24 58 44 89

RSP: 0018:ffffb789027f7cf0 EFLAGS: 00010202

RAX: 6b6b6b6b6b6b6b6b RBX: ffff982d1d897b40 RCX: 0000000000000000

RDX: 6a0fe1820359dce8 RSI: ffffffffa81f9ba0 RDI: 0000000000000246

RBP: ffff982d1d897ea8 R08: 0000000000000000 R09: 0000000000000000

R10: 0000000000000000 R11: ffff982d2645c900 R12: ffffb789027f7dd0

R13: ffff982d1d897c10 R14: 0000000000000001 R15: ffff982d3390e000

FS:  00007f547209d740(0000) GS:ffff98309fa40000(0000) knlGS:0000000000000000

CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 00007f54722cd000 CR3: 00000001b61f4002 CR4: 0000000000770ee0

PKRU: 55555554

Conclusion(s)

Hitting a race can become easier if, instead of racing two threads against each other, you race one thread against a hardware timer to create a gigantic timing window for the other thread. Hence the title! On the other hand, it introduces extra complexity because now you have to think about how timers actually work, and turns out, time is a complicated concept...

This shows how at least some really tight races can still be hit and we should treat them as security bugs, even if it seems like they'd be very hard to hit at first glance.

Also, precisely timing races is hard, and the details of how long it actually takes the CPU to get from one point to another are mysterious. (As not only exploit writers know, but also anyone who's ever wanted to benchmark a performance-relevant change...)

Appendix: How impatient are interrupts?

I did also play around with this stuff on arm64 a bit, and I was wondering: At what points do interrupts actually get delivered? Does an incoming interrupt force the CPU to drop everything immediately, or do inflight operations finish first? This gets particularly interesting on phones that contain two or three different types of CPUs mixed together.

On a Pixel 4 (which has 4 slow in-order cores, 3 fast cores, and 1 faster core), I tried firing an interval timer at 100Hz (using timer_create()), with a signal handler that logs the PC register, while running this loop:

  400680:        91000442         add        x2, x2, #0x1

  400684:        91000421         add        x1, x1, #0x1

  400688:        9ac20820         udiv        x0, x1, x2

  40068c:        91006800         add        x0, x0, #0x1a

  400690:        91000400         add        x0, x0, #0x1

  400694:        91000442         add        x2, x2, #0x1

  400698:        91000421         add        x1, x1, #0x1

  40069c:        91000442         add        x2, x2, #0x1

  4006a0:        91000421         add        x1, x1, #0x1

  4006a4:        9ac20820         udiv        x0, x1, x2

  4006a8:        91006800         add        x0, x0, #0x1a

  4006ac:        91000400         add        x0, x0, #0x1

  4006b0:        91000442         add        x2, x2, #0x1

  4006b4:        91000421         add        x1, x1, #0x1

  4006b8:        91000442         add        x2, x2, #0x1

  4006bc:        91000421         add        x1, x1, #0x1

  4006c0:        17fffff0         b        400680 <main+0xe0>

The logged interrupt PCs had the following distribution on a slow in-order core:

A histogram of PC register values, where most instructions in the loop have roughly equal frequency, the instructions after udiv instructions have twice the frequency, and two other instructions have zero frequency.

and this distribution on a fast out-of-order core:

A histogram of PC register values, where the first instruction of the loop has very high frequency, the following 4 instructions have near-zero frequency, and the following instructions have low frequencies

As always, out-of-order (OOO) cores make everything weird, and the start of the loop seems to somehow "provide cover" for the following instructions; but on the in-order core, we can see that more interrupts arrive after the slow udiv instructions. So apparently, when one of those is executing while an interrupt arrives, it continues executing and doesn't get aborted somehow?

With the following loop, which has a LDR instruction mixed in that accesses a memory location that is constantly being modified by another thread:

  4006a0:        91000442         add        x2, x2, #0x1

  4006a4:        91000421         add        x1, x1, #0x1

  4006a8:        9ac20820         udiv        x0, x1, x2

  4006ac:        91006800         add        x0, x0, #0x1a

  4006b0:        91000400         add        x0, x0, #0x1

  4006b4:        91000442         add        x2, x2, #0x1

  4006b8:        91000421         add        x1, x1, #0x1

  4006bc:        91000442         add        x2, x2, #0x1

  4006c0:        91000421         add        x1, x1, #0x1

  4006c4:        9ac20820         udiv        x0, x1, x2

  4006c8:        91006800         add        x0, x0, #0x1a

  4006cc:        91000400         add        x0, x0, #0x1

  4006d0:        91000442         add        x2, x2, #0x1

  4006d4:        f9400061         ldr        x1, [x3]

  4006d8:        91000421         add        x1, x1, #0x1

  4006dc:        91000442         add        x2, x2, #0x1

  4006e0:        91000421         add        x1, x1, #0x1

  4006e4:        17ffffef         b        4006a0 <main+0x100>

the cache-missing loads obviously have a large influence on the timing. On the in-order core:

A histogram of interrupt instruction pointers, showing that most interrupts are delivered with PC pointing to the instruction after the high-latency load instruction.

On the OOO core:

A similar histogram as the previous one, except that an even larger fraction of interrupt PCs are after the high-latency load instruction.

What is interesting to me here is that the timer interrupts seem to again arrive after the slow load - implying that if an interrupt arrives while a slow memory access is in progress, the interrupt handler may not get to execute until the memory access has finished? (Unless maybe on the OOO core the interrupt handler can start speculating already? I wouldn't really expect that, but could imagine it.)

On an X86 Skylake CPU, we can do a similar test:

    11b8:        48 83 c3 01                  add    $0x1,%rbx

    11bc:        48 83 c0 01                  add    $0x1,%rax

    11c0:        48 01 d8                     add    %rbx,%rax

    11c3:        48 83 c3 01                  add    $0x1,%rbx

    11c7:        48 83 c0 01                  add    $0x1,%rax

    11cb:        48 01 d8                     add    %rbx,%rax

    11ce:        48 03 02                     add    (%rdx),%rax

    11d1:        48 83 c0 01                  add    $0x1,%rax

    11d5:        48 83 c3 01                  add    $0x1,%rbx

    11d9:        48 01 d8                     add    %rbx,%rax

    11dc:        48 83 c3 01                  add    $0x1,%rbx

    11e0:        48 83 c0 01                  add    $0x1,%rax

    11e4:        48 01 d8                     add    %rbx,%rax

    11e7:        eb cf                        jmp    11b8 <main+0xf8>

with a similar result:

A histogram of interrupt instruction pointers, showing that almost all interrupts were delivered with RIP pointing to the instruction after the high-latency load.

This means that if the first access to the file terminated our race window (which is not the case), we probably wouldn't be able to win the race by making the access to the file slow - instead we'd have to slow down one of the operations before that. (But note that I have only tested simple loads, not stores or read-modify-write operations here.)

❌
❌