Normal view

There are new articles available, click to refresh the page.

Before yesterdayWindows OS Platform Blog articles

Hotpatching on Windows

20 November 2021 at 03:10

Introduction

A core priority of the Windows Kernel team is to keep the operating system, applications, and users secure. Like many operating systems, Windows has a large codebase, a driver ecosystem, and a complex set of dependencies. Every day, many malicious actors attempt to find vulnerabilities. To fix these vulnerabilities, Microsoft has historically combined a group of security fixes into what is known as a security patch.

Updates on Windows

Traditionally, security patches have been deployed on the second Tuesday of every month, known as Patch Tuesday. These patches are developed by feature teams as a fix for various security vulnerabilities in the OS. By providing these security patches, we aim to make the Windows OS more secure and eliminate the opportunity of malicious actors to exploit vulnerabilities. Within each patch, both user mode (application) and kernel mode (system) binaries can be updated, and typically this requires a reboot.

Some scenarios require continuous or near-continuous availability. For example, the instances of Windows Server that power the Azure fleet are required to be highly available. However, we also require these operating system instances to be secure. While technologies like Kernel Soft Reboot and VM preserving host updates already exist to minimize VM downtime while changing major OS releases, security patches are applied frequently enough that even this technique impacts downtime.

Why do updates require rebooting?

Usually, many binaries from all over the system are accessed and changed when a patch is applied. The reason a reboot is almost always required is because a binary that must be updated is usually actively mapped in one or more processes so its code may be currently executing. Certain kernel and user-mode binaries, like win32k.sys or ntdll.dll, are always loaded into memory and some others, like Explorer.exe, are loaded when there is an active user session. When binaries such as these are patched as part of an update, a restart is required for the patch to be successfully installed. When an update targets the NT kernel or additional core components, a restart is always required because it is not possible to unload those binaries while their code is executing. Traditionally, even if one fix within the entire patch required a reboot, and all other patches didn’t require a reboot, the machine would still be required to reboot to successfully install the patch.

Current security issues with delayed patching

Security patches are intended to be applied to the Windows OS as soon as they are released from Microsoft. Often, users and system administrators will delay the installation of a patch because of the reboot that is frequently required upon completing the installation. This delay in patching, while seemingly convenient, is actually a security issue. The FireEye Mandiant Threat Intelligence report shows that in 2018 and 2019 the exploitation of 42% of vulnerabilities occurred after a patch was already released. Furthermore, internal MSRC data shows that in the year 2020, around 75% of public proof-of-concept vulnerability were exploited after a patch has been already released. By limiting or eliminating the time between when a patch is issued to when it is applied, there is a substantial opportunity to reduce the total number of exploited vulnerabilities.

What is Hotpatching?

Hotpatching is the capability of an Operating system to “on-the-fly” modify some code that may be currently executed by another entity (application or driver). The hotpatching process should be invisible to the application, library or driver that is executing the code. This implies that the hotpatch engine must respect some constraints, which will be explained later in this post. Hotpatching allows the OS to install security patches without requiring a reboot, ensuring a level of increased security without sacrificing the availability of the machine. By utilizing techniques in the Windows Kernel, updates can be applied without a direct impact to the user. In Server scenarios, hotpatching allows administrators to update their guest VMs without the need of rebooting the VMs, leading to reduced downtime. Hotpatching is one of the first techniques geared to bringing users a reboot-less security update future.

While hotpatch is a new feature for our customers, it has been in use in Azure Host OS for a while. Internal Azure administrators have been providing rebootless security updates to Azure Host machines for long enough to collect data and improve hotpatching itself. Hotpaching is a battle-tested method of updating binaries on a system without the need to reboot.

The Hotpatch architecture

Hotpatch is implemented in various parts of the NT kernel, Secure Kernel and Ntdll module. Before peeking at the engine’s architecture, we should explain how the system is able to dynamically patch a binary.

Hotpatching works at the function level, which means that functions are individually patched and not individual files or components. Function level hotpatching works by redirecting all invocations of an un-patched function belonging to a base image to a patched function belonging to a hotpatch image. Many types of binaries can be patched using this technique, including usermode executables (EXEs and DLLs), system drivers, and even the Hypervisor and Secure Kernel binaries. Note that hotpatch images are considered cumulative, which means that each hotpatch image includes the changes from all other previous hotpatch images targeting the same base image. Multiple hotpatch images can be applied to the same base image and can be rolled back in a similar manner. The latest version of Hotpatch supports both x64 and ARM64 architectures, including 32-bit code running under WOW64.

Patch images, shown in Figure 1, are standard PE (Portable Executable) images, but they contain special information. In particular, the Hotpatch Table (indexed by the Image load configuration directory) contains all the information that describes the patch image, like the expected engine version, the size of the patch table, patch sequence number, and an array of compatible base image descriptors.

Figure 1. Hotpatch image format.

Each patch image is designed for a specific base image. The compatible base image is identified through a checksum and a time-date stamp. The patch engine will refuse to apply the patch if the base image does not have the same checksum and time-date stamp of any descriptors. In this case the patch will be added to an internal list and applied only when the correct base image is loaded later (this procedure is called “Deferred application”.)

The operations that are performed by the engine for applying a patch are described by an array of hotpatch descriptors. A hotpatch descriptor tells the engine what type of patch each record specifies (function patch, global symbol patch, indirect call, CFG call target and so on...). It is composed of a header and one or more hotpatch records. Each record specifies the patch’s parameters that depend on the type of the descriptor, like the source and target function’s RVA, and the original opcodes bytes.

The Hotpatch engine

The Hotpatch engine is implemented in various parts of the operating system, mostly in the NT and Secure kernel. The engine, as introduced in the previous paragraph, supports different kinds of images: Hypervisor, Secure Kernel and its modules, NT Kernel drivers and User-mode processes. The hotpatch engine requires the Secure Kernel to be running.

For applying a patch to an image, the NT kernel takes several steps that start in the MiLoadHotPatch internal function, which temporarily maps the patch image in the system address space and performs the initial analysis with the goal to search and verify the hotpatch information contained in the PE data structures (showed in Figure 1). After the checksum and timestamp of the target image for which the patch has been designed are located, the NT kernel determines whether the corresponding base image is loaded in the system (the base image can also be a secure image, like the Hypervisor or the Secure Kernel, so this step also needs to invoke the secure kernel).

When a compatible image is detected, the NT kernel begins to apply the patch to the target base image using a procedure that is a bit different depending on the type of the base image (user-mode library or process, kernel driver or a secure image). In general, the hotpatch engine maps the patch image in the same address space as the base image (as showed in Figure 2): for user-mode patches, the patch image will be mapped in each process that has the base image loaded.

Note that the hotpatch engine also supports session drivers. A session driver is a driver that lives in a kernel-mode address space that is tied to the user logon session (note that the session address space is generated by one particular root page table entry, which is switched on demand by the Memory manager depending on the active session). This means that a particular session can have a driver mapped which does not exist in another session. The Hotpatch engine is able to attach to all sessions in the system thanks to the “HotPatch” process created in phase 1 of the NT Kernel initialization. This minimal process has the characteristic to not belong to any session. The hotpatch engine can thus use that process to temporarily attach to any session in the system and perform the patch application only to the sessions where the driver is currently loaded.

Figure 2. Various address spaces supported by hotpatching on Windows.

Once the hotpatch image is mapped, the patch engine within the kernel starts to apply the patch by performing Backward patch application as described by the hotpatch records:

Patches all callees of patched functions in the patch image to jump to the corresponding functions in the base image. The reason for this is to ensure that all the unpatched code executes from the original base image. For example, if function A calls function B in the original base image and the patch image patches function A but not function B, then the patch engine will update function B in the patch image to jump to function B in the original base image.
Patches the necessary references to global variables in hotpatch functions to point to the corresponding global variables in the original base image.
Patches the necessary import address table (IAT) references in the hotpatch image by copying the corresponding IAT entries from the original base image.

It then performs the Forward patch application by patching the necessary functions in the original base image to jump to the corresponding functions in the patch image. Once this is done for any given function in the original base image, all new invocations of that function will execute the new patched function code from the hotpatch image. Once the hotpatched function returns, it will return to the caller of the original function.

The described procedure, which, for kernel drivers, is executed by the Secure Kernel, has been highly simplified. Note that the hotpatching process requires proper synchronization: no processor should be able to execute original instructions while undergoing a patch application. Note that the Secure Kernel is able also to interact with Hyperguard. This allows protected Patchguard images to be correctly patched.

The Hotpatch Address Table (HPAT)

When applying a patch to a function, the Hotpatch engine should be able to store the trampoline needed for transferring the code execution from the base to the patched function. The trampoline can’t be stored in the old un-patched function for various reasons: currently running code may hit invalid instructions and there is also no guarantee that enough space exists in the old function’s code. Furthermore, the patch engine supports both the application and the revert (undo) of a patch, which means that the original replaced bytes would have to be stored somewhere. Trampoline code to transfer execution to the target function is placed in the Hotpatch Address table code page (abbreviated as HPAT).

When the system initially boots, the Windows loader determines the size of the HPAT area, which is composed of a combination of data and code pages (to support ARM64 and scenarios where Retpoline is enabled on x64). When HotPatch is enabled, each boot driver is loaded in memory by reserving the HPAT pages at the end of PE image (before the Retpoline code page. Further information about Retpoline on Windows are available here: Mitigating Spectre variant 2 with Retpoline on Windows - Microsoft Tech Community). Note that the term “reserved” means that no actual physical memory is consumed. This is handled similarly for user-mode binaries.

When a patch is applied to a base image, the HPAT pages for both the base and the patch images are mapped to valid physical pages. When a function is patched for the first time, the patch engine allocates an HPAT entry for it and fills the code and data slot with the trampoline code and the target address. Subsequent patches for a function only update the target address. Only a single instruction is replaced in the prologue of the original function’s code. The overwritten opcode is saved in the Undo table to be replaced if the patch is reverted. Figure 3 summarizes this process:

Figure 3. Code flow for a hotpatched function.

Windows Server 2022 - New Hotpatch features

The upcoming Windows Server 2022 release includes the following improvements which make hotpatching applicable to a wider set of changes:

Patch images can now import new functions from other binaries.
Hotpatch engine now support ARM64 as well.
The patch engine now supports a patch callback, exported in the patch image through the “__PatchMainCallout__” function. The callback allows the patch image to perform initialization steps (like allocating memory, initializing new globals and so on....) after one or both the phases of the patch application (described previously) completed.
HotPatch is compatible with Retpoline. A new Retpoline dispatch function (internally called “__guard_retpoline_jump_hpat”) is invoked from the HPAT code entry and can safely transfer the code execution to the target patch function without being vulnerable to Spectre v2 side channel attacks.

Conclusion

Hotpatch is a powerful feature used by the Azure Fleet and Windows Server Azure Edition to eliminate downtime when applying security patches or even adding small features to the OS. Although some limitations in the functions being patched still exist (for example function signatures can never be changed), most of them has been addressed in the new version of the Engine.

How can you get access to the hotpatch feature?

Hotpatch-based security updates are available to customers running Windows Server 2019 and Windows Server 2022 Azure Edition images in the Azure cloud within the automanage framework. Documentation is provided on this page. We are working on bringing hotpatch-based security updates to a wider set of Windows customers.

Andrea Allievi & Hotpatch Team.

Windows OS Platform Blog articles
Getting to Know ARM64EC: #Defines and Intrinsic FunctionsMehmet_Iyigun
18 November 2021 at 08:55

Getting to Know ARM64EC: #Defines and Intrinsic Functions

Windows OS Platform Blog articles

By: Mehmet_Iyigun

18 November 2021 at 08:55

Earlier this year, we announced ARM64EC, a new ABI that will make it easier than ever to build native apps for Windows on ARM. With the Windows 11 SDK and Visual Studio Preview, you can start using the preview of ARM64EC tools to add ARM64EC to your own apps or build new ARM64EC projects. For developers looking to dive in and get started, we'll be sharing more details and things to know in this and upcoming blogs.

Today, we'll be diving into one key detail of the environment to know: when compiling ARM64EC, the _M_AMD64 preprocessor macro is defined and _M_ARM64 is not. There is also a new preprocessor macro, _M_ARM64EC, that is set only when building ARM64EC.

Preprocessor macros defined for each target by MSVC:

x64

ARM64EC

ARM64

_M_X64

_M_AMD64

_M_X64

_M_AMD64

_M_ARM64EC

_M_ARM64

If you include windows.h in your project, you’ll also see that _AMD64_ and _ARM64EC_ are both defined when building ARM64EC code.

This combination may seem counterintuitive at first, but it's key to the fundamental promise of ARM64EC being interoperable with x64 code even within the same binary. Windows 11 takes care of seamlessly transitioning between code running natively in the CPU and under emulation. To do so, it makes sure that data flows transparently between ARM64EC and x64 including data pointers and function pointers (i.e. callbacks). For this to work, datatype definitions must be the same when compiling ARM64EC code as when compiling x64.

The defined preprocessor macros for ARM64EC mean that your project compiling as ARM64EC will use definitions from x64, not ones from ARM64. This ensures that datatype definitions are the same when compiling for x64 and ARM64EC and that passing parameters, either by value or by reference, will not generate a mismatch.

Another common use of #define statements in code is platform specific instructions, usually exposed to C/C++ code in the form of intrinsic functions. Intrinsic functions are functions internally defined by the compiler, which allow C/C++ code to tap into architecture-specific instructions and get the best possible performance without the need for direct use of assembly. Knowing that ARM64EC projects will follow x64 codepaths, you may ask -- what about any intrinsic functions?

When compiling ARM64EC, x64 intrinsic functions are supported and will be translated to ARM64EC code automatically. As a result, taking an x64 project and building for ARM64EC, even one that uses intrinsic functions for performance, can easily yield an ARM64EC app with good performance without source changes.

You also have the option to further optimize the processor-specific code in your project by using ARM64 intrinsic functions in your ARM64EC project. The _M_ARM64EC preprocessor macro allows you to differentiate ARM64EC from x64 and take ARM-specific code paths rather than x64. For example, if you have code that already handles choosing the best intrinsic functions for x64 and ARM64, you can key off _M_ARM64EC or _M_ARM64 to use the ARM intrinsic functions, as below:

Before	After
#include <intrin.h> void func() { #if defined(_M_AMD64) __m128i vec; vec = _mm_setzero_si128(); #elif defined(_M_ARM64) __n128 vec; vec = vdupq_n_u32(0); #endif }	#include <intrin.h> void func() { #if defined(_M_AMD64) && !defined(_M_ARM64EC) __m128i vec; vec = _mm_setzero_si128(); #elif defined(_M_ARM64) \|\| defined(_M_ARM64EC) __n128 vec; vec = vdupq_n_u32(0); #endif }

The architecture #defines set by the compiler when building ARM64EC may be somewhat surprising at first but make more sense when considering that ARM64EC and x64 are interoperable. These settings, and the automatic translation of intrinsics, enable code to be ported to ARM64EC with the least amount of effort, while still enabling ARM64EC specific fine-tuning and optimization.

Marc Sweetgall, Pedro Justo

Windows OS Platform Blog articles
Introducing Kernel Data Protection, a new security technology for preventing data corruptionMehmet_Iyigun
12 December 2022 at 19:08

Introducing Kernel Data Protection, a new security technology for preventing data corruption

Windows OS Platform Blog articles

By: Mehmet_Iyigun

12 December 2022 at 19:08

Kernel Data Protection (KDP) is a new technology that prevents data corruption attacks by protecting parts of the Windows kernel and drivers through virtualization-based security (VBS). KDP is a set of APIs that provide the ability to mark some kernel memory as read-only, preventing attackers from ever modifying protected memory.

KDP uses technologies that are supported by default on Secured-core PCs, which implement a specific set of device requirements that apply the security best practices of isolation and minimal trust to the technologies that underpin the Windows operating system. KDP enhances the security provided by the features that make up Secured-core PCs by adding another layer of protection for sensitive system configuration data.

KDP is implemented in two parts:

Static KDP enables software running in kernel mode to statically protect a section of its own image from being tampered with from any other entity in VTL0.
Dynamic KDP helps kernel-mode software to allocate and release read-only memory from a “secure pool”. The memory returned from the pool can be initialized only once.

The concept of protecting kernel memory as read-only has valuable applications for the Windows kernel, inbox components, security products, and even third-party drivers like anti-cheat and digital rights management (DRM) software.

Learn more about Kernel Data Protection, how it is implemented on Windows 10, and more applications in this blog: Introducing Kernel Data Protection, a new platform security technology for preventing data corruption.

Enjoy!

Memory management & security core team (Andrea Allievi, Matthew Woolman, Jon Lange, Eugene Bak, Mehmet Iyigun)

Windows OS Platform Blog articles
Mitigating Spectre variant 2 with Retpoline on WindowsMehmet_Iyigun
14 May 2019 at 19:43

Mitigating Spectre variant 2 with Retpoline on Windows

Windows OS Platform Blog articles

By: Mehmet_Iyigun

14 May 2019 at 19:43

Updated May 14, 2019: We're happy to announce that today we've updated Retpoline cloud configuration to enable it for all supported devices!* In addition, with the May 14 Patch Tuesday update, we've removed the dependence on cloud configuration such that even those customers who may not be receiving cloud configuration updates can experience Retpoline performance gains.

*Note: Retpoline is enabled by default on devices running Windows 10, version 1809 and Windows Server 2019 or newer and which meet the following conditions:

Spectre, Variant 2 (CVE-2017-5715) mitigation is enabled.
- For Client SKUs, Spectre Variant 2 mitigation is enabled by default
- For Server SKUs, Spectre Variant 2 mitigation is disabled by default. To realize the benefits of Retpoline, IT Admins can enable it on servers following this guidance.
Supported microcode/firmware updates are applied to the machine.

Updated March 1, 2019: The post below outlines the performance benefits of using Retpoline against the Spectre variant 2 (CVE-2017-5715) attack—as observed with 64-bit Windows Insider Preview Builds 18272 and later. ~~While Retpoline is currently disabled by default on production Windows 10 client devices~~, we have backported the OS modifications needed to support Retpoline so that it can be used with Windows 10, version 1809 and have those modifications in the March 1, 2019 update (KB4482887).

~~Over the coming months, we will enable Retpoline as part of phased rollout via cloud configuration.~~ Due to the complexity of the implementation and changes involved, we are only enabling Retpoline performance benefits for Windows 10, version 1809 and later releases.

Updated March 5, 2019: ~~While the phased rollout is in progress, customers who would like to manually enable Retpoline on their machines can do so with the following registry configuration updates:~~

On Client SKUs:

reg add "HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management" /v FeatureSettingsOverride /t REG_DWORD /d 0x400
reg add "HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management" /v FeatureSettingsOverrideMask /t REG_DWORD /d 0x400
Reboot

On Server SKUs:

reg add "HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management" /v FeatureSettingsOverride /t REG_DWORD /d 0x400
reg add "HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management" /v FeatureSettingsOverrideMask /t REG_DWORD /d 0x401
Reboot

Note: The above registry configurations are for customers running with default mitigation settings. In particular, for Server SKUs, these settings will enable Spectre variant 2 mitigations (which are enabled by default on Client SKUs). If it's desirable to enable additional security mitigations on top of Retpoline, then the feature settings values for those features need to be bitwise OR'd into FeatureSettingsOverride and FeatureSettingsOverrideMask.

Example: Feature settings values for enabling SSBD (speculative store bypass) system wide:
FeatureSettingsOverride = 0x8 and FeatureSettingsOverrideMask = 0
To add Retpoline, feature settings value for Retpoline (0x400) should be bitwise OR'd:
FeatureSettingsOverride = 0x408 and FeatureSettings OverrideMask = 0x400

Get-SpeculationControlSettings PowerShell cmdlet can be used to verify Retpoline status. Here’s an example output showing Retpoline and import optimization enabled:

Speculation control settings for CVE-2017-5715 [branch target injection] 
 
Hardware support for branch target injection mitigation is present: True  
Windows OS support for branch target injection mitigation is present: True 
Windows OS support for branch target injection mitigation is enabled: True 
… 
BTIKernelRetpolineEnabled           : True 
BTIKernelImportOptimizationEnabled  : True 
...

Since Retpoline is a performance optimization for Spectre Variant 2, it requires that hardware and OS support for branch target injection to be present and enabled. Skylake and later generations of Intel processors are not compatible with Retpoline, so only Import Optimization will be enabled on these processors.

In January 2018, Microsoft released an advisory and security updates related to a newly discovered class of hardware vulnerabilities involving speculative execution side channels (known as Spectre and Meltdown) that affect AMD, ARM, and Intel CPUs to varying degrees. If you haven’t had a chance to learn about these issues, we recommend watching The Case of Spectre and Meltdown by the team at TU Graz from BlueHat Israel, reading the blog post by Jann Horn (@tehjh) of Google Project Zero.

We have also had multiple posts detailing the internals of our implementation to handle these side-channel attacks.

For today’s post, we have kernel developers Andrea Allievi and Chris Kleynhans describing our design and implementation of retpoline for Windows which improves performance of Spectre variant 2 mitigations (CVE-2017-5715) to noise-level for most scenarios. These improvements are available today in Windows Insider Builds (builds 18272 or newer, x64-only).

Introduction

At a high level, the Spectre variant 2 attack exploits indirect branches to steal secrets located in higher privilege contexts (e.g. kernel-mode vs user-mode). Indirect branches are instructions where the target of the branch is not contained in the instruction itself, such as when the destination address is stored in a CPU register.

Describing the full Spectre attack is outside the scope of this article. Details are in the links above or in this whitepaper from Intel.

Our original mitigations for Spectre variant 2 made use of new capabilities exposed by CPU microcode updates to restrict indirect branch speculation when executing within kernel mode (IBRS and IBPB). While this was an effective mitigation from a security standpoint, it resulted in a larger performance degradation than we’d like on certain processors and workloads.

For this reason, starting in early 2018, we investigated alternatives and found promise in an approach developed by Google called retpoline. A full description of retpoline can be found here, but in short, retpoline works by replacing all indirect call or jumps in kernel-mode binaries with an indirect branch sequence that has safe speculation behavior.

This sequence, shown below in Figure 1, effects a safe control transfer to the target address by performing a function call, modifying the return address and then returning.

RP0:  call RP2                 ; push address of RP1 onto the stack and jump to RP2
RP1:  int 3                    ; breakpoint to capture speculation
RP2:  mov [rsp], <Jump Target> ; overwrite return address on the stack to desired target
RP3:  ret                      ; return

While this construct is not as fast as a regular indirect call or jump, it has the side effect of preventing the processor from unsafe speculative execution. This proves to be much faster than running all of kernel mode code with branch speculation restricted (IBRS set to 1). However, this construct is only safe to use on processors where the RET instruction does not speculate based on the contents of the indirect branch predictor. Those processors are all AMD processors as well as Intel processors codenamed Broadwell and earlier according to Intel’s whitepaper. Retpoline is not applicable to Skylake and later processors from Intel.

Windows requirements for Retpoline

Traditionally the transformation of indirect calls and jumps into retpolines is performed when a binary is built by the compiler. However, there are several functional requirements in Windows that make a purely compile-time implementation insufficient.

These key requirements are:

Single binary: Windows releases are long-lived and must support a wide variety of hardware with a single set of binaries. On some hardware retpoline is not a complete mitigation because of alternate behavior of the ret instruction and retpoline must not be used. Further, future hardware may eliminate the need for retpoline entirely. Therefore, a Windows implementation of retpoline must allow the feature to be enabled and disabled at boot time using a single set of binaries, based on whether the underlying hardware is vulnerable, compatible and whether Spectre variant 2 mitigations are enabled on the system. Further, the runtime overhead of retpoline support should be minimal when the feature is disabled.
3rd party device drivers: A lot of the code that runs in kernel mode is not part of Windows and consists of 3rd party device driver code. Traditional retpoline would only be secure if all these drivers were recompiled with a new version of the compiler. Given the breadth of Windows 3rd party driver ecosystem, it is not realistic to expect all non-inbox 3rd party drivers to be recompiled and released to customers at the same time. Therefore, a Windows implementation of retpoline must be able to support a mixed environment, providing high performance when running drivers that have been updated, but allowing for graceful fallback to hardware-based mitigations upon entering a non-retpoline driver to preserve security.
Driver portability: Windows drivers are not bound to a specific release of Windows, many drivers that are built today for Windows 10 will also support older versions of the operating system. Therefore, a Windows implementation of retpoline must ensure that drivers compiled with retpoline support can run on a version of Windows that does not support retpoline.

General Architecture

To satisfy requirement 1 and 3, we decided that binaries would ship in a non-retpolined state and then be transformed into a retpolined state by rewriting the code sequences for all indirect calls. This ensures that systems that do not use retpoline can use the binaries as compiled without needing any support for retpoline and with minimal runtime cost.

However, performing the transformation at runtime does lead to one problem. How do we know what transformations need to be applied? Disassembling and analyzing driver machine code to locate all indirect calls is not practical.

Dynamic Value Relocation Table (DVRT)

To solve this problem, we collaborated with the compiler team in Visual Studio to develop a system whereby the compiler can emit a new type of metadata into driver binaries describing each indirect call or jump in the system. This metadata takes the form of new relocation entries in the Dynamic Value Relocation Table (DVRT).

The DVRT was originally introduced back in the Windows 10 Creators Update to improve kernel address space layout randomization (KASLR). It allowed the memory manager’s page frame number (PFN) database and page table self-map to be assigned dynamic addresses at runtime. The DVRT is stored directly in the binary and contains a series of relocation entries for each symbol (i.e. address) that is to be relocated. The relocation entries are themselves arranged in a hierarchical fashion grouped first by symbol and then by containing page to allow for a compact description of all locations in the binary that reference a relocatable symbol.

At build time, the compiler keeps track of all references to these special symbols and fills out the DVRT. Then at runtime the kernel will parse the DVRT and update each symbol reference with the correct dynamically assigned address. Importantly, the kernel will skip over any DVRT entries it does not recognize (i.e. those with an unknown symbol) so adding new symbols to the DVRT does not break older versions of Windows.

These properties meant the DVRT was a perfect place to store our retpoline metadata, however the existing DVRT format needed to be extended to support retpoline.

Based on Windows requirements, we classified indirect calls/jumps into three distinct forms and each of these forms has its own type of retpoline relocation and corresponding runtime fixup.

Import calls/jumps
Switchtable jumps
Generic indirect calls/jumps

Let’s talk a little about each of these types of calls.

Import Calls/Jumps

Import calls/jumps are, as the name implies, used for calls/jumps made by a binary to functions that have been imported from another binary. When compiling with retpoline, the compiler ensures that all such calls conform to the following form:

48 FF 15 XX XX XX XX     call qword ptr [_imp_<function>]
0F 1F 44 00 00           nop

The call or jmp instruction always directly references the import address table (IAT) and has 5 bytes of additional padding (to be used by the retpoline fixup).

Switchtable Jumps

Switchtable jumps are used for jumps made to other locations within the same function and are so-named because of their usage in implementing C/C++ switch statements. When compiling with retpoline support the compiler ensures that such calls are always made through a register and take the following form:

FF D0                    jmp rax
CC CC CC                 int 3

Generic Indirect Calls/Jumps

All other indirect calls/jumps fall into the generic type. To simplify the retpoline relocation format and the corresponding fixup logic, the compiler ensures that all such indirect calls/jumps provide their target address in the RAX register. The exact format of the call/jump instruction however differs depending on whether it is protected by control flow guard (CFG).

Loading binaries at runtime

Now that we have a way to identify all the indirect calls/jumps in the binary, we need to apply the fixups.

The NT memory manager has long had infrastructure to apply fixups to binaries at runtime. This infrastructure was extended to understand retpoline relocations and their corresponding fixups.

But what exactly do these fixups look like? As mentioned earlier, the Windows implementation needs to support mixed environments in which some drivers are not compiled with retpoline support. This means that we cannot simply replace every indirect call with a retpoline sequence like the example shown in the introduction. We need to ensure that the kernel gets the opportunity to inspect the target of the call or jump so that it can apply appropriate mitigations if the target does not support retpoline.

For this reason, we transform every indirect call or jump into a direct call or jump to a kernel provided “retpoline stub function”. For example, an indirect call to an imported function that looks like this:

call qword ptr [_imp_ExAllocatePoolWithTag]     ; Target address located at a REL32 offset
nop                                             ; Padding

Will be replaced at runtime with a direct call to the retpoline import stub:

mov r10, qword ptr [_imp_ExAllocatePoolWithTag] ; R10 = target address
call _guard_retpoline_import_r10                ; Direct REL32 call to the stub function

There are several retpoline stub functions each of which is specialized to the type of call/jump it handles. However, each function generally performs the following steps:

Check if the target binary supports retpoline

Prior to transferring control to the target address, the function must determine whether the target address belongs to a driver that supports retpoline. To determine this, the kernel maintains a sparse bitmap of the entire kernel-mode address space with each bit describing a 64 KB region of the address space. Bits in this bitmap are set to 1 if and only if their corresponding region of address space belongs to a kernel-mode binary that fully supports retpoline.
If the bitmap check determines that the target address does not belong to a retpolined binary, the stub function has to fall back to the hardware-based Spectre variant 2 mitigation (by setting IBRS to restrict branch speculation) and then perform a regular indirect call/jmp. Otherwise, the kernel does not need to set IBRS. On processors that do not support IBRS, retpoline will, instead, perform IBPB if user-to-kernel protection is enabled as described here.
Since the target of a switch table jump is always in the same binary as the source (and therefore the target is guaranteed to support retpoline), this bitmap check is omitted from the switchtable jump stub functions.

Check if the target address is a valid CFG target

For CFG instrumented indirect calls/jumps the retpoline stub function is responsible for checking the kernel-mode CFG bitmap to verify that the target address given is a valid CFG call target. If this check fails, then the stub function will bugcheck the system to prevent any exploit that attempts an indirect control transfer to an invalid address.

Transfer control to the target using a retpoline.

The usage of these stub functions ensures that we can satisfy the requirement to support mixed environments, however they do introduce one additional problem. The x64 direct call/jump instruction can only encode a target address within 2 GB of the call-site (since the target is specified by a signed 16- or 32-bit offset). Since the retpoline stub functions are implemented in the NT kernel binary this would generally mean that drivers would have to be loaded within 2 GB of the kernel binary.

To work around this requirement, all retpoline stub functions are contained within a single section of the NT kernel binary and have been carefully written to take no dependencies on their position relative to the rest of the binary. This allows us to map the physical memory pages backing the retpoline stub functions immediately after every driver in the system, giving each driver its own “copy” of the retpoline stub functions that is guaranteed to be within 2 GB of every indirect call/jump.

Import optimization

Indirect calls due to imported functions are by far the most common form of indirect control transfers in kernel-mode. The import call targets are determined at driver load time by processing the import address table (IAT) and remain constant throughout the driver’s lifetime. This means that most of the work provided by the retpoline import stub is unnecessary because we know at driver load time exactly where each of these calls will end up going and we know whether the target binary supports retpoline or not. Hence, we can use a much faster calling sequence.

With import optimization, we use the retpoline fixup infrastructure to replace eligible import calls with direct calls to the imported function. This eliminates the overhead of the retpoline import call stub as well as the guaranteed branch prediction miss due to retpoline itself. To be eligible for import optimization, a call must meet the following requirements:

The call/jump must be from a retpolined binary to another retpolined binary.

This is necessary to maintain the security guarantees of retpoline because once we’ve rewritten the indirect call into a direct call the kernel no longer gets a chance to observe the target address and enable IBRS.

The target of the call must be within 2 GB of the call site.

This is because as mentioned above direct call/jump instructions on x64 can only encode a 32-bit offset.
In order to virtually guarantee that import optimization can be applied all retpolined modules, the OS loader and kernel make sure that all kernel-mode modules are packed tightly in the address space while maintaining address space layout randomizations (ASLR).

Here is an example of how the code generation for the call is modified.

Original code sequence

call [__imp_<Function>]                   ; Call to an imported function
nop                                       ; 5-byte nop

Import Optimized code sequence

mov r10, [__imp_<Function>]               ; R10 = target address (normal transformation)
call <Function>                           ; Direct REL32 call to target

Import optimization turned out to be a big performance win! Hence, even on processors where retpoline cannot be used due to alternate return instruction behavior, we still use import optimization.

Conclusion

Retpoline has significantly improved the performance of the Spectre variant 2 mitigations on Windows. When all relevant kernel-mode binaries are compiled with retpoline, we’ve measured ~25% speedup in Office app launch times and up to 1.5-2x improved throughput in the Diskspd (storage) and NTttcp (networking) benchmarks on Broadwell CPUs in our lab. It is enabled by default in the latest Windows Client Insider Fast builds (for builds 18272 and higher on machines exposing compatible speculation control capabilities) and is targeted to ship with 19H1.

To check if retpoline and import optimizations are enabled, you can use the PowerShell cmdlet Get-SpeculationControlSettings. You can also use NtQuerySystemInformation to programmatically query retpoline status.

For a more in-depth look, here is a talk by Andrea Allievi at BlueHat 2018 talking about retpoline on Windows.

Give the latest builds a try and let us know your experience!