Normal view

There are new articles available, click to refresh the page.
Before yesterdayWindows OS Platform Blog articles

Azure Host OS Update with Hypervisor Hot Restart

6 September 2023 at 16:48

Azure is Microsoft’s cloud computing offering which provides IaaS (infra as a service) virtual machines (VM), PaaS (platform as a service) containers and many other SaaS services (e.g., Azure Storge, Networking, etc.). Azure, being one of the largest cloud service providers, hosts millions of customer virtual machines (VMs) in our data centers. The operating system that runs on these hosts is a modified version of Windows called Cloud Host. I talked about this and the Azure Host OS architecture (incl. the root OS and the hypervisor) in an earlier blog post. In this blog post we will talk about how we update the operating system that runs on those hosts, in particular, a new advancement we made in updating the hypervisor called “Hypervisor Hot Restart (HHR)”.

 

Azure Host OS Updates Overview

Ensuring the security of the Azure hosts is critical to maintaining our customers’ trust as their applications run in a public cloud where customers have limited control over infrastructure updates. We ensure the security of the Azure host by patching and keeping it up to date with all the latest applicable security updates. These patches are typically rolled out every month with no disruption to the customer workloads. In addition, to security updates, we also update the Azure Host OS to provide new features and functionality to the customer VMs, e.g., new hardware generation support or new features such as Azure confidential computing.

 

Note: this blog focuses on internal Azure Host OS technical details and does not talk about Azure customer facing VM updates and control mechanisms. Those maintenance controls or scheduled events are documented for our customers on Azure’s website.

 

Different Azure Host OS update mechanisms

These are the most common types of Azure Host OS update technologies used in the Azure fleet.

 

Update Tech

Performance

Purpose

Hot Patching

Best – in milliseconds. Not visible to customer VMs

Typically used for monthly security updates (e.g., MSRC). More detailed blog on internals here.  

VM PHU

Typically, 30 secs

Update the entire Azure Host OS. Paper in EuroSys 2021 for tech details.  

Live Migration [1]

Multiple seconds

Migrate the VM to a different node and potentially empty the node for Host OS updates or other needs.  

Hypervisor Hot Restart

Under a second

Update the entire Hypervisor. Useful when updating to a new version with the latest features.

 

[1] Future post on Live Migration internals

 

Introducing Hypervisor Hot Restart

With that introduction on Azure Host OS update technologies, we are going to do a deep dive into our latest and most advanced update technology: Hypervisor Hot Restart (HHR). HHR allows us to update and replace the hypervisor on a running system with sub second blackout time for customer VMs and importantly without dropping any packets. With Hypervisor Hot Restart, we can deploy new hypervisor features or fixes easily, providing enormous customer value. This is especially important in today's world, where security threats are becoming more prevalent and sophisticated.

 

Hypervisor Hot Restart in Action

This is a demonstration of how Hypervisor Hot Restart works in action. It showcases 4 VMs (Virtual Machines) that continue to run while the hypervisor is fully replaced under them. The network connection remains stable throughout the process and no packets are lost. Additionally, we showcase the speed of the restart process, with a maximum packet delay of 600 milliseconds. (Apologies for the low-quality GIF, the blogging platform has a low size limit. The original videos are attached to this blog post for offline viewing.) 

 

HHRDemo.gif

 

How Does Hypervisor Hot Restart Work?

On an Azure node there will be one active hypervisor running the host operating system and the guest VMs. When we are ready to update the hypervisor, this active hypervisor creates a service partition where the new updated or latest hypervisor is initialized. The other partitions hosting the customer VMs continue to run normally, uninterrupted.

 

Once the new-hypervisor initialization is complete, it is ready, and the active hypervisor can now call into the new-hypervisor. Next, the active hypervisor creates a mirroring thread for each active partition, which replicates all state associated with the partition to the new-hypervisor. All partitions remain running while the mirroring threads reflect important state changes from the active hypervisor to the new-hypervisor. This mirrored state includes information such as memory ownership, partition lifecycle changes, device ownership, and so on.

 

All partitions are then temporarily suspended, and their state is saved into an internal hypervisor buffer to capture any state that has not already been mirrored. This phase is known as the "blackout" period, during which neither the host OS nor any guest VMs are running. Control of the physical machine is then passed to the new-hypervisor, which becomes the new active hypervisor. This time is well under a second as you see in the demo.

 

Finally, the active hypervisor restores the host OS and guest VMs, and their virtual processors resume execution. We can then reclaim memory that was used by the old hypervisor but is no longer needed by the new hypervisor. This allows us to perform repeated HHR operations without exhausting system resources.

 

To help visualize this process, we have created an animation that demonstrates Hypervisor Hot Restart. (Apologies for the low-quality GIF, the blogging platform has a low size limit. The original videos are attached to this blog post for offline viewing.)

 

HHR.gif

 

The development of Hypervisor Hot Restart enables easy deployment of new hypervisor versions with new features and capabilities without VM downtime. For example, we used Hypervisor Hot Restart to mitigate Retbleed, a side-channel vulnerability that can compromise data security in virtualized environments. We deployed the latest Hypervisor with HyperClear to protect against Retbleed, marking our first utilization of Hypervisor Hot Restart in the Azure Fleet. During this deployment, we were able to deploy HyperClear across the Azure Fleet with sub-second blackout.

 

With that we conclude our look into the internals of Azure Host OS updates with the latest Hypervisor Hot Restart technology. Expect to see more of Azure Host and Windows internals in future blogs.  

 

Cheers,

Meghna, Hari, Bruce (on behalf of the entire Core OS Team)

Confidential VMs on Azure

Microsoft’s’ virtualization stack, which powers the Microsoft Azure Cloud, is made up of the Microsoft hypervisor and the Azure Host OS. Security is foundational for the Azure Cloud, and Microsoft is committed to the highest levels of trust, transparency, and regulatory compliance (see Azure Security and Hypervisor Security to learn more). Confidential VMs are a new security offering that allow customers to protect their most sensitive data in use and during computation in the Azure Cloud.

 

In this blog we’ll describe the Confidential VM model and share how Microsoft built the Confidential VM capabilities by leveraging confidential hardware platforms (we refer to the hardware platform as the combination of the hardware and architecture specific firmware/software supplied by the hardware vendor). We will give an overview of our goals and our design approach and then explain how we took steps to enable confidential VMs to protect their memory, as well as to provide them secure devices like a vTPM, to protect their execution state and their firmware, and lastly to allow them to verify their environment through remote attestation.

What is Confidential VM?

The Confidential Computing Consortium defines confidential computing as “the protection of data in use by performing computation in a hardware-based, attested Trusted Execution Environment (TEE)”, with three primary attributes for what constitutes a TEE: data integrity, data confidentiality, and code integrity1. A Confidential VM is a VM executed inside a TEE, “whereby code and data within the entire VM image is protected from the hypervisor and the host OS 1. As crazy as this sounds – a VM that runs protected from the underlying software that makes its very existence possible – a growing community is coming together to build the technologies to make this possible.

 

For confidentiality, a Confidential VM requires CPU state protection and private memory to hold contents that cannot be seen in clear text by the virtualization stack. To achieve this, the hardware platform protects the VM’s CPU state and encrypts its private memory with a key unique to that VM. The platform further ensures that the VM’s encryption key remains a secret by storing it in a special register which is inaccessible to the virtualization stack. Finally, the platform ensures that VM private memory is never in clear text outside of the CPU complex, preventing certain physical attacks such as memory bus snooping, cold-boot attacks, etc. 

 

For integrity, a Confidential VM’s requires integrity protection to ensure its memory can only be modified by that VM. To achieve this, the hardware platform both protects the contents of the VM’s memory against software-based integrity attacks, and it also verifies address translation. The latter serves to ensure that the address space layout (memory view) of the VM can only be changed with the cooperation and agreement of the VM.

 

Caroline_Perezvargas_0-1685559443296.png

An overview of our approach to Confidential VMs

As a type 1 hypervisor, Microsoft’s hypervisor runs directly on the hardware and all operating systems, including the host OS, run on top of it. The hypervisor virtualizes resources for guests and controls capabilities that manage memory translations. The host OS provides functionality for VM virtualization including memory management (i.e., providing guest VMs interfaces for accessing memory), and device virtualization (i.e., providing virtual devices to guest VMs) to run and manage guest VMs.

 

Caroline_Perezvargas_0-1690312022225.png

 

Since they take care virtualizing and assigning resources to guest VMs, all virtualization stacks up until recently assumed full access to guest VM state, but we completely evolved Microsoft's virtualization stack to break those assumptions to be able to support running Confidential VMs. We set a boundary to protect the guest VM from our virtualization stack, and we leverage different hardware platforms to enforce this boundary and help guarantee the Confidential VM’s attributes. Azure Confidential VMs today leverage the following hardware platforms: AMD SEV-SNP (generally available) and Intel TDX (in preview).

 

We wanted to enable customers to lift and shift workloads with little or no effort, so one of our design goals was to support running mostly unmodified guest operating systems inside Confidential VMs. Operating systems were not designed with confidentiality in mind, so we created the guest paravisor to bridge between the classic OS architecture and the need for confidential computing. The guest paravisor implements the TEE enlightenments on behalf of the guest OS so the guest OS can run mostly unmodified, even across hardware platforms. This can be viewed as the “TEE Shim” 1. You can think of the guest paravisor as a firmware layer that acts like a nested hypervisor. A guest OS that is fully enlightened (modified) to run as a Confidential guest can run without a guest paravisor (this is also supported in Azure today but out of scope for this blog).

 

VBS lets a VM have multiple layers of software at different privilege levels. We decided to extend VBS to allow us to run the guest paravisor in a more privileged mode than the guest OS in a hardware platform-agnostic manner, using the hardware platform's confidential guest privilege levels. Because of this privilege separation, the guest OS and UEFI don’t have access to secrets held in guest paravisor memory. Our model allows the paravisor to provide a Confidential VM privilege-separated features including Secure Boot and its own dedicated vTPM 2.0. We’ll cover the benefits of these features in the attestation section.

 

On the guest OS side, we evolved the device drivers (and other components) to enable both Windows and Linux guest operating systems to run inside Confidential VMs on Azure. For the Linux guest OS, we collaborated with the Linux kernel community and with Linux distros such as Ubuntu and SUSE Linux. A Confidential VM turns the threat model for a VM upside-down: a core pillar of any virtualization stack is protecting the host from the guest, but with Confidential VMs there is a focus on protecting the guest from the host as well. This means that guest-host interfaces must also evolve to break previous assumptions. We therefore analyzed and updated these interfaces, including the guest device drivers themselves, to take a defensive posture that examines and validates data items coming from the host (e.g., ring buffer messages). This was one of the design principles for our paravisor, as it allowed us to move logic from the host into the guest to simplify the guest-host interfaces to be better able to protect the guest. We continue to further evolve these interfaces to continuously improve the defensive model of Confidential VMs.

Caroline_Perezvargas_1-1690312223864.png

Memory Protections

Two types of memory exist for Confidential VMs, private memory where computation is done by default and shared memory to communicate with the virtualization stack for any purpose, e.g., device IO. Any page of memory can be either private or shared, and we call this its page visibility. A Confidential VM should be configured appropriately to make sure that data that will go in a shared memory is protected (via TLS, BitLocker, dm-crypt, etc.).

 

All accesses use private pages by default, so when a guest wants to use a shared page, it must explicitly do this by managing page visibility. We evolved the necessary components (including the virtual device drivers in the guest that communicate with the virtual devices on the host via VMbus channels) to enable them to use our mechanism to manage page visibility. This mechanism essentially has the guest maintain a pool of shared memory. When a guest wants to send data to the host, for example to send a packet via networking, it allocates a bounce buffer from that pool and then copies that data from its private memory into the bounce buffer. The I/O operation is initiated against the bounce buffer, and the virtual device on the host can read the data.

 

The size of the pool of shared memory is not fixed; our memory conversion model is dynamic so that the guest can adapt to the needs of the workload. The memory used by the guest can be converted between private and shared as needed to support all the IO flows. Guest operating systems can choose to make use of our dynamic memory conversion model. The confidential computing platform prevents the hypervisor or anything other than the code inside the Confidential VM from making pages shared. Converting memory is therefore always initiated by the guest, and it kicks off both guest actions and host actions so that our virtualization stack can coordinate with the hardware platform to grant or deny the host access to the page.

 

Caroline_Perezvargas_2-1690312562881.png

Protecting Emulated Devices

For compatibility, the Azure Host OS provides emulated devices to closely mirror hardware, and our hypervisor reads and modifies a VM’s CPU state to emulate the exact behavior of the device. This allows a guest to use an unmodified device driver to interact with its emulated device. Since a Confidential VM’s CPU state is protected from our virtualization stack, it cannot use the emulated devices on the host OS anymore. As part of our goal to enable customers to run a mostly unmodified guest OS inside a Confidential VM, we didn’t want to eliminate emulated devices. Therefore, we decided to evolve our virtualization stack to support device emulation operating inside the guest instead of the host OS, so we moved emulated devices to the guest paravisor.

 

Caroline_Perezvargas_3-1690312783466.png

Protecting Runtime State

We believe a Confidential VM requires assurances about the integrity of its execution state, so we took steps to provide this protection to Confidential VM workloads. To support this, the hardware platform provides mechanisms to protect a Confidential VM from being vulnerable to unexpected interrupts or exceptions during its execution. Normally, our hypervisor emulates the interrupt controller (APIC) to generate emulated interrupts for guests. To help a guest OS running inside a Confidential VM handle interrupts defensively, the guest paravisor performs interrupt proxying, validating an interrupt coming from the hypervisor before re-injecting it into the guest OS. Additionally, with Confidential VMs there is a new exception type that needs to be handled by the guest VM instead of the virtualization stack. This exception is generated only by the hardware and is hardware platform specific. The paravisor can handle this exception on behalf of the guest OS for compatibility purposes as well.

Caroline_Perezvargas_4-1690313109072.png

Protecting Firmware

A guest VM normally relies on the host OS to store and provide its firmware state and attributes (UEFI variables). For Confidential VMs, we evolved guest firmware to get trusted UEFI attributes from a new VM Guest State file instead of from the host. This file is packaged as a VHD and can be encrypted before VM deployment to provide a Confidential VM access to persistent storage that is inaccessible to the host. The host only interacts with the encrypted VHD (E2E process for its encryption is out of scope for this blog).

 

In the same way, a guest VM normally relies on the host OS to implement authenticated UEFI variables to ensure that secure UEFI variables are isolated from the guest OS and cannot be modified without detection. To provide a Confidential VM authenticated UEFI variables that are secure from the host as well as the guest OS, the guest paravisor (running in the guest VM but isolated from the guest OS), manages authenticated UEFI variables. When a Confidential VM uses UEFI runtime services to write a variable, the guest paravisor processes the authenticated variable rights and persists that data in the VM Guest State file. Our design allows a Confidential VM to persistently store and access VM guest state and guest secrets (e.g., UEFI state and vTPM state) that are secure from the host as well as the guest OS.

 

Caroline_Perezvargas_5-1690315183673.png

Remote Attestation

According to the industry definition of confidential computing, any TEE deployment “should provide a mechanism to allow validation of an assertion that it is running in a TEE instance” through “the validation of a hardware signed attestation report of the measurements of the TCB” 1. An attestation report is composed of hardware and software claims about a system, and these claims can be validated by any attestation service. Attestation for a Confidential VM “conceptually authenticates the VM and/or the virtual firmware used to launch the VM” 1.

 

A Confidential VM cannot solely rely on the host to guarantee that it was launched on a confidential computing capable platform and with the right virtual firmware, so a remote attestation service must be responsible for attesting to its launch. Therefore, a Confidential VM on Azure always validates that it was launched with secure, unmodified firmware and on a confidential computing platform via remote attestation with a hardware root of trust. In addition to this, it validates its guest boot configuration thanks to Secure Boot and vTPM capabilities.

 

Once the partition for a Confidential VM on Azure is created and the VM is started, the hardware seals the partition from modification by the host, and a measurement about the launch context of the guest is provided by the hardware platform. The guest paravisor boots and performs attestation on behalf of the guest OS, requesting an attestation report from the hardware platform and sending this report to an attestation service. Any failures in this attestation verification process will result in the VM not booting.

 

After this, the guest paravisor transfers control to UEFI. During this phase Secure Boot verifies the startup components, checking their signatures before they are loaded. In addition to this, thanks to Measured boot, as these startup components are loaded, UEFI accumulates their measurements into the vTPM. If the OS disk is encrypted, the vTPM will only release the key to decrypt it if the VM’s firmware code and configuration, original boot sequence, and boot components are unaltered.

 

Caroline_Perezvargas_7-1685559443316.png

Call to Action

In this blog we described how in collaboration with our industry partners, we evolved our entire virtualization stack to empower customers to lift and shift their most sensitive Windows and Linux workloads into Confidential VMs. We also gave a deep technical overview of how we protect our customers’ workloads in these Confidential VMs. This is an innovative new area, and we want to share that journey with our customers who also want to move into this new Confidential Computing world. As you use Confidential VMs on Azure, we would love to hear about your usage experiences or any other feedback, especially as you think of other scenarios (in the enterprise or cloud).

 

- the Core OS Platform team.

 

References:

1Common Terminology for Confidential Computing, December 2022, Confidential Computing Consortium.

Azure Host OS – Cloud Host

Azure Host OS – Cloud Host

One Windows

 

Windows is a versatile and flexible operating system, running on a variety of machine architectures and available in multiple SKUs. It currently supports x86, x64, and ARM architectures. It even used to support Itanium, PowerPC, Alpha, and MIPS (wiki entry). Windows also runs in a multitude of environments; from data centers, laptops, and phones to embedded devices such as ATM machines.

 

Even with all of this support, the core of Windows remains virtually unchanged on all these architectures and SKUs. Windows dynamically scales up, depending on the architecture and the processor that it’s run on to exploit the full power of the hardware. This same applies to Microsoft Azure as well. So, if you have ever wondered how Windows runs Azure nodes in the data center, read ahead!

 

As Satya says, “we are building Azure as the worlds computer” and powering the worlds computer shows the ability of Windows to scale up and scale out. To demonstrate this scale, here is a snapshot of taskmgr running directly on the Azure host in a M-series machine (one of the largest VMs available in Azure, showing 896 logical processors) in the data center.

 

Hari_Pulapaka_0-1672935896309.png

M-series taskmgr

 

In this post, we will talk about the internals of the Azure Host OS which powers the Azure hosts in the data center.

 

Cloud Host – Azure Host Operating System

Azure of course is Microsoft’s cloud computing service, that provides IaaS (infra as a service) virtual machines (VM), PaaS (platform as a service) containers and many other SaaS services (e.g., Azure Storge, Networking, etc.). For the IaaS and PaaS services, all customer code eventually ends up running in a virtual machine. Hence at the core platform layer, the main purpose of the Azure Host operating system is to manage virtual machines and manage it really well! Managing VMs includes launching, shutting down, live migrating, updating it, etc.

 

Since Azure uses Windows as the operating system, all these VMs run as guests of Microsoft Hyper-V, which is our hypervisor. Microsoft Hyper-v is a type 1 hypervisor and hence when I say Azure Host operating system, its technically the root operating system. This is the OS that has full control of the hardware and provides virtualization facilities to run guest VMs.

 

Keep in mind that the hypervisor we use is the same hypervisor that we use on Windows Client and Windows Server across all our millions of customer machines. We will have upcoming blog posts explaining some of the key features of Microsoft Hyper-V, that that allows Azure to securely, and reliably manage guest VMs.

 

Cloud Host 

As I mentioned, the goal of Azure Host OS is to be very good at managing the lifecycle of VMs. This means that Windows (aka Azure Host OS) doesn’t need a whole lot of functionality typically associated with Windows to perform this functionality. Hence, we created a specially crafted console only (no GUI, some also call it headless) edition of Windows called Cloud Host.

 

This is a OneCore based edition of Windows. OneCore is the base layer upon which all the families of Windows SKUs (or editions) build their functionality. It is a set of components (executables, DLLs, etc.) that are needed by all editions of Windows (PCs, Windows Server, XBOX or IOT). For a programming analogy, it is the base class from which all the Windows classes inherit (e.g., Object). If you look inside OneCore to see what functionality it provides, you can see API sets which provide core functionality such as Kernel, Hypervisor, File system support, Networking, Security, Win32 APIs, etc. OneCoreUAP called out in the picture below is another example of a slightly higher layer that is used to build client PC editions which includes the UWP programming surface, GUI stack and higher-level networking components such as media stack and WiFi.

Hari_Pulapaka_0-1672937725585.png

Overview of some representative components available in OneCore

 

How did we build Cloud Host?

There is a minimal amount of code that needs to run on the Azure host to integrate with the control plane as well as monitor and manage container/VMs. Based on an analysis of the dependency set of this code, we identified the set of functionalities (DLLs and API sets) that Azure needs on top of OneCore. These handful of binaries (tens of binaries) were then added to OneCore to use it as the OS for Azure Host.   

 

To add these DLLs, we created a brand-new SKU called Cloud Host and added all these binaries to Cloud Host. You can think of Cloud host as a “child class” of OneCore. Note that we had to create a new SKU “Cloud Host” because we needed to add new binaries to OneCore. We could have just added them to OneCore directly but its cleaner to create a purpose-built specific SKU/Edition, while keeping OneCore unmodified. In other words, Cloud Host is a special purpose SKU designed and built to run the Azure Host nodes in the data center. You may be more familiar with other Windows SKUs, often referred to as Editions, such as Pro, Enterprise, etc. [wiki]. Cloud Host is a similar SKU/edition, one that is used only for Azure nodes in the data center.

 

With that explanation, let’s see this Cloud Host. Here is a picture of the Cloud Host WIM file (a WIM file is just like a zip file to store the Windows image to boot off from). You can see its size is 280 MB, which is more than 10 times smaller than a typical PC WIM file.

 

Hari_Pulapaka_0-1672937889367.png

 

That is significantly smaller than any Windows you use on your PC, typical client enterprise WIM file would be 3.6 GB in size.

 

Hari_Pulapaka_1-1672937889369.png

 

Cloud Host boots into a console shell and the experience would typically be similar to Windows Server Core. Here is a picture of a Cloud Host session, from one of our test machines.

 

(Keep in mind, we do NOT typically log onto Azure Host Nodes, this is purely for demo purposes)

 

Hari_Pulapaka_0-1672938022848.png

Cloud Host with cmd shell, taskmgr and Regedit

 

Another thing you may have noticed is that the taskmgr or even regedit does not look the same as you would see on Windows 11. This is because as I mentioned, Cloud Host is built on OneCore and it is headless (or console based), hence, it doesn’t contain any of the GUI pieces of Windows. We have a special taskmgr and regedit version that doesn’t link with all the modern GUI functionality available in Windows 11, which gives them the “old style” look.

 

API: What kind of code can run on Azure Host nodes?

We can run C++, Python and even Rust code on Azure Host. The main thing to keep in mind is that as a developer if you are building code to run on Azure Host (which is only our internal developers), you can only link against the OneCore SDK (onecore.lib).  We have documented the API surface available to OneCore here along with info on building against OneCore here.  

 

Hari_Pulapaka_0-1672938173121.png

 

With that look into the internals of Azure Cloud Host, future blog posts will continue into the code and design internals of updating the Azure Host (e.g., Tardigrade, VM PHU, Hypervisor Hot Restart, and Live Migration), kernel/virtualization features, security and many more areas in the operating system platform.

 

Cheers,

 

Hari (on behalf of the entire Core OS Team)

Windows OS Platform

Welcome to the Windows Core OS platform team blog. We own the core of the Windows operating system, primarily, the virtualization platform, silicon support (Intel, AMD, and Arm), kernel, storage file system and hyperconverged storage infrastructure including Spaces Direct. In addition, we also build and deploy a specially crafted operating system based on Windows for the Azure Cloud. We migrated our old blog which was focused purely on Windows kernel to incorporate the full stack of tech owned by the team.

 

The team will continue use this blog to talk about the internals of upcoming features in the Windows Core OS Platform space, very similar to my previous blog on Windows Kernel Internals. For our first blog we will be talking about how we use all our technologies to provide a special purpose host operating system for the Azure Cloud, called Cloud Host.

 

Cheers

 

Hari

Multi-Key Total Memory Encryption on Windows 11 22H2

By: Jin_Lin
28 November 2022 at 19:56

The security and privacy of customer data is a core priority for Azure and Windows. Encrypting data across different layers of device and transport is a universal technique to prevent exploits from accessing plaintext data. In Azure, we have a multitude of offerings to provide different levels of data confidentiality, encryption and isolation across workloads types (Azure Confidential Computing – Protect Data In Use | Microsoft Azure). One of such is VM memory encryption with Intel’s Total Memory Encryption – Multi Key (TME-MK), providing hardware accelerated encryption of DRAM. With the latest Intel 12th Gen Core CPUs (Alder Lake) offering this capability, we are delighted to extend support in Windows 11 22H2 for TME-MK.

 

End-to-end Encryption

Encryption has long been an established mechanism to keep data from prying eyes. By encrypting data while it is at rest, in transit, and in use – we can prevent unexpected parties from getting access to sensitive information for the lifetime of data.

 

Data-at-rest is protected through (a plethora of) disk encryption technologies and data-in-transit is protected through network encryption protocols (SSL/TLS/HTTPS), both used in modern workloads for many years. Data-in-use protection has recently become available through the latest generation hardware in Azure, providing an end-to-end encryption schema. Memory encryption technology innovations are now becoming available in client CPUs.

 

PC Encryption Landscape

Windows 10 introduced Bitlocker to encrypt data while it is residing in persistent storage, ensuring that a stolen laptop does not result in exposure of customers’ saved files on disk. Attackers continually get more sophisticated and mount physical attacks to retrieve data from volatile memory mediums (i.e. DRAM). One example is using methods to cryogenically freeze memory which enables data to persist for long periods of time. Another example is setting up interposers which sit between the DRAM chip and the DRAM slot.

It is logical to extend cryptographic protection of data while it is in memory, but it is expensive to do entirely in software. In modern CPUs, hardware-accelerated capabilities (Intel Total Memory Encryption) are used where the memory controller encrypts data before it is committed to the DIMMs, and decrypt data when needs to be computed on. Having memory controller-accelerated encryption also has a nice property where workloads do not need to be specially modified to take advantage of this, and the operating system and hardware can transparently handle these operations.

 

Memory controller-based encryption prevents attackers who have physical access to DRAM from being able to read in-memory contents in plaintext. TME-MK extends that paradigm by enabling different VMs (partitions) to have unique memory encryption keys.

 

Jin_Lin_0-1669241010262.png

 

Total Memory Encryption – Multi Key (TME-MK)

TME-MK is available in Intel 3rd Generation Xeon server processors and Intel 12th Generation Core client processors. Azure, Azure Stack HCI, and now Windows 11 22H2 operating systems also take advantage of this new generation hardware feature. TME-MK is compatible with Gen 2 VM version 10 and newer. List of Guest OS’s supported in Gen 2

 

On Azure, customers that use DCsv3 and DCdsv3-series Azure Virtual Machines TME-MK.

 

TME-MK capabilities are also available starting with Azure Stack HCI 21H2 and Windows 11 22H2 TME-MK. Go to the Azure Stack HCI catalog and filter “VM memory encryption” to find Azure Stack HCI solutions that support TME-MK.

 

Jin_Lin_1-1669241031874.png

 

To boot a new VM with TME-MK protection (assigning it a unique encryption key from other partitions), use the following PowerShell cmdlet:

Set-VMMemory -VMName <name> -MemoryEncryptionPolicy EnabledIfSupported

 

To verify a running VM is enabled and using TME-MK for memory encryption, you can use the following Powershell cmdlet:

Get-VmMemory -VmName <name> | fl *

 

The following return value would describe a TME-MK protected VM:

MemoryEncryptionPolicy  : EnabledIfSupported

MemoryEncryptionEnabled : True

 

To learn more about syntax and parameters to boot VMs using powershell: New-VM (Hyper-V) | Microsoft Learn

Underneath the hood, the operating system will request the CPU to generate an ephemeral key (for the duration of the VM lifetime). This key will never leave the CPU (and not be visible even to the operating system or hypervisor). The hypervisor will then set the associated bits in the second level page tables (SLAT) describing the physical addresses associated with the VM that should be encrypted with said key by the memory controller when data moves to and from memory.

 

Conclusion

The privacy and security of customer data is top of mind for Windows 11.  Windows will continue to evolve and adopt modern defense-in depth capabilities to continue protecting our customers. For more information on Intel TME-MK, read Intel’s latest whitepaper: https://www.intel.com/content/www/us/en/architecture-and-technology/vpro/hardware-shield/total-memory-encryption-multi-key-white-paper.html

 

Windows OS Platform (Hyper-V Security) Team 

Jin Lin, Alexander Grest, Bruce Sherwin

 

Hotpatching on Windows

20 November 2021 at 03:10

Introduction

A core priority of the Windows Kernel team is to keep the operating system, applications, and users secure. Like many operating systems, Windows has a large codebase, a driver ecosystem, and a complex set of dependencies. Every day, many malicious actors attempt to find vulnerabilities. To fix these vulnerabilities, Microsoft has historically combined a group of security fixes into what is known as a security patch.

 

Updates on Windows

Traditionally, security patches have been deployed on the second Tuesday of every month, known as Patch Tuesday. These patches are developed by feature teams as a fix for various security vulnerabilities in the OS. By providing these security patches, we aim to make the Windows OS more secure and eliminate the opportunity of malicious actors to exploit vulnerabilities. Within each patch, both user mode (application) and kernel mode (system) binaries can be updated, and typically this requires a reboot.

 

Some scenarios require continuous or near-continuous availability. For example, the instances of Windows Server that power the Azure fleet are required to be highly available. However, we also require these operating system instances to be secure. While technologies like Kernel Soft Reboot and VM preserving host updates already exist to minimize VM downtime while changing major OS releases, security patches are applied frequently enough that even this technique impacts downtime.

 

Why do updates require rebooting?

Usually, many binaries from all over the system are accessed and changed when a patch is applied. The reason a reboot is almost always required is because a binary that must be updated is usually actively mapped in one or more processes so its code may be currently executing. Certain kernel and user-mode binaries, like win32k.sys or ntdll.dll, are always loaded into memory and some others, like Explorer.exe, are loaded when there is an active user session. When binaries such as these are patched as part of an update, a restart is required for the patch to be successfully installed. When an update targets the NT kernel or additional core components, a restart is always required because it is not possible to unload those binaries while their code is executing. Traditionally, even if one fix within the entire patch required a reboot, and all other patches didn’t require a reboot, the machine would still be required to reboot to successfully install the patch.

 

Current security issues with delayed patching

Security patches are intended to be applied to the Windows OS as soon as they are released from Microsoft. Often, users and system administrators will delay the installation of a patch because of the reboot that is frequently required upon completing the installation. This delay in patching, while seemingly convenient, is actually a security issue. The FireEye Mandiant Threat Intelligence report shows that in 2018 and 2019 the exploitation of 42% of vulnerabilities occurred after a patch was already released. Furthermore, internal MSRC data shows that in the year 2020, around 75% of public proof-of-concept vulnerability were exploited after a patch has been already released. By limiting or eliminating the time between when a patch is issued to when it is applied, there is a substantial opportunity to reduce the total number of exploited vulnerabilities.

 

What is Hotpatching?

Hotpatching is the capability of an Operating system to “on-the-fly” modify some code that may be currently executed by another entity (application or driver). The hotpatching process should be invisible to the application, library or driver that is executing the code. This implies that the hotpatch engine must respect some constraints, which will be explained later in this post. Hotpatching allows the OS to install security patches without requiring a reboot, ensuring a level of increased security without sacrificing the availability of the machine. By utilizing techniques in the Windows Kernel, updates can be applied without a direct impact to the user. In Server scenarios, hotpatching allows administrators to update their guest VMs without the need of rebooting the VMs, leading to reduced downtime. Hotpatching is one of the first techniques geared to bringing users a reboot-less security update future.

 

While hotpatch is a new feature for our customers, it has been in use in Azure Host OS for a while. Internal Azure administrators have been providing rebootless security updates to Azure Host machines for long enough to collect data and improve hotpatching itself. Hotpaching is a battle-tested method of updating binaries on a system without the need to reboot.

 

The Hotpatch architecture

Hotpatch is implemented in various parts of the NT kernel, Secure Kernel and Ntdll module. Before peeking at the engine’s architecture, we should explain how the system is able to dynamically patch a binary.

 

Hotpatching works at the function level, which means that functions are individually patched and not individual files or components. Function level hotpatching works by redirecting all invocations of an un-patched function belonging to a base image to a patched function belonging to a hotpatch image. Many types of binaries can be patched using this technique, including usermode executables (EXEs and DLLs), system drivers, and even the Hypervisor and Secure Kernel binaries. Note that hotpatch images are considered cumulative, which means that each hotpatch image includes the changes from all other previous hotpatch images targeting the same base image. Multiple hotpatch images can be applied to the same base image and can be rolled back in a similar manner. The latest version of Hotpatch supports both x64 and ARM64 architectures, including 32-bit code running under WOW64.

 

Patch images, shown in Figure 1, are standard PE (Portable Executable) images, but they contain special information. In particular, the Hotpatch Table (indexed by the Image load configuration directory) contains all the information that describes the patch image, like the expected engine version, the size of the patch table, patch sequence number, and an array of compatible base image descriptors.

 

Mehmet_Iyigun_0-1636740240567.png

Figure 1. Hotpatch image format.

 

 

Each patch image is designed for a specific base image. The compatible base image is identified through a checksum and a time-date stamp. The patch engine will refuse to apply the patch if the base image does not have the same checksum and time-date stamp of any descriptors. In this case the patch will be added to an internal list and applied only when the correct base image is loaded later (this procedure is called “Deferred application”.)

 

The operations that are performed by the engine for applying a patch are described by an array of hotpatch descriptors. A hotpatch descriptor tells the engine what type of patch each record specifies (function patch, global symbol patch, indirect call, CFG call target and so on...). It is composed of a header and one or more hotpatch records. Each record specifies the patch’s parameters that depend on the type of the descriptor, like the source and target function’s RVA, and the original opcodes bytes.

 

The Hotpatch engine

The Hotpatch engine is implemented in various parts of the operating system, mostly in the NT and Secure kernel. The engine, as introduced in the previous paragraph, supports different kinds of images: Hypervisor, Secure Kernel and its modules, NT Kernel drivers and User-mode processes. The hotpatch engine requires the Secure Kernel to be running.

 

For applying a patch to an image, the NT kernel takes several steps that start in the MiLoadHotPatch internal function, which temporarily maps the patch image in the system address space and performs the initial analysis with the goal to search and verify the hotpatch information contained in the PE data structures (showed in Figure 1). After the checksum and timestamp of the target image for which the patch has been designed are located, the NT kernel determines whether the corresponding base image is loaded in the system (the base image can also be a secure image, like the Hypervisor or the Secure Kernel, so this step also needs to invoke the secure kernel).

When a compatible image is detected, the NT kernel begins to apply the patch to the target base image using a procedure that is a bit different depending on the type of the base image (user-mode library or process, kernel driver or a secure image). In general, the hotpatch engine maps the patch image in the same address space as the base image (as showed in Figure 2): for user-mode patches, the patch image will be mapped in each process that has the base image loaded.

 

Note that the hotpatch engine also supports session drivers. A session driver is a driver that lives in a kernel-mode address space that is tied to the user logon session (note that the session address space is generated by one particular root page table entry, which is switched on demand by the Memory manager depending on the active session). This means that a particular session can have a driver mapped which does not exist in another session. The Hotpatch engine is able to attach to all sessions in the system thanks to the “HotPatch” process created in phase 1 of the NT Kernel initialization. This minimal process has the characteristic to not belong to any session. The hotpatch engine can thus use that process to temporarily attach to any session in the system and perform the patch application only to the sessions where the driver is currently loaded.

 

Mehmet_Iyigun_0-1636741890443.png

Figure 2. Various address spaces supported by hotpatching on Windows.

 

Once the hotpatch image is mapped, the patch engine within the kernel starts to apply the patch by performing Backward patch application as described by the hotpatch records:

  • Patches all callees of patched functions in the patch image to jump to the corresponding functions in the base image. The reason for this is to ensure that all the unpatched code executes from the original base image. For example, if function A calls function B in the original base image and the patch image patches function A but not function B, then the patch engine will update function B in the patch image to jump to function B in the original base image.
  • Patches the necessary references to global variables in hotpatch functions to point to the corresponding global variables in the original base image.
  • Patches the necessary import address table (IAT) references in the hotpatch image by copying the corresponding IAT entries from the original base image.

It then performs the Forward patch application by patching the necessary functions in the original base image to jump to the corresponding functions in the patch image. Once this is done for any given function in the original base image, all new invocations of that function will execute the new patched function code from the hotpatch image. Once the hotpatched function returns, it will return to the caller of the original function.

 

The described procedure, which, for kernel drivers, is executed by the Secure Kernel, has been highly simplified. Note that the hotpatching process requires proper synchronization: no processor should be able to execute original instructions while undergoing a patch application. Note that the Secure Kernel is able also to interact with Hyperguard. This allows protected Patchguard images to be correctly patched.

 

The Hotpatch Address Table (HPAT)

When applying a patch to a function, the Hotpatch engine should be able to store the trampoline needed for transferring the code execution from the base to the patched function. The trampoline can’t be stored in the old un-patched function for various reasons: currently running code may hit invalid instructions and there is also no guarantee that enough space exists in the old function’s code. Furthermore, the patch engine supports both the application and the revert (undo) of a patch, which means that the original replaced bytes would have to be stored somewhere. Trampoline code to transfer execution to the target function is placed in the Hotpatch Address table code page (abbreviated as HPAT).

 

When the system initially boots, the Windows loader determines the size of the HPAT area, which is composed of a combination of data and code pages (to support ARM64 and scenarios where Retpoline is enabled on x64). When HotPatch is enabled, each boot driver is loaded in memory by reserving the HPAT pages at the end of PE image (before the Retpoline code page. Further information about Retpoline on Windows are available here: Mitigating Spectre variant 2 with Retpoline on Windows - Microsoft Tech Community). Note that the term “reserved” means that no actual physical memory is consumed. This is handled similarly for user-mode binaries.

 

When a patch is applied to a base image, the HPAT pages for both the base and the patch images are mapped to valid physical pages. When a function is patched for the first time, the patch engine allocates an HPAT entry for it and fills the code and data slot with the trampoline code and the target address. Subsequent patches for a function only update the target address. Only a single instruction is replaced in the prologue of the original function’s code. The overwritten opcode is saved in the Undo table to be replaced if the patch is reverted. Figure 3 summarizes this process:

 

Mehmet_Iyigun_1-1636741994353.png

Figure 3. Code flow for a hotpatched function.

 

Windows Server 2022 - New Hotpatch features

The upcoming Windows Server 2022 release includes the following improvements which make hotpatching applicable to a wider set of changes:

  • Patch images can now import new functions from other binaries.
  • Hotpatch engine now support ARM64 as well.
  • The patch engine now supports a patch callback, exported in the patch image through the “__PatchMainCallout__” function. The callback allows the patch image to perform initialization steps (like allocating memory, initializing new globals and so on....) after one or both the phases of the patch application (described previously) completed.
  • HotPatch is compatible with Retpoline. A new Retpoline dispatch function (internally called “__guard_retpoline_jump_hpat”) is invoked from the HPAT code entry and can safely transfer the code execution to the target patch function without being vulnerable to Spectre v2 side channel attacks.

 

Conclusion

Hotpatch is a powerful feature used by the Azure Fleet and Windows Server Azure Edition to eliminate downtime when applying security patches or even adding small features to the OS. Although some limitations in the functions being patched still exist (for example function signatures can never be changed), most of them has been addressed in the new version of the Engine.

 

How can you get access to the hotpatch feature?

Hotpatch-based security updates are available to customers running Windows Server 2019 and Windows Server 2022 Azure Edition images in the Azure cloud within the automanage framework. Documentation is provided on this page. We are working on bringing hotpatch-based security updates to a wider set of Windows customers.

 

Andrea Allievi & Hotpatch Team.

 

Getting to Know ARM64EC: #Defines and Intrinsic Functions

18 November 2021 at 08:55

Earlier this year, we announced ARM64EC, a new ABI that will make it easier than ever to build native apps for Windows on ARM.  With the Windows 11 SDK and Visual Studio Preview, you can start using the preview of ARM64EC tools to add ARM64EC to your own apps or build new ARM64EC projects.  For developers looking to dive in and get started, we'll be sharing more details and things to know in this and upcoming blogs. 

 

Today, we'll be diving into one key detail of the environment to know: when compiling ARM64EC, the _M_AMD64 preprocessor macro is defined and _M_ARM64 is not.  There is also a new preprocessor macro, _M_ARM64EC, that is set only when building ARM64EC. 

 

Preprocessor macros defined for each target by MSVC: 

x64 

ARM64EC 

ARM64 

_M_X64 

_M_AMD64 

_M_X64 

_M_AMD64 

_M_ARM64EC 

_M_ARM64 

 

If you include windows.h in your project, you’ll also see that _AMD64_ and _ARM64EC_ are both defined when building ARM64EC code. 

 

This combination may seem counterintuitive at first, but it's key to the fundamental promise of ARM64EC being interoperable with x64 code even within the same binary.  Windows 11 takes care of seamlessly transitioning between code running natively in the CPU and under emulation. To do so, it makes sure that data flows transparently between ARM64EC and x64 including data pointers and function pointers (i.e. callbacks).  For this to work, datatype definitions must be the same when compiling ARM64EC code as when compiling x64. 

 

The defined preprocessor macros for ARM64EC mean that your project compiling as ARM64EC will use definitions from x64, not ones from ARM64. This ensures that datatype definitions are the same when compiling for x64 and ARM64EC and that passing parameters, either by value or by reference, will not generate a mismatch. 

 

Another common use of #define statements in code is platform specific instructions, usually exposed to C/C++ code in the form of intrinsic functions.  Intrinsic functions are functions internally defined by the compiler, which allow C/C++ code to tap into architecture-specific instructions and get the best possible performance without the need for direct use of assembly.   Knowing that ARM64EC projects will follow x64 codepaths, you may ask -- what about any intrinsic functions? 

 

When compiling ARM64EC, x64 intrinsic functions are supported and will be translated to ARM64EC code automatically.  As a result, taking an x64 project and building for ARM64EC, even one that uses intrinsic functions for performance, can easily yield an ARM64EC app with good performance without source changes. 

 

You also have the option to further optimize the processor-specific code in your project by using ARM64 intrinsic functions in your ARM64EC project.  The _M_ARM64EC preprocessor macro allows you to differentiate ARM64EC from x64 and take ARM-specific code paths rather than x64.  For example, if you have code that already handles choosing the best intrinsic functions for x64 and ARM64, you can key off _M_ARM64EC or _M_ARM64 to use the ARM intrinsic functions, as below: 

 

Before​ 

After​ 

#include <intrin.h>​ void func() {​ #if defined(_M_AMD64)​ __m128i vec;​ vec = _mm_setzero_si128();​ #elif defined(_M_ARM64)​ __n128 vec;​ vec = vdupq_n_u32(0);​ #endif​ }​ #include <intrin.h>​ void func() {​ #if defined(_M_AMD64) && !defined(_M_ARM64EC)​ __m128i vec;​ vec = _mm_setzero_si128();​ #elif defined(_M_ARM64) || defined(_M_ARM64EC)​ __n128 vec;​ vec = vdupq_n_u32(0);​ #endif​ }​

 

The architecture #defines set by the compiler when building ARM64EC may be somewhat surprising at first but make more sense when considering that ARM64EC and x64 are interoperable.  These settings, and the automatic translation of intrinsics, enable code to be ported to ARM64EC with the least amount of effort, while still enabling ARM64EC specific fine-tuning and optimization. 

 

Marc Sweetgall, Pedro Justo

 

Developer Guidance for Hardware-enforced Stack Protection

By: Jin_Lin
12 December 2022 at 19:08

In March 2020, we shared some preliminary information about a new security feature in Windows called Hardware-enforced Stack Protection based on Intel’s Control-flow Enforcement Technology (CET). Today, we are excited to share the next level of details with our developer community around protecting user-mode applications with this feature. Please see requirements section for hardware and OS requirements to take advantage of Hardware-enforced Stack Protection.

 

Starting from the 11C latest cumulative update for 20H1 (19041) and 20H2 (19042) versions of Windows 10, we’ve enabled user mode Hardware-enforced Stack Protection for supported hardware. This exploit mitigation will protect the return address, and work with other Windows mitigations to prevent exploit techniques that aim to achieve arbitrary code execution. When attackers find a vulnerability that allows them to overwrite values on the stack, a common exploit technique is to overwrite return addresses into attacker-defined locations to build a malicious payload. This technique is known as return-oriented programming (ROP). More details on ROP and hardware shadow stacks is in this kernel blog.

 

For user mode applications, this mitigation is opt-in, and the following details are intended to aid developers in understanding how to build protected applications. We will describe in detail the two policies in Hardware-enforced Stack Protection: 1) shadow stack 2) instruction pointer validation. Shadow stack hardens the return address and instruction pointer validation protects exception handling targets.

 

Shadow Stack

Shadow stack is a hardware-enforced read-only memory region that helps keep record of the intended control-flow of the program. On supported hardware, call instructions push the return address on both stacks and return instructions compare the values and issues a CPU exception if there is a return address mismatch. Due to these required hardware capabilities only newer processors will have this feature.

 

To enable shadow stack enforcement on an application, you only need to recompile the application with the /CETCOMPAT linker flag (available in Visual Studio 2019 16.7 Preview 4).

 

CETCOMPAT Property Pages.png

 

Generally, code changes are not needed and the only modification to the binary is a bit in the PE header. However, if your code behavior includes modifying the return addresses on the stack (which results in mismatch with the shadow stack), then the hijacking code must be removed.

 

Applications can also choose to dynamically enable shadow stack enforcement, by using the PROC_THREAD_ATTIBUTE_MITIGATION_POLICY attribute in CreateProcess. This allows programs with multiple executables with the same name to specify specific processes to enable enforcement.

 

Shadow stack enforcement by default is in compatibility mode. Compatibility mode provides a more flexible enforcement of shadow stacks, at module granularity. When a return address mismatch occurs in this mode, it is checked to see if 1) it is not in an image binary (from dynamic code) or 2) in a module that is not compiled for /CETCOMPAT. If either hold true, the execution is allowed to continue. This way, you can slowly increase the coverage of the mitigation, by compiling more modules with /CETCOMPAT at your own pace. To protect dynamic code in compatibility mode, there is a new API, SetProcessDynamicEnforcedCetCompatibleRanges, to allow you to specify a range of virtual addresses to enforce this mitigation. Note that by default this API can only be called from outside the target process for security purposes.

 

Note that all native 64-bit Windows DLLs are compiled with /CETCOMPAT.

 

Strict mode, by definition, strictly enforces shadow stack protections and will terminate the process if the intended return address is also not on the shadow stack.

 

Today, it is recommended to enable your application in compatibility mode, as third-party DLLs may be injected into your process, and subsequently perform return address hijacking. We are working with our ecosystem developers to clean up any of this behavior. At the current time, we recommend beginning by enabling compatibility mode enforcement for your application.

 

The following diagram illustrates how the system behaves under shadow stack. When a return address mismatch occurs, the CPU raises a #CP exception:

 

Compatibility Mode Diagram.PNG

 

Strict Mode Diagram.png

 

As you can see, return address mismatches cause a trap to the kernel, which comes with a performance hit even if the mismatch is forgiven and execution is allowed to continue.

 

Instruction Pointer Validation

With the presence of shadow stacks, one of the next exploit techniques attackers may use to hijack control flow is corrupting the instruction pointer value inside the CONTEXT structure passed into system calls that redirect the execution of a thread, such as NtContinue and SetThreadContext. To provide a comprehensive control-flow integrity mitigation, Hardware-enforced Stack Protection includes an additional mitigation to validate the instruction pointer during exception handling. It is important to keep this mitigation in mind as well when testing for compatibility.

 

When shadow stacks are enabled for an application, SetThreadContext is enlightened to validate the user-supplied instruction pointer. Calls are allowed to proceed only if the value is found on the shadow stack (otherwise the call will fail).

 

For structured exception handling, RtlRestoreContext/NtContinue is hardened by a parallel mitigation, EH Continuation Metadata (EHCONT), by using the /guard:ehcont flag.

 

EHCONT Property Pages.png

 

When this flag is specified, the compiler will include metadata in the binary that has a table of valid exception handling continuation targets. The list of continuation targets is generated by the linker for compiled code. For dynamic code, continuation targets should be specified using SetProcessDynamicEHContinuationTargets (similarly can only be called from outside the target process by default). With this feature enabled, the user-supplied instruction pointer will be checked to see if it is 1) on the shadow stack or 2) in the EH continuation data, before allowing the call to proceed (otherwise the call will fail). Note that if the binary does not contain EHCONT data (legacy binary), then the call is allowed to proceed.

 

Additionally, an application can be compiled for EHCONT even without shadow stack protection, in which the user-supplied instruction pointer must be present in the EH continuation data.

 

Common Violations

To properly build your application for Hardware-enforced Stack Protection, ensure there is a good understanding of how these security mitigations are enforced. Since shadow stacks are present throughout the lifetime of the process, the test matrix is enabling the above mitigations and ensure all code paths do not violate the security guarantees. The goal is to ensure present application code does not perform any behavior deemed unsecure under these mitigations.

 

Here are some examples of behaviors that violate shadow stacks:

 

Certain code obfuscation techniques will not automatically work with shadow stacks. As mentioned above, CALL and RET are enlightened to push onto the shadow stack and perform return address comparisons. Instruction combinations like PUSH/RET will not work with shadow stacks, as the corresponding return address is not present on the shadow stack when RET is performed. One recommendation here is instead using a (Control flow guard protected) JMP instruction.

 

Additionally, techniques that manually return to a previous call frame that is not the preceding call frame will also need to be shadow stack aware. In this case, it is recommended to use the _incsspq intrinsic to pop return addresses off the shadow stack so that it is in sync with the call stack.

 

User Interfaces

There are some user interfaces to help you understand the state of enforcement of processes on the machine. In task manager, adding the “Hardware-enforced Stack Protection” column in the “Details” tab will indicate processes are shadow stack protected, and whether they are in compatibility (compatible modules only) or strict (all modules) mode.

 

TaskMgr.png

 

Additionally, this mitigation can be controlled similar to other exploit protections, including toggling the enforcement using the Windows Defender UI, Group Policy, PowerShell, and other facilities. Use UserShadowStack and UserShadowStackStrictMode as the parameter keyword to manually toggle enforcement in compatibility and strict mode, respectively. Use AuditUserShadowStack to enable audit mode.

 

Defender.png

 

Requirements

You can begin building and testing your application to support Hardware-enforced Stack Protection today, by ensuring you have the following:

 

Hardware: 11th Gen Intel Core Mobile processors and AMD Zen 3 Core (and newer)

Hardware-enforced Stack Protection capable OS: 19041.622 or 19042.622 and newer versions

 

Conclusion

We will continue to strive towards investing in exploit mitigations to make Windows 10 the most secure operating system. These mitigations will help proactively prohibit an attacker’s ability to hijack your program in the event a vulnerability is discovered. Note in the current release, this mitigation is only supported in 64-bit code. There is no support for 32-bit code, WoW64, or in Guest Virtual Machines at the moment.

 

In the latest canary builds of Edge (version 90), Hardware-enforced Stack Protection is enabled in compatibility mode on the browser and a few non-sandboxed processes. In upcoming releases, there will be continued investments in expanding the list of processes protected by this mitigation. Please try it out and provide your feedback. You can send related questions to [email protected].

 

 

Kernel Protection team - Jin Lin, Jason Lin, Matthew Woolman

 

 

 

Introducing Kernel Data Protection, a new security technology for preventing data corruption

12 December 2022 at 19:08

Kernel Data Protection (KDP) is a new technology that prevents data corruption attacks by protecting parts of the Windows kernel and drivers through virtualization-based security (VBS). KDP is a set of APIs that provide the ability to mark some kernel memory as read-only, preventing attackers from ever modifying protected memory. 

 

KDP uses technologies that are supported by default on Secured-core PCs, which implement a specific set of device requirements that apply the security best practices of isolation and minimal trust to the technologies that underpin the Windows operating system. KDP enhances the security provided by the features that make up Secured-core PCs by adding another layer of protection for sensitive system configuration data.  

 

KDP is implemented in two parts: 

  • Static KDP enables software running in kernel mode to statically protect a section of its own image from being tampered with from any other entity in VTL0. 
  • Dynamic KDP helps kernel-mode software to allocate and release read-only memory from a “secure pool”. The memory returned from the pool can be initialized only once. 

The concept of protecting kernel memory as read-only has valuable applications for the Windows kernel, inbox components, security products, and even third-party drivers like anti-cheat and digital rights management (DRM) software. 

 

Learn more about Kernel Data Protection, how it is implemented on Windows 10, and more applications in this blog: Introducing Kernel Data Protection, a new platform security technology for preventing data corruption. 

 

Enjoy!

Memory management & security core team (Andrea Allievi, Matthew Woolman, Jon Lange, Eugene Bak, Mehmet Iyigun)

 

Securely donate CPU time with Windows Sandbox

12 December 2022 at 19:08

With Windows Sandbox, you can run any win32 desktop application you wish with a pristine configuration every time you start it. It allows you to do virtually whatever you want within a secure isolated desktop environment without requiring any cleanup after the fact.

 

For example, Windows Sandbox allows you to contribute time on your Windows 10 PC towards fighting COVID-19. Here is how it works: using Windows Sandbox you can run the open-source Folding@Home app to help simulate protein dynamics. Folding@Home is one of the most popular distributed computing projects bringing together citizen scientists who volunteer to run simulations of protein dynamics on their personal computers to fight COVID-19 and other diseases. For more information about the project itself, please visit the Folding@Home Knowledge Base.

 

Folding@Home in Windows SandboxFolding@Home in Windows Sandbox

 

To do this we have provided a simple PowerShell script that automatically downloads the latest Folding@Home client and launches it in Windows Sandbox. If Windows sandbox is not enabled on your system, the script will enable the feature and reboot your system. After the reboot, just launch the script again and it will start Windows sandbox to run the Folding@Home client. The PowerShell script can be downloaded from our GitHub repository here.

 

PowerShell scriptPowerShell script

 

How to Get Involved  

 

We have also created a GitHub open-source repository to store this script and allow you to submit your own ideas for running applications in Windows Sandbox.

 

Have a suggestion for Windows Sandbox or encountering issues ? We welcome your feedback, which can be submitted through feedback hub here.

 

Cheers,

Brandon Smith, Margarit Chenchev, Paul Bozzay, Hari Pulapaka, Judy Liu & Erick Smith

Understanding Hardware-enforced Stack Protection

12 December 2022 at 19:08

We aim to make Windows 10 one of the most secure operating systems for our customers and to do that we are investing in a multitude of security features. Our philosophy is to build features that mitigate broad classes of vulnerabilities, ideally without having the app change its code. In other words, getting an updated version of Windows 10 should make the customer and the app safer to use. This comprehensive MSDN document shows all of the security focused technologies we have built into Windows over the years and how it keeps our customers safe. Here is another presentation by Matt Miller and David Weston that goes deeper into our security philosophy for further reading.

 

We are now exploring security features with deep hardware integration to further raise the bar against attacks. By integrating Windows and its kernel deeply with hardware, we make it difficult and expensive for attackers to mount large scale attacks.

 

ROP (Return Oriented Programming) based control flow attacks have become a common form of attack based on our own and the external research community’s investigations (Evolution of CFI attacks, Joe Bialek). Hence, they are the next logical point of focus for proactive, built-in Windows security mitigation technologies. In this post, we will describe our efforts to harden control flow integrity in Windows 10 through Hardware-enforced stack protection.

 

Memory safety vulnerabilities

 

The most common class of vulnerability found in systems software is memory safety vulnerabilities. This class includes buffer overruns, dangling pointers, uninitialized variables, and others.

 

A canonical example of a stack buffer overrun is copying data from one buffer to another without bound checking (i.e. strcpy). If an attacker replaces the data and size from the source buffer, the destination buffer and other important components of the stack can be corrupted (i.e. return addresses) to point to attacker desired code.

 

Buffer Overrun.PNG

 

Dangling pointers occur when memory referenced by a pointer is de-allocated but a pointer to that memory still exists. In use-after-free exploits, the attacker can read/write through the dangling pointer that now points to memory the programmer did not intend to.

 

Uninitialized variables exist in some languages where variables can be declared without value, memory in this case is initialized with junk data. If an attacker can read or write to these contents, this will also lead to unintended program behavior.

 

These are popular techniques attackers can utilize to gain control and run arbitrary native code on target machines.

 

Arbitrary Code Execution

 

We frame our strategy for mitigating arbitrary code execution in the form of four pillars:

 

Arbitrary Code Execution Strategy.jpg

 

Code Integrity Guard (CIG) prevents arbitrary code generation by enforcing signature requirements for loading binaries.

 

Arbitrary Code Guard (ACG) ensures signed pages are immutable and dynamic code cannot be generated, thus guaranteeing the integrity of binaries loaded.

 

With the introduction of CIG/ACG, attackers increasingly resort to control flow hijacking via indirect calls and returns, known as call/jump oriented programming (COP/JOP) and return oriented programming (ROP).

 

We shipped Control Flow Guard (CFG) in Windows 10 to enforce integrity on indirect calls (forward-edge CFI). Hardware-enforced Stack Protection will enforce integrity on return addresses on the stack (backward-edge CFI), via Shadow Stacks.

 

The ROP problem

 

In systems software, if an attacker finds a memory safety vulnerability in code, the return address can be hijacked to target an attacker defined address. It is difficult from here to directly execute a malicious payload in Windows thanks to existing mitigations including Data Execution Prevention (DEP) and Address Space Layout Randomization (ASLR), but control can be transferred to snippets of code (gadgets) in executable memory. Attackers can find gadgets that end with the RET instruction (or other branches), and chain multiple gadgets to perform a malicious action (turn off a mitigation), with the end goal of running arbitrary native code.

 

Return Oriented Programming.PNG

 

Hardware-enforced stack protection in Windows 10 

 

Keep in mind, Hardware-enforced stack protection will only work on chipsets with support for hardware shadow stacks, Intel’s Control-flow Enforcement Technology (CET) or AMD shadow stacks. Here is an Intel whitepaper with more information on CET.

 

In this post, we will describe only the relevant parts of the Windows 10 implementation. This technology provides parity with program call stacks, by keeping a record of all the return addresses via a Shadow Stack. On every CALL instruction, return addresses are pushed onto both the call stack and shadow stack, and on RET instructions, a comparison is made to ensure integrity is not compromised.

 

If the addresses do not match, the processor issues a control protection (#CP) exception. This traps into the kernel and we terminate the process to guarantee security.

 

Shadow Stacks.PNG

 

Shadow stacks store only return addresses, which helps minimize the additional memory overhead.

 

Control-flow Enforcement Technology (CET) Shadow Stacks

 

Shadow stack compliant hardware provides extensions to the architecture by adding instructions to manage shadow stacks and hardware protection of shadow stack pages.

 

Hardware will have a new register SSP, which holds the Shadow Stack Pointer address. The hardware will also have page table extensions to identify shadow stack pages and protect those pages against attacks.

 

New instructions are added for management of shadow stack pages, including:

  • INCSSP – increment SSP (i.e. to unwind shadow stack)
  • RDSSP – read SSP into general purpose register
  • SAVEPREVSSP/RSTORSSP – save/restore shadow stack (i.e. thread switching)

The full hardware implementation is documented in Intel’s CET manual.

 

Compiling for Hardware-enforced Stack Protection

 

In order to receive Hardware-enforced stack protection on your application, there is a new linker flag which sets a bit in the PE header to request protection from the kernel for the executable.

 

If the application sets this bit and is running on a supported Windows build and shadow stack-compliant hardware, the Kernel will maintain shadow stacks throughout the runtime of the program. If your Windows version or the hardware does not support shadow stacks, then the PE bit is ignored.

 

By making this an opt-in feature of Windows, we are allowing developers to first validate and test their app with hardware-enforced stack protection, before releasing their app.  

 

Hardware-enforced Stack Protection feature is under development and an early preview is available in Windows 10 Insider previews builds (fast ring). If you have Intel CET capable hardware, you can enable the above linker flag on your application to test with the latest Windows 10 insider builds.

 

Conclusion

 

Hardware-enforced Stack Protection offers robust protection against ROP exploits since it maintains a record of the intended execution flow of a program. To ensure smooth ecosystem adoption and application compatibility, Windows will offer this protection as an opt-in model, so developers can receive this protection, at your own pace.

 

We will provide ongoing guidance on how to re-build your application to be shadow stacks compliant. In our next post, we will dig deeper into best practices, as well as provide technical documentation. This protection will be a major step forward in our continuous efforts to make Windows 10 one of the most secure operating system for our customers.

 

Kernel protection team - Jin Lin, Jason Lin, Niraj Majmudar and Greg Colombo

 

DTrace on Windows – 20H1 updates

12 December 2022 at 19:08

We first released DTrace on Windows as a preview with the Windows 10 May 2019 Update. The feedback and reaction from our community was very gratifying. Thank you for taking the time to use DTrace on Windows and providing us with valuable feedback.

 

We have been quiet since the initial preview release, and today we are ready to talk about the updates we have made to DTrace on Windows. All of these changes are available in the latest Windows 10 Insider Preview (20H1) build, starting with 19041.21.

 

With these changes, we are now positioned to have customers broadly use DTrace on Windows.

 

Key resources

  1. DTrace on Windows developer docs
  2. GitHub for source code and sample scripts
  3. DTrace MSI

 

Removed kernel debugger requirement

This was the biggest hinderance in using DTrace on Windows internally and externally. We knew going in that we need to solve this, but we also knew that it would take time to solve this correctly. In 20H1, we have now removed the kernel debugger requirement. Windows kernel now relies on Virtualization-based Security (VBS) to securely insert dynamic trace points into kernel code. By relying on VBS, we can now safely and securely insert dynamic tracepoints in the kernel without disabling PatchGuard (enabling kernel debugger disables PatchGuard).  

 

Note: Because we made the change to rely on VBS for DTrace on Windows, the installer from 19H1 will only work on 19H1. For Windows 10 Insider Preview (post 19H1) builds, please use the updated installer linked in this post. This installer will NOT install on previous Windows 10 releases.

 

Lets get into how to setup and use DTrace on Windows.

 

Prerequisites for using the feature:

 

  • Windows 10 insider build 19041.21 or higher

Detailed instructions to install DTrace is available in our documentation. At a high-level, these are:

 

  1. Enable boot option to turn on DTrace
  2. Download and install the DTrace MSI.
  3. Ensure VBS is turned on  
  4. Optional: Update the PATH environment variableto include C:\Program Files\DTrace
    • set PATH=%PATH%;"C:\Program Files\DTrace"
  5. Setup symbol path
    • Create a new directory for caching symbols locally. Example: mkdir c:\symbols
    • Set _NT_SYMBOL_PATH=srv*C:\symbols*https://msdl.microsoft.com/download/symbols
    • DTrace automatically downloads the symbols necessary from the symbol server and caches to the local path.
  6. Reboot machine

To check if VBS is enabled or not, look at system summary tab on the Microsoft System Information tool (msinfo32.exe).

 

Msinfo32Msinfo32

ARM64 preview

Yes, that’s right! DTrace now supports ARM64 in preview mode. The ARM64 MSI is available in the download link listed above.

 

You can use it on your Surface Pro X running the latest Windows 10 Insider Preview (20H1) build, starting with 19041.21.

 

DTrace on Surface Pro XDTrace on Surface Pro X

User mode Stackwalk

In the preview, the stackwalk facility in DTrace was limited to Kernel mode (stack). This update adds support for usermode stackwalk facility (ustack). Like stack, ustack facility is fully compatible with open source DTrace specification. It can be invoked in three ways by specifying frames (depth) & size (ignored for now) or void.

 

  • Ustack(nframes, size)
  • Ustack(nframes)
  • Ustack()

While ustack () can determine the address of the calling frame when probe fires, the stack frames will not be translated into symbols until the ustack () action is processed at user-mode by DTrace consumer. Symbol download can slow down the output. Hence, it’s better to use this facility with locally cached symbols like below.

 

 

dtrace -n "profile-1ms /arg1/ {ustack(50, 0); exit(0);} " -y C:\symbols dtrace: description 'profile-1ms ' matched 1 probe CPU ID FUNCTION:NAME 0 3802 :profile-1ms ntdll`ZwAllocateVirtualMemory+0x14 ntdll`RtlAllocateHeap+0x3ded ntdll`RtlAllocateHeap+0x763 ucrtbase`malloc_base+0x44

 

 

Live dump support

Windows commonly uses something called Live dump to help quickly diagnose issues. Live dumps help with troubleshooting issues involving multiple processes or system wide issues without downtime. In 20H1, DTrace on Windows can be used to capture a live dump from inside a D-script using the lkd() DTrace facility. A common use case of this facility is to instrument error path (like return code indicates a failure) and capture a live dump right at the failure point for advanced diagnostics. For more information on live dump support, see DTrace Live Dump 

 

 

dtrace -wn "syscall:::return { if (arg0 != 0xc0000001UL) { lkd(0); printf(\" Triggering Live dump \n \");exit(0); }}" dtrace: description 'syscall:::return ' matched 1411 probes dtrace: allowing destructive actions CPU ID FUNCTION:NAME 0 181 NtDeviceIoControlFile:return Triggering Live dump dir c:\Windows\LiveKernelReports Volume in drive C has no label. Volume Serial Number is 70F4-B9F6 Directory of c:\Windows\LiveKernelReports 11/05/2019 05:20 PM <DIR> . 11/05/2019 05:20 PM <DIR> .. 11/05/2019 05:19 PM <DIR> DTRACE 11/05/2019 05:20 PM 53,395,456 DTRACE-20191105-1720.dmp

 

 

ETW Tracing

ETW tracing is the most frequently used tool for debugging on Windows. In DTrace on Windows 19H1 preview, we added support for instrumenting tracelogged and manifested events using the ETW provider.

 

In 20H1, we further enhanced this facility to create new ETW events on the fly from inside a D-script using the ETW_Trace() facility. This helps in situations where existing ETW events are insufficient and you would like to add additional ETW trace points without modifying production code.

 

For more information about ETW_Trace facility and ETW provider, see DTrace ETW

 

 

/* Running the GitHub ETW provider sample (link below) to print node memory info event. https://github.com/microsoft/DTrace-on-Windows/blob/master/samples/windows/etw/numamemstats.d */ dtrace -qs numamemstats.d Partition ID: 0 Count: 1 Node number: 1 m_nodeinfo { uint64_t TotalPageCount = 0x1fb558 uint64_t SmallFreePageCount = 0x41 uint64_t SmallZeroPageCount = 0 uint64_t MediumFreePageCount = 0 uint64_t MediumZeroPageCount = 0 uint64_t LargeFreePageCount = 0 uint64_t LargeZeroPageCount = 0 uint64_t HugeFreePageCount = 0 uint64_t HugeZeroPageCount = 0 }

 

 

 

This concludes a tour of some of our key updates to DTrace on Windows for 20H1.

 

You can get started by downloading & installing the DTrace MSI package on the latest 20H1 client/server insider build - 19041.21+.

 

You can also visit our GitHub page for contributing code and samples. We have several advanced scripts in GitHub to help users learn and use DTrace on Windows.

 

How to file feedback?

As always, we rely on feedback from our users to help improve the product. If you hit any problems or bugs, please use Feedback hub to let us know:

 

  1. Launch feedback hub by clicking this link
  2. Select Add new feedback.
  3. Please provide a detailed description of the issue.
  4. Currently, we do not automatically collect any debug traces, so your verbatim feedback is crucial for understanding and reproducing the issue. Pass on any verbose logs.
  5. You can also set DTRACE_DEBUG environment variable to 1 to collect verbose DTrace logs.
  6. Submit

 

We are excited to rollout these changes and look forward to working with the community to continue improving DTrace experience.

 

DTrace team (Andrey Shedel, Gopikrishna Kannan, Max Renke, Hari Pulapaka)

DTrace on Windows

12 December 2022 at 19:08

Here at Microsoft, we are always looking to engage with open source communities to produce better solutions for the community and our customers . One of the more useful debugging advances that have arrived in the last decade is DTrace. DTrace of course needs no introduction: it’s a dynamic tracing framework that allows an admin or developer to get a real-time look into a system either in user or kernel mode. DTrace has a C-style high level and powerful programming language that allows you to dynamically insert trace points. Using these dynamically inserted trace points, you can filter on conditions or errors, write code to analyze lock patterns, detect deadlocks, etc. ETW while powerful, is static and does not provide the ability to programmatically insert trace points at runtime.  

 

There are a lot of websites and resources from the community to learn about DTrace. One of the most comprehensive one is the Dynamic Tracing Guide html book available on dtrace.org website. This ebook describes DTrace in detail and is the authoritative guide for DTrace. We also have Windows specific examples below which will provide more info.

 

Starting in 2016, the OpenDTrace effort began on GitHub that  tried to ensure a portable implementation of DTrace for different operating systems. We decided to add support for DTrace on Windows using this OpenDTrace port.

 

We have created a Windows branch for “DTrace on Windows” under the OpenDTrace project on GitHub. All our changes made to support DTrace on Windows are available here. Over the next few months, we plan to work with the OpenDTrace community to merge our changes. All our source code is also available at the 3rd party sources website maintained by Microsoft.   

 

Without further ado, let’s get into how to setup and use DTrace on Windows.

 

Install and Run DTrace

Prerequisites for using the feature

  • Windows 10 insider build 18342 or higher
  • Only available on x64 Windows and captures tracing info only for 64-bit processes
  • Windows Insider Program is enabled and configured with valid Windows Insider Account
    • Visit Settings->Update & Security->Windows Insider Program for details

Instructions:

  1. BCD configuration set:
    1. bcdedit /set dtrace on
    2. Note, you need to set the bcdedit option again, if you upgrade to a new Insider build
  2. Download and install the DTrace package from download center.
    1. This installs the user mode components, drivers and additional feature on demand packages necessary for DTrace to be functional.
  3. Optional: Update the PATH environment variable to include C:\Program Files\DTrace
    1. set PATH=%PATH%;"C:\Program Files\DTrace"
  4. Setup symbol path
    1. Create a new directory for caching symbols locally. Example: mkdir c:\symbols
    2. Set _NT_SYMBOL_PATH=srv*C:\symbols*https://msdl.microsoft.com/download/symbols
    3. DTrace automatically downloads the symbols necessary from the symbol server and caches to the local path.
  5. Optional: Setup Kernel debugger connection to the target machine (MSDN link). This is only required if you want to trace Kernel events using FBT or other providers.
    1. Note that you will need to disable Secureboot and Bitlocker on C:, (if enabled), if you want to setup a kernel debugger. 
  6. Reboot target machine

 

Running DTrace

Launch CMD prompt in administrator mode

 

Get started with sample one-liners:

 

# Syscall summary by program for 5 seconds: 
dtrace -Fn "tick-5sec { exit(0);} syscall:::entry{ @num[pid,execname] = count();} "
 
# Summarize timer set/cancel program for 3 seconds: 
dtrace -Fn "tick-3sec { exit(0);} syscall::Nt*Timer*:entry { @[probefunc, execname, pid] = count();}"
 
# Dump System Process kernel structure: (requires symbol path to be set)
dtrace -n "BEGIN{print(*(struct nt`_EPROCESS *) nt`PsInitialSystemProcess);exit(0);}"
 
# Tracing paths through NTFS when running notepad.exe (requires KD attach): Run below command and launch notepad.exe
dtrace -Fn "fbt:ntfs::/execname==\"notepad.exe\"/{}"

 

The command dtrace -lvn syscall::: will list all the probes and their parameters available from the syscall provider.

 

The following are some of the providers available on Windows and what they instrument.

  • syscall – NTOS system calls
  • fbt (Function Boundary Tracing) – Kernel function entry and returns
  • pid – User-mode process tracing. Like kernel-mode FBT, but also allowing the instrumentation of arbitrary function offsets.
  • etw (Event Tracing for Windows) – Allows probes to be defined for ETW This provider helps to leverage existing operating system instrumentation in DTrace.
    • This is one addition we have done to DTrace to allow it to expose and gain all the information that Windows already provides in ETW.

We have more Windows sample scripts applicable for Windows scenarios in the samples directory of the source.

 

How to file feedback?

DTrace on Windows is very different from our typical features on Windows and we are going to rely on our Insider community to guide us. If you hit any problems or bugs, please use Feedback hub to let us know.

 

  1. Launch feedback hub by clicking this link
  2. Select Add new feedback.
  3. Please provide a detailed description of the issue or suggestion.
    1. Currently, we do not automatically collect any debug traces, so your verbatim feedback is crucial for understanding and reproducing the issue. Pass on any verbose logs.
    2. You can set DTRACE_DEBUG environment variable to 1 to collect verbose dtrace logs.
  4. Submit

 

DTrace Architecture

Let’s talk a little about the internals and architecture of how we supported DTrace. As mentioned, DTrace on Windows is a port of OpenDTrace and reuses much of its user mode components and architecture. Users interact with DTrace through the dtrace command, which is a generic front-end to the DTrace engine. D scripts get compiled to an intermediate format (DIF) in user-space and sent to the DTrace kernel component for execution, sometimes called as the DIF Virtual Machine. This runs in the dtrace.sys driver.

 

Traceext.sys (trace extension) is a new kernel extension driver we added, which allows Windows to expose functionality that DTrace relies on to provide tracing. The Windows kernel provides callouts during stackwalk or memory accesses which are then implemented by the trace extension.

 

All APIs and functionality used by dtrace.sys are documented calls.

dtrace.png

Security

Security of Windows is key for our customers and the security model of DTrace makes it ideally suited to Windows. The DTrace guide, linked above talks about DTrace security and performance impact. It would be useful for anyone interested in this space to read that section. At a high level, DTrace uses an intermediate form which is validated for safety and runs in its own execution environment (think C# or Java). This execution environment also handles any run time errors to avoid crashing the system. In addition, the cost of having a probe is minimal and should not visibly affect the system performance unless you enable too many probes in performance sensitive paths.

 

DTrace on Windows also leverages the Windows security model in useful ways to enhance its security for our customers.

 

  1. To connect to the DTrace trace engine, your account needs to be part of the admin or LocalSystem group
  2. Events originating from kernel mode (FBT, syscalls with ‘kernel’ previous mode, etc.), are only traceable if Kernel debugger is attached
  3. To read kernel-mode memory (probe parameters for kernel-mode originated events, kernel-mode global variables, etc.), the following must be true:
    1. DTrace session security context has either TCB or LoadDriver privilege enabled.
    2. Secure Boot is not active.
  4. To trace a user-mode process, the user needs to have:
    1. Debug privilege
    2. DEBUG access to the target process.

 

Script signing

In addition, we have also updated DTrace on Windows to support signing of d scripts. We follow the same model as PowerShell to support signing of scripts.

 

There is a system wide DTrace script signing policy knob which controls whether to check for signing or not for DTrace scripts. This policy knob is controlled by the Registry.

 

By default, we do NOT check for signature on DTrace scripts.

 

Use the following registry keys to enforce policy at machine or user level.

  • User Scope: HKCU\Software\OpenDTrace\Dtrace, ExecutionPolicy, REG_SZ
  • Machine Scope: HKLM\Software\OpenDTrace\Dtrace, ExecutionPolicy, REG_SZ

 

Policy Values:

DTrace policy take the following values.

 

  • Bypass": do not perform signature checks. This is the default policy. Only set the registry key if you want to deviate from this policy.
  • "Unrestricted": Do not perform checks on local files, allow user's consent to use unsigned remote files.
  • "RemoteSigned": Do not perform checks on local files, requires a valid and trusted signature for remote files.
  • "AllSigned": Require valid and trusted signature for all files.
  • "Restricted": Script file must be installed as a system component and have a signature from the trusted source.

You can also set policy by defining the environment variable DTRACE_EXECUTION_POLICY to the required value.

 

Conclusion

We are very excited to release the first version of DTrace on Windows. We look forward to feedback from the Windows Insider community.

 

Cheers,

DTrace Team (Andrey Shedel, Gopikrishna Kannan, & Hari Pulapaka)

 

Windows Sandbox - Config Files

12 December 2022 at 19:07

Since the initial announcement of Windows Sandbox, we have received overwhelmingly positive feedback. Thank you for your support! We are glad that this feature resonates with the Windows community. 

 

One of the most requested features from our customers is the ability to automatically launch an app or script in the sandbox. Coincidentally, this also aligned with our feature roadmap and is now available in Windows Insider builds. 

 

Windows Sandbox now has support for simple configuration files (.wsb file extension), which provide minimal scripting support. You can use this feature in the latest Windows Insider build 18342.  

 

As always, we rely on your feedback to build features allowing our users to achieve more. 

 

NOTE: Please note that this functionality is still in development and subject to change.  

 

Overview

Sandbox configuration files are formatted as XML, and are associated with Windows Sandbox via the .wsb file extension. A configuration file allows the user to control the following aspects of Windows Sandbox:

 

  1. vGPU (virtualized GPU)
    • Enable or Disable the virtualized GPU. If vGPU is disabled, Sandbox will use WARP (software rasterizer).
  2. Networking
    • Enable or Disable network access to the Sandbox.
  3. Shared folders
    • Share folders from the host with read or write permissions. Note that exposing host directories may allow malicious software to affect your system or steal data.
  4. Startup script
    • Logon action for the sandbox.

 

SandboxConfigFile.png

 

As demonstrated in the examples below, configuration files can be used to granularly control Windows Sandbox for enhanced isolation.

 

Double click a config file to open it in Windows Sandbox, or invoke it via the command line as shown:

 

C:\Temp> MyConfigFile.wsb

 

Keywords, values and limits

 

VGpu

Enables or disables GPU sharing.

 

<VGpu>value</VGpu> 

 

Supported values:

  • Disable – disables vGPU support in the sandbox. If this value is set Windows Sandbox will use software rendering, which can be slower than virtualized GPU.
  • Default – this is the default value for vGPU support; currently this means vGPU is enabled.

Note: Enabling virtualized GPU can potentially increase the attack surface of the sandbox.

 

Networking

Enables or disables networking in the sandbox. Disabling network access can be used to decrease the attack surface exposed by the Sandbox.

 

<Networking>value</Networking>

 

Supported values:

  • Disable – disables networking in the sandbox.
  • Default – this is the default value for networking support. This enables networking by creating a virtual switch on the host, and connects the sandbox to it via a virtual NIC.

 Note: Enabling networking can expose untrusted applications to your internal network.

 

MappedFolders

Wraps a list of MappedFolder objects.

 

<MappedFolders>
list of MappedFolder objects
</MappedFolders>

 

Note: Files and folders mapped in from the host can be compromised by apps in the Sandbox or potentially affect the host. 

 

MappedFolder 

Specifies a single folder on the host machine which will be shared on the container desktop. Apps in the Sandbox are run under the user account “WDAGUtilityAccount”. Hence, all folders are mapped under the following path: C:\Users\WDAGUtilityAccount\Desktop.

 

E.g. “C:\Test” will be mapped as “C:\users\WDAGUtilityAccount\Desktop\Test”.

 

<MappedFolder>
    <HostFolder>path to the host folder</HostFolder>
    <ReadOnly>value</ReadOnly>
</MappedFolder>

 

HostFolder: Specifies the folder on the host machine to share to the sandbox. Note that the folder must already exist the host or the container will fail to start if the folder is not found.

 

ReadOnly: If true, enforces read-only access to the shared folder from within the container. Supported values: true/false.

 

Note: Files and folders mapped in from the host can be compromised by apps in the Sandbox or potentially affect the host.

 

LogonCommand

Specifies a single Command which will be invoked automatically after the container logs on.

 

<LogonCommand>
   <Command>command to be invoked</Command>
</LogonCommand>

 

Command: A path to an executable or script inside of the container that will be executed after login.

 

Note: Although very simple commands will work (launching an executable or script), more complicated scenarios involving multiple steps should be placed into a script file. This script file may be mapped into the container via a shared folder, and then executed via the LogonCommand directive.

 

Example 1:

The following config file can be used to easily test downloaded files inside of the sandbox. To achieve this, the script disables networking and vGPU, and restricts the shared downloads folder to read-only access in the container. For convenience, the logon command opens the downloads folder inside of the container when it is started.

 

Downloads.wsb

<Configuration>
<VGpu>Disable</VGpu>
<Networking>Disable</Networking>
<MappedFolders>
   <MappedFolder>
     <HostFolder>C:\Users\Public\Downloads</HostFolder>
     <ReadOnly>true</ReadOnly>
   </MappedFolder>
</MappedFolders>
<LogonCommand>
   <Command>explorer.exe C:\users\WDAGUtilityAccount\Desktop\Downloads</Command>
</LogonCommand>
</Configuration>

 

Example 2

The following config file installs Visual Studio Code in the container, which requires a slightly more complicated LogonCommand setup.

 

Two folders are mapped into the container; the first (SandboxScripts) contains VSCodeInstall.cmd, which will install and run VSCode. The second folder (CodingProjects) is assumed to contain project files that the developer wants to modify using VSCode.

 

With the VSCode installer script already mapped into the container, the LogonCommand can reference it.

 

VSCodeInstall.cmd

REM Download VSCode
curl -L "https://update.code.visualstudio.com/latest/win32-x64-user/stable" --output C:\users\WDAGUtilityAccount\Desktop\vscode.exe
 
REM Install and run VSCode
C:\users\WDAGUtilityAccount\Desktop\vscode.exe /verysilent /suppressmsgboxes

 

VSCode.wsb

<Configuration>
<MappedFolders>
   <MappedFolder>
     <HostFolder>C:\SandboxScripts</HostFolder>
     <ReadOnly>true</ReadOnly>
   </MappedFolder>
   <MappedFolder>
     <HostFolder>C:\CodingProjects</HostFolder>
     <ReadOnly>false</ReadOnly>
   </MappedFolder>
</MappedFolders>
<LogonCommand>
   <Command>C:\users\wdagutilityaccount\desktop\SandboxScripts\VSCodeInstall.cmd</Command>
</LogonCommand>
</Configuration>

 

Conclusion

We look forward to your feedback.

 

Cheers,

Margarit Chenchev, Erick Smith, Paul Bozzay, Deepti Bhardwaj & Hari Pulapaka

(Windows Sandbox team) 

Windows Sandbox

12 December 2022 at 19:07

Windows Sandbox is a new lightweight desktop environment tailored for safely running applications in isolation.

 

How many times have you downloaded an executable file, but were afraid to run it? Have you ever been in a situation which required a clean installation of Windows, but didn’t want to set up a virtual machine?

 

At Microsoft we regularly encounter these situations, so we developed Windows Sandbox: an isolated, temporary, desktop environment where you can run untrusted software without the fear of lasting impact to your PC. Any software installed in Windows Sandbox stays only in the sandbox and cannot affect your host. Once Windows Sandbox is closed, all the software with all its files and state are permanently deleted.

 

Windows Sandbox has the following properties:

  • Part of Windows – everything required for this feature ships with Windows 10 Pro and Enterprise. No need to download a VHD!
  • Pristine – every time Windows Sandbox runs, it’s as clean as a brand-new installation of Windows
  • Disposable – nothing persists on the device; everything is discarded after you close the application
  • Secure – uses hardware-based virtualization for kernel isolation, which relies on the Microsoft’s hypervisor to run a separate kernel which isolates Windows Sandbox from the host
  • Efficient – uses integrated kernel scheduler, smart memory management, and virtual GPU

 

Prerequisites for using the feature

  • Windows 10 Pro or Enterprise Insider build 18305 or later
  • AMD64 architecture
  • Virtualization capabilities enabled in BIOS
  • At least 4GB of RAM (8GB recommended)
  • At least 1 GB of free disk space (SSD recommended)
  • At least 2 CPU cores (4 cores with hyperthreading recommended)

 

Quick start

  1. Install Windows 10 Pro or Enterprise, Insider build 18305 or newer
  2. Enable virtualization:
    • If you are using a physical machine, ensure virtualization capabilities are enabled in the BIOS.
    • If you are using a virtual machine, enable nested virtualization with this PowerShell cmdlet:
    • Set-VMProcessor -VMName <VMName> -ExposeVirtualizationExtensions $true
  3. Open Windows Features, and then select Windows Sandbox. Select OK to install Windows Sandbox. You might be asked to restart the computer.
  4. Optional Windows Features dlg.png
  5. Using the Start menu, find Windows Sandbox, run it and allow the elevation
  6. Copy an executable file from the host
  7. Paste the executable file in the window of Windows Sandbox (on the Windows desktop)
  8. Run the executable in the Windows Sandbox; if it is an installer go ahead and install it
  9. Run the application and use it as you normally do
  10. When you’re done experimenting, you can simply close the Windows Sandbox application. All sandbox content will be discarded and permanently deleted
  11. Confirm that the host does not have any of the modifications that you made in Windows Sandbox.

 Windows Sandbox Screenshot - open.jpg

 

Windows Sandbox respects the host diagnostic data settings. All other privacy settings are set to their default values.

 

Windows Sandbox internals

Since this is the Windows Kernel Internals blog, let’s go under the hood. Windows Sandbox builds on the technologies used within Windows Containers. Windows containers were designed to run in the cloud. We took that technology, added integration with Windows 10, and built features that make it more suitable to run on devices and laptops without requiring the full power of Windows Server.

 

Some of the key enhancements we have made include:

 

Dynamically generated Image

At its core Windows Sandbox is a lightweight virtual machine, so it needs an operating system image to boot from. One of the key enhancements we have made for Windows Sandbox is the ability to use a copy of the Windows 10 installed on your computer, instead of downloading a new VHD image as you would have to do with an ordinary virtual machine.

 

We want to always present a clean environment, but the challenge is that some operating system files can change. Our solution is to construct what we refer to as “dynamic base image”: an operating system image that has clean copies of files that can change, but links to files that cannot change that are in the Windows image that already exists on the host. The majority of the files are links (immutable files) and that's why the small size (~100MB) for a full operating system. We call this instance the “base image” for Windows Sandbox, using Windows Container parlance.

 

When Windows Sandbox is not installed, we keep the dynamic base image in a compressed package which is only 25MB. When installed the dynamic base package it occupies about 100MB disk space.

 Dynamic Image.PNG

Smart memory management

Memory management is another area where we have integrated with the Windows Kernel. Microsoft’s hypervisor allows a single physical machine to be carved up into multiple virtual machines which share the same physical hardware. While that approach works well for traditional server workloads, it isn't as well suited to running devices with more limited resources. We designed Windows Sandbox in such a way that the host can reclaim memory from the Sandbox if needed.

 

Additionally, since Windows Sandbox is basically running the same operating system image as the host we also allow Windows sandbox to use the same physical memory pages as the host for operating system binaries via a technology we refer to as “direct map”. In other words, the same executable pages of ntdll, are mapped into the sandbox as that on the host. We take care to ensure this done in a secure manner and no secrets are shared. 

 Direct Map.PNG

Integrated kernel scheduler

With ordinary virtual machines, Microsoft’s hypervisor controls the scheduling of the virtual processors running in the VMs. However, for Windows Sandbox we use a new technology called “integrated scheduler” which allows the host to decide when the sandbox runs. 

 

For Windows Sandbox we employ a unique scheduling policy that allows the virtual processors of the sandbox to be scheduled in the same way as threads would be scheduled for a process. High-priority tasks on the host can preempt less important work in the sandbox. The benefit of using the integrated scheduler is that the host manages Windows Sandbox as a process rather than a virtual machine which results in a much more responsive host, similar to Linux KVM.

 

The whole goal here is to treat the Sandbox like an app but with the security guarantees of a Virtual Machine. 

 

Snapshot and clone

As stated above, Windows Sandbox uses Microsoft’s hypervisor. We're essentially running another copy of Windows which needs to be booted and this can take some time. So rather than paying the full cost of booting the sandbox operating system every time we start Windows Sandbox, we use two other technologies; “snapshot” and “clone.”

 

Snapshot allows us to boot the sandbox environment once and preserve the memory, CPU, and device state to disk. Then we can restore the sandbox environment from disk and put it in the memory rather than booting it, when we need a new instance of Windows Sandbox. This significantly improves the start time of Windows Sandbox. 

 

Graphics virtualization

Hardware accelerated rendering is key to a smooth and responsive user experience, especially for graphics-intense or media-heavy use cases. However, virtual machines are isolated from their hosts and unable to access advanced devices like GPUs. The role of graphics virtualization technologies, therefore, is to bridge this gap and provide hardware acceleration in virtualized environments; e.g. Microsoft RemoteFX.

 

More recently, Microsoft has worked with our graphics ecosystem partners to integrate modern graphics virtualization capabilities directly into DirectX and WDDM, the driver model used by display drivers on Windows.

 

At a high level, this form of graphics virtualization works as follows:

  • Apps running in a Hyper-V VM use graphics APIs as normal.
  • Graphics components in the VM, which have been enlightened to support virtualization, coordinate across the VM boundary with the host to execute graphics workloads.
  • The host allocates and schedules graphics resources among apps in the VM alongside the apps running natively. Conceptually they behave as one pool of graphics clients.

This process is illustrated below:

 

GPU virtualization for Sandbox - diagram.png 

 

This enables the Windows Sandbox VM to benefit from hardware accelerated rendering, with Windows dynamically allocating graphics resources where they are needed across the host and guest. The result is improved performance and responsiveness for apps running in Windows Sandbox, as well as improved battery life for graphics-heavy use cases.

 

To take advantage of these benefits, you’ll need a system with a compatible GPU and graphics drivers (WDDM 2.5 or newer). Incompatible systems will render apps in Windows Sandbox with Microsoft’s CPU-based rendering technology.

 

Battery pass-through

Windows Sandbox is also aware of the host’s battery state, which allows it to optimize power consumption. This is critical for a technology that will be used on laptops, where not wasting battery is important to the user.

 

Filing bugs and suggestions

As with any new technology, there may be bugs. Please file them so that we can continually improve this feature. 

 

File bugs and suggestions at Windows Sandbox's Feedback Hub (select Add new feedback), or follows these steps:

  1. Open the Feedback Hub
  2. Select Report a problem or Suggest a feature.
  3. Fill in the Summarize your feedback and Explain in more details boxes with a detailed description of the issue or suggestion.
  4. Select an appropriate category and subcategory by using the dropdown menus. There is a dedicated option in Feedback Hub to file "Windows Sandbox" bugs and feedback. It is located under "Security and Privacy" subcategory "Windows Sandbox".
  5. Feedback Hub.png
  6. Select Next 
  7. If necessary, you can collect traces for the issue as follows: Select the Recreate my problem tile, then select Start capture, reproduce the issue, and then select Stop capture.
  8. Attach any relevant screenshots or files for the problem.
  9. Submit

Conclusion

We look forward to you using this feature and receiving your feedback!

 

Cheers, 

Hari Pulapaka, Margarit Chenchev, Erick Smith, & Paul Bozzay

(Windows Sandbox team)

Mitigating Spectre variant 2 with Retpoline on Windows

Updated May 14, 2019:  We're happy to announce that today we've updated Retpoline cloud configuration to enable it for all supported devices!* In addition, with the May 14 Patch Tuesday update, we've removed the dependence on cloud configuration such that even those customers who may not be receiving cloud configuration updates can experience Retpoline performance gains.

*Note: Retpoline is enabled by default on devices running Windows 10, version 1809 and Windows Server 2019 or newer and which meet the following conditions:

  • Spectre, Variant 2 (CVE-2017-5715) mitigation is enabled.
    • For Client SKUs, Spectre Variant 2 mitigation is enabled by default
    • For Server SKUs, Spectre Variant 2 mitigation is disabled by default. To realize the benefits of Retpoline, IT Admins can enable it on servers following this guidance.
  • Supported microcode/firmware updates are applied to the machine.

 

Updated March 1, 2019:  The post below outlines the performance benefits of using Retpoline against the Spectre variant 2 (CVE-2017-5715) attack—as observed with 64-bit Windows Insider Preview Builds 18272 and later. While Retpoline is currently disabled by default on production Windows 10 client devices, we have backported the OS modifications needed to support Retpoline so that it can be used with Windows 10, version 1809 and have those modifications in the March 1, 2019 update (KB4482887).

Over the coming months, we will enable Retpoline as part of phased rollout via cloud configuration. Due to the complexity of the implementation and changes involved, we are only enabling Retpoline performance benefits for Windows 10, version 1809 and later releases.

 

Updated March 5, 2019:   While the phased rollout is in progress, customers who would like to manually enable Retpoline on their machines can do so with the following registry configuration updates:

 

On Client SKUs:

  1. reg add "HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management" /v FeatureSettingsOverride /t REG_DWORD /d 0x400
  2. reg add "HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management" /v FeatureSettingsOverrideMask /t REG_DWORD /d 0x400
  3. Reboot

On Server SKUs:

  1. reg add "HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management" /v FeatureSettingsOverride /t REG_DWORD /d 0x400
  2. reg add "HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management" /v FeatureSettingsOverrideMask /t REG_DWORD /d 0x401
  3. Reboot

 

Note: The above registry configurations are for customers running with default mitigation settings. In particular, for Server SKUs, these settings will enable Spectre variant 2 mitigations (which are enabled by default on Client SKUs). If it's desirable to enable additional security mitigations on top of Retpoline, then the feature settings values for those features need to be bitwise OR'd into FeatureSettingsOverride and FeatureSettingsOverrideMask.

Example: Feature settings values for enabling SSBD (speculative store bypass) system wide:
FeatureSettingsOverride = 0x8 and FeatureSettingsOverrideMask = 0
To add Retpoline, feature settings value for Retpoline (0x400) should be bitwise OR'd:
FeatureSettingsOverride = 0x408 and FeatureSettings OverrideMask = 0x400

Get-SpeculationControlSettings PowerShell cmdlet can be used to verify Retpoline status. Here’s an example output showing Retpoline and import optimization enabled:

Speculation control settings for CVE-2017-5715 [branch target injection] 
 
Hardware support for branch target injection mitigation is present: True  
Windows OS support for branch target injection mitigation is present: True 
Windows OS support for branch target injection mitigation is enabled: True 
… 
BTIKernelRetpolineEnabled           : True 
BTIKernelImportOptimizationEnabled  : True 
...

Since Retpoline is a performance optimization for Spectre Variant 2, it requires that hardware and OS support for branch target injection to be present and enabled. Skylake and later generations of Intel processors are not compatible with Retpoline, so only Import Optimization will be enabled on these processors.


In January 2018, Microsoft released an advisory and security updates related to a newly discovered class of hardware vulnerabilities involving speculative execution side channels (known as Spectre and Meltdown) that affect AMD, ARM, and Intel CPUs to varying degrees. If you haven’t had a chance to learn about these issues, we recommend watching The Case of Spectre and Meltdown by the team at TU Graz from BlueHat Israel, reading the blog post by Jann Horn (@tehjh) of Google Project Zero.

 

We have also had multiple posts detailing the internals of our implementation to handle these side-channel attacks.

  1. Mitigating speculative execution side channel hardware vulnerabilities
  2. KVA Shadow: Mitigating Meltdown on Windows
  3. Analysis and mitigation of L1 Terminal Fault (L1TF)

For today’s post, we have kernel developers Andrea Allievi and Chris Kleynhans describing our design and implementation of retpoline for Windows which improves performance of Spectre variant 2 mitigations (CVE-2017-5715) to noise-level for most scenarios. These improvements are available today in Windows Insider Builds (builds 18272 or newer, x64-only).

 

Introduction

At a high level, the Spectre variant 2 attack exploits indirect branches to steal secrets located in higher privilege contexts (e.g. kernel-mode vs user-mode). Indirect branches are instructions where the target of the branch is not contained in the instruction itself, such as when the destination address is stored in a CPU register.

 

Describing the full Spectre attack is outside the scope of this article. Details are in the links above or in this whitepaper from Intel.

 

Our original mitigations for Spectre variant 2 made use of new capabilities exposed by CPU microcode updates to restrict indirect branch speculation when executing within kernel mode (IBRS and IBPB). While this was an effective mitigation from a security standpoint, it resulted in a larger performance degradation than we’d like on certain processors and workloads.

 

For this reason, starting in early 2018, we investigated alternatives and found promise in an approach developed by Google called retpoline. A full description of retpoline can be found here, but in short, retpoline works by replacing all indirect call or jumps in kernel-mode binaries with an indirect branch sequence that has safe speculation behavior.

 

This sequence, shown below in Figure 1, effects a safe control transfer to the target address by performing a function call, modifying the return address and then returning.

RP0:  call RP2                 ; push address of RP1 onto the stack and jump to RP2
RP1:  int 3                    ; breakpoint to capture speculation
RP2:  mov [rsp], <Jump Target> ; overwrite return address on the stack to desired target
RP3:  ret                      ; return

While this construct is not as fast as a regular indirect call or jump, it has the side effect of preventing the processor from unsafe speculative execution. This proves to be much faster than running all of kernel mode code with branch speculation restricted (IBRS set to 1). However, this construct is only safe to use on processors where the RET instruction does not speculate based on the contents of the indirect branch predictor. Those processors are all AMD processors as well as Intel processors codenamed Broadwell and earlier according to Intel’s whitepaper. Retpoline is not applicable to Skylake and later processors from Intel.

 

Windows requirements for Retpoline

Traditionally the transformation of indirect calls and jumps into retpolines is performed when a binary is built by the compiler. However, there are several functional requirements in Windows that make a purely compile-time implementation insufficient.

 

These key requirements are:

  1. Single binary: Windows releases are long-lived and must support a wide variety of hardware with a single set of binaries. On some hardware retpoline is not a complete mitigation because of alternate behavior of the ret instruction and retpoline must not be used. Further, future hardware may eliminate the need for retpoline entirely. Therefore, a Windows implementation of retpoline must allow the feature to be enabled and disabled at boot time using a single set of binaries, based on whether the underlying hardware is vulnerable, compatible and whether Spectre variant 2 mitigations are enabled on the system. Further, the runtime overhead of retpoline support should be minimal when the feature is disabled.
  2. 3rd party device drivers: A lot of the code that runs in kernel mode is not part of Windows and consists of 3rd party device driver code. Traditional retpoline would only be secure if all these drivers were recompiled with a new version of the compiler. Given the breadth of Windows 3rd party driver ecosystem, it is not realistic to expect all non-inbox 3rd party drivers to be recompiled and released to customers at the same time. Therefore, a Windows implementation of retpoline must be able to support a mixed environment, providing high performance when running drivers that have been updated, but allowing for graceful fallback to hardware-based mitigations upon entering a non-retpoline driver to preserve security.
  3. Driver portability: Windows drivers are not bound to a specific release of Windows, many drivers that are built today for Windows 10 will also support older versions of the operating system. Therefore, a Windows implementation of retpoline must ensure that drivers compiled with retpoline support can run on a version of Windows that does not support retpoline.

General Architecture

To satisfy requirement 1 and 3, we decided that binaries would ship in a non-retpolined state and then be transformed into a retpolined state by rewriting the code sequences for all indirect calls. This ensures that systems that do not use retpoline can use the binaries as compiled without needing any support for retpoline and with minimal runtime cost.

 

However, performing the transformation at runtime does lead to one problem. How do we know what transformations need to be applied? Disassembling and analyzing driver machine code to locate all indirect calls is not practical.

 

Dynamic Value Relocation Table (DVRT)

To solve this problem, we collaborated with the compiler team in Visual Studio to develop a system whereby the compiler can emit a new type of metadata into driver binaries describing each indirect call or jump in the system. This metadata takes the form of new relocation entries in the Dynamic Value Relocation Table (DVRT).

 

The DVRT was originally introduced back in the Windows 10 Creators Update to improve kernel address space layout randomization (KASLR). It allowed the memory manager’s page frame number (PFN) database and page table self-map to be assigned dynamic addresses at runtime. The DVRT is stored directly in the binary and contains a series of relocation entries for each symbol (i.e. address) that is to be relocated. The relocation entries are themselves arranged in a hierarchical fashion grouped first by symbol and then by containing page to allow for a compact description of all locations in the binary that reference a relocatable symbol.

 

At build time, the compiler keeps track of all references to these special symbols and fills out the DVRT. Then at runtime the kernel will parse the DVRT and update each symbol reference with the correct dynamically assigned address. Importantly, the kernel will skip over any DVRT entries it does not recognize (i.e. those with an unknown symbol) so adding new symbols to the DVRT does not break older versions of Windows.

 

These properties meant the DVRT was a perfect place to store our retpoline metadata, however the existing DVRT format needed to be extended to support retpoline.

 

Based on Windows requirements, we classified indirect calls/jumps into three distinct forms and each of these forms has its own type of retpoline relocation and corresponding runtime fixup.

  1. Import calls/jumps
  2. Switchtable jumps
  3. Generic indirect calls/jumps

Let’s talk a little about each of these types of calls.

 

Import Calls/Jumps

Import calls/jumps are, as the name implies, used for calls/jumps made by a binary to functions that have been imported from another binary. When compiling with retpoline, the compiler ensures that all such calls conform to the following form:

48 FF 15 XX XX XX XX     call qword ptr [_imp_<function>]
0F 1F 44 00 00           nop

The call or jmp instruction always directly references the import address table (IAT) and has 5 bytes of additional padding (to be used by the retpoline fixup).

 

Switchtable Jumps

Switchtable jumps are used for jumps made to other locations within the same function and are so-named because of their usage in implementing C/C++ switch statements. When compiling with retpoline support the compiler ensures that such calls are always made through a register and take the following form:

FF D0                    jmp rax
CC CC CC                 int 3

Generic Indirect Calls/Jumps

All other indirect calls/jumps fall into the generic type. To simplify the retpoline relocation format and the corresponding fixup logic, the compiler ensures that all such indirect calls/jumps provide their target address in the RAX register. The exact format of the call/jump instruction however differs depending on whether it is protected by control flow guard (CFG).

 

Loading binaries at runtime

Now that we have a way to identify all the indirect calls/jumps in the binary, we need to apply the fixups.

 

The NT memory manager has long had infrastructure to apply fixups to binaries at runtime. This infrastructure was extended to understand retpoline relocations and their corresponding fixups.

 

But what exactly do these fixups look like? As mentioned earlier, the Windows implementation needs to support mixed environments in which some drivers are not compiled with retpoline support. This means that we cannot simply replace every indirect call with a retpoline sequence like the example shown in the introduction. We need to ensure that the kernel gets the opportunity to inspect the target of the call or jump so that it can apply appropriate mitigations if the target does not support retpoline.

 

For this reason, we transform every indirect call or jump into a direct call or jump to a kernel provided “retpoline stub function”. For example, an indirect call to an imported function that looks like this:

call qword ptr [_imp_ExAllocatePoolWithTag]     ; Target address located at a REL32 offset
nop                                             ; Padding

Will be replaced at runtime with a direct call to the retpoline import stub:

mov r10, qword ptr [_imp_ExAllocatePoolWithTag] ; R10 = target address
call _guard_retpoline_import_r10                ; Direct REL32 call to the stub function

There are several retpoline stub functions each of which is specialized to the type of call/jump it handles. However, each function generally performs the following steps:

  1. Check if the target binary supports retpoline
  • Prior to transferring control to the target address, the function must determine whether the target address belongs to a driver that supports retpoline. To determine this, the kernel maintains a sparse bitmap of the entire kernel-mode address space with each bit describing a 64 KB region of the address space. Bits in this bitmap are set to 1 if and only if their corresponding region of address space belongs to a kernel-mode binary that fully supports retpoline.
  • If the bitmap check determines that the target address does not belong to a retpolined binary, the stub function has to fall back to the hardware-based Spectre variant 2 mitigation (by setting IBRS to restrict branch speculation) and then perform a regular indirect call/jmp. Otherwise, the kernel does not need to set IBRS. On processors that do not support IBRS, retpoline will, instead, perform IBPB if user-to-kernel protection is enabled as described here.
  • Since the target of a switch table jump is always in the same binary as the source (and therefore the target is guaranteed to support retpoline), this bitmap check is omitted from the switchtable jump stub functions.
  • Check if the target address is a valid CFG target
    • For CFG instrumented indirect calls/jumps the retpoline stub function is responsible for checking the kernel-mode CFG bitmap to verify that the target address given is a valid CFG call target. If this check fails, then the stub function will bugcheck the system to prevent any exploit that attempts an indirect control transfer to an invalid address.
  • Transfer control to the target using a retpoline.
  • The usage of these stub functions ensures that we can satisfy the requirement to support mixed environments, however they do introduce one additional problem. The x64 direct call/jump instruction can only encode a target address within 2 GB of the call-site (since the target is specified by a signed 16- or 32-bit offset). Since the retpoline stub functions are implemented in the NT kernel binary this would generally mean that drivers would have to be loaded within 2 GB of the kernel binary.

     

    To work around this requirement, all retpoline stub functions are contained within a single section of the NT kernel binary and have been carefully written to take no dependencies on their position relative to the rest of the binary. This allows us to map the physical memory pages backing the retpoline stub functions immediately after every driver in the system, giving each driver its own “copy” of the retpoline stub functions that is guaranteed to be within 2 GB of every indirect call/jump.

     

    Import optimization

    Indirect calls due to imported functions are by far the most common form of indirect control transfers in kernel-mode. The import call targets are determined at driver load time by processing the import address table (IAT) and remain constant throughout the driver’s lifetime. This means that most of the work provided by the retpoline import stub is unnecessary because we know at driver load time exactly where each of these calls will end up going and we know whether the target binary supports retpoline or not. Hence, we can use a much faster calling sequence.

     

    With import optimization, we use the retpoline fixup infrastructure to replace eligible import calls with direct calls to the imported function. This eliminates the overhead of the retpoline import call stub as well as the guaranteed branch prediction miss due to retpoline itself. To be eligible for import optimization, a call must meet the following requirements:

    1. The call/jump must be from a retpolined binary to another retpolined binary.
    • This is necessary to maintain the security guarantees of retpoline because once we’ve rewritten the indirect call into a direct call the kernel no longer gets a chance to observe the target address and enable IBRS.
  • The target of the call must be within 2 GB of the call site.
    • This is because as mentioned above direct call/jump instructions on x64 can only encode a 32-bit offset.
    • In order to virtually guarantee that import optimization can be applied all retpolined modules, the OS loader and kernel make sure that all kernel-mode modules are packed tightly in the address space while maintaining address space layout randomizations (ASLR).

    Here is an example of how the code generation for the call is modified.

    Original code sequence

    call [__imp_<Function>]                   ; Call to an imported function
    nop                                       ; 5-byte nop

    Import Optimized code sequence

    mov r10, [__imp_<Function>]               ; R10 = target address (normal transformation)
    call <Function>                           ; Direct REL32 call to target

    Import optimization turned out to be a big performance win! Hence, even on processors where retpoline cannot be used due to alternate return instruction behavior, we still use import optimization.

     

    Conclusion

    Retpoline has significantly improved the performance of the Spectre variant 2 mitigations on Windows. When all relevant kernel-mode binaries are compiled with retpoline, we’ve measured ~25% speedup in Office app launch times and up to 1.5-2x improved throughput in the Diskspd (storage) and NTttcp (networking) benchmarks on Broadwell CPUs in our lab. It is enabled by default in the latest Windows Client Insider Fast builds (for builds 18272 and higher on machines exposing compatible speculation control capabilities) and is targeted to ship with 19H1.

     

    To check if retpoline and import optimizations are enabled, you can use the PowerShell cmdlet Get-SpeculationControlSettings. You can also use NtQuerySystemInformation to programmatically query retpoline status.

     

    For a more in-depth look, here is a talk by Andrea Allievi at BlueHat 2018 talking about retpoline on Windows.

     

    Give the latest builds a try and let us know your experience!

     

    One Windows Kernel

    12 December 2022 at 19:06

    Windows is one of the most versatile and flexible operating systems out there, running on a variety of machine architectures and available in multiple SKUs. It currently supports x86, x64, ARM and ARM64 architectures. Windows used to support Itanium, PowerPC, DEC Alpha, and MIPS (wiki entry). In addition, Windows supports a variety of SKUs that run in a multitude of environments; from data centers, laptops, Xbox, phones to embedded IOT devices such as ATM machines.

     

    The most amazing aspect of all this is that the core of Windows, its kernel, remains virtually unchanged on all these architectures and SKUs. The Windows kernel scales dynamically depending on the architecture and the processor that it’s run on to exploit the full power of the hardware. There is of course some architecture specific code in the Windows kernel, however this is kept to a minimum to allow Windows to run on a variety of architectures.

     

    In this blog post, I will talk about the evolution of the core pieces of the Windows kernel that allows it to transparently scale across a low power NVidia Tegra chip on the Surface RT from 2012, to the giant behemoths that power Azure data centers today.

     

    This is a picture of Windows taskmgr running on a pre-release Windows DataCenter class machine with 896 cores supporting 1792 logical processors and 2TB of RAM!

     

    Task Manager showing 1792 logical processorsTask Manager showing 1792 logical processors

    Evolution of one kernel

    Before we talk about the details of the Windows kernel, I am going to take a small detour to talk about something called Windows refactoring. Windows refactoring plays a key part in increasing the reuse of Windows components across different SKUs, and platforms (e.g. client, server and phone). The basic idea of Windows refactoring is to allow the same DLL to be reused in different SKUs but support minor modifications tailored to the SKU without renaming the DLL and breaking apps.

     

    The base technology used for Windows refactoring are a lightly documented technology (entirely by design) called API sets. API sets are a mechanism that allows Windows to decouple the DLL from where its implementation is located. For example, API sets allow win32 apps to continue to use kernel32.dll but, the implementation of all the APIs are in a different DLL. These implementation DLLs can also be different depending on your SKU. You can see API sets in action if you launch dependency walker on a traditional Windows DLL; e.g. kernel32.dll.

     

    Dependency walkerDependency walker

    With that detour into how Windows is built to maximize code reuse and sharing, let’s go into the technical depths of the kernel starting with the scheduler which is key to the scaling of Windows.

     

    Kernel Components

    Windows NT is like a microkernel in the sense that it has a core Kernel (KE) that does very little and uses the Executive layer (Ex) to perform all the higher-level policy. Note that EX is still kernel mode, so it's not a true microkernel. The kernel is responsible for thread dispatching, multiprocessor synchronization, hardware exception handling, and the implementation of low-level machine dependent functions. The EX layer contains various subsystems which provide the bulk of the functionality traditionally thought of as kernel such as IO, Object Manager, Memory Manager, Process Subsystem, etc.

     

    arch.png 

     

    To get a better idea of the size of the components, here is a rough breakdown on the number of lines of code in a few key directories in the Windows kernel source tree (counting comments). There is a lot more to the Kernel not shown in this table. 

     

    Kernel subsystems

    Lines of code

    Memory Manager

    501, 000

    Registry

    211,000

    Power

    238,000

    Executive

    157,000

    Security

    135,000

    Kernel

    339,000

    Process sub-system

    116,000

     

    For more information on the architecture of Windows, the “Windows Internals” series of books are a good reference.

     

    Scheduler

    With that background, let's talk a little bit about the scheduler, its evolution and how Windows kernel can scale across so many different architectures with so many processors.

     

    A thread is the basic unit that runs program code and it is this unit that is scheduled by the Window scheduler. The Windows scheduler uses the thread priority to decide which thread to run and in theory the highest priority thread on the system always gets to run even if that entails preempting a lower priority thread.

     

    As a thread runs and experiences quantum end (minimum amount of time a thread gets to run), its dynamic priority decays, so that a high priority CPU bound thread doesn’t run forever starving everyone else. When another waiting thread is awakened to run, it is given a priority boost based on the importance of the event that caused the wait to be satisfied (e.g. a large boost is for a foreground UI thread vs. a smaller one for completing disk I/O). A thread therefore runs at a high priority as long as it’s interactive. When it becomes CPU (compute) bound, its priority decays, and it is considered only after other, higher priority threads get their time on the CPU. In addition, the kernel arbitrarily boosts the priority of ready threads that haven't received any processor time for a given period of time to prevent starvation and correct priority inversions.

     

    The Windows scheduler initially had a single ready queue from where it picked up the next highest priority thread to run on the processor. However, as Windows started supporting more and more processors the single ready queue turned out to be a bottleneck and around Windows Server 2003, the scheduler changed to one ready queue per processor. As Windows moved to multiple per processor queues, it avoided having a single global lock protecting all the queues and allowed the scheduler to make locally optimum decisions. This means that any point the single highest priority thread in the system runs but that doesn’t necessarily mean that the top N (N is number of cores) priority threads on the system are running. This proved to be good enough until Windows started moving to low power CPUs, e.g. in laptops and tablets. On these systems, not running a high priority thread (such as the foreground UI thread) caused the system to have noticeable glitches in UI. And so, in Windows 8.1, the scheduler changed to a hybrid model with per processor ready queues for affinitized (tied to a processor) work and shared ready queues between processors. This did not cause a noticeable impact on performance because of other architectural changes in the scheduler such as the dispatcher database lock refactoring which we will talk about later.

     

    Windows 7 introduced something called the Dynamic Fair Share Scheduler; this feature was introduced primarily for terminal servers. The problem that this feature tried to solve was that one terminal server session which had a CPU intensive workload could impact the threads in other terminal server sessions. Since the scheduler didn’t consider sessions and simply used the priority as the key to schedule threads, users in different sessions could impact the user experience of others by starving their threads. It also unfairly advantages the sessions (users) who has a lot of threads because the sessions with more threads get more opportunity to be scheduled and received CPU time. This feature tried to add policy to the scheduler such that each session was treated fairly and roughly the same amount of CPU was available to each session. Similar functionality is available in Linux as well, with its Completely Fair Scheduler. In Windows 8, this concept was generalized as a scheduler group and added to the Windows Scheduler with each session in an independent scheduler group. In addition to the thread priority, the scheduler uses the scheduler groups as a second level index to decide which thread should run next. In a terminal server, all the scheduler groups are weighted equally and hence all sessions (scheduler groups) receive the same amount of CPU regardless of the number or priorities of the threads in the scheduler groups. In addition to its utility in a terminal server session, scheduler groups are also used to have fine grained control on a process at runtime. In Windows 8, Job objects were enhanced to support CPU rate control. Using the CPU rate control APIs, one can decide how much CPU a process can use, whether it should be a hard cap or a soft cap and receive notifications when a process meets those CPU limits. This is like the resource controls features available in cgroups on Linux.

     

    Starting with Windows 7, Windows Server started supporting greater than 64 logical processors in a single machine. To add support for so many processors, Windows internally introduced a new entity called “processor group”. A group is a static set of up to 64 logical processors that is treated as a single scheduling entity.  The kernel determines at boot time which processor belongs to which group and for machines with less than 64 cores, with the overhead of the group structure indirection is mostly not noticeable. While a single process can span groups (such as a SQL server instance), and individual thread could only execute within a single scheduling group at a time.

     

    However, on machines with greater than 64 cores, Windows started showing some bottlenecks that prevented high performance applications such as SQL server from scaling their performance linearly with the number of processor cores. Thus, even if you added more cores and memory, the benchmarks wouldn’t show much increase in performance. And one of the main problems that caused this lack of performance was the contention around the Dispatcher database lock. The dispatcher database lock protected access to those objects that needed to be dispatched; i.e. scheduled. Examples of objects that were protected by this lock included threads, timers, I/O completion ports, and other waitable kernel objects (events, semaphores, mutants, etc.). Thus, in Windows 7 due to the impetus provided by the greater than 64 processor support, work was done to eliminate the dispatcher database lock and replace it with fine grained locks such as per object locks. This allowed benchmarks such as the SQL TPC-C to show a 290% improvement when compared to Windows 7 with a dispatcher database lock on certain machine configurations. This was one of the biggest performance boosts seen in Windows history, due to a single feature.

     

    Windows 10 brought us another innovation in the scheduler space with CPU Sets. CPU Sets allow a process to partition the system such that its process can take over a group of processors and not allow any other process or system to run their threads on those processors. Windows Kernel even steers Interrupts from devices away from the processors that are part of your CPU set. This ensures that even devices cannot target their code on the processors which have been partitioned off by CPU sets for your app or process. Think of this as a low-tech Virtual Machine. As you can imagine this is a powerful capability and hence there are a lot of safeguards built-in to prevent an app developer from making the wrong choice within the API. CPU sets functionality are used by the customer when they use Game Mode to run their games.

     

    Finally, this brings us to ARM64 support with Windows 10 on ARM.  The ARM architecture supports a big.LITTLE architecture, big.LITTLE is a heterogenous architecture where the “big” core runs fast, consuming more power and the “LITTLE” core runs slow consuming less power. The idea here is that you run unimportant tasks on the little core saving battery. To support big.LITTLE architecture and provide great battery life on Windows 10 on ARM, the Windows scheduler added support for heterogenous scheduling which took into account the app intent for scheduling on big.LITTLE architectures.

     

    By app intent, I mean Windows tries to provide a quality of service for apps by tracking threads which are running in the foreground (or starved of CPU) and ensuring those threads always run on the big core. Whereas the background tasks, services, and other ancillary threads in the system run on the little cores. (As an aside, you can also programmatically mark your thread as unimportant which will make it run on the LITTLE core.)

     

    Work on Behalf: In Windows, a lot of work for the foreground is done by other services running in the background. E.g. In Outlook, when you search for a mail, the search is conducted by a background service (Indexer). If we simply, run all the services on the little core, then the experience and performance of the foreground app will be affected. To ensure, that these scenarios are not slow on big.LITTLE architectures, Windows actually tracks when an app calls into another process to do work on its behalf. When this happens, we donate the foreground priority to the service thread and force run the thread in the service on the big core.

     

    That concludes our first (huge?) One Windows Kernel post, giving you an overview of the Windows Kernel Scheduler. We will have more similarly technical posts about the internals of the Windows Kernel. 

     

    Hari Pulapaka

    (Windows Kernel Team)

    Welcome to Windows Kernel Team Blog

    12 December 2022 at 19:06

    Welcome all, 

     

    We are the Windows Kernel team and we will starting a new series of blog posts where we will be talking about the internals of the Windows Kernel. We realize there is a general dearth of information regarding the internals of the Windows Kernels, other than the excellent Windows Internals book series by Mark Russinovich. 

     

    Over the next few months, we will be having deep technical posts about things like the Kernel Scheduler, Memory Management and many other unexpected features in the Kernel. 

     

    Cheers, 

    Hari Pulapaka

    Group Program Manager for Windows Kernel 

    ❌
    ❌