Normal view

There are new articles available, click to refresh the page.

Before yesterdayWindows OS Platform Blog articles

Windows OS Platform Blog articles
Azure Host OS Update with Hypervisor Hot RestartHari_Pulapaka
6 September 2023 at 16:48

Azure Host OS Update with Hypervisor Hot Restart

6 September 2023 at 16:48

Azure is Microsoft’s cloud computing offering which provides IaaS (infra as a service) virtual machines (VM), PaaS (platform as a service) containers and many other SaaS services (e.g., Azure Storge, Networking, etc.). Azure, being one of the largest cloud service providers, hosts millions of customer virtual machines (VMs) in our data centers. The operating system that runs on these hosts is a modified version of Windows called Cloud Host. I talked about this and the Azure Host OS architecture (incl. the root OS and the hypervisor) in an earlier blog post. In this blog post we will talk about how we update the operating system that runs on those hosts, in particular, a new advancement we made in updating the hypervisor called “Hypervisor Hot Restart (HHR)”.

Azure Host OS Updates Overview

Ensuring the security of the Azure hosts is critical to maintaining our customers’ trust as their applications run in a public cloud where customers have limited control over infrastructure updates. We ensure the security of the Azure host by patching and keeping it up to date with all the latest applicable security updates. These patches are typically rolled out every month with no disruption to the customer workloads. In addition, to security updates, we also update the Azure Host OS to provide new features and functionality to the customer VMs, e.g., new hardware generation support or new features such as Azure confidential computing.

Note: this blog focuses on internal Azure Host OS technical details and does not talk about Azure customer facing VM updates and control mechanisms. Those maintenance controls or scheduled events are documented for our customers on Azure’s website.

Different Azure Host OS update mechanisms

These are the most common types of Azure Host OS update technologies used in the Azure fleet.

Update Tech	Performance	Purpose
Hot Patching	Best – in milliseconds. Not visible to customer VMs	Typically used for monthly security updates (e.g., MSRC). More detailed blog on internals here.
VM PHU	Typically, 30 secs	Update the entire Azure Host OS. Paper in EuroSys 2021 for tech details.
Live Migration [1]	Multiple seconds	Migrate the VM to a different node and potentially empty the node for Host OS updates or other needs.
Hypervisor Hot Restart	Under a second	Update the entire Hypervisor. Useful when updating to a new version with the latest features.

[1] Future post on Live Migration internals

Introducing Hypervisor Hot Restart

With that introduction on Azure Host OS update technologies, we are going to do a deep dive into our latest and most advanced update technology: Hypervisor Hot Restart (HHR). HHR allows us to update and replace the hypervisor on a running system with sub second blackout time for customer VMs and importantly without dropping any packets. With Hypervisor Hot Restart, we can deploy new hypervisor features or fixes easily, providing enormous customer value. This is especially important in today's world, where security threats are becoming more prevalent and sophisticated.

Hypervisor Hot Restart in Action

This is a demonstration of how Hypervisor Hot Restart works in action. It showcases 4 VMs (Virtual Machines) that continue to run while the hypervisor is fully replaced under them. The network connection remains stable throughout the process and no packets are lost. Additionally, we showcase the speed of the restart process, with a maximum packet delay of 600 milliseconds. (Apologies for the low-quality GIF, the blogging platform has a low size limit. The original videos are attached to this blog post for offline viewing.)

How Does Hypervisor Hot Restart Work?

On an Azure node there will be one active hypervisor running the host operating system and the guest VMs. When we are ready to update the hypervisor, this active hypervisor creates a service partition where the new updated or latest hypervisor is initialized. The other partitions hosting the customer VMs continue to run normally, uninterrupted.

Once the new-hypervisor initialization is complete, it is ready, and the active hypervisor can now call into the new-hypervisor. Next, the active hypervisor creates a mirroring thread for each active partition, which replicates all state associated with the partition to the new-hypervisor. All partitions remain running while the mirroring threads reflect important state changes from the active hypervisor to the new-hypervisor. This mirrored state includes information such as memory ownership, partition lifecycle changes, device ownership, and so on.

All partitions are then temporarily suspended, and their state is saved into an internal hypervisor buffer to capture any state that has not already been mirrored. This phase is known as the "blackout" period, during which neither the host OS nor any guest VMs are running. Control of the physical machine is then passed to the new-hypervisor, which becomes the new active hypervisor. This time is well under a second as you see in the demo.

Finally, the active hypervisor restores the host OS and guest VMs, and their virtual processors resume execution. We can then reclaim memory that was used by the old hypervisor but is no longer needed by the new hypervisor. This allows us to perform repeated HHR operations without exhausting system resources.

To help visualize this process, we have created an animation that demonstrates Hypervisor Hot Restart. (Apologies for the low-quality GIF, the blogging platform has a low size limit. The original videos are attached to this blog post for offline viewing.)

The development of Hypervisor Hot Restart enables easy deployment of new hypervisor versions with new features and capabilities without VM downtime. For example, we used Hypervisor Hot Restart to mitigate Retbleed, a side-channel vulnerability that can compromise data security in virtualized environments. We deployed the latest Hypervisor with HyperClear to protect against Retbleed, marking our first utilization of Hypervisor Hot Restart in the Azure Fleet. During this deployment, we were able to deploy HyperClear across the Azure Fleet with sub-second blackout.

With that we conclude our look into the internals of Azure Host OS updates with the latest Hypervisor Hot Restart technology. Expect to see more of Azure Host and Windows internals in future blogs.

Cheers,

Meghna, Hari, Bruce (on behalf of the entire Core OS Team)

Azure Host OS – Cloud Host

Windows OS Platform Blog articles

By: Hari_Pulapaka

6 January 2023 at 19:24

Azure Host OS – Cloud Host

One Windows

Windows is a versatile and flexible operating system, running on a variety of machine architectures and available in multiple SKUs. It currently supports x86, x64, and ARM architectures. It even used to support Itanium, PowerPC, Alpha, and MIPS (wiki entry). Windows also runs in a multitude of environments; from data centers, laptops, and phones to embedded devices such as ATM machines.

Even with all of this support, the core of Windows remains virtually unchanged on all these architectures and SKUs. Windows dynamically scales up, depending on the architecture and the processor that it’s run on to exploit the full power of the hardware. This same applies to Microsoft Azure as well. So, if you have ever wondered how Windows runs Azure nodes in the data center, read ahead!

As Satya says, “we are building Azure as the worlds computer” and powering the worlds computer shows the ability of Windows to scale up and scale out. To demonstrate this scale, here is a snapshot of taskmgr running directly on the Azure host in a M-series machine (one of the largest VMs available in Azure, showing 896 logical processors) in the data center.

M-series taskmgr

In this post, we will talk about the internals of the Azure Host OS which powers the Azure hosts in the data center.

Cloud Host – Azure Host Operating System

Azure of course is Microsoft’s cloud computing service, that provides IaaS (infra as a service) virtual machines (VM), PaaS (platform as a service) containers and many other SaaS services (e.g., Azure Storge, Networking, etc.). For the IaaS and PaaS services, all customer code eventually ends up running in a virtual machine. Hence at the core platform layer, the main purpose of the Azure Host operating system is to manage virtual machines and manage it really well! Managing VMs includes launching, shutting down, live migrating, updating it, etc.

Since Azure uses Windows as the operating system, all these VMs run as guests of Microsoft Hyper-V, which is our hypervisor. Microsoft Hyper-v is a type 1 hypervisor and hence when I say Azure Host operating system, its technically the root operating system. This is the OS that has full control of the hardware and provides virtualization facilities to run guest VMs.

Keep in mind that the hypervisor we use is the same hypervisor that we use on Windows Client and Windows Server across all our millions of customer machines. We will have upcoming blog posts explaining some of the key features of Microsoft Hyper-V, that that allows Azure to securely, and reliably manage guest VMs.

Cloud Host

As I mentioned, the goal of Azure Host OS is to be very good at managing the lifecycle of VMs. This means that Windows (aka Azure Host OS) doesn’t need a whole lot of functionality typically associated with Windows to perform this functionality. Hence, we created a specially crafted console only (no GUI, some also call it headless) edition of Windows called Cloud Host.

This is a OneCore based edition of Windows. OneCore is the base layer upon which all the families of Windows SKUs (or editions) build their functionality. It is a set of components (executables, DLLs, etc.) that are needed by all editions of Windows (PCs, Windows Server, XBOX or IOT). For a programming analogy, it is the base class from which all the Windows classes inherit (e.g., Object). If you look inside OneCore to see what functionality it provides, you can see API sets which provide core functionality such as Kernel, Hypervisor, File system support, Networking, Security, Win32 APIs, etc. OneCoreUAP called out in the picture below is another example of a slightly higher layer that is used to build client PC editions which includes the UWP programming surface, GUI stack and higher-level networking components such as media stack and WiFi.

Overview of some representative components available in OneCore

How did we build Cloud Host?

There is a minimal amount of code that needs to run on the Azure host to integrate with the control plane as well as monitor and manage container/VMs. Based on an analysis of the dependency set of this code, we identified the set of functionalities (DLLs and API sets) that Azure needs on top of OneCore. These handful of binaries (tens of binaries) were then added to OneCore to use it as the OS for Azure Host.

To add these DLLs, we created a brand-new SKU called Cloud Host and added all these binaries to Cloud Host. You can think of Cloud host as a “child class” of OneCore. Note that we had to create a new SKU “Cloud Host” because we needed to add new binaries to OneCore. We could have just added them to OneCore directly but its cleaner to create a purpose-built specific SKU/Edition, while keeping OneCore unmodified. In other words, Cloud Host is a special purpose SKU designed and built to run the Azure Host nodes in the data center. You may be more familiar with other Windows SKUs, often referred to as Editions, such as Pro, Enterprise, etc. [wiki]. Cloud Host is a similar SKU/edition, one that is used only for Azure nodes in the data center.

With that explanation, let’s see this Cloud Host. Here is a picture of the Cloud Host WIM file (a WIM file is just like a zip file to store the Windows image to boot off from). You can see its size is 280 MB, which is more than 10 times smaller than a typical PC WIM file.

That is significantly smaller than any Windows you use on your PC, typical client enterprise WIM file would be 3.6 GB in size.

Cloud Host boots into a console shell and the experience would typically be similar to Windows Server Core. Here is a picture of a Cloud Host session, from one of our test machines.

(Keep in mind, we do NOT typically log onto Azure Host Nodes, this is purely for demo purposes)

Cloud Host with cmd shell, taskmgr and Regedit

Another thing you may have noticed is that the taskmgr or even regedit does not look the same as you would see on Windows 11. This is because as I mentioned, Cloud Host is built on OneCore and it is headless (or console based), hence, it doesn’t contain any of the GUI pieces of Windows. We have a special taskmgr and regedit version that doesn’t link with all the modern GUI functionality available in Windows 11, which gives them the “old style” look.

API: What kind of code can run on Azure Host nodes?

We can run C++, Python and even Rust code on Azure Host. The main thing to keep in mind is that as a developer if you are building code to run on Azure Host (which is only our internal developers), you can only link against the OneCore SDK (onecore.lib). We have documented the API surface available to OneCore here along with info on building against OneCore here.

With that look into the internals of Azure Cloud Host, future blog posts will continue into the code and design internals of updating the Azure Host (e.g., Tardigrade, VM PHU, Hypervisor Hot Restart, and Live Migration), kernel/virtualization features, security and many more areas in the operating system platform.

Cheers,

Hari (on behalf of the entire Core OS Team)

Windows OS Platform

Windows OS Platform Blog articles

By: Hari_Pulapaka

4 January 2023 at 01:08

Welcome to the Windows Core OS platform team blog. We own the core of the Windows operating system, primarily, the virtualization platform, silicon support (Intel, AMD, and Arm), kernel, storage file system and hyperconverged storage infrastructure including Spaces Direct. In addition, we also build and deploy a specially crafted operating system based on Windows for the Azure Cloud. We migrated our old blog which was focused purely on Windows kernel to incorporate the full stack of tech owned by the team.

The team will continue use this blog to talk about the internals of upcoming features in the Windows Core OS Platform space, very similar to my previous blog on Windows Kernel Internals. For our first blog we will be talking about how we use all our technologies to provide a special purpose host operating system for the Azure Cloud, called Cloud Host.

Cheers

Hari

Windows OS Platform Blog articles
Securely donate CPU time with Windows SandboxHari_Pulapaka
12 December 2022 at 19:08

Securely donate CPU time with Windows Sandbox

Windows OS Platform Blog articles

By: Hari_Pulapaka

12 December 2022 at 19:08

With Windows Sandbox, you can run any win32 desktop application you wish with a pristine configuration every time you start it. It allows you to do virtually whatever you want within a secure isolated desktop environment without requiring any cleanup after the fact.

For example, Windows Sandbox allows you to contribute time on your Windows 10 PC towards fighting COVID-19. Here is how it works: using Windows Sandbox you can run the open-source Folding@Home app to help simulate protein dynamics. Folding@Home is one of the most popular distributed computing projects bringing together citizen scientists who volunteer to run simulations of protein dynamics on their personal computers to fight COVID-19 and other diseases. For more information about the project itself, please visit the Folding@Home Knowledge Base.

Folding@Home in Windows Sandbox

To do this we have provided a simple PowerShell script that automatically downloads the latest Folding@Home client and launches it in Windows Sandbox. If Windows sandbox is not enabled on your system, the script will enable the feature and reboot your system. After the reboot, just launch the script again and it will start Windows sandbox to run the Folding@Home client. The PowerShell script can be downloaded from our GitHub repository here.

PowerShell script

How to Get Involved

We have also created a GitHub open-source repository to store this script and allow you to submit your own ideas for running applications in Windows Sandbox.

Have a suggestion for Windows Sandbox or encountering issues ? We welcome your feedback, which can be submitted through feedback hub here.

Cheers,

Brandon Smith, Margarit Chenchev, Paul Bozzay, Hari Pulapaka, Judy Liu & Erick Smith

Windows OS Platform Blog articles
Understanding Hardware-enforced Stack ProtectionHari_Pulapaka
12 December 2022 at 19:08

Understanding Hardware-enforced Stack Protection

Windows OS Platform Blog articles

By: Hari_Pulapaka

12 December 2022 at 19:08

We aim to make Windows 10 one of the most secure operating systems for our customers and to do that we are investing in a multitude of security features. Our philosophy is to build features that mitigate broad classes of vulnerabilities, ideally without having the app change its code. In other words, getting an updated version of Windows 10 should make the customer and the app safer to use. This comprehensive MSDN document shows all of the security focused technologies we have built into Windows over the years and how it keeps our customers safe. Here is another presentation by Matt Miller and David Weston that goes deeper into our security philosophy for further reading.

We are now exploring security features with deep hardware integration to further raise the bar against attacks. By integrating Windows and its kernel deeply with hardware, we make it difficult and expensive for attackers to mount large scale attacks.

ROP (Return Oriented Programming) based control flow attacks have become a common form of attack based on our own and the external research community’s investigations (Evolution of CFI attacks, Joe Bialek). Hence, they are the next logical point of focus for proactive, built-in Windows security mitigation technologies. In this post, we will describe our efforts to harden control flow integrity in Windows 10 through Hardware-enforced stack protection.

Memory safety vulnerabilities

The most common class of vulnerability found in systems software is memory safety vulnerabilities. This class includes buffer overruns, dangling pointers, uninitialized variables, and others.

A canonical example of a stack buffer overrun is copying data from one buffer to another without bound checking (i.e. strcpy). If an attacker replaces the data and size from the source buffer, the destination buffer and other important components of the stack can be corrupted (i.e. return addresses) to point to attacker desired code.

Buffer Overrun.PNG

Dangling pointers occur when memory referenced by a pointer is de-allocated but a pointer to that memory still exists. In use-after-free exploits, the attacker can read/write through the dangling pointer that now points to memory the programmer did not intend to.

Uninitialized variables exist in some languages where variables can be declared without value, memory in this case is initialized with junk data. If an attacker can read or write to these contents, this will also lead to unintended program behavior.

These are popular techniques attackers can utilize to gain control and run arbitrary native code on target machines.

Arbitrary Code Execution

We frame our strategy for mitigating arbitrary code execution in the form of four pillars:

Arbitrary Code Execution Strategy.jpg

Code Integrity Guard (CIG) prevents arbitrary code generation by enforcing signature requirements for loading binaries.

Arbitrary Code Guard (ACG) ensures signed pages are immutable and dynamic code cannot be generated, thus guaranteeing the integrity of binaries loaded.

With the introduction of CIG/ACG, attackers increasingly resort to control flow hijacking via indirect calls and returns, known as call/jump oriented programming (COP/JOP) and return oriented programming (ROP).

We shipped Control Flow Guard (CFG) in Windows 10 to enforce integrity on indirect calls (forward-edge CFI). Hardware-enforced Stack Protection will enforce integrity on return addresses on the stack (backward-edge CFI), via Shadow Stacks.

The ROP problem

In systems software, if an attacker finds a memory safety vulnerability in code, the return address can be hijacked to target an attacker defined address. It is difficult from here to directly execute a malicious payload in Windows thanks to existing mitigations including Data Execution Prevention (DEP) and Address Space Layout Randomization (ASLR), but control can be transferred to snippets of code (gadgets) in executable memory. Attackers can find gadgets that end with the RET instruction (or other branches), and chain multiple gadgets to perform a malicious action (turn off a mitigation), with the end goal of running arbitrary native code.

Return Oriented Programming.PNG

Hardware-enforced stack protection in Windows 10

Keep in mind, Hardware-enforced stack protection will only work on chipsets with support for hardware shadow stacks, Intel’s Control-flow Enforcement Technology (CET) or AMD shadow stacks. Here is an Intel whitepaper with more information on CET.

In this post, we will describe only the relevant parts of the Windows 10 implementation. This technology provides parity with program call stacks, by keeping a record of all the return addresses via a Shadow Stack. On every CALL instruction, return addresses are pushed onto both the call stack and shadow stack, and on RET instructions, a comparison is made to ensure integrity is not compromised.

If the addresses do not match, the processor issues a control protection (#CP) exception. This traps into the kernel and we terminate the process to guarantee security.

Shadow Stacks.PNG

Shadow stacks store only return addresses, which helps minimize the additional memory overhead.

Control-flow Enforcement Technology (CET) Shadow Stacks

Shadow stack compliant hardware provides extensions to the architecture by adding instructions to manage shadow stacks and hardware protection of shadow stack pages.

Hardware will have a new register SSP, which holds the Shadow Stack Pointer address. The hardware will also have page table extensions to identify shadow stack pages and protect those pages against attacks.

New instructions are added for management of shadow stack pages, including:

INCSSP – increment SSP (i.e. to unwind shadow stack)
RDSSP – read SSP into general purpose register
SAVEPREVSSP/RSTORSSP – save/restore shadow stack (i.e. thread switching)

The full hardware implementation is documented in Intel’s CET manual.

Compiling for Hardware-enforced Stack Protection

In order to receive Hardware-enforced stack protection on your application, there is a new linker flag which sets a bit in the PE header to request protection from the kernel for the executable.

If the application sets this bit and is running on a supported Windows build and shadow stack-compliant hardware, the Kernel will maintain shadow stacks throughout the runtime of the program. If your Windows version or the hardware does not support shadow stacks, then the PE bit is ignored.

By making this an opt-in feature of Windows, we are allowing developers to first validate and test their app with hardware-enforced stack protection, before releasing their app.

Hardware-enforced Stack Protection feature is under development and an early preview is available in Windows 10 Insider previews builds (fast ring). If you have Intel CET capable hardware, you can enable the above linker flag on your application to test with the latest Windows 10 insider builds.

Conclusion

Hardware-enforced Stack Protection offers robust protection against ROP exploits since it maintains a record of the intended execution flow of a program. To ensure smooth ecosystem adoption and application compatibility, Windows will offer this protection as an opt-in model, so developers can receive this protection, at your own pace.

We will provide ongoing guidance on how to re-build your application to be shadow stacks compliant. In our next post, we will dig deeper into best practices, as well as provide technical documentation. This protection will be a major step forward in our continuous efforts to make Windows 10 one of the most secure operating system for our customers.

Kernel protection team - Jin Lin, Jason Lin, Niraj Majmudar and Greg Colombo

Windows OS Platform Blog articles
DTrace on Windows – 20H1 updatesHari_Pulapaka
12 December 2022 at 19:08

DTrace on Windows – 20H1 updates

Windows OS Platform Blog articles

By: Hari_Pulapaka

12 December 2022 at 19:08

We first released DTrace on Windows as a preview with the Windows 10 May 2019 Update. The feedback and reaction from our community was very gratifying. Thank you for taking the time to use DTrace on Windows and providing us with valuable feedback.

We have been quiet since the initial preview release, and today we are ready to talk about the updates we have made to DTrace on Windows. All of these changes are available in the latest Windows 10 Insider Preview (20H1) build, starting with 19041.21.

With these changes, we are now positioned to have customers broadly use DTrace on Windows.

Key resources

Removed kernel debugger requirement

This was the biggest hinderance in using DTrace on Windows internally and externally. We knew going in that we need to solve this, but we also knew that it would take time to solve this correctly. In 20H1, we have now removed the kernel debugger requirement. Windows kernel now relies on Virtualization-based Security (VBS) to securely insert dynamic trace points into kernel code. By relying on VBS, we can now safely and securely insert dynamic tracepoints in the kernel without disabling PatchGuard (enabling kernel debugger disables PatchGuard).

Note: Because we made the change to rely on VBS for DTrace on Windows, the installer from 19H1 will only work on 19H1. For Windows 10 Insider Preview (post 19H1) builds, please use the updated installer linked in this post. This installer will NOT install on previous Windows 10 releases.

Lets get into how to setup and use DTrace on Windows.

Prerequisites for using the feature:

Windows 10 insider build 19041.21 or higher

Detailed instructions to install DTrace is available in our documentation. At a high-level, these are:

Enable boot option to turn on DTrace
Download and install the DTrace MSI.
Ensure VBS is turned on
Optional: Update the PATH environment variableto include C:\Program Files\DTrace
- set PATH=%PATH%;"C:\Program Files\DTrace"
Setup symbol path
- Create a new directory for caching symbols locally. Example: mkdir c:\symbols
- Set _NT_SYMBOL_PATH=srv*C:\symbols*https://msdl.microsoft.com/download/symbols
- DTrace automatically downloads the symbols necessary from the symbol server and caches to the local path.
Reboot machine

To check if VBS is enabled or not, look at system summary tab on the Microsoft System Information tool (msinfo32.exe).

Msinfo32

ARM64 preview

Yes, that’s right! DTrace now supports ARM64 in preview mode. The ARM64 MSI is available in the download link listed above.

You can use it on your Surface Pro X running the latest Windows 10 Insider Preview (20H1) build, starting with 19041.21.

DTrace on Surface Pro X

User mode Stackwalk

In the preview, the stackwalk facility in DTrace was limited to Kernel mode (stack). This update adds support for usermode stackwalk facility (ustack). Like stack, ustack facility is fully compatible with open source DTrace specification. It can be invoked in three ways by specifying frames (depth) & size (ignored for now) or void.

Ustack(nframes, size)
Ustack(nframes)
Ustack()

While ustack () can determine the address of the calling frame when probe fires, the stack frames will not be translated into symbols until the ustack () action is processed at user-mode by DTrace consumer. Symbol download can slow down the output. Hence, it’s better to use this facility with locally cached symbols like below.

dtrace -n "profile-1ms /arg1/ {ustack(50, 0); exit(0);} " -y C:\symbols dtrace: description 'profile-1ms ' matched 1 probe CPU ID FUNCTION:NAME 0 3802 :profile-1ms ntdll`ZwAllocateVirtualMemory+0x14 ntdll`RtlAllocateHeap+0x3ded ntdll`RtlAllocateHeap+0x763 ucrtbase`malloc_base+0x44

Live dump support

Windows commonly uses something called Live dump to help quickly diagnose issues. Live dumps help with troubleshooting issues involving multiple processes or system wide issues without downtime. In 20H1, DTrace on Windows can be used to capture a live dump from inside a D-script using the lkd() DTrace facility. A common use case of this facility is to instrument error path (like return code indicates a failure) and capture a live dump right at the failure point for advanced diagnostics. For more information on live dump support, see DTrace Live Dump.

dtrace -wn "syscall:::return { if (arg0 != 0xc0000001UL) { lkd(0); printf(\" Triggering Live dump \n \");exit(0); }}" dtrace: description 'syscall:::return ' matched 1411 probes dtrace: allowing destructive actions CPU ID FUNCTION:NAME 0 181 NtDeviceIoControlFile:return Triggering Live dump dir c:\Windows\LiveKernelReports Volume in drive C has no label. Volume Serial Number is 70F4-B9F6 Directory of c:\Windows\LiveKernelReports 11/05/2019 05:20 PM <DIR> . 11/05/2019 05:20 PM <DIR> .. 11/05/2019 05:19 PM <DIR> DTRACE 11/05/2019 05:20 PM 53,395,456 DTRACE-20191105-1720.dmp

ETW Tracing

ETW tracing is the most frequently used tool for debugging on Windows. In DTrace on Windows 19H1 preview, we added support for instrumenting tracelogged and manifested events using the ETW provider.

In 20H1, we further enhanced this facility to create new ETW events on the fly from inside a D-script using the ETW_Trace() facility. This helps in situations where existing ETW events are insufficient and you would like to add additional ETW trace points without modifying production code.

For more information about ETW_Trace facility and ETW provider, see DTrace ETW.

/* Running the GitHub ETW provider sample (link below) to print node memory info event. https://github.com/microsoft/DTrace-on-Windows/blob/master/samples/windows/etw/numamemstats.d */ dtrace -qs numamemstats.d Partition ID: 0 Count: 1 Node number: 1 m_nodeinfo { uint64_t TotalPageCount = 0x1fb558 uint64_t SmallFreePageCount = 0x41 uint64_t SmallZeroPageCount = 0 uint64_t MediumFreePageCount = 0 uint64_t MediumZeroPageCount = 0 uint64_t LargeFreePageCount = 0 uint64_t LargeZeroPageCount = 0 uint64_t HugeFreePageCount = 0 uint64_t HugeZeroPageCount = 0 }

This concludes a tour of some of our key updates to DTrace on Windows for 20H1.

You can get started by downloading & installing the DTrace MSI package on the latest 20H1 client/server insider build - 19041.21+.

You can also visit our GitHub page for contributing code and samples. We have several advanced scripts in GitHub to help users learn and use DTrace on Windows.

How to file feedback?

As always, we rely on feedback from our users to help improve the product. If you hit any problems or bugs, please use Feedback hub to let us know:

Launch feedback hub by clicking this link
Select Add new feedback.
Please provide a detailed description of the issue.
Currently, we do not automatically collect any debug traces, so your verbatim feedback is crucial for understanding and reproducing the issue. Pass on any verbose logs.
You can also set DTRACE_DEBUG environment variable to 1 to collect verbose DTrace logs.
Submit

We are excited to rollout these changes and look forward to working with the community to continue improving DTrace experience.

DTrace team (Andrey Shedel, Gopikrishna Kannan, Max Renke, Hari Pulapaka)

DTrace on Windows

Windows OS Platform Blog articles

By: Hari_Pulapaka

12 December 2022 at 19:08

Here at Microsoft, we are always looking to engage with open source communities to produce better solutions for the community and our customers . One of the more useful debugging advances that have arrived in the last decade is DTrace. DTrace of course needs no introduction: it’s a dynamic tracing framework that allows an admin or developer to get a real-time look into a system either in user or kernel mode. DTrace has a C-style high level and powerful programming language that allows you to dynamically insert trace points. Using these dynamically inserted trace points, you can filter on conditions or errors, write code to analyze lock patterns, detect deadlocks, etc. ETW while powerful, is static and does not provide the ability to programmatically insert trace points at runtime.

There are a lot of websites and resources from the community to learn about DTrace. One of the most comprehensive one is the Dynamic Tracing Guide html book available on dtrace.org website. This ebook describes DTrace in detail and is the authoritative guide for DTrace. We also have Windows specific examples below which will provide more info.

Starting in 2016, the OpenDTrace effort began on GitHub that tried to ensure a portable implementation of DTrace for different operating systems. We decided to add support for DTrace on Windows using this OpenDTrace port.

We have created a Windows branch for “DTrace on Windows” under the OpenDTrace project on GitHub. All our changes made to support DTrace on Windows are available here. Over the next few months, we plan to work with the OpenDTrace community to merge our changes. All our source code is also available at the 3^rd party sources website maintained by Microsoft.

Without further ado, let’s get into how to setup and use DTrace on Windows.

Install and Run DTrace

Prerequisites for using the feature

Windows 10 insider build 18342 or higher
Only available on x64 Windows and captures tracing info only for 64-bit processes
Windows Insider Program is enabled and configured with valid Windows Insider Account
- Visit Settings->Update & Security->Windows Insider Program for details

Instructions:

BCD configuration set:
1. bcdedit /set dtrace on
2. Note, you need to set the bcdedit option again, if you upgrade to a new Insider build
Download and install the DTrace package from download center.
1. This installs the user mode components, drivers and additional feature on demand packages necessary for DTrace to be functional.
Optional: Update the PATH environment variable to include C:\Program Files\DTrace
1. set PATH=%PATH%;"C:\Program Files\DTrace"
Setup symbol path
1. Create a new directory for caching symbols locally. Example: mkdir c:\symbols
2. Set _NT_SYMBOL_PATH=srv*C:\symbols*https://msdl.microsoft.com/download/symbols
3. DTrace automatically downloads the symbols necessary from the symbol server and caches to the local path.
Optional: Setup Kernel debugger connection to the target machine (MSDN link). This is only required if you want to trace Kernel events using FBT or other providers.
1. Note that you will need to disable Secureboot and Bitlocker on C:, (if enabled), if you want to setup a kernel debugger.
Reboot target machine

Running DTrace

Launch CMD prompt in administrator mode

Get started with sample one-liners:

# Syscall summary by program for 5 seconds: 
dtrace -Fn "tick-5sec { exit(0);} syscall:::entry{ @num[pid,execname] = count();} "
 
# Summarize timer set/cancel program for 3 seconds: 
dtrace -Fn "tick-3sec { exit(0);} syscall::Nt*Timer*:entry { @[probefunc, execname, pid] = count();}"
 
# Dump System Process kernel structure: (requires symbol path to be set)
dtrace -n "BEGIN{print(*(struct nt`_EPROCESS *) nt`PsInitialSystemProcess);exit(0);}"
 
# Tracing paths through NTFS when running notepad.exe (requires KD attach): Run below command and launch notepad.exe
dtrace -Fn "fbt:ntfs::/execname==\"notepad.exe\"/{}"

The command dtrace -lvn syscall::: will list all the probes and their parameters available from the syscall provider.

The following are some of the providers available on Windows and what they instrument.

syscall – NTOS system calls
fbt (Function Boundary Tracing) – Kernel function entry and returns
pid – User-mode process tracing. Like kernel-mode FBT, but also allowing the instrumentation of arbitrary function offsets.
etw (Event Tracing for Windows) – Allows probes to be defined for ETW This provider helps to leverage existing operating system instrumentation in DTrace.
- This is one addition we have done to DTrace to allow it to expose and gain all the information that Windows already provides in ETW.

We have more Windows sample scripts applicable for Windows scenarios in the samples directory of the source.

How to file feedback?

DTrace on Windows is very different from our typical features on Windows and we are going to rely on our Insider community to guide us. If you hit any problems or bugs, please use Feedback hub to let us know.

Launch feedback hub by clicking this link
Select Add new feedback.
Please provide a detailed description of the issue or suggestion.
1. Currently, we do not automatically collect any debug traces, so your verbatim feedback is crucial for understanding and reproducing the issue. Pass on any verbose logs.
2. You can set DTRACE_DEBUG environment variable to 1 to collect verbose dtrace logs.
Submit

DTrace Architecture

Let’s talk a little about the internals and architecture of how we supported DTrace. As mentioned, DTrace on Windows is a port of OpenDTrace and reuses much of its user mode components and architecture. Users interact with DTrace through the dtrace command, which is a generic front-end to the DTrace engine. D scripts get compiled to an intermediate format (DIF) in user-space and sent to the DTrace kernel component for execution, sometimes called as the DIF Virtual Machine. This runs in the dtrace.sys driver.

Traceext.sys (trace extension) is a new kernel extension driver we added, which allows Windows to expose functionality that DTrace relies on to provide tracing. The Windows kernel provides callouts during stackwalk or memory accesses which are then implemented by the trace extension.

All APIs and functionality used by dtrace.sys are documented calls.

Security

Security of Windows is key for our customers and the security model of DTrace makes it ideally suited to Windows. The DTrace guide, linked above talks about DTrace security and performance impact. It would be useful for anyone interested in this space to read that section. At a high level, DTrace uses an intermediate form which is validated for safety and runs in its own execution environment (think C# or Java). This execution environment also handles any run time errors to avoid crashing the system. In addition, the cost of having a probe is minimal and should not visibly affect the system performance unless you enable too many probes in performance sensitive paths.

DTrace on Windows also leverages the Windows security model in useful ways to enhance its security for our customers.

To connect to the DTrace trace engine, your account needs to be part of the admin or LocalSystem group
Events originating from kernel mode (FBT, syscalls with ‘kernel’ previous mode, etc.), are only traceable if Kernel debugger is attached
To read kernel-mode memory (probe parameters for kernel-mode originated events, kernel-mode global variables, etc.), the following must be true:
1. DTrace session security context has either TCB or LoadDriver privilege enabled.
2. Secure Boot is not active.
To trace a user-mode process, the user needs to have:
1. Debug privilege
2. DEBUG access to the target process.

Script signing

In addition, we have also updated DTrace on Windows to support signing of d scripts. We follow the same model as PowerShell to support signing of scripts.

There is a system wide DTrace script signing policy knob which controls whether to check for signing or not for DTrace scripts. This policy knob is controlled by the Registry.

By default, we do NOT check for signature on DTrace scripts.

Use the following registry keys to enforce policy at machine or user level.

User Scope: HKCU\Software\OpenDTrace\Dtrace, ExecutionPolicy, REG_SZ
Machine Scope: HKLM\Software\OpenDTrace\Dtrace, ExecutionPolicy, REG_SZ

Policy Values:

DTrace policy take the following values.

“Bypass": do not perform signature checks. This is the default policy. Only set the registry key if you want to deviate from this policy.
"Unrestricted": Do not perform checks on local files, allow user's consent to use unsigned remote files.
"RemoteSigned": Do not perform checks on local files, requires a valid and trusted signature for remote files.
"AllSigned": Require valid and trusted signature for all files.
"Restricted": Script file must be installed as a system component and have a signature from the trusted source.

You can also set policy by defining the environment variable DTRACE_EXECUTION_POLICY to the required value.

Conclusion

We are very excited to release the first version of DTrace on Windows. We look forward to feedback from the Windows Insider community.

Cheers,

DTrace Team (Andrey Shedel, Gopikrishna Kannan, & Hari Pulapaka)

Windows OS Platform Blog articles
Windows Sandbox - Config FilesHari_Pulapaka
12 December 2022 at 19:07

Windows Sandbox - Config Files

Windows OS Platform Blog articles

By: Hari_Pulapaka

12 December 2022 at 19:07

Since the initial announcement of Windows Sandbox, we have received overwhelmingly positive feedback. Thank you for your support! We are glad that this feature resonates with the Windows community.

One of the most requested features from our customers is the ability to automatically launch an app or script in the sandbox. Coincidentally, this also aligned with our feature roadmap and is now available in Windows Insider builds.

Windows Sandbox now has support for simple configuration files (.wsb file extension), which provide minimal scripting support. You can use this feature in the latest Windows Insider build 18342.

As always, we rely on your feedback to build features allowing our users to achieve more.

NOTE: Please note that this functionality is still in development and subject to change.

Overview

Sandbox configuration files are formatted as XML, and are associated with Windows Sandbox via the .wsb file extension. A configuration file allows the user to control the following aspects of Windows Sandbox:

vGPU (virtualized GPU)
- Enable or Disable the virtualized GPU. If vGPU is disabled, Sandbox will use WARP (software rasterizer).
Networking
- Enable or Disable network access to the Sandbox.
Shared folders
- Share folders from the host with read or write permissions. Note that exposing host directories may allow malicious software to affect your system or steal data.
Startup script
- Logon action for the sandbox.

As demonstrated in the examples below, configuration files can be used to granularly control Windows Sandbox for enhanced isolation.

Double click a config file to open it in Windows Sandbox, or invoke it via the command line as shown:

C:\Temp> MyConfigFile.wsb

Keywords, values and limits

VGpu

Enables or disables GPU sharing.

<VGpu>value</VGpu>

Supported values:

Disable – disables vGPU support in the sandbox. If this value is set Windows Sandbox will use software rendering, which can be slower than virtualized GPU.
Default – this is the default value for vGPU support; currently this means vGPU is enabled.

Note: Enabling virtualized GPU can potentially increase the attack surface of the sandbox.

Networking

Enables or disables networking in the sandbox. Disabling network access can be used to decrease the attack surface exposed by the Sandbox.

<Networking>value</Networking>

Supported values:

Disable – disables networking in the sandbox.
Default – this is the default value for networking support. This enables networking by creating a virtual switch on the host, and connects the sandbox to it via a virtual NIC.

Note: Enabling networking can expose untrusted applications to your internal network.

MappedFolders

Wraps a list of MappedFolder objects.

<MappedFolders>
list of MappedFolder objects
</MappedFolders>

Note: Files and folders mapped in from the host can be compromised by apps in the Sandbox or potentially affect the host.

MappedFolder

Specifies a single folder on the host machine which will be shared on the container desktop. Apps in the Sandbox are run under the user account “WDAGUtilityAccount”. Hence, all folders are mapped under the following path: C:\Users\WDAGUtilityAccount\Desktop.

E.g. “C:\Test” will be mapped as “C:\users\WDAGUtilityAccount\Desktop\Test”.

<MappedFolder>
    <HostFolder>path to the host folder</HostFolder>
    <ReadOnly>value</ReadOnly>
</MappedFolder>

HostFolder: Specifies the folder on the host machine to share to the sandbox. Note that the folder must already exist the host or the container will fail to start if the folder is not found.

ReadOnly: If true, enforces read-only access to the shared folder from within the container. Supported values: true/false.

Note: Files and folders mapped in from the host can be compromised by apps in the Sandbox or potentially affect the host.

LogonCommand

Specifies a single Command which will be invoked automatically after the container logs on.

<LogonCommand>
   <Command>command to be invoked</Command>
</LogonCommand>

Command: A path to an executable or script inside of the container that will be executed after login.

Note: Although very simple commands will work (launching an executable or script), more complicated scenarios involving multiple steps should be placed into a script file. This script file may be mapped into the container via a shared folder, and then executed via the LogonCommand directive.

Example 1:

The following config file can be used to easily test downloaded files inside of the sandbox. To achieve this, the script disables networking and vGPU, and restricts the shared downloads folder to read-only access in the container. For convenience, the logon command opens the downloads folder inside of the container when it is started.

Downloads.wsb

<Configuration>
<VGpu>Disable</VGpu>
<Networking>Disable</Networking>
<MappedFolders>
   <MappedFolder>
     <HostFolder>C:\Users\Public\Downloads</HostFolder>
     <ReadOnly>true</ReadOnly>
   </MappedFolder>
</MappedFolders>
<LogonCommand>
   <Command>explorer.exe C:\users\WDAGUtilityAccount\Desktop\Downloads</Command>
</LogonCommand>
</Configuration>

Example 2

The following config file installs Visual Studio Code in the container, which requires a slightly more complicated LogonCommand setup.

Two folders are mapped into the container; the first (SandboxScripts) contains VSCodeInstall.cmd, which will install and run VSCode. The second folder (CodingProjects) is assumed to contain project files that the developer wants to modify using VSCode.

With the VSCode installer script already mapped into the container, the LogonCommand can reference it.

VSCodeInstall.cmd

REM Download VSCode
curl -L "https://update.code.visualstudio.com/latest/win32-x64-user/stable" --output C:\users\WDAGUtilityAccount\Desktop\vscode.exe
 
REM Install and run VSCode
C:\users\WDAGUtilityAccount\Desktop\vscode.exe /verysilent /suppressmsgboxes

VSCode.wsb

<Configuration>
<MappedFolders>
   <MappedFolder>
     <HostFolder>C:\SandboxScripts</HostFolder>
     <ReadOnly>true</ReadOnly>
   </MappedFolder>
   <MappedFolder>
     <HostFolder>C:\CodingProjects</HostFolder>
     <ReadOnly>false</ReadOnly>
   </MappedFolder>
</MappedFolders>
<LogonCommand>
   <Command>C:\users\wdagutilityaccount\desktop\SandboxScripts\VSCodeInstall.cmd</Command>
</LogonCommand>
</Configuration>

Conclusion

We look forward to your feedback.

Cheers,

Margarit Chenchev, Erick Smith, Paul Bozzay, Deepti Bhardwaj & Hari Pulapaka

(Windows Sandbox team)

Windows Sandbox

Windows OS Platform Blog articles

By: Hari_Pulapaka

12 December 2022 at 19:07

Windows Sandbox is a new lightweight desktop environment tailored for safely running applications in isolation.

How many times have you downloaded an executable file, but were afraid to run it? Have you ever been in a situation which required a clean installation of Windows, but didn’t want to set up a virtual machine?

At Microsoft we regularly encounter these situations, so we developed Windows Sandbox: an isolated, temporary, desktop environment where you can run untrusted software without the fear of lasting impact to your PC. Any software installed in Windows Sandbox stays only in the sandbox and cannot affect your host. Once Windows Sandbox is closed, all the software with all its files and state are permanently deleted.

Windows Sandbox has the following properties:

Part of Windows – everything required for this feature ships with Windows 10 Pro and Enterprise. No need to download a VHD!
Pristine – every time Windows Sandbox runs, it’s as clean as a brand-new installation of Windows
Disposable – nothing persists on the device; everything is discarded after you close the application
Secure – uses hardware-based virtualization for kernel isolation, which relies on the Microsoft’s hypervisor to run a separate kernel which isolates Windows Sandbox from the host
Efficient – uses integrated kernel scheduler, smart memory management, and virtual GPU

Prerequisites for using the feature

Windows 10 Pro or Enterprise Insider build 18305 or later
AMD64 architecture
Virtualization capabilities enabled in BIOS
At least 4GB of RAM (8GB recommended)
At least 1 GB of free disk space (SSD recommended)
At least 2 CPU cores (4 cores with hyperthreading recommended)

Quick start

Install Windows 10 Pro or Enterprise, Insider build 18305 or newer
Enable virtualization:
- If you are using a physical machine, ensure virtualization capabilities are enabled in the BIOS.
- If you are using a virtual machine, enable nested virtualization with this PowerShell cmdlet:
- Set-VMProcessor -VMName <VMName> -ExposeVirtualizationExtensions $true
Open Windows Features, and then select Windows Sandbox. Select OK to install Windows Sandbox. You might be asked to restart the computer.
Using the Start menu, find Windows Sandbox, run it and allow the elevation
Copy an executable file from the host
Paste the executable file in the window of Windows Sandbox (on the Windows desktop)
Run the executable in the Windows Sandbox; if it is an installer go ahead and install it
Run the application and use it as you normally do
When you’re done experimenting, you can simply close the Windows Sandbox application. All sandbox content will be discarded and permanently deleted
Confirm that the host does not have any of the modifications that you made in Windows Sandbox.

Windows Sandbox Screenshot - open.jpg

Windows Sandbox respects the host diagnostic data settings. All other privacy settings are set to their default values.

Windows Sandbox internals

Since this is the Windows Kernel Internals blog, let’s go under the hood. Windows Sandbox builds on the technologies used within Windows Containers. Windows containers were designed to run in the cloud. We took that technology, added integration with Windows 10, and built features that make it more suitable to run on devices and laptops without requiring the full power of Windows Server.

Some of the key enhancements we have made include:

Dynamically generated Image

At its core Windows Sandbox is a lightweight virtual machine, so it needs an operating system image to boot from. One of the key enhancements we have made for Windows Sandbox is the ability to use a copy of the Windows 10 installed on your computer, instead of downloading a new VHD image as you would have to do with an ordinary virtual machine.

We want to always present a clean environment, but the challenge is that some operating system files can change. Our solution is to construct what we refer to as “dynamic base image”: an operating system image that has clean copies of files that can change, but links to files that cannot change that are in the Windows image that already exists on the host. The majority of the files are links (immutable files) and that's why the small size (~100MB) for a full operating system. We call this instance the “base image” for Windows Sandbox, using Windows Container parlance.

When Windows Sandbox is not installed, we keep the dynamic base image in a compressed package which is only 25MB. When installed the dynamic base package it occupies about 100MB disk space.

Dynamic Image.PNG

Smart memory management

Memory management is another area where we have integrated with the Windows Kernel. Microsoft’s hypervisor allows a single physical machine to be carved up into multiple virtual machines which share the same physical hardware. While that approach works well for traditional server workloads, it isn't as well suited to running devices with more limited resources. We designed Windows Sandbox in such a way that the host can reclaim memory from the Sandbox if needed.

Additionally, since Windows Sandbox is basically running the same operating system image as the host we also allow Windows sandbox to use the same physical memory pages as the host for operating system binaries via a technology we refer to as “direct map”. In other words, the same executable pages of ntdll, are mapped into the sandbox as that on the host. We take care to ensure this done in a secure manner and no secrets are shared.

Direct Map.PNG

Integrated kernel scheduler

With ordinary virtual machines, Microsoft’s hypervisor controls the scheduling of the virtual processors running in the VMs. However, for Windows Sandbox we use a new technology called “integrated scheduler” which allows the host to decide when the sandbox runs.

For Windows Sandbox we employ a unique scheduling policy that allows the virtual processors of the sandbox to be scheduled in the same way as threads would be scheduled for a process. High-priority tasks on the host can preempt less important work in the sandbox. The benefit of using the integrated scheduler is that the host manages Windows Sandbox as a process rather than a virtual machine which results in a much more responsive host, similar to Linux KVM.

The whole goal here is to treat the Sandbox like an app but with the security guarantees of a Virtual Machine.

Snapshot and clone

As stated above, Windows Sandbox uses Microsoft’s hypervisor. We're essentially running another copy of Windows which needs to be booted and this can take some time. So rather than paying the full cost of booting the sandbox operating system every time we start Windows Sandbox, we use two other technologies; “snapshot” and “clone.”

Snapshot allows us to boot the sandbox environment once and preserve the memory, CPU, and device state to disk. Then we can restore the sandbox environment from disk and put it in the memory rather than booting it, when we need a new instance of Windows Sandbox. This significantly improves the start time of Windows Sandbox.

Graphics virtualization

Hardware accelerated rendering is key to a smooth and responsive user experience, especially for graphics-intense or media-heavy use cases. However, virtual machines are isolated from their hosts and unable to access advanced devices like GPUs. The role of graphics virtualization technologies, therefore, is to bridge this gap and provide hardware acceleration in virtualized environments; e.g. Microsoft RemoteFX.

More recently, Microsoft has worked with our graphics ecosystem partners to integrate modern graphics virtualization capabilities directly into DirectX and WDDM, the driver model used by display drivers on Windows.

At a high level, this form of graphics virtualization works as follows:

Apps running in a Hyper-V VM use graphics APIs as normal.
Graphics components in the VM, which have been enlightened to support virtualization, coordinate across the VM boundary with the host to execute graphics workloads.
The host allocates and schedules graphics resources among apps in the VM alongside the apps running natively. Conceptually they behave as one pool of graphics clients.

This process is illustrated below:

GPU virtualization for Sandbox - diagram.png

This enables the Windows Sandbox VM to benefit from hardware accelerated rendering, with Windows dynamically allocating graphics resources where they are needed across the host and guest. The result is improved performance and responsiveness for apps running in Windows Sandbox, as well as improved battery life for graphics-heavy use cases.

To take advantage of these benefits, you’ll need a system with a compatible GPU and graphics drivers (WDDM 2.5 or newer). Incompatible systems will render apps in Windows Sandbox with Microsoft’s CPU-based rendering technology.

Battery pass-through

Windows Sandbox is also aware of the host’s battery state, which allows it to optimize power consumption. This is critical for a technology that will be used on laptops, where not wasting battery is important to the user.

Filing bugs and suggestions

As with any new technology, there may be bugs. Please file them so that we can continually improve this feature.

File bugs and suggestions at Windows Sandbox's Feedback Hub (select Add new feedback), or follows these steps:

Open the Feedback Hub
Select Report a problem or Suggest a feature.
Fill in the Summarize your feedback and Explain in more details boxes with a detailed description of the issue or suggestion.
Select an appropriate category and subcategory by using the dropdown menus. There is a dedicated option in Feedback Hub to file "Windows Sandbox" bugs and feedback. It is located under "Security and Privacy" subcategory "Windows Sandbox".
Select Next
If necessary, you can collect traces for the issue as follows: Select the Recreate my problem tile, then select Start capture, reproduce the issue, and then select Stop capture.
Attach any relevant screenshots or files for the problem.
Submit.

Conclusion

We look forward to you using this feature and receiving your feedback!

Cheers,

Hari Pulapaka, Margarit Chenchev, Erick Smith, & Paul Bozzay

(Windows Sandbox team)

One Windows Kernel

Windows OS Platform Blog articles

By: Hari_Pulapaka

12 December 2022 at 19:06

Windows is one of the most versatile and flexible operating systems out there, running on a variety of machine architectures and available in multiple SKUs. It currently supports x86, x64, ARM and ARM64 architectures. Windows used to support Itanium, PowerPC, DEC Alpha, and MIPS (wiki entry). In addition, Windows supports a variety of SKUs that run in a multitude of environments; from data centers, laptops, Xbox, phones to embedded IOT devices such as ATM machines.

The most amazing aspect of all this is that the core of Windows, its kernel, remains virtually unchanged on all these architectures and SKUs. The Windows kernel scales dynamically depending on the architecture and the processor that it’s run on to exploit the full power of the hardware. There is of course some architecture specific code in the Windows kernel, however this is kept to a minimum to allow Windows to run on a variety of architectures.

In this blog post, I will talk about the evolution of the core pieces of the Windows kernel that allows it to transparently scale across a low power NVidia Tegra chip on the Surface RT from 2012, to the giant behemoths that power Azure data centers today.

This is a picture of Windows taskmgr running on a pre-release Windows DataCenter class machine with 896 cores supporting 1792 logical processors and 2TB of RAM!

Task Manager showing 1792 logical processors

Evolution of one kernel

Before we talk about the details of the Windows kernel, I am going to take a small detour to talk about something called Windows refactoring. Windows refactoring plays a key part in increasing the reuse of Windows components across different SKUs, and platforms (e.g. client, server and phone). The basic idea of Windows refactoring is to allow the same DLL to be reused in different SKUs but support minor modifications tailored to the SKU without renaming the DLL and breaking apps.

The base technology used for Windows refactoring are a lightly documented technology (entirely by design) called API sets. API sets are a mechanism that allows Windows to decouple the DLL from where its implementation is located. For example, API sets allow win32 apps to continue to use kernel32.dll but, the implementation of all the APIs are in a different DLL. These implementation DLLs can also be different depending on your SKU. You can see API sets in action if you launch dependency walker on a traditional Windows DLL; e.g. kernel32.dll.

Dependency walker

With that detour into how Windows is built to maximize code reuse and sharing, let’s go into the technical depths of the kernel starting with the scheduler which is key to the scaling of Windows.

Kernel Components

Windows NT is like a microkernel in the sense that it has a core Kernel (KE) that does very little and uses the Executive layer (Ex) to perform all the higher-level policy. Note that EX is still kernel mode, so it's not a true microkernel. The kernel is responsible for thread dispatching, multiprocessor synchronization, hardware exception handling, and the implementation of low-level machine dependent functions. The EX layer contains various subsystems which provide the bulk of the functionality traditionally thought of as kernel such as IO, Object Manager, Memory Manager, Process Subsystem, etc.

To get a better idea of the size of the components, here is a rough breakdown on the number of lines of code in a few key directories in the Windows kernel source tree (counting comments). There is a lot more to the Kernel not shown in this table.

Kernel subsystems	Lines of code
Memory Manager	501, 000
Registry	211,000
Power	238,000
Executive	157,000
Security	135,000
Kernel	339,000
Process sub-system	116,000

For more information on the architecture of Windows, the “Windows Internals” series of books are a good reference.

Scheduler

With that background, let's talk a little bit about the scheduler, its evolution and how Windows kernel can scale across so many different architectures with so many processors.

A thread is the basic unit that runs program code and it is this unit that is scheduled by the Window scheduler. The Windows scheduler uses the thread priority to decide which thread to run and in theory the highest priority thread on the system always gets to run even if that entails preempting a lower priority thread.

As a thread runs and experiences quantum end (minimum amount of time a thread gets to run), its dynamic priority decays, so that a high priority CPU bound thread doesn’t run forever starving everyone else. When another waiting thread is awakened to run, it is given a priority boost based on the importance of the event that caused the wait to be satisfied (e.g. a large boost is for a foreground UI thread vs. a smaller one for completing disk I/O). A thread therefore runs at a high priority as long as it’s interactive. When it becomes CPU (compute) bound, its priority decays, and it is considered only after other, higher priority threads get their time on the CPU. In addition, the kernel arbitrarily boosts the priority of ready threads that haven't received any processor time for a given period of time to prevent starvation and correct priority inversions.

The Windows scheduler initially had a single ready queue from where it picked up the next highest priority thread to run on the processor. However, as Windows started supporting more and more processors the single ready queue turned out to be a bottleneck and around Windows Server 2003, the scheduler changed to one ready queue per processor. As Windows moved to multiple per processor queues, it avoided having a single global lock protecting all the queues and allowed the scheduler to make locally optimum decisions. This means that any point the single highest priority thread in the system runs but that doesn’t necessarily mean that the top N (N is number of cores) priority threads on the system are running. This proved to be good enough until Windows started moving to low power CPUs, e.g. in laptops and tablets. On these systems, not running a high priority thread (such as the foreground UI thread) caused the system to have noticeable glitches in UI. And so, in Windows 8.1, the scheduler changed to a hybrid model with per processor ready queues for affinitized (tied to a processor) work and shared ready queues between processors. This did not cause a noticeable impact on performance because of other architectural changes in the scheduler such as the dispatcher database lock refactoring which we will talk about later.

Windows 7 introduced something called the Dynamic Fair Share Scheduler; this feature was introduced primarily for terminal servers. The problem that this feature tried to solve was that one terminal server session which had a CPU intensive workload could impact the threads in other terminal server sessions. Since the scheduler didn’t consider sessions and simply used the priority as the key to schedule threads, users in different sessions could impact the user experience of others by starving their threads. It also unfairly advantages the sessions (users) who has a lot of threads because the sessions with more threads get more opportunity to be scheduled and received CPU time. This feature tried to add policy to the scheduler such that each session was treated fairly and roughly the same amount of CPU was available to each session. Similar functionality is available in Linux as well, with its Completely Fair Scheduler. In Windows 8, this concept was generalized as a scheduler group and added to the Windows Scheduler with each session in an independent scheduler group. In addition to the thread priority, the scheduler uses the scheduler groups as a second level index to decide which thread should run next. In a terminal server, all the scheduler groups are weighted equally and hence all sessions (scheduler groups) receive the same amount of CPU regardless of the number or priorities of the threads in the scheduler groups. In addition to its utility in a terminal server session, scheduler groups are also used to have fine grained control on a process at runtime. In Windows 8, Job objects were enhanced to support CPU rate control. Using the CPU rate control APIs, one can decide how much CPU a process can use, whether it should be a hard cap or a soft cap and receive notifications when a process meets those CPU limits. This is like the resource controls features available in cgroups on Linux.

Starting with Windows 7, Windows Server started supporting greater than 64 logical processors in a single machine. To add support for so many processors, Windows internally introduced a new entity called “processor group”. A group is a static set of up to 64 logical processors that is treated as a single scheduling entity. The kernel determines at boot time which processor belongs to which group and for machines with less than 64 cores, with the overhead of the group structure indirection is mostly not noticeable. While a single process can span groups (such as a SQL server instance), and individual thread could only execute within a single scheduling group at a time.

However, on machines with greater than 64 cores, Windows started showing some bottlenecks that prevented high performance applications such as SQL server from scaling their performance linearly with the number of processor cores. Thus, even if you added more cores and memory, the benchmarks wouldn’t show much increase in performance. And one of the main problems that caused this lack of performance was the contention around the Dispatcher database lock. The dispatcher database lock protected access to those objects that needed to be dispatched; i.e. scheduled. Examples of objects that were protected by this lock included threads, timers, I/O completion ports, and other waitable kernel objects (events, semaphores, mutants, etc.). Thus, in Windows 7 due to the impetus provided by the greater than 64 processor support, work was done to eliminate the dispatcher database lock and replace it with fine grained locks such as per object locks. This allowed benchmarks such as the SQL TPC-C to show a 290% improvement when compared to Windows 7 with a dispatcher database lock on certain machine configurations. This was one of the biggest performance boosts seen in Windows history, due to a single feature.

Windows 10 brought us another innovation in the scheduler space with CPU Sets. CPU Sets allow a process to partition the system such that its process can take over a group of processors and not allow any other process or system to run their threads on those processors. Windows Kernel even steers Interrupts from devices away from the processors that are part of your CPU set. This ensures that even devices cannot target their code on the processors which have been partitioned off by CPU sets for your app or process. Think of this as a low-tech Virtual Machine. As you can imagine this is a powerful capability and hence there are a lot of safeguards built-in to prevent an app developer from making the wrong choice within the API. CPU sets functionality are used by the customer when they use Game Mode to run their games.

Finally, this brings us to ARM64 support with Windows 10 on ARM. The ARM architecture supports a big.LITTLE architecture, big.LITTLE is a heterogenous architecture where the “big” core runs fast, consuming more power and the “LITTLE” core runs slow consuming less power. The idea here is that you run unimportant tasks on the little core saving battery. To support big.LITTLE architecture and provide great battery life on Windows 10 on ARM, the Windows scheduler added support for heterogenous scheduling which took into account the app intent for scheduling on big.LITTLE architectures.

By app intent, I mean Windows tries to provide a quality of service for apps by tracking threads which are running in the foreground (or starved of CPU) and ensuring those threads always run on the big core. Whereas the background tasks, services, and other ancillary threads in the system run on the little cores. (As an aside, you can also programmatically mark your thread as unimportant which will make it run on the LITTLE core.)

Work on Behalf: In Windows, a lot of work for the foreground is done by other services running in the background. E.g. In Outlook, when you search for a mail, the search is conducted by a background service (Indexer). If we simply, run all the services on the little core, then the experience and performance of the foreground app will be affected. To ensure, that these scenarios are not slow on big.LITTLE architectures, Windows actually tracks when an app calls into another process to do work on its behalf. When this happens, we donate the foreground priority to the service thread and force run the thread in the service on the big core.

That concludes our first (huge?) One Windows Kernel post, giving you an overview of the Windows Kernel Scheduler. We will have more similarly technical posts about the internals of the Windows Kernel.

Hari Pulapaka

(Windows Kernel Team)

Windows OS Platform Blog articles
Welcome to Windows Kernel Team BlogHari_Pulapaka
12 December 2022 at 19:06

Welcome to Windows Kernel Team Blog

Windows OS Platform Blog articles

By: Hari_Pulapaka

12 December 2022 at 19:06

Welcome all,

We are the Windows Kernel team and we will starting a new series of blog posts where we will be talking about the internals of the Windows Kernel. We realize there is a general dearth of information regarding the internals of the Windows Kernels, other than the excellent Windows Internals book series by Mark Russinovich.

Over the next few months, we will be having deep technical posts about things like the Kernel Scheduler, Memory Management and many other unexpected features in the Kernel.

Cheers,

Hari Pulapaka

Group Program Manager for Windows Kernel