Microsoft has a great track record of maintaining support for legacy software running under Windows. There is an entire compatibility layer
baked into the OS that is dedicated to fixing issues with decades old software running on modern iterations
of Windows. To learn more about this application compatibility infrastructure, I'd recommend swinging over to Alex Ionescu's blog
. He has a great set of posts
describing the technical details on how user (even kernel
) mode shimming is implemented.
With all of that said, it's an understatement to say that Microsoft takes backwards compatibility seriously. Occasionally, the humans at Microsoft make mistakes. Usually, though, they're very quick to address these problems.
This blog post will go over an unnoticed bug that was introduced in Windows 8 with a documented Win32 API. At the time of this post, this bug is still present in Windows 10 (Creator's Update) and has been around for over 5 years.
Forgotten Win32 APIs
There is a set of Win32 APIs that were introduced in Windows XP to monitor the working set of a process. A process' working set is a collection of pages, chunks of memory, that are currently in RAM (physical memory) and are accessible to that process without inducing a page fault. In particular, the APIs of interest for us are InitializeProcessForWsWatch
After reading the MSDN documentation, it's easy to discover what the intended use for these APIs were. These APIs profile the number of page faults that occur within a process' address space.
What's a page fault? A quick recap.
There are 3 general categories of page faults.
A hard page fault occurs when memory is accessed that's not currently in RAM (physical). In situations like this, the OS will need to retrieve the memory from disk (e.g. pagefile.sys) and make it accessible to the faulting process.
A soft page fault occurs when memory is in RAM (physical), but not currently accessible to the process that induced the fault. This memory might be shared amongst multiple processes and the process that caused the page fault might not have it mapped into its working set. These types of page faults are much more performant than hard page faults as there is no disk I/O conducted.
The last and final type of page fault is known formally as an invalid fault. These can also be referred to as access violations. This can be caused when a program, for example, tries to access unallocated memory or tries to write to memory that's marked read-only.
Paging is necessary to make modern operating systems work. You probably have many processes running on your system, but not nearly enough RAM to hold all the possible contents of each process into physical memory. To learn more about paging, I strongly recommend this article
posted by my colleague.
The best way to illustrate what's broken is through an example. I created two simple programs.
The first application, WorkingSetWatch.exe, implements the InitializeProcessForWsWatch and GetWsChangeEx APIs. This application logs when a specific memory region is paged into our process' working set:
The second application, ReadProcessMemory.exe, implements reading of an arbitrary memory blob from another target process' memory space:
The basic idea is to use ReadProcessMemory.exe to read from the monitored memory address inside of WorkingSetWatch.exe. This will induce a page fault.
Windows 7: Build 7601 (SP1)
The WorkingSetWatch.exe application works as expected. We're able to read any (valid) sized buffer using ReadProcessMemory.exe and log it.
Windows 10: Build 15063 (Creator's Update)
Unfortunately, WorkingSetWatch.exe does not seem to log the page fault that occurs when our remote application, ReadProcessMemory.exe, reads a buffer greater than or equal to 512 bytes; however, it does seem to work as expected when a read occurs that's less than 512 bytes.
This renders these working set APIs useless for profiling reasons on Windows 8+.
What went wrong?
To determine what went wrong, we'll need to reverse engineer parts of Windows and see exactly how the implementation changed in Windows 8+ from Windows 7.
All disassembly and pseudo-source is reconstructed from system files that are provided with Windows x64 10.0.15063 (Creator's Update).
Enabling process working set logging
To enable working set logging for a process, we need to call InitializeProcessForWsWatch
. From the MSDN documentation
, we're told that on newer versions of Windows this API is exported as K32InitializeProcessForWsWatch
. Our analysis begins there:
This function is very simple. It invokes an import from another library. In this case, it executes a function of the same name (K32InitializeProcessForWsWatch
), but contained within a different library, api-ms-win-core-psapi-l1-1-0.dll
. This library doesn't exist on disk, but rather resolves to an API Set mapping
corresponding to kernelbase.dll
(which does exist on disk) for this version of Windows. A look into kernelbase.dll
's implementation shows that a call to NtSetInformationProcess
is performed without any parameter marshalling:
Our next target is NtSetInformationProcess
This is just a simplistic syscall stub that will eventually make its way into the implementation contained within ntoskrnl.exe
, the Windows kernel. nt!NtSetInformationProcess
is a massive function that contains a huge switch statement that supports all the different PROCESSINFOCLASS
that can be passed to it.
We're interested in the PROCESSINFOCLASS
. This is case 15 (0xF). A snippet of the relevant parts (with the cleaned-up disassembly):
It's interesting to note that you're able to start monitoring on a process' working set with either a class of ProcessWorkingSetWatch
(15) or ProcessWorkingSetWatchEx
(42). This can be achieved by invoking nt!NtSetInformationProcess
directly instead of going through the documented route with kernel32!InitializeProcessForWsWatch
. The latter utilizes only the ProcessWorkingSetWatch
The actual logic of nt!NtSetInformationProcess
is pretty trivial to understand. A blob of memory is allocated per process that we're monitoring. This blob of memory is a _PAGEFAULT_HISTORY
structure and contains up to 1024 _PROCESS_WS_WATCH_INFORMATION
structures internally. Each _PROCESS_WS_WATCH_INFORMATION
structure is an entry that describes a page fault. These entries will be cycled through as the array fills up. Recall from the MSDN documentation
(the "Remarks" section) that you must call GetWsChanges/Ex
with enough frequency to avoid record loss. This makes perfect sense because we can see that there are a fixed number of these records (1024) allocated. I took the liberty of documenting these structures:
The union at the beginning of the _PAGEFAULT_HISTORY
structure may be a little confusing, but it'll be explained later.
On successful execution of this routine, the monitored process object will have an internal member (_EPROCESS.WorkingSetWatch
) updated to include this recently allocated _PAGEFAULT_HISTORY
pointer. Additionally, the PsWatchEnabled
global will be set. This value informs the system to track page faults for processes. It will remain set until the system reboots (even if there are no processes running that have working sets tracked). There are only 2 references to PsWatchEnabled
and we've already inspected the one in nt!NtSetInformationProcess
Our investigation leads us to nt!KiPageFault.
Logging a page fault
When a page fault occurs, the CPU transfers execution to nt!KiPageFault:
If the PsWatchEnabled global is set, that means we've enabled working set logging for processes on the system and execution is passed to nt!PsWatchWorkingSet. This function is documented below:
As I mentioned above, there are 3 types of page faults. Access violations are not logged to our process' working set due to an early out by nt!MmAccessFault in nt!KiPageFault. Since this function is executed for the other 2 types of page faults (hard and soft) on the system, it will be accessed heavily by the operating system. Luckily, one of the first things the routine does is check whether or not a working set watch was enabled on the process where the page fault occurred. If there is no working set watch on the process, the routine completes.
As per the documentation, nt!PsWatchWorkingSet will not function while records are being processed (EntrySelector.Busy). We'll describe this part in depth at a later time. Since higher priority interrupts can preempt our working set monitor, most of the logic in this routine needs to have adequate sanity (safety) checks and complete as atomically (Interlocked*** operations) as possible. The first part of the function will safely select a free index in the _PAGEFAULT_HISTORY.WatchInfo array that it can use for logging purposes. If the array is full (there can be at most 1024 entries), a "miss" is recorded (_PAGEFAULT_HISTORY.MissingRecords) and the routine completes. If everything is successful, a page fault event is recorded in a free slot in the _PAGEFAULT_HISTORY.WatchInfo array. An interesting (and undocumented) feature changes the entry's _PROCESS_WS_WATCH_INFORMATION.FaultingVa least significant bit to 0 if a hard page fault occurred and 1 if a soft page fault occurred.
Ultimately, there doesn't seem to be any apparent bugs with this code. Additionally, this code matches very closely to the Windows 7 version which we know works. Our investigation leads us to the working set watch retrieval functions: GetWsChanges/Ex.
Querying working set logging
For article brevity, I'll give a quick summary of the call-flow of kernel32!GetWsChanges (kernel32!K32GetWsChanges) and kernel32!GetWsChangesEx (kernel32!K32GetWsChangesEx). These functions will call into their kernelbase.dll variants. From there, they will branch into kernelbase!GetWsChangesInternal which will invoke ntdll!NtQueryInformationProcess with the appropriate PROCESSINFOCLASS. In particular, the ProcessWorkingSetWatch class will be used for the GetWsChanges family of functions and ProcessWorkingSetWatchEx will be used for the others. From ntdll!NtQueryInformationProcess, a syscall will be made. This makes it to the implementation of NtQueryInformationProcess within the kernel. A massive switch statement awaits:
The part that interests us resides one level deeper within nt!PspQueryWorkingSetWatch
There's some input validation (e.g. alignment checks) and a safety check (nt!ExIsRestrictedCaller
) to avoid kernel pointer leaks in low integrity processes. After that, the process object is retrieved from the supplied process handle. The operating system checks to see that the _EPROCESS.WorkingSetWatch
member is set. Just like the documentation states, at most one query can access a process' working set buffer at a time (EntrySelector.Busy
). Additionally, while the buffer is being accessed, logging (by nt!PsWatchWorkingSet
) will produce misses.
As long as there's enough space in the user supplied buffer, the operating system will copy over the entry array to the user supplied buffer. The data will be structured in the appropriate way for the appropriate PROCESSINFOCLASS
. The last entry in the user supplied buffer (PSAPI_WS_WATCH_INFORMATION/EX
) will be terminated with a FaultingPc
member of NULL. Additionally, the number of "misses" will be recorded in the FaultingVa
member of the last entry.
Finally, the _PAGEFAULT_HISTORY.WatchInfo
array of the _EPROCESS.WorkingSetWatch
will be reset after a successful call.
APIs are surprisingly very finicky. There are many weird restrictions and caveats which make it surprisingly difficult for developers to retrieve information regarding the complete set of page faults that occurred within a process.
There is a very good chance that you will run into situations where records will wind up missing especially in a multi-processor and multi-threaded environment. For example, if a thread is querying the working set of a process, but a page fault occurs on another thread within that same process, a miss could be recorded since the _PAGEFAULT_HISTORY.Busy
member will be acquired by nt!PspQueryWorkingSetWatch
. This will prevent the page fault logging logic in nt!PsWatchWorkingSet
. Functionally, this weakens the usability of the API for profiling purposes. To compound this problem, only 1024 entries can be stored in the array between calls of GetWsChanges/Ex
. That's at most 4 MB (1024*PAGE_SIZE
) of page fault history. This really isn't enough for modern applications which can be very complex.
In our specific situation, we ran our tests on a VM that had 1 processor allocated to it. Furthermore, our application was simple enough that it had 1 thread. This mitigates the chance of page fault "misses". Additionally, after a thorough investigation of the working set APIs, we've concluded that we've still not discovered where the bug is. In particular, why does the buffer size play a role in the success of these APIs? In our demo, we were unable to log page faults on Windows 10 when the buffer size was greater than or equal to 512 bytes. Is it possible that the bug is not within WorkingSetWatch.exe
, but rather ReadProcessMemory.exe
To continue our investigation, we need to turn to ReadProcessMemory.exe
The ReadProcessMemory.exe application is simple enough to understand. We know that we're not logging a page fault when we're reading a buffer that is greater than or equal to 512 bytes. Since there is no apparent bug in the working set APIs, the problem most likely resides in kernel32!ReadProcessMemory.
I'll step past the irrelevant details, but the same strategy is applied as was in the previous parts. In particular, kernel32!ReadProcessMemory calls into kernelbase!ReadProcessMemory. These functions do nothing special and more-or-less directly issue a system call by invoking ntdll!NtReadVirtualMemory. This takes us to the implementation of nt!ReadVirtualMemory in the kernel:
This function just invokes nt!MiReadWriteVirtualMemory
. On some versions of ntoskrnl, this routine may just be inlined into the caller's body.
Aside from a check that prevents reading and writing to protected processes (ProcessObject->Pcb.SecurePid
), this function is nearly identical to the one in the Windows 7 kernel. We need to go deeper. We traverse into nt!MmCopyVirtualMemory
This function is massive. It contains many subfunctions that have been inlined. For article brevity, the important parts of nt!MmCopyVirtualMemory
will be highlighted. One of the first things that this routine does is search for VAD entries
that corresponds to the input addresses (FromAddress
). The idea is to leverage the "region size" information for memory, but this isn't really relevant to our bug. We'll leave the discussion of the VAD (Virtual Address Descriptor) to another time.nt!MmCopyVirtualMemory's
next task is to determine the input buffer's length. In particular, there are a couple checks against the buffer length and the value 512. This is significant to us because we know the bug only seems to manifest when the buffer size is greater than or equal to 512 bytes.
Basically, it seems that if the buffer is greater than or equal to 512 bytes, nt!MmCopyVirtualMemory
will utilize nt!MmProbeAndLockPages
followed by a memcpy
to clone over memory.
If the buffer is less than 512 bytes, nt!MmCopyVirtualMemory
will just leverage memcpy
directly by using a buffer on the stack or a buffer allocated in dynamic memory (based on buffer size) via nt!ExAllocatePoolWithTag
This is probably done for performance reasons. Larger memory copies probably benefit from direct mapping instead of memory pool copying. If we do leverage memory pool copying (buffers that are less than 512 bytes in size), we trigger a page fault and the event is logged by our WorkingSetWatch.exe
application. On the other hand, if we leverage a direct mapping to copy memory, we do not trigger a page fault.
One incorrect assumption is to believe that on Windows 7 this optimization did not exist. On the contrary, there is very similar logic inside of the older version of nt!MmCopyVirtualMemory
. However, something did change, otherwise we would not have any discrepancies with our WorkingSetWatch
program. Our investigation leads us into nt!MmProbeAndLockPages
The bug: an optimization in nt!MmProbeAndLockPages
The implementation of nt!MmProbeAndLockPages
underwent drastic changes between Windows 7 to now. If you looked at these two functions side-by-side, you'd quickly notice that the Windows 7 implementation was in some ways much simpler.
The purpose of nt!MmProbeAndLockPages
(per the documentation
) is to ensure that the specified virtual pages (in the argument contained within MemoryDescriptorList
) are backed by physical memory. Additionally, there is a series of permission checks to ensure that the virtual pages permit the user-specified access rights. In Windows 7, to perform this access check, the routine actually "probed" the memory by directly accessing it. This would induce a page fault in the context of the correct process and therefore we'd be able to log it using our WorkingSetWatch.exe
On Windows 10, this process was optimized. Instead of accessing the memory directly, a PTE (Page Table Entry) walk
is performed to ensure that the correct permissions exist. This change makes the process more efficient especially since the PTEs are leveraged to lock the memory into physical pages anyway.
OS development isn't easy
One seemingly inconspicuous change can break functionality in an entirely unrelated part of the operating system. In our case, an optimization in the underlying logic of how nt!MmProbeAndLockPages functioned broke backwards compatibility of the working set APIs. This bug seems to be entirely unnoticed, but it unfortunately renders the performance profiling nature of the GetWsChanges/Ex APIs useless.
A potential fix for Microsoft is to simply just throw a page fault for "invalid" pages if the PsWatchEnabled global is set or, more granularly, if a process' _EPROCESS.WorkingSetWatch is set.