🔒
There are new articles available, click to refresh the page.
✇Reverse Engineering

Applied Reverse Engineering: Basic Architecture

By: Daax Rynd

Overview

Thanks for joining me in my newest series Applied Reverse Engineering. I decided to write this new series concurrently with the EPT series except I pushed out the first five for this one and haven’t started the other. Typical. Anyways, I have to give a little preface to the article and series as well as a disclaimer. This article is going to cover the basics of microarchitecture – namely the things that apply when reverse engineering something. We’ll cover a few pieces of the microarchitecture that will make learning assembly a little less confusing. This includes the general purpose registers, the processor state flag, the ISA, virtual memory, and a quick overview of the execution of a process on the Intel 64 architecture.

Here’s the disclaimer: I’m assuming you have programming experience in a compiled language such as Rust, C, C++, and so on. If you don’t, but are interested in following this series then I encourage you to take the time to learn the fundamentals of one of those languages. Understanding the high level constructs will help you identify them in a low level environment. We’re going to be working directly with assembly right from the start, so if you’re squeamish with details this may not be for you. There’s a lot to learn, so I’ve broken this series up into many parts in an order I feel appropriate for learning about the architecture and applying it to software reverse engineering.

All projects and examples given in this series (and article) are written to run on Windows 10 x64 (Version 1903 Build 18362). The architecture referenced is the Intel 64 architecture, though most everything still applies if you’re on AMD64. If you’re on a different architecture or operating system make sure to consult the proper specifications to learn about them. You can still take what you learn here and apply it to other systems. Be sure to consult the recommended reading section when confused, or looking for more information!

All that being said, let’s put the rubber on the road and get goin’.

High Level Introduction

If you’ve worked in a high level language and taken some form of computer systems course in a formal institution then you may be familiar with the compilation process and how executables actually run. However, if you haven’t more than dabbled with assembly or heard it’s name then we’re going to cover the process of a simple C program and how it executes on the processor. For the breakdown of the C program I’ve disabled all optimizations, enabled full debug information, and disabled some other settings. I’ll provide the link to the repo that all of the future projects for this series will be posted. You can pull the solutions and pop them in Visual Studio 2019, compile and follow along. We’re going to take a simple C program I wrote which calls a Windows API to get the computer name, implemented a custom strlen function, and prints out the computer name and resulting name length. We’re not focused on the complexity of it – I want it to be as simple as possible so that when we break it down to the spooky assembly nobody runs screaming for the hills. (If you see SIMD instructions in assembly, it makes you want to do that sometimes.)

Let’s cover the compilation process briefly.

— The Compilation Process

The C compilers job is to perform preprocessing of the project, compile, and link the executable. This means that include files, preprocessor directives (macros), and other conditional compilation instructions are handled. This is it’s first pass. The compilation process typically involves 4 major stages and uses a variety of tools – notably, the compiler, assembler, and linker. The second pass of the process is compilation. It takes the output of the preprocessing phase and source and generates an assembler source. It’s worth noting that some compilers use an integrated assembler which generates machine code versus some intermediate representation then invoking an assembler. Which leads us to the next part of the compilation process: assembly. During this stage, an assembler is used to translate assembly instructions into object code. The output is the instructions that are run directly on the target processor – all of which are (or should be) part of the ISA. The final part of the compilation process is linking. Now that we have an object code generated we have to reorder pieces of the program to produce an executable program that functions properly. The linker arranges the various parts of the object code so that functions can invoke functions in different regions. This stage also links the libraries used in a program so that the program can make use of those library functions. In the case of our C program, kernel32.lib will be added to the object code by the linker to invoke GetComputerNameA.

Here’s a visual representation of the compilation process.

Image taken from stackoverflow.com

Now that we’ve covered the compilation process we’re going to take our C program, generate some assembly listings, and take a peek. If you find yourself interested in compilation/compilers, there’s more detailed reading in the recommended reading section.

— Compiled Binary Breakdown

The compiled binary we’re going to breakdown is the C program that was mentioned above. It’s nothing special, this is just to get a taste of what lies underneath the source of the C program.

The above is the C program. It gets our physical computer name, and prints it out along with our computer name length (which is just the number of bytes copies back into our name buffer.) Now let’s compile this, but with some changes to the project settings so we can generate an assembly listing. To make things easier to understand we’re going to disable all optimizations, and remove all debug information. To enable the generation of an assembly listing we’ll go to our Project Settings > C/C++ > Output Files and change the following:

This assembler output will be placed in our project directory. Let’s hit F7 and build this then open up the assembler output. The output shown below may be quite unfamiliar if you’re new to assembly. You may also recognize some instructions and mnemonics (just a representation to identify operations). Let’s take a look…

The listing that generates is a bit longer, so I went ahead and picked out the piece we wanted: the main function. Looking at this may seem like chinese, even with the various identifiers that were added. In the above assembly excerpt, every one of the lines is an instruction (excluding the comments and main PROC/ENDP.) In the x86 architecture there are hundreds of instructions and sometimes tens of variations of those instructions. Those blue keywords are what we refer to as a mnemonic. Upon looking quickly we see groupings of instructions and some pretty generic mnemonics like mov (a store operation), and call (a function invocation). Let’s look at some of the patterns. You’ll see a bunch of operations (each line is a single operation) some reusing the first operands of others. These are called registers, but more on that in a minute. Each one of the lines in this excerpt is a single operation to load, store, or modify data; or call a function. You don’t need to know what all of these mean or what their function is, we’ll cover that in due time. For now, just realize that the 9 line C program we wrote translated to over 25 lines in assembly that are then run directly on your processor to perform billions of operations per second. Interesting, right?

What you just looked at was your (maybe) first glimpse of x64 assembly. In most reverse engineering projects you won’t have the luxury of having prenamed functions and references to strings. We’ll learn how to deal with that as well in this series. However, now that you’ve had your first taste of low level code, let’s learn about some of the fundamentals of the architecture that will help you more easily understand the excerpt above.

The Microarchitecture

You’ve probably heard the word microarchitecture tossed around before with varying degrees of understanding, but to more formally define it for this series a microarchitecture is all the digital logic that allows an instruction set to be executed. It’s the combination of memory, ALUs, registers, logic gates, and so on. When you combine all of these components you wind up with a processor – the digital unit responsible for performing basic arithmetic, input/output operations, and many others. In any processor, even the most basic, you’ll have a register file, an ALU, some form of close to processor memory (a cache), and a unit that allows the processor to make decisions based on an instruction it’s executing (branch predictor). The component we need to cover first on the journey through the architecture is the register file.

If you recall a lot of the operands of those instructions are what are known as registers. Don’t know what I mean? After this section you will.

A quick side note, operands refer to the data being operated on. Some instructions have one, two, or three operands. They’re always referred to from left to right. Take line 19 in the assembly excerpt – xor eax, eax – the two operands are eax (operand 1) and eax (operand 2). Both of those operands also happen to be CPU registers.

Anyways, let’s keep moving so that more of this stuff starts making sense.

— The Register File

Every processor has to perform operations on data and that data usually has to be stored temporarily. This is the purpose of a processor’s register file. The register file is an array (or bank) of processor registers used to store information and subsequently operate on that information. If you’ve taken a computer systems course or read literature regarding system memory versus on chip memory then you know the latter is much faster. Typically, the processor will retrieve information from memory that is relevant to an instruction sequence and store it in a register to operate on that data. If it had to reach out to physical memory for each operations modern systems would be orders of magnitude slower. If you don’t know what a register is think of it as an empty slot with an identifier that’s stored in SRAM on your processor. Each slot is filled with data and various instructions can perform operations on that slot before writing it back to memory or storing it in another register.

For this series, we’re only concerned with the registers relevant to our target architecture (Intel 64). On the Intel 64 architecture the register file contains 16 general purpose registers, each register being 64-bits in size. There are various other registers worth noting, but not until later in this series. The sizes for each of these registers is usually referred to use of terms such as word, doubleword, quadword, etc. A word on the Intel architecture is 16-bits, a doubleword is 32-bits and a quadword is 64-bits. Their sizes can be denoted by size in bytes as well being 2, 4, and 8 bytes, respectively. To be thorough, there are two bytes in a word. Commonly referred to as the high byte, and low byte. We’ll be referencing a lot of this terminology in the next subsection covering these general purpose registers.

— Register Fundamentals

In the previous subsection I mentioned 16 general purpose registers. These general purpose registers are used by the microarchitecture to perform basic data movement, control flow operations, string operations, and so on. You’ll encounter them every time you look at a dead-listing (static disassembly) of an object or debugger. If you recall, we looked at an excerpt of an assembly listing which performed quite a few operations for the simplicity of the application, but more importantly it referenced general purpose registers on almost every line.

Well, you know what a register file is, and that each slot (register) has an identifier. What are these identifiers? I put together a table of the general purpose registers, and if you’re unfamiliar and it looks more confusing than my explanation don’t worry – I’ll break it down as much as necessary.

The image displayed above is a table of the 64-bit general purpose registers and their layout. You might recognize some of the register names from our assembly excerpt. To explain this, there are 16 general purpose registers. Each register on the 64-bit architecture is 64-bits wide. However, on 32-bit architectures there were only 8 general purpose registers. Those registers were the low 32-bit sections of the 64-bit general purpose registers. For instance, in 32-bit architectures, RAX is reduced to a 32-bit general purpose register and becomes EAX. To maintain compatibility with 32-bit architectures the 32-bit general purpose registers were extended to 64-bits. In addition to this size extension the 64-bit architecture added 8 more general purpose registers – those being the general purpose registers R8 to R15. You can still, and will frequently, access the lower portions of registers. This can be confusing for a first timer, but back in the old days of 16-bit architectures there wasn’t an RAX, or EAX. It was just AX.

If you recall, the sizes of data types we’re concerned with goes byte (8-bits), word (16-bits), doubleword (32-bits), and quadword (64-bits). AX in the example just mentioned is a register with the size of a word. In the 64-bit architecture we’re able to use these register mnemonics to access specific portions of the whole register. If we have an operation on EAX such as xor eax, 10000539h and only want to look at the low word of the register value following the xor we could use AX to see 0539h.

Example:

xor eax, 10000539h    ; xor eax with 0x10000539
mov var, ax           ; var = 0x0539

On the other hand if we looked at RAX it the value would be zero extended to 64-bits (which means all bits above the 31st would be set to 0.) All this just means that the different sized portions of a general purpose register can be accessed using the mnemonic devices shown in the image. For the additional general purpose registers introduced in 64-bit architecture (R8-R15) you’ll use the register names shown for the R8 register, but substituting the number in the diagram for the target register number.

Note: Accesses that reference the legacy portions of these 64-bit general purpose registers do not affect the respective upper bits. A store to the low word of one does not affect the upper 48 bits.

You might have also noticed a register not mentioned, RIP. This register is referred to as the instruction pointer register. It contains the offset in the current code for the next instruction to be executed. It increments after each execution by the size of the previous instruction (or from instruction boundary to the next). Some instructions can determine wether RIP will move forward or backward – these instructions are called conditional instructions. We’ll cover them in the future, for now it’s just important to understand that RIP points to the next instruction and moves based on the type of instruction executed.

You can read a more technical description of the general purpose registers in the Intel SDM Vol 1. Chapter 3.4.1 or the AMD64 Architecture Programming Reference. To recap what’s important to know is that the processor uses them to temporarily store data to operate on, and we will frequently access these registers. Now that they’ve been covered, a special processor register needs to be addressed.

— Processor State Flag Register

Commonly called the EFLAGS register, in 64-bit mode it’s often called the RFLAGS register. You may have also heard it called the current program status register (CPSR). This is a 32-bit register that contains a number of flags related to the state of the processor while executing the current program. Some flags are used to control where branching instructions go, and some are used to control OS related operations. A small group of status flags are affected by the results of arithmetic operations like addition, substraction, and so on. I’m only going to cover the main status flags we’ll run into. The rest of the flags have definitions in their respective manuals, and I advise anyone looking to fully understand these topics to go through the recommended reading. Anyways, depicted below is a figure from the Intel SDM of the layout of the EFLAGS register.

The above is the layout of the EFLAGS register. We’re going to quickly cover the status flags (indicated by S) and we’ll conclude with a few examples of how these related to code at a high level.

The Zero Flag (ZF – Bit 6)

The zero flag is a status flag that is only set if the result of an arithmetic operation is 0. There are certain conditional instructions (meaning they are based on the state of status flags) that will only be taken or perform an operation if the zero flag is set. We introduce all of these in the Accelerated Assembly part of this series.

To provide a high level example take the following code:

int Integer0, Integer1;

Integer0 = 510;
Integer1 = 511;

if ((Integer0 - Integer1) == 0)
    printf("ZF set, execute this block.\n");
else
    printf("ZF not set, execute this block.\n");

We have two integers, one is set to 510 and one to 511. If we subtract the two we wind up with the obvious answer of -1. To start easing you into thinking about things in terms of assembly consider these two integers being stored in some general purpose register. For this exercise, they can’t be in the same register. Now, we know we have two registers one with 510 stored and one with 511. We’re going to perform a subtraction on them and then compare their result to 0. Since the result of the subtraction isn’t discarded we’re going to use it to determine which block of code to execute (the printf’s). In this instance the processor will execute the respective instructions to complete the operation, set the zero flag in the status register, and then execute some conditional instruction that will pick the appropriate block to execute based on the status of the zero flag!

Let me translate this to assembly to help you wrap your head around it.

    mov rax, 510
    mov rbx, 511
    sub rax, rbx
    jnz zf_not_set
    lea rcx, offset zf_set_string
    call printf
    jmp end
.zf_not_set:
    lea rcx, offset zf_not_set_string
    call printf
.end:
    ret

Alright, don’t run just yet. This is a lot simpler than it looks. If we recap the logic I walked through above you’ll remember we put 510 and 511 in a general purpose register – I chose RAX and RBX for simplicity. Then we perform a subtraction using the sub instruction. Now here’s what’s interesting, assuming you know very little if anything about assembly, some instructions will set the status flags based on their result saving the need for using some compare instruction like cmp. If the result of the sub instruction is 0 it will set ZF in the EFLAGS register. Neat! Now, the jnz instruction might be obvious to you – the mnemonic simply expands to jump if NOT zero. This means that if the zero flag is 0, or not set, that the jump will be taken. The label zf_not_set is what’s called the jump target. This means that if the ZF is not set (the result of the sub was not 0) the instruction pointer (RIP) will be set to the first instruction underneath our label .zf_not_set. Continuing execution, we load an offset to a string into rcx using lea, and call our printf function, and then return from the function.

I realize there’s a number of things not yet covered here, but I’m hoping as I trickle information to you that when the details are covered later you’ll be able to draw back to our examples and clear up any confusion you may have! We haven’t covered the stack, or calling conventions (ex: lea rcx, offset string) but the article following this one does it in great detail and explains how various instructions affect the stack and how arguments are passed to functions on invocation.

Speaking of instructions and things unknown there’s a reference for all the instructions in the Intel 64 and IA-32 architectures. This manual includes a 60 page run down of the instruction format (you don’t have to read that if you don’t want), and an alphabetized reference of every instruction with its various forms, a description, pseudo-code, flags it affects (if any), and exceptions it can generate. This is your bible for this series. If we encounter an instruction you’re unfamiliar with I urge you to open the instruction manual and look it up, read the description and relevant information, and you’ll begin to develop an understanding of assembly where you’ll know which flags are affected like the back of your hand.

I’ll cover key instruction sequences once we get into the disassembly and debugging sections, so for any details I may leave out: consult the instruction manual.

If you haven’t noticed, learning assembly is very hands on and learn as you go. There’s no way to learn every possible instruction prior to working with it, so just know that if there’s things you don’t know there’s still things people who have been doing it for a decade don’t know. Before we switch gears to a whole other topic entirely let’s cover the last two flags that we’re going to be concerned with.

The Sign Flag (SF – Bit 7)

The sign flag is used in signed operations, and will be set equal to the most significant bit of the result – which happens to be the sign bit of a signed data type. 0 is positive and 1 is negative. I put together another example with the assembly translation below. Remember, your exercise is to start thinking of things in terms of assembly.

int Integer0, Integer1;

Integer0 = 1;
Integer1 = 1000;

if ((Integer0 - Integer1) < 0)
    printf("SF set, execute this block.\n");
else
    printf("SF not set, execute this block.\n");

return 0;

In this example we’re taking two signed integers and subtracting the larger one from the smaller integer to change the sign. I constructed the condition in the if statement to massage the compiler to place a jump that will be based on the sign flag. As in the previous example, if the SF flag is set that means that the result of this operation is negative (because 1 in the most significant bit of a signed integer indicates negative.) Below is the assembly translation – walk through it.

    mov rax, 1
    mov rbx, 1000
    sub rax, rbx
    jge sf_not_set
    lea rcx, sf_set_string
    call printf
    jmp end
.sf_not_set:
    lea rcx, sf_not_set_string
    call printf
.end:
    xor eax, eax
    ret

Much like the other translation we see two registers used, we subtract the larger value from the smaller which causes the SF flag to be set (if the result is negative), and then a jge is encountered. This instruction expands to jump if greater or equal. If the result is greater than or equal to zero we go to the target of the jge instruction, otherwise we execute the instructions directly following it and perform a jmp (uncoditional jump – meaning always taken) to the return sequence.

Note: There are about 40+ jump instructions that are based on the status of different flags. These are referred to as 'jump if condition is met' instructions, or Jcc instructions for short. We'll be calling them Jcc instructions from here on.

The Carry Flag (CF – Bit 0)

The carry flag is set if the arithmetic operation generates a carry or borrow out of the most significant bit of the result. The flag indicates some overflow condition for unsigned integer operations, such as when you add 1 to the maximum supported value.

Here’s a challenge for you: write a C program that executes a block if an overflow has occurred, and generate an assembly listing to check and see if you accomplished this. To determine if the Jcc instruction generated is based on the carry flag you’ll have to consult the instruction manual.

Once you’ve done that, keep reading on to the next article of this series where we cover virtual memory, and the architectural details of memory addressing in the Intel 64 architecture.

Conclusion

In this article we’ve gone over the compilation process, albeit in very little detail, as well as learned where the registers actually come from and what those general purpose registers are. You had the chance to look at an assembly excerpt from a simple C program and see the madness that is assembly. Following the general purpose register discussion I detailed a few status flags in the EFLAGS register that will be important and consulted often during RE projects. I know that you may not be familiar with some of these concepts which is why I’ve introduced them. By the end of this section of the series teaching the fundamentals of the architecture and operating system we’re working under you’ll be able to take on a variety of projects and not be overwhelmed by the load of information poured on your head. You’ll be well equipped to deal with whatever comes at you.

Learning should be fun, but also informative, so in this series I will explain where and why things happen (much like discussing the register file before introducing registers) because I believe that knowing why things are the way they are can greatly increase understanding versus me just stating what is what with no explanation. I find that style of teaching or exposition annoying, I want the details and I want readers to have the details. If you’re new to this you don’t need to worry because the articles are only going to get longer and provide more detail than you most likely are willing to put up with. Like vegetables, the details may suck but they’re good for you. Read ’em.

I encourage you to create your own assembly listings for simple C/C++/Rust programs, and dig through them using the instruction manual and try to understand the logic. You don’t have to do anything too fancy just enough to get a taste for assembly. After all, assembly is the language you’re primarily going to be working with until we start developing tools to simplify our reverse engineering process.

Check the recommended reading section and then use the sidebar to navigate to the next part of this series!

Recommended Reading

The post Applied Reverse Engineering: Basic Architecture appeared first on Reverse Engineering.

✇Reverse Engineering

Applied Reverse Engineering: The Stack

By: Daax Rynd

Overview

This article is written for new reverse engineers who have a hard time understanding the stack, its layout, how it works, and the various requirements for proper function. It can be a confusing concept to wrap your head around at first, but after reading this article you should have a very deep understanding of how stacks work and their usage in a 64-bit architecture. Knowing how the stack works is a topic fundamental to reverse engineering. Various types of obfuscation are stack-based and can be daunting to deal with if the operator doesn’t understand it. It’s also useful for circumventing checks that malware may perform such as return address checks (to validate that a call came from a trusted source). Overall, you’ll find learning about the stack invaluable – even if only reviewing your understanding.

This article is written to cover stacks in 64-bit Intel architecture, and the calling conventions used in the x64 architecture. The calling convention explored is the Microsoft ABI. If you’re not sure what an ABI is we’ll cover it in this article. There are a variety of different conventions depending on platform, so be sure to validate. All examples were created and analyzed on Windows 10 x64 Version 1903 Build 18362.

Note: If you're unfamiliar with memory or how memory is organized I'd suggest consulting the recommended reading for more information. Understanding memory will help to understanding this article.

The Stack

— What is the stack?

In general, a stack is a contiguous array of memory. It’s also sometimes referred to as a structure based on the last-in-first-out principle (LIFO). A contiguous array is simply a sequence of objects in a linear structure format, accessible one after the other. This stack structure is bounded at the bottom meaning that all the operations performed are performed on the top. There’s a simple analogy to remember this – a weapon magazine. While the analogy has limitations we’ll discuss, it goes like this:

  1. Bullets are inserted from the top of the magazine. (LIFO)
  2. Only the top bullet is accessible to the operator. (top of stack)
  3. To load a new bullet you have to push a new bullet into the magazine, this the new bullet is the new top. (push)
  4. To remove the top element you have to shoot the weapon. (pop)
  5. You can check if the magazine is empty. (check if stack is empty)
  6. You can use a new magazine, or reuse the same one. (creating, adding elements back)

This analogy is rather interesting since the normal one is a stack of plates. You can visualize how the stack is laid out, however, there are some issues with this analogy. The first of which is that in modern systems you can access certain stack locations (in memory) if you know the offset from the current stack pointer. There’s a few other ones, but we haven’t covered them so I’m not going to confuse the example. It’s a good representation, regardless.

You have a general idea of how the stack works, let’s get into the dirty details regarding its layout and the various registers that control the structure.

— Stack Layout

Before we begin building a view of the stack it’s important to know how it’s managed. If you recall from the previous article when you learned about general-purpose registers there was a register named RSP. This is the register that manages the current stack in 64-bit architectures called the stack pointer. The stack is also managed by a segment register called the stack segment, or SS for short. The processor will reference SS for all stack operations (which will be discussed in just a bit). The stack is an awkward structure to think about because it grows down in memory when items are added, and shrinks up when items are removed. The stack pointer will always point to the top of the stack, unless by some annoying trickery a tool (maybe a form of obfuscation) uses it as a counter or some other generic use. If that’s confusing, don’t worry – the diagrams below will help you get a better idea of what is going on.

Below is a diagram that we’ll build upon as we discuss different topics that are related to the stack, for now we just know it’s a contiguous array of memory where RSP points to the top and it’s always referenced through the SS register. (If you’re not familiar with segment registers, check recommended reading – worthwhile to know.)

Alright, so we have this graphic. It’s just an empty stack, and if you recall I said the the stack grows down in memory when items are added. Take a look at the image, you’ll see that our RSP points to the top of the stack at the highest address where the stack was allocated (it can be located anywhere in a processes address space, just know that it’s at the upper boundary of that allocation.) To add to this illustration, let’s talk about the two instructions that affect the stack: push and pop. When software needs to place an item on the stack it performs a push, so let’s adjust our graphic after placing two values onto the stack with two consecutive push operations.

As you can see our stack has two new values. If you remember from the analogy early the stack is a LIFO structure. This means the last element pushed onto the stack is the first to be removed by its opposite operation. This also means that the 4 was pushed first and 12 next. The corresponding assembly would have looked like this:

push 4
push 12
...

You also see that the RSP register points has been adjusted. This is because when items are pushed onto the stack the processor decrements the RSP register and writes the data to the new top of stack. This is an example of the stack growing down as items are added. This also means that to access either of those values you could do one of two things: offset from RSP, or pop the items off the stack. Let me illustrate both operations and give some details on the pop instruction.

To access the stack elements by offsetting from RSP you have to know how the stack works. Those two elements are at higher addresses (adding elements makes the stack grow down), meaning we’ll have to add an offset to our stack pointer to acquire that information. Let me adjust the diagram to show how we can do this.

To access the first element pushed onto the stack, the value 12, we’d have to offset 0 bytes from RSP. If you’re wondering why 0 bytes, well we’re on a 64-bit architecture so the push and pop instructions decrement/increment the stack using a 64-bit width. For example, if I want to store the value of 12 in a general-purpose register like RBX by offsetting from RSP I’d write something like this in assembly:

mov rbx, qword ptr ss:[rsp]

That’s a very specific line. I mentioned earlier that all stack references are done through the stack segment register (SS). That’s exactly what this code is doing. Performing a store into RBX from the stack at RSP+0h – which is the value 12. We use the ss:[...] to tell the processor “hey, this is a stack reference.” The same operation would apply for retrieving the value 4 from the top of the stack, just using a different offset – the offset would be 8. We’ll cover why this is important to understand when we get into function frames and usage of the base pointer register.

That’s one way we could retrieve the values of the stack, however, the simpler and faster way is to use the pop instruction. There are different scenarios where usage of one method is preferred over another, but for the sake of this example it is simpler and faster to use pop to get these values off the stack. I’ll adjust the graphic to demonstrate how acquiring the value 12 from the stack could be done using pop. We want to store the value in a general-purpose register as well, so for the sake of consistency we’ll reuse RBX. To do this we’ll have to execute pop twice, but the view of the stack will be quite different.

pop rbx		; rbx = 12
pop rbx		; rbx = 4

We have to pop twice since the stack is LIFO and the value 4 was the first element pushed onto the stack. On the first pop we specify a register that the value on the top of the stack will be placed in – RBX. After executing the first pop instruction the RBX register holds the value 12. After the second, RBX equals 4. That’s what we wanted! When we perform a pop the topmost element of the stack is removed, so what does the stack look like now?

It’s empty! And the stack pointer now points to the top of the stack, as it did in the very beginning. This is because when items are popped from the stack the stack shrinks up – toward higher addresses. To adjust RSP the processor reads the item off the top of the stack and places it in the location specified, and then increments the stack pointer. What’s interesting to note about the instructions that operate on the stack is that they’re not limited to immediate values like 4 or 12 they can use general-purpose and segment registers, or simple memory operands. We won’t demonstrate those right now, but we’ll come in contact with them throughout the series.

Before we move on I have to clarify that a program or operating system can setup many stacks. The limitation is based on the maximum number of segments and available physical memory, you likely won’t encounter more than one stack per task in the system. This means every process and thread can have more than one stack, but most usually only have one. However, only one stack is available at a time regardless of how many exist, and the current stack is the stack referenced by the SS register. In addition to push and pop there are a few other instructions that operate on the stack that we’re going to cover next. Those instructions are call and ret. We’ll cover these instructions, compare and contrast high-level/low-level examples, and then move on to discuss calling conventions. These next few sections are detail heavy, but required to know to reverse engineer.

Calling and Returning

We’ve encountered the call instruction before, in the first article, as part of the assembly excerpt generated by our example program. We saw some operations before it, and some operations after it. We also ran into the ret instruction as well at the very end of the function excerpt. These two instructions will be encountered innumerable times when reverse engineering, and they operate on the stack, but how does each interact with the stack? Both of them operate differently, but use some of the same components such as the stack pointer. We’re going to address everything you need to know about both of these starting with the call instruction.

— The Call Instruction

If you’ve taken a look at the Intel Instruction Manual, and attempted to decipher the meaning of the instructions in the previous article’s excerpt then you’ve likely run into this instruction in the manual. It’s description and various opcodes and nuances about prefetching instructions, etc may have been confusing. I’m only going to cover what’s relevant to the 64-bit architecture regarding the call instruction with some details about it’s operation in the x86 architecture that were passed along.

Before we do that I need to make something known. I mentioned segment registers a section above, and if you’re familiar with segmentation it doesn’t operate the same for processors in 64-bit mode. All segment registers are zero based except for the GS and FS segment registers. The FS and GS segment registers can still have non-zero base addresses because they may be used for critical operating system structures, and in Windows 10 – they are. The GS on Windows stores the Thread Environment Block. The FS segment is used for thread local storage, or canary-based protection, it could also be configured to point to other data. We’ll encounter the GS segment register quite often in x64 projects since a lot of information can be extracted for usage in anti-debugging or integrity checks. Anyways, in 64-bit mode segmentation is effectively disabled. I say this because when describing how the call instruction works I will make references to segment registers. These segment registers are based from 0 and the limit is ignored (in 64-bit architectures). The reason for this is because of the memory model used for the 64-bit architecture. If you’re interested in learning that (and I recommend it) be sure to read the reference in the recommended reading section.

The call instruction has 4 type classifications:

  • Near Call
  • Far Call
  • Inter-Privilege-Level Far Call
  • Task Switch

However, in 64-bit mode, we’re mostly going to be concerned with the near call. It has two opcodes associated, those being E8 and FF; and it’s described as a call to target relative to next instruction. The difference between a near call and a far call is pretty straight forward. A near call doesn’t modify the code segment register (CS), but a far call changes the value of CS. We aren’t going to be concerned with far calls since 64-bit operating systems use what’s called the flat memory model where all memory can be accessed from the same segment. This means there’s no reason to change the value of CS. You remember when I stated that all segment registers are based at 0 (save for FS and GS)? This is how a flat memory model is implemented.

Knowing this actually simplifies the call types we have to learn about. The two sub-types of near calls are near relative calls (E8) and near absolute (FF). Near relative calls are pretty simple to think about. Relative means that the call target address will be relative to the address of the next instruction. To demonstrate this I picked apart a call instruction I found in ntoskrnl.

We know this is a relative near call since it’s opcode is E8. The following 4 bytes are the call target relative to the next instruction. Let’s suss out how this calculation is performed. We’ll have to extract the target address from the instruction which turns out to be FF A4 62 96. If you’re wondering why we went backwards it’s because of Intel’s use of storing information in little endian. Little endian is simply storing the “little end” first, to rebuild the actual target in big endian, or “big end” first, which is the normal way to think about the number we just start from the last byte and work our way forward. Anyways, we should be able to add that relative address to the address of the next instruction and arrive at MiProcessLoaderEntry. What happens is we get a number that isn’t in our address space – what the hell happened? Take a look at that call target again, it starts with FF – it’s negative. To successfully extract this target we’ll take our relative address and sign extend it – meaning using the sign of the value extend it to the maximum width (64-bits). The actual relative address is FF FF FF FF FF A4 62 96. If we take that and add it to 1406FB38E (remember, it’s relative to the instruction after the call) we get 140141624. And take a look at this:

We wind up at the entry point for MiProcessLoaderEntry. That’s how near relative calls work! You can extract the target of any call instruction, and that’ll become very useful to us in the future.

The simplest way to identify a near relative call is by looking at the first opcode and it’s mnemonic. It will always look like this: call some_function. For near absolute calls, where a target address is specified indirectly, we’d see something like call [rbx]. An indirect call specifies the call target in a register or some memory store. A direct call will have the call target specified as part of the instruction. This means that near relative calls, as given above, are direct calls and near absolute calls are indirect! It’s a simple way to remember them and identify them, and also how to pull their targets out. That was a lot, I’m sure. Let’s take a break from overloading with disassembly and get back to how the call instruction utilizes the stack.

— Call Stack Operations

Up to this point you learned a lot more than you might’ve been willing to about the call instruction… Good. In this subsection we’re going to get detailed with how the call instruction utilizes a few registers, and the stack to effectively put a bookmark at its location. When referring to a call we are always talking about near calls – we won’t be using far calls at all.

Continuing, when we execute a near relative call the processor does a few things for us. First, it pushes the value of the instruction pointer (RIP) on to the stack. It does this because RIP contains the offset of the instruction following the call. If you need a refresher on what RIP holds check in the previous article. This new stack value is used as the return-instruction pointer (not to get confused with RIP). The processor branches to the call target address specified by the operand, and if we use our example that operand value was FF FF FF FF FF A4 62 96. This relative offset is encoded as a signed 32-bit immediate value (lots of terms, don’t worry we’ll cover new ones) that is sign extended to 64-bits and added it to the RIP register. This sign extension to 64-bits only occurs in 64-bit mode, if you’re operating in a 16- or 32-bit environment the relative offset is encoded as a signed 16- or 32-bit immediate. The targets operands are always 64-bits in 64-bit mode. Remember that, otherwise if you calculate targets by hand you may wind up with wonky numbers.

Similarly, with near absolute calls most everything is the same except that the absolute offset is specified indirectly in a general-purpose register or memory location. That absolute offset is loaded directly into the RIP register, no addition to RIP necessary. Using our old stack graphic lets illustrate what the stack looks like after execute a call instruction.

As we can see, the instruction pointer was pushed onto the stack by the processor. RIP, in this example, would point to test eax, eax. This is because RIP always points to the next instruction to be executed.

Note: If you find that I'm repeating myself it's because I want to make sure this sticks with you. It's easy to get confused, so the more you read it the better you remember.

RSP is decremented because when we push items on the stack we grow down in memory (towards lower addresses). Not too difficult right? Well, how do we get back to the function that just called func? Remember, we’re executing in func after the processor branches to the target. That’s done by executing the ret instruction. I briefly mentioned that the RIP value pushed onto the stack would be later used as the return-instruction pointer, so let’s dig in to the return instruction and go full circle.

— The Return Instruction

Return from procedure, the return instruction, or simply known as ret is the instruction that transfers program control to a return address located on the top of the stack. That return address was pushed onto the stack by the call instruction, and the return brings us to the instruction following the call in our caller function. Two things, the function that executes the call instruction is often referred to as the caller, and the target function is cited as the callee.

The return instruction has a few different opcodes, the majority of the time in 64-bit targets we’ll just see the C3 opcode should we look at the instruction bytes. However, it’s very possible that you’ll encounter C2 as well. Both operate in their own way. Let’s talk about the generic return instruction, ret. This instruction performs a near return (to pair with our near call) to a calling procedure. The near return instruction, when executed, pops the return instruction pointer off the top of the stack and into RIP, and resumes execution at the instruction pointer. It’s really that simple. Remember, we can’t directly modify the instruction pointer, but the processor can. As with the call instructions in 64-bit mode, the operation size (meaning the width of memory) for this instruction is 64-bits, or 8 bytes. We’ll talk about what happens if there are issues with the top of the stack when we get to the stack faults section. For now let’s look use our stack diagrams of the call and see what happens following a return.

The diagram above shows the execution of a call instruction and then the state of the stack just before the return instruction. Let’s walk through from left to right. First, we land on the call func instruction. This is a near relative call, and if you recall the RIP prior to being pushed on the stack points to the next instruction (where the address is highlighted in blue.) RIP is pushed onto the stack for use later when a ret instruction is encountered. The current RIP on stack holds the value 5 (the address of test eax, eax). Then the processor branches to the callee (func) where we execute 2 instructions of no interest, and land on our ret instruction. Notice the RIP value, it’s the address of the return instruction. Upon executing this return instruction the processor pops the old RIP value from the stack (labeled as the return address) into the RIP register. Below is what the stack looks like after executing the return instruction.

When the return instruction is executed the top of the stack is popped off and put into the RIP register. We can see that by looking at the RIP value highlighted in blue. It points to the instruction following the call func instruction, as it should. The ret transfers control back to the caller at the specified RIP value, and resumes execution. We can see that the stack is clear again, RSP was incremented since pops shrink the stack up toward higher addresses, and the RIP after control transfer points to the next instruction to be executed.

Wow, that’s a lot of information for just two instructions! It turns out this is just the tip of the iceberg. These examples were very trivialized to help you understand what happens when the two branching instructions execute. I put together a high-level example, so you can see what similar code would look like in C. This is somewhat reduced, but the logic still holds.

int func()
{
              // xor eax, eax
    return 0; // ret
}

int main(int argc, char **argv, char **envp)
{
    int res = 1;

    res = func(); // call func
    if (res == 0) // test eax, eax
                  // jz over_there
                  // ...
                  // .over_there:
        printf("return was 0.");

    return 0;
}

This is sort of what the code used above would look like translated to a high-level language. There are definitely some instructions and terms that you encountered that may not be clear, but remember to consult the Intel Instruction Manual when in doubt, and keep reading. If I’m not explaining it just yet it’s not vital to know right this second. I just wanted to provide a look at how call and returns work from the low-level. At this point you’ve encounter a variety of assembly instructions that correspond to high-level operations, and I’m hoping that the overall thought process of breaking high-level code down is beginning to set in. Remember to read all the recommended reading sections to maximize your level of understanding.

That being said, we have to continue on and cover calling conventions. We’ll explain from a high-level and then get low. If the above was confusing take a moment to read the recommended reading sections relevant to the topics we’ve covered and then continue on. If you’re feeling good, and understanding everything then read on.

— Calling Conventions and the Microsoft ABI

A calling convention is a specific method for functions that a compiler uses to set up access to a subroutine. It specifies how arguments are passed to a function, and how return values are – well – returned from the function. It also determines how that function is invoked and the way it creates and manages its stack and stack frames. It’s the way the function call in the compiled language is converted into assembly and we’re going to look at how the most prominent calling convention – fastcall – does these things. Originally, there were three calling conventions that could be used with C in 32-bit x86 processors – those being stdcall, cdecl, and fastcall; and then in C++ when thiscall was introduced to support virtual function invocation. In x64 processors on a 64-bit operating system, notably for Windows in this series, it simply uses fastcall for all 64-bit code. If you run a process in compatibility mode under WoW64 you’ll encounter the predecessor calling conventions mentioned above. We’re only focused on fastcall since we’re going to be operating with 64-bit targets.

If you’ve been programming for a while you know how a function is declared and defined, and how arguments are passed. Let me create a function that does some simple arithmetic and then we’ll get a little more technical with the calling convention.

This function returns the difference between the two arguments. You’ll notice the __fastcall keyword used, this is to explicitly declare the calling convention. When compiling a 64-bit program with MSVC, it’s implied and always used. I just put it there to be explicit. It’s important to note that this calling convention is not standardized across all compilers, some may use different methods of passing arguments to the function or managing the stack and frames. This brings us to our next discussion, the Microsoft ABI.

An application binary interface, ABI, is a the interface between a program and the OS/platform. It provides a set of conventions and details such as data types, their size, alignment requirements; calling conventions, object file format, etc. The ABI is platform dependent meaning it can vary some degree from compiler to compiler. The ABI is a primary component used in how the generated assembler operates meaning that the code generation (part of the compilation process) must know the standards of the ABI. What we’re going to be considering from the ABI today is the layout of the stack frame for a function call, how arguments are passed, and how stack cleanup is performed. This is all implemented by the assembly instructions that reserve space, store certain registers to create a “frame”, copy values into that reserved space. If this is new to you, don’t sweat it. When you’re writing in a high-level language such as C or C++ you don’t really need to know about the ABI. However, when you begin to work and analyze assembly it’s important to use the correct ABI or be able to identify the ABI for the components of interest.

Programs compiled for a 64-bit Windows OS will use their x64 ABI. This ABI uses a four-register __fastcall calling convention by default. We’re going to break the entire convention down and determine how it affects our programs stack during function calls. We’ll be using our sub example above.

— Fast-call Calling Convention

The __fastcall convention uses four general-purpose registers to pass integer arguments to the callee. The registers are rcx, rdx, r8, and r9; in order. If you need a refresher on the general-purpose registers go to the previous article and save the diagram of registers. Using our subtraction example above this means that when sub is called the generated assembly instructions will place the a value into rcx, and the b value into rdx. The other two, in this instance, are unused. Let’s take our small program and translate it to assembly to help tie this idea together.

int __fastcall sub(int a, int b)
{
    return a - b;
}

int main()
{
    sub(8, 4);

    return 0;
}

The first thing we do in main is call sub. The two arguments are 8, and 4. Let’s get the generated assembly and take a look at it.

//
// Assembly listing for main()
//
mov qword ptr[rsp + 24], r8
mov qword ptr[rsp + 16], rdx
mov dword ptr[rsp + 8], ecx
sub rsp, 40
mov edx, 4          ; second argument
mov ecx, 8          ; first argument
call sub            ; function call
xor eax, eax
add rsp, 40
ret

//
// Assembly listing for sub(a,b)
//
mov	eax, edx
sub	ecx, eax
mov	eax, ecx
ret

I’ve reduced the disassembly to be a little simpler, but not by much. Let’s ignore the first 4 lines of the main listing and start analysis at mov edx, 4. I mentioned before that arguments are passed into rcx, rdx, r8, and r9. They’re passed from left to right, as well, meaning that the first parameter will always be in rcx, and so on. Prior to our function call where sub is executed we see that our values from the C program, 8 and 4, are placed in their respective registers as part of the calling convention. 8 is placed in ecx and 4 in edx – their 32-bit counterparts since the full 64-bits aren’t required since the data type is 32-bits in width (an int). If the type were to be unsigned long long then the values would’ve been placed in the register partition that matches the width, so rcx and rdx.

Following the storage of our arguments in the registers used by the convention we execute the call instruction. The call instruction pushes the instruction pointer – which is pointing to the instruction after the call – onto the stack and sets RIP to the call target. Jump down to the assembly listing for the sub function, you’ll immediately see a mov performed to store the value of edx in eax. The next instruction performs the subtraction (this is not the same as calling our function sub, this executes the sub instruction, part of the ISA.) The sub instruction subtracts the second operand from the first and stores the result in the first operand. Thus, we see sub ecx, eax which translates to ecx = ecx - eax. If you’re wondering, we could’ve removed the store of edx into eax and used edx in place of eax.

For the x64 ABI, and most others, the return result is passed back to the caller through the rax (eax, in this instance) register. Remember the result is stored in ecx upon completion of the sub instruction therefore to return the result back to the caller, as our C program specifies, we store ecx in eax and return. The return instruction pops the return-instruction pointer (the address of instruction following the call) into the current RIP register, and transfers control back to the calling function at xor eax, eax – because that was the instruction following our call sub. The instruction afterwards performs a stack clean-up, which we’ll cover in just a minute, and then returns.

How was that? Simple enough, right? We’re gonna add a little more complexity now by further describing the convention.


When passing integer arguments we go through the four registers previously specified. Though, we don’t always pass integers to functions. Sometimes floating point arguments, structures, and so on. Any argument that doesn’t fit into a supported size 1, 2, 4, or 8 bytes has to be passed by reference, this is because (unlike our mailbox analogy) an argument is never split across multiple registers. All floating point arguments and operations are done the XMM registers. We didn’t talk about those, but they’re just like the general-purpose registers – there are 16 XMM registers and computations are performed on them using instructions defined in the ISA. The XMM registers are named with their index as XMM0-XMM15. We will cover these when necessary, as we won’t encounter them much until we get to our game reversing project.

So what does any of this have to do with the stack? Well, prior to executing a call instruction and as part of the convention, it is the job of the caller to allocate space on the stack for the callee to save the registers used to pass arguments. This space that’s allocated by the caller is known as the shadow store, spill space, home space, or shadow space. To be rigorous with our terminology we will always refer to it as the shadow store. The space allocated is strictly the maximum size supported (8 bytes) times the number of registers used to pass arguments (4).

If you look at our main assembly listing above you’ll notice an instruction sub rsp, 40. This is using the allocation of our shadow store, plus a little something else except 8 * 4 = 32; so, what gives? Well, the stack must always be aligned on a 16-byte boundary. This means that the address of the top of the stack must be a multiple of 16. You might be thinking, 32 is a multiple of 16, but remember that we pushed the return address onto the stack which is 8 bytes meaning if we used sub rsp, 32 our stack would have 40 bytes allocated. 40 is not a multiple of 16, so to combat this we allocate an additional 8 bytes thus giving us sub rsp, 40.

To simplify: prior to a function call the stack must always be aligned on a 16-byte boundary.

Let me reuse the main assembly listing above to illustrate what I just addressed since it can be somewhat confusing.

//
// Assembly listing for main()
//
sub rsp, 40         ; sub rsp, 32 (shadow store) + 8 (alignment pad) = > 40, this way 8 bytes for call will keep stack aligned
mov edx, 4          ; second argument
mov ecx, 8          ; first argument
call sub            ; function call
xor eax, eax
add rsp, 40
ret 0

To reiterate, we allocate space on the stack for the registers used (32 bytes) plus an extra 8 to make sure that the stack is aligned when the call instruction is executed and the return address (another 8 bytes) is pushed on the stack. Also, if you’re wondering why we use sub rsp, X to allocate space on the stack remember that the stack grows down in memory – toward lower addresses. To reclaim this allocation when the function finishes execution we use add rsp, X to shrink the stack up to its original state prior to the call. The reclamation of stack space must be the same size as the allocation, otherwise you wind up with a misaligned stack and invariably a crashing program. If this is still confusing for you, I made a graphic to illustrate this process.

This shows what happens when we only allocate space for our shadow store prior to a function call. We wind up with a misaligned stack. The solution is to add 16 bytes to our allocation as an alignment padding to ensure that the stack is 16-byte aligned prior to execution transfer.

I’m sure you’re tired of my diagrams at this point, but unfortunately there’s a little bit more to cover. We have yet to cover stack frames, and how data larger than 8 bytes is passed. If you’ve made it this far, keep going. You’ll have a better understanding of the stack than most just starting out, and that’s what I’m going for.

— Stack Frames

So far we’ve seen how the calling convention passes arguments, how it maintains stack alignment across functions calls, and how it allocates space for register storage for the callee. Now, we’re going to break down how a stack frame is created and used. A stack frame is simply a frame of data that gets placed on the stack. In our example, we’re talking about a call stack frame which represents a function call and its argument data. An important distinction is that the shadow store allocated is not part of the call stack frame. The call stack frame starts with the return address being pushed onto the stack first, then storage of the base pointer, and space for local variables is allocated. In some instances when a function is small enough and no locals are used we wind up not needing a stack frame, and instead opt to use registers to perform a quick calculation such as in the sub function. A good majority of the time you’ll encounter a stack frame, but it’s good to know that they’re not always required.

I’ve constructed a more in-depth example that generates a listing that creates a stack frame and uses it to address local variables and perform some modifications. It’s a bit more involved, but I’m sure you’ll be able to catch on.

void do_math(void)
{
    int x = 10;
    int y = 44;
    int z = 36;
    int w = 109;
    int a[4] = { 1, 2, 3, 4 };

    a[0] = x * a[0];
    a[1] = y * x;
    a[2] = a[1] * z;
    a[3] = w * a[2];

    printf("%d\n", a[3]);
}

It’s just an arbitrary amount of math on an array, and some locals. No significance. I just needed to massage the compiler into giving me the assembly listing I wanted. Speaking of which, it’s a bit of a mess, but we’ll work through it.

//
// Assembly listing of main()
//
mov qword ptr [rsp+24], r8
mov qword ptr [rsp+16], rdx
mov dword ptr [rsp+8], ecx
sub rsp, 40
call do_math
xor eax, eax
add rsp, 40
ret 0

//
// Assembly listing of do_math()
//
push rbp
mov rbp, rsp
sub rsp, 60
mov rax, qword ptr ss:[rbp+30]
mov qword ptr ss:[rbp-40], rax
mov qword ptr ss:[rbp+18], r9
mov qword ptr ss:[rbp+28], r8
mov qword ptr ss:[rbp+10], rdx
mov qword ptr ss:[rbp+20], rcx
test rdx, rdx
jne 7FF691A34607
call 7FF691A36294
mov dword ptr ds:[rax], 16
call 7FF691A36174
or eax, FFFFFFFF
jmp 7FF691A34651
test r8, r8
je 7FF691A345F2
lea rax, qword ptr ss:[rbp+10]
mov qword ptr ss:[rbp-38], rdx
mov qword ptr ss:[rbp-28], rax
lea r9, qword ptr ss:[rbp-38]
lea rax, qword ptr ss:[rbp+18]
mov qword ptr ss:[rbp-30], rdx
mov qword ptr ss:[rbp-20], rax
lea r8, qword ptr ss:[rbp-28]
lea rax, qword ptr ss:[rbp+20]
mov qword ptr ss:[rbp-18], rax
lea rdx, qword ptr ss:[rbp-30]
lea rax, qword ptr ss:[rbp+28]
mov qword ptr ss:[rbp-10], rax
lea rcx, qword ptr ss:[rbp+30]
lea rax, qword ptr ss:[rbp-40]
mov qword ptr ss:[rbp-8], rax 
call printf
add rsp, 60
pop rbp
ret

The listing for main is easy enough. It’s actually storing the arguments for main in its shadow store. You can identify storage in the shadow store by looking for calling convention registers being stored in [rsp+8] or higher. You won’t see it at [rsp] since that’s where the return address (what ret pops into RIP) is stored. Modifying that can cause a lot of issues. Alright, we already covered what happens before we call a function, and right after we transfer control to that function; so now we’re going to look at how the compiler builds stack frames to allow for local storage in functions. Let’s look at the assembly listing of do_math.

The first line pushes a general-purpose register onto the stack, rbp. This register is referred to as the base pointer, and it’s purpose is normally for use in stack frames and addressing local variables in a function. It’s actually pushing this register onto the stack to preserve its value, most pushes you find preceding actual function code are used to preserve register values. We’ll talk about why this is important soon. The next line stores the value of the stack pointer in the base pointer register. This means that both rbp and rsp point to the top of the stack. Then the stack allocates space using sub rsp, 60h. This is our local variable storage space. If we look at our C excerpt you’ll notice that we have a total of 8 integer variables, as well as a 4 character string used as a format string for printf. That means we’re utilizing 32 bytes + 4 for character storage for a total of 36 bytes. Well that’s not good, it’s not 16-byte aligned. Let’s add 8, 44 – still not 16-byte aligned. Add another 8 and we get 52, damn it – no dice. The stack is still misaligned. Add another 8 and we have 60. 60 is a multiple of 16, so we’d have to allocate that much space on the stack to keep it aligned, thus the sub rsp, 60h. Unfortunately, the majority of that storage will be unused since it’s just for alignment purposes. Note that all the operands in these instructions are in hexadecimal.

This sequence of instructions actually has a name, it’s called the function prolog. Any function that allocates stack space, calls other functions, and preserves registers then it will have a prolog. The epilog is the sequence of instructions that cleanup any stack allocations and restore preserved registers prior to returning. Anyways, the reason for storing rsp in the base pointer register is so that the base pointer can be used to store values in stack storage designated for local variables. A visual really helps solidify this concept, you already have a good idea of how the stack looks up to this point so here’s the stacks state following the function prolog.

That’s quite an allocation. I switched the padding location because in reality it doesn’t really matter, it’s just part of the allocation for shadow store and they can be stored anywhere in that region. I put a label for where rbp points after the mov rbp, rsp instruction. And then we perform the sub rsp, 60h which allocates space for 12 8-byte stack slots. The brackets around the labels for those cells indicate that a dereference of rbp minus that offset will access that slot. It makes sense since rbp is rsp before the allocation, the stack grows down so the allocation will take rsp toward lower addresses, and to access those lower address we have to take rbp and subtract. We’re gonna take a look again at our assembly listing for the do_math function except I trimmed the fat so we can just make a point.

push rbp
mov rbp, rsp
sub rsp, 60
mov rax, qword ptr ss:[rbp+30]
mov qword ptr ss:[rbp-40], rax
mov qword ptr ss:[rbp+18], r9
mov qword ptr ss:[rbp+28], r8
mov qword ptr ss:[rbp+10], rdx
mov qword ptr ss:[rbp+20], rcx

......

call printf
add rsp, 60
pop rbp
ret

At this point you know what rbp is used for. It’s the frame base pointer, meaning we use it to index into the stack to store local variables. The line following our stack allocation is mov rax, qword ptr ss:[rbp+30]. That’s a mouthful, but we can immediately identify a few things. It’s referencing the stack, ss:; it’s using rbp to index into a location; and storing that dereferenced value in rax. Unfortunately, the value it’s dereferencing isn’t shown in our diagram, it’s actually at a higher address than is shown. But we can identify where the next thing is stored: mov qword ptr ss:[rbp-40], rax. If you look at the diagram above we store the value of rax in the local stack space at [rbp-40].

Note: Positive offsets from RBP access arguments passed on the stack. Negative offsets from RBP access local variables.

The above note applies to normal accesses using RBP while executing a function. This brings me to something new, if the number of arguments is greater than 4 the 5th argument and on is passed on the stack. An example is provided below.

fnc(int a1, int a2, int a3, int a4, int a5, int a6);

// x64 calling convention passes args as such:
rcx = a1
rdx = a2
r8 = a3
r9 = a4
a5 and a6 pushed onto stack

You’ll be introduced to the various tricks and optimizations that are applied throughout this series. Once complete with the basics you should be able to identify stack uses, and prologues that don’t necessarily follow convention. For the time being though, they will follow convention. To start wrapping things up we’re going to quickly talk about how arguments that are larger than the maximum supported element size are passed.

— Passing Large Arguments

Large arguments don’t necessarily have to be an abstract data structure. In fact most of the time, they’re just strings. Before pulling the example from the first article of the series where you had a brief look at an assembly listing let’s recall some rules enforced by the ABI. Arguments not of size 1, 2, 4, or 8-bytes are passed by reference. That’s done similar to how you may expect it. Take printf for example, the string could be larger than 8-bytes in size since each character is a single byte. When we call printf with a formatter string and some value, the formatter string is passed into the callee by reference through rcx, and the value is passed through rdx. Let’s break down a simple example.

printf("Elapsed Time = %u\n", ElapsedTime);

The string clearly holds more than 8 characters, so it is definitely greater than 8-bytes – we’ll have to pass it by reference. ElapsedTime is just some unsigned integer value, we’ll pass it normally through rdx. What this winds up breaking down to in assembly is this:

mov rdx, ElapsedTime
lea rcx, offset elapsed_string
call printf

You’ve seen the mov instruction before, and call, but lea is new. The lea instruction mnemonic stands for load effective address, and it’s used to compute the effective address of the source operand and then stores it in the destination operand (rcx). The destination is always a general-purpose register. To think about this in high-level terms it’s similar to constructing a string, and passing the string by reference to a function. The reference to this string will point to the address of the first character in its character array, and printf has code to parse that string and perform whatever operations to fill in the necessary formatting components. It’s really that simple. If you see lea you’re most likely seeing a reference to some data larger than the supported size for stack elements. Most of the time it’s strings, but you’ll learn as you progress that sometimes it’s data structures.

Conclusion

In this article, you learned a lot about the stack, its purpose, how certain instructions affect it, and how certain interfaces utilize it to generate code in assembly that matches the semantics of your high-level program. We covered quite a bit of material, but there’s still so much more. If you’re interested in reading ahead and learning more about the stack, the calling convention, volatile and nonvolatile registers (what that even means), and so on then check the recommended reading section. The next article will cover exceptions and I plan to batch publish it with the accelerated assembly section. We’ll address the basics of exceptions, how software and hardware generated exceptions occur, the most common exceptions you’ll encounter; structured exception handling; vectored exception handling; and the role the OS plays. The accelerated assembly article will use a hands on approach to teach you a good portion of the x86 instruction set. You’ll encounter conditional jumps, compares, bit shifting, and more stack based operations.

All that being said, this concludes the introduction to the stack. As always feedback, questions, and comments are welcome. If you can’t reach me in the comments here my DM’s are open on twitter!

Recommended Reading

The post Applied Reverse Engineering: The Stack appeared first on Reverse Engineering.

✇Reverse Engineering

Applied Reverse Engineering: Exceptions and Interrupts

By: Daax Rynd

Overview

To continue learning important topics within the OS and architecture, and before diving into the deep end of the application, we’re going to cover a topic that is relevant to reverse engineering and development in general: exceptions and interrupts. In this article, you’ll learn about exceptions/interrupts from the ground up. What they are, the differences in types of exceptions, interrupt delivery, how they’re used to debug, and how we can leverage a variety of exceptions when reverse engineering. As usual, it is assumed that the reader has a background in a compiled programming language like C, CPP, Rust, et al. However, if you have experience in Java or some other object-oriented language and are familiar with the concept of handling software-based exceptions you should be able to pick up on this as well. We’ll be referencing the Intel and AMD software development manuals often. It’s important to remember that this series serves as a guide to reverse engineering on a Windows OS, and how to think about reverse engineering. All skills learned can be taken and applied to other systems.

All demos are performed on Windows 10 Version 2004; Build 19035. (This build is not required. Having Windows 10 will be sufficient.)

Disclaimer

All projects are written with Visual Studio 2019, and compiled using Intel C++ Compiler. All optimizations are turned off to reduce the number of obscure assembly listings due to compiler optimizations complicating comprehension. The software exception handling mechanisms (SEH, VEH) researched and documented in later sections are only present on this OS. If you’re on Linux, there will be related links to read about exception handling in the recommended reading section. You will be able to apply the same logic to those of other operating systems.

All that being said, let’s get into it…

Exceptions

— What is an exception?

An exception is defined as an event that is generated by the processor when one or more errors are encountered while executing a program. There are exceptions that are predefined in both hardware and software, and both the delivery and handling of these exceptions vary based on the level encountered. Speaking from a high level, we know that when an exception is generated the software stops execution of the application and signals that an error condition has been hit. If you’ve programmed in C/C++ you’re maybe familiar with the __try/__except or try/catch blocks. These are the language constructs that utilize the underlying exception handling mechanisms provided by the OS and hardware to handle errors. The code checks for an error condition throws an exception if the condition exists, and then it does a few things to process an exception. However, that handler can be used in a variety of ways, and it’s useful to know how to reverse engineer them since many malicious actors take advantage of their obscurity in disassembly to perform “covert” operations in the handler. However, to be able to work backward from a piece of code we have no prior knowledge on it’s best to understand the mechanisms and how they operate. We’re going to look at the different types of exceptions, as well as interrupts, their delivery, and processing at both a high and low level. Then we’ll break down examples in the OS-specific portion of this article and reverse engineer them to hijack exception handlers and disable anti-debugging mechanisms nested within them. Along the way, we’ll explore how both exception handling mechanisms work, the differences, and their usefulness for various types of indirect operations.

We first need to address some technical details before learning about the higher-level constructs.

— Interrupts

In most modern architectures there are two different methods of interrupting a program during runtime. One of which is an interrupt, the other is an exception. An interrupt is, at a high-level, an asynchronous event that’s usually generated by some external device. On Windows, there will be various interrupts that occur during the execution of a program. Typically, you shouldn’t encounter many exceptions during runtime. However, the interrupt and exception are handled relatively the same in that the current processor stops executing the program and begins execution in the specific event handler. Things begin to get a little hairy when you discuss interrupt and exception handlers, and it’s important to differentiate between high-level exception handlers used in SEH/VEH versus the OS handlers in the Interrupt Descriptor Table (IDT). I find it’s easier to start with the mind-melting stuff first and then move to the higher-level functionality. Let’s speak generally about interrupts and exceptions and their delivery mechanism – all abstractions aside.

Imagine you’re running a text editor on your computer. You’re taking notes for a meeting, and as you hit each individual key new characters appear in the file. I’m sure you’re aware of how a keyboard works, and how it writes to the actual editor, but have you thought about how the computer knows what key you pressed, that a key was pressed, and how it communicates with the keyboard to say that X was pressed – or being held down in combination with a key? We’re going to take this standard everyday task and break down what is going on under the hood to better understand interrupts.

On each keypress, your keyboard controller – which is just a device that links the computer and keyboard – generates an interrupt. This interrupt is commonly called the keyboard interrupt and is used to signal to the processor that a key has been pressed. The processor stops execution of the task, accesses an entry in the Interrupt Descriptor Table (IDT) and executes the handler associated with that entry in the IDT. Once the execution is complete and the interrupt has been handled properly control is restored to the interrupted task. Remember, that interrupts are typically driven by an external device hence the keyboard example. Before we continue with this example we need to address what the IDT is, how it’s structured, and how the OS leverages it to service these types of interrupts. To do that we have to discuss a few types of tables. This is going to be a little bit of a headache, but if you can manage you’ll come out on the other side a wiser engineer.

— Descriptor Tables

We’re only going to cover the necessary tables, but if you’re interested in learning the ins and outs of every type of descriptor and table associated with the architecture you can refer to the recommended reading. In the Intel/AMD architecture, there are two kinds of descriptor tables. The first is the Global Descriptor Table, and the second is the Local Descriptor Table. We’re going to be focusing on the first table since LDT’s are not typically used within the scope of this discussion. So what is the Global Descriptor Table (GDT)? The GDT is a table that is defined for every system, it’s a requirement, and is used for all programs in the system. It’s simply an array of system descriptor entries. These system descriptor entries can vary in type. There is the call-gate descriptor, IDT-gate descriptor, and some other types like the LDT and TSS descriptor. The IDT-gate descriptor is the one we’re interested in. To help visualize this GDT’s structure I refactored a diagram from the Intel SDM. Recall a GDT is an array/table of system descriptors, one of which is the IDT descriptor.

As mentioned above, we have the GDT displayed as a table of system descriptor entries. Each of those descriptor entries is 16-bytes in size while operating in 64-bit mode. To be technically correct I included the unused entry in the GDT, which is the first entry of every GDT you’ll encounter. It’s often referred to as the null descriptor, but its purpose is a lesson for a different post. If we continue analyzing the diagram you’ll notice the box with GDTR pointing to the GDT. This is the GDT register, which contains the base address and limit of the GDT on your system. The limit of the GDT is a multiple of the size of the descriptors inside of it and in 64-bit operation, these descriptors are expanded to 16-bytes, so the limit is 16(N)-1 where N is the number of entries in the GDT. There are differences in the structure and size of descriptors in different processor modes, but that is for reading on your own. The most important thing to note here is the segment descriptor at +48. The interrupt descriptor table that is created for the system has a field in its definition called the segment selector that points to the segment descriptor in the GDT (think of it as an index into the array). The interrupt gate present in the IDT (not pictured) provides an offset into the linear address space where the interrupt service routine exists. This interrupt service routine is the procedure that executes to properly handle the generated interrupt. The IDT is similar to the GDT in that it’s a system descriptor table, and is an array of 16-byte descriptors in 64-bit mode.

There are some minor things to note when we’re thinking of 64-bit processor operation, and they’ll be mentioned below the next diagram.

This diagram shows how the GDT, and IDT are used to locate the proper interrupt service routine. You’ll notice there is an interrupt vector that is referencing the interrupt gate in the IDT. This is because the IDT associates each exception/interrupt with a number that is referred to as the interrupt vector. This vector is associated with a gate descriptor for the interrupt service routine for that specific interrupt. If you recall, I mentioned the GDT and IDT are similar in structure, however, the IDT uses the exception/interrupt number to index into the table and access the gate descriptor. If we look at this diagram we want to read left to right. We start with an interrupt vector from a generated interrupt being delivered to the IDT, which is then scaled with 16 to index into the array and get the gate descriptor. After getting the interrupt gate descriptor we use the segment selector to index into the GDT to get the segment descriptor and pull the base address out. In 64-bit mode, segments are all based from 0 as segmentation is essentially disabled to create a flat linear address space. This makes sense as the interrupt gate in the IDT is scaled the 16-bytes and holds a 64-bit offset to the interrupt service routine. This offset is placed in the instruction pointer (RIP) which then leads execution to start in the interrupt handler.

Terms to recognize!

I’m using IDT-gate, gate descriptor, and system descriptor interchangeably. Interrupt service routine (ISR) and interrupt handler are the same as well.

The IDT has a limit, and if you’ve read any form of operating system book or looked at interrupt vectors for various service routines for hardware (like a mouse or keyboard) you’ll notice that the IDT only supports 256 interrupt vectors. There can be less than 256 vectors but no more. This is a design decision that has some history and will be in the recommended reading. Now, if all this gave you a slight headache don’t sweat it – we’re gonna go back to our example and walk through the process.

— Interrupt Example Continued

So we’re back, typing away in our text-editor, and each time a key is pressed an interrupt is generated to inform our keyboard controller that a key is being pressed. Let’s say the interrupt vector associated with the keyboard is 34. We chose 34 because interrupts 0-31 are reserved for Intel/AMD, and this is hypothetical. Take a look at the diagram below to understand the routing of an interrupt from when it’s generated to when it’s serviced.

Let’s walk through this. On key press, keyboard interrupt is generated and is delivered via interrupt vector #34 (it’s just a number), it’s scaled by 16 because that’s the size of each interrupt gate in the IDT (34*16), then used to index into IDT to get the proper gate descriptor. The interrupt gate descriptor has a segment selector associated and is used to index into the GDT to find the segment descriptor for this IDT-gate. The base address is 0 because we have a flat linear address space and no segmentation during 64-bit operation, so we take 0 + the 64-bit offset that is maintained in the interrupt gate in the IDT and set RIP to that result. RIP will then be pointing to the proper interrupt service routine (in this case #34) and execute it. Once execution completes the processor restores the context of the interrupted task and resumes execution of that task. This all happens asynchronously and without the loss of program/task flow unless some sort of error was encountered in the service routine. As a general flow of execution that’s all there is to it.

You’ve gone through the hardest part of this and that’s understanding the architectural layout of and function of the IDT, interrupt gate, and GDT. The rest will build on top of this knowledge, and the great thing is that exceptions operate through the same mechanism. In the next few subsections we’re going to cover the different types of exception classifications, the architecturally defined interrupts, and identify a few that you may already be familiar with but not know it. We’ll continue on by differentiating between sources and then address the OS facilities for exceptions like structured-exception handling and vectored-exception handling. After that, you’ll learn how to modify and access the different records for these facilities and use them to your advantage.

Did you know?

The top 4 bits of the IDT index is the current IRQL. The IRQL is the interrupt request level and the processor will raise the processor IRQL if required to properly handle the interrupt.

— Exception Classifications

In an earlier section, we described what an exception is and then detailed how they and interrupts are delivered to the proper procedure, but we didn’t talk about the different types of exceptions. As we know exceptions are events generated when the processor determined some error condition is met while executing instructions. There are three types of exceptions, and their reporting and restoration mechanism varies based on this type. We’re only concerned with two of the three types, and will only be describing them below. If you’re interested in the third and for more details please see Intel SDM Chapter 6.5.

The first type of exception we’re interested in is a trap. A trap is just an exception that is reported following the execution of a trapping instruction. As an example, let’s consider cpuid and pretend on normal hardware it is a trapping instruction. This means that once cpuid is executed it will trap into the handler, execute the code in the trap handler, and resume execution on the instruction following the cpuid. Trapping can be sort of difficult to think about so think of quite literally as if you’re walking down a sidewalk (the instruction stream) and encounter a hole (the trapping instruction), you fall in and have to climb out on the other side of the hole (the trap handler), and now you’re on the other side of the hole (on the next instruction after the trapping instruction).

For your viewing pleasure I went ahead an illustrated how to think about trapping instructions if the description didn’t help. I don’t think that the diagram does well, but writing is hard work. The next type of exception is a fault. A fault is an exception that requires correction to properly restore control flow. This type is much different than a trap in that when a fault has reported the state of execution is restored to a state prior to faulting instruction execution. That’s kind of backward to think about for some, so think of it as a game of hopscotch. You have a pattern you have to jump in order, and if you mess up you stop and go back to the start as opposed to the next instruction like how a trap works. To give a realistic example, consider the following assembly:

mov rax, [rbx]
dec rax
lea rbx, [r9]
mov rax, [rbx] <--- fault

############## FAULT HANDLED ###############

mov rax, [rbx]
dec rax
lea rbx, [r9]
mov rax, [rbx] <--- resumes execution and restores state from beginning of this instruction

We have a series of instructions of no particular importance, but you see the instruction mov rax, [rbx] is causing a faulting exception. This means that after running the fault handler, execution will resume at the beginning of the faulting instruction and restore processor state to the state it was at the beginning to allow for the instruction to execute again. This is a very common type of exception due to a frequently occurring exception called a page fault. We’ll cover that in a bit, however. An interesting note about faulting instructions is that the return address for the fault handler points to the faulting instruction, and this is how control is restored to the erroring task.

Alright, so now that you know the two common exception classifications we can move on to the architecturally defined hardware exceptions and talk about a few that you’ve likely encountered while reverse engineering, debugging, or just running a modern OS.

— Architecturally Defined Exceptions and You

If you recall from the earlier there are a number of predefined interrupt vectors by the architecture, particularly 0-31 are reserved for Intel/AMD. There is an excerpt below from a project of mine that lists out the various exception and interrupt vectors that are architecturally defined. We’ll only be discussing 3 of these exceptions in detail, the rest have details that can be found in the Intel SDM Chapter 6.2 Vol. 3A.

{ VEC_0, DE, "Divide Error",             	Fault,         	NO_EC },
{ VEC_1, DB, "Debug Exception",         	FaultTrap,     	NO_EC },
{ VEC_2, NMI, "NMI Interrupt",             	Interrupt,     	NO_EC },
{ VEC_3, BP, "Breakpoint",             		Trap,         	NO_EC },
{ VEC_4, OF, "Overflow",             		Trap,         	NO_EC },
{ VEC_5, BR, "Bound Range Exceeded",        Fault,         	NO_EC },
{ VEC_6, UD, "Invalid Opcode",             	Fault,         	NO_EC },
{ VEC_7, NM, "No Math Coprocessor",         Fault,         	NO_EC },
{ VEC_8, DF, "Double Fault",             	Abort,         	EC_ZERO },
{ VEC_9, NA, "Segment Overrun",         	Fault,         	NO_EC },
{ VEC_10, TS, "Invalid TSS",             	Fault,         	EC },
{ VEC_11, NP, "Segment Not Present",        Fault,         	EC },
{ VEC_12, SS, "Stack Segment Fault",        Fault,         	EC },
{ VEC_13, GP, "General Protection",         Fault,         	EC },
{ VEC_14, PF, "Page Fault",             	Fault,         	EC },
{ VEC_15, NA, "Intel Reserved",         	None,         	NO_EC },
{ VEC_16, MF, "Math Fault",             	Fault,         	NO_EC },
{ VEC_17, AC, "Alignment Check",         	Fault,         	EC_ZERO },
{ VEC_18, MC, "Machine Check",             	Abort,         	NO_EC },
{ VEC_19, XM, "SIMD FP Exception",         	Fault,         	NO_EC },
{ VEC_20, VE, "Virtualization Exception",   Fault,         	NO_EC },
{ VEC_21, CP, "CP Exception",             	Fault,         	EC },

The first member of these structure definitions is the vector number. Remember that 0 to 31 are reserved for Intel/AMD definition. The DE/DB/NMI/etc designations are the mnemonics for the exception that is a shorthand way of identifying it. You’ll sometimes see #GP(0) which is a general-protection fault with error code (0). As you can see that’s delivered via interrupt vector 13. If you’ve been reading through the list you might notice a few exceptions that sound familiar. Most notably the Debug Exception (#DB), Breakpoint Exception (#BP), and possible the Page-Fault Exception (#PF). We’ll talk about these in this order. If you’re unfamiliar with reverse engineering terminology or the concept of software breakpoints/hardware breakpoints it may be helpful to read this anyways, but you can skip it and come back later after we introduce the tools and their usage in the next article.

Interrupt Descriptor Table Usage

The IDT is used when a hardware interrupt, software interrupt, or processor exception is generated. All of these are noted as interrupts. Software exceptions (excluding INT N instructions) are handled by a high-level facility like SEH/VEH and do not use the IDT.

— Debug Exception (#DB)

This exception behaves differently based on the condition specified in one of the architectural debug registers (DR6). It can act as a fault or trap exception. This type of exception is typically the kind used for hardware debugging, or when enabling a hardware breakpoint on some condition. The conditions could be on data read or write, instruction fetch, or the typical single step.

Debug Register Conditions

There are other conditions that can be used to generate this type of exception.

It’s interesting to note only two of the conditions result in fault-like behavior, and all the others behave in a trapping manner. The only two faulting conditions are breakpoint on instruction fetch and general-detect condition. Recall that a fault means that state is reverted back to when the faulting instruction was executing, and that a trap sets the state to the instruction after the trapping instruction. You will encounter hardware breakpoints as we dive deeper into RE targets, and this knowledge will come in handy.

 How could knowledge of interrupt delivery be helpful in a defensive/offensive system?

An interesting behavior in some open-source hypervisors is that they don’t deliver the #DB exception on the proper instruction boundary when CPUID is executed with the trap flag set in the EFLAGS register. The interrupt will be delivered on the instruction following the instruction after CPUID thus giving a system the ability to detect a virtualized environment.

We’ll go over this exception again later, a brief overview is sufficient for now.

— Breakpoint Exception (#BP)

The breakpoint exception is very common, and if you’ve been programming for some time you’ve likely encountered it when debugging a misbehaving program. This exception has trap-like behavior and is used when a debugger sets a breakpoint. This breakpoint is enabled by replacing the first byte of an instruction with the int 3 instruction. This works because the int 3 instruction is one byte long which makes replacing and restoring trivial. You’re well aware by now how a trapping exception behaves, but if you’re interested in visualizing this behavior create a project and make a simple hello world application. Place a breakpoint on the print statement and observe how the debugger behaves when you break and resume. Look at the registers during the break and you’ll see how RIP points to the next instruction, but not on a faulting instruction.

— Page-Fault Exception (#PF)

Modern operating systems take advantage of a mechanism for memory management called a page-fault. This is an interesting exception because it occurs frequently without the user being aware. If you are unfamiliar with paging or virtual memory in a modern operating system I strongly suggest reading about them using the recommended reading links before continuing with this subsection. If you’re familiar with paging and virtual memory, but maybe not how page faults work then read on! A page-fault is easily classified since the behavior is part of its description – it’s a fault type of exception. A page-fault exception occurs and delivers an error code along with it. This error code is placed on the stack and encodes specific information within it such as if the fault occurred because of a permission bit being cleared, or the present bit being 0, among others. The most typical reason for a page-fault exception is when the processor detects that memory access was performed on a page-table entry that is not present in physical memory. This could be either because the data was paged out to disk due to memory management facilities (typical), or that the page no longer contains data and was freed by the operating system and marked as not present (this is not typical).

The first scenario mentioned above occurs quite often during normal system operations. Some data was paged out to disk in an effort to free up physical memory for an active task, the task switches and attempts to access the paged out memory, address translation mechanism checks the P (present) bit and determines if it is 0, and if so the processor will generate a #PF. Once the #PF is generated it will perform the steps detailed farther up when discussing the IDT and exception/interrupt delivery, call the page-fault (#PF) handler, and bring that memory back into physical memory so that the task attempting to access it may read the data properly. If the data is not found, or a page-fault does not occur your system will typically blue-screen and provide information about the error. A common issue is to encounter PAGE_FAULT_IN_NON_PAGED_AREA which means that there was an attempt to read memory in a region of memory that is exempt from paging where the memory is no longer resident. This results in a page-fault but it can’t be handled so the system saves what it can, and performs a bug check (blue-screen). This will most likely not happen with the software we’ll be looking at in this series, but it can (and often does) happen with poorly designed device drivers. We’ll look more into drivers toward the end of this series, and will debug a #PF error in a hardware monitoring driver.

Windows and Exceptions

So far we’ve covered a lot of information, some relevant and some useful for future articles. In this final section, we’re going to discover how Windows implements SEH and VEH to do exception handling in software. This software could be a driver or a user-mode application like Skype. The next section will start off with SEH, the design choices, implementation, and some details from under the hood. We’ll cover VEH in the same way, and then see how they both link with exception internals to handle exceptions. Once we understand how these facilities are used we’re going to look at a few ways to abuse them in an effort to hijack control flow, but not before we tie back to the IDT discussion from earlier with some software interrupt examples. The end of the article will also add some interesting ways to mask behavior through interrupt gate abuse.

— Structured Exception Handling

To start off with structured exception handling we need to address that SEH is used primarily to release resources if the program experiences a loss of continuity. If you’ve done any sort of C/C++ programming you’ve likely used it to handle problems like potential access violations, bad allocations, or determining if an object was found. We’re looking at this as an extension of the C language since it’s specifically designed for C. However, you can use it in C++ but it’s recommended by various sources to use the ISO-standard C++ exception handling facilities. MSDN is a great source for learning more in-depth information about SEH. I’m going to assume you have used SEH to some extent in projects and know the use of the mechanisms __try/__except and __try/__finally. If you’re unfamiliar with what happens when an exception is encountered we’ll walk through that below.

Let’s take a look at an example and then some disassembly of what’s underneath. Don’t worry if you don’t have the tools or gadgets, this is just for a walkthrough and to get your gears turning. I’ll make a brief example using SEH, we’ll walk through the example then toss it in IDA Pro.

__declspec( noinline ) void ThrowNullPointerDereferenceException( void )
{
    volatile int* ptr = 0x0;
    *ptr = 0x1337;
}

int main( int argc, char** argv )
{
    __try
    {
        ThrowNullPointerDereferenceException();
    }
    __except ( EXCEPTION_EXECUTE_HANDLER )
    {
        printf( "Caught Null Dereference.\n" );
    }

    return 0;
}

In this example, you can see we wrap our potentially exceptional function in a __try block and set our __except handler to catch all types of exceptions. In the ThrowNullPointerDereferenceException we create a pointer and point it at nothing, then dereference it and attempt to write 1337h to the null location. This will clearly generate an exception and the __except block will execute.

That’s typical behavior, and not very interesting. Underneath this high-level abstraction, there is are complex processes at work. The most well-known term when thinking about exception handling is the process of stack unwinding. Let’s take a look at the example application in IDA Pro, and see if we can figure out what’s happening.

This is the main function pulled from our disassembler. Let’s go through and first identify everything so that we can begin to understand assembly somewhat. We have our function start where main proc near is declared. Following that, we have our stack allocation. If you hark back to the previous article on the stack you’ll immediately recognize that this is allocating stack space for the shadow store, and the return address via sub rsp, 28h. This is done to ensure proper stack alignment. The next instruction is a call to sub_1070. Since we have prior knowledge of the application, we know this our ThrowNullPointerDereferenceException procedure. We’re going to follow it anyways to take a look.

This is our function that throws the exception. You can see that mov dword ptr ds:0, 1337h is the instruction that dereferences memory location ds:0 and attempts to store 1337h in that location. This line will immediately signal that there is an issue and raise an exception. But how does it know to call the proper handler? At this point, things get complicated as software exceptions take advantage of OS facilities provided via ntdll. The program raises an exception, and the SEH facilities go to work calling the appropriate API to resolve the error and locate the correct handler (if any). We’re going to use the IDA Pro Local Windows Debugger to trace the path of execution and figure out how this works.

We start the debugger and set a breakpoint on our first instruction in main so we can control things from the start. I’m going to step over until I execute the first call instruction, and we’ll see where we end up.

Immediately upon attempting to write to an invalid memory location our program generates an exception and our OS facilities go to work. The function we land in following the execution of the problem instruction is KiUserExceptionDispatcher. This function is responsible for invoking the user-mode SEH dispatcher. When an exception occurs in user-mode the kernel takes control briefly to determine if the exception occurred in user-mode or not. If it occurred during user execution then it modifies a trap frame that is pushed onto the stack so that when it returns from the interrupt/exception is winds up at KiUserExceptionDispatcher. In this way, software exceptions are trapping exceptions. If you’re not sure why then recall that faulting instructions attempt correction and then re-attempt the problem instruction. When software exceptions occur the kernel modifies the trap frame so that the program resumes execution in the user-mode SEH handling facilities.

Trap Frame Data

A trap frame is a structure passed to the kernel that contains state information of the currently executing program (registers, eflags, etc.) at the time of an exception that way control can be returned after the exception or interrupt has been serviced.

The kernel also places a CONTEXT parameter and an EXCEPTION_RECORD parameter that describe the state of the application when the exception was generated. This allows the handler to do things like read the error code, determine state information like what general-purpose register values were, and so forth. Once continuity is restored in KiUserExceptionDispatcher the exception is processed in RtlDispatchException. You can see a call to that function in the above image. This function is the internal implementation of the user-mode SEH dispatcher. The internals of RtlDispatchException are quite complex but to simplify things it uses the context and exception record parameters to locate what’s called a “function table entry” in a dynamic function table. A dynamic function table is used to store unwind information and the functions they’re associated with in an effort to help the OS properly unwind the call stack. A call stack is a list of functions that have been invoked in the current program. In this example, the call stack currently looks like this after entering RtlDispatchException.

The function that performs the lookup of this unwind information is RtlLookupFunctionEntry. It takes the instruction pointer from the context (the address where the exception occurred), the image base of the application, and creates what’s called a history table. The details of the history table are quite a whole other post in itself, so the main takeaway is that this table is used in the next call to RtlUnwindEx which begins the unwinding of function call frames.

To simplify a lot of these terms that may be unknown I’ll break this down. A call frame is a frame of information that is pushed onto the program stack that represents a call to a function and any parameter data supplied. From the previous article on the stack, we know that the return address is pushed onto the stack, followed by shadow store, and then there may be allocations made for local variables as well. This makes up the frame. The registers rbp and rsp are typically used to outline the size of the call frame where rbp is the base of the frame and rsp is the top.

Continuing our introspection of the SEH dispatcher I want to keep things as simple as possible. Following the execution of RtlUnwindEx a series of other calls are made to calculate virtual address of the unwind information structure associated with the function. It then performs a call to the registered language handler for the call frame, most often the _C_specific_handler, which internally traverses all exception handling records (which are __try/__except structures). It uses RtlUnwindEx to find each frame of the unwind in the specific exception records associated with your application, and – long story short – unwinds until it encounters an end frame in the unwind where it identifies itself as the final point in the unwind operations. It then restores the execution context of the erroring task via RtlRestoreContext and jumps to the exception handlers address by pushing it onto the stack along with the EFLAGS and current segment selector. It performs the jump by executing an iretq which pops the handler address into rip, restoring EFLAGS, and popping the selector into its respective register.

This is all very useful to know, and there is so much more on the internals of SEH that you can read about from the recommended reading. The most important reason is recognizing that these exception handlers are located in your applications address space. The OS facilities have to have these records from somewhere, and they aren’t stored in any magical location. Knowing where these functions look inside of your application will help lead you to their location (if obscured). In our case, there is a .pdata section otherwise known as the runtime information array that stores all the unwind info for a specific application. If we take a look at our application again in IDA (where I previously removed helpful comments) you’ll recognize some things I’ve mentioned in the explanation above.

IDA does a great job of providing useful information to us, and in this case, it helps us identify where the unwind information is located. If we follow the references of the __C_specific_handler we wind up in the .pdata section at our UNWIND_INFO structure for this specific function.

The above image displays the unwind information construct that is associated with our main function. There is a header, a structure that contains the offset where the __try block begins and the frame register, an RVA to the exception handler, and a scope table structure. The scope table structure defines the beginning of the try block, the beginning of the except, a handler value which is 1 for EXCEPTION_EXECUTE_HANDLER, and the target address where the except block begins and should be executed. So how does this help us?

— Taking Advantage of Exception Records

If a target application utilizes SEH in this manner, and you can locate an area where an exception occurs you can use what you know to locate the handler(s) and potentially hijack execution. It’s nothing fancy, but it can be used. SEH exploits have been used for ages, and this is one way to modify their targets or redirect execution to code an attacker wants to run. The method mentioned above would be useful for static modification where an attacker appends code in an execution section to the application, calculates the RVA and overwrites the target address in the scope table. I don’t want to introduce too many new topics in this post but in our examples when reversing some low-level anti-cheats we will employ more advanced SEH exploits that involve the TEB and abuse of VEH.

Content Removed

I’ve since removed some information to maintain focus in this article to just the basics. We will cover more in future sections.

— Software Interrupts

At this point, we’ve covered a lot but there is something interesting to know about interrupts – software interrupts (aka “traps”) more specifically. When an interrupt is encountered we know that the CPU halts execution, saves state, and jumps to a predefined location where a handler routine is located. When handling is complete it resumes execution at the next instruction. So what happens when we perform a special software interrupt like int 3? This is commonly known as the debug breakpoint, or debug trap instruction. Pulling from the Intel SDM we note:

The INT 3 instruction generates a special one byte opcode (CC) that is intended for calling the debug exception handler. (This one byte form is valuable because it can be used to replace the first byte of any instruction with a breakpoint, including other one byte instructions, without over-writing other code).

The implementation of breakpoints is super simple now that we know how the IDT works. If an application encounters an int 3 instruction it issues an interrupt signal on vector 3 (breakpoint exception). The int instruction allows a user-mode process to issue signals on a few different vectors. The creation of the IDT and the interrupt gates must be done properly in order to prevent potentially problematic interrupts and exceptions being signaled from a user-mode process. We don’t really want unprivileged code being able to signal a #DF exception. There are specific fields in the interrupt gates that prevent this sort of behavior. The one to note is the descriptor privilege level field (DPL). This field is checked against the current privilege level (CPL) by the processor to determine if the interrupt was allowed, and if not the processor will raise a general protection fault (#GP). This can be done by setting the DPL of interrupt gates is 0 signifying that only usage from CPL 0 (kernel mode) can safely execute. I bet you’ve guessed it, but for specific software interrupts like int 3, 2E, or 1 the DPL of their interrupt gates is equal to the DPL of user-mode – 3. When the user-mode process executes the int 3 it performs actions just like mentioned in the earlier sections of this article.

You can read more about IDT implementation and the various mechanisms available to preserve system integrity in the recommended reading. For now, we’re done with our rundown of exceptions and interrupts.

Conclusion

In this article, we went over the architectural details of how interrupts and processor exceptions are handled as well as a brief overview of the differences between software exceptions and processor exceptions. You should be comfortable with the idea of the IDT, the classifications of exceptions on Intel and AMD, and the usage of software interrupts in debugging software. You’ve also learned how some software interrupts are delivered and the IDT utilizes specific fields in its descriptor to prevent unprivileged execution of certain interrupt vectors. The next article will be an accelerated introduction to assembly so that we can get going with the targeted reverse engineering projects. I plan to cover the most common instruction sequences you’ll encounter, demystify some of the obscure instructions, and provide many examples of their usage. A lot of the terminology used in the next article will be from the first article of the series, so be sure to brush up on the architecture fundamentals prior to digging into x64 assembly. I hope that this post taught you something new and interesting, and maybe gave you some ideas of your own to investigate. I highly suggest going through the recommended reading and absorbing as much detail as you can.

Thanks for reading, and as always if any part was confusing, needs clarification, or I missed something in the slew of words please don’t hesitate to reach out! Thank you for reading and best of luck!

Twitter: @daax_rynd

Recommended Reading

The post Applied Reverse Engineering: Exceptions and Interrupts appeared first on Reverse Engineering.

✇Reverse Engineering

Applied Reverse Engineering: Accelerated Assembly [P1]

By: Daax Rynd

Overview

In this article you’ll be guided through a course on the x86 Instruction Set. This article serves at as a quick fix to the problem of not knowing where to start when learning Assembly. We’ll be covering instruction format briefly, and then jump right in to the instructions. This is like learning another language, and it may not make sense immediately, but rest assured if you do this enough reading assembly listings will become second nature. You’ll be able to decipher functionality of a code block from a brief excerpt. This page will also serve as a reference in later articles as all the instructions here are encountered often while reverse engineering some piece of software. If you forget what an instruction does, or the types of operands it’s compatible with you can refer back to here or the Intel SDM Volume 2.

As always, it is assumed you, the reader, have some sort of experience with a compiled programming language. Any language that has functional constructs will count too (loops, comparisons, etc.) The instruction set to be analyzed is one of the most popular ones, the x86 ISA, and all examples will be written for execution on Intel or AMD processors. Let’s not waste any time, there’s a lot to cover…

Introduction

Before continuing, it would be wise for those of you who may have forgotten about general purpose registers and their use to review the article on Basic Architecture. General purpose registers are used quite frequently in load/store operations and will be encountered all throughout our various examples. It’s important you know them off hand. Take a second to go back and read the section on general purpose registers, and then come back.

— Microcode versus Assembly

A common problem when reading reference material for assembly and low-level development is the misuse of terms. Particularly, the terms microcode and machine code. Microcode is considered an abstraction beyond machine code. For the sake of understanding, the machine code we’ll be looking at is the x86 instruction set. What I mean by an abstraction beyond machine code is that the CPU actively converts machine code, the assembly instructions, into microcode for the CPU to execute. There are many reasons this is done – the main one is that it is easier to create a complex processing unit with backwards compatibility. In this post, we’re examining the x86 instruction set. This instruction set contains thousands of instructions for many different operations, some of them for loading and storage of strings or floating point values. Rather than explicitly defining an execution path for these instructions they’re converted into microcode and executed on the CPU. It preserves backwards compatibility, and gives way to faster, smaller processors.

It’s important to distinguish between these two for technical accuracy as well as understanding. In addition to that, microcode and machine code do not always have a 1:1 mapping. However, there is no published documentation about Intel or AMD’s microcode, so it’s hard to infer the internal architecture and mapping of microcode:machine code.

As an example, take the instruction popf. This instruction pops the top word on the stack into the EFLAGS register. Prior to doing that though, it performs checks on certain bits in the EFLAGS register, the current privilege level, and IO privilege level. These operations aren’t likely to be stuffed in one instruction, and their microcode is likely not a single instruction to do this. It has to check EFLAGS, current privilege level, and other things before getting the top word of the stack. You could be looking at a number micro-operations that are executed when this instruction is converted.

Note: Microcode is the abstraction beyond machine code. Machine code is the higher level representation of these micro-operations.

— Instruction Simplification

We aren’t going to break down the entire format on an x86 instruction in this subsection since there is an entire chapter dedicated to that in the Intel SDM Volume 2, however, we need to address the general format.

Assembly instructions come in all different sizes (quite literally), but adhere to a similar shape. The format is typically an instruction prefix, the opcode, and the operand(s). There may not always be an instruction prefix (we’ll cover those in the future), but there will always be an opcode so long as the instruction is valid and supported. These opcodes map to a specific instruction in the instruction set, some instructions have a number of opcodes that will change based on the operands they’re acting upon. For example, the logical AND instruction has an opcode for the instruction that uses the lower byte of the rax register, al and performs a logical AND against an 8-bit immediate value. Recall that an immediate is just a numerical value. Below is a simple summary with the opcode and instruction mnemonic.

Logical AND [Opcode: 24h | Instruction: AND AL, imm8]

That’s a new term as well, mnemonic. In assembly, a mnemonic is a simple way to identify an instruction. It beats the alternative of reading a hex dump and determining instruction boundaries and then translating the opcodes by hand to a human readable form. These mnemonics are devices that allow system programmers, hardware engineers, and reverse engineers like us to read and understand what some sequence of instructions is doing with relative ease. In the above example the mnemonic for the logical AND operation is AND followed by op1, and op2 – the operands.

Note: It's pronounced like nehmonik, not memnomic. Maybe I'm just an idiot and am the only one who struggled to say it right.

All instructions follow this general format. If you want the nitty, gritty technical details then you’ll need to consult the Intel SDM. Otherwise, you know enough to begin learning and digesting the instructions you’ll encounter throughout this journey. We’re going to start off basic and gradually increase in difficulty with the instructions. If you struggle with understanding any portion of this text please drop me a line on twitter or leave a comment and I’ll be sure to answer to the best of my ability.

Arithmetic Operations

In this section, we’ll cover simple arithmetic instructions like add, subtract, division, multiplication, and modulus. Following that we step it up a little bit and cover pointer arithmetic and how pointers are modified with assembly.

— Simple Math

When a mathematical expression is executed it usually breaks down into logically equivalent blocks. Take ((2 + 4) * 6) – this expression adds 2 to 4 and then multiplies the result by 6. The expression can be done in a single line in C, but in Assembly it will be broken down into a few loads and stores, then an addition instruction, and then a multiplication. Like I mentioned, logically equivalent blocks. I’ve constructed a few examples with progressively more complex expressions and provided their C and Assembly listings.

static __declspec( noinline ) uint32_t simple_math( void )
{
    volatile uint32_t x = 0;
    volatile uint32_t y = 0;
    volatile uint32_t z = 0;

    x = 4;
    y = 12;
    z = 7;

    return ( x + y ) * z;
}

This function is pretty trivial. I’ve told the compiler with the __declspec( noinline ) modifier to never inline this particular function. I did this primarily so that I can grab the assembly as it relates to the function and not have other instructions polluting the example. We see the use of volatile to prevent the local storage from being optimized out as well, and then we set our variables to random values. So what would this look like in assembly?

sub     rsp, 38h
xor     eax, eax
mov     [rsp+20h], eax
mov     [rsp+24h], eax
mov     [rsp+28h], eax
mov     dword ptr [rsp+20h], 4
mov     dword ptr [rsp+24h], 0Ch
mov     dword ptr [rsp+28h], 7
mov     eax, [rsp+20h]
mov     edx, [rsp+24h]
add     eax, edx
mov     ecx, [rsp+28h]
imul    eax, ecx
add     rsp, 38h
retn

This first starts out allocating space on the stack for our spill space (previously called shadow store) and our local storage. The spill space only requires 32 bytes, then it allocates 12 bytes for our 3 local variables.

Why is the stack allocating 56 bytes of storage instead of 44 bytes?

By definition of the System V AMD64 ABI our stack must always be 16-byte aligned where N modulo 16 = 8. 44 modulo 16 is 12. The stack is misaligned, so we must allocate enough space to the next 16-byte boundary by adding an extra 4 bytes onto the stack. However, this is still not properly aligned because 48 modulo 16 is 0. This is solved by adding an additional 8 bytes to our allocation to ensure that our stack is aligned according to the N module 16 = 8 rule. If we were to make any sort of WinAPI call in this function it would invoke the function with a misaligned stack and most likely break execution.

After the stack space is allocated we notice a xor instruction with the operands being the same 32-bit register eax. This is a simple method of zeroing out a register since any number xor’d against itself is 0. Now comes the part where remembering the information from the stack article will come in handy. We see three instructions that are 1:1 with the source. There are some details to mention before moving on though. The mov instruction is considered a load/store instruction where the first operand is the target and the second operand – in this case eax – is the value to store. The braces you see wrapping [rsp+offset] indicates memory access. You can think of it as [eax] means access the memory contents at address eax. The simplest way to think of it is like dereferencing a pointer in assembly.

*(cast*)(rsp+0x20) = eax

You might be wondering as well what the offset 20h means. The 20h is the offset from the top of the stack to the address where this variable’s storage is located. If we were to look at a stack of this application it would like the diagram below.

The first thing pushed onto our stack prior to the stack space allocation is the callers return address, then space for our local storage and our alignment padding elements are allocated. Remember that the padding is performed because the address has to be 16-byte aligned, so the padding elements are given stack space since all other address values are not 16-byte aligned. But what about 18h (24)? 24 modulo 16 is 8 thereby following the rule, however, we hadn’t allocated storage for our spill space. After allocating storage for our spill space we are no longer aligned and need to add padding elements. You may also notice that x and y are in the same stack element that’s because these allocations are 8 bytes in size and our variables are 4 bytes in size. This means we can fit our x and y variable into one storage spot on the stack. The same goes for our z variable. You’ll notice it goes padding then z storage and that’s just the way I wanted to show it since the upper 32-bits of [rsp+28h] are 0, and the lower 32-bits are the value of z.

Strong understanding is important!

If you’re wondering why all the detail for this particular example it’s because I want to cover it in the most detail so that in future examples you are well equipped to read them and understand them. This will likely be the longest section because there is a lot to cover initially about assembly. Once we move forward the other examples will just be a matter of understanding the nuances of the instruction.

Let’s continue and bring the assembly example back into view.

sub     rsp, 38h
xor     eax, eax
mov     [rsp+20h], eax
mov     [rsp+24h], eax
mov     [rsp+28h], eax
mov     dword ptr [rsp+20h], 4
mov     dword ptr [rsp+24h], 0Ch
mov     dword ptr [rsp+28h], 7
mov     eax, [rsp+20h]
mov     edx, [rsp+24h]
add     eax, edx
mov     ecx, [rsp+28h]
imul    eax, ecx
add     rsp, 38h
retn

We now know that the mov [rsp+20h], eax instruction is zeroing the storage where x is allocated. The same goes for y and z they just have different offsets from rsp. We can see that y is at [rsp+24h] and z at [rsp+28h] are being set to 0. The lines after that are the storage of the values we had preset in the source. You probably notice that the mov is slightly different than the last with some sort of specifier being used: dword ptr. The dword ptr specifier simply means that the target operand is 32-bits in size; size of a doubleword. This then will only write to the lower 32-bits of the stack element. This is also what allows us to share a stack element between two 32-bit variables. The next two instructions are simple to understand now.

After storing our values to the appropriate stack elements we load those elements into registers to be used for computation.

Registers vs. Memory Accesses

Memory accesses by the processor are slow to execute because the instructions generate virtual addresses that must be translated by the MMU to physical memory addresses, then the processor must reach out to main memory with this translated address to access the memory. This is why having a hierarchy of caches associated with the CPU is beneficial, however, using CPU registers that are part of the die is orders of magnitude faster than reaching out to main memory. Compilers typically will prefer to use registers when performing computations to favor speed of execution.

We know now that x is loaded into eax and y into edx then immediately after an add instruction is encountered with the operands eax and edx, respectively. The add instruction takes the second operand and adds it to first. In this case, it would be performing this:

x += y;

Simple enough. The next line we see that z is being loaded into ecx, and then executes imul with eax as the first operand and ecx as the second. This instruction takes the second operand and multiplies it by the first and stores the result in the first operand. This would translate to:

x *= z;

The original source performs all of this in the return statement. There’s something peculiar about this because we know it returns an integer, but how? Through the use of eax. The general-purpose register rax is the return value register. This means that if anything is to be returned to the caller using the System V AMD64 ABI the return value will be stored in rax. It is subject to change with different architectures, but for Intel and AMD it is always rax. The instruction add rsp, 38h is the method with which we reclaim the stack space allocated for our local storage. This leaves the return address of the calling function at the top of the stack which means that when the last instruction, retn, executes rip will be set to that address and the processor will jump to that location and continue executing.

That’s all there is to this function. As we continue on with the next fourty-five million instructions I’ll only address details that can’t be deduced easily and explain new behavior. We’ve covered a lot for this first example, but it will make life so much easier as we move forward. The next sections will go by quickly, but be sure to take note of the quirks and additional information dialogs. It’s important to understand this content fully.

Order of Operations

When evaluating mathematical expressions there is a set of rules that is followed in order to obtain the correct result. If you’ve taken a math class you’ve encountered information about order of operations. In this case, we have parentheses surround the first expression we wanted solved which means that it gets evaluated first. The compiler takes that into consideration otherwise you would get an incorrect result. If you remove the parentheses from the source provided the imul would take place before the add instruction. PEMDAS. Remember that.

— Pointer Arithmetic

If you’ve written in C or C++ you’ve probably done some pointer arithmetic yourself. It’s confusing at a high-level sometimes and it certainly gets confusing when ripping away the abstractions of a high-level language. In this sub-section, we’re going to look at two examples of pointer arithmetic performed on two different data structures: an array, and a linked list. As mentioned previously, only important or new information will be addressed in this sub-section and the others so if you’re having trouble remembering certain things please refer to the above section. If it’s not mentioned now I’ve mentioned it before. We’re going to start off with another example in C which is just how array accesses can look in assembly.

static __declspec( noinline ) uint32_t pointers( void )
{
	uint64_t a[10];
	
    // looped access
    for ( volatile uint32_t it = 0; it < 10; it++ )
        a[ it ] = it + 2;

    // direct access
    a[ 0 ] = 1337;
    a[ 4 ] = 1995;
    
    // quik maffs
    *( uint64_t* ) ( a + 6 ) = 49;

    for ( volatile uint32_t it = 0; it < 10; it++ )
        printf( "%d\n", a[ it ] );

    return 0;
}

This example is pretty straight forward. The assembly? Not so much.

                sub     rsp, 78h
                pxor    xmm0, xmm0
                movdqu  xmmword ptr [rsp+20h], xmm0
                movdqu  xmmword ptr [rsp+30h], xmm0
                movdqu  xmmword ptr [rsp+40h], xmm0
                movdqu  xmmword ptr [rsp+50h], xmm0
                movdqu  xmmword ptr [rsp+60h], xmm0
                mov     dword ptr [rsp+70h], 0
                mov     eax, [rsp+70h]
                cmp     eax, 0Ah
                jnb     short loc_140001084

loc_140001067:                          
                mov     eax, [rsp+70h]
                mov     edx, [rsp+70h]
                add     eax, 2
                mov     [rsp+rdx*8+20h], rax
                inc     dword ptr [rsp+70h]
                mov     ecx, [rsp+70h]
                cmp     ecx, 0Ah
                jb      short loc_140001067

loc_140001084:                          
                mov     qword ptr [rsp+20h], 539h
                mov     qword ptr [rsp+40h], 7CBh
                mov     dword ptr [rsp+74h], 0
                mov     eax, [rsp+74h]
                mov     qword ptr [rsp+50h], 31h
                cmp     eax, 0Ah
                jnb     short loc_1400010D2

loc_1400010B0:                         
                mov     eax, [rsp+74h]
                lea     rcx, aD         ; "%d\n"
                mov     rdx, [rsp+rax*8+20h]
                call    sub_1400010E0
                inc     dword ptr [rsp+74h]
                mov     eax, [rsp+74h]
                cmp     eax, 0Ah
                jb      short loc_1400010B0

loc_1400010D2:                       
                xor     eax, eax
                add     rsp, 78h
                retn

Immediately we notice a significant difference in complexity from the last example. We want to get the hard stuff out of the way first, so why the hell not? You can probably guess what the first instruction does based off prior experience. If you do the math to determine the proper size of the stack allocation the value makes sense. Spill space is four 8-byte elements, our array is 10 elements so 10 * 8 = 80, 80 + 32 = 112 bytes, 112 modulo 16 = 0 and we need it to be aligned so we add 8-bytes on and we get 120 or 78h. 120 modulo 16 = 8! No problem.

The best way to approach complex disassembly or unknown disassembly is literally one line at a time and group together similar operations. Looking at the next instruction we see a pxor. This instruction is a logical exclusive OR for SIMD structures like m128i. It acts the same as the previous instance we saw but zeroes the 16-byte register xmm0. XMM registers are other CPU registers that were added with the advent of SIMD instructions. They are 128-bit (16-byte) SIMD floating-point registers and are named XMM0 to XMM15. You can read more about them in the recommended reading section. You might be wondering why these are even used when we’ve haven’t performed any floating-point operations or used SSE anywhere. The usage of these registers is because the compiler wanted to yield the most performant code and optimized our function. You’ll notice the movdqu instruction which, you guessed it, loads the value of xmm0 into that stack location. The xmmword ptr specifier is used similarly to the previous example and tells the processor we’re going to be performing a write to 16 bytes of data at [rsp+20h]. The sequence of these 5 instructions is a fast way to initialize our allocated stack space to 0. Think about this: 70h – 20h is 50h which is 80 bytes in decimal and our array is 10 elements each 8 bytes in size, thus this sequence is the shortcut to zero our memory. If you’re confused because you see the 60h and not 70h remember that this is writing zero in [rsp+60h] to [rsp+(60h + 10h)], where 10h is 16 bytes because that’s the size of the xmm register. This means that everything up to 70h is zero!

Moving on we notice memory access to [rsp+70h] and initializing it to 0, followed by a mov of [rsp+70h] to eax. What do we know about this sequence of instructions and its relation to our example? The first thing we should note is that it is using eax instead of rax (the 64-bit counterpart of eax.) Where are we using a 32-bit variable? In our first for-loop as the iterator! If we look right after that we notice that there is a cmp instruction. The cmp instruction is the comparison instruction which compares the first operand to the second. It sets certain bits of the RFLAGS register to indicate the result. We’ll cover that in more detail in the next section. For now, just know it is comparing against 0Ah. This feels familiar… our for-loop construct does the same thing! A high-level view of the analysis we’ve done so far would look like this:

void func()
{
    uint128_t xmm_array[5] = { 0 };

    for(uint32_t rsp_70 = 0; rsp_70 < 10;) {}
}

Notice how I’m only taking the assumptions I’ve made from analysis of the disassembly so far. I’m doing this so you, the reader, start to see how to build the pseudo-code from straight disassembly. Now, the instruction following our comparison is a JCC instruction otherwise known as a jump if condition is met. The jnb instruction means jump to the target address if the result of the comparison indicates the value is not below our second operand in the comparison. Like the cmp instruction, the details on these instructions will come later. This will jump to the address 140001084 if our counter is greater than or equal to 10. So in terms of our reconstruction how do we interpret this? Well, we know that a for-loop runs until a condition is met and once it is met or exceeded it breaks out of the loop and continues executing the code that follows. This means that our jnb will go to the address where code continues after our loop, so what follows the jnb if the jump isn’t taken is what is happening inside the loop! We can also assume that once we hit the address where the jnb would jump is where the end of our loop is. Let me bring into view the code that is between the jnb and 140001084.

mov     eax, [rsp+70h]
mov     edx, [rsp+70h]
add     eax, 2
mov     [rsp+rdx*8+20h], rax
inc     dword ptr [rsp+70h]
mov     ecx, [rsp+70h]
cmp     ecx, 0Ah
jb      short loc_140001067

This doesn’t look too daunting. We know some of these instructions. The first two load eax and edx with the value of our counter, and then adds 2 to eax. Now, the next access is a little bit confusing but you might be able to figure it out on your own at this point – give it a try! If you weren’t able to let’s break it down. We see mov, so it’s storing rax into this memory location that is calculated by some obscene combination of things. Jot down what you know from previous instructions.

eax = counter
edx = counter
eax += 2
rsp = top of stack (what's at top of stack?)
[] means we're writing to the memory at location inside braces

mov [rsp + counter * 8 + 20h], rax
8 bytes is the size of a 64-bit integer
20h is offset from stack where our xmm array starts

This is what we know. From here we can begin to understand what’s happening. The easiest thing to do is break down all of the details and make educated guesses about the information. Using what we know we can make sense of the expression in the braces: [rsp + (counter * sizeof(uint64_t)) + base_of_array] = rax. This is where previous experience in languages like C or C++ comes in handy. We know that you can index an array in C in a more messy manner like *(cast*)(array + index), and knowing that this is using the base of our array we know it’s writing somewhere in this array. If we were to reorder this and write it like an array access in C we’d come up with something like this:

// tos = top of stack
*(uint64_t*)(tos + array_offset + (counter * sizeof(uint64_t)) = rax;

It’s beginning to become more understandable. At this point, we can make the assumption that since we have a loop the counter is used to index into the array. Let’s take this low-level representation and combine it with our assumptions to add to our reconstruction.

func()
{
    uint128_t xmm_array[5] = { 0 };

    for(uint32_t counter = rsp_70; counter < 10;)
    {
        rsp_70 = rsp_70 + 2;
        xmm_array[counter] = rsp_70;
    }
}

This looks a lot cleaner. But this doesn’t make much sense since we’d wind up with an index out of bounds bug since the counter loops to 10 but we only have 5 16-byte elements in our array. If you look at the instructions again, particularly the move, we saw that it was indexing in by the size of unsigned __int64. This means that our initial assumption that it was an array of 128-bit elements is wrong, it’s an array of 64-bit elements.

func()
{
    uint64_t u64_array[10] = { 0 };

    for(uint32_t counter = rsp_70; counter < 10;)
    {
        rsp_70 = rsp_70 + 2;
        u64_array[counter] = rsp_70;
    }
}

This is much better. It makes sense with all the assembly we’ve read so far. Continuing our loop excerpt we’ll see that the instruction after the write to the array is inc.

mov     eax, [rsp+70h]
mov     edx, [rsp+70h]
add     eax, 2
mov     [rsp+rdx*8+20h], rax
inc     dword ptr [rsp+70h]    <---
mov     ecx, [rsp+70h]
cmp     ecx, 0Ah
jb      short loc_140001067

The inc instruction is the unary increment instruction that takes the operand and adds 1 to it. Now we know that our loop is incrementing our counter! Skim the rest of the sequence and you’ll notice our comparison again and then a JCC instruction, jb. If you go back and look at the original disassembly listing you’ll see where loc_140001067 is.

                jnb     short loc_140001084

loc_140001067:                          
                mov     eax, [rsp+70h]
                mov     edx, [rsp+70h]
                add     eax, 2
                mov     [rsp+rdx*8+20h], rax
                inc     dword ptr [rsp+70h]
                mov     ecx, [rsp+70h]
                cmp     ecx, 0Ah
                jb      short loc_140001067

That’s it, that’s our first loop! If we add to our reconstruction we will now have this:

func()
{
    uint64_t u64_array[10] = { 0 };

    for(uint32_t counter = rsp_70; counter < 10; counter++)
    {
        rsp_70 = rsp_70 + 2;
        u64_array[counter] = rsp_70;
    }
}

Awesome. Now it’s your turn. Review the rest of the disassembly and rebuild based on assumptions you make then compare with the original source code. Try to refrain from using the original source as a reference.

                sub     rsp, 78h
                pxor    xmm0, xmm0
                movdqu  xmmword ptr [rsp+20h], xmm0
                movdqu  xmmword ptr [rsp+30h], xmm0
                movdqu  xmmword ptr [rsp+40h], xmm0
                movdqu  xmmword ptr [rsp+50h], xmm0
                movdqu  xmmword ptr [rsp+60h], xmm0
                mov     dword ptr [rsp+70h], 0
                mov     eax, [rsp+70h]
                cmp     eax, 0Ah
                jnb     short end_first_loop

first_loop:                          
                mov     eax, [rsp+70h]
                mov     edx, [rsp+70h]
                add     eax, 2
                mov     [rsp+rdx*8+20h], rax
                inc     dword ptr [rsp+70h]
                mov     ecx, [rsp+70h]
                cmp     ecx, 0Ah
                jb      short first_loop

end_first_loop:                          
                mov     qword ptr [rsp+20h], 539h
                mov     qword ptr [rsp+40h], 7CBh
                mov     dword ptr [rsp+74h], 0
                mov     eax, [rsp+74h]
                mov     qword ptr [rsp+50h], 31h
                cmp     eax, 0Ah
                jnb     short loc_1400010D2

loc_1400010B0:                         
                mov     eax, [rsp+74h]
                lea     rcx, aD         ; "%d\n"
                mov     rdx, [rsp+rax*8+20h]
                call    sub_1400010E0
                inc     dword ptr [rsp+74h]
                mov     eax, [rsp+74h]
                cmp     eax, 0Ah
                jb      short loc_1400010B0

loc_1400010D2:                       
                xor     eax, eax
                add     rsp, 78h
                retn

Disassembly Tips

Certain access specifiers like dword ptr, qword ptr, and xmmword ptr are great hints as to the size of an operation and sometimes the size of the operand. And remember the sizes of different types and widths of registers (e.g. eax = 32-bits, rax = 64-bits, uint64_t = 64-bits).

Conditional Operations and Comparisons

This section covers conditional branching instructions and operations. There are a ton of flavors of similar instructions, and we won’t be able to hit them all, but you’ll get a general idea and know where to look to learn more. We’ll also cover checking for error conditions, validated input, if something is about to ruin your life, etc. These are the not necessarily the easiest instructions, however, we’ll cover as many of the subtleties as we can. If you’ve made it to this section and successfully completed the challenge at the end of the last then the majority of these examples will be straightforward.

— Comparing Two Operands

The comparison instruction, cmp, was encountered in the previous section. Its operation is to compare the first operand with the second operand, however, the result is not stored in either of the operands. The comparison instruction sets status flags in the RFLAGS register indicating the result of the comparison. If we take a look at the RFLAGS register diagram from the Intel SDM Vol. 2 we’ll be able to discern which flags are typically affected.

The specific flags (bits in EFLAGS) we’re concerned with in comparisons or conditional operations are as follows:

  • Overflow Flag (OF)
  • Direction Flag (DF)
  • Sign Flag (SF)
  • Zero Flag (ZF)
  • Auxiliary Carry Flag (AF)
  • Parity Flag (PF)
  • Carry Flag (CF)

These are known as the status flags and are also identified in the diagram above. The compare instruction affects any of these status flags, and we’ll look at how they are set and used as we move forward. We first need to understand how the comparison is actually performed. With cmp the comparison of the two operands is done by subtracting the second from the first much like the sub instruction we’ve encountered often. The most often affected flag when performing a comparison is the zero flag (ZF) which is set when the result of the comparison is 0. Let’s pull from our earlier examples: cmp rdx, 0Ah. In this instance, if rdx has a value of 6 the result of the subtraction operation would be -4. Since -4 is not 0 the zero flag (ZF) stays clear. Once the result of the subtraction is 0 then our zero flag will be set (e.g. rdx is 10).

Comparison and jnb

The comparison instruction we encounter earlier prior to the jnb – which if you recall jumped if the value was not below the second operand – uses a different flag than ZF. The jnb instruction uses the carry flag (CF) to determine if the jump should be taken. The CF is only set if an operation generates a carry, or borrow of the result. The CF is also set when an overflow condition is detected which is the case for the comparisons we’ve been performing on unsigned integers. When we go below zero as the result we create an overflow condition that sets multiple flags: signed flag, carry flag, and auxiliary carry flag.

This is why understanding the RFLAGS register is extraordinarily important as well as the conditions that are used to determine if a branch will occur. We’ll cover the JCC instructions soon, for now, that tidbit should just be in the back of your mind.

— Testing Two Operands

It’s not unusual to encounter the test instruction instead of cmp. The test instruction only affects the SF, ZF, and PF status flags based on the result; and the method with which it performs the comparison is different than cmp. The test instruction performs a bitwise AND on the two operands and sets the status flags that correspond to the result. The result is then completely discarded. You’ll typically see test used when the branching instruction that may follow is decided from the result of SF, ZF, or PF. The main difference between cmp and test is the method of evaluation and that cmp sets the AF status flag.

Setting the signed flag

The signed flag is set when the most significant bit of an unsigned integer is set. This bit is also known as the sign bit when used in signed integer arithmetic and indicates whether a value is positive or negative. In unsigned integers, it is just the most significant bit.

The test and cmp instructions are interchangeable and will yield the same results. I don’t know if there is a performance difference or not, or why test is sometimes preferred in place of cmp, but if you find out or have a guess feel free to leave me a note!

— Conditional Branching (JCC Instructions)

It’s time to cover some of the JCC instructions. These instructions are branching instructions and only take a branch when a condition is met. What do I mean by branch? If you’ve ever used a goto statement in C you’ve written in what’s called an unconditional jump. The unconditional jump has the mnemonic jmp and is used to branch directly to an address. When a branch is taken the instruction pointer’s value is modified to the address of the target of our branching instruction. This allows execution to continue at that targeted code block. JCC instructions are the opposite of the unconditional jump, however, they still branch to a target but have a condition requirement. I put together a simple test function with a lot of branches and tried to use different conditions, but there are a lot of conditional branching instructions. If you want to learn more about them after this section, check the recommended reading. We’re only going to cover a few to give you an idea of how they work and what to look for when analyzing branches.

Here’s the example C application:

static __declspec( noinline ) uint32_t branching( uint64_t v1, uint64_t v2 )
{
    volatile uint64_t v3 = 916;
    volatile uint64_t v4 = 0xFFFFFFFFFFFFFFDD;

    volatile uint64_t r1 = 0;

    if ( v1 < v2 )
    {
        r1 = 1;

        if ( v3 != v2 )
        {
            r1 = 2;

            if ( v1 + v2 >= v3 )
            {
                r1 = 10;
                if ( v4 + v1 <= 1000 )
                {
                    r1 = 15;
                }
                else
                {
                    r1 = 9;
                }
            }
            else
            {
                r1 = 1;
            }
        }
        else
        {
            r1 = 0;
        }
    }
    else
    {
        r1 = 0;
    }

    return r1;
}

int main()
{
    printf( "ret = %d\n", branching( 3444, 3666 ) );

    return 0;
}

A little bit of a headache to follow, but it’s not uncommon to encounter nested conditions. Below is the disassembly listing of the function:

                sub     rsp, 38h
                mov     qword ptr [rsp+20h], 394h
                mov     qword ptr [rsp+28h], 0FFFFFFFFFFFFFFDDh
                mov     qword ptr [rsp+30h], 0
                cmp     rcx, rdx
                jnb     short loc_1400010DD
                mov     qword ptr [rsp+30h], 1
                mov     rax, [rsp+20h]
                cmp     rax, rdx
                jz      short loc_1400010DD
                mov     qword ptr [rsp+30h], 2
                add     rdx, rcx
                mov     rax, [rsp+20h]
                cmp     rdx, rax
                jb      short loc_1400010D2
                mov     qword ptr [rsp+30h], 0Ah
                mov     rax, [rsp+28h]
                add     rcx, rax
                cmp     rcx, 3E8h
                ja      short loc_1400010F0
                mov     qword ptr [rsp+30h], 0Fh
                jmp     short loc_1400010E6

loc_1400010D2:                          
                mov     qword ptr [rsp+30h], 1
                jmp     short loc_1400010E6

loc_1400010DD:                          
                                        
                mov     qword ptr [rsp+30h], 0

loc_1400010E6:                         
                                        
                mov     rax, [rsp+30h]
                add     rsp, 38h
                retn

loc_1400010F0:                          
                mov     qword ptr [rsp+30h], 9
                jmp     short loc_1400010E6

Right off the bat, we are already in familiar territory thanks to our earlier examples. Now that you’re probably more comfortable with some instructions and reading the listings you can quickly skim the dead-listing – looking for patterns of instructions. You’ll notice there are 4 comparison instructions within the first code block. However, knowing that doesn’t immediately tell us these are nested blocks. We’ll have to walk through the code and look at the targets to generate a high-level view of what’s going on. At this point, you should be able to read the first four instructions and know what they’re doing. At the end of the local storage initialization, we see a comparison of rcx and rdx, but we don’t see them used anywhere in the code. This is because of the calling convention, fastcall. If you remember reading the first and second articles there were details about the calling convention and how information is passed to functions. When invoking a procedure that follows the fastcall calling convention the first 4 arguments are passed through registers. These registers are rcx, rdx, r8, and r9, respectively.

Our function doesn’t have any mention of r8 or r9, so it’s safe to assume that it only takes two arguments through rcx and rdx. And just like that we already know what a basic function prototype of this may look like: <ret type> unk_fnc(uint64_t a1, uint64_t a2). Moving back to the comparison, we see it’s comparing the two arguments and then jumping to 1400010DD if the condition is met. The condition is jump if rcx is not below rdx. This can be directly translated to an if statement like so:

if(a1 < a2) { ... }

This is a good time to talk about how to determine which is the if block and which is the else block (if there is one). When a comparison is performed like the one above the code that is the target of the conditional is typically the else block since if the condition is not met the instructions following the JCC instruction will be executed. We’ll see as we move forward. The three instructions that follow our first branching instruction follow a similar pattern: mov, mov, cmp. This time a 1 was placed in some local storage area at [rsp+30h], then a rax was assigned the value of the contents in [rsp+20h]. If we look at the prologue of the function the value 394h was placed in [rsp+20h], so we know that one of our local variables has a value of 394h. Then rax is compared against rdx (our second argument) followed by a jz instruction.

The jz instruction is read as jump if zero meaning the result of the comparison was 0 and therefore will jump when the zero flag is 1. Interestingly enough the jump target for the first two conditional branches points to the same place: 1400010DD. This comparison is attempting to determine if our two registers are equal, and will take the jump if they’re equal. This means that the condition allowing continued execution is that rax and rdx are not equal. It doesn’t look like any of the other branch targets are the same here so let’s put together a reconstruction of what we know so far.

<ret type> unk_func(uint64_t a1, uint64_t a2)
{
    uint64_t rsp_20 = 0x394;
    uint64_t rsp_28 = 0xFFFFFFFFFFFFFFDD;
    uint64_t rsp_30 = 0;
    
    if(a1 < a2)
    {
        rsp_30 = 1;
        rax = rsp_20;
        if(rax != a2)
        {
            
        }
        else
        {
            goto loc_10dd;
        }
    }
    else
    {
.loc_10dd:
        rsp_30 = 0;
    }
}

This is just a rough sketch of what you can assume without looking at the original source. There are three local variables, one of those locals is used in an if-statement as shown in the disassembly, and one of the local variables is set to 1 if the if block is executed. If this feels slow, don’t worry – you’ll get much faster as you gain experience. Let’s bring the dead-listing back into view and read from the jz branch we just analyzed.

                jz      short loc_1400010DD
                mov     qword ptr [rsp+30h], 2
                add     rdx, rcx
                mov     rax, [rsp+20h]
                cmp     rdx, rax
                jb      short loc_1400010D2
                mov     qword ptr [rsp+30h], 0Ah
                mov     rax, [rsp+28h]
                add     rcx, rax
                cmp     rcx, 3E8h
                ja      short loc_1400010F0
                mov     qword ptr [rsp+30h], 0Fh
                jmp     short loc_1400010E6

loc_1400010D2:                          
                mov     qword ptr [rsp+30h], 1
                jmp     short loc_1400010E6

loc_1400010DD:                          
                                        
                mov     qword ptr [rsp+30h], 0

loc_1400010E6:                         
                                        
                mov     rax, [rsp+30h]
                add     rsp, 38h
                retn

loc_1400010F0:                          
                mov     qword ptr [rsp+30h], 9
                jmp     short loc_1400010E6

Interesting, there are 3 more branching instructions, and one of them is unconditional. It’s becoming clear that these are nested conditions and that the if/else blocks store some number into [rsp+30h]. We need to determine what happens inside of the if-statement for our rax/rdx not equal condition. It stores 2 into [rsp+30h], adds arg1 to arg2, stores [rsp+20h] in rax, then compares rdx to rax, and jumps if rdx is below rax. The if block of our previous condition has a nested condition, and we can add to our pseudocode.

<ret type> unk_func(uint64_t a1, uint64_t a2)
{
    uint64_t rsp_20 = 0x394;
    uint64_t rsp_28 = 0xFFFFFFFFFFFFFFDD;
    uint64_t rsp_30 = 0;
    
    if(a1 < a2)
    {
        rsp_30 = 1;
        rax = [rsp_20];
        if(rax != a2)
        {
            rsp_30 = 2;
            uint64_t temp = a2 + a1;
            if(rdx > rax)
            {
                // rdx is above rax
            }
            else
            {
                // rdx is below rax
            }
        }
        else
        {
            goto loc_10dd;
        }
    }
    else
    {
.loc_10dd:
        rsp_30 = 0;
    }
}

We’re beginning to see a pattern here of nested conditions based on the two arguments and two of the local variables. If we look at the jb target 1400010D2 we see that [rsp+30h] is being set to 1. Now, look at the code as if the branch was not taken. You’ll see that [rsp+30h] is referenced again, however, a value > 0 or 1 is stored. We now know this isn’t a traditional error code of true or false.

Deducing Return Type/Value

If you want to deduce what type is being returned or the value that is returned then locating the nearest return instruction in the function may provide information. There may be multiple return instructions within a function body, but the return type will match for all of them.

Understanding disassembly is important to your success!

The above tip is what I typically do to validate anything a disassembler is telling me. Some disassemblers like IDA Pro have a decompiler that generates a pseudo-C output, but it’s important that you’re able to read and validate that the output is correct. Sometimes it gets return types right, other times it’s wrong. Sometimes the calling convention is completely trashed and you have to modify it. This is why we’re going through the disassembly slowly and together – so that you get a good grasp of what to look for and what can go wrong. Also, it’s fun to see how your pseudo-C stacks up against original or commercial decompilation.

There will not be any more pseudo-C until we’ve completed the analysis of this target, so be sure to be doing it in your text editor as we go and then compare to mine. Make the changes we noted in the branch blocks so far including the above, and let’s move on. We’re going to move faster to save page space.

A similar pattern is noticed: value store, local to register for quick execution, an addition, and then a comparison against 3E8h (1000 in decimal). The comparison compares the result of our addition statement: rcx += rax. Then, a ja instruction is executed. The ja instruction means jump if above (jump if greater), so these two instructions can be interpreted as jump if rcx is greater than 1000. At this point, you should know where to look for the if/else portions of the condition. Build out your pseudocode, and continue reading.

mov     qword ptr [rsp+30h], 0Fh
jmp     short loc_1400010E6

The last two instructions are super easy, and also what is inside of our if block of the last condition. We store 0Fh (15) in [rsp+30h], and then unconditionally jump to 1400010E6. The location of our jump turns out to be our return sequence. The easiest way to recognize this is the retn instruction preceded by add rsp, N (to clean up the stack). Note that prior to the stack clean up one of our locals is placed into the rax register which is our return value register. We know that rsp+30h is a 64-bits in size since it is using rax versus eax, ax, ah, or al. Now we can insert all of this information into our pseudo-C implementation and compare yours to mine and the actual source.

uint64_t unk_func(uint64_t a1, uint64_t a2)
{
    uint64_t rsp_20 = 0x394;
    uint64_t rsp_28 = 0xFFFFFFFFFFFFFFDD;
    uint64_t rsp_30 = 0;
    
    if(a1 < a2)
    {
        rsp_30 = 1;
        rax = [rsp_20];
        if(rax != a2)
        {
            rsp_30 = 2;
            uint64_t temp = a2 + a1;
            rdx = temp;
            if(rdx > rax)
            {
                rsp_30 = 10;
                uint64_t temp1 = rsp_28 + a1;
                if(temp1 < 0x3E8)
                {
                    rsp_30 = 15;
                }
                else
                {
                    rsp_30 = 9;
                }
            }
            else
            {
                rsp_30 = 1;
            }
        }
        else
        {
            goto loc_10dd;
        }
    }
    else
    {
.loc_10dd:
        rsp_30 = 0;
    }
    
    return rsp_30;
}

Does your pseudo implementation stack up to mine? Or do it better? How about the original source? There are many other conditional jump instructions, and we’ve covered 5 in this example. You’ll need to consult the Intel SDM Vol. 2 or AMD APM Vol. 3 to read more about the other JCC instructions and the conditions that must be met for execution. As we progress through this article you’ll probably start jumping ahead of what I’m detailing, and that’s perfectly fine. For those that are still beginning to grasp the concepts and understand how to analyze program flow be sure to refer back to earlier sections for details I imply knowledge of if you don’t remember!

Section Challenge

Write a simple application with lots of conditional branches, have Visual Studio generate an assembly output, and build a pseudo-C implementation without referencing the source; then compare.

Load/Store Instructions

If you made it this far then the rest of this will be cake. Loading and storing data is a requirement of every application. Whether it’s storing data in a buffer to write to a file, or simply assigning a value to a variable the code underneath is performing a number of load and store operations. This is one of the most important sections since there are tons of ways to load and store data, and a lot of those ways will vary based on the type of data. Simple assignments will use a move instruction while the storage of a pointer to a string would use the load effective address instruction. You may not know what those are now, but you’ll never forget them after this section.

— Move Zero Extend

We know what a move is in assembly, but what is zero extension? Zero extension is pretty straightforward. Any portion of a storage area (register, stack location, etc.,) that is not written to will be set to 0. Take a look at these brief examples.

Standard Move

mov rax, 0xFFFFFFFFFFFFFFFF
mov eax, 0xDDDDDDDD
mov rcx, rax				; rcx = 0xFFFFFFFFDDDDDDDD

This should make sense if you remember that eax is the lower 32-bit region of rax, and can be assigned individually. We set the whole 64-bits of rax to FFFFFFFF`FFFFFFFF, then we set the lower 32-bits of rax to DDDD`DDDD. We store the value of rax in rcx, and then if we were to look at what was in rcx we’d see what is shown in the assembly comment above. What happens when we use move zero extend, movzx, instead of a standard move instruction?

Zero Extension

mov rax, 0xFFFFFFFFFFFFFFFF
movzx rax, 0xDDDDDDDD
mov rcx, rax				; rcx = 0x00000000DDDDDDDD

The same sequence of operations, different results. This is because a zero extension goes from the byte boundary of the operation size to the size of the storage being written. In this case, we have rax which is 64-bits in width. We write FFFFFFFF`FFFFFFFF to rax, then we use movzx with rax which sets the operation size to 64-bits (bit-width of rax), and writes DDDD`DDDD. The processor zero extends the value to the size of the source operand. The size of the extension is dependent on the operand-size. This is a costly instruction to execute in terms of cycles taken, but it’s common to see in encryption functions or obfuscation.

— Move Sign Extend

Much like movzx there is an instruction for move with sign extension. The instruction movsx does the same copying of the source operand to the destination, however, it performs a sign extension depending on the operand-size. An example is provided below!

xor rcx, rcx
mov ax, FFFF
movsx ecx, ax
mov rax, rcx

First, we zero out rcx, set ax to FFFF (ax will be 65535), then perform a move with sign extension from ax (16-bits in width) to ecx (32-bits in width), and finally copy rcx into rax. This can be a little bit confusing since we know that rcx is 0, ax is now 65535, and then after the movsx executes it’s not exactly clear. Let’s put a pretend breakpoint on movsx ecx, ax and observe the contents of the registers.

rax = 00001ABCDEF0FFFF
rcx = 0000000000000000

ax = FFFF
eax = DEF0FFFF

I’ve placed some garbage value in rax for a little realism and to show that the write of FFFF to ax only wrote to the lower word of eax. Let’s execute the movsx instruction and observe the contents again.

rax = 00001ABCDEF0FFFF 
rcx = 00000000FFFFFFFF 

ax = FFFF 
eax = DEF0FFFF

We see that the value of FFFF was copied to ecx but an extra 8 bits were also written. This is the sign extension. It didn’t set the entire value of rcx to FFFFFFFF`FFFFFFFF because it only sign extends up to the source operands size. With this instruction the default operation size is 32-bits, however, you can extend it to 64-bits by using a 64-bit register to generate an operand size extension attribute (more on that later).

movsx rcx, ax

The above would sign extend rcx to the sign of the source operand (ax).

— Load/Store Status Flags

We talked about status flags in a bit of detail earlier, and their importance isn’t to be dismissed. In practice, I’ve encountered some initially obscure instructions like lahf. If you’re a first-timer analyzing some target and encounter this well my first suggestion would be looking at the instruction manual, but I didn’t even know that was a thing when I first started. The lahf instruction is used to load all status flags into the ah register – the upper byte of the lower word of rax. This means that the contents of ah would store the sign flag, zero flag, auxiliary flag, parity flag, and the carry flag. It’s not very often you’ll see this instruction, but it’s worth noting in the event you do. The sahf instruction is the store status flags instruction which simply takes the flags in ah and stores them into their respective flags in RFLAGS.

— Load Effective Address

As opposed to the other instructions in this section you’ll encounter load effective address quite often. It’s important to cover this instruction prior to our string operations section since lea (load effective address) is so frequently used when loading data offsets or pointers to objects. The instruction is quite simple in how it works and is for some reason over-complicated in discussions. The instruction takes the first operand (a register destination) and stores the effective address of the second operand. What is the effective address? It’s… just the address of the data. The lea instruction takes the second operand and, if necessary, performs calculations to generate the address of the data.

So what’s the difference between mov and lea? Well, mov copies the contents of an object at an address into the destination operand and lea loads the pointer of an object you’re addressing into the destination operand. In some instances, you can trivially replace lea with mov – I wouldn’t recommend it however because lea is useful when multiple bases are used.

Let’s take a look at a quick example:

uint64_t p1 = 0;
printf( "%d\n", p1 );

And the disassembly:

lea rcx, qword ptr ds:[fmt_string_address]
mov edx, eax
call printf

When lea is executed it simply takes the address of the format string which is stored in the .rdata section of the program, calculates the address of it, and stores it in rcx. If we replaced the lea with mov the contents of rcx would be the ASCII values of the format strings characters which for this example would be %d\n. This would likely cause an access violation since printf attempts to dereference the pointer to the format string, and if mov was used the contents at that address would be in rcx, not the pointer. This would generate an access violation and your program would crash. You will sometimes encounter lea used in more complex calculations like this snippet I pulled out of a random disassembly:

push    rbp
sub     rsp, 40h
mov     [rsp+50h], rcx
lea     rax, [rsp+58h]
mov     [rax], rdx

In this case, it’s storing the address of the stack location [rsp+58h]. This happens to be taken out of the disassembly of the printf function, so after storing the stack address in rax it stores the second argument of printf in the storage pointed to by rax (which is rsp+58h). It may seem a bit confusing at first, but once you finish the string operations section it’ll be quite obvious how lea works. And don’t be alarmed if you still get it messed up, everyone confuses themselves once in a while.

String Operations

The most common and confusing thing when starting out can be understanding string processing instructions. There’s a lot to them and we cover it all in this section. If you get stumped in later articles when we do crackme’s it’ll most likely be on these string processing instructions and deciphering where the data is flowing and what operations are being performed on the data. These string operations are essential to understand.

— String Example

Being able to identify how strings are copied intrinsically is super useful because sometimes functions like strcmp, memcpy, or a custom implementation will be inlined in code. We’re going to look at an example that copies one string to another, and we’ll encounter some familiar instructions along the way. The original source will not be provided for this example, and the pseudo-C will only be provided at the end of the analysis. Try it on your own this time!

                push    rbp
                sub     rsp, 50h
                lea     rbp, [rsp+20h]
                mov     [rbp+28h], rdi
                mov     [rbp+20h], rsi
                lea     rax, qword ptr ds:[unk1]
                mov     [rbp+8], rax
                lea     rax, qword ptr ds:[unk2] 
                mov     [rbp+10h], rax
                mov     rax, [rbp+8]
                mov     rdx, [rbp+10h]
                mov     rsi, rax
                mov     rdi, rdx
                mov     rax, rsi

loc_140001039:
                mov     dl, [rdi]
                inc     rdi
                mov     [rsi], dl
                inc     rsi
                test    dl, dl
                jnz     short loc_140001039
                mov     [rbp+18h], rax
                lea     rax, fmt
                mov     rdx, [rbp+8]
                mov     rcx, rax
                call    printf
                mov     [rbp+0], eax
                mov     eax, 0
                mov     rsi, [rbp+20h]
                mov     rdi, [rbp+28h]
                lea     rsp, [rbp+30h]
                pop     rbp
                retn

The first two instructions should set off a few bells. The value used for the stack allocation is 50h (80), and 80 modulo 16 is 0, so the stack is misaligned. Is it? If you thought no, you’re correct. It’s not misaligned. This is because prior to allocating stack space we pushed one of our registers, rbp, onto the stack which made an 8-byte allocation. This means that when we perform sub rsp, 50h our stack will actually have 88 bytes allocated for this function. 88 modulo 16 is 8 and abides by the alignment requirement specified in the ABI. There are a few variations in function prologues and this is one of the more common sequences of instructions you’ll experience.

After the prologue we have our first new instruction encountered: lea. It’s not immediately obvious what’s going on, but it’s storing the address of the stack location [rsp+20h] in rbp. Recall that rbp is commonly referred to as the base pointer and here it points to a seemingly arbitrary stack location. How much space do we normally allocate for spill space? 32 bytes (20h). However, that spill space allocation is only 24 bytes since the push rbp pushed 8 bytes onto the stack prior to the sub instruction. So, we have our typical 32 bytes allocated, then we store [rsp+20h] in rbp. This is setting up what’s called setting up the stack frame.

The Stack Frame

A stack frame is an area of stack space that represents a procedure call and the arguments associated with the procedure call. When a call instruction is executed the return address is pushed onto the stack first, followed by arguments, and then space is allocated for local storage.

It should be starting to make a little bit more sense. The spill space is allocated as well as storage for our arguments and local variables. In this instance, the function we’re analyzing is the main entry point of our program. All the code is executing in there. That function takes three arguments – the command line argument array, and the count of arguments. Our main function has a different calling convention and the arguments are passed through rsi and rdi, respectively. We now see that this sequence is setting up our stack frame for the function. The reason it uses [rsp+20h] as the base of the frame for the main function is because the last 32 bytes that were allocated were used to set up the stack for a call to another function. Different calling conventions have different stack-maintenance responsibilities. In this case, the calling convention of the function we’re analyzing is __cdecl which is required to allocate stack space for any functions called inside of it. The [rsp+20h] is used since the remaining 32 bytes from rsp to rsp+20h are the spill space for printf. Knowing the differences between calling conventions is a must and I encourage you to learn them from the direct links here or in the recommended reading section.

What that all means is that our 32 bytes of the initial stack allocation isn’t used in our function, and we know that the first 32 bytes are the spill space for our function. If we take 88 bytes (size of total stack space allocated), subtract 32 bytes (removing use of the allocation for other function), and then subtract 32 bytes (acknowledging our functions spill space), we’re left with 24 bytes for local variables on the stack. 24 divided by 8 is 3, meaning there are 3 local variables used in this function. Now that we know how many locals are used tracking variable movement is a lot easier. This helps us realize that rbp is used as the last stack spot for use by our function. The base of the stack (or call) frame. So when we see rbp used with an offset it can be thought of as if that’s the top of the stack for the currently executing function.

Since our calling convention was noted as __cdecl the first two arguments are stored in rsi and rdi. Then those are stored in the spill space for our function.

mov [rbp+28h], rdi 
mov [rbp+20h], rsi

To understand how this would look in a stack view, see below.

stack view

The diagram above shows what each instruction of the opening sequence is referencing, and how they all work together. If you were to omit the frame pointer (rbp) and look at where the rdi and rsi registers store their values you’d see they wrote to [rsp+48h] and [rsp+40h]. Now you know how I deduced it was writing to spill space. Let’s bring our disassembly back into view.

                push    rbp
                sub     rsp, 50h
                lea     rbp, [rsp+20h]
                mov     [rbp+28h], rdi
                mov     [rbp+20h], rsi
                lea     rax, qword ptr ds:[unk1]
                mov     [rbp+8], rax
                lea     rax, qword ptr ds:[unk2] 
                mov     [rbp+10h], rax
                mov     rax, [rbp+8]
                mov     rdx, [rbp+10h]
                mov     rsi, rax
                mov     rdi, rdx
                mov     rax, rsi

loc_140001039:
                mov     dl, [rdi]
                inc     rdi
                mov     [rsi], dl
                inc     rsi
                test    dl, dl
                jnz     short loc_140001039
                mov     [rbp+18h], rax
                lea     rax, fmt
                mov     rdx, [rbp+8]
                mov     rcx, rax
                call    printf
                mov     [rbp+0], eax
                mov     eax, 0
                mov     rsi, [rbp+20h]
                mov     rdi, [rbp+28h]
                lea     rsp, [rbp+30h]
                pop     rbp
                retn

It get’s a bit easier here after we get passed the details of the opening 5 instructions. We perform an lea to load the pointer of an item into rax, for this example, it’s obviously a string. Then we store rax into [rbp+8] or location 28 in our stack diagram. The same goes for the next two instructions except it is loading the address of a different string. The next four instructions are copying the contents of registers into other registers. This is where inlining has occurred. We know this because in our discussion earlier a function with the __cdecl calling convention takes 2 arguments through rsi and rdi, and at this point, we see that we are loading the two registers with the pointers to these strings then a call would be made if the function weren’t inlined. We should make note of the mov rax, rsi instruction since that is preserving the original pointer address to unk1.

Labels in Disassembly

When reading a disassembly listing any time you notice a label such as loc_x it should be in the back of your mind that there is a conditional somewhere else in the code that references it. It could be used in an error condition, a loop, an if/else, a goto, etc.

As soon as we see the loc_140001039 we need to make note of any reference to it that may be nearby. There is one, the jnz loc_140001039 only 5 instructions away. This is indicative of a loop. Let’s look at the code that’s looping.

loc_140001039:
                mov     dl, [rdi]
                inc     rdi
                mov     [rsi], dl
                inc     rsi
                test    dl, dl
                jnz     short loc_140001039

Let’s make some notes about this sequence.

dl is the lower byte of rdx
rdi contains the pointer to unk2[0] (base of string)
[rdi] is accessing the contents the pointer addresses, so first character in unk2

After reading these notes we can analyze what’s going on. Assuming you’ve programmed in C or any language you know that a character in a string is one byte in size. The sub-register dl is also one byte in size. The instruction mov dl, [rdi], therefore, is reading the address in rdi and copying the contents into dl. This will only copy one byte from that location. Then it increments the address that rdi contains by 1 which means it’s pointing to the next character in the string since arrays are allocated contiguously in memory. It takes that character and copies it into the location pointed to by rsi, then increments rsi so that it now points to the next character in its sequence. Then a test instruction, one of the instances where it decided to show itself. This performs a logical AND on the two operands dl and dl. This is common to see in string looping sequences where test is used. It uses test since the logical AND of a character against itself yields the ASCII value. If the character is NULL then the result of the test will be 0 and the zero flag will be set which means that the jnz branch will not be taken – simply put: indicating the loop has finished or encountered a 0 byte.

We know that strings have a null terminator (null byte) appended to the end of their sequence so this loops until the end of the source string is encountered. Once the loop ends code execution continues in a linear path through the rest of the excerpt. The operation performed on these two strings should be clear at this point. This is a string copy! An unsafe one at that since it will copy until the end source string is hit, but what about the destination? It could keep overwriting data far beyond the length of that string.

Unsafe Copy Operations

There is a reason that unsafe copy operations are tagged by many compilers. These sorts of unsafe copies like the one depicted above are frequently used in buffer overflow exploits. This one, in particular, could be weaponized to hijack the control flow of the program. This is another reason why keeping an eye out for sequences like this will help you when reverse engineering or building exploits.

I’ve decided that at this point you should be challenged to apply what you’ve learned to convert the disassembly to pseudocode. The pseudocode that you should’ve constructed is available here, but I encourage you not to look until you’ve spent time attempting yourself.

Conclusion

In this crash course on x64 assembly we have covered quite a lot, even on just simple examples. There’s no way to pack in years of learning assembly into a single article or all the tricks and nuances of instructions and examples, but I hope that this first part has helped build a solid foundation for you to begin learning assembly. The contents that could be here could fill a book, and I intend to include as much as I can to make the foundation as solid as possible, however, this should not be seen as a one-stop-shop for learning assembly. That being said, in the next part of Accelerated Assembly we will cover more advanced examples like bitmasking, bit rotating, string encryption, rolling encryption, and some examples that use a few instructions as anti-debugging mechanisms. We’ll tear down some built out examples of authorization, encryption, and a game example.

As always, feel free to ask any questions, feedback, or otherwise, you may have! Thanks for reading!

Legal Notice: All of this information is intended for educational purposes only. I do not endorse using this knowledge for illegal activity.

Recommended Reading

The post Applied Reverse Engineering: Accelerated Assembly [P1] appeared first on Reverse Engineering.

✇Reverse Engineering

Applied Reverse Engineering: Accelerated Assembly [P2]

By: Daax Rynd

Overview

After reading feedback from the first part to the Accelerated Assembly guide, I’ve decided to take on a custom target, and call back to high-level languages when we encounter obscure or new pieces in the assembly. I realize that the level of detail in my last article may have been cumbersome to some readers, but I plan to stick to covering what is necessary to understand the material on the page. That being said this article is going to be about the same length, but only because the example was created by a friend so I have no prior knowledge of the implementation details. We’ll go from blackbox to well-documented. Along the way, I’ll be teaching the readers how I go about assessing a target and documenting functionality, as well as techniques I use to understand complex assembly listings. We’ll be referencing the Intel and AMD software development manuals often. It’s important to remember that this series serves as a guide to reverse engineering on a Windows OS, and how to think about reverse engineering. All skills learned can be taken and applied to other systems.

All demos are performed on Windows 10 Version 2004; Build 19035. (This build is not required. Having Windows 10 will be sufficient.)

Disclaimer

All examples and information provided in these articles are based on C/C++ applications. It is assumed you have programmed and have experience in a high-level language like C, CPP, Rust, or otherwise. If you do not, the contents of this series may be difficult to follow. All author projects are written with Visual Studio 2019, and compiled using Intel C++ 19.1 Compiler. All optimizations are turned off to reduce the number of obscure assembly listings due to compiler optimizations complicating comprehension. Some details are omitted to prevent diving down the rabbit hole even further. If you’re an avid reader and want to know more than is provided see the recommended reading section at the end.

Addendum: I want to quickly take a moment to address my style of writing and teaching. I’m a firm believer in learning by doing as that’s the way I learned and how I continue learning in regards to reverse engineering or development of anything, for that matter. I realize some learn better through extensive breakdowns, simpler examples, live demonstrations, etc. As much as I wish I could I don’t have the means to cater to all learning styles, so writing is the best outlet I can give. I write so that interested readers, regardless of learning style, can come back without having to timestamp a video or scour many examples to find some piece of information they forgot or need. I hope that if your learning style is much different than mine that you still find value in these pieces and know that I’m always available to answer questions or help improve your understanding.

I want you to succeed at learning how to reverse engineer and apply it to the real world, but I’m not a teacher by any means so if there are gaps please bear with me, or let me know so that I can add it in! And thank you for your patience while I write these :).

— Omissions from Part 1

𝛿 Linked List/Doubly-Linked List Example (Intentional)

Target Acquired

Now that we’ve covered a few necessary examples, and you now know the details of all the calling conventions – let’s get right into it. We’re going to assess a single target in this section. It’s going to be a long one, complicated, and probably mildly frustrating for you as well but you will come out on the other side knowing much more than you did before!

— Robbing A Bank Requires A Blueprint

This first example is based on an authorization protocol I’ve seen used in the wild. It’s quite shoddy and broken, and yes – this is a rough recreation of it. There are multiple procedures used in this function, however, all license validation was performed locally and in the entry point of the application. The application itself was widely used and presented a number of attack vectors in regards to exploitation which we’ll cover as we encounter them. The assembly is somewhat confusing if you’re just starting out, but we’re going to break it down piece by piece and establish a knowledge base of the target. We’ll note things like local variables used, potentially inlined functions, CRT procedures, and all the attack vectors. This example makes use of structures and we’ll see how these structures are used as well as how to deduce what the different members of the structure(s) are. There will be a lot of new things encountered, so make note of anything that may be confusing and be sure to review the breakdown afterward.

We will be starting without prior knowledge of the source code, and to ensure that I did that as well I had a friend write the application and then I reversed it ahead of time. I did this so that assumptions aren’t made with insider knowledge, so to speak. That way your results will be consistent with mine as we walk through it. I’ll provide the source code given at the end of this break down for you to compare your pseudocode with.

What about the tools?

The reason I haven’t introduced any of the tools yet is that as you learn to reverse engineer it’s important to not become dependent on the tools you may have such as IDA, x64dbg, Hiew, etc. To become proficient at RE you need to be able to work from a cut and dry disassembly listing and be able to deduce as much as you can from that. There may not always be a tool that supports an architecture your target is running on and then at that point, if you’re dependent on tools, you become useless. We’re going to work from a standpoint of only having knowledge of basic instructions, architecture, and make deductions of behavior and program flow. Once we understand the flow of the program overall we’ll dig into the details and hunt down what is used where. This will be slow, but it will allow you to move quickly in the future if you don’t have any tools at your disposal.

With that being said, here’s the listing from the entry point of the target program.

                push    rbp
                sub     rsp, 0B0h
                lea     rbp, [rsp+40h]
                mov     [rbp+68h], rdi
                mov     [rbp+60h], rsi
                mov     [rbp+58h], rbx
                mov     dword ptr [rbp+0], 0
                mov     eax, 105h
                mov     rcx, rax
                call    _unk_fnc
                mov     [rbp+28h], rax
                mov     rax, [rbp+28h]
                mov     [rbp+30h], rax
                mov     dword ptr [rbp+4], 105h
                mov     dword ptr [rbp+8], 0
                mov     dword ptr [rbp+0Ch], 0
                mov     rax, cs:GetVolumeInformationW
                mov     [rbp+38h], rax
                mov     eax, 105h
                mov     rcx, rax
                call    _unk_fnc
                mov     [rbp+40h], rax
                mov     rax, [rbp+40h]
                mov     [rbp+48h], rax
                lea     rax, unk_140028000
                mov     rdx, [rbp+48h]
                mov     rcx, rax
                call    sub_14000113C
                mov     [rbp+10h], eax
                mov     rax, [rbp+38h]
                mov     edx, 0
                mov     rcx, [rbp+30h]
                mov     ebx, [rbp+4]
                lea     rsi, [rbp+0]
                lea     rdi, [rbp+8]
                mov     [rsp+20h], rdi
                lea     rdi, [rbp+0Ch]
                mov     [rsp+28h], rdi
                mov     qword ptr [rsp+30h], 0
                mov     edi, 0FFFFFFFFh
                add     edi, [rbp+4]
                mov     [rsp+38h], edi
                mov     [rbp+50h], rcx
                mov     rcx, rdx
                mov     rdx, [rbp+50h]
                mov     r8d, ebx
                mov     r9, rsi
                call    rax
                mov     [rbp+14h], eax
                mov     eax, [rbp+0]
                mov     ecx, eax
                call    sub_14000136C
                mov     [rbp+18h], eax
                mov     eax, [rbp+18h]
                test    eax, eax
                jz      short loc_140001125
                mov     rax, [rbp+48h]
                mov     rcx, rax
                call    sub_1400014A4
                mov     [rbp+1Ch], eax
                mov     eax, [rbp+0]
                mov     ecx, eax
                call    sub_14000136C
                mov     [rbp+20h], eax
                mov     eax, [rbp+1Ch]
                mov     edx, [rbp+20h]
                cmp     eax, edx
                jnz     short loc_140001125
                lea     rax, unk1
                mov     rcx, rax
                call    sub_140001254
                mov     [rbp+24h], eax

loc_140001125:                          
                                        
                mov     eax, 0
                mov     rbx, [rbp+58h]
                mov     rsi, [rbp+60h]
                mov     rdi, [rbp+68h]
                lea     rsp, [rbp+70h]
                pop     rbp
                retn

Initially, this may appear quite daunting given the examples in the previous post. When you see long listings like this your first instinct should be to break things off in chunks. Upon immediate overview, we only notice two branching instructions that lead us to an area where cleanup is performed and the function returns. This means there’s two conditionals inside of this function, and we know that right off the bat which is helpful. There are no error checks after invocation (otherwise there would be other branches after call instructions), and the code path is linear until we hit the instruction where we could potentially branch. Knowing this simplifies analysis. We can walk down the listing until the branching instruction.

Let’s start by pulling a section of this assembly out and walking through it.

push    rbp
sub     rsp, 0B0h
lea     rbp, [rsp+40h]
mov     [rbp+68h], rdi
mov     [rbp+60h], rsi
mov     [rbp+58h], rbx
mov     dword ptr [rbp+0], 0
mov     eax, 105h
mov     rcx, rax
call    _unk_crt
mov     [rbp+28h], rax

How did I know to pull out the code up to that point? I just looked for the first call saw where the return value in rax was stored and pulled the instructions from that instruction to the first instruction. We’re going to start off with some simple math to attempt to determine how many local variables are present, and guess their type. In the snippet, we see that we push rbp to save the previous functions stack frame, then we subtract B0 (176 bytes) from the stack. This is a total of 184 bytes, and 184 modulo 16 = 8. We can assume that when a compiler generates this code the stack will be aligned properly, but to infer local storage allocation we need to know how much is initially allocated. Then we see the function setup our stack frame using lea rbp, [rsp+40h]. With all this information, we want to take our initial allocation value B8h (184 [incl. push rbp]) and subtract 40h (64) which leaves us with 78h (120) bytes. That’s still a lot of space. Let’s then take the size of our shadow store (spill space) and subtract it from 78h: 78h (120) - 20h (32) = 58h (88). We have 88 bytes of on the stack for local storage. 88 modulo 16 = 8 – we’re aligned. That’s still a lot of storage, so let’s try something different. Refer back to the full function disassembly. We need to find the lowest and highest offset from rbp, and note missing offsets when addressing and storing values.

The lowest offset is shown in our snippet at the 7th instruction, mov dword ptr [rbp+0], 0, so the lowest offset is 0. How about our highest? At the very end lea rsp, [rbp+70h] is restoring the stack to its state before we ran the code of our function. Now comes the tedious part: make note of all offsets used.

58, 60, 68, 70, 0, 28, 30, 4, 8, C, 40, 48, 10, 38, 20, 28, 50, 14, 18, 1C, 24

What can we do with these? Well, first we should sort them and identify the gaps. The offsets that are unused are typically used for alignment purposes when different sized variables are used. I ran this list through a sorting algorithm and removed duplicates, and this was the output.

00 04 08 0C 10 14 18 1C 20 24 28 30 38 40 48 50 58 60 68 70

We accidentally included the offsets that go into our shadow store, so let’s chop those off.

00 04 08 0C 10 14 18 1C 20 24 28 30 38 40 48 50

Now we’re getting somewhere. Simply glancing at the list tells you there are some 4-byte offsets indicating some 32-bit storage, and then it jumps to 8-byte offsets indicating some 64-bit storage area. We could take this information and assume this is how many local variables were used in this function. If we did that we’d wind up calculating that there are potentially:

32-bit variables = 10
64-bit variables = 5

We want to be more thorough than that. How could we determine the true number of local variables used? Remember that this example is unoptimized and the compiler tends to repeat operations that could be cut out (think back to the examples in last post). If you thought about counting uses of temporary storage you’re absolutely correct. Counting uses of temporary storage is as simple as looking for rbp+N offsets that are used to store register contents and then are followed by the copying of that value to a different register. Here’s an instance of it from the code above:

call    _unk_crt
mov     [rbp+28h], rax
mov     rax, [rbp+28h]
mov     [rbp+30h], rax

It takes the return value of _unk_crt and stores it in [rbp+28h], then copies [rbp+28h] back into rax, and then finally stores rax into [rbp+30h]. Why didn’t it copy rax into [rbp+30h] to begin with? No optimization. The compiler was so lazy it made its job even more difficult, but this helps us! Look around in the code and you’ll see there are no other references to [rbp+28h]. Great, now we can mark it as temporary storage and take it out of our 32-bit variable count. If we do the same for all instances of temporary storage and even potentially wasted storage we wind up with 4 32-bit local variables. By looking for unused storage, or temporary storage/unnecessary copies we narrowed down the count of our 32-bit variables. What do I mean by unused storage?

mov     [rbp+10h], eax ; [rbp+10h] is not used anywhere else in function.

Okay, how about our 64-bit variables? Well, we see there are a few function calls and then the return value is stored in a stack location much like the call to _unk_crt. We see that it winds up in some temporary storage at [rsp+30h] then in [rsp+38h] finally. That’s one 64-bit variable. There are two calls to _unk_crt so it’s safe to assume there are 2 64-bit variables at a minimum. If we look for some copies to stack locations with the specifier qword ptr we could possibly note those as 64-bit variables. But wait, the only move to a 64-bit location with this instruction encoding uses [rsp+30h] and isn’t offsetting from the current stack frame! This is generally indicative that the value you’re looking at is used as an argument to a function.

Calling Convention Matters

The default calling convention on Windows is fastcall. The first 4 arguments are passed through registers rcx, rdx, r8, and r9; respectively. Any other arguments are pushed onto the stack, or in this case, loaded into preallocated stack locations.

This means that the 64-bit copy you see occurring offsetting from rsp is loading that location with an argument to some function call down the instruction stream. You can see this call happen a few instructions later: call rax. In this case, we don’t consider this a 64-bit local variable for our current function. If we continue to look around you may also notice the copying of the address of GetVolumeInformationW into rax which then stores rax in [rbp+40h] using the default operation size (64-bits). That’s 3 64-bit variables we’ve tracked so far. Scanning the excerpt more I don’t see any other instructions that store a 64-bit value to a stack location. We’ve now determined with reasonable confidence that we have the following variables used in our function:

32-bit variables = 4
64-bit variables = 3

If you thought that was a lot of work to determine this you’re right, it’s more than you’d normally have to do when using commercial tools like IDA or Binary Ninja. Learning as if you have no tools is the best way to become proficient, however, so we’re going to continue operating as if we don’t have the tools at our disposal.

Reversing Challenge

There is an easier way, in this example, to determine the number of 32-bit variables being used in this function. Can you identify how? There are hints in the above explanation.

# Returning Back to Snippet

push    rbp
sub     rsp, 0B0h
lea     rbp, [rsp+40h]
mov     [rbp+68h], rdi
mov     [rbp+60h], rsi
mov     [rbp+58h], rbx
mov     dword ptr [rbp+0], 0
mov     eax, 105h
mov     rcx, rax
call    _unk_crt
mov     [rbp+28h], rax

After the stack allocation, stack frame setup, and the copying of our arguments to shadow store we see a 32-bit move to [rbp]. This is one of our local variables being initialized to 0. Notice the dword ptr specifier – this is the answer to the challenge above. Moving on we see a value of 105h (261) loaded into rcx and then a call to _unk_crt. The 261 is one of the arguments as per the calling convention. We don’t know what this function does, but we do know that its return value is 64-bits and is stored in a local variable. Let’s pick out the next snippet.

mov     rax, [rbp+28h]
mov     [rbp+30h], rax
mov     dword ptr [rbp+4], 105h
mov     dword ptr [rbp+8], 0
mov     dword ptr [rbp+0Ch], 0
mov     rax, cs:GetVolumeInformationW
mov     [rbp+38h], rax
mov     eax, 105h
mov     rcx, rax
call    _unk_crt
mov     [rbp+40h], rax
mov     rax, [rbp+40h]
mov     [rbp+48h], rax

The return value from _unk_crt is stored into rax then rax into its final location. The next three instructions are initializing 3 more 32-bit locals to the respective value shown. Make note of the 105h (261) constant again. It may be useful in the future. The next part is quite interesting and maybe not something you’ve seen before. It’s taking the address of GetVolumeInformationW and storing it in rax which is then copied to the stack location [rsp+38h] which we’ve already determined is 64-bits in size. If you’ve worked in C or C++ before you might recognize this pattern as some form of function pointer perhaps, but that’s a little presumptive with what information we have now. Following, we see the subroutine _unk_crt executed again with the same constant used in the beginning, and then the temporary storage is used before copying the returned value to its local variable’s stack location. The next snippet is a little bit longer and somewhat more confusing. Since we know which offsets are used for our local variables lets give those offsets aliases instead.

[rbp+0] => [rbp+v1]
[rbp+4] => [rbp+v2]
[rbp+8] => [rbp+v3]
[rbp+0Ch] => [rbp+v4]
[rbp+30h] => [rbp+unk_crt_ret_1]
[rbp+38h] => [rbp+gviw_address]
[rbp+48h] => [rbp+unk_crt_ret_2]

This will make identifying where certain things are used much simpler. To be clear v1 is simply an alias for 0, v2 for 4, and so on. We’re going to use these from now on and I’ve replaced the offsets in the snippets with their associated alias. Here’s the modified assembly of the next excerpt we have to analyze:

lea     rax, unk_140028000
mov     rdx, [rbp+unk_crt_ret_2]
mov     rcx, rax
call    sub_14000113C
mov     [rbp+10h], eax
mov     rax, [rbp+gviw_address]
mov     edx, 0
mov     rcx, [rbp+unk_crt_ret_1]
mov     ebx, [rbp+v2]
lea     rsi, [rbp+v1]
lea     rdi, [rbp+v3]
mov     [rsp+20h], rdi
lea     rdi, [rbp+v4]
mov     [rsp+28h], rdi
mov     qword ptr [rsp+30h], 0
mov     edi, 0FFFFFFFFh
add     edi, [rbp+v2]
mov     [rsp+38h], edi
mov     [rbp+50h], rcx
mov     rcx, rdx
mov     rdx, [rbp+50h]
mov     r8d, ebx
mov     r9, rsi
call    rax

We can quickly pinpoint where certain locals are being used now. First, it loads the address of some unknown object using lea. We don’t know what this object is but it later placed in rcx to be used as an argument to call sub_14000113C. We also see that the return value from the second call to _unk_crt is used as the second argument. Alright, there may be some sort of string or memory operation being performed by sub_14000113C. We’ll get back to that. The next instruction is a garbage store because rbp+10h is not used anywhere in the assembly. Let’s start speeding this up and absorb what multiple instructions are doing at one time. We’re loading registers with contents of local variables, then we load the address of three local variables – v1, v3, and v4. Notice the two instructions not copying to areas in our stack frame. They’re offsetting from rsp so the addresses of these local variables are being used as function arguments! Below is a view of what using these locals in a high-level language would look like.

rand_func(.., .., .., &v1, &v3, &v4, .., etc);

There’s the mov qword ptr [rsp+30h], 0 which is passing 0 as an argument to some function, then -1 to edi and adding the value of v2 to it. This is the same as v2 - 1. We see it copy one more argument then preserve the value of rcx in a stack location, and then we load up our calling convention registers rcx, rdx, r8, and r9. Then the program executes call rax – wait what? Recall that a few instructions above we loaded the address of GetVolumeInformationW into rax. This confirms that the local storing that address was a function pointer.

# Developing Pseudocode

So far we’ve uncovered the local variable initialization, two function calls where the return value is saved, two function calls where the return address is ignored or unused, and now we have a function call using almost all of the locals we initialized. At this point, we need to take a second to start developing the pseudocode of this function. Let’s bring the instructions up to the end of the last excerpt into view.

push    rbp
sub     rsp, 0B0h
lea     rbp, [rsp+40h]
mov     [rbp+68h], rdi
mov     [rbp+60h], rsi
mov     [rbp+58h], rbx
mov     dword ptr [rbp+v1], 0
mov     eax, 105h
mov     rcx, rax
call    _unk_crt
mov     [rbp+28h], rax
mov     rax, [rbp+28h]
mov     [rbp+unk_crt_ret_1], rax
mov     dword ptr [rbp+v2], 105h
mov     dword ptr [rbp+v3], 0
mov     dword ptr [rbp+v4], 0
mov     rax, cs:GetVolumeInformationW
mov     [rbp+gviw_address], rax
mov     eax, 105h
mov     rcx, rax
call    _unk_crt
mov     [rbp+40h], rax
mov     rax, [rbp+40h]
mov     [rbp+unk_crt_ret_2], rax
lea     rax, unk_140028000
mov     rdx, [rbp+unk_crt_ret_2]
mov     rcx, rax
call    sub_14000113C
mov     [rbp+10h], eax
mov     rax, [rbp+gviw_address]
mov     edx, 0
mov     rcx, [rbp+unk_crt_ret_1]
mov     ebx, [rbp+v2]
lea     rsi, [rbp+v1]
lea     rdi, [rbp+v3]
mov     [rsp+20h], rdi
lea     rdi, [rbp+v4]
mov     [rsp+28h], rdi
mov     qword ptr [rsp+30h], 0
mov     edi, 0FFFFFFFFh
add     edi, [rbp+v2]
mov     [rsp+gviw_address], edi
mov     [rbp+50h], rcx
mov     rcx, rdx
mov     rdx, [rbp+50h]
mov     r8d, ebx
mov     r9, rsi
call    rax

Start on line 7 since the prolog is of no interest. We have a 32-bit variable initialized to 0, we know the other three are 105h, 0, and 0 in that order.

int __cdecl main(int argc, char** argv)
{
    u32 v1 = 0;
    u32 v2 = 0x105;
    u32 v3 = 0;
    u32 v4 = 0;
    
    return 0;
}

We know it’s the main entry point and we know it’s using the cdecl calling convention as well since it’s saving rdi, and rsi into the shadow store. We also know there are three 64-bit variables. One of which is a function pointer to GetVolumeInformationW and the other two store some return value from _unk_crt and the arguments to each of those calls is the value 105h.

int __cdecl main(int argc, char** argv)
{
    u32 v1 = 0;
    u32 v2 = 0x105;
    u32 v3 = 0;
    u32 v4 = 0;
    u64 unk_crt_ret_1 = _unk_crt(0x105);
    u64 unk_crt_ret_2 = _unk_crt(0x105);
    
    // Create function pointer prototype and bind it.
    // 
    typedef int (__stdcall *gviw_t)( const char*, char*, u32, u32*, u32*, u32*, char*, u32 );
    gviw_t gviw = (gviw_t)GetVolumeInformationW;
    
    // Call GetVolumeInformationW indirectly.
    //
    
    return 0;
}

Let’s simplify this a little bit more. We can use v2 instead of the two constants for the arguments to _unk_crt.

int __cdecl main(int argc, char** argv)
{
    u32 v1 = 0;
    u32 v2 = 0x105;
    u32 v3 = 0;
    u32 v4 = 0;
    u64 unk_crt_ret_1 = _unk_crt(v2);
    u64 unk_crt_ret_2 = _unk_crt(v2);
    
    // Create function pointer prototype and bind it.
    // 
    typedef int (__stdcall *gviw_t)( const char*, char*, u32, u32*, u32*, u32*, char*, u32 );
    gviw_t gviw = (gviw_t)GetVolumeInformationW;
    
    // Call GetVolumeInformationW indirectly.
    //
    
    return 0;
}

Nice, we’re beginning to develop a clear picture of what’s going on. Now is the tricky part where it comes in handy to know your calling conventions and how the stack is manipulated to pass arguments. Here are the instructions that prepare the arguments for the indirect call to GetVolumeInformationW.

mov     rax, [rbp+gviw_address]
mov     edx, 0
mov     rcx, [rbp+unk_crt_ret_1]
mov     ebx, [rbp+v2]
lea     rsi, [rbp+v1]
lea     rdi, [rbp+v3]
mov     [rsp+20h], rdi
lea     rdi, [rbp+v4]
mov     [rsp+28h], rdi
mov     qword ptr [rsp+30h], 0
mov     edi, 0FFFFFFFFh
add     edi, [rbp+v2]
mov     [rsp+38h], edi
mov     [rbp+50h], rcx
mov     rcx, rdx
mov     rdx, [rbp+50h]
mov     r8d, ebx
mov     r9, rsi
call    rax

There’s an easy way to do this and a hard way. We’re going to do the hard way first, of course.

# The Hard Way

Let’s begin with what we know. Parameters are passed through registers from left to right into rcx, rdx, r8, and r9 – in that order. If the function has more than 4 arguments they will be placed on the stack from right to left, meaning the last argument will be at the highest offset from rsp. This means that the last argument of this procedure call is placed in [rsp+20h]. Remember that offsets from rbp are into our stack frame and offsets from rsp are placing arguments on the stack for the function call. Let’s quickly look at what our function call would look like without the variables:

rax(rcx, rdx, r8d, r9, [rsp+20h], [rsp+28h], [rsp+30h], [rsp+38h]);

In this pseudo-call, we see that rax is used as a function and the arguments are in there place as they would appear at a high-level language. How do we know that [rsp+38h] is the last argument of the function? Using context clues from the excerpt and the fact that offset from rsp are not used anywhere else we are inferring that these are used in the function call. Let’s start by determining what is in each register and argument space. If we look at the first instruction it’s copying gviw_address into rax. Simple. Afterward, we load edx with 0 – now to determine all uses without scanning we can just search for any instances of edx in this chunk. We see it winds up getting loaded into rcx toward the end and used nowhere else, so we know our first argument is 0. We’ll do this same process for the rest of the variables. You might notice that rcx gets loaded with [rbp+unk_crt_ret_1], however, rcx copies its value into [rbp+50h] which is then stored in rdx prior to the call. Sweet, we know that our second argument is unk_crt_ret_1. If we continue doing this for all the arguments you’ll wind up determining that the function call looks like this:

gviw(0, unk_crt_ret_1, v2, &v1, &v3, &v4, 0, v2 - 1);

Remember that when lea is used it is loading the address of the location referenced, and not the data inside. Thus our couple of lea instructions are supplying the address of the locals they reference.

# The Easy Way

We know that the function being called is GetVolumeInformationW – it’s just using a function pointer instead of a direct call to the API. If we open up the MSDN page for GetVolumeInformationW we can see the arguments that are used. This is why having context clues is a world of help when reverse engineering a program. This will also help us determine which arguments are of what type and then we can properly write the types of our local variables.

Why didn’t we do this from the start?

Imagine you have a target that implements a custom CRT, doesn’t reference any documented API, and is a completely black box with minimal context clues. If you only learned how to take advantage of documented information you would be lacking fundamental knowledge. It’s important to be able to determine what arguments are passed to the function without utilizing reference material. If it’s available, definitely use it. If it’s not then knowing how to do it the hard way will be advantageous.

The prototype of GetVolumeInformationW is this:

BOOL GetVolumeInformationW(
  LPCWSTR lpRootPathName,
  LPWSTR  lpVolumeNameBuffer,
  DWORD   nVolumeNameSize,
  LPDWORD lpVolumeSerialNumber,
  LPDWORD lpMaximumComponentLength,
  LPDWORD lpFileSystemFlags,
  LPWSTR  lpFileSystemNameBuffer,
  DWORD   nFileSystemNameSize
);

Let’s take this knowledge now and rename our locals and adjust our pseudocode implementation.

int __cdecl main(int argc, char** argv)
{
    u32 VolumeSerialNumber = 0;
    u32 VolumeNameSize = 0x105;
    u32 MaximumComponentLength = 0;
    u32 FileSystemFlags = 0;
    char* VolumeNameBuffer = _unk_crt(VolumeNameSize);
    u64 unk_crt_ret_2 = _unk_crt(VolumeNameSize);
    
    // Create function pointer prototype and bind it.
    // 
    typedef int (__stdcall *gviw_t)( const char*, char*, u32, u32*, u32*, u32*, char*, u32 );
    gviw_t gviw = (gviw_t)GetVolumeInformationW;
    
    // Call GetVolumeInformationW indirectly.
    //
    gviw(0, 
        VolumeNameBuffer, 
        VolumeNameSize, 
        &VolumeSerialNumber, 
        &MaximumComponentLength, 
        &FileSystemFlags, 
        0, 
        VolumeNameSize - 1);
    
    return 0;
}

This looks so much better now. We can also guess what _unk_crt is at this point. Since our VolumeNameBuffer is a pointer to the return address of _unk_crt and it takes the size of the buffer as the argument an educated guess would be malloc.

int __cdecl main(int argc, char** argv)
{
    u32 VolumeSerialNumber = 0;
    u32 VolumeNameSize = 0x105;
    u32 MaximumComponentLength = 0;
    u32 FileSystemFlags = 0;
    
    // Allocate buffers for two objects.
    //
    char* VolumeNameBuffer = (char*)malloc(VolumeNameSize);
    u64 unk_crt_ret_2 = malloc(VolumeNameSize);
    
    // Create function pointer prototype and bind it.
    // 
    typedef int (__stdcall *gviw_t)( const char*, char*, u32, u32*, u32*, u32*, char*, u32 );
    gviw_t gviw = (gviw_t)GetVolumeInformationW;
    
    // Call GetVolumeInformationW indirectly.
    //
    gviw(0, 
        VolumeNameBuffer, 
        VolumeNameSize, 
        &VolumeSerialNumber, 
        &MaximumComponentLength, 
        &FileSystemFlags, 
        0, 
        VolumeNameSize - 1);
    
    return 0;
}

# Completing Analysis

This is coming together nicely, but we’re not quite done with the main function. There’s one buffer we don’t know about and still a little more functionality to document before we get into the different calls. Let’s bring the disassembly from call rax to the end of the function.

                call    rax ; GetVolumeInformationW(...);
                mov     [rbp+14h], eax
                mov     eax, [rbp+VolumeSerialNumber]
                mov     ecx, eax
                call    sub_14000136C
                mov     [rbp+18h], eax
                mov     eax, [rbp+18h]
                test    eax, eax
                jz      short loc_140001125
                mov     rax, [rbp+unk_crt_ret_2]
                mov     rcx, rax
                call    sub_1400014A4
                mov     [rbp+1Ch], eax
                mov     eax, [rbp+VolumeSerialNumber]
                mov     ecx, eax
                call    sub_14000136C
                mov     [rbp+20h], eax
                mov     eax, [rbp+1Ch]
                mov     edx, [rbp+20h]
                cmp     eax, edx
                jnz     short loc_140001125
                lea     rax, unk1
                mov     rcx, rax
                call    sub_140001254
                mov     [rbp+24h], eax

loc_140001125:                          
                                        
                mov     eax, 0
                mov     rbx, [rbp+58h]
                mov     rsi, [rbp+60h]
                mov     rdi, [rbp+68h]
                lea     rsp, [rbp+70h]
                pop     rbp
                retn

Starting after our procedure call to GetVolumeInformationW we see that the return value is stored in some stack location, but we already noted that this is unused and essentially discarded. The VolumeSerialNumber is stored into eax then copied to ecx, and then the program calls sub_14000136C. This means our function prototype for this function looks like sub_14000136C(VolumeSerialNumber). The return value is stored and then checked using test eax, eax. We’ve covered this sequence before, and know it’s used to determine if eax is 0 and if so the following jump instruction will be taken. Why? The zero flag will be set if the result is 0. This means that our function return value is used in a conditional statement.

if( sub_14000136C(VolumeSerialNumber) ) { }

The target of the branching instruction is the else block and in this chunk, it just cleans up the stack and returns. We can assume that this isn’t an else block and that the if statement exists independently. The next call instruction, call sub_1400014A4, uses the unknown buffer unk_crt_ret_2. We don’t yet know what this does, but we’ll investigate soon. It then calls sub_14000136C with the VolumeSerialNumber as the argument again, and then stores the two return values in stack storage locations and then those are moved to registers prior to their comparison.

call sub_1400014A4 
mov [rbp+1Ch], eax 
mov eax, [rbp+VolumeSerialNumber] 
mov ecx, eax 
call sub_14000136C 
mov [rbp+20h], eax 
mov eax, [rbp+1Ch] 
mov edx, [rbp+20h] 
cmp eax, edx
jnz short loc_140001125

It then compares the results and jumps to loc_140001125 – similar to the first condition. Note that all of this code occurs within the if-block of the first condition, so we have a nested condition. If you’re confused about why I recommend revisiting the first examples of the previous article. Our block will look like this:

if( sub_14000136C(VolumeSerialNumber) ) 
{ 
    if( sub_1400014A4(unk_crt_ret_2) == sub_14000136C(VolumeSerialNumber) )
    {
        
    }
}

I know that the comparison of the return values is occurring because of this sequence:

mov eax, [rbp+1Ch] 
mov edx, [rbp+20h] 
cmp eax, edx
jnz short loc_140001125

The code inside of the nested if-block is pretty trivial and with tools, but it’s not immediately obvious what sub_140001254 is without them.

lea rax, unk1 
mov rcx, rax 
call sub_140001254
mov [rbp+24h], eax

Since we’re operating as if we don’t have these tools so we’ll need to be creative. The unk1 object is globally available to the program, and we don’t have any idea where to look for it (you’ll learn about this in the PE File Format article). Given the structure so far of this program, and taking into account the conditional statements we’re going to have to investigate the function. Below is the disassembly of sub_140001254.

push    rbp
sub     rsp, 50h
lea     rbp, [rsp+20h]
mov     [rbp+30h+var_10], rbx
mov     [rbp+30h+arg_0], rcx
mov     [rbp+30h+arg_8], rdx
mov     [rbp+30h+arg_10], r8
mov     [rbp+30h+arg_18], r9
mov     eax, 0
imul    eax, 8
movsxd  rax, eax
lea     rdx, [rbp+30h+arg_8]
add     rdx, rax
lea     rax, [rbp+30h+var_28]
mov     [rax], rdx
mov     eax, 1
mov     ecx, eax
call    __acrt_iob_func
mov     [rbp+30h+var_20], rax
mov     rax, [rbp+30h+var_20]
mov     rdx, [rbp+30h+arg_0]
mov     ecx, 0
mov     rbx, [rbp+30h+var_28]
mov     [rbp+30h+var_18], rcx
mov     rcx, rax
mov     rax, [rbp+30h+var_18]
mov     r8, rax
mov     r9, rbx
call    sub_1400011F4
mov     [rbp+30h+var_30], eax
mov     eax, [rbp+30h+var_30]
mov     [rbp+30h+var_2C], eax
mov     [rbp+30h+var_28], 0
mov     eax, [rbp+30h+var_2C]
mov     rbx, [rbp+30h+var_10]
lea     rsp, [rbp+30h]
pop     rbp
retn

We’re not going to break this down we just want to quickly scan for any potential clues as to what this function is. Right smack-dab in the middle of the disassembly is a reference to __acrt_iob_func, and with a quick google, we see tons of results. One of which is particularly helpful:

If we look at the argument supplied to __acrt_iob_func, we see that ecx is set to 1 – this is our stdout file descriptor which is most commonly used in printf! This function is almost guaranteed to be printf, and I know from my own experience that it is. You’ll learn to recognize CRT function patterns as you gain experience reversing applications. Now that we know this if we go back to our main function we know that unk1 is a string and sub_140001254 is printf. We are now one step closer to completing our analysis and pseudocode implementation.

lea rax, unk1 ; format string
mov rcx, rax 
call printf 
mov [rbp+24h], eax

The rest of the code for this main function is just clean-up and exit. What do we know about this application? If we’re unable to run it then we can only guess based on the flow of the program given the disassembly. However, for this example, we have the ability to execute it. Let’s do that, and see what we get initially.

Running Targets

Being able to run a target application prior to analysis provides a lot of information or clues you can use to your advantage when reversing it. In this breakdown, I wanted to save it until the end to make sure you were able to make the connections to the disassembly and program overall. Normally, if available, I would run the target and look for information that could help identify certain constructs like menu items, login form labels, and so on.

Upon running that program it waits for keyboard input. If you recall there was a call to some function earlier in the disassembly, call sub_14000113C, that used an unknown object as well – unk_140028000. If I type some random characters and press enter the application exits.

Hm, so what are some candidate functions in C that take user input? The first one that comes to mind is scanf. If we look at where this function is called it makes sense.

call    _unk_crt
mov     [rbp+40h], rax
mov     rax, [rbp+40h]
mov     [rbp+unk_crt_ret_2], rax
lea     rax, unk_140028000
mov     rdx, [rbp+unk_crt_ret_2]
mov     rcx, rax
call    sub_14000113C

We see that sub_14000113C takes the malloc allocated variable and some unknown object: sub_14000113C(unk_140028000, unk_crt_ret_2). Since this application requires a login name that works we know that if this is indeed scanf then unk_crt_ret_2 is the allocated storage where the user input is stored. That also means that unk_140028000 is the format string, likely %s to specify string format. It fits the purpose of this application so we’re going to go with it. Here is our first pass pseudocode implementation of our main function:

int __cdecl main(int argc, char** argv)
{
    u32 VolumeSerialNumber = 0;
    u32 VolumeNameSize = 0x105;
    u32 MaximumComponentLength = 0;
    u32 FileSystemFlags = 0;
    
    // Allocate buffers for two objects.
    //
    char* VolumeNameBuffer = (char*)malloc(VolumeNameSize);
    char* user_name = (char*)malloc(VolumeNameSize);
    
    // Create function pointer prototype and bind it.
    // 
    typedef int (__stdcall *gviw_t)( const char*, char*, u32, u32*, u32*, u32*, char*, u32 );
    gviw_t gviw = (gviw_t)GetVolumeInformationW;
    
    // Call GetVolumeInformationW indirectly.
    //
    gviw(0, 
        VolumeNameBuffer, 
        VolumeNameSize, 
        &VolumeSerialNumber, 
        &MaximumComponentLength, 
        &FileSystemFlags, 
        0, 
        VolumeNameSize - 1);
    
    // Check if serial is valid? Make sure serial is not null?
    //
    if( sub_14000136C(VolumeSerialNumber) ) 
    { 
        // Compare some value based on user_name against serial number? I
        // did not get a print when entering random characters so we know
        // it's doing something else with these inputs.
        //
        if( sub_1400014A4(user_name) == sub_14000136C(VolumeSerialNumber) )
        {
            printf(unk1);
        }
    }
    
    return 0;
}

Reversing Challenge

If you were writing your own pseudocode implementation yourself compare it against the one provided above and see how similar it looks. What did you miss? What could be done better? Did you make any assumptions that misled you? If so, what were they?

Deeper Investigation

Unfortunately, we’re not done yet. We still have to investigate what those functions in the conditional statements do. We have some ideas based on behavior when providing random input, but we can’t be for certain. That being said if we were looking to bypass this sort of authentication protocol we could modify some of the conditionals through byte patching and bypass the nested comparison to reach the printf. Yes, we could do all of this without continuing our reversal. We’re not going to do that – this is not meant to be a crackme on its own, but a way of teaching assembly in a “real” investigation. We’re going to start with analyzing sub_14000136C and then move to sub_1400014A4. These functions are much more confusing than the first one given the lack of optimization and pollution with useless operations, but as always we’ll walk through to solidify the concepts you’ve learned so far.

# sub_14000136C

The disassembly you’re about to see is a jumbled nightmare, and we’re going to encounter some new instructions and learn how to simplify this unoptimized disassembly. Take a break if necessary because this section is going to be a long one.

                push    rbp
                sub     rsp, 150h
                lea     rbp, [rsp+20h]
                mov     [rbp+128h], rdi
                mov     [rbp+120h], rsi
                mov     [rbp+140h], ecx
                lea     rax, [rbp+0]
                mov     edx, 0
                mov     ecx, 104h
                mov     rdi, rax
                mov     eax, edx
                and     eax, 0FFFFh
                mov     ah, al
                mov     edx, eax
                shl     eax, 10h
                or      eax, edx
                mov     esi, ecx
                shr     rcx, 2
                rep stosd
                mov     ecx, esi
                and     ecx, 3
                rep stosb
                mov     byte ptr [rbp+104h], 0

loc_1400013F0:
                movzx   eax, byte ptr [rbp+104h]
                movzx   eax, al
                cmp     eax, 4
                jl      short loc_140001418
                jmp     loc_1400014B2

loc_140001404:
                movzx   eax, byte ptr [rbp+104h]
                movzx   eax, al
                inc     eax
                mov     [rbp+104h], al
                jmp     short loc_1400013F0

loc_140001418:
                mov     eax, [rbp+140h]
                lea     rdx, [rbp+0]
                mov     ecx, 0Ah
                mov     [rbp+118h], ecx
                mov     ecx, eax
                mov     eax, [rbp+118h]
                mov     r8d, eax
                call    _itoa
                mov     [rbp+110h], rax
                mov     rax, [rbp+110h]
                mov     rcx, rax
                call    hash
                mov     [rbp+108h], eax
                movzx   eax, byte ptr [rbp+104h]
                movzx   eax, al
                imul    rax, 8
                lea     rdx, dword_140024000
                add     rdx, 4
                add     rdx, rax
                mov     eax, [rdx]
                mov     edx, [rbp+108h]
                cmp     eax, edx
                jnz     short loc_140001404
                movzx   eax, byte ptr [rbp+104h]
                movzx   eax, al
                imul    rax, 8
                lea     rdx, dword_140024000
                add     rdx, rax
                mov     eax, [rdx]
                mov     rsi, [rbp+120h]
                mov     rdi, [rbp+128h]
                lea     rsp, [rbp+130h]
                pop     rbp
                retn

loc_1400014B2:
                mov     eax, 0
                mov     rsi, [rbp+120h]
                mov     rdi, [rbp+128h]
                lea     rsp, [rbp+130h]
                pop     rbp
                retn

This is quite painful to see at first glance. But we’re going to do it as we always have, byte-sized pieces. Really bad joke. My sanity is waning – I’m sure yours is too.

Let’s take a moment to quickly scan the disassembly and look for any hints as to the local variables used. The only immediate hint I see is this instruction: mov byte ptr [rbp+104h], 0. This is a byte-sized local at [rbp+104h]. We don’t know what it’s used for yet though. Now we have to do the dirty work and see what the function prologue is doing, quickly though.

push    rbp
sub     rsp, 150h
lea     rbp, [rsp+20h]
mov     [rbp+128h], rdi
mov     [rbp+120h], rsi
mov     [rbp+140h], ecx
lea     rax, [rbp+0]

The function saves the old stack frame, allocates 150h (336) bytes of space for our function, and creates a new stack frame that’s based from [rsp+20h]. It stores rdi, rsi, and ecx into the spill space then loads the address of stack location [rbp+0] into rax. This is a local variable of some sort. The fact that it’s loading the address of this local is a hint that its some sort of data structure – most likely an array. The next section of assembly code is very different and uses a lot of the bitwise instructions for some macro-operation. We’re going to encounter some new instructions here as well.

lea     rax, [rbp+0]
mov     edx, 0
mov     ecx, 104h
mov     rdi, rax
mov     eax, edx
and     eax, 0FFFFh
mov     ah, al
mov     edx, eax
shl     eax, 10h
or      eax, edx
mov     esi, ecx
shr     rcx, 2
rep stosd
mov     ecx, esi
and     ecx, 3
rep stosb
mov     byte ptr [rbp+104h], 0

The first instruction following our local address load is zeroing edx, copying 104h (260) into ecx, loading the address in rax into rdi, and then moving edx into eax. That’s a lot, so let’s log register states by hand.

rax = address_of([rbp+0])
edx = 0
ecx = 104h
rdi = rax
eax = edx

----- SIMPLIFY -----

edx = 0
ecx = 260
rdi = address_of([rbp+0])
eax = 0

We took the initial states and then simplified it so that the states are less confusing and require fewer lookups by us. The next instruction is the and instruction. Since most readers have worked in C/C++ I expect familiarity with bitwise operations like and, or, xor, and so on. The instruction and eax, 0FFFFh performs a bitwise and operation on eax with the second operand. You might write this in a high-level language like eax & 0FFFFh. This may appear to be a pointless operation since the value of eax is 0 and the result will be 0, however, the and instruction affects a few status flags. The OF and CF status flags are cleared and the SF, ZF, and PF status flags are modified according to the result of the operation. We’ll see why this happened further down.

The next instruction is mov ah, al. This is a pointless instruction inserted as the result of no optimization applied to the program – it’s just wasted cycles. As is the instruction after, mov edx, eax. The register edx is already 0. The instruction shl is a bitwise left-shift, in this case it doesn’t have any effect because 0 << 10h is still 0, but in normal cases, it will take the value in the first operand and shift the bits 10h bits to the left. I’m sure you can guess what the or instruction does, and it’s also just pollution in this case. We arrive at mov esi, ecx which stores the value 104h into esi. Our register states have now changed.

edx = 0 
ecx = 260 
rdi = address_of([rbp+0]) 
eax = 0
esi = 260

We then see shr rcx, 2 where the value of rcx is 260. This is an operation worth noting! Every bit shift to the right effectively divides the value to be shifted by 2, N times. As an example, shr rcx, 1 takes the value in rcx and shifts all bits to the left which is equivalent to a divide by 2. The result would be 130 and stored in rcx. In this way, shr rcx, 2 is the same as taking the first operand and dividing it by 4. 260 divided by 4 is 65, so the value of rcx becomes 65 after this instruction executes. The opposite arithmetic operation applies to shl – multiply.

shr rcx, 1 => rcx / 2
shl rcx, 1 => rcx * 2
shr rcx, 4 => loop_4_Times:[rcx / 2]
shr rcx, N => loop_N_Times:[rcx / 2]

Using Bitwise Operations

Bitwise instructions are used by compilers to do many types of things. You might see them in place of standard arithmetic operations like multiple, divide, add, or subtract. This is because bitwise instructions have a lower instruction execution time than arithmetic instructions. There are other instructions on newer architectures that have faster execution times, but they’re primarily used when compiler optimizations are at their highest setting.

Observing our disassembly chunk again let’s take a look at the instruction after the shift right: rep stosd.

lea rax, [rbp+0] 
mov edx, 0 
mov ecx, 104h 
mov rdi, rax 
mov eax, edx 
and eax, 0FFFFh 
mov ah, al 
mov edx, eax 
shl eax, 10h 
or eax, edx 
mov esi, ecx 
shr rcx, 2 
rep stosd 
mov ecx, esi 
and ecx, 3 
rep stosb 
mov byte ptr [rbp+104h], 0

This instruction is composed of two parts, as many x86 instructions can be. The first part is the instruction prefix: rep. The REP means to repeat the string operation of the instruction it prefixes. It’s typically used in string operations, but you may also see it in memory copy operations. From this function, the whole instruction is rep stosd. This is the encoding that instructs the processor to repeat a dword sized store operation on edi with the value in eax, ecx number of times. The size of the operation is indicated through the d appended to the stos instruction.

How’s that for a confusing mess? Let me put it in high-level terms, although this isn’t exactly what it would look like in translation.

rep_stosd()
{
    while( ecx > 0 )
    {
        edi[ecx] = eax;
        ecx -= sizeof(dword);
    }
}

Think of it as if the processor is looping from 0 to ecx and storing the value of eax into edi[ecx]. It decrements the temp counter by the size of a doubleword because the size of the store operation is 32-bits. Hopefully, the instruction isn’t too confusing now. We just need to determine what these register states are at the time of execution to discover what it’s operating on. Let’s repost our register states:

edx = 0 
ecx = 65
rdi = address_of([rbp+0]) 
eax = 0 
esi = 260

Note that ecx changed as a result of the bitwise right shift that occurred. Knowing that rep stosd operates on edi we’ll define [rbp+0] as a data structure of size ecx * 4. I inferred the size of the structure because rep stosd loops ecx times and writes a doubleword (eax) into each doubleword element of edi[ecx]. This is an unoptimized form of memset, and is typically seen when a string or data structure is set to 0 using an initializer-list. This particular data structure is 260 bytes in length, so it divided the length by 4 so that it can be initialized to 0 in 32-bit chunks. Neat!

Register Names

Ever wondered why it chose EDI over EBX? EDI is formally known as the Extended Destination Index. Where is our instruction write destination? EDI. Likewise, with ECX, this is also known as the Extended Counter Register. It’s typically used in looping/repeating instructions to keep track of iterations. The more you know!

We’ve determined that [rbp+0] is a data structure allocated on the stack with a size of 260 bytes. The next three instructions serve little purpose as esi is 260 which is then AND’d with 3 which means that ecx is now 0. The termination condition (like when a loop completes) for rep stosd/rep stosb instructions is when rcx/ecx = 0. We can just move over those and get our pseudocode started.

u32 __fastcall sub_14000136C(u32 VolumeSerialNumber)
{
    char buffer[260] = { 0 };
    
    return -1;
}

It looked like a lot, but once we acknowledge which instructions were wasted cycles and junk it wasn’t so bad! Moving on to the next section of assembly there will be some behavior that should be recognized.

                mov     byte ptr [rbp+104h], 0

loc_1400013F0:								; ................. here
                movzx   eax, byte ptr [rbp+104h]
                movzx   eax, al
                cmp     eax, 4
                jl      short loc_140001418
                jmp     loc_1400014B2

loc_140001404:
                movzx   eax, byte ptr [rbp+104h]
                movzx   eax, al
                inc     eax
                mov     [rbp+104h], al
                jmp     short loc_1400013F0				; <---- loops back to ^

At first glance, we know there’s a loop because of the unconditional jump back to loc_1400013F0. But this code is a little messy with all the potential branches so let’s rename the ones we know of right now.

                mov     byte ptr [rbp+104h], 0

outer_loop:								; .................. here
                movzx   eax, byte ptr [rbp+104h]
                movzx   eax, al
                cmp     eax, 4
                jl      short loc_140001418
                jmp     loc_1400014B2

loc_140001404:
                movzx   eax, byte ptr [rbp+104h]
                movzx   eax, al
                inc     eax
                mov     [rbp+104h], al
                jmp     short outer_loop				; <---- loops back to ^

A little bit easier to follow. This is a pretty standard loop setup in assembly. The first instruction initializes a counter to 0 using mov byte ptr [rbp+104h], 0. Then we see it takes that local variable and stores it in eax, and then compares it against 4. It will branch to loc_140001418 if eax is less than 4 otherwise, it will jump to loc_1400014B2. The jl instruction is jump if less than. Alright, so we know that [rbp+104h] is a local variable that’s a byte in width and this loop will execute 4 times. Before we analyze the jump targets lets add it to our pseudocode.

u32 __fastcall sub_14000136C(u32 VolumeSerialNumber)
{
    char buffer[260] = { 0 };
    u8 counter = 0;
    
    while(counter < 4)
    {
        // do something
        counter++;
    }
    
    return -1;
}

Now, look at the jump target loc_140001418:

loc_140001418:
                mov     eax, [rbp+140h]
                lea     rdx, [rbp+buffer]
                mov     ecx, 0Ah
                mov     [rbp+118h], ecx
                mov     ecx, eax
                mov     eax, [rbp+118h]
                mov     r8d, eax
                call    _itoa
                mov     [rbp+110h], rax
                mov     rax, [rbp+110h]
                mov     rcx, rax
                call    sub_1400014A4
                mov     [rbp+108h], eax
                movzx   eax, byte ptr [rbp+counter]
                movzx   eax, al
                imul    rax, 8
                lea     rdx, dword_140024000
                add     rdx, 4
                add     rdx, rax
                mov     eax, [rdx]
                mov     edx, [rbp+108h]
                cmp     eax, edx
                jnz     short loc_140001404
                movzx   eax, byte ptr [rbp+counter]
                movzx   eax, al
                imul    rax, 8
                lea     rdx, dword_140024000
                add     rdx, rax
                mov     eax, [rdx]
                mov     rsi, [rbp+120h]
                mov     rdi, [rbp+128h]
                lea     rsp, [rbp+130h]
                pop     rbp
                retn

I’ve renamed some of the offsets as was demonstrated before. We see a reference to a register that was preserved in the shadow store at [rbp+140h] which just so happens to be ecx and is also the argument of the function (calling convention). The argument is the VolumeSerialNumber that was acquired from our call to GetVolumeInformationW in the previous function. We know this from looking at the pseudocode we generated for the first function. The next instruction loads the base address of our local data structure into rdx. The value 0Ah (10) is then copied into ecx and then ecx into [rbp+118h]. We’re beginning to see a lot of temporary storage uses so we need to just skim and see where the final location is. It looks like 10, buffer, and our serial number are arguments to the _itoa function. The function _itoa is used to convert an integer to a string – if you don’t know the details of the arguments check the manual pages. The result of this function is then loaded into some temporary storage then copied to rcx prior to a call to sub_1400014A4. So we know this function is likely used in a manner similar to this:

sub_1400014A4( _itoa( VolumeSerialNumber, buffer, 10 ) );

Scan the rest of the excerpt and you’ll notice a branching instruction – this means that there is another conditional statement involved. Let’s continue from the line designated below…

loc_140001418:
                mov     eax, [rbp+140h]
                lea     rdx, [rbp+buffer]
                mov     ecx, 0Ah
                mov     [rbp+118h], ecx
                mov     ecx, eax
                mov     eax, [rbp+118h]
                mov     r8d, eax
                call    _itoa
                mov     [rbp+110h], rax
                mov     rax, [rbp+110h]
                mov     rcx, rax
                call    sub_1400014A4
                mov     [rbp+108h], eax	; <------ continue from here
                movzx   eax, byte ptr [rbp+counter]
                movzx   eax, al
                imul    rax, 8
                lea     rdx, dword_140024000
                add     rdx, 4
                add     rdx, rax
                mov     eax, [rdx]
                mov     edx, [rbp+108h]
                cmp     eax, edx
                jnz     short loc_140001404
                movzx   eax, byte ptr [rbp+counter]
                movzx   eax, al
                imul    rax, 8
                lea     rdx, dword_140024000
                add     rdx, rax
                mov     eax, [rdx]
                mov     rsi, [rbp+120h]
                mov     rdi, [rbp+128h]
                lea     rsp, [rbp+130h]
                pop     rbp
                retn

The return value of sub_1400014A4 is stored on the stack, and we can see it’s used later down the line. We copy the counter to the eax register and zero the remaining bits. The following instruction movzx eax, al does nothing for us – ignore it. This is where things get a little dicey.

We multiply the value in rax by 8 and store the result in rax. This is called scaling, and is the process of taking an index (the counter) and scaling it by the width of an address. A 64-bit address is 8 bytes in width. You will commonly see this type of scaling used when accessing an array or similar data structure. Here’s an example to help:

uint64_t a[2] = { 0 };

//
// To access the second element of the array 
// we can take the index * size of an element.
//
*(uint64_t*)(a + (1 * sizeof(uint64_t))) = 10;

//
// The above is the same as doing this.
//
a[1] = 10;

Array Access in Assembly

arrays in memory

Pretend the array is based from 0. If we wanted to access the first 8-byte element (an unsigned 64-bit integer) then we would just read from the base to 8-bytes ahead, so from address 00 to 07. Conversely, to read the second 8-byte element we’d need to offset 8 bytes from the base (to get the start of the second element) and read to address 15. The scaling is done because we have to offset the correct number of bytes to get the desired element. If you think about the math it makes sense.

a = base of array = 00 (like diagram)

//
// Access first element
//
*(u64*)(a + (0 * sizeof(u64))) => *(u64*)(a + 0) => read 64-bit integer @ base of 'a' (address 00)

//
// Access second element
//
*(u64*)(a + (1 * sizeof(u64))) => *(u64*)(a + 8) => read 64-bit integer @ base of 'a' + 8 (address 08)

This is how indexing into arrays work, but the array access in the code is a little more complicated. We’re going to have to understand how to index into a structure.

Structure Accesses in Assembly

The reason being familiar with a language like C or C++ is important is because sometimes you’ll run into assembly that accesses a data structure that isn’t an array, and its layout isn’t immediately obvious. Take the below structure for example:

struct _s
{
    u32 first;
    u32 second;
};

When a structure like this is allocated, whether on the stack or the heap, accessing may not be intuitive. For the above example, it’s similar to the array accesses. Take note of a few things, however. This structure is not 32-bits in size, it is 64-bits because it contains two 32-bit integers. So how would we go about accessing the first or second members of this structure _s? Let’s take a small program to help us.

struct _s temp;

temp.first = 0;
temp.second = 1;

printf( "%d %d\n", temp.first, temp.second ); // 0 1

In order to initialize the first member of the temp structure, we’d need the base of it. Once we have the base it’s very much like an array where the second member would be at the base address + sizeof(first member). For the _s structure both members are 32-bit integers so the offset would be 4. The instructions to initialize these two would look similar to this.

lea rdx, qword ptr [_s]
mov dword ptr [rdx+0], 0
mov dword ptr [rdx+4], 1

Different Types

We know that structures allow the storage of different types in a packed data structure, so you can guess that accessing a member of a different type may require a different offset. This is not always true thanks to memory alignment requirements and the compiler. If you want to learn more about structure padding and how it affects memory accesses be sure to check the recommended reading.

Knowing how structures are accessed is sufficient for this example, but there’s something off about the assembly we’re looking at. I’ll bring it back into view.

movzx   eax, byte ptr [rbp+counter]
movzx   eax, al
imul    rax, 8
lea     rdx, dword_140024000
add     rdx, 4
add     rdx, rax
mov     eax, [rdx]
mov     edx, [rbp+108h]
cmp     eax, edx
jnz     short loc_140001404

You might’ve noticed the lea rdx, dword_140024000 instruction. This is quite confusing since our counter is currently 0, we’re multiplying the counter value by 8, loading rdx with the base of some data structure, and then adding 4 to the base then also adding 8. When you encounter sequences like this writing it out in generic terms helps. Let’s do that.

eax = 0
rax * 8 = 0
rdx = 140024000
<add rdx, 4>
rdx = 140024004
<add rdx, 0>
rdx = 140024004
<mov eax, [rdx]>
eax = *(u32*)140024004
edx = [rbp+108h]

RECALL: rbp+108h is the return value of sub_1400014A4

To me, it looks like structure access and then comparing one of the members to the return value of sub_1400014A4. We can observe the pattern similar to our structure access example here:

lea rdx, dword_140024000 
add rdx, 4

Then what is the scaling of the counter with 8 for? Great question. This is because this data structure is actually an array of structures! Something that you’ll see quite often in the wild. If you’re wondering what I mean by an array of structures picture the earlier example but as an array of _s structs. You’d recognize it in C – check it out.

static struct _s sarr[ N ] = {
    { 0, 1 },
    { 2, 3 },
    { 4, 5 },
    ...etc...
    { X, X+1},
};

The elements of this array are _s structures and are initialized inside of the static array using { }.  This relates to our disassembly because these structures are 8-bytes in size and our array of structures, therefore, is operated on in memory as being an array of 64-bit integers. Remember that when attempting to get the next element of an array when the elements are 8-bytes in size you have to add 8 * the index – just like in our target function:

movzx eax, byte ptr [rbp+counter] 
movzx eax, al 
imul rax, 8

On the first iteration, this scale value is 0 meaning that it will read from the first element in the array of structures! We load the base of the data structure into rdx:

lea     rdx, dword_140024000
add     rdx, 4
add     rdx, rax

Add 4 to rdx which is the offset into the structure for the second member, and then add the scale value to rdx. Realistically these two add operations could be swapped and would be more intuitive, but assembly isn’t always intuitive. If it helps, I put together a diagram to represent this array of structures. We know that there are 4 structures in this array of structures based on the condition of our function loop and that the size of these structures is 8 bytes, and judging by the index into it with 4 the members of that structure are likely 32-bit integers. Study the illustration below and try to connect the dots.

The instructions on the left represent the instructions executed in each loop. I added comments that represents the state of the register in that specific loop. For the first loop, we see that the code attempts to scale, but 0 * 8 is 0 so the add rdx, rax doesn’t modify the address that will be accessed on the next instruction. The 8-byte scale is to index into the array of structures since each structure contains two 4-byte integers. If we read from rdx on first iteration, it will yield the first structure in the array. Adding 4 to the address gives us the second member of the first structure in the array. I’ll layout the structure of the array based on how these accesses are occurring.

typedef struct _unk_struct
{
    u32 a;
    u32 b;
} unk_struct;

static unk_struct [ 4 ] = 
{
    { 0, 0 },
    { 0, 0 },
    { 0, 0 },
    { 0, 0 }
};

The disassembly is noting it as an array of doublewords due to some compiler headache hence the dword_140024000. However, it’s loading the 64-bit address of this data structure into rdx so we assume it’s an array, and then use context clues from the indexing method to realize that it’s an array of 8-byte structures. Now that we’ve covered all of that we just need to see what’s compared, check the jump target, and construct our pseudocode.

call    sub_1400014A4
mov     [rbp+ret_of_14a4], eax
movzx   eax, byte ptr [rbp+counter]
movzx   eax, al
imul    rax, 8
lea     rdx, dword_140024000
add     rdx, 4
add     rdx, rax
mov     eax, [rdx]
mov     edx, [rbp+ret_of_14a4]
cmp     eax, edx
jnz     short loc_140001404

So the second member in the specific structure in our array of structures is loaded into eax and compared against the return of sub_1400014A4. We see that it then jumps to loc_140001404 if the values are not equal. The target, loc_140001404, is presented below.

loc_140001404:
                movzx   eax, byte ptr [rbp+counter]
                movzx   eax, al
                inc     eax
                mov     [rbp+counter], al
                jmp     short outer_loop

Ahh, hey wait a second! We’ve seen this before. Way at the beginning of the analysis of this function. This block simply copies the counter value into eax and increments it then stores it back in the local variable and jumps back to the outer_loop. We knew there were nested conditions in the loop, but now we know the logic in them is pretty simple. This is an unoptimized look at a for-loop where the counter increment is put in its own jump block away from the rest of the looping code. Let’s bring the whole disassembly back into view and add our new pseudo-C.

                push    rbp
                sub     rsp, 150h
                lea     rbp, [rsp+20h]
                mov     [rbp+128h], rdi
                mov     [rbp+120h], rsi
                mov     [rbp+VolumeSerialNumber], ecx
                lea     rax, [rbp+buffer]
                mov     edx, 0
                mov     ecx, 104h
                mov     rdi, rax
                mov     eax, edx
                and     eax, 0FFFFh
                mov     ah, al
                mov     edx, eax
                shl     eax, 10h
                or      eax, edx
                mov     esi, ecx
                shr     rcx, 2
                rep stosd           ; zero our local buffer
                mov     ecx, esi
                and     ecx, 3
                rep stosb           ; ignore
                mov     byte ptr [rbp+counter], 0   ; zero our loop counter

outer_loop:
                movzx   eax, byte ptr [rbp+counter]
                movzx   eax, al
                cmp     eax, 4                      ; compare counter to 4
                jl      short loc_140001418         ; if (counter < 4) ? loc_140001418 : loc_1400014B2
                jmp     loc_1400014B2

loc_140001404:
                movzx   eax, byte ptr [rbp+counter]
                movzx   eax, al
                inc     eax                         ; counter++
                mov     [rbp+counter], al
                jmp     short outer_loop            ; for(...)

loc_140001418:
                mov     eax, [rbp+VolumeSerialNumber]
                lea     rdx, [rbp+buffer]
                mov     ecx, 0Ah
                mov     [rbp+118h], ecx
                mov     ecx, eax
                mov     eax, [rbp+118h]
                mov     r8d, eax
                call    _itoa                       ; _itoa(VolumeSerialNumber, buffer, 10)
                mov     [rbp+110h], rax
                mov     rax, [rbp+110h]
                mov     rcx, rax
                call    sub_1400014A4               ; sub_1400014A4( _itoa(VolumeSerialNumber, buffer, 10) )
                mov     [rbp+ret_of_14A4], eax      ; ret_of_14A4 = ^^
                movzx   eax, byte ptr [rbp+counter]
                movzx   eax, al
                imul    rax, 8
                lea     rdx, dword_140024000
                add     rdx, 4
                add     rdx, rax
                mov     eax, [rdx]                  ; eax = *(u64*)( dword_140024000[ counter ] ) + 4
                                                    ; ----- SIMPLIFIED -----
                                                    ; eax = dword_140024000[ counter ].u32b
                mov     edx, [rbp+ret_of_14A4]
                cmp     eax, edx                    ; if( ret_of_14A4 == eax ) ? loc_140001404 : next_instr
                jnz     short loc_140001404
                movzx   eax, byte ptr [rbp+counter]
                movzx   eax, al
                imul    rax, 8
                lea     rdx, dword_140024000        ; dword_140024000[counter]
                add     rdx, rax                    ; no index into struct by 4, 0 assumed
                mov     eax, [rdx]                  ; eax = dword_140024000[counter].u32a
                mov     rsi, [rbp+120h]
                mov     rdi, [rbp+128h]
                lea     rsp, [rbp+130h]
                pop     rbp
                retn                                ; return dword_140024000[counter].u32a (eax)

loc_1400014B2:
                mov     eax, 0
                mov     rsi, [rbp+120h]
                mov     rdi, [rbp+128h]
                lea     rsp, [rbp+130h]
                pop     rbp
                retn                                ; return 0

And let’s add the pseudo-C we believe to be used in this function. I provided comments so you won’t have to scroll up and recall information from the explanation.

u32 __fastcall sub_14000136C(u32 VolumeSerialNumber)
{
    char buffer[260] = { 0 };
    u8 counter = 0;
    
    for(u8 counter = 0; counter < 4; counter++)
    {
        if(dword_140024000[ counter ].u32b == sub_1400014A4(_itoa(VolumeSerialNumber, buffer, 10)))
            return dword_140024000[ counter ].u32a;
    }
    
    return 0;
}

Assembly Complexity

Complex or confusing assembly will not always result in the most extravagant functions, as seen above. The important takeaway is that the amount of indirection used in the above code greatly impacted the assembly we had to analyze. This will be true for many targets you’ll encounter in the real world. That’s why it’s so important to be able to do the hard parts by hand, and understand where the result came from. If all of this made sense to you, that’s awesome! If not, don’t feel down – it takes time to recognize these things. You will get it.

In order to determine what this function is comparing, you’re gonna have to analyze the last function. There’s good news though! It’s not nearly as complicated as the last two we’ve looked at. If you noticed I said you will have to analyze it to uncover the functionality. Remember this isn’t about cracking the program or authorization protocol, just documenting it. The breaking of it comes later.

Final Challenge

From the start of this series to the end of this section we’ve covered an enormous amount of information. It’s time for you to try to float in the deep end, lucky for you the last function isn’t super complicated. Use the resources you have at your disposal to determine what’s going on. Google, references, anything. Once you document the functionality of this function you’ll be able to reconstruct the program and understand how it’s validating the user’s input. The actual source code is provided following the disassembly, but try to complete it without referencing it. Leave your solutions in the comments, and show off how close you got! Good luck!

# sub_1400014A4

                push    rbp
                sub     rsp, 30h
                lea     rbp, [rsp+20h]
                mov     [rbp+20h], rcx
                mov     dword ptr [rbp+0], 1505h

loc_1400014E5:
                mov     rax, [rbp+20h]
                movzx   eax, byte ptr [rax]
                movzx   eax, al
                mov     [rbp+4], eax
                mov     eax, [rbp+4]
                mov     [rbp+8], eax
                mov     eax, [rbp+4]
                mov     [rbp+0Ch], eax
                mov     eax, 1
                add     rax, [rbp+20h]
                mov     [rbp+20h], rax
                mov     eax, [rbp+0Ch]
                test    eax, eax
                jz      short loc_140001523
                mov     eax, [rbp+0]
                shl     eax, 5
                add     eax, [rbp+0]
                add     eax, [rbp+8]
                mov     [rbp+0], eax
                jmp     short loc_1400014E5

loc_140001523:
                mov     eax, [rbp+0]
                lea     rsp, [rbp+10h]
                pop     rbp
                retn

Program Source Code | Daax’s Pseudocode

Conclusion

This behemoth of an article concludes the Accelerated Assembly saga of the Applied Reverse Engineering series. We’ve covered a ton of instructions, sequences of instructions, types of accesses and compiler-generated headaches. At this point, if you completed the final challenge with little to no reference to my pseudo or the actual source code, you are well on your way. If you still struggled I commend you for making it to the end of the article and following along. I sincerely hope you learned something about assembly and this didn’t intimidate you or put you off from continuing down this path. It’s not easy and the Accelerated Assembly articles are not meant to be a one-stop-shop but more of a jump in and sink or swim type approach.

Learning assembly is one of the harder tasks when it comes to reverse engineering, but if you can become proficient at reading listings like the ones I gave above you’ll excel quickly in any facet of reverse engineering or vulnerability research. I intended to do more than one example, but as you can see this one got lengthy on even a simple program. The others were a bit more complex and would require a series themselves. I don’t want to get away from the point of these articles too much so I stuck with one example!

The next article will step away from the architecture to an extent as we are introduced to OS constructs that track objects within the system. These objects are processes and threads. It will detail the ins and outs of the structures from how they’re accounted for by the operating system to their related structures that control operation in a normal operating environment. You’ll learn about a few techniques for analyzing thread and process states, and how standard code injection works. It will be MUCH shorter than these last two articles, but still just as in-depth.

I hope you were able to take away something from these assembly posts, and I look forward to providing more content in the next articles!

As always feel free to leave me a comment, question, feedback, or a coffee to keep me awake while writing these. Good luck and all the best to everyone reading! My DMs are open on Twitter as well: @daax_rynd

Recommended Reading

The post Applied Reverse Engineering: Accelerated Assembly [P2] appeared first on Reverse Engineering.

✇Reverse Engineering

Patchguard: Detection of Hypervisor Based Introspection [P1]

By: Nick Peterson

Errata Or Nah?

Over the last 2-3 years, Microsoft has inserted various methods of virtualization introspection detection (big brain words) into the workings of patchguard. It shouldn’t come as surprise that this has happened, as subverting kernel patch protection is a breeze when the attacker code is running at a higher privilege level. While Windows obviously runs just fine under a hypervisor, and has an open paravirtualization interface, patchguard is looking for signs that the vmm is tampering with state that isn’t necessary for a functional virtual machine. For instance, attempting to hook system calls by hiding the true value of the MSRs that control their branch targets, or exploiting nested paging to gain execution at critical control paths.

While patchguard contains more mechanisms to detect these types of introspection then are presented in this post, the author has chosen his favorites because they are of peculiar nature. It can be an exercise of the reader to find more 😉 It is the intention of this article to aid in software interoperability between security, anti-virus and introspection tools with kernel patch protection.

First on our list is KiErrata704Present. Upon first glance, the naming convention of these functions seems innocent, and to the untrained eye, might actually look like it’s legitimately checking for some kind of meme errata. Let’s break this function down:

 

A little background: certain ancient forms of privilege transitioning, like SYSENTER and call gates, allowed the caller to essentially single step over the opcode. This wasn’t quite optimal because the single step #DB would be delivered after the branch is complete. The kernel would then need to keep note of this so it could IRET to the caller, to continue the single step operation after handling the system call. The introduction of SYSCALL/SYSRET addressed this problem with the FMASK MSR. This MSR let OS developers have finer control over how SYSCALL handles RFLAGS when it’s executed. Any sane OS is going to ensure that IF and TF are masked off with this MSR. In addition, SYSRET was crafted specially so that if it loads a RFLAGS image with TF set, that it will raise the #DB on the following instruction boundary, as opposed to how IRET applies it to the boundary after its branch target. This allows for a smooth user-mode debugging experience when single stepping over the SYSCALL instruction. Now that we hopefully have a better understanding, we can see that the first thing KiErrata704Present does is save off the FMASK MSR contents and then set the MSR value such that TF will not be modified by the SYSCALL operation.

Next we see a sequence of PUSHFQ/POPFQ setting the trap flag and loading it back into the RFLAGS register. This as you are likely aware, will cause the preceeding instruction to have TF set during its execution, and on it’s boundary, will fire a #DB. Unless of course the instruction is of software exception, software interrupt, or privileged software exception class, or if the instruction generates a hardware exception.

You probably realize by now that once SYSCALL has finished its execution, a #DB will fire, just as it would if we stepped over any other branch instruction. Thus if the LSTAR target looked like the code sequence below:

0x40000: SWAPGS
0x40001: MOV GS:[0x8], RSP
0x40002: MOV RSP, GS:[0x10]

The #DB handler interrupt stack would contain 0x40000, because that is the syscall operation branch target, which hasn’t executed yet.

As you have probably already realized, patchguard can indirectly discover the true contents of the LSTAR MSR by inspecting the #DB generating IP in its interrupt handler. This serves as a way to discover if a malicious virtual machine might be exiting on RDMSR/WRMSR and giving the OS expected values.

Next up is my personal favorite, KiErrataSkx55Present. As it serves as a throwback to CVE-2018-8897 and was added to patchguard not long after this vulnerability was mitigated. In order to have a solid understanding of how this detection works under the hood, you should read the POP SS/MOV SS vulnerability whitepaper.

If you read the paper, then this almost speaks for itself. Thus given the example SYSCALL handler above, this #DB will also have 0x40000 on its interrupt stack.

What’s a young hypervisor to do in this situation since the guest code can now have wisdom beyond RDMSR/WRMSR? Simple really, set our exception bitmap such that we exit on #DB exceptions, and check the guest state IP to handle both of the possible instruction boundary #DBs above, if it does not match, then it would be appropriate to reflect it back to the guest via vectored event injection. It would be wise to check the exit qualification instead of just the TF set in guest state.

Let me tell you a story about a popular anti-virus hypervisor that failed to do this, and thus when it injected the #DB back into the guest to the RIP of its secret syscall handler, the KiDebugTraps mitigation was non the wiser, and this hypervisor made your system vulnerable to CVE-2018-8897 all over again.

Finally, what wouldn’t be the icing on the cake, but a solid check that can only blow your hypervisor up if you’re exiting on #DB exceptions, since, you kinda gotta amiright? Enter KiErrata361Present.

 

 

There’s a bit going on here so let me explain. Under normal circumstances, loading RFLAGS with TF via a POPF variant, followed by a SS load will cause the single step to be seen after the instruction boundary of the instruction following the SS load. This is the same for #DBs that fire for hitting armed debug registers, when temporarily blocked by a load SS. In the case above, a INTn also known as a software interrupt, or the dedicated INT3 opcode also known as a software exception don’t care about the previous pending #DB via TF, and it’s discarded no matter what.

This is the same natural behavior from ICEBP which albeit undocumented, is the privileged software exception you see in your Intel manuals. In this case, the #DB wont have DR6.BS set, even though it was pending, it was discarded due to the nature of how these opcodes operate natively. ICEBP actually carries with it this caveat when it induces a #DB VMEXIT. Under normal architectural circumstances the BS bit would be set in the pending debug exceptions field in the VMCS, because that is the true state here, however when the exit is induced by the privileged software exception the bit is cleared.

As such the state of VMCS is not naturally resume-able and will cause VMRESUME to fail, causing most hypervisors to shit themselves watery logs on the spot. The architecture requires that if the virtual cpu is in an interrupt shadow such that blocking by MOV SS/POP SS is enabled AND the TF bit is set, that a pending BS based #DB must exist because there is no other way to acquire this machine state. The fix for this is also relatively simple: Check for privileged software exception on qualifying exits, and if blocking by MOV SS is indicated alongside TF==1, then make sure BS is set in pending debug exceptions.

The idea for KiErrata361Present was actually taken from the CVE-2018-1087 vulnerability, before it was publicly known that privileged software exception was indeed ICEBP, and showed up in patchguard not long after the vulnerability had been mitigated in KVM. The Intel SDM has since been updated to indicate what privileged software exception actually is, but still leaves out this edge case.

If this wasn’t too boring, continue onto Part 2 where we talk about another Patchguard detection and use some critical thinking to come up with our own neat tricks!

The post Patchguard: Detection of Hypervisor Based Introspection [P1] appeared first on Reverse Engineering.

✇Reverse Engineering

Patchguard: Detection of Hypervisor Based Introspection [P2]

By: Aidan Khoury

No Errata For U!

If you haven’t already, read Part 1 which outlines three neat tricks used by Patchguard.

KiErrata420Present

The LSTAR MSR can be intercepted using a hypervisor to trap on reads and writes. It is the most common and efficient way to hook syscalls in most modern x86 operating systems. However contrary to what I’ve read online, this unfortunately comes at the cost of many potential detection vectors for the hypervisor if not properly dealt with. Using a few clever tricks in privileged code, we can reliably determine if a hook on the LSTAR MSR is present or not, that is, if proper precautions have not already been implemented in the hypervisor. Starting in Windows 10 1903 build 18362, Microsoft added several LSTAR hook detection techniques.

One of the simpler LSTAR hook detections was not given the meme “errata” name, perhaps it was not good enough 🙁 – so let’s call it KiErrata420Present (possibly not that far off from what Microsoft calls it internally?).

I have outlined the detection below:

KiErrata420Present:
        cli                             ; disable interrupts
        mov     r9d, 0C0000082h         ;
        mov     ecx, r9d                ;
        rdmsr                           ; read LSTAR MSR value
        shl     rdx, 32                 ;
        or      rax, rdx                ; store LSTAR value in rax
        lea     rdx, [rdi+87Ah]         ; store temp LSTAR value in rdx read from pg context
        mov     rbx, rax                ; rbx = original LSTAR value
        mov     rax, rdx                ; rax = temp LSTAR value
        shr     rdx, 32                 ;
        wrmsr                           ; write temporary LSTAR MSR value
        mov     r14d, 20000h            ;
        lea     rax, [rdi+87Ch]         ; rax = stub to execute syscall
        mov     rsi, 0A3A03F5891C8B4E8h ; rsi = constant to obfuscate pg context pointer
        test    [rdi+994h], r14d        ; test if should store pg check data?
        jnz     short trigger_syscall   ; if nz, skip tracing

        mov     r8, gs:KPCR.CurrentPrcb ;
        lea     rdx, [rdi+rsi]          ;
        mov     rcx, [rdi+4C0h]         ;
        mov     [rcx], rdx              ;
        mov     rcx, [rdi+4C8h]         ; store pg check related data
        mov     [rcx], r8               ;
        mov     rcx, [rdi+4D0h]         ;
        mov     [rcx], r9               ;
        mov     rcx, [rdi+4D8h]         ;
        mov     qword ptr [rcx], 112h   ;

trigger_syscall:
        call    KeGuardDispatchICall    ; dispatch call to syscall instruction stub

        test    [rdi+994h], r14d        ; test if pg check should be traced?
        jnz     short restore_lstar     ; if nz, skip tracing

        mov     rax, [rdi+4C0h]         ;
        mov     [rax], rsi              ;
        mov     rax, [rdi+4C8h]         ;
        mov     [rax], r13              ; wipe pg check related data
        mov     rax, [rdi+4D0h]         ;
        mov     [rax], r13              ;
        mov     rax, [rdi+4D8h]         ;
        mov     [rax], r13              ;

restore_lstar:
        mov     rdx, rbx                ; restore original LSTAR value
        mov     rax, rbx                ;
        shr     rdx, 32                 ;
        mov     ecx, 0C0000082h         ;
        wrmsr                           ; write original LSTAR MSR value
        sti                             ; reenable interrupts

This check is indeed very simple. It temporarily overwrites the system’s LSTAR MSR value with its own temporary syscall handler, and restores the original LSTAR MSR value afterwards. How do I know this for sure? Let’s dig in further to find out.

First off let’s figure out what temporary value is written to the LSTAR MSR:

lea     rdx, [rdi+87Ah]         ; store temp LSTAR value in rdx read from pg context
mov     rbx, rax                ; rbx = original LSTAR value
mov     rax, rdx                ; rax = temp LSTAR value
shr     rdx, 32                 ;

As we can see the temporary LSTAR value written is the address at RDI+0x87A. Knowing a little about the patchguard callback we know that the RDI register holds a temporary addresses of the current “patchguard context”. Using this knowledge, we can easily determine where the context+0x87A is written to in the patchguard initialization routine:

mov     byte ptr [r14+87Ah], 0C3h ; store RET instruction

Great, this is the opcode of the return instruction, which is very interesting!

Next let’s figure out what this call is at KeGuardDispatchICall. As you may already know, KeGuardDispatchICall works by branching to an instruction pointer given in RAX. So let’s check out where RAX comes from then:

lea     rax, [rdi+87Ch]         ; rax = stub to execute syscall

Last step, determine where context+0x87C is written to in the patchguard initialization routine:

mov     eax, 050Fh
mov     [r14+87Ch], ax ; store SYSCALL instruction

Whats the meaning of this 050Fh we see? Why that is the SYSCALL instruction opcode! I think we already know what is happening now. But let’s simplify this a little bit more using some pseudocode:

_disable();
OriginalSyscall64 = __readmsr(MSR_LSTAR);
__writemsr(MSR_LSTAR, &PgContext->DummySyscallHandler); // C3 -> ret
KeGuardDispatchICall(&PgContext->Syscall); // 0F 05 -> syscall
__writemsr(MSR_LSTAR, OriginalSyscall64);
_enable();

Neat! It simply executes the SYSCALL instruction and then immediately returns from the handler.

This is very effective against most hypervisors utilizing LSTAR hooks and is even better at annoying hypervisors that do their best to prevent the guest from tampering with the LSTAR MSR. In many naive LSTAR MSR hook implementations, developers will simply disallow writes to the LSTAR MSR altogether, which will in turn cause a fault in this case because the context is not setup before executing the syscall in this situation. An example of such an implementation is Hyperbone’s LSTAR MSR hook.

This becomes a frustrating issue for the hypervisor developer. They should fret however since there is a rather simple solution to this simple problem for the developer of a hypervisor. The solution is to let the guest overwrite the LSTAR MSR, and effectively shadow the original.

Well then that means the guest can just force us to unhook???

Yes, you’d be right. However, we can restore our hook afterwards in this case and in almost every other case, unless the guest creates their own syscall hook implementation themselves. It is unfortunate for them that in Windows, patchguard has a separate check for asserting the value of the LSTAR MSR is not tampered with. Therefore, realistically no piece of guest software is going to permanently overwrite your precious LSTAR MSR on Windows unless they have disabled patchguard, which is entirely possible, but also very easy to catch. Besides, these circumstances can all be monitored in the VMM and circumvented as needed.

For this case specifically we can circumvent this detection as such:

VMM_EVENT_STATUS
HVAPI
VmmHandleMsrRead(
    _In_ PVIRTUAL_CPU Vcpu
    )
{
    // ...

    //
    // Hide our LSTAR syscall hook handler address.
    //
    case MSR_LSTAR:
        if (Vcpu->OriginalLSTAR) {
            MsrValue = Vcpu->OriginalLSTAR;
        } else {
            MsrValue = __readmsr(MSR_LSTAR);
        }
        break;

    // ...
}

VMM_EVENT_STATUS
HVAPI
VmmHandleMsrWrite(
    _In_ PVIRTUAL_CPU Vcpu
    )
{
    // ...

    //
    // Let the guest overwrite our hook to avoid possible detection.
    //
    // If and only if the guest is writing the original LSTAR, we replace
    // the MSR value with the hook LSTAR value.
    //
    // N.B. We do this to get around one of PatchGuard's syscall hook
    //      detections which works like this:
    //
    //  _disable();
    //  OriginalSyscall64 = __readmsr(MSR_LSTAR);
    //  __writemsr(MSR_LSTAR, &PgCtx->PgSyscallDummy); // C3 -> ret
    //  KeGuardDispatchICall(&PgCtx->SyscallOpcode1); // 0F 05 -> syscall
    //  __writemsr(MSR_LSTAR, OriginalSyscall64);
    //  _enable();
    //
    case MSR_LSTAR:
        if (MsrValue == Vcpu->OriginalLSTAR) {
            MsrValue = Vcpu->HookLSTAR;
        }
        __writemsr(MSR_LSTAR, MsrValue);
        break;

    // ...
}
We effectively solve this problem completely by shadowing the original LSTAR value on reads and writes to the LSTAR MSR.

KiErrata1337Present

Using a bit of critical thinking, I came up with my own rather deviant LSTAR detection using some tricks I found derived from Patchguard. I call this one KiErrata1337Present, shamelessly derived from Microsoft’s meme “errata” naming scheme for their other cool patchguard checks.

Those who have looked into modern 64-bit system call handlers in Linux and/or Windows may have noticed they start and (sometimes) end with the SWAPGS instruction. The SWAPGS instruction exchanges the current GS base register (IA32_GS_BASE) value with the kernel GS base register value contained in MSR address C0000102H (IA32_KERNEL_GS_BASE).

The instructions immediately following the SWAPGS instruction in the syscall handler is a segmented MOV instruction. Here’s a peek of KiSystemCall64:

KiSystemCall64 proc near
        swapgs                                  ; swap GS base with IA32_KERNEL_GS_BASE
        mov     gs:KPCR.UserRsp, rsp            ; store user mode stack in processor control region
        mov     rsp, gs:KPCR.Prcb.RspBase       ; set the kernel stack from processor control region

Cool, knowing these couple details we know we can mess with the GS base to cause a page fault (#PF) inside the syscall handler. Wait, what?

WTF why would you want to purposely page fault???

You’d be sane thinking this. The reason we want to fault inside the syscall handler is so that we can read the REAL RIP of the syscall handler. This is a very important detail!

Alright, let’s try purposely generating a a page fault (#PF) then:

KiErrata1337Present:
        swapgs                                  ; swapgs to emulate coming from user mode

        mov     ecx 0C0000102h                  ;
        xor     eax, eax                        ; set KERNEL_GS_BASE MSR to zero
        xor     edx, edx                        ;
        wrmsr                                   ;

        syscall                                 ; execute the syscall instruction to trigger fault

        ret

Boom! We page faulted – that’s a good thing by the way!

However, there are a couple problems right off the top: the original page fault handler in Windows is just going to BLOW up and bugcheck, and if we don’t restore the original GS base and kernel GS base values, the operating system is also going to BLOW up in the next context switch. So we need to temporarily hook the interrupt descriptor table (IDT), and back up the GS bases, too easy!

Steps to temporarily hook the interrupt descriptor table IDT are as follows:

  1. Disable interrupts
  2. Save the original IDT
  3. Load our temporary IDT
  4. Do your thang
  5. Restore original IDT
  6. Re-enable interrupts

Here is some pseudo code implementing the above steps with the page fault #PF exception hook we need:

TempIdtr.Limit = sizeof(TempIdt) - 1;
TempIdtr.Base = (UINT64)&TempIdt[0];
for (IdtEntry in KPCR->IdtBase)
    TempIdt[i] = IdtEntry; // Fill in temporary IDT

_disable();             // Disable interrupts
__sidt(&OriginalIdtr);  // Backup original IDT
__lidt(&TempIdtr);      // Load our temporary hook IDT

// Hook page fault handler.
TempIdt[PF] = PageFaultHookHandler;

// Trigger syscall that will purposely page fault!
KiErrata1337Present();  // This must be lean enough not to timeout watchdog!

__lidt(&OriginalIdtr);  // Restore the original IDT.
_enable();              // Re-enable interrupts.

Our page fault handler doesn’t do anything but return from the interrupt for now as we test. We do this using the IRET instruction. Please read about the IRET instruction if you are not already familiar with it as it is very important you understand it for later on when we actually create the detection out of all this!

Here is our boring page fault hook handler:

PageFaultHookHandler:
        add     rsp, 8                  ; skip fault code on stack
        iretq                           ; return from interrupt

Now that we have a hook setup on the page fault handler, let’s fix our KiErrata1337Present routine to backup and restore the original GS bases:

KiErrata1337Present:
        mov     ecx, 0C0000101h         ; read original GS_BASE MSR
        rdmsr                           ;
        push    rdx                     ; backup original GS_BASE MSR
        push    rax                     ;
        mov     ecx, 0C0000102h         ; read original KERNEL_GS_BASE MSR
        rdmsr                           ;
        push    rdx                     ; backup original KERNEL_GS_BASE MSR
        push    rax                     ;

        swapgs                          ; swapgs to emulate coming from user mode

        xor     eax, eax                ;
        xor     edx, edx                ; set KERNEL_GS_BASE MSR to zero
        wrmsr                           ;

        syscall                         ; execute syscall instruction which executes swapgs immediately

        mov     ecx, 0C0000102h         ;
        pop     rax                     ;
        pop     rdx                     ; restore original KERNEL_GS_BASE MSR
        wrmsr                           ;
        mov     ecx, 0C0000101h         ;
        pop     rax                     ;
        pop     rdx                     ; restore original GS_BASE MSR
        wrmsr                           ;

        ret                             ; return back to caller

It works! Now for the juicy detection!

Like I mentioned before, the entire reason we want to cause a fault in the syscall handler is so that we can read the RIP from the machine trap frame upon faulting.  That part is easy. But how do we jump back to our KiErrata1337Present routine if we are in the page fault handler? Well, lucky for us, the SYSCALL instruction saves a return address for us which is actually intended for it’s counterpart SYSRET. When the SYSCALL instruction executes, it will store the address of the next instruction in the RCX register.

We can see the SYSCALL instruction operates as such:

RCX ← RIP; (* Will contain address of next instruction *)
RIP ← IA32_LSTAR;
R11 ← RFLAGS;
RFLAGS ← RFLAGS AND NOT(IA32_FMASK);
// .... memes

So how do we return? Simple, We override the RIP address on the machine frame. Hopefully you understand how IRET works now if you weren’t already familiar with it, because now is where we use it’s operation to wrap up our detection. Let’s be clever and get two birds stoned at once by using the XCHG instruction:

PageFaultHookHandler:
        add     rsp, 8                  ; skip fault code on stack
        xchg    qword [rsp], rcx        ; xchg trap frame RIP with syscall return address in RCX
        iretq                           ; return from interrupt

We use the XCHG instruction to our advantage to exchange the syscall return address in RCX, with the RIP in the trap frame. This allows us to effectively store the REAL syscall handler address in RCX and still branch back to the instruction immediately after our SYSCALL instruction.

That’s pretty much it. That was a nutty one, wasn’t it? Putting it all together looks something  like this:

// detect.c

VOID
DoTheThing(
    VOID
    )
{
    KIDTENTRY64 TempIdt[19];
    X64_DESCRIPTOR TempIdtr;
    PVOID SyscallHandler;

    TempIdtr.Limit = sizeof(TempIdt) - 1;
    TempIdtr.Base = (UINT64)&TempIdt[0];
    RtlCopyMemory(TempIdt, KeGetPcr()->IdtBase, TempIdtr.Limit + 1);

    _disable();             // Disable interrupts
    __sidt(&OriginalIdtr);  // Backup original IDT
    __lidt(&TempIdtr);      // Load our temporary hook IDT

    // Hook page fault handler.
    TempIdt[X86_TRAP_PF].OffsetLow = (UINT16)(UINTN)PageFaultHookHandler;
    TempIdt[X86_TRAP_PF].OffsetMiddle = (UINT16)((UINTN)PageFaultHookHandler >> 16);
    TempIdt[X86_TRAP_PF].OffsetHigh = (UINT32)((UINTN)PageFaultHookHandler >> 32);

    // Trigger syscall that will purposely page fault!
    SyscallHandler = KiErrata1337Present();

    __lidt(&OriginalIdtr);  // Restore the original IDT.
    _enable();              // Re-enable interrupts.

    LOG_INFO("REAL SYSCALL Handler = 0x%p", SyscallHandler);
}
; detect.asm

PageFaultHookHandler:
        add     rsp, 8                  ; skip fault code on stack
        xchg    qword [rsp], rcx        ; xchg trap frame RIP with syscall return address in RCX
        iretq

KiErrata1337Present:
        push    rbx                     ; backup RBX which is to be clobbered

        mov     ecx, 0C0000101h         ; read original GS_BASE MSR
        rdmsr                           ;
        push    rdx                     ; backup original GS_BASE MSR
        push    rax                     ;
        mov     ecx, 0C0000102h         ; read original KERNEL_GS_BASE MSR
        rdmsr                           ;
        push    rdx                     ; backup original KERNEL_GS_BASE MSR
        push    rax                     ;

        swapgs                          ; swapgs to emulate coming from user mode

        xor     eax, eax                ;
        xor     edx, edx                ; set KERNEL_GS_BASE MSR to zero
        wrmsr                           ;

        syscall                         ; execute syscall instruction which executes swapgs immediately
        mov     rbx, rcx                ; store result syscall handler address in RBX for now

        mov     ecx, 0C0000102h         ;
        pop     rax                     ;
        pop     rdx                     ; restore original KERNEL_GS_BASE MSR
        wrmsr                           ;
        mov     ecx, 0C0000101h         ;
        pop     rax                     ;
        pop     rdx                     ; restore original GS_BASE MSR
        wrmsr                           ;

        mov     rax, rbx                ; return result in RAX
        pop     rbx                     ; restore original RBX
        ret                             ; return back to caller

 

PoC||GTFO

This wouldn’t be a complete article without some easy to use paste would it?

You can find the full proof of concept implementation on my github at https://github.com/ajkhoury/Errata1337

The post Patchguard: Detection of Hypervisor Based Introspection [P2] appeared first on Reverse Engineering.

✇Reverse Engineering

MMU Virtualization via Intel EPT – Index

By: Daax Rynd

Overview

After receiving an abundance of requests to complete the EPT series I’ve switched gears to write this 5 part series over MMU Virtualization using Intel EPT. This series is written to be able to be used in your own hypervisor project or in conjunction with the CPU virtualization series published a few months prior. I will be referencing things within the previous project since the majority of readers will be following along, however, implementation will be relatively similar across all type-2 hypervisor projects. This is only meant for those running on Intel processors with the virtualization technology features available. The goal of this series is to allow the reader to learn the technical details of paging, extended page tables, the various translation mechanisms, and how to leverage those in their virtualization projects.

At the end of the series the reader will have a working EPT base, and should be able to design and implement their own EPT infrastructure in their future projects. All concepts for each article, their importance, the references to more detailed information, and otherwise will be linked through just like any other of my blog posts followed by a recommended reading section at the end should your thirst for details and knowledge not be satisfied. There will also be required reading to fully understand certain mechanisms used in VMX address translation.

Note: This series is not meant for those interested in writing a hypervisor for AMD processors, however, it may offer good technical information to help when the AMD series is published. This hypervisor will be written for Intel x86-64 (64-bit) using C.

  • Part 0 – Technical Details
    • This part will introduce readers to the various sub-topics surrounding extended page tables. It will cover the motivation for Intel EPT, mechanisms to aid address translation, performance concerns, and an introduction to various caching components that are referenced often when learning about paging and paging structures.
  • Part 1 – Implementation – Structure Definitions and Initialization
    • In this article, the reader will get pre-fabricated structures and all the details regarding their purpose. These structures will be used in the main EPT implementation in this series. This article will also detail the initialization and passthrough procedures needed for EPT to function properly under VMX. At the end of this article, the reader will have EPT ready to run in their hypervisor.
  • Part 2 – Implementation – EPT Helpers, Page Walking, EPT Violations, and Teardown
    • The third part describes the various EPT-induced VM-exits and how to handle them. Implementing various teardown functions, and routines for assisting guest-to-host address translation. The reader will write handlers and learn about the different types of misconfigurations, violations, and exceptions associated with Intel EPT.
  • Part 3 – Integration and Testing
    • This article will start with implementing the EPT initialization functions into the existing project from the CPU virtualization series followed by a test run to ensure EPT is running properly, and purposely generating violations to ensure we’re hitting proper handlers.
  • Part 4 – EPTP Switching and Page Hooks
    • As a bonus I’ve added this part since a good portion of readers are interested in security research. The usage of EPTP switching and page hooks can be used to hide information, hook otherwise protected functions, or protect information from being queried by an unwanted party. One example in this part will be used to show a hook on a Windows kernel function that will spoof the code integrity information when queried. The other example will prevent an application from getting any useful information when attempting to view the contents of a protected applications address space.

As an aside, prior to any post for this series it is strongly recommended that if you intend to be proficient and knowledgeable on the subject of virtualization and the microarchitecture in general that you read the recommended reading, all of it – and take notes and put the knowledge into practice. This will be repeated every post, and pushed in your face because details matter. Other supplemental reading in each article will be based on content of that day, you may find some tweets, blogs, or gists from other hypervisor authors. All will be credited when used!

I’d also like to thank Aidan Khoury for his helpful insights while working on various projects. A lot of neat tricks have been passed down to me from him that I look forward to sharing with the readers.

Thank you again for your interest and I hope you learn something new and valuable in this series.


I hope you enjoy the series! Leave me feedback, questions, comments, or recommendations in the comment section or contact me on twitter.

The post MMU Virtualization via Intel EPT – Index appeared first on Reverse Engineering.

✇Reverse Engineering

MMU Virtualization via Intel EPT: Technical Details

By: Daax Rynd

Overview

This article marks the first of 5 articles covering the virtualization of the memory management unit (MMU) using Intel EPT. This technology is used as additional support for the virtualization of physical memory and allows hypervisors to monitor memory activity. This article will address the motivation for extended page tables, the many performance concerns associated, and the different architectural components associated with MMU virtualizationThe components will are covered in some detail, but most information about the various components is in the Intel SDM. We will not be discussing anything OS-specific in this article – just the architectural details necessary to understand for proper implementation.

Disclaimer

Readers must have a foundational knowledge of virtual memory, paging, address translation, and page tables. This information is in §4.1.0 V-3A Intel SDM.

Memory and the MMU

In this rundown, we will cover some important concepts related to the memory management unit and paging. This is by no means a full discourse on paging and virtual memory for the Intel architecture, but more of an abstract overview to help the reader connect the dots a little better.

— Physical and Virtual Memory

Physical memory exists on physical cards like DIMM modules and storage devices like hard-disks. Assuming familiarity with computer science’s fundamental concepts, recall that the executable must be mapped into physical memory before any process executes Now, on modern systems, there is a secondary memory storage space called virtual memory. In a perfect world, data required to run programs would be mapped directly into RAM where it can be accessed quickly by the processor. Sadly, we do not live in a perfect world, and the system’s main memory can become full. Enter stage right, virtual memory. The secondary form of memory utilizes a storage device like a hard drive to free up space in physical memory. Nevertheless, we are not concerned with virtual memory for the time being. When setting up EPT, we need to know some critical details about physical memory, first and foremost.

When a computer begins its boot sequence, the code executing on the bootstrapping processor can access physical memory directly. This is because the processor is operating in real address mode. Aptly named since addresses in real-mode correspond to their physical memory addresses. There are also several physical memory ranges available for use by the OS/bootloader, at this point. If we were to breakpoint a system and dump the physical memory ranges present, we would be able to see what is called the system memory map. Below is an image of the physical memory ranges when a breakpoint was applied prior to MmInitSystem.

The image shows the physical memory ranges and their size. The first range from 1000h-A0000h is available as general DRAM for OS consumption. This range of memory is also called low-memory – sometimes DOS compatibility memory. So, what is the purpose of this drivel? During the boot sequence, the BIOS does many things, but the most relevant thing to this series is applying the different caching behaviors to physical memory ranges. The BIOS programs something called a memory-type range register (MTRR) to achieve this. These are a set of control registers that give the system control over how specific memory ranges are cached. The details of the caching requirements vary from system to system. For the sake of example, the physical memory range 1000h-9FFFFh is write-back. Whereas the range A0000h-BFFFFh is write-combined or uncached.

If you’re wondering how MTRRs are relevant, do not worry. We will get to that…

𝛿 Memory Type Range Register (MTRR)

Physical memory has ranges, and each range has a cache-control policy applied during system initialization. Why is this important? For starters, applying the proper caching policies to memory regions is vital to ensure that system performance does not degrade. If a frequently accessed region of memory is uncached, frequent data fetches will significantly degrade system performance This would happen because applications typically access data with high measures of locality. If data is not present in a cache, then the CPU will have to reach out to main memory to acquire it – reaching out to main memory is slow! This is important because when allocating memory and initializing EPT we will have to build what’s called an MTRR map. Fortunately for us, there is already an MTRR map of the current physical memory regions that we can use as a reference.

Figure 0. MTRR encoding table (Intel SDM)

Figure 1. MTRR map on physical machine.

From the image, you might notice the ranges are quite specific – this is due to Windows using fixed-range MTRRs and some variable-range MTRRs. Armed with this information, it’s clear that applying the appropriate caching policy to our extended page tables during initialization is imperative to preserving system performance. No need to worry either, modifying and creating an MTRR map for our VM is straightforward. We will go into more detail in the next article when we build our MTRR map. See the recommended reading if you’re eager to get ahead. With this addressed, let’s talk about the purpose of the MMU and page tables.

Page Attribute Table

In addition to MTRRs, an additional cache-control called the Page Attribute Table (PAT) is for the OS to control caching policies at a finer granularity (page level). This cache control is detailed more in the next article.

— The MMU

Most modern processors come with a memory management unit (MMU) implemented which provides access protection and virtual-to-physical address translation. A virtual address is, simply put, an address that software uses; a physical address is an address that hardware outputs on the address lines of the data bus. Intel architectures divide virtual memory into 4KB pages (with support for other sizes) and physical memory into 4KB frames. An MMU will typically contain a translation lookaside buffer (TLB) and will perform operations on the page table such as hardware table walks. Some MMU architectures will not perform those operations. This is done to give the OS the freedom to implement its page table in whatever manner it desires. The MMU architecture specifies certain caching policies for the instruction and data cache whether identifying code as cacheable or non-cacheable, or write-back and write-through data caching. These policies may also cover caching access rights.

MMU Split

In certain processors, the MMU can be split into an Instruction Memory Management Unit (IMMU) and Data Memory Management Unit (DMMU). The first is activated with instruction fetches and the latter with memory operations.

The MMU architecture for the Intel64 architecture provides a physical address space that covers 16-EiB. However, only 2^57 units are addressable in current architectures with the new page table structure. That’s still ~128-PiB of address space available. The short and “simple” for how an MMU works is this – the MMU gets a virtual address and uses it to index into a table (TLB or page tables.) These entries in the table provide a physical address plus some control signals that may include the caching policy, whether the entry is valid, invalid, protected, and so on. It may also receive signals as to whether the memory referenced by the entry was accessed/modified. If the entry is valid then the virtual address is translated into the physical address; the MMU will then use information from the control signals to determine what type of memory transaction is occurring. These tables mentioned are similar to a directory structure. The MMU will traverse the page tables to translate the virtual address to the physical address. Now on x86-64 architecture, the MMU maps memory through a series of tables – 4 or 5 depending on software requirements.

 

Figure 1. Simplified diagram of address translation.

 

We will cover a bit about TLBs and their role in a virtualization context later. Since we know the purpose of the MMU now let’s talk start talking about Intel’s EPT.

Extended Page Tables (EPT)

Intel’s Extended Page Table (EPT) technology, also referred to as Secondary Level Address Translation (SLAT), allows a VMM to configure a mapping between the physical memory as it is perceived by the guest and the real physical memory. It’s similar to the virtual page table in that EPT enables the hypervisor to specify access rights for a guest’s physical pages. This allows the hypervisor to generate an event called an EPT violation when a guest attempts to access a page that is either invalid or does not have appropriate access rights. This EPT violation is one of the events we will be taking advantage of throughout this series since it triggers a VM-exit.

Important Note

Virtualization of the IOMMU is performed by a complementary technology to EPT called VT-d. This will not be covered in this series.

This technology is extraordinarily useful. For instance, one can utilize EPT to protect the hypervisor’s code and data from malicious code attempting to modify it. This would be done by setting the access rights to the VMM’s code and data to read-only. In addition to that, if a VMM were to be used to whitelist certain applications it could modify the access rights of the remaining physical address space to write-only. This would force a VM-exit on any execution to allow the hypervisor to validate the faulting page. Just a fun thought experiment.

Enough about the potential, let’s get into the motivations for EPT and address the other various components associated…

— Motivation

One of the main motivations for extending Intel’s virtualization technology was to reduce performance loss on all VM-exits. This was achieved by adding virtual-processor identifiers (VPID) in 2008 to the Nehalem processor. It’s known to many researchers in the field that the first generation of the technology forced a flush of the translation lookaside buffer on each VMX transition. This resulted in significant performance loss when going from VMX non-root to root operation. Now, if you’re wondering what the TLB is or does do not worry – we cover it briefly in a subsection below. This performance loss also extended to VM-entries if the VMM was emulating stores to specific control registers or utilizing the invlpg instruction.

This TLB entry invalidation occurs for moves to CR3 and CR4, but with other conditions related to process-context identifiers which we will address later. If you’re not familiar with what TLBs are then I’d strongly suggest revisiting the address translation section in the Intel SDM. However, the next section briefly reviews it as it relates to EPT.

— Translation Lookaside Buffer (TLB)

The translation lookaside buffer (TLB) is a cache that houses mappings for virtual to physical addresses – it follows the principle of locality to reduce the number of traversals of the paging structures that the CPU needs to make when translating a virtual address. For the sake of simplification let’s look at an example of what happens during a TLB fill, hit, and miss. This will make the later explanation easier to understand. Let’s say we are performing a virtual address lookup on virtual address 0x00001ABC. This is a simplified look at what would happen in the three scenarios.

𝛿 TLB Fill

When a lookup is required for a specific virtual address the TLB is the first stop in any address translation. However, if the TLB is empty a sequence of steps is required to ensure faster lookup in future translation. In this case, we’re looking up the virtual address 0x00001ABC.

 

The first step (1) that will occur is that the translation unit will check the TLB to determine if the mapping for the virtual address is available. The translation unit will determine that the PTE is not in the TLB and have to proceed to step two (2) which will load the PTE from the page table in main memory. It uses the virtual page number, 0x00001 to index into the page table to locate the PTE. You can see at index 1 in the page table we have the value 0xA. This value represents the physical page number (PPN), which will be used to fill in the PPN field in the first TLB entry. Now, since the TLB is a cache of mappings from virtual-address to physical addresses we will use the virtual page number as the tag. This achieves the mapping requirement that VPN 0x1 -> PPN 0xA. Once the TLB entry is filled we will use the physical page number, 0xA to complete the translation giving us the physical address 0x0000AABC. This is a simplified example of the process for a TLB fill/TLB miss. The end result is below.

𝛿 TLB Miss + Eviction

Now, what happens when our TLB is full and our virtual address does not have a mapping cached in the TLB? This is called a TLB miss + eviction, or just TLB eviction. Using the same virtual address as before, but with a filled TLB, let’s take a look at the sequence of operations to complete the address translation.

 

The first step is the same as before – the translation unit goes to the TLB to see if a mapping is available for virtual page number 1 (1). However, the TLB is full and no entry corresponds to the virtual to physical mapping for the virtual address. This means the TLB will have to evict the oldest entry (2). Let’s assume that the address translation prior to this used virtual page number 3, so the eviction will occur on the second entry with tag 0x4.

Following the eviction, the translation will continue by traversing the page table in main memory and loading the PTE corresponding to the virtual page number 1 (3). After locating the PTE for VPN 1, the evicted TLB entry is replaced with the mapping for our current virtual address (4). The physical page number would be 0xA and the tag 0x1.

And finally, the address translation will use the physical page number to complete the address translation yielding the physical address 0x0000AABC. This does not seem like a difficult or cumbersome process, but remember that page table traversals are not this simple and reaching out to main memory is slow! What happens if the virtual page number, in this example, is 0? If you guessed that a page-fault would occur you’d be correct, and page-faults are horrifically slow. If you take this diagram and added all the levels of tables required for address translation you would see that TLB misses will increase overhead substantially. Below is an image of address translation using a two-entry TLB taken from this book.

So what does this have to do with EPT? Well, if you’re in a virtualized environment utilizing EPT then there is an increased cost of TLB miss processing. This is because the number of operations to translate a guest virtual-address to a host-physical address dramatically increases. The worst-case scenario for memory references performed by the hardware translation unit can increase by 6 times over native execution. Because of this, it has become imperative for the virtualization community to reduce the frequency and cost of TLB misses as it pertains to Intel VT-x and EPT. There have been numerous research articles on reducing the length of 2-dimensional page table walks, page sharing, and so on – but that’s a discussion for another time. Lucky for us, the technology has made leaps and new mechanisms have been introduced. One of which is the virtual-process identifier (VPID).

— Virtual Processor Identifier (VPID)

As we learned previously, flushing the TLB is a knockout for performance. But Intel engineers were aware of this issue, and in 2008 introduced virtual-processor identifiers in the Nehalem architecture. This virtual-processor identifier is used as a tag for each cached linear address translation (similar to the diagrams above). This provides a way for the processor to identify (tag) different address spaces for different virtual processors. Not to mention, when VPIDs are used no TLB flushes occur for VM-entries or VM-exits. This has significant performance implications in that when a processor attempts to access a mapping where the VPID does not the TLB entry tag a TLB miss will occur – whether an eviction takes place depends.

When EPT and VPID are active the logical processor may cache the physical page number the VPID-tagged entry translates to, as well as information about access rights and memory type information. The same applies to VPID-tagged paging-structure entries except the physical address points to the relevant paging structure instead of the physical page frame. It’s important to note briefly that each guest CPU obtains a unique VPID, and all host CPUs use the VPID 0x0000. We will also come across an instruction, invvpid, that is necessary for migrating a virtual CPU to a new physical CPU. This instruction can also be used for shadowing – such as when the guest page table is modified by the VMM or guest control registers are altered.

There is plenty of information on the information that may be cached when VPID/EPT is in use, as well as more detail on VPIDs in the Intel SDM. The section numbers for this information are provided in the recommended reading section. These subsections are intended to briefly introduce you to terminology and features you will encounter throughout this series.

— Oh, how the Extended Page Tables

Understanding the paging structures and address translation without virtualization in the mix can be confusing. Once we introduce some form of SLAT, in this case – EPT, the complexity of the caching and translation process increases. This is why in the beginning of the article it was recommended you have some background with the translation process. Noting that, let’s look at what the typical translation process looks like on a system that is using 4-level paging.

In this image, you will see the usual process for translating a virtual address to a physical address without any form of SLAT. A virtual address is given to the MMU, and the TLB performs a look-up to determine if there is a VAPA mapping. If a mapping exists we get a TLB hit which results in a process that was detailed in the TLB section above. Otherwise, we have TLB miss and are required to walk the paging structures to get the VAPA translation. If we introduce EPT our memory management constructs get more complicated.

As we can see from the above picture, processors with hardware support for MMU virtualization must have extended paging caches. These caches are part of a “master TLB”, so to speak, that caches both the GVA→GPA and GPA→HPA. The hardware is able to track both of these mappings using the TLB tagging we addressed earlier, the VPID. The result of using the VPID, as mentioned earlier, is that a VM-transition does not flush the TLB. This means that the entries of various VMs can coexist without conflict in the TLB. This master TLB eliminates the need for updating any sort of shadow page tables constantly. There is a downside to this, however. Using EPT makes the virtual to host physical address translation significantly more complex – most notably if we incur a miss in the TLB. Now, this diagram does not cover the complexity very well so let’s talk about how the translation works with EPT.

Concerning the Master TLB

This “master” TLB contains both the guest virtual to guest physical mapping and guest physical to host physical mapping. It also uses a virtual-processor identifier (VPID) to determine which TLB entry belongs to what VM.

For each of the steps in the guest translation, we have to do all the steps in the VMM. When EPT is in use, the addresses in the paging structures are not used as physical addresses to reference memory – they’re treated like guest-physical address and are pushed through the set of EPT paging structures for translation to the real physical address. This means that when we do not have the proper TLB entry the traversal requires 16 lookups as opposed to 4 lookups in a non-virtualized environment – yikes. This is why I wanted to drive the point home that TLB misses… are bad! It’s also worth mentioning that the TLBs on the CPU are much bigger than previous generations. There’s more to this, as with everything, but I want this process introduced prior to implementation so you’re not wildly confused in the next article. Before concluding this article, we need to address one more topic that is vital to increasing the translation speed when emulating the MMU.

— Virtual TLBs (vTLB)

We know now that virtual-to-physical address translation in a virtualized environment is costly. One of the ways to curb this performance hit is to emulate the TLB hardware in software. This is what implementing a virtual TLB (vTLB) requires. We will not be implementing a virtual TLB in this series, but it’s worth knowing that it is a possible solution. The virtual TLB is typically a complete linear lookup table. On Intel processors with vTLB support, we have to enable the virtual TLB scheme by modifying some VMCS fields. We have to trap on #PF exceptions, VM-exit on all CR3 writes and enable invlpg exiting. However, it’s noted in the Intel SDM that the combination of these may not yield the best performance. You can read more on utilizing the virtual TLB scheme in the Intel SDM §32.3.4 Volume 3C.

If you’ve made it this far, you’re prepared to begin your journey through implementing EPT in your hypervisor.

Conclusion

In this article, we covered information regarding the caching policies applied to memory and how they will be utilized when allocating our paging structures. We addressed the MMU and its purpose along with some of its components that are vital to performant address translation. We also discussed the motivations for EPT and went into more detail than I anticipated on the hardware TLB. I wanted to introduce these topics in this article so that the breadth of the future articles in this series was not overwhelming. It’s important that you understand the technical details and purpose underlying these sub-topics. Particularly important is the virtual-processor identifier and TLB operations. The abstract overview in this article should be sufficient but be prepared for more details in the coming parts.

In the next article, we will dive right into building the MTRR map and cover the page attribute table (PAT). I will be provided prefabricated structures and explaining the initialization of EPT in your hypervisor. We will cover identity mapping, setting up the PML4/PML5 entries for our EPTP, allocating our various page directories, and how to implement 4KB pages versus the 2MB pages. In addition to that, the detail will be provided on EPT violations/EPT misconfigurations and how to implement their VM-exit handler. The easiest part will be inserting our EPTP into our VMCS. Unfortunately, the next article will only be configuration and initialization, and the following article will provide different methods of monitoring memory activity.

Aside: The IOMMU virtualization using VT-d may be attached to this series at the end, or a brief implementation in a separate article.

I apologize in advance for the potentially erratic structuring of this article. As I was writing it I realized there was a lot that might’ve been missing and started trying to find a way to naturally cover the topic. It’s been a little bit so I have to stretch my writing muscle again. As always, thanks for reading – please feel free to reach out to me on Twitter or leave a comment on the article below.

Do the recommended reading!

Recommended Reading

The post MMU Virtualization via Intel EPT: Technical Details appeared first on Reverse Engineering.

✇Reverse Engineering

MMU Virtualization via Intel EPT: Implementation – Part 1

By: Daax Rynd

Overview

This article will cover the various requirements and features available for MMU virtualization via Intel Extended Page Tables. It’s going to be a relatively long article as I want to cover all of or most of the details concerning initialization and capability checking, MTRR setup, page splitting, and so on. We’ll start with checking feature availability and what capabilities are supported on the latest Intel processors, restructuring some of the VMM constructs to support EPT, and then move into the allocation of the page tables. This article will use the Windows memory management API to allocate and track resources. It’s highly recommended that the reader research and implement a custom memory allocator that doesn’t rely on the OS for resource allocation as these can be attack vectors for malicious third parties. However, we will be sticking to the most straightforward approach for simplicity. There is a lot of information to cover to avoid wasting much more time on this overview.

Disclaimer

Readers must have a foundational knowledge of virtual memory, paging, address translation, and page tables. This information is in §4.1.0 V-3A Intel SDM.

As always, the research and development of this project were performed on the latest Windows 10 Build 21343.1000. To ensure compatibility with all features, be aware that the author hosts an Intel i9-10850k (Comet Lake) that supports the most recent virtualization extensions. During capability/feature support checks, if your processor doesn’t show availability, do not worry — as long as it supports baseline EPT all is good.

Feature Availability

To start, we need to check a few things to make sure that we support EPT and different EPT policies. This project has a function that sets all VMX capabilities before launch, if available – checking for WB cache type, various processor controls, and related to this article – EPT, VPID, INVPCID support. These capabilities are inside the secondary processor controls, which we’ll read from the IA32_VMX_PROCBASED_CTLS2 MSR. The first 32 bits indicate the allowed 0 settings of these controls, and the upper 32 bits indicate the allowed one settings of this control. You should already have an algorithm set up to check and enable the various control features. If not, please refer back to this article in the first series on CPU virtualization.

Possible Incompatibility

If your processor doesn’t support secondary processor controls, you will be unable to implement EPT. The likelihood of this being an issue is slim unless you’re using a very old processor.

Once the capabilities and policies have been verified and enabled, we will enable EPT. However, there will be an information dump prior because it’s essential to understand extended paging as an extension of the existing paging mechanism and the structural changes to your hypervisor. We’ll need to allocate a data structure inside of our guest descriptor that will contain the EPTP. The design of your project will vary from mine, but the important thing is that each guest structure allocated has its EPTP – this will be a 64-bit physical address. Here is an example of my guest descriptor:

typedef struct gcpu_descriptor_t
{
    uint16_t                id;
    gcpu_handle_t           guest_list;
    crn_access_rights       cr0_ar;
    crn_access_rights       cr4_ar;
    uint64_t                eptp;

    //
    // ... irrelevant members ...
    //

    gcpu_descriptor_t*      next_gcpu;
} gcpu_descriptor_t;

Once you have an EPTP member setup, you’ll need to write the address of this member into the VMCS_EPTP_ADDRESS field using whatever VMCS write primitive you have set up. Similar to this:

// EPTP Address (Field Encoding: 0x201A)
//
vmwrite(vmcs, VMCS_EPTP_ADDRESS, gcpu->eptp);

Before implementing the main portion of the code for EPT, let’s address some important technical details. It’s in your best interest to read the following sections thoroughly to ensure you understand why certain things are checked and why certain conditions are unsupported. Improper virtualization of the MMU can cause loads of issues as you build your project out, so it’s imperative to understand how everything works before extending. It’s also good to review so that confusion is minimized in future sections… and because details are cool.

Memory Virtualization

Virtual memory and paging are necessary abstractions in today’s working environments. They enable the modern computer system to efficiently utilize physical memory, isolate processes and execution contexts, and pass off the most complex parts of memory management to the OS. Before diving into the implementation of EPT, the reader (you) must have a decent understanding of virtual memory and paging; and address translation. There was a brief overview of the address translation performed in the previous article. We’ll go into more detail here to set the stage for allocating and maintaining your EPT page hierarchies.

— Virtual Memory and Paging

In modern systems, when paging is enabled, every process has its own dedicated virtual address space managed at a specific granularity. This granularity usually is 4kB in size, and if you’ve ever heard the term page-aligned, then you’ve worked with paging mechanisms. Page-aligned buffers are buffers (like your VMCS) aligned on page boundary — since the system is divided into granular chunks called pages, then page-aligned means that the starting address of a buffer is at the beginning of a page. A simple way to verify if an address is aligned on a page boundary is to check that the lower 12-bits of the address are clear (or zero). However, this is only true for 4kB pages; pages with different granularity, such as 2MB, 4MB, or 1GB, will have different alignment masks. For example, take the address FFFFD288`BD600000. This address is 4kB page aligned (the lower 12-bits are clear), but it would not be aligned on a page boundary if the size of pages were 1GB. To check this, we would take this address to perform a logical AND operation against the 2s complement of the size (4kB, 1MB, 2MB, 4MB, 1GB) minus 1.

The macro might look something like this: PAGE_ALIGN_4KB(_ADDRESS)   ((UINTPTR)(_ADDRESS) & ~(0x1000 - 1)). Whereas for 1GB, the 0x1000 (4,096 in hex) would be replaced by 0x40000000 (the size of a 1GB page.) Give it a try yourself and look at the differences between the addresses when aligned on their respective granularity’s boundary.

 Page Alignment Trivia

On a 4kB page size architecture, there are several different instances of page-aligned addresses other than 4,096. Two of those are 12,288 (0x3000) and 413,696 (0x65000) — as you may notice, the lower 12-bits are all clear in these. You can use any multiple desired page granularity to determine if the page is appropriately aligned. The expression (FFFFD288`BD600000 & ~(0x32000-1)) still results in the same address; thus, this address is page-aligned – 0x32000 is a multiple of the page granularity.

So, how is this virtual memory managed and mapped to a physical page? The implementation details are specific to the OS that does the memory management; there is enough information for a whole book — luckily, a few well-written researchers have covered much of it in Windows Internals 7th Edition. The main thing to understand here is that all per-process mappings are stored in a page table which allows for virtual-to-physical address translation. In modern systems using virtual memory, for all load/store operations on a virtual address, the processor translates the virtual address to a physical address to access the data in memory. There are many different hardware facilities like the Translation Lookaside Buffer (TLB) that expedite this address translation by caching the most recently used (MRU) page table entries (PTEs). This allows the system to leverage paging in a performant manner since performing all the steps to address translation every time it’s accessed would significantly reduce performance, as with TLB misses. The previous article briefly covered the TLB and the various conditions that may be encountered. It may be worth reviewing since it’s been a bit since it was released…

  Overheads of Paging

As physical memory requirements grow, large workloads will experience higher latency due to paging on modern systems. This is in part to the size of the TLB not keeping pace with memory demands; this is partly due to the TLB being on the processor’s critical path for memory access. There are a few TLBs on systems, but most notably, the L1 and L2 TLB have begun to stagnate in size. You can read more about this problem, referred to as TLB reach limitation, in the recommended reading section if interested. There are also several papers on ResearchGate proposing solutions to increase TLB reach.

The reason for mentioning this is that how you design virtual memory managers is vital in preserving the many benefits of paging without tanking system performance. This is something to consider when adding an additional layer of address translation, such as in the case of EPT. So, what about the page table?

𝛿 Address Translation Visualized

As mentioned above, the page table is a per-process structure (or per-context) that contains all the virtual-to-physical mappings of a process. The OS manages it, and the hardware performs the page table walk; in some cases, the OS fetches the translation. You know that this mapping of virtual to physical addresses occurs at a page granularity specified. So let’s take a look at a diagram showing the process of translating a virtual address to a physical address and then walk through the process.

The above diagram features an abstract view that you’ve likely seen a few times throughout this series, but it’s essential to keep it fresh in mind when walking through the actual address translation process. To address the abstract layout, we start with CR3, which contains the physical base address of the current task’s topmost paging structure — in this case, the base of the PML4 table. The indexes in these different tables are determined by the linear address given for translation. A given PML4 entry (PML4E) will point to the base of a page directory pointer table (PDPT). At each step, the new physical address calculated is dereferenced to determine the base of the next paging structure. An offset into that table is added to the entries physical address, and so on — down the chain. Let’s walk through the process with a non-trivial linear address to get a more concrete example of this.

The linear address given is, and the CR3 was determined by reading the _KPROCESS structure and pulling the address out of the DirectoryTableBase member which was 13B7AA000. The first thing that must be done is to split the associated linear address into parts required for address translation. The numbers above each block are the bit ranges that comprise that index. Bits 39 to 47, for instance, are the bits that will be used to determine the offset into the PML4 table to find the corresponding PML4E. If you want to follow along or try it out for yourself, you can use SpeedCrunch or WinDbg (with the .format command) on the linear address and split it up accordingly. I’d say this is somewhat straightforward, but for the sake of giving as many examples as possible, the code below presents a few C macros that are useful for address translation.

#define X64_PML4E_ADDRESS_BITS          47
#define X64_PDPTE_ADDRESS_BITS          39
#define X64_PDTE_ADDRESS_BITS           30
#define X64_PTE_ADDRESS_BITS            21
        
#define PT_SHIFT                        12
#define PDT_SHIFT                       21
#define PDPT_SHIFT                      30
#define PML4_SHIFT                      39
        
#define ENTRY_SHIFT                     3

#define X64_PX_MASK(_ADDRESS_BITS)      ((((UINT64)1) << _ADDRESS_BITS) - 1)

#define Pml4Index(Va)                   (UINT64)((Va & (UINT64)(X64_PX_MASK(X64_PML4E_ADDRESS_BITS)) >> PML4_SHIFT))
#define PdptIndex(Va)                   (UINT64)((Va & (UINT64)(X64_PX_MASK(X64_PDPTE_ADDRESS_BITS)) >> PDPT_SHIFT))
#define PdtIndex(Va)                    (UINT64)((Va & (UINT64)(X64_PX_MASK(X64_PDTE_ADDRESS_BITS)) >> PDT_SHIFT))
#define PtIndex(Va)                     (UINT64)((Va & (UINT64)(X64_PX_MASK(X64_PTE_ADDRESS_BITS)) >> PT_SHIFT))

// Returns the physical address of PML4E mapping the provided virtual address.
//
#define GetPml4e(Cr3, Va)               ((PUINT64)(Cr3 + (Pml4Index(Va) << ENTRY_SHIFT)))

// Returns the physical address of the PDPTE which maps the provided virtual address.
//
#define GetPdpte(PdptAddress, Va)       ((PUINT64)(PdptAddress + (PdptIndex(Va) << ENTRY_SHIFT)))

// Returns the physical address of the PDTE which maps the provided virtual address.
//
#define GetPdte(PdtAddress, Va)         ((PUINT64)(PdtAddress + (PdtIndex(Va) << ENTRY_SHIFT)))

// Returns the physical address of the PTE which maps the provided virtual address.
//
#define GetPte(PtAddress, Va)           ((PUINT64)(PtAddress + (PtIndex(Va) << ENTRY_SHIFT)))

There’s a lot of shifting and masking in the above; it can be quite daunting to those unfamiliar. There’s only one way to detail the bit shifting shenanigans, and that’s done pretty well in the Intel SDM Vol. 3A Chapter 4. This will be in the recommended reading as understanding paging and virtual memory in depth are necessary. However, circling back to our earlier example, I’ll explain how these macros, in conjunction with a simple algorithm, can be used to traverse the paging hierarchy quickly and efficiently.

  Important Note

If you attempt to traverse the paging structures yourself, you will find that the entries inside of each page table look something akin to 0a000001`33c1a867. This is normal; this is the format for the PTE data structure. On Windows, this is the structure type _MMPTE. If you cast entry to this data structure, you’ll see that it has a union specified and allows you to look at the individual bits set inside the hardware page structure, among other types. For instance, the example given – 0a000001`33c1a867 – is valid, dirty, allows writes, and has a PFN of 133c1a. The information you want for address translation is the page frame number (PFN).

Given the note above, we have to do two simple bitwise operations to get the page frame number (PFN) from the page table entry to provide these macros at each step. The first thing is to mask off the upper word (16-bits) of the entry — this will leave the page frame number and the additional information such as the valid, dirty, owner, and accessed bits, which is what makes up the bottom portion (the 867).  In this case, using the entry value 0a000001`33c1a867, we would have to perform a bitwise AND against a mask that will retain the lower 48 bits (maximum address size when 4-level paging is used). A mask that would do this can be constructed by setting the uppermost bit position (48) and subtracting one, resulting in a mask with all the bits 48 and below set. The mask can be hard-coded or generated with this expression: ((1 << 48) - 1).

If we take our address and do the following:

u64 pdpe_address = ( 0x0a00000133c1a867 & ( ( 1 << 48 ) - 1 ) ) ... /* one more step necessary */

We would be left with the lower 48 bits yielding the result 133c1a867. All that’s left is to clear the lower 12 bits and then pass the result to the next step in our address translation sequence. The bottom 12 bits must be clear since the address of the next paging structure will always be page-aligned. This can be done by masking them off and completing the above expression to yield the next paging structures address:

u64 pdpe_address = ( 0x0a00000133c1a867 & ( ( 1 << 48 ) - 1 ) ) & ~0xFFF;

The above is the same as doing 133c1a867 & 0x000FFFFFFFFFF000, but we want the cleanest solution possible. After this, the variable the result is assigned to holds the value 133c1a000 which is our PDPE base address in this example. These steps can be macro’d out, but I wanted to illustrate the actual entries being processed by hand, so the logic became clear. As the below code excerpt demonstrates, the macros provided before this example are intended to be used.

// This is a brief example, not production ready code...
//
u64 DirectoryBase = 0x1b864d000;
u64 Va = 0x760715d000;

u64 Pml4e = GetPml4e( DirectoryBase, Va );

u64 PdptBase = ( *Pml4e & X64_VIRTUAL_ADDRESS_BITS ) & ~0xFFF;
u64 Pdpte = GetPdpte( PdptBase, Va );

u64 PdtBase = ( *Pdpte & X64_VIRTUAL_ADDRESS_BITS ) & ~0xFFF;
u64 Pde = GetPdte( PdtBase, Va );

/* ... etc ... */

Ideally, you would loop and decrement the level based on various conditions and utilize the requirement that 9 bits be subtracted each time from whichever mask and check for certain bits and extensions in CR0 and CR4, among other things. We will cover a proper page walk in a later section of this article. This was intended to give a quick and dirty overview of the address translation process without checking for presence, large pages, access rights, etc. As of now, hopefully, you have a decent idea of how virtual memory and address translation work. This next section will dive into the info about SLAT mechanisms, in this case – the Extended Page Tables (EPT) feature on Intel processors.

— Extended Page Tables

Intel and other hardware manufacturers introduced virtualization extensions to allow multiple operating systems to execute on a single hardware setup. To perform better than the software virtualization solutions, many different facilities were introduced – one of them was EPT. This extension allows the host computer to fully virtualize memory thought it introduces a level of indirection between guest virtual address space (the VM virtual address space; GVA) and the host physical address space (HPA) called the guest physical address space (GPA). The addition of this second-level in the address translation process is where the acronym SLAT is derived from and also modifies the process taken. The procedure formerly was VA PA, but with SLAT enabled, becomes GVA GPA HPA. Guest virtual address to guest physical address translation is done through an additional per-process guest page table, and the guest physical address to host physical address translation is performed through the per-VM host page table.

 

Figure 2. Guest Virtual Address to Host Physical Address

This method of memory virtualization is commonly referred to as hardware-assisted nested paging. It is accomplished by allowing the processor to hold two-page table pointers: one pointing to the guest page table and another to the host page table. As mentioned earlier, we know that address translation can negatively impact system performance if the TLB misses are high. You can imagine this would by double-so with nested paging enabled it multiplies overheads 6-fold when a TLB miss occurs since it requires a 2-dimensional page walk. I write 2-dimensional because native page-walks only require one dimension of the page hierarchy being traversed, whereas with extended paging, there are two dimensions because of the two-page tables needing to be traversed. Natively, memory references that cause a TLB miss require 4 accesses to complete translation whereas when virtualized it increases to a whopping 24 accesses. This is where MMU caches and intermediate translations can improve the performance of memory accesses that result in a TLB miss – even when virtualized.

Anyways, enough of that, there will be some resources following the conclusion for those interested in reading about the page-walk caches and nested TLBs. I know you’re itching to initialize the EPT data for your project… so let’s get it goin’.

— EPT and Paging Data Structures

If you recall in the first series for virtualization we had a single function that initialized the VMXON, VMCS, and other associated data structures. Prior to enabling VMX operation, but after allocating the regions for our VMXON and VMCS as well as any other host-associated structures, we’re going to initialize our EPT resources. This will be done in the same function that runs for each virtual CPU. First and foremost, we need to check that the processor supports the features necessary for EPT. Depending on the structure of your project, I do it when checking the various VM-entry/VM-exit/VM-control structures for what bits are supported. Below are the data structure, function, and required references for checking if EPT features are available.

// EPT VPID Capability MSR Address
//
#define     IA32_VMX_EPT_VPID_CAP_MSR_ADDRESS                                   0x048C

// EPT VPID Capability MSR Bit Masks
//
#define     IA32_VMX_EPT_VPID_CAP_MSR_EXECUTE_ONLY                              (UINT64)(0x0000000000000001)
#define     IA32_VMX_EPT_VPID_CAP_MSR_PAGE_WALK_LENGTH_4                        (UINT64)(0x0000000000000040)
#define     IA32_VMX_EPT_VPID_CAP_MSR_UC_MEMORY_TYPE                            (UINT64)(0x0000000000000100)
#define     IA32_VMX_EPT_VPID_CAP_MSR_WB_MEMORY_TYPE                            (UINT64)(0x0000000000004000)
#define     IA32_VMX_EPT_VPID_CAP_MSR_PDE_2MB_PAGES                             (UINT64)(0x0000000000010000)
#define     IA32_VMX_EPT_VPID_CAP_MSR_PDPTE_1GB_PAGES                           (UINT64)(0x0000000000020000)
#define     IA32_VMX_EPT_VPID_CAP_MSR_INVEPT_SUPPORTED                          (UINT64)(0x0000000000100000)
#define     IA32_VMX_EPT_VPID_CAP_MSR_ACCESSED_DIRTY_FLAG                       (UINT64)(0x0000000000200000)
#define     IA32_VMX_EPT_VPID_CAP_MSR_EPT_VIOLATION_ADVANCED_EXIT_INFO          (UINT64)(0x0000000000400000)
#define     IA32_VMX_EPT_VPID_CAP_MSR_SUPERVISOR_SHADOW_STACK_CONTROL           (UINT64)(0x0000000000800000)
#define     IA32_VMX_EPT_VPID_CAP_MSR_SINGLE_CONTEXT_INVEPT                     (UINT64)(0x0000000002000000)
#define     IA32_VMX_EPT_VPID_CAP_MSR_ALL_CONTEXT_INVEPT                        (UINT64)(0x0000000004000000)
#define     IA32_VMX_EPT_VPID_CAP_MSR_INVVPID                                   (UINT64)(0x0000000100000000)
#define     IA32_VMX_EPT_VPID_CAP_MSR_INDIVIDUAL_ADDRESS_INVVPID                (UINT64)(0x0000010000000000)
#define     IA32_VMX_EPT_VPID_CAP_MSR_SINGLE_CONTEXT_INVVPID                    (UINT64)(0x0000020000000000)
#define     IA32_VMX_EPT_VPID_CAP_MSR_ALL_CONTEXT_INVVPID                       (UINT64)(0x0000040000000000)
#define     IA32_VMX_EPT_VPID_CAP_MSR_SINGLE_CONTEXT_GLOBAL_INVVPID             (UINT64)(0x0000080000000000)

typedef struct _msr_vmx_ept_vpid_cap
{
    u64 value;
    union
    {
        // RWX support
        //
        u64 ept_xo_support : 1;
        u64 ept_wo_support : 1;
        u64 ept_wxo_support : 1;
        
        // Guest address width support
        //
        u64 gaw_21 : 1;
        u64 gaw_30 : 1;
        u64 gaw_39 : 1;
        u64 gaw_48 : 1;
        u64 gaw_57 : 1;
        
        // Memory type support
        u64 uc_memory_type : 1;
        u64 wc_memory_type : 1;
        u64 rsvd0 : 2;
        u64 wt_memory_type : 1;
        u64 wp_memory_type : 1;
        u64 wb_memory_type : 1;
        u64 rsvd1 : 1;
        
        // Page size support
        u64 pde_2mb_pages : 1;
        u64 pdpte_1gb_pages : 1;
        u64 pxe_512gb_page : 1;
        u64 pxe_1tb_page : 1;
        
        // INVEPT support
        u64 invept_supported : 1;
        u64 ept_accessed_dirty_flags : 1;
        u64 ept_violation_advanced_information : 1;
        u64 supervisor_shadow_stack_control : 1;
        u64 individual_address_invept : 1;
        u64 single_context_invept : 1;
        u64 all_context_invept : 1;
        u64 rsvd2 : 5;
        
        // INVVPID support
        u64 invvpid_supported : 1;
        u64 rsvd7 : 7;
        u64 individual_address_invvpid : 1;
        u64 single_context_invvpid : 1;
        u64 all_context_invvpid : 1;
        u64 single_context_global_invvpid : 1;
        u64 rsvd8 : 20;
    } bits;
} msr_vmx_ept_vpid_cap;

boolean_t is_ept_available( void )
{
    msr_vmx_ept_vpid_cap cap_msr;
    cap_msr.value = __readmsr(IA32_VMX_EPT_VPID_CAP_MSR_ADDRESS);
    
    if( !cap_msr.bits.ept_xo_support             ||
        !cap_msr.bits.gaw_48                     ||
        !cap_msr.bits.wb_memory_type             ||
        !cap_msr.bits.pde_2mb_pages              ||
        !cap_msr.bits.pdpte_1gb_pages            ||
        !cap_msr.bits.invept_supported           ||
        !cap_msr.bits.single_context_invept      ||
        !cap_msr.bits.all_context_invept         ||
        !cap_msr.bits.invvpid_supported          ||
        !cap_msr.bits.individual_address_invvpid ||
        !cap_msr.bits.single_context_invvpid     ||
        !cap_msr.bits.all_context_invvpid        ||
        !cap_msr.bits.single_context_global_invvpid )
    {
        return FALSE;
    }
    
    return TRUE;
}

The above code is intended to be placed into your project based on your layout. I included the macros for the bitmasks in case using the structure to represent the MSR was not as clean as desired. This function is_ept_available is intended to be called prior to setting the processor controls in the primary and secondary controls. Though we won’t get into handling CR3 load exiting in this article, the two controls of interest, for now, is enable_vpid and enable_ept in the secondary processor controls field. You should switch based on the result of the previous function. If all is well, the processor supports the required features (which can be adjusted at your discretion), we’ll need to set up the EPT data structures. However, before we do that we have to take a little detour to explain the use of VPIDs.

— Virtual Processor Identifiers and Process-Context Identifiers

Back in 2008, Intel decided to add a new cache hierarchy alongside some very important changes to the TLB hierarchy to cache virtual-to-physical address mappings. There were more involved changes, but what is relevant for our purposes is that the Intel Nehalem microarchitecture introduced the virtual processor identifier (VPID). As we know from the previous article, the TLB caches virtual-to-physical address translations for pages. The mapping cached in the TLB is specific to a task and guest (VM). On older processors, the TLB would be flushed incessantly as the processor switched between the VM and VMM, which had a massive impact on performance. The VPID is intended to track which guest a given translation entry in the TLB is associated with, providing the ability for the hardware to selectively invalidate caches on VM-exit and VM-entry, removing the requirement of flushing of the TLB for coherence and isolation.

For example, a process attempts to access a translation that it isn’t associated with — this results in a TLB miss rather than an access violation when walking through the page tables. VPIDs were introduced to improve the performance of VM transitions. Coupled with EPT, which further reduced VM transition overhead (because the VMM no longer had to service the #PF itself), you begin to see a reduction in VM exits and a significant improvement in virtualization performance. This feature brought with it new instructions to allow software the ability to invalidate mappings from the TLB associated with a VPID; the instruction is documented as invvpid; similarly, EPT introduced invept instruction which allows the software to invalidate cached information from the EPT page structures. To review some other technical details, please refer to the previous article.

Alongside the VPID technology, a hardware feature known as the process-context identifier (PCID) was introduced. PCIDs enable the hardware to “cache information for multiple linear-address spaces.” This means a processor can maintain cached data when software switches to a different address space with a different PCID. This was added at the same time in order to mitigate the performance impacts of TLB flushes due to context switching, and in a similar fashion to VPIDs, the instruction invpcid was added so that software may invalidate cached mappings in the TLBs associated with a specific PCID.

The main takeaway is that these features allow the software to skip flushing of the TLB when performing a context switch. This is because TLB flushing occurs on VM-entry and VM-exit due to address space change (aka the reload of CR3.) VPIDs support retention of TLB entries across VM switches and provide a performance improvement. Prior to this hardware feature being introduced the TLB used to map linear address physical adress, but utilizing VPID the TLB maps {VPID, linear address} physical address. Host software runs with VPID of 0, and the Guest will have a non-zero VPID assigned by the VMM. Note that some VMM implementations run on modern hardware will have the guest with a VPID of 0, this indicates that a TLB flush will occur on VM-entry and VM-exit.

  Regarding PCID and VPID

As noted in the Intel SDM, software can use PCIDs and VPIDs concurrently; for this project, we will not concern ourselves with the use of PCIDs. If you would like to tinker with this you can find details on how to enable PCIDs in §4.10.1 Vol. 3A of the Intel SDM.

For now, this is all that’s necessary to keep in the back of your mind. This next part is going to be pretty excerpt-heavy with descriptions and reasoning for collecting the information. Let’s get on to MTRRs, and then we’ll finally be ready to setup our EPT context.

— MTRRs

Memory type range registers (MTRRs) were briefly discussed in the first article of this series. In the simplest sense, these registers are used to associate memory caching types with physical-address ranges in system memory. They’re initialized by the BIOS (usually) and are intended to optimize accesses for a variety of memory. RAM, ROM, frame-buffer, MMIO, SMRAM, etc. These memory type registers are available for use through a series of model-specific registers which define the type of memory for a given range of physical memory. There are a handful of memory types, and if you’re familiar with the general theory of caching you’ll recall that there are 3 different levels of caches the processor may use for memory operations. The memory type specified for a region of system memory influences whether these locations are cached or not, and their memory ordering model. In this subsection, whenever you see memory type or cache type they’re referring to the same thing. We’re going to address those memory types below.

PAT preference over MTRR

This section is optional* and a moderate overview of how the BIOS/UEFI firmware sets up MTRRs during boot, therefore this section is optional unless you’re interested in how the BIOS/firmware determines memory types and updates the various MTRRs. It’s recommended that system developers use the Page Attribute Table (PAT) over the MTRRs. Feel free to skip over this to the EPT hierarchies section.

𝛿 Strong Uncacheable (UC)

Any system memory marked as UC indicates that it isn’t cached. Every load and store to that region will be passed through the memory access path and executed in order, without any reordering. This means that there aren’t speculative memory accesses, page-table walks, or any sort of branch prediction. The memory controller performs the operation on DRAM of the default size (64-bytes is typical minimum read size), but returns the requested data to the processor and the information is not propagated to any cache. Since having to access main memory (DRAM) is slow, using this memory type frivolously can significantly reduce performance of the system. It typically will be used for MMIO device ranges, and the BIOS region. The memory model for this memory type is referred to as strongly ordered.

𝛿 Uncacheable (UC- or UC Minus)

This memory type has the same properties as the UC type although it can be overridden by WC if the MTRRs are updated. It’s also only able to be selected through the use of the page attribute table (PAT), which we will discuss following this section.

𝛿 Write Combine (WC)

This memory type is primarily used with any sort of GPU memory, frame buffer, etc. This is because the order of writes aren’t important to the display of whatever data. It operates similar to UC in that the memory locations aren’t cached and coherency isn’t enforced. For instance, if you were to use some GPU API to map a buffer or texture into memory you can bet that memory will be marked as write combine (WC). An interesting behavior is what happens when a read is performed. The read operation is treated as if it were performed on an uncached location. All write-combined buffers get flushed to main memory (oof) and then the read is completed without any cache references. This means that reads on WC memory will impact performance if done often, much like with UC (because they behave as if the memory was UC).

There’s not really a great reason to read from WC memory, and reading back-buffers, or some constant buffer is usually advised against for this reason. If you want to perform a write to WC memory, well, you need to make sure your compiler doesn’t try to reorder writes (hint: volatile). You also don’t want to be performing writes to individual memory locations with WC memory – if you’re writing to a WC range, you’re going to want to write the whole range. It’s better to have one large write than a bunch of small writes — less of a performance impact when modifying WC memory. Alignment, access width, and other rules may be in place – so whether Intel or AMD, check your manual.

(For those reading that like to make game hacks and have issues with the perf of your “hardware ESP”, maybe this will jog your brain.)

𝛿 Write Through (WT)

With this cache type memory operations are cached. Reads will come from caches on a cache hit, misses will cause cache fills. You can see an explanation of read + fill in the previous article. The biggest thing to note about the write through (WT) type is that writes are propagated to a cache line and also written through to memory. This type enforces coherency between caches and main memory.

𝛿 Write Back (WB)

This is the most common memory type throughout the ranges on your machine, as it is the most performant. Memory operations are cached, speculative operations are allowed, however, writes to a cache line are not forwarded to system memory; they’re propagated to the cache and the modified cache lines are written back to main memory when a write-back operation occurs. It enforces memory and cache coherency, and requires devices that may access memory on the system bus to be able to snoop memory accesses. This allows low latency and high throughput for write-intensive tasks.

  Bus Snooping

The term bus snooping used to mean a device was sniffing the bus (monitoring bus transactions) to be aware of changes that may have occurred when requesting a cache line. In modern systems, it’s a bit different. If you’re interested in how cache coherency is maintained on modern systems you can look at the recommended reading section, and/or the patents under the cache coherency classification here. Additionally, the Intel Patent linked here.

𝛿 Write Protected (WP)

This caching type simply propagates writes to the interconnect (shared bus) and causes relevant cache lines on all processors to be invalidated. Whereas reads fetch data from cache lines when available. This memory type is usually intended to cache ROM without having to reach out to the ROM itself.

Now that we’ve discussed the different memory types available to the system programmer, let’s implement our MTRR API so we can appropriately set our memory types when we begin allocating memory for EPT.

— MTRR Implementation

With MTRRs, whether programming them or accessing for information, we’re going to be using a number of model-specific registers (MSRs) that Intel documents. The main two of interest will be the IA32_MTRR_CAP_MSR and IA32_MTRR_DEF_TYPE_MSR. The MTRR capabilities MSR (IA32_MTRR_CAP_MSR) is used to gather additional information about MTRRs such as the number of variable range MTRRs are implemented by the hardware, fixed range MTRRs, and whether write-combining is supported. There are some other flags, but they aren’t of interest to us for this article. The structure for this MSR is given below.

typedef union _ia32_mtrrcap_msr
{
    u64 value;
    struct
    {
        u64 vcnt : 8;
        u64 fr_mtrr : 1;
        u64 rsvd0 : 1;
        u64 wc : 1;
        u64 smrr : 1;
        u64 prmrr : 1;
        u64 rsvd1 : 51;
    } bits;
} ia32_mtrrcap_msr;

The MTRR default type MSR (IA32_MTRR_DEF_TYPE_MSR) provides the default cache properties of physical memory that is not covered by the MTRRS. It also allows the software programming the MTRRs to determine whether MTRRs and the associated fixed-ranges are enabled. Here is the structure I use.

typedef union _ia32_mtrr_def_type_msr
{
    u64 value;
    struct
    {
        u64 type : 8;
        u64 rsvd0 : 2;
        u64 fe : 1;
        u64 en : 1;
        u64 rsvd1 : 51;
    } bits;
} ia32_mtrr_def_type_msr;

MTRRs come in two flavors, fixed and variable range. On Intel, there are 11 fixed-range registers each divided into 8 bit-fields and are used to determine/specify the memory type for each sub-range it covers. The table below depicts how each fixed-range MTRR is divided to cover their respective address ranges.

Figure 4. Bit-field layout for fixed-range MTRRs

Knowing the mapping for each of these type range registers allows us to develop an algorithm to determine which fixed-range an address falls under, if at all. How we’ll achieve this is by defining a few base points to compare the address against. As you can see the first MTRR is named IA32_MTRR_FIX64K_00000 and based on the address ranges covered by the bit-field it maps 512 KiB from 00000h to 7FFFFh, and it has eight 64 KiB sub-ranges in the bitfield (see above table). The IA32_MTRR_FIX16K_80000 and IA32_MTRR_FIX16K_A0000 MTRRs map two 128 KiB address ranges from 80000h to BFFFFh. Then there are eight 32KiB ranges covered by the FIX4K MTRRs. These 4K MTRRs cover 256 KiB through 8 fixed-range registers.

 MTRR Ranges

I’ve been unable to determine the exact reasoning for the layout of MTRRs, but my best guess would be because of the physical memory map after the BIOS transfers control. For instance, the first 384 KiB is typically reserved for ROM shadowing, real mode IVT, BIOS data, bootloader, etc. Then you have the 64 KiB range A0000h to AFFFFh which typically houses the graphics video memory; and the 32 KiB range C0000h to C7FFFh normally containing the VGA BIOS ROM / Video ROM, though the sub-ranges may require different memory types. It also stands to reason that the first two MTRRs cover the 640 KiB that was referred to as conventional memory back in early PCs.

With this in mind let’s define a few things like the MTRR MSRs, cache type encodings, and the start addresses for each range covered, which a given address will be compared against to determine if it falls within.

#define CACHE_MEMORY_TYPE_UC                 0x0000
#define CACHE_MEMORY_TYPE_WC                 0x0001
#define CACHE_MEMORY_TYPE_WT                 0x0004
#define CACHE_MEMORY_TYPE_WP                 0x0005
#define CACHE_MEMORY_TYPE_WB                 0x0006
#define CACHE_MEMORY_TYPE_UC_MINUS           0x0007
#define CACHE_MEMORY_TYPE_ERROR              0x00FE     /* user-defined */
#define CACHE_MEMORY_TYPE_RESERVED           0x00FF

#define IA32_MTRR_CAP_MSR                    0x00FE
#define IA32_MTRR_DEF_TYPE_MSR               0x02FF

#define IA32_MTRR_FIX64K_00000_MSR           0x0250
#define IA32_MTRR_FIX16K_80000_MSR           0x0258
#define IA32_MTRR_FIX16K_A0000_MSR           0x0259
#define IA32_MTRR_FIX4K_C0000_MSR            0x0268
#define IA32_MTRR_FIX4K_C8000_MSR            0x0269
#define IA32_MTRR_FIX4K_D0000_MSR            0x026A
#define IA32_MTRR_FIX4K_D8000_MSR            0x026B
#define IA32_MTRR_FIX4K_E0000_MSR            0x026C
#define IA32_MTRR_FIX4K_E8000_MSR            0x026D
#define IA32_MTRR_FIX4K_F0000_MSR            0x026E
#define IA32_MTRR_FIX4K_F8000_MSR            0x026F

#define MTRR_FIX64K_BASE                     0x00000
#define MTRR_FIX16K_BASE                     0x80000
#define MTRR_FIX4K_BASE                      0xC0000
#define MTRR_FIXED_MAXIMUM                   0xFFFFF

#define MTRR_FIXED_RANGE_ENTRIES_MAX         88
#define MTRR_VARIABLE_RANGE_ENTRIES_MAX      255

Now, let’s derive a function to get the memory type of an address that falls within a fixed-range.

static u8 mtrr_index_fixed_range( u32 msr_address, u32 idx )
{
    // Read MTRR and extract the memory type value from the bitfield.
    //
    u64 val = __readmsr( msr_address + ( idx >> 3 ) );
    return ( u8 )( msr_val >> ( idx << 3 ) );
}

static u8 mtrr_get_fixed_range_type( u64 address, u64* size )
{
    ia32_mtrrcap_msr mtrrcap = { 0 };
    ia32_mtrr_def_type_msr mtrrdef = { 0 };
    
    mtrrcap.value = __readmsr( IA32_MTRR_CAP_MSR );
    mtrrdef.value = __readmsr( IA32_MTRR_DEF_TYPE_MSR );
    
    // Check if fixed-range MTRRs are enabled, and the address
    // is within the ranges covered by fixed-range MTRRs.
    //
    if( !( mtrrdef.bits.fe ) || address >= MTRR_FIXED_MAXIMUM )
        return CACHE_MEMORY_TYPE_RESERVED;
    
    // Check if address is within the FIX64K range.
    //
    if( address < MTRR_FIX16K_BASE ) 
    {
        *size = PAGE_SIZE << 4; /* 64KB */
        return mtrr_index_fixed_range( IA32_MTRR_FIX64K_00000_MSR, address / ( PAGE_SIZE << 4 ) );
    }
    
    // Check if address is within the FIX16K range.
    //
    if( address < MTRR_FIX4K_BASE ) 
    {
        address -= MTRR_FIX16K_BASE;
        *size = PAGE_SIZE << 2; /* 16 KB */
        return mtrr_index_fixed_range( IA32_MTRR_FIX16K_80000_MSR, address / ( PAGE_SIZE << 2 ) );
    }
    
    // If we're not in any of those ranges, we're in the FIX4K range.
    //
    address -= MTRR_FIX4K_BASE;
    *size = PAGE_SIZE;
    
    return mtrr_index_fixed_range( IA32_MTRR_FIX4K_C0000_MSR, address / PAGE_SIZE );
}

The function above uses the relevant MSRs and MTRRs to determine if an address given falls within a fixed-range. The function mtrr_get_fixed_range_type captures the current values of the MTRR capability MSR and MTRR default memory type, and then uses the bitfields from the structures defined earlier to determine if fixed-range MTRRs are enabled, and that the range falls within the maximum fixed-range supported. It then compares the address provided to the different start addresses of the ranges – MTRR_FIX16K_BASE, which starts at 80000h, for instance. The expression checks to see if the address falls within the 64K fixed-range by checking if it’s less than 80000h. It then sets the size of the range to 64K, or whatever the relevant size for the range is. Remember that the 64K range is comprised of eight 64-KiB sub-ranges. We then have a helper function above as well that utilizes the base MSR and an expression that yields the index into the MSR bitfield from which to take the memory type. Let’s briefly walk through that line and helper function, as it will make sense for the others as well.

Given the address 81A00h passed through this function, we’ll wind up branching into this conditional block:

// Check if address is within the FIX16K range.
//
if( address < MTRR_FIX4K_BASE ) 
{
    address -= MTRR_FIX16K_BASE;
    *size = PAGE_SIZE << 2;
    return mtrr_index_fixed_range( IA32_MTRR_FIX16K_80000_MSR, address / ( PAGE_SIZE << 2 ) );
}

This is because the address 8A100h is less than the start address of the fixed 4K range, and not lower than the fixed 16K range start. Inside this conditional block the address is subtracted from the base of the fixed range (MTRR_FIX16K_BASE) to determine the offset into the range it falls. The size of the range is then set to PAGE_SIZE << 2 which is just PAGE_SIZE (1000h) * 4 yielding 16KiB. We then use the fixed-range MSR for the first 16K MTRR, and the address divided by size of the range which will give us the index into the bitfield of the MSR after it is read. We also use this index to determine which MSR should be read from. The shifts will be explained as we go through the helper function.

static u8 mtrr_index_fixed_range( u32 msr_address, u32 idx )
{
    // Read MTRR and extract the memory type value from the bitfield
    //
    u64 val = __readmsr( msr_address + ( idx >> 3 ) );
    return ( u8 )( msr_val >> ( idx << 3 ) );
}

The helper function above reads from the MSR address, which is IA32_MTRR_FIX16K_80000_MSR in this case, after adding the index divided by 8. In this case, the index is derived from the expression in the conditional block – address / ( PAGE_SIZE << 2 ). This expands to 1A00h / 4000h → 0. This means it will read from the MSR address give, and index into that MSRs bitfield (refer to the earlier diagram) using the value 0. This makes sense as the address 81A00h falls within the first bitfield (0th index) of the IA32_MTRR_FIX16K_80000 MTRR which covers physical addresses 80000h to 83FFFh. It then takes the MSR value, which when read is 06060606`06060606h, and shifts it right by the index multiplied by 8 – which is 0, meaning it will use the value 6h from the first byte of this value. The memory type that corresponds to the value 6h is CACHE_MEMORY_TYPE_WB per our earlier definitions. If this is confusing to follow, I’ve provided a diagram below using the same address as well as an address that would fall within a fixed 4K range.

Figure 5. Calculating memory type for physical address using MTRRs.

The above is pretty straight forward as the fixed-ranges have easily indexable MSRs. Hopefully the example cleared up any potential confusion about how the memory type is calculated for these ranges. Now that we’ve gone over fixed-range MTRRs we need to construct an algorithm for determining the memory type of a variable range MTRR. And yes, there’s more to them… Each variable range MTRR allows software to specify a memory type for a varying number address ranges. This is done through a pair of MTRRs for each range. How do we determine the number of variable ranges our platform supports? Recall the IA32_MTRRCAP_MSR structure.

typedef union _ia32_mtrrcap_msr
{
    u64 value;
    struct
    {
        u64 vcnt : 8;
        u64 fr_mtrr : 1;
        u64 rsvd0 : 1;
        u64 wc : 1;
        u64 smrr : 1;
        u64 prmrr : 1;
        u64 rsvd1 : 51;
    } bits;
} ia32_mtrrcap_msr;

The first 8 bits of the bitfield are allocated for the vcnt member, which indicates the number of variable ranges implemented on the processor. We’ll need to remember this for use in our function. It was mentioned that there are MSR pairs provided for programming the memory type of these variable range MTRRs – these are referred to as IA32_MTRR_PHYSBASEn and IA32_MTRR_PHYSMASKn. The “n” is used to represent a value in the range of 0 (vcnt - 1). The MSR addresses for these pairs are provided below.

#define IA32_MTRR_PHYSBASE0_MSR              0x0200
#define IA32_MTRR_PHYSMASK0_MSR              0x0201

#define IA32_MTRR_PHYSBASE1_MSR              0x0202
#define IA32_MTRR_PHYSMASK1_MSR              0x0203 
     
#define IA32_MTRR_PHYSBASE2_MSR              0x0204
#define IA32_MTRR_PHYSMASK2_MSR              0x0205
   
#define IA32_MTRR_PHYSBASE3_MSR              0x0206
#define IA32_MTRR_PHYSMASK3_MSR              0x0207 
          
#define IA32_MTRR_PHYSBASE4_MSR              0x0208
#define IA32_MTRR_PHYSMASK4_MSR              0x0209 
         
#define IA32_MTRR_PHYSBASE5_MSR              0x020a
#define IA32_MTRR_PHYSMASK5_MSR              0x020b
           
#define IA32_MTRR_PHYSBASE6_MSR              0x020c
#define IA32_MTRR_PHYSMASK6_MSR              0x020d  
        
#define IA32_MTRR_PHYSBASE7_MSR              0x020e
#define IA32_MTRR_PHYSMASK7_MSR              0x020f 
          
#define IA32_MTRR_PHYSBASE8_MSR              0x0210
#define IA32_MTRR_PHYSMASK8_MSR              0x0211
         
#define IA32_MTRR_PHYSBASE9_MSR              0x0212
#define IA32_MTRR_PHYSMASK9_MSR              0x0213

Each of these MSRs has a specific layout, both of them are defined below.

typedef union _ia32_mtrr_physbase_msr
{
    u64 value;
    struct
    {
        u64 type : 8;
        u64 rsvd0 : 4;
        u64 physbase_lo : 39;
        u64 rsvd1 : 13;
    } bits;
} ia32_mtrr_physbase_msr;

typedef union _ia32_mtrr_physmask_msr
{
    u64 value;
    struct
    {
        u64 rsvd0 : 11;
        u64 valid : 1;
        u64 physmask_lo : 39;
        u64 rsvd1 : 13;
    } bits;
} ia32_mtrr_physmask_msr;

  Overlapping Ranges

It’s possible for variable range MTRRs to overlap an address range that is described by another variable range MTRR. It’s important that the reader look over §11.11.4.1 MTRR Precedences (Intel SDM Vol 3A) and ensure these rules are followed when attempting to determine the memory type of an address within a variable range MTRR. The proper implementation to follow the precedence rules are pointed out in the function implementation below, however, ensure you understand why.

If you’re interested in how the variable range MTRRs and PAT are initialized by the hardware/BIOS/firmware I highly recommend checking out the section in the manual referenced in the note above or seeing the recommended reading for additional reading on setting up memory types during early boot stages. This section was initially going to cover the entire initialization, but since it’s unnecessary/out of scope of this series and using the PAT is recommended I’ve cut the remainder out to try and reduce the length of this article. If there is interest in the process of setting them up, I could do a spin-off article about it. In any case, let’s move on to the EPT hierarchies and get our structures updated to facilitate EPT initialization.

— EPT Page Hierarchies

Once the features have been determined to be available we’re going to want to initialize our EPT pointer. This article will only cover the initialization of a single page hierarchy. In a future article, we will cover the initialization of multiple EPT pointers to allow for a switching method that utilizes numerous page hierarchies, as opposed to the standard page-switching that occurs upon EPT violations you may have read about.

There are a number of ways to design a hypervisor, some may choose to only associate EPT data within the vCPU structure, others may take a more decoupled approach and have an EPT state structure for the host that tracks all guest EPT states utilizing some form of global and linked list with accessors. For the sake of simplicity, this article will track the data structures by storing them in the vCPU data structure to be initialized during the MP init phase of your hypervisor. The EPT data structure to be added to your vCPU structure is given below.

typedef struct _ept_state
{
	u64 eptp;
	p64 topmost_ps;
	u64 gaw;
} ept_state, *pept_state;

The members of this structure relevant to this article are presented, however, this structure will/can be extended in the future to support more than one EPTP and topmost paging structure. The gaw member is the guest address width value. It’s important to know when it comes to performing a page walk over the EPT hierarchy. You’ll need to allocate this data structure as you would with any other in your stand-up functions prior to vmxon. If you’re wondering why there is a member for the EPTP and the topmost paging structure, it’s because the EPT pointer has a specific format that contains the topmost paging structure (in this case, PML4) and other configuration information like memory type, walk length, etc.

pept_state vcpu_ept_data = mem_allocate( sizeof( ept_state ) );
zeromemory_s(vcpu_ept_data, sizeof( ept_state ) );

//
// Initialization of the single EPT page hierarchy.
//
// ...
//

At this point, we need to allocate our EPT page hierarchy. This will require standing up our own PML4 table and initializing our EPTP properly. Allocation of our PML4 table is done just like it would be for any other page:

typedef union _physical_address
{
    struct
    {
        u32 low;
        i32 high;
    };
    struct
    {
        u32 low;
        i32 high;
    } upper;
    
    i64 quad;
} physical_address;

static p64 eptm_allocate_entry( physical_address* pa )
{
    p64 pxe = mem_allocate( page_size );
    
    if( !pxe )
        return NULL;
    
    zeromemory_s( pxe, 0, page_size );
    
    // Translate allocated entry virtual address to physical.
    //
    *pa = mem_vtop( pxe );
    
    // Return virtual address of our new entry.
    //
    return pxe;
}

  Custom Address Translation

The mem_vtop function uses a custom address translation/page walker, however, it may be better suited for your first run through to use MmGetPhysicalAddress on the returned virtual address. Implementing your own address translation and page walker isn’t necessary for this basic setup utilizing EPT, but I will include it toward the end of the article as extra reading material.

Your ept_initialize function should look something like this at this point.

// Allocate and initialize prior to vmxon and after feature availability check.
//
pept_state vcpu_ept_data = mem_allocate( sizeof( ept_state ) ); 
zeromemory_s( vcpu_ept_data, sizeof( ept_state ) ); 

// Initialization of the single EPT page hierarchy. 
//
vcpu_ept_data->gaw = PTM4 - 1;
ret = eptm_initialize_pt( vcpu_ept_data );

if( ret != 0 )
    eptm_release_resources( vcpu_ept_data );

vcpu->ept_state = &vcpu_ept_data;

///////////////////////////////// eptm_initialize_pt definition below /////////////////////////////////

// Initialization of page tables associated with our EPTP.
//
vmm_status_t eptm_initialize_pt( pept_state ept_state )
{
    p64 ept_topmost;
    physical_address ept_topmost_pa;
    vmm_status_t ret;
    
    ret = 0;
    
    ept_topmost = eptm_allocate_entry( &ept_topmost_pa );
    if( !ept_topmost )
        return VMM_STATUS_MEM_ALLOC_FAILED;
    
    ept_state->topmost_ps = ept_topmost;
    
    // Initialize the EPT pointer and store it in our EPT state
    // structure.
    //
    // ...
    //
    
    //
    // Construct identity mapping for EPT page hierarchy w/ default
    // page size granularity (4kB).
    //
    // ...
    //
}

The next step is to construct our EPTP and store it in the ept_state structure for later insertion into the VMCS. We’ll first need the structure defined that represents the EPTP format.

typedef struct _eptp_format
{
    u64 value;
    union
    {
        u32 memory_type : 3;
        u32 guest_address_width : 3;
        u32 ad_flag_enable : 1;
        u32 ar_enforcement_ssp : 1;
        u32 rsvd0 : 4;
        u32 ept_pml4_pa_low : 20;
        u32 ept_pml4_pa_high;
    } bits;
} eptp_format;

Once defined we’ll adjust the eptm_initialize_pt function and initialize our EPT pointer.

vmm_status_t eptm_initialize_pt( pept_state ept_state )
{
    p64 ept_topmost;
    physical_address ept_topmost_pa;
    eptp_format eptp;
    vmm_status_t ret;
    
    ret = 0;
    
    ept_topmost = eptm_allocate_entry( &ept_topmost_pa );
    if( !ept_topmost )
        return VMM_STATUS_MEM_ALLOC_FAILED;
    
    ept_state->topmost_ps = ept_topmost;
    
    // Initialize the EPT pointer and store it in our EPT state
    // structure.
    //
    eptp.value = ept_topmost_pa.quad;
    eptp.memory_type = EPT_MEMORY_TYPE_WB;
    eptp.guest_address_width = ept_state->gaw;
    eptp.rsvd0 = 0;
    
    ept_state->eptp = eptp.value;
    
    //
    // Construct identity mapping for EPT page hierarchy w/ default
    // page size granularity (4kB).
    //
    // ...
    //
}

We’ve now successfully set up our topmost paging structure (the EPT PML4 table), and our EPT pointer is formatted for use. All that’s left is to construct the identity mapping permitting all page accesses for our EPT page hierarchy – however, this requires us to cover the differences between the normal paging structures and EPT paging structures.

— Paging Structure Differences

When utilizing EPT there are subtle changes in how things are structured. One of which is the differences in the page table entry structure. For every first-level page mapping structure (FL-PMEn), you’ll see a layout similar to this:

struct
{
    u64 present : 1;
    u64 rw : 1;
    u64 us : 1;
    u64 pwt : 1;
    u64 pcd : 1;
    u64 accessed : 1;
    u64 dirty : 1;
    u64 ps_pat : 1;
    u64 global : 1;
    u64 avl0 : 3;
    u64 pfn : 40;
    u64 avl1 : 7;
    u64 pkey : 4;
    u64 xd : 1;
} pte, pme;

Each field here is used by the page walker to perform address translation and verify if an operation to this page is valid, or invalid. The fields are detailed in the Intel SDM Vol. 3-A Chapter 4 – this is just a definition used in my project as I don’t fancy having masks everywhere for individual bits (so I use bitfields). The pme simply means page mapping entry and is an internal term for my project since all paging structure entries follow a similar format. I use this structure for every table entry at all levels. The only difference is the reserved bits at each level which you’ll either come to memorize or document yourself. Now, let’s take a look at what the page table entry structure looks like for EPT.

For each second-level page mapping entry (SL-PMEn), we see this layout:

struct
{
    u64 rd : 1;
    u64 wr : 1;
    u64 x : 1;
    u64 mt : 3;
    u64 ipat : 1;
    u64 avl0 : 1;
    u64 accessed : 1;
    u64 dirty : 1;
    u64 ex_um : 1;
    u64 avl1 : 1;
    u64 pfn : 39;
    u64 rsvd : 9;
    u64 sssp : 1;
    u64 sub_page_wr : 1;
    u64 avl2 : 1;
    u64 suppressed_ve : 1;
} epte, slpme;

The differences may not be immediately obvious, but the first three bits in this SL-PME represent whether this page allows read, write, or execute (instruction fetches) from the region it controls. As opposed to the first structure which has a bit for determining if the page is present, allows read/write operations, and if user-mode accesses are allowed. The differences become clear when we place the two tables atop one another, as below.

 

Figure 3. Format of a FL-PTE (top) and SL-PTE (bottom).

 

With this information, it’s helpful to derive a data structure to represent the two formats as this will make translation much easier later on. The data structure you create may look something like this:

typedef union _page_entry_t
{
    struct
    {
        u64 present : 1;
        u64 rw : 1;
        u64 us : 1;
        u64 pwt : 1;
        u64 pcd : 1;
        u64 accessed : 1;
        u64 dirty : 1;
        u64 ps_pat : 1;
        u64 global : 1;
        u64 avl0 : 3;
        u64 pfn : 40;
        u64 avl1 : 7;
        u64 pkey : 4;
        u64 xd : 1;
    } pte, flpme;
    
    struct
    {
        u64 rd : 1;
        u64 wr : 1;
        u64 x : 1;
        u64 mt : 3;
        u64 ipat : 1;
        u64 avl0 : 1;
        u64 accessed : 1;
        u64 dirty : 1;
        u64 ex_um : 1;
        u64 avl1 : 1;
        u64 pfn : 39;
        u64 rsvd : 9;
        u64 sssp : 1;
        u64 sub_page_wr : 1;
        u64 avl2 : 1;
        u64 suppressed_ve : 1;
    } epte, slpme;
    
    struct
    {
        u64 rd : 1;
        u64 wr : 1;
        u64 x : 1;
        u64 mt : 3;
        u64 ps_ipat : 1;
        u64 avl0 : 1;
        u64 accessed : 1;
        u64 dirty : 1;
        u64 avl1 : 1;
        u64 snoop : 1;
        u64 pa : 39;
        u64 rsvd : 24;
    } vtdpte;
} page_entry_t;

Using a union here allows me to easily cast to one data structure and reference some internal bitfield layout for whatever specific entry type is needed. You will see this come into play as we initialize the remaining requirements for EPT in the next section.

  Requirements for First-Level and Second-Level Page Tables

Despite the differences in their page table entry format, both tables require a top-level structure such as the PML4 or PML5, and the respective sub tables. Those being PDPT, PDT, PT; or PML4, PDPT, PDT, PT (if PML5 is enabled).

— EPT Identity Mapping (4kB)

When it comes to paging there are a lot of interchanged terms, identity mapping is one of them. It’s sometimes referred to as identity paging or direct mapping. I find the latter more confusing than the former, so throughout the remainder of this article, any time identity mapping/paging is used they are referring to the same thing.

When a processor first enables paging it is required to be executing code from an identity mapped page. This means that the software maps each virtual address to the same physical address. This identity mapping is achieved by initializing page entries to point to the corresponding 4kB physical frame. It may be easier understood through example, so here is the code for constructing the table and associated sub-tables for the guest with a 1:1 mapping to the host.

First, we’ll need a way to get all available physical memory pages allocated. We’re going to reference a global pointer that’s within ntoskrnlMmPhysicalMemoryBlock – which contains a list of physical memory descriptors (_PHYSICAL_MEMORY_DESCRIPTORS). The number of elements in this data structure is determined via the NumberOfRuns member. There is also an array under the Run member, which is of type _PHYSICAL_MEMORY_RUN. Both of these structures are defined in the WDK headers, however, I’ve redefined them to fit the format of the other code.

typedef struct _physical_memory_run
{
    u64 base_page;
    u64 page_count;
} physical_memory_run, *pphysical_memory_run;

typedef struct _physical_memory_desc
{
    u32 num_runs;
    u64 num_pages;
    physical_memory_run run[1];
} physical_memory_desc, *pphysical_memory_desc;

pphysical_memory_desc mm_get_physical_memory_block( void )
{
    return get_global_poi( "nt!MmPhysicalMemoryBlock" );
}

The get_global_poi function is a helper function that uses symbols to locate the MmPhysicalMemoryBlock within ntoskrnl. Our objective now is to initialize EPT entries for all physical memory pages accounted for in this table. However, you may have noticed we’ve only allocated our top-level paging structure. To complete the above it’s required that we implement a few more functions to acquire (if they already exist) or allocate our additional paging structures. Recall page walking on a system with 4-level paging goes PML4 PDPT PDT PT. We’ve allocated our PML4, now we need to determine if there is an existing EPT entry or if we need to allocate it. The logic is described in the diagram below, followed by the implementation of these functions with a brief explanation.

 

Figure 4. Flow of EPT hierarchy initialization.

As the diagram shows, we will call some parent functions to initialize EPT hierarchies, within this if you refer back to the function eptm_initialize_pt from earlier on we’re going to complete the implementation by writing the ept_create_mapping_4k and associated functions. Within these functions, you will see the traversal and validation of additional paging structures, if the paging structure for the current level exists we will call mem_ptov and operate on the physical address returned. Otherwise, we’ll construct a new EPT entry, lucky for us we have this allocation function defined. So, how will the other functions look? Let’s see them below, and then how they’ll fit into the bigger picture.

static p64 ept_map_page_table( u64 entry )
{
    p64 ptable = NULL;
    page_entry_t *pxe = NULL;
    physical_address table_pa = { 0 };
    
    // Check if the EPT entry referenced is valid
    //
    if( entry != 0 )
    {
        table_pa.quad = *( ( u64 )entry & X64_PFN_MASK );
        ptable = mem_ptov( table_pa.quad );
        
        if(!ptable) 
            return NULL;
    }
    else
    {
        // If allocation succeeds construct EPT entry
        //
        ptable = eptm_allocate_entry( &table_pa );
        if( !ptable )
            return NULL;
        
        pxe = ( page_entry_t* )ptable;
        
        // Set access rights. Mask for EPT access all = 7, achieves same as below
        //
        pxe->epte.rd = access.rd;
        pxe->epte.wr = access.wr;
        pxe->epte.x = access.x;
        
        // Set PFN for EPTE entry using PFN mask
        //
        pxe->pfn = ( u64 )( table_pa.quad ) & 0x000FFFFFFFFFF000;
        
        pxe->mt = 0x00;
    }
    
    return ptable;
}

p64 ept_create_mapping_4k( pept_state ept_state, ept_access_rights access, physical_address gpa, physical_address hpa )
{
    // Page structure address
    //
    u64 pmln = NULL;
    
    // Next page structure pointer
    //
    u64 ps_ptr = NULL;
    page_entry_t *pxe = { 0 };
    
    // Get the topmost page table (PML4)
    //
    pmln = ept_state->topmost_ps;
    ps_ptr = &pmln[ PML4_IDX( gpa.quad ) ];

    // Check and validate next table exists, allocate if not (PDPT)
    //
    pmln = ept_map_page_table(ps_ptr);
    ps_ptr = &pmln[ PML3_IDX( gpa.quad ) ];

    // Check and validate PDT exists, allocate if not
    //
    pmln = ept_map_page_table(ps_ptr);
    ps_ptr = &pmln[ PML2_IDX( gpa.quad ) ];

    // Get PTE if it exists, allocate if not
    pmln = ept_map_page_table(ps_ptr);
    ps_ptr = &pmln[ PML1_IDX( gpa.quad ) ];
    
    // Verify page is aligned on 4KB boundary
    //
    if (!PAGE_ALIGN_4KB( hpa.quad ) == hpa.quad)
        hpa.quad &= ( ~( PAGE_SIZE - 1 ) );

    pxe = (page_entry_t*)ps_ptr;
    
    // Set access rights. Mask for EPT access all = 7, achieves same as below
    //
    pxe->epte.rd = access.rd;
    pxe->epte.wr = access.wr;
    pxe->epte.x = access.x;
    
    // Set PFN for EPTE entry using PFN mask
    //
    pxe->pfn = ( u64 )( hpa.quad ) & 0x000FFFFFFFFFF000;
    
    // Set memory type for page table entry.
    //
    pxe->mt = hw_query_mtrr_memtype( gpa.quad );

    return pxe;
}

The functions given above ensure that a table is constructed if it has not already and if so it quickly falls through to the next check/allocation. There are some missing error checks, but to save space I only kept the main logic. With these functions, we can go back to eptm_initialize_pt and complete the implementation.

typedef struct _physical_memory_run
{
    u64 base_page;
    u64 page_count;
} physical_memory_run, *pphysical_memory_run;

typedef struct _physical_memory_desc
{
    u32 num_runs;
    u64 num_pages;
    physical_memory_run run[1];
} physical_memory_desc, *pphysical_memory_desc;

pphysical_memory_desc mm_get_physical_memory_block( void )
{
    return get_global_poi( "nt!MmPhysicalMemoryBlock" );
}

static p64 ept_map_page_table( u64 entry )
{
    p64 ptable = NULL;
    page_entry_t *pxe = NULL;
    physical_address table_pa = { 0 };
    
    // Check if the EPT entry referenced is valid
    //
    if( entry != 0 )
    {
        table_pa.quad = *( ( u64 )entry & X64_PFN_MASK );
        ptable = mem_ptov( table_pa.quad );
        
        if(!ptable) 
            return NULL;
    }
    else
    {
        // If allocation succeeds construct EPT entry
        //
        ptable = eptm_allocate_entry( &table_pa );
        if( !ptable )
            return NULL;
        
        pxe = ( page_entry_t* )ptable;
        
        // Set access rights. Mask for EPT access all = 7, achieves same as below
        //
        pxe->epte.rd = access.rd;
        pxe->epte.wr = access.wr;
        pxe->epte.x = access.x;
        
        // Set PFN for EPTE entry using PFN mask
        //
        pxe->pfn = ( u64 )( table_pa.quad ) & 0x000FFFFFFFFFF000;
        
        pxe->mt = 0x00;
    }
    
    return ptable;
}

p64 ept_create_mapping_4k( pept_state ept_state, ept_access_rights access, physical_address gpa, physical_address hpa )
{
    // Page structure address
    //
    u64 pmln = NULL;
    
    // Next page structure pointer
    //
    u64 ps_ptr = NULL;
    page_entry_t *pxe = { 0 };
    
    // Get the topmost page table (PML4)
    //
    pmln = ept_state->topmost_ps;
    ps_ptr = &pmln[ PML4_IDX( gpa.quad ) ];

    // Check and validate next table exists, allocate if not (PDPT)
    //
    pmln = ept_map_page_table(ps_ptr);
    ps_ptr = &pmln[ PML3_IDX( gpa.quad ) ];

    // Check and validate PDT exists, allocate if not
    //
    pmln = ept_map_page_table(ps_ptr);
    ps_ptr = &pmln[ PML2_IDX( gpa.quad ) ];

    // Get PTE if it exists, allocate if not
    pmln = ept_map_page_table(ps_ptr);
    ps_ptr = &pmln[ PML1_IDX( gpa.quad ) ];
    
    // Verify page is aligned on 4KB boundary
    //
    if (!PAGE_ALIGN_4KB( hpa.quad ) == hpa.quad)
        hpa.quad &= ( ~( PAGE_SIZE - 1 ) );

    pxe = (page_entry_t*)ps_ptr;
    
    // Set access rights. Mask for EPT access all = 7, achieves same as below
    //
    pxe->epte.rd = access.rd;
    pxe->epte.wr = access.wr;
    pxe->epte.x = access.x;
    
    // Set PFN for EPTE entry using PFN mask
    //
    pxe->pfn = ( u64 )( hpa.quad ) & 0x000FFFFFFFFFF000;
    
    // Set memory type for page table entry.
    //
    pxe->mt = hw_query_mtrr_memtype( gpa.quad );

    return pxe;
}

vmm_status_t eptm_initialize_pt( pept_state ept_state )
{
    p64 ept_topmost;
    p64 epte;
    physical_address ept_topmost_pa;
    physical_address pa;
    eptp_format eptp;
    vmm_status_t ret;
    
    ret = 0;
    
    ept_topmost = eptm_allocate_entry( &ept_topmost_pa );
    if( !ept_topmost )
        return VMM_STATUS_MEM_ALLOC_FAILED;
    
    ept_state->topmost_ps = ept_topmost;
    
    // Initialize the EPT pointer and store it in our EPT state
    // structure.
    //
    eptp.value = ept_topmost_pa.quad;
    eptp.memory_type = EPT_MEMORY_TYPE_WB;
    eptp.guest_address_width = ept_state->gaw;
    eptp.rsvd0 = 0;
    
    ept_state->eptp = eptp.value;
    
    // Construct identity mapping for EPT page hierarchy w/ default
    // page size granularity (4kB).
    //
    u32 idx = 0;
    u64 pn = 0;
    physical_memory_desc* pmem_desc = ( physical_memory_desc* )mm_get_physical_memory_block();
    ept_access_rights epte_ar = { .rd = 1, .wr = 1, .x = 1 };
    
    for( ; idx < pmem_desc->num_runs; idx++ )
    {
        physical_memory_run* pmem_run = &pmem_desc->run[ idx ];
        u64 base = ( run->base_page << PAGE_SHIFT );
        
        // For each physical page, map a new EPT entry.
        //
        for( ; pn < run->page_count; pn++ ) 
        {
            pa.quad = ( i64 )( base + ( ( u64 )pn << PAGE_SHIFT ) );
            epte = ept_create_mapping_4k( ept_state, epte_ar, pa, pa );
            if( !epte ) 
            {
                // Unmap each of the entries allocated in the table.
                //
                ept_teardown_tables( ept_state );
                return VMM_LARGE_ALLOCATION_FAILED;
            }
        }
    }
    
    return VMM_OPERATION_SUCCESS;
}

This completes the initialization of our extended page table hierarchy, however, we’re not quite out of the woods. We still need to implement our teardown functions to release all EPT resources and associated structures (unmap), EPT page walk helpers, EPT splitting methods, 2MB page support and 1GB page support, page merging; as well as GVA → GPA and GPA → HPA helpers. And of course, we can’t forget our EPT violation handler.

Conclusion

There’s still a bit of work to do, and now that I finally have time to resume writing I’m hoping to have the next part in a few weeks. The next article will spend time clearing up any confusion and residual requirements to get EPT functioning properly, including the details on the page walking mechanisms present on the platform, the logic, and how to implement our own that handles GVA HPA smoothly. As you can see, the introduction of EPT adds a significant amount of background requirements. Because of this, the next article will primarily be explanations of small snippets of source and logic used when constructing the routines. It’s important that readers get familiar, if not already, with paging and address translation – the added layers of indirection add a lot of complexity that can confuse the reader. There will also be other requirements that are not normally of our concern since hardware/OS typically handles it when converting a guest virtual address to a guest physical address. These are things such as checking reserved bits, the US flag, verifying page size, checking SMAP, the pkey, and so on. The page walking method will be a large part of the next article as it’s important to properly traverse the paging structures.

As always, be sure to check the recommended reading! And please excuse the cluster-f of an article that this is. I had been writing it for a long time and cut out various parts that were written and then deemed unnecessary. In the end, it was still long and I wanted to get a fresh start in a new article as opposed to mashing it all in one — you probably didn’t want that either.

Thanks to @ajkhoury for cleaner macros to help with the address translation explanation.

Recommended Reading

The post MMU Virtualization via Intel EPT: Implementation – Part 1 appeared first on Reverse Engineering.

  • There are no more articles
❌