Normal view

There are new articles available, click to refresh the page.
Before yesterdayde engineering

A deep dive into Processes, Threads, Fibers and Jobs on Windows.

By: Mr. Rc
3 August 2022 at 05:00

Learning how processes and threads work is a crucial part of understanding any Operating System as they are the building block on top of which almost all of the user-mode mechanisms work. Additionally, Windows offers us an elegant API that enables us to interact with them. Unsurprisingly enough, these topics can be a bit complicated to understand since Microsoft does not has a clear documentation of them and there are not a lot of resources that cover these topics clearly. Windows also provides us the fiber and job APIs which are built on top of the process and thread APIs to allow the developers to manage processes and threads “easily”.

Table of contents:

Processes

Many people assume that a program and a process are the same. However, a process is not same as a program. A program is simply a file containing code. On the other hand, a process is a container of threads and various resources that are required for the threads inside the process to execute.
A process is a kernel object.

Process resources

The resources that are required to run a process might differ for each process according to it’s need but these are the fundamental components that every (almost) process has:
Process Identifier: Process identifier (aka PID or process ID) is a unique identifier for each process on the system. While processes with same name can exist on the system, process with same process IDs can not.

Private Virtual Address Space: A specific amount of virtual addresses that a process can use. This amount is different for different systems. I’ve previously wrote a detailed post about Virtual Memory which can be found here.

Executable Code: This refers to the code that is mapped into the private virtual address space (“stored in process’s memory”) of the process from the program. Processes can and do exist without any executable code for special purposes.

Handle Table: A Handle table contains all the pointer to the actual objects in the kernel that are being used by the process. The handles returned by the APIs are essentially the indexes inside the handle table. This table can not be accessed from the user-mode, since it is stored in the kernel mode. Another thing to note here is that this handle table only consists of handles for kernel objects and not for any other category of object, i.e. GDI and user.

Access Token: Each process also has an access token that defines it’s security context which is used by the system to check identity information such as which process belongs to which user, what are the privileges that it has, etc.

Process Environment Block: PEB is a user-mode per process structure that contains quite a lot of information about a process, such as the arguments provided to this process, if it’s being debugged or not, list of loaded modules, etc.
This is how the PEB looks like:

struct _PEB {
    0x000 BYTE InheritedAddressSpace;
    0x001 BYTE ReadImageFileExecOptions;
    0x002 BYTE BeingDebugged;
    0x003 BYTE SpareBool;
    0x004 void* Mutant;
    0x008 void* ImageBaseAddress;
    0x00c _PEB_LDR_DATA* Ldr;
    0x010 _RTL_USER_PROCESS_PARAMETERS* ProcessParameters;
    0x014 void* SubSystemData;
    0x018 void* ProcessHeap;
    0x01c _RTL_CRITICAL_SECTION* FastPebLock;
    0x020 void* FastPebLockRoutine;
    0x024 void* FastPebUnlockRoutine;
    0x028 DWORD EnvironmentUpdateCount;
    0x02c void* KernelCallbackTable;
    0x030 DWORD SystemReserved[1];
    0x034 DWORD ExecuteOptions:2; // bit offset: 34, len=2
    0x034 DWORD SpareBits:30; // bit offset: 34, len=30
    0x038 _PEB_FREE_BLOCK* FreeList;
    0x03c DWORD TlsExpansionCounter;
    0x040 void* TlsBitmap;
    0x044 DWORD TlsBitmapBits[2];
    0x04c void* ReadOnlySharedMemoryBase;
    0x050 void* ReadOnlySharedMemoryHeap;
    0x054 void** ReadOnlyStaticServerData;
    0x058 void* AnsiCodePageData;
    0x05c void* OemCodePageData;
    0x060 void* UnicodeCaseTableData;
    0x064 DWORD NumberOfProcessors;
    0x068 DWORD NtGlobalFlag;
    0x070 _LARGE_INTEGER CriticalSectionTimeout;
    0x078 DWORD HeapSegmentReserve;
    0x07c DWORD HeapSegmentCommit;
    0x080 DWORD HeapDeCommitTotalFreeThreshold;
    0x084 DWORD HeapDeCommitFreeBlockThreshold;
    0x088 DWORD NumberOfHeaps;
    0x08c DWORD MaximumNumberOfHeaps;
    0x090 void** ProcessHeaps;
    0x094 void* GdiSharedHandleTable;
    0x098 void* ProcessStarterHelper;
    0x09c DWORD GdiDCAttributeList;
    0x0a0 void* LoaderLock;
    0x0a4 DWORD OSMajorVersion;
    0x0a8 DWORD OSMinorVersion;
    0x0ac WORD OSBuildNumber;
    0x0ae WORD OSCSDVersion;
    0x0b0 DWORD OSPlatformId;
    0x0b4 DWORD ImageSubsystem;
    0x0b8 DWORD ImageSubsystemMajorVersion;
    0x0bc DWORD ImageSubsystemMinorVersion;
    0x0c0 DWORD ImageProcessAffinityMask;
    0x0c4 DWORD GdiHandleBuffer[34];
    0x14c void (*PostProcessInitRoutine)();
    0x150 void* TlsExpansionBitmap;
    0x154 DWORD TlsExpansionBitmapBits[32];
    0x1d4 DWORD SessionId;
    0x1d8 _ULARGE_INTEGER AppCompatFlags;
    0x1e0 _ULARGE_INTEGER AppCompatFlagsUser;
    0x1e8 void* pShimData;
    0x1ec void* AppCompatInfo;
    0x1f0 _UNICODE_STRING CSDVersion;
    0x1f8 void* ActivationContextData;
    0x1fc void* ProcessAssemblyStorageMap;
    0x200 void* SystemDefaultActivationContextData;
    0x204 void* SystemAssemblyStorageMap;
    0x208 DWORD MinimumStackCommit;
);

Thread: Thread is the entity inside a process that executes code. Every process starts with at least one thread of execution, this thread is called the primary thread. A process without threads can exist, but again, it’s mostly of times it’s of no use since it is not running any code.

EPROCESS structure: The EPROCESS (Executive Process) data structure is the kernel’s representation of the process object. The structure is huge and it contains every possible bit of information related to a process, such as pointers to other data structure, values of different attributes, etc. This structure is not documented by Microsoft.
The structure is very big in size so I’m not including it but it can be found here

KPROCESS structure: One of the most interesting structure inside the EPROCESS data structure is the KPROCESS (Kernel Process) data structure. This data structure also contains a lot of information about the process, such as pointer to process’s page directory, how much time the threads of the process has consumed in the user and kernel-mode, etc. Just like it EPROCESS, this structure is also not documented.
The structure looks like this:

struct _KPROCESS {
  struct _DISPATCHER_HEADER Header;
  struct _LIST_ENTRY ProfileListHead;
  unsigned int DirectoryTableBase;
  unsigned long Asid;
  struct _LIST_ENTRY ThreadListHead;
  unsigned long ProcessLock;
  unsigned long Spare0;
  unsigned int DeepFreezeStartTime;
  struct _KAFFINITY_EX Affinity;
  struct _LIST_ENTRY ReadyListHead;
  struct _SINGLE_LIST_ENTRY SwapListEntry;
  struct _KAFFINITY_EX ActiveProcessors;
  long AutoAlignment : 1;
  long DisableBoost : 1;
  long DisableQuantum : 1;
  unsigned long DeepFreeze : 1;
  unsigned long TimerVirtualization : 1;
  unsigned long CheckStackExtents : 1;
  unsigned long SpareFlags0 : 2;
  unsigned long ActiveGroupsMask : 20;
  long ReservedFlags : 4;
  long ProcessFlags;
  char BasePriority;
  char QuantumReset;
  unsigned int Visited;
  union _KEXECUTE_OPTIONS Flags;
  unsigned long ThreadSeed[20];
  unsigned int IdealNode[20];
  unsigned int IdealGlobalNode;
  union _KSTACK_COUNT StackCount;
  struct _LIST_ENTRY ProcessListEntry;
  unsigned int CycleTime;
  unsigned int ContextSwitches;
  struct _KSCHEDULING_GROUP *SchedulingGroup;
  unsigned long FreezeCount;
  unsigned long KernelTime;
  unsigned long UserTime;
  void *InstrumentationCallback;
};

This diagram shows the components of a process:

Threads

Threads are the actual entities inside a process that are running code on the CPU. Threads can execute any part of the code. A process provides all the resources that threads require to complete their task. Without threads, a process can’t run any code. A process can have multiple threads and such processes are called multi-threaded processes.

Thread scheduling

When there are multiple threads on the system, the scheduler switches between different threads and creates an illusion that all the threads running in parallel. While what’s really happening is that the scheduler is switching between different threads so quickly that it appears that the threads are running in parallel.
The amount of time for which a thread can run on a CPU before it switches is called the thread’s quantum. This quantum is a value that is set by the scheduler. This is usually set to a value that is a multiple of the processor’s clock speed.
Windows uses priority based thread scheduling model where the scheduler uses the thread’s priority to determine which thread should run next. The priority of a thread is a value that is set by the thread’s creator or by the system.
Because this system is quite complex, I will not go over it in detail here.

Thread resources

While a process provides a fair amount of resources for threads to run, there are still a few things that threads need in order to execute, these include:

Context: Every thread has a context which is a user-mode per thread data structure (managed by kernel) that contains the state of all the registers from the time the thread was last executed on the CPU. This data structure is very important because there can’t be multiple threads running on a CPU, so Windows switches between different threads after a few moments and each time it switches a thread, it stores the current CPU registers’ state in the context. This context is loaded again into as the values of the registers when the thread resumes it’s execution on the CPU. Since this data structure stores information related to registers, it’s processor-specific.
This is the how the data structure looks like for x64 machines:

typedef struct _CONTEXT {
  DWORD64 P1Home;
  DWORD64 P2Home;
  DWORD64 P3Home;
  DWORD64 P4Home;
  DWORD64 P5Home;
  DWORD64 P6Home;
  DWORD   ContextFlags;
  DWORD   MxCsr;
  WORD    SegCs;
  WORD    SegDs;
  WORD    SegEs;
  WORD    SegFs;
  WORD    SegGs;
  WORD    SegSs;
  DWORD   EFlags;
  DWORD64 Dr0;
  DWORD64 Dr1;
  DWORD64 Dr2;
  DWORD64 Dr3;
  DWORD64 Dr6;
  DWORD64 Dr7;
  DWORD64 Rax;
  DWORD64 Rcx;
  DWORD64 Rdx;
  DWORD64 Rbx;
  DWORD64 Rsp;
  DWORD64 Rbp;
  DWORD64 Rsi;
  DWORD64 Rdi;
  DWORD64 R8;
  DWORD64 R9;
  DWORD64 R10;
  DWORD64 R11;
  DWORD64 R12;
  DWORD64 R13;
  DWORD64 R14;
  DWORD64 R15;
  DWORD64 Rip;
  union {
    XMM_SAVE_AREA32 FltSave;
    NEON128         Q[16];
    ULONGLONG       D[32];
    struct {
      M128A Header[2];
      M128A Legacy[8];
      M128A Xmm0;
      M128A Xmm1;
      M128A Xmm2;
      M128A Xmm3;
      M128A Xmm4;
      M128A Xmm5;
      M128A Xmm6;
      M128A Xmm7;
      M128A Xmm8;
      M128A Xmm9;
      M128A Xmm10;
      M128A Xmm11;
      M128A Xmm12;
      M128A Xmm13;
      M128A Xmm14;
      M128A Xmm15;
    } DUMMYSTRUCTNAME;
    DWORD           S[32];
  } DUMMYUNIONNAME;
  M128A   VectorRegister[26];
  DWORD64 VectorControl;
  DWORD64 DebugControl;
  DWORD64 LastBranchToRip;
  DWORD64 LastBranchFromRip;
  DWORD64 LastExceptionToRip;
  DWORD64 LastExceptionFromRip;
} CONTEXT, *PCONTEXT;

Two stacks: Every thread has two stacks, a user-mode stack and a kernel-mode stack. The user-mode stack is used for normal purposes, such as for storing the values of variables. Unsurprisingly, the kernel stack is not accessible from the user-mode and it’s used as security mechanism.
When a threads calls a syscall, all of the arguments provided to that syscall are copied from the thread’s user-mode stack to it’s kernel-mode stack. It is done this way because after a thread uses a syscall, the CPU switches to the kernel mode and then the kernel-mode code validates those arguments to see if all the pointers, structures, etc that are passed are valid or not and since this stack is not accessible from the user-mode, a thread can not manipulate the arguments after they have been validated and this way, having two stacks works as a strong security measure.

Thread Local Storage: The thread local storage is a data structure that is used to store data that is specific to each thread. This data is stored in the thread’s context and is not shared between threads.

Thread ID: Just like every process has a unique identifier, every thread also has a unique identifier called a thread ID (TID).

Thread Environment Block: Like processes, threads also have most of their information stored in a data structure called the Thread Environment Block (TEB). This structure contains information such as pointer to the TLS, the LastErrorValue (this has to be this way because if two threads called GetLastError and one thread gets the LastErrorValue of some other thread then it can lead to total chaos), pointer to PEB, etc. TEB is also not documented by Microsoft.
This structure can be found here

Affinity: Setting affinity for a thread forces Windows to run a thread only on a specific CPU. For example, let’s say your machine has for CPU and you set the affinity of process linux.exe to CPU 3 then that thread will only run on CPU 3 until it finishes execution or it’s affinity is changed.

ETHREAD structure: The ETHREAD structure (Executive Thread) is the kernel representation of the thread object. Similar to EPROCESS, this structure also contains every possible bit of information about a thread, such as a pointer to the PEB, LastErrorValue, if this thread is the initial thread (main thread) of the process or not, etc. This structure is also not documented by Microsoft.
This structure can be found here

KTHREAD structure: The KTHREAD data structure (Kernel Thread) is also one of the important data structure inside ETHREAD data structure. It includes information such as the pointer to the kernel stack, a lot of information about it’s scheduling (when and for how long this thread will run on the CPU), pointer to TEB, how much time the thread has spent in the user-mode, etc. This structure is also not documented by Microsoft.
This structure can be found here

This diagram shows the components of a process:

Using Threads

Using threads is very simple. We just need to create a thread using the CreateThread function. The thread will start executing at the address of the specified function.
The function that we want to run in the thread is called the thread’s entry point. The entry point is the function that is called when the thread is created.

Here’s the signature of the CreateThread function:

HANDLE CreateThread(
  [in, optional]  LPSECURITY_ATTRIBUTES   lpThreadAttributes,
  [in]            SIZE_T                  dwStackSize,
  [in]            LPTHREAD_START_ROUTINE  lpStartAddress,
  [in, optional]  __drv_aliasesMem LPVOID lpParameter,
  [in]            DWORD                   dwCreationFlags,
  [out, optional] LPDWORD                 lpThreadId
);

The first parameter is the security attributes. This is a pointer to a SECURITY_ATTRIBUTES structure that contains information about the security of the thread. This is optional and can be NULL for default security.
The second parameter is the stack size. This is the size of the stack that the thread will use. This is optional and can be 0 for default stack size.
The third parameter is the address of the function that will be executed in the thread. This is the entry point of the thread.
The fourth parameter is the parameter that will be passed to the thread. This is optional and can be NULL for no parameter.
The fifth parameter is the creation flags. This is a set of flags that determines how the thread will be created. This is optional and can be 0 if we want the thread to directly execute after being created.
The sixth parameter is the thread ID. This is a pointer to a variable that will receive thread ID after it’s created. This is optional and can be NULL if we do not want to store the thread’s ID.

Fibers

Fibers are unit of execution that allow us to manually schedule (define our own scheduling algorithm) them rather than being automatically scheduled by the scheduler. Fibers run in the context of the threads that created them. Every thread can have multiple fibers and a thread can run one fiber at a time (we decide which). Fibers are often called lightweight threads.
Fibers are invisible to the kernel as they are implemented in the user-mode in Kernel32.dll.

Using Fibers

The first step when using fiber is to convert our own thread into a fiber. This is done by calling the ConvertThreadToFiber function. This is the signature for the function:

LPVOID ConvertThreadToFiber(
  [in, optional] LPVOID lpParameter
);

This function returns the memory address of the fiber’s context that was created. This address is useful later when performing operations on the fiber. The fiber’s context is similar to than that of a thread but it has a few more elements than just registers, these include:

  • The value of lpParameter that was passed to ConvertThreadToFiber.
  • The top and bottom memory addresses of the fiber’s stack.
    and more.

After this function is called our thread gets converted into a fiber and it starts running on our thread. This fiber may exit either when it’s done executing or when it calls ExitThread (in this case, the thread and fiber both get terminated).

Now, to create a fiber, we need to call the CreateFiber function. This is the signature for the function:

LPVOID CreateFiber(
  [in]           SIZE_T                dwStackSize,
  [in]           LPFIBER_START_ROUTINE lpStartAddress,
  [in, optional] LPVOID                lpParameter
);

The first argument is used to specify the size of the fiber’s stack, generally, 0 is specified, which uses the default value and creates a stack that can scale grow up to 1 MB. The second argument is the address of the function that will be executed when the fiber is scheduled. The third argument is the parameter that will be passed to the function that will be executed.
This function also returns the memory address of the fiber’s context that was created with this context having one additional element: the address of the function that will be executed.
Remember that calling this function only creates the fiber and doesn’t start it. To start the fiber, we need to call the SwitchToFiber function. This is the signature for the function:

void SwitchToFiber(
  [in] LPVOID lpFiber
);

This function takes only one argument, the address of the fiber’s context that was previously returned by CreateFiber. This function actuall starts the execution of the fiber.

To destroy a fiber, we need to call the DeleteFiber function. This is the signature for the function:

void DeleteFiber(
  [in] LPVOID lpFiber
);

It only takes one argument, the address of the fiber’s context that we want to delete.

CreateProcess internals

Usually, when a thread wants to create another process, it calls the Windows API function CreateProcess and specifies the parameters accordingly to create a process with required attributes. This function is takes a lot of arguments and is quite flexible and can be used in almost all cases.
However, sometimes the capabilities of this functions are not enough so other functions (sometimes just a wrapper of this function) are used, here are some of them:

  • CreateProcessAsUser allows you to create a process on the behalf another user by allowing you to specify the handle to that user’s primary token.
  • CreateProcessWithTokenW gives you the same capabilities as the previous function but this one just requires a few different privileges.
  • CreateProcessWithLogonW allows you to provide the credentials of the user in whose context you want to create a process.
  • ShellExecute is a very unique function. All the previous functions that we talked about work with any valid Portable Executable (PE) file and they do not care about the file extension of the file that you specified, i.e, you can rename the original notepad.exe to notepad.txt and give it to any of those functions and they would still create a process from it.
    However, the ShellExecute and ShellExecuteEx are a bit different. These functions accept any file format and then they look inside the HKLM\SOFTWARE\Classes and HKCU\SOFTWARE\Classes registry keys to find the program which is associated with the file format of the file you gave it as an argument and then they eventually call the CreateProcess function with the appropriate executable name/path along with the file name appended, for example you can provide this function a txt file and it will launch notepad with the filename as an argument (notepad.exe filename.txt).

CreateProcess and CreateProcessAsUser both are exported by Kernel32.dll and both of them eventually call CreateProcessInternal (also exported by Kernel32.dll) which also ends up calling the NtCreateUserProcess function which is exported by ntdll.dll. NtCreateUserProcess is the last part of the user-mode code of all user-mode process creation functions, after this function is done with it’s work, it makes a syscall and transforms into kernel mode. Both CreateProcessInternal and NtCreateUserProcess are officially undocumented by Microsoft at the time of writing this post.
However, the CreateProcessWithTokenW and CreateProcessWithLogonW functions are exported by Advapi32.dll. Both of these functions make a Remote Procedure Call (RPC) to the Secondary Login Service (seclogon.dll hosted in svchost.exe), this service allows processes to be started with different user’s credentials and then Secondary Logon Service executes this call in its SlrCreateProcessWithLogon function which eventually calls CreateProcessAsUser.

Arguments

The arguments for all the CreateProcess* functions are almost completely similar with a only a few differences. The explanation of all the CreateProcess* functions would be tedious to write as well as very boring to read, so here is the brief overview of the description of different arguments:

  • The first argument for CreateProcessAsUser and CreateProcessWithTokenW are the handle to the token under which the process will be started. However, in the case of CreateProcessWithLogonW, the first arguments include the username, domain and password of the user on whose behalf the process will be started.
  • The next important argument lpApplicationName, which is the full path to the executable to run. This argument can be left NULL and instead the next argument can be used.
  • The next argument after lpApplicationName is lpCommandLine. This argument doesn’t require us to put the provide the full path of the executable we want create a process of (we can provide it full path but it’s optional), the reason behind this is that when we provide it an executable’s name without a path in this argument, the function searches through several pre-defined paths in an order to find that file’s path. This is the order defined in msdn:

  • The next important arguments are lpProcessAttributes and lpThreadAttributes. Both of them take a pointer to SECURITY_ATTRIBUTES structure and both of them can be NULL, and when this argument is specified NULL then the default security attributes are used. we can specify whether we want to make the handle of the process that is about to be created (in lpProcessAttributes) and it’s primary thread (in lpProcessAttributes) inheritable by all the other child processes that the caller of CreateProcess* creates or not in the bInheritHandle member of SECURITY_ATTRIBUTES.
  • The next important argument is bInheritHandles. This argument specifies whether we want the process that is about to be created to inherit all the inheritable handles from the handle table of the parent process or not.
  • The next important argument is dwCreationFlags. This argument allow us to specify different flags that affect the creation of the process, such as:
    • CREATE_SUSPENDED: The initial thread of the process being created is started in suspended state (paused state, it doesn’t directly run after it’s created). A call to ResumeThread can be used thereafter to resume the execution of the thread.
    • DEBUG_PROCESS: The calling process declares itself as a debugger and creates the process under it’s control.
  • The next argument is lpEnvironment. This argument is optional and is used to provide a pointer to the an environment variables’ block. Since it’s optional, we can specify it NULL and it will inherit it’s environment variables from it’s parent process.
  • The next argument is lpCurrentDirectory. This argument is also optional and is used if we want the process about to be created will have a different current directory than the parent process. If left NULL, the new process will use the current directory of the parent process.
  • The next argument is lpStartupInfo. This argument is used to specify a pointer to STARTUPINFO or STARTUPINFOEX structures. The STARTUPINFO structure contains some more configuration related for the new process. STARTUPINFOEX structure has an extra field which is used to specify some more attributes for the new process.
  • The last argument is lpProcessInformation. This argument is used to specify a pointer to PROCESS_INFORMATION structure. The CreateProcess* functions returned the information of the new process in this structure, this information includes the process id of the new process, the thread id of the primary thread, a handle to the new process, etc.

Classification of Processes

Windows provides some (almost) completely different attributes for processes that require extra security or have a special purpose. These processes are not launched like normal processes and they also have different attributes.

Protected Processes

The concept of protected processes was initially introduced to imply with Digital Rights Management (DRM) requirements which were imposed by the media industry for protection of content such as HD-DVD media.
Normally, threads of any process which as debug privilege (usually processes started by the administrator account) could read or write data and code into the memory of any process running on the system. This behavior is very useful in a lot of cases. However, this behavior violates the DRM requirements and for this reason, Windows uses protected processes.
These process exist with normal Windows process, but they provide with little to no access to other processes on the system (even the one’s running with administrator privileges).
These processes can be created by any application on the system with whatever rights they have, but for an executable to be able to run as a protected process, it must be signed with a special Windows Media Certificate. This certificate is a digital signature that is used to identify the executable as a protected process.
These process also only load DLLs that are signed with a special certificate and the data of these processes are only accessible to either kernel or other protected processes.
Examples of protected process are:

  • The Audio Graph Device process (Audiodg.exe) that is used by Windows to decode protected DRM audio content.
  • The Media Foundation Protected Media Path (Mfpmp.exe) process used by Windows to decode DRM video content.
  • The Windows Error Reporting (WER, Werfaultsecure.exe) for reporting crashes of protected apps. This the protected version of WER is required because the normal WER process executes as a normal process and therefore it can’t access the data inside the crashed protected processes.
  • The system process.

Protected Processes Light (PPL)

PPL is the extended version of Protected Processes introduced allow third party processes, such as Antivirus programs to have same privileges as protected processes. However, PPLs comes with a slight difference, i.e., how much protected a PPL will be depends upon it’s signature, which results in some PPLs having more or less protection than others.
Most system processes on Windows are PPL protected, such as smss.exe, csrss.exe, services.exe, etc.

Minimal Processes

These are essentially empty processes. These processes have empty user-mode address space, ntdll.dll or other subsystem DLLs are not loaded, no PEB or TEB or any related structure are created, no initial thread is created and no executable image is mapped. There processes are created and managed by the kernel and the kernel provides no way to create such processes from the user-mode since these are not meant to be used by the user, but rather by the system to perform special tasks.
These process can have threads and these threads are called minimal threads. These threads don’t have any Thread Environment Block (TEB) or stack.

An example of this is the memory compression process which stores compressed memory of active processes, this process is used to keep more processes memory without paging them out to the disk (this process is hidden from task manager because since it stores compressed memory, it has a lot of memory usage and average users used to get suspicious about this process). You can view this process in process explorer, if you just sort the processes by their working set (amount of physical memory currently being used), this process should appear on top (it might not, if you have some program eating so much of your ram). This process also has no threads or code.

Pico Processes

Windows introduced the concept of pico processes based on their research called the Project Drawbridge. These are minimal processes with a supporting driver called the pico provider. This driver can manage almost everything related to the execution of the pico process it’s managing and this property of pico providers allow them it can act like a separate kernel for that process without the process having any sense of the original system it’s running on, however, the management of memory, I/O and thread scheduling is still done by the original Windows kernel.
A pico provider is able to intercept all the operations that of the pico process that require any handling by the kernel, this includes things such as system calls, exceptions, etc and respond accordingly.
Pico processes can have pico threads (minimal threads for pico processes) and also normal threads. The pico threads also have a context which is stored in the PicoContext member of ETHREAD structure.

Windows Subsystem for Linux

The Windows Subsystem for Linux (WSL) is built on this idea of pico processes. WSL is able to run whole linux system nearly perfectly on Windows without having a single line of code from the linux kernel. This is made possible by the incredible control that pico providers allow.
The pico providers for WSL are lxss.sys and lxcore.sys. These drivers emulate the behavior of the linux kernel by converting all the linux syscalls made from the WSL pico process to NT APIs or by calling specific components that are implemented from scratch.
This implementation of WSL on Windows is a very interesting topic and complicated topic, I might cover it later in some other blog post!

Trustlets (Secure Processes)

Trustlets are another type of processes that provide strong security. Trustlets can not be directly created by the user. They are created by the Windows kernel when a user-mode application requests to create a secure process.
Trustlets use Virtual Trust Levels provided by Hyper-V Hypervisor to isolate themselves in the system. These levels are used to provide the security of the trustlet. The trustlet can only import DLLs that are signed with a certificate that is trusted by the system and other system trusted DLLs such as C/C++ runtime libraries, Kernelbase, Advapi, RPC runtime, CNG base Crypto, and other mathematical libraries that do no require any syscall to work.
The way truslets work is a bit complex as it requires the understanding of how Hypervisors work so I am not covering that here. However, you can read more about them here on msdn.

Jobs

Jobs are a Windows mechanism to group and manage processes together and have them share the same security context. This can be used to run a bunch of processes that are related to each other, for example if you want to manage multiple processes that are the part of same application.
Jobs are shareable, securable and nameable. Any change to the job will affect all the processes in the job. Jobs are used to impose limits on a set of processes, for example if you want to limit the number of processes of an application that can be running at the same time.
Once a process is assigned to a job, it can not leave that job. Child processes created by the processes inside a job will also be a part of that job unless CREATE_BREAKAWAY_FROM_JOB was specified to CreateProcess and the job itself allows processes to break out from it (a job can deny processes inside it from breaking out of it).

Job limits

Here are few of the limits that we can set in a job:

  • Max active processes: This is used to limit the amount of processes that can exist in a job. When this limit is reached, no new processes are allowed to assign to that job and creation of child processes is blocked.
  • Processor Affinity: This is used to limit all the processes inside a job to only on a specific CPU.
  • Priority Class: This is used to set the priority class for all the members of a job. If the thread of any process that is the member of a job that has it’s priority class set tries to increase it’s priority class, it’s request will be ignored and no error will be returned (to SetThreadPriority).
  • Virtual Memory Limit: This is used to restrict the maximum amount of virtual memory that can be committed by single processes or the entire job.
  • Clipboard R/W: This is used to disallow all the members of a job from accessing or writing to the clipboard.

API functions for working with Jobs

The Windows API provides us all the important functions that are required to manage and work with job objects. Here are few of the important functions:

  • CreateJobObject: Used to create a job object. It can also be used to open a job object.
  • OpenJobObject: Used to open an already existing job object.
  • AssignProcessToJobObject: Used to assign a process to a job object.
  • SetInformationJobObject: Used to set limits for the processes inside a job object.
  • QueryInformationJobObject: Used to retrieve information about the a job object.
  • TerminateJobObject: Used to terminate all the processes inside a job object.
  • IsProcessInJob: Used to check if a process is a member of a job object.

Using Jobs

Working with jobs is also quite simple. You can create a job object, assign processes to it and set limits on the processes inside the job. You can also use the API functions to query and set the limits on the processes inside the job.
To create a job object, you can use the CreateJobObject function. This function returns a handle to the job object. Here is the function signature:

HANDLE CreateJobObjectA(
  [in, optional] LPSECURITY_ATTRIBUTES lpJobAttributes,
  [in, optional] LPCSTR                lpName
);

The lpJobAttributes parameter is a pointer to a SECURITY_ATTRIBUTES structure that can be used to set the security attributes for the job object. It can be NULL if you want the job object to have default security attributes.
The lpName parameter is a pointer to a string that can be used to name the job object. This parameter can also be NULL which will result in the job object being unnamed. If the name matches the name of an existing mutex, file-mapping object or waitable object, the function will fail.

After creating an empty job object, you can assign processes to it. To do this, you can use the AssignProcessToJobObject function. Here is the function signature:

BOOL AssignProcessToJobObject(
  [in] HANDLE hJob,
  [in] HANDLE hProcess
);

The hJob parameter is a handle to the job object.
The hProcess parameter is a handle to the process that you want to assign to the job object.
To get the handle of the current process, you can use the GetCurrentProcess function.

To set the limits on the processes inside the job, you can use the SetInformationJobObject function. Here is the function signature:

BOOL SetInformationJobObject(
  [in] HANDLE             hJob,
  [in] JOBOBJECTINFOCLASS JobObjectInformationClass,
  [in] LPVOID             lpJobObjectInformation,
  [in] DWORD              cbJobObjectInformationLength
);

The hJob parameter is a handle to the job object.
The JobObjectInformationClass parameter is a value that specifies the type of information that you want to set. The next parameter is used to specify the actual information that you want to set.
The lpJobObjectInformation parameter is a pointer to the structure containing information that you want to set.

Code Examples

Now that you know the basics of working with processes, jobs, threads and fibers let’s see some code examples.

Creating a Process

Let’s start by looking at how to create a process.

#include <stdio.h>
#include <windows.h>

int main(){
    STARTUPINFOA si;
    PROCESS_INFORMATION pi;

    ZeroMemory( &si, sizeof(si) );
    si.cb = sizeof(si);
    ZeroMemory( &pi, sizeof(pi) );
    LPSTR lpCommandLine = "notepad.exe";

    // Start the child process. 
    if( !CreateProcessA( NULL,   // No module name (use command line)
        lpCommandLine,        // Command line
        0,           // Process handle not inheritable
        0,           // Thread handle not inheritable
        0,          // Set handle inheritance to FALSE
        0,              // No creation flags
        0,           // Use parent's environment block
        0,           // Use parent's starting directory 
        &si,            // Pointer to STARTUPINFO structure
        &pi )           // Pointer to PROCESS_INFORMATION structure
    ){
        printf( "CreateProcess failed (%d).\n", GetLastError() );
        return -1;
    }
    printf("Process Created!\n");

    // Sleep for 5 seconds
    Sleep(5000);

    // Close process and thread handles. 
    CloseHandle( pi.hProcess );
    CloseHandle( pi.hThread );

    return 0;
}

This code should open notepad.exe and exit after 5 seconds.
Process creation is pretty easy, you just need to know the name of the executable and the command line arguments if you want to pass any.

Creating a Thread

Now that you know how to create a process, let’s look at how to create a thread.

#include <stdio.h>
#include <windows.h>

// Function to create a thread
int EthicalFunction(LPVOID lpParam)
{
    // Print a message
    printf("Thread created\n");
    printf("For educational purposes only*\n");
    // Return success
    return 0;
}

int main()
{
    // Create a thread
    HANDLE hThread = CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)EthicalFunction, NULL, 0, NULL);
    // Wait for thread to finish
    WaitForSingleObject(hThread, INFINITE);
    printf("Thread returned\n");
    printf("Exiting...\n");
    // Close thread handle
    CloseHandle(hThread);
    // Return success
    return 0;
}

This code is creating a thread for the EthicalFunction function and waiting for it to finish and then exiting after printing a message.
You can create multiple threads for multiple functions like this, for example if you want to create a thread a background thread that runs in the background and does not block the main thread until it is finished with its work.

Creating a Fiber

Next, let’s look at how to create a fiber.

#include <stdio.h>
#include <windows.h>

// fiber function
void fiber_function(void* lpParam)
{
    // Print a message
    printf("Fiber created\n");
    printf("For educational purposes only*\n");
    // Converting back into the main thread as fiber will not return to the main thread by itself
    SwitchToFiber(lpParam);

}

// main function
int main()
{
    // Converting thread to fiber
    LPVOID Context = ConvertThreadToFiber(NULL);
    // Creating fiber
    LPVOID lpFiber = CreateFiber(0, (LPFIBER_START_ROUTINE)fiber_function, Context);
    // Switching to fiber (executing fiber function)
    SwitchToFiber(lpFiber);
    // Printing a message
    printf("Fiber returned\n");
    // Deleting fiber
    DeleteFiber(lpFiber);
    // Return success
    printf("Exiting...\n");
    return 0;
}

This code will create a fiber and execute it and then switch back to the main thread and the main thread will print a message and delete the fiber.

Creating a Job Object

Let’s look at how to create a job object and assign a processes to it.

#include <stdio.h>
#include <windows.h>

int main()
{
    // Creating a job with default security attributes
    HANDLE hJob = CreateJobObject(NULL, "Unemployed");
    // Setting the job to terminate when all processes in it terminate
    JOBOBJECT_EXTENDED_LIMIT_INFORMATION jeli = {0};
    jeli.BasicLimitInformation.LimitFlags = JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE;
    SetInformationJobObject(hJob, JobObjectExtendedLimitInformation, &jeli, sizeof(jeli));
    // Creating structures for notepad, cmd and powershell
    STARTUPINFOA si = {0};
    si.cb = sizeof(si);
    PROCESS_INFORMATION pi = {0};
    STARTUPINFOA si1 = {0};
    si1.cb = sizeof(si1);
    PROCESS_INFORMATION pi1 = {0};

    // Creating notepad, and Windows media player in suspended state and adding them to the job and checking for errors
    if (!CreateProcessA(NULL, (LPSTR)"notepad.exe", NULL, NULL, FALSE, CREATE_SUSPENDED, NULL, NULL, &si, &pi) || !CreateProcessA(NULL, (LPSTR)"dvdplay.exe", NULL, NULL, FALSE, CREATE_SUSPENDED, NULL, NULL, &si1, &pi1))
    {
        printf("Error creating processes\n");
        printf("Error code: %d\n", GetLastError());
        return 1;
    }
    
    AssignProcessToJobObject(hJob, pi.hProcess);
    AssignProcessToJobObject(hJob, pi1.hProcess);

    // Resuming processes
    ResumeThread(pi.hThread);
    ResumeThread(pi1.hThread);
    printf("Job created and processes added!\n");
    
    // SLeeping for 1 minutes to let the processes run
    Sleep(60000);
    // Terminating the job
    TerminateJobObject(hJob, 0);
    // Closing handles
    CloseHandle(pi.hProcess);
    CloseHandle(pi.hThread);
    CloseHandle(hJob);

    // Return success
    printf("Exiting...\n");
    return 0;
}

This code will create a job object named Unemployed and add two notepad and Windows Media Player processes to it and then terminate the job after 1 minutes.
To confirm that the processes are running inside the Unemployed job, you can use Process Explorer to view the Properties -> Job of either notepad.exe or wmplayer.exe (not dvdplay.exe as it immediately launches this as the child process, it can also be setup_wm.exe if you do not have your Windows Media Player setup).
Here’s an example image:

Summary

This post covered an overview of the internals of processes, threads, fibers and jobs as well the classification of processes into different types. We also looked at the different components of a process. Later, we looked at how to create a process, thread, fiber and job objects.
I hope you enjoyed this post and found it useful.
Thank you for reading!

Resources

Exploring the process of virtual memory address translation and structure of a page table entry.

By: Mr. Rc
15 April 2022 at 03:33

We learned about the fundamentals of virtual memory management in the last post, as well as two Windows API functions that allow us to allocate virtual memory (VirtualAlloc) and free it (VirtualFree).
In this blog, we’ll continue our exploration of virtual memory management in Windows by learning about the how does a virtual memory address translate to a physical address, the structure of a page table in memory (explained later), what information it contains, and how we can use Window API functions to query that information and some other internals regarding the workings of virtual memory in Windows.

Table of contents:

Translation of virtual memory address

When a virtual memory address gets translated, it goes through several different translation layers where each time it’s translated, it points to a new table (which can be thought of as a structure) which also points to another table and this process is repeated until it finally gets translated into an address in the actual physical memory (RAM). The translation of these pages is done by the Memory Management Unit (MMU) of the CPU and their management is done by the Memory Manager (a component of the Windows OS). On x64 Windows, there are four tables that do this job, namely:

  • Page Map Level 4 (PML4)
  • Page Directory Pointer Table (PDPT)
  • Page Directory Table (PDT)
  • Page Table (PT)

Each of these tables contain indexes that point to the start of the next paging structure. Each of these paging structures have 512 entries. These indexes are called Page Frame Numbers (PFN) and the entries themselves are called as PxE, where x is the name of the table and E means entry, so entries inside the PML4 will be called PML4E (x = ML4), for Page Tables it will be PTE and so on.
This can be visually understood by looking at this diagram:

Virtual Memory Translation on x64 Windows
This image is probably more confusing than what you read before watching this, but let me explain so you can feel cool and get some dopamine hits.
This is simply the translation process of a virtual memory address. On the top, you can see the distribution of 48 bits (0 to 47) into division of 9 bits with one exception of 12 bits (explained later), and since we already know that on x64 systems, the addressing only happens for 48 bits, this makes sense. This explains that this top part of this fancy looking image is basically showing you the distribution of those bits.
Below them are the tables that I just talked about, you can see how each of them are pointing to some other table in coordination with the information inside the virtual memory address to finally translate to a physical memory address.

You might now have a guess of where this is going and how does the address translation takes place. Different bits inside a virtual memory address are distributed into parts and those parts contains data that tells the MMU where to look for the next entry in the next table until it finds a physical page after looking at finding the entry in the Page Table.

Now, let’s look into this distribution of bits and understand it’s work.
The first division starts from the 39th bit to 47th bit, which is a index inside the Page Map Level 4 paging structure (the address of this structure is stored in a special register, will be deeply described in a later post) and the entry at that index contains a PFN that tells the MMU where PDPT is and similarly, the bits from 30th position to 38th position tells the MMU the index of the entry inside PDPT that points to the next paging structure and this process continues until we reach the Page Table.
Once the translation process has reached the point where it has found the entry inside the Page Table which points to the address of a physical page in the RAM, the left 12 bits are used to index a specific byte in the physical page to get the exact needed data that was requested.

Understanding the structure of a PTE

Each Page Table entry has some status and protection bits set, which store information regarding the page itself. These entries tell the MMU how these pages should be managed and what is their current status.
This is how a x64 PTE looks like:

A Page Table entry on x64 Windows
As you can see, there are multiple bits (some are grouped others are not) and each of have some information regarding the page itself or it’s status. Let us understand each of them one by one so we can have a clear understanding of a Page Table entry’s structure.

Hardware bits vs. Software bits in Page Table Entries

Before talking about these bits themselves, let us understand the types of bits that are inside a PTE.
Hardware bits: Hardware bits are the bits that the MMU actually takes in consideration while translating a virtual address into a physical address.
Software and Reserved bits: These are the bits that are totally ignored by the MMU and actually used by the Memory Manager to manage pages. If you look in the diagram, you will find that bit 9 to 11 are marked as Software bits which means they are used by the Memory Manager.

Understanding the bits

Valid bit: The bit at the 0th index is the Valid bit which tells the MMU that the page for which this page table entry is, actually exists somewhere in the physical RAM and it is not paged out (explained in part one of this blog). This bit is useful because as we know, Windows uses demand paging and since some pages might not be used by a process but they might still be allocated then it’s certain that the Memory Manager will page out the unused pages from the memory to the disk. This bit helps the Memory Manager to keep track of paged and non paged memory pages.

Write bit: The bit at the 1st index is the Write bit which tells the MMU that whether the page is writeable or not. When this bit is clear (set to 0), the page is read-only and when this bit is set, we are allowed to write to that page. You can relate this with the information from the last blog post, we used the flProtect argument of the VirtualAlloc function to specify the memory protections that we wanted while allocating a page and if we use any protection that allows writing of the page then this bit will be set to 1.

Owner bit: The bit at the 2nd index is the Owner bit which tells the MMU whether the page is allowed to be accessed from the user mode or if it’s access is limited to the kernel mode. If this bit is set in the pte of a page then that page will be accessible from the user mode and if it’s not set then that page will only be accessible in the kernel mode.

Write Through bit: The bit at the 3rd index is the Write Through bit which tells the MMU to enable write through on the page. Write through is a storage method in which data is written into the cache and the corresponding main memory location at the same time. The cached data allows for fast retrieval on demand, while the same data in main memory ensures that nothing will get lost if a crash, power failure, or other system disruption occurs.

Cache Disabled bit: The bit at the 4th index is the Cache Disabled bit which tells the MMU that this page should not be cached.

Accessed bit: The bit at the 5th index is the Accessed bit which tells the MMU that this page has been accessed at least once after being mapped.

Dirty bit: The bit at the 6th index is the Dirty bit which tells the MMU that this page has been written to (there has been a write operation on this page).

Large bit: The bit at the 7th index is the Large bit which tells the MMU that this page is a large page and it maps to a page that is larger than 4KB.

Global bit: The bit at the 8th index is the Global bit which tells the MMU that this page should not be flushed to the Translation Lookaside Buffer (a caching system for recently used pages).

Copy-on-write bit (Software): The bit at the 9th index is the Copy-on-write bit, which is a Software bit and it has a special purpose. When a thread tries to access a page that is read-only (has the write bit set to 0), a memory-management exception occurs. Along with this, the Memory Manager’s fault handler checks if the Copy-on-write bit is set, if it is set then it makes a copy of that page and gives that thread the access of that copy and this copy has write access enabled so that thread will now be able to write to that data but those writes won’t affect the original page which doesn’t has the write bit set. However, if a thread tries to access a read-only page and this bit is not set then it raises the access violation exception.

Prototype bit (Software): The bit at the 10th index is the Prototype bit, which is also a Software bit and this bit is used to mark a page as a “Prototype”. This is a bit complex concept and to better understand it, you can check the resources section.

Write bit (Software): The bit at the 11th index is the Write bit, which is the last Software bit in a x64 PTE and this bit also has a quite unique usage. This may feel strange to know after everything you have learned but actually, when a page is allocated, whether it was supposed to be writeable or not, the Memory Manager initially sets the write (hardware) bit to 0, which means that all the pages are not writeable on the time of initialization and the actual way the Memory Manager knows that if a page is writeable or not is by using the 11th bit (Software Write bit). Since, the hardware write bit is set 0, every time a thread tries to write to any page for the first time, a Memory Management exception occurs and the Memory Manager checks if the bit 11 (Software Write bit) is set, if it is then it gets to know that this page is actually writeable, then it sets the Dirty bit and Write hardware bit to 1 and updates some other Memory Management information and then it dismisses the exception and then the write operation happens normally. This happens only on the first write operation on a page as the hardware write bit gets set to 1 after this process is done.
The reason it is implemented in this way is related to the existence of multiprocessors and can be understood better by reading the “Address translation” section of the Windows Internals, Part 1 7th edition book.

PFN: The 36 bits from the 12th index to the 47th index are the page frame number that we talked about earlier.

Reserved: These bits from 47th index to 62nd index are completely ignored by the MMU and only used by the Memory Manager for special purposes.

NX bit: The last and 63rd bit in a pte is the NX bit. NX stands for “no-execute” and it tells the MMU whether this page can be executed or not.

Now, since you now have the knowledge of the translation process of a virtual memory address as well the structure of a hardware PTE and you know what information it stores, it’s time for you to learn about another Windows API function which allows us to query information about a page.

GetLastError


Before we start, I would like to introduce you to a function from the Windows API, it is GetLastError. It is used to get the error code of the last error that occurred and we can get more information about the error code by looking at the error code list which is available at msdn here : System Error Codes - Win32 apps
We will be using this function in the code examples to see if there are any errors in our code.

1. VirtualQuery


This function is used to query the information of a virtual memory region (page).

Function signature

This is the syntax for VirtualQuery function:

SIZE_T VirtualQuery(
  LPCVOID                   lpAddress,
  PMEMORY_BASIC_INFORMATION lpBuffer,
  SIZE_T                    dwLength
);

Arguments

The function’s return type is SIZE_T, it’s basically an unsigned int.

lpAddress: You might already know the use of this argument if you have read the part one of this blog, it’s basically the base address of Virtual Memory region that we allocated which is returned by VirtualAlloc.

lpBuffer: This argument is a pointer to a struct. The name of this struct is _MEMORY_BASIC_INFORMATION, it is defined in winint.h. Here is how it looks like:

typedef struct _MEMORY_BASIC_INFORMATION {
  PVOID  BaseAddress;
  PVOID  AllocationBase;
  DWORD  AllocationProtect;
  WORD   PartitionId;
  SIZE_T RegionSize;
  DWORD  State;
  DWORD  Protect;
  DWORD  Type;
} MEMORY_BASIC_INFORMATION, *PMEMORY_BASIC_INFORMATION;

I’ll explain it’s members later.

dwLength: This argument is the size of the struct from the last argument.

Return value

Instead of returning anything, the function just updates the struct that we had created.

Examples

As we have learned enough about the function, let’s take a look at some examples and see the function and it’s working in action.

Example #1

Now as we have done with understanding of the function, we’ll see some code examples of the function. We are going to make a program that will give us the information about a memory region that we’ll allocate using the functions that we learned about in the last blog post. Let me show you the code first, then I will explain it:

#include <Windows.h>
#include <stdio.h>

int main()
{
    MEMORY_BASIC_INFORMATION info; 
    int ret;
    int *vm = VirtualAlloc(NULL, 8, MEM_COMMIT, PAGE_READONLY); // 8 byte allocation.
    ret = VirtualQuery(vm, &info, sizeof(info));
    if (!ret) // error checking.
    {
        printf("VirtualQuery failed\n");
        printf("The error code for the last error was %d", GetLastError());
        return 1;
    }

    switch (info.AllocationProtect)
    {
        case PAGE_EXECUTE_READ:
            printf("Protection type : EXECUTE + READ\n");
            break;
        case PAGE_READWRITE:
            printf("Protection type : READ + WRITE\n");
            break;
        case PAGE_READONLY:
            printf("Protection type : READ\n");
            break;
        default:
            printf("Not found");
            break;
    }

    switch (info.State)
    {
        case MEM_COMMIT:
            printf("Region State : Committed");
            break;
        case MEM_FREE:
            printf("Region State : Free");
            break;
        case MEM_RESERVE:
            printf("Region State : Reserve");
            break;
        default:
            break;
    }
    VirtualFree(vm, 8, MEM_RELEASE); // free the allocated memory.
    return 0;
}

I have used Windows.h instead of using any other header file because Windows.h contains almost everything that we need for doing Windows API programming.
Let’s now understand the code.
First, we have declared a struct of type MEMORY_BASIC_INFORMATION, which is the struct that we talked about, then we committed eight bytes of virtual memory which is read-only.
After that, we have used VirtualQuery function to get information about that memory region.
We gave it the address of the allocated memory region as our first parameter, then we gave the address of the info struct that will hold all the returned data from this function, then we gave it the size of our info struct.
Then, we are doing a check if the function is failed, If it’s failed then the error code can be found by using the GetLastError function.
Then, we have a switch-case clause, where we are checking the value of AllocationProtect member of our info struct. This switch-case clause will check for the protection type of the virtual memory region that was specified as the first parameter.
The constants that are being used to compare in the switch-case clause are defined in the Windows.h header file that we included.
We are then checking the value of State member from our info struct. This switch-case clause is comparing the state of the allocated virtual memory region. Then, we are just printing information according to the statements. One thing to note is that we cannot compare the value with every type of protection type or every type of memory state, I have tried doing so but I was unsuccessful, so I am have just used the types that can be compared.
Then we just free the allocated memory.

Results #1

Here’s the output that I get after running the example:

$ ./vquery-example
Protection type : READ
Region State : Committed

The results are expected, we had hardcoded the page protection to be read-only and the page state to committed and the result by the function is precise.

Example #2

This example will be quite fun. Here, I am asking the user to select which page state and page protection they want for the page and then using VirtualQuery to query the information of the allocated page and then printing it to verify with the input user gave. Here’s the code for it:

#include <Windows.h>
#include <stdio.h>

int main()
{
    MEMORY_BASIC_INFORMATION info;
    int ret;

    char state;         // used for input
    char protection;    // used for input
    int MEM_STATE;
    int MEM_PROTECTION;

    printf("Choose the page state you want to use: \n");
    printf("1. MEM_COMMIT\n");
    printf("2. MEM_RESERVE\n");
    scanf("%c", &state);
    getchar();

    switch (state)      // checking user input.
    {
    case '1':
        MEM_STATE = MEM_COMMIT;  
        break;
    case '2':
        MEM_STATE = MEM_RESERVE;        
        break;
    default:
        printf("Invalid choice!");
        exit(-1);
    }
    
    printf("Choose the page protection you want to use: \n");
    printf("1. PAGE_READONLY\n");
    printf("2. PAGE_READWRITE\n");
    printf("3. PAGE_EXECUTE_READ\n");
    scanf("%c", &protection);

    switch (protection) 
    {
    case '1':
        MEM_PROTECTION = PAGE_READONLY;        
        break;
    case '2':
        MEM_PROTECTION = PAGE_READWRITE;        
        break;
    case '3':
        MEM_PROTECTION = PAGE_EXECUTE_READ;        
        break;
    default:
        printf("Invalid choice!");
        exit(-1);
    }

    // allocating memory.
    int *vm = VirtualAlloc(NULL, 8, MEM_STATE, MEM_PROTECTION);
    printf("Address of memory returned by VirtualAlloc is %lu\n", vm);

    //querying data about that memory.  
    ret = VirtualQuery(vm, &info, sizeof(info));
    
    // error checking.
    if (!ret)
    {
        printf("VirtualQuery failed\n");
        printf("The error code for the last error was %d", GetLastError());
        return 1;
    }

    printf("Protection type : ");
    
    switch (info.AllocationProtect) // comparing protection.
    {
        case PAGE_EXECUTE_READ:
            printf("EXECUTE + READ\n");
            break;
        case PAGE_READWRITE:
            printf("READ + WRITE\n");
            break;
        case PAGE_READONLY:
            printf("READ ONLY\n");
            break;
        case PAGE_GUARD:
            printf("Guard Page\n");
            break;
        default:
            printf("%x\n", info.AllocationProtect);
            break;
    }

    printf("Region State : ");
    switch (info.State) // comparing state.
    {
        case MEM_COMMIT:
            printf("Committed");
            break;
        case MEM_FREE:
            printf("Free");
            break;
        case MEM_RESERVE:
            printf("Reserve");
            break;
        default:
            printf("Unknown");
            break;
    }

    VirtualFree(vm, 8, MEM_DECOMMIT); // free the allocated memory.
    return 0;
}

Most part of the code is similar to the code from the last example, but there are some major changes.

First, we are asking the user to choose which page state they want to allocate, then we are storing their input in a character variable state, then we are taking that input variable state and comparing it in a switch-case clause to find out which page state the user asked for, then we are setting an integer variable MEM_STATE to the constant of the page state which the user asked for and then we did the same for page protection by using the protection character variable for input and MEM_PROTECTION for storing the constant.
Next, we are allocating memory using those variables (MEM_STATE and MEM_PROTECTION) as parameters for VirtualAlloc and then we are taking the address returned by VirtualAlloc and querying the information about it from VirtualQuery, then comparing it possible constants and printing it’s state and protection.

Result #2

Here’s the output of the program:

Choose the page state you want to use: 
1. MEM_COMMIT 
2. MEM_RESERVE
1
Choose the page protection you want to use: 
1. PAGE_READONLY
2. PAGE_READWRITE
3. PAGE_EXECUTE_READ
2
Address of memory returned by VirtualAlloc is 131072
Protection type : READ + WRITE
Region State : Committed

Cool!, it works as expected.

Summary

In this post, we have learned about a lot of complex things related to Windows Virtual Memory Management. We learned about the four paging structures that are used during the translation process of a virtual memory address and the process of translation itself, then we learned about the complex structure of a Page Table Entry and then finally we learned about how we can get the error code of the last error using the GetLastError function, then we learned about the VirtualQuery function and how we can use it to query the information of a virtual memory region and then we made two small projects to see that in action. I hope you enjoyed the blog post and learned something new!
Thank you for reading!

Resources

Exploring Virtual Memory and the Virtual Memory Management API.

By: Mr. Rc
26 January 2022 at 03:33

If you have ever explored Windows Internals or just the internal workings of an Operating System or Computer, you must have heard of the term “Virtual Memory” or “Paging” somewhere because these are some of the most important concepts of an Operating System and these are the concepts which we are going to explore in this blog post. Of course, I won’t be able to cover the whole concepts but I’ll try to give you basic understanding of every concept I talk about and I will also link to the resources that explain each concept in detail in the resources section.

Table of contents:

Virtual Memory

64 bit systems are capable of having addresses upto 64 bits in size but they only use 48 bits for addressing out of their total bits. Why?
Because, if all 64 bits are allowed, the system will be able to address up to 16 Exabyte (EB) of memory (1 EB = 1000000 Terabytes) and of course, none of the system today can use or need that much memory right now and another reason behind this is the memory manager can’t manage this much memory at the same time as 16 EB is a lot of memory. This is why only 48 bits of addressing is allowed. 48 bits can address up to 256 TB of memory, which is still a lot.
We generally use the term “memory” (in context of computers) to refer to the RAM or some data stored in the RAM but behind the scenes, there is a lot going on that actually makes memory a thing and one of the many component behind this is virtual memory.
If you are a bit familiar with pointers or assembly, you might already know how does a memory address look like. It’s like this:

0xFFFFDEADC0DE

yeah, like that. This is an example of a virtual memory address (or simply a virtual address). These virtual memory addresses don’t actually point to a place in the physical RAM installed on your computer, but they are basically a reference to the them. The addresses that point to memory in the Physical RAM are called physical memory address. Virtual memory address are translated (converted) into physical memory addresses by the combined workings of both the CPU and the Memory Manager (which is a part of Windows itself).

Paging

Implementation of virtual memory by the Memory Management Unit is known as paging. Windows uses two types of paging which are known as Disc Paging and Demand Paging with clustering.
Virtual memory and physical memory both are divided into 4KB chunks (regions/parts), these chunks are called Pages (virtual memory chunks) and page frames (physical memory chunks). There are also large pages and huge pages but I won’t talk about them in this blog post.
In disc paging, whenever there is not enough memory left in the physical memory (RAM), the memory manager (explained later) moves pages from the RAM to special files called page files and this process of moving data from RAM to disc is called paging out memory or swapping. Moving pages from RAM to page files frees the RAM and this free memory can be used by new processes for their work. But, what happens to the paged out memory?
Whenever some code (instruction) tries to access some data that is not in the physical memory but is paged out, the MMU generates a page fault which is then handled by the OS, the OS takes that page from the disk and moves it back into the physical memory and restarts the instruction that wanted to access that memory. However, in clustering, instead of bringing back only the page that the fault requested, the memory manager also brings the pages surrounding the page that the fault requested.
In demand paging, whenever a process tries to allocate memory, the memory manager doesn’t really allocate any memory but it still returns a pointer to some memory, which is actually not yet allocated, it gets allocated only when after it is accessed. Memory is not allocated -> Process accesses the non existent memory so page fault happens -> Windows allocates the memory and allows you to use it. This method is used because programs may allocate memory that they will never access or use and having this kind of pages in the memory will only waste the demand paging allows the system to save unused memory.
Each 64 bit process on Windows is allowed to use 256 TB of virtual memory addresses but this memory is divided into different sized regions, some of which is used by the system and some of it is allowed to be used by a process. Here is a diagram of the division:

Page states

A page can be in one of the three states:

Memory Manager in Windows

All the management of the virtual memory and virtual addresses is done by the Memory Manager, which is a part of the Windows executive (kernel component). Here are the specific tasks of the memory manager:

  • Telling the MMU how to translate a virtual memory address to a physical memory address.
  • Performing paging.
  • Allocation, Reservation, Freeing of virtual memory.
  • Handling page faults.
  • Managing page files.
  • Providing a userland API for allocation, reservation and freeing of virtual memory.

Memory-Mapped files

A memory-mapped file is a special region in virtual memory that contains the contents of a file, this allows processes to treat the the contents of a file like a normal region in the memory.
There are two types of memory-mapped files in Windows:

  • Persisted memory-mapped files: These are the files that are associated (connected) with an actual file on the disk. After the last process has done it’s work with the memory-mapped file, the mapped file is written to the original file to which the memory-mapped file was associated with.
  • Non-Persisted memory mapped files: These files are not associated with any file on the disk and are mostly used for inter-process communications (IPC). After the last process had done it’s work with the memory-mapped file, it’s content is lost.

Page sharing

There are pages that are shared with different processes and these pages are called shared pages. Shared pages are mostly used to share DLLs that most processes on Windows require which saves RAM as the system doesn’t have to allocate same DLLs for each process, an example of this is kernel32.dll. Shared pages are essentially just shared memory-mapped pages which are associated with DLLs or some other shareable data.

The Virtual Memory Management API

This API is provided by the memory manager of Windows. This API allows us to allocate, free, reserve and secure virtual memory pages. All the memory related functions in the Windows API reside under the memoryapi.h header file. In this particular post, we will see the VirtualAlloc and VirtualFree functions in depth.

1. VirtualAlloc

The VirtualAlloc function allows us to allocate private memory regions (blocks) and manage them, managing these regions means reserving, committing, changing their states (described later). The memory regions allocated by this function are called a “private memory regions” because they are only accessible (available) to the processes that allocate them. Memory regions allocated with this function are initialised to 0 by default.

Function signature

This is the function signature of this function:

LPVOID VirtualAlloc(
  LPVOID lpAddress,
  SIZE_T dwSize,
  DWORD  flAllocationType,
  DWORD  flProtect
);

Arguments

The return type of this function is LPVOID, which is basically a pointer to a void object. LPVOID is defined as typedef void* LPVOID in the Windef.h. In simple words, LPVOID is an alias for void *. LP in LPVOID stands for long pointer.

lpAddress: This argument is used to specify the starting address of the memory region (page) to allocate. This address can be provided either from the return value of the previous call to this function or it can be specified as an arbitrary address but if there is memory already allocated at this address, then the Memory manager will decide where it should allocate the memory. If we don’t know where to allocate memory (as if we have not called this function previously), we can simply specify NULL and the system will decide where to allocate the memory. If the address specified is from a memory region that is inaccessible or if it’s an invalid address to allocate memory from, the function will fail with ERROR_INVALID_ADDRESS error.

dwSize: This argument is used to specify the size of the memory region that we want to allocate in bytes. If the lpAddress argument was specified as NULL then this value will be rounded up to the next page boundary.

fAllocationType: This argument is used to specify which type of memory allocation we need to use. Here are some valid types as defined in the Microsoft documentation:

Valid types for fAllocation

If you are confused about the hex values which are written after every value, they are basically the real value of the constants (i.e. MEM_COMMIT, MEM_RESERVE, etc). For example, if we use MEM_COMMIT, then it will be converted to 0x00001000 and same with all other values.

What does committing memory actually means?

In the table of types and definitions, I have described MEM_COMMIT (which is used to commit virtual memory) terribly, so let me explain what committing memory actually means in a better way.
When you commit a region of memory using VirtualAlloc, due to the use of demand paging, the memory manager doesn’t actually allocate the memory region, neither on the physical disk nor in the Virtual Memory, but, when you try to access that memory address returned by the VirtualAlloc function, it causes a page fault which causes a series of events and eventually the system allocates that memory region and serves it to you. So, until there’s an access request to the memory, it’s not allocated, there’s just a guarantee by the memory manager that there exists some memory and you can use them whenever you want.

The types which are used rarely can be found here.

flProtect: This argument is used to specify the memory protection that we want to use for the memory region that we are allocating.
These are the supported parameters:

Some memory protection constants

These are only the most used memory protection constants, the full list can be found here.

Return value

If the function succeeds, it will return the starting address of the memory region that was modified or allocated. If the function fails, it will return NULL.

2. VirtualFree

This function is basically used to free the virtual memory that was allocated using VirtualAlloc.

Function signature

This is the syntax of VirtualFree:

BOOL VirtualFree(
  LPVOID lpAddress,
  SIZE_T dwSize,
  DWORD  dwFreeType
);

As you can see, the return type of this function is BOOL, it means that it will either return true (success) or false (fail).

Arguments

lpAddress: As we know, this argument is used to specify the starting address of the memory region (page) which we want to modify (free in this case), but unlike the first time, we cannot specify NULL as an argument because obviously, the function cannot free a memory region whose address it doesn’t know.

dwSize: We also know about this argument, it is used to pass the size in bytes of the memory region which we want to modify. Here, we will use it specify the size of the memory region that we want to free.

dwFreeType: This argument is used to specify the type which we want to use to free the memory. It may be a bit confusing to you but looking at these types and their definition will clear your confusion:
virtualfree() free types

Return value

If the function does its job successfully, it returns a nonzero value. If the function fails, it will return a zero (0).

Examples

As we have looked into all the explanation, now it’s time to write some code and clear the doubts.

Example #1

Let’s start with taking example of VirtualAlloc. We will write some code which will commit 8 bytes of virtual memory.
First we’ll start by including the needed libraries:

#include <stdio.h>
#include <memoryapi.h>

Now, we’ll define a main function that will use the VirtualAlloc function to commit 8 bytes of Virtual Memory which will be rounded up to 4KB as it is the nearest page boundary to 8 bytes. We will specify the lpAddress argument as NULL, so that the system will determine from where to allocate the memory. Here is how the code looks like:

#include <stdio.h>
#include <memoryapi.h>

int main(){
    int *pointer_to_memory = VirtualAlloc(NULL, 8, MEM_COMMIT, PAGE_READWRITE); // commit 4KB of virtual memory (8 byte is rounded up to 4KB) with read write permissions 
    printf("%x", pointer_to_memory); // print the pointer to the start of the region.
  return 0;
}

Do you think something is missing from the code?
It’s the VirtualFree function. Whenever we allocate any kind of memory, we have to free it so that it can be used by other processes on the system.

Now it’s time to implement the VirtualFree function, so here is it:

#include <stdio.h>
#include <memoryapi.h>

int main(){
    int *pointer_to_memory = VirtualAlloc(NULL, 8, MEM_COMMIT, PAGE_READWRITE); // commit 8 bytes of virtual memory with read write permissions. 
    printf("The base address of allocated memory is: %x", pointer_to_memory); // print the pointer to the start of the region.
    VirtualFree(pointer_to_memory, 8, MEM_DECOMMIT); // decommit the memory region.
    return 0;
}

Until this point, the working of the code must be clear to you, but if it’s not, here’s the line-by-line explanation of the code.
First, there’s a variable which is pointing to the memory address returned by VirtualAlloc. We have passed four parameters to the VirtualAlloc function.
The first parameter is NULL, by passing NULL as a parameter, we are telling the function that the starting point of the memory region should be decided by the system.
The second parameter is the size of the memory region that we want to allocate in bytes, which is 8 bytes.
The third parameter is the allocation type, we have specified that we want to commit the memory. After we commit a memory region, it is available to us for our use but it’s not actually allocated until we access it for the first time.
The last parameter is PAGE_READWRITE, which is telling it that we want the memory region to be readable and writeable.
The we are printing virtual memory address returned by VirtualAlloc function as a hex value.
In the end, we are decommitting the memory region that we allocated by using the VirtualFree function.
The first parameter is the base address of the memory region that we allocated.
The second parameter is the size of memory region in bytes, we specified 8 while allocating it so the we’ll specify 8 while deallocating it.
Then we have specified the type of deallocation. As we are using MEM_DECOMMIT, the memory region will be reserved after it gets decommitted, which means that any other function will not be able to use it after you decommit it until you use VirtualFree function again with MEM_RELEASE to release the memory region.

Results #1

As we are almost done with everything, let’s compile and run the code. I suggest you to write the code by yourself and see the result. This is the that result that I got after I ran it:

$ ./vmem-example.exe
The base address of allocated memory is: 61fe18

Cool, right?
We have just used the VirtualAlloc function to allocate 8 bytes of virtual memory and we freed it by ourselves. Now let’s add some data to the allocated virtual memory and print it.

Example #2

Now let’s save some data inside the virtual memory that we allocated:

#include <stdio.h>
#include <memoryapi.h>

int main(){
    int *pointer_to_memory = VirtualAlloc(NULL, 8, MEM_COMMIT, PAGE_READWRITE); // commit 8 bytes of virtual memory with read write permissions. 
    printf("The base address of allocated memory is: %x", pointer_to_memory); // print the pointer to the start of the region.
    memmove(pointer_to_memory, (const void*)"1337", 4); // move "1337" string into the allocated memory.
    printf("The data which is stored in the memory is %s", pointer_to_memory); // print the data from the memory.
    VirtualFree(pointer_to_memory, 8, MEM_DECOMMIT); // decommit the memory region.
    return 0;
}

The memmove function is used to move data from one destination to other. The first argument to this function is the destination memory address where you want to move the data and the second argument is the data that will be moved and the last and third argument is the size of data, which in this case is 5 (length of the string + null byte). Here, we have copied “1337” to the memory our virtually allocated memory. If you’re confused about the type conversion, it’s used because memmove takes second argument as a const void* and we can’t directly pass char array to it.

Results #2

Let’s compile and run the code. This is the output that we’ll get:

$ ./vmem-example.exe
The base address of allocated memory is: 61fe18
The data which is stored in the memory is 1337

looks even more cool :D!

Summary

We learned a lot about virtual memory in this post, we first looked at how it is basically “virtual” memory which points to “physical” memory then we learned about paging on windows and different paging schemes that Windows’ memory manager uses then we got to know that a page is basically a memory region of 4KB, then we had look at two memory management related functions which allow us to modify virtual memory by allowing us to allocate and free it. I hope you enjoyed the blog and it wasn’t boring, any suggestions and constructive criticism is welcome!
Thank you for reading!

Resources

Understanding the booting process of a computer and trying to write own operating system.

By: Mr. Rc
22 January 2022 at 00:00

In this post, we are going to learn how can we write our own Operating System. Although, it won’t be a fully-fleged Operating system (like the one you are using right now to read this post), but it will be a part of an Operating System that would be able to boot and it will give you a brief if not full understanding of the booting process of an Operating System. If you want to take this post seriously, I suggest you to take notes as there is a lot of information combined in this single post and can be uncomfortable to grasp at the same time.
If you find something difficult to understand from my explanation, you can always check the resources section to get a link to some alternative explanation of that topic.
I would start this post by introducing you to some important components of the booting process of an Computer.

Table of contents:

Firmware

Unless you live under a rock, you might have heard of the term “Firmware” several times, if you didn’t then let me introduce you to what a Firmware is.
The most well known example of firmwares are Basic Input/Output System (BIOS) and Unified Extensible Firmware Interface (UEFI).
The term itself is actually made up of two fancy words - FIRM softWARE. The word “FIRM” means “something that doesn’t change or something that is not likely to change” and I know you are a smart person and you know what a software is. The word is nice and all but you are here to learn about the cool technical stuff so let me explain the techincal part of it. The firmware is stored inside non-volatile memory devices (devices which store sort of permanent data that doesn’t change after a system restart) as instructions or data and it is the first thing that the CPU runs after the computer is powered on. Everything that we are learning in this blog post is specific to the BIOS firmware type. Modern Operating Systems do not use BIOS, however, that doesn’t mean that the knowledge in this article is of no use as concepts of BIOS are simpler to understand still relavent to learn.

In order to understand the importance and the uses of a firmware, you would need to understand the boot process (“boot” refers to “Bootstrap”) of a computer.

The boot process

The booting process is something like this:

  • Computer is powered on.
  • The Central Processing Unit (CPU) runs the firmware from a specific Read-Only Memory (ROM) chip on your motherboard. The ROM from which your CPU is going to read the firmware depends upon the CPU your system is having.
  • The firmware detects several (but not all) hardware components connected to the system, such as network interfaces, keyboards, mouse, and so on, and does some error checking (also known as Power-On Self Test or POST) before activating them.
  • The firmware doesn’t know what are the properties and details of the Operating System that is about to be going to be ran on the system, So, it transfers it’s control to the Operating System and lets it do it’s setup. It starts with searching through the available/connected storage devices or network interfaces in a pre-defined order (this order is known as the “boot device sequence” or “boot order”) and attempts to find a bootable disk. A bootable disk is a disk whose first sector (a subdivision of a HDD which can hold 512 bytes of user-accessible data) contains the magic number 0xAA55 (big-endian). This magic number is also called as the “boot signature”. In this sector the byte at index 511 should be 0xAA and the byte at index 512 should be 0x55. This first sector is called the Master Boot Record (MBR) and the program stored inside it is called the MBR bootloader or simply bootloader. Remember that this bootloader is a part of the Operating System, so technically, this is part of the process where we are actually booted in the Operating System. This whole process is done after the firmware calls the interrupt 0x19 (more about this later).
  • After the firmware has found the bootloader, it loads it into the address 0x7c00 in the RAM and hands over the control to it.
  • Now, the bootloader can do whatever it is programmed to do, it may print a nihilist quote and tell you that your life has no meaning or it may just do nothing if it is programmed that way. Jokes aside, while it can be programmed to do anything, the main work it is supposed to be doing is performing several tasks that sets up the environment for the loading of next part (the kernel) of the OS. After performing some tasks like the initialisation of some registers, tables and so on. It reads the kernel from the disk and loads it somewhere in the RAM and handles over the control to it.
  • Now, the kernel has the control over the system. Just like a bootloader, there is no pre-defined tasks for a kernel. Whatever it will do entirely depends upon what it has been programmed to do. For example, this can be seen in the Linux and Windows kernel, they are entirely different and what they will do is too entirely different but they will eventually start the User Interface and allow the user to have the control of the system. If you find this complex, here’s an example - Just like everyone in your company does different stuff after they wake up - they may reply drink a cup of chai, they may go for a walk or do anything they want but their end goal is to reach the office on time and start working, a kernel too has the end goal of successfully loading the easy-to-use User Interface part of the OS to the user. Note that this is not the only work of the kernel in the OS, the kernel is an essential part of an OS and also has a lot to do after it has served you the nice UI.

Environment setup

Before diving in, You should have nasm and qemu installed. I know you probably do not have any of them, so go ahead and install them. Both are available for Windows and Linux.

In linux nasm and qemu can be installed through a single command:

$ sudo apt install nasm; sudo apt install qemu-system-x86

Writing our bootloader

As writing a complete kernel from scratch and then writing our own user interface, software, compiler, etc. would be a lot of pain to write in single blog post and even for you to understand, I am going to not do it all in this post and instead of writing the whole OS, we would be only be writing a bootloader, and it actually worths trying to write it, as you will too learn a lot of new things related to bootloaders and Operating Systems.

For now, we will start by writing an endless loop which is not pointless (unlike your life). It will be a function that does nothing more than jumping to itself (looping endlessly).

loop:
    jmp loop 


Here’s how you assemble it:

$ nasm bootsector.asm -f bin -o bootloader.bin

The -f flag specifies the format which is bin (binary) in our case, and the -o flag is used to name the file in which we want our output to be saved.

hexdump of bootloader.bin:

00000000: ebfe                                     ..

The opcode or the hex representaion of these instructions is ebfe, it is an infinite loop in assembly, which is exactly what we wanted.

Adding some data to our bootloader

Now that we are done with our endless loop, we will continue to write some more instructions to our bootloader and will eventually make it bootable.

We will first start by writing some data to our bootloader, here’s how you do it:

loop:
    jmp loop

db 0x10

hexdump:

00000000: ebfe 10                                  ...

The db (data byte) instruction is used to put a byte “literally” in the executable, that’s why you can see 10 being stored in the executable.

Making our bootloader bootable

The first thing we need to do in order to make this an actual bootable device is to add the the magic bytes at the end of the our bootloader’s code (at 511 and 512 index), so that the firmware can actually know that this is a bootable device. This is how we do it:

loop:
    jmp loop						; endless loop
db 0x10  						; pointless data
db "You didn't chose to exist." 			; makes sense?

times 0x1fe-($-$$) db 0					; explained later. 0x1fe = 510 in decimal.
dw 0xaa55 						; the magic number.

The instruction times 0x1fe-($-$$) db 0 may look scary but it’s really easy to understand.
The instruction can be broken into two instructions: times 0x1fe-($-$$) and db 0. Let me explain the first one to you then you will be able to make sense of the second one too.

The times instruction

The times instruction tells the assembler (nasm in this case) to produce multiple (n) copies of a specified instruction. In order to understand this more clearly, let’s look at the syntax of times instruction:

times <n> <instruction> <operand> ...		; n = number of times.

One thing you should know is the number of operands depends on the instruction being used. Here’s a simpler use case example of the times instruction:

times 10 db '1337' 

Here, 10 is n, db is the instruction and '1337' is an operand. This instruction will tell the assembler to make 10 copies of the instruction db '1337'.
Here’s the hexdump of the code:

00000000  31 33 33 37 31 33 33 37  31 33 33 37 31 33 33 37  |1337133713371337|
00000010  31 33 33 37 31 33 33 37  31 33 33 37 31 33 33 37  |1337133713371337|
00000020  31 33 33 37 31 33 33 37                           |13371337|
00000028

As expected, we can notice the string '1337' repeated 10 times. It worked just fine.


Now, let’s move to the original instruction and try to understand the subtraction it’s doing.
Let’s start with the subtraction under the bracket ($-$$). The $ operator in assembly (nasm) denotes money the address of the current instruction and $$ operator denotes the address of the first instruction (beginning of the current section), which in this case, is the address of the definition of the endless loop and whose address would be 0x7C00 (as we know, firmware loads the bootloader at address 0x7C00).
It’s basically this:

addr_of_current_instruction - addr_of_first_instruction_0x7c00 

This subtraction will return the number of bytes from the start of the program to the current line, which is just the size of the program and it is getting substracted from 0x1fe (510 in decimal). Why are we doing this subtraction?
We are doing this to get the value of bytes that aren’t used so that we can fill them with zeros (db 0) and then we will successfully be having the magic bytes at 511 and 512 index.
It can be understood like this:

200 - (addr_of_current_instruction - addr_of_first_instruction_0x7c00) ; returns the no. of unused bytes.

This value will be passed to times instruction as n and it already has the instruction (db) and operand (0), so it will tell the assembler to fill the bytes aren’t used with 0 until the 510 index.
So, it will finally look like this:

times 200 -(addr_of_current_instruction - addr_of_first_instruction_0x7c00) db 0
; times 0x1fe-($-$$) db 0
; fills the unused bytes with 0

The only thing that is left is to actually put the magic number in the bootloader. It is done by using the dw 0xaa55 instruction (dw is same as db but dw is used for words and db is used for bytes).
Now, that we are done with the understanding of the bootloader, let’s assemble it and look at the hexdump to actually see the result.

00000000: ebfe 1059 6f75 2064 6964 6e27 7420 6368  ...You didn't ch
00000010: 6f73 6520 746f 2065 7869 7374 2e00 0000  ose to exist....
00000020: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000030: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000060: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000070: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000080: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000090: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000100: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000110: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000120: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000130: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000140: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000150: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000160: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000170: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000180: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000190: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001f0: 0000 0000 0000 0000 0000 0000 0000 55aa  ..............U.

As expected, we have filled the unused bytes with zeros and the last two bytes with the magic number (the order is different due to endianness). Now our bootloader and actually a bootloader and ready to work.

Booting into our bootloader

To boot into it, make sure you have assembled your bootloader code with nasm.

Run this command:

qemu-system-x86_64 bootsector.bin

After you run this, if will see a window of qemu which has some initialization text and then it is blank it means your bootloader works perfectly because we just programmed it to loop so it just doing that. Here’s how the window looks like:

QEMU screeenshot 1.


The final code

We are finally at almost the end of the blog post, and we will now add the final features to our bootloader. These features are not going to be anything fancy, we are only going to make it display the text that we are entering.
Here’s the code for it:

[org 0x7c00]

mov bp, 0xffff
mov sp, bp

call set_video_mode
call get_char_input

jmp $

set_video_mode:
	mov ah, 0x00
	mov al, 0x03
	int 0x10
	ret

get_char_input:
	xor ah, ah 		; same as mov ah, 0x00
	int 0x16

	mov ah, 0x0e
	int 0x10

	jmp get_char_input

times 0x1fe-($-$$) db 0
dw 0xaa55

The org directive

The difference between an instruction an directive is that An instruction is directly translated to something the CPU can execute. A directive is something the assembler can interpret and use it while assembling, it does not produce any machine code.
The first line may look a bit complex because unlike other instructions, it has brackets around it, but there’s nothing to worry about, you can just forget about the brackets and focus on the actual directive. It is org 0x7C00. Here’s the explanation:
As we know, bootloaders get loaded at the memory address 0x7C00 but the assembler don’t know this, that is why we use the org directive to tell the assembler to assume that the address of beginning of our code (base address) is <operand>, which is 0x7C00 in this case. After the assembler knows the base address of the program, every address that the assembler use while assembling the code will be relative to the base address that we have defined. For example, if we do not use this directive, the assembler will assume that the base address to be 0x00 and the address of every function and instruction will be calculated like this:

0x00+relative_addr_of_function
; base_addr + relative_addr_of_function
; base_addr + relative_addr_of_instruction

and these address won’t work on the runtime of our bootloader as it will not be loaded at that address, that is why we need to use the org directive.
Visual comparison of effects of using and not using the org directive:

code without org directivecode with org directive

Setting up the registers.

The next thing we do is setting the correct values for registers.
The first register we set up is the bp (base pointer) register to the address 0xffff and then copy it to sp (stack pointer). Hold up!, Why this address?
In order to understand this, we first need to look at the memory layout of the system when it’s in the booting process. Here is how it looks like:

Memory layout of the system while booting Memory layout of the system while booting.

As you can see, the memory address that we are setting the base pointer is in the free memory that is after the memory address where our bootloader will be loaded (0x7e00) and before the other section of memory which starts at 0x9cf00. We have set it to 0xffff because if we had set it anywhere else (in some non-free memory) then it could possibly overwrite the other data that is around it as the stack increases it’s size whenever data is pushed into it. Note that the address 0xffff is arbitrary and you can use any address from the free space, just make sure that the address that you are choosing is not very closer to the boundaries of other regions inside memory because when you will put data inside your stack, it may expand (stack grows downwards) and overwrite the data inside those other regions.

Interrupts.

The next line of code after the setting up of registers is of a call instruction which is calling the function set_video_mode. Here’s the code of the function:

set_video_mode:
	mov al, 0x03
	mov ah, 0x00
	int 0x10
	ret

The first two lines are pretty basic, they are just moving the constant 0x03 and 0x00 into al and ah register but then we have a new instruction, which is the int instruction. The int instruction is used to generate a software interrupt. So, what is an interrupt?
Interrupts allow the CPU to temporarily halt (stop) what it is doing and run some other, higher-priority instructions before returning to the original task. An interrupt could be raised either by a software instruction (e.g. int 0x10) or by some hardware device that requires high-priority action (e.g. to read some incoming data from a network device.
Each interrupt has a different number assigned to it, which is an index in the Interrupt Vector Table (IVR) which is basically a table that stores these interrupts as indexes to vectors (memory address or pointers) which point to Interrupt Service Routines (ISR). ISRs are initialised by the firmware and they are basically machine code that run differently for each interrupts, they have a sort of a long switch case statement with code to be used differently for different arguments. You can think IVT as a simple hash table (dictionary) in which each index holds a memory address to a function. Here’s an example:

IVT = {
	1: 0x0...,
	2: 0x0...,
	3: 0x0...,
	4: 0x0...,
	5: 0x0...,
	6: 0x0...,
	7: 0x0...
	...
}

The most popular interrupt

If you have ever debugged a program, you might already know what a breakpoint is, it’s simply you asking the debugger to stop the program at some point while it’s running and the debugger does it’s job. But, How do debuggers even make the program stop at while it’s running?
They use the interrupt 3, which is specially made for debuggers to stop a running process.

int 3

How do they use this interrupt to pause a program?
Debuggers replace the opcode of the first opcode of the currently running instruction with the opcode of int 3 which is just a one-byte opcode cc.
Here’s an example:

int-3-instruction-usage

As int 3 has just a single byte opcode, it makes the task very fast and easy for debuggers. When the int 3 instruction is executed, it’s index is checked in the IVT and then it’s ISR is located and it starts running. The ISR then finds the process which needs to get paused, pauses it and notifies the debugger that the process has been stopped, and once the debugger gets this notification, it allows you to inspect the memory and the registers of the process which is getting debugged by the debugger. In order to allow the continuation of the process which was previously paused, the debugger replaces the cc opcode with the original opcode which it was replace with and the program continues from the place where it was stopped. Example:

int-3-instruction-reversed

I hope this section helped you understand the real world usage and implementation of an software interrupt, and now you also know how a debugger makes the breakpoint a thing.

int 0x10

Now, you have a good understanding of interrupts and you have also seen an real world example of it, let’s now understand the usage of the interrupt that is present in the set_video_mode function, the interrupt 0x10. The interrupt 0x10 has video/screen related modification functions. In order to use different functions, we set the ah and al registers together to different values. These are the values that to which the ah register can be set:

  • AH=0x00: Video mode.
  • AX=0x1003: Blinking mode.
  • AH=0x13: Write string.
  • AH=0x03: Get cursor position.
  • AH=0x0e: Write Character in TTY Mode.
set_video_mode:
	mov ah, 0x00
	mov al, 0x03
	int 0x10
	ret

Explanation: The mov instruction is setting the value of the ah register to 0x00, which is basically asking it’s ISR to set the video mode to a mode which is specified in the al register, and these are the supported video modes with the values for ah register:

  • AL=0x00 - text mode. 40x25. 16 colors.
  • AL=0x03 - text mode. 80x25. 16 colors.
  • AL=0x13 - graphical mode. 40x25. 256 colors. 320x200 pixels.

So, both registers combined are basically asking the ISR of interrupt 0x10 to set the video mode of the screen to text mode, which has the size 80x25 and supports 16 colors and that is the only motive of this function.

int 0x16.

The other function we are left with is get_char_input. In this function, we have another interrupt, which is interrupt 0x16.
The interrupt 0x16 is used for basic keyboard related function. These are the some values that can be set in the ah register to use different keyboard functions:

  • AH = 0x00 - Read key press.
  • AH = 0x01 - Get state of the keyboard buffer.
  • AH = 0x02 - Get the State of the keyboard.
  • AH = 0x03 - Establish repetition factor.
  • AH = 0x05 - Simulate a keystroke
  • AH = 0x0A - Get the ID of the keyboard.
Implementation of interrupts into something useful
get_char_input:
	xor ah, ah		; same as mov ah, 0x00
	int 0x16

	mov ah, 0x0e
	int 0x10

	jmp get_char_input

The first thing done in the function’s code is the xoring of the ah register with itself, which is basically the same as mov ah, 0x00 but xoring a register with itself is believed to be faster and less CPU expensive, so I used it.
After setting ah to zero, it will call the interrupt 0x16, whose ISR will then read the keystroke from the keyboard and store it into the al register.
After that, it sets the ah register to 0x0e and calls our good old interrupt 0x10, but this time it is not setting the video mode to something as the ah register is not set to 0x00. If you read the functions of the interrupt 0x10 again, you will find that ah = 0x0e asks it’s ISR to “write a character in tty mode” which basically means “write a character to the screen”. The character which this ISR will print will be taken from the al register. So, these two interrupts are together reading the character from the screen (using interrupt 0x10) and printing it onto the screen (using intterupt 0x16).
After this reading of character, the function is simply calling itself (like an infinite loop) to continue what it’s doing forever until it’s manually stopped.

Our bootloader in action

The final thing we are left with is to see our bootloader in action, so let’s do it. Assemble the code:

$ nasm bootsector.asm -f bin -o bootloder.bin

Run it with qemu:

qemu-system-x86_64 bootsector.bin

Now, you should have a blank window of qemu. You can now type anything and it’ll display it to the screen and that is all it has to it.

final-bootloader-screenshot

Summary

We started this blog post by understanding the boot process of a computer, then we learnt about some new and assembly instructions and then we learned about what interrupts, how they work and then we learnt about how debuggers implement breakpoints using interrupts and lastly we learnt how the interrupt 0x10 and interrupt 0x16 can be used and how can we implement them to read data from the screen and print it.

Author notes

This post really took me so much of my time, efforts and understanding of different aspects of an Operating System. I tried the best way to explain everything and I hope that you also learnt so many new things throughout this blog post.
If you think this thing feels fascinating to you and you want to build your own fully-fledged Operating system, then you can continue learning OS dev and to make your lazy life easier, I have linked to different places where you can learn OS dev in the resources section.

Resources

❌
❌