There are new articles available, click to refresh the page.
✇ Cisco Talos

An Azure Sphere kernel exploit — or how I learned to stop worrying and love the IoT

By: [email protected] (Jon Munshaw)
By Claudio Bozzato and Lilith [^.^];. As part of our continued research into Microsoft Azure Sphere, there are two vulnerabilities we discovered that we feel are particularly dangerous. For a full rundown of the 31 vulnerabilities we’ve discovered over the past year, check out our full recap...

[[ This is only the beginning! Please visit the blog for the complete entry ]]
✇ CrowdStrike

Nowhere to Hide: Detecting SILENT CHOLLIMA’s Custom Tooling

By: Falcon OverWatch Team

CrowdStrike Falcon OverWatch™ recently released its annual threat hunting report, detailing the interactive intrusion activity observed by hunters over the course of the past year. The tactics, techniques and procedures (TTPs) an adversary uses serve as key indicators to threat hunters of who might be behind an intrusion. OverWatch threat hunters uncovered an intrusion against a pharmaceuticals organization that bore all of the hallmarks of one of the Democratic People’s Republic of Korea (DPRK) threat actor group: SILENT CHOLLIMA. For further detail, download the CrowdStrike 2021 Threat Hunting Report today.

Threat Hunters Uncover SILENT CHOLLIMA’s Custom Tooling

OverWatch threat hunters detected a burst of suspicious reconnaissance activity in which the threat actor used the Smbexec tool under a Windows service account. Originally designed as a penetration testing tool, Smbexec enables covert execution by creating a Windows service that is then used to redirect a command shell operation to a remote location over Server Message Block (SMB) protocol. This approach is valuable to threat actors, as they can perform command execution under a semi-interactive shell and run commands remotely, ultimately making the activity less likely to trigger automated detections.

As OverWatch continued to investigate the reconnaissance activity, the threat actor used Smbexec to remotely copy low-prevalence executables to disk and execute them. The threat hunters quickly called on CrowdStrike Intelligence, who together were able to quickly determine the files were an updated variant of Export Control — a malware dropper unique to SILENT CHOLLIMA.

SILENT CHOLLIMA then proceeded to load two further custom tools. The first was an information stealer, named GifStealer, which runs a variety of host and network reconnaissance commands and archives the output within individual compressed files. The second was Valefor, a remote access tool (RAT) that uses Windows API functions and utilities to enable file transfer and data collection capabilities.

OverWatch Contains Adversary Activity

Throughout the investigation, OverWatch threat hunters alerted the victim organization to the malicious activity occurring in the environment. As the situation developed, OverWatch continued to alert the organization, eventually informing them of the emerging attribution of this activity to SILENT CHOLLIMA. 

Because this activity originated from a host without the CrowdStrike Falcon® sensor, OverWatch next worked with the organization to expand the rollout of the Falcon sensor so the full scope of threat actor activity could be assessed. Increasing the organization’s coverage and visibility into the intrusion, threat hunters identified six additional compromised hosts. Through further collaboration with the organization, OverWatch was able to relay their findings in a timely manner, empowering the organization to contain and remove SILENT CHOLLIMA from their network. 

OverWatch discovered a service creation event that was configured to execute the Export Control loader every time the system reboots, allowing the threat actor to maintain persistence if they temporarily lose connection.

sc create [REDACTED] type= own type= interact start= auto error=ignore binpath= "cmd /K start C:\Windows\Resources\[REDACTED].exe"

The threat actor was also mindful to evade detection by storing their Export Control droppers and archived reconnaissance data within legitimate local directories. By doing this, threat actors attempt to masquerade the files as benign activity. The threat actor continued its evasion techniques, removing traces of the collected GifStealer archives by deleting them and overwriting the GifStealer binary itself using the string below. This technique is another hallmark of SILENT CHOLLIMA activity.

"C:\Windows\system32\cmd.exe" /c ping -n 3 >NUL & echo EEEE > "C:\Windows\Temp\[REDACTED]"

Conclusions and Recommendations

The OverWatch team exposed multiple signs of malicious tradecraft in the early stages of this intrusion, which proved to be vital to the victim organization’s ability to successfully contain the campaign and remove the threat actor from its networks. In this instance, OverWatch worked with the organization to rapidly expand Falcon sensor coverage. Though the Falcon sensor can be deployed and operational in just seconds, OverWatch strongly recommends that defenders roll out endpoint protection consistently and comprehensively across their environment from the start to ensure maximum coverage and visibility for threat hunters. OverWatch routinely sees security blind spots become a safe haven from which adversaries can launch their intrusions.  The Falcon sensor was built with scalability in mind, allowing an organization to reach a strong security posture by protecting all enterprise endpoints in mere moments.

The expertise of OverWatch’s human threat hunters was pivotal in this instance, as it was the threat hunters ability to leverage their expertise that allowed them to discern the SMB activity was indeed malicious. 

For defenders concerned about this type of activity, OverWatch recommends monitoring: 

  • Service account activity, limiting access where possible
  • Service creation events within Windows event logs to hunt for malicious SMB commands
  • Remote users connecting to administrator shares, as well as other commands and tools that can be used to connect to network shares

Ultimately, threat hunting is a full time job. Defenders should also consider hiring a professional managed threat hunting service, like OverWatch, to secure their networks 24/7/365. 

Additional Resources

✇ NVISO Labs

Cobalt Strike: Decrypting DNS Traffic – Part 5

By: Didier Stevens

Cobalt Strike beacons can communicate over DNS. We show how to decode and decrypt DNS traffic in this blog post.

This series of blog posts describes different methods to decrypt Cobalt Strike traffic. In part 1 of this series, we revealed private encryption keys found in rogue Cobalt Strike packages. In part 2, we decrypted Cobalt Strike traffic starting with a private RSA key. In part 3, we explain how to decrypt Cobalt Strike traffic if you don’t know the private RSA key but do have a process memory dump. And in part 4, we deal with traffic obfuscated with malleable C2 data transforms.

In the first 4 parts of this series, we have always looked at traffic over HTTP (or HTTPS). A beacon can also be configured to communicate over DNS, by performing DNS requests for A, AAAA and/or TXT records. Data flowing from the beacon to the team server is encoded with hexadecimal digits that make up labels of the queried name, and data flowing from the team server to the beacon is contained in the answers of A, AAAA and/or TXT records.

The data needs to be extracted from DNS queries, and then it can be decrypted (with the same cryptographic methods as for traffic over HTTP).

DNS C2 protocol

We use a challenge from the 2021 edition of the Cyber Security Rumble to illustrate how Cobalt Strike DNS traffic looks like.

First we need to take a look at the beacon configuration with tool 1768.py:

Figure 1: configuration of a DNS beacon

Field “payload type” confirms that this is a DNS beacon, and the field “server” tells us what domain is used for the DNS queries: wallet[.]thedarkestside[.]org.

And then a third block of DNS configuration parameters is highlighted in figure 1: maxdns, DNS_idle, … We will explain them when they appear in the DNS traffic we are going to analyze.

Seen in Wireshark, that DNS traffic looks like this:

Figure 2: Wireshark view of Cobalt Strike DNS traffic

We condensed this information (field Info) into this textual representation of DNS queries and replies:

Figure 3: Textual representation of Cobalt Strike DNS traffic

Let’s start with the first set of queries:

Figure 4: DNS_beacon queries and replies

At regular intervals (determined by the sleep settings), the beacon issues an A record DNS query for name 19997cf2[.]wallet[.]thedarkestside[.]org. wallet[.]thedarkestside[.]org are the root labels of every query that this beacon will issue, and this is set inside the config. 19997cf2 is the hexadecimal representation of the beacon ID (bid) of this particular beacon instance. Each running beacon generates a 32-bit number, that is used to identify the beacon with the team server. It is different for each running beacon, even when the same beacon executable is started several times. All DNS request for this particular beacon, will have root labels 19997cf2[.]wallet[.]thedarkestside[.]org.

To determine the purpose of a set of DNS queries like above, we need to consult the configuration of the beacon:

Figure 5: zooming in on the DNS settings of the configuration of this beacon (Figure 1)

The following settings define the top label per type of query:

  1. DNS_beacon
  2. DNS_A
  4. DNS_TXT
  5. DNS_metadata
  6. DNS_output

Notice that the values seen in figure 5 for these settings, are the default Cobalt Strike profile settings.

For example, if DNS queries issued by this beacon have a name starting with http://www., then we know that these are queries to send the metadata to the team server.

In the configuration of our beacon, the value of DNS_beacon is (NULL …): that’s an empty string, and it means that no label is put in front of the root labels. Thus, with this, we know that queries with name 19997cf2[.]wallet[.]thedarkestside[.]org are DNS_beacon queries. DNS_beacon queries is what a beacon uses to inquire if the team server has tasks for the beacon in its queue. The reply to this A record DNS query is an IPv4 address, and that address instructs the beacon what to do. To understand what the instruction is, we first need to XOR this replied address with the value of setting DNS_Idle. In our beacon, that DNS_Idle value is (the default DNS_Idle value is

Looking at figure 4, we see that the replies to the first requests are These have to be XORed with DNS_Idle value thus the result is A reply equal to means that there are no tasks inside the team server queue for this beacon, and that it should sleep and check again later. So for the first 5 queries in figure 4, the beacon has to do nothing.

That changes with the 6th query: the reply is IPv4 address, and when we XOR that value with, we end up with Value instructs the beacon to check for tasks using TXT record queries.

Here are the possible values that determine how a beacon should interact with the team server:

Figure 6: possible DNS_Beacon replies

If the least significant bit is set, the beacon should do a checkin (with a DNS_metadata query).

If bits 4 to 2 are cleared, communication should be done with A records.

If bit 2 is set, communication should be done with TXT records.

And if bit 3 is set, communication should be done with AAAA records.

Value 242 is 11110010, thus no checkin has to be performed but tasks should be retrieved via TXT records.

The next set of DNS queries are performed by the beacon because of the instructions ( it received:

Figure 7: DNS_TXT queries

Notice that the names in these queries start with api., thus they are DNS_TXT queries, according to the configuration (see figure 5). And that is per the instruction of the team server (

Although DNS_TXT queries should use TXT records, the very first DNS query of a DNS_TXT query is an A record query. The reply, an IPv4 address, has to be XORed with the DNS_Idle value. So here in our example, XORed with gives This specifies the length (64 bytes) of the encrypted data that will be transmitted over TXT records. Notice that for DNS_A and DNS_AAAA queries, the first query will be an A record query too. It also encodes the length of the encrypted data to be received.

Next the beacon issues as many TXT record queries as necessary. The value of each TXT record is a BASE64 string, that has to be concatenated together before decoding. The beacon stops issuing TXT record requests once the decoded data has reached the length specified in the A record reply (64 bytes in our example).

Since the beacon can issue these TXT record queries very quickly (depending on the sleep settings), a mechanism is introduced to avoid that cached DNS results can interfere in the communication. This is done by making each name in the DNS queries unique. This is done with an extra hexadecimal label.

Notice that there is an hexadecimal label between the top label (api in our example) and the root labels (19997cf2[.]wallet[.]thedarkestside[.]org in our example). That hexadecimal label is 07311917 for the first DNS query and 17311917 for the second DNS query. That hexadecimal label consists of a counter and a random number: COUNTER + RANDOMNUMBER.

In our example, the random number is 7311917, and the counter always starts with 0 and increments with 1. That is how each query is made unique, and it also helps to process the replies in the correct order, in case the DNS replies arrive in disorder.

Thus, when all the DNS TXT replies have been received (there is only one in our example), the base 64 string (ZUZBozZmBi10KvISBcqS0nxp32b7h6WxUBw4n70cOLP13eN7PgcnUVOWdO+tDCbeElzdrp0b0N5DIEhB7eQ9Yg== in our example) is decoded and decrypted (we will do this with a tool at the end of this blog post).

This is how DNS beacons receive their instructions (tasks) from the team server. The encrypted bytes are transmitted via DNS A, DNS AAAA or DNS TXT record replies.

When the communication has to be done over DNS A records ( reply), the traffic looks like this:

Figure 8: DNS_A queries

cdn. is the top label for DNS_A requests (see config figure 5).

The first reply is, XORed with, this gives Thus 112 bytes of encrypted data have to be received.: that’s 112 / 4 = 28 DNS A record replies.

The encrypted data is just taken from the IPv4 addresses in the DNS A record replies. In our example, that’s: 19, 64, 240, 89, 241, 225, …

And for DNS_AAAA queries, the method is exactly the same, except that the top label is www6. in our example (see config figure 5) and that each IPv6 address contains 16 bytes of encrypted data.

The encrypted data transmitted via DNS records from the team server to the beacon (e.g., the tasks) has exactly the same format as the encrypted tasks transmitted with http or https. Thus the decryption process is exactly the same.

When the beacon has to transmit its results (output of the tasks) to the team server, is uses DNS_output queries. In our example, these queries start with top label post. Here is an example:

Figure 9: beacon sending results to the team server with DNS_output queries

Each name of a DNS query for a DNS_output query, has a unique hexadecimal counter, just like DNS_A, DNS_AAAA and DNS_TXT queries. The data to be transmitted, is encoded with hexadecimal digits in labels that are added to the name.

Let’s take the first DNS query (figure 9): post.140.09842910.19997cf2[.]wallet[.]thedarkestside.org.

This name breaks down into the following labels:

  • post: DNS_output query
  • 140: transmitted data
  • 09842910: counter + random number
  • 19997cf2: beacon ID
  • wallet[.]thedarkestside.org: domain chosen by the operator

The transmitted data of the first query is actually the length of the encrypted data to be transmitted. It has to be decoded as follows: 140 -> 1 40.

The first hexadecimal digit (1 in our example) is a counter that specifies the number of labels that are used to contain the hexadecimal data. Since a DNS label is limited to 63 characters, more than one label needs to be used when 32 bytes or more need to be encoded. That explains the use of a counter. 40 is the hexadecimal data, thus the length of the encrypted data is 64 bytes long.

The second DNS query (figure 9) is: post.2942880f933a45cf2d048b0c14917493df0cd10a0de26ea103d0eb1b3.4adf28c63a97deb5cbe4e20b26902d1ef427957323967835f7d18a42.19842910.19997cf2[.]wallet[.]thedarkestside[.]org.

The name in this query contains the encrypted data (partially) encoded with hexadecimal digits inside labels.

These are the transmitted data labels: 2942880f933a45cf2d048b0c14917493df0cd10a0de26ea103d0eb1b3.4adf28c63a97deb5cbe4e20b26902d1ef427957323967835f7d18a42

The first digit, 2, indicates that 2 labels were used to encode the encrypted data: 942880f933a45cf2d048b0c14917493df0cd10a0de26ea103d0eb1b3 and 4adf28c63a97deb5cbe4e20b26902d1ef427957323967835f7d18a42.

The third DNS query (figure 9) is: post.1debfa06ab4786477.29842910.19997cf2[.]wallet[.]thedarkestside[.]org.

The counter for the labels is 1, and the transmitted data is debfa06ab4786477.

Putting all these labels together in the right order, gives the following hexadecimal data:

942880f933a45cf2d048b0c14917493df0cd10a0de26ea103d0eb1b34adf28c63a97deb5cbe4e20b26902d1ef427957323967835f7d18a42debfa06ab4786477. That’s 128 hexadecimal digits long, or 64 bytes, exactly like specified by the length (40 hexadecimal) in the first query.

The hexadecimal data above, is the encrypted data transmitted via DNS records from the beacon to the team server (e.g., the task results or output) and it has almost the same format as the encrypted output transmitted with http or https. The difference is the following: with http or https traffic, the format starts with an unencrypted size field (size of the encrypted data). That size field is not present in the format of the DNS_output data.


We have developed a tool, cs-parse-traffic, that can decrypt and parse DNS traffic and HTTP(S). Similar to what we did with encrypted HTTP traffic, we will decode encrypted data from DNS queries, use it to find cryptographic keys inside the beacon’s process memory, and then decrypt the DNS traffic.

First we run the tool with an unknown key (-k unknown) to extract the encrypted data from the DNS queries and replies in the capture file:

Figure 10: extracting encrypted data from DNS queries

Option -f dns is required to process DNS traffic, and option -i is used to provided the DNS_Idle value. This value is needed to properly decode DNS replies (it is not needed for DNS queries).

The encrypted data (red rectangle) can then be used to find the AES and HMAC keys inside the process memory dump of the running beacon:

Figure 11: extracting cryptographic keys from process memory

That key can then be used to decrypt the DNS traffic:

Figure 12: decrypting DNS traffic

This traffic was used in a CTF challenge of the Cyber Security Rumble 2021. To find the flag, grep for CSR in the decrypted traffic:

Figure 13: finding the flag inside the decrypted traffic


The major difference between DNS Cobalt Strike traffic and HTTP Cobalt Strike traffic, is how the encrypted data is encoded. Once encrypted data is recovered, decrypting it is very similar for DNS and HTTP.

About the authors

Didier Stevens is a malware expert working for NVISO. Didier is a SANS Internet Storm Center senior handler and Microsoft MVP, and has developed numerous popular tools to assist with malware analysis. You can find Didier on Twitter and LinkedIn.

You can follow NVISO Labs on Twitter to stay up to date on all our future research and publications.

✇ Cisco Talos

Talos Takes Ep. #78: Attackers would love to buy you a non-existent PS5 this holiday season

By: [email protected] (Jon Munshaw)
By Jon Munshaw. The latest episode of Talos Takes is available now. Download this episode and subscribe to Talos Takes using the buttons below, or visit the Talos Takes page. We know this episode comes around every year, but people keep falling for scams, so we have to remind people how to...

[[ This is only the beginning! Please visit the blog for the complete entry ]]
✇ CrowdStrike

Shift Left Security: The Magic Elixir for Securing Cloud-Native Apps

By: David Puzas

Developing applications quickly has always been the goal of development teams. Traditionally, that often puts them at odds with the need for testing. Developers might code up to the last minute, leaving little time to find and fix vulnerabilities in time to meet deadlines. 

During the past decade, this historical push-pull between security and developers led many organizations to look to build security deeper into the application development lifecycle. This new approach, “shift-left security,” is a pivotal part of supporting the DevOps methodology. By focusing on finding and remediating vulnerabilities earlier, organizations can streamline the development process and improve velocity. 

Cloud computing empowers the adoption of DevOps. It offers DevOps teams a centralized platform for testing and deployment. But for DevOps teams to embrace the cloud, security has to be at the forefront of your considerations. For developers, that means making security a part of the continuous integration/continuous delivery (CI/CD) pipeline that forms the cornerstone of DevOps practices.

Out with the Old and In with the New

The CI/CD pipeline is vital to supporting DevOps through the automation of building, testing and deploying applications. It is not enough to just scan applications after they are live. A shift-left approach to security should start the same second that DevOps teams begin developing the application and provisioning infrastructure. By using APIs, developers can integrate security into their toolsets and enable security teams to find problems early. 

Speedy delivery of applications is not the enemy of security, though it can seem that way. Security is meant to be an enabler, an elixir that helps organizations use technology to reach their business goals. Making that a reality, however, requires making it a foundational part of the development process. 

In our Buyer’s Guide for Cloud Workload Protection Platforms, we provide a list of key features we believe organizations should look for to help secure their cloud environments. Automation is crucial. In research from CrowdStrike and Enterprise Strategy Group (ESG), 41% of respondents said that automating the introduction of controls and processes via integration with the software development lifecycle and CI/CD tools is a top priority. Using automation, organizations can keep pace with the elastic, dynamic nature of cloud-native applications and infrastructure.

Better Security, Better Apps

At CrowdStrike, we focus on integrating security into the CI/CD pipeline. As part of the functionality of CrowdStrike’s Falcon Cloud Workload Protection (CWP), customers have the ability to create verified image policies to ensure that only approved images are allowed to progress through the CI/CD pipeline and run in their hosts or Kubernetes clusters. 

The tighter the integration between security and the pipeline, the earlier threats can be identified, and the more the speed of delivery can be accelerated. By seamlessly integrating with Jenkins, Bamboo, GitLab and others, Falcon CWP allows DevOps teams to respond and remediate incidents even faster within the toolsets they use. 

Falcon CWP also continuously scans container images for known vulnerabilities, configuration issues, secrets/keys and OSS licensing issues, and streamlines visibility for security operations by providing insights and context for misconfigurations and compliance violations. It also uses reporting and dashboards to drive alignment across the security operations, DevOps and infrastructure teams. 

Hardening the CI/CD pipeline allows DevOps teams to move fast without sacrificing security. The automation and integration of security into the CI/CD pipeline transforms the DevOps culture into its close relative, DevSecOps, which extends the methodology of DevOps by focusing on building security into the process. As businesses continue to adopt cloud services and infrastructure, forgetting to keep security top of mind is not an option. The CI/CD pipeline represents an attractive target for threat actors. Its criticality means that a compromise could have a significant impact on business and IT operations. 

Baking security into the CI/CD pipeline enables businesses to pursue their digital initiatives with confidence and security. By shifting security left, organizations can identify misconfigurations and other security risks before they impact users. Given the role that cloud computing plays in enabling DevOps, protecting cloud environments and workloads will only take on a larger role in defending the CI/CD pipeline, your applications and, ultimately, your customers. 

To learn more about how to choose security solutions to protect your CI/CD pipeline, download the CrowdStrike Cloud Workload Protection Platform Buyers Guide.

Additional Resources

✇ CrowdStrike

Managing Dead Letter Messages: Three Best Practices to Effectively Capture, Investigate and Redrive Failed Messages

By: Chris Cannon

In a recent blog post, Sharding Kafka for Increased Scale and Reliability, the CrowdStrike Engineering Site and Reliability Team shared how it overcame scaling limitations within Apache Kafka so that they could quickly and effectively process trillions of events daily. In this post, we focus on the other side of this equation: What happens when one of those messages inevitably fails? 

When a message cannot be processed, it becomes what is known as a “dead letter.” The service attempts to process the message by normal means several times to eliminate intermittent failures. However, when all of those attempts fail, the message is ultimately “dead lettered.” In highly scalable systems, these failed messages must be dealt with so that processing can continue on subsequent messages. To retain the dead letter’s information and continue processing messages, the message is stored so that it can be later addressed manually or by an automated tool.

In Best Practices: Improving Fault-Tolerance in Apache Kafka Consumer, we go into great detail about the different failure types and techniques for recovery, which include redriving and dead letters. Here our aim is to solidify those terms and expound upon the processes surrounding these mechanisms. 

Processing dead letters can be a fairly time-consuming and error-prone process. So what can be done to expedite this task and improve its outcome? Here we explore three steps organizations can take to develop the code and infrastructure needed to more effectively and efficiently capture, investigate and redrive dead letter messages.

Dead Letter Basics
What is a message? A message is the record of any communication between two or more services.
Why does a message fail? Messages can fail for a variety of reasons, some of the most common being incompatible message format, unavailable dependent services, or a bug in the service processing the message.
Why does it matter if a message fails? In most cases, a message is being sent because it is sharing important information with another service. Without that knowledge, the service that should be receiving the message can have outdated or inaccurate information and make bad decisions or be completely unable to act.

Three Best Practices for Resolving Dead Letter Messages

1. Define the infrastructure and code to capture and redrive dead letters

As explained above, a dead letter occurs when a service cannot process a message. Most systems have some mechanism in place, such as a log or object storage, to capture the message, review it, identify the issue, resolve the issue and then retry the message once it’s more likely to succeed. This act of replaying the message is known as “redriving.” 

To enable the redrive process, organizations need two basic things: 1) the necessary infrastructure to capture and store the dead letter messages, and 2) the right code to redrive that message.

Since there could potentially be hundreds of millions of dead letters that need to be stored, we recommend using a storage option that meets these four criteria: low cost (especially critical as your data scales), abundant space (no concerns around running out of storage space), durability (no data loss or corruption) and availability (the data is available to restore during disaster recovery). We use Amazon S3. 

For short-term storage and alerting, we recommend using a message queue technology that allows the user to send messages to be processed at a later point. Then your service can be configured to read from the message queue to begin processing the redrive messages. We use Amazon SQS and Kafka as our message queues.

2. Put tooling in place to make remediation foolproof 

The process outlined above can be very error-prone when done manually, as it involves many steps: finding the message, copying its contents, pasting it into a new message and submitting that message to the queue. If the user misses even one character when copying the message, then it will fail again — and the process will need to be repeated. This process must be done for every failed message, making it potentially time-consuming as well. 

Since the process is the same for processing dead letters, it is possible to automate. To that end, organizations should develop a command-line tool to automate common actions with dead letters such as viewing the dead letter, putting the message in the redrive queue and having the service consume messages from the queue for reprocessing. Engineers will use this command-line tool to diagnose and resolve dead letters the same way — this, in turn, will help reduce the risk of human error.

3. Standardize and document the process to ensure ease-of-use 

Our third best practice is around standardization. Because not all engineers will be familiar with the process the organization has for dealing with dead letter messages, it is important to document all aspects of the procedure. Some basic questions your documentation should address include: 

  • How does the organization know when a dead letter message occurs? Is an alert set up? Will an email be sent?
  • How does the team investigate the root cause of the error? Is there a specific phrase they can search for in the logs to find the errors associated with a dead letter?
  • Once it has been investigated and a fix has been deployed, how is the message reprocessed or redrived?

Documenting and standardizing the process in this way ensures that anyone on the team can pick up, solve and redrive dead letters. Ideally, the documentation will be relatively short and intuitive, outlining the following steps:

  • How to read the content of the message and review the logs to help figure out what happened
  • How to run the commands for your dead letter tool
  • How to put the message in the redrive queue to be reprocessed
  • What to do if the message is rejected again

It’s important to have this “cradle-to-grave” mentality when dealing with dead letter messages — pun intended — since a disconnect anywhere within the process could prevent the organization from successfully reprocessing the message.


While many organizations focus on processing massive amounts of messages and scaling those capabilities, it is equally important to ensure errors are captured and solved efficiently and effectively. 

In this blog, we shared our three best practices for organizations to develop the infrastructure and tooling to ensure that any engineer can properly manage a dead letter. But we certainly have more to share! We would be happy to address any specific questions or explore related topics of interest to the community in future blog posts. 

Got a question, comment or idea? Feel free to share your thoughts for future posts on social media via @CrowdStrike.

✇ Cisco Talos

Attackers exploiting zero-day vulnerability in Windows Installer — Here’s what you need to know and Talos’ coverage

By: [email protected] (Jaeson Schultz)
Cisco Talos is releasing new SNORTⓇ rules to protect against the exploitation of a zero-day elevation of privilege vulnerability in Microsoft Windows Installer. This vulnerability allows an attacker with a limited user account to elevate their privileges to become an administrator. This...

[[ This is only the beginning! Please visit the blog for the complete entry ]]
✇ NVISO Labs

The digital operational resilience act (DORA): what you need to know about it, the requirements and challenges we see.

By: nicoameye

TL;DR – In this blogpost, we will give you an introduction to DORA, as well as how you can prepare yourself to be ready for it.

More specifically, throughout this blogpost we will try to formulate an answer to following questions:

  • What is DORA and what are the key requirements of DORA?
  • What are the biggest challenges that you might face in becoming “DORA compliant”?

This blog post is part of a series, keep an eye out for the following parts! In the following blogposts, we will further explore the requirements of DORA, as well as elaborate a self-assessment checklist for financial entities to start assessing their compliance.

What is DORA?

DORA stands for Digital Operational Resilience Act. DORA is the EU proposal to tackle digital risks and build operational resilience in the financial sector. 

The idea of DORA is that organizations are able to demonstrate that they can resist, respond and recover from the impacts of ICT incidents, while continuing to deliver critical functions and minimizing disruption for customers and for the financial system as a whole.

With the DORA, the EU aims to make sure financial organisations mitigate the risks arising from increasing reliance on ICT systems and third parties for critical operations. The risks will be mitigates through appropriate Risk Management, Incident Management, Digital Operational Resilience Testing, as well as Third-Party Risk Management.

Who is concerned?

DORA applies to financial entities, from banks i.e. credit institutions to investment & payment institutions,  electronic money institutions, pension, audit firms, credit rating agencies, insurance and reinsurance undertakings and intermediaries.

Beyond that it also applies to providers of digital and data services, including providers of cloud computing services, data analytics, & data centres.

Note that, while the scope of the DORA itself is proposed to encompass nearly the entire financial system, at the same time it allows for a proportionate application of requirements for financial entities that are micro enterprises.

Exploring DORA

What is operational resilience? Digital operational resilience is the ability to build, assure and review the technological operational integrity of an organisation. In a nutshell, operational resilience a way of thinking and working that emphasizes the hardening of systems so that when an organization is attacked, it has the means to respond, recover, learn, and adapt.

Organizations that do not adopt this mindset are likely to experience DORA as an almost impossibly long checklist of disconnected requirements. We will cover the requirements in the coming blogposts.

DORA introduces requirements across five pillars: 

  • ICT Risk Management
  • ICT-related Incidents Management, Classification and Reporting
  • Digital Operational Resilience Testing
  • ICT Third-Party Risk Management
  • Information and Intelligence Sharing

We have summarised the requirements and these key challenges to start addressing now for each of the 5 pillars. 

ICT Risk Management

DORA requires organizations to apply a strong risk-based approach in their digital operational resilience efforts. This approach is reflected in Chapter 2 of the regulation.

What is required?

ICT risk management requirements form a set of key principles revolving around specific functions (identification, protection and prevention, detection, response and recovery, learning and evolving and communication). Most of them are recognized by current technical standards and industry best practices, such as the NIST framework, and thus the DORA does not impose specific standardization itself.

What do we consider as potential challenges for most organizations?

As described in DORA, the structure does not significantly deviate from standard Information security risk management as defined in NIST Cyber Security Framework.

However, we foresee some elements that might rise additional complexity:

First, as we reviewed, the ICT risk management requirements are organised around:

  • Identifying business functions and the information assets supporting these.
  • Protecting and preventing these assets.
  • Detecting anomalous activities.
  • Developing response and recovery strategies and plans, including communication to customers and stakeholders.

We foresee several elements that might rise additional complexity:

1. Nowadays, we see many organizations struggling with adequate asset management. A first complexity might emerge from the fact the ICT risk management framework shall include the identification of critical and important functions as well as the mapping of the ICT assets that underpin them. This framework shall also include the assessment of all risks associated with the ICT-related business functions and information assets identified.

2. Protection and Prevention is also a challenge for most organizations. Based on the risk assessment, financial entities shall set up protection and prevention measures to ensure the resilience, continuity and availability of ICT systems. These shall include ICT  security  strategies, policies,  procedures and appropriate technologies to ensure the continuous monitoring and control of ICT systems and tools.

3. Most organizations also struggle with timely or prompt detection of anomalous activities. Some complexity might arise as financial entities shall have to ensure the prompt detection of anomalous activities, enforce multiple layers of control, as well as enable the identification of single points of failure.

4. However, while the first three of these will be fairly familiar to most firms, although implemented with various degrees of maturity, the latter (response and recovery) should focus minds. This will require financial entities to think carefully about substitutability, including investing in backup and restoration systems, as well as assess whether – and how – certain critical functions can operate through alternative systems or methods of delivery while primary systems are checked and brought back up.

5. On top of this, as part as the “Learning and Evolving” part of DORA’s Risk Management Framework, DORA not only introduces compulsory training on digital operational resilience for the management body but also for the whole staff, as part of their general training package. Getting all staff on-board might create additional complexity.

In a coming blogpost, we will be reviewing the requirements associated with the risk-based approach based on the ICT risk management framework of DORA, as well as elaborating a self-assessment checklist for financial entities to start assessing their compliance.

ICT-related Incidents Management, Classification and Reporting

DORA has its core in a strong evaluation and reporting process. This process is reflected in Chapter 3 of the regulation.

What is required?

What is required?

In the regulation, ICT-related incident reporting obliges financial entities to establish and implement a management process to monitor and log ICT-related incidents and to classify them based on specific criteria.

The ICT-related Incident Management requirements are organised around:

  • Implementation of an ICT-related incident management process
  • Classification of ICT-related incidents
  • Reporting of major ICT-related incidents

What do we consider as potential challenges for most organizations?

We foresee two elements that might rise additional complexity:

1. First, financial entities will need to review their incident classification methodology to fit with the requirements of the regulation. To help organisations prepare, we anticipate that the incident classification methodology will align with the ENISA Reference Incident Classification Taxonomy.  Indeed, this framework is referenced in the footnote of DORA. Other standards might be permissible, provided they meet the conditions set out in the Regulation but, when a standard or framework is especially called out, there is no downside to considering it.

2. Second, financial entities will also need to set up the right processes and channels to be able to notify the regulator fast in case a major incident occurs. Although firms will only need to report major incidents to their national regulator, this will need to be within strict deadlines. Moreover, based on what gets classified as “major”, this might happen frequently. 

In a coming blogpost, we will be reviewing the requirements associated with the ICT-related Incidents Management of DORA, as well as elaborating a self-assessment checklist for financial entities to start assessing their compliance.

Digital Operational Resilience Testing

DORA introduces the testing efficiency of the risk management framework and measures in place to respond to and recover from a wide range of ICT incident scenarios. This process is reflected in Chapter 4 of the regulation.

What is required?

The underlying rationale behind this part of the regulation would be that undetected vulnerabilities in financial entities could threaten the stability of the financial sector. In order to mitigate this risk, DORA introduces a comprehensive testing program with the aim to identify and explore possible ways in which financial entities could be compromised.

Digital operational resilience testing serves for the periodic testing of the ICT risk management framework for preparedness and identification of weaknesses, deficiencies or gaps, as well as the prompt adoption of corrective measures.

DORA also strongly recommends advanced testing of ICT tools, systems and processes based on threat led penetration testing (“TLPT”), carried out at least every 3 years. The technical standards to apply, when conducting intelligence-based penetration testing, are likely to be aligned with the TIBER-EU developed by the ECB.

The Digital Operational Resilience Testing requirements are therefore organised around:

  • Basic Testing of ICT tools and systems – Applicable to all financial entities
  • Advanced Testing of ICT tools, systems and processes (“TLPT”) – Only applicable to  financial entities identified as significant by competent authorities

What do we consider as potential challenges for most organizations?

We foresee two elements that might rise additional complexity:

1. First, from a cultural standpoint, a challenge might be that financial entities see or perceive Operational Resilience testing as BCP or DR testing. A caution has to be raised here as the objective of DORA with this requirements focuses more on penetration testing than the traditional Operational Resilience testing.

From another cultural standpoint, resilience testing programs should not be perceived as a single goal. It should not be perceive as a binary value concept (either it is in place or not). As stated, the underlying behind DORA is rather about identifying weaknesses, deficiencies or gaps, and admitting that a breach might happen or a vulnerability could go undetected. DORA is therefore more about preparing to withstand just such a possibility.

2. Second, as stated, significant financial entities (might be firms already in the scope of NIS regulation) will have to implement a threat-led penetration testing program and exercise. It is likely that this first exercise will have to be organized by the end of 2024. This might seem like a sufficient period to time for these tests to be conducted, however, consider that these types of tests will require a lot of preparation. First, all EU-based critical ICT third parties are required to be involved. This means that all of these third-parties should also be involved in the preparation of this exercise, which will require a lot of coordination and planning beforehand. Second, the scenario for these threat-led penetration testing exercises will have to be agreed by the regulator in advance. Significant financial entities should therefore start thinking about the scenario as soon as possible to enable validation with the regulator at least 2 years before the deadline.  

In a coming blogpost, we will be reviewing the requirements associated with the Resilience Testing of DORA, as well as elaborating a self-assessment checklist for financial entities to start assessing their compliance.

ICT Third-Party Risk Management

DORA introduces the governance of third-party service providers and the management of third-party risks. DORA states that financial entities should have appropriate level of controls and monitoring of their ICT third parties. This process is reflected in Chapter 5 of the regulation.

What is required?

Chapter 5 addresses the key principles for a sound management of ICT Third-Party risks. In a nutshell, the main requirements associated with these key principles could be described as the following:

  • Obligatory Contractual Provisions :
    • DORA introduces obligatory provisions that have to be present in any contract concluded between a financial institution and an ICT third-party provider.
  • ICT third-party risk strategy definition :
    • Firm shall define a multi-vendor ICT third-party risk strategy and policy owned by a member of the management body.
  • Maintenance of a Register of Information :
    • Firms shall define and maintain a register of information that contains the full view of all their ICT third-party providers, the services they provide and the functions they underpin according to the key contractual provisions.
  • Perform due diligence/assessments :
    • Firms shall assess ICT service providers according to certain criteria before entering into a contractual arrangement on the use of ICT services (e.g. security level, concentration risk, sub-outsourcing risks).

What do we consider as potential challenges for most organizations?

We foresee several elements that might rise additional complexity:

1. One of the main challenges that we foresee relates to the assembling and maintenance of the Register of Information. Financial entities will have to collect information on all ICT vendors (not only the most critical).  

This might create additional complexity as DORA states that this register shall be maintained at entity level and, at sub-consolidated and consolidated levels. DORA also states that this register shall include all contractual arrangements on the use of ICT services provided, identifying the services the third-party provided and the functions they underpin. 

This requirement could be considered as a challenge, on one hand, for large financial entities that rely on thousands of big and small providers, as well as on the other hand, for smaller, less mature financial institutions that will have to ensure that that register of information is complete and accurate.

Some other challenges also have to be foreseen.

2. Contracts with all ICT providers will probably need to be amended. For “EBA” critical contracts this will be covered through the EBA directive on this, however for others (if all ICT providers are affected) this will not be the case yet. Identifying those, and upgrading their contracts will be a challenge.

3. Regarding the Exit strategy, and following the same reasoning, for “EBA” critical contracts this will be covered through the EBA directive on this, however for others this might not be the case yet. Determining how to enforce this requirement in these contract will also have to be seen as creating additional complexity.

4. Determining a correct risk-based approach for performing assessments on the ICT providers will possibly add additional complexity as well. Performing assessments on all ICT providers is not feasible. ICT providers will have to be prioritized based on criticality criteria that will have to be defined.

In a coming blogpost, we will be reviewing the requirements associated with the ICT Third-Party Risk Management of DORA, as well as elaborating a self-assessment checklist for financial entities to start assessing their compliance.

Information and Intelligence Sharing

DORA promotes information-sharing arrangements on cyber threat information and intelligence. This process is reflected in Chapter 6 of the regulation.

What is required?

DORA introduces guidelines on setting up information sharing arrangements between firms to exchange among themselves cyber threat information and intelligence on tactics, techniques, procedures, alerts and configuration tools in a trusted environment.

What do we consider as potential challenges for most organizations?

While, many organisations already have such agreements in place, such challenges might still emerge as: 

  • How will you determine what information to share? There should be a balance between helping the community and ensuring alignment with laws and regulations, as well as not sharing sensitive information with competition
  • How will you share this information efficiently?
  • What processes will you set up to consume the shared information by other entities?

Preparing yourself

In order to be ready, we recommend organisations take the following steps in 2021 and 2022:

  • Conduct a maturity assessment against the DORA requirements and define a mitigation plan to reach compliance.
  • Start consolidating the register of information for all ICT third-party providers.
  • Start defining a potential scenario for the large-scale penetration test.

About the Author

Nicolas is a consultant in the Cyber Strategy & Culture team at NVISO. He taps into his technical hands-on experiences as well as his managerial academic background to help organisations build out their Cyber Security Strategy. He has a strong interest IT management, Digital Transformation, Information Security and Data Protection. In his personal life, he likes adventurous vacations. He hiked several 4000+ summits around the world, and secretly dreams about one day hiking all of the top summits. In his free time, he is an academic teacher who has been teaching for 7 years at both the Solvay Brussels School of Economics and Management and the Brussels School of Engineering. 

Find out more about Nicolas on Linkedin.

✇ VerSprite

What is PASTA Threat Modeling?

By: Tony UcedaVélez

PASTA threat modeling is a leading threat model methodology that allows you to realize an attacker's motivations, perform risk analysis, and prioritize risks based on their business impact. This article breaks down all seven stages of the process.

The post What is PASTA Threat Modeling? appeared first on VerSprite.

✇ CrowdStrike

Mean Time to Repair (MTTR) Explained

By: Humio Staff

This blog was originally published oct. 28, 2021 on humio.com. Humio is a CrowdStrike Company.

Definition of MTTR

Mean time to repair (MTTR) is a key performance indicator (KPI) that represents the average time required to restore a system to functionality after an incident. MTTR is used along with other incident metrics to assess the performance of DevOps and ITOps, gauge the effectiveness of security processes, evaluate the effectiveness of security solutions, and measure the maintainability of systems.

Service level agreements with third-party providers typically set expectations for MTTR, although repair times are not guaranteed because some incidents are more complex than others. Along the same lines, comparing the MTTR of different organizations is not fruitful because MTTR is highly dependent on unique factors relating to the size and type of the infrastructure and the size and skills of the ITOps and DevOps team. Every business has to determine which metrics will best serve its purposes and how it will put them into action in their unique environment.

Difference Between Common Failure Metrics

Modern enterprise systems are complicated and they can fail in numerous ways. For these reasons, there is no one set of incident metrics every business should use — but there are many to choose from, and the differences can be nuanced.

Mean Time to Detect (MTTD)

Also called mean time to discover, MTTD is the average time between the beginning of a system failure and its detection. As a KPI, MTTD is used to measure the effectiveness of the tools and processes used by DevOps teams.

To calculate MTTP, select a period of time, such as a month, and track the times between the beginning of system outages and their discovery, and then add up the total time and divide it by the number of incidents to find the average. MTTD should be low. If it continues to take longer to detect or or discover system failures (an upward trend), an immediate review should be conducted of the existing incident response management tools and processes.

Mean Time to Identify (MTTI)

This measurement tracks the number of business hours between the moment an alert is triggered and the moment the cybersecurity team begins to investigate that alert. MTTI is helpful in understanding if alert systems are effective and if cybersecurity teams are staffed to the necessary capacity. A high MTTI or an MTTI that is trending in the wrong direction can be an indicator that the cybersecurity team is suffering from alert fatigue.

Mean Time to Recovery (MTTR)

Mean time to recovery is the average time it takes in business hours between the start of an incident and the complete recovery back to normal operations. This incident metric is used to understand the effectiveness of the DevOps and ITOps teams and identify opportunities to improve their processes and capabilities.

Mean Time to Resolve (MTTR)

Mean time to resolve is the average time between the first alert through the post-incident analysis, including the time spent ensuring the failure will not re-occur. It is measured in business hours.

Mean Time Between Failures (MTBF)

Mean time between failures is a key performance metric that measures system reliability and availability. ITOps teams use MTBF to understand which systems or components are performing well and which need to be evaluated for repair or replacement. Knowing MTBF enables preventative maintenance, minimizes reactive maintenance, reduces total downtime and enables teams to prioritize their workload effectively. Historical MTBF data can be used to make better decisions about scheduling maintenance downtime and resource allocation.

MTBF is calculated by tracking the number of hours that elapse between system failures in the ordinary course of operations over a period of time and then finding the average.

Mean Time to Failure (MTTF)

Mean time to failure is a way of looking at uptime vs. downtime. Unlike MTBF, an incident metric that focuses on repairability, MTTF focuses on failures that cannot be repaired. It is used to predict the lifespan of systems. MTTF is not a good fit for every system. For example, systems with long lifespans, such as core banking systems or many industrial control systems, are not good subjects for MTTF metrics because they have such a long lifespan that when they are finally replaced, the replacement will be an entirely different type of system due to technological advances. In cases like that, MTTF is moot.

Conversely, tracking the MTTF of systems with more typical lifespans is a good way to gain insight into which brands perform best or which environmental factors most strongly influence a product’s durability.

MTTR is intended to reduce unplanned downtime and shorten breakout time. But its use also supports a better culture within ITOps teams.When incidents are repaired before users are impacted, DevOps and ITOps are seen as efficient and effective. Resilient system design is encouraged because when DevOps knows its performance will be measured by MTTR, the team will build apps that can be repaired faster, such as by developing apps that are populated by discrete web services so one service failure will not crash the entire app. MTTR, when done properly, includes post-incident analysis, which should be used to inform a feedback loop that leads to better software builds in the future and encourages the fixing of bugs early in the SDLC process.

How to Calculate Mean Time to Repair

The MTTR formula is straightforward: Simply add up the total unplanned repair time spent on a system within a certain time frame and divide the results by the total number of relevant incidents.

For example, if you have a system that fails four times in one workday and you spend an hour repairing each of those instances of failure, your MTTR would be 15 minutes (60 minutes / 4 = 15 minutes).

However, not all outages are equal. The time spent repairing a failed component or a customer-facing system that goes down during peak hours is more expensive in terms of lost sales, productivity or brand damage than time spent repairing a non-critical outage in the middle of the night. Organizations can establish an “error budget” that specifies that each minute spent repairing the most impactful systems is worth an hour of minutes spent repairing less impactful ones. This level of granularity will help expose the true costs of downtime and provide a better understanding of what MTTR means to the particular organization.

How to Reduce MTTR

There are three elements to reducing MTTR:

  1. Manage resolution process. The first is a defined strategy for managing the resolution process, which should include a post-incident analysis to capture lessons learned.
  2. Build defenses. Technology plays a crucial role, of course, and the best solution will provide visibility, monitoring and corrective maintenance to help root out problems and build defenses against future attacks.
  3. Mitigate the incident. Lastly, the skills necessary to mitigate the incident have to be available.

MTTR can be reduced by increasing budget or headcount, but that isn’t always realistic. Instead, deploy artificial intelligence (AI) and machine learning (ML) to automate as much of the repair process as possible. Those steps include rapid detection, minimization of false positives, smart escalation, and automated remediation that includes workflows that reduce MTTR.

MTTR can be a helpful metric to reduce downtime and streamline your DevOps and ITOps teams, but improving it shouldn’t be the end goal. After all, the point of using metrics is not simply improving numbers but, in this instance, the practical matter of keeping systems running and protecting the business and its customers. Use MTTR in a way that helps your teams protect customers and optimize system uptime.

Improve MTTR With a Modern Log Management Solution

Logs are invaluable for any kind of incident response. Humio’s platform enables complete observability for all streaming logs and event data to help IT organizations better prepare for the unknown and quickly find the root cause of any incident.

Humio leverages modern technologies, including data streaming, index-free architecture and hybrid deployments, to optimize compute resources and minimize storage costs. Because of this, Humio can collect structured and unstructured data in memory to make exploring and investigating data of any size blazing fast.

Humio Community Edition

With a modern log management platform, you can monitor and improve your MTTR. Try it out at no cost!

✇ SentinelLabs

GSOh No! Hunting for Vulnerabilities in VirtualBox Network Offloads

By: Max Van Amerongen


The Pwn2Own contest is like Christmas for me. It’s an exciting competition which involves rummaging around to find critical vulnerabilities in the most commonly used (and often the most difficult) software in the world. Back in March, I was preparing to have a pop at the Vancouver contest and had decided to take a break from writing browser fuzzers to try something different: VirtualBox.

Virtualization is an incredibly interesting target. The complexity involved in both emulating hardware devices and passing data safely to real hardware is astounding. And as the mantra goes: where there is complexity, there are bugs.

For Pwn2Own, it was a safe bet to target an emulated component. In my eyes, network hardware emulation seemed like the right (and usual) route to go. I started with a default component: the NAT emulation code in /src/VBox/Devices/Network/DrvNAT.cpp.

At the time, I just wanted to get a feel for the code, so there was no specific methodical approach to this other than scrolling through the file and reading various parts.

During my scrolling adventure, I landed on something that caught my eye:

#if 0 /* Assertion happens often to me after resuming a VM -- no time to investigate this now. */
   Assert(pThis->enmLinkState == PDMNETWORKLINKSTATE_UP);
   if (pThis->enmLinkState == PDMNETWORKLINKSTATE_UP)
       struct mbuf *m = (struct mbuf *)pSgBuf->pvAllocator;
       if (m)
            * A normal frame.
           pSgBuf->pvAllocator = NULL;
           slirp_input(pThis->pNATState, m, pSgBuf->cbUsed);
            * GSO frame, need to segment it.
           /** @todo Make the NAT engine grok large frames?  Could be more efficient... */
#if 0 /* this is for testing PDMNetGsoCarveSegmentQD. */
           uint8_t         abHdrScratch[256];
           uint8_t const  *pbFrame = (uint8_t const *)pSgBuf->aSegs[0].pvSeg;
           PCPDMNETWORKGSO pGso    = (PCPDMNETWORKGSO)pSgBuf->pvUser;
           uint32_t const  cSegs   = PDMNetGsoCalcSegmentCount(pGso, pSgBuf->cbUsed);  Assert(cSegs > 1);
           for (uint32_t iSeg = 0; iSeg pNATState, pGso->cbHdrsTotal + pGso->cbMaxSeg, &pvSeg, &cbSeg);
               if (!m)
#if 1
               uint32_t cbPayload, cbHdrs;
               uint32_t offPayload = PDMNetGsoCarveSegment(pGso, pbFrame, pSgBuf->cbUsed,
                                                           iSeg, cSegs, (uint8_t *)pvSeg, &cbHdrs, &cbPayload);
               memcpy((uint8_t *)pvSeg + cbHdrs, pbFrame + offPayload, cbPayload);
               slirp_input(pThis->pNATState, m, cbPayload + cbHdrs);

The function used for sending packets from the guest to the network contained a separate code path for Generic Segmentation Offload (GSO) frames and was using memcpy to combine pieces of data.

The next question was of course “How much of this can I control?” and after going through various code paths and writing a simple Python-based constraint solver for all the limiting factors, the answer was “More than I expected” when using the Paravirtualization Network device called VirtIO.

Paravirtualized Networking

An alternative to fully emulating a device is to use paravirtualization. Unlike full virtualization, in which the guest is entirely unaware that it is a guest, paravirtualization has the guest install drivers that are aware that they are running in a guest machine in order to work with the host to transfer data in a much faster and more efficient manner.

VirtIO is an interface that can be used to develop paravirtualized drivers. One such driver is virtio-net, which comes with the Linux source and is used for networking. VirtualBox, like a number of other virtualization software, supports this as a network adapter:

The Adapter Type options

Similarly to the e1000, VirtIO networking works by using ring buffers to transfer data between the guest and the host (In this case called Virtqueues, or VQueues). However, unlike the e1000, VirtIO doesn’t use a single ring with head and tail registers for transmitting but instead uses three separate arrays:

  • A Descriptor array that contains the following data per-descriptor:
    • Address – The physical address of the data being transferred.
    • Length – The length of data at the address.
    • Flags – Flags that determine whether the Next field is in-use and whether the buffer is read or write.
    • Next – Used when there is chaining.
  • An Available ring – An array that contains indexes into the Descriptor array that are in use and can be read by the host.
  • A Used ring – An array of indexes into the Descriptor array that have been read by the host.

This looks as so:

When the guest wishes to send packets to the network, it adds an entry to the descriptor table, adds the index of this descriptor to the Available ring, and then increments the Available Index pointer:

Once this is done, the guest ‘kicks’ the host by writing the VQueue index to the Queue Notify register. This triggers the host to begin handling descriptors in the available ring. Once a descriptor has been processed, it is added to the Used ring and the Used Index is incremented:

Generic Segmentation Offload

Next, some background on GSO is required. To understand the need for GSO, it’s important to understand the problem that it solves for network cards.

Originally the CPU would handle all of the heavy lifting when calculating transport layer checksums or segmenting them into smaller ethernet packet sizes. Since this process can be quite slow when dealing with a lot of outgoing network traffic, hardware manufacturers started implementing offloading for these operations, thus removing the strain on the operating system.

For segmentation, this meant that instead of the OS having to pass a number of much smaller packets through the network stack, the OS just passes a single packet once.

It was noticed that this optimization could be applied to other protocols (beyond TCP and UDP) without the need of hardware support by delaying segmentation until just before the network driver receives the message. This resulted in GSO being created.

Since VirtIO is a paravirtualized device, the driver is aware that it is in a guest machine and so GSO can be applied between the guest and host. GSO is implemented in VirtIO by adding a context descriptor header to the start of the network buffer. This header can be seen in the following struct:

struct VNetHdr
   uint8_t  u8Flags;
   uint8_t  u8GSOType;
   uint16_t u16HdrLen;
   uint16_t u16GSOSize;
   uint16_t u16CSumStart;
   uint16_t u16CSumOffset;

The VirtIO header can be thought of as a similar concept to the Context Descriptor in e1000.

When this header is received, the parameters are verified for some level of validity in vnetR3ReadHeader. Then the function vnetR3SetupGsoCtx is used to fill the standard GSO struct used by VirtualBox across all network devices:

typedef struct PDMNETWORKGSO
   /** The type of segmentation offloading we're performing (PDMNETWORKGSOTYPE). */
   uint8_t             u8Type;
   /** The total header size. */
   uint8_t             cbHdrsTotal;
   /** The max segment size (MSS) to apply. */
   uint16_t            cbMaxSeg;
   /** Offset of the first header (IPv4 / IPv6).  0 if not not needed. */
   uint8_t             offHdr1;
   /** Offset of the second header (TCP / UDP).  0 if not not needed. */
   uint8_t             offHdr2;
   /** The header size used for segmentation (equal to offHdr2 in UFO). */
   uint8_t             cbHdrsSeg;
   /** Unused. */
   uint8_t             u8Unused;

Once this has been constructed, the VirtIO code creates a scatter-gatherer to assemble the frame from the various descriptors:

          /* Assemble a complete frame. */
               for (unsigned int i = 1; i  0; i++)
                   unsigned int cbSegment = RT_MIN(uSize, elem.aSegsOut[i].cb);
                   PDMDevHlpPhysRead(pDevIns, elem.aSegsOut[i].addr,
                                     ((uint8_t*)pSgBuf->aSegs[0].pvSeg) + uOffset,
                   uOffset += cbSegment;
                   uSize -= cbSegment;

The frame is passed to the NAT code along with the new GSO structure, reaching the point that drew my interest originally.

Vulnerability Analysis

CVE-2021-2145 – Oracle VirtualBox NAT Integer Underflow Privilege Escalation Vulnerability

When the NAT code receives the GSO frame, it gets the full ethernet packet and passes it to Slirp (a library for TCP/IP emulation) as an mbuf message. In order to do this, VirtualBox allocates a new mbuf message and copies the packet to it. The allocation function takes a size and picks the next largest allocation size from three distinct buckets:

  1. MCLBYTES (0x800 bytes)
  2. MJUM9BYTES (0x2400 bytes)
  3. MJUM16BYTES (0x4000 bytes)
struct mbuf *slirp_ext_m_get(PNATState pData, size_t cbMin, void **ppvBuf, size_t *pcbBuf)
   struct mbuf *m;
   int size = MCLBYTES;
   LogFlowFunc(("ENTER: cbMin:%d, ppvBuf:%p, pcbBuf:%p\n", cbMin, ppvBuf, pcbBuf));
   if (cbMin 

If the supplied size is larger than MJUM16BYTES, an assertion is triggered. Unfortunately, this assertion is only compiled when the RT_STRICT macro is used, which is not the case in release builds. This means that execution will continue after this assertion is hit, resulting in a bucket size of 0x800 being selected for the allocation. Since the actual data size is larger, this results in a heap overflow when the user data is copied into the mbuf.

/** @def AssertMsgFailed
* An assertion failed print a message and a hit breakpoint.
* @param   a   printf argument list (in parenthesis).
#ifdef RT_STRICT
# define AssertMsgFailed(a)  \
   do { \
       RTAssertMsg1Weak((const char *)0, __LINE__, __FILE__, RT_GCC_EXTENSION __PRETTY_FUNCTION__); \
       RTAssertMsg2Weak a; \
       RTAssertPanic(); \
   } while (0)
# define AssertMsgFailed(a)     do { } while (0)

CVE-2021-2310 - Oracle VirtualBox NAT Heap-based Buffer Overflow Privilege Escalation Vulnerability

Throughout the code, a function called PDMNetGsoIsValid is used which verifies whether the GSO parameters supplied by the guest are valid. However, whenever it is used it is placed in an assertion. For example:

DECLINLINE(uint32_t) PDMNetGsoCalcSegmentCount(PCPDMNETWORKGSO pGso, size_t cbFrame)
   size_t cbPayload;
   Assert(PDMNetGsoIsValid(pGso, sizeof(*pGso), cbFrame));
   cbPayload = cbFrame - pGso->cbHdrsSeg;
   return (uint32_t)((cbPayload + pGso->cbMaxSeg - 1) / pGso->cbMaxSeg);

As mentioned before, assertions like these are not compiled in the release build. This results in invalid GSO parameters being allowed; a miscalculation can be caused for the size given to slirp_ext_m_get, making it less than the total copied amount by the memcpy in the for-loop. In my proof-of-concept, my parameters for the calculation of pGso->cbHdrsTotal + pGso->cbMaxSeg used for cbMin resulted in an allocation of 0x4000 bytes, but the calculation for cbPayload resulted in a memcpy call for 0x4065 bytes, overflowing the allocated region.

CVE-2021-2442 - Oracle VirtualBox NAT UDP Header Out-of-Bounds

The title of this post makes it seem like GSO is the only vulnerable offload mechanism in place here; however, another offload mechanism is vulnerable too: Checksum Offload.

Checksum offloading can be applied to various protocols that have checksums in their message headers. When emulating, VirtualBox supports this for both TCP and UDP.

In order to access this feature, the GSO frame needs to have the first bit of the u8Flags member set to indicate that the checksum offload is required. In the case of VirtualBox, this bit must always be set since it cannot handle GSO without performing the checksum offload. When VirtualBox handles UDP packets with GSO, it can end up in the function PDMNetGsoCarveSegmentQD in certain circumstances:

           if (iSeg == 0)
                                        pbSegHdrs, pbFrame, pGso->offHdr2);

The function pdmNetGsoUpdateUdpHdrUfo uses the offHdr2 to indicate where the UDP header is in the packet structure. Eventually this leads to a function called RTNetUDPChecksum:

RTDECL(uint16_t) RTNetUDPChecksum(uint32_t u32Sum, PCRTNETUDP pUdpHdr)
   bool fOdd;
   u32Sum = rtNetIPv4AddUDPChecksum(pUdpHdr, u32Sum);
   fOdd = false;
   u32Sum = rtNetIPv4AddDataChecksum(pUdpHdr + 1, RT_BE2H_U16(pUdpHdr->uh_ulen) - sizeof(*pUdpHdr), u32Sum, &fOdd);
   return rtNetIPv4FinalizeChecksum(u32Sum);

This is where the vulnerability is. In this function, the uh_ulen property is completely trusted without any validation, which results in either a size that is outside of the bounds of the buffer, or an integer underflow from the subtraction of sizeof(*pUdpHdr).

rtNetIPv4AddDataChecksum receives both the size value and the packet header pointer and proceeds to calculate the checksum:

   /* iterate the data. */
   while (cbData > 1)
       u32Sum += *pw;
       cbData -= 2;

From an exploitation perspective, adding large amounts of out of bounds data together may not seem particularly interesting. However, if the attacker is able to re-allocate the same heap location for consecutive UDP packets with the UDP size parameter being added two bytes at a time, it is possible to calculate the difference in each checksum and disclose the out of bounds data.

On top of this, it’s also possible to use this vulnerability to cause a denial-of-service against other VMs in the network:

Got another Virtualbox vuln fixed (CVE-2021-2442)

Works as both an OOB read in the host process, as well as an integer underflow. In some instances, it can also be used to remotely DoS other Virtualbox VMs! pic.twitter.com/Ir9YQgdZQ7

— maxpl0it (@maxpl0it) August 1, 2021


Offload support is commonplace in modern network devices so it’s only natural that virtualization software emulating devices does it as well. While most public research has been focused on their main components, such as ring buffers, offloads don’t appear to have had as much scrutiny. Unfortunately in this case I didn’t manage to get an exploit together in time for the Pwn2Own contest, so I ended up reporting the first two to the Zero Day Initiative and the checksum bug to Oracle directly.

✇ CrowdStrike

Securing the Application Lifecycle with Scale and Speed: Achieving Holistic Workload Security with CrowdStrike and Nutanix

By: Fiona Ing

With virtualization in the data center and further adoption of cloud infrastructure, it’s no wonder why IT, DevOps and security teams grapple with new and evolving security challenges. An increase in virtualized applications and desktops have caused organizations’ attack surfaces to expand quickly, enabling highly sophisticated attackers to take advantage of the minimal visibility and control these teams hold.

The question remains: How can your organization secure your production environments and cloud workloads to ensure that you can build and run apps at speed and with confidence? The answer: CrowdStrike Falcon® on the Nutanix Cloud Platform.

Delivered through CrowdStrike’s single lightweight Falcon agent, your team is enabled to take an adversary-focused approach when securing your Nutanix cloud workloads — all without impacting performance. With scalable and holistic security, your team can achieve comprehensive workload protection and visibility across virtual environments to meet compliance requirements and prevent breaches effectively and efficiently. 

Secure All of Your Cloud Workloads with CrowdStrike and Nutanix

By extending CrowdStrike’s world-class security capabilities into the Nutanix Cloud Platform, you can prevent attacks on virtualized workloads and endpoints on or off the network. The Nutanix-validated, cloud-native Falcon sensor enhances Nutanix’s native security posture for workloads running on Nutanix AHV without compromising your team’s output. By extending CrowdStrike protection to Nutanix deployments, including virtual machines and virtual desktop infrastructure (VDI), you get scalable and comprehensive workload and container breach protection to streamline operations and optimize performance.

CrowdStrike and Nutanix provide your DevOps and Security teams with layered security, so they can build, run and secure applications with confidence at every stage of the application lifecycle. Easily deploy and use the CrowdStrike Falcon sensor without hassle for your Nutanix AHV workloads and environment. 

CrowdStrike’s intelligent cloud-native Falcon agent is powered by the proprietary CrowdStrike Threat Graph®, which captures trillions of high-fidelity signals per day in real time from across the globe, fueling one of the world’s most advanced data platforms for security. The Falcon platform helps you gain real-time protection and visibility across your enterprise, preventing attacks on workloads on and off the network. 

Get Started and Secure Your Linux Workloads in the Cloud

With Nutanix and CrowdStrike, you can feel confident that your Linux workloads are secure on creation by using CrowdStrike’s Nutanix Terraform script built on Nutanix’s Terraform Provider. By deploying the CrowdStrike Falcon sensor during Linux instance creation, the lifecycle of building and securing workloads before they are operational in the cloud is made simple and secure, without operational friction. 

Get started with CrowdStrike and Nutanix by deploying Linux workloads securely with CrowdStrike’s Nutanix Terraform script.

Gain Holistic Security Coverage Without Compromising Performance

With CrowdStrike and Nutanix, you can seamlessly secure your end-to-end production environment, streamline operations and optimize application performance; easily manage storage and virtualization securely with CrowdStrike’s lightweight Falcon agent on the Nutanix Cloud Platform; and secure your Linux workloads with CrowdStrike’s Nutanix Terraform solution. Building, running and securing applications on the Nutanix Cloud Platform takes the burden of managing and securing your production environment off your team and ensures confidence.

Additional Resources 

✇ Cisco Talos

A review of Azure Sphere vulnerabilities: Unsigned code execs, kernel bugs, escalation chains and firmware downgrades

By: [email protected] (Jon Munshaw)
Summary of all the vulnerabilities reported by Cisco Talos in Microsoft Azure Sphere By Claudio Bozzato and Lilith [>_>]. In May 2020, Microsoft kicked off the Azure Sphere Security Research Challenge, a three-month initiative aimed at finding bugs in Azure Sphere. In the first three months,...

[[ This is only the beginning! Please visit the blog for the complete entry ]]
✇ Cisco Talos

Vulnerability Spotlight: PHP deserialize vulnerability in CloudLinux Imunity360 could lead to arbitrary code execution

By: [email protected] (Jon Munshaw)
Marcin “Icewall” Noga of Cisco Talos. Blog by Jon Munshaw.  Cisco Talos recently discovered a vulnerability in the Ai-Bolit functionality of CloudLinux Inc Imunify360 that could lead to arbitrary code execution.  Imunify360 is a security platform for web-hosting servers that allows users...

[[ This is only the beginning! Please visit the blog for the complete entry ]]
✇ Cisco Talos

Vulnerability Spotlight: Multiple vulnerabilities in Advantech R-SeeNet

By: [email protected] (Jon Munshaw)
Yuri Kramarz discovered these vulnerabilities. Blog by Jon Munshaw.  Cisco Talos recently discovered multiple vulnerabilities in the Advantech R-SeeNet monitoring software.  R-SeeNet is the software system used for monitoring Advantech routers. It continuously collects information from...

[[ This is only the beginning! Please visit the blog for the complete entry ]]
✇ Cisco Talos

Back from the dead: Emotet re-emerges, begins rebuilding to wrap up 2021

By: [email protected] (Unknown)
Executive summary Emotet has been one of the most widely distributed threats over the past several years. It has typically been observed being distributed via malicious spam email campaigns, and often leads to additional malware infections as it provides threat actors with an initial foothold in an...

[[ This is only the beginning! Please visit the blog for the complete entry ]]
✇ CrowdStrike

Introduction to the Humio Marketplace

By: Humio Staff

This blog was originally published Oct. 11, 2021 on humio.com. Humio is a CrowdStrike Company.

Humio is a powerful and super flexible platform that allows customers to log everything and answer anything. Users can choose how to ingest their data and choose how to create and manage their data with Humio. The goal of Humio’s marketplace is to provide a variety of packages that power our customers with faster and more convenient ways to get more from their data across a variety of use cases.

What is the Humio Marketplace?

The Humio Marketplace is a collection of prebuilt packages created by Humio, partners and customers that Humio customers can access within the Humio product interface.

These packages are relevant to popular log sources and typically contain a parser and some dashboards and/or saved queries. The package documentation includes advice and guidance on how to best ingest the data into Humio to start getting immediate value from logs.

What is a package?

The Marketplace contains prebuilt packages that are essentially YAML files that describe the Humio assets included in the package. A package can include any or all of: a parser, saved searches, alerts, dashboards, lookup files and labels. The package also includes YAML files for the metadata of the package (such as descriptions and tags, support status and author), and a README file which contains a full description and explanation of any prerequisites, etc.

Packages can be configured as either a Library type package — which means, once installed, the assets are available as templates to build from — or an Application package, which means, once installed, the assets are instantiated and are live immediately.

By creating prebuilt content that is quick and simple to install, we want to make it easier for customers to onboard new log sources to Humio to quickly get value from that data. With this prebuilt content, customers won’t have to work out the best way of ingesting the logs and won’t have to create parsers and dashboards from scratch.

How do I make a package?

Packages are a great way to mitigate manual work, whether that’s taking advantage of prebuilt packages or making your own packages so you don’t have to begin new processes all over.

Anyone can create a Humio package straight from Humio’s interface. We actively encourage customers and partners to create packages and submit those packages for inclusion in the Marketplace if they think they could benefit other customers. Humio will work with package creators to make sure the package meets our standards for inclusion in the Marketplace. By sharing your package with all Humio customers through the Marketplace, you are strengthening the community and allowing others to benefit from your expertise while you, likewise, benefit from others’ expertise.

For some customers, the package will be exactly what they want, but for others, it will be a useful starting point for further customization. All Humio packages are provided under an Apache 2.0 license, so customers are free to adapt and reuse the package as needed.

If I install a package, will it get updated?

Package creators can develop updates in response to changes in log formats or to introduce new functionality and improvements. Updates will be advertised as available in the Marketplace and users can choose to accept the update. The update process will check to see if any local changes have been made to assets installed from the package and, if so, will prompt the user to either overwrite the changes with the standard version from the updated package or to keep the local changes.

Are packages free?

Yes, all Humio packages in the Marketplace are free to use!

Can I use packages to manage my own private Humio content?

Absolutely! Packages are a convenient way for customers to manage their own private Humio content. Packages can be created in the Humio product interface and can be downloaded as a ZIP file and uploaded into a different Humio repository or a different instance of Humio (cloud or hybrid). Customers can also store their Humio packages in a code repository and use their CI/CD tools and the Humio API to deploy and manage Humio assets as they would their own code. This streamlines Humio support and operations and delivers a truly agile approach to log management.

Get started today

To get started with packages is simple. All you need is access to a Humio Cloud service, or if running Humio self-hosted, you need to be on V1.21 or later. To create and install packages, you need the “Change Packages” permission assigned to your Humio user role.

Access the Marketplace from within the Humio product UI (Go to Settings, Packages, then Marketplace to browse the available packages or to create your own package). Try creating a package and uploading it to a different repository. If you create a nice complex dashboard and want to recreate it in a different repository, you know what to do: Create a package; export/import it, and then you don’t need to spend time recreating it!

Let us know what else you want to see in the Marketplace by connecting with us at The Nest or emailing [email protected].

Additional Resources

✇ CrowdStrike

Ransomware (R)evolution Plagues Organizations, But CrowdStrike Protection Never Wavers

By: Thomas Moses - Sarang Sonawane - Liviu Arsene
  • ECrime activities dominate the threat landscape, with ransomware as the main driver
  • Ransomware operators constantly refine their code and the efficacy of their operations
  • CrowdStrike uses improved behavior-based detections to prevent ransomware from tampering with Volume Shadow Copies
  • Volume Shadow Copy Service (VSS) backup protection nullifies attackers’ deletion attempts, retaining snapshots in a recoverable state

Ransomware is dominating the eCrime landscape and is a significant concern for organizations, as it can cause major disruptions. ECrime accounted for over 75% of interactive intrusion activity from July 2020 to June 2021, according to the recent CrowdStrike 2021 Threat Hunting Report. The continually evolving big game hunting (BGH) business model has widespread adoption with access brokers facilitating access, with a major driver being dedicated leak sites to apply pressure for victim compliance. Ransomware continues to evolve, with threat actors implementing components and features that make it more difficult for victims to recover their data. 

Lockbit 2.0 Going for the Popularity Vote

The LockBit ransomware family has constantly been adding new capabilities, including tampering with Microsoft Server Volume Shadow Copy Service (VSS) by interacting with the legitimate vssadmin.exe Windows tool. Capabilities such as lateral movement or destruction of shadow copies are some of the most effective and pervasive tactics ransomware uses.

Figure 1. LockBit 2.0 ransom note (Click to enlarge)

The LockBit 2.0 ransomware has similar capabilities to other ransomware families, including the ability to bypass UAC (User Account Control), self-terminate or check the victim’s system language before encryption to ensure that it’s not in a Russian-speaking country. 

For example, LockBit 2.0 checks the default language of the system and the current user by using the Windows API calls GetSystemDefaultUILanguage and GetUserDefaultUILanguage. If the language code identifier matches the one specified, the program will exit. Figure 2 shows how the language validation is performed (function call 49B1C0).

Figure 2. LockBit 2.0 performing system language validation

LockBit can even perform a silent UAC bypass without triggering any alerts or the UAC popup, enabling it to encrypt silently. It first begins by checking if it’s running under Admin privileges. It does that by using specific API functions to get the process token (NTOpenProcessToken), create a SID identifier to check the permission level (CreateWellKnownSid), and then check whether the current process has sufficient admin privileges (CheckTokenMembership and ZwQueryInformationToken functions).

Figure 3. Group SID permissions for running process

If the process is not running under Admin, it will attempt to do so by initializing a COM object with elevation of the COM interface by using the elevation moniker COM initialization method with guid: Elevation:Administrator!new:{3E5FC7F9-9A51-4367-9063-A120244FBEC7}. A similar elevation trick has been used by DarkSide and REvil ransomware families in the past.

LockBit 2.0 also has lateral movement capabilities and can scan for other hosts to spread to other network machines. For example, it calls the GetLogicalDrives function to retrieve a bitmask of currently available drives to list all available drives on the system. If the found drive is a network share, it tries to identify the name of the resource and connect to it using API functions, such as WNetGetConnectionW, PathRemoveBackslashW, OpenThreadToken and DuplicateToken.

In essence, it’s no longer about targeting and compromising individual machines but entire networks. REvil and LockBit are just some of the recent ransomware families that feature this capability, while others such as Ryuk and WastedLocker share the same functionality. The CrowdStrike Falcon OverWatch™ team found that in 36% of intrusions, adversaries can move laterally to additional hosts in less than 30 minutes, according to the CrowdStrike 2021 Threat Hunting Report.

Another interesting feature of LockBit 2.0 is that it prints out the ransom note message on all connected printers found in the network, adding public shaming to its encryption and data exfiltration capabilities.

VSS Tampering: An Established Ransomware Tactic

The tampering and deletion of VSS shadow copies is a common tactic to prevent data recovery. Adversaries will often abuse legitimate Microsoft administrator tools to disable and remove VSS shadow copies. Common tools include Windows Management Instrumentation (WMI), BCDEdit (a command-line tool for managing Boot Configuration Data) and vssadmin.exe. LockBit 2.0 utilizes the following WMI command line for deleting shadow copies:

C:\Windows\System32\cmd.exe /c vssadmin delete shadows /all /quiet & wmic shadowcopy delete & bcdedit /set {default} bootstatuspolicy ignoreallfailures & bcdedit /set {default} recoveryenabled no

The use of preinstalled operating system tools, such as WMI, is not new. Still, adversaries have started abusing them as part of the initial access tactic to perform tasks without requiring a malicious executable file to be run or written to the disk on the compromised system. Adversaries have moved beyond malware by using increasingly sophisticated and stealthy techniques tailor-made to evade autonomous detections, as revealed by CrowdStrike Threat Graph®, which showed that 68% of detections indexed in April-June 2021 were malware-free.

VSS Protection with CrowdStrike

CrowdStrike Falcon takes a layered approach to detecting and preventing ransomware by using behavior-based indicators of attack (IOAs) and advanced machine learning, among other capabilities. We are committed to continually improving the efficacy of our technologies against known and unknown threats and adversaries. 

CrowdStrike’s enhanced IOA detections accurately distinguish malicious behavior from benign, resulting in high-confidence detections. This is especially important when ransomware shares similar capabilities with legitimate software, like backup solutions. Both can enumerate directories and write files that on the surface may seem inconsequential, but when correlated with other indicators on the endpoint, can identify a legitimate attack. Correlating seemingly ordinary behaviors allows us to identify opportunities for coverage across a wide range of malware families. For example, a single IOA can provide coverage for multiple families and previously unseen ones.

CrowdStrike’s recent innovation involves protecting shadow copies from being tampered with, adding another protection layer to mitigate ransomware attacks. Protecting shadow copies helps potentially compromised systems restore encrypted data with much less time and effort. Ultimately, this helps reduce operational costs associated with person-hours spent spinning up encrypted systems post-compromise.

The Falcon platform can prevent suspicious processes from tampering with shadow copies and performing actions such as changing file size to render the backup useless. For instance, should a LockBit 2.0 ransomware infection occur and attempt to use the legitimate Microsoft administrator tool (vssadmin.exe) to manipulate shadow copies, Falcon immediately detects this behavior and prevents the ransomware from deleting or tampering with them, as shown in Figure 4.

Figure 4. Falcon detects and blocks vssadmin.exe manipulation by LockBit 2.0 ransomware (Click to enlarge)

In essence, while a ransomware infection might be able to encrypt files on a compromised endpoint, Falcon can prevent ransomware from tampering with shadow copies and potentially expedite data recovery for your organization.

Figure 5. Falcon alert on detected and blocked ransomware activity for deleting VSS shadow copies (Click to enlarge)

Shown below is Lockbit 2.0 executing on a system without Falcon protections. Here, vssadmin is used to list the shadow copies. Notice the shadow copy has been deleted after execution.

Below is the same Lockbit 2.0 execution, now with Falcon and VSS protection enabled. The shadow copy is not deleted even though the ransomware has run successfully. Please note, we specifically allowed the ransomware to run during this demonstration.

CrowdStrike prevents the destruction and tampering of shadow copies with volume shadow service backup protection, retaining the snapshots in a recoverable state regardless of threat actors using traditional or new novel techniques. This allows for instant recovery of live systems post-attack through direct snapshot tools or system recovery.

VSS shadow copy protection is just one of the new improvements added to CrowdStrike’s layered approach. We remain committed to our mission to stop breaches, and constantly improving our machine learning and behavior-based detection and protection technologies enables the Falcon platform to identify and protect against tactics, techniques and procedures associated with sophisticated adversaries and threats.

CrowdStrike’s Layered Approach Provides Best-in-Class Protection

The Falcon platform unifies intelligence, technology and expertise to successfully detect and protect against ransomware. Artificial intelligence (AI)-powered machine learning and behavioral IOAs, fueled by a massive data set of trillions of events per week and threat actor intelligence, can identify and block ransomware. Coupled with expert threat hunters that proactively see and stop even the stealthiest of attacks, the Falcon platform uses a layered approach to protect the things that matter most to your organization from ransomware and other threats.

CrowdStrike Falcon endpoint protection packages unify the comprehensive technologies, intelligence and expertise needed to successfully stop breaches. For fully managed detection and response (MDR), Falcon Complete™ seasoned security professionals deliver 403% ROI and 100% confidence.

Indicators of Compromise (IOCs)

File SHA256
LockBit 2.0 0545f842ca2eb77bcac0fd17d6d0a8c607d7dbc8669709f3096e5c1828e1c049

Additional Resources

✇ CrowdStrike

Unexpected Adventures in JSON Marshaling

By: Dylan Bourque

Recently, one of our engineering teams encountered what seemed like a fairly straightforward issue: When they attempted to store UUID values to a database, it produced an error claiming that the value was invalid. With a few tweaks to one of our internal libraries, our team was able to resolve the issue. Or did they?

Fast forward one month later, and a different team noticed a peculiar problem. After deploying a new release, their service began logging strange errors alerting the team that the UUID values from the redrive queue could not be read.

So what went wrong? What we soon realized is that when we added a new behavior to our UUID library to solve our first problem, we inadvertently created a new one. In this blog post, we explore how adding seemingly benign new methods can actually be a breaking change, especially when working with JSON support in Go.  We will explore what we did wrong and how we were able to dig our way out of it. We’ll also outline some best practices for managing this type of change, along with some thoughts on how to avoid breaking things in the first place.

When Closing a Functional Gap Turns Into a Bug

This all started when one of our engineering teams added a new PostgreSQL database and ran into issues. They were attempting to store UUID values in a JSONB column in the PostgreSQL database using our internal csuuid library, which wraps a UUID value and adds some additional functionality specific to our systems. Strangely, the generated SQL being sent to the database always contained an empty string for that column, which is an invalid value.

INSERT INTO table (id, uuid_val) VALUES (42, '');

ERROR: invalid input syntax for type json

Checking the code, we saw that there was no specific logic for supporting database persistence.  Conveniently, the Go standard library already provides the scaffolding for making types compatible with database drivers in the form of the database/sql.Scanner and database/sql/driver.Valuer interfaces. The former is used when reading data from a database driver and the latter for writing values to the driver. Each interface is a single method and, since a csuuid.UUID wraps a github.com/gofrs/uuid.UUID value that already provides the correct implementations, extending the code was straightforward.

With this change, the team was now able to successfully store and retrieve csuuid.UUID values in the database.

Free Wins

As often happens, the temptation of “As long as we’re updating things …” crept in. We noticed that csuuid.UUID also did not include any explicit support for JSON marshaling. Like with the database driver support, the underlying github.com/gofrs/uuid.UUID type already provided the necessary functionality, so extending csuuid.UUID for this feature felt like a free win.

If a type can be represented as a string in a JSON document, then you can satisfy the encoding.TextMarshaler and encoding.TextUnmarshaler interfaces to convert your Go struct to/from a JSON string, rather than satisfying the potentially more complex Marshaler and Unmarshaler interfaces from the encoding/json package.

The excerpt from the documentation for the Go standard library’s json.Marshal() function below (emphasis mine) calls out this behavior:

Marshal traverses the value v recursively. If an encountered value implements the Marshaler interface and is not a nil pointer, Marshal calls its MarshalJSON method to produce JSON. If no MarshalJSON method is present but the value implements encoding.TextMarshaler instead, Marshal calls its MarshalText method and encodes the result as a JSON string. The nil pointer exception is not strictly necessary but mimics a similar, necessary exception in the behavior of UnmarshalJSON.

A UUID is a 128-bit value that can easily be represented as a 32-character string of hex digits; that string format is the typical way they are stored in JSON. Armed with this knowledge, extending csuuid.UUID to “correctly” support converting to/from JSON was another simple bit of code.

Other than a bit of logic to account for the pointer field within csuuid.UUID, these two new methods only had to delegate things to the inner github.com/gofrs/uuid.UUID value.

At this point, we felt like we had solved the original issue and gotten a clear bonus win. We danced a little jig and moved on to the next set of problems.

Celebrations all around!

A Trap Awaits

Unfortunately, all was not well in JSON Land. Several months after applying these changes, we deployed a new release of another of our services and started seeing errors logged about it not being able to read in values from its AWS Simple Queue Service (SQS) queue.  For system stability, we always do canary deployments of new services before rolling out changes to the entire fleet.  The new error logs started when the canary for this service was deployed.

Below are examples of the log messages:

From the new instances:
[ERROR] ..../sqs_client.go:42 - error unmarshaling Message from SQS: json: cannot unmarshal object into Go struct field event.trace_id of type *csuuid.UUID error='json: cannot unmarshal object into Go struct field event.trace_id of type *csuuid.UUID'

From both old and new instances:
[ERROR] ..../sqs_client.go:1138 - error unmarshaling Message from SQS: json: cannot unmarshal string into Go struct field event.trace_id of type csuuid.UUID error='json: cannot unmarshal string into Go struct field event.trace_id of type csuuid.UUID'

After some investigation, we were able to determine that the error was happening because we had inadvertently introduced an incompatibility in the JSON marshaling logic for csuuid.UUID. When one of the old instances wrote a message to the SQS queue and one of the new ones processed it, or vice versa, the code would fail to read in the JSON data, thus logging one of the above messages.

json.Marshal() and json.Unmarshal() Work, Even If by Accident

The hint that unlocked the mystery was noticing the slight difference in the two log messages. Some showed “cannot unmarshal object into Go struct field” and the others showed “cannot unmarshal string into Go struct field.” This difference triggered a memory of that “free win” we celebrated earlier.

The root cause of the bug was that, in prior versions of the csuuid module, the csuuid.UUID type contained only unexported fields, and it had no explicit support for converting to/from JSON. In this case, the fallback behavior of json.Marshal() is to output an empty JSON object, {}. Conversely, in the old code, json.Unmarshal() was able to use reflection to convert that same {} into an empty csuuid.UUID value.

The below example Go program displays this behavior:

With the new code, we were trying to read that empty JSON object {} (which was produced by the old code on another node) as a string containing the hex digits of a UUID. This was because json.Unmarshal() was calling our new UnmarshalText() method and failing, which generated the log messages shown above. Similarly, the new code was producing a string of hex digits where the old code, without the new UnmarshalText() method, expected to get a JSON object.

We encountered a bit of serendipity here, though, because we accidentally discovered that the updated service had been losing those trace ID values called out in the logs for messages that went through the redrive logic. Fortunately, this hidden bug hadn’t caused any actual issues for us.

The snippet below highlights the behavior of the prior versions.

With this bug identified, we were in a quandary. The new code is correct and even fixes the data loss bug illustrated above. However, it  was unable to read in JSON data produced by the old code. As a result, it was dropping those events from the service’s SQS queue, which was not an acceptable option. Additionally, this same issue could be extant in many other services.

A Way Out Presents Itself

Since a Big Bang, deploy-everything-at-once-and-lose-data solution wasn’t tenable, we needed to find a way for csuuid.UUID to support both the existing, invalid JSON data and the new, correct format.

Going back to the documentation for JSON marshaling, UnmarshalText() is the second option for converting from JSON. If a type satisfies encoding/json.Unmarshaler, by providing UnmarshalJSON([]byte) error, then json.Unmarshal() will call that method, passing in the bytes of the JSON data. By implementing that method and using a json.Decoder to process the raw bytes of the JSON stream, we were able to accomplish what we needed.

The core of the solution relied on taking advantage of the previously unknown bug where the prior versions of csuuid.UUID always generated an empty JSON object when serialized. Using that knowledge, we created a json.Decoder to inspect the contents of the raw bytes before populating the csuuid.UUID value.

With this code in place, we were able to: 

  1. Confirm that the service could successfully queue and process messages across versions 
  2. Ensure any csuuid.UUID values are “correctly” marshaled to JSON as hex strings
  3. Write csuuid.UUID values to a database and read them back

Time to celebrate!

Lessons for the Future

Now that our team has resolved this issue, and all is well once again in JSON Land, let’s review a few lessons that we learned from our adventure:

  1. Normally, adding new methods to a type would not be a breaking change, as no consumers would be affected. Unfortunately, some special methods, like those that are involved in JSON marshaling, can generate breaking behavioral changes despite not breaking the consumer-facing API. This is something we overlooked when we got excited about our “free win.”
  2. Even if you don’t do it yourself, future consumers that you never thought of may decide to write values of your type to JSON. If you don’t consider what that representation should look like, the default behavior of Go’s encoding/json package may well do something that is deterministic but most definitely wrong , as was the case when  generating {} as the JSON value for our csuuid.UUID type. Take some time to think about what your type should look like when written to JSON, especially if the type is exported outside of the local module/package.
  3. Don’t forget that the simple, straightforward solutions are not the only ones available. In this scenario, introducing the new MarshalText()/UnmarshalText() methods was the simple, well documented way to correctly support converting csuuid.UUID values to/from JSON. However, doing the simple thing is what introduced the bug. By switching to the lower-level json.Decoder we were able to extend csuuid.UUID to be backwards compatible with the previous  code while also providing the “correct” behavior going forward.

Do you love solving technical challenges and want to embark on exciting engineering adventures? Browse our Engineering job listings and hear from some of the world’s most talented engineers.