🔒
There are new articles available, click to refresh the page.
✇ CrowdStrike

Shift Left Security: The Magic Elixir for Securing Cloud-Native Apps

By: David Puzas

Developing applications quickly has always been the goal of development teams. Traditionally, that often puts them at odds with the need for testing. Developers might code up to the last minute, leaving little time to find and fix vulnerabilities in time to meet deadlines. 

During the past decade, this historical push-pull between security and developers led many organizations to look to build security deeper into the application development lifecycle. This new approach, “shift-left security,” is a pivotal part of supporting the DevOps methodology. By focusing on finding and remediating vulnerabilities earlier, organizations can streamline the development process and improve velocity. 

Cloud computing empowers the adoption of DevOps. It offers DevOps teams a centralized platform for testing and deployment. But for DevOps teams to embrace the cloud, security has to be at the forefront of your considerations. For developers, that means making security a part of the continuous integration/continuous delivery (CI/CD) pipeline that forms the cornerstone of DevOps practices.

Out with the Old and In with the New

The CI/CD pipeline is vital to supporting DevOps through the automation of building, testing and deploying applications. It is not enough to just scan applications after they are live. A shift-left approach to security should start the same second that DevOps teams begin developing the application and provisioning infrastructure. By using APIs, developers can integrate security into their toolsets and enable security teams to find problems early. 

Speedy delivery of applications is not the enemy of security, though it can seem that way. Security is meant to be an enabler, an elixir that helps organizations use technology to reach their business goals. Making that a reality, however, requires making it a foundational part of the development process. 

In our Buyer’s Guide for Cloud Workload Protection Platforms, we provide a list of key features we believe organizations should look for to help secure their cloud environments. Automation is crucial. In research from CrowdStrike and Enterprise Strategy Group (ESG), 41% of respondents said that automating the introduction of controls and processes via integration with the software development lifecycle and CI/CD tools is a top priority. Using automation, organizations can keep pace with the elastic, dynamic nature of cloud-native applications and infrastructure.

Better Security, Better Apps

At CrowdStrike, we focus on integrating security into the CI/CD pipeline. As part of the functionality of CrowdStrike’s Falcon Cloud Workload Protection (CWP), customers have the ability to create verified image policies to ensure that only approved images are allowed to progress through the CI/CD pipeline and run in their hosts or Kubernetes clusters. 

The tighter the integration between security and the pipeline, the earlier threats can be identified, and the more the speed of delivery can be accelerated. By seamlessly integrating with Jenkins, Bamboo, GitLab and others, Falcon CWP allows DevOps teams to respond and remediate incidents even faster within the toolsets they use. 

Falcon CWP also continuously scans container images for known vulnerabilities, configuration issues, secrets/keys and OSS licensing issues, and streamlines visibility for security operations by providing insights and context for misconfigurations and compliance violations. It also uses reporting and dashboards to drive alignment across the security operations, DevOps and infrastructure teams. 

Hardening the CI/CD pipeline allows DevOps teams to move fast without sacrificing security. The automation and integration of security into the CI/CD pipeline transforms the DevOps culture into its close relative, DevSecOps, which extends the methodology of DevOps by focusing on building security into the process. As businesses continue to adopt cloud services and infrastructure, forgetting to keep security top of mind is not an option. The CI/CD pipeline represents an attractive target for threat actors. Its criticality means that a compromise could have a significant impact on business and IT operations. 

Baking security into the CI/CD pipeline enables businesses to pursue their digital initiatives with confidence and security. By shifting security left, organizations can identify misconfigurations and other security risks before they impact users. Given the role that cloud computing plays in enabling DevOps, protecting cloud environments and workloads will only take on a larger role in defending the CI/CD pipeline, your applications and, ultimately, your customers. 

To learn more about how to choose security solutions to protect your CI/CD pipeline, download the CrowdStrike Cloud Workload Protection Platform Buyers Guide.

Additional Resources

✇ CrowdStrike

Managing Dead Letter Messages: Three Best Practices to Effectively Capture, Investigate and Redrive Failed Messages

By: Chris Cannon

In a recent blog post, Sharding Kafka for Increased Scale and Reliability, the CrowdStrike Engineering Site and Reliability Team shared how it overcame scaling limitations within Apache Kafka so that they could quickly and effectively process trillions of events daily. In this post, we focus on the other side of this equation: What happens when one of those messages inevitably fails? 

When a message cannot be processed, it becomes what is known as a “dead letter.” The service attempts to process the message by normal means several times to eliminate intermittent failures. However, when all of those attempts fail, the message is ultimately “dead lettered.” In highly scalable systems, these failed messages must be dealt with so that processing can continue on subsequent messages. To retain the dead letter’s information and continue processing messages, the message is stored so that it can be later addressed manually or by an automated tool.

In Best Practices: Improving Fault-Tolerance in Apache Kafka Consumer, we go into great detail about the different failure types and techniques for recovery, which include redriving and dead letters. Here our aim is to solidify those terms and expound upon the processes surrounding these mechanisms. 

Processing dead letters can be a fairly time-consuming and error-prone process. So what can be done to expedite this task and improve its outcome? Here we explore three steps organizations can take to develop the code and infrastructure needed to more effectively and efficiently capture, investigate and redrive dead letter messages.

Dead Letter Basics
What is a message? A message is the record of any communication between two or more services.
Why does a message fail? Messages can fail for a variety of reasons, some of the most common being incompatible message format, unavailable dependent services, or a bug in the service processing the message.
Why does it matter if a message fails? In most cases, a message is being sent because it is sharing important information with another service. Without that knowledge, the service that should be receiving the message can have outdated or inaccurate information and make bad decisions or be completely unable to act.

Three Best Practices for Resolving Dead Letter Messages

1. Define the infrastructure and code to capture and redrive dead letters

As explained above, a dead letter occurs when a service cannot process a message. Most systems have some mechanism in place, such as a log or object storage, to capture the message, review it, identify the issue, resolve the issue and then retry the message once it’s more likely to succeed. This act of replaying the message is known as “redriving.” 

To enable the redrive process, organizations need two basic things: 1) the necessary infrastructure to capture and store the dead letter messages, and 2) the right code to redrive that message.

Since there could potentially be hundreds of millions of dead letters that need to be stored, we recommend using a storage option that meets these four criteria: low cost (especially critical as your data scales), abundant space (no concerns around running out of storage space), durability (no data loss or corruption) and availability (the data is available to restore during disaster recovery). We use Amazon S3. 

For short-term storage and alerting, we recommend using a message queue technology that allows the user to send messages to be processed at a later point. Then your service can be configured to read from the message queue to begin processing the redrive messages. We use Amazon SQS and Kafka as our message queues.

2. Put tooling in place to make remediation foolproof 

The process outlined above can be very error-prone when done manually, as it involves many steps: finding the message, copying its contents, pasting it into a new message and submitting that message to the queue. If the user misses even one character when copying the message, then it will fail again — and the process will need to be repeated. This process must be done for every failed message, making it potentially time-consuming as well. 

Since the process is the same for processing dead letters, it is possible to automate. To that end, organizations should develop a command-line tool to automate common actions with dead letters such as viewing the dead letter, putting the message in the redrive queue and having the service consume messages from the queue for reprocessing. Engineers will use this command-line tool to diagnose and resolve dead letters the same way — this, in turn, will help reduce the risk of human error.

3. Standardize and document the process to ensure ease-of-use 

Our third best practice is around standardization. Because not all engineers will be familiar with the process the organization has for dealing with dead letter messages, it is important to document all aspects of the procedure. Some basic questions your documentation should address include: 

  • How does the organization know when a dead letter message occurs? Is an alert set up? Will an email be sent?
  • How does the team investigate the root cause of the error? Is there a specific phrase they can search for in the logs to find the errors associated with a dead letter?
  • Once it has been investigated and a fix has been deployed, how is the message reprocessed or redrived?

Documenting and standardizing the process in this way ensures that anyone on the team can pick up, solve and redrive dead letters. Ideally, the documentation will be relatively short and intuitive, outlining the following steps:

  • How to read the content of the message and review the logs to help figure out what happened
  • How to run the commands for your dead letter tool
  • How to put the message in the redrive queue to be reprocessed
  • What to do if the message is rejected again

It’s important to have this “cradle-to-grave” mentality when dealing with dead letter messages — pun intended — since a disconnect anywhere within the process could prevent the organization from successfully reprocessing the message.

Conclusion

While many organizations focus on processing massive amounts of messages and scaling those capabilities, it is equally important to ensure errors are captured and solved efficiently and effectively. 

In this blog, we shared our three best practices for organizations to develop the infrastructure and tooling to ensure that any engineer can properly manage a dead letter. But we certainly have more to share! We would be happy to address any specific questions or explore related topics of interest to the community in future blog posts. 

Got a question, comment or idea? Feel free to share your thoughts for future posts on social media via @CrowdStrike.

✇ CrowdStrike

Mean Time to Repair (MTTR) Explained

By: Humio Staff

This blog was originally published oct. 28, 2021 on humio.com. Humio is a CrowdStrike Company.

Definition of MTTR

Mean time to repair (MTTR) is a key performance indicator (KPI) that represents the average time required to restore a system to functionality after an incident. MTTR is used along with other incident metrics to assess the performance of DevOps and ITOps, gauge the effectiveness of security processes, evaluate the effectiveness of security solutions, and measure the maintainability of systems.

Service level agreements with third-party providers typically set expectations for MTTR, although repair times are not guaranteed because some incidents are more complex than others. Along the same lines, comparing the MTTR of different organizations is not fruitful because MTTR is highly dependent on unique factors relating to the size and type of the infrastructure and the size and skills of the ITOps and DevOps team. Every business has to determine which metrics will best serve its purposes and how it will put them into action in their unique environment.

Difference Between Common Failure Metrics

Modern enterprise systems are complicated and they can fail in numerous ways. For these reasons, there is no one set of incident metrics every business should use — but there are many to choose from, and the differences can be nuanced.

Mean Time to Detect (MTTD)

Also called mean time to discover, MTTD is the average time between the beginning of a system failure and its detection. As a KPI, MTTD is used to measure the effectiveness of the tools and processes used by DevOps teams.

To calculate MTTP, select a period of time, such as a month, and track the times between the beginning of system outages and their discovery, and then add up the total time and divide it by the number of incidents to find the average. MTTD should be low. If it continues to take longer to detect or or discover system failures (an upward trend), an immediate review should be conducted of the existing incident response management tools and processes.

Mean Time to Identify (MTTI)

This measurement tracks the number of business hours between the moment an alert is triggered and the moment the cybersecurity team begins to investigate that alert. MTTI is helpful in understanding if alert systems are effective and if cybersecurity teams are staffed to the necessary capacity. A high MTTI or an MTTI that is trending in the wrong direction can be an indicator that the cybersecurity team is suffering from alert fatigue.

Mean Time to Recovery (MTTR)

Mean time to recovery is the average time it takes in business hours between the start of an incident and the complete recovery back to normal operations. This incident metric is used to understand the effectiveness of the DevOps and ITOps teams and identify opportunities to improve their processes and capabilities.

Mean Time to Resolve (MTTR)

Mean time to resolve is the average time between the first alert through the post-incident analysis, including the time spent ensuring the failure will not re-occur. It is measured in business hours.

Mean Time Between Failures (MTBF)

Mean time between failures is a key performance metric that measures system reliability and availability. ITOps teams use MTBF to understand which systems or components are performing well and which need to be evaluated for repair or replacement. Knowing MTBF enables preventative maintenance, minimizes reactive maintenance, reduces total downtime and enables teams to prioritize their workload effectively. Historical MTBF data can be used to make better decisions about scheduling maintenance downtime and resource allocation.

MTBF is calculated by tracking the number of hours that elapse between system failures in the ordinary course of operations over a period of time and then finding the average.

Mean Time to Failure (MTTF)

Mean time to failure is a way of looking at uptime vs. downtime. Unlike MTBF, an incident metric that focuses on repairability, MTTF focuses on failures that cannot be repaired. It is used to predict the lifespan of systems. MTTF is not a good fit for every system. For example, systems with long lifespans, such as core banking systems or many industrial control systems, are not good subjects for MTTF metrics because they have such a long lifespan that when they are finally replaced, the replacement will be an entirely different type of system due to technological advances. In cases like that, MTTF is moot.

Conversely, tracking the MTTF of systems with more typical lifespans is a good way to gain insight into which brands perform best or which environmental factors most strongly influence a product’s durability.

MTTR is intended to reduce unplanned downtime and shorten breakout time. But its use also supports a better culture within ITOps teams.When incidents are repaired before users are impacted, DevOps and ITOps are seen as efficient and effective. Resilient system design is encouraged because when DevOps knows its performance will be measured by MTTR, the team will build apps that can be repaired faster, such as by developing apps that are populated by discrete web services so one service failure will not crash the entire app. MTTR, when done properly, includes post-incident analysis, which should be used to inform a feedback loop that leads to better software builds in the future and encourages the fixing of bugs early in the SDLC process.

How to Calculate Mean Time to Repair

The MTTR formula is straightforward: Simply add up the total unplanned repair time spent on a system within a certain time frame and divide the results by the total number of relevant incidents.

For example, if you have a system that fails four times in one workday and you spend an hour repairing each of those instances of failure, your MTTR would be 15 minutes (60 minutes / 4 = 15 minutes).

However, not all outages are equal. The time spent repairing a failed component or a customer-facing system that goes down during peak hours is more expensive in terms of lost sales, productivity or brand damage than time spent repairing a non-critical outage in the middle of the night. Organizations can establish an “error budget” that specifies that each minute spent repairing the most impactful systems is worth an hour of minutes spent repairing less impactful ones. This level of granularity will help expose the true costs of downtime and provide a better understanding of what MTTR means to the particular organization.

How to Reduce MTTR

There are three elements to reducing MTTR:

  1. Manage resolution process. The first is a defined strategy for managing the resolution process, which should include a post-incident analysis to capture lessons learned.
  2. Build defenses. Technology plays a crucial role, of course, and the best solution will provide visibility, monitoring and corrective maintenance to help root out problems and build defenses against future attacks.
  3. Mitigate the incident. Lastly, the skills necessary to mitigate the incident have to be available.

MTTR can be reduced by increasing budget or headcount, but that isn’t always realistic. Instead, deploy artificial intelligence (AI) and machine learning (ML) to automate as much of the repair process as possible. Those steps include rapid detection, minimization of false positives, smart escalation, and automated remediation that includes workflows that reduce MTTR.

MTTR can be a helpful metric to reduce downtime and streamline your DevOps and ITOps teams, but improving it shouldn’t be the end goal. After all, the point of using metrics is not simply improving numbers but, in this instance, the practical matter of keeping systems running and protecting the business and its customers. Use MTTR in a way that helps your teams protect customers and optimize system uptime.

Improve MTTR With a Modern Log Management Solution

Logs are invaluable for any kind of incident response. Humio’s platform enables complete observability for all streaming logs and event data to help IT organizations better prepare for the unknown and quickly find the root cause of any incident.

Humio leverages modern technologies, including data streaming, index-free architecture and hybrid deployments, to optimize compute resources and minimize storage costs. Because of this, Humio can collect structured and unstructured data in memory to make exploring and investigating data of any size blazing fast.

Humio Community Edition

With a modern log management platform, you can monitor and improve your MTTR. Try it out at no cost!

✇ CrowdStrike

Securing the Application Lifecycle with Scale and Speed: Achieving Holistic Workload Security with CrowdStrike and Nutanix

By: Fiona Ing

With virtualization in the data center and further adoption of cloud infrastructure, it’s no wonder why IT, DevOps and security teams grapple with new and evolving security challenges. An increase in virtualized applications and desktops have caused organizations’ attack surfaces to expand quickly, enabling highly sophisticated attackers to take advantage of the minimal visibility and control these teams hold.

The question remains: How can your organization secure your production environments and cloud workloads to ensure that you can build and run apps at speed and with confidence? The answer: CrowdStrike Falcon® on the Nutanix Cloud Platform.

Delivered through CrowdStrike’s single lightweight Falcon agent, your team is enabled to take an adversary-focused approach when securing your Nutanix cloud workloads — all without impacting performance. With scalable and holistic security, your team can achieve comprehensive workload protection and visibility across virtual environments to meet compliance requirements and prevent breaches effectively and efficiently. 

Secure All of Your Cloud Workloads with CrowdStrike and Nutanix

By extending CrowdStrike’s world-class security capabilities into the Nutanix Cloud Platform, you can prevent attacks on virtualized workloads and endpoints on or off the network. The Nutanix-validated, cloud-native Falcon sensor enhances Nutanix’s native security posture for workloads running on Nutanix AHV without compromising your team’s output. By extending CrowdStrike protection to Nutanix deployments, including virtual machines and virtual desktop infrastructure (VDI), you get scalable and comprehensive workload and container breach protection to streamline operations and optimize performance.

CrowdStrike and Nutanix provide your DevOps and Security teams with layered security, so they can build, run and secure applications with confidence at every stage of the application lifecycle. Easily deploy and use the CrowdStrike Falcon sensor without hassle for your Nutanix AHV workloads and environment. 

CrowdStrike’s intelligent cloud-native Falcon agent is powered by the proprietary CrowdStrike Threat Graph®, which captures trillions of high-fidelity signals per day in real time from across the globe, fueling one of the world’s most advanced data platforms for security. The Falcon platform helps you gain real-time protection and visibility across your enterprise, preventing attacks on workloads on and off the network. 

Get Started and Secure Your Linux Workloads in the Cloud

With Nutanix and CrowdStrike, you can feel confident that your Linux workloads are secure on creation by using CrowdStrike’s Nutanix Terraform script built on Nutanix’s Terraform Provider. By deploying the CrowdStrike Falcon sensor during Linux instance creation, the lifecycle of building and securing workloads before they are operational in the cloud is made simple and secure, without operational friction. 

Get started with CrowdStrike and Nutanix by deploying Linux workloads securely with CrowdStrike’s Nutanix Terraform script.

Gain Holistic Security Coverage Without Compromising Performance

With CrowdStrike and Nutanix, you can seamlessly secure your end-to-end production environment, streamline operations and optimize application performance; easily manage storage and virtualization securely with CrowdStrike’s lightweight Falcon agent on the Nutanix Cloud Platform; and secure your Linux workloads with CrowdStrike’s Nutanix Terraform solution. Building, running and securing applications on the Nutanix Cloud Platform takes the burden of managing and securing your production environment off your team and ensures confidence.

Additional Resources 

✇ CrowdStrike

Introduction to the Humio Marketplace

By: Humio Staff

This blog was originally published Oct. 11, 2021 on humio.com. Humio is a CrowdStrike Company.

Humio is a powerful and super flexible platform that allows customers to log everything and answer anything. Users can choose how to ingest their data and choose how to create and manage their data with Humio. The goal of Humio’s marketplace is to provide a variety of packages that power our customers with faster and more convenient ways to get more from their data across a variety of use cases.

What is the Humio Marketplace?

The Humio Marketplace is a collection of prebuilt packages created by Humio, partners and customers that Humio customers can access within the Humio product interface.

These packages are relevant to popular log sources and typically contain a parser and some dashboards and/or saved queries. The package documentation includes advice and guidance on how to best ingest the data into Humio to start getting immediate value from logs.

What is a package?

The Marketplace contains prebuilt packages that are essentially YAML files that describe the Humio assets included in the package. A package can include any or all of: a parser, saved searches, alerts, dashboards, lookup files and labels. The package also includes YAML files for the metadata of the package (such as descriptions and tags, support status and author), and a README file which contains a full description and explanation of any prerequisites, etc.

Packages can be configured as either a Library type package — which means, once installed, the assets are available as templates to build from — or an Application package, which means, once installed, the assets are instantiated and are live immediately.

By creating prebuilt content that is quick and simple to install, we want to make it easier for customers to onboard new log sources to Humio to quickly get value from that data. With this prebuilt content, customers won’t have to work out the best way of ingesting the logs and won’t have to create parsers and dashboards from scratch.

How do I make a package?

Packages are a great way to mitigate manual work, whether that’s taking advantage of prebuilt packages or making your own packages so you don’t have to begin new processes all over.

Anyone can create a Humio package straight from Humio’s interface. We actively encourage customers and partners to create packages and submit those packages for inclusion in the Marketplace if they think they could benefit other customers. Humio will work with package creators to make sure the package meets our standards for inclusion in the Marketplace. By sharing your package with all Humio customers through the Marketplace, you are strengthening the community and allowing others to benefit from your expertise while you, likewise, benefit from others’ expertise.

For some customers, the package will be exactly what they want, but for others, it will be a useful starting point for further customization. All Humio packages are provided under an Apache 2.0 license, so customers are free to adapt and reuse the package as needed.

If I install a package, will it get updated?

Package creators can develop updates in response to changes in log formats or to introduce new functionality and improvements. Updates will be advertised as available in the Marketplace and users can choose to accept the update. The update process will check to see if any local changes have been made to assets installed from the package and, if so, will prompt the user to either overwrite the changes with the standard version from the updated package or to keep the local changes.

Are packages free?

Yes, all Humio packages in the Marketplace are free to use!

Can I use packages to manage my own private Humio content?

Absolutely! Packages are a convenient way for customers to manage their own private Humio content. Packages can be created in the Humio product interface and can be downloaded as a ZIP file and uploaded into a different Humio repository or a different instance of Humio (cloud or hybrid). Customers can also store their Humio packages in a code repository and use their CI/CD tools and the Humio API to deploy and manage Humio assets as they would their own code. This streamlines Humio support and operations and delivers a truly agile approach to log management.

Get started today

To get started with packages is simple. All you need is access to a Humio Cloud service, or if running Humio self-hosted, you need to be on V1.21 or later. To create and install packages, you need the “Change Packages” permission assigned to your Humio user role.

Access the Marketplace from within the Humio product UI (Go to Settings, Packages, then Marketplace to browse the available packages or to create your own package). Try creating a package and uploading it to a different repository. If you create a nice complex dashboard and want to recreate it in a different repository, you know what to do: Create a package; export/import it, and then you don’t need to spend time recreating it!

Let us know what else you want to see in the Marketplace by connecting with us at The Nest or emailing [email protected].

Additional Resources

✇ CrowdStrike

Ransomware (R)evolution Plagues Organizations, But CrowdStrike Protection Never Wavers

By: Thomas Moses - Sarang Sonawane - Liviu Arsene
  • ECrime activities dominate the threat landscape, with ransomware as the main driver
  • Ransomware operators constantly refine their code and the efficacy of their operations
  • CrowdStrike uses improved behavior-based detections to prevent ransomware from tampering with Volume Shadow Copies
  • Volume Shadow Copy Service (VSS) backup protection nullifies attackers’ deletion attempts, retaining snapshots in a recoverable state

Ransomware is dominating the eCrime landscape and is a significant concern for organizations, as it can cause major disruptions. ECrime accounted for over 75% of interactive intrusion activity from July 2020 to June 2021, according to the recent CrowdStrike 2021 Threat Hunting Report. The continually evolving big game hunting (BGH) business model has widespread adoption with access brokers facilitating access, with a major driver being dedicated leak sites to apply pressure for victim compliance. Ransomware continues to evolve, with threat actors implementing components and features that make it more difficult for victims to recover their data. 

Lockbit 2.0 Going for the Popularity Vote

The LockBit ransomware family has constantly been adding new capabilities, including tampering with Microsoft Server Volume Shadow Copy Service (VSS) by interacting with the legitimate vssadmin.exe Windows tool. Capabilities such as lateral movement or destruction of shadow copies are some of the most effective and pervasive tactics ransomware uses.

Figure 1. LockBit 2.0 ransom note (Click to enlarge)

The LockBit 2.0 ransomware has similar capabilities to other ransomware families, including the ability to bypass UAC (User Account Control), self-terminate or check the victim’s system language before encryption to ensure that it’s not in a Russian-speaking country. 

For example, LockBit 2.0 checks the default language of the system and the current user by using the Windows API calls GetSystemDefaultUILanguage and GetUserDefaultUILanguage. If the language code identifier matches the one specified, the program will exit. Figure 2 shows how the language validation is performed (function call 49B1C0).

Figure 2. LockBit 2.0 performing system language validation

LockBit can even perform a silent UAC bypass without triggering any alerts or the UAC popup, enabling it to encrypt silently. It first begins by checking if it’s running under Admin privileges. It does that by using specific API functions to get the process token (NTOpenProcessToken), create a SID identifier to check the permission level (CreateWellKnownSid), and then check whether the current process has sufficient admin privileges (CheckTokenMembership and ZwQueryInformationToken functions).

Figure 3. Group SID permissions for running process

If the process is not running under Admin, it will attempt to do so by initializing a COM object with elevation of the COM interface by using the elevation moniker COM initialization method with guid: Elevation:Administrator!new:{3E5FC7F9-9A51-4367-9063-A120244FBEC7}. A similar elevation trick has been used by DarkSide and REvil ransomware families in the past.

LockBit 2.0 also has lateral movement capabilities and can scan for other hosts to spread to other network machines. For example, it calls the GetLogicalDrives function to retrieve a bitmask of currently available drives to list all available drives on the system. If the found drive is a network share, it tries to identify the name of the resource and connect to it using API functions, such as WNetGetConnectionW, PathRemoveBackslashW, OpenThreadToken and DuplicateToken.

In essence, it’s no longer about targeting and compromising individual machines but entire networks. REvil and LockBit are just some of the recent ransomware families that feature this capability, while others such as Ryuk and WastedLocker share the same functionality. The CrowdStrike Falcon OverWatch™ team found that in 36% of intrusions, adversaries can move laterally to additional hosts in less than 30 minutes, according to the CrowdStrike 2021 Threat Hunting Report.

Another interesting feature of LockBit 2.0 is that it prints out the ransom note message on all connected printers found in the network, adding public shaming to its encryption and data exfiltration capabilities.

VSS Tampering: An Established Ransomware Tactic

The tampering and deletion of VSS shadow copies is a common tactic to prevent data recovery. Adversaries will often abuse legitimate Microsoft administrator tools to disable and remove VSS shadow copies. Common tools include Windows Management Instrumentation (WMI), BCDEdit (a command-line tool for managing Boot Configuration Data) and vssadmin.exe. LockBit 2.0 utilizes the following WMI command line for deleting shadow copies:

C:\Windows\System32\cmd.exe /c vssadmin delete shadows /all /quiet & wmic shadowcopy delete & bcdedit /set {default} bootstatuspolicy ignoreallfailures & bcdedit /set {default} recoveryenabled no

The use of preinstalled operating system tools, such as WMI, is not new. Still, adversaries have started abusing them as part of the initial access tactic to perform tasks without requiring a malicious executable file to be run or written to the disk on the compromised system. Adversaries have moved beyond malware by using increasingly sophisticated and stealthy techniques tailor-made to evade autonomous detections, as revealed by CrowdStrike Threat Graph®, which showed that 68% of detections indexed in April-June 2021 were malware-free.

VSS Protection with CrowdStrike

CrowdStrike Falcon takes a layered approach to detecting and preventing ransomware by using behavior-based indicators of attack (IOAs) and advanced machine learning, among other capabilities. We are committed to continually improving the efficacy of our technologies against known and unknown threats and adversaries. 

CrowdStrike’s enhanced IOA detections accurately distinguish malicious behavior from benign, resulting in high-confidence detections. This is especially important when ransomware shares similar capabilities with legitimate software, like backup solutions. Both can enumerate directories and write files that on the surface may seem inconsequential, but when correlated with other indicators on the endpoint, can identify a legitimate attack. Correlating seemingly ordinary behaviors allows us to identify opportunities for coverage across a wide range of malware families. For example, a single IOA can provide coverage for multiple families and previously unseen ones.

CrowdStrike’s recent innovation involves protecting shadow copies from being tampered with, adding another protection layer to mitigate ransomware attacks. Protecting shadow copies helps potentially compromised systems restore encrypted data with much less time and effort. Ultimately, this helps reduce operational costs associated with person-hours spent spinning up encrypted systems post-compromise.

The Falcon platform can prevent suspicious processes from tampering with shadow copies and performing actions such as changing file size to render the backup useless. For instance, should a LockBit 2.0 ransomware infection occur and attempt to use the legitimate Microsoft administrator tool (vssadmin.exe) to manipulate shadow copies, Falcon immediately detects this behavior and prevents the ransomware from deleting or tampering with them, as shown in Figure 4.

Figure 4. Falcon detects and blocks vssadmin.exe manipulation by LockBit 2.0 ransomware (Click to enlarge)

In essence, while a ransomware infection might be able to encrypt files on a compromised endpoint, Falcon can prevent ransomware from tampering with shadow copies and potentially expedite data recovery for your organization.

Figure 5. Falcon alert on detected and blocked ransomware activity for deleting VSS shadow copies (Click to enlarge)

Shown below is Lockbit 2.0 executing on a system without Falcon protections. Here, vssadmin is used to list the shadow copies. Notice the shadow copy has been deleted after execution.

Below is the same Lockbit 2.0 execution, now with Falcon and VSS protection enabled. The shadow copy is not deleted even though the ransomware has run successfully. Please note, we specifically allowed the ransomware to run during this demonstration.

CrowdStrike prevents the destruction and tampering of shadow copies with volume shadow service backup protection, retaining the snapshots in a recoverable state regardless of threat actors using traditional or new novel techniques. This allows for instant recovery of live systems post-attack through direct snapshot tools or system recovery.

VSS shadow copy protection is just one of the new improvements added to CrowdStrike’s layered approach. We remain committed to our mission to stop breaches, and constantly improving our machine learning and behavior-based detection and protection technologies enables the Falcon platform to identify and protect against tactics, techniques and procedures associated with sophisticated adversaries and threats.

CrowdStrike’s Layered Approach Provides Best-in-Class Protection

The Falcon platform unifies intelligence, technology and expertise to successfully detect and protect against ransomware. Artificial intelligence (AI)-powered machine learning and behavioral IOAs, fueled by a massive data set of trillions of events per week and threat actor intelligence, can identify and block ransomware. Coupled with expert threat hunters that proactively see and stop even the stealthiest of attacks, the Falcon platform uses a layered approach to protect the things that matter most to your organization from ransomware and other threats.

CrowdStrike Falcon endpoint protection packages unify the comprehensive technologies, intelligence and expertise needed to successfully stop breaches. For fully managed detection and response (MDR), Falcon Complete™ seasoned security professionals deliver 403% ROI and 100% confidence.

Indicators of Compromise (IOCs)

File SHA256
LockBit 2.0 0545f842ca2eb77bcac0fd17d6d0a8c607d7dbc8669709f3096e5c1828e1c049

Additional Resources

✇ CrowdStrike

Unexpected Adventures in JSON Marshaling

By: Dylan Bourque

Recently, one of our engineering teams encountered what seemed like a fairly straightforward issue: When they attempted to store UUID values to a database, it produced an error claiming that the value was invalid. With a few tweaks to one of our internal libraries, our team was able to resolve the issue. Or did they?

Fast forward one month later, and a different team noticed a peculiar problem. After deploying a new release, their service began logging strange errors alerting the team that the UUID values from the redrive queue could not be read.

So what went wrong? What we soon realized is that when we added a new behavior to our UUID library to solve our first problem, we inadvertently created a new one. In this blog post, we explore how adding seemingly benign new methods can actually be a breaking change, especially when working with JSON support in Go.  We will explore what we did wrong and how we were able to dig our way out of it. We’ll also outline some best practices for managing this type of change, along with some thoughts on how to avoid breaking things in the first place.

When Closing a Functional Gap Turns Into a Bug

This all started when one of our engineering teams added a new PostgreSQL database and ran into issues. They were attempting to store UUID values in a JSONB column in the PostgreSQL database using our internal csuuid library, which wraps a UUID value and adds some additional functionality specific to our systems. Strangely, the generated SQL being sent to the database always contained an empty string for that column, which is an invalid value.

INSERT INTO table (id, uuid_val) VALUES (42, '');

ERROR: invalid input syntax for type json

Checking the code, we saw that there was no specific logic for supporting database persistence.  Conveniently, the Go standard library already provides the scaffolding for making types compatible with database drivers in the form of the database/sql.Scanner and database/sql/driver.Valuer interfaces. The former is used when reading data from a database driver and the latter for writing values to the driver. Each interface is a single method and, since a csuuid.UUID wraps a github.com/gofrs/uuid.UUID value that already provides the correct implementations, extending the code was straightforward.

With this change, the team was now able to successfully store and retrieve csuuid.UUID values in the database.

Free Wins

As often happens, the temptation of “As long as we’re updating things …” crept in. We noticed that csuuid.UUID also did not include any explicit support for JSON marshaling. Like with the database driver support, the underlying github.com/gofrs/uuid.UUID type already provided the necessary functionality, so extending csuuid.UUID for this feature felt like a free win.

If a type can be represented as a string in a JSON document, then you can satisfy the encoding.TextMarshaler and encoding.TextUnmarshaler interfaces to convert your Go struct to/from a JSON string, rather than satisfying the potentially more complex Marshaler and Unmarshaler interfaces from the encoding/json package.

The excerpt from the documentation for the Go standard library’s json.Marshal() function below (emphasis mine) calls out this behavior:

Marshal traverses the value v recursively. If an encountered value implements the Marshaler interface and is not a nil pointer, Marshal calls its MarshalJSON method to produce JSON. If no MarshalJSON method is present but the value implements encoding.TextMarshaler instead, Marshal calls its MarshalText method and encodes the result as a JSON string. The nil pointer exception is not strictly necessary but mimics a similar, necessary exception in the behavior of UnmarshalJSON.

A UUID is a 128-bit value that can easily be represented as a 32-character string of hex digits; that string format is the typical way they are stored in JSON. Armed with this knowledge, extending csuuid.UUID to “correctly” support converting to/from JSON was another simple bit of code.

Other than a bit of logic to account for the pointer field within csuuid.UUID, these two new methods only had to delegate things to the inner github.com/gofrs/uuid.UUID value.

At this point, we felt like we had solved the original issue and gotten a clear bonus win. We danced a little jig and moved on to the next set of problems.

Celebrations all around!

A Trap Awaits

Unfortunately, all was not well in JSON Land. Several months after applying these changes, we deployed a new release of another of our services and started seeing errors logged about it not being able to read in values from its AWS Simple Queue Service (SQS) queue.  For system stability, we always do canary deployments of new services before rolling out changes to the entire fleet.  The new error logs started when the canary for this service was deployed.

Below are examples of the log messages:

From the new instances:
[ERROR] ..../sqs_client.go:42 - error unmarshaling Message from SQS: json: cannot unmarshal object into Go struct field event.trace_id of type *csuuid.UUID error='json: cannot unmarshal object into Go struct field event.trace_id of type *csuuid.UUID'

From both old and new instances:
[ERROR] ..../sqs_client.go:1138 - error unmarshaling Message from SQS: json: cannot unmarshal string into Go struct field event.trace_id of type csuuid.UUID error='json: cannot unmarshal string into Go struct field event.trace_id of type csuuid.UUID'

After some investigation, we were able to determine that the error was happening because we had inadvertently introduced an incompatibility in the JSON marshaling logic for csuuid.UUID. When one of the old instances wrote a message to the SQS queue and one of the new ones processed it, or vice versa, the code would fail to read in the JSON data, thus logging one of the above messages.

json.Marshal() and json.Unmarshal() Work, Even If by Accident

The hint that unlocked the mystery was noticing the slight difference in the two log messages. Some showed “cannot unmarshal object into Go struct field” and the others showed “cannot unmarshal string into Go struct field.” This difference triggered a memory of that “free win” we celebrated earlier.

The root cause of the bug was that, in prior versions of the csuuid module, the csuuid.UUID type contained only unexported fields, and it had no explicit support for converting to/from JSON. In this case, the fallback behavior of json.Marshal() is to output an empty JSON object, {}. Conversely, in the old code, json.Unmarshal() was able to use reflection to convert that same {} into an empty csuuid.UUID value.

The below example Go program displays this behavior:

With the new code, we were trying to read that empty JSON object {} (which was produced by the old code on another node) as a string containing the hex digits of a UUID. This was because json.Unmarshal() was calling our new UnmarshalText() method and failing, which generated the log messages shown above. Similarly, the new code was producing a string of hex digits where the old code, without the new UnmarshalText() method, expected to get a JSON object.

We encountered a bit of serendipity here, though, because we accidentally discovered that the updated service had been losing those trace ID values called out in the logs for messages that went through the redrive logic. Fortunately, this hidden bug hadn’t caused any actual issues for us.

The snippet below highlights the behavior of the prior versions.

With this bug identified, we were in a quandary. The new code is correct and even fixes the data loss bug illustrated above. However, it  was unable to read in JSON data produced by the old code. As a result, it was dropping those events from the service’s SQS queue, which was not an acceptable option. Additionally, this same issue could be extant in many other services.

A Way Out Presents Itself

Since a Big Bang, deploy-everything-at-once-and-lose-data solution wasn’t tenable, we needed to find a way for csuuid.UUID to support both the existing, invalid JSON data and the new, correct format.

Going back to the documentation for JSON marshaling, UnmarshalText() is the second option for converting from JSON. If a type satisfies encoding/json.Unmarshaler, by providing UnmarshalJSON([]byte) error, then json.Unmarshal() will call that method, passing in the bytes of the JSON data. By implementing that method and using a json.Decoder to process the raw bytes of the JSON stream, we were able to accomplish what we needed.

The core of the solution relied on taking advantage of the previously unknown bug where the prior versions of csuuid.UUID always generated an empty JSON object when serialized. Using that knowledge, we created a json.Decoder to inspect the contents of the raw bytes before populating the csuuid.UUID value.

With this code in place, we were able to: 

  1. Confirm that the service could successfully queue and process messages across versions 
  2. Ensure any csuuid.UUID values are “correctly” marshaled to JSON as hex strings
  3. Write csuuid.UUID values to a database and read them back

Time to celebrate!

Lessons for the Future

Now that our team has resolved this issue, and all is well once again in JSON Land, let’s review a few lessons that we learned from our adventure:

  1. Normally, adding new methods to a type would not be a breaking change, as no consumers would be affected. Unfortunately, some special methods, like those that are involved in JSON marshaling, can generate breaking behavioral changes despite not breaking the consumer-facing API. This is something we overlooked when we got excited about our “free win.”
  2. Even if you don’t do it yourself, future consumers that you never thought of may decide to write values of your type to JSON. If you don’t consider what that representation should look like, the default behavior of Go’s encoding/json package may well do something that is deterministic but most definitely wrong , as was the case when  generating {} as the JSON value for our csuuid.UUID type. Take some time to think about what your type should look like when written to JSON, especially if the type is exported outside of the local module/package.
  3. Don’t forget that the simple, straightforward solutions are not the only ones available. In this scenario, introducing the new MarshalText()/UnmarshalText() methods was the simple, well documented way to correctly support converting csuuid.UUID values to/from JSON. However, doing the simple thing is what introduced the bug. By switching to the lower-level json.Decoder we were able to extend csuuid.UUID to be backwards compatible with the previous  code while also providing the “correct” behavior going forward.

Do you love solving technical challenges and want to embark on exciting engineering adventures? Browse our Engineering job listings and hear from some of the world’s most talented engineers.

✇ CrowdStrike

Credentials, Authentications and Hygiene: Supercharging Incident Response with Falcon Identity Threat Detection

By: Tim Parisi
  • CrowdStrike Incident Response teams leverage Falcon Identity Threat Detection (ITD) for Microsoft Active Directory (AD) and Azure AD account authentication visibility, credential hygiene and multifactor authentication implementation
  • Falcon ITD is integrated into the CrowdStrike Falcon® platform and provides alerts, dashboards and custom templates to identify compromised accounts and areas to reduce the attack surface and implement additional security measures
  • Falcon ITD allows our Incident Response teams to quickly identify malicious activity that would have previously only been visible through retroactive log review and audits, helping organizations eradicate threats faster and more efficiently

Incident responders and internal security teams have historically had limited visibility into Microsoft AD and Azure AD during an investigation, which has made containment and remediation more difficult and reliant on the victim organization to provide historical logs for retrospective analysis and perform manual authentication and hygiene audits. Since CrowdStrike acquired Preempt in 2020, the Services team has leveraged a new module in the Falcon platform, Falcon Identity Threat Detection (ITD), to gain timely and rich visibility throughout incident response investigations related to Activity Directory, specifically account authentication visibility, credential hygiene and multifactor authentication implementation. This blog highlights the importance of Falcon ITD in incident response and how our incident response teams use Falcon ITD today.

How Falcon ITD Is Leveraged During Incident Response

It’s no secret that one of CrowdStrike’s key differentiators in delivering high-quality, lower-cost investigations to victim organizations is the Falcon platform. Throughout 2021, we have included Falcon ITD in the arsenal of Falcon modules when performing incident response. This new module provides both clients and responders with the following critical data points during a response:

  • Suspicious logins/authentication activity
  • Failed login activity, including password spraying and brute force attempts
  • Inventory of all identities across the enterprise, including stale accounts, with password hygiene scores
  • Identity store (e.g., Active Directory, LDAP/S) verification and assessment to discover any vulnerabilities across multiple domains
  • Consolidated events around user, device, activity and more for improved visibility and pattern identification
  • Creation of a “Watch List” of specific accounts of interest

In a typical incident response investigation, our teams work with clients to understand the high-level Active Directory topology numbers (e.g., domains, accounts, endpoints and domain controllers). Once the domain controllers are identified, the Falcon ITD sensor is installed to begin baselining and assessing accounts, privileges, authentications and AD hygiene, which typically completes within five to 24 hours. Once complete, Falcon ITD telemetry and results are displayed in the Falcon platform for our responders and clients to analyze.  

Figure 1 shows the Falcon ITD Overview dashboard, which features attack surface risk categories and assesses the severity as Low, Medium or High. CrowdStrike responders use this data to understand highly exploitable ways an attacker could escalate privileges, such as non-privileged accounts that have attack paths to privileged accounts, accounts that can be traversed to compromise the privileged accounts’ credentials, or if the current password policies allow accounts with passwords that can be easily cracked.

Figure 1. Overview dashboard in Falcon ITD (Click to enlarge)

Figure 2 shows the main Incidents dashboard. This dashboard highlights suspicious events based on baseline patterns and indicators of authentication activity, and also includes any custom detection patterns the CrowdStrike incident response teams have configured, such as alerting when an account authenticates to a specific system.

Figure 2. Incidents main dashboard in Falcon ITD (Click to enlarge)

CrowdStrike responders leverage this information to understand and confirm findings such as the following scenarios:

  • Credentials were used to perform unusual LDAP activity that fits Service Principal Name (SPN) enumeration patterns 
  • An account entered the wrong two-factor verification code or the identity verification timeout was reached
  • Credentials used are consistent with “pass the hash” (PtH) techniques
  • Unusual LDAP search queries known to be used by the BloodHound reconnaissance tool were performed by an account

In addition to the above built-in policies, CrowdStrike responders, in consultation with clients, may also configure custom rules that will trigger alerts and even enforce controls within Falcon ITD, such as the following:

  • Alert if a specific account or group of accounts authenticates to any system or specific ones
  • Enforce a block for specific accounts from authenticating to any system or specific ones
  • Enforce a block for specific authentication protocols being used 
  • Implement identity verification from a 2FA provider such as Google, Duo or Azure for any account or for a specific one attempting to authenticate via Kerberos, LDAP or NTLM protocols
  • Implement a password reset for any account that has a compromised password

In other cases, responders are looking for additional information on accounts of interest that were observed performing suspicious activity. Typically, incident responders would have to coordinate with the client and have the client’s team provide information about that account (e.g., what group memberships it belongs to, what privileges the account has, and if it is a service or human account). Figure 3 shows how Falcon ITD displays this information and more, including password last change date, password strength and historical account activity. This is another example of how CrowdStrike responders are able to streamline the investigation, allowing our client to focus on getting back to business in a safe and secure manner.

Figure 3. Account information displayed in Falcon ITD (Click to enlarge)

Hygiene and Reconnaissance Case Study

During a recent incident response investigation, CrowdStrike Services identified an eCrime threat actor that maintained intermittent access to the victim’s environment for years. The threat actor leveraged multiple privileged accounts and created a domain administrator account — undetected — to perform reconnaissance, move laterally and gather information from the environment.

CrowdStrike incident responders leveraged Falcon ITD to quickly map out permissions associated with the accounts compromised by the threat actor, and identify password hygiene issues that aided the threat actor. By importing a custom password list into Falcon ITD, incident responders were able to identify accounts that were likely leveraged by the threat actor with the same organizational default or easily guessed password.

Falcon ITD also allowed CrowdStrike’s incident response teams to track the threat actor’s reconnaissance of SMB shares across the victim environment. The threat actor leveraged a legitimate administrative account on a system that did not have Falcon installed. Fortunately, the visibility provided by Falcon ITD still alerted incident responders to this reconnaissance activity, and we coordinated with the client to implement remediations to eradicate the threat actor. 

Multifactor Authentication and Domain Replication Case Study

During another investigation, CrowdStrike incident responders identified a nation-state threat actor that compromised an environment and had remained persistent for multiple years. With this level of sophisticated threat actor and the knowledge they had of the victim environment’s network, Active Directory structure and privileged credential usage, no malware was needed to be able to achieve their objectives.

In light of the multiyear undetected access, CrowdStrike incident responders leveraged Falcon ITD to aid in limiting the threat actor’s mobility by enforcing MFA validation for two scenarios, vastly reducing unauthorized lateral movement capabilities:

  • Enforce MFA (via Duo) for administrator usage of RDP to servers
  • Enforce MFA (via Duo) for any user to RDP from any server to a workstation

Falcon ITD’s detection capabilities were also paramount in identifying the threat actor’s resurgence in the victim network by alerting defenders to a domain replication attack. This allowed defenders to swiftly identify the source of the replication attack, which emanated from the victim’s VPN pool, and take corrective action on the VPN, impacted accounts and remote resources that were accessed by the threat actor.

Conclusion

Falcon Identity Threat Detection provides CrowdStrike incident response teams with another advantage when performing investigations into eCrime or nation-state attacks by providing increased visibility and control in Active Directory, which had previously been unachievable at speed and scale. 

Additional Resources

✇ CrowdStrike

A Principled Approach to Monitoring Streaming Data Infrastructure at Scale

By: Praveen Yedidi

Virtually every aspect of a modern business depends on having a reliable, secure, real-time, high-quality data stream. So how do organizations design, build and maintain a data processing pipeline that delivers? 

In creating a comprehensive monitoring strategy for CrowdStrike’s data processing pipelines, we found it helpful to consider four main attributes: observability, operability, availability and quality.

As illustrated above, we’re modeling these attributes along two axes — complexity of implementation and engineer experience — which enables us to classify these attributes into four quadrants.

In using this model, it is possible to consider the challenges involved in building a comprehensive monitoring system and the iterative approach engineers can take to realize benefits while advancing their monitoring strategy.

For example, in the lower left quadrant, we start with basic observability, which is relatively easy to address and is helpful in terms of creating a positive developer experience. As we move along the X axis and up the Y axis, measuring these attributes becomes challenging and might need a significant development effort.

In this post, we explore each of the four quadrants, starting with observability, which focuses on inferring the operational state of our data streaming infrastructure from the knowledge of external outputs. We will then explore availability and discuss how we make sure that the data keeps flowing end-to-end in our streaming data infrastructure systems without interruption. Next, we will discuss simple and repeatable processes to deal with the issues and the auto-remediations we created to help improve operability. Finally, we will explore how we improved efficiency of our processing pipelines and established some key indicators and some enforceable service level agreements (SLAs) for quality

Observability

Apache Kafka is a distributed, replicated messaging service platform that serves as a highly scalable, reliable and fast data ingestion and streaming tool. At CrowdStrike, we use Apache Kafka as the main component of our near real-time data processing systems to handle over a trillion events per day.

Ensuring Kafka Cluster Is Operational

When we create a new Kafka cluster, we must establish that it is reachable and operational. We can check that using a simple external service that constantly sends heartbeat messages to the Kafka cluster, and at the same time, consumes those messages. We can make sure that the messages that it produces matches the messages it has consumed. By doing that, we have gained confidence that the Kafka cluster is truly operational.

Once we establish that the cluster is operational, we check on other key metrics, such as the consumer group lag. 

Kafka Lag Monitoring

One of the key metrics to monitor when working with Kafka, as a data pipeline or a streaming platform, is consumer group lag.

When an application consumes messages from Kafka, it commits its offset in order to keep its position in the partition. When a consumer gets stuck for any reason — for example, an error, rebalance or even a complete stop — it can resume from the last committed offset and continue from the same point in time.

Therefore, lag is the delta between the last committed message to the last produced message. In other words, lag indicates how far behind your application is in processing up-to-date information. Also, Kafka persistence is based on retention, meaning that if your lag persists, you will lose data at some point in time. The goal is to keep lag to a minimum.

We use Burrow for monitoring Kafka consumer group lag. Burrow is an open source monitoring solution for Kafka that provides consumer lag checking as a service. It monitors committed offsets for all consumers and calculates the status of those consumers on demand. The metrics are exposed via an HTTP endpoint.

It also has configurable notifiers that can send status updates via email or HTTP if a partition status has changed based on predefined lag evaluation rules.

Burrow exposes both status and consumer group lag information in a structured format for a given consumer group across all of the partitions of the topic from which it is consuming. However, there is one drawback with this system: It will only present us with a snapshot of consumer group lag. Having the ability to look back in time and analyze historical trends in this data for a given consumer group is important for us.

To address this, we built a system called Kafka monitor. Kafka monitor fetches these metrics that are exposed by Burrow and stores them in a time series database. This enabled us to analyze historical trends and even perform velocity calculations like mean recovery time from lag for a Kafka consumer, for example.

In the next section, we explore how we implemented auto-remediations, using the consumer group status information from Burrow, to improve the availability and operability in our data infrastructure.

Availability and Operability

Kafka Cluster High Availability 

Initially, our organization relied on one very large cluster in Kafka to process incoming events. Over time, we expanded that cluster to manage our truly enormous data stream. 

However, as our company continues to grow, scaling our clusters vertically has become both problematic and impractical. Our recent blog post, Sharding Kafka for Increased Scale and Reliability, explores this issue and our solution in greater detail. 

Improved Availability and Operability for Stream Processing Jobs

For our stateless streaming jobs, we noticed that by simply relaunching these jobs upon getting stuck, we have a good chance of getting that consumer out of the stuck state. However, it is not practical at our scale to relaunch these jobs manually. So we created a tool called AlertResponder. As the name implies, it will automatically relaunch a stateless job upon getting the first consumer stuck alert.

Of course, we’ll still investigate the root cause afterward. Also, when the relaunch does not fix the problem or if it fails to relaunch for some reason, AlertResponder will then escalate this to an on-call engineer by paging them.

The second useful automation that we derive from our consumer lag monitoring is streaming jobs autoscaling. For most of our streams, traffic fluctuates on a daily basis. It is very inefficient to use a fixed capacity for all streaming jobs. During the peak hours, after the traffic exceeds a certain threshold, the consumer lag will increase dramatically. The direct impact of this is that the customers will see increased processing delays and latency at peak hours.

This is where auto-scaling helps. We use two auto-scaling strategies:

  1. Scheduled scaling: For stream processing jobs for which we are able to reliably predict the traffic patterns over the course of a day, we implemented a scheduled auto scaling strategy. With this strategy, we scale the consumer groups to a predetermined capacity at a known point in time to match the traffic patterns.
  2. Scaling based on consumer lag: For jobs running on our Kubernetes platform, we use KEDA (Kubernetes-based Event Driven Autoscaler) to scale the consumer groups. With KEDA, you can drive the scaling of any container in Kubernetes based on the number of events needing to be processed. We use KEDA’s Prometheus scaler. Using the consumer lag metrics that are available in prometheus, KEDA calculates the number of containers needed for the streaming jobs and works with HPA to scale a deployment accordingly.

Quality

When we talk about the quality of streaming data infrastructure, we are essentially considering two things: 

  1. Efficiency
  2. Conformance to service level agreements (SLAs)

Improving Efficiency Through Redistribution

When lag is uniform across a topic’s partitions, that is typically addressed by horizontal scaling of consumers as discussed above; however, when lag is not evenly distributed across a topic, scaling is much less effective.

Unfortunately, there is no out-of-the box way to address the issue of lag hotspots on certain partitions of a topic within Kafka. In our recent post, Addressing Uneven Partition Lag in Kafka, we explore our solution and how we can coordinate it across our complex ecosystem of more than 300 microservices. 

SLA-based Monitoring

It is almost impossible to measure the quality of a service correctly, let alone well, without understanding which behaviors really matter for that service and how to measure and evaluate those behaviors.

Service level indicators (SLIs), like data loss rate and end-to-end latency, are useful to measure the quality of our streaming data infrastructure. 

As an example, the diagram below shows how we track end-to-end latency through external observation (black box analysis).

We deploy monitors that submit sample input data to the data pipeline and observe the outputs from the pipeline. These monitors submit end-to-end processing latency metrics that, combined with our alerting framework, will be used to emit SLA-based alerts.

Conclusion

These four attributes — observability, availability, operability and quality — are each important in their own right for designing, working in and maintaining the streaming data infrastructure at scale. As discussed in our post, these attributes have a symbiotic relationship. The four-quadrant model not only exposes this relationship but also offers an intuitive mental model that helps us build a comprehensive monitoring solution for streaming data applications that operate at scale.

Have ideas to share about how you create a high-functioning data processing pipeline? Share your thoughts with @CrowdStrike via social media.

  • There are no more articles
❌