🔒
There are new articles available, click to refresh the page.
Before yesterdayCrowdStrike

End-to-end Testing: How a Modular Testing Model Increases Efficiency and Scalability

3 December 2021 at 09:00

In our last post, Testing Data Flows using Python and Remote Functions, we discussed how organizations can use remote functions in Python to create an end-to-end testing and validation strategy. Here we build on that concept and discuss how it is possible to design the code to be more flexible.  

For our purposes, flexible code means two things:

  1. Writing the code in such a way that most of it can be reused
  2. Creating a pool of functionalities that can be combined to create tests that are bigger and more complex.
What is a flow?
A flow is any unique complex sequence of steps and interactions that is independently testable.
Flows can mimic business or functional requirements.
Flows can be combined in any way between themselves to create higher level or specific flows.

The Need to Update the Classic Testing View

A classical testing view is defined by a sequence of steps that do a particular action on the system. This typically contains:

  • A setup which will prepare the test environment for the actual test. Eg: creating users, populating data into a DB, etc.
  • A series of actions that modifies the current state of the system and checks for the outcome of the performed actions
  • A teardown that should return the system to initial state before the test

Each of these actions are separate but dependent on one another. This means that for each test, we can assume that the setup has run successfully and, in the end, that the teardown has run to clean it up. In complex test scenarios this can become cumbersome and difficult to orchestrate or reuse.

In a classical testing model, unless the developer writes helpers that are used inside the test, the code typically cannot be reused to build other tests. In addition, when helpers are written, they tend to be specific to certain use cases and scenarios, making them irrelevant in most other situations. On the other hand, some helpers are so generic that they will still require the implementation of additional logic when using them with certain test data.

Finally, using test-centric development means that many test sequences or test scenario steps must be rewritten every time you need them in different combinations of functionality and test data.

CrowdStrike’s Approach: A Modular Testing Model

To avoid these issues, we take a modularized approach to testing. You can imagine each test component as a Lego block, wherein each piece can be made to fit together in order to create something bigger or more complex. 

These flows can map to a specific functionality and should be atomic. As more complex architectures are built, unless the functionality has changed, you don’t need to rewrite existing functions to fit. Rather, you can combine them to follow business context and business use cases. 

The second part of our approach relates to functional programming, which means we create independent testable functions. We can then separate data into payloads to functions, making them easy to process in parallel.

Business Case: A Product That Identifies Security Vulnerabilities

To illustrate this point, let’s take a business use case for a product that identifies vulnerabilities for certain applications installed on a PC. The evaluation will be based on information sent by agents installed on PCs about supported installed applications. Information could be the name of the application, installed version, architecture (32 or 64 bits). This use case dictates that if a computer is online, the agent will send all relevant information to the cloud where it will be processed and evaluated against a publicly available DB of vulnerabilities (NVD). (If you are unfamiliar with common vulnerabilities and exposures, or CVE, learn more here.)

Our testing flows will be designed around the actual product and business flows. You can see below a basic diagram of the architecture for this proposed product.

You can see the following highlights from the diagram above:

  • A database of profiles for different versions of OS and application to be able to cover a wide range of configurations 
  • An orchestrator for the tests which is called Test Controller with functionalities:
    • Algorithm for selecting datasets based on the particularities of the scenario it has to run
    • Support for creating constructs for the simulated data. 
    • Constructs that will be used to create our expected data in order to do our validations post-processing

There is a special section for generating and sending data points to the cloud. This can be distributed and every simulated agent can be run in parallel and scaled horizontally to cover performance use cases.

Interaction with post-processed data is done through internal and external flows each with their own capabilities and access to data through auto-generated REST/GRPC clients.

Below you can see a diagram of the flows designed to test the product and interaction between them.

A closer look at the flows diagram and code

Flows are organized into separate packages based on actual business needs. They can be differentiated into public and internal flows. A general rule of thumb is that public flows can be used to design test scenarios, whereas internal flows should only be used as helpers inside other flows. Public flows should always implement the same parameters, which are the structures for test data (in our case being simulated hosts and services).

# Build Simulated Hosts/Services Constructs

In this example all data is stored in a simulated host construct. This is created at the beginning of the test based on meaningful data selection algorithms and encapsulates data relevant to the test executed, which may relate to a particular combination of OS or application data.

import agent_primitives

from api_helpers import VulnerabilitiesDbApiHelperRest, InternalVulnerabilitiesApiHelperGrpc, ExternalVulnerabilitiesApiHelperRest

@dataclass
class TestDataApp:
   name: str
   version: str
   architecture: str
   vendor: str


@dataclass
class TestDataDevice:
   architecture: ArchitectureEnum
   os_version: str
   kernel: str


@dataclass
class TestDataProfile:
   app: TestDataApp
   device: TestDataDevice


@dataclass
class SimulatedHost:
   id: str
   device: TestDataDevice
   apps: List[TestDataApp]
   agent: agent_primitives.SimulatedAgent


def flow_build_test_data(profile: TestDataProfile, count: int) -> List[SimulatedHost]:
   test_data_configurations = get_test_data_from_s3(profile, count)
   simulated_hosts = []
   for configuration in test_data_configurations:
       agent = agent_primitives.get_simulated_agent(configuration)
       host = SimulatedHost(device=configuration.get('device'),
                            apps=[TestDataApp(**app) for app in configuration.get('apps')],
                            agent=agent)
       simulated_hosts.append(host)
   return simulated_hosts

Once the Simulated Host construct is created, it can be passed to any public flows that accept that construct. This will be our container of test data to be used in all other testing flows. 

In case you need to mutate states or other information related to that construct, any particular flow can return the host’s constructs to be used by another higher-level flow.

TestServices construct encompasses all the REST/GRPC services clients that will be needed to interact with cloud services to perform queries, get post-processing data, etc. This will be initialized once and passed around where it is needed.

@dataclass
class TestServices:
   vulnerabilities_db_api: VulnerabilitiesDbApiHelperRest
   internal_vulnerabilities_api: InternalVulnerabilitiesApiHelperGrpc
   external_vulnerabilities_api: ExternalVulnerabilitiesApiHelperRest

Function + Data constructs = Flow. Separation of data and functionality is crucial in this approach. Besides the fact that it makes the function work with a large number of payloads that implement the same structure, it also makes curating datasets a lot easier when implementing complex logic for selection data for particular scenarios independent from function implementation.

# Agent Flows

def flow_agent_ready_for_cloud(simulated_hosts: List[SimulatedHost]):
   for host in simulated_hosts:
       host.agent.ping_cloud()
       host.agent.keepalive()
       host.agent.connect_to_cloud()

def flow_agent_send_device_information(simulated_hosts: List[SimulatedHost]):
   for host in simulated_hosts:
       host.agent.send_device_data(host.device_name)
       host.agent.send_device_data(host.device.architecture)
       host.agent.send_device_data(host.device.os_version)
       host.agent.send_device_data(host.device.kernel)

def flow_agent_application_information(simulated_hosts: List[SimulatedHost]):
   for host in simulated_hosts:
       for app in host.apps:
           host.agent.send_application_data(app.application_name)
           host.agent.send_application_data(app.version)

Notice the name of one of the functions above, which captures the main purpose of the function like flow_agent_send_device_information that will send device information like os_version, device_name.

# Internal API Flows

Internal flows are mainly used to gather information from services and do validations. For validations we use a Python library called PyHamcrest and a generic validation method that compares our internal structures with expected outcome built at the beginning of test

@retry(AssertionError, tries=8, delay=3, backoff=2, logger=logging)
def flow_validate_simulated_hosts_for_vulnerabilities(hosts: List[SimulatedHost], services: TestServices):
   expected_data = build_expected_flows.evaluate_simulated_hosts_for_vulnerabilities(simulated_hosts, services)
   for host in simulated_host:
       actual_data = flow_get_simulated_host_vulnerability_flow(host, services)
       augmented_expected = {
           "total": sum([expected_data.total_open,
                         expected_data.total_reopen,
                         expected_data.total_closed])
       }
       actual, expected, missing_fields = test_utils.create_validation_map(actual_data, [expected_data],
                                                                           augmented_expected)
       test_utils.assert_on_validation_map(actual_status_counter, actual, expected, missing_fields)


def flow_get_simulated_host_vulnerabilities(simulated_host: SimulatedHost, services: TestServices):
   response = services.internal_vulnerabilities_api.get_vulnerabilities(host)
   response_json = validate_response_call_and_get_json(rest_response, fail_on_errors=True)
   return response_json

We first use a method called create_validation_map which takes a list of structures that will contain data relevant for the actual response from the services. This is used to normalize all the structures and create an actual and an expected validation map which will be used in the assert_on_validation_map method together with a specific matcher from PyHamcrest called has_entries to do the assert.

The Advantages of a Modular Testing Approach 

There are several key advantages to this approach: 

  1. Increased testing versatility and simplicity. Developers are not dependent on a certain implementation because everything is unique to that function. Modules are independent and can work in any combination. The code base is independently testable. As such, if you have two flows that do the same thing, it means that one can be removed. 
  2. Improved efficiency. It is possible to “double track” most of the flows so that they can be processed in parallel. This means that they can run in any sequence and that the load can be distributed to run in a distributed infrastructure. Because there are no dependencies between the flows, you can also “parallelize” it locally to run multiple threads or multiple processes.
  3. Enhanced testing maturity. Taken together, these principles mean that developers can build more and more complex tests by reusing common elements and building on top of what exists. Test modules can be developed in parallel because they don’t have dependencies between them. Every flow covers a small part of functionality.

Final Thoughts: When to Use Flow-based Testing

Flow-based testing works well in end-to-end tests for complex products and distributed architectures because it takes the best practices in writing and testing code at scale. Testing and validation has a basis in experimental science and implementing a simulated version of the product inside the validation engine is still one of the most comprehensive ways to test the quality of a product. Flows-based testing helps to reduce the complexity in building this and makes it scalable and easier to maintain than the classical testing approach.

However, it is not ideal when testing a single service due to the complexity that exists at the beginning of the process related to data separation and creation of structures to serialize data. In those instances, the team would probably be better served by a classical testing approach. 

Finally, in complex interactions between multiple components, functionality needs to be compartmentalized in order to run it at scale. In that case, flow-based testing is one of the best approaches you can take.

When do you use flow-based testing — and what questions do you have? Sound off on our social media channels @CrowdStrike.

Why Actionable Logs Require Sufficient History

2 December 2021 at 05:16

This blog was originally published Oct. 26, 2021 on humio.com. Humio is a CrowdStrike Company.

Improve visibility and increase insights by logging everything

ITOps, DevOps and SecOps teams need historical log data to ensure the security, performance and availability of IT systems and applications. Detailed historical log data is fundamental for understanding system behavior, mitigating security threats, troubleshooting problems and isolating service quality issues.

But when it comes to indexing, structuring, and maintaining log data, traditional log management solutions are notoriously inefficient and costly. Many businesses today simply can’t afford to gather and retain massive volumes of log data from all their networking gear, security products and other IT platforms using conventional log management solutions.

To make matters worse, many log management vendors use volume-based software licensing schemes that are prohibitively expensive for most businesses. For all these reasons, most organizations limit the types of log records they collect or periodically age out log data, leaving security and IT operations professionals in the dark.

So what can be done about it?

Comprehensive historical log data is fundamental for IT and security operations

Whether you work in DevOps, ITOps or SecOps, comprehensive historical log records are essential tools of the trade. They are critical for:

  • Troubleshooting and root cause analysis. Historical data is fundamental for identifying IT infrastructure issues, pinpointing faults and resolving problems. By going back in time and analyzing detailed log records, you can correlate network and system issues with configuration changes, software upgrades or other adds, moves and changes that might have affected IT infrastructure and impacted applications.
  • Mitigating security threats. Historical data is also fundamental for isolating security breaches and remediating threats. By examining access logs and investigating changes to firewall rules or other security settings, you can pinpoint attacks, take corrective actions and avoid extensive data loss or system downtime.
  • Optimizing performance and service quality. Historical data is vital for identifying compute, storage and networking performance bottlenecks and for optimizing user experience. By analyzing detailed performance data from a variety of sources, development and operations teams can gain insights into design and implementation issues impairing application service quality or response time.

Log everything with Humio

Humio’s flexible, modern architecture improves the log management experience for organizations by transforming massive volumes of historical log data into meaningful and actionable insights, enabling complete observability to answer any question, explore threats and vulnerabilities, and gain valuable insights from all logs in real time. Many organizations still struggle with cost constraints dictating their log strategies, but unlike conventional log management systems, Humio cost-effectively ingests any amount of data at any throughput, providing the full visibility needed to identify, isolate and resolve the most complex issues. The TCO Estimator is a quick and easy way to see this value.

With Humio’s innovative index-free design, organizations are no longer forced to make difficult decisions about which data to log and how long to retain it. By logging everything, organizations gain the holistic insights needed to investigate and mitigate any issue.

Additional resources

CrowdStrike Is Working to Strengthen the U.S. Government’s Cybersecurity Posture

1 December 2021 at 09:30

The United States and like-minded nations face unprecedented threats from today’s adversaries. Continuous cyberattacks on critical infrastructure, supply chains, government agencies and more present significant ongoing threats to national security, and the critical services millions of citizens rely on every day. At CrowdStrike, we are on a mission to stop breaches and rise to the challenge by protecting many of the most critically important organizations around the globe from some of the most sophisticated adversaries. This is why I am especially enthusiastic about recent initiatives in our work to help strengthen the cybersecurity posture of departments and agencies at all levels (federal, state, local, tribal and territorial) of government by empowering key defenders of U.S. critical infrastructure with our innovative technologies and services.

Earlier this year, the Administration issued an Executive Order to help address these threats, emphasizing the use of capabilities like endpoint detection and response (EDR) and Zero Trust. Based on our experience in preventing some of the world’s most sophisticated threat actors from impacting customers representing just about every industry, we believe that these measures stand to help. We also know that the road to protecting the nation’s most critical assets and infrastructure will require a strong partnership between government and private sector. Only by working together can we prevail.  

CrowdStrike has long been committed to working with federal, state, local, tribal and territorial governments to furnish them with the world-class technology and elite human expertise required to stay ahead of today’s attackers. Strategic partnerships with the U.S. Department of Homeland Security (DHS) Cybersecurity and Infrastructure Security Agency (CISA) and the Center for Internet Security (CIS) are key milestones that continue to enhance CrowdStrike’s efforts to protect the public sector and its partners.

Today, we’re proud to announce that CISA and CrowdStrike are strengthening their partnership to secure our nation’s critical infrastructure and assets. CISA will deploy the CrowdStrike Falcon® platform to secure CISA’s critical endpoints and workloads as well as multiple federal agencies. This partnership directly operationalizes the president’s Executive Order on Improving the Nation’s Cybersecurity, the landmark guidance that unifies several initiatives and policies to strengthen the U.S. national and federal government cybersecurity posture.

By applying CrowdStrike’s unique combination of intelligence, managed threat hunting and endpoint detection and response (EDR), CISA will strengthen its Continuous Diagnostics and Mitigation (CDM) program, advancing CISA’s mission to secure civilian “.gov” networks. This partnership also further improves CISA’s capabilities to better understand and manage cyber and physical risks to the nation’s critical infrastructure.

Validation to Fulfill the Mission

CrowdStrike Falcon is a FedRAMP-authorized endpoint protection platform (EPP) that rapidly enables agencies to detect and prevent cyberattacks, a goal of the cybersecurity Executive Order.

Importantly, CrowdStrike has recently been prioritized by the FedRAMP Joint Advisory Board (JAB) to begin work toward achieving a Provisional Authority to Operate (P-ATO). FedRAMP JAB is composed of major departments in the U.S. government, including Department of Defense (DoD), DHS and the General Services Administration (GSA). The FedRAMP JAB prioritizes only the most used and demanded cloud services within the U.S. government, selecting only approximately 12 cloud service offerings a year. This prioritization and our commitment to the FedRAMP JAB demonstrates CrowdStrike’s continued support and commitment to deliver our best-of-breed Falcon platform to help defend some of the most targeted departments and agencies in the world.

Strengthening Cyber Defenses for State, Local, Tribal and Territorial (SLTT) Governments

CrowdStrike’s work in the SLTT government space is not only critical to supporting these agencies but also vital to protecting critical infrastructure and ensuring the resilience of the communities they serve. In fact, CrowdStrike Falcon is currently being leveraged by more than a third of all U.S. state governments. Despite our success in this space, there is still more work to do. That is why after many years of partnership, CrowdStrike and CIS are taking our work to protect SLTT governments to the next level. CIS’s new fully managed endpoint security services (ESS) solution is now powered exclusively by CrowdStrike.

CrowdStrike brings direct deployment to endpoint devices with the cloud-native, intelligent single agent of the CrowdStrike Falcon platform. This provides CIS with a full suite of solutions to protect CIS managed endpoints, including next-generation antivirus (NGAV), EDR, asset and software inventory, USB device monitoring, user account monitoring and host-based firewall management.

Previously, CIS chose CrowdStrike to protect its Elections Infrastructure Information Sharing and Analysis Center® (EI-ISAC®). The new solution expands on the existing partnership, providing a new, fully managed 24/7/365 next-generation cybersecurity offering exclusively tailored to SLTT organizations. This includes more than 12,000 Multi-State Information and Analysis Center® members across the U.S., with more than 14 million endpoints in total.

Moving the Needle Forward for the Public Sector

CrowdStrike has operated a FedRAMP-authorized government cloud since 2018, giving SLTT governments a secure and compliant service that provides innovative and best-of-breed technology to secure their digital assets. Since then, more than one-third of states have standardized on CrowdStrike as their EPP vendor of choice.

To deepen our relationship, we continue to build partnerships with CIS, while formalizing our federal government partnership by becoming an industry launch partner to CISA’s Joint Cyber Defense Collaborative (JCDC). We continue to gain the trust of our government customers as they seek best-of-breed technology to defend their infrastructure and begin their journey to Zero Trust. Our prioritization by, and commitment to, the FedRAMP JAB will only bolster this trust and partnership. Put simply, empowering government defenders with the very technologies successfully embraced by complex private sector organizations is an important step in thwarting adversaries that target governments and, consequently, the functions upon which citizens depend. 

George Kurtz is Chief Executive Officer and Co-founder of CrowdStrike.

Additional Resources

Four Key Factors When Selecting a Cloud Workload Protection Platform

1 December 2021 at 09:24

Security budgets are not infinite. Every dollar spent must produce a return on investment (ROI) in the form of better detection or prevention. 

Getting the highest ROI for security purchases is a key consideration for any IT leader. But the path to achieving that goal is not always easy to find. It is tempting for CISOs and CIOs to succumb to “shiny toy” syndrome: to buy the newest tool claiming to address the security challenges facing their hybrid environment. With cloud adoption on the rise, securing cloud assets will be a critical aspect of supporting digital transformation efforts and the continuous delivery of applications and services to customers well into the future. 

However, embracing the cloud widens the attack surface. That attack surface includes private, public and hybrid environments. A traditional approach to security simply doesn’t provide the level of security needed to protect this environment, and requires organizations to have granular visibility over cloud events. Organizations need a new approach — one that provides them with the visibility and control they need while also supporting the continuous integration/continuous delivery (CI/CD) pipeline.

Where to Start

To address these challenges head on, organizations are turning to cloud workload protection platforms. But how do IT and business leaders know which boxes these solutions should check? Which solution is best in addressing cloud security threats based on the changing adversary landscape? 

To help guide the decision-making process, CrowdStrike has prepared a buyer’s guide with advice on choosing the right solution for your organization. In this guide, we discuss different aspects of these solutions that customers should consider in the buying process, including detection, prevention and CI/CD integration. Here are four key evaluation points highlighted in the buyer’s guide: 

  • Cloud Protection as an Extension of Endpoint Security: Focusing on endpoint security alone is not sufficient to secure the hybrid environments many organizations now have to protect. For those organizations, choosing the right cloud workload protection platform is vital.
  • Understanding Adversary Actions Against Your Cloud Workloads: Real-time, up-to-date threat intelligence is a critical consideration when evaluating CWP platforms. As adversaries ramp up actions to exploit cloud services, having the latest information about attacker tactics and applying it successfully is a necessary part of breach prevention. For example, CrowdStrike researchers noted seeing adversaries targeting neglected cloud infrastructure slated for retirement that still contains sensitive data as well as adversaries leveraging common cloud services as a way to obfuscate malicious activity (learn more in our CrowdStrike cloud security eBook, Adversaries Have Their Heads In the Cloud and Are Targeting Your Weak Points). A proper approach to securing cloud resources leverages enriched threat intelligence to deliver a visual representation of relationships across account roles, workloads and APIs to provide deeper context for a faster, more effective response. 
  • Complete Visibility into Misconfigurations, Vulnerabilities and More: Closing the door on attackers also involves identifying the vulnerabilities and misconfigurations they’re most likely to exploit. A strong approach to cloud security will weave these capabilities into the CI/CD pipeline, enabling organizations to catch vulnerabilities early. For example, they can create verified image policies to guarantee that only approved images are allowed to pass through the pipeline. By continuously scanning container images for known vulnerabilities and configuration issues and integrating security with developer toolchains, organizations can accelerate application delivery and empower DevOps teams. Catching vulnerabilities is also the job of cloud security posture management technology. These solutions allow organizations to continuously monitor the compliance of all of their cloud resources. This ability is critical because misconfigurations are at the heart of many data leaks and breaches. Having these solutions bolstering your cloud security strategy will enable you to reduce risk and embrace the cloud with more confidence.
  • Managed Threat Hunting: Technology alone is not enough. As adversaries refine their tradecraft to avoid detection, access to MDR and advanced threat hunting services for the cloud can be the difference in stopping a breach. Managed services should be able to leverage up-to-the-minute threat intelligence to search for stealthy and sophisticated attacks. This human touch adds a team of experts that can augment existing security capabilities and improve customers’ ability to detect and respond to threats.

Making the Right Decision

Weighing the differences between security vendors is not always simple. However, there are some must-haves for cloud security solutions. From detection to prevention to integration with DevOps tools, organizations need to adopt the capabilities that put them in the best position to take advantage of cloud computing as securely as possible. 

To learn more, download the CrowdStrike Cloud Workload Protection Platform Buyers Guide.

Additional Resources

CrowdStrike Announces Expanded Partnership at AWS re:Invent 2021

30 November 2021 at 09:05

We’re ready to meet you in person in Las Vegas! CrowdStrike is a proud Gold sponsor of AWS re:Invent 2021, being held Nov. 29 through Dec. 3. Stop by Booth #152 at the Venetian for a chance to obtain one of our new limited-edition adversary figures while supplies last. (More details below.) Plus, connect 1:1 with a CrowdStrike expert in person. Register today so you don’t miss out on CrowdStrike in action! Check out what else we have to offer here

Here’s a sneak peek.

What’s New 

At AWS re:Invent 2021, we are announcing expansions to our strategic partnership with AWS to provide breach protection and control for edge computing workloads running on cloud and customer-managed infrastructure, providing simplified infrastructure management and security consolidation, without impact to productivity. 

Build with AWS, Secure with CrowdStrike

AWS Outposts Rack (42U), AWS Outposts Servers (1U and 2U) 

CrowdStrike is proud to be a launch partner of AWS Outposts 1U and 2U servers and is now compatible with the AWS Outposts rack. AWS Outposts is a fully managed service that offers the same AWS infrastructure, AWS services, APIs and tools to on-premises data centers, co-location space, or edge locations like retail stores, branch offices, factories and office locations for a truly consistent hybrid experience. AWS Outposts is ideal for workloads that require low latency access to on-premises systems, local data processing, data residency and migration of applications with local system interdependencies. As a launch partner, this allows CrowdStrike to provide complete end-to-end visibility and protection for a customer’s AWS Hybrid environments as well as Internet of Things (IoT) and edge computing use cases.  

CrowdStrike Achieves EKS Anywhere Certification

Amazon EKS Anywhere is a new deployment option for Amazon EKS that allows customers to create and operate Kubernetes clusters on customer-managed infrastructure, supported by AWS. Starting today, AWS customers can now run Amazon EKS Anywhere on their own on-premises infrastructure using VMware vSphere. Now, with the Amazon EKS Anywhere certification, joint CrowdStrike and AWS solutions deliver end-to-end protection from the host to the cloud, delivering greater visibility, compliance, and threat detection and response to outsmart the adversary. CrowdStrike supports development and production of Amazon EKS workloads across Amazon EKS, Amazon EKS with AWS Fargate, and now Amazon EKS Anywhere.

Humio Log Management Integrations with AWS Services 

Humio‘s purpose-built, large-scale log management platform is now more tightly integrated with a number of AWS services, including AWS Quick Starts and AWS FireLens

  • AWS Quick Starts for Humio: AWS Quick Starts are automated reference deployments built by AWS solutions architects and AWS Partners. AWS Quick Starts help you deploy popular technologies on AWS according to AWS best practices. Joint customers will be able to initiate Humio clusters via AWS Quick Starts Templates to reduce manual procedures to just a few steps, empowering customers to start attaining Humio’s streaming observability at scale and with consistency, within minutes.
  • Humio Integration with AWS FireLens: Customers are now able to ingest AWS service and event data into Humio via AWS FireLens container log router for Amazon ECS and AWS Fargate. Humio customers will now have greater extensibility to use the breadth of services at AWS to simplify routing of logs to Humio, enabling accelerated threat hunting and search across their AWS footprint for novel and advanced cyber threats.

AWS Security Hub Integration Now Supports AWS GovCloud 

CrowdStrike Falcon already integrates with AWS Security Hub to enable a comprehensive, real-time view of high-priority security alerts. CrowdStrike’s API-first approach sends alerts back into AWS Security Hub and accelerates investigation, ultimately helping to automate security tasks. 

We have now extended this integration to publish detections identified by CrowdStrike Falcon for workloads residing within AWS GovCloud to AWS Security Hub to assist customers operating in highly regulated environments, such as the U.S. public sector. This will allow customers’ security operations center (SOC) and DevOps team to streamline communications and simultaneously view and access the same cybersecurity event data. 

CrowdStrike and AWS Partnership 

CrowdStrike is an AWS Partner Network (APN) Advanced Technology Partner, a global partner program to leverage AWS business, technical and marketing support to build solutions for customers. In addition, CrowdStrike has passed the technical review for the AWS Well Architected ISV Certification. By achieving this certification, CrowdStrike has proven to adopt AWS best practices to lower costs, drive better security and performance, adopt cloud-native architectures, drive industry compliance and scale to meet traffic demands. CrowdStrike product offerings are available in the AWS Marketplace.

The Powerful Benefits of CrowdStrike and AWS 

Our joint solutions and integrations in various AWS services are powered by CrowdStrike Threat Graph®, which captures trillions of high-fidelity signals per day in real time from across the globe. Customers benefit from better protection, better performance and immediate time-to-value delivered by the cloud-native Falcon platform, designed to stop breaches. With over 14 service level integrations available, joint AWS and CrowdStrike customers are provided a consistent security posture between their on-premises workloads and those running in the AWS Cloud.

  • Unified, hybrid security experience: To reiterate, CrowdStrike supports development and production of Amazon EKS workloads across Amazon EKS, Amazon EKS with AWS Fargate, and Amazon EKS Anywhere. With a single lightweight agent and single management console, customers can experience a unified, end-to-end experience from the host to the cloud. No matter where the compute workloads are located, customers benefit from visibility, compliance, and threat detection and response to outsmart the adversary.
  • Real-time observability at enterprise scale: Humio offers the freedom to log hundreds of terabytes a day with no compromises. Now with the direct integration with AWS FireLens, customers have complete visibility to see anomalies, threats and problems to get to the root of anything nefarious that has happened across their AWS infrastructure in real time.
  • A modern and consistent security approach: The latest integrations, support and certifications from CrowdStrike for AWS allow organizations to implement a modern enterprise security approach where protection is provided across your AWS infrastructure to defend against sophisticated threat activity. 

Visit CrowdStrike at Booth #152

Come by Booth #152 for a chance to win your own adversary figure, engage in product demos and chat with CrowdStrike experts.

How to Obtain Your Own Adversary Figure 

Earn a limited-edition adversary collectable card for each step you complete. Then show your three collectable cards to a CrowdStrike representative at our giveaway station in our booth, and you’ll be rewarded with your very own adversary figure while supplies last! 

  1. Listen to a theater presentation at the CrowdStrike booth 
  2. Engage in a product demo at one of our demo stations
  3. Snap a selfie and tag #GoCrowdStrike (we will have adversary masks in the booth)

Meet 1:1 with a CrowdStrike Executive

CrowdStrike will have executives and leaders attending AWS re:Invent in person. If you’re interested in a 1:1 onsite meeting, please fill out the form here

Questions? Please contact [email protected]. We look forward to seeing you at AWS re:Invent 2021!

Additional Resources

What Is a Hypervisor (VMM)?

30 November 2021 at 09:43

This blog was originally published on humio.com. Humio is a CrowdStrike Company.

What is a hypervisor?

hypervisor, or virtual machine monitor (VMM), is virtualization software that creates and manages multiple virtual machines (VMs) from a single physical host machine.

Acting as a VMM, the hypervisor monitors, pools and allocates resources — like CPU, memory and storage — across all guest VMs. By centralizing these assets, it’s possible to significantly reduce each VM’s energy consumption, space allocation and maintenance requirements while optimizing overall system performance.

Why should you use a hypervisor?

In addition to helping the IT team better monitor and utilize all available resources, a hypervisor unlocks a wide range of benefits. These include:

  • Speed and scalability: Hypervisors can create new VMs instantly, which allows organizations to quickly scale to meet changing business needs. In the event an application needs more processing power, the hypervisor can also access additional machines on a different server to address this demand.
  • Cost and energy efficiency: Using a hypervisor to create and run several VMs from a common host is far more cost- and energy-efficient than running several physical machines to complete the same tasks.
  • Flexibility: A hypervisor separates the OS from underlying physical hardware. As a result, the guest VM can run a variety of software and applications since the system does not rely on specific hardware.
  • Mobility and resiliency: Hypervisors logically isolate VMs from the host hardware. VMs can therefore be moved freely from one server to another without risk of disruption. Hypervisors can also isolate one guest virtual machine from another; this eliminates the risk of a “domino effect” if one virtual machine crashes.
  • Replication: Replicating a VM manually is a time-intensive and potentially complex process. Hypervisors automate the replication process for VMs, allowing staff to focus on more high-value tasks.
  • Restoration: A hypervisor has built-in stability and security features, including the ability to take a snapshot of a VM’s current state. Once this snapshot is taken, the VM can revert to this state if needed. This is particularly useful when carrying out system upgrades or maintenance as the VM can be restored to its previous functioning state if the IT team encounters an error.

Types of hypervisors

There are two main types of hypervisors:

  1. Type 1 hypervisor: Native or bare metal hypervisor
  2. Type 2 hypervisor: Hosted or embedded hypervisor

Type 1 hypervisor: native or bare metal hypervisor

type 1 hypervisor installs virtualization software directly on the hardware, hence the name bare metal hypervisor.

In this model, the hypervisor takes the place of the OS. As a result, these hypervisors are typically faster since all computing power can be dedicated to guest virtual machines, as well as more secure since adversaries cannot target vulnerabilities within the OS.

That said, a native hypervisor tends to be more complex to set up and operate. Further, a type 1 hypervisor has somewhat limited functionality since the hypervisor itself basically serves as an OS.

Type 2 hypervisor: hosted or embedded hypervisor

Unlike bare-metal hypervisors, a hosted hypervisor is deployed as an added software layer on top of the host operating system. Multiple operating systems can then be installed as a new layer on top of the host OS.

In this model, the OS acts as a weigh station between the hardware and hypervisor. As a result, a type 2 hypervisor tends to have higher latency and slower performance. The presence of the OS also makes this type more vulnerable to cyberattacks.

Embedded hypervisors are generally more convenient to build and launch than a Type 1 hypervisor since they do not require a management console or dedicated machine to set up and oversee the VMs. A hosted hypervisor may also be a good choice for use cases where latency is not a concern, such as software testing.

Cloud hypervisors

The shift to the cloud and cloud computing is prompting the need for cloud hypervisors. The cloud hypervisor focuses exclusively on running VMs in a cloud environment (rather than on physical devices).

Due to the cloud’s flexibility, speed and cost savings, businesses are increasingly migrating their VMs to the cloud. A cloud hypervisor can provide the tools to migrate them more efficiently, allowing companies to make a faster return on investment on their transformation efforts.

Differences between containers and hypervisors

Containers and hypervisors both ensure applications run more efficiently by logically isolating them within the system. However, there are significant differences between how the two are structured, how they scale and their respective use cases.

A container is a package of only software and its dependencies, such as code, system tools, settings and libraries. It can run reliably on any operating system and infrastructure. A container consists of an entire runtime environment, enabling applications to move between a variety of computing environments, such as from a physical machine to the cloud, or from a developer’s test environment to staging and then production.

Hypervisors vs containers

Hypervisors host one or more VMs that mimic a collection of physical machines. Each VM has its own independent OS and is effectively isolated from others.

While VMs are larger and generally slower compared to containers, they can run several applications and different operating systems simultaneously. This makes them a good solution for organizations that need to run multiple applications or legacy software that requires an outdated OS.

Containers, on the other hand, often share an OS kernel or base image. While each container can run individual applications or microservices, it is still linked to the underlying kernel or base image.

Containers are typically used to host a single app or microservice without any other overhead. This makes them more lightweight and flexible than VMs. As such, they are often used for tasks that require a high level of scalability, portability and speed, such as application development.

Understanding hypervisor security

On one hand, by isolating VMs from one another, a hypervisor effectively contains attacks on an individual VM. Also, in the case of type 1 or bare metal hypervisors, the absence of an operating system significantly reduces the risk of an attack since adversaries cannot exploit vulnerabilities within the OS.

At the same time, the hypervisor host itself can be subject to an attack. In that case, each guest machine and their associated data could be vulnerable to a breach.

Best practices for improving hypervisor security

Here are some best practices to consider when integrating a hypervisor within the organization’s IT architecture:

  • Minimize the attack surface by limiting a host’s role to only operating VMs
  • Conduct regular and timely patching for all software applications and the OS
  • Leverage other security measures, such as encryption, zero trust and multi-factor authentication (MFA) to ensure user credentials remain secure
  • Limit administrative privileges and the number of users in the system
  • Incorporate the hypervisor within the organization’s cybersecurity architecture for maximum protection

Hypervisors and modern log management

With the growth of microservices and migration to disparate cloud environments, maintaining observability has become increasingly difficult. Additionally, challenges such as application availability, bugs/vulnerabilities, resource use and changes to performance in virtual machines/containers that affect end-user experience continues to affect the community. Organizations operating with a continuous delivery model are further troubled with capturing and understanding the dependencies within the application environment.

Humio’s streaming log management solution can access and ingest real-time data streaming from diverse platforms and accurately log network issues, database connections and availability, and information about what’s happening in a container that the application relies on. In addition to providing visibility across the entire infrastructure, developers can benefit from comprehensive root cause investigation and analysis. Humio enables search across all relevant data with longer data-retention and long-term storage.

Humio Community Edition

Try Humio’s log management solution at no cost with ongoing access here!

Nowhere to Hide: Detecting SILENT CHOLLIMA’s Custom Tooling

29 November 2021 at 09:25

CrowdStrike Falcon OverWatch™ recently released its annual threat hunting report, detailing the interactive intrusion activity observed by hunters over the course of the past year. The tactics, techniques and procedures (TTPs) an adversary uses serve as key indicators to threat hunters of who might be behind an intrusion. OverWatch threat hunters uncovered an intrusion against a pharmaceuticals organization that bore all of the hallmarks of one of the Democratic People’s Republic of Korea (DPRK) threat actor group: SILENT CHOLLIMA. For further detail, download the CrowdStrike 2021 Threat Hunting Report today.

Threat Hunters Uncover SILENT CHOLLIMA’s Custom Tooling

OverWatch threat hunters detected a burst of suspicious reconnaissance activity in which the threat actor used the Smbexec tool under a Windows service account. Originally designed as a penetration testing tool, Smbexec enables covert execution by creating a Windows service that is then used to redirect a command shell operation to a remote location over Server Message Block (SMB) protocol. This approach is valuable to threat actors, as they can perform command execution under a semi-interactive shell and run commands remotely, ultimately making the activity less likely to trigger automated detections.

As OverWatch continued to investigate the reconnaissance activity, the threat actor used Smbexec to remotely copy low-prevalence executables to disk and execute them. The threat hunters quickly called on CrowdStrike Intelligence, who together were able to quickly determine the files were an updated variant of Export Control — a malware dropper unique to SILENT CHOLLIMA.

SILENT CHOLLIMA then proceeded to load two further custom tools. The first was an information stealer, named GifStealer, which runs a variety of host and network reconnaissance commands and archives the output within individual compressed files. The second was Valefor, a remote access tool (RAT) that uses Windows API functions and utilities to enable file transfer and data collection capabilities.

OverWatch Contains Adversary Activity

Throughout the investigation, OverWatch threat hunters alerted the victim organization to the malicious activity occurring in the environment. As the situation developed, OverWatch continued to alert the organization, eventually informing them of the emerging attribution of this activity to SILENT CHOLLIMA. 

Because this activity originated from a host without the CrowdStrike Falcon® sensor, OverWatch next worked with the organization to expand the rollout of the Falcon sensor so the full scope of threat actor activity could be assessed. Increasing the organization’s coverage and visibility into the intrusion, threat hunters identified six additional compromised hosts. Through further collaboration with the organization, OverWatch was able to relay their findings in a timely manner, empowering the organization to contain and remove SILENT CHOLLIMA from their network. 

OverWatch discovered a service creation event that was configured to execute the Export Control loader every time the system reboots, allowing the threat actor to maintain persistence if they temporarily lose connection.

sc create [REDACTED] type= own type= interact start= auto error=ignore binpath= "cmd /K start C:\Windows\Resources\[REDACTED].exe"

The threat actor was also mindful to evade detection by storing their Export Control droppers and archived reconnaissance data within legitimate local directories. By doing this, threat actors attempt to masquerade the files as benign activity. The threat actor continued its evasion techniques, removing traces of the collected GifStealer archives by deleting them and overwriting the GifStealer binary itself using the string below. This technique is another hallmark of SILENT CHOLLIMA activity.

"C:\Windows\system32\cmd.exe" /c ping -n 3 127.0.0.1 >NUL & echo EEEE > "C:\Windows\Temp\[REDACTED]"

Conclusions and Recommendations

The OverWatch team exposed multiple signs of malicious tradecraft in the early stages of this intrusion, which proved to be vital to the victim organization’s ability to successfully contain the campaign and remove the threat actor from its networks. In this instance, OverWatch worked with the organization to rapidly expand Falcon sensor coverage. Though the Falcon sensor can be deployed and operational in just seconds, OverWatch strongly recommends that defenders roll out endpoint protection consistently and comprehensively across their environment from the start to ensure maximum coverage and visibility for threat hunters. OverWatch routinely sees security blind spots become a safe haven from which adversaries can launch their intrusions.  The Falcon sensor was built with scalability in mind, allowing an organization to reach a strong security posture by protecting all enterprise endpoints in mere moments.

The expertise of OverWatch’s human threat hunters was pivotal in this instance, as it was the threat hunters ability to leverage their expertise that allowed them to discern the SMB activity was indeed malicious. 

For defenders concerned about this type of activity, OverWatch recommends monitoring: 

  • Service account activity, limiting access where possible
  • Service creation events within Windows event logs to hunt for malicious SMB commands
  • Remote users connecting to administrator shares, as well as other commands and tools that can be used to connect to network shares

Ultimately, threat hunting is a full time job. Defenders should also consider hiring a professional managed threat hunting service, like OverWatch, to secure their networks 24/7/365. 

Additional Resources

Shift Left Security: The Magic Elixir for Securing Cloud-Native Apps

24 November 2021 at 09:35

Developing applications quickly has always been the goal of development teams. Traditionally, that often puts them at odds with the need for testing. Developers might code up to the last minute, leaving little time to find and fix vulnerabilities in time to meet deadlines. 

During the past decade, this historical push-pull between security and developers led many organizations to look to build security deeper into the application development lifecycle. This new approach, “shift-left security,” is a pivotal part of supporting the DevOps methodology. By focusing on finding and remediating vulnerabilities earlier, organizations can streamline the development process and improve velocity. 

Cloud computing empowers the adoption of DevOps. It offers DevOps teams a centralized platform for testing and deployment. But for DevOps teams to embrace the cloud, security has to be at the forefront of your considerations. For developers, that means making security a part of the continuous integration/continuous delivery (CI/CD) pipeline that forms the cornerstone of DevOps practices.

Out with the Old and In with the New

The CI/CD pipeline is vital to supporting DevOps through the automation of building, testing and deploying applications. It is not enough to just scan applications after they are live. A shift-left approach to security should start the same second that DevOps teams begin developing the application and provisioning infrastructure. By using APIs, developers can integrate security into their toolsets and enable security teams to find problems early. 

Speedy delivery of applications is not the enemy of security, though it can seem that way. Security is meant to be an enabler, an elixir that helps organizations use technology to reach their business goals. Making that a reality, however, requires making it a foundational part of the development process. 

In our Buyer’s Guide for Cloud Workload Protection Platforms, we provide a list of key features we believe organizations should look for to help secure their cloud environments. Automation is crucial. In research from CrowdStrike and Enterprise Strategy Group (ESG), 41% of respondents said that automating the introduction of controls and processes via integration with the software development lifecycle and CI/CD tools is a top priority. Using automation, organizations can keep pace with the elastic, dynamic nature of cloud-native applications and infrastructure.

Better Security, Better Apps

At CrowdStrike, we focus on integrating security into the CI/CD pipeline. As part of the functionality of CrowdStrike’s Falcon Cloud Workload Protection (CWP), customers have the ability to create verified image policies to ensure that only approved images are allowed to progress through the CI/CD pipeline and run in their hosts or Kubernetes clusters. 

The tighter the integration between security and the pipeline, the earlier threats can be identified, and the more the speed of delivery can be accelerated. By seamlessly integrating with Jenkins, Bamboo, GitLab and others, Falcon CWP allows DevOps teams to respond and remediate incidents even faster within the toolsets they use. 

Falcon CWP also continuously scans container images for known vulnerabilities, configuration issues, secrets/keys and OSS licensing issues, and streamlines visibility for security operations by providing insights and context for misconfigurations and compliance violations. It also uses reporting and dashboards to drive alignment across the security operations, DevOps and infrastructure teams. 

Hardening the CI/CD pipeline allows DevOps teams to move fast without sacrificing security. The automation and integration of security into the CI/CD pipeline transforms the DevOps culture into its close relative, DevSecOps, which extends the methodology of DevOps by focusing on building security into the process. As businesses continue to adopt cloud services and infrastructure, forgetting to keep security top of mind is not an option. The CI/CD pipeline represents an attractive target for threat actors. Its criticality means that a compromise could have a significant impact on business and IT operations. 

Baking security into the CI/CD pipeline enables businesses to pursue their digital initiatives with confidence and security. By shifting security left, organizations can identify misconfigurations and other security risks before they impact users. Given the role that cloud computing plays in enabling DevOps, protecting cloud environments and workloads will only take on a larger role in defending the CI/CD pipeline, your applications and, ultimately, your customers. 

To learn more about how to choose security solutions to protect your CI/CD pipeline, download the CrowdStrike Cloud Workload Protection Platform Buyers Guide.

Additional Resources

Managing Dead Letter Messages: Three Best Practices to Effectively Capture, Investigate and Redrive Failed Messages

24 November 2021 at 08:06

In a recent blog post, Sharding Kafka for Increased Scale and Reliability, the CrowdStrike Engineering Site and Reliability Team shared how it overcame scaling limitations within Apache Kafka so that they could quickly and effectively process trillions of events daily. In this post, we focus on the other side of this equation: What happens when one of those messages inevitably fails? 

When a message cannot be processed, it becomes what is known as a “dead letter.” The service attempts to process the message by normal means several times to eliminate intermittent failures. However, when all of those attempts fail, the message is ultimately “dead lettered.” In highly scalable systems, these failed messages must be dealt with so that processing can continue on subsequent messages. To retain the dead letter’s information and continue processing messages, the message is stored so that it can be later addressed manually or by an automated tool.

In Best Practices: Improving Fault-Tolerance in Apache Kafka Consumer, we go into great detail about the different failure types and techniques for recovery, which include redriving and dead letters. Here our aim is to solidify those terms and expound upon the processes surrounding these mechanisms. 

Processing dead letters can be a fairly time-consuming and error-prone process. So what can be done to expedite this task and improve its outcome? Here we explore three steps organizations can take to develop the code and infrastructure needed to more effectively and efficiently capture, investigate and redrive dead letter messages.

Dead Letter Basics
What is a message? A message is the record of any communication between two or more services.
Why does a message fail? Messages can fail for a variety of reasons, some of the most common being incompatible message format, unavailable dependent services, or a bug in the service processing the message.
Why does it matter if a message fails? In most cases, a message is being sent because it is sharing important information with another service. Without that knowledge, the service that should be receiving the message can have outdated or inaccurate information and make bad decisions or be completely unable to act.

Three Best Practices for Resolving Dead Letter Messages

1. Define the infrastructure and code to capture and redrive dead letters

As explained above, a dead letter occurs when a service cannot process a message. Most systems have some mechanism in place, such as a log or object storage, to capture the message, review it, identify the issue, resolve the issue and then retry the message once it’s more likely to succeed. This act of replaying the message is known as “redriving.” 

To enable the redrive process, organizations need two basic things: 1) the necessary infrastructure to capture and store the dead letter messages, and 2) the right code to redrive that message.

Since there could potentially be hundreds of millions of dead letters that need to be stored, we recommend using a storage option that meets these four criteria: low cost (especially critical as your data scales), abundant space (no concerns around running out of storage space), durability (no data loss or corruption) and availability (the data is available to restore during disaster recovery). We use Amazon S3. 

For short-term storage and alerting, we recommend using a message queue technology that allows the user to send messages to be processed at a later point. Then your service can be configured to read from the message queue to begin processing the redrive messages. We use Amazon SQS and Kafka as our message queues.

2. Put tooling in place to make remediation foolproof 

The process outlined above can be very error-prone when done manually, as it involves many steps: finding the message, copying its contents, pasting it into a new message and submitting that message to the queue. If the user misses even one character when copying the message, then it will fail again — and the process will need to be repeated. This process must be done for every failed message, making it potentially time-consuming as well. 

Since the process is the same for processing dead letters, it is possible to automate. To that end, organizations should develop a command-line tool to automate common actions with dead letters such as viewing the dead letter, putting the message in the redrive queue and having the service consume messages from the queue for reprocessing. Engineers will use this command-line tool to diagnose and resolve dead letters the same way — this, in turn, will help reduce the risk of human error.

3. Standardize and document the process to ensure ease-of-use 

Our third best practice is around standardization. Because not all engineers will be familiar with the process the organization has for dealing with dead letter messages, it is important to document all aspects of the procedure. Some basic questions your documentation should address include: 

  • How does the organization know when a dead letter message occurs? Is an alert set up? Will an email be sent?
  • How does the team investigate the root cause of the error? Is there a specific phrase they can search for in the logs to find the errors associated with a dead letter?
  • Once it has been investigated and a fix has been deployed, how is the message reprocessed or redrived?

Documenting and standardizing the process in this way ensures that anyone on the team can pick up, solve and redrive dead letters. Ideally, the documentation will be relatively short and intuitive, outlining the following steps:

  • How to read the content of the message and review the logs to help figure out what happened
  • How to run the commands for your dead letter tool
  • How to put the message in the redrive queue to be reprocessed
  • What to do if the message is rejected again

It’s important to have this “cradle-to-grave” mentality when dealing with dead letter messages — pun intended — since a disconnect anywhere within the process could prevent the organization from successfully reprocessing the message.

Conclusion

While many organizations focus on processing massive amounts of messages and scaling those capabilities, it is equally important to ensure errors are captured and solved efficiently and effectively. 

In this blog, we shared our three best practices for organizations to develop the infrastructure and tooling to ensure that any engineer can properly manage a dead letter. But we certainly have more to share! We would be happy to address any specific questions or explore related topics of interest to the community in future blog posts. 

Got a question, comment or idea? Feel free to share your thoughts for future posts on social media via @CrowdStrike.

Mean Time to Repair (MTTR) Explained

23 November 2021 at 08:30

This blog was originally published oct. 28, 2021 on humio.com. Humio is a CrowdStrike Company.

Definition of MTTR

Mean time to repair (MTTR) is a key performance indicator (KPI) that represents the average time required to restore a system to functionality after an incident. MTTR is used along with other incident metrics to assess the performance of DevOps and ITOps, gauge the effectiveness of security processes, evaluate the effectiveness of security solutions, and measure the maintainability of systems.

Service level agreements with third-party providers typically set expectations for MTTR, although repair times are not guaranteed because some incidents are more complex than others. Along the same lines, comparing the MTTR of different organizations is not fruitful because MTTR is highly dependent on unique factors relating to the size and type of the infrastructure and the size and skills of the ITOps and DevOps team. Every business has to determine which metrics will best serve its purposes and how it will put them into action in their unique environment.

Difference Between Common Failure Metrics

Modern enterprise systems are complicated and they can fail in numerous ways. For these reasons, there is no one set of incident metrics every business should use — but there are many to choose from, and the differences can be nuanced.

Mean Time to Detect (MTTD)

Also called mean time to discover, MTTD is the average time between the beginning of a system failure and its detection. As a KPI, MTTD is used to measure the effectiveness of the tools and processes used by DevOps teams.

To calculate MTTP, select a period of time, such as a month, and track the times between the beginning of system outages and their discovery, and then add up the total time and divide it by the number of incidents to find the average. MTTD should be low. If it continues to take longer to detect or or discover system failures (an upward trend), an immediate review should be conducted of the existing incident response management tools and processes.

Mean Time to Identify (MTTI)

This measurement tracks the number of business hours between the moment an alert is triggered and the moment the cybersecurity team begins to investigate that alert. MTTI is helpful in understanding if alert systems are effective and if cybersecurity teams are staffed to the necessary capacity. A high MTTI or an MTTI that is trending in the wrong direction can be an indicator that the cybersecurity team is suffering from alert fatigue.

Mean Time to Recovery (MTTR)

Mean time to recovery is the average time it takes in business hours between the start of an incident and the complete recovery back to normal operations. This incident metric is used to understand the effectiveness of the DevOps and ITOps teams and identify opportunities to improve their processes and capabilities.

Mean Time to Resolve (MTTR)

Mean time to resolve is the average time between the first alert through the post-incident analysis, including the time spent ensuring the failure will not re-occur. It is measured in business hours.

Mean Time Between Failures (MTBF)

Mean time between failures is a key performance metric that measures system reliability and availability. ITOps teams use MTBF to understand which systems or components are performing well and which need to be evaluated for repair or replacement. Knowing MTBF enables preventative maintenance, minimizes reactive maintenance, reduces total downtime and enables teams to prioritize their workload effectively. Historical MTBF data can be used to make better decisions about scheduling maintenance downtime and resource allocation.

MTBF is calculated by tracking the number of hours that elapse between system failures in the ordinary course of operations over a period of time and then finding the average.

Mean Time to Failure (MTTF)

Mean time to failure is a way of looking at uptime vs. downtime. Unlike MTBF, an incident metric that focuses on repairability, MTTF focuses on failures that cannot be repaired. It is used to predict the lifespan of systems. MTTF is not a good fit for every system. For example, systems with long lifespans, such as core banking systems or many industrial control systems, are not good subjects for MTTF metrics because they have such a long lifespan that when they are finally replaced, the replacement will be an entirely different type of system due to technological advances. In cases like that, MTTF is moot.

Conversely, tracking the MTTF of systems with more typical lifespans is a good way to gain insight into which brands perform best or which environmental factors most strongly influence a product’s durability.

MTTR is intended to reduce unplanned downtime and shorten breakout time. But its use also supports a better culture within ITOps teams.When incidents are repaired before users are impacted, DevOps and ITOps are seen as efficient and effective. Resilient system design is encouraged because when DevOps knows its performance will be measured by MTTR, the team will build apps that can be repaired faster, such as by developing apps that are populated by discrete web services so one service failure will not crash the entire app. MTTR, when done properly, includes post-incident analysis, which should be used to inform a feedback loop that leads to better software builds in the future and encourages the fixing of bugs early in the SDLC process.

How to Calculate Mean Time to Repair

The MTTR formula is straightforward: Simply add up the total unplanned repair time spent on a system within a certain time frame and divide the results by the total number of relevant incidents.

For example, if you have a system that fails four times in one workday and you spend an hour repairing each of those instances of failure, your MTTR would be 15 minutes (60 minutes / 4 = 15 minutes).

However, not all outages are equal. The time spent repairing a failed component or a customer-facing system that goes down during peak hours is more expensive in terms of lost sales, productivity or brand damage than time spent repairing a non-critical outage in the middle of the night. Organizations can establish an “error budget” that specifies that each minute spent repairing the most impactful systems is worth an hour of minutes spent repairing less impactful ones. This level of granularity will help expose the true costs of downtime and provide a better understanding of what MTTR means to the particular organization.

How to Reduce MTTR

There are three elements to reducing MTTR:

  1. Manage resolution process. The first is a defined strategy for managing the resolution process, which should include a post-incident analysis to capture lessons learned.
  2. Build defenses. Technology plays a crucial role, of course, and the best solution will provide visibility, monitoring and corrective maintenance to help root out problems and build defenses against future attacks.
  3. Mitigate the incident. Lastly, the skills necessary to mitigate the incident have to be available.

MTTR can be reduced by increasing budget or headcount, but that isn’t always realistic. Instead, deploy artificial intelligence (AI) and machine learning (ML) to automate as much of the repair process as possible. Those steps include rapid detection, minimization of false positives, smart escalation, and automated remediation that includes workflows that reduce MTTR.

MTTR can be a helpful metric to reduce downtime and streamline your DevOps and ITOps teams, but improving it shouldn’t be the end goal. After all, the point of using metrics is not simply improving numbers but, in this instance, the practical matter of keeping systems running and protecting the business and its customers. Use MTTR in a way that helps your teams protect customers and optimize system uptime.

Improve MTTR With a Modern Log Management Solution

Logs are invaluable for any kind of incident response. Humio’s platform enables complete observability for all streaming logs and event data to help IT organizations better prepare for the unknown and quickly find the root cause of any incident.

Humio leverages modern technologies, including data streaming, index-free architecture and hybrid deployments, to optimize compute resources and minimize storage costs. Because of this, Humio can collect structured and unstructured data in memory to make exploring and investigating data of any size blazing fast.

Humio Community Edition

With a modern log management platform, you can monitor and improve your MTTR. Try it out at no cost!

Securing the Application Lifecycle with Scale and Speed: Achieving Holistic Workload Security with CrowdStrike and Nutanix

22 November 2021 at 22:17

With virtualization in the data center and further adoption of cloud infrastructure, it’s no wonder why IT, DevOps and security teams grapple with new and evolving security challenges. An increase in virtualized applications and desktops have caused organizations’ attack surfaces to expand quickly, enabling highly sophisticated attackers to take advantage of the minimal visibility and control these teams hold.

The question remains: How can your organization secure your production environments and cloud workloads to ensure that you can build and run apps at speed and with confidence? The answer: CrowdStrike Falcon® on the Nutanix Cloud Platform.

Delivered through CrowdStrike’s single lightweight Falcon agent, your team is enabled to take an adversary-focused approach when securing your Nutanix cloud workloads — all without impacting performance. With scalable and holistic security, your team can achieve comprehensive workload protection and visibility across virtual environments to meet compliance requirements and prevent breaches effectively and efficiently. 

Secure All of Your Cloud Workloads with CrowdStrike and Nutanix

By extending CrowdStrike’s world-class security capabilities into the Nutanix Cloud Platform, you can prevent attacks on virtualized workloads and endpoints on or off the network. The Nutanix-validated, cloud-native Falcon sensor enhances Nutanix’s native security posture for workloads running on Nutanix AHV without compromising your team’s output. By extending CrowdStrike protection to Nutanix deployments, including virtual machines and virtual desktop infrastructure (VDI), you get scalable and comprehensive workload and container breach protection to streamline operations and optimize performance.

CrowdStrike and Nutanix provide your DevOps and Security teams with layered security, so they can build, run and secure applications with confidence at every stage of the application lifecycle. Easily deploy and use the CrowdStrike Falcon sensor without hassle for your Nutanix AHV workloads and environment. 

CrowdStrike’s intelligent cloud-native Falcon agent is powered by the proprietary CrowdStrike Threat Graph®, which captures trillions of high-fidelity signals per day in real time from across the globe, fueling one of the world’s most advanced data platforms for security. The Falcon platform helps you gain real-time protection and visibility across your enterprise, preventing attacks on workloads on and off the network. 

Get Started and Secure Your Linux Workloads in the Cloud

With Nutanix and CrowdStrike, you can feel confident that your Linux workloads are secure on creation by using CrowdStrike’s Nutanix Terraform script built on Nutanix’s Terraform Provider. By deploying the CrowdStrike Falcon sensor during Linux instance creation, the lifecycle of building and securing workloads before they are operational in the cloud is made simple and secure, without operational friction. 

Get started with CrowdStrike and Nutanix by deploying Linux workloads securely with CrowdStrike’s Nutanix Terraform script.

Gain Holistic Security Coverage Without Compromising Performance

With CrowdStrike and Nutanix, you can seamlessly secure your end-to-end production environment, streamline operations and optimize application performance; easily manage storage and virtualization securely with CrowdStrike’s lightweight Falcon agent on the Nutanix Cloud Platform; and secure your Linux workloads with CrowdStrike’s Nutanix Terraform solution. Building, running and securing applications on the Nutanix Cloud Platform takes the burden of managing and securing your production environment off your team and ensures confidence.

Additional Resources 

Introduction to the Humio Marketplace

18 November 2021 at 08:56

This blog was originally published Oct. 11, 2021 on humio.com. Humio is a CrowdStrike Company.

Humio is a powerful and super flexible platform that allows customers to log everything and answer anything. Users can choose how to ingest their data and choose how to create and manage their data with Humio. The goal of Humio’s marketplace is to provide a variety of packages that power our customers with faster and more convenient ways to get more from their data across a variety of use cases.

What is the Humio Marketplace?

The Humio Marketplace is a collection of prebuilt packages created by Humio, partners and customers that Humio customers can access within the Humio product interface.

These packages are relevant to popular log sources and typically contain a parser and some dashboards and/or saved queries. The package documentation includes advice and guidance on how to best ingest the data into Humio to start getting immediate value from logs.

What is a package?

The Marketplace contains prebuilt packages that are essentially YAML files that describe the Humio assets included in the package. A package can include any or all of: a parser, saved searches, alerts, dashboards, lookup files and labels. The package also includes YAML files for the metadata of the package (such as descriptions and tags, support status and author), and a README file which contains a full description and explanation of any prerequisites, etc.

Packages can be configured as either a Library type package — which means, once installed, the assets are available as templates to build from — or an Application package, which means, once installed, the assets are instantiated and are live immediately.

By creating prebuilt content that is quick and simple to install, we want to make it easier for customers to onboard new log sources to Humio to quickly get value from that data. With this prebuilt content, customers won’t have to work out the best way of ingesting the logs and won’t have to create parsers and dashboards from scratch.

How do I make a package?

Packages are a great way to mitigate manual work, whether that’s taking advantage of prebuilt packages or making your own packages so you don’t have to begin new processes all over.

Anyone can create a Humio package straight from Humio’s interface. We actively encourage customers and partners to create packages and submit those packages for inclusion in the Marketplace if they think they could benefit other customers. Humio will work with package creators to make sure the package meets our standards for inclusion in the Marketplace. By sharing your package with all Humio customers through the Marketplace, you are strengthening the community and allowing others to benefit from your expertise while you, likewise, benefit from others’ expertise.

For some customers, the package will be exactly what they want, but for others, it will be a useful starting point for further customization. All Humio packages are provided under an Apache 2.0 license, so customers are free to adapt and reuse the package as needed.

If I install a package, will it get updated?

Package creators can develop updates in response to changes in log formats or to introduce new functionality and improvements. Updates will be advertised as available in the Marketplace and users can choose to accept the update. The update process will check to see if any local changes have been made to assets installed from the package and, if so, will prompt the user to either overwrite the changes with the standard version from the updated package or to keep the local changes.

Are packages free?

Yes, all Humio packages in the Marketplace are free to use!

Can I use packages to manage my own private Humio content?

Absolutely! Packages are a convenient way for customers to manage their own private Humio content. Packages can be created in the Humio product interface and can be downloaded as a ZIP file and uploaded into a different Humio repository or a different instance of Humio (cloud or hybrid). Customers can also store their Humio packages in a code repository and use their CI/CD tools and the Humio API to deploy and manage Humio assets as they would their own code. This streamlines Humio support and operations and delivers a truly agile approach to log management.

Get started today

To get started with packages is simple. All you need is access to a Humio Cloud service, or if running Humio self-hosted, you need to be on V1.21 or later. To create and install packages, you need the “Change Packages” permission assigned to your Humio user role.

Access the Marketplace from within the Humio product UI (Go to Settings, Packages, then Marketplace to browse the available packages or to create your own package). Try creating a package and uploading it to a different repository. If you create a nice complex dashboard and want to recreate it in a different repository, you know what to do: Create a package; export/import it, and then you don’t need to spend time recreating it!

Let us know what else you want to see in the Marketplace by connecting with us at The Nest or emailing [email protected].

Additional Resources

Ransomware (R)evolution Plagues Organizations, But CrowdStrike Protection Never Wavers

  • ECrime activities dominate the threat landscape, with ransomware as the main driver
  • Ransomware operators constantly refine their code and the efficacy of their operations
  • CrowdStrike uses improved behavior-based detections to prevent ransomware from tampering with Volume Shadow Copies
  • Volume Shadow Copy Service (VSS) backup protection nullifies attackers’ deletion attempts, retaining snapshots in a recoverable state

Ransomware is dominating the eCrime landscape and is a significant concern for organizations, as it can cause major disruptions. ECrime accounted for over 75% of interactive intrusion activity from July 2020 to June 2021, according to the recent CrowdStrike 2021 Threat Hunting Report. The continually evolving big game hunting (BGH) business model has widespread adoption with access brokers facilitating access, with a major driver being dedicated leak sites to apply pressure for victim compliance. Ransomware continues to evolve, with threat actors implementing components and features that make it more difficult for victims to recover their data. 

Lockbit 2.0 Going for the Popularity Vote

The LockBit ransomware family has constantly been adding new capabilities, including tampering with Microsoft Server Volume Shadow Copy Service (VSS) by interacting with the legitimate vssadmin.exe Windows tool. Capabilities such as lateral movement or destruction of shadow copies are some of the most effective and pervasive tactics ransomware uses.

Figure 1. LockBit 2.0 ransom note (Click to enlarge)

The LockBit 2.0 ransomware has similar capabilities to other ransomware families, including the ability to bypass UAC (User Account Control), self-terminate or check the victim’s system language before encryption to ensure that it’s not in a Russian-speaking country. 

For example, LockBit 2.0 checks the default language of the system and the current user by using the Windows API calls GetSystemDefaultUILanguage and GetUserDefaultUILanguage. If the language code identifier matches the one specified, the program will exit. Figure 2 shows how the language validation is performed (function call 49B1C0).

Figure 2. LockBit 2.0 performing system language validation

LockBit can even perform a silent UAC bypass without triggering any alerts or the UAC popup, enabling it to encrypt silently. It first begins by checking if it’s running under Admin privileges. It does that by using specific API functions to get the process token (NTOpenProcessToken), create a SID identifier to check the permission level (CreateWellKnownSid), and then check whether the current process has sufficient admin privileges (CheckTokenMembership and ZwQueryInformationToken functions).

Figure 3. Group SID permissions for running process

If the process is not running under Admin, it will attempt to do so by initializing a COM object with elevation of the COM interface by using the elevation moniker COM initialization method with guid: Elevation:Administrator!new:{3E5FC7F9-9A51-4367-9063-A120244FBEC7}. A similar elevation trick has been used by DarkSide and REvil ransomware families in the past.

LockBit 2.0 also has lateral movement capabilities and can scan for other hosts to spread to other network machines. For example, it calls the GetLogicalDrives function to retrieve a bitmask of currently available drives to list all available drives on the system. If the found drive is a network share, it tries to identify the name of the resource and connect to it using API functions, such as WNetGetConnectionW, PathRemoveBackslashW, OpenThreadToken and DuplicateToken.

In essence, it’s no longer about targeting and compromising individual machines but entire networks. REvil and LockBit are just some of the recent ransomware families that feature this capability, while others such as Ryuk and WastedLocker share the same functionality. The CrowdStrike Falcon OverWatch™ team found that in 36% of intrusions, adversaries can move laterally to additional hosts in less than 30 minutes, according to the CrowdStrike 2021 Threat Hunting Report.

Another interesting feature of LockBit 2.0 is that it prints out the ransom note message on all connected printers found in the network, adding public shaming to its encryption and data exfiltration capabilities.

VSS Tampering: An Established Ransomware Tactic

The tampering and deletion of VSS shadow copies is a common tactic to prevent data recovery. Adversaries will often abuse legitimate Microsoft administrator tools to disable and remove VSS shadow copies. Common tools include Windows Management Instrumentation (WMI), BCDEdit (a command-line tool for managing Boot Configuration Data) and vssadmin.exe. LockBit 2.0 utilizes the following WMI command line for deleting shadow copies:

C:\Windows\System32\cmd.exe /c vssadmin delete shadows /all /quiet & wmic shadowcopy delete & bcdedit /set {default} bootstatuspolicy ignoreallfailures & bcdedit /set {default} recoveryenabled no

The use of preinstalled operating system tools, such as WMI, is not new. Still, adversaries have started abusing them as part of the initial access tactic to perform tasks without requiring a malicious executable file to be run or written to the disk on the compromised system. Adversaries have moved beyond malware by using increasingly sophisticated and stealthy techniques tailor-made to evade autonomous detections, as revealed by CrowdStrike Threat Graph®, which showed that 68% of detections indexed in April-June 2021 were malware-free.

VSS Protection with CrowdStrike

CrowdStrike Falcon takes a layered approach to detecting and preventing ransomware by using behavior-based indicators of attack (IOAs) and advanced machine learning, among other capabilities. We are committed to continually improving the efficacy of our technologies against known and unknown threats and adversaries. 

CrowdStrike’s enhanced IOA detections accurately distinguish malicious behavior from benign, resulting in high-confidence detections. This is especially important when ransomware shares similar capabilities with legitimate software, like backup solutions. Both can enumerate directories and write files that on the surface may seem inconsequential, but when correlated with other indicators on the endpoint, can identify a legitimate attack. Correlating seemingly ordinary behaviors allows us to identify opportunities for coverage across a wide range of malware families. For example, a single IOA can provide coverage for multiple families and previously unseen ones.

CrowdStrike’s recent innovation involves protecting shadow copies from being tampered with, adding another protection layer to mitigate ransomware attacks. Protecting shadow copies helps potentially compromised systems restore encrypted data with much less time and effort. Ultimately, this helps reduce operational costs associated with person-hours spent spinning up encrypted systems post-compromise.

The Falcon platform can prevent suspicious processes from tampering with shadow copies and performing actions such as changing file size to render the backup useless. For instance, should a LockBit 2.0 ransomware infection occur and attempt to use the legitimate Microsoft administrator tool (vssadmin.exe) to manipulate shadow copies, Falcon immediately detects this behavior and prevents the ransomware from deleting or tampering with them, as shown in Figure 4.

Figure 4. Falcon detects and blocks vssadmin.exe manipulation by LockBit 2.0 ransomware (Click to enlarge)

In essence, while a ransomware infection might be able to encrypt files on a compromised endpoint, Falcon can prevent ransomware from tampering with shadow copies and potentially expedite data recovery for your organization.

Figure 5. Falcon alert on detected and blocked ransomware activity for deleting VSS shadow copies (Click to enlarge)

Shown below is Lockbit 2.0 executing on a system without Falcon protections. Here, vssadmin is used to list the shadow copies. Notice the shadow copy has been deleted after execution.

Below is the same Lockbit 2.0 execution, now with Falcon and VSS protection enabled. The shadow copy is not deleted even though the ransomware has run successfully. Please note, we specifically allowed the ransomware to run during this demonstration.

CrowdStrike prevents the destruction and tampering of shadow copies with volume shadow service backup protection, retaining the snapshots in a recoverable state regardless of threat actors using traditional or new novel techniques. This allows for instant recovery of live systems post-attack through direct snapshot tools or system recovery.

VSS shadow copy protection is just one of the new improvements added to CrowdStrike’s layered approach. We remain committed to our mission to stop breaches, and constantly improving our machine learning and behavior-based detection and protection technologies enables the Falcon platform to identify and protect against tactics, techniques and procedures associated with sophisticated adversaries and threats.

CrowdStrike’s Layered Approach Provides Best-in-Class Protection

The Falcon platform unifies intelligence, technology and expertise to successfully detect and protect against ransomware. Artificial intelligence (AI)-powered machine learning and behavioral IOAs, fueled by a massive data set of trillions of events per week and threat actor intelligence, can identify and block ransomware. Coupled with expert threat hunters that proactively see and stop even the stealthiest of attacks, the Falcon platform uses a layered approach to protect the things that matter most to your organization from ransomware and other threats.

CrowdStrike Falcon endpoint protection packages unify the comprehensive technologies, intelligence and expertise needed to successfully stop breaches. For fully managed detection and response (MDR), Falcon Complete™ seasoned security professionals deliver 403% ROI and 100% confidence.

Indicators of Compromise (IOCs)

File SHA256
LockBit 2.0 0545f842ca2eb77bcac0fd17d6d0a8c607d7dbc8669709f3096e5c1828e1c049

Additional Resources

Unexpected Adventures in JSON Marshaling

17 November 2021 at 09:29

Recently, one of our engineering teams encountered what seemed like a fairly straightforward issue: When they attempted to store UUID values to a database, it produced an error claiming that the value was invalid. With a few tweaks to one of our internal libraries, our team was able to resolve the issue. Or did they?

Fast forward one month later, and a different team noticed a peculiar problem. After deploying a new release, their service began logging strange errors alerting the team that the UUID values from the redrive queue could not be read.

So what went wrong? What we soon realized is that when we added a new behavior to our UUID library to solve our first problem, we inadvertently created a new one. In this blog post, we explore how adding seemingly benign new methods can actually be a breaking change, especially when working with JSON support in Go.  We will explore what we did wrong and how we were able to dig our way out of it. We’ll also outline some best practices for managing this type of change, along with some thoughts on how to avoid breaking things in the first place.

When Closing a Functional Gap Turns Into a Bug

This all started when one of our engineering teams added a new PostgreSQL database and ran into issues. They were attempting to store UUID values in a JSONB column in the PostgreSQL database using our internal csuuid library, which wraps a UUID value and adds some additional functionality specific to our systems. Strangely, the generated SQL being sent to the database always contained an empty string for that column, which is an invalid value.

INSERT INTO table (id, uuid_val) VALUES (42, '');

ERROR: invalid input syntax for type json

Checking the code, we saw that there was no specific logic for supporting database persistence.  Conveniently, the Go standard library already provides the scaffolding for making types compatible with database drivers in the form of the database/sql.Scanner and database/sql/driver.Valuer interfaces. The former is used when reading data from a database driver and the latter for writing values to the driver. Each interface is a single method and, since a csuuid.UUID wraps a github.com/gofrs/uuid.UUID value that already provides the correct implementations, extending the code was straightforward.

With this change, the team was now able to successfully store and retrieve csuuid.UUID values in the database.

Free Wins

As often happens, the temptation of “As long as we’re updating things …” crept in. We noticed that csuuid.UUID also did not include any explicit support for JSON marshaling. Like with the database driver support, the underlying github.com/gofrs/uuid.UUID type already provided the necessary functionality, so extending csuuid.UUID for this feature felt like a free win.

If a type can be represented as a string in a JSON document, then you can satisfy the encoding.TextMarshaler and encoding.TextUnmarshaler interfaces to convert your Go struct to/from a JSON string, rather than satisfying the potentially more complex Marshaler and Unmarshaler interfaces from the encoding/json package.

The excerpt from the documentation for the Go standard library’s json.Marshal() function below (emphasis mine) calls out this behavior:

Marshal traverses the value v recursively. If an encountered value implements the Marshaler interface and is not a nil pointer, Marshal calls its MarshalJSON method to produce JSON. If no MarshalJSON method is present but the value implements encoding.TextMarshaler instead, Marshal calls its MarshalText method and encodes the result as a JSON string. The nil pointer exception is not strictly necessary but mimics a similar, necessary exception in the behavior of UnmarshalJSON.

A UUID is a 128-bit value that can easily be represented as a 32-character string of hex digits; that string format is the typical way they are stored in JSON. Armed with this knowledge, extending csuuid.UUID to “correctly” support converting to/from JSON was another simple bit of code.

Other than a bit of logic to account for the pointer field within csuuid.UUID, these two new methods only had to delegate things to the inner github.com/gofrs/uuid.UUID value.

At this point, we felt like we had solved the original issue and gotten a clear bonus win. We danced a little jig and moved on to the next set of problems.

Celebrations all around!

A Trap Awaits

Unfortunately, all was not well in JSON Land. Several months after applying these changes, we deployed a new release of another of our services and started seeing errors logged about it not being able to read in values from its AWS Simple Queue Service (SQS) queue.  For system stability, we always do canary deployments of new services before rolling out changes to the entire fleet.  The new error logs started when the canary for this service was deployed.

Below are examples of the log messages:

From the new instances:
[ERROR] ..../sqs_client.go:42 - error unmarshaling Message from SQS: json: cannot unmarshal object into Go struct field event.trace_id of type *csuuid.UUID error='json: cannot unmarshal object into Go struct field event.trace_id of type *csuuid.UUID'

From both old and new instances:
[ERROR] ..../sqs_client.go:1138 - error unmarshaling Message from SQS: json: cannot unmarshal string into Go struct field event.trace_id of type csuuid.UUID error='json: cannot unmarshal string into Go struct field event.trace_id of type csuuid.UUID'

After some investigation, we were able to determine that the error was happening because we had inadvertently introduced an incompatibility in the JSON marshaling logic for csuuid.UUID. When one of the old instances wrote a message to the SQS queue and one of the new ones processed it, or vice versa, the code would fail to read in the JSON data, thus logging one of the above messages.

json.Marshal() and json.Unmarshal() Work, Even If by Accident

The hint that unlocked the mystery was noticing the slight difference in the two log messages. Some showed “cannot unmarshal object into Go struct field” and the others showed “cannot unmarshal string into Go struct field.” This difference triggered a memory of that “free win” we celebrated earlier.

The root cause of the bug was that, in prior versions of the csuuid module, the csuuid.UUID type contained only unexported fields, and it had no explicit support for converting to/from JSON. In this case, the fallback behavior of json.Marshal() is to output an empty JSON object, {}. Conversely, in the old code, json.Unmarshal() was able to use reflection to convert that same {} into an empty csuuid.UUID value.

The below example Go program displays this behavior:

With the new code, we were trying to read that empty JSON object {} (which was produced by the old code on another node) as a string containing the hex digits of a UUID. This was because json.Unmarshal() was calling our new UnmarshalText() method and failing, which generated the log messages shown above. Similarly, the new code was producing a string of hex digits where the old code, without the new UnmarshalText() method, expected to get a JSON object.

We encountered a bit of serendipity here, though, because we accidentally discovered that the updated service had been losing those trace ID values called out in the logs for messages that went through the redrive logic. Fortunately, this hidden bug hadn’t caused any actual issues for us.

The snippet below highlights the behavior of the prior versions.

With this bug identified, we were in a quandary. The new code is correct and even fixes the data loss bug illustrated above. However, it  was unable to read in JSON data produced by the old code. As a result, it was dropping those events from the service’s SQS queue, which was not an acceptable option. Additionally, this same issue could be extant in many other services.

A Way Out Presents Itself

Since a Big Bang, deploy-everything-at-once-and-lose-data solution wasn’t tenable, we needed to find a way for csuuid.UUID to support both the existing, invalid JSON data and the new, correct format.

Going back to the documentation for JSON marshaling, UnmarshalText() is the second option for converting from JSON. If a type satisfies encoding/json.Unmarshaler, by providing UnmarshalJSON([]byte) error, then json.Unmarshal() will call that method, passing in the bytes of the JSON data. By implementing that method and using a json.Decoder to process the raw bytes of the JSON stream, we were able to accomplish what we needed.

The core of the solution relied on taking advantage of the previously unknown bug where the prior versions of csuuid.UUID always generated an empty JSON object when serialized. Using that knowledge, we created a json.Decoder to inspect the contents of the raw bytes before populating the csuuid.UUID value.

With this code in place, we were able to: 

  1. Confirm that the service could successfully queue and process messages across versions 
  2. Ensure any csuuid.UUID values are “correctly” marshaled to JSON as hex strings
  3. Write csuuid.UUID values to a database and read them back

Time to celebrate!

Lessons for the Future

Now that our team has resolved this issue, and all is well once again in JSON Land, let’s review a few lessons that we learned from our adventure:

  1. Normally, adding new methods to a type would not be a breaking change, as no consumers would be affected. Unfortunately, some special methods, like those that are involved in JSON marshaling, can generate breaking behavioral changes despite not breaking the consumer-facing API. This is something we overlooked when we got excited about our “free win.”
  2. Even if you don’t do it yourself, future consumers that you never thought of may decide to write values of your type to JSON. If you don’t consider what that representation should look like, the default behavior of Go’s encoding/json package may well do something that is deterministic but most definitely wrong , as was the case when  generating {} as the JSON value for our csuuid.UUID type. Take some time to think about what your type should look like when written to JSON, especially if the type is exported outside of the local module/package.
  3. Don’t forget that the simple, straightforward solutions are not the only ones available. In this scenario, introducing the new MarshalText()/UnmarshalText() methods was the simple, well documented way to correctly support converting csuuid.UUID values to/from JSON. However, doing the simple thing is what introduced the bug. By switching to the lower-level json.Decoder we were able to extend csuuid.UUID to be backwards compatible with the previous  code while also providing the “correct” behavior going forward.

Do you love solving technical challenges and want to embark on exciting engineering adventures? Browse our Engineering job listings and hear from some of the world’s most talented engineers.

Credentials, Authentications and Hygiene: Supercharging Incident Response with Falcon Identity Threat Detection

17 November 2021 at 09:17
  • CrowdStrike Incident Response teams leverage Falcon Identity Threat Detection (ITD) for Microsoft Active Directory (AD) and Azure AD account authentication visibility, credential hygiene and multifactor authentication implementation
  • Falcon ITD is integrated into the CrowdStrike Falcon® platform and provides alerts, dashboards and custom templates to identify compromised accounts and areas to reduce the attack surface and implement additional security measures
  • Falcon ITD allows our Incident Response teams to quickly identify malicious activity that would have previously only been visible through retroactive log review and audits, helping organizations eradicate threats faster and more efficiently

Incident responders and internal security teams have historically had limited visibility into Microsoft AD and Azure AD during an investigation, which has made containment and remediation more difficult and reliant on the victim organization to provide historical logs for retrospective analysis and perform manual authentication and hygiene audits. Since CrowdStrike acquired Preempt in 2020, the Services team has leveraged a new module in the Falcon platform, Falcon Identity Threat Detection (ITD), to gain timely and rich visibility throughout incident response investigations related to Activity Directory, specifically account authentication visibility, credential hygiene and multifactor authentication implementation. This blog highlights the importance of Falcon ITD in incident response and how our incident response teams use Falcon ITD today.

How Falcon ITD Is Leveraged During Incident Response

It’s no secret that one of CrowdStrike’s key differentiators in delivering high-quality, lower-cost investigations to victim organizations is the Falcon platform. Throughout 2021, we have included Falcon ITD in the arsenal of Falcon modules when performing incident response. This new module provides both clients and responders with the following critical data points during a response:

  • Suspicious logins/authentication activity
  • Failed login activity, including password spraying and brute force attempts
  • Inventory of all identities across the enterprise, including stale accounts, with password hygiene scores
  • Identity store (e.g., Active Directory, LDAP/S) verification and assessment to discover any vulnerabilities across multiple domains
  • Consolidated events around user, device, activity and more for improved visibility and pattern identification
  • Creation of a “Watch List” of specific accounts of interest

In a typical incident response investigation, our teams work with clients to understand the high-level Active Directory topology numbers (e.g., domains, accounts, endpoints and domain controllers). Once the domain controllers are identified, the Falcon ITD sensor is installed to begin baselining and assessing accounts, privileges, authentications and AD hygiene, which typically completes within five to 24 hours. Once complete, Falcon ITD telemetry and results are displayed in the Falcon platform for our responders and clients to analyze.  

Figure 1 shows the Falcon ITD Overview dashboard, which features attack surface risk categories and assesses the severity as Low, Medium or High. CrowdStrike responders use this data to understand highly exploitable ways an attacker could escalate privileges, such as non-privileged accounts that have attack paths to privileged accounts, accounts that can be traversed to compromise the privileged accounts’ credentials, or if the current password policies allow accounts with passwords that can be easily cracked.

Figure 1. Overview dashboard in Falcon ITD (Click to enlarge)

Figure 2 shows the main Incidents dashboard. This dashboard highlights suspicious events based on baseline patterns and indicators of authentication activity, and also includes any custom detection patterns the CrowdStrike incident response teams have configured, such as alerting when an account authenticates to a specific system.

Figure 2. Incidents main dashboard in Falcon ITD (Click to enlarge)

CrowdStrike responders leverage this information to understand and confirm findings such as the following scenarios:

  • Credentials were used to perform unusual LDAP activity that fits Service Principal Name (SPN) enumeration patterns 
  • An account entered the wrong two-factor verification code or the identity verification timeout was reached
  • Credentials used are consistent with “pass the hash” (PtH) techniques
  • Unusual LDAP search queries known to be used by the BloodHound reconnaissance tool were performed by an account

In addition to the above built-in policies, CrowdStrike responders, in consultation with clients, may also configure custom rules that will trigger alerts and even enforce controls within Falcon ITD, such as the following:

  • Alert if a specific account or group of accounts authenticates to any system or specific ones
  • Enforce a block for specific accounts from authenticating to any system or specific ones
  • Enforce a block for specific authentication protocols being used 
  • Implement identity verification from a 2FA provider such as Google, Duo or Azure for any account or for a specific one attempting to authenticate via Kerberos, LDAP or NTLM protocols
  • Implement a password reset for any account that has a compromised password

In other cases, responders are looking for additional information on accounts of interest that were observed performing suspicious activity. Typically, incident responders would have to coordinate with the client and have the client’s team provide information about that account (e.g., what group memberships it belongs to, what privileges the account has, and if it is a service or human account). Figure 3 shows how Falcon ITD displays this information and more, including password last change date, password strength and historical account activity. This is another example of how CrowdStrike responders are able to streamline the investigation, allowing our client to focus on getting back to business in a safe and secure manner.

Figure 3. Account information displayed in Falcon ITD (Click to enlarge)

Hygiene and Reconnaissance Case Study

During a recent incident response investigation, CrowdStrike Services identified an eCrime threat actor that maintained intermittent access to the victim’s environment for years. The threat actor leveraged multiple privileged accounts and created a domain administrator account — undetected — to perform reconnaissance, move laterally and gather information from the environment.

CrowdStrike incident responders leveraged Falcon ITD to quickly map out permissions associated with the accounts compromised by the threat actor, and identify password hygiene issues that aided the threat actor. By importing a custom password list into Falcon ITD, incident responders were able to identify accounts that were likely leveraged by the threat actor with the same organizational default or easily guessed password.

Falcon ITD also allowed CrowdStrike’s incident response teams to track the threat actor’s reconnaissance of SMB shares across the victim environment. The threat actor leveraged a legitimate administrative account on a system that did not have Falcon installed. Fortunately, the visibility provided by Falcon ITD still alerted incident responders to this reconnaissance activity, and we coordinated with the client to implement remediations to eradicate the threat actor. 

Multifactor Authentication and Domain Replication Case Study

During another investigation, CrowdStrike incident responders identified a nation-state threat actor that compromised an environment and had remained persistent for multiple years. With this level of sophisticated threat actor and the knowledge they had of the victim environment’s network, Active Directory structure and privileged credential usage, no malware was needed to be able to achieve their objectives.

In light of the multiyear undetected access, CrowdStrike incident responders leveraged Falcon ITD to aid in limiting the threat actor’s mobility by enforcing MFA validation for two scenarios, vastly reducing unauthorized lateral movement capabilities:

  • Enforce MFA (via Duo) for administrator usage of RDP to servers
  • Enforce MFA (via Duo) for any user to RDP from any server to a workstation

Falcon ITD’s detection capabilities were also paramount in identifying the threat actor’s resurgence in the victim network by alerting defenders to a domain replication attack. This allowed defenders to swiftly identify the source of the replication attack, which emanated from the victim’s VPN pool, and take corrective action on the VPN, impacted accounts and remote resources that were accessed by the threat actor.

Conclusion

Falcon Identity Threat Detection provides CrowdStrike incident response teams with another advantage when performing investigations into eCrime or nation-state attacks by providing increased visibility and control in Active Directory, which had previously been unachievable at speed and scale. 

Additional Resources

A Principled Approach to Monitoring Streaming Data Infrastructure at Scale

17 November 2021 at 07:30

Virtually every aspect of a modern business depends on having a reliable, secure, real-time, high-quality data stream. So how do organizations design, build and maintain a data processing pipeline that delivers? 

In creating a comprehensive monitoring strategy for CrowdStrike’s data processing pipelines, we found it helpful to consider four main attributes: observability, operability, availability and quality.

As illustrated above, we’re modeling these attributes along two axes — complexity of implementation and engineer experience — which enables us to classify these attributes into four quadrants.

In using this model, it is possible to consider the challenges involved in building a comprehensive monitoring system and the iterative approach engineers can take to realize benefits while advancing their monitoring strategy.

For example, in the lower left quadrant, we start with basic observability, which is relatively easy to address and is helpful in terms of creating a positive developer experience. As we move along the X axis and up the Y axis, measuring these attributes becomes challenging and might need a significant development effort.

In this post, we explore each of the four quadrants, starting with observability, which focuses on inferring the operational state of our data streaming infrastructure from the knowledge of external outputs. We will then explore availability and discuss how we make sure that the data keeps flowing end-to-end in our streaming data infrastructure systems without interruption. Next, we will discuss simple and repeatable processes to deal with the issues and the auto-remediations we created to help improve operability. Finally, we will explore how we improved efficiency of our processing pipelines and established some key indicators and some enforceable service level agreements (SLAs) for quality

Observability

Apache Kafka is a distributed, replicated messaging service platform that serves as a highly scalable, reliable and fast data ingestion and streaming tool. At CrowdStrike, we use Apache Kafka as the main component of our near real-time data processing systems to handle over a trillion events per day.

Ensuring Kafka Cluster Is Operational

When we create a new Kafka cluster, we must establish that it is reachable and operational. We can check that using a simple external service that constantly sends heartbeat messages to the Kafka cluster, and at the same time, consumes those messages. We can make sure that the messages that it produces matches the messages it has consumed. By doing that, we have gained confidence that the Kafka cluster is truly operational.

Once we establish that the cluster is operational, we check on other key metrics, such as the consumer group lag. 

Kafka Lag Monitoring

One of the key metrics to monitor when working with Kafka, as a data pipeline or a streaming platform, is consumer group lag.

When an application consumes messages from Kafka, it commits its offset in order to keep its position in the partition. When a consumer gets stuck for any reason — for example, an error, rebalance or even a complete stop — it can resume from the last committed offset and continue from the same point in time.

Therefore, lag is the delta between the last committed message to the last produced message. In other words, lag indicates how far behind your application is in processing up-to-date information. Also, Kafka persistence is based on retention, meaning that if your lag persists, you will lose data at some point in time. The goal is to keep lag to a minimum.

We use Burrow for monitoring Kafka consumer group lag. Burrow is an open source monitoring solution for Kafka that provides consumer lag checking as a service. It monitors committed offsets for all consumers and calculates the status of those consumers on demand. The metrics are exposed via an HTTP endpoint.

It also has configurable notifiers that can send status updates via email or HTTP if a partition status has changed based on predefined lag evaluation rules.

Burrow exposes both status and consumer group lag information in a structured format for a given consumer group across all of the partitions of the topic from which it is consuming. However, there is one drawback with this system: It will only present us with a snapshot of consumer group lag. Having the ability to look back in time and analyze historical trends in this data for a given consumer group is important for us.

To address this, we built a system called Kafka monitor. Kafka monitor fetches these metrics that are exposed by Burrow and stores them in a time series database. This enabled us to analyze historical trends and even perform velocity calculations like mean recovery time from lag for a Kafka consumer, for example.

In the next section, we explore how we implemented auto-remediations, using the consumer group status information from Burrow, to improve the availability and operability in our data infrastructure.

Availability and Operability

Kafka Cluster High Availability 

Initially, our organization relied on one very large cluster in Kafka to process incoming events. Over time, we expanded that cluster to manage our truly enormous data stream. 

However, as our company continues to grow, scaling our clusters vertically has become both problematic and impractical. Our recent blog post, Sharding Kafka for Increased Scale and Reliability, explores this issue and our solution in greater detail. 

Improved Availability and Operability for Stream Processing Jobs

For our stateless streaming jobs, we noticed that by simply relaunching these jobs upon getting stuck, we have a good chance of getting that consumer out of the stuck state. However, it is not practical at our scale to relaunch these jobs manually. So we created a tool called AlertResponder. As the name implies, it will automatically relaunch a stateless job upon getting the first consumer stuck alert.

Of course, we’ll still investigate the root cause afterward. Also, when the relaunch does not fix the problem or if it fails to relaunch for some reason, AlertResponder will then escalate this to an on-call engineer by paging them.

The second useful automation that we derive from our consumer lag monitoring is streaming jobs autoscaling. For most of our streams, traffic fluctuates on a daily basis. It is very inefficient to use a fixed capacity for all streaming jobs. During the peak hours, after the traffic exceeds a certain threshold, the consumer lag will increase dramatically. The direct impact of this is that the customers will see increased processing delays and latency at peak hours.

This is where auto-scaling helps. We use two auto-scaling strategies:

  1. Scheduled scaling: For stream processing jobs for which we are able to reliably predict the traffic patterns over the course of a day, we implemented a scheduled auto scaling strategy. With this strategy, we scale the consumer groups to a predetermined capacity at a known point in time to match the traffic patterns.
  2. Scaling based on consumer lag: For jobs running on our Kubernetes platform, we use KEDA (Kubernetes-based Event Driven Autoscaler) to scale the consumer groups. With KEDA, you can drive the scaling of any container in Kubernetes based on the number of events needing to be processed. We use KEDA’s Prometheus scaler. Using the consumer lag metrics that are available in prometheus, KEDA calculates the number of containers needed for the streaming jobs and works with HPA to scale a deployment accordingly.

Quality

When we talk about the quality of streaming data infrastructure, we are essentially considering two things: 

  1. Efficiency
  2. Conformance to service level agreements (SLAs)

Improving Efficiency Through Redistribution

When lag is uniform across a topic’s partitions, that is typically addressed by horizontal scaling of consumers as discussed above; however, when lag is not evenly distributed across a topic, scaling is much less effective.

Unfortunately, there is no out-of-the box way to address the issue of lag hotspots on certain partitions of a topic within Kafka. In our recent post, Addressing Uneven Partition Lag in Kafka, we explore our solution and how we can coordinate it across our complex ecosystem of more than 300 microservices. 

SLA-based Monitoring

It is almost impossible to measure the quality of a service correctly, let alone well, without understanding which behaviors really matter for that service and how to measure and evaluate those behaviors.

Service level indicators (SLIs), like data loss rate and end-to-end latency, are useful to measure the quality of our streaming data infrastructure. 

As an example, the diagram below shows how we track end-to-end latency through external observation (black box analysis).

We deploy monitors that submit sample input data to the data pipeline and observe the outputs from the pipeline. These monitors submit end-to-end processing latency metrics that, combined with our alerting framework, will be used to emit SLA-based alerts.

Conclusion

These four attributes — observability, availability, operability and quality — are each important in their own right for designing, working in and maintaining the streaming data infrastructure at scale. As discussed in our post, these attributes have a symbiotic relationship. The four-quadrant model not only exposes this relationship but also offers an intuitive mental model that helps us build a comprehensive monitoring solution for streaming data applications that operate at scale.

Have ideas to share about how you create a high-functioning data processing pipeline? Share your thoughts with @CrowdStrike via social media.

A Foray into Fuzzing

16 November 2021 at 20:41

One useful method in a security researcher’s toolbox for discovering new bugs in software is called “fuzz testing,” or just “fuzzing.” Fuzzing is an automatic software testing approach where the software that is to be tested (the target) is automatically fed with input data and its behavior during execution is analyzed and checked for any errors. For the CrowdStrike Intelligence Advanced Research Team, fuzzing is one of our crucial tools to perform bug hunting.

In fuzzing, a fuzzing engine generates suitable inputs, passes them to the target and monitors its execution. The goal is to find an input where the target behaves undesirably. This is, for example, a crash (e.g., Segmentation Fault). Figure 1 shows the main steps of a fuzzing run.

Figure 1. Steps a fuzzing engine performs during execution

Some of the most popular fuzzing engines are American Fuzzy Lop (AFL) and its successor AFL++; libFuzzer; and Honggfuzz. They are known not only to be very efficient in fuzzing but also to have a remarkable trophy case to show. 

Fuzzing can be quite successful because of its minimal overhead compared to other dynamic testing methods — in both compilation and preparation, and also in execution. It typically requires only lightweight instrumentation (e.g., a fixed number of instructions per basic block), and can therefore achieve close to native execution speed. One important disadvantage to consider is that the fuzzing engine usually tests only a fraction of all possible inputs, and bugs may remain undetected.

Automatically generating inputs that trigger some kind of a bug in a reasonable amount of time is therefore one of the main challenges of fuzzing. On one hand, the number of inputs of a certain length is typically very large. On the other hand, testing all possible inputs is usually not necessary or even desirable, especially if the data must follow a certain format to actually reach relevant code paths. 

One simple example is a target that considers an input to be valid if and only if it starts with a hard-coded string, aka a magic string. Therefore, many fuzzing engines expect a small set of valid inputs and then start deriving new inputs with different mutation strategies (e.g., flipping bits, or adding and removing arbitrary data). For some engines, this mutation is driven by instrumenting the target to measure the execution path that a certain input has triggered. The general assumption is that a change in the input that triggers a new execution path is more likely to discover crashes than a change that exercises a code path that was previously observed.

During fuzzing, inputs that crash or hang the fuzzing target indicate that a bug was triggered. Such inputs (or samples) are collected for further analysis. Provided the target behaves deterministically, any input can be easily passed to the target again to try to reproduce the result observed during fuzzing.

It is common for a fuzzing run to generate many samples that trigger the same bug in the target. For example, an input of 160 characters might trigger the same buffer overflow in the target as an input with 162 characters. To be able to handle the many potential samples generated during a fuzzing run and not have to analyze each individually, good tooling is crucial to triage them. While some targets require custom tooling, we found several strategies to be generally applicable, and we will introduce a few of them next.

Fuzzing

Instrumentation

Modern fuzzing approaches mutate those inputs that have shown particular promise of leading to new behavior. For example, coverage-guided fuzzing engines preferentially mutate those inputs that lead to undiscovered execution paths within the target. To be able to detect new execution paths, the fuzzing engine tries to measure the code coverage of a specific execution run. To do so, instrumentation is usually compiled into the target. 

The fuzzing tools usually provide their own compilation toolchains as a wrapper around common compilers (e.g., gcc or clang) to inject instrumentation at compile time. It should be noted that some fuzzing engines are also capable of fuzzing non-instrumented binaries — one popular method for fuzzing closed-source binaries is to use QEMU’s dynamic binary translation to transparently instrument them without recompilation. For the rest of this blog post, though, we’ll assume that we have access to the source code of a software and are allowed and able to modify and compile it.

Hunting for security vulnerabilities is important for estimating potential risks and exploitability of a software. However, only some types of vulnerabilities can be covered by a generic fuzzing engine. While a null pointer dereference would normally result in a crash of the fuzzing target and thus a report from the fuzzing engine, a logic flaw where a function returns a type-correct but wrong result is likely to go undetected. 

AFL++ and Honggfuzz, for example, report those inputs where the execution of the fuzzing target was terminated with a signal (e.g., SIGSEGV or SIGABRT). A generic fuzzing engine has no information whether a function such as int isdigit(int c) was correct when 1 is returned for a given character c. In addition, not every memory issue leads to a crash. Except access violations, an out-of-bound read operation might not be detected at all, and an out-of-bound write operation may cause the fuzzing target to crash if and only if the data that was overwritten is subsequently used in some way (e.g., a return address on the stack or an allocated heap memory segment).

There are two general solutions to address these issues. For one, Address Sanitizer (ASan) can be used to find and report memory errors that would not normally cause a program to crash. ASan, if specified via the environment setting ASAN_OPTIONS=”abort_on_error=1”, terminates a program with signal SIGABRT if an error is detected. In addition, small function wrappers can be implemented to introduce application-specific checks or other optimizations, as shown next.

The Harness

Fuzzing a library or a program typically requires researchers to write a bit of wrapper code that implements an entry point for the fuzzer, potentially executes some setup code, and passes the fuzzing input to the function that is to be fuzzed. This wrapper code is typically called a harness. In addition to passing along the fuzzer input, a harness can also provide several other features.

First, a harness can normalize the input from the fuzzer to the target. Especially when targeting a function, wrapping it into an executable that presents a standardized interface to the fuzzing engine is necessary. This type of wrapper sometimes needs to do more than simply passing the input to the target function, because fuzzing engines in general would not be aware of any requirements or format that the target expects. For example, when targeting a function such as

int some_api(char *data, size_t data_length);

which expects a string and the length of this string as arguments, a wrapper such as the following can be used to make sure that the data generated by the fuzzing engine is passed to the function in the proper format:

int fuzz_some_api(char *fuzz_data)
    {
        return some_api(fuzz_data, strlen(fuzz_data));
    }

Other types of wrappers can aid the fuzzer by ignoring certain inputs that are known to cause false positives, for example because the target detects them as erroneous and reacts with a (not security-relevant) crash. 

For example, having a target function char *encode(char *data) that is known to safely crash if the input string contains certain characters, a wrapper such as the following could be used to avoid such false positive reports:

char *fuzz_encode(char *fuzz_data)
    {
        for (char *ptr = fuzz_data; *ptr != NULL; ptr++)
            if (!fuzz_is_allowed_character(*ptr))
                return NULL;

        return encode(fuzz_data);
    }

Conversely, a wrapper can also be used to detect and signal unexpected behavior, even if the fuzzing target does not crash. For example, given two functions

  1. char *encode(char *data);
  2. char *decode(char *data);

(where decode() is expected to implement the reverse function of encode(), and vice versa) a wrapper function can ensure that for any generated input, the string returned by decode(encode(fuzz_data)) is equal to the input fuzz_data. The wrapper function, and entry point for the fuzzer, might implement a routine as follows:

void fuzz_functions(char *fuzz_data)
    {
        if (strcmp(decode(encode(fuzz_data)), fuzz_data) != 0)
            trigger_crash(); // force a crash, e.g. via *((int *) 0) = -1;
    }

In summary, wrapping the fuzzing target can often reduce the number of false positives by a considerable amount. When implementing wrappers, we found it to be very useful to integrate the wrapping code into the original codebase using #ifdef statements as shown below:

int main(int argc, char **argv)
    {
    #ifdef FUZZING
        return func(get_fuzz_data());
    #endif
        // original codebase:
        data = get_data_via_options(argc, argv);
        return func(data);
    }

Utilizing All Cores

Since fuzzing is resource-intensive, it would be ideal to utilize all processor cores that are available in a modern multi-core system. While Honggfuzz is a multi-process and multi-threaded fuzzing engine out of the box, AFL++ needs manual setup to do so.

To fuzz in parallel using AFL++, the fuzzing is started with one “master instance” (flagged with -M), and all other instances will be created as “secondary instances” (flagged with -S). The following excerpt is part of a script that can be used to spawn multiple instances of afl-fuzz, each one inside a screen session to be able to log out of the fuzzing host without interrupting the instances:

for i in $(seq -w 1 ${NUM_PROCS}); do
        if [ "$i" -eq "1" ]; then
            PARAM="-M fuzzer${i}"
        else
            PARAM="-S fuzzer${i}"
        fi

        CMD='afl-fuzz -i ${DIR_IN} -o ${DIR_OUT} ${PARAM} "${BINARY}"'

        echo "[+] starting fuzzing instance ${i} (parameter ${PARAM})..."
        screen -dmS "fuzzer-${i}" ${CMD}
    done

Crash Triage

After running for a while, the fuzzer will ideally have generated a number of inputs (samples) that crashed the target during fuzzing. We now aim to automatically aggregate this set of samples and enrich each sample and the corresponding potential crash with information to drive any further analysis. 

One of our strategies is to group (cluster) the samples with respect to the behavior of the fuzzing target during execution. Since the resulting clusters then represent different behavior of the fuzzing target, they are easier to triage for (exploitable) bugs. This clustering strategy requires information about the crash and the code path that led to it, which can be collected automatically, using debuggers, for example.

Other information that can be automatically collected is whether a crash is deterministically reproducible, and whether the build configuration (affecting, for example, function addresses or variable order) or the runtime environment (e.g., environment variables or network properties) have an impact on whether the target crashes on a particular sample. 

Given a sample from the fuzzing run, we can replay that input against the target compiled with different configurations (e.g., both with and without the fuzzer’s instrumentation, with different build-time options, or with and without ASan) and see whether the executions crash. The idea is to have different binaries with different configurations to capture circumstances of a crash with respect to a sample that was generated during a fuzzing run. 

For example, if only the instrumented version crashes, then the bug is potentially in the fuzzing-specific code and therefore a false positive. Another example is a sample generated by a fuzzing run on a target with ASan support, where a crash cannot be reproduced with a non-ASan version. In this case, there might be a bug that does not crash the target but could potentially be used to engineer an exploit (e.g., out-of-bound read access to leak sensitive information). 

Collecting all of this information will help us better understand why the samples collected by the fuzzer crash the target, under which circumstances, and whether they may have triggered an exploitable bug. Good strategies and tooling are essential to reduce the required amount of manual analysis.

Sample Collection

Since the different fuzzing engines save samples in different ways, another simple but necessary post-processing step to implement is sample collection. AFL++ and Honggfuzz default to storing each sample in its own file and using the filename to save information about the termination signal, the program counter, a stack signature, and the address and disassembled instruction at which the fault occurred. 

Unfortunately, both fuzzers use a different format out of the box, so the first step in our post-processing pipeline is to collect and move all samples to a shared folder, extract and store information from the filenames, and then rename them to a standard format. Renaming samples to a hash of their contents has worked well for us, because it allows a quick and easy merging of samples from different fuzzing runs.

Information Collection Using GDB

For each sample, we automatically collect information as already indicated. One of the building blocks is an analysis module that uses gdb to collect various information about the crash of a target on a given sample. For the sake of simplicity, we’ll assume that the target expects data either from STDIN or as a command line argument and is unaffected by other vectors that could affect the execution of a binary (e.g., network, environment variables, file system properties). The module invokes gdb as follows:

/usr/bin/gdb -q -nx $binary

The -nx flag is used to avoid loading a gdbinit file, while the -q flag is used to stop gdb printing its version string. After invoking gdb, the following gdb commands are executed automatically:

(gdb) set width unlimited
    (gdb) run {run_payload}
    (gdb) backtrace -frame-info short-location

The first command prevents gdb from breaking long lines, e.g., when printing backtraces. The second command executes the fuzzing target, feeding it either the path to the sample or the actual sample content. The third command generates the backtrace. If the execution of the binary finishes without a crash, or times out, the evaluation is stopped and no backtrace is generated.

The backtrace in general is a summary of the program’s path to the current point in execution. It consists of stack frames, where each stack frame relates to one nested function call. For example, having a function f() that calls the function g(), and the function g() calls a function h(), and a backtrace is generated inside h(), that backtrace might look as follows:

In summary, by executing the binary once on each sample, gdb will tell us whether the binary crashed at all, and if it did, gdb will yield the signal that terminated the process, as well as a backtrace. The backtrace will provide the names of invoked functions, their addresses, variable states and additional debugging information. The output for an exemplary sample looks as follows:

This information is then parsed, stored in a database and subsequently used to cluster all of the samples in order to reduce the overhead of identifying interesting bugs.

Clustering

After gathering information about all of our samples and adding them to a database, the next step is to try and sort the samples into clusters. Obviously, there are many possible approaches to do that. One method that works very well for us, while being exceedingly simple, is based on hashing the list of addresses of the backtrace. The following source code excerpt shows this approach:

def compute_clusterhash(backtrace):
        bt_addresses = [frame["addr"] for frame in backtrace]
        return hashlib.md5('.'.join(bt_addresses).encode()).hexdigest()

For each sample, there is an entry in a database that looks as follows:

{
        "sample": "e5f3438438270583ff09cd84790ee46e",
        "crash": true,
        "signal": "SIGSEGV",
        "signal_description": "Segmentation fault",
        "backtrace": [
            {
                "addr": "0x00007ffff7f09592",
                "func": "__memmove_avx_unaligned_erms", [...]
            },
            {
                "addr": "0x00007ffff7fb3524",
                "func": "ReadFromRFBServer", [...]
            },
            {
                "addr": "0x00007ffff7fae7da",
                "func": "HandleTRLE24", [...]
            },
            {
                "addr": "0x00007ffff7f9c9ba",
                "func": "HandleRFBServerMessage", [...]
            },
            {
                "addr": "0x0000555555555762",
                "func": "spin_fuzzing", [...]
            },
            {
                "addr": "0x00005555555558e5",
                "func": "main", [...]
            }
        ]
    }

This information is now transformed into a hash using compute_clusterhash(), as shown below:

>>> compute_clusterhash(example["backtrace"])
    '3c3f5e47c2c59c8ce0272262f87dc7aa'

We can now cluster our samples by these hashes, hoping samples that trigger different bugs yield different hashes, and samples that trigger the same bug yield the same hash. The next step would be to examine the different clusters to better understand the underlying bugs and learn how to trigger and potentially exploit them. In the best case, just one or only very few samples from each cluster would need to be reviewed.

In our experience, deriving clusters based on the backtrace — generated at the point where the crash occurs — is more useful than considering the full execution path that led to the crash, because usually software is complex enough that there are often different execution paths leading to the same position in code. However, if the analysis reveals that the same bug is reached through different backtraces for a certain target, the described method could be changed to trim the backtrace to only a number of most recent frames during clustering under the assumption that code closer to the crash is more relevant than code that was executed earlier in the program.

Conclusion

Fuzzing is a well-established technique for discovering new vulnerabilities in software. With this blog, we hope to give you an overview of what is required to successfully fuzz a target, from implementing a harness, to gathering crash information, to using this information for clustering inputs and corresponding crashes for further analysis.

Additional Resources

Everything You Need To Know About Log Analysis

16 November 2021 at 09:51

This blog was originally published Sept. 30, 2021 on humio.com. Humio is a CrowdStrike Company.

What Is Log Analysis?

Log analysis is the process of reviewing computer-generated event logs to proactively identify bugs, security threats, factors affecting system or application performance, or other risks. Log analysis can also be used more broadly to ensure compliance with regulations or review user behavior.

A log is a comprehensive file that captures activity within the operating system, software applications or devices. Logs automatically document any information designated by the system administrators, including: messages, error reports, file requests, file transfers and sign-in/out requests. The activity is also time-stamped, which helps IT professionals and developers establish an audit trail in the event of a system failure, breach or other outlying event.

Why Is Log Analysis Important?

In some cases, log analysis is critical for compliance since organizations must adhere to specific regulations that dictate how data is archived and analyzed. It can also help predict the useful lifespan of hardware and software. In addition, log analysis can help IT teams amplify four key factors that help deliver greater business value and customer-centric solutions: agility, efficiency, resilience and customer value.

Log analysis can unlock many additional benefits for the business. These include:

  • Improved troubleshooting. Organizations that regularly review and analyze logs are typically able to identify errors more quickly. With an advanced log analysis tool, the business may even be able to pinpoint problems before they occur, which greatly reduces the time and cost of remediation. Logs also help the log analyzer review the events leading up to the error, which may make the issue easier to troubleshoot and prevent in the future.
  • Enhanced cybersecurity. Effective log analysis dramatically strengthens the organization’s cybersecurity capabilities. Regular review and analysis of logs helps organizations more quickly detect anomalies, contain threats and prioritize responses.
  • Improved customer experience. Log analysis helps businesses ensure that all customer-facing applications and tools are fully operational and secure. The consistent and proactive review of log events helps the organization quickly identify disruptions or even prevent such issues — improving satisfaction and reducing turnover.
  • Agility. Organizations can predict the useful life span of hardware and software and help businesses prepare for scale and agility, thus providing a competitive edge in the marketplace.

How Is Log Analysis Performed?

Log analysis is typically done within a log management system, a software solution that gathers, sorts and stores log data and event logs from a variety of sources.

Log management platforms allow the IT team and security professionals to establish a single point from which to access all relevant endpoint, network and application data. Typically, logs are searchable, which means the log analyzer can easily access the data they need to make decisions about network health, resource allocation or security. Traditional log management uses indexing, which can slow down search and analysis. Modern log management uses index-free search; it’s less expensive, faster and can create gains of 50-100x in required disk space.

Log analysis typically includes:

Ingestion: Installing a log collector to gather data from a variety of sources, including the OS, applications, servers, hosts and each endpoint, across the network infrastructure.

Centralization: Aggregating all log data in a single location as well as a standardized format regardless of the log source. This helps simplify the analysis process and increase the speed at which data can be applied throughout the business.

Search and analysis: Leveraging a combination of AI/ML-enabled log analytics and human resources to review and analyze known errors, suspicious activity or other anomalies within the system. Given the vast amount of data available within the log, it is important to automate as much of the log analysis process as possible. It is also recommended to create a graphical representation of data, through knowledge graphing or other techniques, to help the IT team visualize each log entry, its timing and interrelations.

Monitoring and alerts: The log management system should leverage advanced log analytics to continuously monitor the log for any log event that requires attention or human intervention. The system can be programmed to automatically issue alerts when certain events take place or certain conditions are or are not met.

Reporting: Finally, the LMS should provide a streamlined report of all events as well as an intuitive interface that the log analyzer can leverage to get additional information from the log.

The Limitations of Indexing

Many log management software solutions rely on indexing to organize the log. While this was considered an effective solution in the past, indexing can be a very computationally-expensive activity, causing latency between data entering a system and then being included in search results and visualizations. As the speed at which data is produced and consumed increases, this is a limitation that could have devastating consequences for organizations that need real-time insight into system performance and events.

Further, with index-based solutions, search patterns are also defined based on what was indexed. This is another critical limitation, particularly when an investigation is needed and the available data can’t be searched because it wasn’t properly indexed.

Leading solutions offering free-text search, which allows the IT team to search any field in any log. This capability helps to improve the speed at which the team can work without compromising performance. Learn more.

Log Analysis Methods

Given the massive amount of data being created in today’s digital world, it has become impossible for IT professionals to manually manage and analyze logs across a sprawling tech environment. As such, they require an advanced log management system and techniques that automate key aspects of the data collection, formatting and analysis processes.

These techniques include:

  • Normalization. Normalization is a data management technique that ensures all data and attributes, such as IP addresses and timestamps, within the transaction log are formatted in a consistent way.
  • Pattern recognition. Pattern recognition refers to filtering events based on a pattern book in order to separate routine events from anomalies.
  • Classification and tagging. Classification and tagging is the process of tagging events with key words and classifying them by group so that similar or related events can be reviewed together.
  • Correlation analysis. Correlation analysis is a technique that gathers log data from several different sources and reviews the information as a whole using log analytics.
  • Artificial ignorance. Artificial ignorance refers to the active disregard for entries that are not material to system health or performance.

Log Analysis Use Case Examples

Effective log analysis has use cases across the enterprise. Some of the most useful applications include:

  • Development and DevOps. Log analysis tools and log analysis software are invaluable to DevOps teams, as they require comprehensive observability to see and address problems across the infrastructure. Further, because developers are creating code for increasingly-complex environments, they need to understand how code impacts the production environment after deployment. An advanced log analysis tool will help developers and DevOps organizations easily aggregate data from any source to gain instant visibility into their entire system. This allows the team to identify and address concerns, as well as seek deeper information.
  • Security, SecOps and Compliance. Log analysis increases visibility, which grants cybersecurity, SecOps and compliance teams continuous insights needed for immediate actions and data-driven responses. This in turn helps strengthen the performance across systems, prevent infrastructure breakdowns, protect against attacks and ensure compliance with complex regulations. Advanced technology also allows the cybersecurity team to automate much of the log file analysis process and set up detailed alerts based on suspicious activity, thresholds or logging rules. This allows the organization to allocate limited resources more effectively and enable human threat hunters to remain hyper-focused on critical activity.
  • Information Technology and ITOps. Visibility is also important to IT and ITOps teams as they require a comprehensive view across the enterprise in order to identify and address concerns or vulnerabilities. For example, one of the most common use cases for log analysis is in troubleshooting application errors or system failures. An effective log analysis tool allows the IT team to access large amounts of data to proactively identify performance issues and prevent interruptions.

Log Analysis Solutions From Humio

Humio is purpose-built to help any organization achieve the benefits of large-scale logging and analysis. The Humio difference:

  • Virtually no latency regardless of ingestion, even in the case of data bursts
  • Index-free logging that enables full search of any log, including metrics, traces and any other kind of data
  • Real-time data streaming and streaming analytics with an in-memory state machine
  • Ability to join datasets and create a joint query that searches multiple data sets for enriched insights
  • Easily configured, sharable dashboards and alerts power live system visibility across the organization
  • High data compression to reduce hardware costs and create more storage capacity, enabling both more detailed analysis and traceability over longer time periods

Additional Resources

CrowdStrike Falcon’s Autonomous Detection and Prevention Wins Best EDR Award and Earns Another AAA Rating in SE Labs Evaluations

19 November 2021 at 09:06
  • CrowdStrike wins the prestigious SE Labs “Best Endpoint Detection and Response” 2021 award. 
  • This marks CrowdStrike’s second consecutive year winning Best EDR from SE Labs, the highly regarded independent testing organization, based on stellar EDR performance and testing results observed over the past 12 months.
  • Earlier this week, CrowdStrike  once again earned the highest AAA rating in the SE Labs Enterprise Endpoint Protection, Q3 2021 report, achieving detection scores of 99% total accuracy and 100% legitimate accuracy.
  • This is the 12th AAA rating in EPP for the CrowdStrike Falcon® platform, dating back to March 2018.

CrowdStrike Falcon has been named best Endpoint Detection and Response, winning the award for the second time since independent third-party testing organization SE Labs first introduced it in 2020. The achievement speaks directly to Falcon’s outstanding automated detection and prevention capabilities in tracking elements of sophisticated attack chains and protecting customers from breaches.

CrowdStrike also received a new AAA rating from SE Labs in its recent Endpoint Protection report, demonstrating consistent achievements in SE Labs testing in terms of automated protection and remediation capabilities using on-sensor indicators of attack (IOAs) and machine learning. This latest achievement underscores our commitment to transparency and constant improvement of our capabilities. 

The Falcon platform achieved a 99% Total Accuracy rating in protecting against both in-the-wild commodity threats and targeted attacks, according to the recent Q3 SE Labs Enterprise Endpoint Protection report. In this evaluation, CrowdStrike, a next-generation cloud endpoint detection and response (EDR) vendor, outperformed legacy vendors such as Microsoft, Symantec and McAfee. Falcon achieved outstanding testing score results, with CrowdStrike placing in the top three vendors in overall final score, with nearly in a tie for the best three solutions tested.  

Regularly participating in independent third-party tests drives us to build relevant, meaningful and valuable capabilities that can protect against sophisticated adversaries and threats as well as commodity malware. 

Falcon Once Again Wins Highest AAA Ranking from SE Labs

In the latest report, CrowdStrike Falcon was awarded the highest AAA rating, speaking to Falcon’s capability of automated detection and protection against sophisticated adversaries and unrelenting effectiveness in neutralizing and blocking threats.

SE Labs testing aims to offer a complete view of the capabilities of endpoint security solutions by using common attack tools typical of early stages of attempted breaches and in-the-wild commodity malware that is representative of the current threat landscape. CrowdStrike Falcon has consistently participated in SE Labs testing, with an excellent track record of AAA ratings in SE Labs Enterprise Endpoint Protection reports dating back to March 2018. This marks the 12th time Falcon has been awarded an impressive AAA rating in Enterprise Endpoint Protection evaluations from SE Labs and the third time in 2021. 

Testing scenarios for detection and protection from general threats involved the ability to accurately identify web-based threats, such as URLs that attackers commonly use to trick users into downloading threats or executing malicious scripts. Identifying and blocking exploits and accurately identifying legitimate applications are also part of the testing scenario, with CrowdStrike Falcon achieving an AAA award with 99% Total Accuracy and 100% Legitimate Accuracy rating. False positives generated by incorrectly identifying legitimate applications and websites as malicious can create serious disruptions in business operations. A 100% legitimate accuracy rating means businesses will spend less time, effort and money on remediating false positives and bringing systems back into production. 

Testing every layer of detection and protection against typical stages of an attack employed by sophisticated adversaries measures how the security solution responds to each stage of the attack. CrowdStrike Falcon achieved a 99 Protection Score, which reflects the overall level of protection across multiple attack stages. This SE Labs score assesses the ability to protect systems by detecting, blocking or neutralizing threats based on how severe the outcomes of an attack could be. 

Products that detect and neutralize threats during the early stages of an attack are rated better and will protect systems from sophisticated threats. Conversely, the test severely penalizes security software that blocks legitimate applications, creating false positives. Blocking threats early in the attack chain enabled CrowdStrike Falcon to achieve excellent results in automatically detecting and protecting against incidents.

CrowdStrike Falcon Testing Achievements

By repeatedly participating in independent third-party cybersecurity testing, CrowdStrike demonstrates transparency in Falcon capabilities, and public results serve as a track record for validating consistency in automated protection and remediation. Since there is no single independent third-party test to determine an industry leader, Falcon’s capabilities are validated by our ongoing participation in tests and evaluations from leading organizations, and by obtaining verifiable and repeatable detection and protection results. 

Falcon has demonstrated a superior track record for participating and excelling in third-party independent tests, with consistent results in terms of automated protection and remediation capabilities. For example, CrowdStrike was named a strategic leader in AV-Comparatives Endpoint Protection and Response tests and a leader in the Gartner Magic Quadrant for Endpoint Protection Platforms (EPP). With awards and certifications from leading testing organizations including AV-Comparatives, SE Labs and MITRE, CrowdStrike remains fully committed to supporting independent third-party efforts.

While these are only a handful of achievements, CrowdStrike has never been more unwavering and committed to our mission to stop breaches.

Additional Resources

❌