πŸ”’
There are new articles available, click to refresh the page.
Before yesterdayCrowdStrike

Mean Time to Repair (MTTR) Explained

23 November 2021 at 08:30

This blog was originally published oct. 28, 2021 on humio.com. Humio is a CrowdStrike Company.

Definition of MTTR

Mean time to repair (MTTR)Β is a key performance indicator (KPI) that represents the average time required to restore a system to functionality after an incident. MTTR is used along with other incident metrics to assess the performance ofΒ DevOps and ITOps, gauge the effectiveness of security processes, evaluate the effectiveness of security solutions, and measure the maintainability of systems.

Service level agreements with third-party providers typically set expectations for MTTR, although repair times are not guaranteed because some incidents are more complex than others. Along the same lines, comparing the MTTR of different organizations is not fruitful because MTTR is highly dependent on unique factors relating to the size and type of the infrastructure and the size and skills of the ITOps and DevOps team. Every business has to determine which metrics will best serve its purposes and how it will put them into action in their unique environment.

Difference Between Common Failure Metrics

Modern enterprise systems are complicated and they can fail in numerous ways. For these reasons, there is no one set of incident metrics every business should use β€” but there are many to choose from, and the differences can be nuanced.

Mean Time to Detect (MTTD)

Also called mean time to discover, MTTD is the average time between the beginning of a system failure and its detection. As a KPI, MTTD is used to measure the effectiveness of the tools and processes used by DevOps teams.

To calculate MTTP, select a period of time, such as a month, and track the times between the beginning of system outages and their discovery, and then add up the total time and divide it by the number of incidents to find the average. MTTD should be low. If it continues to take longer to detect or or discover system failures (an upward trend), an immediate review should be conducted of the existing incident response management tools and processes.

Mean Time to Identify (MTTI)

This measurement tracks the number of business hours between the moment an alert is triggered and the moment the cybersecurity team begins to investigate that alert. MTTI is helpful in understanding if alert systems are effective and if cybersecurity teams are staffed to the necessary capacity. A high MTTI or an MTTI that is trending in the wrong direction can be an indicator that the cybersecurity team is suffering from alert fatigue.

Mean Time to Recovery (MTTR)

Mean time to recovery is the average time it takes in business hours between the start of an incident and the complete recovery back to normal operations. This incident metric is used to understand the effectiveness of the DevOps and ITOps teams and identify opportunities to improve their processes and capabilities.

Mean Time to Resolve (MTTR)

Mean time to resolve is the average time between the first alert through the post-incident analysis, including the time spent ensuring the failure will not re-occur. It is measured in business hours.

Mean Time Between Failures (MTBF)

Mean time between failures is a key performance metric that measures system reliability and availability. ITOps teams use MTBF to understand which systems or components are performing well and which need to be evaluated for repair or replacement. Knowing MTBF enables preventative maintenance, minimizes reactive maintenance, reduces total downtime and enables teams to prioritize their workload effectively. Historical MTBF data can be used to make better decisions about scheduling maintenance downtime and resource allocation.

MTBF is calculated by tracking the number of hours that elapse between system failures in the ordinary course of operations over a period of time and then finding the average.

Mean Time to Failure (MTTF)

Mean time to failure is a way of looking at uptime vs. downtime. Unlike MTBF, an incident metric that focuses on repairability, MTTF focuses on failures that cannot be repaired. It is used to predict the lifespan of systems. MTTF is not a good fit for every system. For example, systems with long lifespans, such as core banking systems or many industrial control systems, are not good subjects for MTTF metrics because they have such a long lifespan that when they are finally replaced, the replacement will be an entirely different type of system due to technological advances. In cases like that, MTTF is moot.

Conversely, tracking the MTTF of systems with more typical lifespans is a good way to gain insight into which brands perform best or which environmental factors most strongly influence a product’s durability.

MTTR is intended to reduce unplanned downtime and shortenΒ breakout time. But its use also supports a better culture within ITOps teams.When incidents are repaired before users are impacted, DevOps and ITOps are seen as efficient and effective. Resilient system design is encouraged because when DevOps knows its performance will be measured by MTTR, the team will build apps that can be repaired faster, such as by developing apps that are populated by discrete web services so one service failure will not crash the entire app. MTTR, when done properly, includes post-incident analysis, which should be used to inform a feedback loop that leads to better software builds in the future and encourages the fixing of bugs early in the SDLC process.

How to Calculate Mean Time to Repair

The MTTR formula is straightforward: Simply add up the total unplanned repair time spent on a system within a certain time frame and divide the results by the total number of relevant incidents.

For example, if you have a system that fails four times in one workday and you spend an hour repairing each of those instances of failure, your MTTR would be 15 minutes (60 minutes / 4 = 15 minutes).

However, not all outages are equal. The time spent repairing a failed component or a customer-facing system that goes down during peak hours is more expensive in terms of lost sales, productivity or brand damage than time spent repairing a non-critical outage in the middle of the night. Organizations can establish an β€œerror budget” that specifies that each minute spent repairing the most impactful systems is worth an hour of minutes spent repairing less impactful ones. This level of granularity will help expose the true costs of downtime and provide a better understanding of what MTTR means to the particular organization.

How to Reduce MTTR

There are three elements to reducing MTTR:

  1. Manage resolution process. The first is a defined strategy for managing the resolution process, which should include a post-incident analysis to capture lessons learned.
  2. Build defenses. Technology plays a crucial role, of course, and the best solution will provide visibility, monitoring and corrective maintenance to help root out problems and build defenses against future attacks.
  3. Mitigate the incident. Lastly, the skills necessary to mitigate the incident have to be available.

MTTR can be reduced by increasing budget or headcount, but that isn’t always realistic. Instead, deploy artificial intelligence (AI) and machine learning (ML) to automate as much of the repair process as possible. Those steps include rapid detection, minimization of false positives, smart escalation, and automated remediation that includes workflows that reduce MTTR.

MTTR can be a helpful metric to reduce downtime and streamline your DevOps and ITOps teams, but improving it shouldn’t be the end goal. After all, the point of using metrics is not simply improving numbers but, in this instance, the practical matter of keeping systems running and protecting the business and its customers. Use MTTR in a way that helps your teams protect customers and optimize system uptime.

Improve MTTR With a Modern Log Management Solution

Logs are invaluable for any kind of incident response. Humio’s platform enables complete observability for all streaming logs and event data to help IT organizations better prepare for the unknown and quickly find the root cause of any incident.

Humio leverages modern technologies, including data streaming, index-free architecture and hybrid deployments, to optimize compute resources and minimize storage costs. Because of this, Humio can collect structured and unstructured data in memory to make exploring and investigating data of any size blazing fast.

Humio Community Edition

With a modern log management platform, you can monitor and improve your MTTR. Try it out at no cost!

Introduction to the Humio Marketplace

18 November 2021 at 08:56

This blog was originally published Oct. 11, 2021 on humio.com. Humio is a CrowdStrike Company.

Humio is a powerful and super flexible platform that allows customers to log everything and answer anything. Users can choose how to ingest their data and choose how to create and manage their data with Humio. The goal of Humio’s marketplace is to provide a variety of packages that power our customers with faster and more convenient ways to get more from their data across a variety of use cases.

What is the Humio Marketplace?

The Humio Marketplace is a collection of prebuilt packages created by Humio, partners and customers that Humio customers can access within the Humio product interface.

These packages are relevant to popular log sources and typically contain a parser and some dashboards and/or saved queries. The package documentation includes advice and guidance on how to best ingest the data into Humio to start getting immediate value from logs.

What is a package?

The Marketplace contains prebuilt packages that are essentially YAML files that describe the Humio assets included in the package. A package can include any or all of: a parser, saved searches, alerts, dashboards, lookup files and labels. The package also includes YAML files for the metadata of the package (such as descriptions and tags, support status and author), and a README file which contains a full description and explanation of any prerequisites, etc.

Packages can be configured as either a Library type package β€” which means, once installed, the assets are available as templates to build from β€” or an Application package, which means, once installed, the assets are instantiated and are live immediately.

By creating prebuilt content that is quick and simple to install, we want to make it easier for customers to onboard new log sources to Humio to quickly get value from that data. With this prebuilt content, customers won’t have to work out the best way of ingesting the logs and won’t have to create parsers and dashboards from scratch.

How do I make a package?

Packages are a great way to mitigate manual work, whether that’s taking advantage of prebuilt packages or making your own packages so you don’t have to begin new processes all over.

Anyone can create a Humio package straight from Humio’s interface. We actively encourage customers and partners to create packages andΒ submit those packages for inclusion in the MarketplaceΒ if they think they could benefit other customers. Humio will work with package creators to make sure the package meets our standards for inclusion in the Marketplace. By sharing your package with all Humio customers through the Marketplace, you are strengthening the community and allowing others to benefit from your expertise while you, likewise, benefit from others’ expertise.

For some customers, the package will be exactly what they want, but for others, it will be a useful starting point for further customization. All Humio packages are provided under an Apache 2.0 license, so customers are free to adapt and reuse the package as needed.

If I install a package, will it get updated?

Package creators can develop updates in response to changes in log formats or to introduce new functionality and improvements. Updates will be advertised as available in the Marketplace and users can choose to accept the update. The update process will check to see if any local changes have been made to assets installed from the package and, if so, will prompt the user to either overwrite the changes with the standard version from the updated package or to keep the local changes.

Are packages free?

Yes, all Humio packages in the Marketplace are free to use!

Can I use packages to manage my own private Humio content?

Absolutely! Packages are a convenient way for customers to manage their own private Humio content. Packages can be created in the Humio product interface and can be downloaded as a ZIP file and uploaded into a different Humio repository or a different instance of Humio (cloud or hybrid). Customers can also store their Humio packages in a code repository and use their CI/CD tools and the Humio API to deploy and manage Humio assets as they would their own code. This streamlines Humio support and operations and delivers a truly agile approach to log management.

Get started today

To get started with packages is simple. All you need is access to a Humio Cloud service, or if running Humio self-hosted, you need to be on V1.21 or later. To create and install packages, you need the β€œChange Packages” permission assigned to your Humio user role.

Access the Marketplace from within the Humio product UI (Go to Settings, Packages, then Marketplace to browse the available packages or to create your own package). Try creating a package and uploading it to a different repository. If you create a nice complex dashboard and want to recreate it in a different repository, you know what to do: Create a package; export/import it, and then you don’t need to spend time recreating it!

Let us know what else you want to see in the Marketplace by connecting with us atΒ The NestΒ or emailingΒ [email protected].

Additional Resources

  • There are no more articles
❌