- The CrowdStrike Falcon® platform takes full advantage of the power of the CrowdStrike Security Cloud to reduce high-cost false positives and maximize detection efficacy to stop breaches
- CrowdStrike continuously explores novel approaches to improve machine learning automated detection and protection capabilities for Falcon customers
- CrowdStrike’s cloud-based machine learning model automation can predict 500,000 feature vectors every second and cover 10TB of files per second to find detections
At CrowdStrike, we combine cloud scale with machine learning expertise to improve the efficacy of our machine learning models. One method for achieving that involves scanning massive numbers of files that we may not even have in our sample collections before we release our machine learning models. This prerelease scan allows us to maximize the efficacy of our machine learning models while minimizing negative impact of new or updated model releases.
It’s important to understand that machine learning models take over when discrete algorithms fall short. CrowdStrike machine learning does an excellent job of creating models that can detect impactful in-the-wild novel threats like NotPetya, BadRabbit or HermeticWiper along with other malware families. CrowdStrike’s comprehensive detection capabilities have been consistently validated in independent third-party testing from leading organizations including AV-Comparatives. However, machine learning looks at the world through probabilities, and those probabilities can make understanding why an incorrect detection was made unpredictable and difficult to understand.
Incorrect detections, also known as false positives, are a concern with any endpoint security solution and exacerbate the ongoing skills shortage most organizations face. Any incorrect assessment of a clean file as malicious can immediately trigger remediation procedures that can take down services, disrupt workflows and distract analysts from hunting down legitimate threats. However, not all false positives are created equal, for the cost of any mistakes should be compared to the benefit given by correct detections. CrowdStrike has implemented novel solutions to the false positive predicament.
Clean or Dirty: Know the Difference
One approach involves accumulating billions of files in our cloud. These files come from various sources, ranging from protected environments to public malware collections, at a rate of approximately 86 million new hashes a day. The collection includes malicious code, clean code and unwanted code, such as potentially unwanted programs.
To build our machine learning models, we carefully curate both clean and “dirty” (i.e., malicious) samples from this collection, resulting in a labeled collection that is growing by tens of millions of new examples every training cycle.
Extract the Right Features
To ensure the quality of the resulting models, we also gather from live environments the most interesting files to maximize the efficacy of the model. While some customers use the Falcon platform to share files with us so we can improve our coverage capabilities, others keep their files in-house for a variety of reasons. As a consequence, to build an effective model, we must ensure that it can perform well on in-house files not shared with us as well as on those that have been shared. However, to teach a machine learning model, first you must reduce these interesting files to a long list of transformed numeric values, called a feature vector, that represent various properties of the file.
As humans, we learn to use our senses to extract features from the surrounding environment and then infer probable outcomes based on past experience. For example, if it’s cloudy outside and there’s a damp breeze, we infer there’s a high chance of rain and we need to grab an umbrella. In this case, cloudy and damp can be considered data points part of the feature vector that describes chances of rain.
Of course, the feature list for files contains thousands of decimal numbers that humans can’t read but our artificial intelligence (AI) understands. That feature vector is uploaded to the cloud by the Falcon sensor, making it possible for us to observe what a new model would say about the underlying file by running predictions over that stored feature vector.
Figure 1. This flow describes how feature vectors and metadata are sent to the CrowdStrike Security Cloud and used against our machine learning model to help build better predictions.
Returning to the rain example, the feature vector with the two data points of cloudy and damp is assessed against what we know from experience to be signs of rain. If our experience has taught us that these two particular data points have a high probability of describing chances of rain, then we grab an umbrella. Otherwise, we assess this with low chances of rain. Much like machine learning models, it comes down to how well we are trained in spotting and recognizing signs of rain.
Measure Efficacy, Get It Right!
The same file feature vector can also be combined with additional information such as the prevalence of files that is contained within our security cloud. This means we can virtually scan all prevalent files in protected environments to measure efficacy and test for false positives.
The results of this virtual scan are important for a number of reasons. First, it enables us to identify important files which will have a high impact in the next model release. Second, we can minimize potential high-cost false positives prior to deployment. Finally, this information is used to teach future models.
For example, based on a prevalence threshold, we advance our scan to include all files found on a significant number of devices. We then consider all of our detections. Those that are incorrect we resolve with our cloud by replying to our sensors to prevent future detection, and include the files that triggered an incorrect detection in the next retrain of our model. Correct detections, on the other hand, are added both to our cloud for immediate detection and to the files used in training our models in the future.
Again, returning to the rain example, this virtual scan is like checking multiple weather forecasting websites as soon as we have the two signs — cloudy and damp — before leaving the house with or without an umbrella. Some of those websites may be correct in predicting rain, others may not, but the next time it’s cloudy and damp we will know which websites are reliable before we go outside and risk being caught in the rain without an umbrella.
CrowdStrike’s Automated Cloud-Based Machine Learning Model Maximizes Efficacy
While CrowdStrike analysts inspect millions of files, the number of files detected as malicious is remarkably small enough that they can be analyzed by hand. Because our analysts and processes work better on samples that we have instead of information about samples, we start our analysis with those detections we can also find in our massive sample store.
Using feature vectors, the Falcon platform enables us to know quite a bit about the files we don’t have, and also allows us to use the power of the cloud to enhance detection or resolve incorrect detections of files not contained in our sample store.
Comparing global virtual scans of prevalent files against all of our static detection models is critical in pushing the accuracy and efficacy of our machine learning models to help secure our customers and stop breaches.
In essence, the power of the Falcon platform lies in its ability to take full advantage of the massive data fabric we call the CrowdStrike Security Cloud, which correlates trillions of security events from protected endpoints with threat intelligence and enterprise telemetry. The Falcon platform uses machine learning and AI to automate and maximize the efficacy of detecting and protecting against threats, to stop breaches.