In Tenable.io, we are heavy users of Datadog custom metrics. Millions of metrics are sent through Dogstatsd, providing deep insights into the complex platform. As the platform grew, we found that a significant number of metrics sent by legacy apps were obsolete. We tried to hunt down these obsoleted metrics in the codebase, but modifying legacy applications was extremely time consuming and risky.
To address this, we deployed a StatsD filter as a Datadog agent sidecar to filter out unnecessary metrics. The filter is a simple UDP datagram forwarder written in Node.js (sample, not actual code). We chose Node.js because in our environment, its network performance outstripped other languages that equalled its speed to production. We were able to implement, test and deploy this change across all of the T.io platform within a week.
While this worked for many months, performance issues began to crop up. As the platform continued to scale up, we were sending more and more metrics through the filter. During the first quarter of 2021, we added over 1.4 million new metrics as an effort to improve our observability. The filters needed more CPU resources to keep up with the new metrics. At this scale, even a minor inefficiency can lead to large wastage. Over time, we were consuming over 1000 CPUs and 400GB of memory on these filters. The overhead had become unacceptable.
We analyzed the performance metrics and decided to rewrite the filter in a more efficient language. We chose Rust for its high performance and safety characteristics. (See our other post on Rust evaluations) The source code of the new Rust-based filter is available here.
The Rust-based filter is much more efficient than the original implementation. With the ability to fully manage the heap allocations, Rust’s memory allocation for handling each datagram is kept to a minimum. This means that the Rust-based filter only needs a few MB of memory to operate. As a result, we saw a 75% reduction in CPU usage and a 95% reduction in memory usage in production.
In addition to reclaiming compute resources, the latency per packet has also dropped by over 50%. While latency isn’t a key performance indicator for this application, it is rewarding to see that we are running twice as fast for a fraction of the resources.
With this small change, we were able to optimize away over 700 CPU and 300GB of memory. This was all implemented, tested and deployed in a single sprint (two weeks). Once the new filter was deployed, we were able to confirm the resource reduction in Datadog metrics.
Since the inception of Tenable.io, keeping up with data pressure has been a continuous challenge. This data pressure comes from two dimensions: the growth of the customer base and the growth of usage from each customer. This challenge has been most notable in Elasticsearch, since it is one of the most important stages in our petabytes-scale SaaS pipeline.
When customers run vulnerability scans, the Nessus scanners upload the scan data to Tenable.io. There, the data is broken down into documents detailing vulnerability information, including data such as asset information and cyber exposure details. These documents are then aggregated into an Elasticsearch index. However, when the index reached the scale of hundreds of nodes per cluster, the team discovered that further horizontal scaling would affect overall stability. We would encounter more hot shard problems, leading to uneven load across the index and affecting the user experience. This post will detail the re-architecture that both solved this scaling problem and achieved massive performance improvements for our customers.
Incremental Scaling from Site to Cell
Each point of presence, called a site, contains a multi-tenant Elasticsearch instance to be used by geographically similar customers. As data pressure increases, however, horizontal scaling will cause instability, which will in turn cause instability at the site level.
To overcome this challenge, our overall strategy was to break out the site-wide (monolith) Elasticsearch cluster into multiple smaller, more manageable clusters. We call these smaller clusters cells. The rule is simple: If a customer has over 100 million documents, they will be isolated into their own cell. Smaller customers will be moved to one of the general population (GP) clusters. We came up with a technique to achieve zero downtime migration with massive performance gains.
Request Routing and Backfill
To achieve a zero downtime migration, we implemented two key pieces of software:
An Elasticsearch proxy that can:
Transparently proxy any Elasticsearch request to any Elasticsearch cluster
Intelligently tee any write request (e.g. Index, Bulk) to one or more clusters
Using parallelized scrolls, read the Spark dataframes from the monolith cluster
Map the Spark dataframes from the monolith cluster directly to the cell-based cluster
To start, we reconfigured micro-services to communicate with Elasticsearch through the proxy. Based on the targeted customer (more on this later), the proxy performed dual write to the old monolith cluster and the new cell-based cluster. Once the dual write began, all new documents started flowing to the new cell cluster. For all older documents, we ran a Spark job to pull old data from the monolith cluster to the new cell cluster. Finally, after the Spark job completed, we cut all new queries over to the new cell cluster.
With the cell architecture, we see a future where migrating customers from one Elasticsearch cluster to another is a common event. Customers in a multi-tenant cluster can easily outgrow the cluster’s capacity over time and require migration to other clusters. In addition, we need to reindex the data from time to time to adjust immutable settings (e.g. shard count). With this in mind, we want to make sure this type of migration is completely transparent to all the micro services. This is why we built a proxy to encapsulate all customer routing logic such that all data allocation is completely transparent to client services.
For the proxy to be able to route requests to the correct Elasticsearch clusters, it needs the customer ID to be sent along with each request. To achieve this, we injected a X-CUSTOMER-ID HTTP header in each search and index request. The proxy inspected the X-CUSTOMER-ID header in each request, looked up the customer to cluster mapping, and forwarded the request on to the correct cluster.
While search and index requests always target a single customer, a bulk request contains a large number of documents for numerous customers. A single X-CUSTOMER-ID HTTP header would not provide sufficient routing information for the request. To overcome this, we found an interesting hack in Elasticsearch.
A bulk request body is encoded in a newline-delimited JSON (NDJSON) structure. Each action line is an operation to be performed on a document. This is an example directly copied from Elasticsearch documentation:
We found that within an action line, you can append any amount of metadata to the line as long as it is outside the action body. Elasticsearch seems to accept the request and ignore the extra content with no side effects (verified with ES2 to ES7). With this technique, we modified all clients of the Summary index to append customer IDs to every action.
With this modification, the proxy has enough information to break down a bulk request into subrequests for each customer.
To backfill old data after dual writes were enabled, we used AWS EMR with the elasticsearch-hadoop SDK to perform parallel scrolls against every shard from the source index. As Spark retrieves the data in the Resilient Distributed Dataset (RDD) format, the same RDD can be written directly to the destination index. Since we’re backfilling old data, we want to make sure we don’t overwrite anything that’s already been written. To accomplish this, we set es.write.operation to “create”. (Look for an upcoming blog post about how Tenable uses Kotlin with EMR and Spark!)
Here’s some high level sample code:
To optimize the backfill performance, we performed steps similar to the ones taken by Soundcloud. Specifically, we found the following settings the most impactful:
Setting the index replica to 0
Setting the refresh interval to 5 minutes
However, since we are migrating data using a live production system, our primary goal is to minimize performance impact. In the end, we settled on indexing 9000 documents per second as the sweet spot. At this rate, migrating a large customer takes 10–20 hours, which is fast enough for this effort.
Since we started this effort, we have noticed drastic performance improvement. Elasticsearch scroll speed saw up to 15X performance improvement, and queries decreased in latency of up to several orders of magnitude.
The chart below is a large scroll request that goes through millions of vulnerabilities. Prior to the cell migration, it could take over 24 hours to run the full scroll. The scroll from the monolith cluster suffers slow performance from the frequent resource contention with other customers, and it is further slowed by our fairness algorithm’s throttling. After the customer is migrated to the cell cluster, the same scroll request completes in just over 1.5 hours. Not only is this a large improvement for this customer, but other customers also reap the benefits of the decrease in contention.
Our change in scaling strategy has resulted in large performance improvements for the Tenable.io platform. The new request routing layer and backfill process gave us new powerful tools to shard customer data. The resharding process is streamlined to an easy, safe and zero downtime operation.
Overall, the team is thrilled with the end result. It took a lot of ingenuity, dedication, and teamwork to execute a zero downtime migration of this scale.
Exponential customer growth on the Tenable.io platform led to a huge increase in the data stored in a monolithic Elasticsearch cluster to the point where it was becoming a challenge to scale further with the existing architecture.
We broke down the site monolith cluster to cell clusters to improve performance.
We migrated customer data through a custom proxy and Spark job, all with zero downtime.
Scrolls performance improved by 15x, and queries latency reduced by several orders of magnitude.
Brought to you by the Sharders team: Alan Ning, Alex Barbour, Ciaran Gaffney, Jagan Kondapalli, Johnny Mao, Shannon Prickett, Ted O’Meara, Tristan Burch
Special thanks to Jack Matheson and Vincent Gilcreest for all the help with editing.