Optimizing Threat Stack’s Data Pipeline with Apache Spark and Amazon EMR

Published August 11, 2022

Threat Stack is now F5 Distributed Cloud App Infrastructure Protection (AIP). Start using Distributed Cloud AIP with your team today.

Threat Stack collects tens of billions of events per day, which helps customers understand their environment, identify undesirable activity, and enhance their security posture. This data spins off of different operating systems, containers, cloud provider APIs, and other sources. The more efficiently we process all this data, the more value we can deliver to customers.

In this post, I provide some context on the before and after state of Threat Stack’s data pipeline, which allowed us to increase the stability of our platform and improve our operating efficiency.

Operating the data platform

Threat Stack has scalable and robust back-end systems which handle data at high throughput to allow near-real-time detection of security threats. To support this detailed stream of security data, the Threat Stack platform is broken down into many specialized microservices. This architecture enables easy horizontal scaling of our data pipelining infrastructure as our customer base grows and simplifies troubleshooting of issues by segmenting different data processing responsibilities across dedicated services.

The Threat Stack engineering team typically gets ahead of platform inefficiencies or potential scalability issues before they become a customer-facing problem. Moreover, if a member of the engineering team is paged in the middle of the night, we identify the problem, propose a plan to scale, and implement a remediation. System interruptions have a significant human and opportunity cost, diverting teams from new projects that deliver additional customer value. At Threat Stack, we take system health and its correlation with human impact seriously, and do our best to keep our engineering team healthy and productive.

One of our newest areas of investment has been the Threat Stack Data Platform, which enables customers and internal users to leverage security telemetry in exciting new ways like reporting, analytics, and machine learning.

The data platform consists of a storage layer and a number of services which ingest, process, and consume the stored data to power various analytics and enable our data portability feature. With Threat Stack Data Portability, customers can export rich security telemetry from our platform and load it into their own systems like Splunk for SIEM, Amazon Redshift for data warehousing, or Amazon S3 Glacier for deep archiving. Some of this data also enables the security analysts in our Cloud SecOps Program℠ who staff our SOC 24/7 to monitor customer environments and identify suspicious activity patterns.

For the purposes of this post, we’ll zoom in on the data portability aspect of the data platform. Data portability is operated by a number of services which retrieve and process data from the data platform to partition it by organization and time. In the destination S3 bucket, data is organized by date and an organization identifier.

With growing utilization of the Data Platform and the features that it powers, we needed to redesign parts of our back-end to support Threat Stack Data Portability. Now that you know a little bit about how we work and the feature we’re supporting, let’s look at the state of the platform before and after.

Threat Stack before Spark

The previous iteration of processing applications used Apache Flink, which is a RAM-dependent distributed processing engine for stateful computations for batch processing and stream processing. Flink was chosen because it was already powering other parts of the platform for streaming data and telemetry aggregation. To support the new feature, we built a new Flink cluster that ran a single job, which would consume an Apache Kafka topic, partition the events, and write those out to S3. Eventually, the cluster grew to more than a hundred Flink Task Managers and required significant manual maintenance. To give you an idea of the scale of the processing we support, Threat Stack’s platform handles half a million events per second. All services in our back-end need to support this pace.

The ingestion cluster running on Apache Flink underwent multiple rounds of tuning to the stage parallelism settings, event data salting to achieve an evenly balanced workload, and various infrastructure instance size experimentation. Unfortunately, the cluster reached a point where it became increasingly expensive for our team to operate, taking a toll on our engineers and our cloud-provider bill alike.

Over time, we began noticing some anti-patterns, one of which was cascading worker failures. When a single worker was starved for resources, it eventually became unresponsive, which caused the rest of the nodes to process more messages from the Apache Kafka topic. Nodes then became starved for resources one at a time. While we designed redundancies into the system, cascading failure events adversely affected performance and required manual intervention from individuals on call.

As an aside, I should note that Apache YARN was not in use to manage cluster resources. Apache YARN would have solved the cascading failure in cluster workers, but it would not have helped with cost and service efficiency concerns. The number of platform stability incidents that were attributed to this ingestion service made up to 30% of total service events per month.

The other side of the challenge was the cost associated with running the infrastructure to support the data ingestion service. This one service made up roughly 25% of our cloud-provider monthly bill. It also had a large human impact, waking individuals up at night and decreasing their ability to do their regular work. Due to the amount of maintenance and interruptions caused by the service, it was time for a replacement and to aim for timeliness, scalability, cost-effectiveness and stability.

Moving from Flink to Spark

We went back to the requirements, which were the ability to write partitioned raw event data by organization into chunks of time in preparation for sending to customers’ S3 buckets. We didn’t need the stateful processing qualities of Apache Flink, we just needed a simple ETL pipeline. Our engineering team has already experimented with Apache Spark in other areas of our architecture and has seen promising results.

We decided to port the business logic from Apache Flink to Apache Spark, running on Amazon EMR. This switch enabled us to use better supported, more widely adopted industry standard tools. With EMR, we started using the AWS-provided Hadoop S3 connector instead of the free community-supported implementation. The switch to a managed service, EMR, instead of running our own Apache Flink cluster in-house, eliminated most of the operational complexity involved in keeping the old pipeline running. Moreover, the new ETL pipeline is stateless, and doesn’t rely on checkpointing or intermediary writes that caused failures to go undetected in the old pipeline. In addition, the new pipeline runs discrete jobs at short, predefined intervals, and it fully retries whole batches due to any partial failure. Compared to how we previously worked with streaming data, processing is still timely for our customers, but now with the added durability that Spark micro-batch processing provides.

My colleague Lucas DuBois describing our evolving data architecture prior to re:Invent 2019

The new Spark job runs on infrastructure which costs us only 10% of the cost of operating the Flink cluster. After multiple rounds of tuning, the job was capable of handling the event load and offered an easy scalability pattern, which is adding more task nodes (aka core nodes).

The retirement of the Apache Flink cluster and the switch to a managed Apache Spark service reduced the feature-related interruption incidents by 90%. The reduction in interruption incidents and delays is attributed to EMR’s managed nature and its ability to handle the event load. Moreover, EMR offered an atomic batch-level job retry logic in case of failures, which reduced the number of manual interventions needed to keep the service running. This enabled the engineering team to get time and energy back to innovate and focus on other areas of the platform.

More to come for the new Threat Stack Data Platform

While I’m proud of the operational benefits and stability we’ve achieved as an engineering team, I’m excited for the new features it will enable for customers. Watch this space for more ways you’ll soon be able to work with Threat Stack data, as we have plans for added analytics and a more guided experience as you interact with the Threat Stack Cloud Security Platform.