This is the fourth blog in a series of blogs that cover various aspects of what it took for us to build and operate our SaaS service:
In the previous blog, we provided insights on the challenge of using cryptographic techniques to secure our platform (infrastructure, apps, and data). This blog will deal with techniques we use to secure the platform against targeted attacks from the network — from the Internet as well as from inside. Since apps are no longer constrained to any physical location, traditional perimeter-based firewalls and signature-based security solutions are no longer effective. We will outline the shortcomings of our initial zero-trust security implementation and why + how we augmented it with machine learning and algorithmic techniques to properly secure our distributed infrastructure + app clusters.
Our platform runs a large number of apps across multiple teams that operate their own clusters in edge, our global network, and AWS and Azure public clouds. While the majority of workloads are microservices-orchestrated using Kubernetes, we have a handful of very large-scale monoliths (eg. elasticsearch) that we manage using terraform. Figure 1 demonstrates the distributed nature of our platform. For example, in each of our 18+ global network PoPs (with a few tens to slightly more than a hundred physical servers), we run thousands of app pods. However, on the edge, we have individual customer deployments today with 3000+ active locations (each with one to seven computes) running a few tens of app pods.
The platform is fully multi-tenant with each node running workloads from different customers (and our own). Since some of these apps are exposed to the public Internet, we need to ensure that all communication to/from the apps is secured. As we had outlined in the previous two blogs, we built a robust identity, authentication + authorization system along with our own L3-L7+ network datapath (VoltMesh) that is used to power our service mesh and API gateway. As shown in Figure 2, this has allowed us to deliver transport-level security across app clusters (mTLS), from users (TLS/mTLS), and employees (mTLS) as well as access control based on authentication+authorization.
While this zero-trust implementation provides a lot of benefits, it did not automatically solve several security problems:
Over the last 2.5 years of development on this platform, we also realized that often our developers will incorporate open source apps, containerize them and ask our DevOps team to deploy them on the platform. However, they often lack details on API-level interactions within these apps that are needed by our security team to create policies to whitelist the communication. This is a big roadblock for our zero-trust security implementation as it mandates whitelist policies that only allow APIs used by the apps and block all other traffic. Whenever we made exceptions to this requirement, it left some apps with very basic network-level segmentation, thereby increasing the attack surface.
As a result, we needed to augment our existing zero-trust security solution with additional security capabilities to handle the issues listed above. We identified a list of additional security capabilities that we had to build into the platform:
We decided to use a combination of traditional signature-based techniques, statistical algorithms, and more dynamic machine learning approaches to solve these problems. This required us to make changes to our SaaS backend as well as add new capabilities in our network datapath.
In order to lock-down the platform, we only allow network connections based on the whitelist of APIs for every app. This requires our security team to coordinate with developers and ensure that our programmable policy engine is fed with the right API information. We quickly realized that it was impossible for our developers to provide this information for apps that were not built using our service framework.
Since our service mesh proxy is in the network path of every access to the app, we decided to learn APIs and static resources that are exposed by the app by doing run-time analysis of every access that goes through the proxy. The challenge with this approach is to identify API endpoints by inspecting URLs and separating out components that are dynamically generated. For example, for an API “api/user/<user_id>/vehicle/”, the proxy will see accesses like:
There can be millions of such requests, making it very challenging to decipher. As a result, the identification of dynamic components in these related requests is done using deep learning and graph analysis. We represent the entire URL component set as a graph and then perform graph clustering to find sub-graphs with similar properties using feature sets that capture specific properties of dynamically generated components such as:
As a result, the dynamic components get classified and output from the system looks like:
Using this machine learning of APIs, we can easily and automatically generate a policy that can be enforced by our service mesh proxy. Based on the API endpoints discovered, we also learn other properties like what apps use what APIs to talk with other apps, the typical behavior of these APIs, etc. This allows us to build a service graph that helps our security team to visualize service-to-service interaction for forensics, discovery and API-level micro-segmentation.
While this capability is important for our web traffic, we need to also serve a growing amount of API and machine-to-machine traffic in our environment. To solve this, our security team would have to write app-specific rules that don’t fall under known typical web rules (like OWASP CRS). Usually security administrators know little about the apps and with the dynamic nature of the environment, it becomes even harder to keep track of the app types and structure to write those app-specific rules. As a result, while our platform team provides this capability in our network datapath, it is not often used by our security team.
Another problem for which we have a significant amount of data from our network is that app attacks are becoming a lot more sophisticated over time. The attacker spends days performing reconnaissance to determine the nuts and bolts of the APIs, the app, underlying infrastructure, and OS type by looking at HTTP/TCP signatures, etc. Traditional signature and rules-based approaches are of very limited use in these situations and we decided to continue with our AI-based approach to automatically learn user behavior and enforce good vs bad behavior.
Most apps have certain workflows (sequence of APIs) and context (data within the APIs) to which different use cases/deployments are designed and typically followed by the users of the apps. We exploit these properties and train our machine learning algorithms to model “valid” behavioral patterns in a typical user interaction with the app.
Our datapath samples requests/responses for each API along with associated data and sends it to our central learning engine as shown in Figure 3. This engine continuously generates and updates the model of valid behavioral patterns that is then used by the inference engine running in the datapath to alert/block suspicious behavior.
The learning engine looks at many metrics like the sequence of APIs, gaps between requests, repeated requests to the same APIs, authentication failures, etc. These metrics are analyzed for each user and on an aggregate basis to classify good vs bad behavior. We also perform behavior clustering to identify multiple different sequences of “good behavior.” Let’s take an example to illustrate this:
The following sequence of APIs will get flagged by the system as suspicious/bad behavior that will be automatically mitigated by the system or generate an alert for an admin to intervene
As we put this system into production over a year back, we have continuously refined the model based on usage and customer feedback. We have been able to successfully identify the following types of attacks:
That said, we also realized that there are some problems with this approach — it cannot uncover low and slow attacks (brute force, app denial of service, scanner) for which we need to apply anomaly detection techniques.
Sometimes, we see highly sophisticated attacks that use large distributed botnets that pass under the radar of our behavior analysis technique. Examples of such attacks are:
Since our network datapath is collecting information from each node across our global network, it becomes relatively easy to perform analysis on a particular app’s aggregate metrics like request rate, error rate, response throughput, etc. This analysis allows us to detect distributed attacks and mitigate them (at every node) by identifying the users that could be part of such botnets. Let’s take an example where we are trying to detect anomalies across different time windows (last 5 mins, 30 mins, 4 hours, 24 hours) by looking at request rates and if the request rate is high within a given time window, then the following deeper analysis of access logs will be performed by the system:
While anomaly detection has always been an important technique for intrusion detection and prevention (IDS/IPS) in firewall appliances, these appliances are unable to mitigate global app-layer attacks. With our ability to perform API markup and learning across our global platform, we are now able to suppress attacks at the source across our distributed network.
While we were extremely satisfied with our zero-trust implementation based on service mesh and API gateway, we realized that it was not comprehensive to secure distributed app clusters from vulnerabilities and malicious attacks. We had to augment it with machine learning for behavior analysis + anomaly detection alongside traditional signature + rule-based techniques to provide a better security solution.
We have seen three significant gains from the addition of distributed inferences in our L3-L7+ network datapath along with learning core running in our centralized SaaS:
Network and app security is a never-ending runway and it looks like we still have a long backlog of new features to add. We will come back in the near future to share additional insights into the incremental algorithms and techniques we have implemented.
This series of blogs will cover various aspects of what it took for us to build and operate our globally distributed SaaS service with many app clusters in public clouds, our private network PoPs, and edge sites. Next up will be “Observability across our Globally Distributed Platform” (coming soon)…