We provide a broad portfolio of distributed cloud services that gives our customers the ability to build, deploy, secure, and operate distributed applications and data. Using our SaaS-based offerings, we enable our customers to focus on their business logic while we solve the problem of distributed infrastructure, application management, and the operation of a large fleet of distributed application clusters across multiple cloud locations and/or edge sites.
These customers are building a fairly complex and diverse set of business solutions like smart manufacturing, large scale video forensics for public safety, algorithmic trading for hedge funds, 5G transition for telecom operators, etc.
A good example is one of the top three global automotive companies, which needed to perform video analytics, data collection, and local processing at each EV charging station prior to sending some of the collected data and meta-data back to their centralized cloud locations. Each EV charging station consists of multiple nodes (converged compute, storage, and network) with associated sensors that are used to collect data and run applications. Since this customer operates more than 3000 charging stations, their goal was to manage these 3000 locations as a “logical cloud” to reduce operational overhead and secure valuable data. Another need was to divide this installation across Japan into four regions and two availability zones, giving them the ability to manage the entire deployment with eight “logical clouds” or eight instead of 3000 API endpoints.
Similarly, another good example is a global bank that started with two VPCs in AWS (in 2016) across two regions and quickly grew to 800 VPCs in four regions with more than 1000 Kubernetes clusters that needed to be managed by their DevOps team. While the cloud gave them flexibility and agility, it quickly became a big challenge to operate such a large number of clusters while ensuring reliability, consistency of policy, and compliance with internal and regulatory requirements.
Building and operating distributed applications across a large set of edge locations or multiple VPCs in public/private clouds is a big challenge, and customers are looking for a cloud-like operational model for these distributed systems. They want a SaaS delivery model and not a traditional software delivery model. Many of them prefer to focus on their business logic while someone else solves the problem of distributed infrastructure, application management, and operating a large fleet of distributed application clusters across multiple cloud locations and/or edge sites.
In order to handle such a variety of requirements from our customers across edge sites or multi-cloud locations, and to support the needs of our globally distributed engineering team that is building and maintaining our SaaS service, we needed to develop a platform that provided a consistent set of services across heterogeneous environments. Consistency was critical to reduce duplication of effort, improve productivity, system scalability, reliability, and allow us to eat our own dog food.
As a result, the team created the following goals that we needed our platform needed to deliver:
Once we established the goals of the platform, we collected baseline requirements from our product team, internal developers and several of our customers. The outcome of this exercise gave us clarity that the system needed the following high-level capabilities:
In a nutshell, it's a hybrid environment with many application clusters running on a variety of hardware/infrastructure across heterogeneous locations.
As is typical in any SaaS environment, the platform needed to handle a variety of apps like web front-end, application back-ends, data pipeline, machine learning, security, dev-test pipelines, etc, and do that in the cloud, private data centers, developer laptops, and in some cases, at the edge. As if this was not enough, the platform also needed to support secure multi-tenancy -- a very tall order.
In order to make all of this a reality without running into massive scaling or architectural issues, at the very onset of our engineering efforts we decided to start with a clean slate and build the following capabilities:
An intent-based system to automate deployment, security, and operations of multiple application clusters using Kubernetes as a base platform. This has significantly simplified the ability of our DevOps and SRE teams’ ability to manage large numbers of distributed and multi-tenant clusters across any cloud provider, private cloud, in our global network, or customers’ edge locations
To deliver zero-trust and application connectivity without giving network access across distributed application clusters -- a massive simplification for our and our customers’ security and compliance teams. We use a combination of our global private backbone, a new L3-L7+ network datapath in application clusters or our network PoPs, and a distributed control plane to deliver secure connectivity
Resources spread across different environments present significant challenges to the security of infrastructure, application, and data; as a result, we had to build a set of capabilities like uniform PKI-based identity, ability to store and distribute secrets and keys to workloads in a secure manner, and perform encryption and decryption-as-a-service based on policies and identity
A globally distributed system needs to deliver a multi-layer security solution across transport, network, application, and APIs; as a result, we had to build capabilities in our global network and our new L3-L7+ network datapath using a combination of algorithmic techniques, machine learning and a programmable policy engine
Since we operate hundreds of microservices and very large storage clusters, getting fine grained metrics, logs, traces, and audit-trails is important for our developers to debug issues, measure SLAs, create metrics for our billing systems; since our services or our customers’ services can be running across hundreds of thousands of clusters across the globe, we had to build a hierarchical system to collect and analyze data, as it is not feasible to send thousands of time-series from each cluster to a centralized location
In order to operate a SaaS service with a global backbone and a number of network PoPs, as well as manage upwards of 100K+ app clusters for our customers, we had to build a significant amount of processes and automation to ensure that we can perform continuous upgrades, patches, roll-out new features, monitor, and troubleshoot this distributed system
The goal was to use a common framework to build distributed and resilient microservices that power all of our six services described above; this schema first approach provides all the tooling to automatically generate client, server, test and documentation from the schema and a runtime for API handling, security and storage of these objects in a database -- this allows our developers to focus on business logic and not routine, repetitive tasks
Since each of these are complex topics and need more detailed elaboration, we decided to write a series of blogs to cover each of these. These blogs will discuss the problems we faced and how we solved these problems, starting with Control Plane for Distributed Kubernetes PaaS.