At Volterra, the SRE team’s job is to operate a global SaaS-based edge platform. We have to solve various challenges in managing a large number of application clusters in various states (i.e. online, offline, admin-down, etc.) and we do this by leveraging the Kubernetes ecosystem and tooling with a declarative pull-based model using GitOps.
In this blog, we will describe:
Using GitOps to effectively manage and monitor a large fleet of Infrastructure (physical or cloud hosts) and K8s clusters
We will dive deeper with lessons learned at scale (3000 edge sites), which was also covered in my recent talk at Cloud Native Rejekts in San Diego.
The architecture diagram (Figure 1) above shows logical connectivity between our REs and CEs, where each CE establishes redundant (IPSec or SSL VPN) connections to the closest RE.
When we started designing our platform about 2 years back, our product team had asked us to solve for the following challenges:
Considering our requirements and challenges of operating a highly distributed system, we decided to set several principles what we needed our SRE team to follow in order to reduce downstream issues:
As part of the edge site lifecycle management, we had to solve how to provision the host OS, do basic configurations (e.g. user management, certificate authority, hugepages, etc.), bring up K8s, deploy workloads, and manage ongoing configuration changes.
One of the options we considered but eventually rejected was to use KubeSpray+Ansible (to manage OS and deploy K8s) and Helm/Spinnaker (to deploy workloads). The reason we rejected this was that it would have required us to manage 2–3 open source tools and then do significant modifications to meet our requirements, which continued to grow as we added more features like auto-scaling of edge clusters, support for secure TPM modules, differential upgrades, etc.
Since our goal was to keep it simple and minimize the number of components running directly in the OS (outside of Kubernetes), we decided to write a lightweight Golang daemon called Volterra Platform Manager (VPM). This is the only systemd Docker container in the OS and it acts as a Swiss Army knife that performs many functions:
VPM is responsible for managing the lifecycle of the host operating system including installation, upgrades, patches, configuration, etc. There are many aspects that need to be configured (e.g. hugepages allocation, /etc/hosts, etc)
Management to provide lifecycle for Kubernetes manifest. Instead of using Helm, we decided to use the K8s client-go library, which we integrated into VPM and used several features from this library:
Optimistic = create resource and don’t wait for status. It is very similar to kubernetes apply command where you do not know if the actual pods starts successfully.
Pessimistic = wait for state of Kubernetes resource. For instance, deployment waits until all pods are ready. This is similar to new kubectl wait command.
In addition to configurations related to K8s manifests, we also need to configure various Volterra services via their APIs. One example is IPsec/SSL VPN configs — VPM receives configuration from our global control plane and programs them in individual nodes.
This feature allows us to reset a box remotely into original state and do the whole installation and registration process again. It is a very critical feature for recovering a site needing console/physical access.
Even though K8s lifecycle management may appear to be a big discussion topic for many folks, for our team it is probably just 40–50% of the overall work volume.
Zero-touch provisioning of the edge site in any location (cloud, on-premises or nomadic edge) is critical functionality, as we cannot expect to have access to individual sites nor do we want to staff that many Kubernetes experts to install and manage individual sites. It just does not scale to thousands.
The following diagram (Figure 2) shows how VPM is involved in the process of registering a new site:
As you can see, the whole process is fully automated and the user does not need to know anything about detailed configuration or execute any manual steps. It takes around 5 minutes to get the whole device into online state and ready to serve customer apps and requests.
Upgrade is one of the most complicated things that we had to solve. Let’s define what is being upgraded in edge sites:
There are two known methods that could be used to deliver updates to edge sites:
Our goal for upgrade was to maximize simplicity and reliability — similar to standard cell phone upgrades. In addition, there are other considerations that the upgrade strategy had to satisfy — the upgrade context may only be with the operator of the site, or the device may be offline or unavailable for some time because of connectivity issues, etc. These requirements could be more easily satisfied with the pull method and thus we decided to adopt it to meet our needs.
In addition, we chose GitOps as it made it easier to provide a standard operating model for managing Kubernetes clusters, workflows and audit changes to our SRE team.
In order to solve the scaling problems of thousands of sites, we came up with the architecture for SRE shown in Figure 3:
First, I want to emphasize that we don’t use Git for just storing state or manifests. The reason is that our platform has to not only handle K8s manifests but also ongoing API configurations, K8s versions, etc. In our case, K8s manifests are about 60% of the entire declarative configuration. For this reason we had to come up with our own DSL abstraction on top of it, which is stored in git. Also, since git does not provide an API or any parameter merging capabilities, we had to develop additional Golang daemons for SRE: Config API, Executor and VP Controller.
Let’s go through the workflow of releasing a new software version at the customer edge using our SaaS platform:
You can watch a demo of entire workflow here:
In previous sections, we described how our tooling is used to deploy and manage the lifecycle of edge sites. To validate our design, we decided to build a large-scale environment with three thousand customer edge sites (as shown in Figure 4)
We used Terraform to provision 3000 VMs across AWS, Azure, Google and our own on-premises bare metal cloud to simulate scale. All those VMs were independent CEs (customer edge sites) that established redundant tunnels to our regional edge sites (aka PoPs).
The screenshot below is from our SRE dashboard and shows edge numbers in locations represented by size of circle. At the time of taking screenshot we had around 2711 healthy and 356 unhealthy edge sites.
As part of scaling, we did find a few issues on the configuration and operational side that required us to make modifications to our software daemons. In addition, we ran into many issues with a cloud provider that led to opening of multiple support tickets — for example, API response latency, inability to obtain more than 500 VMs in a single region, etc.
Observability across a distributed system posed a much greater set of challenges as we scaled the system.
Initially, for metrics we started with Prometheus federation — central Prometheus in global control federating Promethei in regional edges (REs), which scrapes its service metrics and federates metrics from their connected CEs. The top-level Prometheus evaluated alerts and served as a metric source for further analysis. We hit the limits of this approach fast (around 1000 CEs) and tried to minimize the impact of the growing number of CEs. We started to generate pre-calculated series for histograms and other high-cardinality metrics. This saved us for a day or two and then we had to employ whitelists for metrics. At the end, we were able to reduce the number of time series metrics from around 60,000 to 2000 for each CE site.
Eventually, after continued scaling beyond 3000 CE sites and running for many days in production, it was clear that this was not scalable and we had to rethink our monitoring infra. We decided to drop top-level Prometheus (in global control) and split the Prometheus in each RE into two separate instances. One being responsible for scraping local service metrics and the second for federating CE metrics. Both generate alerts and push metrics to central storage in Cortex. Cortex is used for analytical and visualization source and not part of the core monitoring alerting flow. We tested several centralized metrics solutions, namely Thanos and M3db, and found Cortex to best suited our needs.
The following screenshot (Figure 7) shows the memory consumption from scraping prometheus-cef at time of 3000 endpoints. The interesting thing is the 29.7GB RAM consumed, which is not actually so much given the scale of the system. It can be further optimized by splitting scraping into multiple of them or moving the remote write to Cortex directly into the edge itself.
The next screenshot (Figure 8) shows how much memory and CPU resources we needed for Cortex ingestors (max 19GB RAM) and distributors at this scale. The biggest advantage of Cortex is horizontal scaling, which allows us to add more replicas compared to Prometheus where scaling has to happen vertically.
For logging infrastructure in CEs and REs, we use Fluentbit services per node which collects the service and system log events and forwards it to Fluentd in the connected RE. Fluentd forwards the data to the ElasticSearch present in the RE. The data from ElasticSearch is evaluated by Elastalert and rules are set to create Alertmanager alerts. We are using our custom integration from Elastalert to Alertmananger to produce alerts with the same labels as Prometheus produces.
The key points in our monitoring journey:
- Initially we had around 50,000 time series per CE with an average of 15 labels
- We optimized it to 2000 per CE on average
Simple while-lists for metric names and black-lists for label names
- Centralized Prometheus scraped all REs’ and CEs’ Prometheus
- At 1000 CE, it became unsustainable to manage the quantity of metrics
- Currently we have Prometheus at each RE (federating to connected CEs’ Promethei) with RW to Cortex
- Decentralized logging architecture
- Fluentbit as collector on each node forwards logs into Fluentd (aggregator) in RE
- ElasticSearch is deployed in every RE using remote cluster search to query logs from a single Kibana instance
I hope this blog gives you an insight into what all has to be considered to manage thousands of edge sites and clusters deployed across the globe. Even though we have been able to meet and validate most of our initial design requirements, lots of improvements are still in front of us…