We want servers to fail simultaneously

A client from the entertainment video streaming sector approached us with an intriguing issue - their servers were failing, but not simultaneously, and they were keen on achieving synchronicity.

The servers that perplexed the client served as the backend for storing video files. Essentially, this involved numerous nodes containing tens of terabytes of video files that had been segmented into various resolutions by converters. Then, these millions of files were delivered to the outside world using nginx + kaltura, facilitating the on-the-fly repackaging of mp4 into DASH/HLS segments. This setup managed even high loads well, delivering only the necessary segments through the player without sudden surges.

Problems arose when issues of geographic redundancy and scaling emerged due to increasing loads. Servers within the same redundancy group were failing asynchronously, as they consisted of a diverse collection of hardware from different providers, with varying bandwidth, disks, and RAID controllers. We were tasked with auditing this entire setup and almost completely rebuilding the monitoring and resource management methodology from scratch.

Architecture Nuances
The first question that arose during the audit was why a full-fledged commercial CDN wasn't used. The answer was quite mundane - it was cheaper. The architecture of the video storage allowed for decent horizontal scaling, and relatively stable traffic made renting physical servers with robust channels more cost-effective than third-party boxed solutions. A paid CDN was eventually set up, but with the scales of content delivery reaching petabyte levels, it was quite expensive for the client and only activated in emergency situations. Such occurrences were rare, as the increase in visitor numbers was generally anticipated in advance, allowing for the leisurely addition of another server to the overloaded group.

All video content was divided into dozens of groups. Each group contained 3-8 servers with identical content. Moreover, every group except the last operated entirely in read-only mode. The last group operated in a mixed mode - new videos were simultaneously written to it and content was distributed from it. Once a group's servers ran out of space, writing to it ceased and data from it was only read. The scheme was quite functional, but there were several very unpleasant aspects:
  • The servers were diverse. No, even more so, the servers were a mixed bag. A diversification policy required having several independent providers with multiple locations around the world. Naturally, the hardware was leased at different times, with various disks, RAID controllers, and even different bandwidths.
  • For several reasons, it was impossible to achieve full load balancing between servers with health-checks at this stage. That is, if one server in the cluster failed, users would still receive its IP address in DNS. Until manual load removal, some users continued to bump into the dead node and received errors when loading content.
  • Because servers within one group could have bandwidths of 1 and 10 gigabits per second, it was utterly unclear when it was time to purchase new capacities.
Ultimately, as it turned out later, some server groups were excessively expanded, while others were not enough.

Servers Fail in Sequence
One of the old dashboards with absolute values
Look at the graph above. Can it be understood how well the load is balanced between individual nodes? Moreover, in this group, there are servers with channels of 1, 2.5, and 10 gigabits. Another node limps on the right leg has a strange controller, which can cause sudden IOPS overloads out of nowhere.
During our analysis, we divided the servers into two large groups:
  • WAN-limited server group
This group includes servers with a throughput capacity of less than 2.5G. The threshold was empirically determined based on peak values.
As the load increases to critical levels, these servers never reach critical CPU iowait - DoS occurs due to network overload. However, within this group, failure occurs at different loads - 1G goes into DoS earlier than 2G. Using traffic metrics in their raw form without normalization is impossible.
  • IOPS-limited server group
This group includes servers with a throughput capacity of more than 2.5G.
As the load rises to critical levels, these servers never reach critical upload traffic values - DoS occurs due to disk subsystem overload.

Let's Normalize!
In cases where all data is presented in different scales, we need to normalize it before doing anything with it. That's why we immediately decided to discard absolute values on the main dashboard. Instead, we outlined two indicators and scaled them from 0 to 100%:
  • cpu iowait value - effectively indicates disk subsystem overload. We empirically derived from historical data that the client's systems function acceptably until iowait is below 60%.
  • Bandwidth - here too, everything is transparent - as soon as the load reaches the limit, users begin to suffer due to provider-side restrictions.
For 100%, we accepted the condition when a server reached its IOPS or channel limit and went into DoS.

Normalization by IOPS
Normalization by Bandwidth
Thus, the scale became relative, and we could see when a server was hitting its personal limits. Moreover, it didn't matter whether it was 1 gigabit or 10 - the data was now normalized.

A second problem arose - a server might fail for either of two reasons, and the final diagnostics needed to consider both. We added another layer of abstraction, which showed the worst value for any of the resources.

Now we could balance the load so that servers in one group would fail simultaneously in the event of excessive load. This would indicate that the resources were being utilized as efficiently as possible. Of course, normally, the average load should not exceed the calculated thresholds.
What We Achieved
The final solution for one group. Ideally, the normalized load should be nearly identical.
We created a multi-layered sandwich of formulas that did not show the real absolute load on the server, but accurately predicted when it would fail.
Imagine a sled team with strong and weak workhorses. It doesn't matter exactly how much horsepower each outputs. What we need is for the entire sled to move evenly, with each of the horses loaded according to their capabilities.

Here's what we ended up with:
  • We automated the entire process of adding new nodes using terraform/ansible, simultaneously generating dashboards that take into account the physical limitations of each server in the formulas for normalization.
  • We prepared a quality model that allowed us to add new nodes at the right moment without situations where "three servers are half-loaded while the fourth is already down."
  • The servers indeed began to fail completely synchronously. A real test occurred some time later during a powerful DDoS by bot scrapers. The load almost simultaneously hit the limits on all nodes - somewhere in IOPS, somewhere in bandwidth.
  • The client was able to preserve their model of infrastructure expansion using diverse equipment in terms of power.

Overall, this was not an ideal solution from an architectural standpoint, but it quickly and optimally closed the main issues, allowing maximum utilization of existing resources. In particular, thanks to the new monitoring and normalization, we were able to achieve specific financial efficiency for each node in the group.
Going forward, this allowed us to continue refining, implement caching with bcache, significantly increasing the performance of individual nodes, and retire more than half of the servers. But more on that another time.

Gumeniuk Ivan
DevOps Engineer