One of the old dashboards with absolute values
Look at the graph above. Can it be understood how well the load is balanced between individual nodes? Moreover, in this group, there are servers with channels of 1, 2.5, and 10 gigabits. Another node limps on the right leg has a strange controller, which can cause sudden IOPS overloads out of nowhere.
During our analysis, we divided the servers into two large groups:
This group includes servers with a throughput capacity of less than 2.5G. The threshold was empirically determined based on peak values.
As the load increases to critical levels, these servers never reach critical CPU iowait - DoS occurs due to network overload. However, within this group, failure occurs at different loads - 1G goes into DoS earlier than 2G. Using traffic metrics in their raw form without normalization is impossible.
- IOPS-limited server group
This group includes servers with a throughput capacity of more than 2.5G.
As the load rises to critical levels, these servers never reach critical upload traffic values - DoS occurs due to disk subsystem overload.