After the audit, we came to several conclusions:
- The client had transferred all tasks related to balancing, health-checks, and other critical aspects to CloudFlare and their logic.
- Most incidents of unavailability were linked either to failures on CloudFlare's side or to the failure of a small number of original content sources on the client's side.
Following a financial market analysis, we proposed an alternative solution. We developed a diversified plan that included:
- A plan for renting bare-metal servers from several independent providers
- A migration plan for content from single storage systems to a network of servers in data centers located in the client's regions of presence
- Abandonment of most of CloudFlare's paid features
- Development of a proprietary system for balancing, automatic traffic management based on load and the condition of source nodes
- Fixing a bug in the software that led to the accumulation in the database and storage of about 23% of deleted and unused content.
- Development of a DRP (Disaster Recovery Plan) for the new architectural approach
- Load testing of the new architecture and resilience testing with a complex failure simulation system for individual regions.
We successfully developed a custom solution, utilizing Python and Terraform for traffic flow management. Commercial CDNs were no longer used under normal operating conditions, which resulted in a 6-fold reduction in infrastructure maintenance and content delivery costs. Moreover, we ensured the possibility of modular connection of two different CDNs in case of failure at our own nodes. This allowed for the short-term rental of a commercial CDN in case of overload on our own infrastructure. The overall availability did not just remain the same; it actually increased to 99.2%.