We have done a great deal of work with full coverage of the entire infrastructure with code:
- The entire server configuration is now deployed fully automatically based on the roles developed in Ansible.
- Integrated monitoring based on DataDog. All necessary checks, including the availability of each of the hundreds of domains, are deployed using Ansible. The developed templates for terragrunt/terraform allowed automatic updating of all diagnostic dashboards in accordance with changes in the server composition and a single source of truth - the Ansible configuration.
- Developed Terraform modules for managing the client's CDN and automatic switching to backup in case of emergency situations.
As a result, the average time to add a new server group was 45-70 minutes per node, meeting the client's requirements.
Thanks to our changes in the infrastructure, diversification by providers and geography, we managed to increase the availability of the infrastructure to the required 99.9% without increasing the specific cost of data storage and distribution.
We also conducted a large-scale study with simulation of various emergency situations, bot attacks, DDoS, and other types of problems on test nodes. As a result, we were able to create normalized dashboards that show an integral characteristic of the group's reserve resources. Our methodology allowed for precise planning of cluster expansion with minimal overspending on reserve capacities. As a result, we reduced spending on reserves by 38.4% while simultaneously increasing the overall reliability of the system.