IaC Development for Video Hosting
Project Description
The client provides high-load video hosting services. The entire infrastructure is based on bare-metal servers in Europe and Canada, utilizing UCDN as a backup CDN to smooth out loads in emergency situations. The content is divided into several independent blocks, each served by its own independent group of servers. Most nodes were configured manually and were only covered by basic monitoring.

The main request from the client was to increase manageability, accelerate the addition of new nodes, enhance monitoring transparency, and reduce the frequency of emergencies.
Key Metrics
  • 60+ manually configured servers
  • 3+ petabytes of monthly traffic for each server group
  • 400+ domain names associated with various projects
  • 97.2% availability level, insufficient for the target level of client service
Project Goals
  • To ensure full coverage of the entire infrastructure with code in accordance with the principles of Infrastructure-as-Code
  • To reduce the deployment time of a new node from 5 working days to 1 hour of working time.
  • To increase infrastructure availability from 97.2% to 99.9% and above.
  • To develop a methodology for predicting the dynamics of reserve capacity and the need for storage expansion.
Key Challenges and Results
We have done a great deal of work with full coverage of the entire infrastructure with code:
  • The entire server configuration is now deployed fully automatically based on the roles developed in Ansible.
  • Integrated monitoring based on DataDog. All necessary checks, including the availability of each of the hundreds of domains, are deployed using Ansible. The developed templates for terragrunt/terraform allowed automatic updating of all diagnostic dashboards in accordance with changes in the server composition and a single source of truth - the Ansible configuration.
  • Developed Terraform modules for managing the client's CDN and automatic switching to backup in case of emergency situations.
As a result, the average time to add a new server group was 45-70 minutes per node, meeting the client's requirements.

Thanks to our changes in the infrastructure, diversification by providers and geography, we managed to increase the availability of the infrastructure to the required 99.9% without increasing the specific cost of data storage and distribution.

We also conducted a large-scale study with simulation of various emergency situations, bot attacks, DDoS, and other types of problems on test nodes. As a result, we were able to create normalized dashboards that show an integral characteristic of the group's reserve resources. Our methodology allowed for precise planning of cluster expansion with minimal overspending on reserve capacities. As a result, we reduced spending on reserves by 38.4% while simultaneously increasing the overall reliability of the system.
Related services
DevOPS audit
Our DevOps audit services boost your IT operations' efficiency and reliability
24/7 DevOps support
We provide DevOps support services for your business 24/7
Infrastructure monitoring services
We provide services for monitoring the infrastructure of your business
IT infrastructure services
We offer IT infrastructure services for your business
Infrastructure as a code (IaC)
We offer Infrastructure as a code (IaC) for your business