The most challenging part of this project was gathering information. In preparation for the upcoming work, we conducted a series of interviews with each employee to collect as much data as possible on the systems they had personally encountered or heard about from colleagues who had left. We processed, systematized, and described all data in a standard format. Another problem was that the software components' code, configuration files, and other data were stored in 27 different repositories. We organized a search across all accumulated data, creating an index based on the open on-premise solution Sourcegraph. The source codes were a mix of Golang, Python, bash, PHP, and numerous plain-text configuration files. Next, we began the painstaking work of separating individual components from a highly interconnected architecture. The work was carried out in several stages:
- Description of a new architectural component in Ansible and Terraform
- Deployment of the node in a test environment, conducting integration testing, and equivalence tests to manually configured
- Full resetup of the component using IaC.
For components with compatibility issues and lacking source codes, we, in agreement with the client, rewrote them onto a fresher technological stack. Some critical components were left to operate as a "black box," ensuring backup, automatic deployment, and documentation. In the final iteration of our work, we successfully tested the deployment of the client's core architecture on two independent sites. During integration testing, we conducted successful partial and full switching of productive load to the backup. The result of our work was a developed DRP plan with simple, clear documentation and deployment of the entire infrastructure in a few requests through the use of Terraform and Ansible. We made modifications to the client's architecture, achieving the effect of graceful degradation in the event of a complete core failure. In this situation, clients experience a partial reduction in available service functionality, but overall operability is maintained. During this time, within 20 hours, backup infrastructure is deployed on backup sites. This allowed for zero costs for backup infrastructure, which is deployed from scratch only in case of a total failure.