Web-interfaces are spoiled: Why do you need Infrastructure-as-Code?

Developers often forget about infrastructure, relying on the magic of clouds. And it really works simply and conveniently, set up in a few clicks in the cloud resources control panel. Right up until it irreversibly dies along with your data or leads to an interesting bill with many zeros at the end of the month. The idea of an easy-to-use web-interface is spoiled.

Therefore, today we will talk about Infrastructure-as-a-Code (IaC):
  • Web interfaces initially lower the barrier to entry, but later they will negatively impact your business.
  • What is Terraform, Ansible, and how do we describe the desired state of infrastructure in code. The magic of declarative configuration.
  • You should be able to decommission a configured server at any time and deploy identical ones in new regions within minutes. Migration in a few lines of commands if IaC is set up correctly.
  • Don't have time to write infrastructure documentation? Your code does it for you.
  • What is the source of truth and why it makes life easier for your engineers.
Your application needs to run somewhere
Teams that consist mainly of developers are primarily focused on their part of the task, related to writing and maintaining the application itself. All the infrastructure on which all this will run is often perceived as something that exists on its own. "Let's put part of the data in the database, and part will be thrown into that S3 bucket." And then it should kind of work magically.

The problem is that there are no clouds. There are only other people's servers for which you are not responsible. They also have their own limitations, application features, and need to be properly configured. For example, you quickly created an account in AWS or another cloud, linked a credit card, created an S3 bucket, and started dumping some intermediate application data there in a multi-threaded mode. After a few months, you suddenly realize that the storage bill is growing, while the amount of data seems unchanged. The problem may be that you forgot to explain to the cloud what to do with partially uploaded data. If data is constantly being uploaded through unstable channels, such as from users' phones through mobile networks, the bucket will remain with partially uploaded fragments.

Clouds by default usually never delete data unless explicitly requested. As a result, volumes grow, and you pay more and more for junk, the cleanup of which you forgot to configure. In the same S3, this is done by setting the DaysAfterInitiation parameter in AbortIncompleteMultipartUpload, which defines the automatic cleanup of such unwanted fragments after a specified period.

There are thousands of such nuances - from virtual machine settings, load balancing, to the peculiarities of deploying your application after the next update. With modern cloud providers, it is usually very easy to start developing something, but it will work only until the first serious failure or a big bill due to a missed checkbox in one of the many tabs of the web interface.

Code is more transparent than a web interface
You can glue together an application with chewing gum and baling wire by checking boxes in the control panel. But, as I already mentioned, this will work only while the application is very small and has not encountered minimal migration and scaling problems. As your project develops, you will more often face problems such as:
  1. An employee once configured a bunch of policies, checked a million check-boxes in different places, and now it needs to be set up the same way but in another region. Or with another provider. It is impossible to reproduce.
  2. During an audit, it was noticed that some function worked fine a couple of months ago, but then it accidentally broke somewhere in the control panel settings. When, who, and how to revert it back is unclear.
  3. You made sure to write an internal guide on how to order and configure additional resources, but you still spend expensive tens of hours a month on it.
This is where the realization most often comes that something needs to change, increasing transparency, manageability, and reproducibility. When you start treating your infrastructure just like your application, everything changes. Now, repositories store not only the code that tells your application how to function but also the code that describes your current infrastructure, virtual machine configurations, environments, and ways to deliver your application.

Multi-layered sandwich
At the core of everything lies the infrastructure part, where virtual machines, connected volumes, load balancers, cloud firewalls, CDNs and other nuances are described.

At the lowest level, we will use Terraform. This tool allows you to create and configure any physical resources provided by your cloud provider's API. All configurations are described in a declarative style. We do not say “do this and that”, we describe “I want it to be like this”. The tool then evaluates the current state and determines the necessary steps to achieve this.

For an engineer, everything looks extremely transparent. For example, we added lines that indicate we need an instance of a certain size in the desired region:

# main.tf

# Specify the provider
provider "aws" {
  region = "us-east-1"
}

# Define the EC2 instance
resource "aws_instance" "example" {
  ami        = "ami-0c55b159cbfafe1f0"
  instance_type = "t2.micro"

  # Tag the instance
  tags = {
    Name = "example-instance"
  }
}


After executing terraform apply, the utility will determine that a virtual machine is described in the configuration, but it is not present. Then, it will automatically propose to create the described machine and deploy it in the desired region.

If we remove these lines from the configuration, Terraform will destroy this machine to bring the infrastructure state to the desired state.

The next layer is application and OS configuration. We have successfully described and physically created all the necessary infrastructure, but our virtual machines have nothing but the basic operating system image. Now, we need to describe in the same declarative approach how the environment for our application should look.

For this, the next tool - Ansible - will be used. It does not require a special agent on the target system and only needs the machine to be accessible via SSH and have Python installed out of the box. In it, we describe the desired state but at the level of applications and the OS. For example, in the playbook below, we declare that all nodes with the role frontend_server should have nginx installed and running and PHP configured.

- name: Setup sample environment
  gather_facts: true
  hosts: frontend_server
  collections:
    - nginxinc.nginx_core
  pre_tasks:
    - name: Create nginx directory
      file:
        path: /etc/nginx
        state: directory
        owner: root
        group: root
        mode: 0755
      tags: nginx_config
    - name: Generate DH Parameters (4096)
      openssl_dhparam:
        path: /etc/nginx/dhparams.pem
        size: 4096
      tags: nginx_config
  roles:
    - role: php
      tags: php
    - role: nginx
      tags: nginx
    - role: nginx_config
      tags: nginx_config


The declarative approach allows us not to worry about the current state of the OS. If the configuration specifies that the required version of nginx should be installed and the daemon should be running, Ansible will figure out how to bring the node to the target state.
Is the package not installed? It installs it. Is nginx installed but stopped manually? It starts it.

What we achieved
Thus, we gradually form the so-called source of truth - a single source of truth that describes the desired and, with reservations, the actual state of your infrastructure. This immediately provides many advantages:
  1. Your infrastructure code is stored in the repository just like the application code. You can roll it back to previous versions if something breaks, see the change history, and apply all the main CI/CD process practices to it.
  2. Your code is your documentation. No, you still need to write in plain language how exactly your creation works, but now any engineer can look into the Terraform or Ansible code to understand how it is configured. Everything you need for your project is already described.
  3. Have you been attacked and compromised by a machine? Or did an engineer configure something manually on a virtual machine and break it? You no longer need to spend valuable man-hours figuring out the details. Just destroy and recreate everything from scratch in a few minutes. And it will definitely work because your repository already has tested Terraform and Ansible code.
  4. Did an excavator sever a main cable, and the data center is on fire? You don't need to explain to clients why you are in a panic manually migrating your application to other sites for the second week. You just open the red envelope of your DRP, enter a few lines, and the pre-developed emergency protocols deploy and configure the duplicate infrastructure at another site. While the first data center is still burning, you have already deployed and continue to work with near-zero downtime.
  5. You no longer need to worry about how a specific server is configured. You can destroy it in seconds and deploy ten of its copies in other regions.

Thus, your environment and underlying layers are always in a clear and predictable state. However, there is a caveat. This works under the following conditions:
  1. Your team does not configure anything by hand. If someone manually adjusts the configuration, for example, of nginx, the next Ansible run will revert it to the state described in the code.
  2. You regularly manually or automatically deploy the changes described in the code to your infrastructure. This ensures the consistency of the configuration and the state of your machines. If this is not done, so-called drift accumulates over time.

Ultimately, the Infrastructure-as-a-Code approach is more complex in the initial setup but saves the business a lot of money in the long run, paying off in the automation of long routine tasks, increasing transparency, reproducibility, and predictable behavior during critical incidents.

If you need help with these tasks - contact us. We will describe in code everything you configured manually and then forgot to mention in the documentation. Alternatively, we'll design everything from the ground up, ensuring it’s both elegant and efficient, so your project can avoid future issues.
Gumeniuk Ivan
DevOps Engineer