Building a resilient Hashicorp Vault cluster with indirect replication from scratch

Hashicorp Vault is a great product for centralized storage of all company passwords and other secrets. However, many know that a convenient key holder is an ideal way to lose all keys at once. When I worked in a large telecom, the DRP protocols with data recovery even took into account a ban on gathering more than two Key Keepers in one place. Just in case of a very unfortunate corporate event with a joint hot air balloon flight, tasting homegrown mushrooms, or other similar factors. In short, if you're implementing such a system, you need to be very careful not only with its operation, but also with backup and recovery.

Today, I won't delve deeply into organizing proper storage of Shamir's key fragments. Instead, I'll try to explain how to set up a resilient Hashicorp Vault cluster from scratch in the community edition. For this, we'll launch a main and a test Vault cluster in several regions and data centers. The test cluster will also serve as a backup as part of the DRP procedure.

To make it more interesting, we'll set up the process so that the test cluster is a one-way replica of the production with a delay of a few days. Of course, all deployment will be done in the Infrastructure-as-a-code paradigm with Terraform and Ansible as the main tools.
I'll now tell you when this can be useful and which ansible modules can be used for this.

Vault backend options
First, let's define the data storage backend. From an architectural perspective, it is a simple key-value store on which the Hashicorp Vault binary implements the basic magic with proper organization of encryption, secrets lifecycle management, tokens and the other similar stuff.

Choosing a backend largely depends on the size of the organization and the potential load. Among the available options, we have Consul, ZooKeeper, etcd, and a few others. But to put it simply, in all likelihood, you need the most deployment-friendly option - integrated storage. This is the optimal choice recommended by the vendor for Vault versions above 1.4 and moderate loads. In the case of integrated storage, the instance will handle not only access management to secrets but also the replication of the key-value store using the Raft protocol.

Normally, the CPU load on the Vault nodes themselves is minimal. At my last job in telecom I was fixing an accident when about 60 megabit/s API requests swamped our cluster. No CPU problems were observed on medium-sized virtual machines, but the backend providing request logging started begging for mercy after several dozens of minutes, causing a number of cascading problems. In normal situations, the bottleneck is I/O operations between Vault and key-value backend storage.

So if you're a huge enterprise company or planning to spin up and terminate thousands of software instances using Vault, you should consider looking at Consul as a backend option. Here are the key differences:

In our tutorial, we will be using embedded integrated storage.

General architecture
First, let's describe the overall architecture. I've shown the details in the diagram above. If anyone needs it, here are draw.io sources. Correct it for your company's needs when you describe the information system passport in the documentation. Well, I really hope that you also describe in detail the mission critical systems in productive operation.

Here is a list of our main wishes for the installation:
  1. The system must survive the crash of any Vault instance in such a way that no customer notices the failure. The minimum operable number of instances is equal to the Raft protocol quorum - n+1, where n is half of the total number of nodes in the cluster.
  2. The system must use several different, independent providers. I've written in a previous post that we've developed some sort of healthy paranoia and always consider the risks that an ISP might suspend an account or suddenly move all data from a data center to the cloud due to an unplanned thermal impact on the hardware.
  3. We need a test environment that is completely identical to the production one. You wouldn't want to update the cluster without prior testing, right? It's also not advisable to test scripts and software that has write access to the database in a production environment.
  4. We should provide the ability to quickly and conveniently retrieve remote secrets/policies/other data which were permanently deleted without the possibility of rollback. This data should be accessible within a couple of minutes, not after an hour of DRP.
  5. Various scripts and software should use AppRole authorization and connect directly to Vault instances without intermediate nginx and other balancers.

Since we are, after all, raising one of the key components of the company's information security, let's define the access controls:
  1. All privileged access is possible only from the corporate VPN or a bastion host.
  2. All Vault nodes are maximally closed off for external access.
  3. Port 22 is accessible only from trusted networks.
  4. Port 8201 should either be completely inaccessible on the public interface, or allowed only for cluster members. It's needed for data replication within the cluster.
  5. Port 8200, accessible only from trusted networks, hosts the web interface and API of our Vault.
  6. Port 80 is globally opened; it will occasionally run certbot in standalone mode to update the certificate.
So, we will be deploying on Hetzner Cloud and Digital Ocean with one-way data replication. You can choose any other providers, or implement as independent as possible clusters based on your company's internal infrastructure.

You should also choose the most convenient option for you to store backups. This same backup storage will simultaneously serve as an instrument for delayed replication of the test/backup cluster. In my example, it's Amazon S3, but you can use any other that suits you better. S3 compatibility is not strictly necessary. SFTP, ssh + rsync, NFS-share, or any other convenient options for you will do. The main thing is that you should be able to store and retrieve backups from it.

The choice of tool for DNS-level fault tolerance is also up to you. Here are the options:
  1. DNS points to a load balancer, which monitors the status of the Vault backend on port 8200 and directs requests only to healthy, unsealed nodes. However, this scheme will require additional tweaks with headers and adding your load balancer to the whitelist of trusted nodes. Without this, AppRole authentication restriction to only whitelisted subnets won't work.
  2. DNS directly determines the backend's status and provides IPs only of healthy Vault nodes. In our case, this is AWS Route53.
Let's get our clusters up
Let's start building our clusters. More precisely, let's start with preparing code for deployment. In this deployment we will limit ourselves to ordinary virtual machines without Kubernetes, which will often be redundant. For our variant it's enough to raise one virtual machine per data center separately and make sure that they can communicate with each other on a private LAN interface. This will also allow you to quickly and easily migrate to any other provider with a minimum of effort.

For this part of the tutorial, I am assuming that you already have a master image for your VM deployment. We use Packer by Hashicorp for this purpose to provision Oracle Linux with our public keys and system users on behalf of which Ansible will run.
Terraform
Now, let’s start by describing the resources for creating virtual machines. In the example below, I used the terraform provider for Digital Ocean.

Let's describe our virtual machines in main.tf:

# Vault cluster
resource "digitalocean_droplet" "vault-nl-1" {
  image          	= var.ol7_base_image
  name           	= "vault-nl-1.example.com"
  region         	= "ams3"
  size           	= var.size
  tags           	= ["vault"]
  monitoring   	= true
}

resource "digitalocean_droplet" "vault-de-1" {
  image          	= var.ol7_base_image
  name           	= "vault-de-1.example.com"
  region         	= "fra1"
  size           	= var.size
  tags           	= ["vault"]
  monitoring    	= true
}

resource "digitalocean_droplet" "vault-uk-1" {
  image          	= var.ol7_base_image
  name           	= "vault-uk-1.example.com"
  region         	= "lon1"
  size           	= var.size
  tags           	= ["vault"]
  monitoring    	= true
}
Let's also set the default region and VM image ID in variables.tf to follow the DRY principle:

variable "size" {
  description = "Droplet size"
  default 	= "s-1vcpu-1gb"
}

variable "ol7_base_image" {
  description = "Base ol7 DO image made by packer"
  default 	= 312166483
  type = number
}
In the technical specification above we described the need to restrict network access. This can be done either at the OS level or at the level of firewall from the provider itself. In the second case we will also need to describe all ports in main.tf. In this case, I strongly recommend to use a classic principle in software development, which condemns those bad guys who insert magic values without any description. To be more precise - use named IP address ranges instead of just throwing them in unlabeled. A couple of years and about a dozen edit iterations down the line, you won't remember what that address was for and whether it even still belongs to you.

resource "digitalocean_firewall" "general-firewall" {
  name = "general-firewall"
  tags = [ "vault" ]

 inbound_rule {
	protocol     	= "icmp"
	source_addresses = ["0.0.0.0/0", "::/0"]
  }

  inbound_rule {
	protocol   = "tcp"
	port_range = "22"
	source_addresses = var.bastion_ip_add
  }
  outbound_rule {
	protocol          	= "udp"
	port_range        	= "all"
	destination_addresses = ["0.0.0.0/0", "::/0"]
  }
  outbound_rule {
	protocol          	= "icmp"
	destination_addresses = ["0.0.0.0/0", "::/0"]
  }
  outbound_rule {
	protocol          	= "tcp"
	port_range        	= "all"
	destination_addresses = ["0.0.0.0/0", "::/0"]
  }
}
Also, let's set those same named subnets like var.bastion_ip_add in variables.tf:

variable "bastion_ip_add" {
  description = "List of bastion IP addresses"
  default 	= ["144.85.22.183/32",
               "12.68.22.226/32",
            	]
}


variable "internal_vpn_ip_add" {
  description = "List of internal VPN IP addresses"
  default 	= ["122.85.77.183/32",
               "44.68.2.226/32",
            	]
}

variable "customer_customer_1_ip_add" {
  description = "List of customer_1 IP addresses"
  default 	= ["158.63.250.15/32",
               "158.63.250.16/32",
            	]
}
To add the multiple named ranges in main.tf, you can use a similar construct:

  inbound_rule {
	protocol   = "tcp"
	port_range = "8200"
	source_addresses = setunion(var.internal_vpn_ip_add, var.bastion_ip_add, var.customer_customer_1_ip_add)
  }
Now we should add the necessary tokens and secrets to the environment variables so the provider authorization works. In our case, this is the Digital Ocean token.

terraform init
terraform plan
If everything looks good, we apply our changes and sip tea while terraform spins up the virtual machines and sets up the network filters.

terraform apply
Ansible
So, we have two groups of virtual machines from different providers. They are sterile, loving and accepting your private keys, waiting for further deployment of the applications. At this stage we will describe our cluster and node configuration in Ansible.

We have chosen the convenient role https://github.com/ansible-community/ansible-vault. It allows us to simply describe the whole architecture and is flexible enough for our purposes. Unfortunately, there is one flaw that I fixed in the Pull request, but the author has not yet accepted the change in the upstream.

The change is small, but it includes an edit to the Vault configuration template for the case where we want TLS encryption but don't want a local CA. This is relevant for deploy options where we use certbot to get certificates for everyone using DNS-challenge.

Update: The author has accepted the PR, everything is fine.

Let's start with getting certificates for all our nodes. I'm not a big fan of "bashansible" inserts, but in this case it's necessary. In normal operation, it is assumed that the vault instance is running as its own, independent user who does not have access to the private keys in /etc/letsencrypt. To get around this, certbot retrieves certificates and then uses a renew-hook to copy the retrieved certificates to a location accessible by Vault. For the first playbook execution, we will need to do this once in a separate task:


- name: Prepare certificates
  hosts: hashicorp_vault
  gather_facts: true
  become: yes
  roles:
	- role: letsencrypt-ssl
  	tags: letsencrypt
  post_tasks:
	- name: Copy the certificates to the vault config dir for the first time
  	shell: |
    	rsync -L /etc/letsencrypt/live/{{ inventory_hostname }}/privkey.pem /etc/vault.d/privkey.pem
    	rsync -L /etc/letsencrypt/live/{{ inventory_hostname }}/fullchain.pem /etc/vault.d/fullchain.pem
    	chown vault:vault /etc/vault.d/privkey.pem
    	chown vault:vault /etc/vault.d/fullchain.pem
    	pkill -SIGHUP vault
  	args:
    	creates: /etc/vault.d/fullchain.pem
  	tags: letsencrypt
In the next step, we will deploy the application itself:

- name: Deploy hashicorp vault cluster
  hosts: hashicorp_vault
  gather_facts: true
  become: yes
  roles:
	- role: ansible-community.ansible-vault
  	tags: vault_install
Here are just some of the key variables from group_vars that you should pay attention to:

vault_version: "1.12.0"
vault_install_hashi_repo: true
vault_harden_file_perms: true
vault_service_restart: false

# listeners configuration
vault_api_addr: "{{ vault_protocol }}://{{ inventory_hostname }}:{{ vault_port }}"
vault_tls_disable: false
vault_tls_certs_path: /etc/vault.d
vault_tls_private_path: /etc/vault.d
vault_tls_cert_file: fullchain.pem
vault_tls_key_file: privkey.pem
vault_tls_min_version: "tls12"

vault_raft_cluster_members:
  - peer: hasd-vault-nl-1.itsts.net
	api_addr: https://vault-nl-1.example.com:8200
  - peer: hasd-vault-de-1.itsts.net
	api_addr: https://vault-de-1.example.com:8200
  - peer: hasd-vault-uk-1.itsts.net
	api_addr: https://vault-uk-1.example.com:8200
Here we explicitly set the software version and specify the need to add the Hashicorp repository as an update source.

vault_harden_file_perms will automatically tighten the nuts for the vault user according to Production Hardening recommendations.

vault_service_restart will prevent the daemon from restarting in case of an update. This is best done manually, as vault does not know the encryption key from its backend at startup and requires manual unseal with Shamir key entry. If you just restart everything, the cluster will enclose itself and the service will become unavailable before unseal. The correct thing to do is to update the nodes and then restart one node at a time to ensure service continuity.

Next comes a block that describes the key parameters that are used in the template to generate the Vault instances configuration. Specifically, vault_raft_cluster_members describe the variables that are needed to assemble the cluster into a single cohesive unit. It should list all nodes and their addresses, which will be substituted into the jinja2 template.

As a result, the nodes should generate configuration file /etc/vault.d/vault_main.hcl with approximately this content:


# Ansible managed

cluster_name = "dc1"
max_lease_ttl = "768h"
default_lease_ttl = "768h"

disable_clustering = "False"
cluster_addr = "https://333.222.10.107:8201"
api_addr = "https://vault-nl-1.example.com:8200"

plugin_directory = "/usr/local/lib/vault/plugins"

listener "tcp" {
  address = "333.222.10.107:8200"
  cluster_address = "333.222.10.107:8201"
  tls_cert_file = "/etc/vault.d/fullchain.pem"
  tls_key_file = "/etc/vault.d/privkey.pem"
  tls_min_version  = "tls12"
  tls_disable = "false"
  }
listener "tcp" {
  address = "127.0.0.1:8200"
  cluster_address = "333.222.10.107:8201"
  tls_cert_file = "/etc/vault.d/fullchain.pem"
  tls_key_file = "/etc/vault.d/privkey.pem"
  tls_min_version  = "tls12"
  tls_disable = "false"
  }

storage "raft" {
  path = "/opt/vault/data"
  node_id = "hasd-vault-nl-1"
  retry_join {
	leader_api_addr =  "https://vault-de-1.example.com:8200"
  }
  retry_join {
	leader_api_addr =  "https://vault-uk-1.example.com:8200"
  }
    	}
// HashiCorp recommends disabling mlock when using Raft.
disable_mlock = true
ui = true
Let's break down a few key points.
  1. api_addr - this parameter must match the node address that is specified in the listener section, unless you are using intermediate balancers between the client and vault instances. The instance uses it to tell other cluster members the address at which it is ready to serve clients. This is necessary for the internal request redirect mechanism to work.
  2. cluster_addr - this parameter is similar to the previous one, but it is used to tell the cluster neighbors the address and port on which the node is ready to perform intra-cluster replication.
  3. tls_cert_file, tls_key_file - file names of the key and certificate that was obtained using certbot. In our case, the nodes of one cluster are located in different data centers and it is easiest to make them communicate on the public interface, but covering the communications with TLS encryption. By default, nodes will use TLS 1.3 to communicate with each other.
  4. retry_join - when a node starts, it tries to connect to its neighbours. It will get the primary bootstrap for finding neighbours from this parameter so that it can quickly assemble into a single cluster.

At this point, after applying the playbook we will have a running cluster where the first connection to a running node will trigger the Shamir key generation procedure.

In one of the companies where we were involved in integrating this product into business processes from scratch, we had already encountered problems with the client. To put it briefly, a few years ago the client set up a minimally functional instance with a single node and default settings, but forgot to save the keys. Everything was running on a server with an uptime of a couple of years until it was accidentally rebooted. After that, it became clear that the data was irreversibly encrypted and could not be extracted. Fortunately, by that time, we had already conducted an audit and transferred the data to a new cluster. The transition to operational use turned out to be a bit more sudden than planned.

Be extremely careful at this stage, it is vitally important!

Lost Shamir keys = permanently locked storage after a daemon restart.

In one of the companies where we were involved in integrating this product into business processes from scratch, we had already encountered problems with the client. To put it briefly, a few years ago the client set up a minimally functional instance with a single node and default settings, but forgot to save the keys. Everything was running on a server with an uptime of a couple of years until it was accidentally rebooted. After that, it became clear that the data was irreversibly encrypted and could not be extracted. Fortunately, by that time, we had already conducted an audit and transferred the data to a new cluster. The transition to operational use turned out to be a bit more sudden than planned.

Important aspects of HTTPS
When you issue TLS certificates, make sure that the nodes function correctly not only under normal conditions but also in various emergency situations:
  1. You should be able to access the general load-balanced name vault.example.com. DNS will direct you to the nearest live node of the production cluster.
  2. You should be able to forcibly connect to a specific node of your cluster. For example, vault-nl-1.example.com.
  3. You should have the ability to urgently switch the load to the test cluster, which will become the production cluster during the emergency. Its nodes should successfully respond to both vault-test.example.com and vault.example.com.
Plan your certificates obtaining accordingly to the requirements above:

Nodes of the production cluster:
CN = vault.example.com.
The SAN contains all personal names of cluster nodes of the form vault-nl-1.example.com

Test cluster nodes:
CN = vault-test.example.com.
The SAN contains vault.example.com for DRP and cluster node names of the form vault-test-nl-1.example.com

Don't forget to configure renew-hook for letsencrypt with the following script:


rsync -L /etc/letsencrypt/live/{{ inventory_hostname }}/privkey.pem /etc/vault.d/privkey.pem
rsync -L /etc/letsencrypt/live/{{ inventory_hostname }}/fullchain.pem /etc/vault.d/fullchain.pem
chown vault:vault /etc/vault.d/privkey.pem
chown vault:vault /etc/vault.d/fullchain.pem
pkill -SIGHUP vault
Note that vault allows you to reload certificates without the need to restart and manually unseal each time you refresh with pkill -SIGHUP vault.

I also recommend that you remember to add a CAA record to DNS to commit the certificate issuer.
Backing up
Why do we need indirect delayed replication
Now let's look at our delayed replication scheme. Once a day, a task in crontab runs a backup of the production cluster data and sends the data to S3 storage. We use duplicity for this, but you can do it any way you are used to.

Then, also once a day, the reverse process is done on the test/standby cluster by downloading the desired snapshot and restoring the cluster to that state. Here's what we accomplish with this approach:
  1. The engineer maintaining the Vault can run whatever experiments he sees fit. Communication between clusters is strictly one-way and no changes to the test cluster will affect the product.
  2. Once a day, the test zone is completely cleaned and brought to an untouched state. If necessary, you can quickly run the same script manually. As a result, all tests are always run on an almost exact clean copy of the production cluster.
  3. It is possible to quickly retrieve data in case of accidental deletion or irreversible damage. There is no need to configure anything additional, there is always a replica lagging a couple of days nearby.
  4. A small time lag allows you to quickly retrieve data whose deletion was not immediately noticed. Usually a couple days is more than enough time to retrieve passwords that were radically nailed down yesterday by a tired engineer.
  5. You immediately move into that category of admins who not only make backups already, but also check their retrieval. But now you do it automatically on a daily basis. In fact, the scheme simultaneously implements a DRP-like simulation of deploying from backup.
Sample script
The recovery backup script itself can be anything you deem reasonable. Below is an example of a jinja2 template:

#!/bin/bash

LEADER=$(vault status -format=json | jq -r '.leader_address')
HOSTNAME=$(hostname)
ROLE_ID_VAULT_MAINTENANCE={{ vault_maintenance_role_id }}

# Get the secret ID for the vault_maintenance_snapshot AppRole
SECRET_ID_VAULT_MAINTENANCE=$(cat /root/approle_vault_maintenance_snapshot_secret_id)

# Check if this node is the leader and if the hostname matches
if [[ "$LEADER" == "https://$HOSTNAME:8200" ]]; then
  TOKEN=$(vault write auth/approle/login --format=json role_id=$ROLE_ID_VAULT_MAINTENANCE secret_id="$SECRET_ID_VAULT_MAINTENANCE" | jq -r '.auth.client_token')
  vault login $TOKEN >/dev/null 2>&1
  vault operator raft snapshot save {{ vault_snapshot_location }}
  # Send the result to the cloud backup. This will happen on the leader node only
  /root/scripts/duplicity-backup.sh -c /root/scripts/{{ duplicity_config_name_s3 }}.conf -b > /dev/null
fi
The script assumes that it is running using the pretty limited AppRole in Vault, which only has rights to work with snapshots. It is expected that you have previously put the secret_id from that role itself into /root/approle_vault_maintenance_snapshot_secret_id. Next, we get the token, authorize and execute snapshot save/restore depending on the role.

To prevent the snapshot save and restore from running simultaneously on all nodes in the cluster, we first verify that the node is the leader. This way, we ensure that at any given time, the backup will only be saved or restored on one node in the cluster.

Here is an example of a minimalistic policy for AppRole that is eligible for snapshots and subsequent restores:

path "sys/storage/raft/snapshot"

{
  capabilities = ["read", "update"]
}

path "auth/token/revoke"

{
  capabilities = ["update"]
}
Brief summary
Tutorial completed. If everything was done right, now you have this:
  1. Ansible and Terraform code for fully automated deployment of Vault clusters. All in the best traditions of Infrastructure-as-a-code, just the way we like it.
  2. Created are two independent fault-tolerant clusters, where each node automatically connects to the others and forms a quorum upon startup.
  3. Delayed replication is set up from the production cluster to the test cluster. You can quickly extract data in case of a fatal error in production and minimize losses.
  4. You have prepared the platform for future DRP.
Don't forget to attach monitoring to all this setup, and you'll have a full cycle of testing all processes - from health check of each node to daily verification of correct restoration process from the backup copy.

In future posts, I'll try to explain how to properly organize a Disaster Recovery Plan using our cluster as an example. All in the best traditions of DevOps-paranoids, just the way we like it, including even a meteorite fall in our considerations.

And if needed, we can conduct an audit of how you're currently storing secrets and help set up Vault to make everything secure. Check this out here.
Gumeniuk Ivan
DevOps Engineer