Rescue Vault from a broken raft setup
All code can be found on GitHub
This is a simple demonstration of how to rescue Vault from a broken quorum from a bad setup (2 nodes) and finally what happens if you are following the suggested 3 or more node setup and one node fails.
This is meant to be a purely educational experience, do not use this for production, or on a production cluster.
Requirements
The setup expects VirtualBox hostnetwork to accept the following IP range: 192.168.56.10-192.168.56.12
- A virtual machine software (VirtualBox) https://www.virtualbox.org/
- HashiCorp Vagrant https://www.vagrantup.com/
- Ansible https://www.ansible.com/
- Python 3 https://www.python.org/
- Public ssh key saved in ~/.ssh/id_rsa.pem
- Strong enough machine to run up to 3 headless Ubuntu 22.04 servers
Usage
Note: The Ansible code from the pre-setup stage can be used to configure Vault on any setup of your choosing, however, it is not meant for production use.
Pre-setup - Server installation
For sake of ease, create these files in a subfolder named servers. Create a new Vagrantfile with the following contents:
Vagrant.configure("2") do |config|
config.vm.define :first do |first|
first.vm.box = "ubuntu/jammy64"
first.vm.hostname = "first"
first.vm.network :private_network, ip: "192.168.56.10"
first.vm.synced_folder ".", "/vagrant", create: true
ssh_pub_key = File.readlines("#{Dir.home}/.ssh/id_rsa.pub").first.strip
first.vm.provision 'shell', inline: "echo #{ssh_pub_key} >> /home/vagrant/.ssh/authorized_keys", privileged: false
first.vm.provision "ansible_local" do |ansible|
ansible.playbook = "install_vault.yml"
ansible.extra_vars = {
server_ip: "192.168.56.10"
}
ansible.pip_install_cmd = "sudo apt-get install -y python3-distutils && curl -s https://bootstrap.pypa.io/get-pip.py | sudo python3"
ansible.install_mode = "pip"
end
end
config.vm.define :second do |second|
second.vm.box = "ubuntu/jammy64"
second.vm.hostname = "second"
second.vm.network :private_network, ip: "192.168.56.11"
second.vm.synced_folder ".", "/vagrant", create: true
ssh_pub_key = File.readlines("#{Dir.home}/.ssh/id_rsa.pub").first.strip
second.vm.provision 'shell', inline: "echo #{ssh_pub_key} >> /home/vagrant/.ssh/authorized_keys", privileged: false
second.vm.provision "ansible_local" do |ansible|
ansible.playbook = "install_vault.yml"
ansible.extra_vars = {
server_ip: "192.168.56.11"
}
ansible.pip_install_cmd = "sudo apt-get install -y python3-distutils && curl -s https://bootstrap.pypa.io/get-pip.py | sudo python3"
ansible.install_mode = "pip"
end
end
config.vm.define :third do |third|
third.vm.box = "ubuntu/jammy64"
third.vm.hostname = "third"
third.vm.network :private_network, ip: "192.168.56.12"
third.vm.synced_folder ".", "/vagrant", create: true
ssh_pub_key = File.readlines("#{Dir.home}/.ssh/id_rsa.pub").first.strip
third.vm.provision 'shell', inline: "echo #{ssh_pub_key} >> /home/vagrant/.ssh/authorized_keys", privileged: false
third.vm.provision "ansible_local" do |ansible|
ansible.playbook = "install_vault.yml"
ansible.extra_vars = {
server_ip: "192.168.56.12"
}
ansible.pip_install_cmd = "sudo apt-get install -y python3-distutils && curl -s https://bootstrap.pypa.io/get-pip.py | sudo python3"
ansible.install_mode = "pip"
end
end
end
Also ensure you create the playbook file as well as the jinja2 template for the Vault configuration used:
vault.hcl.j2
storage "raft" {
path = "/opt/vault/data/"
node_id = ""
retry_join {
leader_api_addr = "http://192.168.56.10:8200"
}
retry_join {
leader_api_addr = "http://192.168.56.11:8200"
}
}
listener "tcp" {
address = "0.0.0.0:8200"
tls_disable = "true"
}
api_addr = "http://:8200"
cluster_addr = "http://:8201"
ui = true
cluster_name = "vault"
install_vault.yml
- name: Setup remote Vault servers
hosts: all
become: true
tasks:
- name: Add an Apt signing key, uses whichever key is at the URL
ansible.builtin.apt_key:
url: https://apt.releases.hashicorp.com/gpg
state: present
- name: Add specified repository into sources list
ansible.builtin.apt_repository:
repo: deb [arch=amd64] https://apt.releases.hashicorp.com jammy main
state: present
- name: Update repositories cache and install Vault
apt:
pkg:
- vault
- jq
update_cache: yes
state: present
- name: Creates directory
file:
path: /opt/vault/data/
state: directory
owner: vault
group: vault
mode: 0775
- name: Write the vault.hcl file
template:
src: vault.hcl.j2
dest: /etc/vault.d/vault.hcl
owner: vault
group: vault
vars:
hostname: ""
ip: ""
- name: Recursively change ownership of a directory
ansible.builtin.file:
path: /opt/vault/
state: directory
recurse: yes
owner: vault
group: vault
- name: Enable Vault
ansible.builtin.service:
name: vault
enabled: yes
state: stopped
- name: Add Vault address export to .bashrc
ansible.builtin.lineinfile:
path: "/home/vagrant/.profile"
line: 'export VAULT_ADDR=http://127.0.0.1:8200'
state: present
Pre-setup Ansible files
In this section, you will create 4 files. 2 playbooks, 1 inventory file and one configuration file for Ansible. To ensure you don’t mix these with the server setup files, make these in a folder named ansible, next to your servers folder that you used in the previous steps.
Let’s start with the configuration file. ansible.cfg
[defaults]
inventory = ./inventory.yml
host_key_checking = False
This ensures two things, one that the inventory we use is the one for the demo setup, and secondly we disregard the issue with self-signed certificates. Change this file to suit your own needs if you divert from the files we created in the Pre-setup servers stage.
Second, we will create the inventory file. This is based on the Vagrantfile configuration above.
vault:
hosts:
first:
ansible_host: 192.168.56.10
ansible_ssh_private_key_file: ~/.ssh/id_rsa
ansible_user: vagrant
second:
ansible_host: 192.168.56.11
ansible_ssh_private_key_file: ~/.ssh/id_rsa
ansible_user: vagrant
third:
ansible_host: 192.168.56.12
ansible_ssh_private_key_file: ~/.ssh/id_rsa
ansible_user: vagrant
local:
hosts:
localhost:
ansible_host: 127.0.0.1
ansible_connection: local
Finally we will create two playbooks. The first will initialize and unseal Vault on your Vagrant boxes: initialize_and_unseal_vault.yml
- name: Initialize and unseal first Vault
hosts: first
become: true
gather_facts: true
environment:
VAULT_ADDR: "http://127.0.0.1:8200"
tasks:
- name: Start Vault
ansible.builtin.service:
name: vault
enabled: yes
state: started
- name: Check if Vault is initialized
shell: vault status -format=json
register: vault_status
failed_when: ( vault_status.rc not in [ 0, 2 ] )
- name: Set Vault status output as a variable
set_fact:
vault_status: " "
- name: Initialize Vault if necessary
shell: vault operator init -key-shares=1 -key-threshold=1 -format=json
register: vault
when: not vault_status.initialized
- name: Save Vault response as variable
set_fact:
vault_response: ""
when: not vault_status.initialized
- name: Write root token to file
copy:
content: ""
dest: "root_token"
when: not vault_status.initialized
- name: Save root token to variable
slurp:
src: "root_token"
register: root_token
- name: Write unseal keys to file
copy:
content: ""
dest: "unseal_keys"
when: not vault_status.initialized
- name: Read unseal_keys from file
slurp:
src: "unseal_keys"
register: slurpfile
- name: Decode and clean the unseal key
set_fact:
unseal_key: "{{ slurpfile['content'] | b64decode | regex_replace('[\\[\\]]', '')}}"
- name: Check if Vault is initialized
shell: vault status -format=json
register: vault_status
failed_when: ( vault_status.rc not in [ 0, 2 ] )
- name: Set Vault status output as a variable
set_fact:
vault_status: " "
- name: Unseal Vault
shell: vault operator unseal
when: vault_status.initialized and vault_status.sealed
- name: Write the root_token to a local file
delegate_to: localhost
become: false
copy:
content: ""
dest: "root_token.txt"
- name: Write the unseal_key to a local file
delegate_to: localhost
become: false
copy:
content: ""
dest: "unseal_key.txt"
- name: Enable secret engine
shell: vault secrets enable kv
environment:
VAULT_TOKEN: ""
VAULT_ADDR: "http://127.0.0.1:8200"
register: enable_kv
failed_when: ( enable_kv.rc not in [ 0, 2 ] )
- name: Initialize and unseal second/third Vault
hosts: second third
become: true
gather_facts: true
environment:
VAULT_ADDR: "http://127.0.0.1:8200"
tasks:
- name: Read unseal_keys from file
set_fact:
unseal_key: "{{ lookup('file', 'unseal_key.txt') }}"
- name: Start Vault
ansible.builtin.service:
name: vault
enabled: yes
state: started
- name: Check if Vault is initialized
shell: vault status -format=json
register: vault_status
failed_when: ( vault_status.rc not in [ 0, 2 ] )
- name: Set Vault status output as a variable
set_fact:
second_vault_status: " "
- name: Initialize Vault if necessary
shell: vault operator init -key-shares=1 -key-threshold=1 -format=json
register: second_vault
when: not second_vault_status.initialized
- name: Unseal Vault
shell: vault operator unseal
when: second_vault_status.sealed
- name: Pause for 10 seconds
ansible.builtin.pause:
seconds: 10
- name: List enabled secrets engines
hosts: vault
become: false
gather_facts: false
tasks:
- name: Read root token from file
set_fact:
root_token: "{{ lookup('file', 'root_token.txt') }}"
- name: List secrets engines and show that the accessor is the same on both nodes
shell: vault secrets list -format=json | jq '."kv/".accessor'
environment:
VAULT_TOKEN: ""
VAULT_ADDR: "http://127.0.0.1:8200"
register: secret_list
- name: Print listed accessors
debug:
msg: ""
The last file we need to create is the one that causes the second Vault server to fail. It does this by forcibly removing the database for raft to simulate corruption.
cause_quorum_break.yml
- name: Split the brain by killing second
hosts: second
become: true
gather_facts: false
environment:
VAULT_ADDR: "http://127.0.0.1:8200"
tasks:
- name: Stop Vault
ansible.builtin.systemd:
name: vault
state: stopped
- name: Show that there are two folders
ansible.builtin.find:
paths: /opt/vault/
register: vault_folder
- name: Print nr of folders
debug:
msg: ""
- name: Delete Vault raft storage
ansible.builtin.file:
dest: /opt/vault/data/
state: absent
- name: Ensure the folder is deleted
ansible.builtin.find:
paths: /opt/vault/
register: vault_folder
- name: Print nr of folders
debug:
msg: ""
- name: Pause for 5 seconds
ansible.builtin.pause:
seconds: 5
- name: Recreate raft storage folder
file:
path: /opt/vault/data/
state: directory
owner: vault
group: vault
mode: 0775
- name: Start Vault
ansible.builtin.systemd:
name: vault
state: started
- name: Pause for 5 seconds
ansible.builtin.pause:
seconds: 5
- name: Check if Vault is initialized
shell: vault status -format=json
register: vault_status
failed_when: ( vault_status.rc not in [ 0, 2 ] )
- name: Set Vault status output as a variable
set_fact:
vault_status: " "
- name: Initialize Vault if necessary
shell: vault operator init -key-shares=1 -key-threshold=1 -format=json
register: vault
when: not vault_status.initialized
- name: Save Vault response as variable
set_fact:
vault_response: ""
when: not vault_status.initialized
- name: Read unseal_keys from file
set_fact:
unseal_key: "{{ lookup('file', 'unseal_key.txt') }}"
- name: Unseal Vault
shell: vault operator unseal
ignore_errors: true
- name: Verify split
hosts: first
become: false
gather_facts: false
environment:
VAULT_ADDR: "http://127.0.0.1:8200"
tasks:
- name: Read root token from file
set_fact:
root_token: "{{ lookup('file', 'root_token.txt') }}"
- name: Check Vault peer list
shell: vault operator raft list-peers -format=json
register: vault_peers
failed_when: ( vault_peers.rc not in [ 0, 2 ] )
environment:
VAULT_TOKEN: ""
- name: Print Vault Peers list when error
debug:
msg: ""
when: vault_peers.stderr | length > 0
- name: Extract node_id values using json_query
vars:
query: "data.config.servers[*].node_id"
set_fact:
node_ids: "{{ vault_peers.stdout | from_json | json_query(query) }}"
when: vault_peers.stderr | length == 0
- name: Print Vault Peers list when no error
debug:
msg: ""
when: vault_peers.stderr | length == 0
Setup initial Cluster (ansible)
Start by running vagrant up first second inside the servers folder.
This will provision two servers and provision them using Ansible within the virtual machines.
Once the cluster is up and running, jump into the ansible folder and run ansible-playbook initialize_and_unseal_vault.yml.
This will do the following:
- Start Vault on the first server
- Initialize it with 1 key-share
- Record the unseal-key and root key and save it to your local machine
- Unseal the first Vault
- Enable the KV secrets engine in the cluster
- Start Vault on the second server
- Unseal the second Vault
- List the accessor of the KV secrets engine from both vm’s to show that they are identical (to prove that Raft is up and running)
Destroy your Raft Quorum (ansible)
With the cluster up and running from the previous stage, you will now commence by destroying your quorum in the cluster. Since your current setup is using a 2 node setup, this will cause Vault to be inoperable and because we destroy the Raft database on the second node it will be as if the node has been completely corrupted.
To start, from the ansible folder, run: ansible-playbook cause_quorum_break.yml
This will do the following on the second server:
- Stop Vault
- List the amount of folders found in the
/opt/vault/folder (should be 2 from the start). - Delete the data folder and show that the
/opt/vault/folder now only contains one folder (the tls certs generated on Vault installation). - It will then recreate the data folder and start Vault and try to unseal Vault. For user readability, the error is ignored but printed to the terminal.
This next step will happen on the first server:
- It tries to list the Raft peers in the Vault cluster and prints the error
Rescue your first server (manual)
The next steps are all manual, meant to be done to practice how to restore quorum.
This will all be done on the first server. SSH into it using ssh vagrant@192.168.56.10.
- Create a json file inside the folder
/opt/vault/data/raft/, name itpeers.jsonwith the following content (if you run it in another server setup):[ { "id": "first", "address": "192.168.56.10:8201", "non_voter": false } ] - Restart Vault by using
sudo systemctl restart vault - Save the root token to environment variables on the server by using the following commands:
export VAULT_TOKEN=$(cat root_token) - Unseal the server using:
vault operator unseal $(cat unseal_keys | tr -d '[]"') - Check that the server is up and running, this can be done either with
vault statusand noting that HA Mode is “active”, or by usingvault operator raft list-peers
Rescue the second server (manual)
After this, you can rescue the second server by the following on the second server:
- Stop Vault by using
sudo systemctl stop vault. - Delete the folder
/opt/vault/data/and recreate it usingsudo install -g vault -o vault -d /opt/vault/data/. - Start up Vault by using
sudo systemctl start vaultand wait a couple of seconds. - Unseal Vault by using
vault operator unsealand the unseal key you can find in the file unseal_key.txt on your computer. - Log in to Vault using
vault loginand use the root_token you find in the root_token.txt file on your computer. - You can see that you have restored the second server to the cluster by using
vault operator raft list-peersorvault secrets listand noting that the secret enginekvis enabled (from the inital setup).
Vault setup following recommendation
Back in the servers folder, run vagrant destroy --force followed by vagrant up. This will start three servers with Vault installed.
Once they are up and running, jump into the ansible folder and run ansible-playbook initialize_and_unseal_vault.yml.
Once the playbook is done, run: ansible-playbook cause_quorum_break.yml.
It will finally print the list of Vault Peers from the first server, this means that even though the second server lost all Raft data Vault was able to retain quorum and bring in the second server.
In a real scenario when you don’t just delete the /opt/vault/data/ folder but the server is fully broken, you would remove the dead server using vault operator raft remove-peer node-id command.