BLOG POST

My First Nomad Cluster

My own notes — not a guide — on standing up my first Nomad cluster on AWS.

Kevin WangNov. 20, 2022, 7:00 a.m. GMT-5

It’s a chilly 35º F morning here in New York.

It has certainly been a minute since I last wrote anything. Work has been quite busy, and life has been, well... turbulent to say the least. But I’m nearly fully back in the regular swing of things.

This post is going to be about Nomad. Not the why, or the what, but rather the how. More specifically, how I went about standing up my first Nomad cluster on AWS.

This is meaningful, personally, because I have a grand goal of understanding and using all 8 of HashiCorp’s core products.

My progress thus far looks like...

Waypoint: Contributor — I’ve contributed several AWS Lambda, K8s plugin improvements to the project, created the lambda-function-url plugin, and am working on a app-runner plugin... which I should finish up 😅.
Terraform: User — I use Terraform semi-regularly on my team at HashiCorp for the usual infrastructure provisioning. I also created a module that manages a set of nearly identical GitHub files across all these 👆👇 core product repositories.
Nomad: Dabbler — I’ve stood up a production cluster, and run a handful of containerized jobs.
Consul: Dabbler — I’ve dabbled with the native integration with Nomad. However, I wouldn’t be able to explain why or when you would use Consul.
Vault: Dabbler — Vault is next on my list to learn. I’ve only interacted with a local dev server.
Boundary: Never touched — Don’t know what it does.
Packer: Never touched — Should look into this for EC2 AMI creation.
Vagrant: Never touched — Don’t know what it does.

Now, on to the actual write up.

Nomad#

I wanted to create a production Nomad cluster. No, not nomad agent -dev. A production cluster with multiple nodes — the whole shebang with a leader and followers.

Prior to this, my only experience with clusters has been with Kubernetes. And in those situations, the underlying nodes were always abstracted away from me by the platform, whether it was EKS or GKE. I never once had to think about leaders or followers, just declarative YAML files for creating pods.

Checklist#

My criteria for a fully qualified production cluster was simply:

Must have 3 server nodes in alive status, with one leader
Should be publicly accessible
Should be reasonably secure
Should be able to run a containerized job

I’ll reference this checklist throughout the rest of the post.

Initial Wall#

The first resource I reached for was this Nomad tutorial, but I quickly turned away due to the hefty prerequisites.

This particular tutorial called for Packer, Terraform, and Consul usage, to which I thought with a ton of skepticism, “Wtf? It cannot possibly require all these additional tools just to create a Nomad cluster...” That’s like looking for the quickest spaghetti recipe and finding one that requires you to make the pasta and sauce from scratch.

I eventually pulled from various scattered documentation pages and embarked on some good ol’ AWS click-ops. The process that ensued was ultimately what drove me to write this post.

Cluster Configuration#

My very surface-level knowledge of clusters and the fact that Nomad is meant to be run as a cluster lead me to create 3 EC2 instances, and then pray that Nomad would magically join them somehow.

Clusters typically call for an odd number of members which helps with fault tolerance and preventing the “split brain” problem. ¹

3 Little EC2 Piggies#

Nomad suggests using fairly powerful specs for Nomad servers. However, for my discovery needs, the cheap t2.small instance size worked just fine.

I created three of these Amazon Linux EC2 instances, SSH’d into each and did some repetitive manual setup, such as installing nomad itself and zsh for productivity.

# SSH in
ssh -i ~/Downloads/my_key.pem ec2-user@ec2-XXX-XXX-XXX-XXX.compute-1.amazonaws.com

# Install Nomad & ZSH
sudo yum install -y yum-utils
sudo yum-config-manager --add-repo https://rpm.releases.hashicorp.com/AmazonLinux/hashicorp.repo
sudo yum -y install nomad
sudo yum -y install zsh

# Install oh-my-zsh
sudo yum -y install git
sh -c "$(curl -fsSL https://raw.github.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"

# Install nomad autocompletion
nomad -autocomplete-install
source ~/.zshrc

# Verify
nomad version
# Nomad v1.4.2 (039d70eeef5888164cad05e7ddfd8b6f8220923b)

Nomad Config Files#

Nomad looks for a config file at /etc/nomad.d/nomad.hcl by default. These are the configurations that I arrived on for my leader and followers.

# Leader config file
datacenter = "dc1"
data_dir   = "/opt/nomad/data"
bind_addr  = "0.0.0.0"

server {
  enabled          = true
  bootstrap_expect = 3
  server_join {
    retry_join = [
      "<internal-ip-1>",
      "<internal-ip-2>",
      "<internal-ip-3>",
    ]
    retry_max      = 0
    retry_interval = "15s"
  }
}

client {
  enabled = true
  servers = ["127.0.0.1"]
}

acl {
  enabled = true
}

# Follower config file
datacenter = "dc1"
data_dir   = "/opt/nomad/data"
bind_addr  = "0.0.0.0"

server {
  enabled          = true
  bootstrap_expect = 1
}

client {
  enabled = true
  servers = [
    "<internal-ip-1>",
    "<internal-ip-2>",
    "<internal-ip-3>",
  ]
}

ui {
  enabled = false
}

Running#

At this point I was ready to run Nomad.

This enterprise guide was pretty helpful with showing how to run Nomad as a background service using systemd/systemctl.

sudo systemctl enable nomad
sudo systemctl start nomad
sudo systemctl status nomad

# ● nomad.service - Nomad
#    Loaded: loaded (/usr/lib/systemd/system/nomad.service; enabled; vendor preset: disabled)
#    Active: active (running) since Thu 2022-11-17 02:31:34 UTC; 3 days ago
#      Docs: https://nomadproject.io/docs/
#  Main PID: 16415 (nomad)
#     Tasks: 25
#    Memory: 77.0M
#    CGroup: /system.slice/nomad.service
#            ├─ 5557 /usr/bin/nomad logmon
#            └─16415 /usr/bin/nomad agent -config /etc/nomad.d

There are some chmod commands for permitting file/folder modifications but I don’t remember if they're needed or not.

UI#

The Nomad web UI — viewable at port 4646 — was my next stop and the quickest way to validate cluster health.

Check ✅ — I had created my first, fully functional production Nomad cluster, reachable at a public URL like http://<ec2-public-ip>:4646/ui/jobs.

~~Must have 3 server nodes in alive status, with one leader~~
~~Should be publicly accessible~~
Should be reasonably secure
Should be able to run a containerized job

Basic Auth#

Nomad, including the UI, is not secure by default, unlike say Waypoint and its UI.

Without diving into fine-grained access control, I wanted basic authentication to be required for all requests to my cluster. This was to keep out any trolls or bad actors.

My leader config file with an additional acl stanza:

acl {
  enabled = true
}

I restarted the processs — sudo systemctl restart nomad — and ran nomad acl bootstrap to generate a basic token for authenicated CLI and UI requests.

nomad acl bootstrap

# Accessor ID  = abcdefgh-ijkl-mnop-qrst-uvwxyz123456
# Secret ID    = abcdefgh-ijkl-mnop-qrst-uvwxyz123456
# Name         = Bootstrap Token
# Type         = management
# Global       = true
# Create Time  = 2022-11-12 16:03:23.785510817 +0000 UTC
# Expiry Time  = <none>
# Create Index = 1823
# Modify Index = 1823
# Policies     = n/a
# Roles        = n/a

Note

Pass the Secret ID to either the UI or set it as an environment variable for the CLI.

export NOMAD_TOKEN=abcdefgh-ijkl-mnop-qrst-uvwxyz123456

Now my cluster was reasonably secure. Time to schedule a job.

~~Must have 3 server nodes in alive status, with one leader~~
~~Should be publicly accessible~~
~~Should be reasonably secure~~
Should be able to run a containerized job

Jobs#

The most mind-blowing thing I’ve witnessed with Nomad is the spawning of minecarts in Minecraft.

But I simply wanted to run a Docker container, similar to Kubernetes’ main use case.

Docker Driver#

The machines that Nomad runs on must have Docker installed for the docker driver to work.

# install docker on Amazon Linux 2
sudo yum -y install docker
sudo usermod -a -G docker ec2-user
id ec2-user
newgrp docker

# run docker as a background service
sudo systemctl enable docker.service
sudo systemctl start docker.service

After this is done, the UI should show the Docker driver status as Healthy.

Scheduling a Container#

I had a Docker image from a previous project handy, which I wanted to deploy with Nomad.

Note

Image info here: https://github.com/users/thiskevinwang/packages/container/package/wasm

Nomad has a handy nomad job init to generate a job file, whose HCL stanzas definitely come with a bit of a learning curve.

I trimmed the sample file down to the following:

job "docker-wasm-job" {
  datacenters = ["dc1"]
  type        = "service"
  update {
    max_parallel      = 1
    min_healthy_time  = "10s"
    healthy_deadline  = "3m"
    progress_deadline = "10m"
    auto_revert       = false
    canary            = 0
  }

  migrate {
    max_parallel     = 1
    health_check     = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "5m"
  }

  group "docker-wasm-grp" {
    count = 3
    network {
      port "http" {
        to     = 8888 # container port the app runs on
        static = 80   # host port to expose
      }
    }

    service {
      name     = "docker-wasm-svc"
      tags     = ["global", "docker-wasm-grp"]
      port     = "http"
      provider = "nomad"

      check {
        type     = "http"
        path     = "/"
        interval = "5s"
        timeout  = "5s"
        method   = "GET"
      }
    }

    restart {
      attempts = 2
      interval = "30m"
      delay    = "15s"
      mode     = "fail"
    }

    ephemeral_disk {
      size = 300
    }

    task "docker-wasm-task" {
      driver = "docker"
      config {
        image          = "ghcr.io/thiskevinwang/wasm:latest"
        ports          = ["http"]
        auth_soft_fail = true
        force_pull     = true
      }

      resources {
        cpu    = 500 # 500 MHz
        memory = 512 # 512MB
      }
    }
  }
}

This then gets run with nomad job run docker-wasm.nomad.

Voila! — It was working!

Mission complete!

~~Must have 3 server nodes in alive status, with one leader~~

~~Should be publicly accessible~~

~~Should be reasonably secure~~

~~Should be able to run a containerized job~~

AWS#

I was largely complete with my original goal... and inevitably went on to do some additional hardening of my cluster setup, like fronting my cluster with a load balancer. This was mostly so I could share a URL with my coworkers that wasn’t an IP address of a EC2 instance.

This is what the final architecture looked like:

Retrospective#

After having gone through the end-to-end flow, I thought about some things that I could have done differently, or could improve moving forward.

Ok, `packer`#

Could I use Packer to build an AMI with Nomad, ZSH, and Docker already installed? I guess it’s time to use Packer...

Ok, `terraform`#

All the click-ops — not to mention the eventual tear down — I went through could be replaced by Terraform infrastructure-as-code.

This would be a must for a team setting.

No Public IPs#

Given that my load balancer has a public URL, I could omit public IP’s for all the EC2 instances. This is only adjustable at the time of instance creation though.

This would also help narrow down the inbound rules on my security group.

`ARM` is Cheaper than `x86`#

I had initially chosen x86 for my EC2 architecture because I didn’t want run into any potentially obscure issues with the newer ARM architecture.

In the past, I’ve run into a “fun” bug with running an ARM Docker container on AWS AppRunner. AppRunner only supports x86 and would hang for up to 10 minutes only to eventually fail with an “exec format error” error.

A quick cost comparison showed that ARM instance types are slightly cheaper than x86, and Nomad supports both so it probably would have been fine.

Instance Type	Arch	vCPU	Memory (GiB)	Price (USD)
t4g.small	`ARM`	2	2	0.0168
t2.small	`x86`	1	2	0.0230

I believe because ARM is more energy-efficient than x86, AWS is charging you less for essentially less electricity used... 🤷‍♂️

Raft Voter Rejection#

There is this seemingly inconsequential yet non-stop warning in the Nomad logs that I want to figure out the root case of:

nomad monitor
# 2022-11-20T20:46:25.270Z [WARN]  nomad.raft: rejecting vote
# request since node is not in configuration: from=XXX.XXX.XXX.XXX:4647

https://stackoverflow.com/questions/58823341/why-is-it-recommended-to-create-clusters-with-odd-number-of-nodes ↩