Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .envrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Tells antonbabenko/pre-commit-terraform's terraform_validate, terraform_fmt,
# and terraform_docs hooks to use OpenTofu rather than HashiCorp Terraform —
# matches the tfroot-runner CI image (which symlinks tofu→terraform) and is
# required because the s3 backend config uses tofu-only attributes
# (assume_role_duration_seconds) that the HashiCorp terraform binary rejects.
#
# Auto-sourced by direnv on cd. Non-direnv users: see AGENTS.md.
export PCT_TFPATH="$(command -v tofu)"
5 changes: 3 additions & 2 deletions .github/workflows/opentofu.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,9 @@ jobs:
opentofu:
uses: makeitworkcloud/shared-workflows/.github/workflows/opentofu.yml@main
with:
runs-on: arc-dind
container: image-registry.openshift-image-registry.svc:5000/public-registry/tfroot-runner:latest
# Native tfroot-runner scale set in kustomize-cluster/workloads/arc.
# The runner pod IS the tfroot-runner image — no nested container.
runs-on: arc-tf
setup-ssh: true
secrets:
SOPS_AGE_KEY: ${{ secrets.SOPS_AGE_KEY }}
Expand Down
47 changes: 34 additions & 13 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,32 +21,53 @@ make test

This automatically fetches the canonical config if not present.

### OpenTofu vs HashiCorp Terraform

The pre-commit-terraform hooks call `terraform` from PATH. In CI the
`tfroot-runner` image symlinks `tofu → terraform` so the call resolves to
OpenTofu. Locally most developers have HashiCorp `terraform` from Homebrew,
which rejects tofu-only backend attributes (e.g. `assume_role_duration_seconds`).

`make test` already exports `PCT_TFPATH=$(command -v tofu)` so the hooks
invoke OpenTofu. For `git commit`-triggered pre-commit runs, either:

- use direnv: `direnv allow` will source the repo's `.envrc`; or
- export it manually: `export PCT_TFPATH=$(command -v tofu)` in your shell.

## CI/CD

This repo uses the shared `opentofu.yml` workflow from `shared-workflows`, but with **custom configuration**:

- **Runner:** `arc-dind` (self-hosted, not `ubuntu-latest`)
- **Container:** `image-registry.openshift-image-registry.svc:5000/public-registry/tfroot-runner:latest` (internal OpenShift registry, not GHCR)
- **Container:** `ghcr.io/makeitworkcloud/tfroot-runner:latest`

This is required because the workflow needs SSH access to libvirt hosts, which is only available from the self-hosted runner network.
The self-hosted runner is required because the workflow needs SSH access to the libvirt host, which is only reachable from the runner network.

### Failure Modes
## Local apply

**"name unknown" or image pull failures:** The `tfroot-runner` image doesn't exist in the OpenShift internal registry. This happens when:
`make init` / `make plan` / `make apply` need:

1. The `images` repo Build workflow failed (check for transient network errors, re-run if needed)
2. The `images` repo Pull workflow failed to import (the `|| true` masks failures - check logs for "Unable to connect" errors)
- `sops` available locally with the team's age key (so `data.sops_file.secret_vars` decrypts)
- The makefile's `libvirt-ssh` target (auto-run by `init`) materializes the qemu+ssh keypair from sops into `.terraform/libvirt-ssh/` — no `~/.ssh/id_rsa` needed
- `tofu` on PATH, plus `direnv` (recommended) so `.envrc` exports `PCT_TFPATH` for pre-commit

### SSH-ing into the VMs

Both VMs are behind the libvirt host. The cloud-init user is `user`, not your local username:

**To fix:** Re-run the Pull workflow in the `images` repo, or manually import:
```bash
oc import-image tfroot-runner:latest \
--from=ghcr.io/makeitworkcloud/tfroot-runner:latest \
-n public-registry \
--confirm \
--reference-policy=local
ssh -J user@hero.makeitwork.cloud user@192.168.102.2 # k3s
ssh -J user@hero.makeitwork.cloud user@192.168.102.12 # runner
```

**Pre-commit failures:** If hooks fail unexpectedly, the canonical config may have changed. Delete `.pre-commit-config.yaml` locally and re-run `make test` to fetch the latest.
### Common apply hiccups

- **`Volume Upload Failed: unexpected EOF`** while creating boot disks — flaky upload of the ~700 MB Fedora qcow2. Just re-run `make apply`; partial volumes get cleaned up automatically on retry. Boot-disk creation legitimately takes 5–7 minutes per VM.
- **`Storage volume X exists already`** on a fresh apply — host has stale volumes (e.g. from a previous failed apply). Delete via `ssh user@hero "sudo virsh -c qemu:///system vol-delete --pool <pool> <volname>"`. `sudo` is required. Run `pool-refresh <pool>` after.
- **`Storage volume not found: no storage vol with matching path …`** during refresh — state references a volume that was deleted out-of-band. `tofu state rm <addr>` and re-apply to recreate.
- **Boot-disk filenames are a deterministic URL hash** (e.g. `k3s-94d57345.qcow2`). Tofu won't recreate them when the boot image content changes server-side or when cloud-init templates change. Force a rebuild with `tofu taint module.<vm>.libvirt_volume.boot module.<vm>.libvirt_volume.cloudinit module.<vm>.libvirt_cloudinit_disk.commoninit`.
- **Cluster + runner state survives boot-disk replacement.** `/var/lib/rancher` (k3s) and `/opt/actions-runner` are on persistent xfs `extra` volumes (`overwrite: false`). Cloud-init scripts are idempotent against this — see the `[ ! -f .runner ]` check in the runner template and the `kubectl get … || create` in the k3s template.
- **Pre-commit failures** — the canonical config may have changed. `rm .pre-commit-config.yaml && make test` fetches the latest.

## Related Repositories

Expand Down
19 changes: 16 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ S3_KEY := $(shell sops decrypt secrets/secrets.yaml | grep ^s3_key
S3_ACCESS_KEY := $(shell sops decrypt secrets/secrets.yaml | grep ^s3_access_key | cut -d ' ' -f 2)
S3_SECRET_KEY := $(shell sops decrypt secrets/secrets.yaml | grep ^s3_secret_key | cut -d ' ' -f 2)

.PHONY: help init plan apply migrate test pre-commit-check-deps pre-commit-install-hooks clean
.PHONY: help init plan apply migrate test libvirt-ssh pre-commit-check-deps pre-commit-install-hooks clean

help:
@echo "General targets"
Expand Down Expand Up @@ -36,7 +36,20 @@ clean:

init: clean .terraform/terraform.tfstate

.terraform/terraform.tfstate:
# SSH key + known_hosts for the libvirt provider's qemu+ssh transport. Decrypted
# from sops at make-time so neither local users nor CI need a private key on disk.
libvirt-ssh: .terraform/libvirt-ssh/id_ed25519 .terraform/libvirt-ssh/known_hosts

.terraform/libvirt-ssh/id_ed25519: secrets/secrets.yaml
@mkdir -p $(@D)
@sops --decrypt --extract '["ops_ssh_privkey"]' secrets/secrets.yaml > $@
@chmod 0600 $@

.terraform/libvirt-ssh/known_hosts: secrets/secrets.yaml
@mkdir -p $(@D)
@sops --decrypt --extract '["hero_known_hosts"]' secrets/secrets.yaml > $@

.terraform/terraform.tfstate: libvirt-ssh
@${TERRAFORM} init -reconfigure -upgrade -input=false -backend-config="key=${S3_KEY}" -backend-config="bucket=${S3_BUCKET}" -backend-config="region=${S3_REGION}" -backend-config="access_key=${S3_ACCESS_KEY}" -backend-config="secret_key=${S3_SECRET_KEY}"

plan: init .terraform/plan
Expand All @@ -56,7 +69,7 @@ migrate:
@${TERRAFORM} init -migrate-state -backend-config="key=${S3_KEY}" -backend-config="bucket=${S3_BUCKET}" -backend-config="region=${S3_REGION}" -backend-config="access_key=${S3_ACCESS_KEY}" -backend-config="secret_key=${S3_SECRET_KEY}"

test: .pre-commit-config.yaml .git/hooks/pre-commit
@pre-commit run -a
@PCT_TFPATH=$$(command -v tofu) pre-commit run -a

.pre-commit-config.yaml:
@curl -sSL -o .pre-commit-config.yaml \
Expand Down
12 changes: 7 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,28 +2,30 @@
## Requirements

| Name | Version |
|------|---------|
| ---- | ------- |
| <a name="requirement_terraform"></a> [terraform](#requirement\_terraform) | >= 1.3 |
| <a name="requirement_aap"></a> [aap](#requirement\_aap) | ~> 1.4.0 |
| <a name="requirement_libvirt"></a> [libvirt](#requirement\_libvirt) | ~> 0.9.0 |
| <a name="requirement_sops"></a> [sops](#requirement\_sops) | ~> 1.3.0 |

## Providers

| Name | Version |
|------|---------|
| ---- | ------- |
| <a name="provider_libvirt"></a> [libvirt](#provider\_libvirt) | ~> 0.9.0 |
| <a name="provider_sops"></a> [sops](#provider\_sops) | ~> 1.3.0 |

## Modules

| Name | Source | Version |
|------|--------|---------|
| ---- | ------ | ------- |
| <a name="module_k3s"></a> [k3s](#module\_k3s) | git::https://github.com/makeitworkcloud/terraform-libvirt-domain.git | n/a |
| <a name="module_runner"></a> [runner](#module\_runner) | git::https://github.com/makeitworkcloud/terraform-libvirt-domain.git | n/a |

## Resources

| Name | Type |
|------|------|
| ---- | ---- |
| [libvirt_pool.cluster](https://registry.terraform.io/providers/dmacvicar/libvirt/latest/docs/resources/pool) | resource |
| [sops_file.secret_vars](https://registry.terraform.io/providers/carlpett/sops/latest/docs/data-sources/file) | data source |

## Inputs
Expand Down
143 changes: 143 additions & 0 deletions cloud-init/k3s/cloud_init.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
#cloud-config
# https://cloudinit.readthedocs.io/en/latest/topics/examples.html

# AGE private key for KSOPS, written to tmpfs so it does not survive reboot.
# Loaded into the argocd/sops-age-keys Secret in runcmd via --from-file.
write_files:
- path: /run/age-key
permissions: '0600'
content: |
${indent(6, sops_age_key)}
# k3s reads /etc/rancher/k3s/config.yaml.d/*.yaml on startup; this enables
# OIDC token validation by kube-apiserver. Headlamp/kubectl forward the
# user's Dex-issued ID token here, the apiserver validates it against the
# Dex issuer, and RBAC bindings in kustomize-cluster/bootstrap/oidc-rbac.yaml
# grant access by GitHub team membership (groups claim).
- path: /etc/rancher/k3s/config.yaml.d/oidc.yaml
permissions: '0600'
content: |
kube-apiserver-arg:
- oidc-issuer-url=https://argocd.makeitwork.cloud/api/dex
- oidc-client-id=headlamp
- oidc-username-claim=email
- oidc-groups-claim=groups

groups:
- default
- name: wheel

users:
- default
- name: user
groups: [wheel]
sudo: ['ALL=(ALL) NOPASSWD:ALL']
shell: /bin/bash
lock_passwd: true
ssh_authorized_keys:
- ${ssh_authorized_key}

packages:
- curl
- git

fs_setup:
- device: /dev/vdb
filesystem: xfs
overwrite: false

mounts:
- ["/dev/vdb", "/var/lib/rancher", "xfs", "defaults", "0", "0"]

runcmd:
- sed -i 's/^SELINUX=.*/SELINUX=permissive/' /etc/selinux/config
- setenforce 0
- mkdir -p /var/lib/rancher
- |
set -e
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION='${k3s_version}' \
sh -s - server --disable=traefik --disable=servicelb --write-kubeconfig-mode=0644
- |
set -e
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
until kubectl get nodes 2>/dev/null | grep -q ' Ready '; do sleep 3; done
# KSOPS in argocd's repo-server expects /sops-age-keys/key.txt; create the
# namespace + Secret BEFORE the ArgoCD CR is reconciled or the repo-server
# CrashLoops on missing volume mount.
- |
set -e
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
kubectl get ns argocd >/dev/null 2>&1 || kubectl create namespace argocd
kubectl -n argocd create secret generic sops-age-keys \
--from-file=key.txt=/run/age-key \
--dry-run=client -o yaml | kubectl apply -f -
# cert-manager — argocd-operator's deployment mounts a webhook-server-cert
# Secret that nothing in config/default actually creates (cert-manager bits
# are commented out in upstream's kustomization). Install cert-manager and
# provision the cert ourselves before the operator install.
- |
set -e
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
kubectl apply -f \
"https://github.com/cert-manager/cert-manager/releases/download/${cert_manager_version}/cert-manager.yaml"
kubectl -n cert-manager rollout status deployment/cert-manager-webhook --timeout=180s
# Cluster CoreDNS can't recursively resolve external domains for ACME
# DNS-01 challenges; force cert-manager to use public resolvers directly.
kubectl -n cert-manager patch deployment cert-manager --type=json -p='[
{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--dns01-recursive-nameservers=1.1.1.1:53,8.8.8.8:53"},
{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--dns01-recursive-nameservers-only"}
]'
kubectl -n cert-manager rollout status deployment/cert-manager --timeout=120s
# Install argocd-operator (community) which provides the
# argoproj.io/v1beta1 ArgoCD CRD consumed by kustomize-cluster's
# bootstrap/argocd-config.yaml.
#
# ARGOCD_CLUSTER_CONFIG_NAMESPACES grants cluster-config scope to ArgoCD CRs
# in the named namespace; without it the application-controller can only
# manage namespaced resources, blocking sync of any ClusterRole/CRB.
- |
set -e
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
kubectl apply --server-side -k \
'https://github.com/argoproj-labs/argocd-operator//config/default?ref=${argocd_operator_version}'
until kubectl get crd argocds.argoproj.io 2>/dev/null; do sleep 3; done
kubectl -n argocd-operator-system set env \
deployment/argocd-operator-controller-manager \
ARGOCD_CLUSTER_CONFIG_NAMESPACES=argocd
# Self-signed Issuer + Certificate for the operator's admission webhook.
# Service name comes from config/default's namePrefix + ../webhook/service.yaml.
- |
set -e
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
kubectl apply -f - <<'EOF'
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
name: argocd-operator-selfsigned
namespace: argocd-operator-system
spec:
selfSigned: {}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: argocd-operator-serving-cert
namespace: argocd-operator-system
spec:
secretName: webhook-server-cert
dnsNames:
- argocd-operator-webhook-service.argocd-operator-system.svc
- argocd-operator-webhook-service.argocd-operator-system.svc.cluster.local
issuerRef:
kind: Issuer
name: argocd-operator-selfsigned
EOF
kubectl -n argocd-operator-system rollout status deployment/argocd-operator-controller-manager --timeout=180s
# Apply kustomize-cluster bootstrap path. This contains the ArgoCD CR
# (which the operator reconciles into a running argocd-server) plus the
# operators-app and workloads-app Applications. Once argocd-server starts,
# it picks up the Applications and self-manages from there.
- |
set -e
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
kubectl apply --server-side -k \
'${cluster_repo_url}//${cluster_repo_path}?ref=${cluster_repo_branch}'
63 changes: 61 additions & 2 deletions cloud-init/runner/cloud_init.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,67 @@ write_files:
# Prune build cache older than 7 days
docker builder prune -f --filter "until=168h"

- path: /usr/local/bin/runner-work-cleanup.sh
permissions: '0755'
content: |
#!/bin/bash
find /opt/actions-runner/_work -mindepth 2 -maxdepth 2 -type d -mtime +1 \
-exec rm -rf {} \; 2>/dev/null || true

# PAT-bearing installer lives on tmpfs so it does not survive first reboot.
- path: /run/install-gha-runner.sh
permissions: '0700'
content: |
#!/bin/bash
set -euo pipefail

GITHUB_ORG='${github_org}'
GITHUB_TOKEN='${github_token}'

RUNNER_VERSION=$(curl -sSL https://api.github.com/repos/actions/runner/releases/latest | jq -r .tag_name)
RUNNER_VER_NUM="$${RUNNER_VERSION#v}"

cd /opt/actions-runner
curl -sSL -o runner.tar.gz \
"https://github.com/actions/runner/releases/download/$RUNNER_VERSION/actions-runner-linux-x64-$RUNNER_VER_NUM.tar.gz"
tar xzf runner.tar.gz
chown -R user:user /opt/actions-runner
rm -f runner.tar.gz

# Runner ships dotnet 6.0 binaries that need libicu / openssl-libs / zlib.
# Fedora's installdependencies.sh handles this for us.
./bin/installdependencies.sh

# /opt/actions-runner is a persistent xfs volume (overwrite:false) so a
# boot-disk replacement reuses the existing registration. Skip config.sh
# if already configured; svc.sh is always re-run since the systemd unit
# lives on the boot disk and is lost on rebuild.
if [ ! -f .runner ]; then
REG_TOKEN=$(curl -sSL -X POST \
-H "Authorization: Bearer $GITHUB_TOKEN" \
-H "Accept: application/vnd.github+json" \
"https://api.github.com/orgs/$GITHUB_ORG/actions/runners/registration-token" \
| jq -r .token)
RANDOM_ID=$(tr -dc 'a-z' </dev/urandom | head -c4)
sudo -u user ./config.sh \
--name "libvirt-$RANDOM_ID" \
--unattended \
--labels libvirt \
--url "https://github.com/$GITHUB_ORG" \
--token "$REG_TOKEN"
fi
./svc.sh install user
./svc.sh start

groups:
- default
- name: wheel

packages:
- docker
- jq
- tar
- cronie

users:
- default
Expand All @@ -65,7 +120,11 @@ mounts:

runcmd:
- [ systemctl, daemon-reload ]
- [ systemctl, enable, docker.service ]
- [ systemctl, start, --no-block, docker.service ]
- [ systemctl, enable, --now, docker.service ]
- [ systemctl, enable, --now, crond.service ]
- sed -i 's/^SELINUX=.*/SELINUX=permissive/' /etc/selinux/config
- setenforce 0
- chown user:user /opt/actions-runner
- /run/install-gha-runner.sh
- echo "0 */6 * * * root /usr/local/bin/docker-cleanup.sh >> /var/log/docker-cleanup.log 2>&1" > /etc/cron.d/docker-cleanup
- echo "30 */6 * * * user /usr/local/bin/runner-work-cleanup.sh" > /etc/cron.d/runner-work-cleanup
Loading
Loading