Skip to content

feat!: migrate to k3s + cloud-init; drop CRC/AAP#2

Merged
xnoto merged 17 commits intomainfrom
chore/add-opencode-config
Apr 30, 2026
Merged

feat!: migrate to k3s + cloud-init; drop CRC/AAP#2
xnoto merged 17 commits intomainfrom
chore/add-opencode-config

Conversation

@xnoto
Copy link
Copy Markdown
Contributor

@xnoto xnoto commented Apr 24, 2026

Summary

Migrates tfroot-libvirt off the retired CRC/AAP stack onto a self-bootstrapping k3s VM, then iterates on the cloud-init bootstrap until the cluster reliably comes up green on Fedora 44.

Provisioning structure

  • BREAKING: drops the AAP/Ansible Automation Platform integration entirely — provider, awx_* sops keys, runner module's enable_aap arguments.
  • Runner VM: ports configure_runner.yml into cloud-init runcmd. Self-installs the GHA binary, registers via the org github_token PAT (shared with tfroot-github), runs as a systemd service. The PAT-bearing installer lives in /run/ so it doesn't survive reboot. installdependencies.sh is invoked because Fedora 44 doesn't ship libicu by default. Re-applies are idempotent against the persistent xfs /opt/actions-runner volume.
  • k3s VM: new module "k3s" (6 vCPU / 16 GiB / 100 GiB) on a dedicated libvirt_pool "cluster" backed by /mnt/nvme/cluster. Cloud-init: SELinux permissive, k3s with Traefik+ServiceLB disabled, then a four-step bootstrap chain — sops-age-keys Secret → upstream cert-manager (with DNS-01 nameservers patched onto the controller args) → argocd-operator config/default + self-signed Issuer/Certificate to provision its webhook-server-certkubectl apply -k of kustomize-cluster/bootstrap.
  • CI: runs on the arc-dind self-hosted runner with the GHCR tfroot-runner container.
  • SSH credentials for the libvirt provider come out of sops and are materialized into .terraform/libvirt-ssh/ by the make libvirt-ssh target (auto-run by make init). No ~/.ssh/id_rsa required.

Operator instructions

This PR does not auto-rotate state. After merge:

  1. Drop orphaned sops keys: awx_controller, awx_username, awx_password, proxyhost (already done in this branch).
  2. Add github_token matching tfroot-github/secrets/secrets.yaml, plus sops_age_key, ops_ssh_privkey, hero_known_hosts, runner_ip_addr (already done in this branch).
  3. One-time host setup: ssh user@hero 'sudo mkdir -p /mnt/nvme/cluster' (done).
  4. make init then make apply. Boot-disk creation is ~5–7 minutes per VM; if you hit Volume Upload Failed: unexpected EOF just re-run.
  5. Old GHA runner instance will orphan in the GitHub org on rebuild — delete via the API.

See AGENTS.md (also refreshed in this PR) for SSH access patterns, common apply hiccups, and recovery operations.

Test plan

  • Pre-commit (terraform_validate, tflint, terraform_fmt, terraform_docs, sundry hooks) passes locally
  • CI OpenTofu job passes
  • make init && make apply from a fresh checkout produces 12 resources
  • k3s VM boots, k3s service active, cert-manager + argocd-operator + ArgoCD CR + 3 Apps come up clean
  • bootstrap-secrets Application reaches Synced + Healthy
  • Runner VM registers a fresh runner with label libvirt, status online

Pairs with

  • kustomize-cluster PR: drops OLM artifacts so gitops-operators can sync on k3s.
  • images PR: gh-cli switches to numeric USER 1000 so runAsNonRoot passes for ci-token-sync.

🤖 Generated with Claude Code

xnoto and others added 3 commits April 24, 2026 14:00
The arc-dind runner pool and the OpenShift internal registry that
hosted the tfroot-runner image both depended on the CRC cluster,
which is offline during the libvirt-host migration. Swap to GitHub-
hosted ubuntu-latest runners and pull the canonical image from
ghcr.io. Revert when the new k3s cluster is up.
Drops the entire Ansible Automation Platform integration (the aap
provider, awx_* secrets, and runner module's enable_aap arguments)
and ports its sole consumer — the configure_runner playbook — into
the runner VM's cloud-init runcmd. The runner now self-installs the
GitHub Actions binary and registers via the existing github_token
secret (which is shared with tfroot-github). The PAT-bearing
installer is written to /run/ so it does not survive reboot.

Adds a new module "k3s" backed by a Fedora cloud image with cloud-
init that:
  - relaxes SELinux to permissive
  - installs k3s (Traefik + ServiceLB disabled)
  - installs upstream Argo CD into ns argocd
  - applies a root Application pointing at kustomize-cluster's main /,
    which then self-manages the cluster

Adds a dedicated libvirt_pool "cluster" backed by /mnt/nvme/cluster
on hero's RAID-1 NVMe, keeping cluster volumes off the root LV. The
host directory must be created once: ssh user@hero 'sudo mkdir -p
/mnt/nvme/cluster' (hero has SELinux disabled, so no fcontext step).

BREAKING CHANGE: tfroot-libvirt no longer requires the aap provider,
the awx_controller / awx_username / awx_password / proxyhost sops
keys, or the ansible-project-libvirt repo. Operators consuming this
TF root must remove those references and provide a github_token sops
key (matches the value in tfroot-github/secrets/secrets.yaml).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@xnoto xnoto changed the title chore: add repo-local opencode config feat!: migrate to k3s + cloud-init; drop CRC/AAP Apr 29, 2026
xnoto and others added 2 commits April 29, 2026 15:05
The age private key in sops is stored as a YAML | (literal block
scalar), so data.sops_file...sops_age_key is a multiline string.
Threading it through --from-literal=key.txt='${sops_age_key}'
inside a cloud-init runcmd would inject literal newlines mid-YAML,
breaking cloud-init parsing.

Switch to:
  1. write_files entry that materialises /run/age-key on tmpfs,
     with indent(6, sops_age_key) so YAML block-scalar indentation
     is preserved across all lines of the secret.
  2. kubectl --from-file=key.txt=/run/age-key in runcmd.

The key file lives only on tmpfs and is reaped on first reboot.
…sion

Threads sops_age_key from secrets into the k3s cloud-init template (paired
with the multiline-safe write_files handling already in place), and renames
the ArgoCD version local to argocd_operator_version to match the operator-
based install (v0.14.0 of argoproj-labs/argocd-operator).

Regenerates README.md with terraform-docs v0.22.0 (now matching the
republished tfroot-runner image).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@xnoto xnoto self-assigned this Apr 29, 2026
xnoto and others added 12 commits April 29, 2026 16:11
…sa dep

Generates a dedicated ed25519 keypair for the libvirt provider's qemu+ssh
transport, encrypts the private half + hero's host pubkeys into
secrets/secrets.yaml, and has the Makefile materialize both under
.terraform/libvirt-ssh/ before tofu init. providers.tf builds the URI from
the sops libvirt_uri base + the materialized keyfile/knownhosts paths.

Local users no longer need ~/.ssh/id_rsa (incompatible with bitwarden-agent
setups), and CI gets the same flow with no extra GHA secret. Host-key
rotations on hero become a sops re-encrypt instead of a per-machine
ssh-keygen -R + accept-new dance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pre-commit-terraform hooks call `terraform` from PATH. In CI the
tfroot-runner image symlinks tofu→terraform, so it resolves correctly;
locally Homebrew's HashiCorp terraform binary rejects tofu-only backend
attributes (e.g. assume_role_duration_seconds) and aborts validation.

Sets PCT_TFPATH=$(command -v tofu) in three complementary spots:

- Makefile `test` target — covers `make test`.
- `.envrc` — direnv users get it auto-sourced via `direnv allow`.
- AGENTS.md — documents the manual export for non-direnv shells.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverts the temporary ubuntu-latest fallback that was needed while the
CRC cluster was decommissioned. Once kustomize-cluster's ARC stack is
running on k3s, the dind RunnerDeployment registers org-scoped runners
with label `arc-dind`, which this workflow now targets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…empotency

Fedora 44 cloud-base images don't ship git or libicu by default, which broke
both bootstrap flows on first apply against the new image:

- k3s: argocd-operator install via kubectl apply -k 'git+https://...' needs git
- runner: actions runner ships dotnet 6 binaries that need libicu/lttng-ust;
  config.sh --unattended fails with "Libicu's dependencies is missing"

Also make both runcmd flows idempotent against persistent extra volumes
(/var/lib/rancher and /opt/actions-runner have overwrite:false). Boot-disk
replacement now reuses cluster + runner state instead of erroring on
"namespace already exists" or "runner already configured".
argocd-operator's config/default mounts a webhook-server-cert Secret in its
manager Deployment, but cert-manager bits are commented out upstream — so
the secret never materializes outside OLM. Pod hung in ContainerCreating
with FailedMount errors, blocking the entire ArgoCD bootstrap chain.

Bootstrap upstream cert-manager (pinned via local var) and provision a
self-signed Issuer + Certificate targeting webhook-server-cert in the
operator namespace. Wait for the operator deployment to roll out before
continuing the bootstrap so the ArgoCD CR has something to reconcile against.

cert-manager being a bootstrap dependency means kustomize-cluster's
operators/cert-manager/operator.yaml (an OpenShift OLM Subscription) is
now redundant for the operator install itself; that file becomes a Phase B
cleanup item — Issuer/ClusterIssuer resources can stay since they depend on
cert-manager being there, but the Subscription needs to go.
- Container ref points at GHCR (was the OpenShift internal registry)
- Drop the OpenShift-only failure-mode section (oc import-image, etc.)
- Add a Local apply section: sops/age requirement, libvirt-ssh target,
  PCT_TFPATH via direnv, SSH-into-VM one-liners (user is `user`, not the
  local login)
- Replace failure modes with the ones we actually hit on k3s: flaky
  boot-image uploads, stale volumes needing virsh vol-delete, state-rm
  for orphaned volumes, deterministic boot-disk hash names that need
  taint to rebuild on cloud-init changes, persistent extra volumes that
  require idempotent cloud-init scripts
…solvers

Cluster CoreDNS doesn't recursively resolve external domains, which breaks
ACME DNS-01 challenge validation. Pass --dns01-recursive-nameservers and
--dns01-recursive-nameservers-only to the cert-manager controller so it
queries 1.1.1.1 / 8.8.8.8 directly. Tighten the surrounding comment too.
Without ARGOCD_CLUSTER_CONFIG_NAMESPACES on the argocd-operator deployment,
the spawned ArgoCD application-controller runs in namespaced mode and can't
manage cluster-scoped resources (ClusterRole/ClusterRoleBinding/etc.). Any
operator that ships those — tor-controller, cloudflare-operator, etc. —
fails to sync with `cannot be managed when in namespaced mode`.

Set the env var to `argocd` so the ArgoCD CR in that namespace gets
cluster-scope permissions on reconcile.
Drop a /etc/rancher/k3s/config.yaml.d/oidc.yaml that points the
kube-apiserver at ArgoCD's embedded Dex issuer. Headlamp (and any
OIDC-aware kubectl) forwards the user's Dex-issued ID token to the
apiserver; without these flags the apiserver treats the token as
unknown and 401s every request. Username comes from the email claim,
groups from Dex's GitHub team mapping.

RBAC binding for makeitworkcloud:admins -> cluster-admin lives in
kustomize-cluster/bootstrap/oidc-rbac.yaml.
Drop the nested container override now that the arc-tf runner-set in
kustomize-cluster runs the tfroot-runner image directly.
@github-actions
Copy link
Copy Markdown

OpenTofu Plan

OpenTofu will perform the following actions:

  # module.k3s.libvirt_cloudinit_disk.commoninit will be created
  + resource "libvirt_cloudinit_disk" "commoninit" {
      + id             = (known after apply)
      + meta_data      = <<-EOT
            instance-id: k3s
            local-hostname: k3s
        EOT
      + name           = "k3s_commoninit"
      + network_config = <<-EOT
            version: 2
            ethernets:
              enp1s0:
                dhcp4: true
              enp2s0:
                dhcp4: false
                addresses:
                  - 192.168.102.2/24
        EOT
      + path           = (known after apply)
      + size           = (known after apply)
      + user_data      = (sensitive value)
    }

  # module.k3s.libvirt_volume.cloudinit will be updated in-place
  ~ resource "libvirt_volume" "cloudinit" {
      ~ allocation = 49152 -> (known after apply)
      ~ capacity   = 47104 -> (known after apply)
      ~ create     = {
          ~ content = {
              ~ url = "/Users/hatch/.claude/plugins/cache/context-mode/context-mode/1.0.15/terraform-provider-libvirt-cloudinit/cloudinit-bfb5c6617dbb44ba.iso" -> (known after apply)
            }
        }
        id         = "/mnt/nvme/cluster/k3s_cloudinit.iso"
        name       = "k3s_cloudinit.iso"
      ~ physical   = 47104 -> (known after apply)
        # (3 unchanged attributes hidden)
    }

  # module.runner.libvirt_cloudinit_disk.commoninit will be created
  + resource "libvirt_cloudinit_disk" "commoninit" {
      + id             = (known after apply)
      + meta_data      = <<-EOT
            instance-id: runner
            local-hostname: runner
        EOT
      + name           = "runner_commoninit"
      + network_config = (sensitive value)
      + path           = (known after apply)
      + size           = (known after apply)
      + user_data      = (sensitive value)
    }

  # module.runner.libvirt_volume.cloudinit will be updated in-place
  ~ resource "libvirt_volume" "cloudinit" {
      ~ allocation = 49152 -> (known after apply)
      ~ capacity   = 47104 -> (known after apply)
      ~ create     = {
          ~ content = {
              ~ url = "/Users/hatch/.claude/plugins/cache/context-mode/context-mode/1.0.15/terraform-provider-libvirt-cloudinit/cloudinit-b063448360566560.iso" -> (known after apply)
            }
        }
        id         = "/var/lib/libvirt/images/runner_cloudinit.iso"
        name       = "runner_cloudinit.iso"
      ~ physical   = 47104 -> (known after apply)
        # (3 unchanged attributes hidden)
    }

Plan: 2 to add, 2 to change, 0 to destroy.

@xnoto xnoto merged commit 11941ec into main Apr 30, 2026
5 of 6 checks passed
@xnoto xnoto deleted the chore/add-opencode-config branch April 30, 2026 05:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant