feat!: migrate to k3s + cloud-init; drop CRC/AAP#2
Merged
Conversation
The arc-dind runner pool and the OpenShift internal registry that hosted the tfroot-runner image both depended on the CRC cluster, which is offline during the libvirt-host migration. Swap to GitHub- hosted ubuntu-latest runners and pull the canonical image from ghcr.io. Revert when the new k3s cluster is up.
Drops the entire Ansible Automation Platform integration (the aap
provider, awx_* secrets, and runner module's enable_aap arguments)
and ports its sole consumer — the configure_runner playbook — into
the runner VM's cloud-init runcmd. The runner now self-installs the
GitHub Actions binary and registers via the existing github_token
secret (which is shared with tfroot-github). The PAT-bearing
installer is written to /run/ so it does not survive reboot.
Adds a new module "k3s" backed by a Fedora cloud image with cloud-
init that:
- relaxes SELinux to permissive
- installs k3s (Traefik + ServiceLB disabled)
- installs upstream Argo CD into ns argocd
- applies a root Application pointing at kustomize-cluster's main /,
which then self-manages the cluster
Adds a dedicated libvirt_pool "cluster" backed by /mnt/nvme/cluster
on hero's RAID-1 NVMe, keeping cluster volumes off the root LV. The
host directory must be created once: ssh user@hero 'sudo mkdir -p
/mnt/nvme/cluster' (hero has SELinux disabled, so no fcontext step).
BREAKING CHANGE: tfroot-libvirt no longer requires the aap provider,
the awx_controller / awx_username / awx_password / proxyhost sops
keys, or the ansible-project-libvirt repo. Operators consuming this
TF root must remove those references and provide a github_token sops
key (matches the value in tfroot-github/secrets/secrets.yaml).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The age private key in sops is stored as a YAML | (literal block
scalar), so data.sops_file...sops_age_key is a multiline string.
Threading it through --from-literal=key.txt='${sops_age_key}'
inside a cloud-init runcmd would inject literal newlines mid-YAML,
breaking cloud-init parsing.
Switch to:
1. write_files entry that materialises /run/age-key on tmpfs,
with indent(6, sops_age_key) so YAML block-scalar indentation
is preserved across all lines of the secret.
2. kubectl --from-file=key.txt=/run/age-key in runcmd.
The key file lives only on tmpfs and is reaped on first reboot.
…sion Threads sops_age_key from secrets into the k3s cloud-init template (paired with the multiline-safe write_files handling already in place), and renames the ArgoCD version local to argocd_operator_version to match the operator- based install (v0.14.0 of argoproj-labs/argocd-operator). Regenerates README.md with terraform-docs v0.22.0 (now matching the republished tfroot-runner image). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sa dep Generates a dedicated ed25519 keypair for the libvirt provider's qemu+ssh transport, encrypts the private half + hero's host pubkeys into secrets/secrets.yaml, and has the Makefile materialize both under .terraform/libvirt-ssh/ before tofu init. providers.tf builds the URI from the sops libvirt_uri base + the materialized keyfile/knownhosts paths. Local users no longer need ~/.ssh/id_rsa (incompatible with bitwarden-agent setups), and CI gets the same flow with no extra GHA secret. Host-key rotations on hero become a sops re-encrypt instead of a per-machine ssh-keygen -R + accept-new dance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pre-commit-terraform hooks call `terraform` from PATH. In CI the tfroot-runner image symlinks tofu→terraform, so it resolves correctly; locally Homebrew's HashiCorp terraform binary rejects tofu-only backend attributes (e.g. assume_role_duration_seconds) and aborts validation. Sets PCT_TFPATH=$(command -v tofu) in three complementary spots: - Makefile `test` target — covers `make test`. - `.envrc` — direnv users get it auto-sourced via `direnv allow`. - AGENTS.md — documents the manual export for non-direnv shells. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverts the temporary ubuntu-latest fallback that was needed while the CRC cluster was decommissioned. Once kustomize-cluster's ARC stack is running on k3s, the dind RunnerDeployment registers org-scoped runners with label `arc-dind`, which this workflow now targets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…empotency Fedora 44 cloud-base images don't ship git or libicu by default, which broke both bootstrap flows on first apply against the new image: - k3s: argocd-operator install via kubectl apply -k 'git+https://...' needs git - runner: actions runner ships dotnet 6 binaries that need libicu/lttng-ust; config.sh --unattended fails with "Libicu's dependencies is missing" Also make both runcmd flows idempotent against persistent extra volumes (/var/lib/rancher and /opt/actions-runner have overwrite:false). Boot-disk replacement now reuses cluster + runner state instead of erroring on "namespace already exists" or "runner already configured".
argocd-operator's config/default mounts a webhook-server-cert Secret in its manager Deployment, but cert-manager bits are commented out upstream — so the secret never materializes outside OLM. Pod hung in ContainerCreating with FailedMount errors, blocking the entire ArgoCD bootstrap chain. Bootstrap upstream cert-manager (pinned via local var) and provision a self-signed Issuer + Certificate targeting webhook-server-cert in the operator namespace. Wait for the operator deployment to roll out before continuing the bootstrap so the ArgoCD CR has something to reconcile against. cert-manager being a bootstrap dependency means kustomize-cluster's operators/cert-manager/operator.yaml (an OpenShift OLM Subscription) is now redundant for the operator install itself; that file becomes a Phase B cleanup item — Issuer/ClusterIssuer resources can stay since they depend on cert-manager being there, but the Subscription needs to go.
- Container ref points at GHCR (was the OpenShift internal registry) - Drop the OpenShift-only failure-mode section (oc import-image, etc.) - Add a Local apply section: sops/age requirement, libvirt-ssh target, PCT_TFPATH via direnv, SSH-into-VM one-liners (user is `user`, not the local login) - Replace failure modes with the ones we actually hit on k3s: flaky boot-image uploads, stale volumes needing virsh vol-delete, state-rm for orphaned volumes, deterministic boot-disk hash names that need taint to rebuild on cloud-init changes, persistent extra volumes that require idempotent cloud-init scripts
…solvers Cluster CoreDNS doesn't recursively resolve external domains, which breaks ACME DNS-01 challenge validation. Pass --dns01-recursive-nameservers and --dns01-recursive-nameservers-only to the cert-manager controller so it queries 1.1.1.1 / 8.8.8.8 directly. Tighten the surrounding comment too.
Without ARGOCD_CLUSTER_CONFIG_NAMESPACES on the argocd-operator deployment, the spawned ArgoCD application-controller runs in namespaced mode and can't manage cluster-scoped resources (ClusterRole/ClusterRoleBinding/etc.). Any operator that ships those — tor-controller, cloudflare-operator, etc. — fails to sync with `cannot be managed when in namespaced mode`. Set the env var to `argocd` so the ArgoCD CR in that namespace gets cluster-scope permissions on reconcile.
Drop a /etc/rancher/k3s/config.yaml.d/oidc.yaml that points the kube-apiserver at ArgoCD's embedded Dex issuer. Headlamp (and any OIDC-aware kubectl) forwards the user's Dex-issued ID token to the apiserver; without these flags the apiserver treats the token as unknown and 401s every request. Username comes from the email claim, groups from Dex's GitHub team mapping. RBAC binding for makeitworkcloud:admins -> cluster-admin lives in kustomize-cluster/bootstrap/oidc-rbac.yaml.
Drop the nested container override now that the arc-tf runner-set in kustomize-cluster runs the tfroot-runner image directly.
OpenTofu Plan |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Migrates
tfroot-libvirtoff the retired CRC/AAP stack onto a self-bootstrapping k3s VM, then iterates on the cloud-init bootstrap until the cluster reliably comes up green on Fedora 44.Provisioning structure
awx_*sops keys, runner module'senable_aaparguments.configure_runner.ymlinto cloud-initruncmd. Self-installs the GHA binary, registers via the orggithub_tokenPAT (shared withtfroot-github), runs as a systemd service. The PAT-bearing installer lives in/run/so it doesn't survive reboot.installdependencies.shis invoked because Fedora 44 doesn't ship libicu by default. Re-applies are idempotent against the persistent xfs/opt/actions-runnervolume.module "k3s"(6 vCPU / 16 GiB / 100 GiB) on a dedicatedlibvirt_pool "cluster"backed by/mnt/nvme/cluster. Cloud-init: SELinux permissive, k3s with Traefik+ServiceLB disabled, then a four-step bootstrap chain — sops-age-keys Secret → upstream cert-manager (with DNS-01 nameservers patched onto the controller args) → argocd-operatorconfig/default+ self-signed Issuer/Certificate to provision itswebhook-server-cert→kubectl apply -kofkustomize-cluster/bootstrap.arc-dindself-hosted runner with the GHCRtfroot-runnercontainer..terraform/libvirt-ssh/by themake libvirt-sshtarget (auto-run bymake init). No~/.ssh/id_rsarequired.Operator instructions
This PR does not auto-rotate state. After merge:
awx_controller,awx_username,awx_password,proxyhost(already done in this branch).github_tokenmatchingtfroot-github/secrets/secrets.yaml, plussops_age_key,ops_ssh_privkey,hero_known_hosts,runner_ip_addr(already done in this branch).ssh user@hero 'sudo mkdir -p /mnt/nvme/cluster'(done).make initthenmake apply. Boot-disk creation is ~5–7 minutes per VM; if you hitVolume Upload Failed: unexpected EOFjust re-run.See
AGENTS.md(also refreshed in this PR) for SSH access patterns, common apply hiccups, and recovery operations.Test plan
OpenTofujob passesmake init && make applyfrom a fresh checkout produces 12 resourcesbootstrap-secretsApplication reaches Synced + Healthylibvirt, statusonlinePairs with
kustomize-clusterPR: drops OLM artifacts sogitops-operatorscan sync on k3s.imagesPR:gh-cliswitches to numericUSER 1000sorunAsNonRootpasses forci-token-sync.🤖 Generated with Claude Code