feat!(cozystack): wizard chain + 2-plugin consolidation (breaking)#11
Conversation
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (51)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request restructures the repository by consolidating individual skills into two primary plugin bundles, cozystack and linstor, and introduces a wizard orchestrator to streamline the installation process. It adds extensive documentation and reference guides for cluster installation, upgrades, debugging, and node bootstrapping for Talos and Ubuntu environments. A new validation script, check-refs.sh, is also included to maintain cross-reference integrity. Review feedback identifies a bug in the diagnostic bundle's tar command and points out that the skill count for the cozystack plugin needs to be updated for accuracy in the documentation.
| kubectl --context $CTX --namespace "$ns" describe hr "$name" > "$DUMP/hr-${ns}-${name}.txt" | ||
| done | ||
|
|
||
| tar -czf "${DUMP}.tar.gz" --directory /tmp "$(basename "$DUMP")" |
There was a problem hiding this comment.
The tar command for creating the diagnostic bundle has an incorrect --directory argument. It's set to /tmp, but the directory to be archived (diagnostics-${TS}) is created under $CONFIG_DIR. This will cause the command to fail if $CONFIG_DIR is not /tmp.
| tar -czf "${DUMP}.tar.gz" --directory /tmp "$(basename "$DUMP")" | |
| tar -czf "${DUMP}.tar.gz" --directory "$CONFIG_DIR" "$(basename "$DUMP")" |
| ### cozystack | ||
|
|
||
| | Plugin | Description | | ||
| Platform skills bundle. One install gives you nine skills, invoked as `/cozystack:<name>`. Start with `/cozystack:wizard` — it asks Talos / Ubuntu / Existing and picks the chain. |
There was a problem hiding this comment.
The description for the cozystack plugin states that it provides nine skills, but it actually includes ten. This should be updated for accuracy.
| Platform skills bundle. One install gives you nine skills, invoked as `/cozystack:<name>`. Start with `/cozystack:wizard` — it asks Talos / Ubuntu / Existing and picks the chain. | |
| Platform skills bundle. One install gives you ten skills, invoked as `/cozystack:<name>`. Start with `/cozystack:wizard` — it asks Talos / Ubuntu / Existing and picks the chain. |
|
|
||
| ```text | ||
| plugins/ | ||
| cozystack/ # platform bundle (9 skills) |
There was a problem hiding this comment.
Major refactor of the marketplace plus a new wizard chain that takes
an operator from raw hardware / cloud VMs to a working Cozystack
install end-to-end.
Repository refactor (BREAKING for existing operators):
Five separate plugins consolidated into two:
cozy-bump → cozystack:package-bump
cozy-deploy → cozystack:package-deploy
cozy-external-app → cozystack:external-app-create
cozystack-upgrade → cozystack:cluster-upgrade
drbd-recovery → linstor:recover
The four carried-over cozystack-* skills now live under one
cozystack plugin alongside six new skills; drbd-recovery becomes
linstor:recover under its own plugin.
Operators must uninstall the old single-skill plugins and install
the new bundles. Old install paths (`/plugin install cozy-deploy@…`)
and skill IDs (`/cozy-deploy:cozy-deploy`) no longer work.
Skill renames carry intent: cozy-bump → package-bump says what the
skill does; cozystack-upgrade → cluster-upgrade is consistent with
cluster-install; drbd-recovery → linstor:recover reflects what the
tool actually operates on (LINSTOR-side, with DRBD as one layer).
New skills under cozystack:
- wizard: orchestrator. Free-form intent intake, parses hints,
picks Talos / Ubuntu / Existing, dispatches the chain via a
cluster config directory the operator owns. Phase 4.5 runtime
research surfaces known landmines for the specific combination
against upstream docs and issue trackers.
- talos-bootstrap: Talos via talm. Maintenance-mode probe
(Talos-1.12-aware via `get disks --insecure`), NAT-provider
cert-SAN guardrail before first talm apply (auto-populates
values.yaml.certSANs with public IPs when OCI / GCP-NAT / AWS-EIP
signature is detected), multidoc machine-config with per-node
VIP-link IPv4 stubs, etcd bootstrap, kubeconfig fetch. Phase 11.5
auto-upgrades from base Talos to cozystack-tuned image when
nodes booted from the wrong image.
- talos-reset: cloud-provider terminate+relaunch helper for
OCI / AWS / GCP / Hetzner when nodes are unrecoverable from
inside the cluster. Preserves block volumes, secondary VNICs,
NSG memberships. Sequential per-node to maintain etcd quorum.
- ubuntu-bootstrap: wraps cozystack/ansible-cozystack for
Ubuntu / Debian k3s bootstrap.
- cluster-install: Cozystack on a ready cluster. Node-readiness
validation, ZFS pool provisioning via privileged DaemonSet on
Talos (hostNetwork: true because CNI is not up yet), extractedprism
for kube-apiserver HA, OCI-tag-normalized cozy-installer chart,
Platform Package apply, inline tenants/root ingress patch +
LINSTOR pool registration during the watch loop, Phase 8.6
default StorageClasses for v1.3.x, Phase 9.1 end-to-end
reachability probe.
- debug: investigate stuck or broken installs. Classifies operator
error / config drift / upstream bug / not-yet-supported, applies
fixes or workarounds, drafts upstream issues on approval (never
opens issues silently).
Renamed-skill changes beyond directory move:
- cluster-upgrade: 5 references files restructured with --context
discipline; multiple known-failure entries added.
- package-deploy: rewritten to surface ghcr.io/<your-username>
placeholder instead of a personal registry path.
- package-bump: path references updated; --context discipline.
- external-app-create: --context discipline.
- linstor:recover: every cluster-mutating kubectl call carries
--context $CTX.
CI + contributor infrastructure:
- tools/check-refs.sh — 5-check validator: references existence,
<plugin>:<skill> mention resolution, descriptions list every
shipped skill, kubectl / helm cluster-mutating invocations
carry --context / --kube-context (allow-list for read-only and
local ops), no private cluster identifiers in plugin content.
- .github/workflows/validate.yml — jq + check-refs.sh on push / PR.
- CLAUDE.md — contributor guide: layout, adding a skill, adding
a plugin, cross-reference discipline, semver bump policy.
- README.md rewritten with the new skills table and third-party
dependency section.
Real-install hardening (each finding ↔ guardrail in the right skill):
- cert-SAN trap on NAT'd providers (caught the same operator
twice across consecutive sessions). Workstation reaches nodes
via public IPs that the cloud fabric rewrites to internal IPs
before they hit the interface; Talos issues machine-cert SANs
from observed addresses (internal only); workstation TLS
handshake fails with no escape. talos-bootstrap Phase 6.3
detects the NAT signature and auto-adds public IPs to
machine.certSANs before the first apply.
- OCI 1:1 NAT externalIPs mismatch — Cilium's externalIPs BPF
matches on packet destination as seen by the host kernel.
cluster-install Phase 4 refuses `external` strategy on
NAT-fronted platforms unless opted in.
- `kubectl debug node --image=alpine:3 -- chroot /host` cannot
run zpool create on Talos: musl vs glibc loader; /bin/sh
absent on host rootfs; PSA baseline blocks the debug pod.
Phase 5.5 uses a privileged Pod from ubuntu:24.04 in a
cozy-storage-bootstrap namespace.
- LINSTOR registration race: HelmRelease Ready != Deployment
readyReplicas. STOP GATE 3 requires both no-HR-not-Ready AND
storage-pool count matches storage-node count.
- Tenants CRD landing after most HRs Ready: ingress patch is
inline in the Phase 8 watch loop, event-driven on CR existence.
- cozystack v1.3.x ships no StorageClasses: Phase 8.6 applies
`local` (1-replica) + `replicated` (3-replica, default) for
v1.3.x; skipped on v1.4+.
- OCI registry tag has no `v` prefix: ${INSTALLER_VERSION#v}
normalize before passing to helm.
- End-to-end reachability gate: Phase 9.1 curls the dashboard
from the workstation; on failure writes
failed_at: external-reachability with curl + DNS + node-addresses
detail.
Third-party dependency:
cozystack:cluster-install default-installs extractedprism on
`generic` variant clusters (k3s / kubeadm / RKE2) — per-node TCP
load balancer giving generic Linux Kubernetes the same
localhost:7445 shape Talos has built-in via KubePrism. BSD-3
(https://github.com/lexfrei/extractedprism), chart at
oci://ghcr.io/lexfrei/charts/extractedprism. Maintained
independently; reviewed and approved by the Cozystack platform
team. Opt-out via --no-extractedprism --api-host=<ip>. Talos and
hosted variants do not need extractedprism.
Validation:
- bash tools/check-refs.sh — all 5 checks green.
- Real-install end-to-end on a 3-node OCI Talos cluster, through
HTTP 200 on the dashboard ingress.
Signed-off-by: Aleksei Sviridkin <f@lex.la>
0f12037 to
1ff8fe0
Compare
TL;DR
Major refactor: 5 separate plugins consolidated into 2 (
cozystackwith 10 skills +linstorwith 1). 5 existing skills renamed; 6 new skills added; CI validator added; CLAUDE.md added; README rewritten. Breaking change for operators — old/plugin install <name>@cozystack-claude-pluginspaths and skill IDs no longer work.Repository refactor (BREAKING)
Before (main today):
marketplace.jsonlisted five plugins, each with one skill, each at its own install path.After:
Skill rename + install-path migration
/plugin install cozy-deploy@cozystack-claude-plugins→/cozy-deploy:cozy-deploy/plugin install cozystack@cozystack-claude-plugins→/cozystack:package-deploy/plugin install cozy-bump@cozystack-claude-plugins→/cozy-bump:cozy-bump/cozystack:package-bump(same plugin)/plugin install cozy-external-app@cozystack-claude-plugins→/cozy-external-app:cozy-external-app/cozystack:external-app-create(same plugin)/plugin install cozystack-upgrade@cozystack-claude-plugins→/cozystack-upgrade:cozystack-upgrade/cozystack:cluster-upgrade(same plugin)/plugin install drbd-recovery@cozystack-claude-plugins→/drbd-recovery:drbd-recovery/plugin install linstor@cozystack-claude-plugins→/linstor:recoverExisting operators should uninstall old plugins and install the new bundles once this lands.
Why consolidate
/cozystack:*namespace is the natural shape: every Cozystack-related skill ships from one plugin install, no per-skill/plugin installdance.cozystack:<name>; cross-plugin references add fragility.cozy-bump→package-bumpsays what the skill does;cozystack-upgrade→cluster-upgradeis consistent withcluster-install;drbd-recovery→linstor:recoverreflects what the tool actually operates on (LINSTOR-side, with DRBD as one of several layers).New skills (6)
talm apply, multidoc machine-config with per-node VIP-link IPv4 stubs,talm apply+talosctl bootstrap+ kubeconfig fetch. Phase 11.5 auto-upgrade to cozystack-tuned image when nodes booted from base Talos.cozystack/ansible-cozystackfor Ubuntu / Debian k3s bootstrap (OS prep, kernel modules, k3s install with cozystack-compatible flags).hostNetwork: truebecause CNI is not up yet), extractedprism for kube-apiserver HA, OCI-tag-normalized cozy-installer chart, Platform Package apply, inlinetenants/rootingress patch and LINSTOR pool registration during the watch loop, Phase 8.6 default StorageClasses for v1.3.x, Phase 9.1 end-to-end reachability probe.Renamed-skill changes (non-trivial)
Beyond directory rename, the 5 carried-over skills got real edits:
cozystack-upgrade): 5 references files restructured with--context/--kube-contextdiscipline; multiple known-failure entries added.cozy-deploy): rewritten to surfaceghcr.io/<your-username>placeholder instead of personal registry;--kube-contextdiscipline.cozy-bump): path references updated;--contextdiscipline; consistentprintf/trshell idioms.cozy-external-app):--contextdiscipline.drbd-recovery): every cluster-mutatingkubectlcall carries--context $CTX.CI + contributor infrastructure
tools/check-refs.sh— 5-check validator: references-file existence,<plugin>:<skill>mention resolution, descriptions list every shipped skill,kubectl/helmcluster-mutating invocations carry--context/--kube-context(allow-list for read-only / local operations), no private cluster identifiers in plugin / public content..github/workflows/validate.yml— jq +check-refs.shon push / PR.CLAUDE.md— contributor guidance: repository layout, adding a new skill, adding a new plugin, cross-reference discipline, semver bump policy.README.md— rewritten with the new skills table + chain diagram + third-party-dependency section.Real-install hardening
The wizard chain went through multiple end-to-end install runs on a 3-node OCI Talos cluster. Each session surfaced bugs that no code review could catch; each finding became a guardrail in the appropriate skill:
talos-bootstrapPhase 6.3 detects the NAT signature (reach_mode=public + external_ips_strategy=internal) and auto-adds public IPs tovalues.yaml.certSANsbefore the first apply.externalIPsBPF matches on packet destination as seen by the host kernel; on NAT'd providers public IPs are never present on any interface.cluster-installPhase 4 refusesexternalstrategy onintent_hints.platform ∈ {oci, gcp-with-nat, aws-with-eip}unless the operator opts in with--allow-external-on-nat-provider.kubectl debug node --image=alpine:3 -- chroot /host zpool createdoes not work on Talos. Three independent reasons (musl vs glibc loader, no/bin/shon host rootfs, PSAbaselineblocks the debug pod).cluster-installPhase 5.5 uses a privileged Pod fromubuntu:24.04in acozy-storage-bootstrapnamespace withpod-security.kubernetes.io/enforce=privileged.local(1-replica) +replicated(3-replica, default) for v1.3.x; skipped on v1.4+.vprefix.${INSTALLER_VERSION#v}normalize before passing.https://dashboard.<host>/from the workstation; on failure writesfailed_at: external-reachabilitywith curl + DNS + node-addresses detail.Third-party dependency
cozystack:cluster-installdefault-installs extractedprism ongenericvariant clusters (k3s / kubeadm / RKE2) — per-node TCP load balancer that gives generic Linux Kubernetes the samelocalhost:7445shape Talos has built-in via KubePrism. BSD-3-Clause; chart atoci://ghcr.io/lexfrei/charts/extractedprism. Maintained independently; reviewed and approved by the Cozystack platform team. Opt-out via--no-extractedprism --api-host=<ip>. Talos and hosted variants do not need extractedprism.Validation
bash tools/check-refs.sh— all 5 checks green.Reviewer notes
talos-bootstrap/SKILL.mdPhase 6.3 is load-bearing for OCI / GCP-NAT / AWS-EIP operators and has caught the same trap twice in test runs without it.tools/check-refs.shruns in CI; adding new skills requires updating the descriptions inplugin.json+marketplace.jsonso the validator stays green.