Skip to content

controller: write config agent version info to ClickHouse#3578

Open
nikw9944 wants to merge 2 commits into
mainfrom
controller-agent-versions-clickhouse
Open

controller: write config agent version info to ClickHouse#3578
nikw9944 wants to merge 2 commits into
mainfrom
controller-agent-versions-clickhouse

Conversation

@nikw9944

@nikw9944 nikw9944 commented Apr 24, 2026

Copy link
Copy Markdown
Contributor

Summary of Changes

  • Add a controller_agent_versions ClickHouse table using ReplacingMergeTree(updated_at) keyed by device_pubkey to track the latest agent and controller version per device
  • The controller writes on every GetConfig poll; ClickHouse merges rows down to one per device. Lake queries with FINAL for deduplicated results
  • Keep controller_grpc_getconfig_success lean (timestamp + device_pubkey only) — separates event tracking from version state
  • Companion: web/api: show agent versions on device detail page lake#534 adds controller_agent_versions to Lake's external remote table proxies
  • Allows malbeclabs/lake to display device config agent version info
  • Schema is still simple so not adding goose migrations at this point

Diff Breakdown

Category Files Lines (+/-) Net
Core logic 2 +98 / -12 +86
Scaffolding 1 +1 / -1 +0
Docs 1 +21 / -2 +19
Total 4 +120 / -15 +105

Small feature — mostly core logic (ClickHouse writer + flush), with minimal scaffolding.

Testing Verification

  • go build ./controlplane/controller/... — builds clean
  • go test ./controlplane/controller/... — all existing tests pass
  • go vet ./controlplane/controller/... — no issues

Write agent and controller version info to a separate
ReplacingMergeTree table keyed by device_pubkey. ClickHouse
merges rows down to one per device; Lake queries with FINAL.

The existing controller_grpc_getconfig_success table stays
lean (timestamp + device_pubkey only) for event tracking.
@nikw9944 nikw9944 force-pushed the controller-agent-versions-clickhouse branch from b027dcf to ca4f2fc Compare April 24, 2026 00:42
@nikw9944 nikw9944 requested review from packethog and snormore April 24, 2026 00:46
@nikw9944 nikw9944 changed the title controller: add controller_agent_versions ClickHouse table controller: write config agent version info to ClickHouse Apr 24, 2026

@ben-dz ben-dz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small, additive, backward-compatible feature; build/vet/tests pass on the branch. No critical or high issues. One Medium worth addressing before merge: (M1) flushVersions does not reset the shared consecutiveErrors counter on success like flushEvents does, a copy-paste asymmetry that skews WARN→ERROR log escalation. Plus two Low design notes: write amplification (a near-static ReplacingMergeTree row per device per poll) and empty-field overwrite of a good version row. No injection, no secret leakage; schema change is additive.

cw.recordFlushError("error closing clickhouse versions batch", err)
return
}
cw.log.Debug("flushed version events to clickhouse", "count", len(versions))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flushVersions never resets consecutiveErrors = 0 on success, unlike flushEvents (line 192). Both paths share the counter that drives WARN→ERROR escalation. Common case is masked because flushEvents runs first and resets, but if the events flush errors early (e.g. PrepareBatch failure) while versions succeed, the counter stays elevated though writes are flowing. Reads as a copy-paste omission. Fix: reset here on success, or reset once in flush() after both sub-flushes succeed.

Timestamp: reqStart,
DevicePubkey: req.GetPubkey(),
})
c.clickhouse.RecordVersion(versionEvent{

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Write amplification: RecordVersion runs on every GetConfig poll, but agent version changes very rarely, so this inserts a near-static ReplacingMergeTree row per device per poll. Intentional per the PR design (merges down, reads use FINAL) and matches the sibling getconfig_success pattern, but the churn-to-value ratio is high here. Optional: keep a per-device last-seen and only RecordVersion on change. Separately (L2): this runs even when all three agent fields are empty, so a blank report from an old/restarting agent overwrites the last good version row until the next non-empty poll — consider skipping when all agent fields are empty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants