controller: write config agent version info to ClickHouse#3578
controller: write config agent version info to ClickHouse#3578nikw9944 wants to merge 2 commits into
Conversation
Write agent and controller version info to a separate ReplacingMergeTree table keyed by device_pubkey. ClickHouse merges rows down to one per device; Lake queries with FINAL. The existing controller_grpc_getconfig_success table stays lean (timestamp + device_pubkey only) for event tracking.
b027dcf to
ca4f2fc
Compare
ben-dz
left a comment
There was a problem hiding this comment.
Small, additive, backward-compatible feature; build/vet/tests pass on the branch. No critical or high issues. One Medium worth addressing before merge: (M1) flushVersions does not reset the shared consecutiveErrors counter on success like flushEvents does, a copy-paste asymmetry that skews WARN→ERROR log escalation. Plus two Low design notes: write amplification (a near-static ReplacingMergeTree row per device per poll) and empty-field overwrite of a good version row. No injection, no secret leakage; schema change is additive.
| cw.recordFlushError("error closing clickhouse versions batch", err) | ||
| return | ||
| } | ||
| cw.log.Debug("flushed version events to clickhouse", "count", len(versions)) |
There was a problem hiding this comment.
flushVersions never resets consecutiveErrors = 0 on success, unlike flushEvents (line 192). Both paths share the counter that drives WARN→ERROR escalation. Common case is masked because flushEvents runs first and resets, but if the events flush errors early (e.g. PrepareBatch failure) while versions succeed, the counter stays elevated though writes are flowing. Reads as a copy-paste omission. Fix: reset here on success, or reset once in flush() after both sub-flushes succeed.
| Timestamp: reqStart, | ||
| DevicePubkey: req.GetPubkey(), | ||
| }) | ||
| c.clickhouse.RecordVersion(versionEvent{ |
There was a problem hiding this comment.
Write amplification: RecordVersion runs on every GetConfig poll, but agent version changes very rarely, so this inserts a near-static ReplacingMergeTree row per device per poll. Intentional per the PR design (merges down, reads use FINAL) and matches the sibling getconfig_success pattern, but the churn-to-value ratio is high here. Optional: keep a per-device last-seen and only RecordVersion on change. Separately (L2): this runs even when all three agent fields are empty, so a blank report from an old/restarting agent overwrites the last good version row until the next non-empty poll — consider skipping when all agent fields are empty.
Summary of Changes
controller_agent_versionsClickHouse table usingReplacingMergeTree(updated_at)keyed bydevice_pubkeyto track the latest agent and controller version per deviceFINALfor deduplicated resultscontroller_grpc_getconfig_successlean (timestamp+device_pubkeyonly) — separates event tracking from version statecontroller_agent_versionsto Lake's external remote table proxiesDiff Breakdown
Small feature — mostly core logic (ClickHouse writer + flush), with minimal scaffolding.
Testing Verification
go build ./controlplane/controller/...— builds cleango test ./controlplane/controller/...— all existing tests passgo vet ./controlplane/controller/...— no issues