obs(proxy): classify secondary write failures by reason#585
Conversation
Adds proxy_secondary_write_errors_by_reason_total{cmd,reason} so OCC
"write conflict" incidents (e.g. the !txn|rb| races from the post-#581
log pattern) and other secondary-write failure classes are observable in
Prometheus/Grafana rather than only in logs. Keeps the existing
unlabelled counter so current dashboards keep working.
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 59 minutes and 42 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThe changes add error classification and metrics tracking for secondary write failures in a dual-write proxy. A new helper function categorizes secondary errors into predefined reason labels, with metrics recorded by command and reason. Infrastructure changes include a new Prometheus counter vector, and comprehensive tests verify the classification logic and metric recording. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces more granular error tracking for secondary write failures by adding a new Prometheus metric, secondary_write_errors_by_reason_total. This metric categorizes errors into specific reasons such as write conflicts, retry limits, and transaction finalization. The changes include a new classification function and associated unit tests. Feedback was provided to ensure the metric's help string accurately reflects all possible error classifications.
| SecondaryWriteErrorsByReason: prometheus.NewCounterVec(prometheus.CounterOpts{ | ||
| Namespace: "proxy", | ||
| Name: "secondary_write_errors_by_reason_total", | ||
| Help: "secondary write failures broken out by redis command and error classification (write_conflict / retry_limit / not_leader / deadline_exceeded / other)", |
There was a problem hiding this comment.
The help string for secondary_write_errors_by_reason_total is missing the txn_already_finalized classification reason. Keeping the help text in sync with the implementation ensures that operators can correctly interpret the metric labels without diving into the source code.
| Help: "secondary write failures broken out by redis command and error classification (write_conflict / retry_limit / not_leader / deadline_exceeded / other)", | |
| Help: "secondary write failures broken out by redis command and error classification (write_conflict / retry_limit / not_leader / deadline_exceeded / txn_already_finalized / other)", |
|
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
proxy/metrics.go (1)
52-56: Label-name and help-text consistency with existing counters.Two small consistency nits on the new CounterVec:
- Label is
"cmd"while the pre-existingCommandTotal(line 33) uses"command". Dashboards/joins across the two vectors now have to rename labels. Prefer"command"for consistency.Helpstarts lowercase ("secondary write failures...") and is long/parenthetical, whereas the other help strings in this file are short sentences starting with a capital (e.g."Total write errors from the secondary backend."). Minor style drift.♻️ Proposed change
SecondaryWriteErrorsByReason: prometheus.NewCounterVec(prometheus.CounterOpts{ Namespace: "proxy", Name: "secondary_write_errors_by_reason_total", - Help: "secondary write failures broken out by redis command and error classification (write_conflict / retry_limit / not_leader / deadline_exceeded / txn_already_finalized / other)", - }, []string{"cmd", "reason"}), + Help: "Secondary write failures broken out by command and error classification (write_conflict, retry_limit, not_leader, deadline_exceeded, txn_already_finalized, other).", + }, []string{"command", "reason"}),Also update the
WithLabelValues("SET", ...)call sites inproxy/dualwrite.goandproxy/metrics_test.goaccordingly.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@proxy/metrics.go` around lines 52 - 56, Change the CounterVec SecondaryWriteErrorsByReason to use the label name "command" (not "cmd") to match CommandTotal, and update its Help string to a short capitalized sentence (e.g., "Secondary write errors broken out by reason."). Then update all call sites that use SecondaryWriteErrorsByReason.WithLabelValues(...), including the WithLabelValues("SET", ...) calls in proxy/dualwrite.go and proxy/metrics_test.go, to pass the "command" label value in the same position as other metrics.proxy/metrics_test.go (1)
14-75: LGTM — table-driven coverage and both-counter assertion look good.Classifier cases cover the interesting precedence (retry-limit beating embedded write-conflict),
errors.Isvs substring paths for deadlines, and the unknown-fallback. The second test correctly asserts the unlabelled counter is preserved for backwards-compat and that distinct(cmd, reason)pairs accumulate independently.One optional nit: since these counters are whole-number increments,
assert.Equal(t, float64(1), testutil.ToFloat64(...))reads more intention-revealing thanInDelta(..., 0.001), but functionally equivalent.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@proxy/metrics_test.go` around lines 14 - 75, Tests use assert.InDelta to check whole-number counter increments; replace those InDelta assertions with exact equality asserts (assert.Equal) to make intent clearer. In TestRecordSecondaryWriteFailureEmitsBothCounters, update the assertions that call testutil.ToFloat64(metrics.SecondaryWriteErrors) and testutil.ToFloat64(metrics.SecondaryWriteErrorsByReason.WithLabelValues(...)) to use assert.Equal(t, float64(<expected>), testutil.ToFloat64(...)) instead of assert.InDelta(..., 0.001), keeping the same expected values and label pairs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@proxy/dualwrite.go`:
- Around line 494-520: Update classifySecondaryWriteError to detect the
additional transient and leadership error messages so they aren’t misclassified:
in the function classifySecondaryWriteError add checks (preserving order
semantics) for "leader not found" (or broaden the existing "not leader" check to
include that phrase), and for transaction/meta/transient error substrings such
as "txn locked", "txn commit ts required", "txn meta missing", "txn invalid
meta", and "txn timestamp overflow" (map them to appropriate labels like
"not_leader", "txn_locked", "txn_meta_error" or a single transient label as your
metrics require); keep the defensive nil branch intact and ensure the new cases
are placed before the generic default to avoid fallback to "other".
---
Nitpick comments:
In `@proxy/metrics_test.go`:
- Around line 14-75: Tests use assert.InDelta to check whole-number counter
increments; replace those InDelta assertions with exact equality asserts
(assert.Equal) to make intent clearer. In
TestRecordSecondaryWriteFailureEmitsBothCounters, update the assertions that
call testutil.ToFloat64(metrics.SecondaryWriteErrors) and
testutil.ToFloat64(metrics.SecondaryWriteErrorsByReason.WithLabelValues(...)) to
use assert.Equal(t, float64(<expected>), testutil.ToFloat64(...)) instead of
assert.InDelta(..., 0.001), keeping the same expected values and label pairs.
In `@proxy/metrics.go`:
- Around line 52-56: Change the CounterVec SecondaryWriteErrorsByReason to use
the label name "command" (not "cmd") to match CommandTotal, and update its Help
string to a short capitalized sentence (e.g., "Secondary write errors broken out
by reason."). Then update all call sites that use
SecondaryWriteErrorsByReason.WithLabelValues(...), including the
WithLabelValues("SET", ...) calls in proxy/dualwrite.go and
proxy/metrics_test.go, to pass the "command" label value in the same position as
other metrics.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 54b329f9-3599-4f6c-860b-dd2f8fa1b912
📒 Files selected for processing (3)
proxy/dualwrite.goproxy/metrics.goproxy/metrics_test.go
| // classifySecondaryWriteError maps a secondary-write error to a small fixed set | ||
| // of reason labels suitable for a Prometheus counter. The elastickv secondary | ||
| // backend is in-house, so matching on substrings of the error message is safe. | ||
| // | ||
| // Order matters: "retry limit exceeded" is checked before "write conflict" | ||
| // because the retry-limit message embeds the underlying conflict string, and | ||
| // we want the outer (retry_limit) classification to win. | ||
| func classifySecondaryWriteError(err error) string { | ||
| if err == nil { | ||
| return "other" | ||
| } | ||
| msg := err.Error() | ||
| switch { | ||
| case strings.Contains(msg, "retry limit exceeded"): | ||
| return "retry_limit" | ||
| case strings.Contains(msg, "write conflict"): | ||
| return "write_conflict" | ||
| case errors.Is(err, context.DeadlineExceeded) || strings.Contains(msg, "deadline exceeded"): | ||
| return "deadline_exceeded" | ||
| case strings.Contains(msg, "not leader"): | ||
| return "not_leader" | ||
| case strings.Contains(msg, "txn already committed") || strings.Contains(msg, "txn already aborted"): | ||
| return "txn_already_finalized" | ||
| default: | ||
| return "other" | ||
| } | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Confirm the exact wording of "retry limit" errors produced anywhere in the tree,
# and survey leadership-related error strings that the classifier should match.
rg -nP -C2 'retry[ _-]?limit' --type=go
rg -nP -C2 '"(not leader|leader not found|txn locked|txn already (committed|aborted)|write conflict)"' --type=goRepository: bootjp/elastickv
Length of output: 10524
Fix classifier to handle additional error types and prevent misclassification.
The substring matching approach for error classification has gaps:
-
"leader not found"containsadapter/internal.go:40'sErrLeaderNotFoundmessage, but the classifier checks only for"not leader"(line 513). This causes leadership-related failures to be misclassified as"other"despite being distinct from transient write conflicts. -
Several transient error types fall to
"other"and won't be visible on dashboards:ErrTxnLocked("txn locked") — common OCC contention signalErrTxnCommitTSRequired,ErrTxnMetaMissing,ErrTxnInvalidMeta,ErrTxnTimestampOverflow— metadata/timestamp issues
-
The
nilbranch (line 502–503) is unreachable fromrecordSecondaryWriteFailurebut safe to keep defensively.
Consider adding checks for the above error strings or extending the existing "not leader" check to match "leader not found".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@proxy/dualwrite.go` around lines 494 - 520, Update
classifySecondaryWriteError to detect the additional transient and leadership
error messages so they aren’t misclassified: in the function
classifySecondaryWriteError add checks (preserving order semantics) for "leader
not found" (or broaden the existing "not leader" check to include that phrase),
and for transaction/meta/transient error substrings such as "txn locked", "txn
commit ts required", "txn meta missing", "txn invalid meta", and "txn timestamp
overflow" (map them to appropriate labels like "not_leader", "txn_locked",
"txn_meta_error" or a single transient label as your metrics require); keep the
defensive nil branch intact and ensure the new cases are placed before the
generic default to avoid fallback to "other".
## Summary
- Adds a protocol-independent OCC write-conflict metric at the MVCC
store layer. Both `checkConflicts` (write-write) and
`checkReadConflicts` (read-write) in `store/lsm_store.go` /
`store/mvcc_store.go` now increment a bounded per-(kind, key_prefix)
counter immediately before returning `NewWriteConflictError`; a new
`monitoring.WriteConflictCollector` polls the stores every 5 s (matching
the existing `DispatchCollector` / `PebbleCollector` cadence) and
mirrors deltas into `elastickv_store_write_conflict_total{group, kind,
key_prefix}`.
- `kind` splits `read` (RW conflict via read set) from `write` (WW
conflict via mutation set) since ops implications differ. `key_prefix`
uses a bounded classification aligned with existing prefix constants
(`!txn|lock|`, `!txn|rb|`, `!redis|str|`, `!hs|`, `!zs|`, `!lst|`,
`!ddb|`, ...) with an `other` fallback so user-supplied keys cannot grow
cardinality.
## Motivation
`#585` adds the proxy-side view (per client request). This PR adds the
underlying store-side view (per Raft-applied proposal), so the signal is
visible regardless of protocol adapter (Redis, DynamoDB, raw KV). In the
`!txn|rb|` production incident the proxy counter spiked per user request
but a single request fanned out to many Raft proposals; the store-side
metric would have surfaced the real pressure (txn_rollback bucket)
directly.
## Sample Grafana query
```
sum by (key_prefix) (rate(elastickv_store_write_conflict_total[5m]))
```
Split by conflict kind:
```
sum by (kind, key_prefix) (rate(elastickv_store_write_conflict_total[5m]))
```
## Test plan
- [x] `go test -race -count=1 ./store/... ./monitoring/... ./adapter/...
./kv/...`
- [x] `make lint`
- [x] New unit tests:
- `store/write_conflict_counter_test.go`: table test for
`classifyWriteConflictKey` covering every bucket; counter record /
snapshot / nil-safety.
- `store/write_conflict_store_test.go`: integration tests for both
pebble-backed and in-memory MVCC stores exercising the WW and RW
conflict paths and asserting the right bucket increments.
- `monitoring/write_conflict_test.go`: collector delta behaviour,
source-reset (counter-down) rebasing, malformed-label skip, nil-safety.
|
@codex code review |
Address reviewer feedback on the secondary-write error classifier: - map "leader not found" (kv.ErrLeaderNotFound / adapter.ErrLeaderNotFound) to not_leader alongside the existing "not leader" substring check, so leadership failures stop leaking into the generic "other" bucket. - introduce a dedicated txn_locked reason for kv.ErrTxnLocked, a common OCC contention signal that was previously classified as other. - pull the substring table into a package-level slice and iterate it, keeping the function under the cyclomatic-complexity limit. - extend metrics_test and the metric help string accordingly.
|
/gemini review |
|
@codex review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Codex Review: Didn't find any major issues. Swish! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
@codex review |
|
Codex Review: Didn't find any major issues. What shall we delve into next? ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Motivation
After #581 we still see occasional secondary-write failures in production with payloads like
!txn|rb|!redis|ttl|...(OCC write conflicts racing against rollback tombstones). Today the only signal is thesecondary write failedlog line fromrecordSecondaryWriteFailure— the unlabelledproxy_secondary_write_errors_totalcounter tells us something is wrong but gives no hint whether the failure is an OCC conflict, a retry-loop giving up, a lost leader, or a timeout. Ops have been grepping logs to triage.Change
proxy_secondary_write_errors_by_reason_total{cmd,reason}(CounterVec) alongside the existing unlabelled counter (kept for dashboard backwards-compat).classifySecondaryWriteError(err) stringmaps error strings to one of:retry_limit—"retry limit exceeded"(checked first, since the retry-limit message embeds"write conflict")write_conflict— raw OCC conflict, e.g.!txn|rb|racesdeadline_exceeded—errors.Is(err, context.DeadlineExceeded)or substringnot_leader— leadership lost mid-dispatchtxn_already_finalized—"txn already committed"/"txn already aborted"other— everything elserecordSecondaryWriteFailureincrements both counters.Suggested Grafana query
A per-reason stacked graph makes the
!txn|rb|conflict pattern (the motivating incident) show up as a spike inreason="write_conflict"without needing to touch logs.Test plan
go test -race -count=1 ./proxy/...make lintSummary by CodeRabbit
Chores
Tests