obs(proxy): classify secondary write failures by reason by bootjp · Pull Request #585 · bootjp/elastickv

bootjp · 2026-04-22T14:18:25Z

Motivation

After #581 we still see occasional secondary-write failures in production with payloads like !txn|rb|!redis|ttl|... (OCC write conflicts racing against rollback tombstones). Today the only signal is the secondary write failed log line from recordSecondaryWriteFailure — the unlabelled proxy_secondary_write_errors_total counter tells us something is wrong but gives no hint whether the failure is an OCC conflict, a retry-loop giving up, a lost leader, or a timeout. Ops have been grepping logs to triage.

Change

Adds proxy_secondary_write_errors_by_reason_total{cmd,reason} (CounterVec) alongside the existing unlabelled counter (kept for dashboard backwards-compat).
New helper classifySecondaryWriteError(err) string maps error strings to one of:
- retry_limit — "retry limit exceeded" (checked first, since the retry-limit message embeds "write conflict")
- write_conflict — raw OCC conflict, e.g. !txn|rb| races
- deadline_exceeded — errors.Is(err, context.DeadlineExceeded) or substring
- not_leader — leadership lost mid-dispatch
- txn_already_finalized — "txn already committed" / "txn already aborted"
- other — everything else
recordSecondaryWriteFailure increments both counters.
Tests: table test for the classifier + a test asserting both counters tick.

Suggested Grafana query

sum by (cmd, reason) (rate(proxy_secondary_write_errors_by_reason_total[5m]))

A per-reason stacked graph makes the !txn|rb| conflict pattern (the motivating incident) show up as a spike in reason="write_conflict" without needing to touch logs.

Test plan

go test -race -count=1 ./proxy/...
make lint

Summary by CodeRabbit

Chores
- Secondary write error monitoring now automatically classifies failures by specific error type to improve diagnostics and observability.
- New detailed metrics track write failure patterns across different error categories, enabling better visibility and root cause analysis.
Tests
- Added comprehensive unit tests validating error classification logic and metrics recording functionality across various failure scenarios.

Adds proxy_secondary_write_errors_by_reason_total{cmd,reason} so OCC "write conflict" incidents (e.g. the !txn|rb| races from the post-#581 log pattern) and other secondary-write failure classes are observable in Prometheus/Grafana rather than only in logs. Keeps the existing unlabelled counter so current dashboards keep working.

coderabbitai · 2026-04-22T14:19:29Z

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 59 minutes and 42 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 59 minutes and 42 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8f0864d1-d6d9-4241-92d8-13d26d768aa1

📥 Commits

Reviewing files that changed from the base of the PR and between f3d03e6 and f17add3.

📒 Files selected for processing (3)

proxy/dualwrite.go
proxy/metrics.go
proxy/metrics_test.go

📝 Walkthrough

Walkthrough

The changes add error classification and metrics tracking for secondary write failures in a dual-write proxy. A new helper function categorizes secondary errors into predefined reason labels, with metrics recorded by command and reason. Infrastructure changes include a new Prometheus counter vector, and comprehensive tests verify the classification logic and metric recording.

Changes

Cohort / File(s)	Summary
Metrics Infrastructure `proxy/metrics.go`	Added `SecondaryWriteErrorsByReason` counter vector to `ProxyMetrics` to track secondary write errors labeled by command and reason; maintained existing `SecondaryWriteErrors` counter.
Error Classification & Recording `proxy/dualwrite.go`	Implemented `classifySecondaryWriteError` helper to categorize errors into fixed reason labels (e.g., write_conflict, retry_limit, deadline_exceeded, not_leader, txn_already_finalized, other) using substring matching and `errors.Is` checks; `recordSecondaryWriteFailure` now increments the new labeled counter vector alongside existing counters.
Test Coverage `proxy/metrics_test.go`	Added two tests: `TestClassifySecondaryWriteError` verifies error classification logic across various error types including nil, deadline-related, and txn-related errors; `TestRecordSecondaryWriteFailureEmitsBothCounters` validates that both labeled and unlabeled metrics increment correctly for different command/reason pairs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

fix(proxy): retry secondary writes on "read timestamp has been compacted" #528: Related through shared modification of proxy/dualwrite.go and secondary write error handling—this PR adds compacted-error detection and retry logic while the main PR implements error classification and metrics.

Poem

🐰 With whiskers twitched and errors caught,
We classify each secondary thought—
By reason, command, each path we chart,
Metrics bloom like clover in the heart! 📊

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'obs(proxy): classify secondary write failures by reason' directly and clearly describes the main change—adding classification of secondary write failures by reason through a new labeled Prometheus metric.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch obs/secondary-write-failure-reasons

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces more granular error tracking for secondary write failures by adding a new Prometheus metric, secondary_write_errors_by_reason_total. This metric categorizes errors into specific reasons such as write conflicts, retry limits, and transaction finalization. The changes include a new classification function and associated unit tests. Feedback was provided to ensure the metric's help string accurately reflects all possible error classifications.

gemini-code-assist · 2026-04-22T14:24:41Z

+		SecondaryWriteErrorsByReason: prometheus.NewCounterVec(prometheus.CounterOpts{
+			Namespace: "proxy",
+			Name:      "secondary_write_errors_by_reason_total",
+			Help:      "secondary write failures broken out by redis command and error classification (write_conflict / retry_limit / not_leader / deadline_exceeded / other)",


The help string for secondary_write_errors_by_reason_total is missing the txn_already_finalized classification reason. Keeping the help text in sync with the implementation ensures that operators can correctly interpret the metric labels without diving into the source code.

Suggested change

Help: "secondary write failures broken out by redis command and error classification (write_conflict / retry_limit / not_leader / deadline_exceeded / other)",

Help: "secondary write failures broken out by redis command and error classification (write_conflict / retry_limit / not_leader / deadline_exceeded / txn_already_finalized / other)",

bootjp · 2026-04-22T15:43:39Z

/gemini review

gemini-code-assist · 2026-04-22T15:43:42Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

proxy/metrics.go (1)
52-56: Label-name and help-text consistency with existing counters.

Two small consistency nits on the new CounterVec:

Label is "cmd" while the pre-existing CommandTotal (line 33) uses "command". Dashboards/joins across the two vectors now have to rename labels. Prefer "command" for consistency.

Help starts lowercase ("secondary write failures...") and is long/parenthetical, whereas the other help strings in this file are short sentences starting with a capital (e.g. "Total write errors from the secondary backend."). Minor style drift.
♻️ Proposed change
 		SecondaryWriteErrorsByReason: prometheus.NewCounterVec(prometheus.CounterOpts{
 			Namespace: "proxy",
 			Name:      "secondary_write_errors_by_reason_total",
-			Help:      "secondary write failures broken out by redis command and error classification (write_conflict / retry_limit / not_leader / deadline_exceeded / txn_already_finalized / other)",
-		}, []string{"cmd", "reason"}),
+			Help:      "Secondary write failures broken out by command and error classification (write_conflict, retry_limit, not_leader, deadline_exceeded, txn_already_finalized, other).",
+		}, []string{"command", "reason"}),
Also update the WithLabelValues("SET", ...) call sites in proxy/dualwrite.go and proxy/metrics_test.go accordingly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proxy/metrics.go` around lines 52 - 56, Change the CounterVec
SecondaryWriteErrorsByReason to use the label name "command" (not "cmd") to
match CommandTotal, and update its Help string to a short capitalized sentence
(e.g., "Secondary write errors broken out by reason."). Then update all call
sites that use SecondaryWriteErrorsByReason.WithLabelValues(...), including the
WithLabelValues("SET", ...) calls in proxy/dualwrite.go and
proxy/metrics_test.go, to pass the "command" label value in the same position as
other metrics.
proxy/metrics_test.go (1)
14-75: LGTM — table-driven coverage and both-counter assertion look good.

Classifier cases cover the interesting precedence (retry-limit beating embedded write-conflict), errors.Is vs substring paths for deadlines, and the unknown-fallback. The second test correctly asserts the unlabelled counter is preserved for backwards-compat and that distinct (cmd, reason) pairs accumulate independently.

One optional nit: since these counters are whole-number increments, assert.Equal(t, float64(1), testutil.ToFloat64(...)) reads more intention-revealing than InDelta(..., 0.001), but functionally equivalent.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proxy/metrics_test.go` around lines 14 - 75, Tests use assert.InDelta to
check whole-number counter increments; replace those InDelta assertions with
exact equality asserts (assert.Equal) to make intent clearer. In
TestRecordSecondaryWriteFailureEmitsBothCounters, update the assertions that
call testutil.ToFloat64(metrics.SecondaryWriteErrors) and
testutil.ToFloat64(metrics.SecondaryWriteErrorsByReason.WithLabelValues(...)) to
use assert.Equal(t, float64(<expected>), testutil.ToFloat64(...)) instead of
assert.InDelta(..., 0.001), keeping the same expected values and label pairs.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@proxy/dualwrite.go`:
- Around line 494-520: Update classifySecondaryWriteError to detect the
additional transient and leadership error messages so they aren’t misclassified:
in the function classifySecondaryWriteError add checks (preserving order
semantics) for "leader not found" (or broaden the existing "not leader" check to
include that phrase), and for transaction/meta/transient error substrings such
as "txn locked", "txn commit ts required", "txn meta missing", "txn invalid
meta", and "txn timestamp overflow" (map them to appropriate labels like
"not_leader", "txn_locked", "txn_meta_error" or a single transient label as your
metrics require); keep the defensive nil branch intact and ensure the new cases
are placed before the generic default to avoid fallback to "other".

---

Nitpick comments:
In `@proxy/metrics_test.go`:
- Around line 14-75: Tests use assert.InDelta to check whole-number counter
increments; replace those InDelta assertions with exact equality asserts
(assert.Equal) to make intent clearer. In
TestRecordSecondaryWriteFailureEmitsBothCounters, update the assertions that
call testutil.ToFloat64(metrics.SecondaryWriteErrors) and
testutil.ToFloat64(metrics.SecondaryWriteErrorsByReason.WithLabelValues(...)) to
use assert.Equal(t, float64(<expected>), testutil.ToFloat64(...)) instead of
assert.InDelta(..., 0.001), keeping the same expected values and label pairs.

In `@proxy/metrics.go`:
- Around line 52-56: Change the CounterVec SecondaryWriteErrorsByReason to use
the label name "command" (not "cmd") to match CommandTotal, and update its Help
string to a short capitalized sentence (e.g., "Secondary write errors broken out
by reason."). Then update all call sites that use
SecondaryWriteErrorsByReason.WithLabelValues(...), including the
WithLabelValues("SET", ...) calls in proxy/dualwrite.go and
proxy/metrics_test.go, to pass the "command" label value in the same position as
other metrics.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 54b329f9-3599-4f6c-860b-dd2f8fa1b912

📥 Commits

Reviewing files that changed from the base of the PR and between 20f97f4 and f3d03e6.

📒 Files selected for processing (3)

proxy/dualwrite.go
proxy/metrics.go
proxy/metrics_test.go

coderabbitai · 2026-04-22T15:46:51Z

+// classifySecondaryWriteError maps a secondary-write error to a small fixed set
+// of reason labels suitable for a Prometheus counter. The elastickv secondary
+// backend is in-house, so matching on substrings of the error message is safe.
+//
+// Order matters: "retry limit exceeded" is checked before "write conflict"
+// because the retry-limit message embeds the underlying conflict string, and
+// we want the outer (retry_limit) classification to win.
+func classifySecondaryWriteError(err error) string {
+	if err == nil {
+		return "other"
+	}
+	msg := err.Error()
+	switch {
+	case strings.Contains(msg, "retry limit exceeded"):
+		return "retry_limit"
+	case strings.Contains(msg, "write conflict"):
+		return "write_conflict"
+	case errors.Is(err, context.DeadlineExceeded) || strings.Contains(msg, "deadline exceeded"):
+		return "deadline_exceeded"
+	case strings.Contains(msg, "not leader"):
+		return "not_leader"
+	case strings.Contains(msg, "txn already committed") || strings.Contains(msg, "txn already aborted"):
+		return "txn_already_finalized"
+	default:
+		return "other"
+	}
+}


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Confirm the exact wording of "retry limit" errors produced anywhere in the tree, # and survey leadership-related error strings that the classifier should match. rg -nP -C2 'retry[ _-]?limit' --type=go rg -nP -C2 '"(not leader|leader not found|txn locked|txn already (committed|aborted)|write conflict)"' --type=go

Repository: bootjp/elastickv

Length of output: 10524

Fix classifier to handle additional error types and prevent misclassification.

The substring matching approach for error classification has gaps:

"leader not found" contains adapter/internal.go:40's ErrLeaderNotFound message, but the classifier checks only for "not leader" (line 513). This causes leadership-related failures to be misclassified as "other" despite being distinct from transient write conflicts.

Several transient error types fall to "other" and won't be visible on dashboards:

ErrTxnLocked ("txn locked") — common OCC contention signal

ErrTxnCommitTSRequired, ErrTxnMetaMissing, ErrTxnInvalidMeta, ErrTxnTimestampOverflow — metadata/timestamp issues

The nil branch (line 502–503) is unreachable from recordSecondaryWriteFailure but safe to keep defensively.

Consider adding checks for the above error strings or extending the existing "not leader" check to match "leader not found".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@proxy/dualwrite.go` around lines 494 - 520, Update classifySecondaryWriteError to detect the additional transient and leadership error messages so they aren’t misclassified: in the function classifySecondaryWriteError add checks (preserving order semantics) for "leader not found" (or broaden the existing "not leader" check to include that phrase), and for transaction/meta/transient error substrings such as "txn locked", "txn commit ts required", "txn meta missing", "txn invalid meta", and "txn timestamp overflow" (map them to appropriate labels like "not_leader", "txn_locked", "txn_meta_error" or a single transient label as your metrics require); keep the defensive nil branch intact and ensure the new cases are placed before the generic default to avoid fallback to "other".

## Summary - Adds a protocol-independent OCC write-conflict metric at the MVCC store layer. Both `checkConflicts` (write-write) and `checkReadConflicts` (read-write) in `store/lsm_store.go` / `store/mvcc_store.go` now increment a bounded per-(kind, key_prefix) counter immediately before returning `NewWriteConflictError`; a new `monitoring.WriteConflictCollector` polls the stores every 5 s (matching the existing `DispatchCollector` / `PebbleCollector` cadence) and mirrors deltas into `elastickv_store_write_conflict_total{group, kind, key_prefix}`. - `kind` splits `read` (RW conflict via read set) from `write` (WW conflict via mutation set) since ops implications differ. `key_prefix` uses a bounded classification aligned with existing prefix constants (`!txn|lock|`, `!txn|rb|`, `!redis|str|`, `!hs|`, `!zs|`, `!lst|`, `!ddb|`, ...) with an `other` fallback so user-supplied keys cannot grow cardinality. ## Motivation `#585` adds the proxy-side view (per client request). This PR adds the underlying store-side view (per Raft-applied proposal), so the signal is visible regardless of protocol adapter (Redis, DynamoDB, raw KV). In the `!txn|rb|` production incident the proxy counter spiked per user request but a single request fanned out to many Raft proposals; the store-side metric would have surfaced the real pressure (txn_rollback bucket) directly. ## Sample Grafana query ``` sum by (key_prefix) (rate(elastickv_store_write_conflict_total[5m])) ``` Split by conflict kind: ``` sum by (kind, key_prefix) (rate(elastickv_store_write_conflict_total[5m])) ``` ## Test plan - [x] `go test -race -count=1 ./store/... ./monitoring/... ./adapter/... ./kv/...` - [x] `make lint` - [x] New unit tests: - `store/write_conflict_counter_test.go`: table test for `classifyWriteConflictKey` covering every bucket; counter record / snapshot / nil-safety. - `store/write_conflict_store_test.go`: integration tests for both pebble-backed and in-memory MVCC stores exercising the WW and RW conflict paths and asserting the right bucket increments. - `monitoring/write_conflict_test.go`: collector delta behaviour, source-reset (counter-down) rebasing, malformed-label skip, nil-safety.

bootjp · 2026-04-22T16:14:14Z

@codex code review

Address reviewer feedback on the secondary-write error classifier: - map "leader not found" (kv.ErrLeaderNotFound / adapter.ErrLeaderNotFound) to not_leader alongside the existing "not leader" substring check, so leadership failures stop leaking into the generic "other" bucket. - introduce a dedicated txn_locked reason for kv.ErrTxnLocked, a common OCC contention signal that was previously classified as other. - pull the substring table into a package-level slice and iterate it, keeping the function under the cyclomatic-complexity limit. - extend metrics_test and the metric help string accordingly.

bootjp · 2026-04-22T18:36:18Z

/gemini review

bootjp · 2026-04-22T18:36:20Z

@codex review

gemini-code-assist · 2026-04-22T18:36:20Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

chatgpt-codex-connector · 2026-04-22T18:40:25Z

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

bootjp · 2026-04-22T18:43:11Z

@codex review

chatgpt-codex-connector · 2026-04-22T18:45:59Z

Codex Review: Didn't find any major issues. What shall we delve into next?

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

gemini-code-assist Bot reviewed Apr 22, 2026

View reviewed changes

docs(proxy): include txn_already_finalized in metric help string

f3d03e6

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

bootjp merged commit 3a434c7 into main Apr 23, 2026
9 checks passed

bootjp deleted the obs/secondary-write-failure-reasons branch April 23, 2026 04:15

	Help: "secondary write failures broken out by redis command and error classification (write_conflict / retry_limit / not_leader / deadline_exceeded / other)",
	Help: "secondary write failures broken out by redis command and error classification (write_conflict / retry_limit / not_leader / deadline_exceeded / txn_already_finalized / other)",

Conversation

bootjp commented Apr 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Change

Suggested Grafana query

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Apr 22, 2026

Uh oh!

gemini-code-assist Bot commented Apr 22, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Apr 22, 2026

Uh oh!

bootjp commented Apr 22, 2026

Uh oh!

bootjp commented Apr 22, 2026

Uh oh!

gemini-code-assist Bot commented Apr 22, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 22, 2026

Uh oh!

bootjp commented Apr 22, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bootjp commented Apr 22, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading