Skip to content

obs(proxy): classify secondary write failures by reason#585

Merged
bootjp merged 3 commits intomainfrom
obs/secondary-write-failure-reasons
Apr 23, 2026
Merged

obs(proxy): classify secondary write failures by reason#585
bootjp merged 3 commits intomainfrom
obs/secondary-write-failure-reasons

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 22, 2026

Motivation

After #581 we still see occasional secondary-write failures in production with payloads like !txn|rb|!redis|ttl|... (OCC write conflicts racing against rollback tombstones). Today the only signal is the secondary write failed log line from recordSecondaryWriteFailure — the unlabelled proxy_secondary_write_errors_total counter tells us something is wrong but gives no hint whether the failure is an OCC conflict, a retry-loop giving up, a lost leader, or a timeout. Ops have been grepping logs to triage.

Change

  • Adds proxy_secondary_write_errors_by_reason_total{cmd,reason} (CounterVec) alongside the existing unlabelled counter (kept for dashboard backwards-compat).
  • New helper classifySecondaryWriteError(err) string maps error strings to one of:
    • retry_limit"retry limit exceeded" (checked first, since the retry-limit message embeds "write conflict")
    • write_conflict — raw OCC conflict, e.g. !txn|rb| races
    • deadline_exceedederrors.Is(err, context.DeadlineExceeded) or substring
    • not_leader — leadership lost mid-dispatch
    • txn_already_finalized"txn already committed" / "txn already aborted"
    • other — everything else
  • recordSecondaryWriteFailure increments both counters.
  • Tests: table test for the classifier + a test asserting both counters tick.

Suggested Grafana query

sum by (cmd, reason) (rate(proxy_secondary_write_errors_by_reason_total[5m]))

A per-reason stacked graph makes the !txn|rb| conflict pattern (the motivating incident) show up as a spike in reason="write_conflict" without needing to touch logs.

Test plan

  • go test -race -count=1 ./proxy/...
  • make lint

Summary by CodeRabbit

  • Chores

    • Secondary write error monitoring now automatically classifies failures by specific error type to improve diagnostics and observability.
    • New detailed metrics track write failure patterns across different error categories, enabling better visibility and root cause analysis.
  • Tests

    • Added comprehensive unit tests validating error classification logic and metrics recording functionality across various failure scenarios.

Adds proxy_secondary_write_errors_by_reason_total{cmd,reason} so OCC
"write conflict" incidents (e.g. the !txn|rb| races from the post-#581
log pattern) and other secondary-write failure classes are observable in
Prometheus/Grafana rather than only in logs. Keeps the existing
unlabelled counter so current dashboards keep working.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 22, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 59 minutes and 42 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 59 minutes and 42 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8f0864d1-d6d9-4241-92d8-13d26d768aa1

📥 Commits

Reviewing files that changed from the base of the PR and between f3d03e6 and f17add3.

📒 Files selected for processing (3)
  • proxy/dualwrite.go
  • proxy/metrics.go
  • proxy/metrics_test.go
📝 Walkthrough

Walkthrough

The changes add error classification and metrics tracking for secondary write failures in a dual-write proxy. A new helper function categorizes secondary errors into predefined reason labels, with metrics recorded by command and reason. Infrastructure changes include a new Prometheus counter vector, and comprehensive tests verify the classification logic and metric recording.

Changes

Cohort / File(s) Summary
Metrics Infrastructure
proxy/metrics.go
Added SecondaryWriteErrorsByReason counter vector to ProxyMetrics to track secondary write errors labeled by command and reason; maintained existing SecondaryWriteErrors counter.
Error Classification & Recording
proxy/dualwrite.go
Implemented classifySecondaryWriteError helper to categorize errors into fixed reason labels (e.g., write_conflict, retry_limit, deadline_exceeded, not_leader, txn_already_finalized, other) using substring matching and errors.Is checks; recordSecondaryWriteFailure now increments the new labeled counter vector alongside existing counters.
Test Coverage
proxy/metrics_test.go
Added two tests: TestClassifySecondaryWriteError verifies error classification logic across various error types including nil, deadline-related, and txn-related errors; TestRecordSecondaryWriteFailureEmitsBothCounters validates that both labeled and unlabeled metrics increment correctly for different command/reason pairs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 With whiskers twitched and errors caught,
We classify each secondary thought—
By reason, command, each path we chart,
Metrics bloom like clover in the heart! 📊

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'obs(proxy): classify secondary write failures by reason' directly and clearly describes the main change—adding classification of secondary write failures by reason through a new labeled Prometheus metric.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch obs/secondary-write-failure-reasons

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces more granular error tracking for secondary write failures by adding a new Prometheus metric, secondary_write_errors_by_reason_total. This metric categorizes errors into specific reasons such as write conflicts, retry limits, and transaction finalization. The changes include a new classification function and associated unit tests. Feedback was provided to ensure the metric's help string accurately reflects all possible error classifications.

Comment thread proxy/metrics.go Outdated
SecondaryWriteErrorsByReason: prometheus.NewCounterVec(prometheus.CounterOpts{
Namespace: "proxy",
Name: "secondary_write_errors_by_reason_total",
Help: "secondary write failures broken out by redis command and error classification (write_conflict / retry_limit / not_leader / deadline_exceeded / other)",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The help string for secondary_write_errors_by_reason_total is missing the txn_already_finalized classification reason. Keeping the help text in sync with the implementation ensures that operators can correctly interpret the metric labels without diving into the source code.

Suggested change
Help: "secondary write failures broken out by redis command and error classification (write_conflict / retry_limit / not_leader / deadline_exceeded / other)",
Help: "secondary write failures broken out by redis command and error classification (write_conflict / retry_limit / not_leader / deadline_exceeded / txn_already_finalized / other)",

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 22, 2026

/gemini review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
proxy/metrics.go (1)

52-56: Label-name and help-text consistency with existing counters.

Two small consistency nits on the new CounterVec:

  1. Label is "cmd" while the pre-existing CommandTotal (line 33) uses "command". Dashboards/joins across the two vectors now have to rename labels. Prefer "command" for consistency.
  2. Help starts lowercase ("secondary write failures...") and is long/parenthetical, whereas the other help strings in this file are short sentences starting with a capital (e.g. "Total write errors from the secondary backend."). Minor style drift.
♻️ Proposed change
 		SecondaryWriteErrorsByReason: prometheus.NewCounterVec(prometheus.CounterOpts{
 			Namespace: "proxy",
 			Name:      "secondary_write_errors_by_reason_total",
-			Help:      "secondary write failures broken out by redis command and error classification (write_conflict / retry_limit / not_leader / deadline_exceeded / txn_already_finalized / other)",
-		}, []string{"cmd", "reason"}),
+			Help:      "Secondary write failures broken out by command and error classification (write_conflict, retry_limit, not_leader, deadline_exceeded, txn_already_finalized, other).",
+		}, []string{"command", "reason"}),

Also update the WithLabelValues("SET", ...) call sites in proxy/dualwrite.go and proxy/metrics_test.go accordingly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proxy/metrics.go` around lines 52 - 56, Change the CounterVec
SecondaryWriteErrorsByReason to use the label name "command" (not "cmd") to
match CommandTotal, and update its Help string to a short capitalized sentence
(e.g., "Secondary write errors broken out by reason."). Then update all call
sites that use SecondaryWriteErrorsByReason.WithLabelValues(...), including the
WithLabelValues("SET", ...) calls in proxy/dualwrite.go and
proxy/metrics_test.go, to pass the "command" label value in the same position as
other metrics.
proxy/metrics_test.go (1)

14-75: LGTM — table-driven coverage and both-counter assertion look good.

Classifier cases cover the interesting precedence (retry-limit beating embedded write-conflict), errors.Is vs substring paths for deadlines, and the unknown-fallback. The second test correctly asserts the unlabelled counter is preserved for backwards-compat and that distinct (cmd, reason) pairs accumulate independently.

One optional nit: since these counters are whole-number increments, assert.Equal(t, float64(1), testutil.ToFloat64(...)) reads more intention-revealing than InDelta(..., 0.001), but functionally equivalent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proxy/metrics_test.go` around lines 14 - 75, Tests use assert.InDelta to
check whole-number counter increments; replace those InDelta assertions with
exact equality asserts (assert.Equal) to make intent clearer. In
TestRecordSecondaryWriteFailureEmitsBothCounters, update the assertions that
call testutil.ToFloat64(metrics.SecondaryWriteErrors) and
testutil.ToFloat64(metrics.SecondaryWriteErrorsByReason.WithLabelValues(...)) to
use assert.Equal(t, float64(<expected>), testutil.ToFloat64(...)) instead of
assert.InDelta(..., 0.001), keeping the same expected values and label pairs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@proxy/dualwrite.go`:
- Around line 494-520: Update classifySecondaryWriteError to detect the
additional transient and leadership error messages so they aren’t misclassified:
in the function classifySecondaryWriteError add checks (preserving order
semantics) for "leader not found" (or broaden the existing "not leader" check to
include that phrase), and for transaction/meta/transient error substrings such
as "txn locked", "txn commit ts required", "txn meta missing", "txn invalid
meta", and "txn timestamp overflow" (map them to appropriate labels like
"not_leader", "txn_locked", "txn_meta_error" or a single transient label as your
metrics require); keep the defensive nil branch intact and ensure the new cases
are placed before the generic default to avoid fallback to "other".

---

Nitpick comments:
In `@proxy/metrics_test.go`:
- Around line 14-75: Tests use assert.InDelta to check whole-number counter
increments; replace those InDelta assertions with exact equality asserts
(assert.Equal) to make intent clearer. In
TestRecordSecondaryWriteFailureEmitsBothCounters, update the assertions that
call testutil.ToFloat64(metrics.SecondaryWriteErrors) and
testutil.ToFloat64(metrics.SecondaryWriteErrorsByReason.WithLabelValues(...)) to
use assert.Equal(t, float64(<expected>), testutil.ToFloat64(...)) instead of
assert.InDelta(..., 0.001), keeping the same expected values and label pairs.

In `@proxy/metrics.go`:
- Around line 52-56: Change the CounterVec SecondaryWriteErrorsByReason to use
the label name "command" (not "cmd") to match CommandTotal, and update its Help
string to a short capitalized sentence (e.g., "Secondary write errors broken out
by reason."). Then update all call sites that use
SecondaryWriteErrorsByReason.WithLabelValues(...), including the
WithLabelValues("SET", ...) calls in proxy/dualwrite.go and
proxy/metrics_test.go, to pass the "command" label value in the same position as
other metrics.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 54b329f9-3599-4f6c-860b-dd2f8fa1b912

📥 Commits

Reviewing files that changed from the base of the PR and between 20f97f4 and f3d03e6.

📒 Files selected for processing (3)
  • proxy/dualwrite.go
  • proxy/metrics.go
  • proxy/metrics_test.go

Comment thread proxy/dualwrite.go
Comment on lines +494 to +520
// classifySecondaryWriteError maps a secondary-write error to a small fixed set
// of reason labels suitable for a Prometheus counter. The elastickv secondary
// backend is in-house, so matching on substrings of the error message is safe.
//
// Order matters: "retry limit exceeded" is checked before "write conflict"
// because the retry-limit message embeds the underlying conflict string, and
// we want the outer (retry_limit) classification to win.
func classifySecondaryWriteError(err error) string {
if err == nil {
return "other"
}
msg := err.Error()
switch {
case strings.Contains(msg, "retry limit exceeded"):
return "retry_limit"
case strings.Contains(msg, "write conflict"):
return "write_conflict"
case errors.Is(err, context.DeadlineExceeded) || strings.Contains(msg, "deadline exceeded"):
return "deadline_exceeded"
case strings.Contains(msg, "not leader"):
return "not_leader"
case strings.Contains(msg, "txn already committed") || strings.Contains(msg, "txn already aborted"):
return "txn_already_finalized"
default:
return "other"
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Confirm the exact wording of "retry limit" errors produced anywhere in the tree,
# and survey leadership-related error strings that the classifier should match.
rg -nP -C2 'retry[ _-]?limit' --type=go
rg -nP -C2 '"(not leader|leader not found|txn locked|txn already (committed|aborted)|write conflict)"' --type=go

Repository: bootjp/elastickv

Length of output: 10524


Fix classifier to handle additional error types and prevent misclassification.

The substring matching approach for error classification has gaps:

  1. "leader not found" contains adapter/internal.go:40's ErrLeaderNotFound message, but the classifier checks only for "not leader" (line 513). This causes leadership-related failures to be misclassified as "other" despite being distinct from transient write conflicts.

  2. Several transient error types fall to "other" and won't be visible on dashboards:

    • ErrTxnLocked ("txn locked") — common OCC contention signal
    • ErrTxnCommitTSRequired, ErrTxnMetaMissing, ErrTxnInvalidMeta, ErrTxnTimestampOverflow — metadata/timestamp issues
  3. The nil branch (line 502–503) is unreachable from recordSecondaryWriteFailure but safe to keep defensively.

Consider adding checks for the above error strings or extending the existing "not leader" check to match "leader not found".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proxy/dualwrite.go` around lines 494 - 520, Update
classifySecondaryWriteError to detect the additional transient and leadership
error messages so they aren’t misclassified: in the function
classifySecondaryWriteError add checks (preserving order semantics) for "leader
not found" (or broaden the existing "not leader" check to include that phrase),
and for transaction/meta/transient error substrings such as "txn locked", "txn
commit ts required", "txn meta missing", "txn invalid meta", and "txn timestamp
overflow" (map them to appropriate labels like "not_leader", "txn_locked",
"txn_meta_error" or a single transient label as your metrics require); keep the
defensive nil branch intact and ensure the new cases are placed before the
generic default to avoid fallback to "other".

bootjp added a commit that referenced this pull request Apr 22, 2026
## Summary

- Adds a protocol-independent OCC write-conflict metric at the MVCC
store layer. Both `checkConflicts` (write-write) and
`checkReadConflicts` (read-write) in `store/lsm_store.go` /
`store/mvcc_store.go` now increment a bounded per-(kind, key_prefix)
counter immediately before returning `NewWriteConflictError`; a new
`monitoring.WriteConflictCollector` polls the stores every 5 s (matching
the existing `DispatchCollector` / `PebbleCollector` cadence) and
mirrors deltas into `elastickv_store_write_conflict_total{group, kind,
key_prefix}`.
- `kind` splits `read` (RW conflict via read set) from `write` (WW
conflict via mutation set) since ops implications differ. `key_prefix`
uses a bounded classification aligned with existing prefix constants
(`!txn|lock|`, `!txn|rb|`, `!redis|str|`, `!hs|`, `!zs|`, `!lst|`,
`!ddb|`, ...) with an `other` fallback so user-supplied keys cannot grow
cardinality.

## Motivation

`#585` adds the proxy-side view (per client request). This PR adds the
underlying store-side view (per Raft-applied proposal), so the signal is
visible regardless of protocol adapter (Redis, DynamoDB, raw KV). In the
`!txn|rb|` production incident the proxy counter spiked per user request
but a single request fanned out to many Raft proposals; the store-side
metric would have surfaced the real pressure (txn_rollback bucket)
directly.

## Sample Grafana query

```
sum by (key_prefix) (rate(elastickv_store_write_conflict_total[5m]))
```

Split by conflict kind:

```
sum by (kind, key_prefix) (rate(elastickv_store_write_conflict_total[5m]))
```

## Test plan

- [x] `go test -race -count=1 ./store/... ./monitoring/... ./adapter/...
./kv/...`
- [x] `make lint`
- [x] New unit tests:
- `store/write_conflict_counter_test.go`: table test for
`classifyWriteConflictKey` covering every bucket; counter record /
snapshot / nil-safety.
- `store/write_conflict_store_test.go`: integration tests for both
pebble-backed and in-memory MVCC stores exercising the WW and RW
conflict paths and asserting the right bucket increments.
- `monitoring/write_conflict_test.go`: collector delta behaviour,
source-reset (counter-down) rebasing, malformed-label skip, nil-safety.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 22, 2026

@codex code review

Address reviewer feedback on the secondary-write error classifier:
- map "leader not found" (kv.ErrLeaderNotFound / adapter.ErrLeaderNotFound)
  to not_leader alongside the existing "not leader" substring check, so
  leadership failures stop leaking into the generic "other" bucket.
- introduce a dedicated txn_locked reason for kv.ErrTxnLocked, a common
  OCC contention signal that was previously classified as other.
- pull the substring table into a package-level slice and iterate it,
  keeping the function under the cyclomatic-complexity limit.
- extend metrics_test and the metric help string accordingly.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 22, 2026

/gemini review

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 22, 2026

@codex review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 22, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. What shall we delve into next?

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@bootjp bootjp merged commit 3a434c7 into main Apr 23, 2026
9 checks passed
@bootjp bootjp deleted the obs/secondary-write-failure-reasons branch April 23, 2026 04:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant