Skip to content

metric: add CI timeout observability#25178

Draft
VioletQwQ-0 wants to merge 1 commit into
matrixorigin:mainfrom
VioletQwQ-0:violet/observe-ci-timeout-metrics
Draft

metric: add CI timeout observability#25178
VioletQwQ-0 wants to merge 1 commit into
matrixorigin:mainfrom
VioletQwQ-0:violet/observe-ci-timeout-metrics

Conversation

@VioletQwQ-0

Copy link
Copy Markdown
Collaborator

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #25074
issue #25164

What this PR does / why we need it:

This PR adds observability for CI/sysbench failures where clients see proxy all-CN-busy errors or backend/RPC timeout errors.

In simpler terms, it adds gauges/counters/dashboards that answer:

  • Are CN/DN pods already near memory or CPU limits?
  • Did the DN merge scheduler pause because available memory was too low?
  • Did proxy mark specific CNs unhealthy, and did all candidate CNs become busy?
  • Are proxy-to-CN handshakes slow or piling up?
  • Did MORPC/lockservice fail on connect/read/write/wait-response, and what error type was it?

Changes:

  • Add DN merge OOM governor metrics.
  • Add proxy CN health, backend handshake latency, and in-flight handshake metrics.
  • Add MORPC backend error and lockservice remote RPC error metrics with low-cardinality labels.
  • Add Grafana panels for pod resources, merge OOM governor, proxy health/handshake, and RPC timeout/backend errors.
  • Add an observability runbook with PromQL/LogQL queries for CI regression investigations.
  • Add focused tests for MORPC and lockservice error type classification.

This PR does not claim to fix the sysbench overload root cause. It makes the failure chain observable so future 20508/20505 failures can be diagnosed from metrics instead of only from logs.

Validation:

  • git diff --check
  • go test ./pkg/proxy
  • go test ./pkg/vm/engine/tae/db/merge
  • go test ./pkg/common/morpc
  • go test ./pkg/common/morpc ./pkg/lockservice -run 'TestRPCMetricErrorType|TestLockserviceRemoteRPCErrorType'
  • go test ./pkg/util/metric/v2 ./pkg/util/metric/v2/dashboard -run 'TestDoesNotExist'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/documentation Improvements or additions to documentation kind/enhancement size/M Denotes a PR that changes [100,499] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants