diff --git a/docs/integration/boj-side-observability-spec.md b/docs/integration/boj-side-observability-spec.md new file mode 100644 index 00000000..6d7407e4 --- /dev/null +++ b/docs/integration/boj-side-observability-spec.md @@ -0,0 +1,264 @@ + + + +# BoJ-side observability spec — Phase E §3 prerequisite + +**Version:** 0.1 (scaffold, Phase E) +**Date:** 2026-06-22 +**Status:** Phase E scaffold. Names the BoJ-side telemetry events and Prometheus metrics that the rollout-runbook §4.2 signals require, anchored to the exact instrumentation sites in `elixir/lib/boj_rest/router.ex` so a follow-up wiring PR has an unambiguous target. Until the events land BoJ exposes no metrics; the runbook §1.4 prerequisite tracks this absence as a stop-the-rollout condition for §3.1 (10% traffic). +**ADR:** [`docs/decisions/0004-adopt-http-capability-gateway.md`](../decisions/0004-adopt-http-capability-gateway.md) +**Plan:** [`docs/integration/http-capability-gateway-plan.md`](http-capability-gateway-plan.md) (§ Phase E, E3 telemetry verification) +**Contract:** [`docs/integration/http-capability-gateway-boj-contract.md`](http-capability-gateway-boj-contract.md) +**Sister spec (gateway side):** [`docs/integration/gateway-observability-spec.md`](gateway-observability-spec.md) (§ 3 BoJ-side signal templates anchor here) +**Rollout runbook:** [`docs/integration/hcg-tier2-rollout-runbook.md`](hcg-tier2-rollout-runbook.md) (§ 1.4 BoJ-side prerequisite, § 4.2 signals) +**Tracking:** [`standards#91`](https://github.com/hyperpolymath/standards/issues/91) (parent), [`standards#100`](https://github.com/hyperpolymath/standards/issues/100) (Phase E) + +> **File-format note.** Matches sibling integration docs (`http-capability-gateway-{plan,audit,boj-contract,policy-authoring}.md`, `gateway-{load-profile,observability-spec}.md`, `hcg-tier2-rollout-runbook.md`); the estate `.adoc` default is deliberately overridden for the `docs/integration/` set so the integration plan can name documents by exact path. + +--- + +## 0. Scope + +The gateway-side observability spec (`gateway-observability-spec.md` §3) lists three BoJ-side signals the rollout-runbook §4.2 expects on-call to watch: + +1. Per-route trust-class distribution from `BojRest.Router` decisions (runbook §4.2 bullet 1). +2. `X-Trust-Level` arriving from non-loopback peers — must remain zero (runbook §4.2 bullet 2, paired with rollback trigger §5.1 row 4). +3. BoJ 5xx rate, independent of the gateway's view (runbook §4.2 bullet 3). + +The sister spec gives PromQL templates for each but cannot anchor them to real metric names because BoJ today has no telemetry layer at all — `elixir/mix.exs` carries no `:telemetry_metrics_prometheus_core` dependency, `BojRest.Application` mounts no exporter, and `BojRest.Router` emits no `:telemetry.execute/3` events. The sister spec consequently leaves §3 templates qualified with `!OWNER: scaffold against the actual BojRest.Router instrumentation`. **This spec closes that qualification** by naming, normatively, the events to emit, the metrics to export, and the exact instrumentation sites in `BojRest.Router`. + +In scope: + +- The four telemetry events the BoJ side must emit (§ 1). +- The Prometheus metric names that the gateway-observability-spec §3 PromQL templates expect (§ 2). +- The instrumentation sites in `elixir/lib/boj_rest/router.ex`, with `file:line` anchors (§ 3). +- The `mix.exs` dependency, supervisor child, and `Plug` route the wiring PR must add (§ 4). +- The `/metrics` exposure policy — which gateway-policy rule must govern the new endpoint (§ 5). +- The acceptance criterion that the runbook §1.4 prerequisite checks against (§ 6). + +Out of scope: + +- The actual code (mix.exs / application.ex / router.ex edits). Implementation is a follow-up PR. +- Cartridge-level telemetry. The gateway sees BoJ as a single backend; cartridge-internal observability is downstream of this scope. +- Long-term storage / retention. The spec defines what BoJ scrapes expose; how long the scrapes are kept is an operator decision (the runbook §4.3 dashboard URL row already covers that scope). + +--- + +## 1. Telemetry events — emission contract + +Four events. Each maps 1:1 to a runbook §4.2 signal. Event names follow the prevailing `[:app_namespace, :component, :verb]` convention from `http-capability-gateway/lib/http_capability_gateway/application.ex` `telemetry_metrics/0`. + +| Event | Emitted when | Measurements | Metadata (tags) | Anchors | +|---|---|---|---|---| +| `[:boj_rest, :router, :decision]` | After every `BojRest.Router.check_trust/3` call, regardless of allow/deny outcome. | `count: 1` | `route`, `verb`, `trust_class`, `outcome` | runbook §4.2 bullet 1; sister spec §3.1 | +| `[:boj_rest, :router, :trust_level_present]` | When `BojRest.Router` reads a non-empty `X-Trust-Level` header from the request. | `count: 1` | `remote_origin` ∈ `{loopback, non_loopback}` | runbook §4.2 bullet 2; sister spec §3.2; rollback trigger §5.1 row 4 | +| `[:boj_rest, :http, :response]` | At every `json/3` response render in `BojRest.Router` (the single response site for the five governed routes). | `count: 1`, `duration: <µs>` (monotonic time delta from `:request_received` to render) | `status` (3-digit), `route` | runbook §4.2 bullet 3; sister spec §3.3 | +| `[:boj_rest, :request, :received]` | At `BojRest.Router`'s `match` plug, before dispatch. | `count: 1`, `received_at_ns: System.monotonic_time(:nanosecond)` | `verb` | duration anchor for `[:boj_rest, :http, :response]`; also the canonical "request entered BoJ" signal | + +### 1.1 Tag value vocabularies + +Bounded vocabularies — no operator open-ended fields, so total Prometheus time-series cardinality is bounded. + +- `route` ∈ the seven route patterns in `BojRest.Router` as of contract v1.0 + the `cartridge-sse-post` addition (boj-server#165): `"/.well-known/boj-node-pubkey"`, `"/health"`, `"/menu"`, `"/cartridges"`, `"/cartridge/:name"`, `"/cartridge/:name/invoke"`, `"/cartridge/:name/sse"`. **Route templates, not concrete paths.** New routes added to `BojRest.Router` must be added here in the same PR and reflected in the gateway policy (the §1.5 surface-drift script will catch the policy half). +- `verb` ∈ `{GET, POST, PUT, DELETE, PATCH, HEAD, OPTIONS}` — same seven-value allowlist the gateway uses (`http-capability-gateway/lib/http_capability_gateway/gateway.ex:65-77`). Unknown methods are dispatched by `Plug.Router` to the `match _` fall-through, where this event is **not** emitted (no trust check runs; nothing to bucket). +- `trust_class` ∈ `{public, authenticated, internal}` — the three values `BojRest.TrustPolicy.required_exposure/1` can return (`elixir/lib/boj_rest/trust_policy.ex:52-67`). This is the **required** exposure for the matched cartridge, not the **presented** trust header. +- `outcome` ∈ `{allow, deny_insufficient_trust, deny_unknown_route, deny_loopback_required}` — `allow` is the `:ok` branch of `check_trust/3`; the three deny variants split the `:error` branches per `router.ex:135-138` (the unknown-route deny is the `Plug.Router` `match _` 404 site, which also emits this event so the dashboards see the no-match traffic). +- `remote_origin` ∈ `{loopback, non_loopback}` — derived from `BojRest.Router.loopback?/1` (`router.ex:238-240`). +- `status` — 3-digit HTTP status. Naturally bounded; cardinality is the set of statuses BoJ actually returns (today: `{200, 400, 403, 404, 500, 502}`). + +Total decision counter cardinality bound: `7 routes × 7 verbs × 3 trust_classes × 4 outcomes = 588` time series at saturation, of which only the realistically-occupied combinations (~`7 routes × 1-2 verbs each × 2 trust_classes × 2 outcomes ≈ 28`) will ever materialise. Well below the threshold where Prometheus storage becomes a concern. + +### 1.2 What is **not** emitted from these sites + +Explicit non-emission to keep the contract sharp: + +- The events above do **not** carry the `X-Trust-Level` header value itself. The gateway-observability-spec §3.2 query relies on the presence-vs-absence + origin tag, not the header value, to keep the spec resilient to the future mTLS-primary path (where the trust value never arrives in a header at all). +- The events above do **not** carry the `X-Node-Identity` value or any cartridge credential material. Audit-log payloads stay in the BoJ logs and VeriSimDB; metrics expose only the bounded-vocabulary tags above. +- The events above do **not** carry per-request payload size. Throughput and latency are sufficient for the §4.2 signal set; payload-size histograms are out of scope. + +--- + +## 2. Prometheus metrics — what the sister spec PromQL reads + +The `telemetry_metrics_prometheus_core` convention applies: dots become underscores; distribution metrics expose `_bucket`, `_count`, `_sum` series; counters expose a `_total` series. The names below are the metric prefixes the operator sees in `/metrics`. + +| Telemetry event | Prometheus metric prefix | Type | Tags | +|---|---|---|---| +| `[:boj_rest, :router, :decision]` | `boj_router_decision_count` | counter | `route`, `verb`, `trust_class`, `outcome` | +| `[:boj_rest, :router, :trust_level_present]` | `boj_router_trust_level_present_count` | counter | `remote_origin` | +| `[:boj_rest, :http, :response]` | `boj_http_responses` | counter | `status`, `route` | +| `[:boj_rest, :http, :response]` | `boj_http_response_duration` | distribution | `status`, `route` | +| `[:boj_rest, :request, :received]` | `boj_request_received_count` | counter | `verb` | + +### 2.1 Name alignment with the sister spec + +The sister spec (`gateway-observability-spec.md`) names three metrics as the targets of its §3 PromQL templates: + +- `boj_router_decision_count_total{route, trust_class}` — sister §3.1. +- `boj_router_trust_level_present_count_total{remote_origin}` — sister §3.2. +- `boj_http_responses_total{status=~"5.."}` — sister §3.3. + +This spec's §2 table is **exactly** that set, with the `outcome` and `verb` tags added on the decision counter (which the sister-spec §3.1 query happily ignores via PromQL's `sum by`) and the duration histogram added on the response side (which the sister-spec §3.3 query does not currently use but which is the natural BoJ-side equivalent of the gateway's `backend_response_duration` — useful for the dashboard, not strictly required for §3.1 sign-off). + +### 2.2 Histogram buckets + +`boj_http_response_duration` buckets, in microseconds, cover the BoJ side of the 60 ms ceiling the gateway uses for its `backend_response_duration` (sister spec §1.1): + +``` +[100, 500, 1_000, 5_000, 10_000, 30_000, 60_000] +``` + +These match the gateway-side bucket layout for `backend_response_duration` so the operator can read both axes against the same scale on a shared dashboard. Wider upper bound for the BoJ side would only hide cartridge-attributable latency the gateway already buckets up to 60 ms; tighter bounds would lose the cartridge tail. + +--- + +## 3. Instrumentation sites + +`elixir/lib/boj_rest/router.ex` (256 lines as of 2026-06-22). The four emission sites: + +### 3.1 `[:boj_rest, :request, :received]` + +**Site:** `Router.match` plug entry, between line `29` (`plug :match`) and line `31` (`plug :dispatch`). Add a thin custom plug `:emit_request_received` before `:dispatch` that runs once per request and stashes the monotonic start time in `conn.private` for the response-site duration read. + +### 3.2 `[:boj_rest, :router, :trust_level_present]` + +**Sites:** `router.ex:109` (`POST /cartridge/:name/invoke`) and `router.ex:161` (`POST /cartridge/:name/sse`), immediately after the `trust_level = conn |> get_req_header(...) |> List.first()` read. Emit **only** when `trust_level` is non-nil; passing the empty case through is fine — the metric counts presences, not absences. The `remote_origin` tag derives from the already-computed `is_local = loopback?(conn.remote_ip)`. + +### 3.3 `[:boj_rest, :router, :decision]` + +**Site:** `Router.check_trust/3` (`router.ex:228-236`). Emit on **both** the `:ok` and `{:error, :insufficient_trust}` return branches, immediately before the function returns, so allow and deny outcomes go through the same code path. The `route` tag is the matched route template; `Router` already has it in scope via the macro expansion (the `:plug_route` private key — `conn.private[:plug_route]` gives `{template_string, fn}`). The unknown-route 404 case (the `Plug.Router` `match _` fall-through) is a separate emission site below. + +**Site (unknown-route):** the `match _` fall-through at the bottom of the router (existing 404 handler). Emit `[:boj_rest, :router, :decision]` with `route = ""`, `verb = conn.method`, `trust_class = :public` (placeholder — no cartridge matched, so no required exposure), `outcome = :deny_unknown_route`. This keeps the unknown-path traffic visible to the dashboard's deny-rate panel. + +### 3.4 `[:boj_rest, :http, :response]` + +**Site:** `Router.json/3` (`router.ex:213-217`). Emit immediately before `send_resp`. The `duration` measurement reads `received_at_ns` from `conn.private` (set by §3.1's plug); if absent (a code path that does not pass through the new request-received plug), emit `duration: 0` and rely on the dashboard alert to surface the wiring gap. + +### 3.5 Total diff size + +The wiring PR adds, approximately: + +- `elixir/mix.exs`: one dep line. +- `elixir/lib/boj_rest/application.ex`: one supervisor child (the Prometheus reporter), one new plug wiring (the `/metrics` route — see §5). +- `elixir/lib/boj_rest/router.ex`: four `:telemetry.execute/3` calls (or one helper invoked from four sites), one new plug, one `conn.private` stash. +- `elixir/lib/boj_rest/telemetry.ex` (new): the `metrics/0` declaration that names the five Prometheus metrics in §2 above. + +No structural change to `BojRest.TrustPolicy`, `BojRest.Invoker`, `BojRest.JsInvoker`, `BojRest.Catalog`, or the cartridges. The emissions slot into the existing `Router` plug chain without altering decision logic. + +--- + +## 4. Dependencies and supervisor wiring + +### 4.1 `mix.exs` + +Add to `defp deps`: + +```elixir +{:telemetry_metrics_prometheus_core, "~> 1.2"} +``` + +The `:telemetry` dep is already transitively pulled in by `:plug_cowboy` (which uses it for its own request lifecycle events). The reporter library is the smallest dependency that exports `:telemetry_metrics` definitions to Prometheus text format; it does **not** start its own HTTP server (the next dep up the stack, `:telemetry_metrics_prometheus`, does, which would conflict with the existing Cowboy listener). The `_core` variant is the one to pick for a Cowboy-already-listening process. + +### 4.2 `BojRest.Application` + +Add one child to the supervisor tree (between `BojRest.JsWorkerPool` and the Cowboy listener): + +```elixir +{TelemetryMetricsPrometheus.Core, [ + name: :boj_rest_prometheus, + metrics: BojRest.Telemetry.metrics() +]} +``` + +This starts the reporter; it does **not** add a listener. The `/metrics` route is added inside `BojRest.Router` (§5). + +### 4.3 `BojRest.Telemetry` + +New module `elixir/lib/boj_rest/telemetry.ex` exporting `metrics/0` that returns the five `Telemetry.Metrics` definitions matching §2: + +```elixir +def metrics do + [ + last_value("boj_router_decision_count", + event_name: [:boj_rest, :router, :decision], + measurement: :count, + tags: [:route, :verb, :trust_class, :outcome] + ), + # … one entry per row in §2 table. + ] +end +``` + +The exact `Telemetry.Metrics` type per row matches the §2 table (counter / distribution). + +--- + +## 5. `/metrics` exposure — policy implication + +The `/metrics` endpoint is the new BoJ surface route the wiring PR adds. The gateway policy (`config/gateway-policy-boj.yaml`) must govern it. **The endpoint MUST NOT be public.** Two reasons: + +1. The metrics payload exposes route-level traffic shape — useful to an attacker probing for capability-existence, the exact threat the `stealth_profiles` mechanism in the policy guards against. +2. The `decision` counter tags include the cartridge-trust-class distribution, which leaks information about which cartridges expect authenticated callers and which expect internal-only callers — a capability-discovery signal that defeats the audit's truthfulness invariant. + +Declared policy rule (additions, when the wiring PR lands): + +```yaml +- id: "metrics-get" + description: "BoJ Prometheus scrape — internal-only, stealth 404 on untrusted access" + path: "/metrics" + verb: "GET" + trust_class: "internal" + stealth_profile: "internal-404" +``` + +Operator scrapes must reach `/metrics` from a trusted-proxy IP with `X-Trust-Level: internal` set by the gateway. Tier-1 (Cloudflare) MUST NOT route external traffic to `/metrics`; the gateway's stealth-profile rule is the second line of defence if a Cloudflare config drift exposes it. + +The `scripts/hcg-policy-smoke.sh` stealth-profile canary set (§1.5 of the rollout runbook) must add `/metrics` to its internal-route 404 canary list in the same PR that adds the policy rule — so a misconfiguration that demotes `metrics-get` from `internal+stealth` to `authenticated+403` is caught at smoke time, not at exposure time. + +--- + +## 6. Phase E acceptance — how this spec gates §3.1 + +This spec adds one new prerequisite to the rollout-runbook §1.4 "BoJ-side prerequisites" checklist: + +> **BoJ-side observability events emitted.** The four events declared in this spec § 1 are emitted by `BojRest.Router`, the five Prometheus metrics in § 2 appear in a `/metrics` scrape, the policy rule in § 5 governs the endpoint, and the smoke script's stealth canary covers it. Until this lands, runbook § 3.1 success criterion 4 ("No `X-Trust-Level` mismatches in BoJ access logs") is unobservable via Prometheus — the only signal path is BoJ structured logs, which the runbook § 4 dashboards do not currently consume. + +The runbook update (this PR's §1.4 edit) adds that checkbox to the existing list. + +**Strictness:** until the wiring PR lands, runbook § 3.1 (10% traffic shift) **cannot** be signed off against the Prometheus-anchored success criteria. The rollback trigger § 5.1 row 4 ("BoJ access logs show `X-Trust-Level` from non-loopback peers") falls back to log-based detection — which is acceptable as a manual triage path but **not** acceptable as the only paged signal for a §3 invariant 4 violation in flight. + +--- + +## 7. Wiring PR — checklist for the follow-up + +The follow-up PR that lands the actual emission must: + +- [ ] Add `:telemetry_metrics_prometheus_core` to `elixir/mix.exs` `deps/0`. +- [ ] Add `BojRest.Telemetry` module exporting `metrics/0` per § 4.3. +- [ ] Add the `TelemetryMetricsPrometheus.Core` child to `BojRest.Application` per § 4.2. +- [ ] Add the four emission sites in `elixir/lib/boj_rest/router.ex` per § 3.1–3.4. +- [ ] Add a `GET /metrics` plug route in `BojRest.Router` that calls `TelemetryMetricsPrometheus.Core.scrape(:boj_rest_prometheus)` and returns the text-format payload. +- [ ] Add the `metrics-get` rule to `config/gateway-policy-boj.yaml` per § 5. +- [ ] Add `/metrics` to `scripts/hcg-policy-smoke.sh` stealth canary list per § 5. +- [ ] Add a unit test exercising each `:telemetry.execute/3` site (matching the existing test patterns under `elixir/test/`). +- [ ] Update `gateway-observability-spec.md` § 3.1, § 3.2, § 3.3 to drop the `!OWNER: scaffold` qualifier (the metric names now resolve against this repo's emitters). +- [ ] Update `hcg-tier2-rollout-runbook.md` § 1.4 to flip the new prerequisite checkbox from `[ ]` to `[x]`. + +The same PR closes this spec's `**Status:**` line from `scaffold` to `wired (Phase E §1.4 prereq satisfied)` with a date. + +--- + +## 8. References + +- Sister spec — `docs/integration/gateway-observability-spec.md` (§ 3 BoJ-side PromQL templates; this spec backs those templates with real metric names). +- Rollout runbook — `docs/integration/hcg-tier2-rollout-runbook.md` (§ 1.4 BoJ-side prerequisites, § 3.1 success criteria, § 4.2 signals, § 5.1 rollback triggers). +- Contract — `docs/integration/http-capability-gateway-boj-contract.md` (§ 3 trust-level invariants this spec's `decision` and `trust_level_present` events enforce telemetry coverage for). +- Audit — `docs/integration/http-capability-gateway-audit.md` (§ 5 gateway-side telemetry shape, mirrored on the BoJ side here). +- Plan — `docs/integration/http-capability-gateway-plan.md` (§ Phase E E3 telemetry verification). +- BoJ router — `elixir/lib/boj_rest/router.ex` (instrumentation sites § 3.1–3.4). +- BoJ trust policy — `elixir/lib/boj_rest/trust_policy.ex` (`required_exposure/1` source of `trust_class` tag values; `satisfies?/3` source of `outcome` tag values). +- BoJ application — `elixir/lib/boj_rest/application.ex` (supervisor tree to extend per § 4.2). +- Gateway metric definitions — `http-capability-gateway/lib/http_capability_gateway/application.ex` `telemetry_metrics/0` (lines 259–296; bucket-layout reference for § 2.2). diff --git a/docs/integration/gateway-observability-spec.md b/docs/integration/gateway-observability-spec.md index c0d015c6..458319ca 100644 --- a/docs/integration/gateway-observability-spec.md +++ b/docs/integration/gateway-observability-spec.md @@ -6,9 +6,9 @@ Copyright (c) Jonathan D.A. Jewell # HCG tier-2 — observability spec -**Version:** 0.1 (scaffold, Phase E) -**Date:** 2026-06-16 -**Status:** Phase E scaffold. Names the gateway-emitted Prometheus metrics, gives PromQL templates for every signal listed in the rollout runbook §4.1/§4.2, and binds alert thresholds to the rollback triggers in runbook §5.1 and the perf contract's tolerance ratios. Absolute-µs values are deliberately left as `Phase D-4` references — once `bench/baseline.json` `_status` flips to `active` the queries here read against real numbers without further edits. +**Version:** 0.2 (BoJ-side sister spec anchored, Phase E) +**Date:** 2026-06-22 (rev. from 2026-06-16) +**Status:** Phase E scaffold. Names the gateway-emitted Prometheus metrics, gives PromQL templates for every signal listed in the rollout runbook §4.1/§4.2, and binds alert thresholds to the rollback triggers in runbook §5.1 and the perf contract's tolerance ratios. §3 BoJ-side templates now anchor to the sister spec [`boj-side-observability-spec.md`](boj-side-observability-spec.md) (events, metric names, instrumentation sites in `BojRest.Router`); the `!OWNER:` scaffold qualifier on those templates is dropped — the wiring PR target is fixed. Absolute-µs values are deliberately left as `Phase D-4` references — once `bench/baseline.json` `_status` flips to `active` the queries here read against real numbers without further edits. **ADR:** [`docs/decisions/0004-adopt-http-capability-gateway.md`](../decisions/0004-adopt-http-capability-gateway.md) **Plan:** [`docs/integration/http-capability-gateway-plan.md`](http-capability-gateway-plan.md) (§ Phase E, E3 telemetry verification) **Contract:** [`docs/integration/http-capability-gateway-boj-contract.md`](http-capability-gateway-boj-contract.md) @@ -286,23 +286,22 @@ sum( ## 3. Signal → query mapping (rollout runbook §4.2, BoJ-side) -§4.2 names three BoJ-side signals. The first two require BoJ-emitted Prometheus metrics that are not in this repo's scope; the third is a network-layer signal. This section names the queries each owner needs to wire on the BoJ side; the metric naming convention follows the project's existing telemetry (`BojRest.Router` decisions, `boj-server` Prometheus exporter — both !OWNER: scaffolded since BoJ-side telemetry is outside this Phase E channel). +§4.2 names three BoJ-side signals. The first two require BoJ-emitted Prometheus metrics; the third is paired with the BoJ-emitted HTTP-response counter. The metric names referenced below are declared normatively in the sister spec [`boj-side-observability-spec.md`](boj-side-observability-spec.md) §2, which anchors them to telemetry events emitted from `elixir/lib/boj_rest/router.ex` — that spec is the contract for the BoJ-side wiring PR that lands the actual emission. Until that wiring PR lands, the queries here remain *templates* (the metric names do not yet resolve in the BoJ exporter, because no exporter is mounted); the rollout-runbook §1.4 prerequisite tracks the wiring as a stop-the-rollout condition for §3.1 sign-off. ### 3.1 Per-route trust-class distribution **Runbook signal:** "Per-route trust-class distribution from `BojRest.Router` decisions." -PromQL template (assumes a BoJ-side `boj_router_decision_count_total{route, trust_class}` counter — !OWNER: scaffold against the actual `BojRest.Router` instrumentation): +PromQL (against the `boj_router_decision_count_total{route, verb, trust_class, outcome}` counter declared in `boj-side-observability-spec.md` §1 + §2): ```promql -# Per-route trust-class distribution. Replace boj_router_decision_count_total -# with whatever name the BoJ exporter uses. +# Per-route trust-class distribution. sum by (route, trust_class) ( rate(boj_router_decision_count_total[5m]) ) ``` -If BoJ does not currently expose this metric, the runbook §4.2 signal list is the unambiguous request: BoJ-side observability for the rollout requires emitting this counter before Phase E §3.1 (10% traffic) can begin. **Follow-up:** open as a BoJ-side prereq tracking item; reference this section. +BoJ does not currently expose this metric. The wiring contract — events, metric names, instrumentation sites, and `/metrics` exposure policy — is `docs/integration/boj-side-observability-spec.md`. The rollout-runbook §1.4 prerequisite tracks the wiring PR as a stop-the-rollout condition for Phase E §3.1 (10% traffic). ### 3.2 `X-Trust-Level` from non-loopback peers — should be zero @@ -310,11 +309,10 @@ If BoJ does not currently expose this metric, the runbook §4.2 signal list is t The strongest enforcement is at the network layer (NetworkPolicy, firewall — landed via boj-server#173, runbook §1.4). This signal verifies the *invariant*; non-zero is a deployment defect that NetworkPolicy did not catch. -PromQL template (assumes a BoJ-side counter that distinguishes loopback from non-loopback origin): +PromQL (against the `boj_router_trust_level_present_count_total{remote_origin}` counter declared in `boj-side-observability-spec.md` §1 + §2; `remote_origin` ∈ `{loopback, non_loopback}`): ```promql # X-Trust-Level from non-loopback peers — must remain at zero. -# Replace metric name with the BoJ-side equivalent. sum( rate(boj_router_trust_level_present_count_total{remote_origin!="loopback"}[5m]) ) @@ -331,11 +329,10 @@ Any non-zero rate is the trigger. Immediate page; this is a §3 contract invaria **Runbook signal:** "BoJ 5xx rate (independent of gateway's view)." -PromQL template: +PromQL (against the `boj_http_responses_total{status, route}` counter declared in `boj-side-observability-spec.md` §1 + §2): ```promql # BoJ-emitted 5xx rate as a fraction of BoJ-handled requests. -# Replace metric names with the BoJ-side equivalents. sum(rate(boj_http_responses_total{status=~"5.."}[5m])) / sum(rate(boj_http_responses_total[5m])) ``` @@ -406,6 +403,7 @@ Phase E §3.4 (decommission BoJ direct external access) further requires all que ## 7. References - Rollout runbook — `docs/integration/hcg-tier2-rollout-runbook.md` (§4 signal list, §5 rollback triggers, §6 Trustfile flip). +- Sister spec (BoJ side) — `docs/integration/boj-side-observability-spec.md` (events, metric names, instrumentation sites for the §3 templates above). - Load profile — `docs/integration/gateway-load-profile.md` (§2 SLO budgets, §3.4 bench harness reference). - Audit — `docs/integration/http-capability-gateway-audit.md` (§1.6 telemetry, §5 telemetry shape, §1.4 mTLS path notes). - Plan — `docs/integration/http-capability-gateway-plan.md` (§Phase E E3 telemetry verification). diff --git a/docs/integration/hcg-tier2-rollout-runbook.md b/docs/integration/hcg-tier2-rollout-runbook.md index 5c3e0aaa..70fc2fd3 100644 --- a/docs/integration/hcg-tier2-rollout-runbook.md +++ b/docs/integration/hcg-tier2-rollout-runbook.md @@ -6,9 +6,9 @@ Copyright (c) Jonathan D.A. Jewell # HCG tier-2 — rollout & rollback runbook -**Version:** 0.7 (smoke-script stealth-profile canary, Phase E in-progress) -**Date:** 2026-06-15 (rev. from 2026-06-14) -**Status:** Phase E deliverables E1 (deploy spec) + E5 (rollback runbook) drafted; live gateway policy (`config/gateway-policy-boj.yaml`) promoted from the worked example (§1.5); `scripts/hcg-policy-smoke.sh` lands as the checked-in §1.5 operator pre-check (deny-path covers gateway-alone; `--with-backend` adds allow-path coverage); §1.5 verb-canary block covers OPTIONS, regex-route DELETE, and wrong-verb-on-listed-path; a path-canary exercises the no-match default-deny branch (synthetic unknown path with a `global_verbs` verb); and a stealth-profile canary now pins the deny *status code* — internal+stealth routes must return 404 (capability existence hidden) and authenticated routes must return 403, so a missing-`:stealth_profiles` misconfiguration is caught instead of slipping past the generic any-4xx deny pattern. Owner-input markers (`!OWNER:`) remain to be filled before any traffic-shift action is taken. +**Version:** 0.8 (BoJ-side observability spec prereq, Phase E in-progress) +**Date:** 2026-06-22 (rev. from 2026-06-15) +**Status:** Phase E deliverables E1 (deploy spec) + E5 (rollback runbook) drafted; live gateway policy (`config/gateway-policy-boj.yaml`) promoted from the worked example (§1.5); `scripts/hcg-policy-smoke.sh` lands as the checked-in §1.5 operator pre-check (deny-path covers gateway-alone; `--with-backend` adds allow-path coverage); §1.5 verb-canary block covers OPTIONS, regex-route DELETE, and wrong-verb-on-listed-path; a path-canary exercises the no-match default-deny branch (synthetic unknown path with a `global_verbs` verb); a stealth-profile canary pins the deny *status code* — internal+stealth routes must return 404 (capability existence hidden) and authenticated routes must return 403, so a missing-`:stealth_profiles` misconfiguration is caught instead of slipping past the generic any-4xx deny pattern; and `docs/integration/boj-side-observability-spec.md` now declares the BoJ-side telemetry events + Prometheus metric names that back the §4.2 signals — wired into §1.4 as a stop-the-rollout prerequisite, with the actual `BojRest.Router` emission left as a follow-up PR per the spec's §7 checklist. Owner-input markers (`!OWNER:`) remain to be filled before any traffic-shift action is taken. **ADR:** [`docs/decisions/0004-adopt-http-capability-gateway.md`](../decisions/0004-adopt-http-capability-gateway.md) **Plan:** [`docs/integration/http-capability-gateway-plan.md`](http-capability-gateway-plan.md) (§ Phase E) **Contract:** [`docs/integration/http-capability-gateway-boj-contract.md`](http-capability-gateway-boj-contract.md) @@ -83,6 +83,7 @@ These cannot be inferred from the code/contract; the owner must fill them before - [x] BoJ `BojRest.TrustPolicy.satisfies?/3` non-loopback-deny clause present — verified at `elixir/lib/boj_rest/trust_policy.ex:73` (`def satisfies?(_required, _trust, false), do: false`). Phase C invariant 3 enforcement; landed in boj-server#106. - [x] Phase E NetworkPolicy hardening (back-side reachability restricted) — boj-server#173. - [x] HCG-policy SSE-route coverage (`POST /cartridge/:name/sse` governed alongside `cartridge-invoke-post`) — boj-server#165. +- [ ] **BoJ-side observability emitted.** Four telemetry events declared in [`boj-side-observability-spec.md`](boj-side-observability-spec.md) §1 are emitted by `BojRest.Router`; the five Prometheus metrics in that spec §2 appear in a `/metrics` scrape; the `metrics-get` policy rule in that spec §5 governs the endpoint; and the `scripts/hcg-policy-smoke.sh` stealth canary covers it. Until this lands, §3.1 success criterion 4 ("No `X-Trust-Level` mismatches in BoJ access logs") and rollback trigger §5.1 row 4 are unobservable via Prometheus — the only signal path is BoJ structured logs, which the §4 dashboards do not currently consume. Spec landed; the wiring PR follows the spec's §7 checklist. - [ ] `Trustfile.a2ml [CLOUDFLARE_EDGE_SECURITY].rate_limiting.tier_2_gateway.status` currently `"PENDING — http-capability-gateway wiring forthcoming"` (line 900). _The flip to a real status is the **last** action; see §6._ ### 1.5 Gateway-side prerequisites @@ -313,6 +314,7 @@ Also update `[HTTP_CAPABILITY_GATEWAY]` section per plan §E acceptance: `status - `docs/integration/http-capability-gateway-boj-contract.md` — HTTP boundary contract. - `docs/integration/http-capability-gateway-policy-authoring.md` — policy file authoring workflow. - `docs/integration/gateway-observability-spec.md` — Phase E PromQL templates + alert-threshold bindings for the §4 signals + §5 rollback triggers. +- `docs/integration/boj-side-observability-spec.md` — Phase E §1.4 prerequisite spec: BoJ-side telemetry events, Prometheus metric names, and `BojRest.Router` instrumentation sites that back the §4.2 BoJ-side signals. - `http-capability-gateway/docs/perf-contract.md` — Phase D perf-contract. - `elixir/lib/boj_rest/trust_policy.ex` — `satisfies?/3` Phase C enforcement. - `.machine_readable/contractiles/trust/Trustfile.a2ml` — `[CLOUDFLARE_EDGE_SECURITY].rate_limiting.tier_2_gateway` (current `PENDING` site; §6.4 flip target) + `[SEAMS]` (Phase C gateway↔BoJ-gnosis declaration).