Skip to content

feat(observability): add /api/metrics + opt-in OTLP metrics push#403

Draft
aryasaatvik wants to merge 1 commit intoRhysSullivan:mainfrom
aryasaatvik:feat/otlp-http-observability
Draft

feat(observability): add /api/metrics + opt-in OTLP metrics push#403
aryasaatvik wants to merge 1 commit intoRhysSullivan:mainfrom
aryasaatvik:feat/otlp-http-observability

Conversation

@aryasaatvik
Copy link
Copy Markdown
Contributor

Stack

Independent of the execution-history stack. Can land in any order.


Summary

Two complementary ways to get Effect's in-process metrics out — one always-on (pull), one opt-in (push). Addresses the "metrics collected but never exported" gap. Ships with zero new external dependencies — reuses `@effect/opentelemetry` which is already in the workspace.

What ships

Pull: `GET /api/metrics` in `@executor/api`

  • New `MetricsApi` group + handler.
  • Returns Prometheus text exposition format built from `Metric.unsafeSnapshot()` — hand-rolled serializer in `packages/core/api/src/metrics/prometheus.ts` (~170 lines, no new deps). Handles counters, gauges, histograms (cumulative bucket form + `+Inf` + `_count` + `_sum`), summaries, and frequencies.
  • Registered in `CoreExecutorApi` + `CoreHandlers`, so both apps expose it. Local mounts unconditionally (self-host scrape); cloud inherits `OrgAuth` so each org only sees their own metrics.

Push: OTLP metrics layer in `apps/local` only

  • `apps/local/src/server/telemetry.ts` wires `@effect/opentelemetry/OtlpMetrics.layer` behind the OTel-standard `OTEL_EXPORTER_OTLP_METRICS_ENDPOINT` env var. Absent = no push, no cost.
  • Auth via `OTEL_EXPORTER_OTLP_METRICS_HEADERS` (comma-separated `key=value` pairs). For Axiom: `Authorization=Bearer xxx,X-Axiom-Dataset=executor-local`.
  • Installs in a module-scope `ManagedRuntime` so the exporter's `exportInterval` timer keeps ticking across requests. Per-request scoping would shut down the exporter before the first batch leaves — same constraint the existing cloud DO tracer install documents.

Cloud push path: deliberately not included

Cloud already has tracing via `@microlabs/otel-cf-workers`. Metrics push adds Axiom ingestion cost + cardinality risk that nobody's asked for. The pull endpoint (auth-gated) is there for ops who want to scrape; OTLP push is a follow-up if concretely needed.

Cardinality policy

Full `mcp.tool.name` dimensions kept (namespace + operation, e.g. `github.repos.get`). The primary consumer is self-host / local where cardinality is bounded by the operator's own plugin set. Any cloud-side cap is a follow-up.

Dependency alignment

  • `@effect/opentelemetry@^0.63.0` added to `apps/local`'s `package.json` (already in the workspace via `apps/cloud` at the same version — no new package fetched, bun.lock entry reuses existing resolution).
  • No removals from main's dep graph.
  • No changes to `apps/cloud` `package.json` or telemetry wiring.

Why Prometheus format for the pull endpoint

  • Local self-host UX: operator can scrape with a local Prometheus, a `curl` one-liner, or eventually a daemon UI panel that renders the text directly. No external collector needed.
  • Format simplicity: 80 lines serializer, no dep pulled in for what's essentially `name + labels + value + newline`.
  • Compatibility: every observability tool speaks Prometheus scrape. Grafana Agent, Alloy, OpenTelemetry Collector, VictoriaMetrics, Datadog Agent — all of them can scrape and forward.

Test plan

  • `bun x vitest run` in `@executor/api` — 8/8 passing (4 new Prometheus serializer tests + 4 existing observability tests).
  • `bun x vitest run` in `@executor/sdk` — 90/90.
  • `bun x vitest run` in `@executor/hosts/mcp` — 23/23.
  • `bun x tsc --noEmit` in `@executor/api`, `apps/local`, `apps/cloud` — all clean.
  • Dev-server smoke (reviewer): start the daemon, `curl http://localhost:4788/api/metrics\`, verify Prometheus-format output. Optionally set the push env vars, trigger executions, verify metrics land in the backend.

Follow-ups (out of scope)

  • Axiom / Grafana / Prometheus dashboards.
  • Cardinality guardrails if cloud OTLP push is ever added.
  • Small daemon UI panel that renders the metrics snapshot inline.

Wires two independent ways to get Effect's in-process metrics out —
a pull endpoint that's always on locally and auth-gated on cloud, plus
an opt-in OTLP push path that the self-host daemon enables via env var.

## What ships

### Pull: `GET /api/metrics` in `@executor/api`

- New `MetricsApi` group + handler. Returns Prometheus text exposition
  format built from `Metric.unsafeSnapshot()` — hand-rolled serializer
  in `packages/core/api/src/metrics/prometheus.ts` (~170 lines, no
  new deps). Handles counters, gauges, histograms (cumulative bucket
  form + `+Inf` + `_count` + `_sum`), summaries, and frequencies.
- Registered in `CoreExecutorApi` + `CoreHandlers`, so both
  `apps/local` and `apps/cloud` expose the endpoint. Local mounts
  unconditionally (self-host operator can scrape); cloud inherits the
  `OrgAuth` middleware so each org only sees their own metrics.

### Push: OTLP metrics layer in `apps/local`

- `apps/local/src/server/telemetry.ts` wires
  `@effect/opentelemetry/OtlpMetrics.layer` behind
  `OTEL_EXPORTER_OTLP_METRICS_ENDPOINT` (OTel-standard env var). Absent =
  no push, no cost. Auth via comma-separated key=value pairs in
  `OTEL_EXPORTER_OTLP_METRICS_HEADERS` — Axiom users set
  `Authorization=Bearer xxx,X-Axiom-Dataset=executor-local`.
- Installs in a module-scope `ManagedRuntime` so the exporter's
  `exportInterval` timer keeps ticking across requests. Per-request
  scoping would shut the exporter down before the first batch leaves.
- Booted once from `createServerHandlers` via `startMetricsExport()`.

### Cloud push path: deliberately not included

Cloud already has tracing via `@microlabs/otel-cf-workers`. Metrics
push would add cost + cardinality risk that nobody's asked for. The
pull endpoint (auth-gated) is there for ops teams who want to scrape;
OTLP push can be added later when there's a concrete need.

## Dependency alignment

- `@effect/opentelemetry@^0.63.0` added to `apps/local` (already in
  the workspace via `apps/cloud`; no new package fetched).
- No removals from main's dep graph.
- No changes to `apps/cloud` deps or wiring.

## Test plan

- [x] `bun x vitest run` in `@executor/api` — 8/8 (4 new prometheus
  serializer tests + 4 existing).
- [x] `bun x vitest run` in `@executor/sdk` — 90/90.
- [x] `bun x vitest run` in `@executor/hosts/mcp` — 23/23.
- [x] `bun x tsc --noEmit` in `@executor/api`, `apps/local`,
  `apps/cloud` — all clean.
- [ ] Dev-server smoke (reviewer): start the daemon, `curl
  localhost:4788/api/metrics` — verify Prometheus-format output.
  Optionally set `OTEL_EXPORTER_OTLP_METRICS_ENDPOINT` + headers,
  trigger some executions, verify metrics appear in the backend.

## Follow-ups

- Dashboards for Axiom / Grafana / Prometheus. Not blocking this PR.
- Cardinality guardrails if cloud push path is ever added.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant