diff --git a/pip/pip-471.md b/pip/pip-471.md new file mode 100644 index 0000000000000..46278866de2ce --- /dev/null +++ b/pip/pip-471.md @@ -0,0 +1,302 @@ +# PIP-471: Authorization Operation Metrics + +# Background knowledge + +Pulsar brokers perform authorization checks before allowing clients, proxies, and administrative callers to access +topics, namespaces, tenants, brokers, clusters, and policy operations. These checks are handled through the broker-side +`AuthorizationService`, which delegates decisions to the configured `AuthorizationProvider`. + +Pulsar already exposes security-related metrics, especially around authentication. These metrics help operators detect +login failures, unhealthy clients, and changes in access patterns. However, Pulsar does not expose a generic broker-level +metric stream for authorization outcomes. Authorization denials are mostly visible through request failures and logs, +which makes them harder to alert on and harder to compare with successful authorization traffic. + +Pulsar also supports both Prometheus-compatible metrics and OpenTelemetry metrics. New broker observability features +should keep those pipelines aligned when possible, so operators can consume equivalent signals regardless of their +metrics backend. + +# Motivation + +Operators need a low-cardinality, broker-native signal that shows whether authorization checks are succeeding or failing. +This is useful for security alerting, baseline monitoring, and compliance-oriented reporting. + +Without a dedicated authorization metric, operators have to infer authorization denials from logs, HTTP status codes, or +client-side errors. That is brittle and does not support standard monitoring patterns such as: + +- Alerting on spikes in authorization failures. +- Comparing authorization failures against successful authorizations. +- Distinguishing authentication failures from authorization failures. +- Building dashboards by authorization resource category. +- Exporting equivalent authorization signals through both Prometheus and OpenTelemetry. + +A failure-only metric is also not sufficient. Operators often need success, failure, and error counts together to +understand whether a denial spike reflects an attack, a rollout issue, a policy mistake, an authorization provider +problem, or a normal traffic shift. + +# Goals + +## In Scope + +- Add a low-cardinality broker authorization metric for operation outcomes. +- Record successful, failed, and errored authorization operations. +- Expose the metric through the Prometheus-compatible broker metrics endpoint. +- Expose the same metric through OpenTelemetry. +- Centralize instrumentation in `AuthorizationService` so broker authorization paths share the same metric model. +- Avoid identity-bearing or high-cardinality metric dimensions. + +## Out of Scope + +- Per-role, per-topic, per-tenant, per-namespace, or per-principal labels. +- Audit-log payloads or structured security event streams. +- New authorization APIs or binary protocol changes. +- Alert rule definitions for downstream monitoring stacks. +- Configuration to enable or disable this specific metric. + +# High Level Design + +Introduce a generic authorization operation counter that is incremented when the broker finishes an authorization +decision or rejects an authorization request before invoking the configured provider. + +The metric is recorded centrally in `AuthorizationService`, which is the broker-side entry point for authorization checks +across topic, namespace, tenant, broker, cluster, and policy operations. Each completed provider decision or direct +authorization rejection emits one result with a small, fixed dimension set: + +- the resource type that was checked +- the operation that was requested +- whether the result was a success, failure, or error + +This metric is exported in two equivalent forms by the same helper class: + +- a Prometheus counter for the existing broker metrics endpoint +- an OpenTelemetry `LongCounter` for OpenTelemetry metrics export + +Invalid original-principal combinations in proxied authorization flows are counted as authorization failures because the +broker rejects the request during authorization handling. For valid proxied authorization flows, the broker evaluates +both the proxy role and the original principal, and each completed authorization decision is recorded. + +# Detailed Design + +## Design & Implementation Details + +This proposal introduces `org.apache.pulsar.broker.authorization.metrics.AuthorizationMetrics`, a broker authorization +metrics helper that owns: + +- a Prometheus `Counter` for broker metrics scraping +- an OpenTelemetry `LongCounter` for OpenTelemetry metrics export + +The helper uses the following constants for metric names, instrumentation scope, label values, and OpenTelemetry +attribute keys: + +| Constant | Value | +|---|---| +| `AUTHORIZATION_OPERATIONS_METRIC_NAME` | `pulsar_authorization_operations_total` | +| `AUTHORIZATION_COUNTER_METRIC_NAME` | `pulsar.authorization.operation.count` | +| `INSTRUMENTATION_SCOPE_NAME` | `org.apache.pulsar.authorization` | +| `RESULT_SUCCESS` | `success` | +| `RESULT_FAILURE` | `failure` | +| `RESULT_ERROR` | `error` | +| `RESOURCE_TYPE_KEY` | `pulsar.authorization.resource.type` | +| `OPERATION_KEY` | `pulsar.authorization.operation` | +| `RESULT_KEY` | `pulsar.authorization.result` | + +`AuthorizationMetrics` registers a static Prometheus counter with labels `resource_type`, `operation`, and `result`. +It also builds an OpenTelemetry `LongCounter` from the `OpenTelemetry` instance passed to the constructor. + +The helper exposes three recording methods: + +| Method | Behavior | +|---|---| +| `recordSuccess(resourceType, operation)` | Increments the Prometheus counter with `result="success"` and adds `1` to the OpenTelemetry counter with `pulsar.authorization.result="success"`. | +| `recordFailure(resourceType, operation)` | Increments the Prometheus counter with `result="failure"` and adds `1` to the OpenTelemetry counter with `pulsar.authorization.result="failure"`. | +| `recordError(resourceType, operation)` | Increments the Prometheus counter with `result="error"` and adds `1` to the OpenTelemetry counter with `pulsar.authorization.result="error"`. | + +`AuthorizationService` owns one `AuthorizationMetrics` instance. The existing `AuthorizationService` constructor remains +available and delegates to a new constructor with `OpenTelemetry.noop()`. `BrokerService` constructs +`AuthorizationService` with `pulsar.getOpenTelemetry().getOpenTelemetry()` so the OpenTelemetry counter is exported by +the broker's OpenTelemetry pipeline. + +`AuthorizationService` records a result after each completed authorization operation. If the provider returns `true`, the +helper records a success. If the provider returns `false`, the helper records a failure. If the provider future completes +exceptionally, the helper records an error because authorization evaluation failed before a boolean decision was returned. + +If `AuthorizationService` rejects a request before provider evaluation, such as an invalid original-principal combination +for proxied requests, it records a failure directly and returns a completed `false` future. Existing +authorization-disabled short-circuit behavior is preserved; operation methods that already return early when +authorization is disabled do not emit this metric on that path. + +The instrumentation applies to the following authorization flows: + +- superuser checks +- tenant-admin checks +- tenant operations +- broker operations +- cluster operations +- cluster policy operations +- namespace operations +- namespace policy operations +- topic operations +- topic policy operations + +The metric dimensions are intentionally bounded. The resource type is selected from a fixed set of constants in +`AuthorizationMetrics`. The operation is `check` for superuser and tenant-admin checks. For enum-backed operations, the +operation is the lower-case enum name. If an existing authorization path does not provide an operation value, the metric +uses a fixed `unknown` operation value rather than failing the request path or introducing dynamic labels. + +The metric does not include role names, topic names, tenant names, namespace names, client addresses, provider names, +exception classes, or error messages. + +## Public-facing Changes + +### Public API + +No public client, admin, REST, or `AuthorizationProvider` API changes. + +### Binary protocol + +No binary protocol changes. + +### Configuration + +No new configuration is required. + +### CLI + +No CLI changes. + +### Metrics + +Prometheus metric: + +| Field | Value | +|---|---| +| Full name | `pulsar_authorization_operations_total` | +| Description | Pulsar authorization operations | +| Type | Counter | +| Labels | `resource_type`, `operation`, `result` | +| Unit | operations | + +OpenTelemetry metric: + +| Field | Value | +|---|---| +| Full name | `pulsar.authorization.operation.count` | +| Description | The number of authorization operations | +| Type | `LongCounter` | +| Attributes | `pulsar.authorization.resource.type`, `pulsar.authorization.operation`, `pulsar.authorization.result` | +| Unit | `{operation}` | + +Result values: + +| Value | Meaning | +|---|---| +| `success` | The authorization request was allowed. | +| `failure` | The authorization request was denied or rejected by authorization handling. | +| `error` | Authorization evaluation failed before an allow/deny decision was returned. | + +Resource type values: + +| Value | Meaning | +|---|---| +| `superuser` | Superuser authorization check. | +| `tenant_admin` | Tenant-admin authorization check. | +| `tenant` | Tenant operation authorization check. | +| `broker` | Broker operation authorization check. | +| `cluster` | Cluster operation authorization check. | +| `cluster_policy` | Cluster policy operation authorization check. | +| `namespace` | Namespace operation authorization check. | +| `namespace_policy` | Namespace policy operation authorization check. | +| `topic` | Topic operation authorization check. | +| `topic_policy` | Topic policy operation authorization check. | + +Operation values are normalized authorization operation names. Examples include `produce`, `consume`, `lookup`, +`packages`, and `read`. Superuser and tenant-admin checks use `check`. Existing authorization paths that do not provide +a concrete operation value use `unknown`. + +# Monitoring + +Operators should monitor absolute authorization failures and errors, plus the relationship between failures and +successes. +Recommended patterns include: + +- Alert on sustained increases in `result="failure"`. +- Alert on sustained increases in `result="error"`, which can indicate authorization provider failures or outages. +- Build dashboards that show `success`, `failure`, and `error` together by `resource_type`. +- Investigate rollout regressions by comparing failure rates before and after authorization policy changes. +- Correlate authorization failures with authentication metrics to distinguish authentication incidents from + authorization incidents. + +This proposal enables ratio-based alerting because success, failure, and error outcomes are reported in the same metric +family. + +# Security Considerations + +This proposal improves security observability but does not change authorization semantics. + +Authorization decisions can be high volume and may involve sensitive identifiers. The metric therefore avoids +identity-bearing labels and attributes. It does not include roles, principals, topics, namespaces, tenants, client +addresses, or error messages. This keeps the metric useful for operations without turning it into an audit-log substitute +or a high-cardinality data leak. + +Failed proxy original-principal validation is counted as an authorization failure because the broker rejects the request +during authorization handling. + +# Backward & Forward Compatibility + +## Upgrade + +No special upgrade action is required. The new metrics appear automatically after upgrading brokers that include this +feature. + +Monitoring systems should treat these as new metric series. Existing metrics and authorization behavior are unchanged. + +## Downgrade / Rollback + +Downgrading removes the new metrics. Monitoring systems should tolerate missing-series behavior during rollback. + +## Pulsar Geo-Replication Upgrade & Downgrade/Rollback Considerations + +No geo-replication protocol, metadata, or wire compatibility changes are introduced. + +# Alternatives + +- Failure-only counter: + Rejected because operators often need success, failure, and error counts to interpret changes correctly and to build + ratio-based alerts. + +- OpenTelemetry-only metric: + Rejected because Pulsar still exposes Prometheus-compatible broker metrics and many deployments rely on the broker + metrics endpoint. + +- Prometheus-only metric: + Rejected because Pulsar is adding OpenTelemetry support and new broker observability should keep equivalent + OpenTelemetry signals where practical. + +- Detailed identity labels such as role, tenant, namespace, or topic: + Rejected due to cardinality and privacy concerns. + +- Instrument each authorization call site independently: + Rejected because it would be error-prone and would likely produce inconsistent semantics across broker paths. + +- Cache Prometheus label children or prebuild OpenTelemetry attributes for every resource type, operation, and result + combination: + Deferred because the initial implementation keeps the dimension set bounded and simple. This can be added later if + profiling shows metric recording overhead is significant on hot authorization paths. + +# General Notes + +This proposal is intentionally limited to broker metrics. It does not replace audit logging or structured security +events. + +The metric dimensions add some per-recording overhead because Prometheus label children and OpenTelemetry attributes +must be resolved when recording. The proposed dimension set is deliberately small and bounded to keep this overhead +predictable. + +The implementation includes focused test coverage for both metric export paths: + +- Prometheus samples are validated through `CollectorRegistry.defaultRegistry.getSampleValue(...)`. +- OpenTelemetry samples are validated through the broker OpenTelemetry metric reader. + +# Links + +* Mailing List discussion thread: +* Mailing List voting thread: