Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
302 changes: 302 additions & 0 deletions pip/pip-471.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,302 @@
# PIP-471: Authorization Operation Metrics

# Background knowledge

Pulsar brokers perform authorization checks before allowing clients, proxies, and administrative callers to access
topics, namespaces, tenants, brokers, clusters, and policy operations. These checks are handled through the broker-side
`AuthorizationService`, which delegates decisions to the configured `AuthorizationProvider`.

Pulsar already exposes security-related metrics, especially around authentication. These metrics help operators detect
login failures, unhealthy clients, and changes in access patterns. However, Pulsar does not expose a generic broker-level
metric stream for authorization outcomes. Authorization denials are mostly visible through request failures and logs,
which makes them harder to alert on and harder to compare with successful authorization traffic.

Pulsar also supports both Prometheus-compatible metrics and OpenTelemetry metrics. New broker observability features
should keep those pipelines aligned when possible, so operators can consume equivalent signals regardless of their
metrics backend.

# Motivation

Operators need a low-cardinality, broker-native signal that shows whether authorization checks are succeeding or failing.
This is useful for security alerting, baseline monitoring, and compliance-oriented reporting.

Without a dedicated authorization metric, operators have to infer authorization denials from logs, HTTP status codes, or
client-side errors. That is brittle and does not support standard monitoring patterns such as:

- Alerting on spikes in authorization failures.
- Comparing authorization failures against successful authorizations.
- Distinguishing authentication failures from authorization failures.
- Building dashboards by authorization resource category.
- Exporting equivalent authorization signals through both Prometheus and OpenTelemetry.

A failure-only metric is also not sufficient. Operators often need success, failure, and error counts together to
understand whether a denial spike reflects an attack, a rollout issue, a policy mistake, an authorization provider
problem, or a normal traffic shift.

# Goals

## In Scope

- Add a low-cardinality broker authorization metric for operation outcomes.
- Record successful, failed, and errored authorization operations.
- Expose the metric through the Prometheus-compatible broker metrics endpoint.
- Expose the same metric through OpenTelemetry.
- Centralize instrumentation in `AuthorizationService` so broker authorization paths share the same metric model.
- Avoid identity-bearing or high-cardinality metric dimensions.

## Out of Scope

- Per-role, per-topic, per-tenant, per-namespace, or per-principal labels.
- Audit-log payloads or structured security event streams.
- New authorization APIs or binary protocol changes.
- Alert rule definitions for downstream monitoring stacks.
- Configuration to enable or disable this specific metric.

# High Level Design

Introduce a generic authorization operation counter that is incremented when the broker finishes an authorization
decision or rejects an authorization request before invoking the configured provider.

The metric is recorded centrally in `AuthorizationService`, which is the broker-side entry point for authorization checks
across topic, namespace, tenant, broker, cluster, and policy operations. Each completed provider decision or direct
authorization rejection emits one result with a small, fixed dimension set:

- the resource type that was checked
- the operation that was requested
- whether the result was a success, failure, or error

This metric is exported in two equivalent forms by the same helper class:

- a Prometheus counter for the existing broker metrics endpoint
- an OpenTelemetry `LongCounter` for OpenTelemetry metrics export

Invalid original-principal combinations in proxied authorization flows are counted as authorization failures because the
broker rejects the request during authorization handling. For valid proxied authorization flows, the broker evaluates
both the proxy role and the original principal, and each completed authorization decision is recorded.

# Detailed Design

## Design & Implementation Details

This proposal introduces `org.apache.pulsar.broker.authorization.metrics.AuthorizationMetrics`, a broker authorization
metrics helper that owns:

- a Prometheus `Counter` for broker metrics scraping
- an OpenTelemetry `LongCounter` for OpenTelemetry metrics export

The helper uses the following constants for metric names, instrumentation scope, label values, and OpenTelemetry
attribute keys:

| Constant | Value |
|---|---|
| `AUTHORIZATION_OPERATIONS_METRIC_NAME` | `pulsar_authorization_operations_total` |
| `AUTHORIZATION_COUNTER_METRIC_NAME` | `pulsar.authorization.operation.count` |
| `INSTRUMENTATION_SCOPE_NAME` | `org.apache.pulsar.authorization` |
| `RESULT_SUCCESS` | `success` |
| `RESULT_FAILURE` | `failure` |
| `RESULT_ERROR` | `error` |
| `RESOURCE_TYPE_KEY` | `pulsar.authorization.resource.type` |
| `OPERATION_KEY` | `pulsar.authorization.operation` |
| `RESULT_KEY` | `pulsar.authorization.result` |

`AuthorizationMetrics` registers a static Prometheus counter with labels `resource_type`, `operation`, and `result`.
It also builds an OpenTelemetry `LongCounter` from the `OpenTelemetry` instance passed to the constructor.

The helper exposes three recording methods:

| Method | Behavior |
|---|---|
| `recordSuccess(resourceType, operation)` | Increments the Prometheus counter with `result="success"` and adds `1` to the OpenTelemetry counter with `pulsar.authorization.result="success"`. |
| `recordFailure(resourceType, operation)` | Increments the Prometheus counter with `result="failure"` and adds `1` to the OpenTelemetry counter with `pulsar.authorization.result="failure"`. |
| `recordError(resourceType, operation)` | Increments the Prometheus counter with `result="error"` and adds `1` to the OpenTelemetry counter with `pulsar.authorization.result="error"`. |

`AuthorizationService` owns one `AuthorizationMetrics` instance. The existing `AuthorizationService` constructor remains
available and delegates to a new constructor with `OpenTelemetry.noop()`. `BrokerService` constructs
`AuthorizationService` with `pulsar.getOpenTelemetry().getOpenTelemetry()` so the OpenTelemetry counter is exported by
the broker's OpenTelemetry pipeline.

`AuthorizationService` records a result after each completed authorization operation. If the provider returns `true`, the
helper records a success. If the provider returns `false`, the helper records a failure. If the provider future completes
exceptionally, the helper records an error because authorization evaluation failed before a boolean decision was returned.

If `AuthorizationService` rejects a request before provider evaluation, such as an invalid original-principal combination
for proxied requests, it records a failure directly and returns a completed `false` future. Existing
authorization-disabled short-circuit behavior is preserved; operation methods that already return early when
authorization is disabled do not emit this metric on that path.

The instrumentation applies to the following authorization flows:

- superuser checks
- tenant-admin checks
- tenant operations
- broker operations
- cluster operations
- cluster policy operations
- namespace operations
- namespace policy operations
- topic operations
- topic policy operations

The metric dimensions are intentionally bounded. The resource type is selected from a fixed set of constants in
`AuthorizationMetrics`. The operation is `check` for superuser and tenant-admin checks. For enum-backed operations, the
operation is the lower-case enum name. If an existing authorization path does not provide an operation value, the metric
uses a fixed `unknown` operation value rather than failing the request path or introducing dynamic labels.

The metric does not include role names, topic names, tenant names, namespace names, client addresses, provider names,
exception classes, or error messages.

## Public-facing Changes

### Public API

No public client, admin, REST, or `AuthorizationProvider` API changes.

### Binary protocol

No binary protocol changes.

### Configuration

No new configuration is required.

### CLI

No CLI changes.

### Metrics

Prometheus metric:

| Field | Value |
|---|---|
| Full name | `pulsar_authorization_operations_total` |
| Description | Pulsar authorization operations |
| Type | Counter |
| Labels | `resource_type`, `operation`, `result` |
| Unit | operations |

OpenTelemetry metric:

| Field | Value |
|---|---|
| Full name | `pulsar.authorization.operation.count` |
| Description | The number of authorization operations |
| Type | `LongCounter` |
| Attributes | `pulsar.authorization.resource.type`, `pulsar.authorization.operation`, `pulsar.authorization.result` |
| Unit | `{operation}` |

Result values:

| Value | Meaning |
|---|---|
| `success` | The authorization request was allowed. |
| `failure` | The authorization request was denied or rejected by authorization handling. |
| `error` | Authorization evaluation failed before an allow/deny decision was returned. |

Resource type values:

| Value | Meaning |
|---|---|
| `superuser` | Superuser authorization check. |
| `tenant_admin` | Tenant-admin authorization check. |
| `tenant` | Tenant operation authorization check. |
| `broker` | Broker operation authorization check. |
| `cluster` | Cluster operation authorization check. |
| `cluster_policy` | Cluster policy operation authorization check. |
| `namespace` | Namespace operation authorization check. |
| `namespace_policy` | Namespace policy operation authorization check. |
| `topic` | Topic operation authorization check. |
| `topic_policy` | Topic policy operation authorization check. |

Operation values are normalized authorization operation names. Examples include `produce`, `consume`, `lookup`,
`packages`, and `read`. Superuser and tenant-admin checks use `check`. Existing authorization paths that do not provide
a concrete operation value use `unknown`.

# Monitoring

Operators should monitor absolute authorization failures and errors, plus the relationship between failures and
successes.
Recommended patterns include:

- Alert on sustained increases in `result="failure"`.
- Alert on sustained increases in `result="error"`, which can indicate authorization provider failures or outages.
- Build dashboards that show `success`, `failure`, and `error` together by `resource_type`.
- Investigate rollout regressions by comparing failure rates before and after authorization policy changes.
- Correlate authorization failures with authentication metrics to distinguish authentication incidents from
authorization incidents.

This proposal enables ratio-based alerting because success, failure, and error outcomes are reported in the same metric
family.

# Security Considerations

This proposal improves security observability but does not change authorization semantics.

Authorization decisions can be high volume and may involve sensitive identifiers. The metric therefore avoids
identity-bearing labels and attributes. It does not include roles, principals, topics, namespaces, tenants, client
addresses, or error messages. This keeps the metric useful for operations without turning it into an audit-log substitute
or a high-cardinality data leak.

Failed proxy original-principal validation is counted as an authorization failure because the broker rejects the request
during authorization handling.

# Backward & Forward Compatibility

## Upgrade

No special upgrade action is required. The new metrics appear automatically after upgrading brokers that include this
feature.

Monitoring systems should treat these as new metric series. Existing metrics and authorization behavior are unchanged.

## Downgrade / Rollback

Downgrading removes the new metrics. Monitoring systems should tolerate missing-series behavior during rollback.

## Pulsar Geo-Replication Upgrade & Downgrade/Rollback Considerations

No geo-replication protocol, metadata, or wire compatibility changes are introduced.

# Alternatives

- Failure-only counter:
Rejected because operators often need success, failure, and error counts to interpret changes correctly and to build
ratio-based alerts.

- OpenTelemetry-only metric:
Rejected because Pulsar still exposes Prometheus-compatible broker metrics and many deployments rely on the broker
metrics endpoint.

- Prometheus-only metric:
Rejected because Pulsar is adding OpenTelemetry support and new broker observability should keep equivalent
OpenTelemetry signals where practical.

- Detailed identity labels such as role, tenant, namespace, or topic:
Rejected due to cardinality and privacy concerns.

- Instrument each authorization call site independently:
Rejected because it would be error-prone and would likely produce inconsistent semantics across broker paths.

- Cache Prometheus label children or prebuild OpenTelemetry attributes for every resource type, operation, and result
combination:
Deferred because the initial implementation keeps the dimension set bounded and simple. This can be added later if
profiling shows metric recording overhead is significant on hot authorization paths.

# General Notes

This proposal is intentionally limited to broker metrics. It does not replace audit logging or structured security
events.

The metric dimensions add some per-recording overhead because Prometheus label children and OpenTelemetry attributes
must be resolved when recording. The proposed dimension set is deliberately small and bounded to keep this overhead
predictable.

The implementation includes focused test coverage for both metric export paths:

- Prometheus samples are validated through `CollectorRegistry.defaultRegistry.getSampleValue(...)`.
- OpenTelemetry samples are validated through the broker OpenTelemetry metric reader.

# Links

* Mailing List discussion thread:
* Mailing List voting thread: