From c160138818fca39189d0a98c0c8401b01969c695 Mon Sep 17 00:00:00 2001
From: mattisonchao <mattisonchao@gmail.com>
Date: Sun, 26 Apr 2026 11:29:13 +0800
Subject: [PATCH 1/3] [improve][pip] PIP-471: Authorization operation metrics

---
 pip/pip-471.md | 249 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 249 insertions(+)
 create mode 100644 pip/pip-471.md

diff --git a/pip/pip-471.md b/pip/pip-471.md
new file mode 100644
index 0000000000000..93e0d061d70bb
--- /dev/null
+++ b/pip/pip-471.md
@@ -0,0 +1,249 @@
+# PIP-471: Authorization Operation Metrics
+
+# Background knowledge
+
+Pulsar brokers perform authorization checks before allowing clients, proxies, and administrative callers to access
+topics, namespaces, tenants, brokers, clusters, and policy operations. These checks are handled through the broker-side
+`AuthorizationService`, which delegates decisions to the configured `AuthorizationProvider`.
+
+Pulsar already exposes security-related metrics, especially around authentication. These metrics help operators detect
+login failures, unhealthy clients, and changes in access patterns. However, Pulsar does not expose a generic broker-level
+metric stream for authorization outcomes. Authorization denials are mostly visible through request failures and logs,
+which makes them harder to alert on and harder to compare with successful authorization traffic.
+
+Pulsar also supports both Prometheus-compatible metrics and OpenTelemetry metrics. New broker observability features
+should keep those pipelines aligned when possible, so operators can consume equivalent signals regardless of their
+metrics backend.
+
+# Motivation
+
+Operators need a low-cardinality, broker-native signal that shows whether authorization checks are succeeding or failing.
+This is useful for security alerting, baseline monitoring, and compliance-oriented reporting.
+
+Without a dedicated authorization metric, operators have to infer authorization denials from logs, HTTP status codes, or
+client-side errors. That is brittle and does not support standard monitoring patterns such as:
+
+- Alerting on spikes in authorization failures.
+- Comparing authorization failures against successful authorizations.
+- Distinguishing authentication failures from authorization failures.
+- Building dashboards by authorization resource category.
+- Exporting equivalent authorization signals through both Prometheus and OpenTelemetry.
+
+A failure-only metric is also not sufficient. Operators often need success and failure counts together to understand
+whether a denial spike reflects an attack, a rollout issue, a policy mistake, or a normal traffic shift.
+
+# Goals
+
+## In Scope
+
+- Add a low-cardinality broker authorization metric for operation outcomes.
+- Record both successful and failed authorization decisions.
+- Expose the metric through the Prometheus-compatible broker metrics endpoint.
+- Expose the same metric through OpenTelemetry.
+- Centralize instrumentation in `AuthorizationService` so broker authorization paths share the same metric model.
+- Avoid identity-bearing or high-cardinality metric dimensions.
+
+## Out of Scope
+
+- Per-role, per-topic, per-tenant, per-namespace, or per-principal labels.
+- Audit-log payloads or structured security event streams.
+- New authorization APIs or binary protocol changes.
+- Alert rule definitions for downstream monitoring stacks.
+- Configuration to enable or disable this specific metric.
+
+# High Level Design
+
+Introduce a generic authorization operation counter that is incremented when the broker finishes an authorization
+decision.
+
+The metric is recorded centrally in `AuthorizationService`, which is the broker-side entry point for authorization checks
+across topic, namespace, tenant, broker, cluster, and policy operations. Each authorization check emits one result with a
+small, fixed dimension set:
+
+- the resource type that was checked
+- the operation that was requested
+- whether the result was a success or failure
+
+This metric is exported in two equivalent forms:
+
+- a Prometheus counter for the existing broker metrics endpoint
+- an OpenTelemetry `LongCounter` for OpenTelemetry metrics export
+
+Invalid original-principal combinations in proxied authorization flows are counted as authorization failures because the
+broker rejects the request during authorization handling.
+
+# Detailed Design
+
+## Design & Implementation Details
+
+This proposal introduces a broker authorization metrics helper that owns:
+
+- a Prometheus `Counter` for broker metrics scraping
+- an OpenTelemetry `LongCounter` for OpenTelemetry metrics export
+
+The helper is instantiated by `AuthorizationService`. `AuthorizationService` records a result after each completed
+authorization decision. If the provider returns `true`, the helper records a success. If the provider returns `false`,
+the helper records a failure. If `AuthorizationService` rejects a request before provider evaluation, such as an invalid
+original-principal combination for proxied requests, the helper records a failure directly.
+
+The instrumentation applies to the following authorization flows:
+
+- superuser checks
+- tenant-admin checks
+- tenant operations
+- broker operations
+- cluster operations
+- cluster policy operations
+- namespace operations
+- namespace policy operations
+- topic operations
+- topic policy operations
+
+The metric dimensions are intentionally bounded. The resource type is selected from a fixed set of resource categories.
+The operation is the normalized authorization operation name, such as a lower-case enum name. If an existing authorization
+path does not provide an operation value, the metric uses a fixed `unknown` operation value rather than failing the
+request path or introducing dynamic labels.
+
+The metric does not include role names, topic names, tenant names, namespace names, client addresses, provider names,
+exception classes, or error messages.
+
+## Public-facing Changes
+
+### Public API
+
+No public API changes.
+
+### Binary protocol
+
+No binary protocol changes.
+
+### Configuration
+
+No new configuration is required.
+
+### CLI
+
+No CLI changes.
+
+### Metrics
+
+Prometheus metric:
+
+| Field | Value |
+|---|---|
+| Full name | `pulsar_authorization_operations_total` |
+| Description | Pulsar authorization operations |
+| Type | Counter |
+| Labels | `resource_type`, `operation`, `result` |
+| Unit | operations |
+
+OpenTelemetry metric:
+
+| Field | Value |
+|---|---|
+| Full name | `pulsar.authorization.operation.count` |
+| Description | The number of authorization operations |
+| Type | `LongCounter` |
+| Attributes | `pulsar.authorization.resource.type`, `pulsar.authorization.operation`, `pulsar.authorization.result` |
+| Unit | `{operation}` |
+
+Result values:
+
+| Value | Meaning |
+|---|---|
+| `success` | The authorization request was allowed. |
+| `failure` | The authorization request was denied or rejected by authorization handling. |
+
+Resource type values:
+
+| Value | Meaning |
+|---|---|
+| `superuser` | Superuser authorization check. |
+| `tenant_admin` | Tenant-admin authorization check. |
+| `tenant` | Tenant operation authorization check. |
+| `broker` | Broker operation authorization check. |
+| `cluster` | Cluster operation authorization check. |
+| `cluster_policy` | Cluster policy operation authorization check. |
+| `namespace` | Namespace operation authorization check. |
+| `namespace_policy` | Namespace policy operation authorization check. |
+| `topic` | Topic operation authorization check. |
+| `topic_policy` | Topic policy operation authorization check. |
+
+Operation values are normalized authorization operation names. Examples include `produce`, `consume`, `lookup`,
+`packages`, and `read`. Existing authorization paths that do not provide a concrete operation value use `unknown`.
+
+# Monitoring
+
+Operators should monitor both absolute authorization failures and the relationship between failures and successes.
+Recommended patterns include:
+
+- Alert on sustained increases in `result="failure"`.
+- Build dashboards that show `success` and `failure` together by `resource_type`.
+- Investigate rollout regressions by comparing failure rates before and after authorization policy changes.
+- Correlate authorization failures with authentication metrics to distinguish authentication incidents from
+  authorization incidents.
+
+This proposal enables ratio-based alerting because success and failure outcomes are reported in the same metric family.
+
+# Security Considerations
+
+This proposal improves security observability but does not change authorization semantics.
+
+Authorization decisions can be high volume and may involve sensitive identifiers. The metric therefore avoids
+identity-bearing labels and attributes. It does not include roles, principals, topics, namespaces, tenants, client
+addresses, or error messages. This keeps the metric useful for operations without turning it into an audit-log substitute
+or a high-cardinality data leak.
+
+Failed proxy original-principal validation is counted as an authorization failure because the broker rejects the request
+during authorization handling.
+
+# Backward & Forward Compatibility
+
+## Upgrade
+
+No special upgrade action is required. The new metrics appear automatically after upgrading brokers that include this
+feature.
+
+Monitoring systems should treat these as new metric series. Existing metrics and authorization behavior are unchanged.
+
+## Downgrade / Rollback
+
+Downgrading removes the new metrics. Monitoring systems should tolerate missing-series behavior during rollback.
+
+## Pulsar Geo-Replication Upgrade & Downgrade/Rollback Considerations
+
+No geo-replication protocol, metadata, or wire compatibility changes are introduced.
+
+# Alternatives
+
+- Failure-only counter:
+  Rejected because operators often need both success and failure counts to interpret changes correctly and to build
+  ratio-based alerts.
+
+- OpenTelemetry-only metric:
+  Rejected because Pulsar still exposes Prometheus-compatible broker metrics and many deployments rely on the broker
+  metrics endpoint.
+
+- Prometheus-only metric:
+  Rejected because Pulsar is adding OpenTelemetry support and new broker observability should keep equivalent OTel
+  signals where practical.
+
+- Detailed identity labels such as role, tenant, namespace, or topic:
+  Rejected due to cardinality and privacy concerns.
+
+- Instrument each authorization call site independently:
+  Rejected because it would be error-prone and would likely produce inconsistent semantics across broker paths.
+
+# General Notes
+
+This proposal is intentionally limited to broker metrics. It does not replace audit logging or structured security
+events.
+
+The metric dimensions add some per-recording overhead because Prometheus label children and OpenTelemetry attributes
+must be resolved when recording. The proposed dimension set is deliberately small and bounded to keep this overhead
+predictable.
+
+# Links
+
+* Mailing List discussion thread:
+* Mailing List voting thread:

From b7fd8bfd8a4783d66de95673c11ae7461d6c3b6c Mon Sep 17 00:00:00 2001
From: mattisonchao <mattisonchao@gmail.com>
Date: Sun, 26 Apr 2026 12:04:06 +0800
Subject: [PATCH 2/3] [improve][pip] Refine authorization metrics proposal

---
 pip/pip-471.md | 82 +++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 64 insertions(+), 18 deletions(-)

diff --git a/pip/pip-471.md b/pip/pip-471.md
index 93e0d061d70bb..8ab8580614f93 100644
--- a/pip/pip-471.md
+++ b/pip/pip-471.md
@@ -54,37 +54,72 @@ whether a denial spike reflects an attack, a rollout issue, a policy mistake, or
 # High Level Design
 
 Introduce a generic authorization operation counter that is incremented when the broker finishes an authorization
-decision.
+decision or rejects an authorization request before invoking the configured provider.
 
 The metric is recorded centrally in `AuthorizationService`, which is the broker-side entry point for authorization checks
-across topic, namespace, tenant, broker, cluster, and policy operations. Each authorization check emits one result with a
-small, fixed dimension set:
+across topic, namespace, tenant, broker, cluster, and policy operations. Each completed provider decision or direct
+authorization rejection emits one result with a small, fixed dimension set:
 
 - the resource type that was checked
 - the operation that was requested
 - whether the result was a success or failure
 
-This metric is exported in two equivalent forms:
+This metric is exported in two equivalent forms by the same helper class:
 
 - a Prometheus counter for the existing broker metrics endpoint
 - an OpenTelemetry `LongCounter` for OpenTelemetry metrics export
 
 Invalid original-principal combinations in proxied authorization flows are counted as authorization failures because the
-broker rejects the request during authorization handling.
+broker rejects the request during authorization handling. For valid proxied authorization flows, the broker evaluates
+both the proxy role and the original principal, and each completed authorization decision is recorded.
 
 # Detailed Design
 
 ## Design & Implementation Details
 
-This proposal introduces a broker authorization metrics helper that owns:
+This proposal introduces `org.apache.pulsar.broker.authorization.metrics.AuthorizationMetrics`, a broker authorization
+metrics helper that owns:
 
 - a Prometheus `Counter` for broker metrics scraping
 - an OpenTelemetry `LongCounter` for OpenTelemetry metrics export
 
-The helper is instantiated by `AuthorizationService`. `AuthorizationService` records a result after each completed
-authorization decision. If the provider returns `true`, the helper records a success. If the provider returns `false`,
-the helper records a failure. If `AuthorizationService` rejects a request before provider evaluation, such as an invalid
-original-principal combination for proxied requests, the helper records a failure directly.
+The helper uses the following constants for metric names, instrumentation scope, label values, and OpenTelemetry
+attribute keys:
+
+| Constant | Value |
+|---|---|
+| `AUTHORIZATION_OPERATIONS_METRIC_NAME` | `pulsar_authorization_operations_total` |
+| `AUTHORIZATION_COUNTER_METRIC_NAME` | `pulsar.authorization.operation.count` |
+| `INSTRUMENTATION_SCOPE_NAME` | `org.apache.pulsar.authorization` |
+| `RESULT_SUCCESS` | `success` |
+| `RESULT_FAILURE` | `failure` |
+| `RESOURCE_TYPE_KEY` | `pulsar.authorization.resource.type` |
+| `OPERATION_KEY` | `pulsar.authorization.operation` |
+| `RESULT_KEY` | `pulsar.authorization.result` |
+
+`AuthorizationMetrics` registers a static Prometheus counter with labels `resource_type`, `operation`, and `result`.
+It also builds an OpenTelemetry `LongCounter` from the `OpenTelemetry` instance passed to the constructor.
+
+The helper exposes two recording methods:
+
+| Method | Behavior |
+|---|---|
+| `recordSuccess(resourceType, operation)` | Increments the Prometheus counter with `result="success"` and adds `1` to the OpenTelemetry counter with `pulsar.authorization.result="success"`. |
+| `recordFailure(resourceType, operation)` | Increments the Prometheus counter with `result="failure"` and adds `1` to the OpenTelemetry counter with `pulsar.authorization.result="failure"`. |
+
+`AuthorizationService` owns one `AuthorizationMetrics` instance. The existing `AuthorizationService` constructor remains
+available and delegates to a new constructor with `OpenTelemetry.noop()`. `BrokerService` constructs
+`AuthorizationService` with `pulsar.getOpenTelemetry().getOpenTelemetry()` so the OpenTelemetry counter is exported by
+the broker's OpenTelemetry pipeline.
+
+`AuthorizationService` records a result after each completed authorization decision. If the provider returns `true`, the
+helper records a success. If the provider returns `false`, the helper records a failure. If the provider future completes
+exceptionally, no success or failure metric is recorded because the provider did not produce an authorization decision.
+
+If `AuthorizationService` rejects a request before provider evaluation, such as an invalid original-principal combination
+for proxied requests, it records a failure directly and returns a completed `false` future. Existing
+authorization-disabled short-circuit behavior is preserved; operation methods that already return early when
+authorization is disabled do not emit this metric on that path.
 
 The instrumentation applies to the following authorization flows:
 
@@ -99,10 +134,10 @@ The instrumentation applies to the following authorization flows:
 - topic operations
 - topic policy operations
 
-The metric dimensions are intentionally bounded. The resource type is selected from a fixed set of resource categories.
-The operation is the normalized authorization operation name, such as a lower-case enum name. If an existing authorization
-path does not provide an operation value, the metric uses a fixed `unknown` operation value rather than failing the
-request path or introducing dynamic labels.
+The metric dimensions are intentionally bounded. The resource type is selected from a fixed set of constants in
+`AuthorizationMetrics`. The operation is `check` for superuser and tenant-admin checks. For enum-backed operations, the
+operation is the lower-case enum name. If an existing authorization path does not provide an operation value, the metric
+uses a fixed `unknown` operation value rather than failing the request path or introducing dynamic labels.
 
 The metric does not include role names, topic names, tenant names, namespace names, client addresses, provider names,
 exception classes, or error messages.
@@ -111,7 +146,7 @@ exception classes, or error messages.
 
 ### Public API
 
-No public API changes.
+No public client, admin, REST, or `AuthorizationProvider` API changes.
 
 ### Binary protocol
 
@@ -170,7 +205,8 @@ Resource type values:
 | `topic_policy` | Topic policy operation authorization check. |
 
 Operation values are normalized authorization operation names. Examples include `produce`, `consume`, `lookup`,
-`packages`, and `read`. Existing authorization paths that do not provide a concrete operation value use `unknown`.
+`packages`, and `read`. Superuser and tenant-admin checks use `check`. Existing authorization paths that do not provide
+a concrete operation value use `unknown`.
 
 # Monitoring
 
@@ -225,8 +261,8 @@ No geo-replication protocol, metadata, or wire compatibility changes are introdu
   metrics endpoint.
 
 - Prometheus-only metric:
-  Rejected because Pulsar is adding OpenTelemetry support and new broker observability should keep equivalent OTel
-  signals where practical.
+  Rejected because Pulsar is adding OpenTelemetry support and new broker observability should keep equivalent
+  OpenTelemetry signals where practical.
 
 - Detailed identity labels such as role, tenant, namespace, or topic:
   Rejected due to cardinality and privacy concerns.
@@ -234,6 +270,11 @@ No geo-replication protocol, metadata, or wire compatibility changes are introdu
 - Instrument each authorization call site independently:
   Rejected because it would be error-prone and would likely produce inconsistent semantics across broker paths.
 
+- Cache Prometheus label children or prebuild OpenTelemetry attributes for every resource type, operation, and result
+  combination:
+  Deferred because the initial implementation keeps the dimension set bounded and simple. This can be added later if
+  profiling shows metric recording overhead is significant on hot authorization paths.
+
 # General Notes
 
 This proposal is intentionally limited to broker metrics. It does not replace audit logging or structured security
@@ -243,6 +284,11 @@ The metric dimensions add some per-recording overhead because Prometheus label c
 must be resolved when recording. The proposed dimension set is deliberately small and bounded to keep this overhead
 predictable.
 
+The implementation includes focused test coverage for both metric export paths:
+
+- Prometheus samples are validated through `CollectorRegistry.defaultRegistry.getSampleValue(...)`.
+- OpenTelemetry samples are validated through the broker OpenTelemetry metric reader.
+
 # Links
 
 * Mailing List discussion thread:

From 766cf02f44aeba8d8d6cd135b982e87fde756d67 Mon Sep 17 00:00:00 2001
From: mattisonchao <mattisonchao@gmail.com>
Date: Sun, 26 Apr 2026 12:48:31 +0800
Subject: [PATCH 3/3] [improve][pip] Add authorization error metric result

---
 pip/pip-471.md | 29 ++++++++++++++++++-----------
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/pip/pip-471.md b/pip/pip-471.md
index 8ab8580614f93..46278866de2ce 100644
--- a/pip/pip-471.md
+++ b/pip/pip-471.md
@@ -29,15 +29,16 @@ client-side errors. That is brittle and does not support standard monitoring pat
 - Building dashboards by authorization resource category.
 - Exporting equivalent authorization signals through both Prometheus and OpenTelemetry.
 
-A failure-only metric is also not sufficient. Operators often need success and failure counts together to understand
-whether a denial spike reflects an attack, a rollout issue, a policy mistake, or a normal traffic shift.
+A failure-only metric is also not sufficient. Operators often need success, failure, and error counts together to
+understand whether a denial spike reflects an attack, a rollout issue, a policy mistake, an authorization provider
+problem, or a normal traffic shift.
 
 # Goals
 
 ## In Scope
 
 - Add a low-cardinality broker authorization metric for operation outcomes.
-- Record both successful and failed authorization decisions.
+- Record successful, failed, and errored authorization operations.
 - Expose the metric through the Prometheus-compatible broker metrics endpoint.
 - Expose the same metric through OpenTelemetry.
 - Centralize instrumentation in `AuthorizationService` so broker authorization paths share the same metric model.
@@ -62,7 +63,7 @@ authorization rejection emits one result with a small, fixed dimension set:
 
 - the resource type that was checked
 - the operation that was requested
-- whether the result was a success or failure
+- whether the result was a success, failure, or error
 
 This metric is exported in two equivalent forms by the same helper class:
 
@@ -93,6 +94,7 @@ attribute keys:
 | `INSTRUMENTATION_SCOPE_NAME` | `org.apache.pulsar.authorization` |
 | `RESULT_SUCCESS` | `success` |
 | `RESULT_FAILURE` | `failure` |
+| `RESULT_ERROR` | `error` |
 | `RESOURCE_TYPE_KEY` | `pulsar.authorization.resource.type` |
 | `OPERATION_KEY` | `pulsar.authorization.operation` |
 | `RESULT_KEY` | `pulsar.authorization.result` |
@@ -100,21 +102,22 @@ attribute keys:
 `AuthorizationMetrics` registers a static Prometheus counter with labels `resource_type`, `operation`, and `result`.
 It also builds an OpenTelemetry `LongCounter` from the `OpenTelemetry` instance passed to the constructor.
 
-The helper exposes two recording methods:
+The helper exposes three recording methods:
 
 | Method | Behavior |
 |---|---|
 | `recordSuccess(resourceType, operation)` | Increments the Prometheus counter with `result="success"` and adds `1` to the OpenTelemetry counter with `pulsar.authorization.result="success"`. |
 | `recordFailure(resourceType, operation)` | Increments the Prometheus counter with `result="failure"` and adds `1` to the OpenTelemetry counter with `pulsar.authorization.result="failure"`. |
+| `recordError(resourceType, operation)` | Increments the Prometheus counter with `result="error"` and adds `1` to the OpenTelemetry counter with `pulsar.authorization.result="error"`. |
 
 `AuthorizationService` owns one `AuthorizationMetrics` instance. The existing `AuthorizationService` constructor remains
 available and delegates to a new constructor with `OpenTelemetry.noop()`. `BrokerService` constructs
 `AuthorizationService` with `pulsar.getOpenTelemetry().getOpenTelemetry()` so the OpenTelemetry counter is exported by
 the broker's OpenTelemetry pipeline.
 
-`AuthorizationService` records a result after each completed authorization decision. If the provider returns `true`, the
+`AuthorizationService` records a result after each completed authorization operation. If the provider returns `true`, the
 helper records a success. If the provider returns `false`, the helper records a failure. If the provider future completes
-exceptionally, no success or failure metric is recorded because the provider did not produce an authorization decision.
+exceptionally, the helper records an error because authorization evaluation failed before a boolean decision was returned.
 
 If `AuthorizationService` rejects a request before provider evaluation, such as an invalid original-principal combination
 for proxied requests, it records a failure directly and returns a completed `false` future. Existing
@@ -188,6 +191,7 @@ Result values:
 |---|---|
 | `success` | The authorization request was allowed. |
 | `failure` | The authorization request was denied or rejected by authorization handling. |
+| `error` | Authorization evaluation failed before an allow/deny decision was returned. |
 
 Resource type values:
 
@@ -210,16 +214,19 @@ a concrete operation value use `unknown`.
 
 # Monitoring
 
-Operators should monitor both absolute authorization failures and the relationship between failures and successes.
+Operators should monitor absolute authorization failures and errors, plus the relationship between failures and
+successes.
 Recommended patterns include:
 
 - Alert on sustained increases in `result="failure"`.
-- Build dashboards that show `success` and `failure` together by `resource_type`.
+- Alert on sustained increases in `result="error"`, which can indicate authorization provider failures or outages.
+- Build dashboards that show `success`, `failure`, and `error` together by `resource_type`.
 - Investigate rollout regressions by comparing failure rates before and after authorization policy changes.
 - Correlate authorization failures with authentication metrics to distinguish authentication incidents from
   authorization incidents.
 
-This proposal enables ratio-based alerting because success and failure outcomes are reported in the same metric family.
+This proposal enables ratio-based alerting because success, failure, and error outcomes are reported in the same metric
+family.
 
 # Security Considerations
 
@@ -253,7 +260,7 @@ No geo-replication protocol, metadata, or wire compatibility changes are introdu
 # Alternatives
 
 - Failure-only counter:
-  Rejected because operators often need both success and failure counts to interpret changes correctly and to build
+  Rejected because operators often need success, failure, and error counts to interpret changes correctly and to build
   ratio-based alerts.
 
 - OpenTelemetry-only metric: