Skip to content

feat(analytics): HMAC bearer auth #35776

Open
swicken wants to merge 18 commits into
mainfrom
feat/analytics-hmac-auth
Open

feat(analytics): HMAC bearer auth #35776
swicken wants to merge 18 commits into
mainfrom
feat/analytics-hmac-auth

Conversation

@swicken
Copy link
Copy Markdown
Member

@swicken swicken commented May 20, 2026

Closes #35775
Sub-issue of #35048 (Phase 3 — Security)

Summary

Replaces Basic auth on the dotCMS-side analytics proxy with per-tenant HMAC bearer tokens, adds a save-flow that exchanges admin credentials for a token (clearing the password from the app config after the exchange), tightens the proxy/health surface, and enforces per-site permissions on the GET read endpoint.

What changes

Auth model

  • Drop DOT_ANALYTICS_PASSWORD entirely. The proxy now uses bearer tokens sourced from the per-site Content Analytics App secret, with DOT_ANALYTICS_BEARER_TOKEN as a bootstrap fallback for non-site-bound contexts.
  • New ContentAnalyticsAppListener: when an admin saves the Content Analytics App with adminPassword populated, the listener calls POST {DOT_ANALYTICS_BASE_URL}/v1/admin/token (Basic auth: tenant:password), stores the returned bearerToken as a hidden secret on the app config, and clears the adminPassword field. The user-entered password is written to encrypted app-secret storage by the original save and then removed automatically once the token exchange completes; on a failed exchange (or a failed token-persist) the listener still attempts to clear the password as a best-effort cleanup.
  • The save flow re-fires AppSecretSavedEvent; the listener short-circuits via the "admin password not set" guard.
  • App YAML grows new fields (adminPassword, bearerToken) with operator hints explaining the one-time exchange.

Proxy hardening (EventAnalyticsProxyHelper)

  • Path allowlist: only event / event/... are forwarded. Path segments exactly equal to .. or ., encoded ?/#, backslashes, and non-event prefixes are rejected with 400.
  • Hard timeouts: 10s proxy / 2s health probe — prevents a slow event manager from exhausting the Jersey thread pool.
  • Caller project= wins over DOT_ANALYTICS_PROJECT; the configured value is no longer appended when the URL already has one.
  • RFC-6750-compliant buildAuthHeader returns Optional.empty() when no token is available; the proxy then omits Authorization instead of emitting a malformed Bearer (no token).
  • POST ingest: blank-body / literal-null JSON guard returns 400 with a clear error envelope. Empty-body POSTs drop Content-Type per RFC 9110 §8.3.

Per-site permission gating (EventAnalyticsProxyResource.proxyGetRequest)

  • Site is resolved as the logged-in user via HostAPI.find(siteId, user, DONT_RESPECT_FRONT_END_ROLES) for the explicit ?siteId= branch and HostWebAPI.getCurrentHost(request, user) for the session fallback, so PermissionAPI.READ gates access.
  • A backend user without READ on the resolved site gets 403 SITE_ACCESS_DENIED, even if they hand-craft ?siteId= to bypass the (already-permissioned) site picker.
  • The 403 path calls SecurityLogger.logInfo(...) to mirror what the global DotForbiddenExceptionMapper would log — keeps denied-site accesses visible in the audit trail.

Health monitor (MonitorHelper.isContentAnalytics)

  • Drops the DOT_ANALYTICS_PASSWORD requirement.
  • Null-safe site probe: monitor pings that land on the System Host (no current host in session) no longer NPE inside getAppSecrets.
  • Tightened /v1/health timeout via the same 2s ceiling so monitor probes fail fast.

Tests

  • ContentAnalyticsAppListenerTest: failure paths (missing tenant / missing base URL clear the password and notify the user) plus a direct happy-path test of persistTokenAndClearCredentials that asserts the atomic invariant — adminPassword absent, bearerToken replaced with the new value, unrelated secrets (siteAuth) preserved verbatim. Method relaxed to package-private specifically for this test.
  • EventAnalyticsProxyHelperTest: path allowlist matrix (traversal, smuggled ?/#, non-event prefixes), buildUpstreamUrl (trailing-slash strip, project dedup, query forwarding), and buildAuthHeader covering the four security-critical branches (per-site wins, env fallback with trim, null host → fallback, both absent → empty Optional — the RFC 6750 branch).

Files changed

  • dotCMS/src/main/java/com/dotcms/analytics/listener/ContentAnalyticsAppListener.java (new)
  • dotCMS/src/main/java/com/dotcms/rest/api/v1/analytics/event/EventAnalyticsProxyHelper.java
  • dotCMS/src/main/java/com/dotcms/rest/api/v1/analytics/event/EventAnalyticsProxyResource.java
  • dotCMS/src/main/java/com/dotcms/rest/api/v1/analytics/content/util/ContentAnalyticsUtil.java
  • dotCMS/src/main/java/com/dotcms/rest/api/v1/system/monitor/MonitorHelper.java
  • dotCMS/src/main/java/com/dotcms/system/event/local/business/LocalSystemEventSubscribersInitializer.java
  • dotCMS/src/main/resources/apps/dotContentAnalytics-config.yml
  • dotCMS/src/test/java/com/dotcms/analytics/listener/ContentAnalyticsAppListenerTest.java (new)
  • dotCMS/src/test/java/com/dotcms/rest/api/v1/analytics/event/EventAnalyticsProxyHelperTest.java (new)
  • justfile (import optional justfile.local)

Operator notes

  • DOT_ANALYTICS_PASSWORD is no longer read. Remove it from secrets/Terraform/k8s configs after merge.
  • DOT_ANALYTICS_TENANT is still required. Tenant must already exist on the upstream event manager; this PR exchanges credentials, it does not provision the tenant.
  • DOT_ANALYTICS_BEARER_TOKEN (optional) is a bootstrap fallback for non-site-bound contexts; production deployments should use the per-site App save flow instead.

Behavioral changes worth flagging to consumers

  • GET catch-all now returns 403 SITE_ACCESS_DENIED for backend users without READ on the resolved site. The dashboard already passes ?siteId= and the global site picker is already permissioned, so this only affects callers that hand-craft a siteId.
  • Empty-body POSTs to /api/v1/analytics/content/event now return 400 with "Request body is required". Postman tests that asserted on status alone still pass; tests that asserted on message text may need updates.
  • Health check semantics: NOT_CONFIGURED requires either global env vars OR per-site App secrets, no longer both.
  • Proxy hard timeout is 10s; previously inherited the CircuitBreakerUrl default.

Test plan

  • ./mvnw test -pl :dotcms-core -Dtest=ContentAnalyticsAppListenerTest,EventAnalyticsProxyHelperTest
  • Regenerate openapi.yaml: ./mvnw compile -pl :dotcms-core -DskipTests — verify the GET catch-all gains 400/403 entries and the POST endpoint description matches reality
  • Manual: save the Content Analytics App on a test site with a valid adminPassword; confirm bearerToken is populated and adminPassword is cleared in a single round-trip
  • Manual: save the App with a wrong password; confirm the password is still cleared (no silent retention) and a system message surfaces to the user
  • Manual: as a non-admin backend user with READ on Site A only, hit /api/v1/analytics/event/total-events?siteId=<Site B id> and confirm 403 SITE_ACCESS_DENIED
  • Manual: confirm /api/v1/analytics/event/... against an unconfigured deployment fails fast (timeout) rather than hanging

Scott Work and others added 16 commits May 19, 2026 12:54
…CS_PASSWORD

HMAC bearer is the only auth scheme. Token sourced from per-site app
secrets via the save-flow exchange; DOT_ANALYTICS_BEARER_TOKEN env var
kept as a bootstrap fallback for CI/smoke contexts. Basic-auth path and
DOT_ANALYTICS_AUTH_MODE switch removed; MonitorHelper no longer requires
DOT_ANALYTICS_PASSWORD.
If the credential exchange fails (event-manager unreachable, wrong
password, missing tenant/base-url config), the password was being left
in the app config. The YAML hints explicitly promise it is never
persisted, so clear it on every failure path. User sees an error
message and re-enters the password to retry.
Adds an INFO block above the bearerToken field telling site admins it's
managed by the save flow and must not be edited. Apps form framework
doesn't support read-only STRING fields, so the field stays hidden+STRING
to survive the mask-preservation path on unrelated saves — the social
signal compensates for the lack of a UI guard.
Replaces the stub test (empty body + tautological string compare) with
JUnit 5 + Mockito 5 MockedStatic coverage of the listener's critical
behaviors:

- null event is a no-op
- missing host identifier short-circuits before any HostAPI call
- re-entrancy guard skips when adminPassword is absent
- missing DOT_ANALYTICS_TENANT clears the password and notifies the user
- missing DOT_ANALYTICS_BASE_URL clears the password and notifies the user
- getKey() returns CONTENT_ANALYTICS_APP_KEY

Listener constructed via reflection on the (HostAPI) constructor to
avoid the APILocator singleton chain. HTTP success path remains an
integration-test concern (mocking the full CircuitBreakerUrl builder
chain isn't worth the complexity here).
… bearer

Addresses three review findings on the catch-all event-manager proxy:

- Whitelist relative paths to the event/* surface area; reject anything
  containing .. or \ segments. Prevents an authenticated backend user
  from probing upstream admin endpoints via path traversal.
- Set a 10s upstream timeout to keep a slow event manager from exhausting
  the Jersey thread pool and degrading the admin portal.
- Return Optional<String> from buildAuthHeader and omit the Authorization
  header entirely when no token is resolved, instead of sending the
  malformed 'Bearer ' (empty token) value that RFC 6750 forbids.

Adds unit coverage for isAllowedRelativePath including traversal,
non-event prefixes, null/blank input.
…conflict

Addresses four findings from the second-round review:

- Make ContentAnalyticsAppListener(HostAPI) package-private and drop the
  reflection helper in the test — same package, no need to bypass access.
- Simplify ALLOWED_PATH_PREFIXES to ['event'] (the '/' variant was
  redundant given the equals + startsWith(prefix + '/') matching logic).
- proxyEventRequest now short-circuits on null/blank body and also
  catches IllegalArgumentException alongside IOException, so malformed
  payloads produce a 400 with a clear message instead of a raw 500.
- Drop 'envvar: DOT_ANALYTICS_BEARER_TOKEN' from the bearerToken field
  in the App YAML to remove the double-binding with the helper's
  Config.getStringProperty fallback. The env var stays as a code-level
  bootstrap fallback only.

Deferred (per design / scope):
- Per-site tenant in App config: deliberate design choice, tenant stays
  global via DOT_ANALYTICS_TENANT.
- Async-save plaintext window: needs Apps framework pre-save hook,
  outside this branch's scope.
requiredBackendUser(true) lets any logged-in backend user (content
editor, translator, etc.) query raw analytics across all sites. Add
requireAdmin(true) so only CMS administrators reach the read proxy.
Event POST path is unchanged — it uses siteAuth gating, not backend-user
gating.
- ContentAnalyticsAppListener.notify now null-checks both getAppSecrets()
  and getSecrets() before reading, so a malformed AppSecretSavedEvent
  exits cleanly instead of NPE-ing through the subscriber chain.
- buildAuthHeader trims whitespace from both the per-site bearer token
  and the env-var fallback before emitting the Authorization header.
  Defends against copy-paste artifacts (trailing newlines, leading
  spaces) that would produce an RFC 7230-illegal header value.
…esolution, health

- Only set Content-Type on POST upstream calls (RFC 9110 §8.3): GETs without a body
  shouldn't carry a Content-Type, some gateways reject the request.
- Guard exchangeToken() against a null CircuitBreakerUrl.Response (returned when
  setThrowWhenError(false) is paired with a circuit-open / transport-level failure).
  Avoids a silent NPE that swallowed the real diagnostic.
- notifyError() now logs and returns when userId is blank instead of pushing a
  null-userId notification that would NPE in the subscriber chain and block
  downstream AppSecretSavedEvent listeners.
- proxyGetRequest() now resolves the target site from an explicit ?siteId= query
  param first, falling back to the session-bound host only when absent. Eliminates
  the multi-tab race where two admin tabs targeting different sites could leak
  data across tenant scopes via the shared session host, and unblocks stateless
  API clients.
- isContentAnalytics() health check no longer returns NOT_CONFIGURED when the
  probe lands on the System Host (typical for IP/LB-based monitoring probes).
  Global env-var config (BASE_URL + TENANT + PROJECT) is now the baseline; the
  reachability probe runs whenever the subsystem is globally configured, so
  outages of the Event Manager surface as CONFIGURATION_ERROR to monitoring.
Self-review walked back four review-driven additions that added noise without
preventing a real failure mode:

- Drop the appSecrets / getSecrets() null guard in notify(). The framework's
  constructor invariant guarantees both are non-null on dispatch; the guard
  was cargo-culted from a theoretical reviewer concern.
- Drop the containsKey(ADMIN_PASSWORD_KEY) early-return in clearAdminPassword.
  All call sites are reached only after notify()'s 'password is set' guard, so
  the inner check is dead by construction.
- Drop the trim() on the per-site bearer token. The token is produced by the
  upstream /v1/admin/token endpoint and JSON-parsed via JSONObject.optString;
  whitespace cannot land in it. The env-var fallback trim (real copy-paste
  risk) is kept.
- Drop the dedicated info_token_managed INFO block in the YAML. The field's
  own hint already says 'Managed by the save flow above. Leave the masked
  value alone.' The separate block was the same message restated.
…save-failure notify, drop empty-POST Content-Type

- buildUpstreamUrl now only appends the configured DOT_ANALYTICS_PROJECT when
  the caller didn't already include a project= query parameter. Eliminates
  the duplicate-key upstream URL (?project=foo&project=bar) whose decoding
  behavior is implementation-defined.

- proxyGetRequest now returns 400 INVALID_SITE_ID when an explicit ?siteId=
  cannot be resolved, instead of silently falling back to the session host.
  Surprising-data-leak failure mode replaced with a clean error.

- persistTokenAndClearCredentials and clearAdminPassword now notify the user
  via SystemMessage on saveSecrets failure, instead of swallowing the error
  with a log-only handler. Doesn't fix the underlying async-window issue
  (Apps framework pre-save hook still pending), but at least the operator
  sees something went wrong instead of a silent 'Save Successful'.

- Drop Content-Type: application/json from exchangeToken's POST headers —
  the call has no body, so per RFC 9110 §8.3 the Content-Type is meaningless
  and can trip strict gateways.
…sts, javadoc

- MonitorHelper.isContentAnalytics now null-guards the resolved Host before
  passing it to ContentAnalyticsUtil.getAppSecrets. IP / load-balancer probes
  where WebAPILocator.getCurrentHostNoThrow returns null no longer hit the
  catch-block NPE in the (pre-existing) utility, which would otherwise log a
  stack trace on every probe.

- Add 5 unit tests for buildUpstreamUrl covering trailing-slash stripping,
  query-param forwarding, configured-project append, the new caller-supplied
  project dedup (round 5 #1), and the empty-query path.

- Correct misleading proxyEventRequest javadoc. The example showed POSTing to
  /content/event/total-events which doesn't exist (the @path is an exact
  match and the upstream is hard-coded to event/ingest). Replace with the
  actual ingest example and point at proxyGetRequest for the read endpoints.
…list gap

- healthCheck() now sets a 2s timeout. The method is pre-existing on main, but
  round-4 #6 expanded the conditions under which it runs (now on every monitor
  probe, including IP/LB probes that land on the System Host). Without a
  timeout, a hung event manager would block the monitor's calling thread.

- proxyEventRequest now guards against a Java null bodyMap. JSON parsers
  treat the literal token 'null' as a valid value that deserializes to a
  null Map, which then NPEs on get(). Existing checks only caught
  null/blank request bodies, not the 'null' literal.

- isAllowedRelativePath now rejects relative paths containing '?' or '#'.
  JAX-RS URL-decodes %3F / %23 in @PathParam values, so a request like
  /api/v1/analytics/event/foo%3Fbar passes the startsWith('event/') check
  but produces a malformed upstream URL (double-? on appended query params)
  or strips them entirely (#). Reviewer correctly distinguished this from
  the previously-rejected event?injected case (still rejected because it
  doesn't start with event/).

- Extract BEARER_TOKEN_KEY constant to ContentAnalyticsUtil so the listener
  and the helper reference a single source-of-truth for the app-secret key
  name. Removes a divergence vector if the key ever gets renamed.

- Add a unit test for the new allowlist rejection cases.
…eanup

GET proxy now resolves the requested site as the logged-in user instead
of systemUser, so PermissionAPI gates access. Dropped the requireAdmin
gate that was a temporary fix; a backend user without READ on the site
gets 403 SITE_ACCESS_DENIED, even if they hand-craft ?siteId= to bypass
the (already-permissioned) site picker. Session fallback switched to
getCurrentHost(request, user) for the same reason. 403 path mirrors
the global DotForbiddenExceptionMapper by calling SecurityLogger.logInfo
so audit trails stay intact.

Other fixes:
- Listener env-var constants promoted to import static from
  EventAnalyticsProxyHelper (matches round-7 BEARER_TOKEN_KEY pattern,
  matches MonitorHelper).
- Dead 4-arg EventAnalyticsProxyHelper.proxy() overload removed; both
  callers use the 5-arg variant.
- @Operation/@parameter on proxyEventRequest rewritten to match reality
  (JSON body validation, exact path, hard-coded event/ingest upstream).
- @ApiResponses on proxyGetRequest gained 400/403 entries.
- DONT_RESPECT_FRONT_END_ROLES used instead of literal false (matches
  sibling generateSiteKey usage in the same file).
- persistTokenAndClearCredentials relaxed to package-private to enable
  a direct happy-path test of the atomic password-clear + token-persist
  invariant; new test asserts adminPassword absent, bearerToken
  replaced, unrelated secrets preserved.
- buildAuthHeader gained five tests covering the security-critical
  branches: per-site wins over fallback, env fallback with trim, null
  host fallback, both-absent returns empty (no malformed "Bearer "
  per RFC 6750), blank site token falls through.
@claude

This comment was marked as resolved.

@github-actions github-actions Bot added the Area : Backend PR changes Java/Maven backend code label May 20, 2026
@swicken swicken changed the title feat(analytics): HMAC bearer auth + JIT tenant provisioning save flow feat(analytics): HMAC bearer auth May 20, 2026
- ContentAnalyticsAppListener.persistTokenAndClearCredentials.onFailure
  now calls clearAdminPassword as best-effort cleanup. Without this,
  a failure mid-token-persist would leave the user's adminPassword
  sitting in encrypted storage indefinitely (a re-save would short-
  circuit on the "password not set" guard). Notification text updated
  to drop the misleading "password may still be stored" claim.
- EventAnalyticsProxyHelper.isAllowedRelativePath switched from a
  blanket contains("..") to a segment-equality check for "." and "..".
  The previous form would have wrongly rejected legitimate future
  endpoint names containing dots (e.g. "event/last..7..days").
  Backslash + ?/# guards retained as defense-in-depth.
- Regenerate openapi.yaml — Swagger annotation edits in the last
  commit (@Operation/@Parameter/@ApiResponses) drifted the committed
  YAML; CI's "no uncommitted changes after build" gate caught it.
- New tests: failure-cleanup path on the listener, regression test
  on the helper confirming dots inside segments are now accepted.

/** Relative paths allowed by the catch-all proxy. Anything outside this list — including
* attempts to escape via {@code ..} segments — is rejected with 400. */
private static final List<String> ALLOWED_PATH_PREFIXES = List.of("event");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really we support more than just event url path in the event manager

Copy link
Copy Markdown
Member Author

@swicken swicken May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sorry, meant to expand the list, this is done.

List.of("event", "conversion", "session", "health");

Am I missing any @freddyDOTCMS ? I want to keep the allowlist otherwise the proxy means any authenticated backend user can tunnel through the auth boundary.

- ALLOWED_PATH_PREFIXES expands to event/conversion/session/health to
  match the actual dot-ca-event-manager surface area. The prior
  event-only list would have 400'd every dashboard call to the
  conversion, session, and health endpoints. Allowlist itself kept
  (rather than dropped) so the proxy doesn't auto-expose admin/* or
  tenants/* with the tenant-scoped bearer.
- proxyGetRequest's site-resolution + load-bearing javadoc extracted
  into a private resolveSiteForUser(request, user) helper. The request
  handler is now just the try/catch + response-building.
- Helper tests: positive cases for each new prefix; old
  rejectsNonEventPrefixes renamed to rejectsDisallowedPrefixes with
  the security intent made explicit (admin/, tenants/) plus
  boundary-substring checks (healthcheck, conversionfoo).

Per Freddy's PR review comments.
Comment thread justfile
@@ -1,4 +1,5 @@
set positional-arguments := true
import? 'justfile.local'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change intended?

Copy link
Copy Markdown
Member Author

@swicken swicken May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have a local justfile with things to ease my personal dev process and its an optional import so it won't impact others. Just committing it to main now so I don't have to re-add it on every branch. There are some things I plan to push into the main justfile but I am fleshing them out locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI: Safe To Rollback Area : Backend PR changes Java/Maven backend code

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Wire HMAC bearer auth + JIT tenant provisioning save flow into dotCMS analytics proxy

3 participants