Skip to content

fix(cert-manager): await profile slug resolution before certificate issuance#175

Open
jdoss wants to merge 2 commits intoInfisical:mainfrom
jdoss:fix/cert-manager-slug-resolution-race
Open

fix(cert-manager): await profile slug resolution before certificate issuance#175
jdoss wants to merge 2 commits intoInfisical:mainfrom
jdoss:fix/cert-manager-slug-resolution-race

Conversation

@jdoss
Copy link
Copy Markdown

@jdoss jdoss commented Apr 10, 2026

Summary

  • Fix race condition where certificate issuance POST fires before profile slug-to-UUID resolution completes, sending the raw slug as profileId and getting a 422
  • Merge the slug resolution and certificate monitoring goroutines into a single sequential flow (resolve first, then monitor) in both the cert-manager agent and regular agent commands
  • Reuse existing lifecycle.max-failure-retries and lifecycle.failure-retry-interval config for initial issuance retries (previously only used for renewal). Default max-failure-retries: 0 retries indefinitely with exponential backoff capped at 5 minutes

Bug

The cert-manager agent had a race condition where POST /api/v1/cert-manager/certificates fired before the GET /api/v1/cert-manager/certificate-profiles/slug/{slug} response returned. This caused the agent to send the profile-name slug string (e.g. "crdb") as the profileId field instead of the resolved UUID, resulting in a 422:

{"validation":"uuid","code":"invalid_string","message":"Invalid uuid","path":["profileId"]}

Evidence from server-side logs

The Infisical server received these requests from the same client, 75ms apart:

Timestamp (ms) Request ID Request
1775781765199 req-LiX9p3LQCBiYXr GET /api/v1/cert-manager/certificate-profiles/slug/crdb
1775781765274 req-u181LmbJP98hwB POST /api/v1/cert-manager/certificates (profileId="crdb")
1775781765293 req-u181LmbJP98hwB 422 response — Invalid uuid

The slug lookup hadn't returned yet when the cert issuance request was sent.

Behavior change

The existing lifecycle.max-failure-retries and lifecycle.failure-retry-interval config fields now also apply to initial certificate issuance, not just renewal. Previously, a failed initial issuance was never retried (the agent was permanently broken until restarted). Now:

  • max-failure-retries: 0 (default) → retry indefinitely with exponential backoff
  • max-failure-retries: N → retry up to N times then stop
  • failure-retry-interval → base delay for exponential backoff (default 2s, capped at 5m)

A follow-up docs PR can be opened to document this behavior change.

Test plan

  • TestResolveCertificateNameReferences — verifies slug resolution populates ProfileID with the UUID
  • TestConcurrentIssuanceBlocksOnSlugResolution — fires issuance concurrently with a delayed slug resolution, verifies the server rejects the early POST with unresolved ProfileID, proving the ordering guarantee matters
  • TestResolveCertificateNameReferences_MultipleProfiles — verifies resolution works for multiple certificates with different profiles
  • Full project builds cleanly (go build ./...)

…ssuance

The cert-manager agent had a race condition where two independent
goroutines started after authentication: one resolving profile-name
slugs to UUIDs via the API, and another (MonitorCertificates) issuing
certificates using those UUIDs. Under slow API responses, the issuance
POST fired before the slug resolution GET returned, sending the raw
slug string as profileId instead of the resolved UUID, causing a 422.

Fix by merging both goroutines into one: resolve slugs first, then
start certificate monitoring. This affects both the cert-manager agent
command and the regular agent command, which had the same pattern.

Also replace the sync.Once-based initial issuance with a retry loop
(3 attempts, exponential backoff). The sync.Once guaranteed exactly-one
execution regardless of outcome, so a failed first issuance was never
retried, leaving the agent permanently broken until restarted.
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@x032205
Copy link
Copy Markdown
Member

x032205 commented Apr 10, 2026

@claude review

@jdoss
Copy link
Copy Markdown
Author

jdoss commented Apr 10, 2026

@x032205 I will fix up the tests.

@jdoss
Copy link
Copy Markdown
Author

jdoss commented Apr 10, 2026

Ahhh nevermind, it seems PRs from forks don't get these vars.

- Fix misleading log: "will retry on next renewal check" was wrong
  because failed certs (status="failed") are skipped by
  CheckCertificateRenewals. Removed the false promise.

- Reuse existing lifecycle config (max-failure-retries,
  failure-retry-interval) for initial issuance retries instead of
  hardcoded constants. Default max-failure-retries=0 means retry
  indefinitely with exponential backoff (base delay from config,
  capped at 5 minutes).

- Replace require.Equal with assert.Equal inside httptest handler
  goroutines to avoid runtime.Goexit masking real assertion failures.

- Rewrite race condition test to actually exercise concurrent
  goroutine ordering: fires issuance concurrently with a delayed
  slug resolution and verifies the server rejects the early POST.
@jdoss
Copy link
Copy Markdown
Author

jdoss commented Apr 10, 2026

@x032205 OK I figured out that the initial cert issuance wasn't using the retry logic for renewal, so I wired it up to reuse stuff like

  lifecycle:
    renew-before-expiry: "24h"
    status-check-interval: "5m"
    max-failure-retries: 0
    failure-retry-interval: "5s"

and I fixed the review notes from Claude bot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants