fix(cert-manager): await profile slug resolution before certificate issuance by jdoss · Pull Request #175 · Infisical/cli

jdoss · 2026-04-10T16:39:09Z

Summary

Fix race condition where certificate issuance POST fires before profile slug-to-UUID resolution completes, sending the raw slug as profileId and getting a 422
Merge the slug resolution and certificate monitoring goroutines into a single sequential flow (resolve first, then monitor) in both the cert-manager agent and regular agent commands
Reuse existing lifecycle.max-failure-retries and lifecycle.failure-retry-interval config for initial issuance retries (previously only used for renewal). Default max-failure-retries: 0 retries indefinitely with exponential backoff capped at 5 minutes

Bug

The cert-manager agent had a race condition where POST /api/v1/cert-manager/certificates fired before the GET /api/v1/cert-manager/certificate-profiles/slug/{slug} response returned. This caused the agent to send the profile-name slug string (e.g. "crdb") as the profileId field instead of the resolved UUID, resulting in a 422:

{"validation":"uuid","code":"invalid_string","message":"Invalid uuid","path":["profileId"]}

Evidence from server-side logs

The Infisical server received these requests from the same client, 75ms apart:

Timestamp (ms)	Request ID	Request
1775781765199	req-LiX9p3LQCBiYXr	GET /api/v1/cert-manager/certificate-profiles/slug/crdb
1775781765274	req-u181LmbJP98hwB	POST /api/v1/cert-manager/certificates (profileId="crdb")
1775781765293	req-u181LmbJP98hwB	422 response — Invalid uuid

The slug lookup hadn't returned yet when the cert issuance request was sent.

Behavior change

The existing lifecycle.max-failure-retries and lifecycle.failure-retry-interval config fields now also apply to initial certificate issuance, not just renewal. Previously, a failed initial issuance was never retried (the agent was permanently broken until restarted). Now:

max-failure-retries: 0 (default) → retry indefinitely with exponential backoff
max-failure-retries: N → retry up to N times then stop
failure-retry-interval → base delay for exponential backoff (default 2s, capped at 5m)

A follow-up docs PR can be opened to document this behavior change.

Test plan

TestResolveCertificateNameReferences — verifies slug resolution populates ProfileID with the UUID
TestConcurrentIssuanceBlocksOnSlugResolution — fires issuance concurrently with a delayed slug resolution, verifies the server rejects the early POST with unresolved ProfileID, proving the ordering guarantee matters
TestResolveCertificateNameReferences_MultipleProfiles — verifies resolution works for multiple certificates with different profiles
Full project builds cleanly (go build ./...)

…ssuance The cert-manager agent had a race condition where two independent goroutines started after authentication: one resolving profile-name slugs to UUIDs via the API, and another (MonitorCertificates) issuing certificates using those UUIDs. Under slow API responses, the issuance POST fired before the slug resolution GET returned, sending the raw slug string as profileId instead of the resolved UUID, causing a 422. Fix by merging both goroutines into one: resolve slugs first, then start certificate monitoring. This affects both the cert-manager agent command and the regular agent command, which had the same pattern. Also replace the sync.Once-based initial issuance with a retry loop (3 attempts, exponential backoff). The sync.Once guaranteed exactly-one execution regardless of outcome, so a failed first issuance was never retried, leaving the agent permanently broken until restarted.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

x032205 · 2026-04-10T17:59:17Z

@claude review

jdoss · 2026-04-10T18:01:29Z

@x032205 I will fix up the tests.

jdoss · 2026-04-10T18:05:33Z

Ahhh nevermind, it seems PRs from forks don't get these vars.

- Fix misleading log: "will retry on next renewal check" was wrong because failed certs (status="failed") are skipped by CheckCertificateRenewals. Removed the false promise. - Reuse existing lifecycle config (max-failure-retries, failure-retry-interval) for initial issuance retries instead of hardcoded constants. Default max-failure-retries=0 means retry indefinitely with exponential backoff (base delay from config, capped at 5 minutes). - Replace require.Equal with assert.Equal inside httptest handler goroutines to avoid runtime.Goexit masking real assertion failures. - Rewrite race condition test to actually exercise concurrent goroutine ordering: fires issuance concurrently with a delayed slug resolution and verifies the server rejects the early POST.

jdoss · 2026-04-10T18:18:53Z

@x032205 OK I figured out that the initial cert issuance wasn't using the retry logic for renewal, so I wired it up to reuse stuff like

  lifecycle:
    renew-before-expiry: "24h"
    status-check-interval: "5m"
    max-failure-retries: 0
    failure-retry-interval: "5s"

and I fixed the review notes from Claude bot.

claude Bot reviewed Apr 10, 2026

View reviewed changes

Comment thread packages/cmd/agent.go Outdated

Comment thread packages/cmd/agent_cert_resolution_test.go

Comment thread packages/cmd/agent_cert_resolution_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cert-manager): await profile slug resolution before certificate issuance#175

fix(cert-manager): await profile slug resolution before certificate issuance#175
jdoss wants to merge 2 commits into
Infisical:mainfrom
jdoss:fix/cert-manager-slug-resolution-race

jdoss commented Apr 10, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

x032205 commented Apr 10, 2026

Uh oh!

jdoss commented Apr 10, 2026

Uh oh!

jdoss commented Apr 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jdoss commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jdoss commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bug

Evidence from server-side logs

Behavior change

Test plan

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

x032205 commented Apr 10, 2026

Uh oh!

jdoss commented Apr 10, 2026

Uh oh!

jdoss commented Apr 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jdoss commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jdoss commented Apr 10, 2026 •

edited

Loading