fix(cert-manager): await profile slug resolution before certificate issuance#175
Open
jdoss wants to merge 2 commits intoInfisical:mainfrom
Open
fix(cert-manager): await profile slug resolution before certificate issuance#175jdoss wants to merge 2 commits intoInfisical:mainfrom
jdoss wants to merge 2 commits intoInfisical:mainfrom
Conversation
…ssuance The cert-manager agent had a race condition where two independent goroutines started after authentication: one resolving profile-name slugs to UUIDs via the API, and another (MonitorCertificates) issuing certificates using those UUIDs. Under slow API responses, the issuance POST fired before the slug resolution GET returned, sending the raw slug string as profileId instead of the resolved UUID, causing a 422. Fix by merging both goroutines into one: resolve slugs first, then start certificate monitoring. This affects both the cert-manager agent command and the regular agent command, which had the same pattern. Also replace the sync.Once-based initial issuance with a retry loop (3 attempts, exponential backoff). The sync.Once guaranteed exactly-one execution regardless of outcome, so a failed first issuance was never retried, leaving the agent permanently broken until restarted.
Member
|
@claude review |
Author
|
@x032205 I will fix up the tests. |
Author
|
Ahhh nevermind, it seems PRs from forks don't get these vars. |
- Fix misleading log: "will retry on next renewal check" was wrong because failed certs (status="failed") are skipped by CheckCertificateRenewals. Removed the false promise. - Reuse existing lifecycle config (max-failure-retries, failure-retry-interval) for initial issuance retries instead of hardcoded constants. Default max-failure-retries=0 means retry indefinitely with exponential backoff (base delay from config, capped at 5 minutes). - Replace require.Equal with assert.Equal inside httptest handler goroutines to avoid runtime.Goexit masking real assertion failures. - Rewrite race condition test to actually exercise concurrent goroutine ordering: fires issuance concurrently with a delayed slug resolution and verifies the server rejects the early POST.
Author
|
@x032205 OK I figured out that the initial cert issuance wasn't using the retry logic for renewal, so I wired it up to reuse stuff like lifecycle:
renew-before-expiry: "24h"
status-check-interval: "5m"
max-failure-retries: 0
failure-retry-interval: "5s"and I fixed the review notes from Claude bot. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
profileIdand getting a 422cert-manager agentand regularagentcommandslifecycle.max-failure-retriesandlifecycle.failure-retry-intervalconfig for initial issuance retries (previously only used for renewal). Defaultmax-failure-retries: 0retries indefinitely with exponential backoff capped at 5 minutesBug
The cert-manager agent had a race condition where
POST /api/v1/cert-manager/certificatesfired before theGET /api/v1/cert-manager/certificate-profiles/slug/{slug}response returned. This caused the agent to send the profile-name slug string (e.g."crdb") as theprofileIdfield instead of the resolved UUID, resulting in a 422:{"validation":"uuid","code":"invalid_string","message":"Invalid uuid","path":["profileId"]}Evidence from server-side logs
The Infisical server received these requests from the same client, 75ms apart:
The slug lookup hadn't returned yet when the cert issuance request was sent.
Behavior change
The existing
lifecycle.max-failure-retriesandlifecycle.failure-retry-intervalconfig fields now also apply to initial certificate issuance, not just renewal. Previously, a failed initial issuance was never retried (the agent was permanently broken until restarted). Now:max-failure-retries: 0(default) → retry indefinitely with exponential backoffmax-failure-retries: N→ retry up to N times then stopfailure-retry-interval→ base delay for exponential backoff (default 2s, capped at 5m)A follow-up docs PR can be opened to document this behavior change.
Test plan
TestResolveCertificateNameReferences— verifies slug resolution populatesProfileIDwith the UUIDTestConcurrentIssuanceBlocksOnSlugResolution— fires issuance concurrently with a delayed slug resolution, verifies the server rejects the early POST with unresolved ProfileID, proving the ordering guarantee mattersTestResolveCertificateNameReferences_MultipleProfiles— verifies resolution works for multiple certificates with different profilesgo build ./...)