What happened?
When using expired credentials or having other misconfigurations in stackit-cert-manager-webhook cert-manager will infinitely retry (without any exponential backoff!) the Challenge object.
This happens in all error cases where the returned error contains any changing details (e.g., timestamp, request ID, ...), as the error returned by the webhook is persisted in the Challenge object (status.reason) and cert-manager reconciles the entire Challenge object if it detects any change (including the entire .status!).
This is also stated here in a comment inside the acmechallenges Sync function. Sadly, this entire thing isn't properly documented anywhere else 😔
How can we reproduce this?
We've noticed this when someone tried to use a removed service account key, so:
- Create new project
- Create a new service account (no need to actually add it to a project)
- Create a new service account key (persist it for later use and delete it again)
- Deploy cert-manager
- Deploy
stackit-cert-manager-webhook (helm install stackit-cert-manager-webhook -n cert-manager stackit-cert-manager-webhook/stackit-cert-manager-webhook --set stackitSaAuthentication.enabled=true and create the cert-manager/stackit-sa-authentication secret)
- Create an issuer and certificate
Observe the issue:
- Check the events of the
Challenge resource
kubectl get challenges.acme.cert-manager.io -w (see ~4 changes per second)
- Check the cert-manager logs
- Check the
stackit-cert-manager-webhook logs
Additional context
To properly fix this, we must sanitize every error case where we don't have control of the error. We can still log the "original" error, so we should just state the general thing that failed and optionally tell the user that they should check the stackit-cert-manager-webhook logs (e.g., "failed fetching zone. See the stackit-cert-manager-webhook logs for more details.").
Search
Code of Conduct
What happened?
When using expired credentials or having other misconfigurations in
stackit-cert-manager-webhookcert-manager will infinitely retry (without any exponential backoff!) theChallengeobject.This happens in all error cases where the returned error contains any changing details (e.g., timestamp, request ID, ...), as the error returned by the webhook is persisted in the
Challengeobject (status.reason) and cert-manager reconciles the entireChallengeobject if it detects any change (including the entire.status!).This is also stated here in a comment inside the
acmechallengesSyncfunction. Sadly, this entire thing isn't properly documented anywhere else 😔How can we reproduce this?
We've noticed this when someone tried to use a removed service account key, so:
stackit-cert-manager-webhook(helm install stackit-cert-manager-webhook -n cert-manager stackit-cert-manager-webhook/stackit-cert-manager-webhook --set stackitSaAuthentication.enabled=trueand create thecert-manager/stackit-sa-authenticationsecret)Observe the issue:
Challengeresourcekubectl get challenges.acme.cert-manager.io -w(see ~4 changes per second)stackit-cert-manager-webhooklogsAdditional context
To properly fix this, we must sanitize every error case where we don't have control of the error. We can still log the "original" error, so we should just state the general thing that failed and optionally tell the user that they should check the
stackit-cert-manager-webhooklogs (e.g., "failed fetching zone. See the stackit-cert-manager-webhook logs for more details.").Search
Code of Conduct