Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
149 changes: 148 additions & 1 deletion pkg/controllers/kube-apiserver.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ import (
"io"
"net"
"os"
"os/exec"
"path/filepath"
"strconv"
"strings"
Expand Down Expand Up @@ -55,6 +56,16 @@ import (

const (
kubeAPIStartupTimeout = 60
// rbacHookDeadlockTimeout is the time to wait for the RBAC bootstrap hook
// before declaring a deadlock. This is shorter than kubeAPIStartupTimeout
// to allow for faster recovery.
rbacHookDeadlockTimeout = 15
// rbacHookCheckInterval is how often to check the RBAC hook status
rbacHookCheckInterval = 2
// rbacHookMaxWaitDuration is the absolute maximum time to wait for the RBAC hook
// regardless of etcd health state changes. This prevents flapping from extending
// detection indefinitely.
rbacHookMaxWaitDuration = 30 * time.Second
)

var (
Expand Down Expand Up @@ -348,7 +359,13 @@ func (s *KubeAPIServer) Run(ctx context.Context, ready chan<- struct{}, stopped
return err
}

// run readiness check
// Channel to signal RBAC hook deadlock detection
rbacDeadlockDetected := make(chan struct{})

// Run RBAC hook deadlock detector
go s.detectRBACHookDeadlock(ctx, restClient, rbacDeadlockDetected)

// Run standard readiness check
go func() {
err := wait.PollUntilContextTimeout(ctx, time.Second, kubeAPIStartupTimeout*time.Second, true, func(ctx context.Context) (bool, error) {
var status int
Expand Down Expand Up @@ -420,7 +437,137 @@ func (s *KubeAPIServer) Run(ctx context.Context, ready chan<- struct{}, stopped
return err
case perr := <-panicChannel:
panic(perr)
case <-rbacDeadlockDetected:
klog.Error("RBAC bootstrap hook deadlock detected - restarting microshift-etcd.scope to recover")
if err := restartMicroshiftEtcdScope(); err != nil {
klog.Errorf("Failed to restart microshift-etcd.scope: %v", err)
}
return fmt.Errorf("RBAC bootstrap hook deadlock detected after %d seconds", rbacHookDeadlockTimeout)
}
}

// detectRBACHookDeadlock monitors the RBAC bootstrap hook status and detects deadlock conditions.
// A deadlock is detected when:
// 1. The RBAC hook is not completing (stuck in "not finished" state)
// 2. etcd is healthy and responsive
// This indicates the circular dependency where the hook waits for API server
// while API server waits for the hook.
func (s *KubeAPIServer) detectRBACHookDeadlock(ctx context.Context, restClient rest.Interface, deadlockDetected chan<- struct{}) {
// Wait a few seconds before starting detection to allow normal startup
select {
case <-ctx.Done():
return
case <-time.After(5 * time.Second):
}

// Track wall-clock deadline to prevent flapping from extending detection indefinitely
startTime := time.Now()
checkCount := 0
maxChecks := (rbacHookDeadlockTimeout - 5) / rbacHookCheckInterval // Account for initial delay

for checkCount < maxChecks {
// Check absolute deadline first - this cannot be reset by etcd state changes
if time.Since(startTime) >= rbacHookMaxWaitDuration {
klog.Errorf("RBAC bootstrap hook exceeded maximum wait duration of %v", rbacHookMaxWaitDuration)
break
Comment on lines +469 to +472
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Only confirm deadlock after the healthy-etcd stall predicate is actually met.

checkCount advances before either probe proves the condition, isEtcdHealthy errors do not clear it, and the rbacHookMaxWaitDuration break still falls through to close(deadlockDetected). That means transport failures, health-probe errors, or a prolonged etcd outage can still be reported as a confirmed RBAC deadlock and trigger the stop path even though “hook unfinished while etcd is healthy” was never observed. Only increment on that predicate, and return on max-wait expiry unless it has already reached maxChecks.

Also applies to: 481-515

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/kube-apiserver.go` around lines 469 - 472, The loop that
decides to signal deadlock is incrementing checkCount and falling through to
close(deadlockDetected) on rbacHookMaxWaitDuration expiry even when
isEtcdHealthy() or the RBAC probe errored; modify the loop (the block around
checkCount, isEtcdHealthy, rbacHookMaxWaitDuration, maxChecks, and
close(deadlockDetected)) so that checkCount is only incremented when the
"healthy-etcd stall" predicate is actually observed (i.e., when both the RBAC
probe is unfinished AND isEtcdHealthy() returns true), ignore/skip increments
when probes or health checks error, and on hitting rbacHookMaxWaitDuration
return early unless checkCount has already reached maxChecks; ensure
close(deadlockDetected) is only called after the predicate has been confirmed
maxChecks times.

}

select {
case <-ctx.Done():
return
case <-time.After(rbacHookCheckInterval * time.Second):
}

checkCount++

// Check RBAC hook status
var status int
err := restClient.Get().AbsPath("/readyz/poststarthook/rbac/bootstrap-roles").Do(ctx).StatusCode(&status).Error()

Comment on lines +483 to +486
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add a per-probe timeout to the RBAC hook request.

Do(ctx) inherits the long-lived controller context. If the apiserver accepts the connection but never answers, this goroutine blocks inside the request and never re-evaluates either the 2s interval or the 30s max-wait guard.

Possible fix
-		var status int
-		err := restClient.Get().AbsPath("/readyz/poststarthook/rbac/bootstrap-roles").Do(ctx).StatusCode(&status).Error()
+		probeCtx, cancel := context.WithTimeout(ctx, time.Second)
+		var status int
+		err := restClient.Get().
+			AbsPath("/readyz/poststarthook/rbac/bootstrap-roles").
+			Do(probeCtx).
+			StatusCode(&status).
+			Error()
+		cancel()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/kube-apiserver.go` around lines 483 - 486, The RBAC readiness
probe call using
restClient.Get().AbsPath("/readyz/poststarthook/rbac/bootstrap-roles").Do(ctx)...
uses the long-lived controller ctx and can hang; wrap that request in a short
per-probe context with timeout (e.g., context.WithTimeout(ctx, 2*time.Second)),
use the derived probeCtx in Do(probeCtx), and defer cancel() so the goroutine
unblocks and the 2s interval / 30s max-wait logic can re-evaluate; update the
call that sets status and err (StatusCode(&status).Error()) to use the probe
context.

// If hook is ready, no deadlock
if err == nil && status == 200 {
klog.V(4).Info("RBAC bootstrap hook completed successfully")
return
}

// Hook not ready - check if etcd is healthy
etcdHealthy, etcdErr := isEtcdHealthy(ctx)
if etcdErr != nil {
klog.V(4).Infof("Could not check etcd health: %v", etcdErr)
continue
}

if etcdHealthy {
klog.Warningf("RBAC bootstrap hook not ready (check %d/%d, elapsed %v), but etcd is healthy - potential deadlock",
checkCount, maxChecks, time.Since(startTime).Round(time.Second))
} else {
// etcd not healthy - not a deadlock, just waiting for etcd
klog.V(4).Infof("RBAC hook waiting, etcd not yet healthy (check %d/%d)", checkCount, maxChecks)
// Reset counter since this isn't a deadlock condition
// Note: wall-clock deadline (startTime) is NOT reset - flapping cannot extend indefinitely
checkCount = 0
}
Comment on lines +500 to +509
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Counter reset can extend detection window indefinitely.

When etcd becomes unhealthy, checkCount resets to 0. If etcd flaps between healthy/unhealthy states, the detector may never reach maxChecks and the deadlock goes undetected indefinitely.

Consider adding a maximum wall-clock timeout regardless of resets:

🛡️ Suggested fix
+	deadline := time.Now().Add(time.Duration(rbacHookDeadlockTimeout+30) * time.Second)
+
 	for checkCount < maxChecks {
+		if time.Now().After(deadline) {
+			klog.Warning("RBAC deadlock detection exceeded maximum wall-clock time")
+			break
+		}
 		select {
 		case <-ctx.Done():
 			return
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/kube-apiserver.go` around lines 488 - 496, The current RBAC
bootstrap loop resets checkCount to 0 when etcd is unhealthy which allows
flapping to prevent ever reaching maxChecks; modify the logic in the
kube-apiserver.go RBAC detection loop (around variables checkCount, maxChecks,
and etcdHealthy) to track a wall-clock deadline (e.g., startTime or deadline)
when the check begins and do not reset that deadline on etcd state
changes—always compare time.Since(startTime) (or time.Now().After(deadline))
against a configured maxWaitDuration and fail/exit the loop if exceeded; keep
the existing checkCount logic for deadlock detection but add this independent
timeout check so flapping cannot extend detection indefinitely.

}

// Reached max checks with etcd healthy but hook not completing - deadlock detected
klog.Errorf("RBAC bootstrap hook deadlock confirmed after %v: etcd healthy but hook not completing",
time.Since(startTime).Round(time.Second))
close(deadlockDetected)
}

// isEtcdHealthy checks if etcd is responsive by attempting to connect and get status.
func isEtcdHealthy(ctx context.Context) (bool, error) {
certsDir := cryptomaterial.CertsDirectory(config.DataDir)
etcdAPIServerClientCertDir := cryptomaterial.EtcdAPIServerClientCertDir(certsDir)

tlsInfo := transport.TLSInfo{
CertFile: cryptomaterial.ClientCertPath(etcdAPIServerClientCertDir),
KeyFile: cryptomaterial.ClientKeyPath(etcdAPIServerClientCertDir),
TrustedCAFile: cryptomaterial.CACertPath(cryptomaterial.EtcdSignerDir(certsDir)),
}
tlsConfig, err := tlsInfo.ClientConfig()
if err != nil {
return false, fmt.Errorf("failed to create TLS config: %w", err)
}

// Use a short timeout for health check
checkCtx, cancel := context.WithTimeout(ctx, 2*time.Second)
defer cancel()

client, err := clientv3.New(clientv3.Config{
Endpoints: []string{"https://localhost:2379"},
DialTimeout: 1 * time.Second,
TLS: tlsConfig,
Context: checkCtx,
})
if err != nil {
return false, fmt.Errorf("failed to create etcd client: %w", err)
}
defer func() { _ = client.Close() }()

_, err = client.Status(checkCtx, "localhost:2379")
if err != nil {
return false, nil // etcd not healthy, but not an error condition
}

return true, nil
}

// restartMicroshiftEtcdScope restarts the microshift-etcd.scope to recover from deadlock.
// This forces a clean restart of etcd which can help break the circular dependency.
func restartMicroshiftEtcdScope() error {
klog.Info("Stopping microshift-etcd.scope for recovery")

stopCmd := exec.Command("systemctl", "stop", "microshift-etcd.scope")
if out, err := stopCmd.CombinedOutput(); err != nil {
return fmt.Errorf("failed to stop microshift-etcd.scope: %w, output: %s", err, string(out))
Comment on lines +558 to +563
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bound the systemctl recovery command.

CombinedOutput() is unbounded here. If systemd or DBus stalls, the fail-fast recovery path hangs and Run never returns.

Possible fix
 func restartMicroshiftEtcdScope() error {
 	klog.Info("Stopping microshift-etcd.scope for recovery")
 
-	stopCmd := exec.Command("systemctl", "stop", "microshift-etcd.scope")
+	cmdCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+	defer cancel()
+
+	stopCmd := exec.CommandContext(cmdCtx, "systemctl", "stop", "microshift-etcd.scope")
 	if out, err := stopCmd.CombinedOutput(); err != nil {
 		return fmt.Errorf("failed to stop microshift-etcd.scope: %w, output: %s", err, string(out))
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func restartMicroshiftEtcdScope() error {
klog.Info("Stopping microshift-etcd.scope for recovery")
stopCmd := exec.Command("systemctl", "stop", "microshift-etcd.scope")
if out, err := stopCmd.CombinedOutput(); err != nil {
return fmt.Errorf("failed to stop microshift-etcd.scope: %w, output: %s", err, string(out))
func restartMicroshiftEtcdScope() error {
klog.Info("Stopping microshift-etcd.scope for recovery")
cmdCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
stopCmd := exec.CommandContext(cmdCtx, "systemctl", "stop", "microshift-etcd.scope")
if out, err := stopCmd.CombinedOutput(); err != nil {
return fmt.Errorf("failed to stop microshift-etcd.scope: %w, output: %s", err, string(out))
}
return nil
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/kube-apiserver.go` around lines 558 - 563, The call to
stopCmd.CombinedOutput() in restartMicroshiftEtcdScope is unbounded and can
hang; wrap the systemctl invocation with a context deadline (e.g., create a
context.WithTimeout and use exec.CommandContext) and use that context when
constructing stopCmd, then read CombinedOutput (or CombinedOutput on the
context-backed command) and return a timeout-aware error if the context expires;
reference restartMicroshiftEtcdScope, stopCmd and CombinedOutput and ensure you
cancel the context and propagate/annotate context timeout errors in the returned
fmt.Errorf message.

}

// Wait briefly for cleanup
time.Sleep(1 * time.Second)

klog.Info("microshift-etcd.scope stopped - MicroShift will restart")
return nil
}

func discoverEtcdServers(ctx context.Context, kubeconfigPath string) ([]string, error) {
Expand Down