Skip to content

Add retry mechanism and dynamic model loading to SONIC framework#26

Closed
kakwok wants to merge 1 commit into
fastmachinelearning:masterfrom
kakwok:SonicRetryDML_CMSSW_17_0_0_pre1
Closed

Add retry mechanism and dynamic model loading to SONIC framework#26
kakwok wants to merge 1 commit into
fastmachinelearning:masterfrom
kakwok:SonicRetryDML_CMSSW_17_0_0_pre1

Conversation

@kakwok

@kakwok kakwok commented Jun 24, 2026

Copy link
Copy Markdown

PR description:

Introduces a pluggable retry mechanism for SONIC inference clients and dynamic
model loading/unloading on the Triton fallback server. This enables automatic
recovery when a remote Triton server becomes unavailable during event processing.

Retry mechanism (SonicCore)

  • RetryActionBase: Plugin factory base class for retry strategies, with
    retry(), start(), finish() interface
  • RetrySameServerAction: Retries inference on the same server (configurable
    allowedTries)
  • SonicClientBase::finish(): Drives retry loop — on retryable failure
    (finish(false)), iterates through registered retry actions; on non-retryable
    failure (finish(false, eptr)), propagates exception directly

Retry with server failover (SonicTriton)

  • RetryActionDiffServer: On failure, queries TritonService::getBestServer()
    for an alternative healthy remote server, calls updateServer() + eval() to
    retry on the new server
  • RetryFallbackServerAction: Last-resort action — when all other retries are
    exhausted, lazily starts the fallback server (idempotent), dynamically loads the
    model to the fallback server via TritonClient::switchToFallback(), and retries locally. Fires at most
    once per inference call.
  • Server health tracking: TritonService maintains per-server health stats
    (liveness, readiness, failure count, queue time) via updateServerHealth(),
    used by getBestServer() to select the best candidate
  • Server selection preference: resolveServerName() prefers remote (non-fallback)
    servers when multiple servers provide the same model; getBestServer() excludes
    fallback servers from candidate pool

Async retry redesign (holder-based)

  • evaluate() (Async mode) creates an inner WaitingTaskWithArenaHolder per
    inference attempt — the gRPC callback calls doneWaiting() and returns
    immediately; finish()/retry()/updateServer() are scheduled as TBB tasks

Dynamic model loading (TritonService)

  • loadModel()/unloadModel() with reference counting for the fallback server
  • startFallbackServer() is idempotent; only started at preBeginJob for
    unassigned models (no remote server available)
  • Fallback server health entry properly registered in serversHealth_

Configuration

  • Retry actions configurable via Retry VPSet in client config
    (retryType, allowedTries)
  • customize.py updated with retry options, defaulted to add RetryFallbackServerAction as the last action.

PR validation:

retry_diffServer_fallback.log

@kakwok kakwok marked this pull request as draft June 24, 2026 00:02
Co-authored-by: Trevin Lee <trevin-lee@users.noreply.github.com>
@kakwok kakwok force-pushed the SonicRetryDML_CMSSW_17_0_0_pre1 branch from 409afe2 to e23f6fb Compare June 24, 2026 00:17
@kakwok

kakwok commented Jun 24, 2026

Copy link
Copy Markdown
Author

superseded by #27

@kakwok kakwok closed this Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant