Skip to content

Add retry mechanism and dynamic model loading to SONIC framework#27

Open
kakwok wants to merge 1 commit into
fastmachinelearning:masterfrom
kakwok:SonicRetryDML_CMSSW_17_0_0_pre2
Open

Add retry mechanism and dynamic model loading to SONIC framework#27
kakwok wants to merge 1 commit into
fastmachinelearning:masterfrom
kakwok:SonicRetryDML_CMSSW_17_0_0_pre2

Conversation

@kakwok

@kakwok kakwok commented Jun 24, 2026

Copy link
Copy Markdown

PR description:

Introduces a pluggable retry mechanism for SONIC inference clients and dynamic
model loading/unloading on the Triton fallback server. This enables automatic
recovery when a remote Triton server becomes unavailable during event processing.

Retry mechanism (SonicCore)

  • RetryActionBase: Plugin factory base class for retry strategies, with
    retry(), start(), finish() interface
  • RetrySameServerAction: Retries inference on the same server (configurable
    allowedTries)
  • SonicClientBase::finish(): Drives retry loop — on retryable failure
    (finish(false)), iterates through registered retry actions; on non-retryable
    failure (finish(false, eptr)), propagates exception directly

Retry with server failover (SonicTriton)

  • RetryActionDiffServer: On failure, queries TritonService::getBestServer()
    for an alternative healthy remote server, calls updateServer() + eval() to
    retry on the new server
  • RetryFallbackServerAction: Last-resort action — when all other retries are
    exhausted, lazily starts the fallback server (idempotent), dynamically loads the
    model to the fallback server via TritonClient::switchToFallback(), and retries locally. Fires at most
    once per inference call.
  • Server health tracking: TritonService maintains per-server health stats
    (liveness, readiness, failure count, queue time) via updateServerHealth(),
    used by getBestServer() to select the best candidate
  • Server selection preference: resolveServerName() prefers remote (non-fallback)
    servers when multiple servers provide the same model; getBestServer() excludes
    fallback servers from candidate pool

Async retry redesign (holder-based)

  • evaluate() (Async mode) creates an inner WaitingTaskWithArenaHolder per
    inference attempt — the gRPC callback calls doneWaiting() and returns
    immediately; finish()/retry()/updateServer() are scheduled as TBB tasks

Dynamic model loading (TritonService)

  • loadModel()/unloadModel() with reference counting for the fallback server
  • startFallbackServer() is idempotent; only started at preBeginJob for
    unassigned models (no remote server available)
  • Fallback server health entry properly registered in serversHealth_

Configuration

  • Retry actions configurable via Retry VPSet in client config
    (retryType, allowedTries)
  • customize.py updated with retry options, defaulted to add RetryFallbackServerAction as the last action.

PR validation:

retry_diffServer_fallback.log

…olve cmsTriton conflict in CMSSW_17_0_0_pre2.

Co-authored-by: Trevin Lee <trl008@ucsd.edu>
@kakwok kakwok force-pushed the SonicRetryDML_CMSSW_17_0_0_pre2 branch from 0ea0d7a to 36ca055 Compare June 29, 2026 14:12
auto retryAction = RetryActionFactory::get()->create(actionType, retryPSet, this);
if (retryAction) {
//Convert to RetryActionPtr Type from raw pointer of retryAction
retryActions_.emplace_back(RetryActionPtr(retryAction.release()));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the explicit RetryActionPtr() needed here? usually for emplace_back(), it shouldn't be

edm::LogInfo("SonicClientBase") << "Calling retry()";
// retry() must trigger eval() or finish()
action->retry();
// return because another finish() was already called inside client->evaluate()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment still correct for the task-based implementation?

Comment on lines +139 to +145
// helper functions to get server statistics?
// - getServerSideStatus()
// - updateServerStatus()
// - loop over servers_ get statistics
// - getBestServer(model)
// - call updateServerStatus()
// - loop over servers_ get their statistics, compute metric, return server name

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be removed now?

parser.add_argument("--port", default=8001, type=int, help="server port")
parser.add_argument("--address", nargs=3, action="append", metavar=("NAME", "HOST", "PORT"),
dest="addresses", default=[],
help="Triton server entry: name host port (repeatable, e.g. --address server1 0.0.0.0 8011)")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "e.g." should be --address server1 0.0.0.0 8011 --address server2 0.0.0.0 8021 or something like that to actually show how the repeatability works

parser.add_argument("--timeout", default=30, type=int, help="timeout for requests")
parser.add_argument("--timeoutUnit", default="seconds", type=str, help="unit for timeout")
parser.add_argument("--params", default="", type=str, help="json file containing server address/port")
parser.add_argument("--params", default="", type=str, help="json file containing server address/port(single-server)")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add space before "(" in help message

public:
TestTritonClient() : TritonClient() {}

void connectToServer(const std::string& url) override { lastConnectedUrl = url; }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can also be removed?


void updateServer(const std::string& serverName) override { lastUpdatedServerName = serverName; }

const std::string& lastUrl() const { return lastConnectedUrl; }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be removed

void evaluate() override {}

private:
std::string lastConnectedUrl;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be removed

<use name="catch2"/>
</bin>

<test name="TestHeterogeneousCoreSonicTritonRetryActionDiff_Log" command="retry_action_diff_log_test.sh ${LOCALTOP}"/>

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no script retry_action_diff_log_test.sh committed in this branch

Comment on lines +3 to +4
<test name="TestHeterogeneousCoreSonicTritonRetryActionSame" command="cmsRun ${LOCALTOP}/src/HeterogeneousCore/SonicTriton/test/tritonTest_cfg.py --modules TritonGraphProducer --maxEvents 2 --unittest --device cpu --retryAction same"/>
<test name="TestHeterogeneousCoreSonicTritonRetryActionDiff" command="cmsRun ${LOCALTOP}/src/HeterogeneousCore/SonicTriton/test/tritonTest_cfg.py --modules TritonGraphProducer --maxEvents 2 --unittest --device cpu --retryAction diff"/>

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do these tests work without a driver script to enable/disable servers?

@kpedro88

Copy link
Copy Markdown

We will also have to update the various reco/miniaod algorithm configuration files that currently set allowedTries = cms.untracked.uint32(0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants