Add retry mechanism and dynamic model loading to SONIC framework#27
Add retry mechanism and dynamic model loading to SONIC framework#27kakwok wants to merge 1 commit into
Conversation
…olve cmsTriton conflict in CMSSW_17_0_0_pre2. Co-authored-by: Trevin Lee <trl008@ucsd.edu>
0ea0d7a to
36ca055
Compare
| auto retryAction = RetryActionFactory::get()->create(actionType, retryPSet, this); | ||
| if (retryAction) { | ||
| //Convert to RetryActionPtr Type from raw pointer of retryAction | ||
| retryActions_.emplace_back(RetryActionPtr(retryAction.release())); |
There was a problem hiding this comment.
is the explicit RetryActionPtr() needed here? usually for emplace_back(), it shouldn't be
| edm::LogInfo("SonicClientBase") << "Calling retry()"; | ||
| // retry() must trigger eval() or finish() | ||
| action->retry(); | ||
| // return because another finish() was already called inside client->evaluate() |
There was a problem hiding this comment.
is this comment still correct for the task-based implementation?
| // helper functions to get server statistics? | ||
| // - getServerSideStatus() | ||
| // - updateServerStatus() | ||
| // - loop over servers_ get statistics | ||
| // - getBestServer(model) | ||
| // - call updateServerStatus() | ||
| // - loop over servers_ get their statistics, compute metric, return server name |
| parser.add_argument("--port", default=8001, type=int, help="server port") | ||
| parser.add_argument("--address", nargs=3, action="append", metavar=("NAME", "HOST", "PORT"), | ||
| dest="addresses", default=[], | ||
| help="Triton server entry: name host port (repeatable, e.g. --address server1 0.0.0.0 8011)") |
There was a problem hiding this comment.
the "e.g." should be --address server1 0.0.0.0 8011 --address server2 0.0.0.0 8021 or something like that to actually show how the repeatability works
| parser.add_argument("--timeout", default=30, type=int, help="timeout for requests") | ||
| parser.add_argument("--timeoutUnit", default="seconds", type=str, help="unit for timeout") | ||
| parser.add_argument("--params", default="", type=str, help="json file containing server address/port") | ||
| parser.add_argument("--params", default="", type=str, help="json file containing server address/port(single-server)") |
| public: | ||
| TestTritonClient() : TritonClient() {} | ||
|
|
||
| void connectToServer(const std::string& url) override { lastConnectedUrl = url; } |
|
|
||
| void updateServer(const std::string& serverName) override { lastUpdatedServerName = serverName; } | ||
|
|
||
| const std::string& lastUrl() const { return lastConnectedUrl; } |
| void evaluate() override {} | ||
|
|
||
| private: | ||
| std::string lastConnectedUrl; |
| <use name="catch2"/> | ||
| </bin> | ||
|
|
||
| <test name="TestHeterogeneousCoreSonicTritonRetryActionDiff_Log" command="retry_action_diff_log_test.sh ${LOCALTOP}"/> |
There was a problem hiding this comment.
there is no script retry_action_diff_log_test.sh committed in this branch
| <test name="TestHeterogeneousCoreSonicTritonRetryActionSame" command="cmsRun ${LOCALTOP}/src/HeterogeneousCore/SonicTriton/test/tritonTest_cfg.py --modules TritonGraphProducer --maxEvents 2 --unittest --device cpu --retryAction same"/> | ||
| <test name="TestHeterogeneousCoreSonicTritonRetryActionDiff" command="cmsRun ${LOCALTOP}/src/HeterogeneousCore/SonicTriton/test/tritonTest_cfg.py --modules TritonGraphProducer --maxEvents 2 --unittest --device cpu --retryAction diff"/> |
There was a problem hiding this comment.
do these tests work without a driver script to enable/disable servers?
|
We will also have to update the various reco/miniaod algorithm configuration files that currently set |
PR description:
Introduces a pluggable retry mechanism for SONIC inference clients and dynamic
model loading/unloading on the Triton fallback server. This enables automatic
recovery when a remote Triton server becomes unavailable during event processing.
Retry mechanism (
SonicCore)RetryActionBase: Plugin factory base class for retry strategies, withretry(),start(),finish()interfaceRetrySameServerAction: Retries inference on the same server (configurableallowedTries)SonicClientBase::finish(): Drives retry loop — on retryable failure(
finish(false)), iterates through registered retry actions; on non-retryablefailure (
finish(false, eptr)), propagates exception directlyRetry with server failover (
SonicTriton)RetryActionDiffServer: On failure, queriesTritonService::getBestServer()for an alternative healthy remote server, calls
updateServer()+eval()toretry on the new server
RetryFallbackServerAction: Last-resort action — when all other retries areexhausted, lazily starts the fallback server (idempotent), dynamically loads the
model to the fallback server via
TritonClient::switchToFallback(), and retries locally. Fires at mostonce per inference call.
TritonServicemaintains per-server health stats(liveness, readiness, failure count, queue time) via
updateServerHealth(),used by
getBestServer()to select the best candidateresolveServerName()prefers remote (non-fallback)servers when multiple servers provide the same model;
getBestServer()excludesfallback servers from candidate pool
Async retry redesign (holder-based)
evaluate()(Async mode) creates an innerWaitingTaskWithArenaHolderperinference attempt — the gRPC callback calls
doneWaiting()and returnsimmediately;
finish()/retry()/updateServer()are scheduled as TBB tasksDynamic model loading (
TritonService)loadModel()/unloadModel()with reference counting for the fallback serverstartFallbackServer()is idempotent; only started atpreBeginJobforunassigned models (no remote server available)
serversHealth_Configuration
RetryVPSet in client config(
retryType,allowedTries)customize.pyupdated with retry options, defaulted to addRetryFallbackServerActionas the last action.PR validation:
retry_diffServer_fallback.log