feat: auto failover APIs with LK Cloud#686
Conversation
retries in alternative datacenters on 5xx and transport failures
🦋 Changeset detectedLatest commit: 0619fe1 The changes in this PR will be included in the next version bump. This PR includes changesets to release 2 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Add auto failover APIs for LK Cloud in livekit-server-sdk.
|
CI depends on livekit/livekit#4627 |
| const response = await fetch(new URL('/settings/regions', origin.origin), { | ||
| method: 'GET', | ||
| headers: fetchHeaders, | ||
| }); |
There was a problem hiding this comment.
🟡 Region discovery requests have no timeout, causing user API calls to hang far beyond the configured timeout
The region discovery request is issued without any abort signal (fetch(...) at packages/livekit-server-sdk/src/failover.ts:130) after the primary request already timed out, so if the same host also hangs on the discovery endpoint, the user's call blocks for the OS-level TCP timeout (often minutes) instead of the configured seconds.
Impact: API calls to a slow LiveKit Cloud host can hang for minutes even though the user configured a short request timeout.
Mechanism: region discovery inherits no timeout from the caller
When the primary Twirp request times out via AbortSignal.timeout (packages/livekit-server-sdk/src/TwirpRPC.ts:118), the error is treated as a retryable transport error, and regionOrigins(origin, headers) is called at packages/livekit-server-sdk/src/TwirpRPC.ts:144. This delegates to fetchRegions (packages/livekit-server-sdk/src/failover.ts:119-142), which calls fetch(new URL('/settings/regions', origin.origin), ...) without setting signal on the request init. Since origin is the same host that just timed out, this second fetch can also hang for an OS-dependent duration (typically 60–120+ seconds for TCP keepalive/connect timeout), far exceeding the user-configured requestTimeout. Only after the OS timeout does the catch block in regionOrigins (packages/livekit-server-sdk/src/failover.ts:114) fire, returning an empty array and finally surfacing the original error.
Was this helpful? React with 👍 or 👎 to provide feedback.
retries in alternative datacenters on 5xx and transport failures
also removed legacy camel-case, which was not needed since we switched to protobuf-es