Skip to content

feat: auto failover APIs with LK Cloud#686

Open
davidzhao wants to merge 5 commits into
mainfrom
dz/region-failover
Open

feat: auto failover APIs with LK Cloud#686
davidzhao wants to merge 5 commits into
mainfrom
dz/region-failover

Conversation

@davidzhao

Copy link
Copy Markdown
Member

retries in alternative datacenters on 5xx and transport failures

also removed legacy camel-case, which was not needed since we switched to protobuf-es

retries in alternative datacenters on 5xx and transport failures
@davidzhao davidzhao requested review from anunaym14 and lukasIO June 27, 2026 21:42
@changeset-bot

changeset-bot Bot commented Jun 27, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 0619fe1

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 2 packages
Name Type
livekit-server-sdk Patch
agent-dispatch Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@davidzhao davidzhao requested a review from 1egoman June 27, 2026 21:42
Add auto failover APIs for LK Cloud in livekit-server-sdk.
@davidzhao

Copy link
Copy Markdown
Member Author

CI depends on livekit/livekit#4627

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

Open in Devin Review

Comment on lines +130 to +133
const response = await fetch(new URL('/settings/regions', origin.origin), {
method: 'GET',
headers: fetchHeaders,
});

@devin-ai-integration devin-ai-integration Bot Jun 27, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Region discovery requests have no timeout, causing user API calls to hang far beyond the configured timeout

The region discovery request is issued without any abort signal (fetch(...) at packages/livekit-server-sdk/src/failover.ts:130) after the primary request already timed out, so if the same host also hangs on the discovery endpoint, the user's call blocks for the OS-level TCP timeout (often minutes) instead of the configured seconds.

Impact: API calls to a slow LiveKit Cloud host can hang for minutes even though the user configured a short request timeout.

Mechanism: region discovery inherits no timeout from the caller

When the primary Twirp request times out via AbortSignal.timeout (packages/livekit-server-sdk/src/TwirpRPC.ts:118), the error is treated as a retryable transport error, and regionOrigins(origin, headers) is called at packages/livekit-server-sdk/src/TwirpRPC.ts:144. This delegates to fetchRegions (packages/livekit-server-sdk/src/failover.ts:119-142), which calls fetch(new URL('/settings/regions', origin.origin), ...) without setting signal on the request init. Since origin is the same host that just timed out, this second fetch can also hang for an OS-dependent duration (typically 60–120+ seconds for TCP keepalive/connect timeout), far exceeding the user-configured requestTimeout. Only after the OS timeout does the catch block in regionOrigins (packages/livekit-server-sdk/src/failover.ts:114) fire, returning an empty array and finally surfacing the original error.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment thread packages/livekit-server-sdk/src/TwirpRPC.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant