Skip to content

console: add subscribe health atom#36760

Open
leedqin wants to merge 1 commit into
MaterializeInc:mainfrom
leedqin:subscribe-health-check-atom
Open

console: add subscribe health atom#36760
leedqin wants to merge 1 commit into
MaterializeInc:mainfrom
leedqin:subscribe-health-check-atom

Conversation

@leedqin
Copy link
Copy Markdown
Contributor

@leedqin leedqin commented May 27, 2026

Environment health polls SELECT mz_version() on mz_catalog_server every 5s per tab. Once a global SUBSCRIBE is streaming, the environment is already proven reachable, so the poll is redundant.

Derive health from subscribe state (subscribeDerivedHealthAtom) and back the poll off to 30s while subscribes are healthy; keep 5s during bootstrap and when subscribes go down, so recovery and the crashed/blocked banner still surface.

This frees up browser connection slots that 5 second health poll check was taking. Tested against workload capture for production analytics workload. Health checks requests on a 6 tab stress test have around 45 concurrent in flight requests in 2 mins. This change brought it down to 7-9 concurrent requests. Worse case requsest queue time shows one health request stuck at 240s behind a saturated pool. This change brought it down to 1 ms.

Fixes CNS-83

Environment health polls `SELECT mz_version()` on mz_catalog_server every 5s
per tab. Once a global SUBSCRIBE is streaming, the environment is already
proven reachable, so the poll is redundant.

Derive health from subscribe state (`subscribeDerivedHealthAtom`) and back the
poll off to 30s while subscribes are healthy; keep 5s during bootstrap and when
subscribes go down, so recovery and the crashed/blocked banner still surface.

This frees up browser connection slots that 5 second health poll check was taking.
Tested against workload capture for production analytics workload. Health  checks requests on a 6 tab stress test have around 45 concurrent in flight requests in 2 mins. This change brought it down to 7-9 concurrent requests. Worse case requsest queue time shows one health request stuck at  240s behind a saturated pool. This change brought it down to 1 ms.
@leedqin leedqin requested a review from a team as a code owner May 27, 2026 20:32
@leedqin leedqin added the A-CONSOLE Area: Console label May 27, 2026
@leedqin leedqin requested review from SangJunBak and jubrad and removed request for a team May 27, 2026 20:32
const mergeEnvironments = useSetAtom(mergeEnvironmentsWithHealth);
const cloudRegions = useAtomValue(cloudRegionsSelector);
const appConfig = useAtomValue(appConfigAtom);
const subscribeHealth = useAtomValue(subscribeDerivedHealthAtom);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do want to call out I think the region creation flow relies on polling fetchEnvironmentsWithHealth to transition from "booting" to "ready". I think 30 seconds is fine, but instead of relying on a subscribe, I wonder if it's just simpler to change the overall health check poll to 30s/1 minute and pass in 30_000 into usePollEnvironmentHealth

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just concerned in the event that suddenly we stop polling for 30 seconds and then the console doesn't try to reconnect in 5 seconds so the user doesn't immediately see the console being unable to connect to environmentd. Not necessarily the worst but I thought it was a slight regression so I used a subscribe atom too.

My initial idea was to just use the subscribe atom but then in case the request is blocked, an active subscribe will not catch that unless the socket disconnects and connects again.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial idea was to just use the subscribe atom but then in case the request is blocked, an active subscribe will not catch that unless the socket disconnects and connects again.

Yeah I don't think this will work given we rely on the API call to the region controller which you can't subscribe on for the following flow:

I do want to call out I think the region creation flow relies on polling fetchEnvironmentsWithHealth to transition from "booting" to "ready".

Regarding:

I was just concerned in the event that suddenly we stop polling for 30 seconds and then the console doesn't try to reconnect in 5 seconds so the user doesn't immediately see the console being unable to connect to environmentd.

Yeah but how this actually displays in the Console is the "environment not ready" toast which we already deafen. All other queries need to connect to environmentd anyways so they don't actually require this health check to operate correctly.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SangJunBak doesn't that work if we just poll for initial connection? Also I think we it would be better to do this with by fetching the status of the environment from the region api rather than waiting for a subscribe to work right?

@SangJunBak
Copy link
Copy Markdown
Contributor

Left a comment. Lemme know what you think!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-CONSOLE Area: Console

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants