console: add subscribe health atom by leedqin · Pull Request #36760 · MaterializeInc/materialize

leedqin · 2026-05-27T20:32:36Z

Environment health polls SELECT mz_version() on mz_catalog_server every 5s per tab. Once a global SUBSCRIBE is streaming, the environment is already proven reachable, so the poll is redundant.

Derive health from subscribe state (subscribeDerivedHealthAtom) and back the poll off to 30s while subscribes are healthy; keep 5s during bootstrap and when subscribes go down, so recovery and the crashed/blocked banner still surface.

This frees up browser connection slots that 5 second health poll check was taking. Tested against workload capture for production analytics workload. Health checks requests on a 6 tab stress test have around 45 concurrent in flight requests in 2 mins. This change brought it down to 7-9 concurrent requests. Worse case requsest queue time shows one health request stuck at 240s behind a saturated pool. This change brought it down to 1 ms.

Fixes CNS-83

Environment health polls `SELECT mz_version()` on mz_catalog_server every 5s per tab. Once a global SUBSCRIBE is streaming, the environment is already proven reachable, so the poll is redundant. Derive health from subscribe state (`subscribeDerivedHealthAtom`) and back the poll off to 30s while subscribes are healthy; keep 5s during bootstrap and when subscribes go down, so recovery and the crashed/blocked banner still surface. This frees up browser connection slots that 5 second health poll check was taking. Tested against workload capture for production analytics workload. Health checks requests on a 6 tab stress test have around 45 concurrent in flight requests in 2 mins. This change brought it down to 7-9 concurrent requests. Worse case requsest queue time shows one health request stuck at 240s behind a saturated pool. This change brought it down to 1 ms.

SangJunBak · 2026-05-27T21:14:20Z

  const mergeEnvironments = useSetAtom(mergeEnvironmentsWithHealth);
  const cloudRegions = useAtomValue(cloudRegionsSelector);
  const appConfig = useAtomValue(appConfigAtom);
+  const subscribeHealth = useAtomValue(subscribeDerivedHealthAtom);


I do want to call out I think the region creation flow relies on polling fetchEnvironmentsWithHealth to transition from "booting" to "ready". I think 30 seconds is fine, but instead of relying on a subscribe, I wonder if it's just simpler to change the overall health check poll to 30s/1 minute and pass in 30_000 into usePollEnvironmentHealth

I was just concerned in the event that suddenly we stop polling for 30 seconds and then the console doesn't try to reconnect in 5 seconds so the user doesn't immediately see the console being unable to connect to environmentd. Not necessarily the worst but I thought it was a slight regression so I used a subscribe atom too.

My initial idea was to just use the subscribe atom but then in case the request is blocked, an active subscribe will not catch that unless the socket disconnects and connects again.

My initial idea was to just use the subscribe atom but then in case the request is blocked, an active subscribe will not catch that unless the socket disconnects and connects again.

Yeah I don't think this will work given we rely on the API call to the region controller which you can't subscribe on for the following flow:

I do want to call out I think the region creation flow relies on polling fetchEnvironmentsWithHealth to transition from "booting" to "ready".

Regarding:

I was just concerned in the event that suddenly we stop polling for 30 seconds and then the console doesn't try to reconnect in 5 seconds so the user doesn't immediately see the console being unable to connect to environmentd.

Yeah but how this actually displays in the Console is the "environment not ready" toast which we already deafen. All other queries need to connect to environmentd anyways so they don't actually require this health check to operate correctly.

@SangJunBak doesn't that work if we just poll for initial connection? Also I think we it would be better to do this with by fetching the status of the environment from the region api rather than waiting for a subscribe to work right?

SangJunBak · 2026-05-27T21:14:46Z

Left a comment. Lemme know what you think!

leedqin requested a review from a team as a code owner May 27, 2026 20:32

leedqin added the A-CONSOLE Area: Console label May 27, 2026

leedqin requested review from SangJunBak and jubrad and removed request for a team May 27, 2026 20:32

SangJunBak reviewed May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

console: add subscribe health atom#36760

console: add subscribe health atom#36760
leedqin wants to merge 1 commit into
MaterializeInc:mainfrom
leedqin:subscribe-health-check-atom

leedqin commented May 27, 2026

Uh oh!

SangJunBak May 27, 2026

Uh oh!

leedqin May 27, 2026

Uh oh!

SangJunBak May 28, 2026

Uh oh!

jubrad May 29, 2026

Uh oh!

SangJunBak commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

leedqin commented May 27, 2026

Uh oh!

SangJunBak May 27, 2026

Choose a reason for hiding this comment

Uh oh!

leedqin May 27, 2026

Choose a reason for hiding this comment

Uh oh!

SangJunBak May 28, 2026

Choose a reason for hiding this comment

Uh oh!

jubrad May 29, 2026

Choose a reason for hiding this comment

Uh oh!

SangJunBak commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants