console: add subscribe health atom#36760
Conversation
Environment health polls `SELECT mz_version()` on mz_catalog_server every 5s per tab. Once a global SUBSCRIBE is streaming, the environment is already proven reachable, so the poll is redundant. Derive health from subscribe state (`subscribeDerivedHealthAtom`) and back the poll off to 30s while subscribes are healthy; keep 5s during bootstrap and when subscribes go down, so recovery and the crashed/blocked banner still surface. This frees up browser connection slots that 5 second health poll check was taking. Tested against workload capture for production analytics workload. Health checks requests on a 6 tab stress test have around 45 concurrent in flight requests in 2 mins. This change brought it down to 7-9 concurrent requests. Worse case requsest queue time shows one health request stuck at 240s behind a saturated pool. This change brought it down to 1 ms.
| const mergeEnvironments = useSetAtom(mergeEnvironmentsWithHealth); | ||
| const cloudRegions = useAtomValue(cloudRegionsSelector); | ||
| const appConfig = useAtomValue(appConfigAtom); | ||
| const subscribeHealth = useAtomValue(subscribeDerivedHealthAtom); |
There was a problem hiding this comment.
I do want to call out I think the region creation flow relies on polling fetchEnvironmentsWithHealth to transition from "booting" to "ready". I think 30 seconds is fine, but instead of relying on a subscribe, I wonder if it's just simpler to change the overall health check poll to 30s/1 minute and pass in 30_000 into usePollEnvironmentHealth
There was a problem hiding this comment.
I was just concerned in the event that suddenly we stop polling for 30 seconds and then the console doesn't try to reconnect in 5 seconds so the user doesn't immediately see the console being unable to connect to environmentd. Not necessarily the worst but I thought it was a slight regression so I used a subscribe atom too.
My initial idea was to just use the subscribe atom but then in case the request is blocked, an active subscribe will not catch that unless the socket disconnects and connects again.
There was a problem hiding this comment.
My initial idea was to just use the subscribe atom but then in case the request is blocked, an active subscribe will not catch that unless the socket disconnects and connects again.
Yeah I don't think this will work given we rely on the API call to the region controller which you can't subscribe on for the following flow:
I do want to call out I think the region creation flow relies on polling fetchEnvironmentsWithHealth to transition from "booting" to "ready".
Regarding:
I was just concerned in the event that suddenly we stop polling for 30 seconds and then the console doesn't try to reconnect in 5 seconds so the user doesn't immediately see the console being unable to connect to environmentd.
Yeah but how this actually displays in the Console is the "environment not ready" toast which we already deafen. All other queries need to connect to environmentd anyways so they don't actually require this health check to operate correctly.
There was a problem hiding this comment.
@SangJunBak doesn't that work if we just poll for initial connection? Also I think we it would be better to do this with by fetching the status of the environment from the region api rather than waiting for a subscribe to work right?
|
Left a comment. Lemme know what you think! |
Environment health polls
SELECT mz_version()on mz_catalog_server every 5s per tab. Once a global SUBSCRIBE is streaming, the environment is already proven reachable, so the poll is redundant.Derive health from subscribe state (
subscribeDerivedHealthAtom) and back the poll off to 30s while subscribes are healthy; keep 5s during bootstrap and when subscribes go down, so recovery and the crashed/blocked banner still surface.This frees up browser connection slots that 5 second health poll check was taking. Tested against workload capture for production analytics workload. Health checks requests on a 6 tab stress test have around 45 concurrent in flight requests in 2 mins. This change brought it down to 7-9 concurrent requests. Worse case requsest queue time shows one health request stuck at 240s behind a saturated pool. This change brought it down to 1 ms.
Fixes CNS-83