Skip to content

[fix][proxy] Avoid blocking the proxy IO thread on a cold broker cache#26052

Merged
lhotari merged 1 commit into
apache:masterfrom
merlimat:mmerli/fix-proxy-available-brokers-blocking-io
Jun 18, 2026
Merged

[fix][proxy] Avoid blocking the proxy IO thread on a cold broker cache#26052
lhotari merged 1 commit into
apache:masterfrom
merlimat:mmerli/fix-proxy-available-brokers-blocking-io

Conversation

@merlimat

Copy link
Copy Markdown
Contributor

Motivation

MetadataStoreCacheLoader.getAvailableBrokers() returns the cached active-broker list, but on a cold/empty cache it fell back to a synchronous getChildren() metadata read (getChildrenAsync(...).get(operationTimeoutMs)). Its clean List signature hid that it could block.

It is reached on the proxy's Netty IO thread via ProxyConnection.completeConnect() (when checkActiveBrokers is enabled) → isBrokerActive()getAvailableBrokers(), and also from BrokerDiscoveryProvider.nextBroker(). A cold cache could therefore stall a proxy IO thread on a metadata round-trip during connection setup.

Modifications

getAvailableBrokers() no longer performs a synchronous metadata read. It returns the snapshot maintained by init() and the metadata-store children/session listeners, and — only when the snapshot is empty — triggers a single background refresh (guarded by an AtomicBoolean so a burst of calls during an empty-cache window does not queue many reads). The former tryUpdate lambda is extracted into a reloadBrokers() method shared by the listeners, the initial fetch, and the background refresh.

The fail-fast isBrokerActive check (which already expects the client to retry after a back-off) now sees the current snapshot instead of blocking. An empty snapshot means there are genuinely no active brokers, since the cache is populated by init() and kept current by the listener.

Verifying this change

Added MetadataStoreCacheLoaderTest, which verifies that getAvailableBrokers() serves the cache without ever invoking the synchronous getChildren(...), and does not block on an empty cache.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

MetadataStoreCacheLoader.getAvailableBrokers() returned the cached active-broker
list, but on a cold/empty cache it fell back to a synchronous getChildren()
metadata read (getChildrenAsync(...).get(timeout)). Its clean List signature hid
that it could block. It is reached on the proxy Netty IO thread via
ProxyConnection.completeConnect() (when checkActiveBrokers is enabled) ->
isBrokerActive() -> getAvailableBrokers(), and from BrokerDiscoveryProvider
.nextBroker(). A cold cache could therefore stall a proxy IO thread on a metadata
round-trip during connection setup.

getAvailableBrokers() no longer performs a synchronous metadata read. It returns
the snapshot maintained by init() and the metadata-store children/session
listeners, and only when the snapshot is empty triggers a single background
refresh (guarded by an AtomicBoolean so a burst of calls does not queue many
reads). The former tryUpdate lambda is extracted into reloadBrokers(), shared by
the listeners, the initial fetch, and the background refresh.

The fail-fast isBrokerActive check (which already expects the client to retry on a
back-off) now sees the current snapshot instead of blocking; an empty snapshot
means there are genuinely no active brokers, since the cache is populated by
init() and kept current by the listener.
@lhotari lhotari merged commit 1d7ae01 into apache:master Jun 18, 2026
44 checks passed
@lhotari lhotari added this to the 5.0.0-M2 milestone Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants