Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
adb81f2
chore: rationalize SessionImpl field visibility
igorbernstein2 May 27, 2026
0854204
chore: migrate pool scheduling to a hashed-wheel timer
igorbernstein2 Jun 12, 2026
54c287e
chore: introduce VOperation as the head of the middleware chain
igorbernstein2 Jun 12, 2026
c88862a
chore: introduce OpExecutor; plumb VRpcCallContext.getExecutor
igorbernstein2 Jun 12, 2026
8f99778
chore: move callback dispatch from RetryingVRpc to VRpcImpl
igorbernstein2 Jun 13, 2026
2105669
fix: arm heartbeat tick only while a vRPC is in flight
igorbernstein2 Jun 17, 2026
62af875
chore: add session SynchronizationContext alongside the existing lock
igorbernstein2 Jun 14, 2026
06151ea
chore: make startRpc and cancelRpc async via sessionSyncContext
igorbernstein2 Jun 14, 2026
3873aa7
chore: remove synchronized(lock) from SessionImpl
igorbernstein2 May 28, 2026
0f6a9e4
chore: abort session on uncaught exception in sessionSyncContext
igorbernstein2 Jun 4, 2026
df61652
chore: isolate user-callback executor on a cached thread pool
igorbernstein2 Jun 14, 2026
fc9aafd
chore: back OpExecutor with SerializingExecutor(userCallbackExecutor)
igorbernstein2 Jun 14, 2026
b32ff78
chore: configure gRPC session streams with DirectExecutor
igorbernstein2 May 28, 2026
8ba415d
test: add 30-second @Timeout to all session/pool integration tests
igorbernstein2 May 27, 2026
739f0e3
chore: route PendingVRpc per-op state through the op executor
igorbernstein2 May 28, 2026
62ba853
chore: consolidate cancel trampolines at VOperationImpl
igorbernstein2 Jun 14, 2026
bb3fd91
chore: replace OpExecutor backing with an inline-capable queue; tight…
igorbernstein2 Jun 14, 2026
3c521d7
fix: drain SessionPools before tearing down userCallbackExecutor on C…
igorbernstein2 Jun 15, 2026
d6992c3
fix: deliver terminal onClose to Scheduled retries pending at Client.…
igorbernstein2 Jun 15, 2026
e9ad19f
fix: serialize open*/close so racing opens cannot create orphan pools
igorbernstein2 Jun 15, 2026
1ffe8b6
fix: ShimImpl uses shutdownAndAwait for userCallbackExecutor
igorbernstein2 Jun 15, 2026
49dd066
fix: lift shim loader throws to failed futures via SessionPoolMap
igorbernstein2 Jun 16, 2026
40dab94
refactor: add State.onExit and localize per-attempt tracer pairing to…
igorbernstein2 Jun 16, 2026
c212e84
refactor: replace CleanupListener.closed flag with chain.isDone()
igorbernstein2 Jun 16, 2026
fe812c6
docs: capture deferred review findings and design rationale
igorbernstein2 Jun 16, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 59 additions & 0 deletions java-bigtable/TODO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# java-bigtable — deferred work

## Three fallback closeReason synthesizers in SessionImpl

**File:** `google-cloud-bigtable/src/main/java/com/google/cloud/bigtable/data/v2/internal/session/SessionImpl.java`
**Symptom:** the invariant "a session reaching CLOSED carries a closeReason" is enforced by belt-and-suspenders synthesizers at three sites — `abortFromUncaughtException` (overwrites unconditionally), `startGracefulClose` (logs warning + synthesizes if null), `notifyTerminalClose` (added in `bfa99cd0d9e` as last-resort guard). Each was added in response to a discovered miss. A new code path that transitions toward CLOSED must remember to set `closeReason` or rely on a downstream synthesizer firing.

**Fix sketch:** couple setter to transition. Add an overload `updateState(SessionState newState, CloseSessionRequest closeReason)` for terminal-phase transitions that requires the reason as an argument. The plain `updateState(newState)` overload stays for NEW→STARTING→READY. The three synthesizers go away.

**Risk:** a bug that misses the setter becomes a loud `IllegalArgumentException` instead of silent metric corruption. Probably better but it's a behavior shift — any "should never happen" path that hit a synthesizer today would now crash.

## drainedFuture completion duplicated across two sites

**File:** `google-cloud-bigtable/src/main/java/com/google/cloud/bigtable/data/v2/internal/session/SessionPoolImpl.java`
**Symptom:** `drainedFuture.complete(null)` fires from two unrelated sites — `close()` (line 308) when the pool was already empty at close time, and `onSessionClose` (line 538-540) when the last session drains while `poolState == CLOSED`. Both re-check `sessions.getAllSessions().isEmpty()`. A future code path that removes the last session by any other route (admin force-drain, AFE shutdown, per-session abort that bypasses the standard listener chain) leaves `drainedFuture` uncompleted — `Client.close` hangs for the full `POOL_DRAIN_TIMEOUT` (6 min) per pool with no log indication.

**Fix sketch:** derive the completion from a single state event — e.g., a `maybeCompleteDrain()` helper called from every site that transitions a session out of the pool, with one check `poolState == CLOSED && sessions.getAllSessions().isEmpty()`. Removes the duplicated check and gives a single chokepoint to audit.

**Risk:** low. The condition lives in one place; future session-removal paths just need to call the helper.

## Client.shutdownAndAwait is a public-static helper in the wrong place

**File:** `google-cloud-bigtable/src/main/java/com/google/cloud/bigtable/data/v2/internal/api/Client.java`
**Symptom:** `shutdownAndAwait(ExecutorService)` is a public static helper on `Client` (line 394), duplicating Guava's `MoreExecutors.shutdownAndAwaitTermination` semantics. Promoted to public-static so `ShimImpl` (different package) can reuse it. The user-callback executor's lifecycle policy ("cached pool drained with 5s grace, then `shutdownNow`") belongs to the executor's owning type, not a free function on `Client` that every owner must remember to import.

**Fix sketch:** introduce a `UserCallbackExecutor` (or similar) type that wraps the cached `ExecutorService` and owns its lifecycle (`close()` does the shutdown-and-await dance). Both `Client.create` and `ShimImpl` construct one. `Client.shutdownAndAwait` goes away.

**Risk:** small ripple to construction sites. Reduces pressure to add more "shutdown subtleties" (configurable timeout, interrupt restoration variants, etc.) as more public statics on `Client`.

## RetryingVRpc.start guard relies on unenforced op-executor affinity

**File:** `google-cloud-bigtable/src/main/java/com/google/cloud/bigtable/data/v2/internal/middleware/RetryingVRpc.java`
**Symptom:** `started`, `isCancelling`, `currentState` are plain non-volatile fields. The comment claims "VOperationImpl trampolines every inbound call onto opExecutor, so no synchronization is needed here" — but `start()` and `cancel()` don't assert that contract. A caller that bypasses `VOperationImpl` (a test using `VRpcCallContext.create`'s 3-arg overload with a `t -> {}` swallowing handler, or any new direct consumer) gets torn reads and silent state corruption rather than a clear precondition failure. Added importance after the `chain.isDone()` change (#8 fix) — that field is also unsynchronized and now externally observable.

**Fix sketch:** add `context.getExecutor().throwIfNotInThisExecutor()` at the top of `start()` and `cancel()` (and `isDone()` if we want to be strict). Each becomes a one-line guard. Tests that exercise these methods directly must construct a real OpExecutor and trampoline through it, matching production usage.

**Risk:** test rewrites for any test that calls `RetryingVRpc.start()` directly without going through `VOperationImpl`. Check the existing `RetryingVRpcTest` and `VRpcTracerTest` for impacted call sites.

## Long-delay Scheduled retries are cancelled on close, not awaited

**File:** `google-cloud-bigtable/src/main/java/com/google/cloud/bigtable/data/v2/internal/middleware/RetryingVRpc.java` + `Client.java`
**Symptom:** Today (post `4a80a8284fb`) a `Scheduled` retry pending at `Client.close` is driven to a CANCELLED Done via the `BigtableTimer.onStop` hook. That delivers a terminal to the user — but it abandons whatever the retry was waiting for. A truly graceful shutdown would let the retry complete naturally if the deadline still has room. We explicitly chose cancel-on-close because per-op tracking at the Client level was rejected as too heavy.

**Fix sketch (if requirements change):** add a Client-level (or per-pool) registry of in-flight `VOperation`s. `Client.close` Phase 2 would extend `drainedFuture` to also wait for in-flight ops to terminate, bounded by `POOL_DRAIN_TIMEOUT`. Op registry maintained by `VOperationImpl.start` / `Done.onStart`.

**Risk:** real structural change — every `VOperation` registers/unregisters; Client gains a new responsibility. Only worth it if a customer reports the cancel-on-close behavior as wrong. Today the cancel path delivers a clean terminal, which is sufficient for shutdown correctness.

## Idle session heartbeat ticks burn CPU

**File:** `google-cloud-bigtable/src/main/java/com/google/cloud/bigtable/data/v2/internal/session/SessionImpl.java`
**Symptom:** `checkHeartbeat` re-arms a 100 ms wheel tick on every fire, regardless of whether a vRPC is in flight. For idle sessions, `nextHeartbeat` sits at `now + FUTURE_TIME` (30 min) but the tick still fires 10×/sec, syncs onto `sessionSyncContext`, compares, and re-arms. 100 idle sessions = ~1,000 wakeups/sec per Client of pure background noise.

**Fix sketch:** couple heartbeat arming to `currentRpc` lifecycle.
- In `startRpc` (after setting `nextHeartbeat = now + heartbeatInterval`): arm `scheduleHeartbeatCheck()` if `heartbeatTimeout == null`.
- In `handleVRpcResponse` / `handleVRpcErrorResponse` (after pushing `nextHeartbeat` to `FUTURE_TIME`): `cancelHeartbeatTimeout()`.
- In `checkHeartbeat`: re-arm only when `currentRpc != null`.
- Drop the unconditional `scheduleHeartbeatCheck()` from `handleOpenSessionResponse`.

**Risk:** heartbeat becomes per-vRPC instead of per-session. Future multiplexing or any state where `nextHeartbeat` is meaningful without `currentRpc != null` needs explicit arming. Stuck-session shutdown detection is already covered by the watchdog + `Client.POOL_DRAIN_TIMEOUT`.
Original file line number Diff line number Diff line change
Expand Up @@ -26,13 +26,14 @@
import com.google.cloud.bigtable.data.v2.internal.csm.Metrics;
import com.google.cloud.bigtable.data.v2.internal.csm.attributes.ClientInfo;
import com.google.cloud.bigtable.data.v2.internal.session.SessionPool;
import com.google.cloud.bigtable.data.v2.internal.session.BigtableTimer;
import com.google.cloud.bigtable.data.v2.internal.session.VRpcDescriptor;
import com.google.cloud.bigtable.data.v2.internal.util.ClientConfigurationManager;
import io.grpc.CallOptions;
import io.grpc.Deadline;
import java.io.Closeable;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.Executor;

public class AuthorizedViewAsync implements AutoCloseable, Closeable {

Expand All @@ -48,7 +49,8 @@ static AuthorizedViewAsync createAndStart(
String viewId,
Permission permission,
Metrics metrics,
ScheduledExecutorService executorService) {
BigtableTimer timer,
Executor userCallbackExecutor) {

AuthorizedViewName viewName =
AuthorizedViewName.builder()
Expand Down Expand Up @@ -78,7 +80,8 @@ static AuthorizedViewAsync createAndStart(
callOptions,
viewName.toString(),
metrics,
executorService);
timer,
userCallbackExecutor);

return new AuthorizedViewAsync(base);
}
Expand Down
Loading
Loading