chore(gastown): promote gastown-staging to main by jrf0110 · Pull Request #2974 · Kilo-Org/cloud

jrf0110 · 2026-04-30T18:19:57Z

Summary

Batch promote of gastown-staging to main. Eight commits since the last promotion, addressing a mix of bug fixes, observability improvements, a new feature, and CI plumbing:

#	Commit	PR	Type
1	`21a14d04`	(no PR — direct push)	dev-only fix
2	`c532f40e`	#2999	bug fix
3	`db584066`	#3047	bug fix
4	`47e95a8b`	#3055	observability + race fix
5	`0ebac0f1`	#3072	CI fix
6	`e20d04f7`	#3074	CI fix
7	`62e1f135`	#3113	feature
8	`47933737`	#3119	bug fix

Constituent changes

fix(gastown): point dev GIT_TOKEN_SERVICE binding at git-token-service-dev (21a14d04, no PR — direct push)
Local-dev binding fix. services/git-token-service/wrangler.jsonc overrides its worker name to git-token-service-dev in env.dev, but gastown's env.dev.services binding still referenced the base git-token-service name. Wrangler's local dev registry does exact-name matching, so the binding showed [not connected] whenever both workers ran side by side. Production binding (top-level services) is untouched; only the env.dev block was changed.
fix(gastown): push new model onto resumed mayor session on hot-swap (#2999)
Regression where changing the mayor's model in town settings persisted the config but the running mayor kept using the previous model — users had to run /model manually. Root cause: a prior fix (9785570b9) skipped the session.prompt(...) call on resumed sessions to avoid a duplicate startup turn, but that call was also responsible for pushing the new model field onto the session. Fix sends a session.prompt with noReply: true and the new model on resume, so the model updates without injecting a synthetic user turn.
fix(gastown): stop reconciler log spam from orphaned bead_cancelled events (#3047)
Production logs were filling with reconciler: applyEvent failed … Bead <id> not found repeating every alarm tick forever. Two cooperating bugs: (1) deleteBead cascaded cleanup to satellite tables but not the town_events reconciler queue, leaving orphan events; (2) the drain loop in Town.do.ts intentionally never marked failed events as processed, expecting retries to eventually succeed — but for a deleted bead they never can. Fix cleans up town_events in deleteBead/deleteBeads, pre-checks bead existence in applyEvent, and classifies Bead/Agent … not found errors as terminal in the drain loop.
feat(gastown-container): add crash visibility + per-agent start mutex (#3055)
Investigation hooks for repeated container restarts seen on a specific town (~1.5–2 min cadence). Adds an unhandledRejection listener with full stack logging (no process.exit), periodic RSS memory logging via the heartbeat path, and a per-agentId mutex in startAgent that fixes a real concurrency race exposed by duplicate /agents/start log lines arriving in the same millisecond for the same mayor.
chore(gastown): fix format and lint CI failures (#3072)
Plumbing fix for this batch PR. services/gastown/src/dos/Town.do.ts and services/gastown/test/integration/event-cleanup.test.ts needed oxfmt formatting; and String(promise) in services/gastown/container/src/main.ts (added by feat(gastown-container): add crash visibility + per-agent start mutex #3055's unhandledRejection handler) tripped no-base-to-string because Promise stringifies to [object Object]. Replaced with a string marker in the log payload.
chore(gastown): drop unused promise param from unhandledRejection handler (#3074)
Follow-up to chore(gastown): fix format and lint CI failures #3072. Removing the String(promise) call left the promise parameter declared but unused, tripping a different lint rule. Drops the parameter entirely; the rejection reason is already captured in the reason field.
feat(gastown): add proactive idle-container stop in TownDO alarm (#3113)
When a town has no active work and the mayor has been idle for >5 minutes, the alarm now calls container.stop() to force a graceful SIGTERM drain instead of waiting for Cloudflare's port-idle timer. PTY WebSocket keep-alives from open dashboard terminals were resetting the port-idle timer indefinitely, leaving 300+ active containers running for ~100 active users. Targets the root cause of that container-count bloat.
fix(onboarding): redirect back to onboarding after GitHub app install (#3119)
When a new gastown user without the Kilo GitHub app clicked "Install GitHub App" during the repo step of onboarding, the GitHub callback always redirected to /integrations/github?success=installed, dropping the user out of the wizard. Extends the GitHub install URL's state parameter with an optional |return=<urlencoded-path> suffix; the callback validates it (same-origin internal path; rejects //, backslash, CRLF) and redirects there with ?github_install=success when present. Onboarding now resumes on the repo step with the newly-installed repos available.

Verification

Each constituent PR was reviewed and merged into gastown-staging independently — see PR links above for per-change verification details.
21a14d04 was manually verified locally (wrangler dev --env dev for both workers, binding now connects).

Reviewer notes

Changes are scoped to services/gastown (container + DO/worker code), apps/web/src/app/(app)/gastown/... (onboarding wizard), apps/web/src/app/api/integrations/github/callback/route.ts (install callback), and the env.dev block of services/gastown/wrangler.jsonc.
Production worker bindings are unchanged.
The GitHub callback change in fix(onboarding): redirect back to onboarding after GitHub app install #3119 is fully backwards-compatible with old in-flight installs (state values without a |return=… suffix fall through to the existing default redirect).

…e-dev git-token-service's wrangler env.dev overrides the worker name to 'git-token-service-dev', but gastown's env.dev.services binding was still referencing the base 'git-token-service' name. Wrangler's local dev registry does exact-name matching, so the binding showed as [not connected] whenever both workers were running side by side. Every other consumer in the repo (cloud-agent-next, security-sync, security-auto-analysis) already uses 'git-token-service-dev' in their env.dev block; gastown was the outlier.

kilo-code-bot · 2026-04-30T18:20:59Z

Code Review Summary

Status: 2 Issues Found | Recommendation: Address before merge

Overview

Severity	Count
CRITICAL	0
WARNING	2
SUGGESTION	0

Issue Details (click to expand)

WARNING

File	Line	Issue
`apps/web/src/app/(app)/gastown/onboarding/OnboardingStepRepo.tsx`	68	GitHub install can start with `user_undefined` before the user query resolves
`services/gastown/container/src/process-manager.ts`	2693	Mayor prewarm starts the SDK in a workdir that is not created on cold containers

Fix these issues in Kilo Cloud

Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

File	Line	Issue
N/A	N/A	N/A

Files Reviewed (23 files)

apps/web/src/app/(app)/gastown/onboarding/OnboardingStepRepo.tsx - 1 issue
apps/web/src/app/(app)/gastown/onboarding/OnboardingWizardClient.tsx - 0 issues
apps/web/src/app/api/integrations/github/callback/route.ts - 0 issues
apps/web/src/components/gastown/MayorChat.tsx - 0 issues
apps/web/src/components/gastown/TerminalBar.tsx - 0 issues
apps/web/src/lib/integrations/validate-return-path.ts - 0 issues
apps/web/src/lib/integrations/validate-return-path.test.ts - 0 issues
services/gastown/wrangler.jsonc - 0 issues
services/gastown/container/src/agent-runner.ts - 0 issues
services/gastown/container/src/main.ts - 0 issues
services/gastown/container/src/process-manager.ts - 1 issue
services/gastown/container/src/process-manager.test.ts - 0 issues
services/gastown/container/vitest.config.ts - 0 issues
services/gastown/src/dos/Town.do.ts - 0 issues
services/gastown/src/dos/town/beads.ts - 0 issues
services/gastown/src/dos/town/container-dispatch.ts - 0 issues
services/gastown/src/dos/town/container-idle-stop.ts - 0 issues
services/gastown/src/dos/town/container-idle-stop.test.ts - 0 issues
services/gastown/src/dos/town/reconciler.ts - 0 issues
services/gastown/src/gastown.worker.ts - 0 issues
services/gastown/test/integration/event-cleanup.test.ts - 0 issues
services/gastown/test/integration/mayor-sdk-fallthrough.test.ts - 0 issues

_{Reviewed by gpt-5.5-20260423 · 6,559,818 tokens}

…2999) When a user changes the mayor's model in town settings, updateAgentModel restarts the SDK server with new KILO_CONFIG_CONTENT and resumes the existing session from kilo.db. Commit 9785570 intentionally stopped sending any session.prompt on resume to avoid duplicating the MAYOR_STARTUP_PROMPT, but that also dropped the model param — so the resumed session kept its prior per-session model until the user ran /model manually. Extract the fresh vs. resumed session-prompt logic into applyModelToSession and on resume send a noReply:true prompt carrying only the new model param. This updates the SDK server's per-session model without replaying the startup prompt. Errors on the resume path are swallowed so the hot-swap still succeeds; the SDK server fell back to the config-loaded model at startup, which was already updated. Add container tests covering both fresh and resumed paths. Co-authored-by: John Fawcett <john@kilcoode.ai>

…vents (#3047) Two independent bugs compose to flood production logs every alarm tick with 'Bead <id> not found' errors: 1. deleteBead / deleteBeads did not clean up the town_events queue, leaving bead_cancelled and container_status rows pointing at deleted beads/agents. 2. applyEvent threw on missing beads and the drain loop never marked the failing event processed — so it retried forever. Fix 1: purge town_events rows (by bead_id OR agent_id, since agents are beads) from deleteBead and the deleteBeads bulk path. Fix 2a: reconciler.applyEvent('bead_cancelled') checks for the target bead up front and returns (with a warn) when it's missing, instead of throwing. Fix 2b: the Town.do.ts drain loop recognises 'Bead/Agent <uuid> not found' terminal errors, logs them at warn, and marks the offending event processed so it stops retrying. Adds debug RPCs (debugTownEvents, debugInsertTownEvent, debugRecordContainerStatus) and integration coverage in event-cleanup.test.ts. Co-authored-by: John Fawcett <john@kilcoode.ai>

…#3055) * feat(gastown-container): add crash visibility + per-agent start mutex Diagnostic changes to investigate frequent container restarts for town 4d82f099-ccb7-4eaf-8676-73562e0a27eb (~1.5–2 min boot-hydration loops). - main.ts: add unhandledRejection listener that logs full error/stack without exiting (Bun/Node silently drop rejections without a handler, making fire-and-forget failures like void saveDbSnapshot()/void subscribeToEvents() invisible). Include uptime and active-agent count for correlation. - main.ts: improve uncaughtException log with name/uptime/agent count. - main.ts: 30s periodic container.memory_usage log (rss/heap/external) so OOM-class failures (external SIGKILL from Cloudflare Containers runtime when the memory ceiling is hit) become observable — these leave no exception behind. - main.ts: wrap bootHydration() in try/catch so a rare synchronous throw before the first await doesn't crash the process. - process-manager.ts: add per-agentId mutex for startAgent. Production logs show two /agents/start requests for the same agentId logged at the same millisecond; both pass the re-entrancy check before either commits a 'starting' record, then race on startupAbortController, session creation, idle timers, and SDK sessionCount. Serialising per agentId makes the re-entrant path observe a consistent snapshot. - process-manager.test.ts: three tests for the mutex — same-id serialisation, different-id concurrency, lock release on throw. * fix(container): replace Promise.withResolvers with explicit new Promise Promise.withResolvers is a newer API not available on older Bun runtimes. Since process-manager.ts is imported during container startup, a missing global would throw before crash handlers are registered and prevent the control server from starting. Use the same explicit new Promise pattern as the existing sdkServerLock. * feat(gastown/container): include townId in crash and memory logs Per review feedback, attach the container's GASTOWN_TOWN_ID to unhandled_rejection, uncaught_exception, cold_start, memory_usage, and boot_hydration_failed log entries so production crash logs can be correlated with a specific town without needing to also have an agent registered. --------- Co-authored-by: John Fawcett <john@kilcoode.ai>

chore(gastown): fix format and lint CI failures on staging Co-authored-by: John Fawcett <john@kilcoode.ai>

…dler (#3074) Co-authored-by: John Fawcett <john@kilcoode.ai>

* feat(gastown): add proactive idle-container stop in TownDO alarm When a town has no active work and the mayor has been idle for >5min, the alarm handler now calls container.stop() to force a graceful SIGTERM drain instead of waiting for Cloudflare's port-idle timer (which gets reset by PTY WebSocket keep-alives). This targets the root cause of 300+ active containers for ~100 active users. - Add stopContainerIfIdle() with dependency-injected logic in town/container-idle-stop.ts for testability - Wire into alarm handler just before the re-arm block - Emit container.idle_stop event with reason for observability - 2min throttle prevents thrash; failed stops are retried next tick - 13 unit tests covering all branches * fix(gastown): remove non-null assertions from container-idle-stop Replace mayor.last_activity_at! with null-safe checks using mayor.last_activity_at != null guards, consistent with coding style that forbids ! non-null assertions. * fix: allow stopping healthy containers in idle-stop guard The state guard only checked for 'running', but containers can also report 'healthy' as an active state (consistent with gastown.worker.ts). Added 'healthy' to the guard and a corresponding test. --------- Co-authored-by: John Fawcett <john@kilcoode.ai>

…#3119) * feat(onboarding): redirect back to onboarding after GitHub app install * fix: address PR review feedback — stable effect deps, parsed error state, URIError guard - OnboardingStepRepo: use stable refetch reference and scalar param instead of full query object to prevent duplicate toasts/refetches - GitHub callback: parse owner token from state in error handler so |return= suffix doesn't leak into org redirect URLs - validate-return-path: catch URIError from malformed percent-encoding and treat as invalid return path (null) instead of throwing --------- Co-authored-by: John Fawcett <john@kilcoode.ai>

kilo-code-bot · 2026-05-07T19:45:36Z

-    const installState = orgId ? `org_${orgId}` : `user_${user?.id}`;
-    const installUrl = `https://github.com/apps/${githubAppName}/installations/new?state=${installState}`;
-    window.open(installUrl, '_blank', 'noopener');
+    const owner = orgId ? `org_${orgId}` : `user_${user?.id}`;


WARNING: GitHub install can start with an undefined user id

For personal onboarding, this builds user_${user?.id} even while useUser() is still loading. If the user clicks Install before the user query resolves, GitHub returns with state=user_undefined, and the callback rejects it because user.id !== 'undefined', sending the user back to / instead of completing installation. Gate the button/handler on user?.id for the personal path, or derive the owner id from already-authenticated server state.

* feat(gastown): prewarm mayor SDK server in bootHydration Add mayor SDK server prewarming to bootHydration so the mayor's kilo serve instance is already running when the user's first /agents/start arrives after a container restart. Previously, the mayor was only resumed if it was in the registry (running/starting at shutdown), but idle-stop and stream-error teardowns leave the mayor unregistered. - Export mayorWorkdirForTown() from agent-runner.ts - Add prewarmMayorSDK() to process-manager.ts that fetches the mayor agent ID from a new worker endpoint, hydrates kilo.db from KV snapshot, and starts the SDK server - Add GET /api/towns/:townId/mayor-id endpoint to gastown.worker.ts (uses authMiddleware like container-registry/db-snapshot) - Add getMayorAgentId() RPC method to Town.do.ts - Add warm-cache detection in startAgentImpl: log phaseMs: 0 and prewarmed: true when the SDK instance was already cached - bootHydration no longer returns early on empty registry so the mayor prewarm always runs * feat(gastown): seed getMayorStatus cache from ensureMayor response Instead of only invalidating the getMayorStatus query after ensureMayor succeeds (which forces a 3s polling wait before useXtermPty can start connecting), seed the React Query cache directly from the mutation result. The agentId and sessionStatus are already available in the ensureMayor response, so the terminal can begin connecting within ~50ms instead of waiting for the next poll tick. Still invalidate after seeding so the next poll catches up to authoritative state. * fix(gastown): detect torn-down SDK in _ensureMayor short-circuit When the container reports the mayor as 'running'/'starting' but the SDK instance has no serverPort or sessionId (torn down after stream errors or drain), _ensureMayor now falls through to a fresh dispatch instead of returning early. This eliminates the 'refresh fixes it' class of failures where the PTY gets a 404 because there's no SDK port to attach to. Also extend checkAgentContainerStatus to surface serverPort and sessionId from the container's agent status response. * feat(gastown): add AE telemetry events for mayor startup optimization Add three Analytics Engine event streams to measure the impact of the mayor startup optimizations: 1. agent.startup_phase — emitted for db_hydrated, sdk_ready, and session_created phases. Includes elapsedMs and phaseMs so we can P50/P95 per-town. The sdk_ready event includes phaseMs: 0 when the SDK was prewarmed (warm-cache hit). 2. mayor.prewarm_complete — emitted when the mayor SDK server is prewarmed during bootHydration, with durationMs. 3. mayor.ensure_decision — tracks the _ensureMayor decision tree: short_circuit_warm, short_circuit_idle, sdk_dead_redispatch, or fresh_dispatch. Measures the rate of the SDK-dead case that Change 3 fixes. Container-side events are proxied to AE via a new POST /api/towns/:townId/container-events worker endpoint, since the container can't call writeEvent directly. * test(gastown): add integration test for torn-down-SDK fall-through Test that _ensureMayor falls through when the container status doesn't indicate a live SDK (no serverPort or sessionId). Covers: 1. Container not available in test env (baseline behavior) 2. sdkAlive validation logic: zero port, empty session, valid values 3. checkAgentContainerStatus returns 404 for unknown agents * fix(gastown): set per-agent KILO_TEST_HOME/XDG_DATA_HOME in prewarm env The prewarm function was copying KILO_TEST_HOME and XDG_DATA_HOME from process.env, but those are typically absent at the container level. Normal agent startup sets them per-agent via buildAgentEnv(). Without these, the prewarmed SDK server boots against the default data directory and bypasses the hydrated kilo.db snapshot. Now buildPrewarmEnv() sets KILO_TEST_HOME and XDG_DATA_HOME based on the mayorAgentId, matching what buildAgentEnv() does for regular agents. * fix(gastown): generate KILO_CONFIG_CONTENT in prewarm env and handle config mismatch on cache hit Prewarm now generates KILO_CONFIG_CONTENT/OPENCODE_CONFIG_CONTENT using buildKiloConfigContent() with the kilocode token and default models instead of copying them from process.env (where they're absent on cold start). When ensureSDKServer() finds a cached instance whose config differs from the incoming env, it evicts the old server and creates a new one so the SDK picks up the correct model/provider config. Also extracts PERSIST_ENV_KEYS to module-level and updates process.env for those keys on cache hit when configs match. --------- Co-authored-by: John Fawcett <john@kilcoode.ai>

kilo-code-bot · 2026-05-07T23:19:52Z

+    return;
+  }
+
+  const workdir = mayorWorkdirForTown(townId);


WARNING: Mayor prewarm can fail before creating the workspace

prewarmMayorSDK calls ensureSDKServer with this derived workdir, but the normal mayor startup path first creates and git-initializes that directory via createMayorWorkspace / createLightweightWorkspace. On a cold container where /workspace/rigs/mayor-${townId}/mayor-workspace does not exist yet, ensureSDKServer will hit process.chdir(workdir) and throw, so boot hydration logs Mayor SDK prewarm failed and the optimization never takes effect. Create/init the mayor workspace before starting the SDK server, or reuse the normal mayor workspace helper.

jrf0110 and others added 3 commits May 1, 2026 16:22

jrf0110 mentioned this pull request May 6, 2026

chore(gastown): fix format and lint CI failures #3072

Merged

chore(gastown): fix format and lint CI failures (#3072)

0ebac0f

chore(gastown): fix format and lint CI failures on staging Co-authored-by: John Fawcett <john@kilcoode.ai>

jrf0110 mentioned this pull request May 6, 2026

chore(gastown): drop unused promise param from unhandledRejection handler #3074

Merged

jrf0110 and others added 3 commits May 6, 2026 10:56

chore(gastown): drop unused promise param from unhandledRejection han…

e20d04f

…dler (#3074) Co-authored-by: John Fawcett <john@kilcoode.ai>

kilo-code-bot Bot reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(gastown): promote gastown-staging to main#2974

chore(gastown): promote gastown-staging to main#2974
jrf0110 wants to merge 9 commits intomainfrom
gastown-staging

jrf0110 commented Apr 30, 2026 •

edited

Loading

Uh oh!

kilo-code-bot Bot commented Apr 30, 2026 •

edited

Loading

WARNING

Uh oh!

kilo-code-bot Bot May 7, 2026

Uh oh!

kilo-code-bot Bot May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jrf0110 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Constituent changes

Verification

Reviewer notes

Uh oh!

kilo-code-bot Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Overview

WARNING

Uh oh!

kilo-code-bot Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

kilo-code-bot Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jrf0110 commented Apr 30, 2026 •

edited

Loading

kilo-code-bot Bot commented Apr 30, 2026 •

edited

Loading