Skip to content

Commit c8228e3

Browse files
jahoomaclaude
andcommitted
Export wasm chunks as a function so the bundler can't inline them away
Round 4 (chunked array literals) still failed on Windows: the build's own verification step caught the first chunk missing from the compiled binary. So either: - Bun's bundler reads tree-sitter-wasm-bytes.ts at static-analysis time, sees `export const X = []` (the committed stub), inlines `X` into pre-init's call sites, then DCEs the conditional branch that would have referenced the chunks. Whatever my embed script wrote later is treated as unused and dropped. - OR the file write doesn't propagate to disk before bun reads it on Windows. Switch the export from `const` to a function. Function return values aren't statically inlinable — the bundler can't substitute a literal empty array at the call site. The chunks live inside the function body, only materialized when the pre-init calls `getTreeSitterWasmChunks()`. Add a sanity re-read after writing the embed file: if NTFS buffers the write and bun reads the stale stub, the embed step itself fails *during the build*, with a clear "wrote N chunks but re-read does not contain chunk[0]" message — instead of letting the build silently produce a broken artifact. Verified locally: build embeds 268 chunks, post-build verifies 3 chunks in the compiled binary, --smoke-tree-sitter exits 0, boot smoke passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent b0dc5de commit c8228e3

3 files changed

Lines changed: 68 additions & 43 deletions

File tree

cli/scripts/build-binary.ts

Lines changed: 19 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -294,13 +294,27 @@ function embedTreeSitterWasmAsChunks(): {
294294

295295
const generated =
296296
`// AUTO-GENERATED by cli/scripts/build-binary.ts during \`bun build --compile\`.\n` +
297-
`// Restored to the empty stub after the build finishes — do not commit a\n` +
298-
`// non-empty value here.\n` +
299-
`export const TREE_SITTER_WASM_BASE64_CHUNKS: readonly string[] = [\n` +
300-
chunks.map((c) => ` ${JSON.stringify(c)},`).join('\n') +
301-
`\n]\n`
297+
`// Restored to an empty function after the build finishes — do not commit a\n` +
298+
`// non-empty body here.\n` +
299+
`export function getTreeSitterWasmChunks(): string[] {\n` +
300+
` return [\n` +
301+
chunks.map((c) => ` ${JSON.stringify(c)},`).join('\n') +
302+
`\n ]\n` +
303+
`}\n`
302304

303305
writeFileSync(stubPath, generated)
306+
// Re-read what we just wrote so we can fail loudly if the OS buffered
307+
// the write. On Windows, NTFS writes can lag, and bun --compile would
308+
// then read the stale stub. Verifying here means the build fails
309+
// *during embed* instead of producing a broken binary that surprises
310+
// us later.
311+
const onDisk = readFileSync(stubPath, 'utf8')
312+
if (!onDisk.includes(chunks[0]!)) {
313+
throw new Error(
314+
`Embed wrote ${chunks.length} chunks but re-read of ${stubPath} ` +
315+
`does not contain chunk[0]. File on disk: ${onDisk.slice(0, 200)}…`,
316+
)
317+
}
304318
logAlways(
305319
`Embedded tree-sitter.wasm from ${sourceWasm} (${wasmBytes.length} bytes → ${chunks.length} chunks of ~${CHUNK_SIZE} chars).`,
306320
)
Lines changed: 17 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,19 @@
1-
// Stub committed for dev mode and tests. The real wasm chunks are written
1+
// Stub committed for dev mode and tests. The real chunks are written
22
// here by `cli/scripts/build-binary.ts` immediately before
3-
// `bun build --compile`, then restored to an empty array after the build
4-
// completes. Dev mode and unit tests see the empty stub and fall back to
5-
// path-based resolution in `packages/code-map/src/init-node.ts` (which
6-
// works locally because `node_modules/web-tree-sitter/tree-sitter.wasm`
7-
// exists on the filesystem).
3+
// `bun build --compile`, then restored to this empty stub after.
84
//
9-
// Why an array of small chunks rather than one big string: a single
10-
// 274KB string literal got dropped/transformed by bun's Windows
11-
// minifier (the binary built clean but ran without the bytes). Many
12-
// small string literals slip under whatever threshold caused that. See
13-
// `cli/src/pre-init/tree-sitter-wasm.ts` for the full failure history.
14-
export const TREE_SITTER_WASM_BASE64_CHUNKS: readonly string[] = []
5+
// Why a *function* return rather than a top-level const: prior
6+
// approaches kept getting eliminated on Windows even with 268
7+
// individual chunks. The bundler appears to evaluate the imported
8+
// value at static-analysis time (we suspect either filesystem write
9+
// timing or an AST cache), inlines it as the empty stub, and DCEs
10+
// any conditional that depends on `.length > 0`. A function call's
11+
// return value is not statically inlinable in the same way — the
12+
// chunks live inside the function body, only materialized on call.
13+
//
14+
// Why a function instead of `export const X = (() => [...])()`:
15+
// same reason — IIFEs can be folded by aggressive minifiers, but
16+
// imported functions called at runtime are preserved.
17+
export function getTreeSitterWasmChunks(): string[] {
18+
return []
19+
}

cli/src/pre-init/tree-sitter-wasm.ts

Lines changed: 32 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,41 +1,47 @@
11
// Embed tree-sitter.wasm into the bun-compile binary so the SDK's tree-sitter
22
// parser singleton can find it at runtime. Must be the very first import in
33
// `index.tsx`: subsequent imports (the SDK / code-map) eagerly construct the
4-
// parser, and its init reads what we publish here on `globalThis` and via
5-
// the env var.
4+
// parser, and its init reads what we publish here on `globalThis`.
65
//
7-
// History of failed approaches before this one:
6+
// History of failed approaches before this one (all worked on macOS/Linux,
7+
// failed on Windows in different ways):
88
//
9-
// 1. `with { type: 'file' }` import of `web-tree-sitter/tree-sitter.wasm`
10-
// (node_modules subpath) — bun --compile on Windows embedded the
11-
// bytes but bound the import variable to undefined.
12-
// 2. `with { type: 'file' }` import of a copied-in relative wasm file —
13-
// same problem; this turns out to be a bun/Windows bug, not a
14-
// subpath-vs-relative thing.
15-
// 3. Single 274KB base64 string literal in a generated TS module —
16-
// bun's Windows minifier dropped/transformed the literal even
17-
// though the embed step wrote it.
9+
// 1. `with { type: 'file' }` of `web-tree-sitter/tree-sitter.wasm` (node_
10+
// modules subpath) — bytes ended up in the binary but the import
11+
// variable was undefined at runtime. Bun/Windows bug with the import-
12+
// attribute binding.
13+
// 2. `with { type: 'file' }` of a copied-in relative .wasm — same as #1,
14+
// so it's not subpath-vs-relative.
15+
// 3. Single 274KB base64 string literal in a generated TS module — the
16+
// literal didn't appear in the compiled binary at all. Probably the
17+
// minifier transforming "huge constant" literals.
18+
// 4. ~268 chunked base64 string literals — same fate; the bundler
19+
// appeared to evaluate the imported array as the empty stub at
20+
// static-analysis time and DCE'd the conditional that consumed it.
1821
//
19-
// What works: many small base64 chunks (each well under any plausible
20-
// minifier threshold) joined at runtime. The build script writes the
21-
// chunks; this module decodes them. The committed file ships an empty
22-
// stub array — dev-mode runs see no chunks and fall through to
23-
// path-based resolution in init-node.ts (which works locally because
24-
// `node_modules/web-tree-sitter/tree-sitter.wasm` exists on disk).
22+
// What this version does: import a *function* whose body returns the
23+
// chunks. Function return values aren't statically inlinable the way
24+
// `export const` values are, so the bundler can't substitute the empty
25+
// stub for the call site. Reference the result unconditionally so DCE
26+
// can't kick in even if some inliner does fold the function.
2527

26-
import { TREE_SITTER_WASM_BASE64_CHUNKS } from './tree-sitter-wasm-bytes'
28+
import { getTreeSitterWasmChunks } from './tree-sitter-wasm-bytes'
2729

28-
let embeddedWasm: Uint8Array | undefined
29-
if (TREE_SITTER_WASM_BASE64_CHUNKS.length > 0) {
30-
// Joined string is up to ~275KB but only lives long enough to decode.
31-
const buf = Buffer.from(TREE_SITTER_WASM_BASE64_CHUNKS.join(''), 'base64')
32-
embeddedWasm = new Uint8Array(buf.buffer, buf.byteOffset, buf.byteLength)
30+
const chunks = getTreeSitterWasmChunks()
31+
if (chunks.length > 0) {
32+
const buf = Buffer.from(chunks.join(''), 'base64')
3333
// globalThis is the only cross-bundle channel: the SDK pre-built bundle
3434
// inlines its own copy of `init-node.ts`, so a module-level variable
35-
// here isn't visible to the singleton initialized via the SDK.
35+
// here isn't visible to the singleton initialized via the SDK. Slice
36+
// into a fresh Uint8Array view rather than handing over Buffer's shared
37+
// underlying ArrayBuffer.
3638
;(
3739
globalThis as { __CODEBUFF_TREE_SITTER_WASM_BINARY__?: Uint8Array }
38-
).__CODEBUFF_TREE_SITTER_WASM_BINARY__ = embeddedWasm
40+
).__CODEBUFF_TREE_SITTER_WASM_BINARY__ = new Uint8Array(
41+
buf.buffer,
42+
buf.byteOffset,
43+
buf.byteLength,
44+
)
3945
}
4046

4147
// `--smoke-tree-sitter` is the deterministic CI gate. The handler lives at

0 commit comments

Comments
 (0)