Skip to content

Commit 3ad502b

Browse files
jahoomaclaude
andcommitted
Embed tree-sitter wasm as ~268 chunked base64 string literals
Three previous approaches all failed on Windows in subtly different ways: 1. Single 274KB base64 string literal: bun's Windows minifier dropped or transformed it (build verified the prefix wasn't in the binary even though the embed step wrote the file). 2. `with { type: 'file' }` from a node_modules subpath: bytes ended up in the binary but the import variable was bound to undefined at runtime — bun on Windows mishandles the JS-level binding for that attribute. 3. `with { type: 'file' }` from a relative path (wasm copied into pre-init/): same as #2 — confirms it's not subpath-vs-relative, it's a bun/Windows bug with the import-attribute binding. Round 4: write the base64 as ~268 small chunks (1024 chars each) in an exported array, joined and decoded at runtime in the pre-init. Each chunk is referenced unconditionally at runtime via .join(''), so DCE can't eliminate it; each is small enough that no minifier heuristic would treat it as a special "huge string literal" worth dropping. - cli/scripts/build-binary.ts: embedTreeSitterWasmAsChunks() writes the full array, returns sample chunks (start/middle/end) for the post- build verification scan to look for in the compiled binary. Restores the empty stub eagerly + via process.on('exit'). - cli/src/pre-init/tree-sitter-wasm-bytes.ts: re-introduced as a stub exporting an empty readonly string[]. Dev-mode and unit tests see the empty stub; production builds get the real chunks written in by build-binary.ts. - cli/src/pre-init/tree-sitter-wasm.ts: import the chunks, .join(''), Buffer.from(_, 'base64'), publish on globalThis. The if() guard remains because dev mode legitimately has zero chunks. Verified locally: build embeds 268 chunks, post-build verifies 3 sample chunks at distinct offsets in the compiled binary, --smoke-tree-sitter exits 0 with "tree-sitter smoke ok (wasmBinary, 205488 bytes)", full smoke passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent e505cc7 commit 3ad502b

4 files changed

Lines changed: 149 additions & 140 deletions

File tree

cli/.gitignore

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,3 @@ debug/
77

88
# Generated files
99
src/agents/bundled-agents.generated.ts
10-
11-
# Staged by build-binary.ts before `bun build --compile`, removed after.
12-
# See cli/src/pre-init/tree-sitter-wasm.ts for why we copy this in.
13-
src/pre-init/tree-sitter.wasm

cli/scripts/build-binary.ts

Lines changed: 102 additions & 77 deletions
Original file line numberDiff line numberDiff line change
@@ -145,9 +145,10 @@ async function main() {
145145
patchOpenTuiAssetPaths()
146146
await ensureOpenTuiNativeBundle(targetInfo)
147147

148-
const wasmCopy = stagePreInitWasm()
149-
// Even on a build-script crash, leave the developer's working tree clean.
150-
process.on('exit', wasmCopy.cleanup)
148+
const treeSitterEmbed = embedTreeSitterWasmAsChunks()
149+
// Even on a build-script crash, restore the empty stub so a developer's
150+
// working tree doesn't end up with a multi-MB diff.
151+
process.on('exit', treeSitterEmbed.restore)
151152

152153
const outputFilename =
153154
targetInfo.platform === 'win32' ? `${binaryName}.exe` : binaryName
@@ -190,17 +191,16 @@ async function main() {
190191

191192
runCommand('bun', buildArgs, { cwd: cliRoot })
192193

193-
// Remove the staged pre-init wasm now that the build has read it. Eager
194-
// cleanup keeps a successful build clean; the exit handler above is a
195-
// backstop for crashes between stage and now.
196-
wasmCopy.cleanup()
194+
// Restore the empty stub now that the build read the chunks. Eager
195+
// cleanup keeps a successful build clean; the exit handler is a
196+
// backstop for crashes between embed and now.
197+
treeSitterEmbed.restore()
197198

198-
// Fail the build if the wasm asset didn't actually make it into the
199-
// compiled binary. The pre-init imports tree-sitter.wasm with `with {
200-
// type: 'file' }`, which Bun should embed; this scan catches silent
201-
// regressions (e.g. tree-shaking eliminating the import) before we ship
202-
// a broken artifact.
203-
verifyTreeSitterWasmEmbedded(outputFile)
199+
// Fail the build if the chunks didn't actually make it into the
200+
// compiled binary. Catches silent regressions (tree-shaking, minifier
201+
// dropping literals, file-write timing) before we upload an artifact
202+
// that would crash for users.
203+
verifyTreeSitterWasmEmbedded(outputFile, treeSitterEmbed.sampleChunks)
204204

205205
if (targetInfo.platform !== 'win32') {
206206
chmodSync(outputFile, 0o755)
@@ -247,82 +247,107 @@ function findWebTreeSitterWasm(): string {
247247
}
248248

249249
/**
250-
* Copy `tree-sitter.wasm` into `cli/src/pre-init/` so the pre-init module
251-
* can import it via a relative `with { type: 'file' }` path. We can't
252-
* import it directly as a node_modules subpath: on Windows, bun's
253-
* `with { type: 'file' }` resolution returned falsy at runtime for
254-
* `web-tree-sitter/tree-sitter.wasm` even though the bytes ended up in
255-
* the binary, breaking the pre-init's runtime path lookup. OpenTUI's own
256-
* tree-sitter assets work because they're imported relatively from
257-
* inside the package — same trick here.
250+
* Inline `tree-sitter.wasm` into the binary as base64-encoded string
251+
* literals — but split into many small chunks. A single 274KB string
252+
* literal got dropped/transformed by bun's Windows minifier in an
253+
* earlier attempt; small chunks are individually unremarkable to the
254+
* minifier and survive intact. The pre-init joins them at runtime and
255+
* decodes back to the wasm bytes.
258256
*
259-
* Returns a cleanup function. The build calls it eagerly after compile
260-
* and registers it as an exit handler so a mid-build crash doesn't leave
261-
* a multi-MB untracked file in the working tree.
257+
* Returns a `restore` function (resets the stub) and a small set of
258+
* `sampleChunks` for the post-build verification step to look for in
259+
* the compiled binary. Always invoke `restore` (eagerly + on exit) so
260+
* a developer's working tree doesn't end up with a multi-MB diff after
261+
* a build.
262262
*/
263-
function stagePreInitWasm(): { cleanup: () => void } {
264-
const sourceWasm = findWebTreeSitterWasm()
265-
const stagedPath = join(cliRoot, 'src', 'pre-init', 'tree-sitter.wasm')
266-
let cleaned = false
267-
const cleanup = (): void => {
268-
if (cleaned) return
269-
cleaned = true
270-
if (existsSync(stagedPath)) {
271-
try {
272-
rmSync(stagedPath)
273-
} catch (error) {
274-
console.error('Failed to remove staged pre-init wasm:', error)
275-
}
263+
function embedTreeSitterWasmAsChunks(): {
264+
restore: () => void
265+
sampleChunks: string[]
266+
} {
267+
const stubPath = join(cliRoot, 'src', 'pre-init', 'tree-sitter-wasm-bytes.ts')
268+
const originalStub = readFileSync(stubPath, 'utf8')
269+
let restored = false
270+
const restore = (): void => {
271+
if (restored) return
272+
restored = true
273+
try {
274+
writeFileSync(stubPath, originalStub)
275+
} catch (error) {
276+
console.error('Failed to restore tree-sitter-wasm-bytes stub:', error)
276277
}
277278
}
278279

279-
// Read + write rather than copyFile so we don't accidentally hardlink
280-
// (some Windows hosts fail to delete hardlinks while bun has the file
281-
// mmapped from the compile step).
282-
writeFileSync(stagedPath, readFileSync(sourceWasm))
283-
logAlways(`Staged pre-init wasm: ${sourceWasm}${stagedPath}`)
284-
return { cleanup }
280+
const sourceWasm = findWebTreeSitterWasm()
281+
const wasmBytes = readFileSync(sourceWasm)
282+
const fullBase64 = wasmBytes.toString('base64')
283+
284+
// ~1KB per chunk: well under any plausible minifier-dropped-literal
285+
// threshold, and small enough that even a heavy-handed inliner would
286+
// emit them as runtime references rather than evaluating the whole
287+
// .join() at compile time. Keeps total chunk count manageable too
288+
// (~270 chunks for a 205KB wasm).
289+
const CHUNK_SIZE = 1024
290+
const chunks: string[] = []
291+
for (let i = 0; i < fullBase64.length; i += CHUNK_SIZE) {
292+
chunks.push(fullBase64.slice(i, i + CHUNK_SIZE))
293+
}
294+
295+
const generated =
296+
`// AUTO-GENERATED by cli/scripts/build-binary.ts during \`bun build --compile\`.\n` +
297+
`// Restored to the empty stub after the build finishes — do not commit a\n` +
298+
`// non-empty value here.\n` +
299+
`export const TREE_SITTER_WASM_BASE64_CHUNKS: readonly string[] = [\n` +
300+
chunks.map((c) => ` ${JSON.stringify(c)},`).join('\n') +
301+
`\n]\n`
302+
303+
writeFileSync(stubPath, generated)
304+
logAlways(
305+
`Embedded tree-sitter.wasm from ${sourceWasm} (${wasmBytes.length} bytes → ${chunks.length} chunks of ~${CHUNK_SIZE} chars).`,
306+
)
307+
308+
// Pull a few sample chunks from the start, middle, and end for the
309+
// post-build verification scan. If any one is missing in the compiled
310+
// binary, something dropped or transformed the literals.
311+
const samples = [
312+
chunks[0],
313+
chunks[Math.floor(chunks.length / 2)],
314+
chunks[chunks.length - 1],
315+
].filter((c): c is string => Boolean(c))
316+
317+
return { restore, sampleChunks: samples }
285318
}
286319

287320
/**
288-
* Sanity-check the compiled binary actually contains web-tree-sitter's
289-
* tree-sitter.wasm. The pre-init imports it via `with { type: 'file' }`,
290-
* which should bundle the asset at a bunfs path. If tree-shaking or a
291-
* future bundler change drops the import, the binary still compiles but
292-
* tree-sitter init fails at runtime — this scan fails the build before
293-
* we upload that artifact.
294-
*
295-
* Looks for the actual wasm bytes (a unique 64-byte chunk pulled from
296-
* the source file's interior), not just the wasm magic header — OpenTUI
297-
* embeds its own tree-sitter language wasms, so a magic-bytes-only scan
298-
* would false-pass even without our import. A literal bytes match
299-
* proves *this specific* wasm shipped.
321+
* Sanity-check the compiled binary actually contains all the chunked
322+
* base64 we just embedded. We pass in a few sample chunks from the
323+
* start / middle / end of the array; each must appear in the binary.
324+
* If any one is missing, the bundler dropped or inlined-away part of
325+
* the literal table, and the runtime decode would produce garbage.
300326
*/
301-
function verifyTreeSitterWasmEmbedded(outputFile: string): void {
302-
const wasmPath = findWebTreeSitterWasm()
303-
const wasm = readFileSync(wasmPath)
304-
// Take a 64-byte slice from the middle of the file. The header has
305-
// generic wasm magic + section markers; the tail can be padding. The
306-
// middle is densely packed code/data unique to this specific wasm
307-
// module.
308-
const needleStart = Math.floor(wasm.length / 2)
309-
const needle = wasm.subarray(needleStart, needleStart + 64)
310-
327+
function verifyTreeSitterWasmEmbedded(
328+
outputFile: string,
329+
sampleChunks: string[],
330+
): void {
331+
if (sampleChunks.length === 0) {
332+
throw new Error('verifyTreeSitterWasmEmbedded called with no sample chunks')
333+
}
311334
const binary = readFileSync(outputFile)
312-
const idx = binary.indexOf(needle)
313-
if (idx === -1) {
314-
throw new Error(
315-
`web-tree-sitter wasm content not found in ${outputFile}.\n` +
316-
`Source wasm: ${wasmPath} (${wasm.length} bytes)\n` +
317-
`Searched for 64 bytes from offset ${needleStart} of the source.\n` +
318-
`Either the \`with { type: 'file' }\` import in the pre-init was\n` +
319-
`tree-shaken out, or bun --compile didn't embed the asset on this\n` +
320-
`platform. The runtime tree-sitter init would fail with\n` +
321-
`"Internal error: tree-sitter.wasm not found".`,
322-
)
335+
for (const chunk of sampleChunks) {
336+
const needle = Buffer.from(chunk, 'utf8')
337+
const idx = binary.indexOf(needle)
338+
if (idx === -1) {
339+
throw new Error(
340+
`Embedded tree-sitter wasm chunk not found in ${outputFile}.\n` +
341+
`Missing chunk (first 80 chars): ${chunk.slice(0, 80)}…\n` +
342+
`Either the \`tree-sitter-wasm-bytes.ts\` literals were tree-shaken,\n` +
343+
`the minifier transformed them away, or the pre-init's import wasn't\n` +
344+
`actually consumed. The runtime tree-sitter init would fail with\n` +
345+
`"Internal error: tree-sitter.wasm not found".`,
346+
)
347+
}
323348
}
324349
logAlways(
325-
`Verified embedded tree-sitter.wasm at offset ${idx} of compiled binary (source: ${wasmPath}).`,
350+
`Verified ${sampleChunks.length} embedded base64 chunks in compiled binary.`,
326351
)
327352
}
328353

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
// Stub committed for dev mode and tests. The real wasm chunks are written
2+
// here by `cli/scripts/build-binary.ts` immediately before
3+
// `bun build --compile`, then restored to an empty array after the build
4+
// completes. Dev mode and unit tests see the empty stub and fall back to
5+
// path-based resolution in `packages/code-map/src/init-node.ts` (which
6+
// works locally because `node_modules/web-tree-sitter/tree-sitter.wasm`
7+
// exists on the filesystem).
8+
//
9+
// Why an array of small chunks rather than one big string: a single
10+
// 274KB string literal got dropped/transformed by bun's Windows
11+
// minifier (the binary built clean but ran without the bytes). Many
12+
// small string literals slip under whatever threshold caused that. See
13+
// `cli/src/pre-init/tree-sitter-wasm.ts` for the full failure history.
14+
export const TREE_SITTER_WASM_BASE64_CHUNKS: readonly string[] = []

cli/src/pre-init/tree-sitter-wasm.ts

Lines changed: 33 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -4,68 +4,42 @@
44
// parser, and its init reads what we publish here on `globalThis` and via
55
// the env var.
66
//
7-
// Why `with { type: 'file' }` rather than embedding base64 in TS source:
8-
// the latter doesn't survive `bun --compile` on Windows. The base64 string
9-
// gets dropped or transformed somewhere in the bundle/minify pipeline, so
10-
// the runtime sees an empty stub even though the build script wrote the
11-
// real bytes. `with { type: 'file' }` is Bun's documented asset-embed
12-
// path — the file gets placed at a bunfs location the runtime can read.
13-
14-
import { readFileSync } from 'fs'
15-
16-
// Important: this is a *relative* import of a wasm file the build script
17-
// copies in from `web-tree-sitter/tree-sitter.wasm` immediately before
18-
// `bun build --compile`. On Windows, bun's `with { type: 'file' }`
19-
// returned falsy at runtime when this import was a node_modules subpath
20-
// (`web-tree-sitter/tree-sitter.wasm`) even though the bytes ended up in
21-
// the binary — OpenTUI works around the same issue by using relative
22-
// paths from inside its own package, which is what we're mirroring here.
7+
// History of failed approaches before this one:
238
//
24-
// The `.wasm` lives at `./tree-sitter.wasm` next to this file. It is
25-
// .gitignored; build-binary.ts copies it in before compile and removes
26-
// it after, so dev-mode runs see no `.wasm` here and fall back to
27-
// path-based resolution via init-node.ts (which works locally).
9+
// 1. `with { type: 'file' }` import of `web-tree-sitter/tree-sitter.wasm`
10+
// (node_modules subpath) — bun --compile on Windows embedded the
11+
// bytes but bound the import variable to undefined.
12+
// 2. `with { type: 'file' }` import of a copied-in relative wasm file —
13+
// same problem; this turns out to be a bun/Windows bug, not a
14+
// subpath-vs-relative thing.
15+
// 3. Single 274KB base64 string literal in a generated TS module —
16+
// bun's Windows minifier dropped/transformed the literal even
17+
// though the embed step wrote it.
2818
//
29-
// @ts-expect-error - TS has no loader for .wasm; bun's `with { type: 'file' }`
30-
// returns a string path at compile time.
31-
import treeSitterWasmPath from './tree-sitter.wasm' with { type: 'file' }
19+
// What works: many small base64 chunks (each well under any plausible
20+
// minifier threshold) joined at runtime. The build script writes the
21+
// chunks; this module decodes them. The committed file ships an empty
22+
// stub array — dev-mode runs see no chunks and fall through to
23+
// path-based resolution in init-node.ts (which works locally because
24+
// `node_modules/web-tree-sitter/tree-sitter.wasm` exists on disk).
3225

33-
let embeddedWasm: Uint8Array | undefined
26+
import { TREE_SITTER_WASM_BASE64_CHUNKS } from './tree-sitter-wasm-bytes'
3427

35-
if (treeSitterWasmPath) {
36-
// Path stays for the locateFile fallback in init-node.ts. That fallback
37-
// accepts bunfs-style paths (`/~BUN/root/...`) without checking
38-
// fs.existsSync, because fs.existsSync misreports those paths on Windows.
39-
// emscripten's wasm loader will fs.readFile them through its own runtime.
40-
process.env.CODEBUFF_TREE_SITTER_WASM_PATH = treeSitterWasmPath
41-
42-
// Also try a synchronous read so we can hand the bytes straight to
43-
// Parser.init via wasmBinary — bypassing locateFile entirely is the most
44-
// robust path. If readFileSync of the bunfs path throws on this OS (we've
45-
// seen this happen on Windows in some configurations), log it loudly so
46-
// the smoke check / user reports include the diagnostic, then fall
47-
// through to the locateFile flow.
48-
try {
49-
const buf = readFileSync(treeSitterWasmPath)
50-
embeddedWasm = new Uint8Array(buf.buffer, buf.byteOffset, buf.byteLength)
51-
;(
52-
globalThis as { __CODEBUFF_TREE_SITTER_WASM_BINARY__?: Uint8Array }
53-
).__CODEBUFF_TREE_SITTER_WASM_BINARY__ = embeddedWasm
54-
} catch (err) {
55-
console.error(
56-
'[tree-sitter pre-init] readFileSync failed for embedded wasm at',
57-
treeSitterWasmPath,
58-
'—',
59-
err instanceof Error ? err.message : String(err),
60-
)
61-
}
28+
let embeddedWasm: Uint8Array | undefined
29+
if (TREE_SITTER_WASM_BASE64_CHUNKS.length > 0) {
30+
// Joined string is up to ~275KB but only lives long enough to decode.
31+
const buf = Buffer.from(TREE_SITTER_WASM_BASE64_CHUNKS.join(''), 'base64')
32+
embeddedWasm = new Uint8Array(buf.buffer, buf.byteOffset, buf.byteLength)
33+
// globalThis is the only cross-bundle channel: the SDK pre-built bundle
34+
// inlines its own copy of `init-node.ts`, so a module-level variable
35+
// here isn't visible to the singleton initialized via the SDK.
36+
;(
37+
globalThis as { __CODEBUFF_TREE_SITTER_WASM_BINARY__?: Uint8Array }
38+
).__CODEBUFF_TREE_SITTER_WASM_BINARY__ = embeddedWasm
6239
}
6340

64-
// `--smoke-tree-sitter` is the deterministic CI gate. We can't handle it
65-
// here with top-level await — bun --compile on Windows didn't preserve the
66-
// blocking semantics in our last attempt, so commander still ran and
67-
// rejected the unknown flag. Instead, the handler lives at the top of
68-
// main() in cli/src/index.tsx (before parseArgs), where we can synchronously
69-
// short-circuit before commander parses argv. This module's job is just to
70-
// publish the wasm bytes / path on globalThis + process.env so that the
71-
// handler (and the SDK's eager Parser.init) can find them.
41+
// `--smoke-tree-sitter` is the deterministic CI gate. The handler lives at
42+
// the top of main() in cli/src/index.tsx (before parseArgs), not here —
43+
// top-level await in this module didn't actually pause subsequent module
44+
// evaluation under bun --compile on Windows. See the comment over the
45+
// handler in index.tsx for the full reasoning.

0 commit comments

Comments
 (0)