Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
353695e
Disabling walkthrough for web
cwebster-99 Apr 15, 2026
a2437aa
agentHost: show rich diffs when requesting write confirmations
connor4312 Apr 17, 2026
70eb8cf
Merge remote-tracking branch 'origin/main' into connor4312/readfile-d…
connor4312 Apr 17, 2026
b6453f2
comments
connor4312 Apr 17, 2026
a1427ee
Merge remote-tracking branch 'origin/main' into connor4312/readfile-d…
connor4312 Apr 17, 2026
9677ab2
Use 'sessionTypes' to filter chat customizations in UI, 'when' only i…
aeschli Apr 17, 2026
696552c
agentHost: correctly rewrite links in markdown for remote files
connor4312 Apr 17, 2026
2cc8b21
Updates component explorer
hediet Apr 17, 2026
82a6ad1
Merge branch 'connor4312/readfile-diffs' into connor4312/ra-links
connor4312 Apr 17, 2026
6476a36
build and tests:
connor4312 Apr 17, 2026
20298ed
Background - use merge-base commit for the left hand side of the diff…
lszomoru Apr 17, 2026
b82b184
sessions: add subtle border to GH profile image (#311036)
hawkticehurst Apr 17, 2026
e44fc34
Remove noisy log (#311034)
roblourens Apr 17, 2026
7c2a50c
Remove extensions step from onboarding
cwebster-99 Apr 17, 2026
1830427
more build
connor4312 Apr 17, 2026
d689175
Merge branch 'connor4312/readfile-diffs' into connor4312/ra-links
connor4312 Apr 17, 2026
251d099
Fix unescaped icon and undefined text in terminal tool confirmations …
meganrogge Apr 17, 2026
7d224d9
build(deps): bump rand from 0.8.5 to 0.9.3 in /cli (#309689)
dependabot[bot] Apr 17, 2026
c0a6ca5
comments
connor4312 Apr 17, 2026
377feb6
chat customizations: splash when MCP/plugins disabled (#310847)
joshspicer Apr 17, 2026
8c03967
Merge pull request #311025 from microsoft/connor4312/readfile-diffs
connor4312 Apr 17, 2026
0e3e05d
Merge pull request #310169 from microsoft/respectable-aardwolf
cwebster-99 Apr 17, 2026
15248d5
Merge pull request #311042 from microsoft/remove-onboarding-extension…
cwebster-99 Apr 17, 2026
c78e004
Warm-cache gate for inline background summarization (#311047)
bhavyaus Apr 17, 2026
960abe8
Always open overview from sidebar customization entrypoints (#311050)
joshspicer Apr 17, 2026
5027dde
Add back button to AI customization section screens (#310881)
joshspicer Apr 17, 2026
3704d95
Support frames / workers in browser CDP (#311049)
kycutler Apr 17, 2026
99a3906
Update distro commit (main) (#311059)
vs-code-engineering[bot] Apr 17, 2026
8dac599
fix(terminal-chat): dedupe terminal tool-session registrations to pre…
maruthang Apr 17, 2026
f2ba285
Merge branch 'main' into connor4312/ra-links
roblourens Apr 17, 2026
99c9ee1
agentHost: fix bugs around message handling (#311054)
connor4312 Apr 17, 2026
ea6aac9
Agents tunnels: auto-reconnect with backoff and wake-triggered retry …
osortega Apr 17, 2026
00a718e
Add incremental chat rendering experiment (#310801)
pwang347 Apr 17, 2026
c89a817
Browser shouldn't steal window focus (#311082)
kycutler Apr 17, 2026
6579de6
Joshspicer/add policy preview features docs (#311039)
joshspicer Apr 17, 2026
6a8889b
Move to InputState instead of metadata on sessions (#311060)
TylerLeonhardt Apr 17, 2026
22cc340
agentHost: surface data into the 'changes' view (#311067)
connor4312 Apr 17, 2026
41770c5
Fixes screenshots attributed to the wrong PR.
hediet Apr 17, 2026
0f93774
Bump dompurify from 3.3.2 to 3.4.0 in /extensions/copilot (#310375)
dependabot[bot] Apr 17, 2026
309dec5
Restore autopilot branch and shorten steering text in terminal tool r…
meganrogge Apr 17, 2026
78f1999
Match subdomains when checking for existing browser pages (#311084)
kycutler Apr 17, 2026
a1bec8c
feat: upload commit-to-version mapping for copilot source maps (#311086)
bryanchen-d Apr 17, 2026
a947515
Agents web: Fix for terminals (#311089)
osortega Apr 17, 2026
ec992ba
Add performance tests (#309700)
pwang347 Apr 17, 2026
5711368
Filter Copilot sessions by local data (#311097)
roblourens Apr 17, 2026
ab11a29
Rename remote agent host Copilot agent displayName to "Copilot CLI" (…
roblourens Apr 17, 2026
d431c96
Integrating sandboxing with network filter service (#310872)
dileepyavan Apr 17, 2026
b4bdfac
Merge pull request #311041 from microsoft/connor4312/ra-links
connor4312 Apr 17, 2026
1493a10
agentHost: make 'new terminal' button open a terminal in the remote b…
connor4312 Apr 17, 2026
840a3be
Fix Send to Terminal ignoring terminal session auto-approve (#311103)
meganrogge Apr 17, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions .github/skills/add-policy/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -183,6 +183,50 @@ The file `src/vs/workbench/contrib/policyExport/test/node/extensionPolicyFixture
| `vscode-website` (`gulpfile.policies.js`) | `policyData.jsonc` | Enterprise policy reference table at code.visualstudio.com/docs/enterprise/policies |
| `vscode-docs` | Generated from website build | `docs/enterprise/policies.md` |

## GitHub Preview Features

If your setting is a **GitHub Preview Feature** — meaning it's a Copilot/chat feature that organizations can disable via their GitHub account-level policy — you **must** add a `value` function that checks `policyData.chat_preview_features_enabled`.

### When to add this flag

Add the `chat_preview_features_enabled` check when **all** of these apply:

- The setting controls a Copilot or chat feature (e.g., agent tools, hooks, MCP, auto-approve)
- The feature is in preview or experimental status (typically tagged `'preview'` or `'experimental'`)
- An organization admin should be able to disable it for all users in their org via GitHub account policy

### How it works

The `chat_preview_features_enabled` field on `IPolicyData` (defined in `src/vs/base/common/defaultAccount.ts`) is populated from the user's GitHub Copilot token entitlements. When an organization admin disables preview features, `chat_preview_features_enabled` is set to `false`.

### Pattern

Add a `value` function to the policy that returns a disabling value when `chat_preview_features_enabled === false`, and `undefined` otherwise (to fall through to the user's own setting):

```typescript
policy: {
name: 'MyPreviewFeaturePolicy',
category: PolicyCategory.InteractiveSession,
minimumVersion: '1.xx', // Must match the first VS Code release that ships this policy.
value: (policyData) => policyData.chat_preview_features_enabled === false ? false : undefined,
localization: {
description: {
key: 'my.setting.description',
value: nls.localize('my.setting.description', "Description of the setting."),
}
}
}
```

Key details:
- **Always compare with `=== false`**, not `!policyData.chat_preview_features_enabled` — the field is optional and `undefined` means "no policy data available", which should not disable the feature.
- **Return `undefined`** when the flag is not `false` so the account-level policy does not override the user's setting.
- **Return the disabling value** for the setting's type: `false` for booleans, a restrictive string/enum value for other types.

### Real-world examples

See `chat.tools.global.autoApprove` and `chat.useHooks` in `src/vs/workbench/contrib/chat/browser/chat.contribution.ts` for existing settings that use this pattern.

## Examples

Search the codebase for `policy:` to find all the examples of different policy configurations.
265 changes: 265 additions & 0 deletions .github/skills/chat-perf/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,265 @@
---
name: chat-perf
description: Run chat perf benchmarks and memory leak checks against the local dev build or any published VS Code version. Use when investigating chat rendering regressions, validating perf-sensitive changes to chat UI, or checking for memory leaks in the chat response pipeline.
---

# Chat Performance Testing

## When to use

- Before/after modifying chat rendering code (`chatListRenderer.ts`, `chatInputPart.ts`, markdown rendering)
- When changing the streaming response pipeline or SSE processing
- When modifying disposable/lifecycle patterns in chat components
- To compare performance between two VS Code releases
- In CI to gate PRs that touch chat UI code

## Quick start

```bash
# Run perf regression test (compares local dev build vs VS Code 1.115.0):
npm run perf:chat -- --scenario text-only --runs 3

# Run all scenarios with no baseline (just measure):
npm run perf:chat -- --no-baseline --runs 3

# Compare two local builds (apples-to-apples):
npm run perf:chat -- --build /path/to/build-A --baseline-build /path/to/build-B --runs 5

# Build a local production package and compare against a release:
npm run perf:chat -- --production-build --baseline-build 1.115.0 --runs 5

# Run memory leak check (10 messages in one session):
npm run perf:chat-leak

# Run leak check with more messages for accuracy:
npm run perf:chat-leak -- --messages 20 --verbose
```

## Perf regression test

**Script:** `scripts/chat-simulation/test-chat-perf-regression.js`
**npm:** `npm run perf:chat`

Launches VS Code via Playwright Electron, opens the chat panel, sends a message with a mock LLM response, and measures timing, layout, and rendering metrics. By default, downloads VS Code 1.115.0 as a baseline, benchmarks it, then benchmarks the local dev build and compares.

### Key flags

| Flag | Default | Description |
|---|---|---|
| `--runs <n>` | `5` | Runs per scenario. More = more stable. Use 5+ for CI. |
| `--scenario <id>` / `-s` | all | Scenario to test (repeatable). See `common/perf-scenarios.js`. |
| `--build <path\|ver>` / `-b` | local dev | Build to test. Accepts path or version (`1.110.0`, `insiders`, commit hash). |
| `--baseline <path>` | — | Compare against a previously saved baseline JSON file. |
| `--baseline-build <path\|ver>` | `1.115.0` | Version or local path to benchmark as baseline. |
| `--no-baseline` | — | Skip baseline comparison entirely. |
| `--save-baseline` | — | Save results as the new baseline (requires `--baseline <path>`). |
| `--resume <path>` | — | Resume a previous run, adding more iterations to increase confidence. |
| `--threshold <frac>` | `0.2` | Regression threshold (0.2 = flag if 20% slower). |
| `--production-build` | — | Build a local bundled package via `gulp vscode` for comparison against a release baseline. |
| `--no-cache` | — | Ignore cached baseline data, always run fresh. |
| `--force` | — | Skip build mode mismatch confirmation prompt. |
| `--ci` | — | CI mode: write Markdown summary to `ci-summary.md` (implies `--no-cache`). |
| `--setting <k=v>` | — | Set a VS Code setting override for all builds (repeatable). |
| `--test-setting <k=v>` | — | Set a VS Code setting override for the test build only. |
| `--baseline-setting <k=v>` | — | Set a VS Code setting override for the baseline build only. |
| `--verbose` | — | Print per-run details including response content. |

### Comparing two remote builds

```bash
# Compare 1.110.0 against 1.115.0 (no local build needed):
npm run perf:chat -- --build 1.110.0 --baseline-build 1.115.0 --runs 5
```

### Comparing two local builds

Both `--build` and `--baseline-build` accept local paths to VS Code executables. This enables apples-to-apples comparisons between any two builds:

```bash
# Compare two dev builds (e.g. feature branch vs main):
npm run perf:chat -- \
--build .build/electron/Code\ -\ OSS.app/Contents/MacOS/Code\ -\ OSS \
--baseline-build /path/to/other/Code\ -\ OSS.app/Contents/MacOS/Code\ -\ OSS \
--runs 5

# Compare two production builds:
npm run perf:chat -- \
--build ../VSCode-darwin-arm64-feature/Code\ -\ OSS.app/Contents/MacOS/Code\ -\ OSS \
--baseline-build ../VSCode-darwin-arm64-main/Code\ -\ OSS.app/Contents/MacOS/Code\ -\ OSS \
--runs 5
```

Local path baselines are never cached (the build may change between runs). Version string baselines are cached for reuse.

### Build modes and mismatch detection

The tool classifies builds into three modes based on the executable path:

| Mode | Source | Characteristics |
|---|---|---|
| `dev` | `.build/electron/` (local dev) | Unbundled sources, `VSCODE_DEV=1`, `NODE_ENV=development`. Higher memory and startup overhead. |
| `production` | `../VSCode-<platform>-<arch>/` (from `gulp vscode`) | Bundled JS, no dev flags. Matches release characteristics but uses local source. |
| `release` | `.vscode-test/` (downloaded via `@vscode/test-electron`) | Official published build. |

When test and baseline builds have different modes (e.g. dev vs release), the tool shows a warning and prompts for confirmation. Use `--force` or `--ci` to skip the prompt.

Using `--production-build` builds a local bundled package via `gulp vscode` for fair comparison against a release baseline. This eliminates dev-mode overhead while still testing your local changes.

```bash
# Production build vs release baseline (fair comparison):
npm run perf:chat -- --production-build --baseline-build 1.115.0 --runs 5
```

### Settings overrides

Use `--setting`, `--test-setting`, and `--baseline-setting` to inject VS Code settings into the launched instance. This is useful for A/B testing experimental features:

```bash
# Enable a feature for the test build only:
npm run perf:chat -- --test-setting chat.experimental.incrementalRendering.enabled=true --runs 3

# Compare two builds with different settings:
npm run perf:chat -- \
--baseline-build "../vscode2/.build/electron/Code - OSS.app/Contents/MacOS/Code - OSS" \
--baseline-setting chat.experimental.incrementalRendering.enabled=true \
--test-setting chat.experimental.incrementalRendering.enabled=false \
--runs 3

# Set a value for both builds:
npm run perf:chat -- --setting chat.mcp.enabled=false --runs 3
```

Precedence: `--test-setting` / `--baseline-setting` override `--setting` for the same key. Values are auto-parsed: `true`/`false` become booleans, numbers become numbers, everything else stays a string.

### Resuming a run for more confidence

When results exceed the threshold but aren't statistically significant, the tool prints a `--resume` hint. Use it to add more iterations to an existing run:

```bash
# Initial run with 3 iterations — may be inconclusive:
npm run perf:chat -- --scenario text-only --runs 3

# Add 3 more runs to the same results file (both test + baseline):
npm run perf:chat -- --resume .chat-simulation-data/2026-04-14T02-15-14/results.json --runs 3

# Keep adding until confidence is reached:
npm run perf:chat -- --resume .chat-simulation-data/2026-04-14T02-15-14/results.json --runs 5
```

`--resume` loads the previous `results.json` and its associated `baseline-*.json`, runs N more iterations for both builds, merges rawRuns, recomputes stats, and re-runs the comparison. The updated files are written back in-place. You can resume multiple times — samples accumulate.

### Statistical significance

Regression detection uses **Welch's t-test** to avoid false positives from noisy measurements. A metric is only flagged as `REGRESSION` when it both exceeds the threshold AND is statistically significant (p < 0.05). Otherwise it's reported as `(likely noise — p=X, not significant)`.

With typical variance (cv ≈ 20%), you need:
- **n ≥ 5** per build to detect a 35% regression at 95% confidence
- **n ≥ 10** per build to detect a 20% regression reliably

Confidence levels reported: `high` (p < 0.01), `medium` (p < 0.05), `low` (p < 0.1), `none`.

### Exit codes

- `0` — all metrics within threshold, or exceeding threshold but not statistically significant
- `1` — statistically significant regression detected, or all runs failed

### Scenarios

Scenarios are defined in `scripts/chat-simulation/common/perf-scenarios.js` and registered via `registerPerfScenarios()`. There are three categories:

- **Content-only** — plain streaming responses (e.g. `text-only`, `large-codeblock`, `rapid-stream`)
- **Tool-call** — multi-turn scenarios with tool invocations (e.g. `tool-read-file`, `tool-edit-file`)
- **Multi-turn user** — multi-turn conversations with user follow-ups, thinking blocks (e.g. `thinking-response`, `multi-turn-user`, `long-conversation`)

Run `npm run perf:chat -- --help` to see the full list of registered scenario IDs.

### Metrics collected

- **Timing:** time to first token, time to complete, time to render complete (includes typewriter animation)
- **Rendering:** layout count, layout duration (ms), style recalculation count, forced reflows, long tasks (>50ms), long animation frame count and duration
- **Memory:** heap before/after, heap delta post-GC (informational, noisy for single requests)
- **Extension host:** heap before/after/delta via CDP inspector

### Regression triggers vs informational metrics

Only these metrics trigger a regression failure (when they exceed the threshold with statistical significance):
- `timeToFirstToken`, `timeToComplete` — user-perceived latency
- `forcedReflowCount` — forced synchronous layouts are always bad
- `longTaskCount`, `longAnimationFrameCount` — main thread jank

These are reported but **informational only** (won't fail CI):
- `layoutCount` — inflated by CSS animations; use `layoutDurationMs` instead
- `layoutDurationMs` — total layout time from trace (more meaningful than count)
- `recalcStyleCount` — inflated by CSS animations (compositor-driven, cheap)
- `timeToRenderComplete` — includes typewriter animation tail
- Memory/heap metrics — too noisy for single-request benchmarks

### Statistics

Results use **IQR-based outlier removal** and **median** (not mean) to handle startup jitter. The **coefficient of variation (cv)** is reported — under 15% is stable, over 15% gets a ⚠ warning. Baseline comparison uses **Welch's t-test** on raw run values to determine statistical significance before flagging regressions. Use 5+ runs to get stable results.

## Memory leak check

**Script:** `scripts/chat-simulation/test-chat-mem-leaks.js`
**npm:** `npm run perf:chat-leak`

Launches one VS Code session, sends N messages sequentially, forces GC between each, and measures renderer heap and DOM node count. Uses **linear regression** on the samples to compute per-message growth rate, which is compared against a threshold.

### Key flags

| Flag | Default | Description |
|---|---|---|
| `--messages <n>` / `-n` | `10` | Number of messages to send. More = more accurate slope. |
| `--build <path\|ver>` / `-b` | local dev | Build to test. |
| `--threshold <MB>` | `2` | Max per-message heap growth in MB. |
| `--setting <k=v>` | — | Set a VS Code setting override (repeatable). |
| `--verbose` | — | Print per-message heap/DOM counts. |

### What it measures

- **Heap growth slope** (MB/message) — linear regression over forced-GC heap samples. A leak shows as sustained positive slope.
- **DOM node growth** (nodes/message) — catches rendering leaks where elements aren't cleaned up. Healthy chat virtualizes old messages so node count plateaus.

### Interpreting results

- `0.3–1.0 MB/msg` — normal (V8 internal overhead, string interning)
- `>2.0 MB/msg` — likely leak, investigate retained objects
- DOM nodes stable after first message — normal (chat list virtualization working)
- DOM nodes growing linearly — rendering leak, check disposable cleanup

## Architecture

```
scripts/chat-simulation/
├── common/
│ ├── mock-llm-server.js # Mock CAPI server matching @vscode/copilot-api URL structure
│ ├── perf-scenarios.js # Built-in scenario definitions (content, tool-call, multi-turn)
│ └── utils.js # Shared: paths, env setup, stats, launch helpers
├── config.jsonc # Default config (baseline version, runs, thresholds)
├── fixtures/ # TypeScript fixture files used by tool-call scenarios
├── test-chat-perf-regression.js
└── test-chat-mem-leaks.js
```

### Mock server

The mock LLM server (`common/mock-llm-server.js`) implements the full CAPI URL structure from `@vscode/copilot-api`'s `DomainService`:

- `GET /models` — returns model metadata
- `POST /models/session` — returns `AutoModeAPIResponse` with `available_models` and `session_token`
- `POST /models/session/intent` — model router
- `POST /chat/completions` — SSE streaming response matching the scenario
- Agent, session, telemetry, and token endpoints

The copilot extension connects to this server via `IS_SCENARIO_AUTOMATION=1` mode with `overrideCapiUrl` and `overrideProxyUrl` settings. The `vscode-api-tests` extension must be disabled (`--disable-extension=vscode.vscode-api-tests`) because it contributes a duplicate `copilot` vendor that blocks the real extension's language model provider registration.

### Adding a scenario

1. Add a new entry to the appropriate object (`CONTENT_SCENARIOS`, `TOOL_CALL_SCENARIOS`, or `MULTI_TURN_SCENARIOS`) in `common/perf-scenarios.js` using the `ScenarioBuilder` API from `common/mock-llm-server.js`
2. The scenario is auto-registered by `registerPerfScenarios()` — no manual ID list to update
3. Run: `npm run perf:chat -- --scenario your-new-scenario --runs 1 --no-baseline --verbose`

## Related skills

- **heap-snapshot-analysis** — When a perf regression or leak check identifies high memory growth, use the heap-snapshot-analysis skill to dig deeper. It can parse `.heapsnapshot` files, compare before/after snapshots, group object deltas, and trace retainer paths to find what keeps disposed objects alive. The chat-perf leak check measures overall heap slope; heap-snapshot-analysis finds the specific objects responsible.
- **auto-perf-optimize** — For launching VS Code, driving a scenario, and capturing heap snapshots or CPU profiles automatically before doing low-level analysis.
Loading
Loading