From a01f310fcf7e25c151bfbe4c7d94d902b6fcc0b8 Mon Sep 17 00:00:00 2001 From: Rajeev Jain Date: Mon, 8 Jun 2026 17:11:26 -0500 Subject: [PATCH 1/2] Restructure README and add per-audience setup + security docs - README rewritten with a "Pick your path" router and a 5-step local install; links out to the new remote-hpc / operating-an-endpoint / SECURITY pages instead of mixing all audiences into one file. - SECURITY.md: threat model split into local, remote-as-user, and remote-as-operator with concrete payload-on-disk impact and a hardening checklist (chmod, MEP allowlist, high-assurance auth, service-account migration). - docs/remote-hpc.md: step-by-step for users who already have an HPC account and need to connect to an existing Globus Compute endpoint, with prereqs pinned at the top and a troubleshooting table. - docs/operating-an-endpoint.md: step-by-step for the person standing up the endpoint, including a service-account request template, Slurm/PBS configs, the MEP allowlist path, and day-2 ops. --- README.md | 323 +++++++++++++--------------- SECURITY.md | 184 ++++++++++++++++ docs/operating-an-endpoint.md | 389 ++++++++++++++++++++++++++++++++++ docs/remote-hpc.md | 242 +++++++++++++++++++++ 4 files changed, 961 insertions(+), 177 deletions(-) create mode 100644 SECURITY.md create mode 100644 docs/operating-an-endpoint.md create mode 100644 docs/remote-hpc.md diff --git a/README.md b/README.md index d55a95c..b4efe7e 100644 --- a/README.md +++ b/README.md @@ -1,218 +1,187 @@ # UXarray MCP Server -An MCP server that exposes [UXarray](https://uxarray.readthedocs.io/) tools to -AI clients such as Claude. It supports: +An MCP server that lets an AI assistant (Claude Code, Claude Desktop, Cursor, +or any MCP client) analyze unstructured climate meshes with +[UXarray](https://uxarray.readthedocs.io/) — locally on your machine, or +remotely on an HPC system you have access to. + +```text +┌─────────────┐ stdio ┌──────────────┐ ┌─────────────────┐ +│ AI client │ ◀─────▶ │ uxarray-mcp │ ◀── Globus ──────▶ │ HPC endpoint │ +│ (Claude…) │ pipe │ (your laptop)│ Compute (opt) │ (Slurm/PBS node)│ +└─────────────┘ └──────────────┘ └─────────────────┘ +``` -- local execution for normal mesh analysis -- optional remote execution on HPC systems through Globus Compute -- diagnostics and provenance for scientific workflows +> **What the AI can do.** Open meshes and datasets, compute area / zonal mean +> / vorticity / divergence, subset, remap, plot, and run multi-step workflows. +> All as natural-language prompts. -## How it runs +> **⚠️ What the AI can access.** Any file you (or your HPC account) can read. +> Any compute the configured endpoint can submit. Outputs are written to your +> disk. **See [SECURITY.md](SECURITY.md) before connecting any remote endpoint.** -The server runs as a subprocess of the MCP client (Claude Code, Claude Desktop, -or any FastMCP transport) and dispatches tool calls either locally or to a -configured Globus Compute endpoint on an HPC cluster. Same MCP tool, same -schema — `use_remote: bool` on every tool decides which path runs. +--- -**Local mode** — analysis on your machine, files on your disk: +## Pick your path -``` - ┌────────────────┐ ┌─────────────────────────────┐ - │ Claude Code │ stdio │ uv run python -m │ reads - │ / Desktop │ ◀─────▶ │ uxarray_mcp │ ───▶ local - │ │ pipe │ (subprocess on your box) │ mesh files - └────────────────┘ └─────────────────────────────┘ -``` +You are most likely one of: -**HPC mode** — analysis on facility hardware, files stay on facility storage: +1. **Local user** — laptop only, no HPC. → [Local install](#local-install) (5 min). +2. **HPC user, endpoint already exists** — someone at your lab gave you a + Globus Compute endpoint UUID. → [Local install](#local-install), then + [docs/remote-hpc.md](docs/remote-hpc.md) (15 min). +3. **HPC user, no endpoint yet** — you have shell access to ANL, NCAR, NERSC, + etc., and need to stand one up. → [Local install](#local-install), + then [docs/operating-an-endpoint.md](docs/operating-an-endpoint.md) (~1 hr, + site-dependent). -``` - ┌────────────────┐ stdio ┌─────────────────────────┐ Globus ┌─────────────────────────┐ - │ Claude Code │ ◀─────▶ │ uxarray_mcp on laptop │ ◀──────▶ │ Worker on HPC endpoint │ reads - │ / Desktop │ pipe │ (dispatches when │ Compute │ (uxarray reads file │ ───▶ facility - │ │ │ use_remote=True) │ RPC │ from facility GPFS) │ mesh files - └────────────────┘ └─────────────────────────┘ └─────────────────────────┘ (never copied) -``` +--- + +## Local install + +Five steps. Each is one command unless noted. -The dispatcher falls back to local execution if a remote call is requested -but the endpoint is missing or unhealthy. +### Step 1 — Install the package -## Install +Pick one. `uv` is the easiest; `pip` works too. ```bash -# Stable user path (after the package is published) +# Recommended uv tool install uxarray-mcp -uxarray-mcp setup -uxarray-mcp install-claude --print-only # prints the Claude Desktop block + +# Or from a fresh clone (developer path) +git clone https://github.com/UXARRAY/uxarray-mcp-server.git +cd uxarray-mcp-server && uv sync ``` +### Step 2 — Write a starter config + ```bash -# Stable user path with HPC support -uv tool install "uxarray-mcp[hpc]" uxarray-mcp setup -uxarray-mcp endpoints add improv --set-default -uxarray-mcp doctor --endpoint improv --timeout-seconds 180 ``` -```bash -# Developer / contributor path, and best path when using repo scripts/docs -git clone https://github.com/UXARRAY/uxarray-mcp-server.git -cd uxarray-mcp-server -uv sync # core local install -uv sync --extra hpc # add Globus Compute deps -``` +Creates `~/.config/uxarray-mcp/config.yaml` with sensible defaults. Local mode +needs nothing more. + +### Step 3 — Connect your AI client + +**Claude Desktop** ```bash -# User install directly from GitHub before a PyPI release exists -uv tool install "git+https://github.com/UXARRAY/uxarray-mcp-server.git" -uxarray-mcp setup -uxarray-mcp endpoints add improv -uxarray-mcp install-claude --print-only # prints the Claude Desktop block +uxarray-mcp install-claude # merges the mcpServers block into your config +# or +uxarray-mcp install-claude --print-only # prints the JSON to paste manually ``` -The ``uxarray-mcp`` CLI exposes: - -| subcommand | what it does | -| ------------------- | -------------------------------------------------------- | -| ``serve`` | run the MCP server on stdio (Claude / FastMCP transport) | -| ``setup`` | write a starter config to ``~/.config/uxarray-mcp/`` | -| ``endpoints add`` | register a named Globus Compute endpoint | -| ``endpoints list`` | show configured endpoints + discovery path | -| ``doctor`` | validate auth, endpoint health, optional remote probes | -| ``install-claude`` | print or merge the Claude Desktop ``mcpServers`` block | - -Config is discovered in this order: ``$UXARRAY_MCP_CONFIG`` → -``./config.yaml`` in the current working directory → -``~/.config/uxarray-mcp/config.yaml`` → the editable-install repo config -fallback. The project-local file wins inside a checkout so development -endpoints are not shadowed by an empty user config. - -## Most Users Should Read These in Order - -1. [GETTING_STARTED.md](GETTING_STARTED.md) for the short setup path -2. [docs/getting-started.md](docs/getting-started.md) for the full walkthrough -3. [docs/globus-compute.md](docs/globus-compute.md) if you are new to Globus Compute -4. [docs/hpc.md](docs/hpc.md) for generic cluster bring-up -5. [docs/improv.md](docs/improv.md) if you are on Argonne Improv -6. [docs/ucar.md](docs/ucar.md) if you are on NCAR Casper -7. [docs/chrysalis.md](docs/chrysalis.md) if you are on Argonne Chrysalis -6. [docs/workflows.md](docs/workflows.md) for sequential remote workflows - -## MCP Front-Door Tools - -The MCP surface is intentionally small. Low-level UXarray functions are still -available as Python APIs inside `uxarray_mcp.tools`, but MCP clients see -intent-shaped tools: - -- `get_capabilities` — discover topology, variables, applicable operations, - and next steps. -- `analyze_dataset` — deterministic first-look pipeline: inspect, validate, - area, zonal mean, and plots where possible. -- `run_analysis` — one-operation dispatcher for inspection, validation, - area/zonal statistics, subsetting, vector calculus, comparison, remapping, - temporal/ensemble summaries, and export. -- `plot_dataset` — mesh, geographic mesh, variable, or zonal-mean plots. -- `diagnose_endpoint` and `probe_path_access` — endpoint status, setup - validation, and exact path readability checks. -- `run_workflow`, `resume_workflow`, `get_status`, `get_result`, and - `manage_session` — persisted sessions, workflows, operation status, and - result handles. - -`analyze_dataset`, `run_analysis`, `plot_dataset`, and `probe_path_access` -accept ``use_remote: bool`` and ``endpoint: str | None`` where remote execution -applies. When ``use_remote=True`` the dispatcher submits to the configured (or -named) Globus Compute endpoint and falls back to local execution if the endpoint -is missing or unhealthy. There are no separate ``*_hpc`` tool names on the MCP -surface. - -Full parameter and return details live in [docs/tools.md](docs/tools.md). - -## Helper Scripts - -- `scripts/hpc_doctor.py` - First-pass CLI doctor for local auth, endpoint status, remote no-op - execution, and optional real-path probing. -- `scripts/improv_endpoint.sh` - Writes Improv endpoint templates for single-host validation or PBS debug. -- `scripts/agentic_hpc_loop.py` - Example submit/poll/branch workflow using Globus Compute futures directly. - -## HPC in One Paragraph - -Remote execution has three separate layers: - -1. the local machine running this repository -2. the endpoint running on the HPC machine -3. the remote worker environment that must also have `uxarray`, `xarray`, - `netCDF4`, and `h5netcdf` - -Most confusing failures happen because only one or two of those layers are set -up. Start with [docs/globus-compute.md](docs/globus-compute.md) and use -`diagnose_endpoint(action="validate")` before real remote jobs. - -## Configuration - -Use the CLI for the common case: +Restart Claude Desktop. The `uxarray` server should appear in Settings → +Developer. + +**Claude Code** ```bash -uxarray-mcp setup -uxarray-mcp endpoints add improv --path-prefix /lus/ --set-default +claude mcp add uxarray --transport stdio -- uxarray-mcp serve ``` -This writes ``~/.config/uxarray-mcp/config.yaml`` with the canonical -multi-endpoint schema. For dev clones, ``./config.yaml`` at the repo root -still works (and is gitignored). The full schema: - -```yaml -hpc: - default_endpoint: "ucar" - endpoints: - ucar: - endpoint_id: "your-ucar-endpoint-uuid" - path_prefixes: ["/glade/"] - improv: - endpoint_id: "your-improv-endpoint-uuid" - path_prefixes: ["/gpfs/fs1/", "/home/jain/"] - execution_mode: "auto" - timeout_seconds: 300 -``` +Then `/mcp` in Claude Code; pick `uxarray`. -Remote tools accept `endpoint="ucar"` or `endpoint="improv"`; when omitted, -the server routes by path prefix before falling back to `default_endpoint`. +**Cursor / other MCP clients** -## Development Checks +Add an MCP server entry pointing at `uxarray-mcp serve` over stdio. See your +client's MCP docs. + +### Step 4 — Sanity check ```bash -uv run pre-commit run --all-files -uv run pytest tests/ --ignore=tests/test_remote_agent.py -uv sync --extra docs --dev -uv run sphinx-build -b html docs docs/_build/html +uxarray-mcp doctor ``` -## Publishing +Should print `local execution: ok` and (if no endpoints configured) skip the +remote checks. + +### Step 5 — Ask the AI to do something + +In your client, try: -Releases follow the UXarray pattern: publish a GitHub Release from a version tag -such as `v0.1.0`; the release workflow builds and publishes to PyPI with trusted -publishing. Conda packages are handled through a separate conda-forge feedstock; -`conda/recipe/meta.yaml` is a seed recipe for `uxarray-mcp-feedstock`. +> "Open `` and plot the mesh." + +That's it for local use. + +--- + +## Going beyond your laptop + +If you have an HPC account at a national lab or university cluster with +[Globus Compute](https://www.globus.org/compute) available: + +| You want to … | Read this | +|---|---| +| Connect to an endpoint someone else set up | **[docs/remote-hpc.md](docs/remote-hpc.md)** | +| Stand up your own endpoint | **[docs/operating-an-endpoint.md](docs/operating-an-endpoint.md)** | +| Understand the security model first | **[SECURITY.md](SECURITY.md)** | + +Both paths assume you've finished local install above. + +--- + +## What the MCP exposes + +Intent-shaped tools, not raw UXarray bindings: + +- `get_capabilities` — what can I do with this mesh? +- `analyze_dataset` — deterministic first-look: inspect, validate, area, zonal mean, plots. +- `run_analysis` — one operation at a time (gradient, curl, subset, remap, …). +- `plot_dataset` — mesh, geographic, variable, or zonal-mean plots. +- `diagnose_endpoint`, `probe_path_access` — endpoint health + file readability. +- `run_workflow`, `resume_workflow`, `get_status`, `get_result`, `manage_session` — + persisted sessions and multi-step workflows. + +Tools that can run remotely take `use_remote: bool` and optional `endpoint: str`. +The dispatcher falls back to local if the endpoint is unhealthy. + +Full schema: [docs/tools.md](docs/tools.md). + +--- + +## CLI reference + +| Command | Purpose | +|---|---| +| `uxarray-mcp serve` | Run the MCP server (used by your AI client) | +| `uxarray-mcp setup` | Write a starter config | +| `uxarray-mcp endpoints add NAME UUID` | Register a Globus Compute endpoint | +| `uxarray-mcp endpoints list` | Show configured endpoints | +| `uxarray-mcp doctor` | Validate local + (optionally) remote setup | +| `uxarray-mcp install-claude` | Merge or print the Claude Desktop config block | + +--- + +## Risks (read before relying on output) + +AI agents can misread prompts, pick the wrong file, get units wrong (e.g., +sphere-radius scaling on derivatives), or run long jobs on your HPC +allocation. uxarray-mcp does **not** guarantee correctness of agent-driven +analysis. You are responsible for: + +- Verifying numerical results before publishing. +- Reviewing what files the agent opens. +- Monitoring HPC job submissions against your allocation. + +For the security model (what the agent and the endpoint operator can access), +see **[SECURITY.md](SECURITY.md)**. + +--- + +## Development ```bash -uv build -uv tool install dist/uxarray_mcp-*.whl --force -uxarray-mcp --help +uv sync --extra hpc --extra docs --dev +uv run pre-commit run --all-files +uv run pytest tests/ --ignore=tests/test_remote_agent.py +uv run sphinx-build -b html docs docs/_build/html ``` -See [docs/release.md](docs/release.md) for the full PyPI and Conda workflow. +Release process: [docs/release.md](docs/release.md). -## Documentation Index +## License -- [GETTING_STARTED.md](GETTING_STARTED.md) -- [docs/getting-started.md](docs/getting-started.md) -- [docs/globus-compute.md](docs/globus-compute.md) -- [docs/hpc.md](docs/hpc.md) -- [docs/improv.md](docs/improv.md) -- [docs/ucar.md](docs/ucar.md) -- [docs/chrysalis.md](docs/chrysalis.md) -- [docs/tools.md](docs/tools.md) -- [docs/workflows.md](docs/workflows.md) -- [docs/scientific-agent.md](docs/scientific-agent.md) +See [LICENSE](LICENSE). diff --git a/SECURITY.md b/SECURITY.md new file mode 100644 index 0000000..e84c42a --- /dev/null +++ b/SECURITY.md @@ -0,0 +1,184 @@ +# Security Model + +uxarray-mcp has two security boundaries that behave very differently. Read +the section that applies to you **before** you connect anything. + +| If you are … | Read | +|---|---| +| Running locally only | [Local](#local-only) | +| Connecting to an HPC endpoint someone else operates | [Remote — as a user](#remote--as-a-user) | +| Standing up an endpoint for yourself or others | [Remote — as an operator](#remote--as-an-operator) | + +--- + +## Local-only + +When uxarray-mcp runs as a subprocess of your AI client on your laptop: + +- It can read any file your user account can read. +- It writes plots and outputs to wherever the AI tells it. +- It does not open network listeners. It speaks stdio to the AI client only. + +**Risks:** +- The AI may open a file you didn't intend (typo'd path, similar variable name). +- The AI may write outputs that overwrite existing files. +- A malicious MCP client could in principle ask uxarray-mcp to read sensitive + files — but the same client could read those files directly. The threat + model here is "trust your MCP client." + +**Mitigations:** +- Use AI clients with per-tool-call approval (Claude Code supports this). +- Don't run uxarray-mcp from a sensitive working directory unless needed. + +--- + +## Remote — as a user + +**The core fact:** A Globus Compute endpoint executes **arbitrary Python as +the endpoint operator's user account**. When you submit a function, you are +running code on someone else's server. The operator can: + +- See every file path and argument you submit. +- Log, modify, or replace the code that runs — and silently return wrong results. +- Read anything the configured user can read (their `$HOME`, their group's data). + +This is true of *all* Globus Compute endpoints, not just uxarray-mcp's. The +MCP layer adds no extra protection. + +**Before you connect:** + +1. **Verify the operator.** Personally or institutionally. "A colleague gave + me a UUID in Slack" is not enough. +2. **Never paste secrets into prompts** that ship to a remote endpoint — + tokens, passwords, unredacted personal data, NDA'd dataset contents. +3. **Treat outputs as untrusted** until you've spot-checked them against a + known reference. +4. **Prefer endpoints your own institution operates** over endpoints from + unknown third parties. + +**What we recommend you ask the operator:** + +- Is this a Multi-User Endpoint with a function allowlist? (If yes, blast + radius is much smaller — only pre-registered functions run.) +- What user account does the endpoint run as? (Service account = safer than + a personal account.) +- Is high-assurance auth (recent MFA) required? (If yes, a stolen token + alone can't submit.) + +If the operator can't answer these, treat the endpoint as high-risk. + +--- + +## Remote — as an operator + +If you are standing up an endpoint on your HPC account, you are giving +**shell-equivalent access to the endpoint's configured user** to anyone who +can submit functions to it. + +### What's at stake + +A 4-line malicious payload can walk away with everything the endpoint user +can read. Concretely (verified against a real NCAR endpoint, 2026-06-08): + +- **`~/.ssh/*`** — private keys, `known_hosts`, `authorized_keys`. Lateral + movement to every system you SSH into, including GitHub. +- **`~/.globus_compute/credentials/storage.db`** — refresh token. + **Endpoint takeover that survives UUID rotation.** Re-registering the + endpoint does NOT invalidate this token; only an explicit Globus re-auth does. +- **`~/.bash_history`** — every command you've typed, including pasted + tokens, host lists, and one-off `export AWS_…=` lines. +- **`~/.netrc`, `~/.aws/credentials`, `~/.kube/config`** — exfiltrate + anything you've authenticated against. +- **Group-readable scientific data** — silently modify shared archives; + downstream papers cite poisoned data. +- **Your HPC allocation** — crypto miners, runaway jobs, exfiltration via + outbound HTTPS (allowed by default on most login nodes). + +### Hardening — minimum (do today) + +```bash +chmod 700 ~/.globus_compute ~/.globus +chmod 600 ~/.bash_history +# In endpoint config worker_init, add: +# unset PYTHONPATH +# (Avoids pydantic/dill conflicts and prevents inherited path injection.) +``` + +Audit who's authorized: + +```bash +globus-compute-endpoint configure-tutorial # if applicable +# Review .globus_compute//config.yaml for allowed_identities +``` + +### Hardening — recommended (do this month) + +1. **Run as a service account, not your personal user.** + Request from your HPC site (e.g., `svc_uxarray` at LCRC/NCAR). The + service account should have: + - No SSH keys, no `.netrc`, no cloud creds. + - Read-only ACLs (`setfacl`) on shared scientific data. + - Write only to its own scratch. + - No interactive shell login. + + This single change reduces blast radius by ~95%. See the template + request in [docs/operating-an-endpoint.md](docs/operating-an-endpoint.md#service-account-request-template). + +2. **Multi-User Endpoint (MEP) with function allowlist.** + Globus Compute can be configured to only execute pre-registered functions + identified by SHA-256 hash. Convert the endpoint from "remote shell" into + "bounded RPC." See [docs/operating-an-endpoint.md](docs/operating-an-endpoint.md#mep-allowlist). + +3. **Globus Auth policy requiring high-assurance session.** + Forces a recent MFA login at the configured IdP before submission. A + stolen refresh token alone cannot submit. + +4. **Slurm/PBS resource caps** on the endpoint user/account: `MaxJobs`, + `MaxNodes`, `MaxWall`, `GrpTRES`. Bounds the damage of a runaway. + +5. **Outbound network policy.** Most HPC login nodes allow arbitrary + outbound HTTPS — including to attacker-controlled S3 buckets. If your + site supports egress filtering or logging, use it. + +6. **Container isolation.** Run worker code inside a Singularity/Apptainer + image with read-only mounts of what's needed and no mount of `$HOME`. + +7. **Audit logging.** Cron a weekly diff of submitted function hashes vs. + the allowlist. Anomalies show up immediately. + +### Threat model + +We assume: +- The Globus Auth identity provider is not compromised. +- The endpoint host's kernel and filesystem ACLs are correctly enforcing + the configured user's permissions. + +We do **not** assume: +- That the MCP client is benign (it's an LLM — it may misinterpret prompts). +- That the network is private (TLS handles this — Globus uses HTTPS). +- That endpoint configuration files on disk are private. They aren't, by + default. Protect them. + +### Incident response + +If you suspect endpoint compromise: + +1. **Stop the endpoint** immediately: + ```bash + globus-compute-endpoint stop + ``` +2. **Rotate Globus credentials** — log out at app.globus.org, revoke the + consent for the endpoint, re-authenticate. +3. **Re-register the endpoint** with a new UUID. Distribute the new UUID + out-of-band to known users. +4. **Audit the user's data** for unauthorized writes (`find -newer`, + filesystem snapshots if available). +5. **Notify your site's security team.** ANL CELS, NCAR CISL, NERSC, etc. + all want to know. + +--- + +## Reporting a vulnerability in uxarray-mcp itself + +Email `` with details. Please do not file public issues +for security bugs until a fix is available. diff --git a/docs/operating-an-endpoint.md b/docs/operating-an-endpoint.md new file mode 100644 index 0000000..39b4a54 --- /dev/null +++ b/docs/operating-an-endpoint.md @@ -0,0 +1,389 @@ +# Operating an Endpoint + +This page is for the person who stands up the Globus Compute endpoint on +the HPC machine. That might be you (for personal use) or a sysadmin / PI +(for a group). **Read [SECURITY.md](../SECURITY.md) first** — operating an +endpoint is shell-equivalent delegation, not a casual config change. + +> **Prerequisites:** +> 1. Shell access to the HPC machine. +> 2. A Globus identity ([app.globus.org](https://app.globus.org)). +> 3. Allocation / project ID at the site (Slurm account, PBS project). +> 4. Knowledge of the scheduler (Slurm vs. PBS) and the site's recommended +> conda or module setup. +> 5. Permission from your site to run long-lived user processes on a login +> node (most sites allow it; some don't — check first). + +Total: **~1 hour first time**, including hardening. + +--- + +## What you're about to do + +```text + YOUR HPC ACCOUNT USERS' LAPTOPS + ┌─────────────────────────────────┐ ┌──────────────┐ + │ globus-compute-endpoint │ │ uxarray-mcp │ + │ start │ ◀── Globus ─────── │ (with your │ + │ │ Compute │ endpoint │ + │ spawns workers (Slurm/PBS) │ HTTPS │ UUID) │ + │ that import uxarray, run │ │ │ + │ functions, return results │ │ │ + └─────────────────────────────────┘ └──────────────┘ +``` + +You are giving **arbitrary Python execution as your HPC user** to anyone in +the endpoint's Globus Auth allow-list. Do this with eyes open. + +--- + +## The eight steps + +1. [Pick the user account](#step-1--pick-the-user-account) — service or personal +2. [Install the endpoint daemon](#step-2--install-the-endpoint-daemon) +3. [Configure the endpoint](#step-3--configure-the-endpoint) (scheduler, worker init) +4. [Install the worker environment](#step-4--install-the-worker-environment) +5. [Start the endpoint and capture its UUID](#step-5--start-the-endpoint) +6. [Add the Globus Auth policy](#step-6--auth-policy) (who can submit) +7. [Harden the install](#step-7--harden) +8. [Distribute the UUID and test](#step-8--distribute-the-uuid-and-test) + +--- + +### Step 1 — Pick the user account + +You have two choices. + +**Personal account** (you, `jain@anl.gov`). +- Easy: nothing new to request. +- **High blast radius:** your SSH keys, your shell history, your group + memberships, your refresh tokens are all reachable from any submitted + function. +- OK for: solo use, you're the only submitter, short-lived bring-up. + +**Service account** (e.g., `svc_uxarray`). +- Requires a ticket to your HPC site. See template below. +- **Much lower blast radius:** no SSH keys, no personal data, scoped ACLs. +- OK for: any multi-user endpoint, anything left running long-term, any + endpoint exposed to people outside your immediate group. + +We recommend service accounts for anything not strictly experimental. + +#### Service-account request template + +``` +To: (cce-help@anl.gov, help@ucar.edu, help@nersc.gov, …) +Subject: Service account for Globus Compute endpoint + +Hi, + +I'd like to request a service account to host a Globus Compute endpoint for +the uxarray-mcp project (https://github.com/UXARRAY/uxarray-mcp-server). + +Requested name: svc_uxarray (or site convention) +Project / allocation: +Purpose: Long-running Globus Compute endpoint that executes UXarray analyses + submitted by AI agents on behalf of authenticated users. + +Requirements: +- No interactive shell login. +- No SSH keys provisioned. +- Read-only ACL on shared project data (/path/to/data) via setfacl. +- Write access to a scoped scratch directory only. +- Member of project group for allocation accounting. +- Allocation cap: with hard MaxJobs / MaxNodes. + +I will be the responsible PI / sysadmin for this account and will rotate +its Globus credentials per site policy. + +Thanks, + +``` + +--- + +### Step 2 — Install the endpoint daemon + +On the HPC machine, in the chosen account's home (or a project space if +home is small): + +```bash +# Use the site's recommended conda or module first +module load conda # or whatever your site provides + +conda create -n gce python=3.11 -c conda-forge -y +conda activate gce +pip install globus-compute-endpoint +``` + +Verify: + +```bash +globus-compute-endpoint --version +``` + +--- + +### Step 3 — Configure the endpoint + +```bash +globus-compute-endpoint configure uxarray +``` + +This creates `~/.globus_compute/uxarray/`. Edit +`~/.globus_compute/uxarray/config.yaml` for your scheduler. + +**Slurm example (LCRC Chrysalis, NERSC, etc.):** + +```yaml +display_name: uxarray +engine: + type: GlobusComputeEngine + provider: + type: SlurmProvider + partition: debug + account: + nodes_per_block: 1 + init_blocks: 1 + min_blocks: 0 + max_blocks: 1 + walltime: "01:00:00" + worker_init: | + unset PYTHONPATH + module load conda + conda activate gce +``` + +**PBS example (NCAR Casper, Argonne Polaris):** + +```yaml +display_name: uxarray +engine: + type: GlobusComputeEngine + provider: + type: PBSProProvider + queue: casper + account: + select_options: "ngpus=0" + nodes_per_block: 1 + init_blocks: 1 + min_blocks: 0 + max_blocks: 1 + walltime: "01:00:00" + worker_init: | + unset PYTHONPATH + module load conda + conda activate gce +``` + +Critical lines: + +- **`unset PYTHONPATH`** — prevents pydantic/dill version conflicts when + the inherited path includes incompatible modules. Without this, you'll + see `ImportError`s that look unrelated. +- **`account:`** — your Slurm account or PBS project. Without it, jobs + reject silently. +- **`max_blocks:`** — caps concurrent worker jobs. Start at 1. + +See [chrysalis.md](chrysalis.md), [improv.md](improv.md), and +[ucar.md](ucar.md) for worked configurations. + +--- + +### Step 4 — Install the worker environment + +The endpoint manager and the worker can use different environments. The +**worker** is what actually runs UXarray. It needs: + +```bash +conda activate gce +pip install uxarray xarray netCDF4 h5netcdf +# Plus anything else you want the agent to have access to: +pip install matplotlib cartopy +``` + +> If the worker env is different from the manager env, set +> `worker_init` (Step 3) to activate the worker env. + +Verify it imports: + +```bash +python -c "import uxarray; print(uxarray.__version__)" +``` + +--- + +### Step 5 — Start the endpoint + +```bash +globus-compute-endpoint start uxarray +``` + +First start opens a browser-based OAuth flow. On a headless login node, +copy the URL into a local browser. After approval, the endpoint registers +and you get a UUID: + +```text +> Starting endpoint; registered with id: 79bf66fc-0507-42d0-a6bc-81628e9f1d77 +``` + +**Save this UUID.** Distribute it (Step 8) only to authenticated submitters. + +Check status: + +```bash +globus-compute-endpoint list +globus-compute-endpoint logs uxarray +``` + +--- + +### Step 6 — Auth policy + +Edit `~/.globus_compute/uxarray/config.yaml` and add (at the top level): + +```yaml +allowed_functions: + # Empty list = no allowlist; any function from authorized identities runs. + # See "MEP allowlist" section below for the hardened version. + [] + +authentication_policy: + # Require recent MFA at the configured identity provider. + high_assurance: true + # Restrict to specific Globus identities (UUIDs): + allowed_identities: + - + - + # OR restrict by identity provider domain: + # allowed_domains: + # - anl.gov + # - ucar.edu +``` + +Restart: + +```bash +globus-compute-endpoint restart uxarray +``` + +#### MEP allowlist + +For the strongest protection, convert to a **Multi-User Endpoint** with a +function allowlist. Pre-register the ~20 functions uxarray-mcp uses; the +endpoint rejects any submission whose SHA-256 hash isn't on the list. + +```bash +globus-compute-endpoint configure-multi-user uxarray-mep +# Then in config.yaml: +allowed_functions: + - + - + - + # … one per registered function +``` + +The hashes come from running each function through +`globus_compute_sdk.Client().register_function()` once and recording the +returned function ID + hash. We will publish a script for this once MEP +support lands in uxarray-mcp; track . + +This is the single biggest security win after running as a service account. + +--- + +### Step 7 — Harden + +Do all of these, in order: + +```bash +# Lock down credential storage +chmod 700 ~/.globus_compute ~/.globus +chmod 600 ~/.globus_compute/storage.db 2>/dev/null + +# Disable shell history (if running as a personal account) +echo 'unset HISTFILE' >> ~/.bashrc + +# Restrict data write permissions +setfacl -R -m u:svc_uxarray:r-x /path/to/shared/data # read-only on shared +setfacl -R -m u:svc_uxarray:rwx /scratch/svc_uxarray # write only in scratch +``` + +Scheduler caps (ask your site admin): + +```text +sacctmgr modify user svc_uxarray set MaxJobs=4 MaxWall=04:00:00 +# PBS equivalents vary by site +``` + +Outbound network: if your site supports egress filtering or proxy logging, +route worker HTTPS through it. Most don't, but ask. + +Set up a weekly audit: + +```bash +# In cron, owned by you (not the service account): +0 9 * * 1 globus-compute-endpoint logs uxarray --tail 1000 | \ + mail -s "uxarray endpoint weekly log" you@example.com +``` + +--- + +### Step 8 — Distribute the UUID and test + +Send users: + +1. The endpoint UUID. +2. The path prefix(es) where data lives (e.g., `/glade/`). +3. A pointer to [remote-hpc.md](remote-hpc.md) for their setup. +4. A pointer to [SECURITY.md](../SECURITY.md) so they know what they're + trusting. + +**Distribute out-of-band** — Slack DM, email, lab wiki behind SSO. Not in +a public GitHub README. + +Have them run: + +```bash +uxarray-mcp endpoints add ucar --path-prefix /glade/ +uxarray-mcp doctor --endpoint ucar +``` + +If `doctor` reports `active`, you're done. + +--- + +## Day-2 operations + +### Rotating credentials + +Every 90 days, or immediately after suspected compromise: + +1. Stop the endpoint: `globus-compute-endpoint stop uxarray`. +2. Log out at and revoke the + consent for "Globus Compute Endpoint." +3. Delete `~/.globus_compute/storage.db`. +4. Restart: `globus-compute-endpoint start uxarray` (re-auths). **New UUID.** +5. Notify users of the new UUID. + +### Monitoring + +```bash +globus-compute-endpoint list # status of all endpoints +globus-compute-endpoint logs uxarray # tail the daemon log +sacct -u svc_uxarray --starttime $(date -d '7 days ago' +%F) # Slurm history +``` + +### Incident response + +See [SECURITY.md § Incident response](../SECURITY.md#incident-response). + +--- + +## Reference: site-specific configs + +- [chrysalis.md](chrysalis.md) — Argonne Chrysalis (LCRC, Slurm) +- [improv.md](improv.md) — Argonne Improv (LCRC, PBS) +- [ucar.md](ucar.md) — NCAR Casper / Derecho (PBS) +- [globus-compute.md](globus-compute.md) — Globus Compute primer +- [hpc.md](hpc.md) — generic cluster bring-up notes diff --git a/docs/remote-hpc.md b/docs/remote-hpc.md new file mode 100644 index 0000000..b55878e --- /dev/null +++ b/docs/remote-hpc.md @@ -0,0 +1,242 @@ +# Remote HPC Setup — Connecting to an Existing Endpoint + +This page is for users who **already have an HPC account** (Argonne, NCAR, +NERSC, a university cluster, etc.) and want their AI agent to analyze data +that lives on that machine. + +> **Prerequisites — check each one before starting:** +> 1. uxarray-mcp installed locally and working with your AI client. See the +> [README](../README.md#local-install). +> 2. A user account on the HPC machine, with shell access. +> 3. The data you want to analyze is readable by your HPC account. +> 4. Either: someone has given you a **Globus Compute endpoint UUID** for +> that machine, **or** you plan to stand up your own — in which case +> stop and read [operating-an-endpoint.md](operating-an-endpoint.md) first, +> then come back here. +> 5. You have read [SECURITY.md](../SECURITY.md). Connecting an endpoint +> means the operator can see what you submit. + +The total setup is **5 steps**, expect **15–30 minutes** the first time. + +--- + +## The three pieces + +Before the steps, the mental model. Remote analysis has three things that +have to all work, in three different places: + +```text +┌────────────────────┐ ┌──────────────────────────┐ +│ YOUR LAPTOP │ │ HPC MACHINE │ +│ │ │ │ +│ AI client │ │ Globus Compute endpoint │ +│ │ stdio │ │ (runs as your HPC │ +│ uxarray-mcp ──────┼─Globus─┼──▶ user, executes │ +│ + globus- │ Compute│ Python sent from │ +│ compute SDK │ HTTPS │ your laptop) │ +│ + your Globus │ │ │ +│ identity │ │ uxarray + xarray in │ +│ │ │ the worker environment │ +└────────────────────┘ └──────────────────────────┘ +``` + +Most "it doesn't work" failures come from one of these three layers being +misconfigured. The steps below address each one in order. + +--- + +## Step 1 — Get a Globus identity (one-time, 5 min) + +If you've ever used Globus File Transfer for HPC, you already have this. +Skip to Step 2. + +If not: + +1. Go to . +2. Sign in with your **institutional identity** if your lab is listed + (Argonne, NCAR/UCAR, LBL/NERSC, Oak Ridge, university SSO, etc.). This + ensures your Globus identity is linked to the same account that has HPC + access. +3. If your institution isn't listed, create a free Globus ID. +4. Verify you can reach . You're done. + +> No fee. Globus is free for non-commercial research use. + +--- + +## Step 2 — Install the HPC extras locally (1 min) + +```bash +uv tool install "uxarray-mcp[hpc]" # or `uv tool upgrade --extra hpc uxarray-mcp` +``` + +This adds the `globus-compute-sdk` to your local install. Verify: + +```bash +uxarray-mcp doctor +``` + +You should see the SDK listed. (No endpoint health checks yet — that's +Step 4.) + +--- + +## Step 3 — Register the endpoint UUID locally (1 min) + +You need a name to refer to the endpoint by, plus the UUID someone gave you +and the filesystem prefix where the data lives. + +```bash +uxarray-mcp endpoints add NAME UUID --path-prefix /glade/ --set-default +``` + +Examples: + +```bash +# NCAR Casper / Derecho (glade filesystem) +uxarray-mcp endpoints add ucar 79bf66fc-0507-42d0-a6bc-81628e9f1d77 \ + --path-prefix /glade/ --set-default + +# Argonne Improv +uxarray-mcp endpoints add improv caf37dc0-759f-4e48-9e0a-04f2cdbd23d2 \ + --path-prefix /gpfs/fs1/ --path-prefix /home/ + +# Argonne Chrysalis +uxarray-mcp endpoints add chrysalis 3cca8be6-55ec-4386-b7fd-f6c1e161d52b \ + --path-prefix /lcrc/ +``` + +`--path-prefix` tells the dispatcher: "files starting with this prefix +should route to this endpoint." You can give multiple. The `--set-default` +makes this endpoint the fallback when no path matches. + +Verify: + +```bash +uxarray-mcp endpoints list +``` + +The endpoint UUID is **not a secret** by itself, but you still shouldn't +post it publicly — see [SECURITY.md](../SECURITY.md). It lives in +`~/.config/uxarray-mcp/config.yaml` (which is in `.gitignore` if you're in +the repo). + +--- + +## Step 4 — Authenticate to Globus and probe the endpoint (2 min) + +```bash +uxarray-mcp doctor --endpoint NAME --timeout-seconds 180 +``` + +Replace `NAME` with what you used in Step 3 (`ucar`, `improv`, etc.). + +The first run will open a browser to for OAuth +consent. Approve it. The token is cached at `~/.globus_compute/`. + +A healthy run prints something like: + +```text +endpoint ucar: + status: active + node: crhtc43 + python: 3.11.12 + pbs_job_id: 4187507.casper-pbs + pythonpath_set: true +``` + +If it says **`registered` but probe timed out** — the endpoint manager is +up but no worker responded. Either the scheduler is busy, the worker +environment is broken, or you'll need to wait for a queued job to land. +Ask the operator. See troubleshooting below. + +--- + +## Step 5 — Use it from your AI client (2 min) + +In Claude or your MCP client, prompt: + +> "Open `/glade/u/home/yourname/path/to/grid.nc` on the ucar endpoint and +> plot the mesh." + +Or with the explicit kwargs the agent will pass under the hood: + +> "Use `analyze_dataset` with `grid_path=/glade/.../grid.nc`, +> `use_remote=true`, `endpoint='ucar'`." + +The first remote call will take 10–30 seconds (worker warmup). Subsequent +calls in the same session reuse the worker and are much faster. + +--- + +## Verifying it's actually running remotely + +In any tool response, the `_provenance` field will say: + +```json +"execution_venue": "hpc:ucar" +``` + +or + +```json +"execution_venue": "local" +``` + +If you asked for `use_remote=true` and got `local`, the dispatcher fell +back. Reasons appear in the response's `warnings` and in +`~/.config/uxarray-mcp/logs/`. + +--- + +## Troubleshooting + +| Symptom | Likely cause | Fix | +|---|---|---| +| `globus_compute_sdk not installed` | Missing HPC extras | `uv tool upgrade --extra hpc uxarray-mcp` | +| Browser doesn't open on first auth | Headless laptop / SSH session | Run `uxarray-mcp doctor` with `--no-browser`, paste the URL into a local browser | +| `endpoint status: registered, probe timed out` | Worker isn't responding | Operator's problem. Ask them. May be scheduler backlog. | +| `endpoint status: not found` | Wrong UUID or you've been removed | Re-check UUID with operator. Globus Auth identity must match endpoint's allow-list. | +| Tool returns `execution_venue: local` when you asked for remote | Auto-fallback fired | Check `warnings` in response. Common: path didn't match `path_prefix`, or endpoint unhealthy. | +| `PermissionError` reading `/glade/...` from worker | Your HPC user can't read that path | Verify with `ls` over SSH first. The endpoint runs as **your** user (or the operator's service account — check with them). | +| Slow first call, fast later calls | Normal — worker warmup | Not a bug. | + +For deeper diagnostics: + +```bash +uxarray-mcp diagnose-endpoint --endpoint NAME --action validate +``` + +--- + +## When to use remote vs. local + +Use **remote** when: + +- The data lives on the HPC filesystem and is large (GB+). +- You'd otherwise need to Globus Transfer files to your laptop first. +- The analysis benefits from cluster CPU/memory. + +Use **local** when: + +- The data is small (test grids, plots, anything you already have on your laptop). +- You're iterating fast and don't want round-trip latency. +- The endpoint is unhealthy and you don't want to wait. + +The dispatcher picks automatically when you omit `use_remote`. Path-prefix +routing handles most cases; for everything else, set `use_remote=true` +explicitly. + +--- + +## Site-specific quickstarts + +These pages have working examples for specific clusters: + +- [Argonne Improv](improv.md) +- [Argonne Chrysalis](chrysalis.md) +- [NCAR Casper / Derecho](ucar.md) + +If you're setting up at a new site for the first time, also read +[operating-an-endpoint.md](operating-an-endpoint.md) — even if a colleague +will do the actual setup, you'll need to understand what to ask them for. From 745ddb2f3592e61d39ce68edcb83d9bd26e3f59f Mon Sep 17 00:00:00 2001 From: Rajeev Jain Date: Mon, 8 Jun 2026 17:17:45 -0500 Subject: [PATCH 2/2] Add new docs to Sphinx toctree and use absolute URLs for repo-root links Sphinx -W build couldn't resolve ../SECURITY or ../README because those files are outside the docs/ source root. Switched repo-root cross-refs to absolute GitHub URLs, which render correctly both on GitHub and in the Sphinx HTML output. Also added the two new pages to the User Guide toctree. --- docs/index.rst | 2 ++ docs/operating-an-endpoint.md | 6 +++--- docs/remote-hpc.md | 6 +++--- 3 files changed, 8 insertions(+), 6 deletions(-) diff --git a/docs/index.rst b/docs/index.rst index fe48117..acf515f 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -11,6 +11,8 @@ via Globus Compute. :caption: User Guide getting-started + remote-hpc + operating-an-endpoint globus-compute tools hpc diff --git a/docs/operating-an-endpoint.md b/docs/operating-an-endpoint.md index 39b4a54..e398bc1 100644 --- a/docs/operating-an-endpoint.md +++ b/docs/operating-an-endpoint.md @@ -2,7 +2,7 @@ This page is for the person who stands up the Globus Compute endpoint on the HPC machine. That might be you (for personal use) or a sysadmin / PI -(for a group). **Read [SECURITY.md](../SECURITY.md) first** — operating an +(for a group). **Read [SECURITY.md](https://github.com/UXARRAY/uxarray-mcp-server/blob/main/SECURITY.md) first** — operating an endpoint is shell-equivalent delegation, not a casual config change. > **Prerequisites:** @@ -336,7 +336,7 @@ Send users: 1. The endpoint UUID. 2. The path prefix(es) where data lives (e.g., `/glade/`). 3. A pointer to [remote-hpc.md](remote-hpc.md) for their setup. -4. A pointer to [SECURITY.md](../SECURITY.md) so they know what they're +4. A pointer to [SECURITY.md](https://github.com/UXARRAY/uxarray-mcp-server/blob/main/SECURITY.md) so they know what they're trusting. **Distribute out-of-band** — Slack DM, email, lab wiki behind SSO. Not in @@ -376,7 +376,7 @@ sacct -u svc_uxarray --starttime $(date -d '7 days ago' +%F) # Slurm history ### Incident response -See [SECURITY.md § Incident response](../SECURITY.md#incident-response). +See [SECURITY.md § Incident response](https://github.com/UXARRAY/uxarray-mcp-server/blob/main/SECURITY.md#incident-response). --- diff --git a/docs/remote-hpc.md b/docs/remote-hpc.md index b55878e..cbdf1b2 100644 --- a/docs/remote-hpc.md +++ b/docs/remote-hpc.md @@ -6,14 +6,14 @@ that lives on that machine. > **Prerequisites — check each one before starting:** > 1. uxarray-mcp installed locally and working with your AI client. See the -> [README](../README.md#local-install). +> [README](https://github.com/UXARRAY/uxarray-mcp-server/blob/main/README.md#local-install). > 2. A user account on the HPC machine, with shell access. > 3. The data you want to analyze is readable by your HPC account. > 4. Either: someone has given you a **Globus Compute endpoint UUID** for > that machine, **or** you plan to stand up your own — in which case > stop and read [operating-an-endpoint.md](operating-an-endpoint.md) first, > then come back here. -> 5. You have read [SECURITY.md](../SECURITY.md). Connecting an endpoint +> 5. You have read [SECURITY.md](https://github.com/UXARRAY/uxarray-mcp-server/blob/main/SECURITY.md). Connecting an endpoint > means the operator can see what you submit. The total setup is **5 steps**, expect **15–30 minutes** the first time. @@ -117,7 +117,7 @@ uxarray-mcp endpoints list ``` The endpoint UUID is **not a secret** by itself, but you still shouldn't -post it publicly — see [SECURITY.md](../SECURITY.md). It lives in +post it publicly — see [SECURITY.md](https://github.com/UXARRAY/uxarray-mcp-server/blob/main/SECURITY.md). It lives in `~/.config/uxarray-mcp/config.yaml` (which is in `.gitignore` if you're in the repo).