diff --git a/docs/toolhive/guides-vmcp/optimizer.mdx b/docs/toolhive/guides-vmcp/optimizer.mdx index aa24b63e..a6b005cf 100644 --- a/docs/toolhive/guides-vmcp/optimizer.mdx +++ b/docs/toolhive/guides-vmcp/optimizer.mdx @@ -10,9 +10,9 @@ number of tools exposed to clients can grow quickly. The optimizer addresses this by filtering tools per request, reducing token usage and improving tool selection accuracy. -For the desktop/CLI approach using the MCP Optimizer container, see the +For a step-by-step tutorial that walks through the full setup, see the [MCP Optimizer tutorial](../tutorials/mcp-optimizer.mdx). This guide covers the -Kubernetes operator approach using VirtualMCPServer and EmbeddingServer CRDs. +configuration details for the VirtualMCPServer and EmbeddingServer CRDs. ## Benefits @@ -146,27 +146,10 @@ are: For the complete field reference, see the [EmbeddingServer CRD specification](../reference/crd-spec.md#apiv1alpha1embeddingserver). -:::warning[ARM64 compatibility] +:::tip[ARM64 support] -The default TEI CPU images depend on Intel MKL, which is x86_64-only. No -official ARM64 images exist yet. On ARM64 nodes (including Apple Silicon with -kind), you can run the amd64 image under emulation as a workaround. - -First, pull the amd64 image and load it into your cluster: - -```bash -docker pull --platform linux/amd64 \ - ghcr.io/huggingface/text-embeddings-inference:cpu-1.7 -kind load docker-image \ - ghcr.io/huggingface/text-embeddings-inference:cpu-1.7 -``` - -The `kind load` command is specific to kind. For other cluster distributions, -use the equivalent image-loading mechanism (for example, `ctr images import` for -containerd, or push the image to a registry your cluster can pull from). - -Then, pin the image in your EmbeddingServer so the operator uses the pre-pulled -tag instead of the default `cpu-latest`: +The default TEI image (`cpu-latest`) is x86_64-only. If you are running on ARM64 +nodes (for example, Apple Silicon), override the image in your EmbeddingServer: ```yaml title="embedding-server.yaml" apiVersion: toolhive.stacklok.dev/v1alpha1 @@ -175,13 +158,9 @@ metadata: name: my-embedding namespace: toolhive-system spec: - image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.7 + image: ghcr.io/huggingface/text-embeddings-inference:cpu-arm64-latest ``` -Native ARM64 support is in progress upstream. Track the -[TEI GitHub repository](https://github.com/huggingface/text-embeddings-inference) -for updates. - ::: ## Tune the optimizer @@ -288,6 +267,8 @@ metadata: name: full-vmcp namespace: toolhive-system spec: + groupRef: + name: my-tools embeddingServerRef: name: full-embedding groupRef: @@ -314,7 +295,8 @@ spec: ## Related information -- [MCP Optimizer tutorial](../tutorials/mcp-optimizer.mdx) - desktop/CLI setup +- [MCP Optimizer tutorial](../tutorials/mcp-optimizer.mdx) - end-to-end + Kubernetes setup - [Optimizing LLM context](../concepts/tool-optimization.mdx) - background on tool filtering and context pollution - [Configure vMCP servers](./configuration.mdx) diff --git a/docs/toolhive/tutorials/mcp-optimizer.mdx b/docs/toolhive/tutorials/mcp-optimizer.mdx index d0ebaaad..337db246 100644 --- a/docs/toolhive/tutorials/mcp-optimizer.mdx +++ b/docs/toolhive/tutorials/mcp-optimizer.mdx @@ -1,54 +1,55 @@ --- title: Reduce token usage with MCP Optimizer description: - Enable the MCP Optimizer to enhance tool selection and reduce token usage. + Deploy the MCP Optimizer on Kubernetes with Virtual MCP Server and an + EmbeddingServer to filter tools and reduce token usage. schema_type: tutorial --- ## Overview -The ToolHive MCP Optimizer acts as an intelligent intermediary between AI -clients and MCP servers. It provides tool discovery, unified access to multiple -MCP servers through a single endpoint, and intelligent routing of requests to -appropriate MCP tools. +The MCP Optimizer acts as an intelligent intermediary between AI clients and MCP +servers. It provides tool discovery, unified access to multiple MCP servers +through a single endpoint, and intelligent routing of requests to appropriate +MCP tools. :::note[Moving to vMCP] The optimizer is now integrated into [Virtual MCP Server (vMCP)](../guides-vmcp/optimizer.mdx), which provides the same tool filtering and token reduction at the team level. You can deploy it in -Kubernetes today, and a local experience is coming soon. This tutorial covers -the standalone CLI approach in the meantime. +Kubernetes today, and a local experience is coming soon. This tutorial walks you +through the Kubernetes deployment. ::: -## About MCP Optimizer +In this tutorial, you deploy the optimizer on Kubernetes using Virtual MCP +Server (vMCP) and an EmbeddingServer for semantic tool search. -### Benefits +## What you'll learn -- **Reduced token usage**: Narrow down the toolset to only relevant tools for a - given task, minimizing context overload and token consumption -- **Improved tool selection**: Find the most appropriate tools across all - connected MCP servers -- **Simplified client configuration**: Connect to a single MCP Optimizer - endpoint instead of managing multiple MCP server connections +- How to create an MCPGroup with multiple backend MCP servers +- How to deploy an EmbeddingServer for semantic search +- How to create a VirtualMCPServer with the optimizer enabled +- How to connect your AI client to the optimized endpoint -### How it works +## About MCP Optimizer -Instead of flooding the model with all available tools, MCP Optimizer introduces -two lightweight primitives: +Instead of exposing every backend tool to the model, the optimizer introduces +two lightweight primitives: `find_tool` for semantic search and `call_tool` for +routing. This keeps context small and improves tool selection accuracy. For the +full parameter reference and configuration options, see +[Optimize tool discovery](../guides-vmcp/optimizer.mdx). -1. `find_tool`: Searches for the most relevant tools using hybrid semantic + - keyword search -2. `call_tool`: Routes the selected tool request to the appropriate MCP server +### How it works The workflow is as follows: -1. You send a prompt that requires tool assistance (for example, interacting - with a GitHub repo) +1. You send a prompt that requires tool assistance (for example, fetching a web + page) 2. The assistant calls `find_tool` with relevant keywords extracted from the prompt -3. MCP Optimizer returns the most relevant tools (up to 8 by default, but this +3. The optimizer returns the most relevant tools (up to 8 by default, but this is configurable) 4. Only those tools and their descriptions are included in the context sent to the model @@ -56,255 +57,337 @@ The workflow is as follows: ```mermaid flowchart TB - subgraph optimizerGroup["MCP Optimizer group (internal)"] + subgraph vmcpGroup["VirtualMCPServer"] direction TB - optimizer["MCP Optimizer"] + vmcp["vMCP (optimizer enabled)"] end - subgraph target["ToolHive group: default"] + subgraph embedding["EmbeddingServer"] + direction TB + tei["Text Embeddings Inference"] + end + subgraph backends["MCPGroup backends"] direction TB mcp1["MCP server"] mcp2["MCP server"] mcp3["MCP server"] end - client(["Client"]) <-- connects --> optimizerGroup - optimizer <-. discovers/routes .-> target + client(["Client"]) <-- "find_tool / call_tool" --> vmcpGroup + vmcp <-. "semantic search" .-> embedding + vmcp <-. "discovers / routes" .-> backends ``` ## Prerequisites -- One of the following container runtimes: - - macOS: Docker Desktop, Podman Desktop, or Rancher Desktop (using dockerd) - - Windows: Docker Desktop or Rancher Desktop (using dockerd) - - Linux: any container runtime (see [Linux setup](#linux-setup)) -- ToolHive CLI - -## Step 1: Install MCP servers in a ToolHive group - -Before you can use MCP Optimizer, you need to have one or more MCP servers -running in a ToolHive group. If you don't have any MCP servers set up yet, -follow these steps: - -Run one or more MCP servers in the `default` group. For this tutorial, you can -run the following example MCP servers: +Before starting this tutorial, make sure you have: + +- A Kubernetes cluster with the ToolHive operator installed (see + [Quickstart: Kubernetes Operator](../guides-k8s/quickstart.mdx)) +- `kubectl` configured to communicate with your cluster +- The [ToolHive CLI](../guides-cli/quickstart.mdx) installed on your local + machine (used in Step 4 to register the endpoint with your MCP clients) +- An MCP client (Visual Studio Code with GitHub Copilot is used in this + tutorial) + +## Step 1: Create an MCPGroup and deploy backend MCP servers + +Create an MCPGroup to organize the backend MCP servers that the optimizer will +index and route to: + +```yaml title="mcpgroup.yaml" +apiVersion: toolhive.stacklok.dev/v1alpha1 +kind: MCPGroup +metadata: + name: optimizer-demo + namespace: toolhive-system +spec: + description: Backend servers for the optimizer tutorial +``` -- `github`: Provides tools for interacting with GitHub repositories - ([guide](../guides-mcp/github.mdx?mode=cli)) -- `fetch`: Provides a web search tool to fetch recent news articles -- `time`: Provides a tool to get the current time in various time zones +Apply the resource: ```bash -thv run github -thv run fetch -thv run time +kubectl apply -f mcpgroup.yaml ``` -See the [Run MCP servers](../guides-cli/run-mcp-servers.mdx) guide for more -details. +Next, deploy two MCP servers in the group. Both reference `optimizer-demo` in +the `groupRef` field: + +```yaml {11-12,31-32} title="mcpservers.yaml" +apiVersion: toolhive.stacklok.dev/v1alpha1 +kind: MCPServer +metadata: + name: fetch + namespace: toolhive-system +spec: + image: ghcr.io/stackloklabs/gofetch/server + transport: streamable-http + proxyPort: 8080 + mcpPort: 8080 + groupRef: + name: optimizer-demo + resources: + limits: + cpu: '100m' + memory: '128Mi' + requests: + cpu: '50m' + memory: '64Mi' +--- +apiVersion: toolhive.stacklok.dev/v1alpha1 +kind: MCPServer +metadata: + name: osv + namespace: toolhive-system +spec: + image: ghcr.io/stackloklabs/osv-mcp/server + transport: streamable-http + proxyPort: 8080 + mcpPort: 8080 + groupRef: + name: optimizer-demo + resources: + limits: + cpu: '100m' + memory: '128Mi' + requests: + cpu: '50m' + memory: '64Mi' +``` -Verify the MCP servers are running: +Apply the resources and wait for both servers to be ready: ```bash -thv list +kubectl apply -f mcpservers.yaml +kubectl get mcpservers -n toolhive-system -w ``` -## Step 2: Connect your AI client - -Connect your AI client to the ToolHive group where the MCP servers are running -(for example, the `default` group). +You should see both servers with `Ready` status before continuing. :::note -For best results, connect your client to only the optimized group. If you -connect it to multiple groups, ensure there is no overlap in MCP servers between -the groups to avoid unpredictable behavior. - -::: - -Run the following command to register your AI client with the ToolHive group -where the MCP servers are running (for example, `default`): +If you still have an MCPServer left over from the +[K8s Operator Quickstart](../guides-k8s/quickstart.mdx), you can delete it first +to avoid confusion: ```bash -thv client setup +kubectl delete mcpserver fetch -n toolhive-system ``` -See the [Client configuration](../guides-cli/client-configuration.mdx) guide for -more details. - -Open your AI client and verify that it is connected to the correct MCP servers. -If you installed the `github`, `fetch`, and `time` servers, you should see -almost 50 tools available. +Then apply the YAML above, which creates a new `fetch` server with the correct +`groupRef`. -## Step 3: Enable MCP Optimizer +::: -If you are on Linux with native containers, follow the steps below but see -[Linux setup](#linux-setup) for the modified `thv run` command. +## Step 2: Deploy an EmbeddingServer -**Step 3.1: Run the API server** +The optimizer uses semantic search to find relevant tools. This requires an +EmbeddingServer, which runs a text embeddings inference (TEI) server. -MCP Optimizer uses the ToolHive API server to discover MCP servers and manage -client connections. +Create an EmbeddingServer with default settings. This deploys the +`BAAI/bge-small-en-v1.5` model. If you are running on ARM64 nodes (for example, +Apple Silicon with kind), uncomment the `image` line to use the ARM64 build: -You can run the API server in two ways. The simplest is to install and run the -ToolHive UI, which automatically starts the API server in the background. +```yaml title="embedding-server.yaml" +apiVersion: toolhive.stacklok.dev/v1alpha1 +kind: EmbeddingServer +metadata: + name: optimizer-embedding + namespace: toolhive-system +spec: + # Uncomment for Apple Silicon or other ARM64 platforms + # image: ghcr.io/huggingface/text-embeddings-inference:cpu-arm64-latest +``` -If you prefer to run the API server manually using the CLI, open a dedicated -terminal window and start it on a specific port: +Apply the resource: ```bash -thv serve --port 50100 +kubectl apply -f embedding-server.yaml ``` -Note the port number (`50100` in this example) for use in the next step. - -**Step 3.2: Create a dedicated group and run mcp-optimizer** +Wait for the EmbeddingServer to reach the `Ready` phase before proceeding. The +first startup may take a few minutes while the model downloads: ```bash -# Create the meta group -thv group create optimizer - -# Run mcp-optimizer in the dedicated group -thv run --group optimizer -e TOOLHIVE_PORT=50100 mcp-optimizer +kubectl get embeddingserver optimizer-embedding -n toolhive-system -w ``` -If you are running the API server using the ToolHive UI, omit the -`TOOLHIVE_PORT` environment variable. +:::info[What's happening?] + +The EmbeddingServer deploys a TEI container that generates vector embeddings +from text. The optimizer uses these embeddings to perform semantic search across +all backend tools, finding the most relevant tools for a given query even when +the exact keywords don't match. + +::: -**Step 3.3: Configure your AI client for the meta group** +## Step 3: Create a VirtualMCPServer with the optimizer + +Create a VirtualMCPServer that aggregates the backend servers and enables the +optimizer. Adding `embeddingServerRef` is the only change needed to enable the +optimizer - sensible defaults are applied automatically: + +```yaml title="virtualmcpserver.yaml" +apiVersion: toolhive.stacklok.dev/v1alpha1 +kind: VirtualMCPServer +metadata: + name: optimizer-vmcp + namespace: toolhive-system +spec: + # highlight-start + groupRef: + name: optimizer-demo + embeddingServerRef: + name: optimizer-embedding + # highlight-end + incomingAuth: + type: anonymous + serviceType: ClusterIP + config: + aggregation: + conflictResolution: prefix + conflictResolutionConfig: + prefixFormat: '{workload}_' +``` -Remove your client from the `default` group. For example, to unregister Cursor: +Apply the resource: ```bash -thv client remove cursor --group default +kubectl apply -f virtualmcpserver.yaml ``` -Then, register your client with the `optimizer` group: +Check the status: ```bash -# Run the group setup, select the optimizer group, and then select your client -thv client setup +kubectl get virtualmcpservers -n toolhive-system +``` -# Verify the configuration -thv client list-registered +After about 30 seconds, you should see output similar to: + +```text +NAME PHASE URL BACKENDS AGE READY +optimizer-vmcp Ready http://vmcp-optimizer-vmcp.toolhive-system.svc.cluster.local:4483 2 30s True ``` -:::note +:::info[What's happening?] -Your client now connects only to the `optimizer` group and sees only the -`mcp-optimizer` MCP server. +Setting `embeddingServerRef` tells the operator to enable the optimizer on this +VirtualMCPServer. Instead of exposing all backend tools directly, the optimizer +builds a semantic index of tools and exposes only `find_tool` and `call_tool` to +clients. This dramatically reduces the number of tools (and tokens) sent to the +model. ::: -The resulting configuration should look like this: +## Step 4: Connect your AI client -```mermaid -flowchart TB - subgraph meta["ToolHive group: optimizer"] - direction TB - optimizer["mcp-optimizer"] - end - subgraph def["ToolHive group: default"] - direction TB - mcp1["github"] - mcp2["fetch"] - mcp3["time"] - end +The vMCP service runs inside Kubernetes and is not directly reachable by desktop +AI clients. This tutorial uses `kubectl port-forward` because it works with any +cluster, but in production you would typically expose the service through an +Ingress, Gateway API, or LoadBalancer. See +[Expose the service](../guides-vmcp/configuration.mdx#expose-the-service) for +the available options. + +In a separate terminal, port-forward the vMCP service to your local machine: + +```bash +kubectl port-forward service/vmcp-optimizer-vmcp -n toolhive-system 4483:4483 +``` + +Test the health endpoint: - client(["Client"]) <-- connects --> meta - optimizer <-. discovers/routes .-> def - client x-. 🚫 .-x def +```bash +curl http://localhost:4483/health ``` -## Step 4: Sample prompts +You should see `{"status":"ok"}`. -After you configure and run MCP Optimizer, you can use the same prompts you -would normally use with individual MCP servers. The Optimizer automatically -discovers and routes to appropriate tools. +The ToolHive CLI bridges the remaining gap: it registers the port-forwarded +endpoint as a local workload and automatically updates your MCP client +configuration to point at it. -Using the example MCP servers above, here are some sample prompts: +Register the port-forwarded vMCP endpoint as a ToolHive-managed workload: -- "Get the details of GitHub issue 1911 from the stacklok/toolhive repo" -- "List recent PRs from the stacklok/toolhive repo" -- "Fetch the latest news articles about AI" -- "What is the current time in Tokyo?" +```bash +thv run http://localhost:4483/mcp --name optimizer-vmcp +``` -Watch how MCP Optimizer intelligently selects and routes to the relevant tools -across the connected MCP servers, reducing token usage and improving response -quality. +:::tip -To check your token savings, you can ask the optimizer: +If you haven't set up client configuration yet, run `thv client setup` to +register your MCP clients. See +[Client configuration](../guides-cli/client-configuration.mdx) for more details. -- "How many tokens did I save using MCP Optimizer?" +::: -## Linux setup +Open your AI client and check its MCP configuration. You should see only two +tools available: `find_tool` and `call_tool`. This confirms the optimizer is +working. -The setup depends on which type of container runtime you are using. +## Step 5: Test the optimizer -### VM-based container runtimes +Try these sample prompts to verify the optimizer is routing requests correctly +across both backend MCP servers: -If you are using a container runtime that runs containers inside a virtual -machine (such as Docker Desktop for Linux), the setup is the same as on macOS -and Windows. No additional configuration is needed - follow the steps above. +- "Fetch the contents of https://docs.stacklok.com and summarize the page" +- "Check if the Go package github.com/stacklok/toolhive has any known + vulnerabilities" -### Native containers (Docker, Podman, Rancher Desktop, and others) +Watch how the optimizer uses `find_tool` to locate the relevant tool across all +backends, then `call_tool` to execute it - all through a single endpoint. -:::note +To check your token savings, send this prompt to your AI client: -Before running the command below, complete the following: +- "How many tokens did I save using MCP Optimizer?" -1. [Step 1](#step-1-install-mcp-servers-in-a-toolhive-group) - install your MCP - servers -2. [Step 2](#step-2-connect-your-ai-client) - connect your AI client -3. [Step 3](#step-3-enable-mcp-optimizer) - start the API server, create the - optimizer group, and reconfigure your client. When you reach the - `thv run mcp-optimizer` command, use the Linux-specific command below - instead. +:::note + +With only two backend MCP servers and a small number of tools, the optimizer may +report minimal or no token savings. The benefit becomes more significant as you +add more backends and tools to your MCPGroup. ::: -Most Linux container runtimes run containers natively on the host kernel. -Because containers run directly on the host kernel, `host.docker.internal` is -not automatically configured - unlike on macOS and Windows, where Docker Desktop -sets it up to let containers reach the host from inside a virtual machine. -Instead, you need to pass a couple of extra flags: +## Clean up + +Remove the local workload and delete the Kubernetes resources when you're done: ```bash -# Run mcp-optimizer with host networking -thv run --group optimizer --network host \ - -e TOOLHIVE_HOST=127.0.0.1 \ - -e ALLOWED_GROUPS=default \ - mcp-optimizer +thv rm optimizer-vmcp +kubectl delete virtualmcpserver optimizer-vmcp -n toolhive-system +kubectl delete embeddingserver optimizer-embedding -n toolhive-system +kubectl delete mcpserver fetch osv -n toolhive-system +kubectl delete mcpgroup optimizer-demo -n toolhive-system ``` -- `--network host` lets the container reach the host directly, achieving the - same result as the automatic bridge Docker Desktop sets up on macOS and - Windows. -- `TOOLHIVE_PORT` specifies the port the API server is listening on. If you - started it manually with a custom port in Step 3.1, pass - `-e TOOLHIVE_PORT=` here as well. Omit it if you are using the ToolHive - UI to run the API server. -- `TOOLHIVE_HOST` tells `mcp-optimizer` to connect to `127.0.0.1` instead of - `host.docker.internal`. -- `ALLOWED_GROUPS` tells the optimizer which group's MCP servers to discover, - index, and route requests to. Replace `default` with the name of the group you - want to optimize. +To tear down the entire kind cluster from the K8s Quickstart: -To change which groups MCP Optimizer can optimize after initial setup, remove -the workload and run the command again with the updated `ALLOWED_GROUPS` value -(see [Remove a server](../guides-cli/manage-mcp-servers.mdx#remove-a-server)). - -See [Step 4: Sample prompts](#step-4-sample-prompts) to verify the setup. +```bash +kind delete cluster --name toolhive +``` -## What's next? +## Next steps -- Experiment with different MCP servers to see how MCP Optimizer enhances tool - selection and reduces token usage -- Explore the [vMCP optimizer](../guides-vmcp/optimizer.mdx) for team-level - optimization in Kubernetes +- [Tune the optimizer](../guides-vmcp/optimizer.mdx#tune-the-optimizer) to + adjust search parameters for your workload +- [Configure authentication](../guides-vmcp/authentication.mdx) for production + deployments +- [Monitor vMCP activity](../guides-vmcp/telemetry-and-metrics.mdx) with + OpenTelemetry tracing and metrics +- [Configure failure handling](../guides-vmcp/failure-handling.mdx) for circuit + breakers and partial failure modes +- Provide feedback on your experience on the + [Stacklok Discord community](https://discord.gg/stacklok) ## Related information -- [Optimize tool discovery in vMCP](../guides-vmcp/optimizer.mdx) - Kubernetes - operator approach +- [Optimize tool discovery](../guides-vmcp/optimizer.mdx) - full parameter + reference, high availability, and ARM64 support details - [Optimizing LLM context](../concepts/tool-optimization.mdx) - background on tool filtering and context pollution +- [Virtual MCP Server overview](../concepts/vmcp.mdx) - conceptual overview of + vMCP +- [MCP Optimizer UI guide](../guides-ui/mcp-optimizer.mdx) - standalone desktop + approach without Kubernetes (legacy, being replaced by the vMCP path) +- [Quickstart: Kubernetes Operator](../guides-k8s/quickstart.mdx) - prerequisite + tutorial