Skip to content

fix(mcp): return tool images via langgraph Command (no more UUID side-channel)#2189

Closed
Mgczacki wants to merge 11 commits into
dimensionalOS:mainfrom
Mgczacki:worktree-mcp-client-image-langgraph-command
Closed

fix(mcp): return tool images via langgraph Command (no more UUID side-channel)#2189
Mgczacki wants to merge 11 commits into
dimensionalOS:mainfrom
Mgczacki:worktree-mcp-client-image-langgraph-command

Conversation

@Mgczacki
Copy link
Copy Markdown
Collaborator

@Mgczacki Mgczacki commented May 20, 2026

Context

I hit this while developing #2188 (the memory2 agent + Space raster backend) and figured the fix would be worth porting to the MCP client, which had the same problem and was working around it in a less elegant way.

The underlying constraint

Many OpenAI chat-completion models don't accept image content inside a tool result — you can't just stuff an image into a ToolMessage. The canonical workaround is to deliver the image as a HumanMessage after the tool returns. That part isn't avoidable.

What we can avoid is delivering that HumanMessage in a separate, later agent turn — which is what the MCP client was doing.

What was happening before

When an MCP tool returned non-text content (an image), mcp_client.py:

  1. Minted a UUID for the image.
  2. Returned a placeholder ToolMessage to the agent: "Tool call started with UUID: X. You will be updated with the result soon".
  3. Queued the actual image as a fresh HumanMessage via add_message, which got picked up on the next agent turn.

The agent therefore saw a fake "wait for it" tool result, kept reasoning with no real data, and only got the image one turn later. Two extra round-trips per image, and the tool result the agent reasons over isn't the actual content.

What we do now

We're already on langchain's create_agent, which is a thin wrapper over LangGraph. LangGraph lets a tool return a Command(update={...}) instead of a single message — i.e. the tool call can append multiple messages to state as the result of one call. The state's messages reducer (the tool-aware one in agent.py) then re-orders so each image-carrying HumanMessage lands next to its matching ToolMessage, paired by additional_kwargs["tool_call_id"].

Concretely, the MCP tool now returns:

Command(update={
    "messages": [
        ToolMessage(content="<text part or stub>", tool_call_id=...),
        HumanMessage(content=[image_block], additional_kwargs={"tool_call_id": ...}),
    ]
})

Both messages land in the same turn. The HumanMessage-with-image workaround for OpenAI's API constraint is still there — but it's now part of the tool's response, not a fake-out + late delivery. The reducer guarantees ordering even when multiple parallel tool calls return images simultaneously.

Bonus fix: InjectedToolCallId for JSON-Schema tools

LangChain auto-injects InjectedToolCallId into tool arguments when args_schema is a Pydantic model. MCP gives us its schema as a plain JSON-Schema dict (the authoritative LLM contract on the MCP side), which langchain doesn't auto-inject for. Added _McpStructuredTool, a tiny StructuredTool subclass that does the injection for the JSON-Schema path. Without this, the tool wouldn't know its own tool_call_id and the reducer pairing breaks.

Future steps

The same Command(update=...) pattern unlocks more than just "tools that return images". Two observations from driving the memory2 agent against this fix:

  • The agent skips steps. Skills that prescribe verification sub-procedures (e.g. "call frames_facing to refine each candidate position before merging") get partially-followed even when the prose is strict. The current harness has no way to enforce that a step ran before the agent can submit an answer — the LLM is the only authority. Closing this needs explicit task management in the harness: a way to gate the final-answer tool on whether the prerequisite tool calls actually fired.

  • Long-horizon tasks benefit from subagent delegation. Some questions (e.g. room segmentation across a 5-minute recording, exhaustive frontier exploration) are big enough that a single agent context fills up. The cleanest way to scale is to let the top-level agent dispatch self-contained sub-tasks to subagents — each with its own context window — and integrate their results. That keeps the parent agent's context focused on synthesis instead of bookkeeping.

These two together — harness-level task management + subagent delegation — are most naturally implemented as another module: a long-running, compute-heavier agent specialised for complex tasks, that the existing McpClient-style short-horizon agents can defer to in a true multi-agent setting. This PR is the prerequisite (image-returning tools work cleanly across agents); the multi-agent module is the follow-up.

Test plan

  • pytest dimos/agents/mcp/test_mcp_client_unit.py — covers the new Command-return path and the multi-image / parallel-tool-call ordering.
  • Manual: drive an MCP tool that returns an image; verify the agent sees the image in the same turn as the tool call (no "started with UUID" placeholder).

Mario Garrido and others added 2 commits May 19, 2026 12:41
…de-channel

The MCP client previously handled non-text tool content (images) by minting
a UUID, returning a placeholder ToolMessage ("Tool call started with UUID:
X. You will be updated with the result soon"), and queuing the image as a
fresh HumanMessage via add_message. The image therefore arrived in a new
agent turn rather than as the tool call's actual result.

Switch to the langgraph pattern used by examples/memory2_agent/tools.py:
the tool returns Command(update={"messages": [ToolMessage, HumanMessage]})
so the image is applied within the same turn. The HumanMessage carries
additional_kwargs["tool_call_id"] so the state reducer can pair it with
its ToolMessage when multiple parallel tool calls return images at once.

Adds _McpStructuredTool, a small StructuredTool subclass that injects
InjectedToolCallId for tools whose args_schema is a JSON-Schema dict
(MCP's authoritative LLM contract) — langchain only handles this
automatically for Pydantic args_schemas.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Mgczacki Mgczacki marked this pull request as ready for review May 20, 2026 06:44
@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
2047 1 2046 69
View the top 1 failed test(s) by shortest run time
dimos.agents.mcp.test_mcp_client::test_image
Stack Traces | 24.9s run time
agent_setup = <function agent_setup.<locals>.fn at 0x74bab5ab37e0>

    @pytest.mark.self_hosted
    def test_image(agent_setup):
        history = agent_setup(
            blueprints=[Visualizer.blueprint()],
            messages=[
                HumanMessage(
                    "What do you see? Take a picture using your camera and describe it. "
                    "Please mention one of the words which best match the image: "
                    "'stadium', 'cafe', 'battleship'."
                )
            ],
            system_prompt="You are a helpful assistant that can use a camera to take pictures.",
        )
    
        response = history[-1].content.lower()
>       assert "cafe" in response
E       assert 'cafe' in "i've taken a picture. let me analyze and describe it for you.\nthe image features an expansive outdoor stadium. from the camera's perspective, the word 'stadium' best matches the image. is there anything else you'd like to know or do?"

agent_setup = <function agent_setup.<locals>.fn at 0x74bab5ab37e0>
history    = [HumanMessage(content="What do you see? Take a picture using your camera and describe it. Please mention one of the wo...s={}, response_metadata={}, id='lc_run--019e46c7-caee-7453-9c1c-69e83028cb11-0', tool_calls=[], invalid_tool_calls=[])]
response   = "i've taken a picture. let me analyze and describe it for you.\nthe image features an expansive outdoor stadium. from the camera's perspective, the word 'stadium' best matches the image. is there anything else you'd like to know or do?"

.../agents/mcp/test_mcp_client.py:199: AssertionError

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 20, 2026

Greptile Summary

This PR replaces the UUID side-channel workaround for image-returning MCP tools with LangGraph's Command(update={...}) pattern, delivering both the ToolMessage stub and the image-carrying HumanMessage in the same agent turn. It also introduces _McpStructuredTool (overriding public invoke/ainvoke) to inject tool_call_id for JSON-Schema-dict tools, and _fix_parallel_tool_batches to keep parallel ToolMessages contiguous as OpenAI requires.

  • call_tool now returns Command(update={\"messages\": [ToolMessage, HumanMessage]}) when the MCP response contains image content, eliminating the extra agent turn and the misleading "started with UUID" placeholder.
  • _McpStructuredTool.invoke/ainvoke injects tool_call_id via public Runnable API (not private _injected_args_keys), addressing the concern raised in a prior review thread.
  • _fix_parallel_tool_batches reorders interleaved [Tool₁, Human₁, Tool₂, Human₂, …] batches to [Tool₁, Tool₂, …, Human₁, Human₂, …], triggered only when all expected parallel tool responses are present, ensuring OpenAI never sees a non-contiguous parallel tool batch.

Confidence Score: 5/5

Safe to merge — the Command-based image delivery and parallel-batch reordering are both well-reasoned and thoroughly tested.

The rewrite replaces a two-turn UUID side-channel with a single-turn Command that delivers both the ToolMessage stub and the image-carrying HumanMessage atomically. The _fix_parallel_tool_batches reducer only rewrites a parallel batch when all expected ToolMessages are present, making it idempotent and safe when tool results arrive out of order. _McpStructuredTool correctly injects tool_call_id via the public invoke/ainvoke surface rather than private LangChain internals. All previously missing branches (text+image return, bare-dict error, parallel reorder variants) now have explicit test coverage. No incorrect filtering, missing required fields, or broken invariants were found.

No files require special attention.

Important Files Changed

Filename Overview
dimos/agents/mcp/mcp_client.py Core rewrite: adds Command-based image delivery, _McpStructuredTool for public-API tool_call_id injection, and _fix_parallel_tool_batches reducer to maintain OpenAI contiguity invariant. Logic is sound across all traced parallel-tool orderings.
dimos/agents/mcp/test_mcp_client_unit.py Adds thorough test coverage for the new Command return path (image-only, text+image, invoke injection, bare-dict error), parallel batch reordering, and the reducer composition. All previously untested branches are now covered.

Sequence Diagram

sequenceDiagram
    participant User
    participant McpClient
    participant LangGraph
    participant LLM as OpenAI LLM
    participant ToolNode
    participant MCP as MCP Server

    User->>McpClient: human message
    McpClient->>LangGraph: "stream({messages: history})"
    LangGraph->>LLM: history (via _OrderedAgentState)
    LLM-->>LangGraph: "AIMessage(tool_calls=[{id:a},{id:b}])"

    par Parallel tool calls
        LangGraph->>ToolNode: invoke tool A (text-only)
        ToolNode->>MCP: tools/call A
        MCP-->>ToolNode: "{content:[{type:text}]}"
        ToolNode-->>LangGraph: ToolMessage(a, text result)
    and
        LangGraph->>ToolNode: invoke tool B (image)
        ToolNode->>MCP: tools/call B
        MCP-->>ToolNode: "{content:[{type:image_url}]}"
        ToolNode-->>LangGraph: "Command(update={messages:[ToolMessage(b), HumanMessage(image, tool_call_id=b)]})"
    end

    Note over LangGraph: _reorder_tool_responses reducer runs
    Note over LangGraph: [AI, Tool(a), Tool(b), Human(image)] ✓

    LangGraph->>LLM: ordered history with image in same turn
    LLM-->>LangGraph: final AIMessage
    LangGraph-->>McpClient: stream updates
    McpClient-->>User: response
Loading

Reviews (4): Last reviewed commit: "Adding proposed changes, fixing stash of..." | Re-trigger Greptile

Comment thread dimos/agents/mcp/mcp_client.py Outdated
self, tool_input: str | dict[str, Any], tool_call_id: str | None
) -> tuple[tuple[Any, ...], dict[str, Any]]:
args, kwargs = super()._to_args_and_kwargs(tool_input, tool_call_id)
if "tool_call_id" in self._injected_args_keys:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Dependency on private _injected_args_keys API

self._injected_args_keys is a single-underscore property — a LangChain implementation detail not part of the public contract. The GitHub issue #36221 ("Typing issue in StructuredTool._injected_args_keys", March 2026) confirms it's actively being changed. If a LangChain version renames or redefines this property, _McpStructuredTool._to_args_and_kwargs silently skips the injection and every image-returning tool call fails with a TypeError (call_tool() missing required argument tool_call_id). Consider adding a guard-check at construction time (e.g., an assert hasattr(self, "_injected_args_keys")) or documenting the expected minimum langchain-core version.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can update that if updating the langchain version changes the way the library does it. That's why pinned libraries exist.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to write a future-proof version now than wait for the private API to break and implement the fix then.

_to_args_and_kwargs and _injected_args_keys should not be used.

If the goal is to capture tool_call_id you can do so by defining something like this:

class _McpStructuredTool(StructuredTool):
    def run(
        self, tool_input: str | dict[str, Any], *args: Any, **kwargs: Any
    ) -> Any:
        tool_input = _inject_tool_call_id(tool_input, kwargs.get("tool_call_id"))
        return super().run(tool_input, *args, **kwargs)

    async def arun(
        self, tool_input: str | dict[str, Any], *args: Any, **kwargs: Any
    ) -> Any:
        tool_input = _inject_tool_call_id(tool_input, kwargs.get("tool_call_id"))
        return await super().arun(tool_input, *args, **kwargs)


def _inject_tool_call_id(
    tool_input: str | dict[str, Any], tool_call_id: str | None
) -> str | dict[str, Any]:
    if not isinstance(tool_input, dict):
        return tool_input
    if tool_call_id is None:
        raise ValueError(
            "MCP tool requires a tool_call_id; invoke via a ToolCall, not a bare dict."
        )
    return {**tool_input, "tool_call_id": tool_call_id}

Comment thread dimos/agents/mcp/test_mcp_client_unit.py
Mario Garrido and others added 2 commits May 20, 2026 00:58
Adds `narrate_picture` to the mock MCP server and a test asserting
that when a tool returns both text and an image, the ToolMessage
carries the tool's real text (not the
"{name} returned N artefact(s)" fallback sentinel) while the image
still rides back on the follow-up HumanMessage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread dimos/agents/mcp/mcp_client.py Outdated
# Vision content can't be embedded inside a ToolMessage for OpenAI
# (and others), so we use Command to append a follow-up HumanMessage
# carrying the image blocks within the same agent turn. Mirrors the
# pattern used by examples/memory2_agent/tools.py.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is memory2_agent? I don't see this in the codebase.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the lack of context, it's the agent in this PR: #2188 - will clean up this comment and others that reference it.

Mario Garrido added 2 commits May 20, 2026 10:46
…com:mgczacki/dimos into worktree-mcp-client-image-langgraph-command
# carrying the image blocks within the same agent turn. Mirrors the
# pattern used by examples/memory2_agent/tools.py.
#
# The HumanMessage is tagged with `additional_kwargs["tool_call_id"]`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is additional_kwargs["tool_call_id"] used?

In langchain_core it seems like it's meant to be used for additional fields used by some providers.

Copy link
Copy Markdown
Collaborator Author

@Mgczacki Mgczacki May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like I stashed my reducer that read it just before uploading the PR 😢. Will update shortly when I reapply it and double-check everything. Answering your question, it is read as a condition for the reducer to be able to re-order the messages once it gets them. Human Message with tool_call_id in additional_kwargs will be the indicator to reorder it just after the tool response that matches said tool_call_id.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 20, 2026

Want your agent to iterate on Greptile's feedback? Try greploops.

Mario Garrido and others added 5 commits May 20, 2026 11:02
Add `dimos.telemetry`, a self-contained module that wraps the agent's
per-turn execution in an OTEL span. The base install is unaffected:
this package imports no opentelemetry packages at module load, and
`dimos.telemetry.span` is a silent no-op until tracing is wired up.

Wiring options:

  * Env-driven (recommended). Install the extra and set
    `OTEL_EXPORTER_OTLP_ENDPOINT`. `enable()` runs automatically on first
    import and configures the OTLP HTTP exporter plus, when available,
    `openinference-instrumentation-langchain` for LangChain auto-spans.

  * Caller-owned provider: `configure_tracing(my_provider)`.

  * Standard OTEL idiom: `DimosInstrumentor().instrument(...)`. The class
    is resolved lazily via module-level `__getattr__` so the heavy
    `opentelemetry.instrumentation` import only runs on attribute access.

Vendor-agnostic via OTLP: Langfuse, Arize Phoenix, LangSmith, and Opik
all accept the same pipeline; selection is by env var, not code.

Each McpClient instance now generates a UUID at construction and stamps
it on every `agent.turn` span via `session_attributes()`, which sets
both the OpenInference `session.id` (Langfuse, Phoenix) and
`langsmith.trace.session_id` (LangSmith). Backends group all per-turn
traces from one instance into a single session in their UI. Opik has no
OTEL→Threads mapping yet (comet-ml/opik#3441); use its native SDK there.
Two unrelated dev-only patches kept together so they can be reverted
as a single unit before opening any feature PR:

  - dimos/models/segmentation/edge_tam.py
    Make the inference device configurable, default to CUDA when
    available, fall back to CPU with a warning instead of hard-failing.
    (originally from Jeff Hykin)

  - dimos/robot/unitree/mujoco_connection.py
    Use DYLD_FALLBACK_LIBRARY_PATH so we don't shadow Apple's
    Accelerate with conda's stale libblas, and redirect mjpython
    stdout/stderr to files instead of subprocess.PIPE to avoid
    deadlock when the parent doesn't drain stderr.

Both unblock running agentic blueprints on Apple Silicon. Revert this
commit before opening the agent_graph_understanding feature PR.
@Mgczacki
Copy link
Copy Markdown
Collaborator Author

Closing and reopening as a branch within the repo.

@Mgczacki Mgczacki closed this May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants