Skip to content

Mcp client image langgraph command#2195

Open
Mgczacki wants to merge 13 commits into
mainfrom
mcp-client-image-langgraph-command
Open

Mcp client image langgraph command#2195
Mgczacki wants to merge 13 commits into
mainfrom
mcp-client-image-langgraph-command

Conversation

@Mgczacki
Copy link
Copy Markdown
Collaborator

Context

Continuing #2189 within the repo so CI/CD can run.

I hit this while developing #2188 (the memory2 agent + Space raster backend) and figured the fix would be worth porting to the MCP client, which had the same problem and was working around it in a less elegant way.

The underlying constraint

Many OpenAI chat-completion models don't accept image content inside a tool result — you can't just stuff an image into a ToolMessage. The canonical workaround is to deliver the image as a HumanMessage after the tool returns. That part isn't avoidable.

What we can avoid is delivering that HumanMessage in a separate, later agent turn — which is what the MCP client was doing.

What was happening before

When an MCP tool returned non-text content (an image), mcp_client.py:

  1. Minted a UUID for the image.
  2. Returned a placeholder ToolMessage to the agent: "Tool call started with UUID: X. You will be updated with the result soon".
  3. Queued the actual image as a fresh HumanMessage via add_message, which got picked up on the next agent turn.

The agent therefore saw a fake "wait for it" tool result, kept reasoning with no real data, and only got the image one turn later. Two extra round-trips per image, and the tool result the agent reasons over isn't the actual content.

What we do now

We're already on langchain's create_agent, which is a thin wrapper over LangGraph. LangGraph lets a tool return a Command(update={...}) instead of a single message — i.e. the tool call can append multiple messages to state as the result of one call. The state's messages reducer (the tool-aware one in agent.py) then re-orders so each image-carrying HumanMessage lands next to its matching ToolMessage, paired by additional_kwargs["tool_call_id"].

Concretely, the MCP tool now returns:

Command(update={
    "messages": [
        ToolMessage(content="<text part or stub>", tool_call_id=...),
        HumanMessage(content=[image_block], additional_kwargs={"tool_call_id": ...}),
    ]
})

Both messages land in the same turn. The HumanMessage-with-image workaround for OpenAI's API constraint is still there — but it's now part of the tool's response, not a fake-out + late delivery. The reducer guarantees ordering even when multiple parallel tool calls return images simultaneously.

Bonus fix: InjectedToolCallId for JSON-Schema tools

LangChain auto-injects InjectedToolCallId into tool arguments when args_schema is a Pydantic model. MCP gives us its schema as a plain JSON-Schema dict (the authoritative LLM contract on the MCP side), which langchain doesn't auto-inject for. Added _McpStructuredTool, a tiny StructuredTool subclass that does the injection for the JSON-Schema path. Without this, the tool wouldn't know its own tool_call_id and the reducer pairing breaks.

Future steps

The same Command(update=...) pattern unlocks more than just "tools that return images". Two observations from driving the memory2 agent against this fix:

  • The agent skips steps. Skills that prescribe verification sub-procedures (e.g. "call frames_facing to refine each candidate position before merging") get partially-followed even when the prose is strict. The current harness has no way to enforce that a step ran before the agent can submit an answer — the LLM is the only authority. Closing this needs explicit task management in the harness: a way to gate the final-answer tool on whether the prerequisite tool calls actually fired.

  • Long-horizon tasks benefit from subagent delegation. Some questions (e.g. room segmentation across a 5-minute recording, exhaustive frontier exploration) are big enough that a single agent context fills up. The cleanest way to scale is to let the top-level agent dispatch self-contained sub-tasks to subagents — each with its own context window — and integrate their results. That keeps the parent agent's context focused on synthesis instead of bookkeeping.

These two together — harness-level task management + subagent delegation — are most naturally implemented as another module: a long-running, compute-heavier agent specialised for complex tasks, that the existing McpClient-style short-horizon agents can defer to in a true multi-agent setting. This PR is the prerequisite (image-returning tools work cleanly across agents); the multi-agent module is the follow-up.

Test plan

  • pytest dimos/agents/mcp/test_mcp_client_unit.py — covers the new Command-return path and the multi-image / parallel-tool-call ordering.
  • Manual: drive an MCP tool that returns an image; verify the agent sees the image in the same turn as the tool call (no "started with UUID" placeholder).

Mario Garrido and others added 11 commits May 19, 2026 12:41
…de-channel

The MCP client previously handled non-text tool content (images) by minting
a UUID, returning a placeholder ToolMessage ("Tool call started with UUID:
X. You will be updated with the result soon"), and queuing the image as a
fresh HumanMessage via add_message. The image therefore arrived in a new
agent turn rather than as the tool call's actual result.

Switch to the langgraph pattern used by examples/memory2_agent/tools.py:
the tool returns Command(update={"messages": [ToolMessage, HumanMessage]})
so the image is applied within the same turn. The HumanMessage carries
additional_kwargs["tool_call_id"] so the state reducer can pair it with
its ToolMessage when multiple parallel tool calls return images at once.

Adds _McpStructuredTool, a small StructuredTool subclass that injects
InjectedToolCallId for tools whose args_schema is a JSON-Schema dict
(MCP's authoritative LLM contract) — langchain only handles this
automatically for Pydantic args_schemas.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `narrate_picture` to the mock MCP server and a test asserting
that when a tool returns both text and an image, the ToolMessage
carries the tool's real text (not the
"{name} returned N artefact(s)" fallback sentinel) while the image
still rides back on the follow-up HumanMessage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…com:mgczacki/dimos into worktree-mcp-client-image-langgraph-command
Add `dimos.telemetry`, a self-contained module that wraps the agent's
per-turn execution in an OTEL span. The base install is unaffected:
this package imports no opentelemetry packages at module load, and
`dimos.telemetry.span` is a silent no-op until tracing is wired up.

Wiring options:

  * Env-driven (recommended). Install the extra and set
    `OTEL_EXPORTER_OTLP_ENDPOINT`. `enable()` runs automatically on first
    import and configures the OTLP HTTP exporter plus, when available,
    `openinference-instrumentation-langchain` for LangChain auto-spans.

  * Caller-owned provider: `configure_tracing(my_provider)`.

  * Standard OTEL idiom: `DimosInstrumentor().instrument(...)`. The class
    is resolved lazily via module-level `__getattr__` so the heavy
    `opentelemetry.instrumentation` import only runs on attribute access.

Vendor-agnostic via OTLP: Langfuse, Arize Phoenix, LangSmith, and Opik
all accept the same pipeline; selection is by env var, not code.

Each McpClient instance now generates a UUID at construction and stamps
it on every `agent.turn` span via `session_attributes()`, which sets
both the OpenInference `session.id` (Langfuse, Phoenix) and
`langsmith.trace.session_id` (LangSmith). Backends group all per-turn
traces from one instance into a single session in their UI. Opik has no
OTEL→Threads mapping yet (comet-ml/opik#3441); use its native SDK there.
Two unrelated dev-only patches kept together so they can be reverted
as a single unit before opening any feature PR:

  - dimos/models/segmentation/edge_tam.py
    Make the inference device configurable, default to CUDA when
    available, fall back to CPU with a warning instead of hard-failing.
    (originally from Jeff Hykin)

  - dimos/robot/unitree/mujoco_connection.py
    Use DYLD_FALLBACK_LIBRARY_PATH so we don't shadow Apple's
    Accelerate with conda's stale libblas, and redirect mjpython
    stdout/stderr to files instead of subprocess.PIPE to avoid
    deadlock when the parent doesn't drain stderr.

Both unblock running agentic blueprints on Apple Silicon. Revert this
commit before opening the agent_graph_understanding feature PR.
@Mgczacki Mgczacki requested a review from paul-nechifor May 20, 2026 18:59
@Mgczacki
Copy link
Copy Markdown
Collaborator Author

@paul-nechifor I removed the private-API dep . Took your direction but used invoke/ainvoke (documented to accept ToolCall) and pulled the id from input["id"] rather than reading tool_call_id from run() kwargs.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 20, 2026

Greptile Summary

This PR replaces the old UUID-placeholder workaround for image-returning MCP tools with a clean Command(update={...})-based approach that delivers the ToolMessage and image-carrying HumanMessage in the same agent turn, eliminating the two-round-trip overhead. It also introduces _McpStructuredTool to bridge LangChain's InjectedToolCallId auto-injection (only available for Pydantic schemas) to MCP's JSON-Schema tools, and adds _fix_parallel_tool_batches + _reorder_tool_responses to keep parallel ToolMessages contiguous per OpenAI's API requirement.

  • Command-based image delivery: call_tool now returns a Command with a paired [ToolMessage, HumanMessage] when non-text content is present, co-landing both in one agent turn; self._history is explicitly re-sorted via _fix_parallel_tool_batches after streaming so the pre-reducer node outputs don't corrupt the history fed back into stream() on the next turn.
  • _McpStructuredTool: Overrides invoke/ainvoke to inject tool_call_id from the ToolCall envelope into args before delegating, bridging the Pydantic-only auto-injection gap for JSON-Schema tools.
  • Test coverage: New unit tests cover the Command return path, multi-image/parallel-tool ordering, the ainvoke async path, and the _process_message history normalization, directly addressing previously identified gaps.

Confidence Score: 5/5

Safe to merge. The rewrite correctly co-delivers images in one agent turn, history normalisation after streaming is sound, and the previously identified ordering problem in _history is now explicitly fixed.

The central logic — Command-based image delivery, _fix_parallel_tool_batches, and _McpStructuredTool — is correct and well-tested across sync, async, single-tool, and parallel-tool scenarios. The explicit _fix_parallel_tool_batches call at the end of _process_message directly closes the gap where stream_mode='updates' raw node output bypassed the channel reducer, so _history is correctly ordered before it is fed back into the next stream() call. No regressions are introduced, and the fixture update aligns with the removed two-turn placeholder flow.

No files require special attention.

Important Files Changed

Filename Overview
dimos/agents/mcp/mcp_client.py Core logic rewrite: replaces UUID-placeholder side-channel with Command-based same-turn image delivery; adds _fix_parallel_tool_batches, _reorder_tool_responses, _OrderedAgentState, and _McpStructuredTool; applies history normalisation after each stream call. Logic is sound and well-commented.
dimos/agents/mcp/test_mcp_client_unit.py Comprehensive new tests cover Command return, text+image branching, ToolCall invoke/ainvoke injection, bare-dict rejection, and _process_message history normalisation post-streaming; mock fixtures extended accordingly.
dimos/agents/mcp/fixtures/test_image.json Fixture updated: removes the now-unnecessary intermediate AI response that mirrored the old two-turn placeholder flow; only the final answer turn remains.

Sequence Diagram

sequenceDiagram
    participant LG as LangGraph ToolNode
    participant MT as _McpStructuredTool.invoke
    participant CT as call_tool closure
    participant MCP as MCP Server
    participant State as Graph State (_OrderedAgentState)

    LG->>MT: "invoke(ToolCall{id, args, type})"
    MT->>MT: "_inject_tool_call_id(input)<br/>args[tool_call_id] = id"
    MT->>CT: super().invoke(modified_input)
    CT->>MCP: tools/call(name, kwargs)
    MCP-->>CT: "{content: [{type, ...}]}"
    alt text only
        CT-->>LG: plain str
        LG->>State: "ToolMessage(content=str, tool_call_id=id)"
    else has image blocks
        CT-->>LG: "Command(update={messages:[ToolMessage, HumanMessage]})"
        LG->>State: "ToolMessage(content=summary, tool_call_id=id) + HumanMessage([text,image], additional_kwargs={tool_call_id})"
    end
    State->>State: _reorder_tool_responses reducer
    Note over State: Parallel ToolMessages contiguous, Image HumanMessages follow
Loading

Reviews (2): Last reviewed commit: "[autofix.ci] apply automated fixes" | Re-trigger Greptile

Comment thread dimos/agents/mcp/mcp_client.py
@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 97.71429% with 4 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
dimos/agents/mcp/mcp_client.py 96.49% 1 Missing and 1 partial ⚠️
dimos/agents/mcp/test_mcp_client_unit.py 98.30% 1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@Mgczacki
Copy link
Copy Markdown
Collaborator Author

This fixes #2054

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant