fix(mcp): return tool images via langgraph Command (no more UUID side-channel)#2189
fix(mcp): return tool images via langgraph Command (no more UUID side-channel)#2189Mgczacki wants to merge 11 commits into
Conversation
…de-channel
The MCP client previously handled non-text tool content (images) by minting
a UUID, returning a placeholder ToolMessage ("Tool call started with UUID:
X. You will be updated with the result soon"), and queuing the image as a
fresh HumanMessage via add_message. The image therefore arrived in a new
agent turn rather than as the tool call's actual result.
Switch to the langgraph pattern used by examples/memory2_agent/tools.py:
the tool returns Command(update={"messages": [ToolMessage, HumanMessage]})
so the image is applied within the same turn. The HumanMessage carries
additional_kwargs["tool_call_id"] so the state reducer can pair it with
its ToolMessage when multiple parallel tool calls return images at once.
Adds _McpStructuredTool, a small StructuredTool subclass that injects
InjectedToolCallId for tools whose args_schema is a JSON-Schema dict
(MCP's authoritative LLM contract) — langchain only handles this
automatically for Pydantic args_schemas.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
❌ 1 Tests Failed:
View the top 1 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
Greptile SummaryThis PR replaces the UUID side-channel workaround for image-returning MCP tools with LangGraph's
Confidence Score: 5/5Safe to merge — the Command-based image delivery and parallel-batch reordering are both well-reasoned and thoroughly tested. The rewrite replaces a two-turn UUID side-channel with a single-turn Command that delivers both the ToolMessage stub and the image-carrying HumanMessage atomically. The _fix_parallel_tool_batches reducer only rewrites a parallel batch when all expected ToolMessages are present, making it idempotent and safe when tool results arrive out of order. _McpStructuredTool correctly injects tool_call_id via the public invoke/ainvoke surface rather than private LangChain internals. All previously missing branches (text+image return, bare-dict error, parallel reorder variants) now have explicit test coverage. No incorrect filtering, missing required fields, or broken invariants were found. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant McpClient
participant LangGraph
participant LLM as OpenAI LLM
participant ToolNode
participant MCP as MCP Server
User->>McpClient: human message
McpClient->>LangGraph: "stream({messages: history})"
LangGraph->>LLM: history (via _OrderedAgentState)
LLM-->>LangGraph: "AIMessage(tool_calls=[{id:a},{id:b}])"
par Parallel tool calls
LangGraph->>ToolNode: invoke tool A (text-only)
ToolNode->>MCP: tools/call A
MCP-->>ToolNode: "{content:[{type:text}]}"
ToolNode-->>LangGraph: ToolMessage(a, text result)
and
LangGraph->>ToolNode: invoke tool B (image)
ToolNode->>MCP: tools/call B
MCP-->>ToolNode: "{content:[{type:image_url}]}"
ToolNode-->>LangGraph: "Command(update={messages:[ToolMessage(b), HumanMessage(image, tool_call_id=b)]})"
end
Note over LangGraph: _reorder_tool_responses reducer runs
Note over LangGraph: [AI, Tool(a), Tool(b), Human(image)] ✓
LangGraph->>LLM: ordered history with image in same turn
LLM-->>LangGraph: final AIMessage
LangGraph-->>McpClient: stream updates
McpClient-->>User: response
Reviews (4): Last reviewed commit: "Adding proposed changes, fixing stash of..." | Re-trigger Greptile |
| self, tool_input: str | dict[str, Any], tool_call_id: str | None | ||
| ) -> tuple[tuple[Any, ...], dict[str, Any]]: | ||
| args, kwargs = super()._to_args_and_kwargs(tool_input, tool_call_id) | ||
| if "tool_call_id" in self._injected_args_keys: |
There was a problem hiding this comment.
Dependency on private
_injected_args_keys API
self._injected_args_keys is a single-underscore property — a LangChain implementation detail not part of the public contract. The GitHub issue #36221 ("Typing issue in StructuredTool._injected_args_keys", March 2026) confirms it's actively being changed. If a LangChain version renames or redefines this property, _McpStructuredTool._to_args_and_kwargs silently skips the injection and every image-returning tool call fails with a TypeError (call_tool() missing required argument tool_call_id). Consider adding a guard-check at construction time (e.g., an assert hasattr(self, "_injected_args_keys")) or documenting the expected minimum langchain-core version.
There was a problem hiding this comment.
We can update that if updating the langchain version changes the way the library does it. That's why pinned libraries exist.
There was a problem hiding this comment.
It's better to write a future-proof version now than wait for the private API to break and implement the fix then.
_to_args_and_kwargs and _injected_args_keys should not be used.
If the goal is to capture tool_call_id you can do so by defining something like this:
class _McpStructuredTool(StructuredTool):
def run(
self, tool_input: str | dict[str, Any], *args: Any, **kwargs: Any
) -> Any:
tool_input = _inject_tool_call_id(tool_input, kwargs.get("tool_call_id"))
return super().run(tool_input, *args, **kwargs)
async def arun(
self, tool_input: str | dict[str, Any], *args: Any, **kwargs: Any
) -> Any:
tool_input = _inject_tool_call_id(tool_input, kwargs.get("tool_call_id"))
return await super().arun(tool_input, *args, **kwargs)
def _inject_tool_call_id(
tool_input: str | dict[str, Any], tool_call_id: str | None
) -> str | dict[str, Any]:
if not isinstance(tool_input, dict):
return tool_input
if tool_call_id is None:
raise ValueError(
"MCP tool requires a tool_call_id; invoke via a ToolCall, not a bare dict."
)
return {**tool_input, "tool_call_id": tool_call_id}Adds `narrate_picture` to the mock MCP server and a test asserting
that when a tool returns both text and an image, the ToolMessage
carries the tool's real text (not the
"{name} returned N artefact(s)" fallback sentinel) while the image
still rides back on the follow-up HumanMessage.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| # Vision content can't be embedded inside a ToolMessage for OpenAI | ||
| # (and others), so we use Command to append a follow-up HumanMessage | ||
| # carrying the image blocks within the same agent turn. Mirrors the | ||
| # pattern used by examples/memory2_agent/tools.py. |
There was a problem hiding this comment.
What is memory2_agent? I don't see this in the codebase.
There was a problem hiding this comment.
Apologies for the lack of context, it's the agent in this PR: #2188 - will clean up this comment and others that reference it.
…com:mgczacki/dimos into worktree-mcp-client-image-langgraph-command
| # carrying the image blocks within the same agent turn. Mirrors the | ||
| # pattern used by examples/memory2_agent/tools.py. | ||
| # | ||
| # The HumanMessage is tagged with `additional_kwargs["tool_call_id"]` |
There was a problem hiding this comment.
Where is additional_kwargs["tool_call_id"] used?
In langchain_core it seems like it's meant to be used for additional fields used by some providers.
There was a problem hiding this comment.
It seems like I stashed my reducer that read it just before uploading the PR 😢. Will update shortly when I reapply it and double-check everything. Answering your question, it is read as a condition for the reducer to be able to re-order the messages once it gets them. Human Message with tool_call_id in additional_kwargs will be the indicator to reorder it just after the tool response that matches said tool_call_id.
|
Want your agent to iterate on Greptile's feedback? Try greploops. |
Add `dimos.telemetry`, a self-contained module that wraps the agent's
per-turn execution in an OTEL span. The base install is unaffected:
this package imports no opentelemetry packages at module load, and
`dimos.telemetry.span` is a silent no-op until tracing is wired up.
Wiring options:
* Env-driven (recommended). Install the extra and set
`OTEL_EXPORTER_OTLP_ENDPOINT`. `enable()` runs automatically on first
import and configures the OTLP HTTP exporter plus, when available,
`openinference-instrumentation-langchain` for LangChain auto-spans.
* Caller-owned provider: `configure_tracing(my_provider)`.
* Standard OTEL idiom: `DimosInstrumentor().instrument(...)`. The class
is resolved lazily via module-level `__getattr__` so the heavy
`opentelemetry.instrumentation` import only runs on attribute access.
Vendor-agnostic via OTLP: Langfuse, Arize Phoenix, LangSmith, and Opik
all accept the same pipeline; selection is by env var, not code.
Each McpClient instance now generates a UUID at construction and stamps
it on every `agent.turn` span via `session_attributes()`, which sets
both the OpenInference `session.id` (Langfuse, Phoenix) and
`langsmith.trace.session_id` (LangSmith). Backends group all per-turn
traces from one instance into a single session in their UI. Opik has no
OTEL→Threads mapping yet (comet-ml/opik#3441); use its native SDK there.
Two unrelated dev-only patches kept together so they can be reverted
as a single unit before opening any feature PR:
- dimos/models/segmentation/edge_tam.py
Make the inference device configurable, default to CUDA when
available, fall back to CPU with a warning instead of hard-failing.
(originally from Jeff Hykin)
- dimos/robot/unitree/mujoco_connection.py
Use DYLD_FALLBACK_LIBRARY_PATH so we don't shadow Apple's
Accelerate with conda's stale libblas, and redirect mjpython
stdout/stderr to files instead of subprocess.PIPE to avoid
deadlock when the parent doesn't drain stderr.
Both unblock running agentic blueprints on Apple Silicon. Revert this
commit before opening the agent_graph_understanding feature PR.
This reverts commit e20cfbd.
This reverts commit bc91121.
|
Closing and reopening as a branch within the repo. |
Context
I hit this while developing #2188 (the memory2 agent + Space raster backend) and figured the fix would be worth porting to the MCP client, which had the same problem and was working around it in a less elegant way.
The underlying constraint
Many OpenAI chat-completion models don't accept image content inside a tool result — you can't just stuff an image into a
ToolMessage. The canonical workaround is to deliver the image as aHumanMessageafter the tool returns. That part isn't avoidable.What we can avoid is delivering that HumanMessage in a separate, later agent turn — which is what the MCP client was doing.
What was happening before
When an MCP tool returned non-text content (an image),
mcp_client.py:ToolMessageto the agent: "Tool call started with UUID: X. You will be updated with the result soon".HumanMessageviaadd_message, which got picked up on the next agent turn.The agent therefore saw a fake "wait for it" tool result, kept reasoning with no real data, and only got the image one turn later. Two extra round-trips per image, and the tool result the agent reasons over isn't the actual content.
What we do now
We're already on langchain's
create_agent, which is a thin wrapper over LangGraph. LangGraph lets a tool return aCommand(update={...})instead of a single message — i.e. the tool call can append multiple messages to state as the result of one call. The state'smessagesreducer (the tool-aware one inagent.py) then re-orders so each image-carryingHumanMessagelands next to its matchingToolMessage, paired byadditional_kwargs["tool_call_id"].Concretely, the MCP tool now returns:
Both messages land in the same turn. The
HumanMessage-with-image workaround for OpenAI's API constraint is still there — but it's now part of the tool's response, not a fake-out + late delivery. The reducer guarantees ordering even when multiple parallel tool calls return images simultaneously.Bonus fix:
InjectedToolCallIdfor JSON-Schema toolsLangChain auto-injects
InjectedToolCallIdinto tool arguments whenargs_schemais a Pydantic model. MCP gives us its schema as a plain JSON-Schema dict (the authoritative LLM contract on the MCP side), which langchain doesn't auto-inject for. Added_McpStructuredTool, a tinyStructuredToolsubclass that does the injection for the JSON-Schema path. Without this, the tool wouldn't know its owntool_call_idand the reducer pairing breaks.Future steps
The same
Command(update=...)pattern unlocks more than just "tools that return images". Two observations from driving the memory2 agent against this fix:The agent skips steps. Skills that prescribe verification sub-procedures (e.g. "call
frames_facingto refine each candidate position before merging") get partially-followed even when the prose is strict. The current harness has no way to enforce that a step ran before the agent can submit an answer — the LLM is the only authority. Closing this needs explicit task management in the harness: a way to gate the final-answer tool on whether the prerequisite tool calls actually fired.Long-horizon tasks benefit from subagent delegation. Some questions (e.g. room segmentation across a 5-minute recording, exhaustive frontier exploration) are big enough that a single agent context fills up. The cleanest way to scale is to let the top-level agent dispatch self-contained sub-tasks to subagents — each with its own context window — and integrate their results. That keeps the parent agent's context focused on synthesis instead of bookkeeping.
These two together — harness-level task management + subagent delegation — are most naturally implemented as another module: a long-running, compute-heavier agent specialised for complex tasks, that the existing
McpClient-style short-horizon agents can defer to in a true multi-agent setting. This PR is the prerequisite (image-returning tools work cleanly across agents); the multi-agent module is the follow-up.Test plan
pytest dimos/agents/mcp/test_mcp_client_unit.py— covers the new Command-return path and the multi-image / parallel-tool-call ordering.