Mcp client image langgraph command#2195
Conversation
…de-channel
The MCP client previously handled non-text tool content (images) by minting
a UUID, returning a placeholder ToolMessage ("Tool call started with UUID:
X. You will be updated with the result soon"), and queuing the image as a
fresh HumanMessage via add_message. The image therefore arrived in a new
agent turn rather than as the tool call's actual result.
Switch to the langgraph pattern used by examples/memory2_agent/tools.py:
the tool returns Command(update={"messages": [ToolMessage, HumanMessage]})
so the image is applied within the same turn. The HumanMessage carries
additional_kwargs["tool_call_id"] so the state reducer can pair it with
its ToolMessage when multiple parallel tool calls return images at once.
Adds _McpStructuredTool, a small StructuredTool subclass that injects
InjectedToolCallId for tools whose args_schema is a JSON-Schema dict
(MCP's authoritative LLM contract) — langchain only handles this
automatically for Pydantic args_schemas.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `narrate_picture` to the mock MCP server and a test asserting
that when a tool returns both text and an image, the ToolMessage
carries the tool's real text (not the
"{name} returned N artefact(s)" fallback sentinel) while the image
still rides back on the follow-up HumanMessage.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…com:mgczacki/dimos into worktree-mcp-client-image-langgraph-command
Add `dimos.telemetry`, a self-contained module that wraps the agent's
per-turn execution in an OTEL span. The base install is unaffected:
this package imports no opentelemetry packages at module load, and
`dimos.telemetry.span` is a silent no-op until tracing is wired up.
Wiring options:
* Env-driven (recommended). Install the extra and set
`OTEL_EXPORTER_OTLP_ENDPOINT`. `enable()` runs automatically on first
import and configures the OTLP HTTP exporter plus, when available,
`openinference-instrumentation-langchain` for LangChain auto-spans.
* Caller-owned provider: `configure_tracing(my_provider)`.
* Standard OTEL idiom: `DimosInstrumentor().instrument(...)`. The class
is resolved lazily via module-level `__getattr__` so the heavy
`opentelemetry.instrumentation` import only runs on attribute access.
Vendor-agnostic via OTLP: Langfuse, Arize Phoenix, LangSmith, and Opik
all accept the same pipeline; selection is by env var, not code.
Each McpClient instance now generates a UUID at construction and stamps
it on every `agent.turn` span via `session_attributes()`, which sets
both the OpenInference `session.id` (Langfuse, Phoenix) and
`langsmith.trace.session_id` (LangSmith). Backends group all per-turn
traces from one instance into a single session in their UI. Opik has no
OTEL→Threads mapping yet (comet-ml/opik#3441); use its native SDK there.
Two unrelated dev-only patches kept together so they can be reverted
as a single unit before opening any feature PR:
- dimos/models/segmentation/edge_tam.py
Make the inference device configurable, default to CUDA when
available, fall back to CPU with a warning instead of hard-failing.
(originally from Jeff Hykin)
- dimos/robot/unitree/mujoco_connection.py
Use DYLD_FALLBACK_LIBRARY_PATH so we don't shadow Apple's
Accelerate with conda's stale libblas, and redirect mjpython
stdout/stderr to files instead of subprocess.PIPE to avoid
deadlock when the parent doesn't drain stderr.
Both unblock running agentic blueprints on Apple Silicon. Revert this
commit before opening the agent_graph_understanding feature PR.
This reverts commit e20cfbd.
This reverts commit bc91121.
|
@paul-nechifor I removed the private-API dep . Took your direction but used invoke/ainvoke (documented to accept ToolCall) and pulled the id from input["id"] rather than reading tool_call_id from run() kwargs. |
Greptile SummaryThis PR replaces the old UUID-placeholder workaround for image-returning MCP tools with a clean
Confidence Score: 5/5Safe to merge. The rewrite correctly co-delivers images in one agent turn, history normalisation after streaming is sound, and the previously identified ordering problem in _history is now explicitly fixed. The central logic — Command-based image delivery, _fix_parallel_tool_batches, and _McpStructuredTool — is correct and well-tested across sync, async, single-tool, and parallel-tool scenarios. The explicit _fix_parallel_tool_batches call at the end of _process_message directly closes the gap where stream_mode='updates' raw node output bypassed the channel reducer, so _history is correctly ordered before it is fed back into the next stream() call. No regressions are introduced, and the fixture update aligns with the removed two-turn placeholder flow. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant LG as LangGraph ToolNode
participant MT as _McpStructuredTool.invoke
participant CT as call_tool closure
participant MCP as MCP Server
participant State as Graph State (_OrderedAgentState)
LG->>MT: "invoke(ToolCall{id, args, type})"
MT->>MT: "_inject_tool_call_id(input)<br/>args[tool_call_id] = id"
MT->>CT: super().invoke(modified_input)
CT->>MCP: tools/call(name, kwargs)
MCP-->>CT: "{content: [{type, ...}]}"
alt text only
CT-->>LG: plain str
LG->>State: "ToolMessage(content=str, tool_call_id=id)"
else has image blocks
CT-->>LG: "Command(update={messages:[ToolMessage, HumanMessage]})"
LG->>State: "ToolMessage(content=summary, tool_call_id=id) + HumanMessage([text,image], additional_kwargs={tool_call_id})"
end
State->>State: _reorder_tool_responses reducer
Note over State: Parallel ToolMessages contiguous, Image HumanMessages follow
Reviews (2): Last reviewed commit: "[autofix.ci] apply automated fixes" | Re-trigger Greptile |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
This fixes #2054 |
Context
Continuing #2189 within the repo so CI/CD can run.
I hit this while developing #2188 (the memory2 agent + Space raster backend) and figured the fix would be worth porting to the MCP client, which had the same problem and was working around it in a less elegant way.
The underlying constraint
Many OpenAI chat-completion models don't accept image content inside a tool result — you can't just stuff an image into a
ToolMessage. The canonical workaround is to deliver the image as aHumanMessageafter the tool returns. That part isn't avoidable.What we can avoid is delivering that HumanMessage in a separate, later agent turn — which is what the MCP client was doing.
What was happening before
When an MCP tool returned non-text content (an image),
mcp_client.py:ToolMessageto the agent: "Tool call started with UUID: X. You will be updated with the result soon".HumanMessageviaadd_message, which got picked up on the next agent turn.The agent therefore saw a fake "wait for it" tool result, kept reasoning with no real data, and only got the image one turn later. Two extra round-trips per image, and the tool result the agent reasons over isn't the actual content.
What we do now
We're already on langchain's
create_agent, which is a thin wrapper over LangGraph. LangGraph lets a tool return aCommand(update={...})instead of a single message — i.e. the tool call can append multiple messages to state as the result of one call. The state'smessagesreducer (the tool-aware one inagent.py) then re-orders so each image-carryingHumanMessagelands next to its matchingToolMessage, paired byadditional_kwargs["tool_call_id"].Concretely, the MCP tool now returns:
Both messages land in the same turn. The
HumanMessage-with-image workaround for OpenAI's API constraint is still there — but it's now part of the tool's response, not a fake-out + late delivery. The reducer guarantees ordering even when multiple parallel tool calls return images simultaneously.Bonus fix:
InjectedToolCallIdfor JSON-Schema toolsLangChain auto-injects
InjectedToolCallIdinto tool arguments whenargs_schemais a Pydantic model. MCP gives us its schema as a plain JSON-Schema dict (the authoritative LLM contract on the MCP side), which langchain doesn't auto-inject for. Added_McpStructuredTool, a tinyStructuredToolsubclass that does the injection for the JSON-Schema path. Without this, the tool wouldn't know its owntool_call_idand the reducer pairing breaks.Future steps
The same
Command(update=...)pattern unlocks more than just "tools that return images". Two observations from driving the memory2 agent against this fix:The agent skips steps. Skills that prescribe verification sub-procedures (e.g. "call
frames_facingto refine each candidate position before merging") get partially-followed even when the prose is strict. The current harness has no way to enforce that a step ran before the agent can submit an answer — the LLM is the only authority. Closing this needs explicit task management in the harness: a way to gate the final-answer tool on whether the prerequisite tool calls actually fired.Long-horizon tasks benefit from subagent delegation. Some questions (e.g. room segmentation across a 5-minute recording, exhaustive frontier exploration) are big enough that a single agent context fills up. The cleanest way to scale is to let the top-level agent dispatch self-contained sub-tasks to subagents — each with its own context window — and integrate their results. That keeps the parent agent's context focused on synthesis instead of bookkeeping.
These two together — harness-level task management + subagent delegation — are most naturally implemented as another module: a long-running, compute-heavier agent specialised for complex tasks, that the existing
McpClient-style short-horizon agents can defer to in a true multi-agent setting. This PR is the prerequisite (image-returning tools work cleanly across agents); the multi-agent module is the follow-up.Test plan
pytest dimos/agents/mcp/test_mcp_client_unit.py— covers the new Command-return path and the multi-image / parallel-tool-call ordering.