fix(qwen-agent): improve token and nested agent tracing#220
Conversation
7fdc966 to
a304e11
Compare
ralf0131
left a comment
There was a problem hiding this comment.
Review by github-manager-bot
Summary
Three improvements to loongsuite-instrumentation-qwen-agent: (1) record token usage from DashScope response metadata on streaming and non-streaming chat spans, (2) roll up child LLM token usage to parent invoke_agent spans and preserve nested agent spans, (3) record only the final agent answer as output.
Findings
- [Info]
patch.py— The reentrancy guard is correctly upgraded from a boolean_in_agent_runto an instance-ID-based stack (_agent_run_instance_stack). This fixes the nested-agent suppression bug while still preventing Proxy/Wrapper super-call duplication. The_active_agent_invocationsContextVar tuple-based stack for token rollup is well-designed. - [Info]
patch.py— Token usage rollup via_accumulate_llm_usage_on_active_agentsis correctly transitive (parent agent accumulates nested LLM costs), and thecurrent_scoreguard in_apply_usage_to_llm_invocationhandles cumulative streaming chunks correctly. - [Warning]
utils.py:218-222— In_convert_qwen_agent_final_output_messages, the fallbackreturn _convert_qwen_messages_to_output_messages(messages)at the end of the function will include all messages (including tool-call messages) if no assistant message with text content is found. This could include intermediate tool-call/function messages in the output. Consider falling back to the last message instead of all messages, or logging a debug warning when no final answer is found.
Tests
Excellent test coverage: streaming/non-streaming token usage, cumulative usage retention across chunks, nested agent span creation with parent-child relationship verification, final-output-only message assertion, and agent token rollup verification. The tests are well-structured with realistic mock objects.
Overall, a solid and well-tested improvement. The one minor warning about the fallback path is non-blocking.
Automated review by github-manager-bot
a304e11 to
5088b41
Compare
ralf0131
left a comment
There was a problem hiding this comment.
Review by github-manager-bot
Summary
Re-verification after new commits. Substantially improves Qwen-Agent/DashScope token accounting and agent telemetry: usage extraction across many metadata shapes (with streaming "keep most complete" semantics), nested agent spans via a same-instance reentrancy stack, final-answer-only agent output, and agent-level LLM token rollup. Clean APPROVE.
Findings
- [Info]
patch.py(reentrancy) — Replacing the global_in_agent_runboolean with an_agent_run_instance_stackofid(instance)is the correct fix: nested runs on different instances are now preserved while same-instance super() calls are still deduplicated. The paired ContextVar tuple with proper.reset()in finally avoids leaks. - [Info]
utils.py(_extract_usage_values/_apply_usage_to_llm_invocation) — Robust recursive extraction across dict/SDK-object/JSON-string shapes. The_usage_scoreguard correctly keeps the most complete cumulative usage when later streaming chunks omit usage metadata, so partial updates can't regress already-captured totals. - [Info]
_convert_qwen_agent_final_output_messages— Walkingreversed(messages)and returning the first assistant text (skipping tool/function messages) cleanly isolates the final answer; empty-case returns[]. Good.
Cross-repo Note
util/opentelemetry-util-genai is untouched; changes are isolated to loongsuite-instrumentation-qwen-agent, so downstream instrumentation plugins are not affected.
Excellent test coverage (token variants, streaming preservation, final-output filtering, nested spans, agent rollup) plus thorough validation evidence. LGTM.
Automated review by github-manager-bot
Summary
Improve Qwen-Agent instrumentation so DashScope-backed chat/model calls and agent runs keep token usage and cleaner agent telemetry.
This extends the original DeepSeek/Qwen-Agent token usage fix with the related Qwen-Agent optimizations from the downstream branch: robust usage extraction across Qwen-Agent metadata shapes, streaming usage preservation, nested agent span support, final-answer-only agent output, and agent-level LLM token rollup.
Changes
usage,extra.usage,extra.model_service_info, and top-levelmodel_service_info.input_tokens/prompt_tokens,output_tokens/completion_tokens, cache-read tokens, and cache-creation tokens.invoke_agentspans by replacing the global agent-run boolean guard with a same-instance reentrancy stack.invoke_agentoutput, instead of storing intermediate tool calls and tool results as the final agent response.Unreleased > Fixed.Validation Evidence
loongsuite-instrumentation-qwen-agentsource, tests, and changelog.git fetch origin --prune && git rebase origin/main-> current branch is up to date.check_loongsuite_pr_readiness.py --repo .python3 -m py_compile .../utils.py .../patch.py .../test_spans.py;git diff --checkpytest -q instrumentation-loongsuite/loongsuite-instrumentation-qwen-agent/tests/test_spans.py(24 passed)tox -c tox-loongsuite.ini -e py312-test-loongsuite-instrumentation-qwen-agent-latest(44 passed)tox -c tox-loongsuite.ini -e py312-test-loongsuite-instrumentation-qwen-agent-oldest(44 passed)tox -c tox-loongsuite.ini -e lint-loongsuite-instrumentation-qwen-agenttox -e precommit/tmp/codex-claude-review/loongsuite-python-agent-fcd27d9b4e/run-20260617-153346; r1findings=0, r2 process finding resolved by amending the full implementation into HEAD, r3findings=0.Real E2E Matrix
deepseek-v3non-streaming smoke in the temp qwen-agent environment hit framework assertionuse_raw_api only support full stream; non-streaming token extraction is covered by focused tests and latest/oldest tox.deepseek-v3streaming smoke produced chat spans withgen_ai.usage.input_tokens,gen_ai.usage.output_tokens, andgen_ai.usage.total_tokens.deepseek-v3streaming calls produced isolated chat spans; every chat span had token usage.Telemetry Contract
invoke_agentspan parentage.SPAN_ONLYtest verifiesinvoke_agentoutput records only the final assistant answer, not intermediate tool calls/results.SPAN_ONLYfinal-output coverage and liveNO_CONTENTsmoke both kept token attributes independent from content capture.Notes
gen_ai.usage.total_tokensremains computed by the existing GenAI span finalization layer from input and output tokens. This PR sets the source input/output/cache fields and does not duplicate total-token computation.