Skip to content

fix(qwen-agent): improve token and nested agent tracing#220

Open
sipercai wants to merge 1 commit into
alibaba:mainfrom
sipercai:liuyu/fix-qwen-agent-token-usage
Open

fix(qwen-agent): improve token and nested agent tracing#220
sipercai wants to merge 1 commit into
alibaba:mainfrom
sipercai:liuyu/fix-qwen-agent-token-usage

Conversation

@sipercai

@sipercai sipercai commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Summary

Improve Qwen-Agent instrumentation so DashScope-backed chat/model calls and agent runs keep token usage and cleaner agent telemetry.

This extends the original DeepSeek/Qwen-Agent token usage fix with the related Qwen-Agent optimizations from the downstream branch: robust usage extraction across Qwen-Agent metadata shapes, streaming usage preservation, nested agent span support, final-answer-only agent output, and agent-level LLM token rollup.

Changes

  • Extract token usage from multiple Qwen-Agent/DashScope response shapes, including usage, extra.usage, extra.model_service_info, and top-level model_service_info.
  • Support dict-like, SDK-object, JSON-string, and namespace-style metadata, including input_tokens/prompt_tokens, output_tokens/completion_tokens, cache-read tokens, and cache-creation tokens.
  • Apply usage while streaming chunks are consumed and keep the most complete cumulative usage if later chunks omit usage metadata.
  • Preserve nested invoke_agent spans by replacing the global agent-run boolean guard with a same-instance reentrancy stack.
  • Roll child LLM token usage onto active agent spans. Parent agent spans intentionally represent the total nested LLM cost of that agent run; global cost aggregation should use LLM spans or trace-level de-duplication rather than summing agent spans.
  • Record only the final assistant answer as the invoke_agent output, instead of storing intermediate tool calls and tool results as the final agent response.
  • Add focused coverage for token metadata variants, streaming usage preservation, final agent output filtering, nested agent spans, and agent-level token rollup.
  • Add a changelog entry under Unreleased > Fixed.

Validation Evidence

Check Status Evidence
Approved spec / waiver pass User requested direct implementation and PR update for Qwen-Agent DeepSeek token usage plus downstream Qwen-Agent optimizations.
Changed surface pass Limited to loongsuite-instrumentation-qwen-agent source, tests, and changelog.
Rebase pass git fetch origin --prune && git rebase origin/main -> current branch is up to date.
Static readiness pass check_loongsuite_pr_readiness.py --repo .
Syntax / whitespace pass python3 -m py_compile .../utils.py .../patch.py .../test_spans.py; git diff --check
Focused tests pass pytest -q instrumentation-loongsuite/loongsuite-instrumentation-qwen-agent/tests/test_spans.py (24 passed)
Latest package matrix pass tox -c tox-loongsuite.ini -e py312-test-loongsuite-instrumentation-qwen-agent-latest (44 passed)
Oldest package matrix pass tox -c tox-loongsuite.ini -e py312-test-loongsuite-instrumentation-qwen-agent-oldest (44 passed)
Lint pass tox -c tox-loongsuite.ini -e lint-loongsuite-instrumentation-qwen-agent
Precommit pass tox -e precommit
Privacy scan pass Diff regex scan found no credentials, bearer tokens, local user paths, or API-key literals.
Claude review loop pass /tmp/codex-claude-review/loongsuite-python-agent-fcd27d9b4e/run-20260617-153346; r1 findings=0, r2 process finding resolved by amending the full implementation into HEAD, r3 findings=0.
GitHub CI pending Updated head will trigger CI after force-push. Existing PR checks were still queued before this update.

Real E2E Matrix

Scenario Status Evidence
Non-streaming model call blocked Live deepseek-v3 non-streaming smoke in the temp qwen-agent environment hit framework assertion use_raw_api only support full stream; non-streaming token extraction is covered by focused tests and latest/oldest tox.
Streaming model call pass Live deepseek-v3 streaming smoke produced chat spans with gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.usage.total_tokens.
Concurrency pass Two concurrent live deepseek-v3 streaming calls produced isolated chat spans; every chat span had token usage.
Agent / tool / ReAct pass Qwen-Agent real/VCR-backed tests cover basic agent run, stream LLM TTFT, non-stream chat, agent non-stream run, multi-turn, ReAct, and tool-call flows in latest and oldest tox matrices.
Tool-heavy pass Existing tool-call, ReAct, and span hierarchy tests passed; this change does not alter tool dispatch or schema generation.
Error path pass Existing chat, agent, and tool error-path span tests passed; provider error responses do not yield successful usage metadata.

Telemetry Contract

Contract Status Evidence
Span kind and usage attributes pass Focused tests and live streaming smoke verified LLM chat spans contain usage attributes before span finalization.
Parent-child tree pass Span hierarchy tests verify agent -> chat/tool nesting; new nested-agent test verifies child invoke_agent span parentage.
Agent output content pass SPAN_ONLY test verifies invoke_agent output records only the final assistant answer, not intermediate tool calls/results.
Content capture modes pass SPAN_ONLY final-output coverage and live NO_CONTENT smoke both kept token attributes independent from content capture.
Concurrency isolation pass Live concurrent streaming smoke produced separate chat traces and token usage on every chat span.
Weaver live-check pass Weaver live-check on live exported Qwen-Agent spans reported no blocking violations; only non-blocking stability / enum informational findings.

Notes

gen_ai.usage.total_tokens remains computed by the existing GenAI span finalization layer from input and output tokens. This PR sets the source input/output/cache fields and does not duplicate total-token computation.

@sipercai sipercai force-pushed the liuyu/fix-qwen-agent-token-usage branch from 7fdc966 to a304e11 Compare June 17, 2026 07:48
@sipercai sipercai changed the title fix(qwen-agent): record chat token usage fix(qwen-agent): improve token and nested agent tracing Jun 17, 2026

@ralf0131 ralf0131 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by github-manager-bot

Summary

Three improvements to loongsuite-instrumentation-qwen-agent: (1) record token usage from DashScope response metadata on streaming and non-streaming chat spans, (2) roll up child LLM token usage to parent invoke_agent spans and preserve nested agent spans, (3) record only the final agent answer as output.

Findings

  • [Info] patch.py — The reentrancy guard is correctly upgraded from a boolean _in_agent_run to an instance-ID-based stack (_agent_run_instance_stack). This fixes the nested-agent suppression bug while still preventing Proxy/Wrapper super-call duplication. The _active_agent_invocations ContextVar tuple-based stack for token rollup is well-designed.
  • [Info] patch.py — Token usage rollup via _accumulate_llm_usage_on_active_agents is correctly transitive (parent agent accumulates nested LLM costs), and the current_score guard in _apply_usage_to_llm_invocation handles cumulative streaming chunks correctly.
  • [Warning] utils.py:218-222 — In _convert_qwen_agent_final_output_messages, the fallback return _convert_qwen_messages_to_output_messages(messages) at the end of the function will include all messages (including tool-call messages) if no assistant message with text content is found. This could include intermediate tool-call/function messages in the output. Consider falling back to the last message instead of all messages, or logging a debug warning when no final answer is found.

Tests

Excellent test coverage: streaming/non-streaming token usage, cumulative usage retention across chunks, nested agent span creation with parent-child relationship verification, final-output-only message assertion, and agent token rollup verification. The tests are well-structured with realistic mock objects.

Overall, a solid and well-tested improvement. The one minor warning about the fallback path is non-blocking.


Automated review by github-manager-bot

@sipercai sipercai force-pushed the liuyu/fix-qwen-agent-token-usage branch from a304e11 to 5088b41 Compare June 18, 2026 02:34

@ralf0131 ralf0131 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by github-manager-bot

Summary

Re-verification after new commits. Substantially improves Qwen-Agent/DashScope token accounting and agent telemetry: usage extraction across many metadata shapes (with streaming "keep most complete" semantics), nested agent spans via a same-instance reentrancy stack, final-answer-only agent output, and agent-level LLM token rollup. Clean APPROVE.

Findings

  • [Info] patch.py (reentrancy) — Replacing the global _in_agent_run boolean with an _agent_run_instance_stack of id(instance) is the correct fix: nested runs on different instances are now preserved while same-instance super() calls are still deduplicated. The paired ContextVar tuple with proper .reset() in finally avoids leaks.
  • [Info] utils.py (_extract_usage_values / _apply_usage_to_llm_invocation) — Robust recursive extraction across dict/SDK-object/JSON-string shapes. The _usage_score guard correctly keeps the most complete cumulative usage when later streaming chunks omit usage metadata, so partial updates can't regress already-captured totals.
  • [Info] _convert_qwen_agent_final_output_messages — Walking reversed(messages) and returning the first assistant text (skipping tool/function messages) cleanly isolates the final answer; empty-case returns []. Good.

Cross-repo Note

util/opentelemetry-util-genai is untouched; changes are isolated to loongsuite-instrumentation-qwen-agent, so downstream instrumentation plugins are not affected.

Excellent test coverage (token variants, streaming preservation, final-output filtering, nested spans, agent rollup) plus thorough validation evidence. LGTM.


Automated review by github-manager-bot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants