Skip to content

[WIP] support otel trace and jaeger viewer#1918

Open
YanhuiDua wants to merge 6 commits into
InternLM:mainfrom
YanhuiDua:feat/otel-trace-jaeger-viewer
Open

[WIP] support otel trace and jaeger viewer#1918
YanhuiDua wants to merge 6 commits into
InternLM:mainfrom
YanhuiDua:feat/otel-trace-jaeger-viewer

Conversation

@YanhuiDua

Copy link
Copy Markdown
Collaborator

No description provided.

@YanhuiDua

Copy link
Copy Markdown
Collaborator Author

@claude review

@YanhuiDua YanhuiDua changed the title [Feat] support otel trace and jaeger viewer [WIP] support otel trace and jaeger viewer Jun 16, 2026
Comment thread xtuner/v1/rl/trace.py
Comment on lines +999 to +1004
finally:
if token is not None:
try:
otel_context.detach(token)
except Exception:
return

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Bugreturn inside a finally block of a @contextmanager silently swallows exceptions.

If the caller raises inside the with use_trace_context(...) block and otel_context.detach(token) also fails, this return suppresses the original caller exception. The caller's error disappears with no indication.

Suggested change
finally:
if token is not None:
try:
otel_context.detach(token)
except Exception:
return
finally:
if token is not None:
try:
otel_context.detach(token)
except Exception:
pass

Replace return with pass so the original exception continues to propagate. The same pattern appears in trace_baggage (line ~1033) and trace_task_context (line ~1083) — all three need the same fix.

Comment thread xtuner/v1/rl/trace.py
Comment on lines +1029 to +1034
finally:
if token is not None:
try:
context.detach(token)
except Exception:
return

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Bug — Same return-in-finally issue as use_trace_context. This will suppress the caller's exception if context.detach(token) also raises.

Suggested change
finally:
if token is not None:
try:
context.detach(token)
except Exception:
return
finally:
if token is not None:
try:
context.detach(token)
except Exception:
pass

Comment thread xtuner/v1/rl/trace.py
Comment on lines +1080 to +1084
finally:
if token is not None and otel_context_module is not None:
try:
otel_context_module.detach(token)
except Exception:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Bug — Same return-in-finally issue as the other two context managers. Replace with pass.

Suggested change
finally:
if token is not None and otel_context_module is not None:
try:
otel_context_module.detach(token)
except Exception:
finally:
if token is not None:
try:
otel_context_module.detach(token)
except Exception:
pass

Comment thread xtuner/v1/rl/trace.py
Comment on lines +551 to +554
try:
otel_trace.set_tracer_provider(self._provider)
except Exception:
pass

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Warning — Silent except Exception: pass when setting the global tracer provider. If another TracerProvider is already registered (e.g. user code, a test, or double initialization), this silently fails and all subsequent spans go to the wrong provider. Consider logging a warning so operators can diagnose "traces aren't showing up" issues:

        try:
            otel_trace.set_tracer_provider(self._provider)
        except Exception:
            logger.warning("Failed to set global tracer provider; spans may not export", exc_info=True)

Comment thread xtuner/v1/rl/trace.py
return OTLPSpanExporter(endpoint=config.otel_endpoint)
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

return OTLPSpanExporter(endpoint=config.otel_endpoint, insecure=True)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Warninginsecure=True is hardcoded, which disables TLS for all gRPC trace export. This is fine for localhost/cluster-internal Jaeger deployments, but if anyone points otel_endpoint at a remote collector over an untrusted network, trace data (which may include prompt content in span attributes) will be sent in plaintext.

Consider making this configurable via TraceConfig, e.g. otel_insecure: bool = True, or at minimum add a comment noting the security assumption.

Comment thread xtuner/v1/rl/trace.py
"message": state.message,
"repeat_index": repeat_index,
}
digest = hashlib.sha1(_json_dumps_stable(payload).encode("utf-8")).hexdigest()[:16]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Nit — This uses SHA-1 while _stable_trace_id (line ~380) uses blake2b. The inconsistency is surprising. SHA-1 is fine for non-cryptographic trace IDs, but since blake2b is already available and used elsewhere in this file, consider unifying on blake2b for consistency (it's also faster for short inputs):

digest = hashlib.blake2b(_json_dumps_stable(payload).encode("utf-8"), digest_size=8).hexdigest()

Comment on lines +750 to +755

__all__ = [
"SessionServer",
"SessionServerActor",
"_choices_output_ids_len",
"_extract_body_trace_context",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Warning — Exporting private-looking names (_choices_output_ids_len, _extract_body_trace_context, _response_output_ids_len) in __all__ is unusual. Combined with the from .session_trace import _foo as _foo re-export pattern above, this looks like a workaround for type-checker visibility.

If these functions are part of the module's public API, drop the leading underscore. If they're internal, remove them from __all__ and import them normally — __all__ signals "this is the public interface" to both tools and humans.

parser.add_argument("--host", default="0.0.0.0")
parser.add_argument("--port", type=int, default=4318)
parser.add_argument("--root", default="/tmp/otelcol")
args = parser.parse_args()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Nit — Default binding to 0.0.0.0 exposes the debug sink to all network interfaces. Since this is a development/debugging tool, consider defaulting to 127.0.0.1 and letting users explicitly opt into 0.0.0.0 when needed.

Comment thread xtuner/v1/rl/trace.py
Comment on lines +548 to +549
self._provider.add_span_processor(
BatchSpanProcessor(exporter, schedule_delay_millis=100, max_export_batch_size=512)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Nitschedule_delay_millis=100 is 50× more aggressive than the OTel SDK default (5000ms). This means the exporter flushes every 100ms, which increases CPU overhead and network chatter — especially under high span volume during RL training. Unless there's a specific latency requirement for real-time trace viewing, consider using a more moderate value like 1000 or the default 5000.

@claude

claude Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Claude: ## Summary

Large feature PR adding OpenTelemetry tracing and a Jaeger-backed trace viewer to the RL training pipeline. Replaces the old file-based per-sample tracing (sandbox_agent_loop/trace.py, deleted) with a unified OTel-based system. Instruments the full RL pipeline — sampling, agent loops, rollout workers/controllers, session server, and judgers — with spans and trace context propagation across Ray actors via environment variables. Includes a Jaeger query dashboard (jaeger_trace_dashboard.py), trace analysis tools, and comprehensive unit tests (~1640 lines). Also adds ~6000 lines of design/spec documents under docs/superpowers/.

ProduceBatchResult impact: not affected

The PR adds trace_id to RolloutState and propagates trace metadata through extra_fields (train_step, model_step, producer_future_step, produce_batch_id). These are purely additive — no ProduceBatchResult fields (batch status, returned groups, timing, leftovers, rewards, produced work counters) are modified. The flush_trace() call in agent_loop_manager.py runs after produce_batch completes and does not alter batch accounting.

RoutedExperts impact: not affected

The trace changes do not touch routed-experts handling, object-ref plumbing, or cleanup paths. extra_fields additions are string metadata only.

Issues

Critical

  • [trace.py:1004] return inside finally blocks in use_trace_context, trace_baggage, and trace_task_context context managers will silently swallow the caller's exception if otel_context.detach() also fails. This is a correctness bug — exceptions from traced code paths can vanish. Fix: replace return with pass in all three locations (lines 1004, 1034, 1084).

Warning

  • [trace.py:551-554] set_tracer_provider failure is silently swallowed with except Exception: pass. If another provider is already registered, all spans silently go nowhere. Should at minimum log a warning.
  • [trace.py:520] insecure=True is hardcoded for gRPC export. Trace spans may contain prompt/response content in attributes — sending these over plaintext to a remote collector is a data exposure risk. Consider making this configurable via TraceConfig.
  • [session_server.py:750-755] Private functions (_choices_output_ids_len, _extract_body_trace_context, _response_output_ids_len) exported in __all__. If these are public API, remove the underscore prefix; if internal, remove from __all__.
  • Missing PR description. A feature PR of this scope (~46 files, ~5000+ lines of production code) should have a body explaining the design, motivation, and how to enable/configure tracing.

Nit

  • [trace.py:363] build_rollout_trace_id uses SHA-1 while _stable_trace_id uses blake2b. Consider unifying on blake2b for consistency.
  • [trace.py:548-549] schedule_delay_millis=100 is 50× more aggressive than the OTel SDK default (5000ms). This increases export overhead significantly — consider whether real-time viewing truly requires sub-second flush.
  • [otlp_http_sink.py:115] Debug tool defaults to binding on 0.0.0.0; 127.0.0.1 is safer for a dev utility.
  • ~6000 lines of design documents under docs/superpowers/ could be split into a separate PR to keep this diff focused on the implementation.

Verdict

REQUEST_CHANGES — the return-in-finally bug is a correctness issue that can silently swallow exceptions in production code paths. The fix is mechanical (change returnpass in three locations). The rest of the warnings are worth addressing but not blocking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant