[WIP] support otel trace and jaeger viewer by YanhuiDua · Pull Request #1918 · InternLM/xtuner

YanhuiDua · 2026-06-16T10:12:19Z

No description provided.

YanhuiDua · 2026-06-16T10:12:30Z

claude · 2026-06-16T10:18:53Z

+    finally:
+        if token is not None:
+            try:
+                otel_context.detach(token)
+            except Exception:
+                return


Claude: Bug — return inside a finally block of a @contextmanager silently swallows exceptions.

If the caller raises inside the with use_trace_context(...) block and otel_context.detach(token) also fails, this return suppresses the original caller exception. The caller's error disappears with no indication.

Suggested change

finally:

if token is not None:

try:

otel_context.detach(token)

except Exception:

return

finally:

if token is not None:

try:

otel_context.detach(token)

except Exception:

pass

Replace return with pass so the original exception continues to propagate. The same pattern appears in trace_baggage (line ~1033) and trace_task_context (line ~1083) — all three need the same fix.

claude · 2026-06-16T10:18:56Z

+    finally:
+        if token is not None:
+            try:
+                context.detach(token)
+            except Exception:
+                return


Claude: Bug — Same return-in-finally issue as use_trace_context. This will suppress the caller's exception if context.detach(token) also raises.

Suggested change

finally:

if token is not None:

try:

context.detach(token)

except Exception:

return

finally:

if token is not None:

try:

context.detach(token)

except Exception:

pass

claude · 2026-06-16T10:18:58Z

+    finally:
+        if token is not None and otel_context_module is not None:
+            try:
+                otel_context_module.detach(token)
+            except Exception:


Claude: Bug — Same return-in-finally issue as the other two context managers. Replace with pass.

Suggested change

finally:

if token is not None and otel_context_module is not None:

try:

otel_context_module.detach(token)

except Exception:

finally:

if token is not None:

try:

otel_context_module.detach(token)

except Exception:

pass

claude · 2026-06-16T10:19:02Z

+        try:
+            otel_trace.set_tracer_provider(self._provider)
+        except Exception:
+            pass


Claude: Warning — Silent except Exception: pass when setting the global tracer provider. If another TracerProvider is already registered (e.g. user code, a test, or double initialization), this silently fails and all subsequent spans go to the wrong provider. Consider logging a warning so operators can diagnose "traces aren't showing up" issues:

try: otel_trace.set_tracer_provider(self._provider) except Exception: logger.warning("Failed to set global tracer provider; spans may not export", exc_info=True)

claude · 2026-06-16T10:19:07Z

+        return OTLPSpanExporter(endpoint=config.otel_endpoint)
+    from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
+
+    return OTLPSpanExporter(endpoint=config.otel_endpoint, insecure=True)


Claude: Warning — insecure=True is hardcoded, which disables TLS for all gRPC trace export. This is fine for localhost/cluster-internal Jaeger deployments, but if anyone points otel_endpoint at a remote collector over an untrusted network, trace data (which may include prompt content in span attributes) will be sent in plaintext.

Consider making this configurable via TraceConfig, e.g. otel_insecure: bool = True, or at minimum add a comment noting the security assumption.

claude · 2026-06-16T10:19:11Z

+            "message": state.message,
+            "repeat_index": repeat_index,
+        }
+    digest = hashlib.sha1(_json_dumps_stable(payload).encode("utf-8")).hexdigest()[:16]


Claude: Nit — This uses SHA-1 while _stable_trace_id (line ~380) uses blake2b. The inconsistency is surprising. SHA-1 is fine for non-cryptographic trace IDs, but since blake2b is already available and used elsewhere in this file, consider unifying on blake2b for consistency (it's also faster for short inputs):

digest = hashlib.blake2b(_json_dumps_stable(payload).encode("utf-8"), digest_size=8).hexdigest()

claude · 2026-06-16T10:19:16Z

+
+__all__ = [
+    "SessionServer",
+    "SessionServerActor",
+    "_choices_output_ids_len",
+    "_extract_body_trace_context",


Claude: Warning — Exporting private-looking names (_choices_output_ids_len, _extract_body_trace_context, _response_output_ids_len) in __all__ is unusual. Combined with the from .session_trace import _foo as _foo re-export pattern above, this looks like a workaround for type-checker visibility.

If these functions are part of the module's public API, drop the leading underscore. If they're internal, remove them from __all__ and import them normally — __all__ signals "this is the public interface" to both tools and humans.

claude · 2026-06-16T10:19:18Z

+    parser.add_argument("--host", default="0.0.0.0")
+    parser.add_argument("--port", type=int, default=4318)
+    parser.add_argument("--root", default="/tmp/otelcol")
+    args = parser.parse_args()


Claude: Nit — Default binding to 0.0.0.0 exposes the debug sink to all network interfaces. Since this is a development/debugging tool, consider defaulting to 127.0.0.1 and letting users explicitly opt into 0.0.0.0 when needed.

claude · 2026-06-16T10:19:22Z

+        self._provider.add_span_processor(
+            BatchSpanProcessor(exporter, schedule_delay_millis=100, max_export_batch_size=512)


Claude: Nit — schedule_delay_millis=100 is 50× more aggressive than the OTel SDK default (5000ms). This means the exporter flushes every 100ms, which increases CPU overhead and network chatter — especially under high span volume during RL training. Unless there's a specific latency requirement for real-time trace viewing, consider using a more moderate value like 1000 or the default 5000.

claude · 2026-06-16T10:19:56Z

Claude: ## Summary

Large feature PR adding OpenTelemetry tracing and a Jaeger-backed trace viewer to the RL training pipeline. Replaces the old file-based per-sample tracing (sandbox_agent_loop/trace.py, deleted) with a unified OTel-based system. Instruments the full RL pipeline — sampling, agent loops, rollout workers/controllers, session server, and judgers — with spans and trace context propagation across Ray actors via environment variables. Includes a Jaeger query dashboard (jaeger_trace_dashboard.py), trace analysis tools, and comprehensive unit tests (~1640 lines). Also adds ~6000 lines of design/spec documents under docs/superpowers/.

ProduceBatchResult impact: not affected

The PR adds trace_id to RolloutState and propagates trace metadata through extra_fields (train_step, model_step, producer_future_step, produce_batch_id). These are purely additive — no ProduceBatchResult fields (batch status, returned groups, timing, leftovers, rewards, produced work counters) are modified. The flush_trace() call in agent_loop_manager.py runs after produce_batch completes and does not alter batch accounting.

RoutedExperts impact: not affected

The trace changes do not touch routed-experts handling, object-ref plumbing, or cleanup paths. extra_fields additions are string metadata only.

Issues

Critical

[trace.py:1004] return inside finally blocks in use_trace_context, trace_baggage, and trace_task_context context managers will silently swallow the caller's exception if otel_context.detach() also fails. This is a correctness bug — exceptions from traced code paths can vanish. Fix: replace return with pass in all three locations (lines 1004, 1034, 1084).

Warning

[trace.py:551-554] set_tracer_provider failure is silently swallowed with except Exception: pass. If another provider is already registered, all spans silently go nowhere. Should at minimum log a warning.
[trace.py:520] insecure=True is hardcoded for gRPC export. Trace spans may contain prompt/response content in attributes — sending these over plaintext to a remote collector is a data exposure risk. Consider making this configurable via TraceConfig.
[session_server.py:750-755] Private functions (_choices_output_ids_len, _extract_body_trace_context, _response_output_ids_len) exported in __all__. If these are public API, remove the underscore prefix; if internal, remove from __all__.
Missing PR description. A feature PR of this scope (~46 files, ~5000+ lines of production code) should have a body explaining the design, motivation, and how to enable/configure tracing.

Nit

[trace.py:363] build_rollout_trace_id uses SHA-1 while _stable_trace_id uses blake2b. Consider unifying on blake2b for consistency.
[trace.py:548-549] schedule_delay_millis=100 is 50× more aggressive than the OTel SDK default (5000ms). This increases export overhead significantly — consider whether real-time viewing truly requires sub-second flush.
[otlp_http_sink.py:115] Debug tool defaults to binding on 0.0.0.0; 127.0.0.1 is safer for a dev utility.
~6000 lines of design documents under docs/superpowers/ could be split into a separate PR to keep this diff focused on the implementation.

Verdict

REQUEST_CHANGES — the return-in-finally bug is a correctness issue that can silently swallow exceptions in production code paths. The fix is mechanical (change return → pass in three locations). The rest of the warnings are worth addressing but not blocking.

YanhuiDua added 6 commits June 16, 2026 07:24

feat(trace): add OpenTelemetry task tracing runtime

e4fba02

feat(trace): instrument producer and rollout lifecycle

9492943

feat(trace): trace sandbox and localhost agent stages

cd296d1

feat(trace): add Jaeger-backed task dashboard

106c3ba

test(trace): add trace unit tests and smoke configs

c4b0fcb

docs(trace): document RL task tracing design

6a31d8e

YanhuiDua changed the title ~~[Feat] support otel trace and jaeger viewer~~ [WIP] support otel trace and jaeger viewer Jun 16, 2026

claude Bot reviewed Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] support otel trace and jaeger viewer#1918

[WIP] support otel trace and jaeger viewer#1918
YanhuiDua wants to merge 6 commits into
InternLM:mainfrom
YanhuiDua:feat/otel-trace-jaeger-viewer

YanhuiDua commented Jun 16, 2026

Uh oh!

YanhuiDua commented Jun 16, 2026

Uh oh!

claude Bot Jun 16, 2026

Uh oh!

claude Bot Jun 16, 2026

Uh oh!

claude Bot Jun 16, 2026

Uh oh!

claude Bot Jun 16, 2026

Uh oh!

claude Bot Jun 16, 2026

Uh oh!

claude Bot Jun 16, 2026

Uh oh!

claude Bot Jun 16, 2026

Uh oh!

claude Bot Jun 16, 2026

Uh oh!

claude Bot Jun 16, 2026

Uh oh!

claude Bot commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		self._provider.add_span_processor(
		BatchSpanProcessor(exporter, schedule_delay_millis=100, max_export_batch_size=512)

Conversation

YanhuiDua commented Jun 16, 2026

Uh oh!

YanhuiDua commented Jun 16, 2026

Uh oh!

claude Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Jun 16, 2026

Issues

Critical

Warning

Nit

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant