[WIP] support otel trace and jaeger viewer#1918
Conversation
|
@claude review |
| finally: | ||
| if token is not None: | ||
| try: | ||
| otel_context.detach(token) | ||
| except Exception: | ||
| return |
There was a problem hiding this comment.
Claude: Bug — return inside a finally block of a @contextmanager silently swallows exceptions.
If the caller raises inside the with use_trace_context(...) block and otel_context.detach(token) also fails, this return suppresses the original caller exception. The caller's error disappears with no indication.
| finally: | |
| if token is not None: | |
| try: | |
| otel_context.detach(token) | |
| except Exception: | |
| return | |
| finally: | |
| if token is not None: | |
| try: | |
| otel_context.detach(token) | |
| except Exception: | |
| pass |
Replace return with pass so the original exception continues to propagate. The same pattern appears in trace_baggage (line ~1033) and trace_task_context (line ~1083) — all three need the same fix.
| finally: | ||
| if token is not None: | ||
| try: | ||
| context.detach(token) | ||
| except Exception: | ||
| return |
There was a problem hiding this comment.
Claude: Bug — Same return-in-finally issue as use_trace_context. This will suppress the caller's exception if context.detach(token) also raises.
| finally: | |
| if token is not None: | |
| try: | |
| context.detach(token) | |
| except Exception: | |
| return | |
| finally: | |
| if token is not None: | |
| try: | |
| context.detach(token) | |
| except Exception: | |
| pass |
| finally: | ||
| if token is not None and otel_context_module is not None: | ||
| try: | ||
| otel_context_module.detach(token) | ||
| except Exception: |
There was a problem hiding this comment.
Claude: Bug — Same return-in-finally issue as the other two context managers. Replace with pass.
| finally: | |
| if token is not None and otel_context_module is not None: | |
| try: | |
| otel_context_module.detach(token) | |
| except Exception: | |
| finally: | |
| if token is not None: | |
| try: | |
| otel_context_module.detach(token) | |
| except Exception: | |
| pass |
| try: | ||
| otel_trace.set_tracer_provider(self._provider) | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Claude: Warning — Silent except Exception: pass when setting the global tracer provider. If another TracerProvider is already registered (e.g. user code, a test, or double initialization), this silently fails and all subsequent spans go to the wrong provider. Consider logging a warning so operators can diagnose "traces aren't showing up" issues:
try:
otel_trace.set_tracer_provider(self._provider)
except Exception:
logger.warning("Failed to set global tracer provider; spans may not export", exc_info=True)| return OTLPSpanExporter(endpoint=config.otel_endpoint) | ||
| from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter | ||
|
|
||
| return OTLPSpanExporter(endpoint=config.otel_endpoint, insecure=True) |
There was a problem hiding this comment.
Claude: Warning — insecure=True is hardcoded, which disables TLS for all gRPC trace export. This is fine for localhost/cluster-internal Jaeger deployments, but if anyone points otel_endpoint at a remote collector over an untrusted network, trace data (which may include prompt content in span attributes) will be sent in plaintext.
Consider making this configurable via TraceConfig, e.g. otel_insecure: bool = True, or at minimum add a comment noting the security assumption.
| "message": state.message, | ||
| "repeat_index": repeat_index, | ||
| } | ||
| digest = hashlib.sha1(_json_dumps_stable(payload).encode("utf-8")).hexdigest()[:16] |
There was a problem hiding this comment.
Claude: Nit — This uses SHA-1 while _stable_trace_id (line ~380) uses blake2b. The inconsistency is surprising. SHA-1 is fine for non-cryptographic trace IDs, but since blake2b is already available and used elsewhere in this file, consider unifying on blake2b for consistency (it's also faster for short inputs):
digest = hashlib.blake2b(_json_dumps_stable(payload).encode("utf-8"), digest_size=8).hexdigest()|
|
||
| __all__ = [ | ||
| "SessionServer", | ||
| "SessionServerActor", | ||
| "_choices_output_ids_len", | ||
| "_extract_body_trace_context", |
There was a problem hiding this comment.
Claude: Warning — Exporting private-looking names (_choices_output_ids_len, _extract_body_trace_context, _response_output_ids_len) in __all__ is unusual. Combined with the from .session_trace import _foo as _foo re-export pattern above, this looks like a workaround for type-checker visibility.
If these functions are part of the module's public API, drop the leading underscore. If they're internal, remove them from __all__ and import them normally — __all__ signals "this is the public interface" to both tools and humans.
| parser.add_argument("--host", default="0.0.0.0") | ||
| parser.add_argument("--port", type=int, default=4318) | ||
| parser.add_argument("--root", default="/tmp/otelcol") | ||
| args = parser.parse_args() |
There was a problem hiding this comment.
Claude: Nit — Default binding to 0.0.0.0 exposes the debug sink to all network interfaces. Since this is a development/debugging tool, consider defaulting to 127.0.0.1 and letting users explicitly opt into 0.0.0.0 when needed.
| self._provider.add_span_processor( | ||
| BatchSpanProcessor(exporter, schedule_delay_millis=100, max_export_batch_size=512) |
There was a problem hiding this comment.
Claude: Nit — schedule_delay_millis=100 is 50× more aggressive than the OTel SDK default (5000ms). This means the exporter flushes every 100ms, which increases CPU overhead and network chatter — especially under high span volume during RL training. Unless there's a specific latency requirement for real-time trace viewing, consider using a more moderate value like 1000 or the default 5000.
|
Claude: ## Summary Large feature PR adding OpenTelemetry tracing and a Jaeger-backed trace viewer to the RL training pipeline. Replaces the old file-based per-sample tracing ( The PR adds The trace changes do not touch routed-experts handling, object-ref plumbing, or cleanup paths. IssuesCritical
Warning
Nit
VerdictREQUEST_CHANGES — the |
No description provided.