fix(dgw): improve session-level tracing#1822
Conversation
Correlate recording uploads with their session: carry the session id on the JREC push span and emit a "Recording push stream started" event, so a /jet/jrec/push stream and its /jet/rdp session can be matched with a single session-id grep (the push span previously only had client=<ip>). Distinguish a Gateway-initiated session kill from a natural or peer-driven end in the forwarding proxy. A kill (e.g. recording-policy violation, max session duration) previously logged the same "Forwarding ended" line as a normal close and was easily mistaken for a target-side reset during triage. Log the resolved peer address when the RDCleanPath TLS handshake to the target RDP server fails. The bubbled error only carries the target hostname; logging the IP we actually connected to distinguishes a wrong/split-horizon DNS resolution from a target-side reset without a packet capture. Tracing/logging only; no behavior change.
Let maintainers know that an action is required on their side
|
Use a ForwardingOutcome enum (derives Debug, logged as a structured field) instead of a true/false bool to record whether forwarding ended naturally or was killed by the Gateway. Carry session_id on the jmux span as well, the remaining WebSocket side-channel handler that only had client= (jrec was fixed in the previous commit). Both now match the session_id field already used by the fwd / generic_client / rdp_proxy spans.
When the RDCleanPath TLS handshake to the target RDP server fails, classify the failure from the error already in hand (server alert / non-TLS response / reset before any response / other) and log it as a structured field, alongside the resolved peer address. This lets a crypto/cert/version mismatch (server TLS alert) be told apart from a pre-ServerHello reset (in-path device, fingerprint filtering, or target drop) and from a non-TLS response on the port - the distinction that previously required a packet capture.
There was a problem hiding this comment.
Pull request overview
This PR improves Devolutions Gateway’s session-level observability for support/triage by enriching tracing spans and adding targeted log lines around common failure points (recording uploads, proxy termination, and RDCleanPath TLS handshakes).
Changes:
- Add
session_idcorrelation to/jet/jrec/pushand/jmuxtracing spans, plus a “Recording push stream started” log event on connect. - Add a forwarding “outcome” signal to distinguish Gateway-initiated termination from natural/peer-driven ends.
- Log the resolved target
server_addrand a TLS handshake failure classification when RDCleanPath’s target TLS handshake fails.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| devolutions-gateway/src/recording.rs | Adds a connect-time log event for recording push streams with session_id. |
| devolutions-gateway/src/api/jrec.rs | Adds session_id to the jrec span for correlation across logs. |
| devolutions-gateway/src/api/jmux.rs | Adds session_id to the jmux span for correlation across logs. |
| devolutions-gateway/src/rd_clean_path.rs | Adds target TLS handshake failure classification + logs resolved server_addr on failure. |
| devolutions-gateway/src/proxy.rs | Introduces ForwardingOutcome to distinguish completion vs kill-path termination in logs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The TlsHandshakeFailure alert field was only read through a derived Debug, which `cargo clippy --workspace --tests -- -D warnings` (the CI lint) rejects as a never-read field. Give the enum a manual Display impl (which reads the field, satisfying the lint) and log it via Display, and carry the rustls AlertDescription itself instead of a raw u8 so the alert reads as e.g. HandshakeFailure rather than 40.
Addresses review feedback: the forwarding-end log collapsed both outcomes into a single "Forwarding ended" message plus a structured field. Restore a distinct message for the kill path so the Gateway-initiated termination is obvious in plain-text logs and simple greps, while keeping the structured `outcome` field for log processing.
Co-authored-by: Benoît Cortier <3809077+CBenoit@users.noreply.github.com>
Co-authored-by: Benoît Cortier <3809077+CBenoit@users.noreply.github.com>
Co-authored-by: Benoît Cortier <3809077+CBenoit@users.noreply.github.com>
Builds on the accepted review suggestions (%session_id shorthand on the jrec/jmux spans) and finishes the trim: - Drop ForwardingOutcome: a Gateway-initiated kill is already logged with a more specific reason at the source (terminate endpoint, max session lifetime, recording policy), so the proxy-side distinction is redundant. - Drop the "Recording push stream started" event: the session id is already in the request URL and on the jrec span. - Drop the TLS handshake failure classification and the extra warn!: the error already bubbles up and is logged in full by the top-level error reporter. Instead carry the resolved server address in the TlsHandshake error so that log includes the IP, the one fact it did not already have.
What
Three small, behavior-preserving tracing/logging improvements that make Gateway session failures easier to triage from logs alone.
1. Session-id correlation for recording uploads
/jet/jrec/pushspan now carriessession_id, and aRecording push stream startedevent is emitted on connect./jet/rdpsession through a singlesession_idgrep (the push span previously only hadclient=<ip>).2. Distinguish a Gateway-initiated kill from a peer-driven end
Proxy::forwardnow logsForwarding ended because the session was killed by the Gatewaywhen the session is terminated via the kill path (recording-policy violation, max duration, ...), instead of the sameForwarding endedline used for a natural close.3. Log the resolved target IP on RDCleanPath TLS handshake failure
server_addralongside the target hostname and error.Why
These came out of recent support swarms around browser / web-client RDP: recording-required sessions getting killed when the recording stream drops, and target-side TLS resets where the resolved IP was the missing fact. In several of these the existing logs were correct but required expert correlation across multiple lines, or didn't surface the one fact needed.
Scope / risk
Tracing and logging only - no control-flow or behavioral change.
cargo checkandcargo clippy -p devolutions-gatewayare clean.Out of scope (possible follow-ups)