Skip to content

Fix false-positive DSL classloader leak warnings on idle heaps#13935

Merged
wu-sheng merged 2 commits into
masterfrom
fix/dsl-classloader-leak-warn-evidence
Jul 2, 2026
Merged

Fix false-positive DSL classloader leak warnings on idle heaps#13935
wu-sheng merged 2 commits into
masterfrom
fix/dsl-classloader-leak-warn-evidence

Conversation

@wu-sheng

@wu-sheng wu-sheng commented Jul 2, 2026

Copy link
Copy Markdown
Member

Fix false-positive rule loader leak suspected warnings after DSL rule hot updates

  • Add a unit test to verify that the fix works.
  • Explain briefly why the bug exists and how to fix it.

Why the bug exists. When a runtime-rule hot update displaces a RuleClassLoader, DSLClassLoaderManager retires it into a phantom-reference graveyard and WARNs (rule loader leak suspected) if the loader is still uncollected 5 minutes later. But wall-clock age is not evidence of a leak: a classloader that has defined classes can only be reclaimed by a class-unloading-capable GC cycle (G1 concurrent mark / full GC — young collections never unload classes), and an idle heap may not run one for hours. So the WARN fired after essentially every hot update on a quiet OAP, alarming operators for what was plain GC inactivity.

How it is fixed. The graveyard now arms an unload probe whenever retired loaders are pending: a parent-less throwaway classloader that defines one empty class (UnloadProbePayload) and is immediately dereferenced. The probe has the exact same collection requirement as a retired rule loader, so its collection (observed via its own phantom queue) proves a class-unloading cycle completed after the probe's mint time — collector-agnostic, unlike GC-MXBean counting. The leak WARN now fires only when such a cycle completed at least the 5-minute settle window after a loader's retirement and the loader still survived it — proof it is strongly referenced, not GC lag. The WARN message states that evidence and directs operators to heap-dump triage; pending-without-evidence loaders are DEBUG-only, and the eventual rule loader collected INFO notes when it clears an earlier warning. If the probe payload bytecode is ever unreadable, detection degrades to the previous wall-clock heuristic rather than going silent.

The gating logic is covered by deterministic, collector-independent unit tests (evidence injected directly; no System.gc() dependence, so no CI flake risk).

  • If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes #.
  • Update the CHANGES log.

No CHANGES entry: the leak detector has not shipped in any release, so this fix has no released-behavior delta to document.

🤖 Generated with Claude Code

@wu-sheng wu-sheng requested a review from Copilot July 2, 2026 09:45
@wu-sheng wu-sheng added this to the 11.0.0 milestone Jul 2, 2026
@wu-sheng wu-sheng added the enhancement Enhancement on performance or codes label Jul 2, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refines the DSL rule classloader leak detector in OAP server-core to avoid false-positive “leak suspected” warnings on idle heaps by gating warnings on evidence of a class-unloading GC cycle (via an unload-probe), and adds unit tests to validate the gating logic deterministically.

Changes:

  • Add unload-probe mechanism to detect class-unloading GC cycles and gate leak warnings on that evidence.
  • Adjust DSLClassLoaderManager sweep logging to WARN only for evidence-backed suspects and DEBUG-log pending-without-evidence.
  • Add JUnit tests covering the evidence-gating behavior without relying on System.gc().

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/ClassLoaderGc.java Adds unload-probe plumbing, evidence watermarking, and leakSuspects() gating logic.
oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/DSLClassLoaderManager.java Switches sweeper WARNs to use evidence-backed suspects and adds DEBUG diagnostics for pending loaders.
oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/UnloadProbePayload.java Introduces minimal bytecode payload class used by the unload-probe.
oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/classloader/ClassLoaderGcTest.java Adds deterministic unit tests for evidence gating and watermark behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@wu-sheng wu-sheng merged commit 1432e5e into master Jul 2, 2026
440 of 444 checks passed
@wu-sheng wu-sheng deleted the fix/dsl-classloader-leak-warn-evidence branch July 2, 2026 13:32
wu-sheng added a commit that referenced this pull request Jul 2, 2026
…rtOnFailure wiring and leak-probe timing

- LAL json{} reads the JSON body first and falls back to parsing the text
  body as JSON (the OTLP log receiver delivers every string body as text,
  even JSON-shaped ones). On a successful text-fallback parse, the matching
  rule's context input is swapped to a JSON-bodied copy so that rule persists
  the log with content type JSON; the input shared with other rules is never
  mutated.
- Fix the abortOnFailure option being silently ignored by the v2 compiler:
  the rule's flag is now baked into the generated json/yaml/text-regexp
  parser calls (default true, as documented). Aborting parse failures WARN
  at most once per minute per parser with the suppressed count reported on
  the next emission; abortOnFailure=false failures are DEBUG-only and
  continue with a metadata-backed parsed map so parsed.* reads stay
  null-safe. Typed-proto inputs (Envoy ALS routing guard) stay quiet.
- Fix delayed classloader leak detection (follow-up to #13935): arm the
  unload probe on demand once a pending entry's settle window elapses and
  record the collected probe's mint time — sound evidence with single-GC
  detection and no drain-time overshoot.
wu-sheng added a commit that referenced this pull request Jul 2, 2026
…rtOnFailure wiring and leak-probe timing

- LAL json{} reads the JSON body first and falls back to parsing the text
  body as JSON (the OTLP log receiver delivers every string body as text,
  even JSON-shaped ones). On a successful text-fallback parse, the matching
  rule's context input is swapped to a JSON-bodied copy so that rule persists
  the log with content type JSON; the input shared with other rules is never
  mutated.
- Fix the abortOnFailure option being silently ignored by the v2 compiler:
  the rule's flag is now baked into the generated json/yaml/text-regexp
  parser calls (default true, as documented). Aborting parse failures WARN
  at most once per minute per parser with the suppressed count reported on
  the next emission; abortOnFailure=false failures are DEBUG-only and
  continue with a metadata-backed parsed map so parsed.* reads stay
  null-safe. Typed-proto inputs (Envoy ALS routing guard) stay quiet.
- Fix delayed classloader leak detection (follow-up to #13935): arm the
  unload probe on demand once a pending entry's settle window elapses and
  record the collected probe's mint time — sound evidence with single-GC
  detection and no drain-time overshoot.
wu-sheng added a commit that referenced this pull request Jul 2, 2026
…rtOnFailure wiring and leak-probe timing

- LAL json{} reads the JSON body first and falls back to parsing the text
  body as JSON (the OTLP log receiver delivers every string body as text,
  even JSON-shaped ones). On a successful text-fallback parse, the matching
  rule's context input is swapped to a JSON-bodied copy so that rule persists
  the log with content type JSON; the input shared with other rules is never
  mutated.
- Fix the abortOnFailure option being silently ignored by the v2 compiler:
  the rule's flag is now baked into the generated json/yaml/text-regexp
  parser calls (default true, as documented). Aborting parse failures WARN
  at most once per minute per parser with the suppressed count reported on
  the next emission; abortOnFailure=false failures are DEBUG-only and
  continue with a metadata-backed parsed map so parsed.* reads stay
  null-safe. Typed-proto inputs (Envoy ALS routing guard) stay quiet.
- Fix delayed classloader leak detection (follow-up to #13935): arm the
  unload probe on demand once a pending entry's settle window elapses and
  record the collected probe's mint time — sound evidence with single-GC
  detection and no drain-time overshoot.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Enhancement on performance or codes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants