Fix false-positive DSL classloader leak warnings on idle heaps#13935
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refines the DSL rule classloader leak detector in OAP server-core to avoid false-positive “leak suspected” warnings on idle heaps by gating warnings on evidence of a class-unloading GC cycle (via an unload-probe), and adds unit tests to validate the gating logic deterministically.
Changes:
- Add unload-probe mechanism to detect class-unloading GC cycles and gate leak warnings on that evidence.
- Adjust
DSLClassLoaderManagersweep logging to WARN only for evidence-backed suspects and DEBUG-log pending-without-evidence. - Add JUnit tests covering the evidence-gating behavior without relying on
System.gc().
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/ClassLoaderGc.java | Adds unload-probe plumbing, evidence watermarking, and leakSuspects() gating logic. |
| oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/DSLClassLoaderManager.java | Switches sweeper WARNs to use evidence-backed suspects and adds DEBUG diagnostics for pending loaders. |
| oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/UnloadProbePayload.java | Introduces minimal bytecode payload class used by the unload-probe. |
| oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/classloader/ClassLoaderGcTest.java | Adds deterministic unit tests for evidence gating and watermark behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ock-free settle-window test
wankai123
approved these changes
Jul 2, 2026
9 tasks
wu-sheng
added a commit
that referenced
this pull request
Jul 2, 2026
…rtOnFailure wiring and leak-probe timing
- LAL json{} reads the JSON body first and falls back to parsing the text
body as JSON (the OTLP log receiver delivers every string body as text,
even JSON-shaped ones). On a successful text-fallback parse, the matching
rule's context input is swapped to a JSON-bodied copy so that rule persists
the log with content type JSON; the input shared with other rules is never
mutated.
- Fix the abortOnFailure option being silently ignored by the v2 compiler:
the rule's flag is now baked into the generated json/yaml/text-regexp
parser calls (default true, as documented). Aborting parse failures WARN
at most once per minute per parser with the suppressed count reported on
the next emission; abortOnFailure=false failures are DEBUG-only and
continue with a metadata-backed parsed map so parsed.* reads stay
null-safe. Typed-proto inputs (Envoy ALS routing guard) stay quiet.
- Fix delayed classloader leak detection (follow-up to #13935): arm the
unload probe on demand once a pending entry's settle window elapses and
record the collected probe's mint time — sound evidence with single-GC
detection and no drain-time overshoot.
wu-sheng
added a commit
that referenced
this pull request
Jul 2, 2026
…rtOnFailure wiring and leak-probe timing
- LAL json{} reads the JSON body first and falls back to parsing the text
body as JSON (the OTLP log receiver delivers every string body as text,
even JSON-shaped ones). On a successful text-fallback parse, the matching
rule's context input is swapped to a JSON-bodied copy so that rule persists
the log with content type JSON; the input shared with other rules is never
mutated.
- Fix the abortOnFailure option being silently ignored by the v2 compiler:
the rule's flag is now baked into the generated json/yaml/text-regexp
parser calls (default true, as documented). Aborting parse failures WARN
at most once per minute per parser with the suppressed count reported on
the next emission; abortOnFailure=false failures are DEBUG-only and
continue with a metadata-backed parsed map so parsed.* reads stay
null-safe. Typed-proto inputs (Envoy ALS routing guard) stay quiet.
- Fix delayed classloader leak detection (follow-up to #13935): arm the
unload probe on demand once a pending entry's settle window elapses and
record the collected probe's mint time — sound evidence with single-GC
detection and no drain-time overshoot.
wu-sheng
added a commit
that referenced
this pull request
Jul 2, 2026
…rtOnFailure wiring and leak-probe timing
- LAL json{} reads the JSON body first and falls back to parsing the text
body as JSON (the OTLP log receiver delivers every string body as text,
even JSON-shaped ones). On a successful text-fallback parse, the matching
rule's context input is swapped to a JSON-bodied copy so that rule persists
the log with content type JSON; the input shared with other rules is never
mutated.
- Fix the abortOnFailure option being silently ignored by the v2 compiler:
the rule's flag is now baked into the generated json/yaml/text-regexp
parser calls (default true, as documented). Aborting parse failures WARN
at most once per minute per parser with the suppressed count reported on
the next emission; abortOnFailure=false failures are DEBUG-only and
continue with a metadata-backed parsed map so parsed.* reads stay
null-safe. Typed-proto inputs (Envoy ALS routing guard) stay quiet.
- Fix delayed classloader leak detection (follow-up to #13935): arm the
unload probe on demand once a pending entry's settle window elapses and
record the collected probe's mint time — sound evidence with single-GC
detection and no drain-time overshoot.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix false-positive
rule loader leak suspectedwarnings after DSL rule hot updatesWhy the bug exists. When a runtime-rule hot update displaces a
RuleClassLoader,DSLClassLoaderManagerretires it into a phantom-reference graveyard and WARNs (rule loader leak suspected) if the loader is still uncollected 5 minutes later. But wall-clock age is not evidence of a leak: a classloader that has defined classes can only be reclaimed by a class-unloading-capable GC cycle (G1 concurrent mark / full GC — young collections never unload classes), and an idle heap may not run one for hours. So the WARN fired after essentially every hot update on a quiet OAP, alarming operators for what was plain GC inactivity.How it is fixed. The graveyard now arms an unload probe whenever retired loaders are pending: a parent-less throwaway classloader that defines one empty class (
UnloadProbePayload) and is immediately dereferenced. The probe has the exact same collection requirement as a retired rule loader, so its collection (observed via its own phantom queue) proves a class-unloading cycle completed after the probe's mint time — collector-agnostic, unlike GC-MXBean counting. The leak WARN now fires only when such a cycle completed at least the 5-minute settle window after a loader's retirement and the loader still survived it — proof it is strongly referenced, not GC lag. The WARN message states that evidence and directs operators to heap-dump triage; pending-without-evidence loaders are DEBUG-only, and the eventualrule loader collectedINFO notes when it clears an earlier warning. If the probe payload bytecode is ever unreadable, detection degrades to the previous wall-clock heuristic rather than going silent.The gating logic is covered by deterministic, collector-independent unit tests (evidence injected directly; no
System.gc()dependence, so no CI flake risk).CHANGESlog.No CHANGES entry: the leak detector has not shipped in any release, so this fix has no released-behavior delta to document.
🤖 Generated with Claude Code