Make ExtractLicenses handle large expressions#139
Conversation
An expression with a lot of terms in it can cause ExtractLicences memory usage to blow up. This change alters the traversal to efficiently collect the licenses found. Includes a test that ensures the long expression found in the wild works. This test is a little ugly, because it runs the test in a subprocess to ensure it can be reliably killed. Fixes #138
There was a problem hiding this comment.
Pull request overview
This PR addresses excessive memory usage / hangs in spdxexp.ExtractLicenses when evaluating large compound SPDX license expressions by replacing the expansion-based traversal with a direct AST walk that collects unique licenses.
Changes:
- Reworked
ExtractLicensesto traverse the parsed expression tree and collect unique reconstructed license strings without generating expanded combinations. - Added tests for LicenseRef handling/dedup and a “real world” long expression, executed in a subprocess with a timeout to detect hangs/OOM.
Show a summary per file
| File | Description |
|---|---|
spdxexp/extracts.go |
Replaces expansion+flatten logic with a tree traversal that collects unique licenses in a map. |
spdxexp/extracts_test.go |
Adds regression tests for LicenseRef dedup and a long-expression hang/OOM scenario via subprocess + timeout. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 2/2 changed files
- Comments generated: 1
| seen := map[string]struct{}{} | ||
| collectExtractedLicenses(node, seen) | ||
| return slices.Collect(maps.Keys(seen)), nil |
There was a problem hiding this comment.
ExtractLicenses now returns slices.Collect(maps.Keys(seen)), which depends on Go map iteration order and can produce nondeterministic output across runs. This is a behavior change from the previous implementation (which produced a deterministic order after expand(true) + dedup) and can lead to flaky downstream assertions or unstable output.
Consider returning a deterministic order (e.g., collect keys into a slice and sort before returning, or maintain insertion order while deduping).
There was a problem hiding this comment.
The previous implementation did not have any sorting and would therefore also have been non-deterministic.
62e3e21 to
e589960
Compare
An expression with a lot of terms in it can cause ExtractLicences
memory usage to blow up. This change alters the traversal to
efficiently collect the licenses found.
Includes a test that ensures the long expression found in the wild
works. This test is a little ugly, because it runs the test in a
subprocess to ensure it can be reliably killed.
Fixes #138