Skip to content

Avoid correlated SQLite plans for alert decision filters#4471

Open
jimstrang wants to merge 1 commit into
crowdsecurity:masterfrom
jimstrang:experiment/alert-decision-in-subqueries
Open

Avoid correlated SQLite plans for alert decision filters#4471
jimstrang wants to merge 1 commit into
crowdsecurity:masterfrom
jimstrang:experiment/alert-decision-in-subqueries

Conversation

@jimstrang
Copy link
Copy Markdown

@jimstrang jimstrang commented May 17, 2026

Summary

This changes the positive /v1/alerts decision filters from Ent's generated HasDecisionsWith(...) predicate to a small helper that emits independent IN subqueries.

The affected filters are:

  • decision_type
  • origin
  • has_active_decision=true

The intent is to keep the existing filter semantics while avoiding SQLite plans that repeatedly probe decisions for each candidate alert.

Rationale

The current Ent shape is a correlated EXISTS, roughly:

WHERE EXISTS (
  SELECT 1
  FROM decisions
  WHERE decisions.alert_decisions = alerts.id
    AND decisions.origin = ?
)

When several decision filters are combined, SQLite plans this as an alert scan plus repeated correlated probes into decisions. That gets expensive when a matching alert has thousands of linked decisions, which is the shape seen in #4464 and #4470.

The new helper keeps each decision filter independent, but emits:

WHERE alerts.id IN (
  SELECT alert_decisions
  FROM decisions
  WHERE decisions.origin = ?
)

This lets SQLite build matching alert IDs from decisions first, then look up alerts by primary key. It is intentionally not a semantic change: different linked decision rows may still satisfy different filters, just like with the previous separate HasDecisionsWith(...) predicates.

Implementation Notes

  • Scoped to positive decision filters only.
  • include_capi=false is unchanged because it uses negative NOT EXISTS predicates. A naive NOT IN rewrite is not null-safe when decisions.alert_decisions contains NULL; I verified this with unlinked CAPI/lists decisions.
  • scenario and ip/range are unchanged. They have similar correlated shapes, but fixture testing did not show the same benefit; the IN form for IP/range made SQLite scan larger parts of decisions.
  • The helper lives in a handwritten alert package file, so generated Ent files stay untouched and call sites remain alert.HasDecisionsMatching(...).
  • The composite indexes proposed in Add composite indexes for alert decision filters #4468 do not appear necessary for this fix.

Testing

Validated on the reproduced SQLite DB from #4464, an expanded #4470-style SQLite fixture, a fresh production SQLite copy with WAL disabled, and the live production Postgres backend.

Result sets:

SQLite/API behavior:

  • On the reproduced DB, the Homepage-style query decision_type=ban&origin=crowdsec&has_active_decision=1 moved from a 50s+ path to roughly tens of milliseconds in the local reproducer.
  • On the expanded GET /v1/alerts?origin=X causes systematic timeout (25s+) while same query without origin= responds in 13ms #4470 fixture, current code timed out at 35s for origin=cscli, origin=lists, and a large synthetic origin; the new shape returned those cases in ~160-410ms.
  • On a fresh production SQLite copy with WAL disabled (PRAGMA journal_mode=delete), current code still timed out on affected alert filters; the new shape returned quickly.

Expanded fixture comparison:

current code / WAL:
origin_cscli     timeout at 35s
origin_lists     timeout at 35s
origin_fixture   timeout at 35s

new IN-subquery shape / WAL:
origin_cscli     161ms
origin_lists     407ms
origin_fixture   413ms

Fresh DB / DELETE journal comparison:

current code / fresh DB / DELETE journal:
homepage_bans           timeout at 35s
origin_crowdsec    timeout at 35s
origin_CAPI        1068ms

new IN-subquery shape / same DB / DELETE journal:
homepage_bans       61ms
origin_crowdsec      29ms
origin_CAPI           568ms

I also reran the result-set comparison on that fresh DB; the current and new query shapes returned the same alert IDs for the tested cases.

Postgres check:

  • Live production Postgres does not reproduce the SQLite timeout behavior.
  • Postgres planned both shapes as similar semi-join queries.
  • Result sets matched for tested homepage_bans, origin=lists, origin=CAPI, and origin=cscli cases.
Postgres EXPLAIN ANALYZE:
homepage_bans       EXISTS 43.181ms   IN 37.270ms
origin=lists   EXISTS 31.990ms   IN 31.730ms
origin=CAPI    EXISTS 24.060ms   IN 23.021ms

Go checks:

go test ./pkg/database
go test ./pkg/apiserver -run '^$'

Both pass locally.

Refs #4464
Refs #4470
Supersedes the index-only approach explored in #4468

@github-actions
Copy link
Copy Markdown

@jimstrang: There are no 'kind' label on this PR. You need a 'kind' label to generate the release automatically.

  • /kind feature
  • /kind enhancement
  • /kind refactoring
  • /kind fix
  • /kind chore
  • /kind dependencies
Details

I am a bot created to help the crowdsecurity developers manage community feedback and contributions. You can check out my manifest file to understand my behavior and what I can do. If you want to use this for your project, you can check out the BirthdayResearch/oss-governance-bot repository.

@github-actions
Copy link
Copy Markdown

@jimstrang: There are no area labels on this PR. You can add as many areas as you see fit.

  • /area agent
  • /area local-api
  • /area cscli
  • /area appsec
  • /area security
  • /area configuration
Details

I am a bot created to help the crowdsecurity developers manage community feedback and contributions. You can check out my manifest file to understand my behavior and what I can do. If you want to use this for your project, you can check out the BirthdayResearch/oss-governance-bot repository.

@jimstrang
Copy link
Copy Markdown
Author

/kind fix
/area local-api

@buixor
Copy link
Copy Markdown
Contributor

buixor commented May 18, 2026

hey, thanks for the PR. We're going to investigate a bit to understand where the regression might come from tho.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

❌ Patch coverage is 88.88889% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 63.88%. Comparing base (b4b2de2) to head (67c952a).

Files with missing lines Patch % Lines
pkg/database/alertfilter.go 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4471      +/-   ##
==========================================
- Coverage   63.89%   63.88%   -0.02%     
==========================================
  Files         478      479       +1     
  Lines       34298    34304       +6     
==========================================
  Hits        21915    21915              
- Misses      10227    10232       +5     
- Partials     2156     2157       +1     
Flag Coverage Δ
bats 46.46% <77.77%> (+<0.01%) ⬆️
unit-linux 37.32% <88.88%> (+<0.01%) ⬆️
unit-windows 25.99% <0.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jimstrang
Copy link
Copy Markdown
Author

I think its due to how SQLite specifically handles the chunked streaming chages that were introduced in #4413, but candidly I haven't yet explicitly rolled that one back to A/B test. I can test that out later

@jimstrang
Copy link
Copy Markdown
Author

jimstrang commented May 18, 2026

I think I've narrowed down the regression to not the chunked streaming changes, but the deps update in #4412, specifically the go-sqlite3 driver/package 1.14.24 / SQLite 3.46.1 to 1.14.41 / SQLite 3.51.3.

With the newer SQLite, the existing correlated EXISTS alert filter query becomes highly predicate-order sensitive: origin -> until -> type stays fast, while type -> origin -> until times out on the same DB/query.

This PR avoids the EXISTS predicate ordering problem by switching the shape to explicitly use IN instead.

https://sqlite.org/changes.html shows some changes in the past few releases for EXISTS planning optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants