[WIP] - Upstream 16530 - Optimize HostList API: conditional DISTINCT + composite index on JobHostSummary#560
Open
cigamit wants to merge 1 commit into
Open
[WIP] - Upstream 16530 - Optimize HostList API: conditional DISTINCT + composite index on JobHostSummary#560cigamit wants to merge 1 commit into
cigamit wants to merge 1 commit into
Conversation
…ite index on JobHostSummary
Contributor
There was a problem hiding this comment.
Pull request overview
Optimizes the HostList API query plan by reducing unnecessary DISTINCT usage and improving the correlated “latest JobHostSummary per host” lookup via a composite index on JobHostSummary.
Changes:
- Makes
HostList.get_queryset()apply.distinct()conditionally instead of unconditionally. - Adds a composite DB index on
JobHostSummary (host_id, id DESC)to speed up thewith_latest_summary_id()correlated subquery. - Adds the corresponding Django migration for the new index.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| awx/api/views/init.py | Makes distinct() conditional in HostList.get_queryset() before annotating latest summary id. |
| awx/main/models/jobs.py | Adds a composite (host, -id) index on JobHostSummary to accelerate “latest summary per host” queries. |
| awx/main/migrations/0201_jobhostsummary_main_jobhostsumm_host_id_desc.py | Introduces the migration that adds the new composite index. |
Comment on lines
1662
to
+1666
| filter_string = self.request.query_params.get('host_filter', None) | ||
| if filter_string: | ||
| filter_qs = SmartFilter.query_from_string(filter_string) | ||
| qs &= filter_qs | ||
| return qs.distinct().with_latest_summary_id() | ||
| qs = qs.distinct() |
Comment on lines
+12
to
+16
| operations = [ | ||
| migrations.AddIndex( | ||
| model_name='jobhostsummary', | ||
| index=models.Index(fields=['host', '-id'], name='main_jobhostsumm_host_id_desc'), | ||
| ), |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Upstream Summary
.distinct()conditional onhost_filterquery parameter being set — without it, the RBACINsubquery on a direct FK (inventory_id) cannot produce duplicates, soDISTINCTis pure overhead forcing PostgreSQL to sort/hash-dedup the entire result set(host_id, id DESC)onmain_jobhostsummaryso thewith_latest_summary_id()correlated subquery can use an index-only top-1 scan instead of scanning and sorting per rowThe
host_list_rbacquery pattern is the #2 DB time consumer in Scale Lab testing at 2,871 seconds total, and is worsening (+15%). Unlike AAP-81082, the RBAC subquery itself is clean (simpleINon a direct FK) — the cost is dominated by the unconditionalDISTINCTand the correlated subquery lacking a composite index.Why
.distinct()is safe to make conditionalSmartFilter.query_from_string()can filter through M2M relationships (e.g.,groups__name=...), which produce duplicate rows via JOINs..distinct()is only needed whenhost_filteris set. Withouthost_filter, the queryset flows throughHostAccess.filtered_queryset()which filters oninventory_id IN (RBAC subquery)— a direct FK that cannot produce duplicates.Why the composite index helps
with_latest_summary_id()adds a correlated subquery:This runs for every result row. The existing auto-index on
host_idalone requires scanning all matching rows then sorting. The composite(host_id, id DESC)index enables a single backward index scan to fetch the top-1 result directly.EXPLAIN analysis (Scale Lab, Jul 2)
Ran on the aap26-next read replica (37,824 hosts, 16.7M job host summaries, 320K role evaluations).
The correlated subquery is the smoking gun — it scans the entire 367MB PK index backward with a filter instead of using a targeted index:
Cost 624,118 per host row. With ~23K summaries per host on average (max 92K), each probe is extremely expensive.
The correlated subquery adds +4,167 cost (56% of total). DISTINCT adds +8 in planner estimate but forces Unique + Sort on 138-byte rows.
Test plan
(host_id, id DESC)index existshost_list_rbacmean query time improvement on Scale LabClassification
New or Enhanced Feature
Fixes: https://issues.redhat.com/browse/AAP-81517