Skip to content

[WIP] - Upstream 16530 - Optimize HostList API: conditional DISTINCT + composite index on JobHostSummary#560

Open
cigamit wants to merge 1 commit into
mainfrom
upstream16530
Open

[WIP] - Upstream 16530 - Optimize HostList API: conditional DISTINCT + composite index on JobHostSummary#560
cigamit wants to merge 1 commit into
mainfrom
upstream16530

Conversation

@cigamit

@cigamit cigamit commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Upstream Summary

  • Make .distinct() conditional on host_filter query parameter being set — without it, the RBAC IN subquery on a direct FK (inventory_id) cannot produce duplicates, so DISTINCT is pure overhead forcing PostgreSQL to sort/hash-dedup the entire result set
  • Add composite index (host_id, id DESC) on main_jobhostsummary so the with_latest_summary_id() correlated subquery can use an index-only top-1 scan instead of scanning and sorting per row

The host_list_rbac query pattern is the #2 DB time consumer in Scale Lab testing at 2,871 seconds total, and is worsening (+15%). Unlike AAP-81082, the RBAC subquery itself is clean (simple IN on a direct FK) — the cost is dominated by the unconditional DISTINCT and the correlated subquery lacking a composite index.

Why .distinct() is safe to make conditional

SmartFilter.query_from_string() can filter through M2M relationships (e.g., groups__name=...), which produce duplicate rows via JOINs. .distinct() is only needed when host_filter is set. Without host_filter, the queryset flows through HostAccess.filtered_queryset() which filters on inventory_id IN (RBAC subquery) — a direct FK that cannot produce duplicates.

Why the composite index helps

with_latest_summary_id() adds a correlated subquery:

SELECT id FROM main_jobhostsummary WHERE host_id = <outer.id> ORDER BY id DESC LIMIT 1

This runs for every result row. The existing auto-index on host_id alone requires scanning all matching rows then sorting. The composite (host_id, id DESC) index enables a single backward index scan to fetch the top-1 result directly.

EXPLAIN analysis (Scale Lab, Jul 2)

Ran on the aap26-next read replica (37,824 hosts, 16.7M job host summaries, 320K role evaluations).

The correlated subquery is the smoking gun — it scans the entire 367MB PK index backward with a filter instead of using a targeted index:

Index Scan Backward using main_jobhostsummary_pkey (cost=0.43..624,118)
      Filter: (host_id = $0)

Cost 624,118 per host row. With ~23K summaries per host on average (max 92K), each probe is extremely expensive.

Query variant Estimated cost Delta
RBAC subquery alone 3,113
Minimal (no DISTINCT, no subquery, no JOINs) 3,131 baseline
Full query WITHOUT DISTINCT 7,376 +4,245
Full query WITH DISTINCT 7,384 +4,253

The correlated subquery adds +4,167 cost (56% of total). DISTINCT adds +8 in planner estimate but forces Unique + Sort on 138-byte rows.

Test plan

  • All existing tests pass (210 host-related tests verified locally)
  • EXPLAIN on Scale Lab confirms correlated subquery uses full PK scan (cost 624,118/row)
  • EXPLAIN confirms no composite (host_id, id DESC) index exists
  • Verify composite index is picked up by the correlated subquery after deployment
  • Measure host_list_rbac mean query time improvement on Scale Lab

Classification

New or Enhanced Feature

Fixes: https://issues.redhat.com/browse/AAP-81517

@cigamit cigamit self-assigned this Jul 2, 2026
Copilot AI review requested due to automatic review settings July 2, 2026 22:03
@cigamit cigamit added the enhancement New feature or request label Jul 2, 2026
@cigamit cigamit changed the title Upstream 16530 - Optimize HostList API: conditional DISTINCT + composite index on JobHostSummary [WIP] - Upstream 16530 - Optimize HostList API: conditional DISTINCT + composite index on JobHostSummary Jul 2, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Optimizes the HostList API query plan by reducing unnecessary DISTINCT usage and improving the correlated “latest JobHostSummary per host” lookup via a composite index on JobHostSummary.

Changes:

  • Makes HostList.get_queryset() apply .distinct() conditionally instead of unconditionally.
  • Adds a composite DB index on JobHostSummary (host_id, id DESC) to speed up the with_latest_summary_id() correlated subquery.
  • Adds the corresponding Django migration for the new index.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
awx/api/views/init.py Makes distinct() conditional in HostList.get_queryset() before annotating latest summary id.
awx/main/models/jobs.py Adds a composite (host, -id) index on JobHostSummary to accelerate “latest summary per host” queries.
awx/main/migrations/0201_jobhostsummary_main_jobhostsumm_host_id_desc.py Introduces the migration that adds the new composite index.

Comment thread awx/api/views/__init__.py
Comment on lines 1662 to +1666
filter_string = self.request.query_params.get('host_filter', None)
if filter_string:
filter_qs = SmartFilter.query_from_string(filter_string)
qs &= filter_qs
return qs.distinct().with_latest_summary_id()
qs = qs.distinct()
Comment on lines +12 to +16
operations = [
migrations.AddIndex(
model_name='jobhostsummary',
index=models.Index(fields=['host', '-id'], name='main_jobhostsumm_host_id_desc'),
),
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Development

Successfully merging this pull request may close these issues.

2 participants