Skip to content

docs: propose SQS-compatible adapter design#606

Merged
bootjp merged 31 commits intomainfrom
feat/sqs_compatible_adapter
Apr 24, 2026
Merged

docs: propose SQS-compatible adapter design#606
bootjp merged 31 commits intomainfrom
feat/sqs_compatible_adapter

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 23, 2026

Add a Milestone-1 design doc for an SQS-compatible HTTP adapter under docs/design/, following the existing adapter pattern used for Redis, DynamoDB, and S3. Covers data model (queue catalog, visibility-indexed message keys, FIFO dedup/group lock), routing (one Raft group per queue), leader-bound mutations with lease-read fast paths for receive scans and read-only APIs, long-poll notifier on the FSM commit stream, and a Jepsen plan for standard + FIFO semantics.

Also add docs/design/README.md documenting the YYYY_MM_DD__ filename convention established by the docs/reorg-design-doc-layout refactor, so future design proposals can follow it without reverse- engineering the rule.

Summary by CodeRabbit

  • Documentation

    • Added a comprehensive SQS-compatible adapter design and a standardized design-doc process.
  • New Features

    • Introduced an SQS-compatible HTTP server with JSON API, health endpoints, SigV4 auth, long-polling, visibility/receipt semantics, FIFO support, and an operator console for queue inspection.
  • Refactor

    • Centralized SigV4 handling and standardized leader-proxy behavior.
  • Tests

    • Added extensive unit and integration tests covering auth, catalog, message lifecycle, FIFO rules, long-polling, and edge/error cases.

Add a Milestone-1 design doc for an SQS-compatible HTTP adapter under
docs/design/, following the existing adapter pattern used for Redis,
DynamoDB, and S3. Covers data model (queue catalog, visibility-indexed
message keys, FIFO dedup/group lock), routing (one Raft group per queue),
leader-bound mutations with lease-read fast paths for receive scans and
read-only APIs, long-poll notifier on the FSM commit stream, and a
Jepsen plan for standard + FIFO semantics.

Also add docs/design/README.md documenting the YYYY_MM_DD_<status>_<name>
filename convention established by the docs/reorg-design-doc-layout
refactor, so future design proposals can follow it without reverse-
engineering the rule.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

📝 Walkthrough

Walkthrough

Adds a new SQS-compatible adapter and supporting infrastructure: design docs, shared SigV4 utilities, an HTTP SQSServer with SigV4 auth, queue catalog and message keyspaces, long-poll/visibility/receipt semantics, leader-proxy helper, main/runtime wiring, routing changes, extensive tests, and test infra updates.

Changes

Cohort / File(s) Summary
Design docs
docs/design/2026_04_24_proposed_sqs_compatible_adapter.md, docs/design/README.md
Adds a comprehensive SQS-compatible adapter design document and a design-docs README specifying filename/metadata/status conventions.
SQSServer core
adapter/sqs.go, adapter/sqs_auth.go, adapter/sqs_catalog.go, adapter/sqs_keys.go, adapter/sqs_messages.go
New SQSServer implementation: HTTP routing (X-Amz-Target JSON-1.0), SQS SigV4 auth, queue catalog persistence, key encoding, message storage and visibility index, receipt-handle format, OCC transactions, long-polling and visibility rotation logic.
SQS tests & infra
adapter/sqs_auth_test.go, adapter/sqs_catalog_test.go, adapter/sqs_messages_test.go, adapter/sqs_test.go, adapter/test_util.go
Adds extensive unit/integration tests for auth, catalog, message lifecycle, long-polling, retention and receipt-handle codec; extends test infra to allocate SQS ports and start per-node SQSServer.
SigV4 utilities & creds
adapter/sigv4.go, main_sigv4_creds.go, adapter/s3_auth.go, adapter/s3_test.go, main_s3.go
Introduces shared SigV4 parsing/building and a shared static-credentials loader; refactors S3 auth/tests to reuse SigV4 helpers and centralizes credential loading.
Leader proxy & Dynamo refactor
adapter/leader_http_proxy.go, adapter/dynamodb.go
Adds proxyHTTPRequestToLeader helper and httpLeaderErrorWriter type; refactors Dynamo follower->leader proxying to delegate to the new helper and centralize proxy error responses.
Routing / main / config
kv/shard_key.go, shard_config.go, main.go, main_sqs.go, main_bootstrap_e2e_test.go
Adds SQS route-key normalization, raft→SQS leader map parsing, new startup startSQSServer, CLI/runtime flags for SQS region/credentials/leader mapping, and updated bootstrap call site.
Keyspace constants
adapter/sqs_keys.go, adapter/sqs_messages.go
Introduces exported keyspace prefix constants for queue metadata and message/visibility prefixes used by the adapter.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant SQSServer
    participant Coordinator
    participant RaftLeader
    participant KVStore

    Client->>SQSServer: POST (X-Amz-Target / SigV4)
    SQSServer->>SQSServer: authorizeSQSRequest (SigV4)
    alt Node is not raft leader
        SQSServer->>RaftLeader: proxyHTTPRequestToLeader(...)
        RaftLeader-->>SQSServer: proxied response
        SQSServer-->>Client: return response
    else Node is raft leader
        SQSServer->>Coordinator: locate shard / route key
        Coordinator->>RaftLeader: ensure leader / LeaseRead as needed
        SQSServer->>KVStore: OCC transaction (write message/meta or visibility swap)
        KVStore-->>SQSServer: commit/result
        SQSServer-->>Client: return SQS JSON response
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 I tunneled through bytes and left a trail of queues,
Messages hop, wait, and wake in orderly queues.
SigV4 keeps hats snug and receipt tokens true,
Long polls nap till a new message says "peek-a-boo!"
Hooray — a carrot-socketed SQS debut!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the primary change: proposing a design document for an SQS-compatible adapter, which is the main deliverable of this PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/sqs_compatible_adapter

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 330e0184a5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


### 4.1 Leader proxy reuse

The DynamoDB adapter already has the right proxy shape (`adapter/dynamodb.go:proxyToLeader`): small request bodies, JSON envelope, re-emit response. SQS message payloads cap at 256 KiB, so the existing proxy pattern is sufficient and the streaming HTTP proxy that S3 needed (`docs/s3_compatible_adapter_design.md §4.1`) is **not** required here. The SQS adapter should fork the DynamoDB proxy helper rather than reuse the S3 streaming proxy.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Point S3 design references at existing design doc

This proposal cites docs/s3_compatible_adapter_design.md, but that path does not exist in this repository after the design-doc reorganization (docs/design/2026_03_22_implemented_s3_compatible_adapter.md is the current file). Keeping the old path leaves dead references in the design, so readers cannot verify the cited proxy/routing sections and may implement from stale or missing context; update these references consistently across the document.

Useful? React with 👍 / 👎.

bootjp added 2 commits April 24, 2026 04:05
Codex review on #606 flagged that the SQS design cited
docs/s3_compatible_adapter_design.md, which was relocated to
docs/design/2026_03_22_implemented_s3_compatible_adapter.md by the
docs/reorg-design-doc-layout change. Update all four references so the
cited §4.1, §5.2, and §6.2 resolve.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 23, 2026

@claude[agent] review

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 23, 2026

@codex[agent] review

@Claude
Copy link
Copy Markdown

Claude AI commented Apr 23, 2026

@claude[agent] review

Completed comprehensive review of the SQS adapter design doc. The design is technically sound and ready for implementation. All cross-references are accurate, the data model is well-thought-out (especially the visibility-indexed keys), and it properly reuses existing patterns from DynamoDB/S3/lease-read. Key strengths: eliminates background sweeper via visibility index design, provides forgery-resistant receipt handles, maintains FIFO consistency via one-queue-per-shard routing, and enables follower reads via lease-read fast path. Testing plan covers unit/integration/compatibility/Jepsen. No issues found - this is a solid foundation for Milestone 1 implementation.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 559dcbe744

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

3. **ChangeMessageVisibility:** same delete+insert swap as receive, validated against the supplied receipt handle.
4. **Delete:** remove both the data record and the visibility index entry.

Because `visible_at` leads the key, the next candidate message is always the first key `>= now` in the prefix. Invisible messages (`visible_at > now`) are naturally skipped without a sweeper.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Scan visibility index with <= now candidates

This sentence inverts the visibility-index lookup condition: with keys ordered by visible_at, selecting the first key >= now will skip any messages whose visible_at is already in the past, so a queue with ready backlog can appear empty. The receive path described later correctly talks about visible_at ≤ now; this line should match that rule to avoid an implementation that drops immediately deliverable messages from consideration.

Useful? React with 👍 / 👎.

Add §13 "Console UI" to the SQS-compatible adapter design and thread
the companion references into §2.1 goals, §10 operational flags, §14
rollout phases, and §15 summary.

What the section covers:

- Goals / non-goals: read-only browsing from any node, leader-aware
  writes, zero JS build toolchain, operator-token auth distinct from
  SigV4. Explicitly out of scope: RBAC, graphical metrics, edit-in-
  place.
- Serving model: separate --consoleAddress listener (like
  --metricsAddress / --pprofAddress) so operators can firewall it
  independently; --consoleToken bearer auth enforced by
  monitoring.AddressRequiresToken; --consoleReadOnly gate for the
  mutating path.
- Architecture: adapter/console.go + go:embed static bundle under
  adapter/console_ui/; reuses SQSServer methods for catalog and
  message state; writes go through the shared
  proxyHTTPRequestToLeader helper; reads use LeaseReadThrough.
- REST API surface: non-AWS JSON with plain status codes; admin peek
  endpoint that scans the visibility index non-destructively (does
  not rotate receipt tokens or bump visibility), so the console
  never competes with workers.
- Front-end: single-page HTML + one hand-written JS file, no
  framework, readable over a slow SSH tunnel. Destructive actions
  (Purge / Delete) require the operator to type the queue name.
- Auth, failure, observability, and testing plans, plus a clear
  Milestone-1 out-of-scope list (SSO, per-queue ACLs, exports).

Rollout phases now include the console: Phase 1 ships read-only
views, Phase 2 unlocks mutating actions, Phase 3 adds UI polish
(in-flight countdown, dark mode).
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c6158388af

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Details:

1. Fence the snapshot scan with `coordinator.LeaseReadForKey(ctx, queueRoute)`. While the lease is valid (≤ `electionTimeout − leaseSafetyMargin`, currently 700 ms per `docs/design/2026_04_20_implemented_lease_read.md §3.2`) this returns immediately with no ReadIndex round-trip. Under sustained receive load this amortizes quorum confirmation across every receive in the lease window instead of paying one ReadIndex per call.
2. Snapshot-scan the visibility index for the first `k` keys with `visible_at ≤ now`, up to `MaxNumberOfMessages` (≤ 10). Use a bounded scan limit (e.g., `2 * k`) to tolerate races.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove fixed 2*k cap when scanning FIFO receive candidates

The proposed ReceiveMessage algorithm limits the visibility scan to 2 * k, but FIFO filtering can skip many visible entries due to held group locks; if the first 20 entries are blocked and a deliverable message appears later, the queue will be reported empty despite ready backlog. This can systematically starve available messages under concurrent consumers, so the scan bound should not be tied to k alone (e.g., continue until k deliverable candidates or the visible prefix is exhausted).

Useful? React with 👍 / 👎.


### 7.3 Long-poll notifier

Long polling is local to the leader: every node keeps a `map[queueRoute]*cond` and wakes waiters when a `SendMessage` commits on that queue. The notifier lives in the leader process; on failover, waiters on the old leader get `AbortedError` from the coordinator and clients retry against the new leader. AWS SDKs already retry receive on connection errors, so this is acceptable.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Wake long-poll waiters on visibility transitions, not just sends

The notifier design only wakes waiters when SendMessage commits, but messages can become receivable without any send (for example when an in-flight message's visibility timeout expires, or visibility is set to zero). In those cases, receivers can sleep until WaitTimeSeconds elapses even though a message is available, adding avoidable latency and reducing effective throughput for idle/low-send queues.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/design/2026_04_24_proposed_sqs_compatible_adapter.md`:
- Around line 203-207: The markdown has unlabeled fenced code blocks causing
MD040 warnings; update the three affected fences by adding explicit languages:
label the receipt_handle block (the block containing "receipt_handle =
base64url( queue_gen_u64 | message_id_16 | receipt_token_16 )") as "text", label
the routing sample block (the block containing "!sqsroute|<queue-esc><gen-u64>")
as "text" as well, and label the CLI flags block (the block listing
"--consoleAddress", "--consoleToken", "--consoleReadOnly") as "bash"; apply the
same labeling to the other occurrences noted around the same snippets.
- Around line 357-363: The Markdown table cells include raw pipe characters in
the fenced key examples (e.g., `!sqs|queue|meta|<queue>`, `!sqs|queue|meta|...`,
`!sqs|queue|meta|<queue>.tags`) which breaks column parsing; fix by escaping
each pipe inside those cell values (or wrap the entire key examples in inline
code spans) so the pipes are not interpreted as column separators—update the
table rows for GetQueueUrl, GetQueueAttributes, ListQueues, and ListQueueTags
accordingly.

In `@docs/design/README.md`:
- Around line 7-9: Two fenced code blocks in the README use triple backticks
without a language, triggering MD040; update both fences to declare a language
(use "text") so markdownlint stops flagging them—specifically change the fence
around the snippet containing "YYYY_MM_DD_<status>_<name>.md" and the fence
around the block listing filenames starting with
"2026_04_20_implemented_lease_read.md" to use ```text instead of ```; no content
changes otherwise.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 68055b72-f7a0-4710-9886-c21abedb12fd

📥 Commits

Reviewing files that changed from the base of the PR and between 99c0560 and c615838.

📒 Files selected for processing (2)
  • docs/design/2026_04_24_proposed_sqs_compatible_adapter.md
  • docs/design/README.md

Comment thread docs/design/2026_04_24_proposed_sqs_compatible_adapter.md Outdated
Comment thread docs/design/2026_04_24_proposed_sqs_compatible_adapter.md
Comment thread docs/design/README.md Outdated
Fix three active findings from the latest codex review plus the
coderabbit markdown polish:

- P1 (codex) §5.4: "first key >= now" was inverted; the correct
  statement is "first N keys with visible_at <= now", which maps to
  a scan over [prefix|0, prefix|now+1).
- P1 (codex) §7.2: drop the blanket "2 * k" FIFO scan cap. FIFO
  group locks can legitimately hide an entire page, so receive must
  page forward until MaxNumberOfMessages candidates survive the
  group-lock/dedup filters. Standard keeps the small page; FIFO
  uses continuation paging with a soft wall-clock budget.
- P2 (codex) §7.3: long-poll notifier now wakes on any transition
  that makes a message deliverable - send, delay expiry,
  ChangeMessageVisibility that moves visible_at backward, and
  natural visibility-timeout expiry - tracked with a per-queue
  timer wheel over the smallest in-flight visible_at. The
  sequence-diagram caption is updated accordingly.
- Markdown lint: add language tags to unlabeled fenced code blocks
  (MD040) and escape "|" in example keys inside table cells
  (MD056) so the table still parses correctly.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e098bcf74c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +335 to +336
- Load `!sqs|msg|data|<queue><gen><message_id>`. Missing → `ReceiptHandleIsInvalid`.
- Compare `current_receipt_token`. Mismatch → `InvalidReceiptHandle`.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve DeleteMessage success for stale receipt handles

The design maps token mismatch (and missing message records) to InvalidReceiptHandle, but SQS DeleteMessage treats an old receipt handle as a successful no-op that may not delete the message. In real workloads this happens when consumers retry deletes after a redelivery/handle rotation, so returning hard errors here would break compatibility expectations for SDK clients and batch workers that rely on idempotent delete acknowledgements.

Useful? React with 👍 / 👎.

1. Fence the snapshot scan with `coordinator.LeaseReadForKey(ctx, queueRoute)`. While the lease is valid (≤ `electionTimeout − leaseSafetyMargin`, currently 700 ms per `docs/design/2026_04_20_implemented_lease_read.md §3.2`) this returns immediately with no ReadIndex round-trip. Under sustained receive load this amortizes quorum confirmation across every receive in the lease window instead of paying one ReadIndex per call.
2. Snapshot-scan the visibility index for keys with `visible_at ≤ now`. Standard queues read in page-sized chunks (page limit 1024) and stop once either `MaxNumberOfMessages` candidates survive the filter or the scan is exhausted; a small `2 * k` page is fine because Standard has no filter that can hide an unbounded prefix. FIFO queues **cannot** cap the scan at `2 * k` because an entire page can legitimately belong to groups whose locks are already held, hiding deliverable messages that sit further along in the index. FIFO receive instead continues paging forward (resuming from the last key returned) until `MaxNumberOfMessages` candidates survive the group-lock and dedup filters, or the end-of-range sentinel `visible_at = now+1` is reached. A soft wall-clock budget (e.g., 100 ms) on the overall scan caps latency when the queue is pathologically group-skewed, in which case the receive returns whatever it has accumulated.
3. Filter FIFO candidates: skip any message whose group lock is held by another message; acquire the group lock on delivery. Candidates rejected by this filter must not consume the `MaxNumberOfMessages` budget.
4. Run one OCC transaction per batch: for each candidate, verify the visibility entry still matches, rotate receipt token, set `visible_at_hlc = now + effective_timeout`, insert the new visibility entry, bump `receive_count` and `approximate_first_receive_timestamp`, acquire FIFO group lock if applicable. A successful `Propose` also refreshes the lease (per `§3.3` of the lease-read doc), so a busy queue keeps its fast path warm without any extra work.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Set first-receive timestamp only on initial delivery

This step says to bump approximate_first_receive_timestamp on every receive, but that attribute is defined as the time the message was first received. Updating it on retries/redeliveries would erase the true first-seen time and diverge from SQS metadata semantics used by operators and consumers for age/debug analysis.

Useful? React with 👍 / 👎.

Two follow-up findings on the SQS design doc:

- P1 §7.4: DeleteMessage must be idempotent against stale receipt
  handles. AWS returns 200 no-op when the handle token no longer
  matches (a redelivery rotated it) or the record is already gone;
  SDK retry paths and batch workers rely on this. Rewritten §7.4 to
  spell out the three cases — structural error vs missing-record
  no-op vs token-mismatch no-op vs valid delete — and to note that
  ErrWriteConflict retries must re-apply the same classification.
  §7.5 explicitly contrasts ChangeMessageVisibility (AWS still
  errors on stale/missing handles, so we do too).
- P2 §7.2 step 4: clarify that approximate_first_receive_timestamp
  is set only on the FIRST delivery, not bumped on every receive.
  The attribute is defined as "time of first receive" and
  overwriting it on redeliveries corrupts age/debug analytics the
  operators rely on.

No behavior change at the storage or routing layer; only clarifies
what the adapter must do when it encounters these cases.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4206e63eb6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +310 to +314
2. Snapshot-scan the visibility index for keys with `visible_at ≤ now`. Standard queues read in page-sized chunks (page limit 1024) and stop once either `MaxNumberOfMessages` candidates survive the filter or the scan is exhausted; a small `2 * k` page is fine because Standard has no filter that can hide an unbounded prefix. FIFO queues **cannot** cap the scan at `2 * k` because an entire page can legitimately belong to groups whose locks are already held, hiding deliverable messages that sit further along in the index. FIFO receive instead continues paging forward (resuming from the last key returned) until `MaxNumberOfMessages` candidates survive the group-lock and dedup filters, or the end-of-range sentinel `visible_at = now+1` is reached. A soft wall-clock budget (e.g., 100 ms) on the overall scan caps latency when the queue is pathologically group-skewed, in which case the receive returns whatever it has accumulated.
3. Filter FIFO candidates: skip any message whose group lock is held by another message; acquire the group lock on delivery. Candidates rejected by this filter must not consume the `MaxNumberOfMessages` budget.
4. Run one OCC transaction per batch: for each candidate, verify the visibility entry still matches, rotate receipt token, set `visible_at_hlc = now + effective_timeout`, insert the new visibility entry, bump `receive_count`, **set `approximate_first_receive_timestamp` only on the first delivery** (subsequent redeliveries must preserve the original value — it is defined as the time the message was first received), and acquire FIFO group lock if applicable. A successful `Propose` also refreshes the lease (per `§3.3` of the lease-read doc), so a busy queue keeps its fast path warm without any extra work.
5. If the batch is empty and `WaitTimeSeconds > 0`, register a waiter on the queue's in-process condition variable and re-run step 1 on wake or timeout. Do not poll in a tight loop. On re-scan, the `LeaseRead` fence from step 1 stays local for the whole long-poll window — empty long polls cost O(1) quorum operations, not O(ticks).
6. If `receive_count > redrive_policy.maxReceiveCount`, move the message to the DLQ (see §7.6) atomically within the receive transaction and do not return it to the caller.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove fixed standard scan cap when redrive skips entries

This still treats a bounded 2 * k scan as safe for Standard queues, but the same flow later drops messages that exceed maxReceiveCount by moving them to the DLQ and not returning them. When the visible prefix contains many over-limit messages, a fixed 2 * k window can return an empty or undersized batch even though deliverable messages exist deeper in the index, which creates false-empty receives and extra latency under poison-message backlogs.

Useful? React with 👍 / 👎.

Comment on lines +321 to +325

1. A `SendMessage` (or `SendMessageBatch` entry) commits — the obvious case.
2. A `SendMessage` whose `DelaySeconds` has now elapsed — the notifier schedules a one-shot timer keyed off `available_at` when the delayed message commits, and the timer wakes the queue's waiters when it fires.
3. A `ChangeMessageVisibility` commits a `VisibilityTimeout` that moves the record's `visible_at` backward (including the AWS-permitted value of zero) — the message becomes immediately deliverable even though no send happened.
4. A message's visibility timeout naturally expires. Because each message has a known `visible_at`, the notifier tracks the smallest `visible_at` in-flight per queue and wakes waiters at that deadline (same timer wheel as the delay case).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Rebuild visibility wake timers after leader failover

The notifier is defined as waking waiters based on commit-time transitions and timers scheduled from those commits, but there is no bootstrap path for a newly elected leader that inherits already in-flight messages. After failover, those preexisting visibility deadlines can pass without any new commit to schedule a wake, so long-poll receivers may sleep until WaitTimeSeconds expires instead of being notified when messages become visible.

Useful? React with 👍 / 👎.

Two more P2 findings on §7.2 / §7.3:

- §7.2 (step 2/3): the earlier revision only removed the 2*k scan
  cap for FIFO, but Standard queues can also hide an unbounded
  prefix via DLQ redrive — messages that exceeded maxReceiveCount
  are moved to the DLQ inside the same receive transaction and not
  returned, so a poison-message backlog masks deliverable messages
  deeper in the visibility index. The scan rule is now: both
  queue types keep paging forward until MaxNumberOfMessages
  candidates survive the filters OR the end-of-range sentinel is
  reached, with a soft wall-clock budget to cap latency. Step 3's
  filter is generalized to list both the FIFO group-lock rule and
  the DLQ-threshold rule.
- §7.3: add a leader-failover bootstrap step. Timer-wheel entries
  scheduled on the old leader do not Raft-replicate, so the newly
  elected leader must scan !sqs|msg|vis| at LinearizableRead time
  and repopulate timers from in-flight visible_at values. Past
  deadlines fire an immediate wake. Bootstrap is single-shot per
  leader term with a lazy-populate fallback for extreme-fanout
  clusters.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 559d1d4968

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


FIFO queues add three invariants on top of the standard schema:

1. **Group lock**: `!sqs|msg|group|<queue><gen><group>` is held by at most one message at a time. `ReceiveMessage` skips the entire group while the lock is held. The lock is released when the locking message is deleted, expires its visibility (and the next receive within the group rotates ownership), or the message retention expires.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve FIFO head lock across visibility timeout

Releasing the FIFO group lock when the in-flight message’s visibility expires can violate strict per-MessageGroupId ordering. With visible_at-ordered scans, a later message in the same group can be selected before the timed-out head message once this lock is dropped, contradicting the FIFO guarantees stated elsewhere in the doc. Keep the lock tied to the head message until it is deleted (or explicitly moved), so later group messages cannot overtake it.

Useful? React with 👍 / 👎.

1. Validate message size (`MaximumMessageSize` from queue attrs, capped at 262144).
2. Compute `md5_of_body` and `md5_of_message_attributes`.
3. Resolve queue generation; if queue is tombstoned return `QueueDoesNotExist`.
4. Compute `available_at_hlc = now + max(queue.delay, per-message DelaySeconds)`.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject per-message DelaySeconds on FIFO queues

The send flow applies max(queue.delay, per-message DelaySeconds) generically, but SQS FIFO APIs reject per-message DelaySeconds and only allow queue-level delay. Accepting message-level delay on FIFO would diverge from AWS compatibility and let invalid client behavior pass silently, so this path should explicitly fail FIFO sends that include per-message delay.

Useful? React with 👍 / 👎.

| `StartMessageMoveTask` and friends | Not planned | |
| Server-side KMS | Not planned | |

Payload limits match AWS defaults: 256 KiB per message, 120,000 in-flight messages per standard queue, 20,000 in-flight per FIFO queue. These should be enforced at the adapter, not silently truncated.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use current SQS FIFO in-flight quota default

This section states FIFO queues default to a 20,000 in-flight cap and says the adapter should enforce AWS defaults, but current AWS FIFO quota is 120,000 in-flight messages. Enforcing 20,000 would make Elastickv throttle valid FIFO workloads far earlier than SQS, causing avoidable compatibility and throughput regressions.

Useful? React with 👍 / 👎.

### 13.6 Authentication

1. Loopback + no token configured: the console is open. This is the "one-liner dev" case.
2. Non-loopback or token configured: every `/console/*` request must carry `Authorization: Bearer <token>`. Missing or wrong token returns `401 Unauthorized`. No cookie / session; the token is supplied by the operator (typed into a small login screen and stored in `sessionStorage`).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Exempt console bootstrap page from bearer auth

Requiring Authorization: Bearer for every /console/* request makes the browser flow unusable when a token is configured, because the initial HTML/JS load cannot attach a header before the login UI runs. That conflicts with the documented “type token into a login screen” behavior and would lock operators out of the console in secured deployments unless unauthenticated bootstrap assets are allowed.

Useful? React with 👍 / 👎.

Four findings from codex round 5:

- §3 (P2): FIFO in-flight quota was stated as 20,000 — the current
  AWS quota is 120,000. Updated so the adapter does not throttle
  valid FIFO workloads far earlier than SQS.
- §5.6 (P1): FIFO group lock was described as releasing on
  visibility-timeout expiry, which lets a later message in the same
  MessageGroupId overtake the head message on the next receive
  (vis-index scans by visible_at ascending). Rewritten so the lock
  stays pinned to the head message across visibility expiries; it
  is only released by delete, DLQ move, or retention expiry.
- §7.1 step 4 (P2): FIFO sends now reject per-message DelaySeconds
  with InvalidParameterValue — AWS only permits queue-level delay
  on FIFO, and silently accepting a per-message value would let
  misbehaving clients ship out of order.
- §13.6 (P1): console bearer-auth required a header on every
  /console/* request, which breaks the documented login-UI flow
  because the browser cannot attach Authorization before the
  static HTML/JS that renders the login screen has loaded. Split
  routes: /console/, /console/index.html, /console/assets/** serve
  unauthenticated so the bundle can bootstrap; /console/api/**
  requires the bearer token. /console/api/health is also
  unauthenticated so the login page can show leader/follower
  status before sign-in.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: feec3782e2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

1. Validate message size (`MaximumMessageSize` from queue attrs, capped at 262144).
2. Compute `md5_of_body` and `md5_of_message_attributes`.
3. Resolve queue generation; if queue is tombstoned return `QueueDoesNotExist`.
4. Compute `available_at_hlc`. Standard queues use `now + max(queue.delay, per-message DelaySeconds)`. **FIFO queues reject per-message `DelaySeconds` with `InvalidParameterValue`** — AWS only allows queue-level `DelaySeconds` on FIFO, and silently accepting a per-message value would diverge from real SQS behavior and let misbehaving clients ship out of order. FIFO sends therefore always compute `available_at_hlc = now + queue.delay`.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Honor per-message DelaySeconds override for standard queues

This step makes standard queues compute availability as max(queue.delay, per-message DelaySeconds), but SQS message timers override the queue delay for that message (they are not merged via max). With the current rule, a queue configured with a longer default delay cannot send an explicitly immediate message (DelaySeconds=0), which breaks AWS-compatible behavior for SDK clients that rely on per-message overrides.

Useful? React with 👍 / 👎.


## 11. Failure Handling and Cleanup

1. **Retention expiry**: a background reaper on the leader scans the visibility index for entries older than `send_timestamp + message_retention_seconds` and deletes them in bounded batches. Because visible_at ≤ send_time + retention implies the message is past retention in the common case, the reaper can mostly piggy-back on `ReceiveMessage` scans.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Enforce retention by send age, not visibility index position

Driving retention cleanup primarily from the visibility index can leave expired messages behind when visibility is extended far into the future (for example via repeated ChangeMessageVisibility), because those records may never appear in the scanned visible_at window even after send_timestamp + retention has passed. SQS retention is based on message age, including in-flight messages, so this approach can violate configured retention limits and keep data longer than intended.

Useful? React with 👍 / 👎.


1. Unit: REST-handler table tests using `httptest.NewRecorder` for auth / validation.
2. Integration: single-node cluster via `createNode(t, 1)`, drive the console API through `http.Client`, assert queue create → peek → purge → delete round-trip.
3. UI smoke: a `testdata/console_ui_smoke_test.go` compiles the static bundle through `httptest.NewServer` and runs a handful of headless HTTP GETs to confirm the assets actually ship. (No Chromium; too heavy for CI.)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Move console UI smoke test out of testdata to execute in CI

Placing the proposed smoke test at testdata/console_ui_smoke_test.go means go test will not run it, because the Go tool ignores directories named testdata. That leaves the “assets actually ship” check unenforced in normal CI runs; this test should live in a regular package test file path so regressions are caught.

Useful? React with 👍 / 👎.

bootjp added 3 commits April 24, 2026 17:17
Three findings:

- §7.1 step 4 (P1): per-message DelaySeconds on Standard queues is
  an OVERRIDE of queue.delay, not max'd with it. AWS lets a client
  that knows the queue default is 5 min send an immediate message
  by passing DelaySeconds=0. The previous "max(queue.delay,
  per-message)" rule broke that compatibility. Rewritten so the
  effective delay is the per-message value when supplied, else
  queue.delay. FIFO still rejects per-message DelaySeconds.
- §11 item 1 (P1): retention cannot drive off the visibility index
  because ChangeMessageVisibility can push visible_at past
  send_timestamp + retention, so the message never enters the
  scan window. Added a dedicated send-age index
  !sqs|msg|byage|<queue><gen><send_timestamp><msg-id> that the
  send path writes in the same txn and the reaper scans by age.
  ReceiveMessage also skips (and files for reaping) any
  candidate whose age has elapsed so a slow reaper cannot cause
  stale-retention deliveries.
- §13.9 item 3 (P2): the proposed console UI smoke test lived
  under testdata/ — `go test` ignores testdata/ so the check
  would silently not run. The test file now lives at the
  package root (adapter/console_ui_smoke_test.go); fixtures it
  consumes can still live under adapter/testdata/console_ui/.
Start the SQS-compatible HTTP adapter per
docs/design/2026_04_24_proposed_sqs_compatible_adapter.md. This first
increment wires the server skeleton, dispatch table, error envelope,
health endpoints, and leader-proxy path so subsequent PRs can drop real
handlers into a tested frame.

Included:

- adapter/sqs.go: SQSServer struct, NewSQSServer with SQSServerOption,
  Run/Stop, /sqs_health + /sqs_leader_health, X-Amz-Target dispatch for
  all Milestone-1/2 SQS operations (CreateQueue ... ListQueueTags), and
  the JSON-1.0 error envelope (__type + x-amzn-ErrorType + message).
  Every target currently replies 501 NotImplemented; handlers will be
  filled in separately.
- adapter/sqs_test.go: health endpoints, method gating, error-envelope
  shape, unknown-target -> InvalidAction, every known target ->
  NotImplemented, and a Stop-unblocks-Run regression guard.
- adapter/leader_http_proxy.go: shared proxyHTTPRequestToLeader helper
  factored out of the DynamoDB adapter. DynamoDB and SQS both delegate
  to it via per-adapter error writers, removing the structural dup the
  linter flagged.

No main.go wiring or flag changes in this PR; that lands alongside the
first real handler so the binary does not expose a 501-only endpoint.
Add the minimal plumbing to bring an SQSServer online next to the
existing DynamoDB and S3 servers. No new handlers yet - every target
still returns NotImplemented - but the binary and the test harness can
now open an SQS endpoint so subsequent PRs that add real operations
have somewhere to land.

- shard_config.go: ErrInvalidRaftSQSMapEntry + parseRaftSQSMap.
- main.go: --sqsAddress / --raftSqsMap flags, leaderSQS in
  runtimeConfig, buildLeaderSQS, and startSQSServer wired into
  runtimeServerRunner.start. sqsAddress defaults to empty.
- main_sqs.go: startSQSServer mirrors main_s3.go two-goroutine
  shutdown pattern.
- adapter/test_util.go: portsAdress / listeners / Node / newNode /
  setupNodes carry an sqsAddress and *SQSServer per node, shutdown
  calls Stop via an extracted shutdownNode helper, portSQS starts at
  29000 so tests do not collide with the dynamo range at 28000.
- main_bootstrap_e2e_test.go: update parseRuntimeConfig signature.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 24, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c692465631

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread adapter/sqs_messages.go Outdated
Comment on lines +411 to +413
candidates, err := s.scanVisibleMessageCandidates(ctx, queueName, meta.Generation, max*sqsReceiveScanOverfetchFactor, readTS)
if err != nil {
writeSQSErrorFromErr(w, err)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Honor WaitTimeSeconds before returning empty receives

This handler parses WaitTimeSeconds in the input type but never uses it, and immediately performs a single visibility scan/return cycle. On an empty queue, requests that ask for long polling (for example WaitTimeSeconds=20) return right away instead of waiting up to the requested duration, which breaks expected SQS receive semantics and can create high-rate hot polling in production clients.

Useful? React with 👍 / 👎.

Comment thread adapter/sqs_catalog.go Outdated
Comment on lines +247 to +248
IsFIFO: strings.HasSuffix(name, sqsFIFOQueueNameSuffix),
VisibilityTimeoutSeconds: sqsDefaultVisibilityTimeoutSeconds,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require explicit FifoQueue=true for FIFO queue creation

Queue type is currently inferred from the .fifo suffix when FifoQueue is omitted, because IsFIFO is initialized from the name before attribute validation runs. That makes requests without FifoQueue create FIFO queues implicitly, whereas SQS-compatible behavior expects omission to create a standard queue and only treat a queue as FIFO when FifoQueue=true is explicitly set.

Useful? React with 👍 / 👎.

codex round 8 on #610 flagged two findings, both addressed here:

- P1 sqs_messages.go: ReceiveMessage parsed WaitTimeSeconds but
  never used it, so clients asking for long polling got an
  immediate empty return and hot-polled the server. Milestone 1
  now does a polling-based long poll: if the first scan returns
  nothing and WaitTimeSeconds > 0, the handler re-scans every
  200 ms until a message arrives, the wait elapses, or the
  request context is canceled. The full commit-stream notifier
  from §7.3 of the design is still deferred, but the observable
  client contract now matches AWS.
  resolveReceiveWaitSeconds applies per-request value -> queue
  default -> clamp to [0, 20].
- P2 sqs_catalog.go: IsFIFO was pre-initialized from the .fifo
  name suffix before attribute validation ran, so a CreateQueue
  without FifoQueue=true silently became a FIFO queue. AWS
  requires FifoQueue=true to be explicitly set. IsFIFO now
  defaults to false; a .fifo-suffixed name without the attribute
  returns InvalidParameterValue ("FIFO queue name requires
  FifoQueue=true attribute").

New tests: TestSQSServer_LongPollWaitsForArrival schedules a send
mid-wait and asserts the receive returns promptly after arrival
(and waited at all); TestSQSServer_LongPollTimesOutOnEmptyQueue
asserts WaitTimeSeconds on an empty queue returns empty at
approximately the requested time.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3b1d429f8d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread adapter/sqs_messages.go Outdated
Comment on lines +479 to +480
if err != nil {
return nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Propagate ReceiveMessage scan failures instead of empty success

scanAndDeliverOnce drops scanVisibleMessageCandidates errors and returns nil, which receiveMessage serializes as a normal 200 with an empty Messages list. In any backend failure path (for example routing/storage errors), clients get a false-empty poll instead of an actionable error, which can silently stall consumers and hide production incidents.

Useful? React with 👍 / 👎.

Comment thread adapter/sqs_catalog.go
Comment on lines +524 to +525
ReadKeys: [][]byte{metaKey, genKey},
Elems: []*kv.Elem[kv.OP]{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add shard-routing normalization for SQS internal keys

SQS mutations are dispatched using raw !sqs|... keys, but routing normalization only covers Redis/Dynamo/S3/list keys (kv/shard_key.go), not SQS. In multi-shard deployments where shardRanges are defined around user-key ranges, these internal keys can become unroutable and fail with no route for key, making queue creation and message APIs unusable. SQS needs a route-key normalization path before dispatching these internal keys.

Useful? React with 👍 / 👎.

Comment thread adapter/sqs_messages.go Outdated
Comment on lines +243 to +247
type sqsSendMessageInput struct {
QueueUrl string `json:"QueueUrl"`
MessageBody string `json:"MessageBody"`
DelaySeconds *int64 `json:"DelaySeconds,omitempty"`
MessageAttributes map[string]string `json:"MessageAttributes,omitempty"`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Enforce FIFO SendMessage required fields

The send input schema omits FIFO-only parameters (MessageGroupId/deduplication id), so FIFO queues can accept sends that AWS would reject, and those messages are processed with standard-queue behavior. This breaks SQS compatibility and FIFO ordering expectations for any client creating a queue with FifoQueue=true.

Useful? React with 👍 / 👎.

Three P1s:

- sqs_messages.go: scanAndDeliverOnce was swallowing scan errors
  and returning nil, which the handler serialized as a normal 200
  with an empty Messages list. A backend / routing / storage
  failure would silently stall consumers instead of surfacing. Now
  scanAndDeliverOnce returns (result, error), longPollReceive
  propagates the error, and receiveMessage returns an actionable
  5xx via writeSQSErrorFromErr.
- sqs_messages.go: SendMessage now validates FIFO-specific fields.
  FIFO queues REQUIRE MessageGroupId, and without
  ContentBasedDeduplication they REQUIRE MessageDeduplicationId
  too. Standard queues REJECT both fields. Per-message
  DelaySeconds is rejected on FIFO (AWS only accepts queue-level
  delay on FIFO). The record struct and input schema now carry
  MessageGroupId / MessageDeduplicationId so Milestone 2's group
  lock has the data it needs.
- kv/shard_key.go: added sqsRouteKey so !sqs|... internal keys
  normalize to a stable !sqs|route|global prefix. Without this,
  multi-shard deployments with shardRanges defined around user-
  key ranges could leave SQS internal keys unroutable and fail
  every queue mutation with "no route for key". Milestone 1
  collapses all SQS keys to one route (single control-plane
  group); per-queue routing lands with FIFO in Milestone 2.

Tests: TestSQSServer_SendFIFOValidation exercises every branch
of validateSendFIFOParams; TestSQSServer_CreateQueueRequiresExplicitFifoFlag
asserts that a .fifo suffix alone is rejected when FifoQueue=true
is not set.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0d69c90a8d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread adapter/sqs_messages.go
Comment on lines +782 to +783
if errors.Is(err, store.ErrKeyNotFound) {
return nil, nil, readTS, sqsDeleteNoOp, nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject delete no-op when QueueUrl does not match message key

loadMessageForDelete maps any missing data key to sqsDeleteNoOp, so DeleteMessage returns 200 even when the caller supplies a wrong/deleted QueueUrl (or a handle from another queue) and nothing can actually be deleted. That silently acknowledges deletes for misrouted requests and can leave the original message in-flight or visible while consumers believe cleanup succeeded.

Useful? React with 👍 / 👎.

Comment thread adapter/sqs_messages.go Outdated
Comment on lines +488 to +492
if w < 0 {
w = 0
}
if w > sqsMaxReceiveMessageWaitSeconds {
w = sqsMaxReceiveMessageWaitSeconds
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Return validation error for out-of-range WaitTimeSeconds

ReceiveMessage currently clamps invalid WaitTimeSeconds values into [0,20] instead of rejecting the request, so client-side bugs are silently masked and can change behavior (e.g., negative values become immediate polling, oversized values become long polls). For an AWS-compatible API this should be an explicit InvalidAttributeValue path, not implicit coercion.

Useful? React with 👍 / 👎.

Comment thread adapter/sqs_catalog.go
Comment on lines +332 to +338
"ContentBasedDeduplication": func(m *sqsQueueMeta, v string) error {
b, err := strconv.ParseBool(v)
if err != nil {
return newSQSAPIError(http.StatusBadRequest, sqsErrInvalidAttributeValue, "ContentBasedDeduplication must be a boolean")
}
m.ContentBasedDedup = b
return nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Disallow ContentBasedDeduplication on non-FIFO queues

The ContentBasedDeduplication applier unconditionally sets meta.ContentBasedDedup without checking queue type, and this same path is used by both create and set-attributes flows. That lets standard queues accept a FIFO-only attribute and advertise unsupported behavior, which can mislead clients and diverges from expected SQS attribute validation.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Nitpick comments (17)
kv/shard_key.go (1)

134-142: Precompute the constant return value.

sqsRouteKey always returns the same bytes (!sqs|route|global) whenever the prefix matches, yet currently allocates a fresh slice on every hit. On a route-heavy path this is avoidable garbage. Since callers of normalizeRouteKey don't mutate the returned slice (the Redis/Dynamo/S3 variants return sub-slices of the input), a shared package-level slice is safe.

♻️ Proposed refactor
 var (
 	...
 	sqsRoutePrefixBytes              = []byte(sqsRoutePrefix)
 	sqsInternalPrefixBytes           = []byte(sqsInternalPrefix)
+	sqsGlobalRouteKeyBytes           = []byte(sqsRoutePrefix + "global")
 )
 func sqsRouteKey(key []byte) []byte {
 	if !bytes.HasPrefix(key, sqsInternalPrefixBytes) {
 		return nil
 	}
-	out := make([]byte, 0, len(sqsRoutePrefixBytes)+len("global"))
-	out = append(out, sqsRoutePrefixBytes...)
-	out = append(out, []byte("global")...)
-	return out
+	return sqsGlobalRouteKeyBytes
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv/shard_key.go` around lines 134 - 142, sqsRouteKey currently allocates a
new slice for the constant value every call; replace that by precomputing the
constant return bytes as a package-level variable (e.g., sqsRouteKeyConst or
similar) constructed from sqsRoutePrefixBytes + "global" and return that shared
slice when bytes.HasPrefix(key, sqsInternalPrefixBytes) is true; update
sqsRouteKey to return the precomputed slice and keep callers like
normalizeRouteKey behavior the same (ensure callers do not mutate the returned
slice).
main_bootstrap_e2e_test.go (1)

343-343: Positional-empty-string call is getting brittle.

parseRuntimeConfig is now up to 11 positional string arguments, most of them "" here. Every new adapter (Redis, S3, Dynamo, SQS, …) forces mechanical churn at every call site and risks silent arg-order mistakes (swapping raftRedisMap for raftSqsMap would compile). Consider refactoring parseRuntimeConfig to accept an options struct in a follow-up so future adapters don't keep widening this call site.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@main_bootstrap_e2e_test.go` at line 343, The call to parseRuntimeConfig is
brittle due to 11 positional string args; refactor parseRuntimeConfig to accept
a single options struct (e.g., RuntimeConfigOptions) or functional option type
instead of many positional strings, update its signature and internal usage
accordingly, then change this call site (and other callers) to pass a clear
struct literal with named fields for raftAddr and redisAddr (e.g.,
RuntimeConfigOptions{RaftAddr: ep.raftAddr, RedisAddr: ep.redisAddr}) so future
adapters can add fields without breaking existing call sites.
adapter/leader_http_proxy.go (1)

57-57: Hardcoded http scheme assumes plain intra-cluster HTTP across proxies.

target := &url.URL{Scheme: "http", Host: targetAddr} locks leader proxying to plaintext. This same pattern appears in adapter/s3.go:2264 for S3 proxying. If the cluster adopts TLS between peers (adapter HTTPS listeners), these proxies will silently fail or connect incorrectly.

Consider refactoring both locations to accept the scheme as a parameter (or a fully-parsed *url.URL) to future-proof TLS support consistently across the codebase.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/leader_http_proxy.go` at line 57, The code hardcodes the URL scheme
when constructing the proxy target (target := &url.URL{Scheme: "http", Host:
targetAddr}), which breaks TLS-capable deployments; change the proxy setup to
accept the scheme or a full parsed *url.URL from the caller instead of
hardcoding "http" (update the function that builds/returns target and any
callers), and mirror the same refactor in the similar S3 proxy site
(adapter/s3.go around the target construction at the noted location) so both
leader proxying and S3 proxying can use "https" when configured.
main_sqs.go (1)

49-65: Defer runDoneCancel() to make the waiter panic-safe.

If srv.Run() panics, runDoneCancel() is never invoked and the other goroutine blocks forever on <-runDoneCtx.Done() until ctx is cancelled. A defer keeps the cleanup guaranteed and the code slightly easier to follow.

♻️ Proposed tweak
 	eg.Go(func() error {
-		err := sqsServer.Run()
-		runDoneCancel()
+		defer runDoneCancel()
+		err := sqsServer.Run()
 		if err == nil || errors.Is(err, net.ErrClosed) {
 			return nil
 		}
 		return errors.WithStack(err)
 	})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@main_sqs.go` around lines 49 - 65, The goroutine that calls sqsServer.Run()
should defer runDoneCancel() so cleanup happens even if Run panics; inside the
eg.Go anonymous func that currently does "err := sqsServer.Run();
runDoneCancel(); ...", move the runDoneCancel() into a defer at the start of
that func (and remove the explicit post-Run call) so runDoneCancel() runs on
normal return, error paths, or panics; reference runDoneCtx/runDoneCancel, the
eg.Go anonymous function wrapping sqsServer.Run(), and the existing error
handling that checks net.ErrClosed.
adapter/sqs_test.go (1)

196-218: Consider listener-based readiness instead of time.Sleep.

time.Sleep(20 * time.Millisecond) to "give Serve a chance to enter its accept loop" can be flaky on overloaded CI. Since Run() returns nil on graceful shutdown even if Stop() is called before Serve() entered its loop, you can remove the sleep entirely and rely purely on the done channel + time.After guard.

♻️ Proposed minor simplification
 	done := make(chan error, 1)
 	go func() { done <- srv.Run() }()
-	// Give Serve a chance to enter its accept loop.
-	time.Sleep(20 * time.Millisecond)
 	srv.Stop()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/sqs_test.go` around lines 196 - 218, The test
TestSQSServer_StopShutsDown uses time.Sleep(20 * time.Millisecond) to wait for
srv.Run() to enter its accept loop which is flaky; remove that sleep and instead
rely on the done channel (reading from the goroutine that runs srv.Run())
combined with the existing time.After timeout to observe completion after
calling srv.Stop(); keep the existing goroutine that does done <- srv.Run(),
call srv.Stop() immediately, and let the select on done / time.After assert
behavior.
adapter/test_util.go (1)

255-273: Consider dialing the SQS (and dynamo) listeners here too.

waitForNodeListeners only verifies the gRPC and Redis ports are dialable. Since SQS is now part of node startup, TestSQSServer_* tests that race against createNode return could theoretically hit a yet-to-accept listener. In practice lc.Listen has already bound, so connections queue rather than fail, but aligning this check with bindListeners (gRPC + Redis + Dynamo + SQS) removes the footgun and keeps the readiness contract complete.

♻️ Proposed change
 	for _, n := range nodes {
 		require.Eventually(t, func() bool {
-			conn, err := d.DialContext(ctx, "tcp", n.grpcAddress)
-			if err != nil {
-				return false
+			for _, addr := range []string{n.grpcAddress, n.redisAddress, n.dynamoAddress, n.sqsAddress} {
+				if addr == "" {
+					continue
+				}
+				conn, err := d.DialContext(ctx, "tcp", addr)
+				if err != nil {
+					return false
+				}
+				_ = conn.Close()
 			}
-			_ = conn.Close()
-			conn, err = d.DialContext(ctx, "tcp", n.redisAddress)
-			if err != nil {
-				return false
-			}
-			_ = conn.Close()
 			return true
 		}, waitTimeout, waitInterval)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/test_util.go` around lines 255 - 273, waitForNodeListeners currently
only dials gRPC and Redis; add similar net.Dialer DialContext checks for each
node's SQS and Dynamo listeners so the readiness check matches bindListeners.
Update the loop in waitForNodeListeners to also attempt dialing n.sqsAddress and
n.dynamoAddress (using the same ctx, d.DialContext pattern and closing
connections) inside the require.Eventually closure so tests wait for all four
listeners (grpcAddress, redisAddress, sqsAddress, dynamoAddress) to be reachable
before proceeding.
adapter/sqs_messages_test.go (1)

380-452: Time-based long-poll assertions may flake on slow CI.

TestSQSServer_LongPollWaitsForArrival guards elapsed between 300ms and 3s, and TestSQSServer_LongPollTimesOutOnEmptyQueue between 900ms and 2500ms. These windows are generous enough in practice, but under -race on small runners the upper bound on "wait timed out promptly" can occasionally be exceeded. If this proves flaky, consider widening the upper bounds or gating these with a testing.Short() / environment escape hatch.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/sqs_messages_test.go` around lines 380 - 452, The time-window
assertions in TestSQSServer_LongPollWaitsForArrival and
TestSQSServer_LongPollTimesOutOnEmptyQueue are prone to flakiness under slow
CI/race builds; either widen the upper bounds on the elapsed checks (e.g. change
the 3s check in TestSQSServer_LongPollWaitsForArrival to 5s and the 2500ms check
in TestSQSServer_LongPollTimesOutOnEmptyQueue to 5s) or skip/short-circuit the
timing assertions when tests run in short/CI/race mode (use testing.Short() or
an env var check at the top of those tests) so the checks on elapsed do not fail
spuriously; update the comparisons that reference elapsed in both test functions
accordingly.
adapter/sqs_auth.go (1)

151-159: Nit: surface "body too large" separately from "IncompleteSignature".

If io.ReadAll(http.MaxBytesReader(...)) returns because the body exceeded sqsMaxRequestBodyBytes, the client currently sees a 403 with IncompleteSignature, which is misleading during client debugging (the signature is fine — the body just didn't fit). Consider distinguishing the limit-exceeded error and returning a 413 / RequestTooLarge-style code instead, while keeping IncompleteSignature for genuine read failures.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/sqs_auth.go` around lines 151 - 159, The handler drainAndHashSQSBody
should distinguish a MaxBytesReader overflow from other read failures: when
io.ReadAll on http.MaxBytesReader returns an error that indicates the request
body exceeded sqsMaxRequestBodyBytes (detect via errors.Is(err,
http.ErrBodyTooLarge) and/or string match "request body too large" as a
fallback), return a 413 status and an sqsAuthError with Code "RequestTooLarge"
(instead of IncompleteSignature); for other read errors keep the existing 403 /
"IncompleteSignature" behavior. Update logic in drainAndHashSQSBody and reuse
sqsMaxRequestBodyBytes, http.MaxBytesReader and sqsAuthError symbols to locate
the code.
adapter/sigv4.go (1)

107-110: Remove redundant Header.Clone() after r.Clone().

http.Request.Clone already deep-copies Header via http.Header.Clone(), making the subsequent clone.Header = clone.Header.Clone() a no-op allocation. This pattern appears in both buildSigV4AuthorizationHeader (lines 107–110) and buildSigV4AuthorizationHeaderRestricted (lines 175–178).

♻️ Proposed cleanup
 	clone := r.Clone(context.Background())
 	clone.Host = r.Host
-	clone.Header = clone.Header.Clone()
 	clone.Header.Del("Authorization")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/sigv4.go` around lines 107 - 110, The code redundantly calls
clone.Header = clone.Header.Clone() after using r.Clone(); remove that no-op
allocation in both buildSigV4AuthorizationHeader and
buildSigV4AuthorizationHeaderRestricted so the request clone is built with clone
:= r.Clone(context.Background()), clone.Host = r.Host, and then call
clone.Header.Del("Authorization") (keep the Header.Del call) without re-cloning
the headers.
adapter/sqs_catalog.go (3)

646-657: filterByPrefix mutates the input slice via names[:0] aliasing.

out := names[:0] shares backing storage with names, so iterating names while appending into out works only because the loop reads names[i] before the filtered write to position <= i overwrites it. This is a well-known Go idiom but surprising for a general-purpose helper. Today's caller at Line 615 uses the returned slice immediately, so it's safe, but any future reuse that keeps names alive gets silently corrupted data.

Allocating a fresh slice (or renaming the helper to filterByPrefixInPlace) removes the footgun at negligible cost for the bounded list sizes involved.

Proposed change
 func filterByPrefix(names []string, prefix string) []string {
 	if prefix == "" {
 		return names
 	}
-	out := names[:0]
+	out := make([]string, 0, len(names))
 	for _, n := range names {
 		if strings.HasPrefix(n, prefix) {
 			out = append(out, n)
 		}
 	}
 	return out
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/sqs_catalog.go` around lines 646 - 657, The helper filterByPrefix
currently does in-place mutation via out := names[:0], which aliases names'
backing array and can silently corrupt the original slice; change filterByPrefix
to allocate a new slice for results (e.g., make([]string, 0, len(names)) or
similar) and append matching elements into that new slice, keeping the original
names unmodified, so callers of filterByPrefix receive a fresh slice and no
aliasing occurs.

209-226: queueNameFromURL doesn't validate the extracted name.

The extracted name (after account-id stripping) is handed to downstream loadQueueMetaAt unchecked. Names containing characters outside [a-zA-Z0-9_-]{1,80}(\.fifo)? cannot match any persisted queue, so the call fails cleanly with NonExistentQueue, but the error message is misleading ("queue does not exist" when the real issue is a malformed URL). Running validateQueueName here too would return a clearer InvalidParameterValue early and align with how getQueueUrl validates the bare QueueName at Line 719.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/sqs_catalog.go` around lines 209 - 226, queueNameFromURL extracts the
queue name but doesn't validate it, causing malformed names to surface as
NonExistentQueue downstream; after stripping the account-id prefix in
queueNameFromURL call validateQueueName(name) (same helper used by getQueueUrl)
and if it fails return a newSQSAPIError with the InvalidParameterValue error
code (and appropriate http.StatusBadRequest) so callers get an early
InvalidParameterValue instead of a misleading NonExistentQueue when the URL
contained an invalid queue name before calling loadQueueMetaAt.

678-711: scanQueueNames loads every queue name on each ListQueues call.

Every ListQueues request scans the entire !sqs|queue|meta| prefix end-to-end, sorts in memory, then slices by NextToken. For the 1000-queue AWS cap this is bounded, but it ignores MaxResults and NextToken during the scan itself — a 100-result page still reads all N records. Consider pushing the cursor into the scan: decode NextToken to a starting key, scan up to maxResults+1 records, and stop. That matches the AWS pagination cost model and keeps ListQueues O(page size) regardless of total queue count.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/sqs_catalog.go` around lines 678 - 711, The current scanQueueNames
implementation always scans the entire SqsQueueMetaPrefix range; change it to
honor pagination by decoding the provided NextToken into the initial start
cursor (if present) or falling back to the prefix, then scan at most
min(maxResults+1, sqsQueueScanPageLimit) records per loop and stop once you have
maxResults+1 items so you can set the next token; specifically, modify
scanQueueNames (and its caller) to accept maxResults and nextToken, use
queueNameFromMetaKey to extract names, use
nextScanCursorAfter(kvs[len(kvs)-1].Key) only if you need to continue, and
ensure you break early when you collected enough entries (using readTS as
before) rather than scanning the full keyspace.
adapter/sqs.go (1)

33-34: Naming inconsistency: sqsChangeMessageVisibilityBatchTgt vs. the *Target convention elsewhere.

Every other operation constant in this block ends in Target; only this one truncates to Tgt. Consider renaming to sqsChangeMessageVisibilityBatchTarget for consistency. Go doesn't care about line length here and golangci-lint won't complain — the asymmetry is just distracting when grepping.

Proposed rename
-	sqsChangeMessageVisibilityBatchTgt = sqsTargetPrefix + "ChangeMessageVisibilityBatch"
+	sqsChangeMessageVisibilityBatchTarget = sqsTargetPrefix + "ChangeMessageVisibilityBatch"

(and the single reference at Line 106)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/sqs.go` around lines 33 - 34, Rename the constant
sqsChangeMessageVisibilityBatchTgt to sqsChangeMessageVisibilityBatchTarget to
match the *Target naming convention used by the other SQS operation constants;
update the declaration and every usage (search for
sqsChangeMessageVisibilityBatchTgt and replace with
sqsChangeMessageVisibilityBatchTarget, including the single reference in the
ChangeMessageVisibilityBatch handling code) so identifiers remain consistent
across the file.
adapter/sqs_messages.go (2)

437-446: LeaseReadThrough fences only the first scan; subsequent long-poll iterations run without re-fencing.

The design (§7.3) frames empty long polls as "O(1) quorum operations, not O(ticks)" by relying on the lease staying warm across the whole wait window. But the lease TTL in docs/design/2026_04_20_implemented_lease_read.md §3.2 is ~700 ms, so a 20 s WaitTimeSeconds can straddle ~28 lease windows without any re-fence. If this node silently loses leadership partway through the poll, scanAndDeliverOnce continues reading local MVCC at the stale readTS taken per iteration; the OCC Dispatch will reject the write so no corruption occurs, but the receive could burn most of its wait window doing useless scans on a now-follower.

Consider calling kv.LeaseReadThrough (or LeaseReadForKey keyed on the queue route) at the top of each scanAndDeliverOnce, or at least on every K-th iteration, so lease expiry surfaces as a retryable 5xx instead of hanging on a dead leader.

Also applies to: 508-536

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/sqs_messages.go` around lines 437 - 446, The current LeaseReadThrough
call only fences the first long-poll iteration, so subsequent scanAndDeliverOnce
loops can hang on a node that lost leadership; move the lease check into
scanAndDeliverOnce (or invoke kv.LeaseReadThrough/kv.LeaseReadForKey keyed by
the queue route at the top of each scanAndDeliverOnce) or at least call it every
K iterations tied to WaitTimeSeconds, and if it fails return a retryable 5xx via
writeSQSErrorFromErr so lease expiry is surfaced promptly instead of allowing
wasted local scans.

620-691: Silent false returns conflate races with infrastructure failures in tryDeliverCandidate.

newReceiptToken (Line 638), decodeSQSMessageRecord (Line 633), encodeSQSMessageRecord (Line 650), and encodeReceiptHandle (Line 676) all return (nil, false) on error, identical to the "concurrent rotation, skip candidate" path at Line 672. The comment on rotateMessagesForDelivery (Line 592-596) says failures are skipped because AWS allows partial batches, but skipping a crypto/rand failure or a JSON-marshal failure masks real bugs — the receive will look like the queue is empty to the caller.

Consider propagating these as errors so scanAndDeliverOnce surfaces them as a 5xx, while keeping the "race/write-conflict → skip" behavior that actually matches AWS semantics.

Sketch
// tryDeliverCandidate returns (msg, true, nil) on delivery,
// (nil, false, nil) on a benign race, and (nil, false, err) on a
// hard failure the caller should surface.
func (s *SQSServer) tryDeliverCandidate(...) (map[string]any, bool, error) {
    ...
    newToken, err := newReceiptToken()
    if err != nil {
        return nil, false, errors.WithStack(err)
    }
    ...
    if _, err := s.coordinator.Dispatch(ctx, req); err != nil {
        // write-conflict / concurrent rotation: benign skip
        return nil, false, nil
    }
    ...
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/sqs_messages.go` around lines 620 - 691, Change tryDeliverCandidate
to return (map[string]any, bool, error) and propagate hard failures from
decodeSQSMessageRecord, newReceiptToken, encodeSQSMessageRecord, and
encodeReceiptHandle as non-nil errors so they surface as 5xx; keep the
benign-skip behavior for concurrent rotation/write-conflict by treating
coordinator.Dispatch errors that indicate an OCC/write conflict (e.g.,
kv.ErrWriteConflict or whatever sentinel your coordinator uses) as (nil, false,
nil) but return (nil, false, err) for other Dispatch errors. Update callers
(e.g., scanAndDeliverOnce / rotateMessagesForDelivery) to handle the new error
return and surface hard errors to the client. Ensure you reference the symbols:
tryDeliverCandidate, newReceiptToken, decodeSQSMessageRecord,
encodeSQSMessageRecord, encodeReceiptHandle, and coordinator.Dispatch when
making these changes.
adapter/sqs_catalog_test.go (2)

217-223: Consider asserting the error envelope on the second DeleteQueue call as well.

The first post-delete GetQueueUrl check validates both the 400 status and __type = NonExistentQueue, but the second DeleteQueue only checks status. Asserting out["__type"] == sqsErrQueueDoesNotExist here would catch regressions where deleteQueue starts returning a different error code (e.g., InternalFailure) while still returning 400.

Proposed addition
-	// DeleteQueue on an unknown queue also returns NonExistentQueue.
-	status, _ = callSQS(t, node, sqsDeleteQueueTarget, map[string]any{
-		"QueueUrl": url,
-	})
-	if status != http.StatusBadRequest {
-		t.Fatalf("second delete: got %d want 400", status)
-	}
+	// DeleteQueue on an unknown queue also returns NonExistentQueue.
+	status, out = callSQS(t, node, sqsDeleteQueueTarget, map[string]any{
+		"QueueUrl": url,
+	})
+	if status != http.StatusBadRequest {
+		t.Fatalf("second delete: got %d want 400; body %v", status, out)
+	}
+	if got, _ := out["__type"].(string); got != sqsErrQueueDoesNotExist {
+		t.Fatalf("second delete error type: got %q want %q", got, sqsErrQueueDoesNotExist)
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/sqs_catalog_test.go` around lines 217 - 223, The second DeleteQueue
call currently only asserts the HTTP status; update the test to capture the
response envelope from callSQS (instead of discarding it) and assert that
out["__type"] == sqsErrQueueDoesNotExist (the same error constant used earlier)
so the test verifies both status and the specific NonExistentQueue error for the
second delete; locate the call using callSQS, sqsDeleteQueueTarget and the
variables status/out in the same test.

269-294: Consider a table-driven form for the key-encoding round-trips.

The segment round-trip loop is already inline-table-shaped. Extracting a []struct{name, input string} (or just keeping the current []string) with explicit t.Run(name, ...) subtests would give you localized failure output when a single input fails, and matches the project convention of preferring table-driven tests.

As per coding guidelines, "Unit tests must be co-located with packages in *_test.go files; prefer table-driven test cases".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/sqs_catalog_test.go` around lines 269 - 294, Replace the inline loop
in TestSQSServer_CatalogKeyEncoding with a table-driven subtest structure:
create a []struct{ name, input string } (or []string) of test cases and call
t.Run for each case to run the encodeSQSSegment/decodeSQSSegment round-trip
assertions so failures are localized; keep t.Parallel() semantics by either
calling it in the parent test or inside each subtest as appropriate. Retain the
existing checks for queueNameFromMetaKey(sqsQueueMetaKey(...)) and the
unknown-prefix rejection; just move the segment round-trip assertions into the
new table-driven t.Run subtests referencing encodeSQSSegment and
decodeSQSSegment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@adapter/sqs_catalog.go`:
- Around line 121-128: writeSQSErrorFromErr currently returns err.Error() on the
500 path which can leak internal details; instead, use slog to log the full
error (structured, e.g. slog.Error("sqs internal error", "error", err)) and call
writeSQSError with http.StatusInternalServerError, sqsErrInternalFailure and a
generic message like "internal failure" (keep the sqsAPIError branch unchanged).
Ensure you reference writeSQSErrorFromErr, sqsAPIError, writeSQSError and
sqsErrInternalFailure when making the change.

In `@adapter/sqs_messages_test.go`:
- Around line 391-397: The scheduled send using time.AfterFunc (sendAt) calls
callSQS which may call t.Fatalf from the timer goroutine—avoid running
test-failing assertions off the main test goroutine. Change the callback so it
does not call t.Fatalf: have the timer goroutine capture errors from callSQS and
either swallow/log them (t.Logf) or send the result on a channel/WaitGroup that
the main test goroutine waits on before returning; ensure the test defers
waiting for that completion instead of relying only on defer sendAt.Stop().
Target symbols: sendAt, time.AfterFunc, callSQS, and the defer sendAt.Stop()
usage.

In `@adapter/sqs_messages.go`:
- Around line 266-270: sqsChangeVisibilityInput currently uses a non-pointer
int64 for VisibilityTimeout so a missing field decodes to 0; change
VisibilityTimeout to *int64 in the sqsChangeVisibilityInput struct (matching
sqsReceiveMessageInput) and update changeMessageVisibility to explicitly check
for nil and return a MissingParameter error when VisibilityTimeout is nil, while
still allowing a decoded 0 value if the client explicitly sent 0; also update
any other places that construct or decode sqsChangeVisibilityInput (e.g., the
other handler around the similar code block) to handle the pointer type.
- Around line 915-937: md5OfAttributesHex currently returns a non-canonical MD5
which breaks SDKs; as a short-term fix, update the SendMessage request handling
to explicitly reject any request with non-empty MessageAttributes by returning
an InvalidParameterValue error (so callers get a clear failure), and leave
md5OfAttributesHex unchanged for now; locate where SendMessage is processed (the
SendMessage handler/function) and add a check that if the incoming message's
MessageAttributes map is non-empty, return an InvalidParameterValue response
explaining message attributes are not supported yet, referencing
md5OfAttributesHex in a TODO to implement AWS canonical MD5 later.

In `@adapter/sqs.go`:
- Around line 54-59: The constant sqsErrMalformedRequest currently set to
"MalformedQueryString" is being used for JSON-1.0 errors; add a new constant
(e.g. sqsErrSerialization = "SerializationException") and keep
sqsErrMalformedRequest for Query-protocol only, then update error returns: in
the HTTP Method Not Allowed handler (where you currently return
sqsErrMalformedRequest) return sqsErrInvalidAction instead, and in
adapter/sqs_catalog.go replace uses of sqsErrMalformedRequest for JSON
unmarshal/body-read failures with sqsErrSerialization so JSON-1.0 responses use
"SerializationException".

In `@docs/design/2026_04_24_proposed_sqs_compatible_adapter.md`:
- Around line 307-320: The design doc describes a commit-stream notifier and
leader-transition bootstrap that the current codebase does not implement (the
code uses a polling long-poll loop in sqs_messages.go instead of a
leader-transition hook), so update the doc and code annotations: add a
one-sentence note in §7.3 stating that Milestone 1 uses the polling fallback
(e.g., 200 ms poll) and that the commit-stream notifier and failover bootstrap
will land in Milestone 2, and also add a brief inline comment in sqs_messages.go
near the polling long-poll loop to reference the planned
RegisterLeaderLossCallback / leader-transition bootstrap to avoid confusion
until the bootstrap is implemented.
- Around line 470-476: The docs describe a send-age index key pattern
"!sqs|msg|byage|<queue-esc><gen-u64><send_timestamp-u64><message-id-esc>" but
the send path in the PR doesn't emit that index; either (A) update the design
text to explicitly mark the byage index and its write as "deferred to Phase 2"
in §11.1 so readers won't expect it in the current code, or (B) implement the
byage write in the send path so every send creates the byage entry alongside the
data and visibility entries (ensure the same unique key pattern is used and that
ReceiveMessage logic treats expired byage entries as reaping candidates). Choose
one option and update the doc or code accordingly.
- Around line 175-185: The doc's timestamp suffixes (_hlc) don't match the
implementation which uses wall-clock millis in adapter/sqs_messages.go (fields
SendTimestampMillis, AvailableAtMillis, VisibleAtMillis, FirstReceiveMillis);
update the design text in §5.3 and §11.4 to describe the wall-clock millis
choice and the single-leader invariant that makes it safe (rename the listed
fields from *_hlc to *_millis and explicitly state time.Now().UnixMilli() is
used), or if you prefer option (a) instead convert the implementation to HLC
timestamps across the same fields and ensure all mutating paths use the cluster
HLC—pick one approach and make the doc and code consistent.

In `@kv/shard_key.go`:
- Around line 27-39: The doc comments currently reference exported names but the
actual identifiers are unexported; update the comments so they begin with the
exact declared identifiers "sqsRoutePrefix" and "sqsInternalPrefix" (e.g.,
"sqsRoutePrefix is the logical route prefix..." and "sqsInternalPrefix is the
shared prefix...") following Go godoc conventions and keep the rest of the
explanatory text unchanged.

In `@main_sigv4_creds.go`:
- Around line 31-43: In loadSigV4StaticCredentialsFile, wrap the errors from
os.Open and json.NewDecoder(...).Decode so they include the provided
labelForErrors context (e.g. "s3"/"sqs") before returning; update the error
returns in the os.Open error branch and the JSON decode branch to include
labelForErrors (preserving the stack via errors.WithStack or equivalent) so
startup/misconfiguration failures show which adapter label caused the bad path
or malformed JSON.

---

Nitpick comments:
In `@adapter/leader_http_proxy.go`:
- Line 57: The code hardcodes the URL scheme when constructing the proxy target
(target := &url.URL{Scheme: "http", Host: targetAddr}), which breaks TLS-capable
deployments; change the proxy setup to accept the scheme or a full parsed
*url.URL from the caller instead of hardcoding "http" (update the function that
builds/returns target and any callers), and mirror the same refactor in the
similar S3 proxy site (adapter/s3.go around the target construction at the noted
location) so both leader proxying and S3 proxying can use "https" when
configured.

In `@adapter/sigv4.go`:
- Around line 107-110: The code redundantly calls clone.Header =
clone.Header.Clone() after using r.Clone(); remove that no-op allocation in both
buildSigV4AuthorizationHeader and buildSigV4AuthorizationHeaderRestricted so the
request clone is built with clone := r.Clone(context.Background()), clone.Host =
r.Host, and then call clone.Header.Del("Authorization") (keep the Header.Del
call) without re-cloning the headers.

In `@adapter/sqs_auth.go`:
- Around line 151-159: The handler drainAndHashSQSBody should distinguish a
MaxBytesReader overflow from other read failures: when io.ReadAll on
http.MaxBytesReader returns an error that indicates the request body exceeded
sqsMaxRequestBodyBytes (detect via errors.Is(err, http.ErrBodyTooLarge) and/or
string match "request body too large" as a fallback), return a 413 status and an
sqsAuthError with Code "RequestTooLarge" (instead of IncompleteSignature); for
other read errors keep the existing 403 / "IncompleteSignature" behavior. Update
logic in drainAndHashSQSBody and reuse sqsMaxRequestBodyBytes,
http.MaxBytesReader and sqsAuthError symbols to locate the code.

In `@adapter/sqs_catalog_test.go`:
- Around line 217-223: The second DeleteQueue call currently only asserts the
HTTP status; update the test to capture the response envelope from callSQS
(instead of discarding it) and assert that out["__type"] ==
sqsErrQueueDoesNotExist (the same error constant used earlier) so the test
verifies both status and the specific NonExistentQueue error for the second
delete; locate the call using callSQS, sqsDeleteQueueTarget and the variables
status/out in the same test.
- Around line 269-294: Replace the inline loop in
TestSQSServer_CatalogKeyEncoding with a table-driven subtest structure: create a
[]struct{ name, input string } (or []string) of test cases and call t.Run for
each case to run the encodeSQSSegment/decodeSQSSegment round-trip assertions so
failures are localized; keep t.Parallel() semantics by either calling it in the
parent test or inside each subtest as appropriate. Retain the existing checks
for queueNameFromMetaKey(sqsQueueMetaKey(...)) and the unknown-prefix rejection;
just move the segment round-trip assertions into the new table-driven t.Run
subtests referencing encodeSQSSegment and decodeSQSSegment.

In `@adapter/sqs_catalog.go`:
- Around line 646-657: The helper filterByPrefix currently does in-place
mutation via out := names[:0], which aliases names' backing array and can
silently corrupt the original slice; change filterByPrefix to allocate a new
slice for results (e.g., make([]string, 0, len(names)) or similar) and append
matching elements into that new slice, keeping the original names unmodified, so
callers of filterByPrefix receive a fresh slice and no aliasing occurs.
- Around line 209-226: queueNameFromURL extracts the queue name but doesn't
validate it, causing malformed names to surface as NonExistentQueue downstream;
after stripping the account-id prefix in queueNameFromURL call
validateQueueName(name) (same helper used by getQueueUrl) and if it fails return
a newSQSAPIError with the InvalidParameterValue error code (and appropriate
http.StatusBadRequest) so callers get an early InvalidParameterValue instead of
a misleading NonExistentQueue when the URL contained an invalid queue name
before calling loadQueueMetaAt.
- Around line 678-711: The current scanQueueNames implementation always scans
the entire SqsQueueMetaPrefix range; change it to honor pagination by decoding
the provided NextToken into the initial start cursor (if present) or falling
back to the prefix, then scan at most min(maxResults+1, sqsQueueScanPageLimit)
records per loop and stop once you have maxResults+1 items so you can set the
next token; specifically, modify scanQueueNames (and its caller) to accept
maxResults and nextToken, use queueNameFromMetaKey to extract names, use
nextScanCursorAfter(kvs[len(kvs)-1].Key) only if you need to continue, and
ensure you break early when you collected enough entries (using readTS as
before) rather than scanning the full keyspace.

In `@adapter/sqs_messages_test.go`:
- Around line 380-452: The time-window assertions in
TestSQSServer_LongPollWaitsForArrival and
TestSQSServer_LongPollTimesOutOnEmptyQueue are prone to flakiness under slow
CI/race builds; either widen the upper bounds on the elapsed checks (e.g. change
the 3s check in TestSQSServer_LongPollWaitsForArrival to 5s and the 2500ms check
in TestSQSServer_LongPollTimesOutOnEmptyQueue to 5s) or skip/short-circuit the
timing assertions when tests run in short/CI/race mode (use testing.Short() or
an env var check at the top of those tests) so the checks on elapsed do not fail
spuriously; update the comparisons that reference elapsed in both test functions
accordingly.

In `@adapter/sqs_messages.go`:
- Around line 437-446: The current LeaseReadThrough call only fences the first
long-poll iteration, so subsequent scanAndDeliverOnce loops can hang on a node
that lost leadership; move the lease check into scanAndDeliverOnce (or invoke
kv.LeaseReadThrough/kv.LeaseReadForKey keyed by the queue route at the top of
each scanAndDeliverOnce) or at least call it every K iterations tied to
WaitTimeSeconds, and if it fails return a retryable 5xx via writeSQSErrorFromErr
so lease expiry is surfaced promptly instead of allowing wasted local scans.
- Around line 620-691: Change tryDeliverCandidate to return (map[string]any,
bool, error) and propagate hard failures from decodeSQSMessageRecord,
newReceiptToken, encodeSQSMessageRecord, and encodeReceiptHandle as non-nil
errors so they surface as 5xx; keep the benign-skip behavior for concurrent
rotation/write-conflict by treating coordinator.Dispatch errors that indicate an
OCC/write conflict (e.g., kv.ErrWriteConflict or whatever sentinel your
coordinator uses) as (nil, false, nil) but return (nil, false, err) for other
Dispatch errors. Update callers (e.g., scanAndDeliverOnce /
rotateMessagesForDelivery) to handle the new error return and surface hard
errors to the client. Ensure you reference the symbols: tryDeliverCandidate,
newReceiptToken, decodeSQSMessageRecord, encodeSQSMessageRecord,
encodeReceiptHandle, and coordinator.Dispatch when making these changes.

In `@adapter/sqs_test.go`:
- Around line 196-218: The test TestSQSServer_StopShutsDown uses time.Sleep(20 *
time.Millisecond) to wait for srv.Run() to enter its accept loop which is flaky;
remove that sleep and instead rely on the done channel (reading from the
goroutine that runs srv.Run()) combined with the existing time.After timeout to
observe completion after calling srv.Stop(); keep the existing goroutine that
does done <- srv.Run(), call srv.Stop() immediately, and let the select on done
/ time.After assert behavior.

In `@adapter/sqs.go`:
- Around line 33-34: Rename the constant sqsChangeMessageVisibilityBatchTgt to
sqsChangeMessageVisibilityBatchTarget to match the *Target naming convention
used by the other SQS operation constants; update the declaration and every
usage (search for sqsChangeMessageVisibilityBatchTgt and replace with
sqsChangeMessageVisibilityBatchTarget, including the single reference in the
ChangeMessageVisibilityBatch handling code) so identifiers remain consistent
across the file.

In `@adapter/test_util.go`:
- Around line 255-273: waitForNodeListeners currently only dials gRPC and Redis;
add similar net.Dialer DialContext checks for each node's SQS and Dynamo
listeners so the readiness check matches bindListeners. Update the loop in
waitForNodeListeners to also attempt dialing n.sqsAddress and n.dynamoAddress
(using the same ctx, d.DialContext pattern and closing connections) inside the
require.Eventually closure so tests wait for all four listeners (grpcAddress,
redisAddress, sqsAddress, dynamoAddress) to be reachable before proceeding.

In `@kv/shard_key.go`:
- Around line 134-142: sqsRouteKey currently allocates a new slice for the
constant value every call; replace that by precomputing the constant return
bytes as a package-level variable (e.g., sqsRouteKeyConst or similar)
constructed from sqsRoutePrefixBytes + "global" and return that shared slice
when bytes.HasPrefix(key, sqsInternalPrefixBytes) is true; update sqsRouteKey to
return the precomputed slice and keep callers like normalizeRouteKey behavior
the same (ensure callers do not mutate the returned slice).

In `@main_bootstrap_e2e_test.go`:
- Line 343: The call to parseRuntimeConfig is brittle due to 11 positional
string args; refactor parseRuntimeConfig to accept a single options struct
(e.g., RuntimeConfigOptions) or functional option type instead of many
positional strings, update its signature and internal usage accordingly, then
change this call site (and other callers) to pass a clear struct literal with
named fields for raftAddr and redisAddr (e.g., RuntimeConfigOptions{RaftAddr:
ep.raftAddr, RedisAddr: ep.redisAddr}) so future adapters can add fields without
breaking existing call sites.

In `@main_sqs.go`:
- Around line 49-65: The goroutine that calls sqsServer.Run() should defer
runDoneCancel() so cleanup happens even if Run panics; inside the eg.Go
anonymous func that currently does "err := sqsServer.Run(); runDoneCancel();
...", move the runDoneCancel() into a defer at the start of that func (and
remove the explicit post-Run call) so runDoneCancel() runs on normal return,
error paths, or panics; reference runDoneCtx/runDoneCancel, the eg.Go anonymous
function wrapping sqsServer.Run(), and the existing error handling that checks
net.ErrClosed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8d96e36b-588c-4b00-aeff-dacc84012030

📥 Commits

Reviewing files that changed from the base of the PR and between c615838 and 0d69c90.

📒 Files selected for processing (24)
  • adapter/dynamodb.go
  • adapter/leader_http_proxy.go
  • adapter/s3_auth.go
  • adapter/s3_test.go
  • adapter/sigv4.go
  • adapter/sqs.go
  • adapter/sqs_auth.go
  • adapter/sqs_auth_test.go
  • adapter/sqs_catalog.go
  • adapter/sqs_catalog_test.go
  • adapter/sqs_keys.go
  • adapter/sqs_messages.go
  • adapter/sqs_messages_test.go
  • adapter/sqs_test.go
  • adapter/test_util.go
  • docs/design/2026_04_24_proposed_sqs_compatible_adapter.md
  • docs/design/README.md
  • kv/shard_key.go
  • main.go
  • main_bootstrap_e2e_test.go
  • main_s3.go
  • main_sigv4_creds.go
  • main_sqs.go
  • shard_config.go
✅ Files skipped from review due to trivial changes (2)
  • adapter/s3_test.go
  • docs/design/README.md

Comment thread adapter/sqs_catalog.go
Comment thread adapter/sqs_messages_test.go Outdated
Comment thread adapter/sqs_messages.go
Comment thread adapter/sqs_messages.go
Comment thread adapter/sqs.go
Comment on lines +54 to +59
sqsErrInvalidAction = "InvalidAction"
sqsErrNotImplemented = "NotImplemented"
sqsErrInternalFailure = "InternalFailure"
sqsErrServiceUnavailable = "ServiceUnavailable"
sqsErrMalformedRequest = "MalformedQueryString"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

AWS SQS JSON-1.0 protocol error code for malformed request body and method not allowed

💡 Result:

AWS SQS using the JSON-1.0 protocol (application/x-amz-json-1.0 Content-Type) does not use fixed numeric "error codes" like some protocols. Instead, errors are identified by specific exception names in the JSON response body's "type" field (e.g., "com.amazonaws.sqs#QueueDoesNotExist") and described in the "message" field. The HTTP status code is typically 400 for client errors. For a malformed request body (invalid JSON or invalid structure/parameters): - Likely "SerializationException": Occurs when the body payload isn't valid JSON (e.g., missing/extra comma/brace). HTTP 400. Seen in API Gateway integrations with SQS JSON protocol. - Or "ValidationError": Input fails constraints. HTTP 400 (common AWS error). - Or "InvalidMessageContents": For SendMessage specifically, if message body has disallowed characters. HTTP 400. For method not allowed (e.g., wrong HTTP method like GET instead of POST): - Likely "InvalidAction": The action/operation is invalid (could apply if method mismatch affects action parsing). HTTP 400. - HTTP 405 may occur if HTTP method is unsupported (SQS JSON requires POST), but documented errors use 400 with exception names. "UnknownOperationException" for missing/incorrect X-Amz-Target header. Error responses include: HTTP/1.1 400 Bad Request x-amzn-RequestId: ... Content-Type: application/x-amz-json-1.0 { "type": "com.amazonaws.sqs#SerializationException", "message": "Invalid JSON..." } No single "JSON-1.0 protocol error code"; exceptions are service-specific. Check per-API docs (e.g., SendMessage lists InvalidMessageContents). Use AWS SDKs to handle named exceptions automatically.

Citations:


🏁 Script executed:

# Examine the constant definition and usage in adapter/sqs.go
cd adapter && head -60 sqs.go | tail -10

Repository: bootjp/elastickv

Length of output: 435


🏁 Script executed:

# Check lines 143-146 in sqs.go for the usage context
sed -n '140,150p' adapter/sqs.go

Repository: bootjp/elastickv

Length of output: 394


🏁 Script executed:

# Check the sqs_catalog.go for the JSON input decoding mentioned
if [ -f adapter/sqs_catalog.go ]; then
  sed -n '168,180p' adapter/sqs_catalog.go
fi

Repository: bootjp/elastickv

Length of output: 575


🏁 Script executed:

# Get more context on writeSQSError to see what error format is returned
sed -n '250,270p' adapter/sqs.go

Repository: bootjp/elastickv

Length of output: 600


MalformedQueryString is a Query-protocol code used for JSON-1.0 protocol errors.

sqsErrMalformedRequest = "MalformedQueryString" is the SQS Query-protocol error code for malformed application/x-www-form-urlencoded bodies. However, it is used for JSON-1.0 protocol errors in this codebase:

  • Line 145: HTTP 405 Method Not Allowed → should use InvalidAction
  • adapter/sqs_catalog.go lines 172, 178: JSON unmarshal and body read errors → should use SerializationException

AWS SQS JSON-1.0 protocol uses distinct exception names in the __type response field: SerializationException for malformed JSON bodies and InvalidAction for unsupported operations. Returning MalformedQueryString for these cases is semantically incorrect.

This does not actively break SDKs since they key off the HTTP status code and x-amzn-ErrorType header first, but aligning with AWS JSON-1.0 semantics would improve compatibility and clarity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/sqs.go` around lines 54 - 59, The constant sqsErrMalformedRequest
currently set to "MalformedQueryString" is being used for JSON-1.0 errors; add a
new constant (e.g. sqsErrSerialization = "SerializationException") and keep
sqsErrMalformedRequest for Query-protocol only, then update error returns: in
the HTTP Method Not Allowed handler (where you currently return
sqsErrMalformedRequest) return sqsErrInvalidAction instead, and in
adapter/sqs_catalog.go replace uses of sqsErrMalformedRequest for JSON
unmarshal/body-read failures with sqsErrSerialization so JSON-1.0 responses use
"SerializationException".

Comment on lines +175 to +185
5. `message_attributes` (user-defined)
6. `message_system_attributes`
7. `send_timestamp_hlc`
8. `available_at_hlc` (send time + `DelaySeconds`)
9. `visible_at_hlc` (current visibility deadline; equals `available_at_hlc` while the message is available)
10. `receive_count`
11. `first_receive_timestamp_hlc`
12. `current_receipt_token` (opaque 16-byte random token; rotated on each receive)
13. `message_group_id` (FIFO only)
14. `message_deduplication_id` (FIFO only)
15. `sequence_number` (FIFO only, monotonic per queue)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Timestamp suffixes in the design (_hlc) diverge from the implementation's _millis wall-clock fields.

The design specifies HLC-based timestamps throughout §5.3 (send_timestamp_hlc, available_at_hlc, visible_at_hlc, first_receive_timestamp_hlc), but adapter/sqs_messages.go persists wall-clock millis (SendTimestampMillis, AvailableAtMillis, VisibleAtMillis, FirstReceiveMillis) via time.Now().UnixMilli(). Because every mutation currently routes through a single leader, wall-clock is fine for correctness, but the doc's §11.4 clock-skew argument ("visibility timeouts use the cluster HLC, not wall clock") does not match the code as-of this PR.

Consider either (a) switching the fields to HLC to match the design, or (b) updating §5.3 and §11.4 to describe the wall-clock choice and the single-leader invariant it relies on. Milestone-1 docs typically land option (b) and leave (a) for a follow-up.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/2026_04_24_proposed_sqs_compatible_adapter.md` around lines 175 -
185, The doc's timestamp suffixes (_hlc) don't match the implementation which
uses wall-clock millis in adapter/sqs_messages.go (fields SendTimestampMillis,
AvailableAtMillis, VisibleAtMillis, FirstReceiveMillis); update the design text
in §5.3 and §11.4 to describe the wall-clock millis choice and the single-leader
invariant that makes it safe (rename the listed fields from *_hlc to *_millis
and explicitly state time.Now().UnixMilli() is used), or if you prefer option
(a) instead convert the implementation to HLC timestamps across the same fields
and ensure all mutating paths use the cluster HLC—pick one approach and make the
doc and code consistent.

Comment on lines +307 to +320
Details:

1. Fence the snapshot scan with `coordinator.LeaseReadForKey(ctx, queueRoute)`. While the lease is valid (≤ `electionTimeout − leaseSafetyMargin`, currently 700 ms per `docs/design/2026_04_20_implemented_lease_read.md §3.2`) this returns immediately with no ReadIndex round-trip. Under sustained receive load this amortizes quorum confirmation across every receive in the lease window instead of paying one ReadIndex per call.
2. Snapshot-scan the visibility index for keys with `visible_at ≤ now`. Both Standard and FIFO queues read in page-sized chunks (page limit 1024) and keep paging forward (resuming from the last key returned) until `MaxNumberOfMessages` candidates survive the per-queue filters, or the end-of-range sentinel `visible_at = now+1` is reached. A small fixed-size first page (e.g., `2 * k`) is adequate for the common case, but the scan **must not** hard-cap at any fixed multiple of `k`, because the per-queue filters can legitimately hide an unbounded visible prefix:
- FIFO queues hide every message whose `MessageGroupId` already holds an in-flight record (group lock).
- Both queue types hide messages that have reached `redrive_policy.maxReceiveCount` — those are moved to the DLQ inside the same receive transaction (see step 6) and **not** returned to the caller, so a poison-message backlog can mask deliverable messages that sit further along in the index.
A soft wall-clock budget (e.g., 100 ms) caps the overall scan latency when the queue is pathologically group-skewed or DLQ-saturated; the receive then returns whatever survived the filters.
3. Filter candidates. FIFO: skip any message whose group lock is held by another message; acquire the group lock on delivery. Standard + FIFO: skip (and DLQ-redrive in step 6) any message whose `receive_count` would exceed `redrive_policy.maxReceiveCount`. Filter-rejected candidates must **not** consume the `MaxNumberOfMessages` budget.
4. Run one OCC transaction per batch: for each candidate, verify the visibility entry still matches, rotate receipt token, set `visible_at_hlc = now + effective_timeout`, insert the new visibility entry, bump `receive_count`, **set `approximate_first_receive_timestamp` only on the first delivery** (subsequent redeliveries must preserve the original value — it is defined as the time the message was first received), and acquire FIFO group lock if applicable. A successful `Propose` also refreshes the lease (per `§3.3` of the lease-read doc), so a busy queue keeps its fast path warm without any extra work.
5. If the batch is empty and `WaitTimeSeconds > 0`, register a waiter on the queue's in-process condition variable and re-run step 1 on wake or timeout. Do not poll in a tight loop. On re-scan, the `LeaseRead` fence from step 1 stays local for the whole long-poll window — empty long polls cost O(1) quorum operations, not O(ticks).
6. If `receive_count > redrive_policy.maxReceiveCount`, move the message to the DLQ (see §7.6) atomically within the receive transaction and do not return it to the caller.

Lease invalidation handling: if the leader loses quorum during a long poll, `RegisterLeaderLossCallback` fires, the lease is cleared, and the next `LeaseReadForKey` falls back to `LinearizableRead`. That call fails if no quorum is available, and the receive returns a retryable error rather than quietly serving stale data.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

§7.2 bootstrap for long-poll notifier assumes state this PR doesn't implement.

§7.3's leader-failover bootstrap (re-populate per-queue timer wheels from a scan of !sqs|msg|vis| on non-Leader → Leader transitions) is a prerequisite for the "no wait-time inflation after failover" guarantee. The implementation in this PR uses a polling long-poll loop (sqs_messages.go Line 517) and no leader-transition hook, so none of the bootstrap mechanics exist yet.

If Milestone 1 intentionally ships the polling fallback first, consider adding a one-sentence marker near §7.3 saying "the commit-stream notifier and failover bootstrap in this section land in Milestone 2; M1 uses a 200 ms poll". The doc currently reads as if it describes what the code does.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/2026_04_24_proposed_sqs_compatible_adapter.md` around lines 307 -
320, The design doc describes a commit-stream notifier and leader-transition
bootstrap that the current codebase does not implement (the code uses a polling
long-poll loop in sqs_messages.go instead of a leader-transition hook), so
update the doc and code annotations: add a one-sentence note in §7.3 stating
that Milestone 1 uses the polling fallback (e.g., 200 ms poll) and that the
commit-stream notifier and failover bootstrap will land in Milestone 2, and also
add a brief inline comment in sqs_messages.go near the polling long-poll loop to
reference the planned RegisterLeaderLossCallback / leader-transition bootstrap
to avoid confusion until the bootstrap is implemented.

Comment thread docs/design/2026_04_24_proposed_sqs_compatible_adapter.md
Comment thread kv/shard_key.go
Comment on lines +27 to +39
const (
// SqsRoutePrefix is the logical route prefix all SQS internal keys
// normalize to. The adapter stores queue metadata and per-queue
// message keys under several !sqs|... families; routing must map
// every one of them to a single stable prefix so the distribution
// engine colocates a queue's storage on one Raft group.
sqsRoutePrefix = "!sqs|route|"

// SqsInternalPrefix is the shared prefix of every SQS-owned key
// family (!sqs|queue|meta|, !sqs|msg|vis|, etc.). Used by
// sqsRouteKey to dispatch the routing decision.
sqsInternalPrefix = "!sqs|"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Doc comments reference wrong identifier names.

The comments open with SqsRoutePrefix and SqsInternalPrefix (capitalized), but the declared identifiers are sqsRoutePrefix and sqsInternalPrefix (unexported). Per Go's godoc convention, doc comments must begin with the declared name; stylecheck/revive will flag this.

✏️ Proposed fix
-	// SqsRoutePrefix is the logical route prefix all SQS internal keys
-	// normalize to. The adapter stores queue metadata and per-queue
-	// message keys under several !sqs|... families; routing must map
-	// every one of them to a single stable prefix so the distribution
-	// engine colocates a queue's storage on one Raft group.
+	// sqsRoutePrefix is the logical route prefix all SQS internal keys
+	// normalize to. The adapter stores queue metadata and per-queue
+	// message keys under several !sqs|... families; routing must map
+	// every one of them to a single stable prefix so the distribution
+	// engine colocates a queue's storage on one Raft group.
 	sqsRoutePrefix = "!sqs|route|"

-	// SqsInternalPrefix is the shared prefix of every SQS-owned key
-	// family (!sqs|queue|meta|, !sqs|msg|vis|, etc.). Used by
-	// sqsRouteKey to dispatch the routing decision.
+	// sqsInternalPrefix is the shared prefix of every SQS-owned key
+	// family (!sqs|queue|meta|, !sqs|msg|vis|, etc.). Used by
+	// sqsRouteKey to dispatch the routing decision.
 	sqsInternalPrefix = "!sqs|"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const (
// SqsRoutePrefix is the logical route prefix all SQS internal keys
// normalize to. The adapter stores queue metadata and per-queue
// message keys under several !sqs|... families; routing must map
// every one of them to a single stable prefix so the distribution
// engine colocates a queue's storage on one Raft group.
sqsRoutePrefix = "!sqs|route|"
// SqsInternalPrefix is the shared prefix of every SQS-owned key
// family (!sqs|queue|meta|, !sqs|msg|vis|, etc.). Used by
// sqsRouteKey to dispatch the routing decision.
sqsInternalPrefix = "!sqs|"
)
const (
// sqsRoutePrefix is the logical route prefix all SQS internal keys
// normalize to. The adapter stores queue metadata and per-queue
// message keys under several !sqs|... families; routing must map
// every one of them to a single stable prefix so the distribution
// engine colocates a queue's storage on one Raft group.
sqsRoutePrefix = "!sqs|route|"
// sqsInternalPrefix is the shared prefix of every SQS-owned key
// family (!sqs|queue|meta|, !sqs|msg|vis|, etc.). Used by
// sqsRouteKey to dispatch the routing decision.
sqsInternalPrefix = "!sqs|"
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv/shard_key.go` around lines 27 - 39, The doc comments currently reference
exported names but the actual identifiers are unexported; update the comments so
they begin with the exact declared identifiers "sqsRoutePrefix" and
"sqsInternalPrefix" (e.g., "sqsRoutePrefix is the logical route prefix..." and
"sqsInternalPrefix is the shared prefix...") following Go godoc conventions and
keep the rest of the explanatory text unchanged.

Comment thread main_sigv4_creds.go
Comment on lines +31 to +43
func loadSigV4StaticCredentialsFile(path string, labelForErrors string) (map[string]string, error) {
path = strings.TrimSpace(path)
if path == "" {
return nil, nil
}
f, err := os.Open(path)
if err != nil {
return nil, errors.WithStack(err)
}
defer f.Close()
file := sigV4CredentialsFile{}
if err := json.NewDecoder(f).Decode(&file); err != nil {
return nil, errors.WithStack(err)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Wrap open/decode failures with the adapter label.

labelForErrors is only used for row validation today. Bad paths or malformed JSON come back without the s3/sqs context this helper promises, which makes shared startup failures harder to diagnose.

Suggested fix
 func loadSigV4StaticCredentialsFile(path string, labelForErrors string) (map[string]string, error) {
 	path = strings.TrimSpace(path)
 	if path == "" {
 		return nil, nil
 	}
 	f, err := os.Open(path)
 	if err != nil {
-		return nil, errors.WithStack(err)
+		return nil, errors.Wrapf(err, "%s credentials file %q", labelForErrors, path)
 	}
 	defer f.Close()
 	file := sigV4CredentialsFile{}
 	if err := json.NewDecoder(f).Decode(&file); err != nil {
-		return nil, errors.WithStack(err)
+		return nil, errors.Wrapf(err, "%s credentials file %q", labelForErrors, path)
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func loadSigV4StaticCredentialsFile(path string, labelForErrors string) (map[string]string, error) {
path = strings.TrimSpace(path)
if path == "" {
return nil, nil
}
f, err := os.Open(path)
if err != nil {
return nil, errors.WithStack(err)
}
defer f.Close()
file := sigV4CredentialsFile{}
if err := json.NewDecoder(f).Decode(&file); err != nil {
return nil, errors.WithStack(err)
func loadSigV4StaticCredentialsFile(path string, labelForErrors string) (map[string]string, error) {
path = strings.TrimSpace(path)
if path == "" {
return nil, nil
}
f, err := os.Open(path)
if err != nil {
return nil, errors.Wrapf(err, "%s credentials file %q", labelForErrors, path)
}
defer f.Close()
file := sigV4CredentialsFile{}
if err := json.NewDecoder(f).Decode(&file); err != nil {
return nil, errors.Wrapf(err, "%s credentials file %q", labelForErrors, path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@main_sigv4_creds.go` around lines 31 - 43, In loadSigV4StaticCredentialsFile,
wrap the errors from os.Open and json.NewDecoder(...).Decode so they include the
provided labelForErrors context (e.g. "s3"/"sqs") before returning; update the
error returns in the os.Open error branch and the JSON decode branch to include
labelForErrors (preserving the stack via errors.WithStack or equivalent) so
startup/misconfiguration failures show which adapter label caused the bad path
or malformed JSON.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e9f2d0c648

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread adapter/sqs_catalog.go
// FifoQueue is applied after applyAttributes returns (see
// parseAttributesIntoMeta) because it needs cross-field validation
// against the queue name.
"FifoQueue": func(_ *sqsQueueMeta, _ string) error { return nil },
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject FifoQueue updates in SetQueueAttributes

The FifoQueue attribute applier is a no-op, and this same applier map is used by SetQueueAttributes, so requests that include FifoQueue return success without changing or validating queue type. That creates a false-positive control-plane result (clients think the queue type mutation/validation succeeded) while runtime behavior remains unchanged, which is hard to diagnose in automation.

Useful? React with 👍 / 👎.

Comment thread adapter/sqs_catalog.go
Comment on lines +795 to +796
if v, ok := all[k]; ok {
out[k] = v
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Return InvalidAttributeName for unknown GetQueueAttributes keys

GetQueueAttributes currently ignores unrecognized requested attribute names because queueMetaToAttributes only copies known keys and silently drops the rest. This means typos or unsupported names produce a 200 with partial/empty attributes instead of an explicit validation error, masking client misconfiguration and diverging from the adapter’s own strict attribute validation behavior elsewhere.

Useful? React with 👍 / 👎.

7 findings across two reviews; all addressed:

codex P2:
- sqs_messages.go DeleteMessage: loadMessageForDelete no longer
  treats a cross-queue or deleted-queue handle as a silent no-op.
  We first resolve the queue by the supplied QueueUrl; missing
  queue returns AWS.SimpleQueueService.NonExistentQueue; generation
  mismatch (handle belongs to a prior/different queue incarnation)
  returns ReceiptHandleIsInvalid. Only same-queue same-generation
  handles qualify for the AWS-compat stale-handle no-op.
- sqs_messages.go ReceiveMessage: resolveReceiveWaitSeconds now
  rejects out-of-range per-request WaitTimeSeconds with
  InvalidParameterValue instead of silently clamping; queue-level
  default is trusted because it was validated at SetQueueAttributes
  time.
- sqs_catalog.go: ContentBasedDeduplication is FIFO-only. Rejected
  on both CreateQueue and SetQueueAttributes when the queue is not
  FIFO. resolveFifoQueueFlag extracted so parseAttributesIntoMeta
  stays below the cyclomatic limit.

coderabbit Minor:
- sqs_catalog.go writeSQSErrorFromErr: internal errors no longer
  echo err.Error() (which can wrap Pebble paths, Raft peer ids,
  and stack frames). Raw detail goes to slog, clients see a
  generic 500 "internal error".
- sqs_messages.go ChangeMessageVisibility: VisibilityTimeout is now
  *int64 so a missing field returns MissingParameter per AWS,
  rather than silently meaning "make visible immediately".
- sqs_messages.go SendMessage: reject non-empty MessageAttributes
  with InvalidParameterValue until Milestone 2 ships the canonical
  AWS MD5-of-attributes encoder. The current hand-rolled hash
  would produce a non-canonical MD5 and every AWS SDK client
  would fail SendMessage with MessageAttributeMD5Mismatch — better
  to fail clearly at the caller with a descriptive error.
- sqs_messages_test.go: TestSQSServer_LongPollWaitsForArrival's
  time.AfterFunc now coordinates with a sync.WaitGroup so the
  scheduled callSQS cannot call t.Fatalf after the test has
  returned (go test flags that as misuse).
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 24, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 698270e444

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread adapter/sqs_messages.go
Comment on lines +320 to +323
req := &kv.OperationGroup[kv.OP]{
IsTxn: true,
Elems: []*kv.Elem[kv.OP]{
{Op: kv.Put, Key: dataKey, Value: recordBytes},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fence SendMessage against concurrent queue deletion

sendMessage builds a transactional write with only Elems and no StartTS/ReadKeys, even though queue metadata was read earlier in loadQueueMetaForSend. If DeleteQueue commits between that read and this dispatch, the send still returns 200 and stores the message under the old generation, making it unreachable after the generation bump (effectively acknowledged message loss). Include queue meta/generation in the same OCC transaction (or equivalent compare-and-check) so sends fail when the queue was deleted/recreated concurrently.

Useful? React with 👍 / 👎.

Comment thread adapter/sqs_messages.go
Comment on lines +928 to +931
func (s *SQSServer) loadAndVerifyMessage(ctx context.Context, queueName string, handle *decodedReceiptHandle) (*sqsMessageRecord, []byte, uint64, error) {
readTS := s.nextTxnReadTS(ctx)
dataKey := sqsMsgDataKey(queueName, handle.QueueGeneration, handle.MessageIDHex)
raw, err := s.store.GetAt(ctx, dataKey, readTS)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject stale handles in ChangeMessageVisibility

loadAndVerifyMessage trusts handle.QueueGeneration and reads the message record directly without checking current queue metadata/generation. Because DeleteQueue currently leaves old message keys behind, a receipt handle from a deleted (or recreated) queue generation can still be accepted and mutated by ChangeMessageVisibility instead of being rejected as invalid for this queue. This should mirror loadMessageForDelete by validating queue existence and generation before operating on the message key.

Useful? React with 👍 / 👎.

…-state races

Two new codex findings on the "concurrent DeleteQueue race" axis:

- SendMessage (P1): the transaction wrote data + vis but had no
  StartTS or ReadKeys. If DeleteQueue committed between
  loadQueueMetaForSend's read and this dispatch, the send still
  returned 200 and wrote the message under the old generation —
  which the generation bump had just made unreachable via routing,
  effectively silent acknowledged loss. loadQueueMetaForSend now
  returns the snapshot readTS alongside the metadata, and
  sendMessage pins StartTS=readTS and ReadKeys=[metaKey, genKey]
  on its dispatch. A concurrent DeleteQueue / PurgeQueue commits
  before ours → ErrWriteConflict, which surfaces to the client as
  a retryable error.
- ChangeMessageVisibility (P2): loadAndVerifyMessage previously
  trusted handle.QueueGeneration and jumped straight to the
  message key. If the queue was deleted (or deleted-and-recreated
  with a fresh generation), a stale handle could still mutate
  the orphan in-flight record left under the old generation.
  loadAndVerifyMessage now mirrors loadMessageForDelete: it
  resolves queue meta first, rejects a missing queue with
  QueueDoesNotExist, and rejects a generation mismatch with
  ReceiptHandleIsInvalid. Matching generations then proceed
  exactly as before.

No behavior change on the happy path. Single-writer tests still
pass; the added checks only fire when a concurrent catalog
mutation is in flight.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d2d5696cd1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread adapter/sqs_messages.go Outdated
Comment on lines +585 to +590
if requested <= 0 {
return sqsReceiveDefaultMaxMessages
}
if requested > sqsReceiveHardMaxMessages {
return sqsReceiveHardMaxMessages
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject out-of-range MaxNumberOfMessages values

ReceiveMessage currently coerces invalid MaxNumberOfMessages inputs (<=0 to 1, >10 to 10) instead of failing validation. AWS documents this field as valid only in the 1..10 range, so accepting bad values changes client-visible behavior and masks caller bugs (for example, 0 silently becoming active polling). This should return an InvalidParameterValue-style error rather than being clamped.

Useful? React with 👍 / 👎.

Comment thread adapter/sqs_catalog.go
Comment on lines +790 to +792
if len(req) == 0 {
return nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Return empty attributes when no AttributeNames are requested

GetQueueAttributes treats an omitted AttributeNames list as “All” by returning nil from selectedAttributeNames, which queueMetaToAttributes expands to every attribute. AWS behavior for omitted AttributeNames is an empty attributes result, so this over-returns metadata and breaks compatibility for clients that rely on omission returning nothing.

Useful? React with 👍 / 👎.

Comment thread adapter/sqs_catalog.go
Comment on lines +838 to +842
if err := s.setQueueAttributesWithRetry(r.Context(), name, in.Attributes); err != nil {
writeSQSErrorFromErr(w, err)
return
}
writeSQSJSON(w, map[string]any{})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate required Attributes in SetQueueAttributes

SetQueueAttributes forwards requests even when Attributes is omitted, and the downstream apply path no-ops successfully, returning HTTP 200 with no change. The SQS API marks Attributes as required, so this creates false-positive control-plane success for malformed requests and can hide automation errors.

Useful? React with 👍 / 👎.

Three P2 findings on AWS-compat request-validation semantics:

- ReceiveMessage.MaxNumberOfMessages: was silently clamping <=0
  to 1 and >10 to 10, which hides caller bugs (0 becoming active
  polling, 11+ becoming 10). AWS rejects out-of-range with
  InvalidParameterValue. Input struct changed to *int so omission
  can still default to 1; explicit values must be in [1, 10] or
  the request fails. Same policy already applied to
  WaitTimeSeconds in round 10.
- GetQueueAttributes: omitting AttributeNames was treated as
  "All", over-returning every queue attribute. AWS returns NO
  attributes on omission; "All" must be explicit. Replaced the
  map-or-nil signal with a tri-state sqsAttributeSelection
  { expandAll bool; names map[string]bool } so the semantics
  (empty / All / specific names) are each a distinct branch.
- SetQueueAttributes.Attributes: was optional and a missing field
  silently 200'd with no change, hiding automation bugs. AWS
  marks the field as required. Omission (or an empty map) now
  returns MissingParameter.

New tests: TestSQSServer_GetQueueAttributesOmittedReturnsEmpty
asserts the two AWS cases (omitted -> empty, ["All"] -> full);
TestSQSServer_SetQueueAttributesRequiresAttributes covers both
the omitted-field and empty-map paths;
TestSQSServer_ReceiveMessageRejectsOutOfRangeMax iterates
{0, -1, 11, 100} expecting 400, plus the omitted-and-in-range
happy paths.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b9d472711c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread adapter/sqs_messages.go Outdated
Comment on lines +715 to +716
if _, err := s.coordinator.Dispatch(ctx, req); err != nil {
return nil, false
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Propagate receive rotation dispatch failures

This branch turns every Dispatch failure into a silent candidate skip, so ReceiveMessage can return 200 with an empty/undersized Messages array even when the backend is failing. In scenarios like leader step-down, shard routing failures, or storage/coordinator outages, callers will see a false-empty queue instead of an actionable error, which can stall consumers and hide incidents. Only expected race conditions (e.g., write conflicts) should be skipped; non-retryable dispatch errors should be propagated.

Useful? React with 👍 / 👎.

codex P1 (new): tryDeliverCandidate collapsed every Dispatch failure
into a silent candidate skip, so ReceiveMessage returned 200 with an
empty or undersized Messages array even when the coordinator had
stepped down, a shard was unroutable, or storage was failing. That
hides production incidents and stalls consumers.

Refactor the delivery path to tri-state returns:

- tryDeliverCandidate returns (msg, skip bool, err error):
    * (msg, false, nil)  — delivered.
    * (nil, true,  nil)  — expected race; skip this candidate.
      Covers only ErrKeyNotFound on the initial GetAt (another
      worker deleted the record between our scan and load) and
      ErrWriteConflict on dispatch (another receive rotated the
      same record first).
    * (nil, false, err)  — non-retryable; propagate.
- rotateMessagesForDelivery returns ([]msgs, error). On
  non-retryable error it returns the messages already rotated in
  this batch (committed, so they must be delivered) together with
  the error.
- scanAndDeliverOnce propagates, longPollReceive propagates,
  receiveMessage renders the error through writeSQSErrorFromErr
  (generic 500 body; full detail logged via slog as round 10).

No behavior change on the happy path or on genuine races; only
coordinator / routing / storage outages stop being masked as
false-empty 200s.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 94e3a3feec

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread adapter/sqs_messages.go
Comment on lines +730 to +733
IsTxn: true,
StartTS: readTS,
ReadKeys: [][]byte{cand.visKey, dataKey},
Elems: []*kv.Elem[kv.OP]{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fence receive rotation against concurrent queue deletion

tryDeliverCandidate commits a receive rotation while only checking cand.visKey and dataKey for OCC conflicts. DeleteQueue only mutates queue meta/generation keys, so a delete/recreate that races with this path will not conflict and the receive can still return a message from the old generation after deletion was acknowledged. This breaks the generation fence used elsewhere and can surface stale messages from a deleted queue; include queue meta/generation in the read set (or revalidate generation before commit) so the receive aborts on queue-state changes.

Useful? React with 👍 / 👎.

Comment thread adapter/sqs_messages.go
Comment on lines +711 to +714
rec.VisibleAtMillis = newVisibleAt
rec.CurrentReceiptToken = newToken
rec.ReceiveCount++
if rec.FirstReceiveMillis == 0 {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Enforce RedrivePolicy during receive delivery

The receive path increments ReceiveCount and immediately returns the message without consulting any DLQ policy, even though queue metadata accepts and persists RedrivePolicy. As a result, queues configured with maxReceiveCount will never move poison messages to their dead-letter queue and will redeliver them indefinitely, which diverges from expected SQS behavior and can block consumers under retry storms.

Useful? React with 👍 / 👎.

Two new P1s in codex review 4171957563:

- tryDeliverCandidate's OCC ReadKeys covered only cand.visKey +
  dataKey, but DeleteQueue only mutates queue meta + generation
  keys — so a DeleteQueue that raced with an in-flight receive
  would NOT produce a write conflict, letting the rotation commit
  and return a message from an already-deleted queue generation.
  ReadKeys now additionally include sqsQueueMetaKey / sqsQueueGenKey
  so concurrent DeleteQueue / PurgeQueue aborts the receive with
  ErrWriteConflict, which propagates as a retryable 5xx rather
  than surfacing stale messages from a dead queue.
- RedrivePolicy was silently accepted as a queue attribute, but
  receive has no DLQ-redrive logic — messages that hit
  maxReceiveCount never move to the DLQ and redeliver
  indefinitely. Milestone 1 cannot honor the contract, so the
  attribute now returns 501 NotImplemented ("DLQ redrive is
  tracked for Milestone 2") from both CreateQueue and
  SetQueueAttributes, matching the earlier decision to reject
  MessageAttributes until the canonical MD5 encoder lands.

New test TestSQSServer_CreateQueueRejectsRedrivePolicy covers the
attribute-rejection behavior on both Create and SetAttributes
paths.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 55fed695a9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread adapter/sqs_messages.go Outdated
return
}

delivered, err := s.longPollReceive(ctx, queueName, meta.Generation, max, visibilityTimeout, waitSeconds)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Revalidate queue generation before each receive attempt

This handler reads meta.Generation once and then reuses it for the entire receive flow, including long-poll retries. Because DeleteQueue explicitly leaves old message keys in place while only deleting meta and bumping generation, a queue that is deleted/recreated after this initial read can still have old-generation messages scanned and delivered from the stale generation, even though the old queue was already deleted. Re-reading/validating current generation on each scan pass is needed to prevent post-delete stale delivery.

Useful? React with 👍 / 👎.

Comment thread adapter/sqs_messages.go
Comment on lines +949 to +952
IsTxn: true,
StartTS: readTS,
ReadKeys: [][]byte{dataKey, oldVisKey},
Elems: []*kv.Elem[kv.OP]{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fence ChangeMessageVisibility on queue meta/generation

The visibility-change transaction only uses dataKey and oldVisKey in ReadKeys, but DeleteQueue races mutate only queue meta/generation keys. If a queue is deleted between loadAndVerifyMessage and this dispatch, OCC won't detect the delete and the visibility update can still commit against orphaned old-generation records, returning success for a queue that has already been deleted. Include queue meta/generation in the read set (or revalidate generation at commit time) to close this race.

Useful? React with 👍 / 👎.

…sibility

codex round 15 P1×2:

- receiveMessage loaded queue metadata once and reused the cached
  generation for every long-poll scan pass. A DeleteQueue that
  committed mid-wait would leave orphan !sqs|msg|vis| entries
  behind under the old generation (DeleteQueue only mutates meta +
  generation, never the message keys), and the cached-gen scan
  would happily find and deliver them even after ack. scan
  AndDeliverOnce now re-reads queue metadata on every pass — the
  fresh read sees the generation bump immediately and returns
  QueueDoesNotExist instead of scanning stale keys. The outer
  handler no longer passes a gen parameter; it only passes the
  queue name.
- changeVisibilityWithRetry and deleteMessageWithRetry's OCC
  ReadKeys covered only the message record and its vis entry,
  but DeleteQueue / PurgeQueue races mutate queue meta +
  generation keys. A delete that commits between loadMessageFor*
  and Dispatch would slip past the conflict check and mutate
  orphan records on an already-deleted queue. Both dispatches now
  include sqsQueueMetaKey + sqsQueueGenKey in ReadKeys (same fix
  tryDeliverCandidate got one round earlier), so a DeleteQueue
  race aborts the in-flight delete / change-visibility with
  ErrWriteConflict which the retry loop classifies as
  ReceiptHandleIsInvalid on the next pass.

No behavior change for uncontended callers. The race windows
these findings flagged now close via OCC conflict instead of
returning success against a dead generation.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fa7fdbfaf9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread adapter/sqs_messages.go
Comment on lines +365 to +367
if int64(len(body)) > meta.MaximumMessageSize {
return nil, readTS, newSQSAPIError(http.StatusBadRequest, sqsErrMessageTooLong, "message body exceeds MaximumMessageSize")
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject empty MessageBody in SendMessage

SendMessage currently only enforces the upper size bound and accepts a zero-length MessageBody, because loadQueueMetaForSend checks len(body) > MaximumMessageSize but never validates a minimum length. This means requests with omitted/empty bodies are enqueued successfully instead of returning a validation error, which breaks SQS compatibility and can hide producer serialization bugs.

Useful? React with 👍 / 👎.

Comment thread adapter/sqs_messages.go Outdated
Comment on lines +593 to +597
candidates, err := s.scanVisibleMessageCandidates(ctx, queueName, meta.Generation, max*sqsReceiveScanOverfetchFactor, readTS)
if err != nil {
return nil, err
}
return s.rotateMessagesForDelivery(ctx, queueName, meta.Generation, candidates, visibilityTimeout, max, readTS)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Enforce message retention before delivery

The receive path loads queue metadata but does not apply MessageRetentionPeriod when scanning or delivering messages, so messages older than send_timestamp + retention can still be returned indefinitely. Fresh evidence in this commit: scanAndDeliverOnce forwards directly to visibility-based selection/rotation without using meta.MessageRetentionSeconds, and there is no runtime expiry path in the adapter/sqs_*.go message flow.

Useful? React with 👍 / 👎.

bootjp added 2 commits April 25, 2026 02:00
codex round 16, two findings:

- P2 SendMessage: loadQueueMetaForSend only checked the upper body
  size bound; an omitted or empty MessageBody would be accepted and
  enqueued silently. AWS rejects empty bodies as
  InvalidParameterValue, and silently admitting them would let
  producer serialization bugs land on the queue disguised as real
  messages. Added the lower-bound check right next to the size
  cap.
- P1 ReceiveMessage: the receive path loaded MessageRetentionPeriod
  via queue metadata but never applied it, so messages older than
  send_timestamp + retention could be re-delivered indefinitely on
  every visibility-timeout cycle. tryDeliverCandidate now takes the
  retention seconds, compares against rec.SendTimestampMillis, and
  when expired runs a dedicated expireMessage OCC txn that drops
  the data + vis entries (idempotent on ErrWriteConflict — another
  worker racing us to delete or rotate is fine). The candidate is
  then skipped for this batch, so expired messages never reach
  clients. A Milestone-2 send-age reaper can still sweep in bulk
  when the queue is idle. tryDeliverCandidate is split into
  loadCandidateRecord / handleRetentionExpiry / commitReceiveRotation
  so the retention branch stays under the cyclomatic limit.

Tests: TestSQSServer_SendMessageRejectsEmptyBody covers the two
empty-body request shapes; TestSQSServer_ReceiveDropsRetentionExpiredMessages
seeds a queue with MessageRetentionPeriod=60 s, backdates the
stored SendTimestampMillis via a new backdateSQSMessageForTest
helper, and asserts that two consecutive receives return empty
(message was reaped inline, not just hidden).
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 818c27cb66

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread adapter/sqs_messages.go
if time.Now().After(deadline) {
return delivered, nil
}
delivered, err = s.scanAndDeliverOnce(ctx, queueName, max, visibilityTimeout)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Revalidate leadership on each long-poll receive iteration

receiveMessage does a single lease check before entering long polling, but the loop then keeps scanning local state without re-checking leader status. If leadership changes while a WaitTimeSeconds poll is in flight (e.g., failover/partition), the old leader can return stale/false-empty results instead of failing or handing off to the new leader, which can hide available messages from consumers during outages.

Useful? React with 👍 / 👎.

Comment thread main_sigv4_creds.go
}
out[accessKeyID] = secretAccessKey
}
return out, nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject empty credentials files instead of silently disabling auth

When a credentials file is provided but contains zero entries, this loader returns an empty map and no error. The SQS startup path then treats empty credentials as “auth off”, so a misconfigured --sqsCredentialsFile can unintentionally expose an unauthenticated API instead of failing closed. Returning a validation error when a non-empty file yields no valid credentials would prevent this security footgun.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (6)
adapter/sqs_messages_test.go (2)

133-149: Ignored errors in test setup mask failures at the wrong line.

Calls like _, _ = callSQS(...) for the initial SendMessage here (and similarly at Lines 245-249, 286-289, 343-346, 482-485, 604-607) discard the status code. If SendMessage fails (e.g. due to a regression in validation), the subsequent ReceiveMessage returns 0 messages and the test fails with "no messages received" — misleading the failure to the wrong assertion. Asserting status == http.StatusOK at the send site makes regressions diagnosable at a glance without adding many lines of boilerplate.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/sqs_messages_test.go` around lines 133 - 149, The test currently
ignores the status return from the initial SendMessage calls (via callSQS with
sqsSendMessageTarget), which hides failures; update each send invocation (the
one shown and the similar occurrences around the file) to capture and assert the
HTTP status equals http.StatusOK right after callSQS before continuing, so
failures surface at the send call rather than later when ReceiveMessage returns
no messages; look for callSQS usages with sqsSendMessageTarget and add a status
check/assert for http.StatusOK immediately after each send.

385-440: Wall-clock-dependent bounds may flake under CI contention.

elapsed < 300ms / elapsed > 3s assume the poll ticker (200 ms), the scheduled send (400 ms), and callSQS RTT all stay within a narrow envelope. On a heavily loaded CI box (or when -race doubles goroutine scheduling latency), I have seen similar bounds flake. The existing logic is correct; widening the bounds (e.g. elapsed < 200ms lower and elapsed > 4500ms upper) or gating the test behind testing.Short would reduce flakes without weakening the intent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/sqs_messages_test.go` around lines 385 - 440, The elapsed-time
assertions in TestSQSServer_LongPollWaitsForArrival are too tight and can flake
under CI; update the two checks around elapsed (after calling callSQS) to use
wider bounds (for example change the lower check from 300ms to 200ms and the
upper check from 3s to 4500ms) or alternatively skip the timing assertions in
short mode by gating with testing.Short(); modify the assertions near the
existing elapsed comparisons in TestSQSServer_LongPollWaitsForArrival (the block
that references sendAt, sendWG, callSQS and WaitTimeSeconds) accordingly so the
test remains robust under CI scheduling variability.
adapter/sqs_messages.go (2)

443-455: MessageAttributes is persisted into the record even though sends with attributes are rejected upstream.

buildSendRecord copies in.MessageAttributes into rec.MessageAttributes, but sendMessage rejects any request with non-empty MessageAttributes at Line 298-302 before reaching buildSendRecord. The copy is effectively dead code until the canonical MD5 encoder lands. This is harmless today but means the stored-record schema and the wire behavior have drifted — reviewers reading just buildSendRecord will think attributes are supported end-to-end. Consider either leaving a // TODO(milestone-2): enabled when canonical MD5 lands next to the assignment, or dropping it until the encoder is ready to avoid latent inconsistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/sqs_messages.go` around lines 443 - 455, buildSendRecord currently
assigns in.MessageAttributes into rec.MessageAttributes even though sendMessage
rejects non-empty MessageAttributes (see sendMessage), causing a schema/behavior
drift; update buildSendRecord (sqsMessageRecord assignment of MessageAttributes)
to either remove the assignment or annotate it with a clear TODO noting it will
be enabled when the canonical MD5 encoder is implemented (e.g., "//
TODO(milestone-2): enable when canonical MD5 lands"), so readers of
buildSendRecord aren't misled about end-to-end support for MessageAttributes.

555-583: Long-poll deadline excludes the time spent in the first scan.

scanAndDeliverOnce runs before deadline := time.Now().Add(waitSeconds * Second), so a slow initial scan (e.g. 500 ms under storage backpressure) causes the overall receive to sleep up to waitSeconds on top of the first pass, not inclusive of it. AWS treats WaitTimeSeconds as the total server-side wait budget. Anchoring deadline at function entry (before the first scan) would give callers a predictable worst-case latency and keep TestSQSServer_LongPollTimesOutOnEmptyQueue-style bounds tight.

♻️ Proposed refactor
 func (s *SQSServer) longPollReceive(ctx context.Context, queueName string, max int, visibilityTimeout, waitSeconds int64) ([]map[string]any, error) {
+	deadline := time.Now().Add(time.Duration(waitSeconds) * time.Second)
 	delivered, err := s.scanAndDeliverOnce(ctx, queueName, max, visibilityTimeout)
 	if err != nil {
 		return nil, err
 	}
 	if len(delivered) > 0 || waitSeconds <= 0 {
 		return delivered, nil
 	}
-	deadline := time.Now().Add(time.Duration(waitSeconds) * time.Second)
 	ticker := time.NewTicker(sqsLongPollInterval)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/sqs_messages.go` around lines 555 - 583, The long-poll deadline is
currently set after the initial scan, so a slow first call to scanAndDeliverOnce
can push total latency beyond waitSeconds; fix longPollReceive by anchoring the
deadline at function entry (compute deadline :=
time.Now().Add(time.Duration(waitSeconds)*time.Second) before calling
scanAndDeliverOnce), keep the early-return when waitSeconds <= 0, and then use
that deadline for all subsequent waits/loops so the total server-side wait
budget includes the first scan.
adapter/sqs_catalog.go (1)

680-691: filterByPrefix aliases the input slice via names[:0].

out := names[:0] reuses the caller's backing array and the returned slice overwrites entries the caller still holds. In the one current caller (listQueues), the original names is not used after the call, so this is safe today — but the function reads like a pure helper and the aliasing will bite the next caller who assumes independence. Consider allocating a fresh slice (or at minimum renaming to make the in-place semantic explicit).

♻️ Proposed refactor
 func filterByPrefix(names []string, prefix string) []string {
 	if prefix == "" {
 		return names
 	}
-	out := names[:0]
+	out := make([]string, 0, len(names))
 	for _, n := range names {
 		if strings.HasPrefix(n, prefix) {
 			out = append(out, n)
 		}
 	}
 	return out
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/sqs_catalog.go` around lines 680 - 691, The helper filterByPrefix
currently aliases the input slice by using out := names[:0], which reuses the
caller's backing array and can unexpectedly mutate caller data; change
filterByPrefix to allocate and return a fresh slice (e.g., make([]string, 0,
len(names)) and append matches) so callers (like listQueues) get an independent
result, or alternatively rename the function to indicate in-place behavior —
prefer allocating a new slice to preserve helper purity.
adapter/sqs_catalog_test.go (1)

269-294: LGTM — table-driven key-encoding round-trip.

Nice coverage of edge cases (empty, single char, whitespace, .fifo suffix, 80-char). As per coding guidelines: "prefer table-driven test cases". One nit: "hello world" and "round.trip.fifo" exercise the segment encoder but would be rejected by validateQueueName; worth a short comment clarifying this asserts the codec in isolation (not producer/consumer agreement with the validator).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/sqs_catalog_test.go` around lines 269 - 294, Add a clarifying comment
in TestSQSServer_CatalogKeyEncoding explaining that some test names (e.g.,
"hello world", "round.trip.fifo") would be rejected by validateQueueName but are
intentionally included to exercise the encodeSQSSegment/decodeSQSSegment codec
in isolation (not to assert producer/consumer agreement with validateQueueName);
mention the test also checks queueNameFromMetaKey/sqsQueueMetaKey round-trip to
avoid confusion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@adapter/sqs_catalog.go`:
- Around line 538-541: The call to s.coordinator.Clock() in tryCreateQueueOnce
is performed without a nil-check and can panic; mirror the guard used in
nextTxnReadTS: check if s.coordinator != nil before calling Clock() (or
otherwise handle the nil case), and only set requested.CreatedAtHLC when Clock()
is non-nil; update tryCreateQueueOnce accordingly (referencing
tryCreateQueueOnce, s.coordinator, Clock(), requested.CreatedAtHLC, and
nextTxnReadTS).
- Around line 409-421: attributesEqual currently ignores Tags which causes
createQueue to silently accept differing tags; update attributesEqual (used by
createQueue) to include tag comparison and return false when Tags differ so
createQueue can surface a QueueNameExists error or handle tag reconciliation.
Specifically, compare a.Tags and b.Tags (order-agnostic, nil vs empty-aware, and
key/value equality) inside attributesEqual (and ensure the place that checks its
result in createQueue treats a mismatch as a name conflict), so tag diffs are
detected and the existing QueueNameExists behavior for differing tags is
preserved.

In `@adapter/sqs_messages.go`:
- Around line 344-348: The SendMessage response currently always includes
"MD5OfMessageAttributes" with an empty string because
md5OfAttributesHex(in.MessageAttributes) is called even when no attributes were
supplied; update the writeSQSJSON call in the SendMessage handling so it only
adds the "MD5OfMessageAttributes" key when in.MessageAttributes is non-empty (or
when md5OfAttributesHex returns a non-empty string) instead of unconditionally
including it alongside rec.MessageID and rec.MD5OfBody; adjust the map
construction around writeSQSJSON/md5OfAttributesHex to conditionally insert that
key so SDK clients won't see a spurious empty hash.

---

Nitpick comments:
In `@adapter/sqs_catalog_test.go`:
- Around line 269-294: Add a clarifying comment in
TestSQSServer_CatalogKeyEncoding explaining that some test names (e.g., "hello
world", "round.trip.fifo") would be rejected by validateQueueName but are
intentionally included to exercise the encodeSQSSegment/decodeSQSSegment codec
in isolation (not to assert producer/consumer agreement with validateQueueName);
mention the test also checks queueNameFromMetaKey/sqsQueueMetaKey round-trip to
avoid confusion.

In `@adapter/sqs_catalog.go`:
- Around line 680-691: The helper filterByPrefix currently aliases the input
slice by using out := names[:0], which reuses the caller's backing array and can
unexpectedly mutate caller data; change filterByPrefix to allocate and return a
fresh slice (e.g., make([]string, 0, len(names)) and append matches) so callers
(like listQueues) get an independent result, or alternatively rename the
function to indicate in-place behavior — prefer allocating a new slice to
preserve helper purity.

In `@adapter/sqs_messages_test.go`:
- Around line 133-149: The test currently ignores the status return from the
initial SendMessage calls (via callSQS with sqsSendMessageTarget), which hides
failures; update each send invocation (the one shown and the similar occurrences
around the file) to capture and assert the HTTP status equals http.StatusOK
right after callSQS before continuing, so failures surface at the send call
rather than later when ReceiveMessage returns no messages; look for callSQS
usages with sqsSendMessageTarget and add a status check/assert for http.StatusOK
immediately after each send.
- Around line 385-440: The elapsed-time assertions in
TestSQSServer_LongPollWaitsForArrival are too tight and can flake under CI;
update the two checks around elapsed (after calling callSQS) to use wider bounds
(for example change the lower check from 300ms to 200ms and the upper check from
3s to 4500ms) or alternatively skip the timing assertions in short mode by
gating with testing.Short(); modify the assertions near the existing elapsed
comparisons in TestSQSServer_LongPollWaitsForArrival (the block that references
sendAt, sendWG, callSQS and WaitTimeSeconds) accordingly so the test remains
robust under CI scheduling variability.

In `@adapter/sqs_messages.go`:
- Around line 443-455: buildSendRecord currently assigns in.MessageAttributes
into rec.MessageAttributes even though sendMessage rejects non-empty
MessageAttributes (see sendMessage), causing a schema/behavior drift; update
buildSendRecord (sqsMessageRecord assignment of MessageAttributes) to either
remove the assignment or annotate it with a clear TODO noting it will be enabled
when the canonical MD5 encoder is implemented (e.g., "// TODO(milestone-2):
enable when canonical MD5 lands"), so readers of buildSendRecord aren't misled
about end-to-end support for MessageAttributes.
- Around line 555-583: The long-poll deadline is currently set after the initial
scan, so a slow first call to scanAndDeliverOnce can push total latency beyond
waitSeconds; fix longPollReceive by anchoring the deadline at function entry
(compute deadline := time.Now().Add(time.Duration(waitSeconds)*time.Second)
before calling scanAndDeliverOnce), keep the early-return when waitSeconds <= 0,
and then use that deadline for all subsequent waits/loops so the total
server-side wait budget includes the first scan.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 003a4419-6978-456c-933f-cfe7398961c6

📥 Commits

Reviewing files that changed from the base of the PR and between 0d69c90 and 2ce85e0.

📒 Files selected for processing (5)
  • adapter/sqs_catalog.go
  • adapter/sqs_catalog_test.go
  • adapter/sqs_messages.go
  • adapter/sqs_messages_test.go
  • adapter/test_util.go
✅ Files skipped from review due to trivial changes (1)
  • adapter/test_util.go

Comment thread adapter/sqs_catalog.go
Comment on lines +409 to +421
func attributesEqual(a, b *sqsQueueMeta) bool {
if a == nil || b == nil {
return false
}
return a.IsFIFO == b.IsFIFO &&
a.ContentBasedDedup == b.ContentBasedDedup &&
a.VisibilityTimeoutSeconds == b.VisibilityTimeoutSeconds &&
a.MessageRetentionSeconds == b.MessageRetentionSeconds &&
a.DelaySeconds == b.DelaySeconds &&
a.ReceiveMessageWaitSeconds == b.ReceiveMessageWaitSeconds &&
a.MaximumMessageSize == b.MaximumMessageSize &&
a.RedrivePolicy == b.RedrivePolicy
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

attributesEqual ignores Tags, so idempotent CreateQueue silently drops tag diffs.

requested.Tags = in.Tags is set on every createQueue call, but attributesEqual omits Tags from the comparison. A second CreateQueue with the same name and identical attributes but different Tags returns 200 with the original QueueUrl, never persists the new tags, and never surfaces a QueueNameExists error. AWS SQS returns QueueNameExists when tags differ, so a client that relies on this idempotency contract for tag reconciliation (e.g. IaC drift detection) will see a spurious success.

Either compare Tags here and return QueueNameExists on mismatch, or document the deviation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/sqs_catalog.go` around lines 409 - 421, attributesEqual currently
ignores Tags which causes createQueue to silently accept differing tags; update
attributesEqual (used by createQueue) to include tag comparison and return false
when Tags differ so createQueue can surface a QueueNameExists error or handle
tag reconciliation. Specifically, compare a.Tags and b.Tags (order-agnostic, nil
vs empty-aware, and key/value equality) inside attributesEqual (and ensure the
place that checks its result in createQueue treats a mismatch as a name
conflict), so tag diffs are detected and the existing QueueNameExists behavior
for differing tags is preserved.

Comment thread adapter/sqs_catalog.go
Comment on lines +538 to +541
requested.Generation = lastGen + 1
if clock := s.coordinator.Clock(); clock != nil {
requested.CreatedAtHLC = clock.Current()
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Unchecked s.coordinator deref; nextTxnReadTS guards against nil here.

s.nextTxnReadTS at Line 432 explicitly checks if s.coordinator != nil before calling .Clock(), but tryCreateQueueOnce calls s.coordinator.Clock() unconditionally on Line 539. If s.coordinator can legitimately be nil in any construction path (test doubles, partial init during startup), this panics. Either drop the nil guard in nextTxnReadTS (if coordinator is invariant) or add the same guard here for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/sqs_catalog.go` around lines 538 - 541, The call to
s.coordinator.Clock() in tryCreateQueueOnce is performed without a nil-check and
can panic; mirror the guard used in nextTxnReadTS: check if s.coordinator != nil
before calling Clock() (or otherwise handle the nil case), and only set
requested.CreatedAtHLC when Clock() is non-nil; update tryCreateQueueOnce
accordingly (referencing tryCreateQueueOnce, s.coordinator, Clock(),
requested.CreatedAtHLC, and nextTxnReadTS).

Comment thread adapter/sqs_messages.go
Comment on lines +344 to +348
writeSQSJSON(w, map[string]string{
"MessageId": rec.MessageID,
"MD5OfMessageBody": rec.MD5OfBody,
"MD5OfMessageAttributes": md5OfAttributesHex(in.MessageAttributes),
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

MD5OfMessageAttributes: "" is echoed unconditionally in SendMessage responses.

The AWS SendMessage response schema only includes MD5OfMessageAttributes when the request carried attributes. Since Line 298-302 already rejects any send with non-empty MessageAttributes, md5OfAttributesHex is always called with an empty map here and returns "". Emitting the field with an empty string makes SDK clients that do a presence check (or treat "" as a hash to verify against their own canonical MD5) see a spurious mismatch. Prefer omitting the field entirely when no attributes were supplied.

🛠️ Proposed fix
-	writeSQSJSON(w, map[string]string{
-		"MessageId":              rec.MessageID,
-		"MD5OfMessageBody":       rec.MD5OfBody,
-		"MD5OfMessageAttributes": md5OfAttributesHex(in.MessageAttributes),
-	})
+	resp := map[string]string{
+		"MessageId":        rec.MessageID,
+		"MD5OfMessageBody": rec.MD5OfBody,
+	}
+	if md5 := md5OfAttributesHex(in.MessageAttributes); md5 != "" {
+		resp["MD5OfMessageAttributes"] = md5
+	}
+	writeSQSJSON(w, resp)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/sqs_messages.go` around lines 344 - 348, The SendMessage response
currently always includes "MD5OfMessageAttributes" with an empty string because
md5OfAttributesHex(in.MessageAttributes) is called even when no attributes were
supplied; update the writeSQSJSON call in the SendMessage handling so it only
adds the "MD5OfMessageAttributes" key when in.MessageAttributes is non-empty (or
when md5OfAttributesHex returns a non-empty string) instead of unconditionally
including it alongside rec.MessageID and rec.MD5OfBody; adjust the map
construction around writeSQSJSON/md5OfAttributesHex to conditionally insert that
key so SDK clients won't see a spurious empty hash.

@bootjp bootjp merged commit bc6746b into main Apr 24, 2026
8 checks passed
@bootjp bootjp deleted the feat/sqs_compatible_adapter branch April 24, 2026 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants