Skip to content

feat(sqs): SQS-compatible adapter Milestone 1#610

Merged
bootjp merged 9 commits intofeat/sqs_compatible_adapterfrom
feat/sqs_adapter_scaffold
Apr 24, 2026
Merged

feat(sqs): SQS-compatible adapter Milestone 1#610
bootjp merged 9 commits intofeat/sqs_compatible_adapterfrom
feat/sqs_adapter_scaffold

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 24, 2026

Summary

Implements Milestone 1 of the SQS-compatible adapter described in
docs/design/2026_04_24_proposed_sqs_compatible_adapter.md (stacked on #606).

  • Standard queues: CreateQueue / DeleteQueue / ListQueues / GetQueueUrl /
    Get/SetQueueAttributes, and the core message path SendMessage /
    ReceiveMessage / DeleteMessage / ChangeMessageVisibility.
  • Storage layout per the design doc: !sqs|queue|meta|, !sqs|queue|gen|,
    !sqs|msg|data|, and a visibility-indexed !sqs|msg|vis| keyspace so
    ReceiveMessage scans the next visible messages with a single bounded
    prefix scan — no background sweeper.
  • Every receive rotates the message's receipt token inside a Raft-committed
    OCC transaction, so receipt handles are forgery-resistant without HMAC.
  • ReceiveMessage fences its scan with kv.LeaseReadThrough so the local
    snapshot stays within the lease window and falls back to
    LinearizableRead when the lease is cold.
  • SigV4 primitives extracted into adapter/sigv4.go and shared with the
    S3 adapter. SQS-side authorizeSQSRequest handles the JSON-protocol
    flow (no presigned URLs, no streaming payloads).
  • Leader-proxy helper factored into adapter/leader_http_proxy.go and
    reused by DynamoDB and SQS.
  • Wiring: --sqsAddress / --raftSqsMap / --sqsRegion /
    --sqsCredentialsFile flags, startSQSServer mirrors startS3Server,
    adapter/test_util.go carries an *SQSServer per Node.

Five commits, each self-contained:

  1. feat(sqs): scaffold SQS-compatible adapter — server skeleton,
    dispatch table, JSON error envelope, shared leader-proxy helper.
  2. feat(sqs): wire SQS adapter into main.go and test harness
    flags, runtimeServerRunner, test harness port allocation.
  3. feat(sqs): implement queue catalog CRUD — idempotent Create,
    DeleteQueue with generation bump, ListQueues pagination,
    Get/SetQueueAttributes, FIFO name/attribute validation.
  4. feat(sqs): SigV4 verifier shared with S3, wire static credentials
    — extract SigV4 into adapter/sigv4.go, --sqsCredentialsFile
    support, full SigV4 verifier with happy/failure coverage.
  5. feat(sqs): SendMessage, ReceiveMessage, DeleteMessage, ChangeMessageVisibility — message record, visibility-index scan
    and rotation, receipt handle codec, lease-read fenced scan.

Not in this PR (deferred to Milestone 2+ as per the design):
PurgeQueue, batch APIs (Send/Delete/ChangeVisibility), tag APIs, FIFO,
DLQ redrive, long polling, Jepsen workload, dedicated SQS Prometheus
metrics, Console UI (§13 of the design).

Test plan

  • go test -race -count=1 ./adapter/ — full pass, incl. new SQS
    catalog / auth / message tests (end-to-end through createNode).
  • golangci-lint run --config=.golangci.yaml — 0 issues.
  • DynamoDB / S3 / Redis adapter regressions — green after the
    shared leader-proxy and SigV4 refactors.
  • aws sqs CLI smoke test against a running node (manual, done
    locally on each commit; worth re-running before merge).

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: aef79cb6-9f44-48ad-bad8-a40fe2d72d8e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/sqs_adapter_scaffold

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a new SQS-compatible adapter, providing support for queue management and message operations such as Send, Receive, and Delete. It also refactors the DynamoDB and S3 adapters to utilize shared logic for leader proxying and SigV4 authentication. A review comment suggests replacing a manual selection sort with the standard library's more efficient sorting function for message attributes.

Comment thread adapter/sqs_messages.go Outdated
Comment on lines +689 to +695
for i := 0; i < len(keys); i++ {
for j := i + 1; j < len(keys); j++ {
if keys[j] < keys[i] {
keys[i], keys[j] = keys[j], keys[i]
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation uses a selection sort (O(n²)) to order attribute keys. For better performance and code clarity, it's recommended to use the standard library's sort.Strings() function, which is more efficient (O(n log n)).

	sort.Strings(keys)

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a49266185a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread adapter/sqs_messages.go Outdated
Comment on lines +523 to +527
IsTxn: true,
Elems: []*kv.Elem[kv.OP]{
{Op: kv.Del, Key: cand.visKey},
{Op: kv.Put, Key: newVisKey, Value: []byte(cand.messageID)},
{Op: kv.Put, Key: dataKey, Value: recordBytes},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add read-set checks to message delivery rotation

tryDeliverCandidate reads the message at readTS and then commits a write transaction without any ReadKeys, so OCC won't reject stale reads. Under concurrent ReceiveMessage calls, a second worker can scan an old visibility row, commit after the first worker, and overwrite dataKey/visibility state from stale data, which can return the same message twice and immediately invalidate the first receipt handle. Include at least dataKey (and ideally the old visibility key) in OperationGroup.ReadKeys or perform the token/visibility check atomically in the write transaction.

Useful? React with 👍 / 👎.

Comment thread adapter/sqs_messages.go Outdated
Comment on lines +568 to +572
req := &kv.OperationGroup[kv.OP]{
IsTxn: true,
Elems: []*kv.Elem[kv.OP]{
{Op: kv.Del, Key: dataKey},
{Op: kv.Del, Key: visKey},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Make receipt-token verification atomic with delete writes

deleteMessage validates the receipt token via loadAndVerifyMessage before building this transaction, but the transaction itself has no read-set/condition on the validated record. If another consumer rotates the token between validation and commit, this stale handle can still delete the message because OCC only checks write-write conflicts when ReadKeys are absent. This breaks the intended guarantee that old receipt handles are invalid after a new receive; the same pattern also affects changeMessageVisibility.

Useful? React with 👍 / 👎.

@bootjp bootjp force-pushed the feat/sqs_adapter_scaffold branch from 3e9c1d2 to 715c2f9 Compare April 24, 2026 07:42
bootjp added a commit that referenced this pull request Apr 24, 2026
Address the P1 findings from the codex review on #610 and the gemini
micro-nit:

- tryDeliverCandidate now passes ReadKeys=[old_vis_key, data_key] on
  its commit so a concurrent ReceiveMessage that rotates the same
  message commits a newer version and our Dispatch returns
  ErrWriteConflict. Skipping the candidate on conflict prevents the
  "two workers both think they delivered" duplicate.
- deleteMessage and changeMessageVisibility lift their single-shot
  validate-and-commit into retry loops that mirror the catalog
  handlers. Both pass ReadKeys covering the data record and the
  current vis entry; on ErrWriteConflict we re-validate the token,
  which either succeeds (the state we just saw is still current) or
  returns InvalidReceiptHandle (someone else rotated the token first,
  which is the correct AWS semantics for a stale handle).
- md5OfAttributesHex: drop the hand-rolled O(n^2) selection sort in
  favor of sort.Strings.

No behavior change on the happy path; these are purely
correctness-under-contention fixes. Existing tests still cover the
single-writer flows, and the retry loops inherit the same test-wide
timeout knobs the catalog code uses.
bootjp added a commit that referenced this pull request Apr 24, 2026
Address the P1 findings from the codex review on #610 and the gemini
micro-nit:

- tryDeliverCandidate now passes ReadKeys=[old_vis_key, data_key] on
  its commit so a concurrent ReceiveMessage that rotates the same
  message commits a newer version and our Dispatch returns
  ErrWriteConflict. Skipping the candidate on conflict prevents the
  "two workers both think they delivered" duplicate.
- deleteMessage and changeMessageVisibility lift their single-shot
  validate-and-commit into retry loops that mirror the catalog
  handlers. Both pass ReadKeys covering the data record and the
  current vis entry; on ErrWriteConflict we re-validate the token,
  which either succeeds (the state we just saw is still current) or
  returns InvalidReceiptHandle (someone else rotated the token first,
  which is the correct AWS semantics for a stale handle).
- md5OfAttributesHex: drop the hand-rolled O(n^2) selection sort in
  favor of sort.Strings.

No behavior change on the happy path; these are purely
correctness-under-contention fixes. Existing tests still cover the
single-writer flows, and the retry loops inherit the same test-wide
timeout knobs the catalog code uses.
@bootjp bootjp force-pushed the feat/sqs_adapter_scaffold branch from 1ec56dd to 9067886 Compare April 24, 2026 07:55
bootjp added a commit that referenced this pull request Apr 24, 2026
Address the P1 findings from the codex review on #610 and the gemini
micro-nit:

- tryDeliverCandidate now passes ReadKeys=[old_vis_key, data_key] on
  its commit so a concurrent ReceiveMessage that rotates the same
  message commits a newer version and our Dispatch returns
  ErrWriteConflict. Skipping the candidate on conflict prevents the
  "two workers both think they delivered" duplicate.
- deleteMessage and changeMessageVisibility lift their single-shot
  validate-and-commit into retry loops that mirror the catalog
  handlers. Both pass ReadKeys covering the data record and the
  current vis entry; on ErrWriteConflict we re-validate the token,
  which either succeeds (the state we just saw is still current) or
  returns InvalidReceiptHandle (someone else rotated the token first,
  which is the correct AWS semantics for a stale handle).
- md5OfAttributesHex: drop the hand-rolled O(n^2) selection sort in
  favor of sort.Strings.

No behavior change on the happy path; these are purely
correctness-under-contention fixes. Existing tests still cover the
single-writer flows, and the retry loops inherit the same test-wide
timeout knobs the catalog code uses.
@bootjp bootjp force-pushed the feat/sqs_adapter_scaffold branch from 9067886 to 5d4bd8d Compare April 24, 2026 08:01
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5d4bd8db1b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread adapter/sqs_messages.go
Comment on lines +527 to +530
req := &kv.OperationGroup[kv.OP]{
IsTxn: true,
ReadKeys: [][]byte{cand.visKey, dataKey},
Elems: []*kv.Elem[kv.OP]{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Set StartTS when rotating message visibility

tryDeliverCandidate reads dataKey at readTS, but this transaction leaves StartTS unset so the coordinator assigns a newer start timestamp at dispatch. If another ReceiveMessage commits a rotation between the read and dispatch, that commit can fall before the auto-assigned start TS, letting ReadKeys validation pass and allowing this stale transaction to overwrite the newer token/visibility state (reintroducing duplicate delivery and invalidating the first receipt handle). Set req.StartTS = readTS so OCC validates against the same snapshot used to build recordBytes.

Useful? React with 👍 / 👎.

Comment thread adapter/sqs_messages.go
Comment on lines +599 to +602
req := &kv.OperationGroup[kv.OP]{
IsTxn: true,
ReadKeys: [][]byte{dataKey, visKey},
Elems: []*kv.Elem[kv.OP]{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Pin delete OCC check to the validated read snapshot

deleteMessageWithRetry validates the receipt token in loadMessageForDelete, then dispatches with ReadKeys but no StartTS. A concurrent receive can rotate the token after validation and still commit before the coordinator-assigned start TS, so this delete may execute from stale validation data and remove a message with an old receipt handle. Carry the load timestamp into the transaction (req.StartTS) so the read-set check actually guards the token/visibility state; the same pattern affects change-visibility.

Useful? React with 👍 / 👎.

bootjp added a commit that referenced this pull request Apr 24, 2026
Address the P1 findings from the codex review on #610 and the gemini
micro-nit:

- tryDeliverCandidate now passes ReadKeys=[old_vis_key, data_key] on
  its commit so a concurrent ReceiveMessage that rotates the same
  message commits a newer version and our Dispatch returns
  ErrWriteConflict. Skipping the candidate on conflict prevents the
  "two workers both think they delivered" duplicate.
- deleteMessage and changeMessageVisibility lift their single-shot
  validate-and-commit into retry loops that mirror the catalog
  handlers. Both pass ReadKeys covering the data record and the
  current vis entry; on ErrWriteConflict we re-validate the token,
  which either succeeds (the state we just saw is still current) or
  returns InvalidReceiptHandle (someone else rotated the token first,
  which is the correct AWS semantics for a stale handle).
- md5OfAttributesHex: drop the hand-rolled O(n^2) selection sort in
  favor of sort.Strings.

No behavior change on the happy path; these are purely
correctness-under-contention fixes. Existing tests still cover the
single-writer flows, and the retry loops inherit the same test-wide
timeout knobs the catalog code uses.
@bootjp bootjp force-pushed the feat/sqs_adapter_scaffold branch from 5d4bd8d to 944d81a Compare April 24, 2026 08:08
bootjp added 8 commits April 24, 2026 17:17
Start the SQS-compatible HTTP adapter per
docs/design/2026_04_24_proposed_sqs_compatible_adapter.md. This first
increment wires the server skeleton, dispatch table, error envelope,
health endpoints, and leader-proxy path so subsequent PRs can drop real
handlers into a tested frame.

Included:

- adapter/sqs.go: SQSServer struct, NewSQSServer with SQSServerOption,
  Run/Stop, /sqs_health + /sqs_leader_health, X-Amz-Target dispatch for
  all Milestone-1/2 SQS operations (CreateQueue ... ListQueueTags), and
  the JSON-1.0 error envelope (__type + x-amzn-ErrorType + message).
  Every target currently replies 501 NotImplemented; handlers will be
  filled in separately.
- adapter/sqs_test.go: health endpoints, method gating, error-envelope
  shape, unknown-target -> InvalidAction, every known target ->
  NotImplemented, and a Stop-unblocks-Run regression guard.
- adapter/leader_http_proxy.go: shared proxyHTTPRequestToLeader helper
  factored out of the DynamoDB adapter. DynamoDB and SQS both delegate
  to it via per-adapter error writers, removing the structural dup the
  linter flagged.

No main.go wiring or flag changes in this PR; that lands alongside the
first real handler so the binary does not expose a 501-only endpoint.
Add the minimal plumbing to bring an SQSServer online next to the
existing DynamoDB and S3 servers. No new handlers yet - every target
still returns NotImplemented - but the binary and the test harness can
now open an SQS endpoint so subsequent PRs that add real operations
have somewhere to land.

- shard_config.go: ErrInvalidRaftSQSMapEntry + parseRaftSQSMap.
- main.go: --sqsAddress / --raftSqsMap flags, leaderSQS in
  runtimeConfig, buildLeaderSQS, and startSQSServer wired into
  runtimeServerRunner.start. sqsAddress defaults to empty.
- main_sqs.go: startSQSServer mirrors main_s3.go two-goroutine
  shutdown pattern.
- adapter/test_util.go: portsAdress / listeners / Node / newNode /
  setupNodes carry an sqsAddress and *SQSServer per node, shutdown
  calls Stop via an extracted shutdownNode helper, portSQS starts at
  29000 so tests do not collide with the dynamo range at 28000.
- main_bootstrap_e2e_test.go: update parseRuntimeConfig signature.
Wire up the first real SQS handlers on top of the existing MVCC /
Coordinator primitives:

- CreateQueue (idempotent when attributes match, QueueNameExists
  otherwise; FIFO suffix + FifoQueue attribute cross-validated; every
  supported attribute validated against its AWS range)
- DeleteQueue (tombstones the meta record and bumps the generation
  counter so a subsequent CreateQueue with the same name starts a new
  incarnation; actual message-keyspace reclaim lands with Stage 4)
- ListQueues (prefix-scans !sqs|queue|meta|, supports
  QueueNamePrefix / MaxResults / NextToken)
- GetQueueUrl
- GetQueueAttributes / SetQueueAttributes

Design choices:

- Metadata is stored as JSON with a four-byte magic prefix so future
  encoding migrations can switch formats without reading back garbage.
- Segment encoding uses base64 raw URL for queue names, matching the
  DynamoDB adapter's encodeDynamoSegment so operators reading raw
  Pebble keys see the same shape across adapters.
- Attribute validation is table-driven via sqsAttributeAppliers so
  adding an attribute does not grow applyAttributes's cyclomatic
  complexity.
- The OCC retry loop is split into tryCreateQueueOnce so each retry
  pass is self-contained and the outer loop only deals with backoff.
- All mutating operations go through the leader via the shared
  proxyHTTPRequestToLeader helper, matching the DynamoDB adapter.

Tests (adapter/sqs_catalog_test.go) spin up a real single-node cluster
via createNode(t, 1) and exercise the end-to-end JSON-protocol path:
Create/Get/List, idempotent Create, QueueNameExists on attribute
change, Get/SetAttributes, Delete + follow-up NonExistentQueue, FIFO
name/attribute validation, and key-encoding round trips.

The scaffold "unknown target returns NotImplemented" test was updated
to skip the now-implemented catalog targets.
Refactor S3's SigV4 primitives into adapter/sigv4.go and implement the
SQS JSON-protocol verifier on top of them.

- adapter/sigv4.go: service-agnostic SigV4 helpers
  (parseSigV4AuthorizationHeader, buildSigV4AuthorizationHeader,
  buildSigV4AuthorizationHeaderRestricted, extractSigV4Signature /
  SignedHeaders / Auth fields). The *Restricted builder strips
  transport-added headers that are not listed in SignedHeaders, so
  re-signing a server-side request does not fail because Go's http
  client added Accept-Encoding after the client signed.
- adapter/s3_auth.go: delegates parse/build/extract to the shared
  helpers. Error ordering and the presigned-URL path are unchanged to
  keep the existing S3 tests green.
- adapter/sqs_auth.go: WithSQSRegion / WithSQSStaticCredentials
  options and the JSON-protocol verifier, split into
  validateSQSAuthScope / validateSQSAuthDate / drainAndHashSQSBody /
  verifySQSSignatureMatches so each pass is independently testable.
  When no credentials are configured the endpoint stays open
  (matching S3's scaffold-test-friendly behavior).
- adapter/sqs.go: SQSServer gains region + staticCreds fields and
  calls authorizeSQSRequest after the leader-proxy check, before
  dispatching to the handler map.
- adapter/sqs_auth_test.go: happy-path SigV4, missing Authorization,
  wrong-service scope, unknown access key, tampered body, clock
  skew, and the no-credentials open-endpoint case. All tests drive
  requests through the actual AWS SDK v4 signer so the adapter stays
  compatible with real SDK callers.
- main_sigv4_creds.go: loadSigV4StaticCredentialsFile, the shared
  JSON credentials loader; main_s3.go.loadS3StaticCredentials now
  delegates to it.
- main_sqs.go + main.go: --sqsRegion / --sqsCredentialsFile flags
  and per-runner fields, passed through to WithSQSRegion /
  WithSQSStaticCredentials.
…isibility

Implement the core message path against the visibility-indexed keyspace
described in docs/design/2026_04_24_proposed_sqs_compatible_adapter.md.

Storage layout:

- !sqs|msg|data|<queue-esc><gen-u64><msg-id-esc> - the full message
  record: body, MD5, attributes, send / available / visible
  timestamps, receive count, and the current rotating receipt token.
- !sqs|msg|vis|<queue-esc><gen-u64><visible-at-u64><msg-id-esc> - a
  visibility index keyed by visible_at. ReceiveMessage scans the
  range [now=0, now+1) to find the next visible messages without any
  background sweeper.
- Receipt handles are base64url(version || queue_gen || message_id ||
  receipt_token) and carry everything Delete / ChangeVisibility need
  to locate the record and verify ownership.

Handlers:

- sendMessage: validates body size against MaximumMessageSize,
  resolves effective DelaySeconds, then writes the data record and
  the matching visibility entry in one OCC transaction.
- receiveMessage: fences the scan with LeaseReadThrough so the
  snapshot scan stays local on the leader inside the lease window
  and falls back to LinearizableRead when the lease is cold. For
  each candidate it runs a single-message OCC transaction that
  deletes the old vis entry, inserts a new one at (now + visibility
  timeout), rotates the receipt token, and bumps ReceiveCount. Race
  losers are skipped so a batch returns whatever it could deliver.
- deleteMessage: verifies receipt token on the data record, then
  atomically drops the data + vis keys.
- changeMessageVisibility: swaps the vis entry to a new visible_at
  and updates the record; rejects messages whose visibility already
  expired with MessageNotInflight.

Tests (adapter/sqs_messages_test.go) spin up a single-node cluster
and cover:

- Send -> Receive -> Delete happy path, with MD5 and MessageId checks.
- Tampered receipt handle rejected with InvalidReceiptHandle.
- MaxNumberOfMessages returns a partial batch across two receives.
- DelaySeconds defers delivery until the delay elapses.
- Visibility-timeout expiry re-delivers the same message with
  ReceiveCount=2.
- ChangeMessageVisibility extends the in-flight window.
- Receipt-handle codec round trips for several (queue_gen,
  message_id, token) combinations.

Scaffold "returns NotImplemented" test updated to skip the newly
implemented targets (Send/Receive/Delete/ChangeMessageVisibility).
PurgeQueue, batch APIs, and tag APIs remain NotImplemented for now.
Address the P1 findings from the codex review on #610 and the gemini
micro-nit:

- tryDeliverCandidate now passes ReadKeys=[old_vis_key, data_key] on
  its commit so a concurrent ReceiveMessage that rotates the same
  message commits a newer version and our Dispatch returns
  ErrWriteConflict. Skipping the candidate on conflict prevents the
  "two workers both think they delivered" duplicate.
- deleteMessage and changeMessageVisibility lift their single-shot
  validate-and-commit into retry loops that mirror the catalog
  handlers. Both pass ReadKeys covering the data record and the
  current vis entry; on ErrWriteConflict we re-validate the token,
  which either succeeds (the state we just saw is still current) or
  returns InvalidReceiptHandle (someone else rotated the token first,
  which is the correct AWS semantics for a stale handle).
- md5OfAttributesHex: drop the hand-rolled O(n^2) selection sort in
  favor of sort.Strings.

No behavior change on the happy path; these are purely
correctness-under-contention fixes. Existing tests still cover the
single-writer flows, and the retry loops inherit the same test-wide
timeout knobs the catalog code uses.
AWS SQS semantics: DeleteMessage with a stale receipt handle (token
rotated under us, or record already gone) is a 200 success no-op, NOT
an InvalidReceiptHandle. This is relied on by SDK retry paths and
batch workers, and also by clients that do a visibility-expiry
redelivery dance.

- deleteMessageWithRetry now calls a new loadMessageForDelete helper
  that returns a sqsDeleteOutcome tag. Missing record and token
  mismatch both map to sqsDeleteNoOp -> return nil (handler renders
  200 empty body). Token-match proceeds to the OCC delete txn.
  Only structural errors (malformed handle) and retry-budget
  exhaustion still propagate as errors.
- ChangeMessageVisibility keeps the strict behavior — AWS also
  errors there when the handle is stale, so changeVisibilityWithRetry
  is unchanged.
- Test rename: TestSQSServer_DeleteWithWrongReceiptRejected ->
  TestSQSServer_DeleteWithStaleReceiptIsIdempotentNoOp. The test now
  asserts that a token-mismatched delete returns 200, that the real
  handle still works afterwards (the stale no-op must not have
  "stolen" the in-flight message), and that a structurally malformed
  handle still fails with ReceiptHandleIsInvalid.

The ReceiveCount / first-receive-timestamp behavior the same codex
review flagged is already implemented correctly
(tryDeliverCandidate: "if rec.FirstReceiveMillis == 0 { ... = now }"),
so no code change was needed for that finding beyond the design doc
clarification that landed on feat/sqs_compatible_adapter.
codex round 2 on #610 flagged two P1 OCC safety holes: the
ReceiveMessage rotation and the DeleteMessage commit both left
StartTS=0, so the coordinator auto-assigned a newer StartTS at
Dispatch time. A concurrent rotation that committed between our
load and the auto-assigned StartTS would fall below the ReadKeys
check and slip through, allowing stale-snapshot writes to land —
double delivery in the receive path, stealing an in-flight
message in the delete path.

Fix: carry the snapshot readTS we actually loaded the record at
through to the Dispatch call and set it as StartTS on the
OperationGroup. The existing ReadKeys then fence correctly against
any commit in (readTS, now].

- tryDeliverCandidate: StartTS=readTS on the rotation txn.
- loadMessageForDelete: now returns the readTS it took; caller
  (deleteMessageWithRetry) threads it into the delete txn.
- loadAndVerifyMessage: now returns the readTS too;
  changeVisibilityWithRetry threads it into the swap txn.

No behavior change under no-contention. Existing tests still
pass; a Jepsen-style concurrent receive/delete test that exposes
the race will land in the follow-up workload PR alongside FIFO.
@bootjp bootjp force-pushed the feat/sqs_adapter_scaffold branch from 944d81a to 1846fd0 Compare April 24, 2026 08:18
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 944d81a17d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread adapter/sqs_catalog.go Outdated
Comment on lines +505 to +509
req := &kv.OperationGroup[kv.OP]{
IsTxn: true,
Elems: []*kv.Elem[kv.OP]{
{Op: kv.Put, Key: sqsQueueMetaKey(requested.Name), Value: metaBytes},
{Op: kv.Put, Key: sqsQueueGenKey(requested.Name), Value: []byte(strconv.FormatUint(requested.Generation, 10))},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add snapshot fencing to CreateQueue transaction

tryCreateQueueOnce reads queue absence/generation at readTS and then dispatches a write-only transaction without StartTS or ReadKeys, so the coordinator can assign a newer start timestamp and let stale decisions commit. If two CreateQueue calls race with different attributes, the second call can still commit after the first and overwrite metadata, causing both requests to return 200 even though one should fail with QueueNameExists. Include the queue meta/gen keys in ReadKeys and pin StartTS to the read snapshot.

Useful? React with 👍 / 👎.

Comment thread adapter/sqs_catalog.go
Comment on lines +815 to +819
req := &kv.OperationGroup[kv.OP]{
IsTxn: true,
Elems: []*kv.Elem[kv.OP]{
{Op: kv.Put, Key: sqsQueueMetaKey(queueName), Value: metaBytes},
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fence SetQueueAttributes against stale read overwrites

setQueueAttributesWithRetry loads the full queue metadata, mutates selected fields, then writes the entire record in a transaction with no StartTS/ReadKeys. Under concurrent updates, both writers can read the same old metadata and the later dispatch can overwrite the earlier successful update (even when they changed different attributes), silently dropping one change instead of retrying on conflict. Add read-set validation on the metadata key and bind the write to the original read timestamp.

Useful? React with 👍 / 👎.

codex round 3 on #610 flagged the same StartTS-missing hole the
message path already has fixed on the catalog CRUD path:

- tryCreateQueueOnce reads queue absence + generation at readTS,
  then dispatches Put without StartTS or ReadKeys. Two races can
  both observe "queue missing" and both commit, and the coordinator
  auto-assigns StartTS so OCC never rejects the second one.
- deleteQueueWithRetry and setQueueAttributesWithRetry have the
  same pattern (read at readTS, write with no StartTS/ReadKeys),
  which lets a concurrent create/delete/set clobber a just-
  committed change.

Fix: carry readTS through to the dispatch as StartTS and add
ReadKeys covering the meta and (for create/delete) generation
records. The existing retry loops already handle ErrWriteConflict,
so contention now retries instead of silently committing stale
state.

No behavior change at the single-writer level; the existing
idempotent-create test and delete-generation-bump test still pass.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 13e75f4e93

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread adapter/sqs_messages.go
Comment on lines +293 to +297
req := &kv.OperationGroup[kv.OP]{
IsTxn: true,
Elems: []*kv.Elem[kv.OP]{
{Op: kv.Put, Key: dataKey, Value: recordBytes},
{Op: kv.Put, Key: visKey, Value: []byte(rec.MessageID)},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fence SendMessage against concurrent queue deletion

sendMessage reads queue metadata at readTS but commits a write-only transaction without StartTS or ReadKeys, so a concurrent DeleteQueue/recreate can advance the generation after the read and this stale send still commits to the old generation keyspace. The API returns 200, but the message becomes unreachable because subsequent receives route by the newer generation; this is an acknowledged-message loss window under delete/recreate races.

Useful? React with 👍 / 👎.

Comment thread adapter/sqs_messages.go
Comment on lines +532 to +534
StartTS: readTS,
ReadKeys: [][]byte{cand.visKey, dataKey},
Elems: []*kv.Elem[kv.OP]{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Block ReceiveMessage commits after queue generation changes

tryDeliverCandidate only validates the message visibility/data keys, not queue metadata/generation, so a ReceiveMessage that scanned candidates before DeleteQueue can still rotate visibility and return a receipt after the queue has been deleted or recreated. Because delete updates only queue meta/gen records, this transaction sees no conflict and can leak deliveries from a retired generation.

Useful? React with 👍 / 👎.

Comment thread adapter/sqs_catalog.go
Comment on lines +787 to +790
for k := range selection {
if v, ok := all[k]; ok {
out[k] = v
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject unknown names in GetQueueAttributes requests

The attribute-selection path silently drops unknown AttributeNames instead of returning InvalidAttributeName: selectedAttributeNames accepts arbitrary names and queueMetaToAttributes just skips missing keys. That makes typos look successful and diverges from SQS error semantics, which can break client-side validation and troubleshooting.

Useful? React with 👍 / 👎.

@bootjp bootjp merged commit c692465 into feat/sqs_compatible_adapter Apr 24, 2026
4 of 5 checks passed
@bootjp bootjp deleted the feat/sqs_adapter_scaffold branch April 24, 2026 08:42
bootjp added a commit that referenced this pull request Apr 24, 2026
codex round 8 on #610 flagged two findings, both addressed here:

- P1 sqs_messages.go: ReceiveMessage parsed WaitTimeSeconds but
  never used it, so clients asking for long polling got an
  immediate empty return and hot-polled the server. Milestone 1
  now does a polling-based long poll: if the first scan returns
  nothing and WaitTimeSeconds > 0, the handler re-scans every
  200 ms until a message arrives, the wait elapses, or the
  request context is canceled. The full commit-stream notifier
  from §7.3 of the design is still deferred, but the observable
  client contract now matches AWS.
  resolveReceiveWaitSeconds applies per-request value -> queue
  default -> clamp to [0, 20].
- P2 sqs_catalog.go: IsFIFO was pre-initialized from the .fifo
  name suffix before attribute validation ran, so a CreateQueue
  without FifoQueue=true silently became a FIFO queue. AWS
  requires FifoQueue=true to be explicitly set. IsFIFO now
  defaults to false; a .fifo-suffixed name without the attribute
  returns InvalidParameterValue ("FIFO queue name requires
  FifoQueue=true attribute").

New tests: TestSQSServer_LongPollWaitsForArrival schedules a send
mid-wait and asserts the receive returns promptly after arrival
(and waited at all); TestSQSServer_LongPollTimesOutOnEmptyQueue
asserts WaitTimeSeconds on an empty queue returns empty at
approximately the requested time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant