feat(sqs): SQS-compatible adapter Milestone 1#610
feat(sqs): SQS-compatible adapter Milestone 1#610bootjp merged 9 commits intofeat/sqs_compatible_adapterfrom
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request implements a new SQS-compatible adapter, providing support for queue management and message operations such as Send, Receive, and Delete. It also refactors the DynamoDB and S3 adapters to utilize shared logic for leader proxying and SigV4 authentication. A review comment suggests replacing a manual selection sort with the standard library's more efficient sorting function for message attributes.
| for i := 0; i < len(keys); i++ { | ||
| for j := i + 1; j < len(keys); j++ { | ||
| if keys[j] < keys[i] { | ||
| keys[i], keys[j] = keys[j], keys[i] | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a49266185a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| IsTxn: true, | ||
| Elems: []*kv.Elem[kv.OP]{ | ||
| {Op: kv.Del, Key: cand.visKey}, | ||
| {Op: kv.Put, Key: newVisKey, Value: []byte(cand.messageID)}, | ||
| {Op: kv.Put, Key: dataKey, Value: recordBytes}, |
There was a problem hiding this comment.
Add read-set checks to message delivery rotation
tryDeliverCandidate reads the message at readTS and then commits a write transaction without any ReadKeys, so OCC won't reject stale reads. Under concurrent ReceiveMessage calls, a second worker can scan an old visibility row, commit after the first worker, and overwrite dataKey/visibility state from stale data, which can return the same message twice and immediately invalidate the first receipt handle. Include at least dataKey (and ideally the old visibility key) in OperationGroup.ReadKeys or perform the token/visibility check atomically in the write transaction.
Useful? React with 👍 / 👎.
| req := &kv.OperationGroup[kv.OP]{ | ||
| IsTxn: true, | ||
| Elems: []*kv.Elem[kv.OP]{ | ||
| {Op: kv.Del, Key: dataKey}, | ||
| {Op: kv.Del, Key: visKey}, |
There was a problem hiding this comment.
Make receipt-token verification atomic with delete writes
deleteMessage validates the receipt token via loadAndVerifyMessage before building this transaction, but the transaction itself has no read-set/condition on the validated record. If another consumer rotates the token between validation and commit, this stale handle can still delete the message because OCC only checks write-write conflicts when ReadKeys are absent. This breaks the intended guarantee that old receipt handles are invalid after a new receive; the same pattern also affects changeMessageVisibility.
Useful? React with 👍 / 👎.
3e9c1d2 to
715c2f9
Compare
Address the P1 findings from the codex review on #610 and the gemini micro-nit: - tryDeliverCandidate now passes ReadKeys=[old_vis_key, data_key] on its commit so a concurrent ReceiveMessage that rotates the same message commits a newer version and our Dispatch returns ErrWriteConflict. Skipping the candidate on conflict prevents the "two workers both think they delivered" duplicate. - deleteMessage and changeMessageVisibility lift their single-shot validate-and-commit into retry loops that mirror the catalog handlers. Both pass ReadKeys covering the data record and the current vis entry; on ErrWriteConflict we re-validate the token, which either succeeds (the state we just saw is still current) or returns InvalidReceiptHandle (someone else rotated the token first, which is the correct AWS semantics for a stale handle). - md5OfAttributesHex: drop the hand-rolled O(n^2) selection sort in favor of sort.Strings. No behavior change on the happy path; these are purely correctness-under-contention fixes. Existing tests still cover the single-writer flows, and the retry loops inherit the same test-wide timeout knobs the catalog code uses.
Address the P1 findings from the codex review on #610 and the gemini micro-nit: - tryDeliverCandidate now passes ReadKeys=[old_vis_key, data_key] on its commit so a concurrent ReceiveMessage that rotates the same message commits a newer version and our Dispatch returns ErrWriteConflict. Skipping the candidate on conflict prevents the "two workers both think they delivered" duplicate. - deleteMessage and changeMessageVisibility lift their single-shot validate-and-commit into retry loops that mirror the catalog handlers. Both pass ReadKeys covering the data record and the current vis entry; on ErrWriteConflict we re-validate the token, which either succeeds (the state we just saw is still current) or returns InvalidReceiptHandle (someone else rotated the token first, which is the correct AWS semantics for a stale handle). - md5OfAttributesHex: drop the hand-rolled O(n^2) selection sort in favor of sort.Strings. No behavior change on the happy path; these are purely correctness-under-contention fixes. Existing tests still cover the single-writer flows, and the retry loops inherit the same test-wide timeout knobs the catalog code uses.
1ec56dd to
9067886
Compare
Address the P1 findings from the codex review on #610 and the gemini micro-nit: - tryDeliverCandidate now passes ReadKeys=[old_vis_key, data_key] on its commit so a concurrent ReceiveMessage that rotates the same message commits a newer version and our Dispatch returns ErrWriteConflict. Skipping the candidate on conflict prevents the "two workers both think they delivered" duplicate. - deleteMessage and changeMessageVisibility lift their single-shot validate-and-commit into retry loops that mirror the catalog handlers. Both pass ReadKeys covering the data record and the current vis entry; on ErrWriteConflict we re-validate the token, which either succeeds (the state we just saw is still current) or returns InvalidReceiptHandle (someone else rotated the token first, which is the correct AWS semantics for a stale handle). - md5OfAttributesHex: drop the hand-rolled O(n^2) selection sort in favor of sort.Strings. No behavior change on the happy path; these are purely correctness-under-contention fixes. Existing tests still cover the single-writer flows, and the retry loops inherit the same test-wide timeout knobs the catalog code uses.
9067886 to
5d4bd8d
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5d4bd8db1b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| req := &kv.OperationGroup[kv.OP]{ | ||
| IsTxn: true, | ||
| ReadKeys: [][]byte{cand.visKey, dataKey}, | ||
| Elems: []*kv.Elem[kv.OP]{ |
There was a problem hiding this comment.
Set StartTS when rotating message visibility
tryDeliverCandidate reads dataKey at readTS, but this transaction leaves StartTS unset so the coordinator assigns a newer start timestamp at dispatch. If another ReceiveMessage commits a rotation between the read and dispatch, that commit can fall before the auto-assigned start TS, letting ReadKeys validation pass and allowing this stale transaction to overwrite the newer token/visibility state (reintroducing duplicate delivery and invalidating the first receipt handle). Set req.StartTS = readTS so OCC validates against the same snapshot used to build recordBytes.
Useful? React with 👍 / 👎.
| req := &kv.OperationGroup[kv.OP]{ | ||
| IsTxn: true, | ||
| ReadKeys: [][]byte{dataKey, visKey}, | ||
| Elems: []*kv.Elem[kv.OP]{ |
There was a problem hiding this comment.
Pin delete OCC check to the validated read snapshot
deleteMessageWithRetry validates the receipt token in loadMessageForDelete, then dispatches with ReadKeys but no StartTS. A concurrent receive can rotate the token after validation and still commit before the coordinator-assigned start TS, so this delete may execute from stale validation data and remove a message with an old receipt handle. Carry the load timestamp into the transaction (req.StartTS) so the read-set check actually guards the token/visibility state; the same pattern affects change-visibility.
Useful? React with 👍 / 👎.
Address the P1 findings from the codex review on #610 and the gemini micro-nit: - tryDeliverCandidate now passes ReadKeys=[old_vis_key, data_key] on its commit so a concurrent ReceiveMessage that rotates the same message commits a newer version and our Dispatch returns ErrWriteConflict. Skipping the candidate on conflict prevents the "two workers both think they delivered" duplicate. - deleteMessage and changeMessageVisibility lift their single-shot validate-and-commit into retry loops that mirror the catalog handlers. Both pass ReadKeys covering the data record and the current vis entry; on ErrWriteConflict we re-validate the token, which either succeeds (the state we just saw is still current) or returns InvalidReceiptHandle (someone else rotated the token first, which is the correct AWS semantics for a stale handle). - md5OfAttributesHex: drop the hand-rolled O(n^2) selection sort in favor of sort.Strings. No behavior change on the happy path; these are purely correctness-under-contention fixes. Existing tests still cover the single-writer flows, and the retry loops inherit the same test-wide timeout knobs the catalog code uses.
5d4bd8d to
944d81a
Compare
Start the SQS-compatible HTTP adapter per docs/design/2026_04_24_proposed_sqs_compatible_adapter.md. This first increment wires the server skeleton, dispatch table, error envelope, health endpoints, and leader-proxy path so subsequent PRs can drop real handlers into a tested frame. Included: - adapter/sqs.go: SQSServer struct, NewSQSServer with SQSServerOption, Run/Stop, /sqs_health + /sqs_leader_health, X-Amz-Target dispatch for all Milestone-1/2 SQS operations (CreateQueue ... ListQueueTags), and the JSON-1.0 error envelope (__type + x-amzn-ErrorType + message). Every target currently replies 501 NotImplemented; handlers will be filled in separately. - adapter/sqs_test.go: health endpoints, method gating, error-envelope shape, unknown-target -> InvalidAction, every known target -> NotImplemented, and a Stop-unblocks-Run regression guard. - adapter/leader_http_proxy.go: shared proxyHTTPRequestToLeader helper factored out of the DynamoDB adapter. DynamoDB and SQS both delegate to it via per-adapter error writers, removing the structural dup the linter flagged. No main.go wiring or flag changes in this PR; that lands alongside the first real handler so the binary does not expose a 501-only endpoint.
Add the minimal plumbing to bring an SQSServer online next to the existing DynamoDB and S3 servers. No new handlers yet - every target still returns NotImplemented - but the binary and the test harness can now open an SQS endpoint so subsequent PRs that add real operations have somewhere to land. - shard_config.go: ErrInvalidRaftSQSMapEntry + parseRaftSQSMap. - main.go: --sqsAddress / --raftSqsMap flags, leaderSQS in runtimeConfig, buildLeaderSQS, and startSQSServer wired into runtimeServerRunner.start. sqsAddress defaults to empty. - main_sqs.go: startSQSServer mirrors main_s3.go two-goroutine shutdown pattern. - adapter/test_util.go: portsAdress / listeners / Node / newNode / setupNodes carry an sqsAddress and *SQSServer per node, shutdown calls Stop via an extracted shutdownNode helper, portSQS starts at 29000 so tests do not collide with the dynamo range at 28000. - main_bootstrap_e2e_test.go: update parseRuntimeConfig signature.
Wire up the first real SQS handlers on top of the existing MVCC / Coordinator primitives: - CreateQueue (idempotent when attributes match, QueueNameExists otherwise; FIFO suffix + FifoQueue attribute cross-validated; every supported attribute validated against its AWS range) - DeleteQueue (tombstones the meta record and bumps the generation counter so a subsequent CreateQueue with the same name starts a new incarnation; actual message-keyspace reclaim lands with Stage 4) - ListQueues (prefix-scans !sqs|queue|meta|, supports QueueNamePrefix / MaxResults / NextToken) - GetQueueUrl - GetQueueAttributes / SetQueueAttributes Design choices: - Metadata is stored as JSON with a four-byte magic prefix so future encoding migrations can switch formats without reading back garbage. - Segment encoding uses base64 raw URL for queue names, matching the DynamoDB adapter's encodeDynamoSegment so operators reading raw Pebble keys see the same shape across adapters. - Attribute validation is table-driven via sqsAttributeAppliers so adding an attribute does not grow applyAttributes's cyclomatic complexity. - The OCC retry loop is split into tryCreateQueueOnce so each retry pass is self-contained and the outer loop only deals with backoff. - All mutating operations go through the leader via the shared proxyHTTPRequestToLeader helper, matching the DynamoDB adapter. Tests (adapter/sqs_catalog_test.go) spin up a real single-node cluster via createNode(t, 1) and exercise the end-to-end JSON-protocol path: Create/Get/List, idempotent Create, QueueNameExists on attribute change, Get/SetAttributes, Delete + follow-up NonExistentQueue, FIFO name/attribute validation, and key-encoding round trips. The scaffold "unknown target returns NotImplemented" test was updated to skip the now-implemented catalog targets.
Refactor S3's SigV4 primitives into adapter/sigv4.go and implement the SQS JSON-protocol verifier on top of them. - adapter/sigv4.go: service-agnostic SigV4 helpers (parseSigV4AuthorizationHeader, buildSigV4AuthorizationHeader, buildSigV4AuthorizationHeaderRestricted, extractSigV4Signature / SignedHeaders / Auth fields). The *Restricted builder strips transport-added headers that are not listed in SignedHeaders, so re-signing a server-side request does not fail because Go's http client added Accept-Encoding after the client signed. - adapter/s3_auth.go: delegates parse/build/extract to the shared helpers. Error ordering and the presigned-URL path are unchanged to keep the existing S3 tests green. - adapter/sqs_auth.go: WithSQSRegion / WithSQSStaticCredentials options and the JSON-protocol verifier, split into validateSQSAuthScope / validateSQSAuthDate / drainAndHashSQSBody / verifySQSSignatureMatches so each pass is independently testable. When no credentials are configured the endpoint stays open (matching S3's scaffold-test-friendly behavior). - adapter/sqs.go: SQSServer gains region + staticCreds fields and calls authorizeSQSRequest after the leader-proxy check, before dispatching to the handler map. - adapter/sqs_auth_test.go: happy-path SigV4, missing Authorization, wrong-service scope, unknown access key, tampered body, clock skew, and the no-credentials open-endpoint case. All tests drive requests through the actual AWS SDK v4 signer so the adapter stays compatible with real SDK callers. - main_sigv4_creds.go: loadSigV4StaticCredentialsFile, the shared JSON credentials loader; main_s3.go.loadS3StaticCredentials now delegates to it. - main_sqs.go + main.go: --sqsRegion / --sqsCredentialsFile flags and per-runner fields, passed through to WithSQSRegion / WithSQSStaticCredentials.
…isibility Implement the core message path against the visibility-indexed keyspace described in docs/design/2026_04_24_proposed_sqs_compatible_adapter.md. Storage layout: - !sqs|msg|data|<queue-esc><gen-u64><msg-id-esc> - the full message record: body, MD5, attributes, send / available / visible timestamps, receive count, and the current rotating receipt token. - !sqs|msg|vis|<queue-esc><gen-u64><visible-at-u64><msg-id-esc> - a visibility index keyed by visible_at. ReceiveMessage scans the range [now=0, now+1) to find the next visible messages without any background sweeper. - Receipt handles are base64url(version || queue_gen || message_id || receipt_token) and carry everything Delete / ChangeVisibility need to locate the record and verify ownership. Handlers: - sendMessage: validates body size against MaximumMessageSize, resolves effective DelaySeconds, then writes the data record and the matching visibility entry in one OCC transaction. - receiveMessage: fences the scan with LeaseReadThrough so the snapshot scan stays local on the leader inside the lease window and falls back to LinearizableRead when the lease is cold. For each candidate it runs a single-message OCC transaction that deletes the old vis entry, inserts a new one at (now + visibility timeout), rotates the receipt token, and bumps ReceiveCount. Race losers are skipped so a batch returns whatever it could deliver. - deleteMessage: verifies receipt token on the data record, then atomically drops the data + vis keys. - changeMessageVisibility: swaps the vis entry to a new visible_at and updates the record; rejects messages whose visibility already expired with MessageNotInflight. Tests (adapter/sqs_messages_test.go) spin up a single-node cluster and cover: - Send -> Receive -> Delete happy path, with MD5 and MessageId checks. - Tampered receipt handle rejected with InvalidReceiptHandle. - MaxNumberOfMessages returns a partial batch across two receives. - DelaySeconds defers delivery until the delay elapses. - Visibility-timeout expiry re-delivers the same message with ReceiveCount=2. - ChangeMessageVisibility extends the in-flight window. - Receipt-handle codec round trips for several (queue_gen, message_id, token) combinations. Scaffold "returns NotImplemented" test updated to skip the newly implemented targets (Send/Receive/Delete/ChangeMessageVisibility). PurgeQueue, batch APIs, and tag APIs remain NotImplemented for now.
Address the P1 findings from the codex review on #610 and the gemini micro-nit: - tryDeliverCandidate now passes ReadKeys=[old_vis_key, data_key] on its commit so a concurrent ReceiveMessage that rotates the same message commits a newer version and our Dispatch returns ErrWriteConflict. Skipping the candidate on conflict prevents the "two workers both think they delivered" duplicate. - deleteMessage and changeMessageVisibility lift their single-shot validate-and-commit into retry loops that mirror the catalog handlers. Both pass ReadKeys covering the data record and the current vis entry; on ErrWriteConflict we re-validate the token, which either succeeds (the state we just saw is still current) or returns InvalidReceiptHandle (someone else rotated the token first, which is the correct AWS semantics for a stale handle). - md5OfAttributesHex: drop the hand-rolled O(n^2) selection sort in favor of sort.Strings. No behavior change on the happy path; these are purely correctness-under-contention fixes. Existing tests still cover the single-writer flows, and the retry loops inherit the same test-wide timeout knobs the catalog code uses.
AWS SQS semantics: DeleteMessage with a stale receipt handle (token
rotated under us, or record already gone) is a 200 success no-op, NOT
an InvalidReceiptHandle. This is relied on by SDK retry paths and
batch workers, and also by clients that do a visibility-expiry
redelivery dance.
- deleteMessageWithRetry now calls a new loadMessageForDelete helper
that returns a sqsDeleteOutcome tag. Missing record and token
mismatch both map to sqsDeleteNoOp -> return nil (handler renders
200 empty body). Token-match proceeds to the OCC delete txn.
Only structural errors (malformed handle) and retry-budget
exhaustion still propagate as errors.
- ChangeMessageVisibility keeps the strict behavior — AWS also
errors there when the handle is stale, so changeVisibilityWithRetry
is unchanged.
- Test rename: TestSQSServer_DeleteWithWrongReceiptRejected ->
TestSQSServer_DeleteWithStaleReceiptIsIdempotentNoOp. The test now
asserts that a token-mismatched delete returns 200, that the real
handle still works afterwards (the stale no-op must not have
"stolen" the in-flight message), and that a structurally malformed
handle still fails with ReceiptHandleIsInvalid.
The ReceiveCount / first-receive-timestamp behavior the same codex
review flagged is already implemented correctly
(tryDeliverCandidate: "if rec.FirstReceiveMillis == 0 { ... = now }"),
so no code change was needed for that finding beyond the design doc
clarification that landed on feat/sqs_compatible_adapter.
codex round 2 on #610 flagged two P1 OCC safety holes: the ReceiveMessage rotation and the DeleteMessage commit both left StartTS=0, so the coordinator auto-assigned a newer StartTS at Dispatch time. A concurrent rotation that committed between our load and the auto-assigned StartTS would fall below the ReadKeys check and slip through, allowing stale-snapshot writes to land — double delivery in the receive path, stealing an in-flight message in the delete path. Fix: carry the snapshot readTS we actually loaded the record at through to the Dispatch call and set it as StartTS on the OperationGroup. The existing ReadKeys then fence correctly against any commit in (readTS, now]. - tryDeliverCandidate: StartTS=readTS on the rotation txn. - loadMessageForDelete: now returns the readTS it took; caller (deleteMessageWithRetry) threads it into the delete txn. - loadAndVerifyMessage: now returns the readTS too; changeVisibilityWithRetry threads it into the swap txn. No behavior change under no-contention. Existing tests still pass; a Jepsen-style concurrent receive/delete test that exposes the race will land in the follow-up workload PR alongside FIFO.
944d81a to
1846fd0
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 944d81a17d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| req := &kv.OperationGroup[kv.OP]{ | ||
| IsTxn: true, | ||
| Elems: []*kv.Elem[kv.OP]{ | ||
| {Op: kv.Put, Key: sqsQueueMetaKey(requested.Name), Value: metaBytes}, | ||
| {Op: kv.Put, Key: sqsQueueGenKey(requested.Name), Value: []byte(strconv.FormatUint(requested.Generation, 10))}, |
There was a problem hiding this comment.
Add snapshot fencing to CreateQueue transaction
tryCreateQueueOnce reads queue absence/generation at readTS and then dispatches a write-only transaction without StartTS or ReadKeys, so the coordinator can assign a newer start timestamp and let stale decisions commit. If two CreateQueue calls race with different attributes, the second call can still commit after the first and overwrite metadata, causing both requests to return 200 even though one should fail with QueueNameExists. Include the queue meta/gen keys in ReadKeys and pin StartTS to the read snapshot.
Useful? React with 👍 / 👎.
| req := &kv.OperationGroup[kv.OP]{ | ||
| IsTxn: true, | ||
| Elems: []*kv.Elem[kv.OP]{ | ||
| {Op: kv.Put, Key: sqsQueueMetaKey(queueName), Value: metaBytes}, | ||
| }, |
There was a problem hiding this comment.
Fence SetQueueAttributes against stale read overwrites
setQueueAttributesWithRetry loads the full queue metadata, mutates selected fields, then writes the entire record in a transaction with no StartTS/ReadKeys. Under concurrent updates, both writers can read the same old metadata and the later dispatch can overwrite the earlier successful update (even when they changed different attributes), silently dropping one change instead of retrying on conflict. Add read-set validation on the metadata key and bind the write to the original read timestamp.
Useful? React with 👍 / 👎.
codex round 3 on #610 flagged the same StartTS-missing hole the message path already has fixed on the catalog CRUD path: - tryCreateQueueOnce reads queue absence + generation at readTS, then dispatches Put without StartTS or ReadKeys. Two races can both observe "queue missing" and both commit, and the coordinator auto-assigns StartTS so OCC never rejects the second one. - deleteQueueWithRetry and setQueueAttributesWithRetry have the same pattern (read at readTS, write with no StartTS/ReadKeys), which lets a concurrent create/delete/set clobber a just- committed change. Fix: carry readTS through to the dispatch as StartTS and add ReadKeys covering the meta and (for create/delete) generation records. The existing retry loops already handle ErrWriteConflict, so contention now retries instead of silently committing stale state. No behavior change at the single-writer level; the existing idempotent-create test and delete-generation-bump test still pass.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 13e75f4e93
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| req := &kv.OperationGroup[kv.OP]{ | ||
| IsTxn: true, | ||
| Elems: []*kv.Elem[kv.OP]{ | ||
| {Op: kv.Put, Key: dataKey, Value: recordBytes}, | ||
| {Op: kv.Put, Key: visKey, Value: []byte(rec.MessageID)}, |
There was a problem hiding this comment.
Fence SendMessage against concurrent queue deletion
sendMessage reads queue metadata at readTS but commits a write-only transaction without StartTS or ReadKeys, so a concurrent DeleteQueue/recreate can advance the generation after the read and this stale send still commits to the old generation keyspace. The API returns 200, but the message becomes unreachable because subsequent receives route by the newer generation; this is an acknowledged-message loss window under delete/recreate races.
Useful? React with 👍 / 👎.
| StartTS: readTS, | ||
| ReadKeys: [][]byte{cand.visKey, dataKey}, | ||
| Elems: []*kv.Elem[kv.OP]{ |
There was a problem hiding this comment.
Block ReceiveMessage commits after queue generation changes
tryDeliverCandidate only validates the message visibility/data keys, not queue metadata/generation, so a ReceiveMessage that scanned candidates before DeleteQueue can still rotate visibility and return a receipt after the queue has been deleted or recreated. Because delete updates only queue meta/gen records, this transaction sees no conflict and can leak deliveries from a retired generation.
Useful? React with 👍 / 👎.
| for k := range selection { | ||
| if v, ok := all[k]; ok { | ||
| out[k] = v | ||
| } |
There was a problem hiding this comment.
Reject unknown names in GetQueueAttributes requests
The attribute-selection path silently drops unknown AttributeNames instead of returning InvalidAttributeName: selectedAttributeNames accepts arbitrary names and queueMetaToAttributes just skips missing keys. That makes typos look successful and diverges from SQS error semantics, which can break client-side validation and troubleshooting.
Useful? React with 👍 / 👎.
codex round 8 on #610 flagged two findings, both addressed here: - P1 sqs_messages.go: ReceiveMessage parsed WaitTimeSeconds but never used it, so clients asking for long polling got an immediate empty return and hot-polled the server. Milestone 1 now does a polling-based long poll: if the first scan returns nothing and WaitTimeSeconds > 0, the handler re-scans every 200 ms until a message arrives, the wait elapses, or the request context is canceled. The full commit-stream notifier from §7.3 of the design is still deferred, but the observable client contract now matches AWS. resolveReceiveWaitSeconds applies per-request value -> queue default -> clamp to [0, 20]. - P2 sqs_catalog.go: IsFIFO was pre-initialized from the .fifo name suffix before attribute validation ran, so a CreateQueue without FifoQueue=true silently became a FIFO queue. AWS requires FifoQueue=true to be explicitly set. IsFIFO now defaults to false; a .fifo-suffixed name without the attribute returns InvalidParameterValue ("FIFO queue name requires FifoQueue=true attribute"). New tests: TestSQSServer_LongPollWaitsForArrival schedules a send mid-wait and asserts the receive returns promptly after arrival (and waited at all); TestSQSServer_LongPollTimesOutOnEmptyQueue asserts WaitTimeSeconds on an empty queue returns empty at approximately the requested time.
Summary
Implements Milestone 1 of the SQS-compatible adapter described in
docs/design/2026_04_24_proposed_sqs_compatible_adapter.md(stacked on #606).Get/SetQueueAttributes, and the core message path SendMessage /
ReceiveMessage / DeleteMessage / ChangeMessageVisibility.
!sqs|queue|meta|,!sqs|queue|gen|,!sqs|msg|data|, and a visibility-indexed!sqs|msg|vis|keyspace soReceiveMessage scans the next visible messages with a single bounded
prefix scan — no background sweeper.
OCC transaction, so receipt handles are forgery-resistant without HMAC.
kv.LeaseReadThroughso the localsnapshot stays within the lease window and falls back to
LinearizableReadwhen the lease is cold.adapter/sigv4.goand shared with theS3 adapter. SQS-side
authorizeSQSRequesthandles the JSON-protocolflow (no presigned URLs, no streaming payloads).
adapter/leader_http_proxy.goandreused by DynamoDB and SQS.
--sqsAddress/--raftSqsMap/--sqsRegion/--sqsCredentialsFileflags,startSQSServermirrorsstartS3Server,adapter/test_util.gocarries an*SQSServerperNode.Five commits, each self-contained:
feat(sqs): scaffold SQS-compatible adapter— server skeleton,dispatch table, JSON error envelope, shared leader-proxy helper.
feat(sqs): wire SQS adapter into main.go and test harness—flags,
runtimeServerRunner, test harness port allocation.feat(sqs): implement queue catalog CRUD— idempotent Create,DeleteQueue with generation bump, ListQueues pagination,
Get/SetQueueAttributes, FIFO name/attribute validation.
feat(sqs): SigV4 verifier shared with S3, wire static credentials— extract SigV4 into
adapter/sigv4.go,--sqsCredentialsFilesupport, full SigV4 verifier with happy/failure coverage.
feat(sqs): SendMessage, ReceiveMessage, DeleteMessage, ChangeMessageVisibility— message record, visibility-index scanand rotation, receipt handle codec, lease-read fenced scan.
Not in this PR (deferred to Milestone 2+ as per the design):
PurgeQueue, batch APIs (Send/Delete/ChangeVisibility), tag APIs, FIFO,
DLQ redrive, long polling, Jepsen workload, dedicated SQS Prometheus
metrics, Console UI (§13 of the design).
Test plan
go test -race -count=1 ./adapter/— full pass, incl. new SQScatalog / auth / message tests (end-to-end through
createNode).golangci-lint run --config=.golangci.yaml— 0 issues.shared leader-proxy and SigV4 refactors.
aws sqsCLI smoke test against a running node (manual, donelocally on each commit; worth re-running before merge).