Skip to content

[WIP] address ably-cocoa push notification token storage issues#441

Draft
lawrence-forooghian wants to merge 10 commits intomainfrom
2026-03-20-investigating-ably-cocoa-push-registration-failures
Draft

[WIP] address ably-cocoa push notification token storage issues#441
lawrence-forooghian wants to merge 10 commits intomainfrom
2026-03-20-investigating-ably-cocoa-push-registration-failures

Conversation

@lawrence-forooghian
Copy link
Copy Markdown
Collaborator

@lawrence-forooghian lawrence-forooghian commented Mar 23, 2026

Partially addresses the issue of ably-cocoa storing device secret in keychain and not wiping the device token when it fails to load the secret. WIP, and largely Claude-gnerated. There are still things to address here and need to think about whether we want to try and fetch a new token instead of just validating one (see commit messages for why haven't done this yet). Also we need to decide whether we're happy migrating to a storage for token and secret that's always available. Will finish add references to existing conversations once back from illness; we should discuss

lawrence-forooghian and others added 6 commits March 20, 2026 10:08
The note in RSH8k2 mentioned RSH3a2b three times but only linked
the first reference. Link the other two for consistency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The note previously said "our implementations" all generate eagerly,
but this isn't accurate. ably-java follows RSH3a2b (lazy generation
on CalledActivate); ably-cocoa and ably-js generate eagerly at device
fetch time.

Checked against:
- ably-cocoa 745e7b7
- ably-java da4c60f
- ably-js 17be43e

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RSH8j describes the failure path of RSH8a's loading step, so it
belongs as a subpoint of RSH8a rather than a standalone point that
reaches across into the state machine. Move it to RSH8a1, keeping
the original text, and leave a replacement stub at RSH8j.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document the problems with this spec point:

- It doesn't clear the deviceIdentityToken, so RSH3a2a will try to
  use a stale token against an absent or non-matching device ID
- It's ambiguous whether "generate new" is a new instruction or a
  restatement of RSH3a2b, and if the latter, RSH3a2a runs first
- It bypasses the state machine's event-driven model and doesn't
  specify what should happen to an in-flight activate() call
- No implementations currently implement this spec point

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add RSH3h to require loading and verifying LocalDevice details at
state machine init time, before processing any events. If the load
fails, the state machine starts in NotActivated with all device
state discarded (RSH3h1).

Add RSH8a2 to document the atomicity requirement for the (id,
deviceSecret, deviceIdentityToken) tuple, with allowance for
split storage (e.g. platform keychain) provided the implementation
can detect when the loaded tuple diverges from what was persisted.
RSH8a2a provides a fallback for legacy data without an atomicity
mechanism.

Update RSH8a to reference RSH3h for the state machine init trigger.
Replace RSH8a1 with a stub pointing to RSH3h.

TODO: squash this branch's commits before merging, and consider
removing the RSH8a1 and RSH8j replacement stubs entirely (since
neither was in a released spec version).

Note: this change does not take a position on whether implementations
should or should not use secure-but-sometimes-unavailable storage
(e.g. a platform keychain) for sensitive device details. The intent
is to meet ably-cocoa where it is and not rule out this possibility
in the future (customers have requested the ability to store device
secrets securely [1]). We should still consider how to handle the
case where such storage is temporarily unavailable (e.g. iOS
Keychain before first unlock after reboot) — the current spec
treats this as a load failure and discards the data, but a future
enhancement could allow the state machine to wait for the storage
to become available.

[1] ably/ably-java#593

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add RSH3i (ValidatingDeviceIdentityToken state) for validating
legacy ably-cocoa data where the deviceIdentityToken may not belong
to the current device id. On CalledActivate, a non-mutating GET
to /push/deviceRegistrations/:deviceId authenticated with the token
confirms whether it matches the device id. Success re-persists with
the atomicity mechanism; server rejection discards everything and
transitions to NotActivated; transient errors remain in state.

Add RSH8a2a with ably-cocoa-specific handling for legacy data:
- RSH8a2a1: invariant checks (id/secret both present or absent,
  token only with id+secret) — violation is a load failure
- RSH8a2a2: if invariants pass and all three present, indicate to
  RSH3h2 that the token needs server validation

Update RSH3h2 to enter ValidatingDeviceIdentityToken when RSH8a2a
indicates validation is needed.

The context for this state is: ably-cocoa stores the deviceSecret
in the Keychain keyed by device id, so if a secret is present it
is guaranteed to belong to the current id. The deviceIdentityToken
is stored under a fixed key in NSUserDefaults and may belong to a
previous device id. The token is opaque to the client so we cannot
verify the binding locally.

We considered three options for handling this:

1. Validate the token with a non-mutating GET; if invalid, discard
   everything and re-register. Simple, no custom callback needed,
   but orphans the server-side registration.

2. Recreate the token by PATCHing with the deviceSecret; the server
   would return a fresh token. Preserves the registration, but
   the spec's custom registerCallback has no way to distinguish
   between POST/PUT/PATCH semantics, making it unclear whether the
   callback would do the right thing.

3. Abandon the token and use the deviceSecret for all subsequent
   authentication. Simple, but unknown consequences — the token
   exists for a reason and this would diverge from the spec's
   authentication model.

We chose option 1. The trade-off (orphaning a registration) is
acceptable: it's what already happens today with the bug, the
orphaned registration is already unusable, and it only occurs
once per device during migration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot temporarily deployed to staging/pull/441 March 23, 2026 10:34 Inactive
@lawrence-forooghian lawrence-forooghian changed the title [WIP] address ably-cocoa [WIP] address ably-cocoa push notification token storage issues Mar 23, 2026
lawrence-forooghian and others added 4 commits April 13, 2026 17:05
Refines the error conditions in RSH3i1c and RSH3i1d introduced in
faa0fb8.

RSH3i1c now triggers only on a 401 status code, rather than the
vague "e.g. 401 or 404". This is the specific error returned by
the realtime server's DeviceAuth.validateToken() (error code
40100, status 401) when the deviceIdentityToken's embedded device
ID does not match the requested device ID. Only this case
definitively indicates the token is invalid for the device.

RSH3i1d now handles all other failures (including 404, 5xx, and
network errors) by remaining in ValidatingDeviceIdentityToken,
allowing the user to retry. "Transient error" was too vague and
assumed we could classify errors we haven't seen.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous wording left the user with no indication of what
happened when the validation request failed for a non-401 reason.
Call the activate() callback with the error, consistent with how
RSH3i1b and RSH3i1c handle their outcomes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Refactor the ValidatingDeviceIdentityToken state (introduced in
faa0fb8, refined in 6583b0a and 2254ad5) to follow the same
event-driven pattern used by the rest of the state machine.

Previously, RSH3i handled the GET request and all its outcomes
inline within the CalledActivate handler. Now:

- RSH3i (ValidatingDeviceIdentityToken): on CalledActivate,
  starts the GET, fires one of three events, and transitions to
  WaitingForDeviceIdentityTokenValidation.
- RSH3j (WaitingForDeviceIdentityTokenValidation): handles the
  result events (GotValidationResponse, GotValidationRejection,
  GotValidationFailure).

This matches the pattern of RSH3b3/RSH3c (POST registration then
handle result events in WaitingForDeviceRegistration).

Functionally equivalent to the previous inline version — same
three outcomes, same status code conditions, same callbacks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of transitioning to NotActivated and requiring the user to
call activate() again, proceed directly with a fresh registration
(RSH3a2b onwards) as part of the same activate() call.

By the time RSH3j2a has discarded all device details, the state is
equivalent to a fresh device. The previous persisted state doesn't
matter because the data it depended on is gone. Proceeding with
fresh registration means the user's activate() call completes
successfully (with a new device id) rather than failing and
requiring a retry.

The rejection is logged as an error so the user can see that their
old registration was discarded, but it's an internal recovery, not
something they need to act on.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant