CCM-16073 - Enhanced callbacks by rhyscoxnhs · Pull Request #145 · NHSDigital/nhs-notify-client-callbacks

rhyscoxnhs · 2026-04-15T10:34:23Z

Description

Context

Type of changes

Refactoring (non-breaking change)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would change existing functionality)
Bug fix (non-breaking change which fixes an issue)

Checklist

I am familiar with the contributing guidelines
I have followed the code style of the project
I have added tests to cover my changes
I have updated the documentation accordingly
This PR is a result of pair or mob programming

Sensitive Information Declaration

To ensure the utmost confidentiality and protect your and others privacy, we kindly ask you to NOT including PII (Personal Identifiable Information) / PID (Personal Identifiable Data) or any other sensitive data in this PR (Pull Request) and the codebase changes. We will remove any PR that do contain any sensitive information. We really appreciate your cooperation in this matter.

I confirm that neither PII/PID nor sensitive data are included in this PR and the codebase changes.

Copilot

Pull request overview

Introduces enhanced callback delivery security and operational controls by adding mTLS + certificate pinning configuration to callback targets, and shifting delivery from direct EventBridge API Destinations to an SQS + per-client HTTPS delivery Lambda model (with Redis-backed gating for rate limiting/circuit breaking).

Changes:

Add mtls, certPinning, and delivery fields to callback target model + schema validation, and update fixtures/tests accordingly.
Add CLI commands to manage mTLS, certificate pinning enable/disable, and SPKI hash extraction/storage for targets.
Add new https-client-lambda (delivery, signing, retries, DLQ handling, Redis/Lua gate) and new shared config-cache package; update Terraform to provision per-client delivery infra (SQS/Lambda/ElastiCache) and mock mTLS ALB.

Reviewed changes

Copilot reviewed 104 out of 107 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
tools/client-subscriptions-management/src/entrypoint/cli/targets-set-pinning.ts	New CLI command to enable/disable certificate pinning for a target.
tools/client-subscriptions-management/src/entrypoint/cli/targets-set-mtls.ts	New CLI command to enable/disable mTLS for a target.
tools/client-subscriptions-management/src/entrypoint/cli/targets-set-certificate.ts	New CLI command to extract/store SPKI hash from PEM for a target.
tools/client-subscriptions-management/src/entrypoint/cli/clients-put.ts	Minor cleanup in CLI file handling comment.
tools/client-subscriptions-management/src/domain/client-subscription-builder.ts	Adds default `mtls`/`certPinning` and emits security warnings when building targets.
tools/client-subscriptions-management/src/tests/helpers/client-subscription-fixtures.ts	Updates test target fixtures with required `mtls`/`certPinning` fields.
tools/client-subscriptions-management/src/tests/entrypoint/cli/targets-set-pinning.test.ts	Unit tests for targets-set-pinning CLI behavior.
tools/client-subscriptions-management/src/tests/entrypoint/cli/targets-set-mtls.test.ts	Unit tests for targets-set-mtls CLI behavior.
tools/client-subscriptions-management/src/tests/entrypoint/cli/targets-set-certificate.test.ts	Unit tests for targets-set-certificate CLI behavior.
tools/client-subscriptions-management/src/tests/domain/client-subscription-builder.test.ts	Adds tests around security warning emission in target builder.
tools/client-subscriptions-management/package.json	Adds `picocolors` dependency for warning output.
tests/integration/helpers/mock-client-config.ts	Adds a new integration fixture key for an mTLS-enabled mock client.
tests/integration/helpers/event-factories.ts	Adds factory for delivery messages compatible with the new delivery flow.
tests/integration/fixtures/subscriptions/mock-client-rate-limit.json	New integration fixture including required `mtls`/`certPinning`.
tests/integration/fixtures/subscriptions/mock-client-mtls.json	New integration fixture for mTLS + pinning enabled target.
tests/integration/fixtures/subscriptions/mock-client-circuit-breaker.json	New integration fixture including delivery circuit breaker configuration.
tests/integration/fixtures/subscriptions/mock-client-2.json	Updates existing integration fixture targets to include `mtls`/`certPinning`.
tests/integration/fixtures/subscriptions/mock-client-1.json	Updates existing integration fixture targets to include `mtls`/`certPinning`.
src/models/src/client-config.ts	Extends callback target type with `mtls`, `certPinning`, and optional `delivery`.
src/models/src/client-config-schema.ts	Adds Zod validation for `mtls`, `certPinning` (incl. SPKI hash constraints), and `delivery`.
src/models/src/tests/client-config-schema.test.ts	Adds/updates schema tests for new target fields and constraints.
src/config-cache/tsconfig.json	New package tsconfig for config-cache workspace.
src/config-cache/src/index.ts	Exports ConfigCache from new workspace package.
src/config-cache/src/config-cache.ts	Implements TTL-based in-memory config cache.
src/config-cache/src/tests/config-cache.test.ts	Unit tests for ConfigCache TTL and behaviors.
src/config-cache/package.json	New workspace package definition for config-cache.
src/config-cache/jest.config.ts	Jest config for config-cache package.
scripts/config/pre-commit.yaml	Adjusts pre-commit detect-private-key exclusions for a test file.
pnpm-workspace.yaml	Adds workspace catalog entries for `@redis/client`, `picocolors`, and Secrets Manager SDK.
lambdas/mock-webhook-lambda/src/index.ts	Adds ALB mTLS client-cert header validation for mock webhook endpoint.
lambdas/mock-webhook-lambda/src/tests/index.test.ts	Adds unit tests for ALB mTLS certificate verification flow.
lambdas/https-client-lambda/tsconfig.json	New lambda workspace tsconfig.
lambdas/https-client-lambda/src/services/ssm-applications-map.ts	Loads and caches clientId→applicationId map from SSM.
lambdas/https-client-lambda/src/services/sqs-visibility.ts	SQS ChangeMessageVisibility helper.
lambdas/https-client-lambda/src/services/record-result.lua	Redis Lua script to update circuit breaker state.
lambdas/https-client-lambda/src/services/payload-signer.ts	Adjusts payload signer function signature (parameter order).
lambdas/https-client-lambda/src/services/logger.ts	Re-exports shared logger for lambda local imports.
lambdas/https-client-lambda/src/services/endpoint-gate.ts	Redis admission + circuit-breaker integration (EVALSHA/EVAL Lua execution).
lambdas/https-client-lambda/src/services/dlq-sender.ts	DLQ send helper.
lambdas/https-client-lambda/src/services/delivery/tls-agent-factory.ts	Builds TLS agent with optional mTLS material and SPKI pinning.
lambdas/https-client-lambda/src/services/delivery/retry-policy.ts	Retry/backoff and Retry-After parsing helpers.
lambdas/https-client-lambda/src/services/delivery/https-client.ts	HTTPS delivery client + result classification.
lambdas/https-client-lambda/src/services/delivery-metrics.ts	Embedded metrics emission for delivery and circuit-breaker events.
lambdas/https-client-lambda/src/services/config-loader.ts	Loads client config/targets from S3 with TTL caching and schema validation.
lambdas/https-client-lambda/src/services/admit.lua	Redis Lua script for token-bucket rate limiting + CB admission.
lambdas/https-client-lambda/src/lua.d.ts	Type declaration for importing `.lua` as text.
lambdas/https-client-lambda/src/index.ts	Lambda entrypoint delegating to record processor.
lambdas/https-client-lambda/src/handler.ts	Main SQS batch handler: load config, sign, gate, deliver, retry/DLQ, metrics.
lambdas/https-client-lambda/src/tests/tls-agent-factory.test.ts	Unit tests for TLS agent factory behavior (S3/SecretsManager/pinning).
lambdas/https-client-lambda/src/tests/ssm-applications-map.test.ts	Unit tests for SSM applications map loading/caching/errors.
lambdas/https-client-lambda/src/tests/sqs-visibility.test.ts	Unit tests for SQS visibility changes.
lambdas/https-client-lambda/src/tests/retry-policy.test.ts	Unit tests for retry policy helpers.
lambdas/https-client-lambda/src/tests/payload-signer.test.ts	Unit tests for payload signing.
lambdas/https-client-lambda/src/tests/index.test.ts	Unit test for lambda entrypoint wiring.
lambdas/https-client-lambda/src/tests/https-client.test.ts	Unit tests for delivery HTTP behavior classification.
lambdas/https-client-lambda/src/tests/handler.test.ts	Unit tests for SQS record processing paths (DLQ/retry/gate/CB).
lambdas/https-client-lambda/src/tests/endpoint-gate.test.ts	Unit tests for Redis Lua invocation paths and client creation.
lambdas/https-client-lambda/src/tests/dlq-sender.test.ts	Unit tests for DLQ sender.
lambdas/https-client-lambda/src/tests/delivery-metrics.test.ts	Unit tests for embedded metrics behavior.
lambdas/https-client-lambda/src/tests/config-loader.test.ts	Unit tests for S3 config loading + TTL cache.
lambdas/https-client-lambda/package.json	New lambda workspace package definition.
lambdas/https-client-lambda/lua-transform.js	Jest transform to load `.lua` scripts as strings.
lambdas/https-client-lambda/jest.config.ts	Jest config enabling `.lua` transform.
lambdas/client-transform-filter-lambda/src/services/observability.ts	Removes callback signing observability (signing moved downstream).
lambdas/client-transform-filter-lambda/src/services/config-loader.ts	Switches to shared config-cache package import.
lambdas/client-transform-filter-lambda/src/services/config-loader-service.ts	Switches to shared config-cache package import.
lambdas/client-transform-filter-lambda/src/index.ts	Removes ApplicationsMapService wiring from handler creation.
lambdas/client-transform-filter-lambda/src/handler.ts	Removes per-target signatures from output; outputs deliverable payload + subscriptions only.
lambdas/client-transform-filter-lambda/src/tests/services/payload-signer.test.ts	Removes tests for payload signing in transform-filter lambda.
lambdas/client-transform-filter-lambda/src/tests/services/config-update.component.test.ts	Updates imports to new config-cache package.
lambdas/client-transform-filter-lambda/src/tests/services/config-loader.test.ts	Updates imports to new config-cache package.
lambdas/client-transform-filter-lambda/src/tests/services/config-cache.test.ts	Updates imports to new config-cache package.
lambdas/client-transform-filter-lambda/src/tests/index.test.ts	Updates tests to reflect removal of signatures and apps map dependency.
lambdas/client-transform-filter-lambda/src/tests/index.component.test.ts	Updates component tests to reflect new payload shape and removed SSM dependency.
lambdas/client-transform-filter-lambda/src/tests/helpers/client-subscription-fixtures.ts	Updates target fixtures to include required `mtls`/`certPinning`.
lambdas/client-transform-filter-lambda/package.json	Adds config-cache workspace dependency.
knip.ts	Updates Knip workspace config (new workspaces and integration entrypoints).
infrastructure/terraform/modules/client-destination/variables.tf	Removes legacy API Destination-based module.
infrastructure/terraform/modules/client-destination/module_target_dlq.tf	Removes legacy per-target DLQ module.
infrastructure/terraform/modules/client-destination/locals.tf	Removes legacy locals for old module.
infrastructure/terraform/modules/client-destination/iam_role_api_target_role.tf	Removes legacy IAM role/policy for API destinations.
infrastructure/terraform/modules/client-destination/cloudwatch_event_rule_main.tf	Removes legacy EventBridge rule/target to API destination setup.
infrastructure/terraform/modules/client-destination/cloudwatch_event_connection_main.tf	Removes legacy EventBridge Connection resources.
infrastructure/terraform/modules/client-destination/cloudwatch_event_api_destination_this.tf	Removes legacy API Destination resources.
infrastructure/terraform/modules/client-destination/README.md	Removes docs for legacy module.
infrastructure/terraform/modules/client-delivery/variables.tf	Adds new per-client delivery module variables.
infrastructure/terraform/modules/client-delivery/outputs.tf	Adds outputs for per-client queues and lambda details.
infrastructure/terraform/modules/client-delivery/module_sqs_per_client.tf	Provisions per-client delivery SQS queue.
infrastructure/terraform/modules/client-delivery/module_https_client_lambda.tf	Provisions per-client https-client lambda + SQS event source mapping.
infrastructure/terraform/modules/client-delivery/module_dlq_per_client.tf	Provisions per-client DLQ + DLQ depth alarm.
infrastructure/terraform/modules/client-delivery/locals.tf	Defines naming/tagging locals for per-client module.
infrastructure/terraform/modules/client-delivery/iam_role_sqs_target.tf	IAM policy for https-client lambda (SQS/S3/SSM/KMS and optional SecretsManager/S3 cert).
infrastructure/terraform/modules/client-delivery/cloudwatch_event_rule_per_subscription.tf	Defines EventBridge rules/targets routing to per-client SQS delivery queue.
infrastructure/terraform/modules/client-delivery/README.md	Adds docs for new per-client delivery module.
infrastructure/terraform/components/callbacks/variables.tf	Enables X-Ray by default and adds mTLS-related component variables.
infrastructure/terraform/components/callbacks/pipes_pipe_main.tf	Removes signatures from Pipe output template.
infrastructure/terraform/components/callbacks/module_mock_webhook_alb_mtls.tf	Adds internal ALB + ACM import + passthrough mTLS wiring for mock webhook.
infrastructure/terraform/components/callbacks/module_client_destination.tf	Removes legacy client_destination module usage.
infrastructure/terraform/components/callbacks/module_client_delivery.tf	Adds per-client delivery module instantiation.
infrastructure/terraform/components/callbacks/locals.tf	Reworks locals for per-client subscriptions/targets and mTLS mock endpoint selection.
infrastructure/terraform/components/callbacks/elasticache_delivery_state.tf	Adds ElastiCache Serverless Redis for delivery state + security groups/alarms.
infrastructure/terraform/components/callbacks/cloudwatch_metric_alarm_dlq_depth.tf	Removes legacy per-target DLQ alarm (replaced by per-client alarms).
infrastructure/terraform/components/callbacks/cloudwatch_eventbus_main.tf	Adds EventBridge archive for 7-day retention.
infrastructure/terraform/components/callbacks/README.md	Updates docs to reflect new modules, variables, and defaults.
eslint.config.mjs	Ignores lua-transform.js and expands rule scope to include src workspaces.
.gitleaksignore	Adds ignore for test private key fixture in tls-agent-factory tests.

Files not reviewed (1)

pnpm-lock.yaml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

aidenvaines-cgi

the TF stuff all looks cool from an rough check

aidenvaines-cgi · 2026-04-15T11:18:25Z

@@ -0,0 +1,178 @@
+resource "aws_elasticache_serverless_cache" "delivery_state" {
+  name                 = "${local.csi}-delivery-state"
+  engine               = "redis"


Valkey is redis compatible, is half the cost and faster. Can we use that?

Just waiting on agreement.

aidenvaines-cgi · 2026-04-15T11:20:20Z

@@ -0,0 +1,39 @@
+module "sqs_delivery" {
+  source = "https://github.com/NHSDigital/nhs-notify-shared-modules/releases/download/3.0.7/terraform-sqs.zip"


Suggested change

source = "https://github.com/NHSDigital/nhs-notify-shared-modules/releases/download/3.0.7/terraform-sqs.zip"

source = "https://github.com/NHSDigital/nhs-notify-shared-modules/releases/download/3.0.9/terraform-sqs.zip"

3.0.9 exists, 3.0.7 is being used here. minor, ignore if you want

aidenvaines-cgi · 2026-04-15T11:21:55Z

  type        = bool
  description = "Enable AWS X-Ray active tracing for Lambda functions"
-  default     = false
+  default     = true


i think weve historically had xray off by default in prod as its quite spendy, and enabling it as and when needed.

All variables should be prod defaults with PTL envs having overrides

aidenvaines-cgi · 2026-04-15T11:22:25Z

+  log_destination_arn       = var.log_destination_arn
+  log_subscription_role_arn = var.log_subscription_role_arn
+
+  lambda_env_vars = {


please sort alphabetically, it makes me sad

gparry333 · 2026-04-15T14:39:35Z

+import type { SQSRecord } from "aws-lambda";
+
+import { processRecords } from "handler";
+


There is a lot of mock setup at the beginning of this. Could we extract this out into a test helper or maybe a fixture so that the test file is cleaner and more readable?

rhyscoxnhs · 2026-04-15T14:15:36Z

+  firstReceivedMs: number,
+  maxRetryDurationMs: number,
+): boolean {
+  return Date.now() - firstReceivedMs >= maxRetryDurationMs;


This looks a little hard to read, could do with parens around the first operation

rhyscoxnhs · 2026-04-15T14:16:18Z

+const { MTLS_CERT_SECRET_ARN } = process.env;
+const { MTLS_TEST_CERT_S3_BUCKET } = process.env;
+const { MTLS_TEST_CERT_S3_KEY } = process.env;
+const { MTLS_TEST_CA_S3_KEY } = process.env;


These destructures should be merged ideally

rhyscoxnhs · 2026-04-15T14:17:44Z

+  // eslint-disable-next-line sonarjs/null-dereference -- loadS3Object always returns a string
+  const parts = pem.split(/(?=-----BEGIN )/);
+  // eslint-disable-next-line sonarjs/null-dereference -- .find returns string|undefined, fallback handles it
+  const key = parts.find((p) => p.includes("PRIVATE KEY")) ?? "";
+  // eslint-disable-next-line sonarjs/null-dereference -- as above
+  const cert = parts.find((p) => p.includes("CERTIFICATE")) ?? "";


Ideally we either disable this lint at the top or find ways to fix this lint properly.

rhyscoxnhs · 2026-04-15T14:37:21Z

+  /* eslint-disable sonarjs/null-dereference -- refillPerSec is typed as number, cannot be null */
+  const args = [
+    now,
+    refillPerSec.toString(),
+    config.burstCapacity.toString(),
+    config.cbProbeIntervalMs.toString(),
+    cbEnabled ? "1" : "0",
+    config.decayPeriodMs.toString(),
+  ];
+  /* eslint-enable sonarjs/null-dereference */


Not sure why we're disabling this lint?

Do we need to try it without to see if it fails?

gparry333 · 2026-04-15T14:59:05Z

@@ -0,0 +1,474 @@
+import type { SQSRecord } from "aws-lambda";
+


We could do with a test being added that tests that an unexpected error being thrown on one record doesn't prevent a subsequent record being processed e.g. An unexpected error thrown by an upstream dependency like loadTargetConfig which would be caught by the try/catch in processRecords - loadTargetConfig throwing "S3 unavailable" for record 1, while record 2 should still be processed normally.

gparry333 · 2026-04-15T15:09:07Z

+      { type: "MessageStatus", attributes: { messageStatus: "delivered" } },
+    ],
+  }) as Parameters<typeof signPayload>[2];
+


Maybe also have a negative test for this that verifies an empty payload produces a deterministic (non-empty) signature, ensuring the signing function doesn't silently fail on edge-case payloads?

mjewildnhs

As far as reviewing the terraform

mjewildnhs · 2026-04-15T11:11:06Z

@@ -0,0 +1,178 @@
+resource "aws_elasticache_serverless_cache" "delivery_state" {
+  name                 = "${local.csi}-delivery-state"
+  engine               = "redis"


For cost saving I think we should switch to valkey.
You are billed in gigabyte-hours (GB-hrs) and the minimum for redis is 1GB vs 100mb in valkey.
Not sure we'll go above 100mb even in prod.
https://aws.amazon.com/elasticache/pricing/

Just waiting on agreement.

mjewildnhs · 2026-04-15T11:22:38Z

Not related to the phase 2 work but I think we can delete lines 3-6 - those secrets have never been in our repo at any point

mjewildnhs · 2026-04-15T11:53:09Z

+
+  cache_usage_limits {
+    data_storage {
+      maximum = 1


1GB is the minimum with redis but can go down to 100mb if we make the valkey switch.
Keeping this low for dev/test environments is good for cost saving.
We should see how much each client will take in storage.

mjewildnhs · 2026-04-15T11:57:11Z

Do we want an alarm on storage - e.g. 80% used

mjewildnhs · 2026-04-15T14:23:40Z

+  certificate_arn   = aws_acm_certificate_import.mock_webhook_server[0].arn
+
+  mutual_authentication {
+    mode = "passthrough"


With passthrough we should be able route mTLS and non mTLS traffic through and remove the aws_lambda_permission config in module_mock_webhook_lambda.
We'd then need to modify the lambda to output whether it was mTLS or not which the tests can then verify based on their expectation.
This would simplify things a lot and also get us the hardening we are looking for on the mock webhook.

mjewildnhs · 2026-04-15T15:10:20Z

+    MAX_RETRY_DURATION_SECONDS            = tostring(var.max_retry_duration_seconds)
+    MTLS_CERT_SECRET_ARN                  = var.mtls_cert_secret_arn
+    MTLS_TEST_CERT_S3_BUCKET              = var.mtls_test_cert_s3_bucket
+    MTLS_TEST_CERT_S3_KEY                 = var.mtls_test_cert_s3_key


Looks like the lambda is also missing MTLS_TEST_CA_S3_KEY

mjewildnhs · 2026-04-15T15:13:15Z

I don't think any of these are referenced?
AI seems to like adding outputs for no reason.

mjewildnhs · 2026-04-15T15:14:54Z

+
+variable "mtls_cert_secret_arn" {
+  type        = string
+  description = "Secrets Manager ARN for the mTLS client certificate (production)"


As per my other comment I think we need to review the production vs non-production terminology we are using. Production should be assumed and non-production should be "dev" - this aligns with the tfvar groups.

mjewildnhs · 2026-04-15T15:16:34Z

+variable "deploy_mock_clients" {
+  type        = bool
+  description = "Whether mock client infrastructure is deployed (non-prod only)"
+  default     = false
+}


Suggested change

variable "deploy_mock_clients" {

type = bool

description = "Whether mock client infrastructure is deployed (non-prod only)"

default = false

}

Not used in the module.

mjewildnhs

Done transform filter lambda changes - in middle of http lambda

mjewildnhs · 2026-04-15T15:32:33Z

Delete now its in a shared module. This is just a duplicate of src/config-cache/src/__tests__/config-cache.test.ts

mjewildnhs · 2026-04-15T15:44:25Z


 export const configLoaderService = new ConfigLoaderService();

-export const applicationsMapService = new ApplicationsMapService();


Its never very good at cleaning up after itself. Theres a bunch of dead code to remove. AI summary:

Files to delete:

File Reason

src/services/ssm-applications-map.ts Dead code — no longer imported in index.ts or handler.ts

src/__tests__/services/ssm-applications-map.test.ts Tests the deleted service

Knock-on changes required:

package.json — remove @aws-sdk/client-ssm from dependencies (only used in ssm-applications-map.ts and its test)

src/__tests__/index.component.test.ts — needs cleanup:

Remove the jest.mock("@aws-sdk/client-ssm", ...) block (lines 18–26)

Remove mockSsmSend variable and its usages

Remove APPLICATIONS_MAP_PARAMETER from process.env setup/teardown

Remove CLIENT_SUBSCRIPTION_CONFIG_BUCKET, CLIENT_SUBSCRIPTION_CONFIG_PREFIX, CLIENT_SUBSCRIPTION_CACHE_TTL_SECONDS env vars — these are now the wrong names (should be CLIENT_CONFIG_BUCKET etc. per the earlier fix), and the SSM-related ones are dead

Remove mockSsmSend.mockResolvedValue(...) from beforeEach

Remove mockSsmSend.mockClear() from beforeEach

Note: the SSM mock was used to simulate ApplicationsMapService loading the applications map — since that's gone from this lambda, it's entirely redundant. The applicationsMap const and mockSsmSend usages can all go.

Also need to remove APPLICATIONS_MAP_PARAMETER from the terraform module for the lambda.

mjewildnhs · 2026-04-15T15:44:46Z

@@ -2,6 +2,7 @@
  "dependencies": {
    "@aws-sdk/client-s3": "catalog:aws",
    "@aws-sdk/client-ssm": "catalog:aws",


mjewildnhs · 2026-04-15T15:54:31Z

We need a .luarc.json file at the root level like we had on the PoC branch - otherwise the IDE won't be happy with the undefined globals
The following works:

{ "diagnostics": { "globals": [ "KEYS", "ARGV", "redis" ] } }

mjewildnhs · 2026-04-15T16:10:04Z

+  const maxRetryDurationMs =
+    (target.delivery?.maxRetryDurationSeconds ??
+      DEFAULT_MAX_RETRY_DURATION_MS / 1000) * 1000;


Lol this is crazy to read

Suggested change

const maxRetryDurationMs =

(target.delivery?.maxRetryDurationSeconds ??

DEFAULT_MAX_RETRY_DURATION_MS / 1000) * 1000;

const maxRetryDurationMs =

target.delivery?.maxRetryDurationSeconds !== undefined

? target.delivery.maxRetryDurationSeconds * 1000

: DEFAULT_MAX_RETRY_DURATION_MS;

mjewildnhs · 2026-04-15T16:23:16Z

+async function processRecord(record: SQSRecord): Promise<void> {
+  const { CLIENT_ID } = process.env;
+  if (!CLIENT_ID) {
+    throw new Error("CLIENT_ID is required");


If this is misconfigured then the behaviour will be undesirable - SQS will just keep retrying it 100 times.
However it will never happen so I don't think its worth explicitly handling it and treating the events as permanent failures (by sending to the DLQ).

mjewildnhs · 2026-04-15T16:28:56Z

+    if (cbEnabled) {
+      await recordResult(redis, targetId, true, gateConfig);
+    }
+    emitDeliverySuccess(targetId);


I preferred the observability stuff we had in the client transform filter lambda (lambdas/client-transform-filter-lambda/src/services/observability.ts) which wrapped the logging and metric calls into one. It cuts down on the clutter a bit by reducing it into 1 call.

rhyscoxnhs · 2026-04-16T09:14:25Z

+      {
+        hostname: url.hostname,
+        port: url.port || 443,
+        path: url.pathname + url.search,


We may face URL encoding issues here. https.request does not perform URL escaping, so spaces may result in malformed HTTPS requests.

Instead, since https.request supports a URL object for path, we should pass through the url object we already construct before making the request.

rhyscoxnhs · 2026-04-16T09:17:56Z

+const SQS_MAX_VISIBILITY_SECONDS = 43_200;
+
+export function jitteredBackoffSeconds(receiveCount: number): number {
+  const ceiling = Math.min(5 * 2 ** (receiveCount - 1), BACKOFF_CAP_SECONDS);


We should have proper constant naming here for 5, 2, etc so it's understandable at a high level

rhyscoxnhs · 2026-04-16T09:19:18Z

+    MTLS_TEST_CERT_S3_KEY,
+  );
+  // eslint-disable-next-line sonarjs/null-dereference -- loadS3Object always returns a string
+  const parts = pem.split(/(?=-----BEGIN )/);


I think we should make use of a proper npm package for pem file parsing, that way we're not hand rolling our own parsing code which could result in security vulnerabilities from malformed data.

I agree - there must be something off the shelf somewhere for this.

If we use secrets manager for everything we won't have this problem.

rhyscoxnhs · 2026-04-16T09:21:54Z

+let cache: ConfigCache | undefined;
+
+function getCache(): ConfigCache {
+  if (!cache) {


This could be simplified like so:

cache ??= new ConfigCache((Number(process.env.CLIENT_SUBSCRIPTION_CACHE_TTL_SECONDS) || 300) * 1000);

rhyscoxnhs · 2026-04-16T09:24:36Z

@@ -0,0 +1 @@
+export * from "@nhs-notify-client-callbacks/logger";


I don't think we should be re-exporting like this, we should instead directly use the @nhs-notify-client-callbacks/logger import.

rhyscoxnhs · 2026-04-16T09:26:15Z

+  const target = await loadTargetConfig(CLIENT_ID, targetId);
+  const maxRetryDurationMs =
+    (target.delivery?.maxRetryDurationSeconds ??
+      DEFAULT_MAX_RETRY_DURATION_MS / 1000) * 1000;


Why is it we're dividing by 1000 just to then multiply by 1000 later? It may make more sense to make the default be in seconds to simplify all of this.

rhyscoxnhs · 2026-04-16T09:27:28Z

+    return;
+  }
+
+  if ("retryAfterHeader" in result) {


Why are we doing an in check here? Can we strongly type result instead to avoid this?

Its the discriminated union type ok and permanent are both false so it doesn't know if its a rate limited result or other status code result.
I've suggested an alternative type which is easier to understand and uses a "type" as a discriminator.

mjewildnhs

Middle of http client lambda review - done with the handler, client, retry policy

mjewildnhs · 2026-04-16T08:34:28Z

+): Promise<SQSBatchItemFailure[]> {
+  const failures: SQSBatchItemFailure[] = [];
+
+  for (const record of records) {


We should try and processRecords in parallel up to a configurable concurrency limit.

mjewildnhs · 2026-04-16T08:45:50Z

+  throw new Error("Rate limited — requeue");
+}
+
+async function processRecord(record: SQSRecord): Promise<void> {


This function is too long and needs refactoring as its hard to follow. Especially with the early returns if result.ok == true.
Can it be refactored so we broadly have (with other bits between):

checkAdmission() deliverPayload() handleDeliveryResult()

mjewildnhs · 2026-04-16T08:51:35Z

+  targetId: string;
+};
+
+async function handleRateLimited(


I think a lot of this should be refactored to lambdas/https-client-lambda/src/services/delivery/retry-policy.ts we then wouldn't need to export/import as much from retry-policy and keep retry specific calculations out of the root handler.

mjewildnhs · 2026-04-16T08:56:03Z

+  }
+
+  if ("retryAfterHeader" in result) {
+    await handleRateLimited(


It doesn't look like any metrics are recorded when we get a 429.
Feels like we should record this.
Are there any other important gaps in the flow where metrics aren't recorded?

mjewildnhs · 2026-04-16T09:10:02Z

+  return cachedMaterial;
+}
+
+const PERMANENT_TLS_ERROR_CODES = new Set([


Bit of a nit-pick but this needs to be somewhere near the top of the file.
And shouldn't ERR_CERT_PINNING_FAILED be included in this list?

mjewildnhs · 2026-04-16T09:19:02Z

+    ELASTICACHE_IAM_USERNAME              = var.elasticache_iam_username
+  }
+
+  vpc_config = var.lambda_security_group_id != "" ? {


Where is the security group defined?
Am I missing something?
We need to consider what the lambda can talk to and on what ports - the lambda supports a port being in the webhook URL.

mjewildnhs · 2026-04-16T10:06:34Z

+  return new Promise((resolve) => {
+    const url = new URL(target.invocationEndpoint);
+
+    const req = https.request(


Doesn't look like we've any kind of timeout handling in here which I think we need. We don't want to wait for a 5m socket timeout or something.

We don't want to run the risk of the lambda timing out on request 1000/1000

mjewildnhs · 2026-04-16T11:22:58Z

+export function jitteredBackoffSeconds(receiveCount: number): number {
+  const ceiling = Math.min(5 * 2 ** (receiveCount - 1), BACKOFF_CAP_SECONDS);
+  // eslint-disable-next-line sonarjs/pseudo-random -- jitter for backoff, not security-sensitive
+  return Math.floor(Math.random() * ceiling);


Do we want to potentially have 0 seconds retry here or do we want a minimum of 1 second?

Suggested change

return Math.floor(Math.random() * ceiling);

return Math.max(1, Math.floor(Math.random() * ceiling))

mjewildnhs · 2026-04-16T11:26:20Z

+const CERT_EXPIRY_THRESHOLD_MS =
+  Number(process.env.CERT_EXPIRY_THRESHOLD_MS) || 86_400_000;


Suggested change

const CERT_EXPIRY_THRESHOLD_MS =

Number(process.env.CERT_EXPIRY_THRESHOLD_MS) || 86_400_000;

const CERT_EXPIRY_THRESHOLD_MS =

Number(process.env.CERT_EXPIRY_THRESHOLD_MS) || 86_400_000; // 24 hours

mjewildnhs · 2026-04-16T11:30:18Z

+    MTLS_TEST_CERT_S3_KEY,
+  );
+  // eslint-disable-next-line sonarjs/null-dereference -- loadS3Object always returns a string
+  const parts = pem.split(/(?=-----BEGIN )/);


I agree - there must be something off the shelf somewhere for this.

mjewildnhs · 2026-04-16T11:43:07Z

+}
+
+async function loadCertMaterial(): Promise<CertMaterial> {
+  const isProduction = Boolean(MTLS_CERT_SECRET_ARN);


I think we should revisit the decision to have the difference in cert handling between environments.
It adds a lot of complexity and unless we always test for it in non-prod there's a big danger we won't realize its broken until we've released it.
I seem to remember it was because were were concerned about costs for ephemeral environments.
Its priced per secret per month (at $0.40) - doesn't that mean we'll only be billed for a fraction of the based on how long the env was deployed?
https://aws.amazon.com/secrets-manager/pricing/

At least loadFromSecretsManager is trivial but the loadFromS3 function is currently awful.
Theres a tonne of env vars as well backing the S3 solution.

mjewildnhs · 2026-04-16T11:44:52Z

+    MTLS_TEST_CERT_S3_KEY,
+  );
+  // eslint-disable-next-line sonarjs/null-dereference -- loadS3Object always returns a string
+  const parts = pem.split(/(?=-----BEGIN )/);


If we use secrets manager for everything we won't have this problem.

mjewildnhs

Reviewed the http-lambda (minus unit tests)

mjewildnhs · 2026-04-16T13:52:08Z

A lot of this is similar to the transform filter lambdas config-loader (get from s3, parse, validate?, set in the cache, return).
Maybe we should replace @nhs-notify-client-callbacks/config-cache with @nhs-notify-client-callbacks/config-subscription-cache?

mjewildnhs · 2026-04-16T14:04:30Z

This looks quite a bit different to the spec implementation.
I'm going to ignore for now while rate limiting not decided but the spec implementation I think is better with:

TTL

batched HMGET read

~~structured return~~

better commenting

bunch of other minor algorithm things

mjewildnhs · 2026-04-16T14:05:32Z

As with the admit script (1 to come back to - https://github.com/NHSDigital/nhs-notify-client-callbacks/pull/145/changes#r3093774270) but I think this is worse than the spec impl.

mjewildnhs · 2026-04-16T14:21:20Z

+    admitSha = computeSha1(admitLuaSrc);
+  }
+
+  try {


Can we refactor this into a evalScript function as both admit and recordResult do similar things (EVALSHA, NOSCRIPT error handling)

mjewildnhs · 2026-04-16T14:35:28Z

We should have a TTL for the cache.
Not sure why it hasn't just copied the file from the transform filter lambda.

mjewildnhs · 2026-04-16T14:44:41Z

+    });
+  }

  logger.info("Mock webhook invoked", {


We should log out whether the cert was present so we can assert mTLS vs non mTLS clients are behaving correctly in the integration tests (by parsing the logging as we do for other stuff).

mjewildnhs · 2026-04-16T14:46:26Z

+    const pem = decodeURIComponent(certHeader);
+    const cert = new X509Certificate(pem);
+    const now = new Date();
+    if (now < new Date(cert.validFrom) || now > new Date(cert.validTo)) {


Do we actually care about this? Its not as though we are checking the cert is even trusted/revoked/etc?
Why check expiry?

mjewildnhs · 2026-04-16T14:54:30Z

    headerName: string;
    headerValue: string;
  };
+  mtls: {


Structure is a bit weird here - aren't mtls and certPinning delivery concerns too?
The impl matches the spec but i think we should go back and change it to:

delivery?: { maxRetryDurationSeconds?: number; // per-target retry window; default 7200 (2 h) circuitBreaker?: { enabled: boolean; // default true }; mtls?: { enabled: boolean; // present client cert; default true certPinning?: { enabled: boolean; // pin server cert by SPKI hash; default true spkiHash?: string; // base64 SHA-256 of server cert SPKI // absent when pinning disabled or hash not yet configured }; }; };

mjewildnhs · 2026-04-16T14:56:10Z

+  certPinning: certPinningSchema,
+  delivery: z
+    .object({
+      maxRetryDurationSeconds: z.number().min(60).max(43_200).optional(),


Suggested change

maxRetryDurationSeconds: z.number().min(60).max(43_200).optional(),

maxRetryDurationSeconds: z.number().min(60).max(43_200).optional(), // 12 hours

mjewildnhs · 2026-04-16T15:05:50Z

+    warnings.push("mTLS is enabled but certificate pinning is disabled");
+  }
+
+  if (certPinning.enabled && !certPinning.spkiHash) {


I think this has come from the spec but the schema validation in the models package will fail in the http-lambda if pinning is on and there is no hash.
I think this should be an error and the user has to set the hash 1st then turn on pinning.

mjewildnhs · 2026-04-16T15:09:35Z

+      ...commonOptions,
+      ...clientIdOption,
+      ...writeOptions,
+      "target-id": {


This target-id block is duplicated 3 times.
We should add targetIdOption to tools/client-subscriptions-management/src/entrypoint/cli/helper.ts like we do for the other options and also consume it in target-del which also has it (albeit with slightly different wording that can change).

mjewildnhs · 2026-04-16T15:19:02Z

+  if (!config) {
+    throw new Error(`No configuration found for client: ${argv["client-id"]}`);
+  }
+
+  const target = config.targets.find((t) => t.targetId === argv["target-id"]);
+
+  if (!target) {
+    throw new Error(
+      `Target '${argv["target-id"]}' not found for client '${argv["client-id"]}'`,
+    );
+  }


We've got this code pattern repeated on a lot of the targets (get client, throw if no client, get target, throw if no target).
We need the full config so can't refactor all of it away and in some places we expect no config/target.
Could we at least have requireClientConfig, requireTargetConfig.
Its minor but theres a lot of code in this tool so every bit helps

mjewildnhs · 2026-04-16T15:25:55Z

+        description: "Enable mTLS for this target",
+        conflicts: "disable",
+      },
+      disable: {


In yargs for booleans you can do --enable and --no-enable which we could make use of (https://yargs.js.org/docs/#api-reference-booleankey) which I think would simplify things (the conflict check and enabled declaration at L54)

mjewildnhs · 2026-04-16T15:27:01Z

+        demandOption: true,
+        description: "Target identifier to update",
+      },
+      enable: {


As per my other comment - using yargs built in --no-enable for this would simplify things.

mjewildnhs · 2026-04-16T15:29:26Z

  }

-  // Safe as this is an internal tool and this CLI option we are expecting the user will run locally and manually
  // eslint-disable-next-line security/detect-non-literal-fs-filename


Suggested change

// eslint-disable-next-line security/detect-non-literal-fs-filename

// eslint-disable-next-line security/detect-non-literal-fs-filename -- path is provided directly by the operator via CLI arg

Not sure why the comment was removed but we've added a similar disclaimer in target-set-certificate.

mjewildnhs · 2026-04-16T15:57:30Z

+export type DeliveryResult =
+  | { ok: true }
+  | { ok: false; permanent: true }
+  | {
+      ok: false;
+      permanent: false;
+      statusCode: 429;
+      retryAfterHeader: string | undefined;
+    }
+  | { ok: false; permanent: false; statusCode: number };


Suggested change

export type DeliveryResult =

| { ok: true }

| { ok: false; permanent: true }

| {

ok: false;

permanent: false;

statusCode: 429;

retryAfterHeader: string | undefined;

}

| { ok: false; permanent: false; statusCode: number };

export type DeliveryResult =

| { outcome: "success" }

| { outcome: "permanent_failure" }

| { outcome: "rate_limited"; retryAfterHeader: string | undefined }

| { outcome: "transient_failure"; statusCode: number };

mjewildnhs · 2026-04-16T16:08:59Z

All looks very promising - but where are the tests?!
Happy to leave them out till we get something reasonable looking deployed.

CCM-16073 - Enhanced callbacks [skip ci]

d9dfecc

rhyscoxnhs requested review from a team as code owners April 15, 2026 10:34

rhyscoxnhs changed the title ~~CCM-16073 - Enhanced callbacks [skip ci]~~ CCM-16073 - Enhanced callbacks Apr 15, 2026

rhyscoxnhs requested a review from Copilot April 15, 2026 10:35

Copilot started reviewing on behalf of rhyscoxnhs April 15, 2026 10:35 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

mjewildnhs marked this pull request as draft April 15, 2026 10:45

aidenvaines-cgi reviewed Apr 15, 2026

View reviewed changes

CCM-16073 - PR feedback

e2ea305

gparry333 reviewed Apr 15, 2026

View reviewed changes

rhyscoxnhs commented Apr 15, 2026

View reviewed changes

gparry333 reviewed Apr 15, 2026

View reviewed changes

mjewildnhs reviewed Apr 15, 2026

View reviewed changes

rhyscoxnhs commented Apr 16, 2026

View reviewed changes

mjewildnhs reviewed Apr 16, 2026

View reviewed changes

		@@ -0,0 +1,39 @@
		module "sqs_delivery" {
		source = "https://github.com/NHSDigital/nhs-notify-shared-modules/releases/download/3.0.7/terraform-sqs.zip"

		import type { SQSRecord } from "aws-lambda";

		import { processRecords } from "handler";

		@@ -0,0 +1,474 @@
		import type { SQSRecord } from "aws-lambda";


		export const configLoaderService = new ConfigLoaderService();

		export const applicationsMapService = new ApplicationsMapService();

File	Reason
`src/services/ssm-applications-map.ts`	Dead code — no longer imported in index.ts or handler.ts
`src/__tests__/services/ssm-applications-map.test.ts`	Tests the deleted service

		@@ -0,0 +1 @@
		export * from "@nhs-notify-client-callbacks/logger";

	return Math.floor(Math.random() * ceiling);
	return Math.max(1, Math.floor(Math.random() * ceiling))

		const CERT_EXPIRY_THRESHOLD_MS =
		Number(process.env.CERT_EXPIRY_THRESHOLD_MS) \|\| 86_400_000;

	maxRetryDurationSeconds: z.number().min(60).max(43_200).optional(),
	maxRetryDurationSeconds: z.number().min(60).max(43_200).optional(), // 12 hours

	// eslint-disable-next-line security/detect-non-literal-fs-filename
	// eslint-disable-next-line security/detect-non-literal-fs-filename -- path is provided directly by the operator via CLI arg

Conversation

rhyscoxnhs commented Apr 15, 2026

Description

Context

Type of changes

Checklist

Sensitive Information Declaration

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aidenvaines-cgi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gparry333 Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mjewildnhs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mjewildnhs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

gparry333 Apr 15, 2026 •

edited

Loading