Conversation
There was a problem hiding this comment.
Pull request overview
Introduces enhanced callback delivery security and operational controls by adding mTLS + certificate pinning configuration to callback targets, and shifting delivery from direct EventBridge API Destinations to an SQS + per-client HTTPS delivery Lambda model (with Redis-backed gating for rate limiting/circuit breaking).
Changes:
- Add
mtls,certPinning, anddeliveryfields to callback target model + schema validation, and update fixtures/tests accordingly. - Add CLI commands to manage mTLS, certificate pinning enable/disable, and SPKI hash extraction/storage for targets.
- Add new
https-client-lambda(delivery, signing, retries, DLQ handling, Redis/Lua gate) and new sharedconfig-cachepackage; update Terraform to provision per-client delivery infra (SQS/Lambda/ElastiCache) and mock mTLS ALB.
Reviewed changes
Copilot reviewed 104 out of 107 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/client-subscriptions-management/src/entrypoint/cli/targets-set-pinning.ts | New CLI command to enable/disable certificate pinning for a target. |
| tools/client-subscriptions-management/src/entrypoint/cli/targets-set-mtls.ts | New CLI command to enable/disable mTLS for a target. |
| tools/client-subscriptions-management/src/entrypoint/cli/targets-set-certificate.ts | New CLI command to extract/store SPKI hash from PEM for a target. |
| tools/client-subscriptions-management/src/entrypoint/cli/clients-put.ts | Minor cleanup in CLI file handling comment. |
| tools/client-subscriptions-management/src/domain/client-subscription-builder.ts | Adds default mtls/certPinning and emits security warnings when building targets. |
| tools/client-subscriptions-management/src/tests/helpers/client-subscription-fixtures.ts | Updates test target fixtures with required mtls/certPinning fields. |
| tools/client-subscriptions-management/src/tests/entrypoint/cli/targets-set-pinning.test.ts | Unit tests for targets-set-pinning CLI behavior. |
| tools/client-subscriptions-management/src/tests/entrypoint/cli/targets-set-mtls.test.ts | Unit tests for targets-set-mtls CLI behavior. |
| tools/client-subscriptions-management/src/tests/entrypoint/cli/targets-set-certificate.test.ts | Unit tests for targets-set-certificate CLI behavior. |
| tools/client-subscriptions-management/src/tests/domain/client-subscription-builder.test.ts | Adds tests around security warning emission in target builder. |
| tools/client-subscriptions-management/package.json | Adds picocolors dependency for warning output. |
| tests/integration/helpers/mock-client-config.ts | Adds a new integration fixture key for an mTLS-enabled mock client. |
| tests/integration/helpers/event-factories.ts | Adds factory for delivery messages compatible with the new delivery flow. |
| tests/integration/fixtures/subscriptions/mock-client-rate-limit.json | New integration fixture including required mtls/certPinning. |
| tests/integration/fixtures/subscriptions/mock-client-mtls.json | New integration fixture for mTLS + pinning enabled target. |
| tests/integration/fixtures/subscriptions/mock-client-circuit-breaker.json | New integration fixture including delivery circuit breaker configuration. |
| tests/integration/fixtures/subscriptions/mock-client-2.json | Updates existing integration fixture targets to include mtls/certPinning. |
| tests/integration/fixtures/subscriptions/mock-client-1.json | Updates existing integration fixture targets to include mtls/certPinning. |
| src/models/src/client-config.ts | Extends callback target type with mtls, certPinning, and optional delivery. |
| src/models/src/client-config-schema.ts | Adds Zod validation for mtls, certPinning (incl. SPKI hash constraints), and delivery. |
| src/models/src/tests/client-config-schema.test.ts | Adds/updates schema tests for new target fields and constraints. |
| src/config-cache/tsconfig.json | New package tsconfig for config-cache workspace. |
| src/config-cache/src/index.ts | Exports ConfigCache from new workspace package. |
| src/config-cache/src/config-cache.ts | Implements TTL-based in-memory config cache. |
| src/config-cache/src/tests/config-cache.test.ts | Unit tests for ConfigCache TTL and behaviors. |
| src/config-cache/package.json | New workspace package definition for config-cache. |
| src/config-cache/jest.config.ts | Jest config for config-cache package. |
| scripts/config/pre-commit.yaml | Adjusts pre-commit detect-private-key exclusions for a test file. |
| pnpm-workspace.yaml | Adds workspace catalog entries for @redis/client, picocolors, and Secrets Manager SDK. |
| lambdas/mock-webhook-lambda/src/index.ts | Adds ALB mTLS client-cert header validation for mock webhook endpoint. |
| lambdas/mock-webhook-lambda/src/tests/index.test.ts | Adds unit tests for ALB mTLS certificate verification flow. |
| lambdas/https-client-lambda/tsconfig.json | New lambda workspace tsconfig. |
| lambdas/https-client-lambda/src/services/ssm-applications-map.ts | Loads and caches clientId→applicationId map from SSM. |
| lambdas/https-client-lambda/src/services/sqs-visibility.ts | SQS ChangeMessageVisibility helper. |
| lambdas/https-client-lambda/src/services/record-result.lua | Redis Lua script to update circuit breaker state. |
| lambdas/https-client-lambda/src/services/payload-signer.ts | Adjusts payload signer function signature (parameter order). |
| lambdas/https-client-lambda/src/services/logger.ts | Re-exports shared logger for lambda local imports. |
| lambdas/https-client-lambda/src/services/endpoint-gate.ts | Redis admission + circuit-breaker integration (EVALSHA/EVAL Lua execution). |
| lambdas/https-client-lambda/src/services/dlq-sender.ts | DLQ send helper. |
| lambdas/https-client-lambda/src/services/delivery/tls-agent-factory.ts | Builds TLS agent with optional mTLS material and SPKI pinning. |
| lambdas/https-client-lambda/src/services/delivery/retry-policy.ts | Retry/backoff and Retry-After parsing helpers. |
| lambdas/https-client-lambda/src/services/delivery/https-client.ts | HTTPS delivery client + result classification. |
| lambdas/https-client-lambda/src/services/delivery-metrics.ts | Embedded metrics emission for delivery and circuit-breaker events. |
| lambdas/https-client-lambda/src/services/config-loader.ts | Loads client config/targets from S3 with TTL caching and schema validation. |
| lambdas/https-client-lambda/src/services/admit.lua | Redis Lua script for token-bucket rate limiting + CB admission. |
| lambdas/https-client-lambda/src/lua.d.ts | Type declaration for importing .lua as text. |
| lambdas/https-client-lambda/src/index.ts | Lambda entrypoint delegating to record processor. |
| lambdas/https-client-lambda/src/handler.ts | Main SQS batch handler: load config, sign, gate, deliver, retry/DLQ, metrics. |
| lambdas/https-client-lambda/src/tests/tls-agent-factory.test.ts | Unit tests for TLS agent factory behavior (S3/SecretsManager/pinning). |
| lambdas/https-client-lambda/src/tests/ssm-applications-map.test.ts | Unit tests for SSM applications map loading/caching/errors. |
| lambdas/https-client-lambda/src/tests/sqs-visibility.test.ts | Unit tests for SQS visibility changes. |
| lambdas/https-client-lambda/src/tests/retry-policy.test.ts | Unit tests for retry policy helpers. |
| lambdas/https-client-lambda/src/tests/payload-signer.test.ts | Unit tests for payload signing. |
| lambdas/https-client-lambda/src/tests/index.test.ts | Unit test for lambda entrypoint wiring. |
| lambdas/https-client-lambda/src/tests/https-client.test.ts | Unit tests for delivery HTTP behavior classification. |
| lambdas/https-client-lambda/src/tests/handler.test.ts | Unit tests for SQS record processing paths (DLQ/retry/gate/CB). |
| lambdas/https-client-lambda/src/tests/endpoint-gate.test.ts | Unit tests for Redis Lua invocation paths and client creation. |
| lambdas/https-client-lambda/src/tests/dlq-sender.test.ts | Unit tests for DLQ sender. |
| lambdas/https-client-lambda/src/tests/delivery-metrics.test.ts | Unit tests for embedded metrics behavior. |
| lambdas/https-client-lambda/src/tests/config-loader.test.ts | Unit tests for S3 config loading + TTL cache. |
| lambdas/https-client-lambda/package.json | New lambda workspace package definition. |
| lambdas/https-client-lambda/lua-transform.js | Jest transform to load .lua scripts as strings. |
| lambdas/https-client-lambda/jest.config.ts | Jest config enabling .lua transform. |
| lambdas/client-transform-filter-lambda/src/services/observability.ts | Removes callback signing observability (signing moved downstream). |
| lambdas/client-transform-filter-lambda/src/services/config-loader.ts | Switches to shared config-cache package import. |
| lambdas/client-transform-filter-lambda/src/services/config-loader-service.ts | Switches to shared config-cache package import. |
| lambdas/client-transform-filter-lambda/src/index.ts | Removes ApplicationsMapService wiring from handler creation. |
| lambdas/client-transform-filter-lambda/src/handler.ts | Removes per-target signatures from output; outputs deliverable payload + subscriptions only. |
| lambdas/client-transform-filter-lambda/src/tests/services/payload-signer.test.ts | Removes tests for payload signing in transform-filter lambda. |
| lambdas/client-transform-filter-lambda/src/tests/services/config-update.component.test.ts | Updates imports to new config-cache package. |
| lambdas/client-transform-filter-lambda/src/tests/services/config-loader.test.ts | Updates imports to new config-cache package. |
| lambdas/client-transform-filter-lambda/src/tests/services/config-cache.test.ts | Updates imports to new config-cache package. |
| lambdas/client-transform-filter-lambda/src/tests/index.test.ts | Updates tests to reflect removal of signatures and apps map dependency. |
| lambdas/client-transform-filter-lambda/src/tests/index.component.test.ts | Updates component tests to reflect new payload shape and removed SSM dependency. |
| lambdas/client-transform-filter-lambda/src/tests/helpers/client-subscription-fixtures.ts | Updates target fixtures to include required mtls/certPinning. |
| lambdas/client-transform-filter-lambda/package.json | Adds config-cache workspace dependency. |
| knip.ts | Updates Knip workspace config (new workspaces and integration entrypoints). |
| infrastructure/terraform/modules/client-destination/variables.tf | Removes legacy API Destination-based module. |
| infrastructure/terraform/modules/client-destination/module_target_dlq.tf | Removes legacy per-target DLQ module. |
| infrastructure/terraform/modules/client-destination/locals.tf | Removes legacy locals for old module. |
| infrastructure/terraform/modules/client-destination/iam_role_api_target_role.tf | Removes legacy IAM role/policy for API destinations. |
| infrastructure/terraform/modules/client-destination/cloudwatch_event_rule_main.tf | Removes legacy EventBridge rule/target to API destination setup. |
| infrastructure/terraform/modules/client-destination/cloudwatch_event_connection_main.tf | Removes legacy EventBridge Connection resources. |
| infrastructure/terraform/modules/client-destination/cloudwatch_event_api_destination_this.tf | Removes legacy API Destination resources. |
| infrastructure/terraform/modules/client-destination/README.md | Removes docs for legacy module. |
| infrastructure/terraform/modules/client-delivery/variables.tf | Adds new per-client delivery module variables. |
| infrastructure/terraform/modules/client-delivery/outputs.tf | Adds outputs for per-client queues and lambda details. |
| infrastructure/terraform/modules/client-delivery/module_sqs_per_client.tf | Provisions per-client delivery SQS queue. |
| infrastructure/terraform/modules/client-delivery/module_https_client_lambda.tf | Provisions per-client https-client lambda + SQS event source mapping. |
| infrastructure/terraform/modules/client-delivery/module_dlq_per_client.tf | Provisions per-client DLQ + DLQ depth alarm. |
| infrastructure/terraform/modules/client-delivery/locals.tf | Defines naming/tagging locals for per-client module. |
| infrastructure/terraform/modules/client-delivery/iam_role_sqs_target.tf | IAM policy for https-client lambda (SQS/S3/SSM/KMS and optional SecretsManager/S3 cert). |
| infrastructure/terraform/modules/client-delivery/cloudwatch_event_rule_per_subscription.tf | Defines EventBridge rules/targets routing to per-client SQS delivery queue. |
| infrastructure/terraform/modules/client-delivery/README.md | Adds docs for new per-client delivery module. |
| infrastructure/terraform/components/callbacks/variables.tf | Enables X-Ray by default and adds mTLS-related component variables. |
| infrastructure/terraform/components/callbacks/pipes_pipe_main.tf | Removes signatures from Pipe output template. |
| infrastructure/terraform/components/callbacks/module_mock_webhook_alb_mtls.tf | Adds internal ALB + ACM import + passthrough mTLS wiring for mock webhook. |
| infrastructure/terraform/components/callbacks/module_client_destination.tf | Removes legacy client_destination module usage. |
| infrastructure/terraform/components/callbacks/module_client_delivery.tf | Adds per-client delivery module instantiation. |
| infrastructure/terraform/components/callbacks/locals.tf | Reworks locals for per-client subscriptions/targets and mTLS mock endpoint selection. |
| infrastructure/terraform/components/callbacks/elasticache_delivery_state.tf | Adds ElastiCache Serverless Redis for delivery state + security groups/alarms. |
| infrastructure/terraform/components/callbacks/cloudwatch_metric_alarm_dlq_depth.tf | Removes legacy per-target DLQ alarm (replaced by per-client alarms). |
| infrastructure/terraform/components/callbacks/cloudwatch_eventbus_main.tf | Adds EventBridge archive for 7-day retention. |
| infrastructure/terraform/components/callbacks/README.md | Updates docs to reflect new modules, variables, and defaults. |
| eslint.config.mjs | Ignores lua-transform.js and expands rule scope to include src workspaces. |
| .gitleaksignore | Adds ignore for test private key fixture in tls-agent-factory tests. |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
aidenvaines-cgi
left a comment
There was a problem hiding this comment.
the TF stuff all looks cool from an rough check
| @@ -0,0 +1,178 @@ | |||
| resource "aws_elasticache_serverless_cache" "delivery_state" { | |||
| name = "${local.csi}-delivery-state" | |||
| engine = "redis" | |||
There was a problem hiding this comment.
Valkey is redis compatible, is half the cost and faster. Can we use that?
| @@ -0,0 +1,39 @@ | |||
| module "sqs_delivery" { | |||
| source = "https://github.com/NHSDigital/nhs-notify-shared-modules/releases/download/3.0.7/terraform-sqs.zip" | |||
There was a problem hiding this comment.
| source = "https://github.com/NHSDigital/nhs-notify-shared-modules/releases/download/3.0.7/terraform-sqs.zip" | |
| source = "https://github.com/NHSDigital/nhs-notify-shared-modules/releases/download/3.0.9/terraform-sqs.zip" |
3.0.9 exists, 3.0.7 is being used here. minor, ignore if you want
| type = bool | ||
| description = "Enable AWS X-Ray active tracing for Lambda functions" | ||
| default = false | ||
| default = true |
There was a problem hiding this comment.
i think weve historically had xray off by default in prod as its quite spendy, and enabling it as and when needed.
All variables should be prod defaults with PTL envs having overrides
| log_destination_arn = var.log_destination_arn | ||
| log_subscription_role_arn = var.log_subscription_role_arn | ||
|
|
||
| lambda_env_vars = { |
There was a problem hiding this comment.
please sort alphabetically, it makes me sad
| import type { SQSRecord } from "aws-lambda"; | ||
|
|
||
| import { processRecords } from "handler"; | ||
|
|
There was a problem hiding this comment.
There is a lot of mock setup at the beginning of this. Could we extract this out into a test helper or maybe a fixture so that the test file is cleaner and more readable?
| firstReceivedMs: number, | ||
| maxRetryDurationMs: number, | ||
| ): boolean { | ||
| return Date.now() - firstReceivedMs >= maxRetryDurationMs; |
There was a problem hiding this comment.
This looks a little hard to read, could do with parens around the first operation
| const { MTLS_CERT_SECRET_ARN } = process.env; | ||
| const { MTLS_TEST_CERT_S3_BUCKET } = process.env; | ||
| const { MTLS_TEST_CERT_S3_KEY } = process.env; | ||
| const { MTLS_TEST_CA_S3_KEY } = process.env; |
There was a problem hiding this comment.
These destructures should be merged ideally
| // eslint-disable-next-line sonarjs/null-dereference -- loadS3Object always returns a string | ||
| const parts = pem.split(/(?=-----BEGIN )/); | ||
| // eslint-disable-next-line sonarjs/null-dereference -- .find returns string|undefined, fallback handles it | ||
| const key = parts.find((p) => p.includes("PRIVATE KEY")) ?? ""; | ||
| // eslint-disable-next-line sonarjs/null-dereference -- as above | ||
| const cert = parts.find((p) => p.includes("CERTIFICATE")) ?? ""; |
There was a problem hiding this comment.
Ideally we either disable this lint at the top or find ways to fix this lint properly.
| /* eslint-disable sonarjs/null-dereference -- refillPerSec is typed as number, cannot be null */ | ||
| const args = [ | ||
| now, | ||
| refillPerSec.toString(), | ||
| config.burstCapacity.toString(), | ||
| config.cbProbeIntervalMs.toString(), | ||
| cbEnabled ? "1" : "0", | ||
| config.decayPeriodMs.toString(), | ||
| ]; | ||
| /* eslint-enable sonarjs/null-dereference */ |
There was a problem hiding this comment.
Not sure why we're disabling this lint?
There was a problem hiding this comment.
Do we need to try it without to see if it fails?
| @@ -0,0 +1,474 @@ | |||
| import type { SQSRecord } from "aws-lambda"; | |||
|
|
|||
There was a problem hiding this comment.
We could do with a test being added that tests that an unexpected error being thrown on one record doesn't prevent a subsequent record being processed e.g. An unexpected error thrown by an upstream dependency like loadTargetConfig which would be caught by the try/catch in processRecords - loadTargetConfig throwing "S3 unavailable" for record 1, while record 2 should still be processed normally.
| { type: "MessageStatus", attributes: { messageStatus: "delivered" } }, | ||
| ], | ||
| }) as Parameters<typeof signPayload>[2]; | ||
|
|
There was a problem hiding this comment.
Maybe also have a negative test for this that verifies an empty payload produces a deterministic (non-empty) signature, ensuring the signing function doesn't silently fail on edge-case payloads?
mjewildnhs
left a comment
There was a problem hiding this comment.
As far as reviewing the terraform
| @@ -0,0 +1,178 @@ | |||
| resource "aws_elasticache_serverless_cache" "delivery_state" { | |||
| name = "${local.csi}-delivery-state" | |||
| engine = "redis" | |||
There was a problem hiding this comment.
For cost saving I think we should switch to valkey.
You are billed in gigabyte-hours (GB-hrs) and the minimum for redis is 1GB vs 100mb in valkey.
Not sure we'll go above 100mb even in prod.
https://aws.amazon.com/elasticache/pricing/
There was a problem hiding this comment.
Not related to the phase 2 work but I think we can delete lines 3-6 - those secrets have never been in our repo at any point
|
|
||
| cache_usage_limits { | ||
| data_storage { | ||
| maximum = 1 |
There was a problem hiding this comment.
1GB is the minimum with redis but can go down to 100mb if we make the valkey switch.
Keeping this low for dev/test environments is good for cost saving.
We should see how much each client will take in storage.
There was a problem hiding this comment.
Do we want an alarm on storage - e.g. 80% used
| certificate_arn = aws_acm_certificate_import.mock_webhook_server[0].arn | ||
|
|
||
| mutual_authentication { | ||
| mode = "passthrough" |
There was a problem hiding this comment.
With passthrough we should be able route mTLS and non mTLS traffic through and remove the aws_lambda_permission config in module_mock_webhook_lambda.
We'd then need to modify the lambda to output whether it was mTLS or not which the tests can then verify based on their expectation.
This would simplify things a lot and also get us the hardening we are looking for on the mock webhook.
| MAX_RETRY_DURATION_SECONDS = tostring(var.max_retry_duration_seconds) | ||
| MTLS_CERT_SECRET_ARN = var.mtls_cert_secret_arn | ||
| MTLS_TEST_CERT_S3_BUCKET = var.mtls_test_cert_s3_bucket | ||
| MTLS_TEST_CERT_S3_KEY = var.mtls_test_cert_s3_key |
There was a problem hiding this comment.
Looks like the lambda is also missing MTLS_TEST_CA_S3_KEY
There was a problem hiding this comment.
I don't think any of these are referenced?
AI seems to like adding outputs for no reason.
|
|
||
| variable "mtls_cert_secret_arn" { | ||
| type = string | ||
| description = "Secrets Manager ARN for the mTLS client certificate (production)" |
There was a problem hiding this comment.
As per my other comment I think we need to review the production vs non-production terminology we are using. Production should be assumed and non-production should be "dev" - this aligns with the tfvar groups.
| variable "deploy_mock_clients" { | ||
| type = bool | ||
| description = "Whether mock client infrastructure is deployed (non-prod only)" | ||
| default = false | ||
| } |
There was a problem hiding this comment.
| variable "deploy_mock_clients" { | |
| type = bool | |
| description = "Whether mock client infrastructure is deployed (non-prod only)" | |
| default = false | |
| } |
Not used in the module.
mjewildnhs
left a comment
There was a problem hiding this comment.
Done transform filter lambda changes - in middle of http lambda
There was a problem hiding this comment.
Delete now its in a shared module. This is just a duplicate of src/config-cache/src/__tests__/config-cache.test.ts
|
|
||
| export const configLoaderService = new ConfigLoaderService(); | ||
|
|
||
| export const applicationsMapService = new ApplicationsMapService(); |
There was a problem hiding this comment.
Its never very good at cleaning up after itself. Theres a bunch of dead code to remove. AI summary:
Files to delete:
| File | Reason |
|---|---|
src/services/ssm-applications-map.ts |
Dead code — no longer imported in index.ts or handler.ts |
src/__tests__/services/ssm-applications-map.test.ts |
Tests the deleted service |
Knock-on changes required:
-
package.json — remove
@aws-sdk/client-ssmfrom dependencies (only used in ssm-applications-map.ts and its test) -
src/__tests__/index.component.test.ts— needs cleanup:- Remove the
jest.mock("@aws-sdk/client-ssm", ...)block (lines 18–26) - Remove
mockSsmSendvariable and its usages - Remove
APPLICATIONS_MAP_PARAMETERfromprocess.envsetup/teardown - Remove
CLIENT_SUBSCRIPTION_CONFIG_BUCKET,CLIENT_SUBSCRIPTION_CONFIG_PREFIX,CLIENT_SUBSCRIPTION_CACHE_TTL_SECONDSenv vars — these are now the wrong names (should beCLIENT_CONFIG_BUCKETetc. per the earlier fix), and the SSM-related ones are dead - Remove
mockSsmSend.mockResolvedValue(...)frombeforeEach - Remove
mockSsmSend.mockClear()frombeforeEach
Note: the SSM mock was used to simulate
ApplicationsMapServiceloading the applications map — since that's gone from this lambda, it's entirely redundant. TheapplicationsMapconst andmockSsmSendusages can all go. - Remove the
There was a problem hiding this comment.
Also need to remove APPLICATIONS_MAP_PARAMETER from the terraform module for the lambda.
| @@ -2,6 +2,7 @@ | |||
| "dependencies": { | |||
| "@aws-sdk/client-s3": "catalog:aws", | |||
| "@aws-sdk/client-ssm": "catalog:aws", | |||
There was a problem hiding this comment.
We need a .luarc.json file at the root level like we had on the PoC branch - otherwise the IDE won't be happy with the undefined globals
The following works:
{
"diagnostics": {
"globals": [
"KEYS",
"ARGV",
"redis"
]
}
}| const maxRetryDurationMs = | ||
| (target.delivery?.maxRetryDurationSeconds ?? | ||
| DEFAULT_MAX_RETRY_DURATION_MS / 1000) * 1000; |
There was a problem hiding this comment.
Lol this is crazy to read
| const maxRetryDurationMs = | |
| (target.delivery?.maxRetryDurationSeconds ?? | |
| DEFAULT_MAX_RETRY_DURATION_MS / 1000) * 1000; | |
| const maxRetryDurationMs = | |
| target.delivery?.maxRetryDurationSeconds !== undefined | |
| ? target.delivery.maxRetryDurationSeconds * 1000 | |
| : DEFAULT_MAX_RETRY_DURATION_MS; |
| async function processRecord(record: SQSRecord): Promise<void> { | ||
| const { CLIENT_ID } = process.env; | ||
| if (!CLIENT_ID) { | ||
| throw new Error("CLIENT_ID is required"); |
There was a problem hiding this comment.
If this is misconfigured then the behaviour will be undesirable - SQS will just keep retrying it 100 times.
However it will never happen so I don't think its worth explicitly handling it and treating the events as permanent failures (by sending to the DLQ).
| if (cbEnabled) { | ||
| await recordResult(redis, targetId, true, gateConfig); | ||
| } | ||
| emitDeliverySuccess(targetId); |
There was a problem hiding this comment.
I preferred the observability stuff we had in the client transform filter lambda (lambdas/client-transform-filter-lambda/src/services/observability.ts) which wrapped the logging and metric calls into one. It cuts down on the clutter a bit by reducing it into 1 call.
| { | ||
| hostname: url.hostname, | ||
| port: url.port || 443, | ||
| path: url.pathname + url.search, |
There was a problem hiding this comment.
We may face URL encoding issues here. https.request does not perform URL escaping, so spaces may result in malformed HTTPS requests.
Instead, since https.request supports a URL object for path, we should pass through the url object we already construct before making the request.
| const SQS_MAX_VISIBILITY_SECONDS = 43_200; | ||
|
|
||
| export function jitteredBackoffSeconds(receiveCount: number): number { | ||
| const ceiling = Math.min(5 * 2 ** (receiveCount - 1), BACKOFF_CAP_SECONDS); |
There was a problem hiding this comment.
We should have proper constant naming here for 5, 2, etc so it's understandable at a high level
| MTLS_TEST_CERT_S3_KEY, | ||
| ); | ||
| // eslint-disable-next-line sonarjs/null-dereference -- loadS3Object always returns a string | ||
| const parts = pem.split(/(?=-----BEGIN )/); |
There was a problem hiding this comment.
I think we should make use of a proper npm package for pem file parsing, that way we're not hand rolling our own parsing code which could result in security vulnerabilities from malformed data.
There was a problem hiding this comment.
I agree - there must be something off the shelf somewhere for this.
There was a problem hiding this comment.
If we use secrets manager for everything we won't have this problem.
| let cache: ConfigCache | undefined; | ||
|
|
||
| function getCache(): ConfigCache { | ||
| if (!cache) { |
There was a problem hiding this comment.
This could be simplified like so:
cache ??= new ConfigCache((Number(process.env.CLIENT_SUBSCRIPTION_CACHE_TTL_SECONDS) || 300) * 1000);| @@ -0,0 +1 @@ | |||
| export * from "@nhs-notify-client-callbacks/logger"; | |||
There was a problem hiding this comment.
I don't think we should be re-exporting like this, we should instead directly use the @nhs-notify-client-callbacks/logger import.
| const target = await loadTargetConfig(CLIENT_ID, targetId); | ||
| const maxRetryDurationMs = | ||
| (target.delivery?.maxRetryDurationSeconds ?? | ||
| DEFAULT_MAX_RETRY_DURATION_MS / 1000) * 1000; |
There was a problem hiding this comment.
Why is it we're dividing by 1000 just to then multiply by 1000 later? It may make more sense to make the default be in seconds to simplify all of this.
| return; | ||
| } | ||
|
|
||
| if ("retryAfterHeader" in result) { |
There was a problem hiding this comment.
Why are we doing an in check here? Can we strongly type result instead to avoid this?
There was a problem hiding this comment.
Its the discriminated union type ok and permanent are both false so it doesn't know if its a rate limited result or other status code result.
I've suggested an alternative type which is easier to understand and uses a "type" as a discriminator.
mjewildnhs
left a comment
There was a problem hiding this comment.
Middle of http client lambda review - done with the handler, client, retry policy
| ): Promise<SQSBatchItemFailure[]> { | ||
| const failures: SQSBatchItemFailure[] = []; | ||
|
|
||
| for (const record of records) { |
There was a problem hiding this comment.
We should try and processRecords in parallel up to a configurable concurrency limit.
| throw new Error("Rate limited — requeue"); | ||
| } | ||
|
|
||
| async function processRecord(record: SQSRecord): Promise<void> { |
There was a problem hiding this comment.
This function is too long and needs refactoring as its hard to follow. Especially with the early returns if result.ok == true.
Can it be refactored so we broadly have (with other bits between):
checkAdmission()
deliverPayload()
handleDeliveryResult()
| targetId: string; | ||
| }; | ||
|
|
||
| async function handleRateLimited( |
There was a problem hiding this comment.
I think a lot of this should be refactored to lambdas/https-client-lambda/src/services/delivery/retry-policy.ts we then wouldn't need to export/import as much from retry-policy and keep retry specific calculations out of the root handler.
| } | ||
|
|
||
| if ("retryAfterHeader" in result) { | ||
| await handleRateLimited( |
There was a problem hiding this comment.
It doesn't look like any metrics are recorded when we get a 429.
Feels like we should record this.
Are there any other important gaps in the flow where metrics aren't recorded?
| return cachedMaterial; | ||
| } | ||
|
|
||
| const PERMANENT_TLS_ERROR_CODES = new Set([ |
There was a problem hiding this comment.
Bit of a nit-pick but this needs to be somewhere near the top of the file.
And shouldn't ERR_CERT_PINNING_FAILED be included in this list?
| ELASTICACHE_IAM_USERNAME = var.elasticache_iam_username | ||
| } | ||
|
|
||
| vpc_config = var.lambda_security_group_id != "" ? { |
There was a problem hiding this comment.
Where is the security group defined?
Am I missing something?
We need to consider what the lambda can talk to and on what ports - the lambda supports a port being in the webhook URL.
| return new Promise((resolve) => { | ||
| const url = new URL(target.invocationEndpoint); | ||
|
|
||
| const req = https.request( |
There was a problem hiding this comment.
Doesn't look like we've any kind of timeout handling in here which I think we need. We don't want to wait for a 5m socket timeout or something.
There was a problem hiding this comment.
We don't want to run the risk of the lambda timing out on request 1000/1000
| export function jitteredBackoffSeconds(receiveCount: number): number { | ||
| const ceiling = Math.min(5 * 2 ** (receiveCount - 1), BACKOFF_CAP_SECONDS); | ||
| // eslint-disable-next-line sonarjs/pseudo-random -- jitter for backoff, not security-sensitive | ||
| return Math.floor(Math.random() * ceiling); |
There was a problem hiding this comment.
Do we want to potentially have 0 seconds retry here or do we want a minimum of 1 second?
| return Math.floor(Math.random() * ceiling); | |
| return Math.max(1, Math.floor(Math.random() * ceiling)) |
| const CERT_EXPIRY_THRESHOLD_MS = | ||
| Number(process.env.CERT_EXPIRY_THRESHOLD_MS) || 86_400_000; |
There was a problem hiding this comment.
| const CERT_EXPIRY_THRESHOLD_MS = | |
| Number(process.env.CERT_EXPIRY_THRESHOLD_MS) || 86_400_000; | |
| const CERT_EXPIRY_THRESHOLD_MS = | |
| Number(process.env.CERT_EXPIRY_THRESHOLD_MS) || 86_400_000; // 24 hours |
| MTLS_TEST_CERT_S3_KEY, | ||
| ); | ||
| // eslint-disable-next-line sonarjs/null-dereference -- loadS3Object always returns a string | ||
| const parts = pem.split(/(?=-----BEGIN )/); |
There was a problem hiding this comment.
I agree - there must be something off the shelf somewhere for this.
| } | ||
|
|
||
| async function loadCertMaterial(): Promise<CertMaterial> { | ||
| const isProduction = Boolean(MTLS_CERT_SECRET_ARN); |
There was a problem hiding this comment.
I think we should revisit the decision to have the difference in cert handling between environments.
It adds a lot of complexity and unless we always test for it in non-prod there's a big danger we won't realize its broken until we've released it.
I seem to remember it was because were were concerned about costs for ephemeral environments.
Its priced per secret per month (at $0.40) - doesn't that mean we'll only be billed for a fraction of the based on how long the env was deployed?
https://aws.amazon.com/secrets-manager/pricing/
There was a problem hiding this comment.
At least loadFromSecretsManager is trivial but the loadFromS3 function is currently awful.
Theres a tonne of env vars as well backing the S3 solution.
| MTLS_TEST_CERT_S3_KEY, | ||
| ); | ||
| // eslint-disable-next-line sonarjs/null-dereference -- loadS3Object always returns a string | ||
| const parts = pem.split(/(?=-----BEGIN )/); |
There was a problem hiding this comment.
If we use secrets manager for everything we won't have this problem.
mjewildnhs
left a comment
There was a problem hiding this comment.
Reviewed the http-lambda (minus unit tests)
There was a problem hiding this comment.
A lot of this is similar to the transform filter lambdas config-loader (get from s3, parse, validate?, set in the cache, return).
Maybe we should replace @nhs-notify-client-callbacks/config-cache with @nhs-notify-client-callbacks/config-subscription-cache?
There was a problem hiding this comment.
This looks quite a bit different to the spec implementation.
I'm going to ignore for now while rate limiting not decided but the spec implementation I think is better with:
- TTL
- batched HMGET read
structured return- better commenting
- bunch of other minor algorithm things
There was a problem hiding this comment.
As with the admit script (1 to come back to - https://github.com/NHSDigital/nhs-notify-client-callbacks/pull/145/changes#r3093774270) but I think this is worse than the spec impl.
| admitSha = computeSha1(admitLuaSrc); | ||
| } | ||
|
|
||
| try { |
There was a problem hiding this comment.
Can we refactor this into a evalScript function as both admit and recordResult do similar things (EVALSHA, NOSCRIPT error handling)
There was a problem hiding this comment.
We should have a TTL for the cache.
Not sure why it hasn't just copied the file from the transform filter lambda.
| }); | ||
| } | ||
|
|
||
| logger.info("Mock webhook invoked", { |
There was a problem hiding this comment.
We should log out whether the cert was present so we can assert mTLS vs non mTLS clients are behaving correctly in the integration tests (by parsing the logging as we do for other stuff).
| const pem = decodeURIComponent(certHeader); | ||
| const cert = new X509Certificate(pem); | ||
| const now = new Date(); | ||
| if (now < new Date(cert.validFrom) || now > new Date(cert.validTo)) { |
There was a problem hiding this comment.
Do we actually care about this? Its not as though we are checking the cert is even trusted/revoked/etc?
Why check expiry?
| headerName: string; | ||
| headerValue: string; | ||
| }; | ||
| mtls: { |
There was a problem hiding this comment.
Structure is a bit weird here - aren't mtls and certPinning delivery concerns too?
The impl matches the spec but i think we should go back and change it to:
delivery?: {
maxRetryDurationSeconds?: number; // per-target retry window; default 7200 (2 h)
circuitBreaker?: {
enabled: boolean; // default true
};
mtls?: {
enabled: boolean; // present client cert; default true
certPinning?: {
enabled: boolean; // pin server cert by SPKI hash; default true
spkiHash?: string; // base64 SHA-256 of server cert SPKI
// absent when pinning disabled or hash not yet configured
};
};
};
| certPinning: certPinningSchema, | ||
| delivery: z | ||
| .object({ | ||
| maxRetryDurationSeconds: z.number().min(60).max(43_200).optional(), |
There was a problem hiding this comment.
| maxRetryDurationSeconds: z.number().min(60).max(43_200).optional(), | |
| maxRetryDurationSeconds: z.number().min(60).max(43_200).optional(), // 12 hours |
| warnings.push("mTLS is enabled but certificate pinning is disabled"); | ||
| } | ||
|
|
||
| if (certPinning.enabled && !certPinning.spkiHash) { |
There was a problem hiding this comment.
I think this has come from the spec but the schema validation in the models package will fail in the http-lambda if pinning is on and there is no hash.
I think this should be an error and the user has to set the hash 1st then turn on pinning.
| ...commonOptions, | ||
| ...clientIdOption, | ||
| ...writeOptions, | ||
| "target-id": { |
There was a problem hiding this comment.
This target-id block is duplicated 3 times.
We should add targetIdOption to tools/client-subscriptions-management/src/entrypoint/cli/helper.ts like we do for the other options and also consume it in target-del which also has it (albeit with slightly different wording that can change).
| if (!config) { | ||
| throw new Error(`No configuration found for client: ${argv["client-id"]}`); | ||
| } | ||
|
|
||
| const target = config.targets.find((t) => t.targetId === argv["target-id"]); | ||
|
|
||
| if (!target) { | ||
| throw new Error( | ||
| `Target '${argv["target-id"]}' not found for client '${argv["client-id"]}'`, | ||
| ); | ||
| } |
There was a problem hiding this comment.
We've got this code pattern repeated on a lot of the targets (get client, throw if no client, get target, throw if no target).
We need the full config so can't refactor all of it away and in some places we expect no config/target.
Could we at least have requireClientConfig, requireTargetConfig.
Its minor but theres a lot of code in this tool so every bit helps
| description: "Enable mTLS for this target", | ||
| conflicts: "disable", | ||
| }, | ||
| disable: { |
There was a problem hiding this comment.
In yargs for booleans you can do --enable and --no-enable which we could make use of (https://yargs.js.org/docs/#api-reference-booleankey) which I think would simplify things (the conflict check and enabled declaration at L54)
| demandOption: true, | ||
| description: "Target identifier to update", | ||
| }, | ||
| enable: { |
There was a problem hiding this comment.
As per my other comment - using yargs built in --no-enable for this would simplify things.
| } | ||
|
|
||
| // Safe as this is an internal tool and this CLI option we are expecting the user will run locally and manually | ||
| // eslint-disable-next-line security/detect-non-literal-fs-filename |
There was a problem hiding this comment.
| // eslint-disable-next-line security/detect-non-literal-fs-filename | |
| // eslint-disable-next-line security/detect-non-literal-fs-filename -- path is provided directly by the operator via CLI arg |
Not sure why the comment was removed but we've added a similar disclaimer in target-set-certificate.
| export type DeliveryResult = | ||
| | { ok: true } | ||
| | { ok: false; permanent: true } | ||
| | { | ||
| ok: false; | ||
| permanent: false; | ||
| statusCode: 429; | ||
| retryAfterHeader: string | undefined; | ||
| } | ||
| | { ok: false; permanent: false; statusCode: number }; |
There was a problem hiding this comment.
| export type DeliveryResult = | |
| | { ok: true } | |
| | { ok: false; permanent: true } | |
| | { | |
| ok: false; | |
| permanent: false; | |
| statusCode: 429; | |
| retryAfterHeader: string | undefined; | |
| } | |
| | { ok: false; permanent: false; statusCode: number }; | |
| export type DeliveryResult = | |
| | { outcome: "success" } | |
| | { outcome: "permanent_failure" } | |
| | { outcome: "rate_limited"; retryAfterHeader: string | undefined } | |
| | { outcome: "transient_failure"; statusCode: number }; |
There was a problem hiding this comment.
All looks very promising - but where are the tests?!
Happy to leave them out till we get something reasonable looking deployed.
Description
Context
Type of changes
Checklist
Sensitive Information Declaration
To ensure the utmost confidentiality and protect your and others privacy, we kindly ask you to NOT including PII (Personal Identifiable Information) / PID (Personal Identifiable Data) or any other sensitive data in this PR (Pull Request) and the codebase changes. We will remove any PR that do contain any sensitive information. We really appreciate your cooperation in this matter.