Skip to content

Retry KMS requests on transient errors#1953

Draft
rozza wants to merge 8 commits into
mongodb:mainfrom
rozza:JAVA-5391
Draft

Retry KMS requests on transient errors#1953
rozza wants to merge 8 commits into
mongodb:mainfrom
rozza:JAVA-5391

Conversation

@rozza
Copy link
Copy Markdown
Member

@rozza rozza commented Apr 29, 2026

Add libmongocrypt CAPI bindings for KMS retry support and wire retry logic through the sync and reactive driver stacks. Transient KMS HTTP and network errors are retried with backoff delays managed by libmongocrypt; retry is enabled unconditionally.

  • Add native bindings: mongocrypt_setopt_retry_kms, mongocrypt_kms_ctx_usleep, mongocrypt_kms_ctx_feed_with_retry, mongocrypt_kms_ctx_fail
  • Add sleepMicroseconds(), feedAndRetry(), fail() to MongoKeyDecryptor
  • Enable KMS retry unconditionally in MongoCryptImpl
  • Rewrite sync Crypt.decryptKey() with retry loop, timeout-aware
  • Add retry logic to reactive KeyManagementService.decryptKey()
  • Fix TlsChannelImpl.read() to preserve bytes delivered alongside close_notify (already fixed upstream in marianobarrios/tls-channel)
  • Add spec Section 24 KMS retry integration tests (sync + reactive)
  • Add Evergreen CI task for KMS retry tests

JAVA-5391

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

Add libmongocrypt CAPI bindings for KMS retry support and wire retry
logic through the sync and reactive driver stacks. Transient KMS HTTP
and network errors are retried with backoff delays managed by
libmongocrypt; retry is enabled unconditionally.

- Add native bindings: mongocrypt_setopt_retry_kms,
  mongocrypt_kms_ctx_usleep, mongocrypt_kms_ctx_feed_with_retry,
  mongocrypt_kms_ctx_fail
- Add sleepMicroseconds(), feedAndRetry(), fail() to MongoKeyDecryptor
- Enable KMS retry unconditionally in MongoCryptImpl
- Rewrite sync Crypt.decryptKey() with retry loop, timeout-aware
- Add retry logic to reactive KeyManagementService.decryptKey()
- Fix TlsChannelImpl.read() to preserve bytes delivered alongside
  close_notify (already fixed upstream in marianobarrios/tls-channel)
- Add spec Section 24 KMS retry integration tests (sync + reactive)
- Add Evergreen CI task for KMS retry tests

JAVA-5391
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds libmongocrypt-backed retry support for KMS requests and wires it through the sync and reactive driver encryption flows, including a small TLS-channel fix needed for correct KMS response handling and new CI coverage for the retry prose tests.

Changes:

  • Introduce new libmongocrypt CAPI/JNA bindings and surface them via MongoKeyDecryptor to drive sleep/backoff and retry decisions.
  • Implement retry loops in sync Crypt.decryptKey() and reactive KeyManagementService.decryptKey() using libmongocrypt’s retry signals and operation-timeout awareness.
  • Add KMS retry prose tests (sync + reactive) and an Evergreen task/script to run them.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
mongodb-crypt/src/main/com/mongodb/internal/crypt/capi/MongoKeyDecryptorImpl.java Implements new retry-related native calls (usleep, feed_with_retry, fail) on the KMS ctx wrapper.
mongodb-crypt/src/main/com/mongodb/internal/crypt/capi/MongoKeyDecryptor.java Extends decryptor API with retry hooks and a default initial KMS read size constant.
mongodb-crypt/src/main/com/mongodb/internal/crypt/capi/MongoCryptImpl.java Enables KMS retry option in libmongocrypt during initialization.
mongodb-crypt/src/main/com/mongodb/internal/crypt/capi/CAPI.java Adds JNA native bindings for KMS retry APIs.
driver-sync/src/main/com/mongodb/client/internal/Crypt.java Reworks KMS decryption to loop retries with backoff and timeout checks.
driver-reactive-streams/src/main/com/mongodb/reactivestreams/client/internal/crypt/KeyManagementService.java Adds reactive retry flow with libmongocrypt-provided delay and retry decisions.
driver-core/src/main/com/mongodb/internal/connection/tlschannel/impl/TlsChannelImpl.java Preserves bytes produced alongside TLS close_notify instead of immediately returning -1.
driver-sync/src/test/functional/com/mongodb/client/AbstractClientSideEncryptionKmsRetryProseTest.java Adds shared prose tests for KMS retry behaviors (TCP/HTTP retry, exhausted retries, timeout mid-retry).
driver-sync/src/test/functional/com/mongodb/client/ClientSideEncryptionKmsRetryProseTest.java Sync concrete test wiring for the shared retry prose tests.
driver-reactive-streams/src/test/functional/com/mongodb/reactivestreams/client/ClientSideEncryptionKmsRetryProseTest.java Reactive concrete test wiring (via sync adapter) for the shared retry prose tests.
.evergreen/run-kms-retry-tests.sh Adds CI script to run sync + reactive KMS retry prose tests with required trust material.
.evergreen/.evg.yml Adds Evergreen function/task/buildvariant wiring to execute the KMS retry test script.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .evergreen/run-kms-retry-tests.sh
@rozza rozza marked this pull request as ready for review May 26, 2026 13:58
@rozza rozza requested a review from a team as a code owner May 26, 2026 13:58
@rozza rozza requested a review from strogiyotec May 26, 2026 13:58
@codeowners-service-app
Copy link
Copy Markdown

Assigned stIncMale for team dbx-java because strogiyotec is out of office.

@stIncMale stIncMale removed the request for review from strogiyotec May 29, 2026 21:38
Copy link
Copy Markdown
Member

@stIncMale stIncMale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions and suggestions after a partial review (the smaller part of it).

* @param crypt The @ref mongocrypt_t object to update
* @param enable Whether to enable KMS retry
* @return A boolean indicating success. If false, an error status is set.
* @since 5.8
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we specify @since on an internal API? (the same applies to MongoKeyDecryptor) I fail to see an answer myself.

I see that some existing program elements in CAPI have @since specified, while some other do not (like mongocrypt_setopt_enable_multiple_collinfo, which is relatively new).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just convention I suppose and I suppose its slightly informative - we don't publish the javadocs either yet write them.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolving as its a Nit / question and not worth changing

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up removing all @since tags here - as you mentioned the usage was mixed. Also its really tied to the libmongocrypt version.

*
* @param crypt The @ref mongocrypt_t object to update
* @param enable Whether to enable KMS retry
* @return A boolean indicating success. If false, an error status is set.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the

Retrieve it with @ref mongocrypt_ctx_status

sentence from the documentation in mongocrypt.h to the @return text. It is useful, and we seem to have it in the documentation of other methods. Though it should be formatted properly according to the Java documentation comment rules.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated all Javadocs to bring inline with mongocrypt.h as of 1.18.1

* Opt-into handling the MONGOCRYPT_CTX_NEED_KMS state with retry logic.
*
* <p>If opted in, KMS requests will include retry information accessible via
* {@link #mongocrypt_kms_ctx_usleep}, {@link #mongocrypt_kms_ctx_feed_with_retry},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thus, we do not have to use mongocrypt_kms_ctx_feed_with_retry. If we decide to introduce this method and use it, then we should
also

  1. fully implement https://jira.mongodb.org/browse/MONGOCRYPT-763 in this PR (it seems, it mostly was implemented?);
  2. specify JAVA-5772 in addition to JAVA-5391 in the description of the current PR
  3. resolve https://jira.mongodb.org/browse/JAVA-5772 (split from https://jira.mongodb.org/browse/MONGOCRYPT-763) once this PR is merged.

I wonder how the PR ended up using a method from a different change, why nothing was explicitly said about that to help orienting a reviewer, and what was the plan for JAVA-5772 given that this PR implements most of what should be done in JAVA-5772.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - Let me review JAVA-5772 and ensure all is covered there. I followed integrating.md but that most likely was the latest version and effectively have done both tickets.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed this includes JAVA-5772. I didn't see a drivers ticket for it and missed it completely. Not that its very clear either.

Comment on lines +1196 to +1210
/**
* Feed bytes from the HTTP response, with retry support.
*
* <p>Requires {@link #mongocrypt_setopt_retry_kms} to be enabled.
*
* @param kms The @ref mongocrypt_kms_ctx_t.
* @param bytes The bytes to feed.
* @param should_retry Receives whether the driver should retry the KMS request.
* @return A boolean indicating success.
* @since 5.8
*/
public static native boolean
mongocrypt_kms_ctx_feed_with_retry(mongocrypt_kms_ctx_t kms,
mongocrypt_binary_t bytes,
ByteByReference should_retry);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason the documentation of this method in the PR omits parts of the documentation from mongocrypt.h. Let's bring them all here, unless there is a good reason to omit any specific part.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated all javadocs.

Comment on lines +239 to +244
bytesToReturn = res.bytesProduced;
if (res.wasClosed) {
return -1;
// JAVA-5391: return any bytes produced alongside close_notify; the next read
// sees shutdownReceived and returns -1. Fixed in upstream marianobarrios/tls-channel;
// this is the minimal patch until the vendored snapshot is refreshed.
return bytesToReturn > 0 ? bytesToReturn : -1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please

  • Either explain the problem in this GitHub thread, or link to an existing explanation / bug report.
    • Also explain why the fix is required for the current PR.
  • Share a link to the upstream PR/commit with the fix.

I think, given that this change is in a vendored code, we should do it in a separate PR, to avoid complicating the future work of updating the vendored code.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it was causing the tests to fail (I can't recall which one) but it was not returning bytes that needed to be read. Most likely the mock kms server.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes it was:

ClientSideEncryptionKmsRetryProseTest > testCreateDataKeyAndEncryptWithHttpRetry(String) > Case 2: HTTP retry with aws FAILED

If causes: MongoClientException: Exception in encryption library: Allocation size must be greater than zero.

It doesn't happen on sync because it uses SSLSocket/InputStream, which correctly delivers data before signalling EOF on close_notify.

It isn't triggered by gcp or azure because the failpoint fires on the OAuth token request. So the actual key operation uses another connection and succeeds. As AWS doesn't use a token, and fails with data unconsumed on the wire.

According to Claude:

The TLS spec allows close_notify to arrive in the same flight as application data. > Any real-world TLS server can legitimately:

  • Send Connection: close and the response in one write (common for HTTP/1.0 or error responses)
  • Send a response and immediately initiate TLS shutdown
  • Have the OS coalesce the response and close_notify into a single TCP segment due to Nagle's algorithm or buffering

TLSChannelImpl fixed this behaviour in 0.5.0

}

@Override
public boolean feedAndRetry(final ByteBuffer bytes) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current name feedAndRetry suggests that the method also retries, which is not true. Let's find a name that is less misleading. Maybe feedWithRetry (as is in CAPI), or feedWithRetrySupport?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +27 to +34
/**
* Initial read size after re-sending a KMS request. Matches libmongocrypt's DEFAULT_MAX_KMS_BYTE_REQUEST
* and is used when {@link #bytesNeeded()} still returns 0 because libmongocrypt's should_retry flag has
* not yet been cleared by {@link #feedAndRetry}.
*
* @since 5.8
*/
int DEFAULT_KMS_READ_SIZE = 1024;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this constant and the rules accompanying it come from? It seem to be in conflict with both the current integration.md instructions and the API documentation of the methods mongocrypt_kms_ctx_bytes_needed, mongocrypt_kms_ctx_feed_with_retry.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the comment above - it matches DEFAULT_MAX_KMS_BYTE_REQUEST.

The fail() function resets the parser and sets should_retry = true, which deliberately makes bytes_needed return 0. This creates an awkward contract: the driver must somehow know to do a read even though bytes_needed says "I don't need anything."

The integrating guide doesn't document this gap — it just says "loop while bytes_needed > 0" and "call fail() on network error, retry if it returns true," without acknowledging that those two instructions are incompatible without a fallback.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked Claude to compare my implementation to that of the python binding. Seems I over complicated it (or shortcutted it) and I can just revisit the State machine loop. Will update.

mongocrypt_setopt_bypass_query_analysis (mongocrypt_t crypt);

/**
* Opt-into handling the MONGOCRYPT_CTX_NEED_KMS state with retry logic.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make MONGOCRYPT_CTX_NEED_KMS a @link to the corresponding program element.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated all javadocs

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Let's update the documentation of the mongocrypt_kms_ctx_bytes_needed method that existed before this PR, such that it mentions not only to the mongocrypt_kms_ctx_feed method, but also to the mongocrypt_kms_ctx_feed_with_retry method.
  • Let's express those notions using @link.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated all javadocs

Comment on lines +1212 to +1222
/**
* Signal to libmongocrypt that a network error occurred on this KMS request.
*
* <p>Requires {@link #mongocrypt_setopt_retry_kms} to be enabled.
*
* @param kms The @ref mongocrypt_kms_ctx_t.
* @return True if the request should be retried, false if retries are exhausted.
* @since 5.8
*/
public static native boolean
mongocrypt_kms_ctx_fail(mongocrypt_kms_ctx_t kms);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True if the request should be retried

The documentation of the native method says "may be retried", not "should", which is different. Let's make our documentation consistent with that of the native API, and update the documentation of MongoKeyDecryptor.fail accordingly.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got Claude to update them all - as most are very out of date.

I've also added JAVA-6225 to remove any unused statics / methods that either we never or no longer implement

Copy link
Copy Markdown
Member

@stIncMale stIncMale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pausing the review until we address the somewhat structurally important concerns:

Comment on lines +49 to 54
/**
* Note: Not thread-safe: methods mutate the underlying native {@code mongocrypt_kms_ctx_t} and must be invoked serially.
* Callers perform retries sequentially — the sync driver in a {@code while} loop and the reactive driver via
* {@code Mono.flatMap} — so no external synchronization is required.
*/
class MongoKeyDecryptorImpl implements MongoKeyDecryptor {
Copy link
Copy Markdown
Member

@stIncMale stIncMale Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We have @NotThreadSafe for this, let's use it.
    • Also, given that for some non-obvious reason (do you know what it is?) we use MongoKeyDecryptorImpl via the MongoKeyDecryptor interface, we should document the interface as non-thread-safe.
  2. If you want to document explicitly that the wrapped mongocrypt_kms_ctx_t is not thread-safe, then let's do that in the documentation of that type, not here.
  3. The "methods mutate the underlying..." text is not needed - it does not add anything useful to the "not thread-safe" information, nor does it on its own imply non-thread-safety.
  4. "Callers perform retries sequentially ..." - all that information may be a comment on the calling code, if it's really needed there. I think, the documentation of MongoKeyDecryptorImpl is a wrong place for the comments for the code that uses the type.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opted to remove it. Unfortunately, @NotThreadSafe isn't available to use here.

Comment thread .evergreen/.evg.yml
@rozza rozza marked this pull request as draft June 3, 2026 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants