operator: KBS API for LUKS key registration#248
Conversation
Jakob-Naucke
left a comment
There was a problem hiding this comment.
Thank you for creating this, and thank you for already creating a test. Make sure that linting & build/unit tests pass.
| pub async fn launch_trustee_sync_controller(client: Client) { | ||
| let deployments: Api<Deployment> = Api::default_namespaced(client.clone()); | ||
| let watcher_config = watcher::Config { | ||
| label_selector: Some("app=kbs".to_string()), |
There was a problem hiding this comment.
nit: maybe a constant also used in generate_kbs_{service,deployment} is better
| use serde::{Serialize, Serializer}; | ||
| use serde_json::{Value::String as JsonString, json}; | ||
| use std::collections::BTreeMap; | ||
| use kbs_client; |
There was a problem hiding this comment.
Is this going to be https://github.com/confidential-containers/trustee/tree/main/tools/kbs-client?
dc726d8 to
c5ec81d
Compare
|
The CI is failing because we need to have some Perl dependency. I solved this by adding this |
|
@iroykaufman these tests run in (Ubuntu) GHA containers that aren't affected by the Containerfile (this would get picked up in integration tests once I manage to fix them). You'll need to add them to the steps in steps:
- name: "Install OpenSSL dependencies"
run: apt-get install -y … |
|
@iroykaufman whoops, you can set your own build container, and we do, and it's a Fedora container, so it should be |
|
@iroykaufman thanks for the updates. integration tests are failing because these packages are missing there too. in my intuition, those shouldn't need to be built on the host at all so let me look into that first. e: before the next push you can also still look into the lint failures |
Thanks, let me know and then I'll push with the lint fix |
|
@Jakob-Naucke maybe update the container file in buildroot with the openssl dependency is better then adding them in the github workflow files. WDYT? |
Yes. I got confused by first thinking we were on GHA's Ubuntu containers, but the buildroot is the better place. |
|
and the CI host does need these to compile the integration tests, which is slightly not how I thought cargo dependencies worked, but alas, these packages are now installed |
|
I deleted the changes for the github workflow and opened PR#9. When this is merged, the CI should work |
|
@sourcery-ai review |
Reviewer's GuideImplements KBS API–based LUKS key management so trustee no longer needs its deployment patched on each key update, introduces an auth keypair/secret for KBS, adds a controller to resync machine LUKS keys when trustee pods restart, switches KBS storage to kvstorage, and adds an end‑to‑end test covering LUKS key sync and deletion. Sequence diagram for LUKS key registration via KBS APIsequenceDiagram
actor User as ClusterOperator
participant K8s as KubernetesAPI
participant RegSrv as RegisterServerController
participant Trustee as TrusteeModule
participant KBS as KBSService
User->>K8s: Create/Update Machine
K8s-->>RegSrv: Machine event
RegSrv->>RegSrv: keygen_reconcile
RegSrv->>Trustee: generate_secret(client, id, owner_reference)
Trustee->>K8s: Create Secret id (root key)
K8s-->>Trustee: Secret id created
RegSrv->>Trustee: send_secret(client, id)
Trustee->>K8s: Get Secret id
K8s-->>Trustee: Secret id (root data)
Trustee->>K8s: Get Secret trustee-auth
K8s-->>Trustee: Secret trustee-auth (private.key)
Trustee->>KBS: set_resource(url, auth_key, resource_bytes, path, [])
KBS-->>Trustee: 201 Created
Trustee-->>RegSrv: Ok
RegSrv-->>K8s: Reconcile result Action::await_change()
Sequence diagram for LUKS key resync on trustee restartsequenceDiagram
participant K8s as KubernetesAPI
participant SyncCtl as TrusteeSyncController
participant TrusteeDep as TrusteeDeployment
participant Trustee as TrusteeModule
participant KBS as KBSService
Note over TrusteeDep,K8s: TrusteeDeployment restarted
K8s-->>SyncCtl: Deployment event (label app=kbs)
SyncCtl->>SyncCtl: trustee_deployment_reconcile
SyncCtl->>TrusteeDep: Read status (ready_replicas, replicas)
TrusteeDep-->>SyncCtl: Status ready
SyncCtl->>Trustee: sync_all_machine_luks_key(client)
Trustee->>K8s: List Secrets (default namespace)
K8s-->>Trustee: Secrets owned by Machines
loop For each machine Secret id
Trustee->>Trustee: send_secret(client, id)
Trustee->>K8s: Get Secret id
K8s-->>Trustee: Secret id (root data)
Trustee->>K8s: Get Secret trustee-auth
K8s-->>Trustee: Secret trustee-auth (private.key)
Trustee->>KBS: set_resource(url, auth_key, resource_bytes, path, [])
KBS-->>Trustee: Response
end
Trustee-->>SyncCtl: Ok
SyncCtl-->>K8s: Action::await_change()
Sequence diagram for LUKS key deletion via KBS APIsequenceDiagram
participant K8s as KubernetesAPI
participant RegSrv as RegisterServerController
participant Trustee as TrusteeModule
participant KBS as KBSService
K8s-->>RegSrv: Machine deletion event
RegSrv->>RegSrv: keygen_reconcile finalizer
RegSrv->>Trustee: delete_secret(client, id)
Trustee->>K8s: Get Secret trustee-auth
K8s-->>Trustee: Secret trustee-auth (private.key)
Trustee->>KBS: delete_resource(url, auth_key, path, [])
KBS-->>Trustee: 200 OK
Trustee-->>RegSrv: Ok
RegSrv-->>K8s: Finalizer completed (Action::await_change())
Class diagram for new Ed25519 key pair and trustee auth secret generationclassDiagram
class Ed25519KeyPair {
+Vec~u8~ private_key_pem
+Vec~u8~ public_key_pem
}
class TrusteeModule {
+generate_ed25519_key_pair() Result~Ed25519KeyPair~
+generate_trustee_auth_keys_secret(client, owner_reference) Result~()~
+get_auth_key(client) Result~String~
+send_secret(client, id) Result~()~
+delete_secret(client, id) Result~()~
+sync_all_machine_luks_key(client) Result~()~
+launch_trustee_sync_controller(client) void
}
TrusteeModule --> Ed25519KeyPair : creates
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- In
launch_trustee_sync_controller, the watcherlabel_selectoris set as"app={KBS_LABEL_SELECTOR}", which will not expand the constant and thus never match your KBS pods; this should be built withformat!("app={KBS_LABEL_SELECTOR}")or similar. - The new
test_luks_key_syncrelies on grepping operator logs for specific hard-coded message substrings (including the exact count"Syncing 2 machine luks key to KBS"), which makes the test brittle to benign log or wording changes; consider asserting behavior via the KBS state or Kubernetes resources instead of log contents where possible.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `launch_trustee_sync_controller`, the watcher `label_selector` is set as `"app={KBS_LABEL_SELECTOR}"`, which will not expand the constant and thus never match your KBS pods; this should be built with `format!("app={KBS_LABEL_SELECTOR}")` or similar.
- The new `test_luks_key_sync` relies on grepping operator logs for specific hard-coded message substrings (including the exact count `"Syncing 2 machine luks key to KBS"`), which makes the test brittle to benign log or wording changes; consider asserting behavior via the KBS state or Kubernetes resources instead of log contents where possible.
## Individual Comments
### Comment 1
<location path="operator/src/trustee.rs" line_range="244" />
<code_context>
+pub async fn launch_trustee_sync_controller(client: Client) {
+ let deployments: Api<Deployment> = Api::default_namespaced(client.clone());
+ let watcher_config = watcher::Config {
+ label_selector: Some("app={KBS_LABEL_SELECTOR}".to_string()),
+ ..Default::default()
+ };
</code_context>
<issue_to_address>
**issue (bug_risk):** The label selector string is not interpolating `KBS_LABEL_SELECTOR` and will not match any pods.
Because the selector is the literal string `"app={KBS_LABEL_SELECTOR}"`, the controller will never match any `kbs`-labelled resources. Build the selector string dynamically (e.g. `format!("app={}", KBS_LABEL_SELECTOR)` or a const with the expanded value) so it actually aligns with the deployment and service labels.
</issue_to_address>
### Comment 2
<location path="tests/trusted_execution_cluster.rs" line_range="604-613" />
<code_context>
+ // Delete machine1 and verify its secret is removed from both K8s and KBS
</code_context>
<issue_to_address>
**issue (testing):** Also assert that the machine Secret is deleted from Kubernetes, not only from KBS logs
Right now the test only checks the operator logs for `"Secret {id1} deleted successfully"`, which covers the KBS side. To match the comment and fully validate behavior, please also assert that the corresponding Kubernetes Secret no longer exists (for example, `secrets_api.get(&machine1_uuid).await` returns `NOT_FOUND` or fails via the Poller).
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
a751ea3 to
eeb4005
Compare
Jakob-Naucke
left a comment
There was a problem hiding this comment.
- If changes to the base branch made a rebase necessary, the usual practice is to fix and keep the existing commits, so I think you can squash in this instance.
- Pushing new buildroots got auto-disabled due to inactivity (the more you know). I re-enabled it so FindBin etc. are hopefully there on next attempt.
|
|
||
| [admin] | ||
| type = "DenyAll" | ||
| type = "Simple" |
There was a problem hiding this comment.
Sorry about this one, didn't see it coming but you fixed it :)
This is still WIP so I'll squash it into one commit after everything is fixed. |
9c68c4b to
7dd60c7
Compare
|
@iroykaufman can you please split the git history into multiple commits to make the review easier |
| ) | ||
| .await?; | ||
|
|
||
| test_ctx.wait_for_deployment_ready(&deployments, TRUSTEE_DEPLOYMENT, 120).await?; |
There was a problem hiding this comment.
Instead of polling here, can we replace it with await_condition?
Something like:
let is_available = |obj: Option<&Deployment>| -> bool {
obj.and_then(|d| d.status.as_ref())
.map(|r| r >= 1)
.unwrap_or(false)
};
let ready = await_condition(deployments.clone(), TRUSTEE_DEPLOYMENT, is_available);
timeout(Duration::from_secs(120), ready).await.context("waiting for trustee to be ready")??;There was a problem hiding this comment.
@Jakob-Naucke probably we could replace several polling operation with that. WDYT?
There was a problem hiding this comment.
Don't we need to look for a condition with ( type==Available && status == "True" ) ?
Alternatively we can check for r.available_replicas >= 1
There was a problem hiding this comment.
The poller here is inside the wait_for_deployment_ready this is what you mean?
There was a problem hiding this comment.
yes, I would completely remove the wait_for_deployment_ready
There was a problem hiding this comment.
Remove wait_for_deployment_ready and instead use await_condition? because I have to check that the object is ready.
There was a problem hiding this comment.
Yes, à la
operator/test_utils/src/lib.rs
Line 925 in 103fa3f
There was a problem hiding this comment.
@alicefr @Jakob-Naucke it is fine to just change wait_for_deployment_ready to use await_condition insted of polling?
There was a problem hiding this comment.
yes, most preferably in a commit before the other ones, create the function and start using it from apply_cr_manifests, and then use it like you're already doing
| .with_interval(Duration::from_millis(500)) | ||
| .with_error_message("AttestationKeys not approved with secrets".to_string()); | ||
|
|
||
| poller |
There was a problem hiding this comment.
This probably can be refactor with the await_condition
There was a problem hiding this comment.
and the same above using the polling
There was a problem hiding this comment.
The polling where the condition uses the pod logs can't be replaced with await_condition. There are two poller that doesn't use logs so I'll refactor them.
There was a problem hiding this comment.
we shouldn't relay on logs for validating the test. That also needs to be changed
There was a problem hiding this comment.
Instead of checking the logs we could exec in the pod and verify that the secret is gone if that's really necessary. It seems to me more reliable.
There was a problem hiding this comment.
Does trustee save the secrets and AKs in its filesystem since the API introduction?
There was a problem hiding this comment.
With the current kbs config the storage backend is configured to be on the filesystem.
| secret | ||
| .data | ||
| .as_ref() | ||
| .and_then(|d| String::from_utf8(d.get("public_key").unwrap().0.clone()).ok()) |
There was a problem hiding this comment.
what if the secret doesn't have the field public_key, this will panic:
| .and_then(|d| String::from_utf8(d.get("public_key").unwrap().0.clone()).ok()) | |
| .filter_map(|secret| { | |
| secret | |
| .data | |
| .as_ref() | |
| .and_then(|d| d.get("public_key")) | |
| .and_then(|pk| String::from_utf8(pk.0.clone()).ok()) | |
| }) |
| match reference_values::create_pcrs_config_map(client.clone()).await { | ||
| Ok(_) => info!("Created bare configmap for PCRs"), | ||
| Err(e) => error!("Failed to create the PCRs configmap: {e}"), | ||
| } |
There was a problem hiding this comment.
This changes the previous behavior where we were returning an error
| match trustee::generate_trustee_auth_keys_secret(client.clone(), owner_reference.clone()).await | ||
| { | ||
| Ok(_) => info!("Generate auth keys for the KBS API",), | ||
| Err(e) => error!("Failed to create the auth keys: {e}"), | ||
| } | ||
|
|
||
| match trustee::generate_rv_data(client.clone(), owner_reference.clone()).await { | ||
| Ok(_) => info!("Created configmap for reference values"), | ||
| Err(e) => error!("Failed to create the reference values configmap: {e}"), | ||
| } |
There was a problem hiding this comment.
Should here return an error instead of just printing it?
There was a problem hiding this comment.
👍 I know it wasn't out when this was raised but we changed this in #285
There was a problem hiding this comment.
Please fix similarly the call to generate_trustee_auth_keys_secret() above
| let c = Arc::unwrap_or_clone(client.clone()); | ||
| if let Err(e) = sync_resource_policy(c.clone()).await { | ||
| warn!("Failed to sync resource policy to KBS: {e}"); | ||
| } | ||
| if let Err(e) = sync_attestation_policy(c.clone()).await { | ||
| warn!("Failed to sync attestation policy to KBS: {e}"); | ||
| } | ||
| if let Err(e) = sync_reference_values_from_configmap(&c).await { | ||
| warn!("Failed to sync reference values to KBS: {e}"); | ||
| } |
There was a problem hiding this comment.
why if those fails we only print a warning and not return an error?
| .into_iter() | ||
| pub async fn update_attestation_keys(client: Client) -> Result<()> { | ||
| let secrets: Api<Secret> = Api::default_namespaced(client.clone()); | ||
| let secret_list = secrets.list(&Default::default()).await?; |
There was a problem hiding this comment.
This doesn't use the secret store any longer. This was introduced by #267 to avoid direct APi calls to the API server
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn test_adopt_approved_images() { |
There was a problem hiding this comment.
Why those tests don't apply any longer?
There was a problem hiding this comment.
Because we no longer need the function adopt_approved_images, we only use and test (above) adopt_approved_image
| let secret_api: Api<Secret> = Api::default_namespaced(client.clone()); | ||
| let auth_secret = secret_api.get(TRUSTEE_AUTH_SECRET).await?; |
There was a problem hiding this comment.
Ideally, this should use the secret store and not calling the API server directly
| let secret_api: Api<Secret> = Api::default_namespaced(client.clone()); | ||
|
|
There was a problem hiding this comment.
Same here, please use the secret store
|
@iroykaufman can you please also split the commit 774f99b into multiple It does too many things in one. I would at least split the bump of the kbs-client version and the configuration into a separate commit. Also I would separate the testing in its own commit. Then, it becomes easier to read |
|
I would also personally move the 512baf6 into a separate PR. It really solves a bug and it isn't part of this feature |
@iroykaufman was the problem that the kbs-client won't work as used here with 17? if it does work, I agree. if it doesn't, I prefer big commits over breaking bisects @alicefr |
| match trustee::generate_trustee_auth_keys_secret(client.clone(), owner_reference.clone()).await | ||
| { | ||
| Ok(_) => info!("Generate auth keys for the KBS API",), | ||
| Err(e) => error!("Failed to create the auth keys: {e}"), | ||
| } | ||
|
|
||
| match trustee::generate_rv_data(client.clone(), owner_reference.clone()).await { | ||
| Ok(_) => info!("Created configmap for reference values"), | ||
| Err(e) => error!("Failed to create the reference values configmap: {e}"), | ||
| } |
There was a problem hiding this comment.
Please fix similarly the call to generate_trustee_auth_keys_secret() above
| } | ||
|
|
||
| fn generate_ed25519_key_pair() -> Result<Ed25519KeyPair> { | ||
| let key = openssl::pkey::PKey::generate_ed25519()?; |
There was a problem hiding this comment.
For now it's OK, but we should later prefer PQC keys (if available) over Ed25519
| ) | ||
| .await?; | ||
|
|
||
| test_ctx.wait_for_deployment_ready(&deployments, TRUSTEE_DEPLOYMENT, 120).await?; |
There was a problem hiding this comment.
Don't we need to look for a condition with ( type==Available && status == "True" ) ?
Alternatively we can check for r.available_replicas >= 1
| kbs-client = {git = "https://github.com/confidential-containers/trustee.git", rev = "e65897a9ad4eb3ac69fa2ec75ed831200eb2acd7", default-features = false, features = ["native-tls"] } | ||
| jsonwebtoken = { version = "10.4.0", default-features = false, features = ["use_pem"] } | ||
| jsonwebtoken-openssl = "1.0.0" | ||
| jiff = "0.2.29" |
There was a problem hiding this comment.
nit: can we use chrono instead of jiff (for deletion_timestamp) -- it's already required.
Correct. When bumping to trustee v0.20.0, it didn't work as is because the trustee storage backend changed. The idea was to use the API as an abstraction layer that could make it much easier to bump trustee to a later version or change the storage backend in the future. |
|
@iroykaufman do you know where trustee stores the secrets and attestations keys after they have been configured via the API? I hope it stores it somewhere in its filesystem. If that's the case we need to persist them, otherwise if trustee restarts we are going to lose them |
|
@alicefr The current solution does not persist the trustee-data on the filesystem - when the Trustee pod restarts, the operator uploads all the keys (and other configuration) to the new pod. Currently Trustee does keep those on its file-system, but that may change in the future. For example there is an idea of storing the data in a database. |
valid comments have been raised and I would like to re-review and prevent an accidental merge
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: iroykaufman The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
But is the old behavior of using config map for secrets, AK and RV still supported by 0.20? Or did they suddenly move and only support adding the various objects via API? |
no, file system (NB it doesn't matter if it's a configmap) is no longer supported AFAICT, see confidential-containers/trustee#1190 |
Trustee v0.20.0 replaced file-based configuration with a KV storage backend that Trustee manages internally it reads and writes to its own storage directories (e.g. |
I didn't ask because I want to keep the old behavior, but if the mounting of the secrets and other objects still works, we could gradually add each component one by one in separate commits. In this way, we keep the git bisect working but the PR becomes a bit more readable |
I understand, initially I tried to do exactly that, but the file formats also changed (you can choose to use either JSON or individual files). If I remember correctly, we use both formats: JSON for reference values and individual files for secrets. Because of this, it will create some extra work that will need to be reverted later. Maybe I missed things but this is what I stumbled upon. |
|
Trustee v0.20.0 can still keep configuration files on its file-system, so it is possible to update to 0.20.0 first |
|
@uril @iroykaufman thanks for the explication! No, I don't want you to do useless extra work. Let's keep it in this way then. |
Currently, every time the LUKS key is updated, the operator patches the trustee deployment, which causes a restart of the pod. This PR introduces a way to avoid this by setting the LUKS key using the KBS API.
Core implementation points:
Tests:
Summary by Sourcery
Integrate KBS API–based LUKS key management for trustee, replacing deployment patching with authenticated API calls and adding automatic resync on trustee restarts.
New Features:
Enhancements:
Build:
Tests: