Docs/pipelines mlflow integration#263
Conversation
- Narrow scope to Claude Code only; remove opencode and Codex CLI sections - Add how to configure reasoning effort when starting the InferenceService (server-side --reasoning-effort flag and request-time override) - Update Claude Code section with corrected proxy setup for LiteLLM and claude-code-router (config-driven, ccr code startup command) - Qwen3.6 and Gemma 4 recommendations and Unsloth quantized model list already present; no change needed
The flag does not exist in vLLM. Replaced with accurate guidance about server-wide control via --chat-template and request-level parameters.
…/coding-agents-inference-service
- Remove list preceding code block to avoid remark-lint-code-block-split-list - Replace Python dict literals with dict() constructor to avoid JSX parsing
…/pipelines-mlflow-integration
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds comprehensive documentation for integrating MLflow with Kubeflow (KFP) and the MLflow Python SDK using Kubernetes user identity tokens for authentication and RBAC. Includes a complete KFP pipeline example, Trainer v2 integration patterns, best practices, and troubleshooting. Also adds an e2e smoke-test script validating the identity-token-based integration and improves kubectl retry logic for transient failures. ChangesMLflow Integration Guides
e2e Testing Infrastructure
Sequence Diagram(s)sequenceDiagram
participant Developer
participant OAuthProxy
participant KubeAPI
participant MLflowServer
Developer->>OAuthProxy: Request with _oauth2_proxy cookie (interactive)
OAuthProxy->>MLflowServer: Forward request with identity
MLflowServer-->>Developer: Run/metrics under user identity
Developer->>KubeAPI: Exchange Dex refresh token for id_token (headless)
KubeAPI-->>Developer: JWT id_token
Developer->>OAuthProxy: Request with Authorization: Bearer {id_token}
OAuthProxy->>MLflowServer: Forward request with identity
MLflowServer-->>Developer: Run/metrics under token owner
sequenceDiagram
participant PipelineComponent
participant KubeAPI
participant MLflowServer
PipelineComponent->>KubeAPI: Query mlflow-tracking-server pod
KubeAPI-->>PipelineComponent: Pod location + proxy endpoint
PipelineComponent->>MLflowServer: MLflow REST call + X-Forwarded-Access-Token
MLflowServer->>MLflowServer: Derive owner from token claims
MLflowServer-->>PipelineComponent: Create/log run as component owner
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
e2e/lib.sh (1)
171-171: ⚡ Quick winExtract the duplicated error pattern to avoid maintenance burden.
The retryable error pattern is duplicated at lines 171 and 198. Consider extracting it to a shared constant or helper function to ensure consistency and reduce maintenance overhead when the pattern needs to be updated.
♻️ Proposed refactor to extract the pattern
+# Transient errors that warrant kubectl retry +_KUBECTL_RETRY_PATTERN='failed calling webhook|x509|connection refused|EOF|context deadline exceeded|webhook.* connect: connection refused|failed to download openapi|openapi' + # Run a kubectl verb (create / apply) reading YAML from stdin, retrying on # transient webhook TLS failures from the kubeflow-trainer cert-rotator. # Args: kctl_fn verb [extra-kubectl-args ...] @@ -168,7 +170,7 @@ return 0 fi rc=$? - if ! echo "${out}" | grep -qE 'failed calling webhook|x509|connection refused|EOF|context deadline exceeded|webhook.* connect: connection refused|failed to download openapi|openapi'; then + if ! echo "${out}" | grep -qE "${_KUBECTL_RETRY_PATTERN}"; then printf '%s\n' "${out}" >&2 return "${rc}" fi @@ -195,7 +197,7 @@ return 0 fi rc=$? - if ! echo "${out}" | grep -qE 'failed calling webhook|x509|connection refused|EOF|context deadline exceeded|webhook.* connect: connection refused|failed to download openapi|openapi'; then + if ! echo "${out}" | grep -qE "${_KUBECTL_RETRY_PATTERN}"; then printf '%s\n' "${out}" >&2 return "${rc}" fiAlso applies to: 198-198
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@e2e/lib.sh` at line 171, The retryable error pattern used in the grep command at line 171 is duplicated at line 198. Extract this pattern to a shared constant or helper function at the beginning of the file. Define a variable that contains the full error pattern string (including all the pipe-separated error messages like 'failed calling webhook|x509|connection refused|EOF|context deadline exceeded|webhook.* connect: connection refused|failed to download openapi|openapi'), then replace both occurrences of the duplicated grep pattern with references to this shared constant. This ensures consistency and makes future updates to the pattern require only a single change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/en/training_guides/pipelines-mlflow-integration.mdx`:
- Around line 153-169: The `client.get_run_id()` method does not exist in the
KFP SDK. After calling `create_run_from_pipeline_package()`, which returns a
RunPipelineResult object, access the run ID directly using the object's
attribute instead of calling a non-existent client method. Replace the line
containing `client.get_run_id(run.name)` with `run.run_id` to retrieve the run
ID from the returned run object.
In `@e2e/lib.sh`:
- Line 171: The bare `openapi` pattern in the grep condition is overly broad and
will match any error containing "openapi" as a substring, potentially treating
non-transient errors as retryable. Additionally, the pattern is case-sensitive
and won't match "OpenAPI" (capitalized). Replace the bare `openapi` pattern in
the grep regular expression with a more specific pattern such as using word
boundaries like `\bopenapi\b` to match only complete words, or use the more
specific pattern `'failed to download openapi'` if that is the specific error
you want to catch. This will ensure only relevant transient OpenAPI errors are
treated as retryable.
- Around line 186-208: The _retry_kubectl_stdin_novalidate function is defined
but has no callers and no public wrapper functions (unlike the base
_retry_kubectl_stdin which has retry_create and retry_apply wrappers). Determine
if this function is needed: if it was intended for future use or there are plans
to call it, add public wrapper functions (such as retry_apply_novalidate and
retry_create_novalidate) and integrate them at appropriate call sites in the
codebase; otherwise, remove the _retry_kubectl_stdin_novalidate function
definition entirely to keep the codebase clean.
- Line 164: The _retry_kubectl_stdin() function uses an excessive delay value of
120 seconds per retry attempt (max 20 attempts = up to 40 minutes total), which
significantly slows down the e2e test suite. Reduce the delay parameter to align
with the shorter retry parameters already used in
_retry_kubectl_stdin_novalidate() which uses delay=10 (50 seconds total). Update
the delay value in the local variable declaration at the start of
_retry_kubectl_stdin() to match the faster retry pattern, such as delay=10 or
similar, to accelerate the test suite without sacrificing resilience for
transient kubectl failures.
---
Nitpick comments:
In `@e2e/lib.sh`:
- Line 171: The retryable error pattern used in the grep command at line 171 is
duplicated at line 198. Extract this pattern to a shared constant or helper
function at the beginning of the file. Define a variable that contains the full
error pattern string (including all the pipe-separated error messages like
'failed calling webhook|x509|connection refused|EOF|context deadline
exceeded|webhook.* connect: connection refused|failed to download
openapi|openapi'), then replace both occurrences of the duplicated grep pattern
with references to this shared constant. This ensures consistency and makes
future updates to the pattern require only a single change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ec1a636f-c0f0-4e40-96c1-314c5fa68e79
📒 Files selected for processing (5)
docs/en/agentic_mlops/index.mdxdocs/en/training_guides/index.mdxdocs/en/training_guides/pipelines-mlflow-integration.mdxdocs/en/training_guides/qwen3_finetune_verify.ipynbe2e/lib.sh
| local data | ||
| data="$(cat)" | ||
| local attempts=0 max=20 delay=30 rc out | ||
| local attempts=0 max=20 delay=120 rc out |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for related documentation or comments about webhook retry timing
rg -n -C5 'webhook|cert-rotator|retry.*delay' --type=sh --type=mdRepository: alauda/aml-docs
Length of output: 2650
🏁 Script executed:
sed -n '160,175p' e2e/lib.shRepository: alauda/aml-docs
Length of output: 652
🏁 Script executed:
git log --all -p --follow -S 'delay=120' -- e2e/lib.sh | head -100Repository: alauda/aml-docs
Length of output: 3212
🏁 Script executed:
rg -n 'retry_apply|retry_create|_retry_kubectl_stdin' e2e/ --type=shRepository: alauda/aml-docs
Length of output: 2314
Reconsider the 120-second retry delay; consider reducing it to align with the faster alternatives already in use.
The _retry_kubectl_stdin() function at line 164 retries with max=20 and delay=120, allowing up to 40 minutes of total wait time per kubectl operation. This is excessive for transient webhook/TLS failures. The codebase already includes _retry_kubectl_stdin_novalidate() with max=5 and delay=10 (50 seconds total), suggesting that significantly shorter delays are viable. For a busy e2e test suite with 12+ cases using retry_apply() and retry_create(), a 120-second delay per operation will substantially slow down CI/CD pipelines. Document the rationale for the 120-second choice or align it with the shorter retry parameters used in the validation-bypass variant.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@e2e/lib.sh` at line 164, The _retry_kubectl_stdin() function uses an
excessive delay value of 120 seconds per retry attempt (max 20 attempts = up to
40 minutes total), which significantly slows down the e2e test suite. Reduce the
delay parameter to align with the shorter retry parameters already used in
_retry_kubectl_stdin_novalidate() which uses delay=10 (50 seconds total). Update
the delay value in the local variable declaration at the start of
_retry_kubectl_stdin() to match the faster retry pattern, such as delay=10 or
similar, to accelerate the test suite without sacrificing resilience for
transient kubectl failures.
Deploying alauda-ai with
|
| Latest commit: |
cdf097c
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://ed1c10dc.alauda-ai.pages.dev |
| Branch Preview URL: | https://docs-pipelines-mlflow-integr.alauda-ai.pages.dev |
The pipelines-mlflow-integration example did not run as written. Fixes verified against MLflow + KFP on g1-c1-x86: - Import mlflow inside each @dsl.component (KFP v2 packages components from their own source; a module-level import raises NameError at runtime). - Replace dsl.RUN_ID_PLACEHOLDER (removed in KFP v2) with dsl.PIPELINE_JOB_ID_PLACEHOLDER, passed in as a component argument. - Document the secured-install access path: the mlflow-tracking-server Service fronts oauth2-proxy (302s headless clients), so components need a direct in-cluster Service, a ServiceAccount bearer token (MLFLOW_TRACKING_TOKEN), workspace RBAC, and a warm-up retry. - Fix the Trainer v2 example (trainer.kubeflow.org/v1alpha1 TrainJob with runtimeRef/trainer, not TrainingJob/v1 with a raw pod template). - Fix client.get_run_id -> run.run_id and the Tools menu path. Also: - Drop files unrelated to this PR's scope (agentic_mlops index + nav row, qwen3 finetune notebook) carried in from the coding-agents base branch. - Remove dead _retry_kubectl_stdin_novalidate() from e2e/lib.sh. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ethod Cross-checked against mlflow-plugin/mlflow-kubernetes-plugins: - Name the canonical mechanism: the server's `kubernetes-auth` plugin authorizes via Kubernetes RBAC and accepts a ServiceAccount bearer token (Authorization / X-Forwarded-Access-Token) + X-MLFLOW-WORKSPACE. - Fix caller RBAC resources to the plugin's API group set (experiments / datasets / registeredmodels); `runs` is not a resource (run writes authorize against `experiments`). - Add the canonical out-of-cluster token path (`kubectl create token`) alongside the in-pod projected token. - Document workspace selection via set_workspace() / MLFLOW_WORKSPACE. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per mlflow-plugin/mlflow-kubernetes-plugins/docs/authorization-plugin.md: - Lead with the identity-token method: the server's `kubernetes-auth` plugin (user_identity_token mode) authenticates the caller from the bearer token's identity claims, authorizes that identity, and records it as the MLflow run owner. The client authenticates with the token before any API call. - Note the credential is a Kubernetes ServiceAccount token (the platform-wide `kubectl create token` pattern; sub claim is the identity). - Add a security warning: because user_identity_token reads claims unverified (the oauth2-proxy is the verifier), a direct endpoint must be network-restricted / not exposed via ingress, or run the server in self_subject_access_review mode. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e test Reworks the KFP + MLflow guide to authenticate with a platform user identity token only — no ServiceAccount, no per-workspace RBAC, no extra in-cluster Service: - The MLflow kubernetes-auth plugin (user_identity_token mode) takes the caller identity from the bearer token's claims and records it as the run owner. - Components reach MLflow through the platform Kubernetes API (…/kubernetes/<cluster>/…/pods/<pod>:5000/proxy/…) and forward identity via X-Forwarded-Access-Token; the shipped Service only exposes the browser OAuth proxy, so this avoids it without creating anything. - Removed the direct-Service, ServiceAccount-token, and RBAC sections. - KFP example now uses a stdlib REST helper (no mlflow SDK install needed) and passes the token as a parameter (source from a Secret). Adds e2e/mlflow-user-identity-smoke.sh: logs a run with a user token and asserts the run owner equals the token identity. Verified on g1-c1-x86 (run owner admin@cpaas.io); the pipeline example compiles with kfp 2.11.0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
New how_to/mlflow-python-sdk.mdx: how to drive the stock mlflow>=3.10 SDK against the auth + multi-tenant Alauda AI MLflow server with a platform user identity token — no ServiceAccount, no per-workspace RBAC, no extra Service. Covers MLFLOW_TRACKING_TOKEN auth, mlflow.set_workspace, the port-forward connection to the app port (raw tunnel preserves Authorization), model registry, the smoke test, and troubleshooting (302 / token-newline / 401 / 403). Verified on g1-c1-x86: runs are owned by the token identity. Cross-linked from mlflow.mdx Client Configuration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
e2e/mlflow-user-identity-smoke.sh (1)
36-38: Usejq first()instead of piping tohead -1for cleaner selection.At lines 38 and 77, the pipeline
jq ... | head -1works but is non-idiomatic. Replace withjq 'first(...) // empty'to select the first matching item directly within jq without consuming the pipeline. This is clearer and avoids unnecessary process overhead.Suggested refactor
-POD="$(curl -fsSk -H "Authorization: Bearer ${TOKEN}" \ - "${KAPI}/api/v1/namespaces/${MLFLOW_NS}/pods?labelSelector=app%3Dmlflow-tracking-server" \ - | jq -r '.items[] | select(.status.phase=="Running") | .metadata.name' | head -1)" +POD="$(curl -fsSk -H "Authorization: Bearer ${TOKEN}" \ + "${KAPI}/api/v1/namespaces/${MLFLOW_NS}/pods?labelSelector=app%3Dmlflow-tracking-server" \ + | jq -r 'first(.items[] | select(.status.phase=="Running") | .metadata.name) // empty')" -METRIC="$(printf '%s' "${RUN}" | jq -r '.run.data.metrics[] | select(.key=="loss") | .key' | head -1)" +METRIC="$(printf '%s' "${RUN}" | jq -r 'first(.run.data.metrics[]? | select(.key=="loss") | .key) // empty')"Also applies to: 77-77
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@e2e/mlflow-user-identity-smoke.sh` around lines 36 - 38, Replace the non-idiomatic `| head -1` piping pattern with jq's built-in `first()` function at two locations in e2e/mlflow-user-identity-smoke.sh (lines 36-38 and line 77). In both cases, refactor the jq command to use `first(...) // empty` to select the first matching item directly within the jq filter, eliminating the need to pipe to an external head command. This makes the code cleaner and more idiomatic while avoiding unnecessary process overhead.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/en/kubeflow/how_to/mlflow-python-sdk.mdx`:
- Line 12: The prerequisite documentation on line 12 states that the JWT must
have an `email` claim, but this over-restricts the actual valid tokens since the
implementation supports fallback identity claims (preferred_username, name, sub)
as documented elsewhere. Update the wording on line 12 to indicate that email is
the primary claim but clarify that the platform also accepts fallback claims
like preferred_username, name, and sub for token identity validation, aligning
the documentation with the actual behavior documented on line 18 and in the
referenced shell script.
---
Nitpick comments:
In `@e2e/mlflow-user-identity-smoke.sh`:
- Around line 36-38: Replace the non-idiomatic `| head -1` piping pattern with
jq's built-in `first()` function at two locations in
e2e/mlflow-user-identity-smoke.sh (lines 36-38 and line 77). In both cases,
refactor the jq command to use `first(...) // empty` to select the first
matching item directly within the jq filter, eliminating the need to pipe to an
external head command. This makes the code cleaner and more idiomatic while
avoiding unnecessary process overhead.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 31c3c128-90fb-4521-ac7a-77b60083b26d
📒 Files selected for processing (4)
docs/en/kubeflow/how_to/mlflow-python-sdk.mdxdocs/en/kubeflow/how_to/mlflow.mdxdocs/en/training_guides/pipelines-mlflow-integration.mdxe2e/mlflow-user-identity-smoke.sh
✅ Files skipped from review due to trivial changes (2)
- docs/en/kubeflow/how_to/mlflow.mdx
- docs/en/training_guides/pipelines-mlflow-integration.mdx
…cess) Rework mlflow-python-sdk.mdx so the MLflow Python client always goes through the oauth2-proxy (the platform MLflow route) instead of port-forwarding to the container port: - Interactive: present the browser SSO session — copy the _oauth2_proxy cookie and attach it via a runtime-registered RequestHeaderProvider (verified: the provider injects the header and the run is owned by the caller identity). - Headless/automation: admin enables oauth2-proxy --skip-jwt-bearer-tokens, then the client uses MLFLOW_TRACKING_TOKEN with a platform OIDC token. Removes the kubectl port-forward / app-port connection entirely. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- SDK guide "Headless / automation": mint a short-lived Dex id token from a long-lived refresh token (refresh-token grant at /dex/token), then use it as MLFLOW_TRACKING_TOKEN through the OAuth proxy. Refresh before the 24h id-token expiry instead of carrying a static token. - Rework the smoke test to the same method: refresh token -> id token -> log to MLflow via the platform route (through oauth2-proxy, no container-port access), asserting the run owner equals the token identity. Requires the proxy's --skip-jwt-bearer-tokens. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
e2e/mlflow-user-identity-smoke.sh (1)
43-48: 💤 Low valuecurl
-kdisables certificate verification.The
-kflag is used throughout the script, which is typical for e2e tests against self-signed certificates. This is acceptable for testing but should not be used in production code.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@e2e/mlflow-user-identity-smoke.sh` around lines 43 - 48, Add an explanatory comment in the script to document why the `-k` flag is included in the curl command within the api() function. The comment should clarify that the `-k` flag disables certificate verification and is intentionally used here for e2e testing against self-signed certificates, making it clear to future developers that this is a deliberate choice specific to the e2e test environment and should not be replicated in production code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@e2e/mlflow-user-identity-smoke.sh`:
- Around line 43-48: Add an explanatory comment in the script to document why
the `-k` flag is included in the curl command within the api() function. The
comment should clarify that the `-k` flag disables certificate verification and
is intentionally used here for e2e testing against self-signed certificates,
making it clear to future developers that this is a deliberate choice specific
to the e2e test environment and should not be replicated in production code.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a7659b8a-ae68-42a7-9661-16283824aaad
📒 Files selected for processing (2)
docs/en/kubeflow/how_to/mlflow-python-sdk.mdxe2e/mlflow-user-identity-smoke.sh
✅ Files skipped from review due to trivial changes (1)
- docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
- SDK guide "Headless / automation": mint a Dex id token with the OAuth2 password grant (grant_type=password at /dex/token) — one call, no browser/ cookie — then use it as MLFLOW_TRACKING_TOKEN through the OAuth proxy. Requires a Dex client whose grantTypes include "password" + the proxy's --skip-jwt-bearer-tokens. Warns to use a dedicated service account (ROPC sends the password) and store creds in a Secret. - Rework the smoke test to ROPC: username/password -> Dex id token -> log to MLflow via the platform route (through oauth2-proxy), asserting run owner == token identity. Verified ROPC mints a valid Dex id token (iss=dex, aud=alauda-auth, key in Dex JWKS) on g1-c1-x86. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mlflow-python-sdk.mdx now leads with the OAuth2 password grant: mint a Dex id token from a username/password at /dex/token, then use it as MLFLOW_TRACKING_TOKEN through the OAuth proxy. Adds an admin "Platform setup" section (--skip-jwt-bearer-tokens + a password-grant Dex client). The browser session-cookie flow is kept as a secondary "interactive alternative". Verified end-to-end on g1-c1-x86 (run owner = the token's user identity). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- SDK guide: set_tracking_uri now uses the in-cluster Service http://mlflow-tracking-server.kubeflow:5000 (still via the OAuth proxy) for in-cluster clients; note the platform route for outside-the-cluster use. - Pipelines guide: rewritten to use the MLflow Python client against the in-cluster Service with MLFLOW_TRACKING_TOKEN injected from a Secret (kfp-kubernetes use_secret_as_env), and reference the SDK guide for auth/RBAC and minting the token (password grant). Drops the raw-REST/container-port helper. Trainer v2 example points MLFLOW_TRACKING_URI at the in-cluster Service. Example compiles with kfp 2.11 + kfp-kubernetes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The MLflow usage docs under training_guides now point to how_to/mlflow-python-sdk.mdx for authentication (MLFLOW_TRACKING_TOKEN) and workspace/RBAC on secured installs, where the bare MLFLOW_TRACKING_URI / report_to: mlflow setup is not sufficient: - fine-tuning-using-notebooks.mdx (Experiment tracking sections) - fine-tune-with-trainer-v2.ipynb (Step 5: View Training Metrics in MLflow) Also corrects the menu path to Alauda AI -> Tools -> MLFlow. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary by CodeRabbit
Documentation
Bug Fixes
Tests