alauda · typhoonzero · Jun 9, 2026 · Jun 9, 2026 · Jun 9, 2026 · Jun 9, 2026
diff --git a/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx b/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
@@ -0,0 +1,135 @@
+---
+weight: 46
+---
+
+# Using the MLflow Python SDK with Authentication and RBAC
+
+On Alauda AI the [MLflow Tracking Server](./mlflow.mdx) runs behind single sign-on and multi-tenancy: an OAuth proxy authenticates every caller, and the server records each run under the calling user and authorizes it against Kubernetes RBAC. This guide shows how to drive the stock **MLflow Python SDK** through that OAuth proxy with your own identity, using the OAuth2 **password grant** to obtain a token from a username and password — no browser, and never the MLflow container port.
+
+## Platform setup (administrator, one-time) \{#platform-setup-administrator-one-time}
+
+The password grant needs two settings, which an administrator enables once:
+
+- **Accept bearer tokens at the proxy.** Add `--skip-jwt-bearer-tokens=true` to the MLflow OAuth proxy so it accepts a Dex OIDC token alongside browser sessions:
+
+  ```yaml
+  # MLflow plugin values
+  auth:
+    oauth:
+      extraArgs:
+        - --skip-jwt-bearer-tokens=true
+  ```
+
+- **Allow the password grant.** Dex must have the password connector enabled (`enablePasswordDB: true`), and the OAuth client you authenticate with must list `password` in its `grantTypes`. Register a **dedicated** client for this rather than the platform's interactive-login client.
+
+## Prerequisites
+
+- `mlflow` **3.10 or later** (`pip install "mlflow>=3.10"`). Workspace selection (`mlflow.set_workspace`) is a 3.10+ feature.
+- A platform **username and password** — ideally a dedicated service account, not a person's login — that can access the target workspace (see [Workspace Access](./mlflow.mdx)).
+- The Dex **client id and secret** allowed to use the password grant (from your administrator).
+
+## How authentication works
+
+Two layers sit in front of your runs:
+
+1. The **OAuth proxy** (`oauth2-proxy`) authenticates the request. With `--skip-jwt-bearer-tokens`, it accepts a Dex-issued OIDC **id token** sent as `Authorization: Bearer …`.
+2. The MLflow server's `kubernetes-auth` plugin reads your identity from that token, records it as the run **owner**, and authorizes it against your Kubernetes permissions in the workspace.
+
+The client always goes through the OAuth proxy — never connect to the MLflow container port directly.
+
+## Connect the SDK
+
+### 1. Mint an id token with the password grant
+
+Exchange the username and password for a Dex **id token** in a single call (no browser, no cookie):
+
+```bash
+export ID_TOKEN=$(curl -sk "https://<platform>/dex/token" \
+  -d grant_type=password \
+  --data-urlencode "username=$MLFLOW_USERNAME" \
+  --data-urlencode "password=$MLFLOW_PASSWORD" \
+  -d scope="openid email groups" \
+  -d client_id="$DEX_CLIENT_ID" --data-urlencode "client_secret=$DEX_CLIENT_SECRET" \
+  | jq -r .id_token)
+```
+
+### 2. Point the SDK at the MLflow route with the token
+
+The SDK reads `MLFLOW_TRACKING_TOKEN` and sends it as `Authorization: Bearer …`:
+
+```python
+import os
+import mlflow
+
+os.environ["MLFLOW_TRACKING_TOKEN"] = os.environ["ID_TOKEN"].strip()  # → Authorization: Bearer
+mlflow.set_tracking_uri("http://mlflow-tracking-server.kubeflow:5000")  # in-cluster Service (fronted by the OAuth proxy)
+mlflow.set_workspace("team-a")                 # workspace namespace → X-MLFLOW-WORKSPACE
+mlflow.set_experiment("my-experiment")
+
+with mlflow.start_run(run_name="sdk-quickstart") as run:
+    mlflow.log_param("learning_rate", 2e-4)
+    mlflow.log_metric("loss", 0.123)
+    print("run:", run.info.run_id)
+```
+
+The run appears under **Alauda AI → Tools → MLFlow**, owned by the username you authenticated as. (Verified end-to-end on a secured install: the run owner is the token's user identity.)
+
+Use the in-cluster Service URL `http://mlflow-tracking-server.kubeflow:5000` when the client runs **inside** the cluster (pipeline components, Workbench notebooks). From **outside** the cluster, point at the platform route `https://<platform>/clusters/<cluster>/mlflow` instead — both reach the same OAuth proxy.
+
+:::warning
+The password grant sends the password to the token endpoint, so use a **dedicated service account** and keep the credentials and client secret in a Kubernetes `Secret`, never in code. Always `.strip()` the token (a trailing newline produces `Invalid … character(s) in header value: 'Bearer …\n'`). id tokens expire (24 h by default), so re-run step 1 to refresh for long-running jobs. If you use the external HTTPS route and the platform certificate is not trusted by your machine, set `MLFLOW_TRACKING_INSECURE_TLS=true`.
+:::
+
+## Selecting a workspace
+
+Runs are recorded in the workspace you select; if you select none, the server's default workspace is used. Any of these set it (the SDK turns them into the `X-MLFLOW-WORKSPACE` header):
+
+- `mlflow.set_workspace("team-a")` in code,
+- the `MLFLOW_WORKSPACE=team-a` environment variable.
+
+You can only use a workspace your account has access to; see [Workspace Access](./mlflow.mdx).
+
+## Registering models
+
+The model registry is workspace-scoped and authorized the same way, so the usual SDK calls work once connected:
+
+```python
+mlflow.set_workspace("team-a")
+with mlflow.start_run():
+    mlflow.sklearn.log_model(sk_model, name="model", registered_model_name="fraud-detector")
+```
+
+Promote the registered version to **Staging** or **Production** from the MLflow UI.
+
+## Interactive alternative: browser session
+
+If you cannot use the password grant (for example you only have an interactive SSO login), present your browser session instead — this works without the `--skip-jwt-bearer-tokens` setting. Sign in at **Alauda AI → Tools → MLFlow**, copy the `_oauth2_proxy` cookie from the browser developer tools (**Application/Storage → Cookies**; include any `_oauth2_proxy_N` chunks, joined with `; `), and attach it to every request with a header provider:
+
+```python
+import os, mlflow
+from mlflow.tracking.request_header.abstract_request_header_provider import RequestHeaderProvider
+from mlflow.tracking.request_header.registry import _request_header_provider_registry
+
+class ProxySessionHeader(RequestHeaderProvider):
+    def in_context(self):
+        return bool(os.environ.get("MLFLOW_PROXY_COOKIE"))      # export MLFLOW_PROXY_COOKIE='_oauth2_proxy=<value>'
+    def request_headers(self):
+        return {"Cookie": os.environ["MLFLOW_PROXY_COOKIE"]}
+
+_request_header_provider_registry.register(ProxySessionHeader)
+mlflow.set_tracking_uri("https://<platform>/clusters/<cluster>/mlflow")
+mlflow.set_workspace("team-a")
+```
+
+The session cookie expires — copy a fresh one when calls start returning a login redirect.
+
+## Troubleshooting
+
+| Symptom | Check |
+|---------|-------|
+| `/dex/token` returns `unsupported_grant_type` / "password grant … not allowed" | The Dex client does not permit the password grant. Use a client whose `grantTypes` include `password` (see [Platform setup](#platform-setup-administrator-one-time)). |
+| Call returns HTML or a redirect (`302` to the login page) | The OAuth proxy rejected the bearer token. Confirm `--skip-jwt-bearer-tokens` is enabled and the token is a valid Dex id token (`aud` = the proxy's client). For the cookie alternative, your `_oauth2_proxy` value is missing or expired. |
+| `Invalid … character(s) in header value: 'Bearer …\n'` | The token has trailing whitespace. Set `MLFLOW_TRACKING_TOKEN` to the `.strip()`-ed value. |
+| `Failed to query /api/3.0/mlflow/server-info` | The SDK could not reach the server through the proxy — verify the tracking URI is the platform MLflow route and the token is valid. |
+| `403 PERMISSION_DENIED` | Your account lacks access to the workspace namespace. Request access to the workspace (see [Workspace Access](./mlflow.mdx)); no ServiceAccount is involved. |
+| Run shows the wrong owner or workspace | The owner is your authenticated identity; the workspace is `set_workspace()` / `MLFLOW_WORKSPACE` (else the server default). Check both. |
diff --git a/docs/en/kubeflow/how_to/mlflow.mdx b/docs/en/kubeflow/how_to/mlflow.mdx
@@ -69,6 +69,8 @@ subjects:
 
 ## Client Configuration
 
+For authenticating the MLflow Python SDK with a user identity token — including the in-cluster connection details and RBAC — see [Using the MLflow Python SDK with Authentication and RBAC](./mlflow-python-sdk.mdx).
+
 Set the MLflow tracking URI to the platform route and select the workspace:
 
 ```python

diff --git a/docs/en/training_guides/fine-tune-with-trainer-v2.ipynb b/docs/en/training_guides/fine-tune-with-trainer-v2.ipynb
@@ -947,15 +947,7 @@
    "cell_type": "markdown",
    "id": "27d2b476",
    "metadata": {},
-   "source": [
-    "## Step 5: View Training Metrics in MLflow\n",
-    "\n",
-    "If `MLFLOW_TRACKING_URI` is set and the MLflow server is reachable from the training pod, LlamaFactory will log metrics (loss, learning rate, etc.) to MLflow automatically via `report_to: mlflow` in the training config.\n",
-    "\n",
-    "To open the MLflow UI, go to **Alauda AI** - **Tools** - **MLFlow** (need MLFlow Cluster plugin installed). Look for the experiment named by `MLFLOW_EXPERIMENT_NAME`.\n",
-    "\n",
-    "Each `TrainJob` run will appear as a separate MLflow **run** under the same experiment, making it easy to compare training curves across different models and hyperparameters."
-   ]
+   "source": "## Step 5: View Training Metrics in MLflow\n\nIf `MLFLOW_TRACKING_URI` is set and the MLflow server is reachable from the training pod, LlamaFactory will log metrics (loss, learning rate, etc.) to MLflow automatically via `report_to: mlflow` in the training config.\n\nOn a secured (SSO + multi-tenant) MLflow install the trainer must also authenticate — set `MLFLOW_TRACKING_TOKEN` and select a workspace. See [Using the MLflow Python SDK with Authentication and RBAC](../kubeflow/how_to/mlflow-python-sdk.mdx) for how to obtain the token and how authorization/RBAC work.\n\nTo open the MLflow UI, go to **Alauda AI** - **Tools** - **MLFlow** (need MLFlow Cluster plugin installed). Look for the experiment named by `MLFLOW_EXPERIMENT_NAME`.\n\nEach `TrainJob` run will appear as a separate MLflow **run** under the same experiment, making it easy to compare training curves across different models and hyperparameters."
   },
   {
    "cell_type": "markdown",
@@ -1060,4 +1052,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 5
-}
+}
diff --git a/docs/en/training_guides/fine-tuning-using-notebooks.mdx b/docs/en/training_guides/fine-tuning-using-notebooks.mdx
@@ -325,7 +325,9 @@ After success the merged model is pushed to a date-stamped branch (`sft-YYYYMMDD
 
 ## 8. Experiment tracking
 
-Setting `report_to: mlflow` in the LLaMA-Factory config plus the `MLFLOW_TRACKING_URI` / `MLFLOW_EXPERIMENT_NAME` env vars routes metrics to MLflow. Find runs in **Alauda AI → Advanced → MLFlow**, compare loss curves, and pin the winning run.
+Setting `report_to: mlflow` in the LLaMA-Factory config plus the `MLFLOW_TRACKING_URI` / `MLFLOW_EXPERIMENT_NAME` env vars routes metrics to MLflow. Find runs in **Alauda AI → Tools → MLFlow**, compare loss curves, and pin the winning run.
+
+On a secured (SSO + multi-tenant) MLflow install the job must also authenticate — supply an `MLFLOW_TRACKING_TOKEN` and select a workspace. See [Using the MLflow Python SDK with Authentication and RBAC](../kubeflow/how_to/mlflow-python-sdk.mdx) for how to obtain the token and configure the client.
 
 ## 9. Publish the fine-tuned model
 
@@ -412,4 +414,4 @@ spec:
 
 ### Experiment tracking on other devices
 
-LLaMA-Factory and Transformers integrate with MLflow / wandb directly. Set the destination in the framework config (e.g. `report_to: mlflow` for LLaMA-Factory) and supply `MLFLOW_TRACKING_URI` and `MLFLOW_EXPERIMENT_NAME` env vars. View results under **Alauda AI → Advanced → MLFlow**.
+LLaMA-Factory and Transformers integrate with MLflow / wandb directly. Set the destination in the framework config (e.g. `report_to: mlflow` for LLaMA-Factory) and supply `MLFLOW_TRACKING_URI` and `MLFLOW_EXPERIMENT_NAME` env vars (plus `MLFLOW_TRACKING_TOKEN` on a secured install — see [Using the MLflow Python SDK with Authentication and RBAC](../kubeflow/how_to/mlflow-python-sdk.mdx)). View results under **Alauda AI → Tools → MLFlow**.