From c0730ef38b14a071430005b80b7a3c33a514c687 Mon Sep 17 00:00:00 2001
From: Wu Yi <typhoonzero1986@gmail.com>
Date: Tue, 9 Jun 2026 10:23:08 +0800
Subject: [PATCH 01/21] docs: focus coding-agents guide on Claude Code with
 translation proxy

- Narrow scope to Claude Code only; remove opencode and Codex CLI sections
- Add how to configure reasoning effort when starting the InferenceService
  (server-side --reasoning-effort flag and request-time override)
- Update Claude Code section with corrected proxy setup for LiteLLM and
  claude-code-router (config-driven, ccr code startup command)
- Qwen3.6 and Gemma 4 recommendations and Unsloth quantized model list
  already present; no change needed
---
 .../coding-agents-with-inference-service.mdx  | 246 ++++++++++++------
 1 file changed, 164 insertions(+), 82 deletions(-)

diff --git a/docs/en/agentic_mlops/coding-agents-with-inference-service.mdx b/docs/en/agentic_mlops/coding-agents-with-inference-service.mdx
index 650baf5..6bb2a2a 100644
--- a/docs/en/agentic_mlops/coding-agents-with-inference-service.mdx
+++ b/docs/en/agentic_mlops/coding-agents-with-inference-service.mdx
@@ -10,9 +10,9 @@ i18n:
 
 ## Introduction
 
-Coding agents such as [opencode](https://opencode.ai/), [Codex CLI](https://github.com/openai/codex), and [Claude Code](https://docs.anthropic.com/en/docs/claude-code) are terminal-based assistants that read your repository, plan changes, edit files, and run commands on your behalf. They normally talk to a hosted model provider over the internet.
+[Claude Code](https://docs.anthropic.com/en/docs/claude-code) is a terminal-based coding agent that reads your repository, plans changes, edits files, and runs commands on your behalf. It normally talks to Anthropic's hosted models over the internet.
 
-This document shows how to point those agents at a model you serve yourself on Alauda AI, so that your source code, prompts, and infrastructure configuration never leave your cluster. The same on-premise `InferenceService` that you deploy for any other workload can back an interactive coding agent, as long as it exposes an **OpenAI-compatible API** and has **tool (function) calling** enabled.
+This document shows how to point Claude Code at a model you serve yourself on Alauda AI, so that your source code, prompts, and infrastructure configuration never leave your cluster. The same on-premise `InferenceService` that you deploy for any other workload can back an interactive coding agent, as long as it exposes an **OpenAI-compatible API** and has **tool (function) calling** enabled. Because Claude Code speaks the Anthropic Messages API (`/v1/messages`), you front your `InferenceService` with a lightweight translation proxy (see [Step 3](#step-3-connect-claude-code-with-a-translation-proxy)).
 
 This page builds directly on the deployment how-tos. It does not repeat how to create or expose an `InferenceService`; instead it links to them and focuses on the agent-specific configuration and tuning.
 
@@ -23,24 +23,24 @@ Coding agents and their configuration formats evolve quickly. The config snippet
 ## Prerequisites
 
 - A running, ready `InferenceService` that serves an OpenAI-compatible API. See [Create Inference Service using CLI](../model_inference/inference_service/how_to/create_inference_service_cli.mdx).
-- Network access from the machine running the agent to the service endpoint. For access from a developer laptop outside the cluster, see [Configure External Access for Inference Services](../model_inference/inference_service/how_to/external_access_inference_service.mdx).
-- A model with **tool/function calling** support, served with the matching vLLM parser enabled (see [Enable tool calling on the runtime](#enable-tool-calling-on-the-runtime)). Without this, agents can chat but cannot edit files or run commands.
-- The agent CLI installed locally (`opencode`, `codex`, or `claude`).
+- Network access from the machine running Claude Code to the service endpoint. For access from a developer laptop outside the cluster, see [Configure External Access for Inference Services](../model_inference/inference_service/how_to/external_access_inference_service.mdx).
+- A model with **tool/function calling** support, served with the matching vLLM parser enabled (see [Enable tool calling on the runtime](#enable-tool-calling-on-the-runtime)). Without this, the agent can chat but cannot edit files or run commands.
+- Claude Code installed locally (`claude`).
+- A translation proxy (LiteLLM or claude-code-router) to bridge Claude Code's Anthropic Messages API to the OpenAI-compatible endpoint (see [Step 3](#step-3-connect-claude-code-with-a-translation-proxy)).
 
 ## How the pieces fit together
 
 ```text
-  Coding agent (opencode / Codex / Claude Code)
-        │  OpenAI-compatible HTTP  (POST /v1/chat/completions)
+  Claude Code
+        │  Anthropic Messages API  (POST /v1/messages)
         ▼
-  External access / Load Balancer  ──►  KServe InferenceService (vLLM)
-        ▲                                       │
-        └──── Anthropic→OpenAI proxy ───────────┘
-             (only required for Claude Code)
+  Translation proxy (LiteLLM / claude-code-router)
+        │  OpenAI Chat Completions API  (POST /v1/chat/completions)
+        ▼
+  KServe InferenceService (vLLM)
 ```
 
-- **opencode** and **Codex CLI** speak the OpenAI Chat Completions API natively, so they can call the `InferenceService` endpoint directly.
-- **Claude Code** speaks the Anthropic Messages API, which vLLM does not serve. It requires a small translation proxy in front of the OpenAI-compatible endpoint (see [Claude Code](#claude-code)).
+Claude Code speaks the Anthropic Messages API (`/v1/messages`), while your `InferenceService` exposes an OpenAI-compatible endpoint (`/v1/chat/completions`). A lightweight translation proxy bridges the two.
 
 ## Step 1: Deploy and smoke-test the endpoint
 
@@ -66,6 +66,10 @@ curl -sS ${BASE_URL}/chat/completions \
 
 A normal JSON completion confirms the endpoint is reachable and the model name is correct. Note the three values you will reuse for every agent: **base URL** (ending in `/v1`), **model name** (the `--served-model-name`), and **API key**.
 
+:::tip
+For reasoning models (DeepSeek R1, QwQ, etc.), also add `--reasoning-parser` to the vLLM launch flags. See [Configure reasoning models and reasoning effort](#configure-reasoning-effort).
+:::
+
 ## Step 2: Enable tool calling on the runtime \{#enable-tool-calling-on-the-runtime}
 
 Coding agents work by calling tools (read file, write file, run shell). This requires the model to emit tool calls **and** vLLM to parse them. Add the following flags to the vLLM launch command in your `InferenceService` (in the sample from [Create Inference Service using CLI](../model_inference/inference_service/how_to/create_inference_service_cli.mdx), they go on the `python3 -m vllm.entrypoints.openai.api_server` line):
@@ -81,115 +85,192 @@ Coding agents work by calling tools (read file, write file, run shell). This req
 
 Verify tool calling end-to-end by asking the agent to perform a trivial file operation (for example, "create `hello.txt` containing the word hi"). If the model replies in prose instead of editing the file, tool calling is not wired up correctly — recheck the parser and model.
 
-## Step 3: Connect your coding agent
+## Step 2b (optional): Configure reasoning models and reasoning effort \{#configure-reasoning-effort}
+
+Some models (for example, DeepSeek R1, QwQ, Hunyuan, or Cohere Command A Reasoning) emit chain-of-thought reasoning before their final answer. vLLM separates the reasoning traces from the assistant content so your agent receives clean output — but you must enable the matching flags.
+
+### Server-side flags
+
+Add `--reasoning-parser` to your vLLM launch command, paired with the appropriate tool-call parser:
+
+```bash
+--enable-auto-tool-choice \
+--tool-call-parser <parser> \
+--reasoning-parser <reasoning-parser>
+```
+
+The table below shows common model families and their required parsers. Confirm against the [vLLM tool calling documentation](https://docs.vllm.ai/en/latest/features/tool_calling.html) for the current list.
+
+| Model family | `--tool-call-parser` | `--reasoning-parser` | Notes |
+| --- | --- | --- | --- |
+| DeepSeek R1 (`deepseek-ai/DeepSeek-R1-*`) | `deepseek_v3` | *(none required)* | Also needs `--chat-template examples/tool_chat_template_deepseekr1.jinja` |
+| QwQ / Qwen reasoning (`Qwen/QwQ-*`) | `hermes` | *(none required)* | QwQ's reasoning is handled by the chat template; no separate reasoning parser needed |
+| Hunyuan-A13B-Instruct | `deepseek_v3` | `hunyuan_a13b` | Tencent's reasoning model |
+| Cohere Command A Reasoning | `cohere` | `cohere_command3` | Cohere's reasoning model |
+
+For model families not listed above, check the model card for reasoning instructions and the [vLLM tool calling documentation](https://docs.vllm.ai/en/latest/features/tool_calling.html) for the matching parser pair.
 
-### opencode
+### Configuring reasoning effort when starting the inference service
 
-opencode reads configuration from `opencode.json` in the project root or `~/.config/opencode/opencode.json`. Define a custom OpenAI-compatible provider that points at your endpoint:
+Reasoning effort controls how much the model "thinks" before answering. For coding agents you typically want **low** reasoning effort to keep interactive latency acceptable — many short, low-reasoning turns beat a single long, high-reasoning one.
+
+#### Server-side default
+
+You can set a default reasoning effort at the vLLM server level so every request uses it unless the client overrides:
+
+```bash
+--enable-reasoning \
+--reasoning-parser <reasoning-parser> \
+--reasoning-effort low
+```
+
+`--reasoning-effort` accepts `"low"`, `"medium"` (default), or `"high"`. Setting it to `"low"` on the server ensures that even clients that don't specify reasoning effort get the responsive behavior you want for coding agents.
+
+Not all vLLM releases support `--reasoning-effort` as a launch flag. If your version doesn't recognize it, use the request-time method below.
+
+#### Request-time override
+
+DeepSeek R1 and compatible models also accept a `reasoning_effort` parameter in the request body:
 
 ```json
 {
-  "$schema": "https://opencode.ai/config.json",
-  "provider": {
-    "onprem": {
-      "npm": "@ai-sdk/openai-compatible",
-      "name": "On-Prem Alauda AI",
-      "options": {
-        "baseURL": "https://your-inference-service-domain.com/v1",
-        "apiKey": "{env:ONPREM_API_KEY}"
-      },
-      "models": {
-        "qwen-2": {
-          "name": "Qwen2.5-Coder (on-prem)"
-        }
-      }
-    }
+  "model": "deepseek-r1",
+  "messages": [{"role": "user", "content": "..."}],
+  "extra_body": {
+    "reasoning_effort": "low"
   }
 }
 ```
 
-- The model key (`qwen-2`) must match the `--served-model-name` of the `InferenceService`.
-- Export the key the config references, then select the model: `export ONPREM_API_KEY=sk-local` and choose `onprem/qwen-2` with the `/models` command inside opencode.
+When using a translation proxy (LiteLLM or claude-code-router), the proxy forwards this parameter to the backend automatically.
 
-### Codex CLI
+**Note:** Qwen reasoning models (QwQ) do not have a separate reasoning-effort knob. Control reasoning depth indirectly through the chat template or by passing `max_tokens` to cap how long the reasoning chain can grow.
 
-Codex CLI reads `~/.codex/config.toml`. Register your endpoint as a model provider and select it:
+## Step 3: Connect Claude Code with a Translation Proxy \{#step-3-connect-claude-code-with-a-translation-proxy}
 
-```toml
-model = "qwen-2"
-model_provider = "onprem"
+Claude Code communicates over the Anthropic Messages API (`/v1/messages`), while your `InferenceService` exposes an OpenAI-compatible endpoint (`/v1/chat/completions`). Bridge the two by running a translation proxy in front of your endpoint. Two common options:
 
-[model_providers.onprem]
-name = "On-Prem Alauda AI"
-base_url = "https://your-inference-service-domain.com/v1"
-env_key = "ONPREM_API_KEY"
-wire_api = "chat"
-```
+- [LiteLLM](https://docs.litellm.ai/) proxy, which exposes an Anthropic-compatible `/v1/messages` endpoint and routes to any backend model.
+- [claude-code-router](https://github.com/musistudio/claude-code-router), a proxy built specifically to point Claude Code at OpenAI-compatible and other backends.
 
-- `base_url` must end at `/v1`; `model` must match the `--served-model-name`.
-- `env_key` names the environment variable that holds the API key: `export ONPREM_API_KEY=sk-local`.
-- Use `wire_api = "chat"` for vLLM's OpenAI Chat Completions API.
+Both approaches handle the API translation for you. Pick whichever fits your workflow — LiteLLM is more general-purpose, while claude-code-router is tailored to Claude Code's needs.
 
-### Claude Code \{#claude-code}
+### Option 1: LiteLLM proxy
 
-Claude Code communicates over the Anthropic Messages API (`/v1/messages`). There are two ways to back it with an on-premise model — pick the one that matches your runtime.
+Start the LiteLLM proxy, pointing it at your `InferenceService` endpoint:
 
-#### Option A: point Claude Code directly at the on-premise endpoint
+```bash
+litellm --model openai/qwen-2 \
+  --api_base https://your-inference-service-domain.com/v1 \
+  --port 4000
+```
 
-If the on-premise endpoint already speaks the Anthropic Messages API — either natively (for example, some `llama.cpp` `llama-server` builds and similar local runners) or because you front your `InferenceService` with a gateway that exposes `/v1/messages` — you can configure Claude Code with environment variables alone, no separate proxy needed:
+This exposes `http://localhost:4000/v1/messages` (Anthropic format) and forwards requests to your OpenAI-compatible backend.
+
+Then point Claude Code at the proxy:
 
 ```bash
-export ANTHROPIC_BASE_URL="http://127.0.0.1:9123"      # on-premise endpoint speaking the Anthropic Messages API
-export ANTHROPIC_AUTH_TOKEN="not_set"                  # any value; the endpoint may ignore it
-export ANTHROPIC_API_KEY="not_set_either!"             # any value; both vars are checked
-export ANTHROPIC_MODEL="qwen-2"                        # must match what the endpoint exposes (e.g. served-model-name)
-
-# Keep traffic on-premise and trim features the on-prem model can't honor:
-export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1      # suppress optional traffic to Anthropic-hosted services
-export CLAUDE_CODE_ATTRIBUTION_HEADER=0                # drop the Anthropic attribution header
-export CLAUDE_CODE_ENABLE_TELEMETRY=0                  # disable telemetry
-export CLAUDE_CODE_DISABLE_1M_CONTEXT=1                # disable the 1M-context feature; most on-prem models can't serve it
-export CLAUDE_CODE_MAX_OUTPUT_TOKENS=64000             # cap to what the on-prem model and runtime support
+export ANTHROPIC_BASE_URL="http://127.0.0.1:4000"
+export ANTHROPIC_AUTH_TOKEN="not_set"
+export ANTHROPIC_API_KEY="not_set_either!"
+export ANTHROPIC_MODEL="qwen-2"
+
+export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
+export CLAUDE_CODE_ATTRIBUTION_HEADER=0
+export CLAUDE_CODE_ENABLE_TELEMETRY=0
+export CLAUDE_CODE_DISABLE_1M_CONTEXT=1
+export CLAUDE_CODE_MAX_OUTPUT_TOKENS=64000
 
 claude
 ```
 
-A few notes on the values:
+### Option 2: claude-code-router
 
-- The `ANTHROPIC_AUTH_TOKEN` / `ANTHROPIC_API_KEY` values must be non-empty but their content does not matter if your endpoint does not check them; gate access at the endpoint or in front of it (see [Manage gateways](./mlops-with-coding-agents.mdx#manage-gateways) for adding auth via Envoy AI Gateway).
-- `ANTHROPIC_MODEL` must match the model name the endpoint exposes (the `--served-model-name` from your `InferenceService`, or whatever your local runner advertises).
-- The `CLAUDE_CODE_DISABLE_*` and `CLAUDE_CODE_*=0` flags are what actually keep an "on-prem" setup on-prem: without them, Claude Code can still emit non-essential requests to Anthropic-hosted endpoints and ask the model for features (1M context, very large outputs) the on-prem model cannot honor.
+Create a config file at `~/.claude-code-router/config.json` with your `InferenceService` as a provider:
 
-#### Option B: front an OpenAI-compatible endpoint with a translation proxy
+```json
+{
+  "Providers": [
+    {
+      "name": "onprem",
+      "api_base_url": "https://your-inference-service-domain.com/v1/chat/completions",
+      "api_key": "sk-local",
+      "models": ["qwen-2"]
+    }
+  ],
+  "Router": {
+    "default": "onprem,qwen-2"
+  }
+}
+```
 
-If your endpoint is OpenAI-compatible only (for example, a stock vLLM `InferenceService` exposing `/v1/chat/completions` but not `/v1/messages`), run a small gateway that accepts Anthropic-format requests and forwards them. Two common options:
+Then start Claude Code through the router:
 
-- [LiteLLM](https://docs.litellm.ai/) proxy, which exposes an Anthropic-compatible `/v1/messages` endpoint and routes to any backend model.
-- [claude-code-router](https://github.com/musistudio/claude-code-router), a proxy built specifically to point Claude Code at OpenAI-compatible and other backends.
+```bash
+ccr code
+```
+
+The router automatically sets the required `ANTHROPIC_BASE_URL` and other environment variables — no manual `export` needed. The model is selected by the `Router.default` field in the config (format: `provider_name,model_name`). You can also activate the router in your shell first with `eval "$(ccr activate)"` and then run `claude` directly. Inside a running session, switch models with `/model provider_name,model_name`.
 
-Then use the same env-var configuration from Option A, with `ANTHROPIC_BASE_URL` pointing at the proxy and `ANTHROPIC_MODEL` set to the model alias the proxy exposes. Optionally also set `ANTHROPIC_SMALL_FAST_MODEL` to an on-prem model so background/low-cost requests stay on-prem too.
+### Notes for on-premise operation
 
-Regardless of which option you pick, Claude Code's agentic quality depends heavily on the served model's tool-calling fidelity — prefer a strong instruction- and tool-tuned model, and confirm tool calls round-trip end-to-end before relying on it.
+- The `ANTHROPIC_AUTH_TOKEN` / `ANTHROPIC_API_KEY` values (used with the LiteLLM option) must be non-empty but their content does not matter if your proxy and endpoint do not check them; gate access at the endpoint or proxy (see [Manage gateways](./mlops-with-coding-agents.mdx#manage-gateways) for adding auth via Envoy AI Gateway).
+- The `CLAUDE_CODE_DISABLE_*` flags are what actually keep an "on-prem" setup on-prem: without them, Claude Code can still emit non-essential requests to Anthropic-hosted endpoints and ask the model for features (1M context, very large outputs) the on-prem model cannot honor. claude-code-router sets some of these automatically.
+- `ANTHROPIC_MODEL` must match the model name your `InferenceService` exposes (the `--served-model-name`).
+- Optionally set `ANTHROPIC_SMALL_FAST_MODEL` to an on-prem model so background/low-cost requests stay on-prem too.
+
+Claude Code's agentic quality depends heavily on the served model's tool-calling fidelity — prefer a strong instruction- and tool-tuned model, and confirm tool calls round-trip end-to-end before relying on it.
 
 ## Best practices \{#best-practices}
 
+### Recommended model families for coding agents
+
+**Qwen3.6** and **Gemma 4** are the two model families we currently recommend for on-premise coding agents. Both have strong tool-calling support, mature vLLM parsers, and a wide range of sizes and quantization formats available.
+
+| Family | Why it works for coding agents | vLLM `--tool-call-parser` |
+| --- | --- | --- |
+| **Qwen3.6** (Qwen team) | Strong code generation, instruction following, and tool calling. MoE variants (35B-A3B) activate only ~3B parameters per token, giving high throughput at low VRAM cost. | `hermes` |
+| **Gemma 4** (Google) | Clean instruction tuning, compact sizes (E2B, E4B) that fit on consumer GPUs. Verify tool-calling support in the vLLM version you run; Gemma's parser assignment may vary by vLLM release. | Check [vLLM tool calling docs](https://docs.vllm.ai/en/latest/features/tool_calling.html) |
+| **Qwen3-Coder** (Qwen team) | Code-specialized; the MoE variants (30B-A3B, 480B-A35B) are powerful but require more hardware. | `hermes` |
+
 ### Choose a model that fits your hardware
 
 Start from the GPU memory you have, then pick the largest capable model that leaves headroom for the KV cache. A rough weight-size estimate is `parameters × bytes-per-parameter` — FP16 ≈ 2 bytes, FP8/INT8 ≈ 1 byte, INT4 ≈ 0.5 bytes per parameter — on top of which the KV cache and runtime overhead consume more memory. Leave **15–25% headroom**.
 
-| GPU memory (single GPU) | Example GPUs | Practical coding-model choices |
+#### Quantized models from Unsloth on HuggingFace
+
+[Unsloth](https://huggingface.co/unsloth) publishes GGUF-quantized versions of the latest models, optimized for fast loading with vLLM. The table below lists the most useful ones for coding agents:
+
+| Model | Format | Active params | VRAM (approx.) | Notes |
+| --- | --- | --- | --- | --- |
+| [`unsloth/gemma-4-E2B-it-qat-GGUF`](https://huggingface.co/unsloth/gemma-4-E2B-it-qat-GGUF) | GGUF (QAT) | 2B | ~4 GB | Fastest option; fits on any GPU |
+| [`unsloth/gemma-4-E4B-it-qat-GGUF`](https://huggingface.co/unsloth/gemma-4-E4B-it-qat-GGUF) | GGUF (QAT) | 4B | ~8 GB | Strong tool-calling at low cost |
+| [`unsloth/gemma-4-12b-it-GGUF`](https://huggingface.co/unsloth/gemma-4-12b-it-GGUF) | GGUF | 12B | ~16 GB | Good balance of speed and quality |
+| [`unsloth/gemma-4-26B-A4B-it-GGUF`](https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF) | GGUF (MoE) | 4B active | ~12 GB | MoE: high quality, low active VRAM |
+| [`unsloth/gemma-4-31B-it-GGUF`](https://huggingface.co/unsloth/gemma-4-31B-it-GGUF) | GGUF | 31B | ~40 GB | Largest Gemma 4 dense model |
+| [`unsloth/Qwen3.6-27B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) | GGUF | 27B | ~36 GB | Strong general-purpose coding |
+| [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) | GGUF (MTP) | 27B | ~36 GB | Multi-token prediction for faster decode |
+| [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) | GGUF (MoE+MTP) | 3B active | ~12 GB | Best quality/cost ratio; MoE + MTP |
+
+> **Note:** GGUF-quantized models load in vLLM via `--quantization gguf`. For AWQ or GPTQ INT4 variants, check [huggingface.co/models](https://huggingface.co/models?sort=trending&search=AWQ) — search for `qwen3.6 AWQ` or `gemma-4 GPTQ` to find community-quantized versions. Unsloth's QAT (quantization-aware training) models typically retain higher quality at aggressive bit-widths than post-hoc quantization.
+
+#### Hardware fit guide
+
+| GPU memory (single GPU) | Example GPUs | Recommended model |
 | --- | --- | --- |
-| 16–24 GB | L4, A10, A30 (24G), RTX 4090 | 7–8B at FP16, or 14B quantized (AWQ/GPTQ INT4) |
-| 40–48 GB | A40, L40S, A6000, A100-40G | 14B at FP16, or 32B quantized (AWQ/GPTQ INT4) |
-| 80 GB | A100-80G, H100, H800 | 32B at FP16, or 70B at INT4 / FP8 |
-| Multi-GPU (2–8×) | 2–8 × 80 GB | 70B+ at FP16 with tensor parallel, or large MoE models |
+| 8–16 GB | L4, A10, RTX 4070 | `gemma-4-E2B` or `gemma-4-E4B` (QAT GGUF) |
+| 16–24 GB | A30 (24G), RTX 4090 | `gemma-4-12B` or `Qwen3.6-35B-A3B` (MoE, 3B active) |
+| 40–48 GB | A40, L40S, A6000 | `Qwen3.6-27B` or `gemma-4-31B` (GGUF) |
+| 80 GB | A100-80G, H100, H800 | `Qwen3.6-27B` at FP16, or `gemma-4-31B` at FP16 |
+| Multi-GPU (2–8×) | 2–8 × 80 GB | `Qwen3-Coder-480B-A35B` (MoE, tensor-parallel) |
 
 Additional selection guidance:
 
 - **Prefer code-specialized, instruction-tuned models** that natively support tool/function calling. If the model card does not mention tool calling, the agent will not be able to edit files reliably.
-- **Confirm a matching vLLM parser exists** for the model (see [Enable tool calling on the runtime](#enable-tool-calling-on-the-runtime)) before committing to it.
+- **Confirm a matching vLLM parser exists** for the model (see [Enable tool calling on the runtime](#enable-tool-calling-on-the-runtime)) before committing to it. Qwen3.6 models use `hermes`; verify Gemma 4's parser in the vLLM docs for your version.
 - **Budget for context length.** Coding agents send large prompts (system prompt + file and repo context). Pick a model whose context window covers your largest expected prompt, and remember that a longer `--max-model-len` consumes more KV cache per request, reducing concurrency.
-- **Quantization is a force multiplier on-premise.** INT4 (AWQ/GPTQ) or FP8 lets you fit a noticeably more capable model in the same VRAM, which usually matters more for agent quality than raw FP16 precision.
+- **Quantization is a force multiplier on-premise.** INT4 (AWQ/GPTQ) or GGUF quantization lets you fit a noticeably more capable model in the same VRAM, which usually matters more for agent quality than raw FP16 precision.
+- **MoE models are especially efficient.** Qwen3.6-35B-A3B and Gemma 4-26B-A4B activate only 3–4B parameters per token while carrying a larger knowledge base, giving near-dense quality at a fraction of the VRAM cost.
 
 ### Tune inference service performance
 
@@ -211,7 +292,7 @@ Coding-agent traffic has a distinctive shape: long, highly repetitive prompts (t
 
 "Vibe coding" — iterating quickly by describing intent and letting the agent write the code — works well with a self-hosted model once the basics are right:
 
-1. Start with a 7–14B code model that fits comfortably on your GPU with headroom; a responsive smaller model beats a sluggish larger one for interactive flow.
+1. Start with a Qwen3.6 or Gemma 4 model that fits comfortably on your GPU with headroom; a responsive smaller model beats a sluggish larger one for interactive flow. For 24 GB GPUs, `Qwen3.6-35B-A3B` (MoE) is an excellent starting point.
 2. Set a **low temperature** (around `0–0.2`) for code generation to keep edits deterministic and reduce flailing.
 3. Validate tool calling with one trivial task ("create a file and run it") before attempting anything real.
 4. Keep prompts focused — open or reference only the relevant files so the agent's context stays on-topic and prefill stays cheap.
@@ -226,6 +307,8 @@ Because the model runs inside your cluster, a coding agent backed by an on-premi
 - Author and adjust pipelines and monitoring for your model lifecycle.
 - Close the loop: deploy a model with the agent, then use that same on-premise model to drive further platform operations.
 
+For detailed MLOps workflows — managing InferenceServices, configuring gateways, tuning performance iteratively, and planning fine-tuning runs — see [Run MLOps with Coding Agents and On-Premise LLMs](./mlops-with-coding-agents.mdx).
+
 ## Troubleshooting
 
 - **Agent chats but never edits files or runs commands.** Tool calling is not enabled or the parser does not match the model — see [Enable tool calling on the runtime](#enable-tool-calling-on-the-runtime).
@@ -236,6 +319,7 @@ Because the model runs inside your cluster, a coding agent backed by an on-premi
 
 ## References
 
+- [Run MLOps with Coding Agents and On-Premise LLMs](./mlops-with-coding-agents.mdx)
 - [Create Inference Service using CLI](../model_inference/inference_service/how_to/create_inference_service_cli.mdx)
 - [Configure External Access for Inference Services](../model_inference/inference_service/how_to/external_access_inference_service.mdx)
 - [Configure Scaling for Inference Services](../model_inference/inference_service/how_to/autoscale_settings.mdx)
@@ -244,8 +328,6 @@ Because the model runs inside your cluster, a coding agent backed by an on-premi
 - [Extend Inference Runtimes](../model_inference/inference_service/how_to/custom_inference_runtime.mdx)
 - [Tool Calling — vLLM](https://docs.vllm.ai/en/latest/features/tool_calling.html)
 - [Automatic Prefix Caching — vLLM](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html)
-- [opencode documentation](https://opencode.ai/docs/)
-- [Codex CLI](https://github.com/openai/codex)
 - [Claude Code documentation](https://docs.anthropic.com/en/docs/claude-code)
 - [LiteLLM](https://docs.litellm.ai/)
 - [claude-code-router](https://github.com/musistudio/claude-code-router)

From b0becaf2ec86144e4775df2b88be7f236e1e47b8 Mon Sep 17 00:00:00 2001
From: Wu Yi <typhoonzero1986@gmail.com>
Date: Tue, 9 Jun 2026 11:33:28 +0800
Subject: [PATCH 02/21] docs: fix coding agent inference guide

---
 .../coding-agents-with-inference-service.mdx  | 144 +++++++++++++-----
 1 file changed, 109 insertions(+), 35 deletions(-)

diff --git a/docs/en/agentic_mlops/coding-agents-with-inference-service.mdx b/docs/en/agentic_mlops/coding-agents-with-inference-service.mdx
index 6bb2a2a..7a5fc01 100644
--- a/docs/en/agentic_mlops/coding-agents-with-inference-service.mdx
+++ b/docs/en/agentic_mlops/coding-agents-with-inference-service.mdx
@@ -10,9 +10,9 @@ i18n:
 
 ## Introduction
 
-[Claude Code](https://docs.anthropic.com/en/docs/claude-code) is a terminal-based coding agent that reads your repository, plans changes, edits files, and runs commands on your behalf. It normally talks to Anthropic's hosted models over the internet.
+Coding agents such as [opencode](https://opencode.ai/), [Codex CLI](https://github.com/openai/codex), and [Claude Code](https://docs.anthropic.com/en/docs/claude-code) are terminal-based assistants that read your repository, plan changes, edit files, and run commands on your behalf. They normally talk to a hosted model provider over the internet.
 
-This document shows how to point Claude Code at a model you serve yourself on Alauda AI, so that your source code, prompts, and infrastructure configuration never leave your cluster. The same on-premise `InferenceService` that you deploy for any other workload can back an interactive coding agent, as long as it exposes an **OpenAI-compatible API** and has **tool (function) calling** enabled. Because Claude Code speaks the Anthropic Messages API (`/v1/messages`), you front your `InferenceService` with a lightweight translation proxy (see [Step 3](#step-3-connect-claude-code-with-a-translation-proxy)).
+This document shows how to point those agents at a model you serve yourself on Alauda AI, so that your source code, prompts, and infrastructure configuration never leave your cluster. The same on-premise `InferenceService` that you deploy for any other workload can back an interactive coding agent, as long as it exposes an **OpenAI-compatible API** and has **tool (function) calling** enabled. opencode and Codex CLI can call that endpoint directly; Claude Code speaks the Anthropic Messages API (`/v1/messages`) and needs a lightweight translation proxy (see [Claude Code](#claude-code)).
 
 This page builds directly on the deployment how-tos. It does not repeat how to create or expose an `InferenceService`; instead it links to them and focuses on the agent-specific configuration and tuning.
 
@@ -23,24 +23,30 @@ Coding agents and their configuration formats evolve quickly. The config snippet
 ## Prerequisites
 
 - A running, ready `InferenceService` that serves an OpenAI-compatible API. See [Create Inference Service using CLI](../model_inference/inference_service/how_to/create_inference_service_cli.mdx).
-- Network access from the machine running Claude Code to the service endpoint. For access from a developer laptop outside the cluster, see [Configure External Access for Inference Services](../model_inference/inference_service/how_to/external_access_inference_service.mdx).
-- A model with **tool/function calling** support, served with the matching vLLM parser enabled (see [Enable tool calling on the runtime](#enable-tool-calling-on-the-runtime)). Without this, the agent can chat but cannot edit files or run commands.
-- Claude Code installed locally (`claude`).
-- A translation proxy (LiteLLM or claude-code-router) to bridge Claude Code's Anthropic Messages API to the OpenAI-compatible endpoint (see [Step 3](#step-3-connect-claude-code-with-a-translation-proxy)).
+- Network access from the machine running the agent to the service endpoint. For access from a developer laptop outside the cluster, see [Configure External Access for Inference Services](../model_inference/inference_service/how_to/external_access_inference_service.mdx).
+- A model with **tool/function calling** support, served with the matching vLLM parser enabled (see [Enable tool calling on the runtime](#enable-tool-calling-on-the-runtime)). Without this, agents can chat but cannot edit files or run commands.
+- The agent CLI installed locally (`opencode`, `codex`, or `claude`).
+- For Claude Code, a translation proxy (LiteLLM or claude-code-router) to bridge Claude Code's Anthropic Messages API to the OpenAI-compatible endpoint (see [Claude Code](#claude-code)).
 
 ## How the pieces fit together
 
 ```text
+  opencode / Codex CLI
+        │  OpenAI Chat Completions API  (POST /v1/chat/completions)
+        ▼
+  External access / Load Balancer  ──►  KServe InferenceService (vLLM)
+
   Claude Code
         │  Anthropic Messages API  (POST /v1/messages)
         ▼
   Translation proxy (LiteLLM / claude-code-router)
         │  OpenAI Chat Completions API  (POST /v1/chat/completions)
         ▼
-  KServe InferenceService (vLLM)
+  same InferenceService endpoint
 ```
 
-Claude Code speaks the Anthropic Messages API (`/v1/messages`), while your `InferenceService` exposes an OpenAI-compatible endpoint (`/v1/chat/completions`). A lightweight translation proxy bridges the two.
+- **opencode** and **Codex CLI** speak the OpenAI Chat Completions API natively, so they can call the `InferenceService` endpoint directly.
+- **Claude Code** speaks the Anthropic Messages API, which vLLM does not serve. It requires a small translation proxy in front of the OpenAI-compatible endpoint (see [Claude Code](#claude-code)).
 
 ## Step 1: Deploy and smoke-test the endpoint
 
@@ -67,7 +73,7 @@ curl -sS ${BASE_URL}/chat/completions \
 A normal JSON completion confirms the endpoint is reachable and the model name is correct. Note the three values you will reuse for every agent: **base URL** (ending in `/v1`), **model name** (the `--served-model-name`), and **API key**.
 
 :::tip
-For reasoning models (DeepSeek R1, QwQ, etc.), also add `--reasoning-parser` to the vLLM launch flags. See [Configure reasoning models and reasoning effort](#configure-reasoning-effort).
+For reasoning models (DeepSeek R1, QwQ, Qwen3, etc.), also add the matching `--reasoning-parser` to the vLLM launch flags. See [Configure reasoning models and reasoning effort](#configure-reasoning-effort).
 :::
 
 ## Step 2: Enable tool calling on the runtime \{#enable-tool-calling-on-the-runtime}
@@ -79,7 +85,7 @@ Coding agents work by calling tools (read file, write file, run shell). This req
 --tool-call-parser hermes        # match the parser to your model family
 ```
 
-- The parser must match the model. For example, Qwen2.5 / Qwen3 family models commonly use `hermes`; Llama 3.x models use `llama3_json`; Mistral models use `mistral`. Check the [vLLM tool calling documentation](https://docs.vllm.ai/en/latest/features/tool_calling.html) for the current parser list and the value that matches your model.
+- The parser must match the model. For example, Qwen2.5 and QwQ-32B commonly use `hermes`; Qwen3-Coder uses `qwen3_xml`; Llama 3.x models use `llama3_json`; Mistral models use `mistral`. Check the [vLLM tool calling documentation](https://docs.vllm.ai/en/latest/features/tool_calling.html) for the current parser list and the value that matches your model.
 - Some models need a specific chat template to emit tool calls correctly; pass `--chat-template` if the model card calls for it.
 - If you serve a reasoning model, also enable the matching `--reasoning-parser` so the agent receives clean assistant content separated from reasoning traces.
 
@@ -91,7 +97,7 @@ Some models (for example, DeepSeek R1, QwQ, Hunyuan, or Cohere Command A Reasoni
 
 ### Server-side flags
 
-Add `--reasoning-parser` to your vLLM launch command, paired with the appropriate tool-call parser:
+Add `--reasoning-parser` to your vLLM launch command. If the same model also needs agent tool calls, pair it with the appropriate `--tool-call-parser`:
 
 ```bash
 --enable-auto-tool-choice \
@@ -103,50 +109,115 @@ The table below shows common model families and their required parsers. Confirm
 
 | Model family | `--tool-call-parser` | `--reasoning-parser` | Notes |
 | --- | --- | --- | --- |
-| DeepSeek R1 (`deepseek-ai/DeepSeek-R1-*`) | `deepseek_v3` | *(none required)* | Also needs `--chat-template examples/tool_chat_template_deepseekr1.jinja` |
-| QwQ / Qwen reasoning (`Qwen/QwQ-*`) | `hermes` | *(none required)* | QwQ's reasoning is handled by the chat template; no separate reasoning parser needed |
-| Hunyuan-A13B-Instruct | `deepseek_v3` | `hunyuan_a13b` | Tencent's reasoning model |
-| Cohere Command A Reasoning | `cohere` | `cohere_command3` | Cohere's reasoning model |
+| DeepSeek R1 (`deepseek-ai/DeepSeek-R1-*`) | `deepseek_v3` for DeepSeek-R1 tool calling | `deepseek_r1` | DeepSeek-R1-0528 tool calling also needs `--chat-template examples/tool_chat_template_deepseekr1.jinja` |
+| QwQ (`Qwen/QwQ-32B`) | `hermes` | `deepseek_r1` | QwQ uses Hermes-style tool calls and DeepSeek-style reasoning tags |
+| Qwen3 reasoning (`Qwen/Qwen3-*`) | Check the current vLLM docs for the exact Qwen variant | `qwen3` | Qwen3 reasoning is enabled by default; disable it with `chat_template_kwargs` if needed |
+| Hunyuan-A13B-Instruct | `hunyuan_a13b` | `hunyuan_a13b` | Use both parsers when serving the reasoning mode with tool calls |
+| Cohere Command A Reasoning | `cohere_command3` | `cohere_command3` | Requires the optional `cohere_melody` package |
 
 For model families not listed above, check the model card for reasoning instructions and the [vLLM tool calling documentation](https://docs.vllm.ai/en/latest/features/tool_calling.html) for the matching parser pair.
 
-### Configuring reasoning effort when starting the inference service
+### Configuring reasoning effort and thinking behavior
 
 Reasoning effort controls how much the model "thinks" before answering. For coding agents you typically want **low** reasoning effort to keep interactive latency acceptable — many short, low-reasoning turns beat a single long, high-reasoning one.
 
-#### Server-side default
+#### Server-side defaults
 
-You can set a default reasoning effort at the vLLM server level so every request uses it unless the client overrides:
+vLLM does not expose a generic `--reasoning-effort` launch flag. For server-wide defaults, configure the model's chat-template thinking knobs instead. For example, Qwen3 enables thinking by default, so you can disable it server-wide:
 
 ```bash
---enable-reasoning \
---reasoning-parser <reasoning-parser> \
---reasoning-effort low
+--reasoning-parser qwen3 \
+--default-chat-template-kwargs '{"enable_thinking": false}'
 ```
 
-`--reasoning-effort` accepts `"low"`, `"medium"` (default), or `"high"`. Setting it to `"low"` on the server ensures that even clients that don't specify reasoning effort get the responsive behavior you want for coding agents.
-
-Not all vLLM releases support `--reasoning-effort` as a launch flag. If your version doesn't recognize it, use the request-time method below.
+For models whose templates expose a different thinking switch, use the key documented by the model or vLLM. Request-level `chat_template_kwargs` override the server default.
 
 #### Request-time override
 
-DeepSeek R1 and compatible models also accept a `reasoning_effort` parameter in the request body:
+Models that support OpenAI-style reasoning effort accept `reasoning_effort` as a top-level request parameter:
 
 ```json
 {
-  "model": "deepseek-r1",
+  "model": "google/gemma-4-26B-A4B-it",
   "messages": [{"role": "user", "content": "..."}],
-  "extra_body": {
-    "reasoning_effort": "low"
+  "reasoning_effort": "low"
+}
+```
+
+When using the OpenAI Python client, pass it as a normal argument:
+
+```python
+client.chat.completions.create(
+    model="google/gemma-4-26B-A4B-it",
+    messages=[{"role": "user", "content": "..."}],
+    reasoning_effort="low",
+)
+```
+
+For parsers that support an explicit thinking budget, you can also cap reasoning tokens per request:
+
+```json
+{
+  "model": "Qwen/Qwen3-0.6B",
+  "messages": [{"role": "user", "content": "..."}],
+  "thinking_token_budget": 256
+}
+```
+
+When using a translation proxy (LiteLLM or claude-code-router), confirm the proxy version passes through these vLLM/OpenAI extension fields before relying on them.
+
+**Note:** QwQ and Qwen3-style reasoning do not use `reasoning_effort` for a simple low/medium/high knob. Control their behavior with `chat_template_kwargs`, `thinking_token_budget`, or `max_tokens`, depending on what your model and vLLM version support.
+
+## Step 3: Connect your coding agent
+
+### opencode
+
+opencode reads configuration from `opencode.json` in the project root or `~/.config/opencode/opencode.json`. Define a custom OpenAI-compatible provider that points at your endpoint:
+
+```json
+{
+  "$schema": "https://opencode.ai/config.json",
+  "provider": {
+    "onprem": {
+      "npm": "@ai-sdk/openai-compatible",
+      "name": "On-Prem Alauda AI",
+      "options": {
+        "baseURL": "https://your-inference-service-domain.com/v1",
+        "apiKey": "{env:ONPREM_API_KEY}"
+      },
+      "models": {
+        "qwen-2": {
+          "name": "Qwen2.5-Coder (on-prem)"
+        }
+      }
+    }
   }
 }
 ```
 
-When using a translation proxy (LiteLLM or claude-code-router), the proxy forwards this parameter to the backend automatically.
+- The model key (`qwen-2`) must match the `--served-model-name` of the `InferenceService`.
+- Export the key the config references, then select the model: `export ONPREM_API_KEY=sk-local` and choose `onprem/qwen-2` with the `/models` command inside opencode.
+
+### Codex CLI
+
+Codex CLI reads `~/.codex/config.toml`. Register your endpoint as a model provider and select it:
+
+```toml
+model = "qwen-2"
+model_provider = "onprem"
+
+[model_providers.onprem]
+name = "On-Prem Alauda AI"
+base_url = "https://your-inference-service-domain.com/v1"
+env_key = "ONPREM_API_KEY"
+wire_api = "chat"
+```
 
-**Note:** Qwen reasoning models (QwQ) do not have a separate reasoning-effort knob. Control reasoning depth indirectly through the chat template or by passing `max_tokens` to cap how long the reasoning chain can grow.
+- `base_url` must end at `/v1`; `model` must match the `--served-model-name`.
+- `env_key` names the environment variable that holds the API key: `export ONPREM_API_KEY=sk-local`.
+- Use `wire_api = "chat"` for vLLM's OpenAI Chat Completions API.
 
-## Step 3: Connect Claude Code with a Translation Proxy \{#step-3-connect-claude-code-with-a-translation-proxy}
+### Claude Code \{#claude-code}
 
 Claude Code communicates over the Anthropic Messages API (`/v1/messages`), while your `InferenceService` exposes an OpenAI-compatible endpoint (`/v1/chat/completions`). Bridge the two by running a translation proxy in front of your endpoint. Two common options:
 
@@ -225,13 +296,13 @@ Claude Code's agentic quality depends heavily on the served model's tool-calling
 
 ### Recommended model families for coding agents
 
-**Qwen3.6** and **Gemma 4** are the two model families we currently recommend for on-premise coding agents. Both have strong tool-calling support, mature vLLM parsers, and a wide range of sizes and quantization formats available.
+**Qwen3.6** and **Gemma 4** are the two model families we currently recommend for on-premise coding agents. Both have strong instruction tuning and a wide range of sizes and quantization formats available; verify tool-calling parser support against the vLLM version you run.
 
 | Family | Why it works for coding agents | vLLM `--tool-call-parser` |
 | --- | --- | --- |
-| **Qwen3.6** (Qwen team) | Strong code generation, instruction following, and tool calling. MoE variants (35B-A3B) activate only ~3B parameters per token, giving high throughput at low VRAM cost. | `hermes` |
+| **Qwen3.6** (Qwen team) | Strong code generation and instruction following. MoE variants (35B-A3B) activate only ~3B parameters per token, giving high throughput at low VRAM cost. | Check [vLLM tool calling docs](https://docs.vllm.ai/en/latest/features/tool_calling.html) |
 | **Gemma 4** (Google) | Clean instruction tuning, compact sizes (E2B, E4B) that fit on consumer GPUs. Verify tool-calling support in the vLLM version you run; Gemma's parser assignment may vary by vLLM release. | Check [vLLM tool calling docs](https://docs.vllm.ai/en/latest/features/tool_calling.html) |
-| **Qwen3-Coder** (Qwen team) | Code-specialized; the MoE variants (30B-A3B, 480B-A35B) are powerful but require more hardware. | `hermes` |
+| **Qwen3-Coder** (Qwen team) | Code-specialized; the MoE variants (30B-A3B, 480B-A35B) are powerful but require more hardware. | `qwen3_xml` |
 
 ### Choose a model that fits your hardware
 
@@ -267,7 +338,7 @@ Start from the GPU memory you have, then pick the largest capable model that lea
 Additional selection guidance:
 
 - **Prefer code-specialized, instruction-tuned models** that natively support tool/function calling. If the model card does not mention tool calling, the agent will not be able to edit files reliably.
-- **Confirm a matching vLLM parser exists** for the model (see [Enable tool calling on the runtime](#enable-tool-calling-on-the-runtime)) before committing to it. Qwen3.6 models use `hermes`; verify Gemma 4's parser in the vLLM docs for your version.
+- **Confirm a matching vLLM parser exists** for the model (see [Enable tool calling on the runtime](#enable-tool-calling-on-the-runtime)) before committing to it. Qwen3-Coder models use `qwen3_xml`; verify Qwen3.6 and Gemma 4 parser support in the vLLM docs for your version.
 - **Budget for context length.** Coding agents send large prompts (system prompt + file and repo context). Pick a model whose context window covers your largest expected prompt, and remember that a longer `--max-model-len` consumes more KV cache per request, reducing concurrency.
 - **Quantization is a force multiplier on-premise.** INT4 (AWQ/GPTQ) or GGUF quantization lets you fit a noticeably more capable model in the same VRAM, which usually matters more for agent quality than raw FP16 precision.
 - **MoE models are especially efficient.** Qwen3.6-35B-A3B and Gemma 4-26B-A4B activate only 3–4B parameters per token while carrying a larger knowledge base, giving near-dense quality at a fraction of the VRAM cost.
@@ -327,7 +398,10 @@ For detailed MLOps workflows — managing InferenceServices, configuring gateway
 - [Speculative Decoding for vLLM Inference Services](../model_inference/inference_service/how_to/vllm_speculative_decoding.mdx)
 - [Extend Inference Runtimes](../model_inference/inference_service/how_to/custom_inference_runtime.mdx)
 - [Tool Calling — vLLM](https://docs.vllm.ai/en/latest/features/tool_calling.html)
+- [Reasoning Outputs — vLLM](https://docs.vllm.ai/en/latest/features/reasoning_outputs/)
 - [Automatic Prefix Caching — vLLM](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html)
+- [opencode documentation](https://opencode.ai/docs/)
+- [Codex CLI](https://github.com/openai/codex)
 - [Claude Code documentation](https://docs.anthropic.com/en/docs/claude-code)
 - [LiteLLM](https://docs.litellm.ai/)
 - [claude-code-router](https://github.com/musistudio/claude-code-router)

From 7871e1ffdbf0ea6e68331c0d641d4d8c4ce957b5 Mon Sep 17 00:00:00 2001
From: Wu Yi <typhoonzero1986@gmail.com>
Date: Tue, 9 Jun 2026 14:41:28 +0800
Subject: [PATCH 03/21] docs: remove non-existent
 --default-chat-template-kwargs flag

The flag does not exist in vLLM. Replaced with accurate guidance about
server-wide control via --chat-template and request-level parameters.
---
 .../coding-agents-with-inference-service.mdx             | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/docs/en/agentic_mlops/coding-agents-with-inference-service.mdx b/docs/en/agentic_mlops/coding-agents-with-inference-service.mdx
index 7a5fc01..e598ab7 100644
--- a/docs/en/agentic_mlops/coding-agents-with-inference-service.mdx
+++ b/docs/en/agentic_mlops/coding-agents-with-inference-service.mdx
@@ -123,14 +123,7 @@ Reasoning effort controls how much the model "thinks" before answering. For codi
 
 #### Server-side defaults
 
-vLLM does not expose a generic `--reasoning-effort` launch flag. For server-wide defaults, configure the model's chat-template thinking knobs instead. For example, Qwen3 enables thinking by default, so you can disable it server-wide:
-
-```bash
---reasoning-parser qwen3 \
---default-chat-template-kwargs '{"enable_thinking": false}'
-```
-
-For models whose templates expose a different thinking switch, use the key documented by the model or vLLM. Request-level `chat_template_kwargs` override the server default.
+vLLM does not expose a generic `--reasoning-effort` launch flag. Server-wide control is achieved through the model's chat template — you can supply a custom Jinja template that disables thinking by default, then pass it with `--chat-template`. Alternatively, some models and vLLM versions expose per-model template kwargs; check the vLLM release notes for the specific key. For a quick start, request-level parameters (see below) are the most portable approach.
 
 #### Request-time override
 

From 285e68dc61d9ebd7bb03c81750f9b7192c5a9e34 Mon Sep 17 00:00:00 2001
From: Wu Yi <typhoonzero1986@gmail.com>
Date: Tue, 9 Jun 2026 16:02:28 +0800
Subject: [PATCH 04/21] docs: clarify vllm reasoning effort support

---
 .../coding-agents-with-inference-service.mdx  | 22 +++++++++++--------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/docs/en/agentic_mlops/coding-agents-with-inference-service.mdx b/docs/en/agentic_mlops/coding-agents-with-inference-service.mdx
index e598ab7..d16fab5 100644
--- a/docs/en/agentic_mlops/coding-agents-with-inference-service.mdx
+++ b/docs/en/agentic_mlops/coding-agents-with-inference-service.mdx
@@ -123,27 +123,31 @@ Reasoning effort controls how much the model "thinks" before answering. For codi
 
 #### Server-side defaults
 
-vLLM does not expose a generic `--reasoning-effort` launch flag. Server-wide control is achieved through the model's chat template — you can supply a custom Jinja template that disables thinking by default, then pass it with `--chat-template`. Alternatively, some models and vLLM versions expose per-model template kwargs; check the vLLM release notes for the specific key. For a quick start, request-level parameters (see below) are the most portable approach.
+vLLM does not expose a generic `--reasoning-effort` launch flag. Server-wide control is achieved through the model's chat template: you can supply a custom Jinja template that disables thinking by default, then pass it with `--chat-template`. Alternatively, some models and vLLM versions expose per-model template kwargs; check the vLLM release notes for the specific key.
 
-#### Request-time override
+#### Request-time controls
 
-Models that support OpenAI-style reasoning effort accept `reasoning_effort` as a top-level request parameter:
+Do not assume every vLLM-backed `InferenceService` accepts `reasoning_effort`. Support depends on the vLLM version, OpenAI-compatible server implementation, model, and chat template. If the service rejects unknown request fields, `reasoning_effort` can fail even when the model itself supports reasoning.
+
+Prefer model-specific controls that your deployed vLLM service documents. For example, Qwen3-style templates commonly use `chat_template_kwargs` to enable or disable thinking:
 
 ```json
 {
-  "model": "google/gemma-4-26B-A4B-it",
+  "model": "Qwen/Qwen3-8B",
   "messages": [{"role": "user", "content": "..."}],
-  "reasoning_effort": "low"
+  "chat_template_kwargs": {
+    "enable_thinking": false
+  }
 }
 ```
 
-When using the OpenAI Python client, pass it as a normal argument:
+When using the OpenAI Python client, pass vLLM-specific request fields through `extra_body`:
 
 ```python
 client.chat.completions.create(
-    model="google/gemma-4-26B-A4B-it",
+    model="Qwen/Qwen3-8B",
     messages=[{"role": "user", "content": "..."}],
-    reasoning_effort="low",
+    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
 )
 ```
 
@@ -159,7 +163,7 @@ For parsers that support an explicit thinking budget, you can also cap reasoning
 
 When using a translation proxy (LiteLLM or claude-code-router), confirm the proxy version passes through these vLLM/OpenAI extension fields before relying on them.
 
-**Note:** QwQ and Qwen3-style reasoning do not use `reasoning_effort` for a simple low/medium/high knob. Control their behavior with `chat_template_kwargs`, `thinking_token_budget`, or `max_tokens`, depending on what your model and vLLM version support.
+Only use `reasoning_effort` after you verify that your exact vLLM image and model template accept it. On supported deployments, it can be sent as a top-level Chat Completions field such as `"reasoning_effort": "low"`; on unsupported deployments, use `chat_template_kwargs`, `thinking_token_budget`, or `max_tokens` instead.
 
 ## Step 3: Connect your coding agent
 

From b18b5cd67a1904845836efc51c37121c224bdad9 Mon Sep 17 00:00:00 2001
From: Wu Yi <typhoonzero1986@gmail.com>
Date: Tue, 9 Jun 2026 18:14:21 +0800
Subject: [PATCH 05/21] docs: refine agentic mlops tuning guidance

---
 .../mlops-with-coding-agents.mdx              | 25 ++++++++++++-------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/docs/en/agentic_mlops/mlops-with-coding-agents.mdx b/docs/en/agentic_mlops/mlops-with-coding-agents.mdx
index 38221bb..0e1eef2 100644
--- a/docs/en/agentic_mlops/mlops-with-coding-agents.mdx
+++ b/docs/en/agentic_mlops/mlops-with-coding-agents.mdx
@@ -77,7 +77,7 @@ For the exact field shape of each CRD, defer to the upstream documentation linke
 
 ## Tune service performance to fit your hardware \{#tune-performance}
 
-The list of vLLM and KServe knobs is unchanged from [Best practices: tune inference service performance](./coding-agents-with-inference-service.mdx#best-practices) — this section focuses on how an agent can *drive* that tuning instead of you doing it by hand.
+This section focuses on how an agent can *drive* tuning instead of you doing it by hand. The server-side knobs differ by runtime — for vLLM, see [Best practices: tune inference service performance](./coding-agents-with-inference-service.mdx#best-practices); for llama.cpp (GGUF), the relevant flags include `--cache-type-k`/`--cache-type-v` (KV cache quantization), `--ctx-size` (context window), `--parallel` (concurrency), and `--reasoning`/`--reasoning-budget` (reasoning control).
 
 A productive loop:
 
@@ -91,32 +91,38 @@ Pin numbers before tuning. Tell the agent what "good enough" looks like:
 - Maximum P95 inter-token latency or total response time for a representative prompt.
 - Minimum sustainable throughput (requests/min or tokens/sec).
 - Maximum context length the agent traffic will send.
+- GPU memory headroom target (leave ≥10% free for KV cache growth).
 
 ### 2. Generate a reproducible benchmark
 
-Ask the agent to write a small benchmark script that mirrors your real traffic — typical prompt size, system prompt, concurrency. Useful starting points include the built-in `vllm bench serve` command, `genai-perf`, or a `k6`/Python script that drives `/v1/chat/completions` directly. Have the agent run it against the current `InferenceService` and record the results in a markdown table.
+Ask the agent to write a benchmark script that mirrors your real traffic — typical prompt size, system prompt, concurrency. Useful starting points include `vllm bench serve`, `genai-perf`, or a `k6`/Python script driving `/v1/chat/completions` directly.
+
+The agent should record **TTFT** (time to first token), **ITL** (inter-token latency in ms/token), and **TPS** (tokens per second) for both streaming and non-streaming requests. If the model serves with reasoning enabled, the agent must also measure the reasoning-overhead ratio: `reasoning_tokens / content_tokens`, because it can inflate total latency and TPS without contributing to the user-visible output.
+
+Ensure the server exposes metrics. For vLLM, scrape the OpenAI-compatible server's built-in `/metrics` endpoint. For llama.cpp, enable the metrics endpoint with `--metrics` before scraping `/metrics`. If the metrics endpoint is unavailable, fall back to per-request latency measurement via the API.
 
 ### 3. Have the agent propose one change at a time
 
-Give the agent the benchmark output and the current YAML. Ask for **one** change with an expected effect, for example:
+Give the agent the benchmark output and current YAML. Ask for **one** change with an expected effect, for example:
 
-- "Add `--enable-prefix-caching` and re-run; expected: lower TTFT on the repeated system-prompt prefix."
-- "Switch the model from FP16 to AWQ INT4 and raise `--gpu-memory-utilization` to 0.92; expected: more KV cache headroom, larger sustainable context length."
-- "Increase `--max-num-seqs`; expected: higher throughput at the cost of higher P95 latency."
+- "Add `--enable-prefix-caching` and re-run; expected: lower TTFT on repeated system-prompt prefixes (vLLM)."
+- "Switch KV cache to `--cache-type-k q8_0 --cache-type-v q8_0` and re-run; expected: fit more context in limited GPU memory (llama.cpp / GGUF)."
+- "Set `--reasoning-budget 2048` instead of `-1`; expected: bounded reasoning overhead, more tokens available for content output (llama.cpp)."
+- "Increase `--max-num-seqs`; expected: higher throughput at the cost of higher P95 latency (vLLM)."
 
 One change per iteration keeps cause and effect attributable.
 
 ### 4. Apply, measure, and record
 
-The agent updates the `InferenceService` YAML, applies it, waits for `READY`, re-runs the benchmark, and appends a new row to the results table with the configuration delta.
+The agent updates the `InferenceService` YAML, applies it, waits for `READY`, and re-runs the benchmark. It should also check GPU-level indicators: utilization (expect 70–90% for sustained inference), memory usage (flag if >95%), and power draw. Each run appends a row to a markdown table with the configuration delta, first-token latency, ITL, TPS, and GPU metrics.
 
 ### 5. Stop on SLO or hardware ceiling
 
-The loop ends when SLOs are met, or when the next sensible knob is "different hardware" or "different model" — at which point the agent should say so explicitly rather than churn. Common ceilings: KV cache saturated at the target context length, tensor-parallel scaling no longer linear, decode-bound at single-request latency.
+The loop ends when SLOs are met, or when the next sensible knob is "different hardware" or "different model" — at which point the agent should say so explicitly rather than churn. Common ceilings: KV cache saturated at the target context length, tensor-parallel scaling no longer linear, decode-bound at single-request latency, or GPU memory headroom below 5%.
 
 </Steps>
 
-For model-size vs. GPU-memory selection, see the table in the prior doc's [Choose a model that fits your hardware](./coding-agents-with-inference-service.mdx#best-practices) section. For autoscaling and cold-start trade-offs, see [Configure Scaling for Inference Services](../model_inference/inference_service/how_to/autoscale_settings.mdx). For interactive-latency wins, see [Speculative Decoding for vLLM Inference Services](../model_inference/inference_service/how_to/vllm_speculative_decoding.mdx).
+For model-size vs. GPU-memory selection, see the prior doc's [Choose a model that fits your hardware](./coding-agents-with-inference-service.mdx#best-practices) section. For autoscaling and cold-start trade-offs, see [Configure Scaling for Inference Services](../model_inference/inference_service/how_to/autoscale_settings.mdx). For interactive-latency wins, see [Speculative Decoding for vLLM Inference Services](../model_inference/inference_service/how_to/vllm_speculative_decoding.mdx).
 
 ## Plan fine-tuning and generate reports \{#fine-tuning-plans-and-reports}
 
@@ -241,6 +247,7 @@ Each step is a separate prompt with its own diff to review. The agent is the typ
 - **Always `--dry-run=server`.** Make it a standing rule in the agent context file; mention it in every prompt that involves `kubectl apply`.
 - **One change per iteration.** Especially for performance tuning, mixing two changes hides which one helped.
 - **Never let the agent fabricate metrics.** Require it to cite the file, log, or run ID it pulled each number from, and to mark `TODO` when data is missing.
+- **Account for reasoning overhead.** When benchmarking reasoning-enabled models, report both total tokens and reasoning-vs-content breakdown. A model emitting 8,000 reasoning tokens before 50 content tokens has 160:1 overhead — that dominates latency and TPS. For llama.cpp, use server-side `--reasoning-budget` to bound it; for other runtimes, use only documented request-time controls that your deployed service accepts.
 - **Keep the loop on-prem.** Confirm that no fallback model in any agent config points at a hosted provider (see [Connect your coding agent](./coding-agents-with-inference-service.mdx) for the per-agent settings to check).
 - **Commit everything.** Plans, reports, generated YAML, and benchmark scripts all go into Git so the next person — or the next agent — can pick up where you left off.
 

From 3c79b62b25e009821c715a7821681a329635bdba Mon Sep 17 00:00:00 2001
From: Wuyi <wuyi@alauda.cn>
Date: Fri, 12 Jun 2026 13:48:09 +0800
Subject: [PATCH 06/21] docs: fix lint error in pipelines-mlflow-integration
 guide

- Remove list preceding code block to avoid remark-lint-code-block-split-list
- Replace Python dict literals with dict() constructor to avoid JSX parsing
---
 AGENTS.md                                     |  23 +
 docs/en/agentic_mlops/index.mdx               |   6 +
 .../assets/build-train-image/run_build.sh     |  10 +
 docs/en/kubeflow/how_to/kf-local-queue.yaml   |   7 +
 .../how_to/kf-trainingruntime-npu.yaml        | 401 ++++++++++++++++++
 docs/en/kubeflow/how_to/kf-trainjob-npu.yaml  | 110 +++++
 .../how_to/pipelines-mlflow-integration.mdx   | 267 ++++++++++++
 .../how_to/qwen3_finetune_verify.ipynb        | 390 +++++++++++++++++
 docs/en/training_guides/index.mdx             |   1 +
 .../pipelines-mlflow-integration.mdx          | 266 ++++++++++++
 e2e/lib.sh                                    |  30 +-
 11 files changed, 1508 insertions(+), 3 deletions(-)
 create mode 100644 AGENTS.md
 create mode 100644 docs/en/agentic_mlops/index.mdx
 create mode 100644 docs/en/kubeflow/how_to/assets/build-train-image/run_build.sh
 create mode 100644 docs/en/kubeflow/how_to/kf-local-queue.yaml
 create mode 100644 docs/en/kubeflow/how_to/kf-trainingruntime-npu.yaml
 create mode 100644 docs/en/kubeflow/how_to/kf-trainjob-npu.yaml
 create mode 100644 docs/en/kubeflow/how_to/pipelines-mlflow-integration.mdx
 create mode 100644 docs/en/kubeflow/how_to/qwen3_finetune_verify.ipynb
 create mode 100644 docs/en/training_guides/pipelines-mlflow-integration.mdx

diff --git a/AGENTS.md b/AGENTS.md
new file mode 100644
index 0000000..6817188
--- /dev/null
+++ b/AGENTS.md
@@ -0,0 +1,23 @@
+# Repository Guidelines
+
+## Project Structure & Module Organization
+`docs/en/` contains the English documentation, organized by product area such as `kubeflow/`, `workbench/`, and `model_inference/`. Section landing pages usually live in `index.mdx`, with supporting content under folders like `overview/`, `how_to/`, `functions/`, and `trouble_shooting/`. Put downloadable files and images in `docs/public/` or the nearest section asset folder. Shared generated resources live in `docs/shared/` (`crds/`, `openapis/`, `roletemplates/`, `functionresources/`). Root config lives in `doom.config.yml`, `sites.yaml`, `eslint.config.js`, and `cspell.config.js`. CI definitions are under `.builds/`.
+
+## Build, Test, and Development Commands
+Use Yarn 4 for all local work:
+
+- `yarn install`: install dependencies.
+- `yarn dev`: start the Doom dev server with live reload.
+- `yarn lint`: run repository lint checks before committing.
+- `yarn build`: produce the static site in `dist/`.
+- `yarn serve`: preview the built site locally.
+- `yarn translate` / `yarn export`: run Doom translation or export workflows when needed.
+
+## Coding Style & Naming Conventions
+Follow `.editorconfig`: 2-space indentation, LF line endings, UTF-8, and a final newline. Prettier is the formatter; its current rules prefer single quotes and no semicolons. Keep MDX concise and use descriptive headings. Match the surrounding directory’s naming pattern; for new pages, prefer lowercase filenames and keep section entry files as `index.mdx`, `intro.mdx`, or `features.mdx` when they serve those roles.
+
+## Testing Guidelines
+There is no separate unit-test suite in this repository. Validation is content-focused: run `yarn lint` and `yarn build` for every change, then use `yarn serve` or `yarn dev` to verify rendering, navigation, links, code blocks, and asset paths. Treat a clean build as the minimum acceptance bar.
+
+## Commit & Pull Request Guidelines
+Recent history favors short, imperative commit subjects such as `Add trainerv2 llm fine tuning (#156)` or `Split component tables by architecture...`. Keep commits focused on one documentation change. For pull requests, include a concise summary, link the relevant issue or task, and note any generated or copied assets. Add screenshots only when navigation, theme behavior, or visual assets change. Ensure `yarn lint` passes before opening the PR.
diff --git a/docs/en/agentic_mlops/index.mdx b/docs/en/agentic_mlops/index.mdx
new file mode 100644
index 0000000..ec8f937
--- /dev/null
+++ b/docs/en/agentic_mlops/index.mdx
@@ -0,0 +1,6 @@
+---
+weight: 10
+---
+# Agentic MLOps
+
+<Overview />
diff --git a/docs/en/kubeflow/how_to/assets/build-train-image/run_build.sh b/docs/en/kubeflow/how_to/assets/build-train-image/run_build.sh
new file mode 100644
index 0000000..990c096
--- /dev/null
+++ b/docs/en/kubeflow/how_to/assets/build-train-image/run_build.sh
@@ -0,0 +1,10 @@
+buildctl \
+--addr tcp://192.168.142.83:1234 build \
+--frontend dockerfile.v0 \
+--local context=$PWD \
+--local dockerfile=$PWD \
+--opt filename=fine_tune_with_llamafactory_npu.Containerfile \
+--opt platform=linux/arm64 \
+--opt build-arg:INDEX_URL=https://pypi.tuna.tsinghua.edu.cn/simple \
+--output type=image,name=build-harbor.alauda.cn/mlops/fine_tune_with_llamafactory_npu:v0.9.4-cann_8.5.0-torch_2.6.0-v2,push=true
+
diff --git a/docs/en/kubeflow/how_to/kf-local-queue.yaml b/docs/en/kubeflow/how_to/kf-local-queue.yaml
new file mode 100644
index 0000000..3e79346
--- /dev/null
+++ b/docs/en/kubeflow/how_to/kf-local-queue.yaml
@@ -0,0 +1,7 @@
+apiVersion: kueue.x-k8s.io/v1beta2
+kind: LocalQueue
+metadata:
+  name: local-queue
+  namespace: mlops-demo-ai-test
+spec:
+  clusterQueue: cluster-queue
\ No newline at end of file
diff --git a/docs/en/kubeflow/how_to/kf-trainingruntime-npu.yaml b/docs/en/kubeflow/how_to/kf-trainingruntime-npu.yaml
new file mode 100644
index 0000000..86eccf7
--- /dev/null
+++ b/docs/en/kubeflow/how_to/kf-trainingruntime-npu.yaml
@@ -0,0 +1,401 @@
+apiVersion: trainer.kubeflow.org/v1alpha1
+kind: TrainingRuntime
+metadata:
+  name: llamafactory-finetune-runtime
+  namespace: kubeflow-admin-cpaas-io
+  labels:
+    trainer.kubeflow.org/framework: torch
+spec:
+  mlPolicy:
+    numNodes: 1
+    torch:
+      numProcPerNode: auto
+  template:
+    spec:
+      replicatedJobs:
+        - name: dataset-initializer
+          template:
+            metadata:
+              labels:
+                trainer.kubeflow.org/trainjob-ancestor-step: dataset-initializer
+            spec:
+              template:
+                spec:
+                  #hostNetwork: true
+                  #dnsPolicy: ClusterFirstWithHostNet
+                  securityContext:
+                    runAsNonRoot: true
+                    runAsUser: 1001
+                    runAsGroup: 1000
+                    fsGroup: 1000
+                  containers:
+                    - name: dataset-initializer
+                      #image: docker.1ms.run/alaudadockerhub/fine_tune_with_llamafactory_npu:v0.1.3
+                      image: docker.1ms.run/alaudadockerhub/fine_tune_with_llamafactory_npu:v0.9.4-cann_8.5.0-torch_2.6.0-v1
+                      command:
+                      - /bin/bash
+                      - -c
+                      - |
+                        set -ex
+                        cd /mnt/models
+                        DATASET_NAME=$(basename ${DATASET_URL})
+                        DATASET_URL_NO_HTTPS="${DATASET_URL//http:\/\/}"
+                        gitauth="${GIT_USER}:${GIT_TOKEN}"
+                        #rm -rf ${DATASET_NAME}
+                        #rm -rf data
+                        if [ -d ${DATASET_NAME} ]; then
+                            echo "dataset ${DATASET_NAME} already exists skipping download"
+                        else
+                            git -c http.sslVerify=false -c lfs.activitytimeout=36000 clone "http://${gitauth}@${DATASET_URL_NO_HTTPS}"
+                        fi
+                        echo "listing files under /mnt/models ..."
+                        ls /mnt/models
+                        echo "listing dataset files ..."
+                        ls ${DATASET_NAME}
+                      env:
+                      # Step 1: set DATASET_URL to download dataset from gitlab.
+                      - name: DATASET_URL
+                        value: "http://aml-gitlab.alaudatech.net/mlops-demo-ai-test/amldatasets/identity-alauda"
+                      # Step 2: set GIT_USER and GIT_TOKEN to access private git repo.
+                      # NOTE: if your dataset is located in different storage like S3, you need to modify the initializer container to download dataset from S3 instead of git.
+                      - name: GIT_USER
+                        valueFrom:
+                          secretKeyRef:
+                            name: aml-image-builder-secret
+                            key: MODEL_REPO_GIT_USER
+                      - name: GIT_TOKEN
+                        valueFrom:
+                          secretKeyRef:
+                            name: aml-image-builder-secret
+                            key: MODEL_REPO_GIT_TOKEN
+                      resources:
+                        requests:
+                          cpu: 100m
+                          memory: 128Mi
+                        limits:
+                          cpu: 2
+                          memory: 4Gi
+                      securityContext:
+                        allowPrivilegeEscalation: false
+                        capabilities:
+                          drop:
+                            - ALL
+                        runAsNonRoot: true
+                        seccompProfile:
+                          type: RuntimeDefault
+        - name: model-initializer
+          dependsOn:
+            - name: dataset-initializer
+              status: Complete
+          template:
+            metadata:
+              labels:
+                trainer.kubeflow.org/trainjob-ancestor-step: model-initializer
+            spec:
+              template:
+                spec:
+                  securityContext:
+                    runAsNonRoot: true
+                    runAsUser: 1001
+                    runAsGroup: 1000
+                    fsGroup: 1000
+                  containers:
+                    - name: model-initializer
+                      image: docker.1ms.run/alaudadockerhub/fine_tune_with_llamafactory_npu:v0.9.4-cann_8.5.0-torch_2.6.0-v1
+                      command:
+                      - /bin/bash
+                      - -c
+                      - |
+                        set -ex
+                        cd /mnt/models
+                        BASE_MODEL_NAME=$(basename ${BASE_MODEL_URL})
+                        # Download base model
+                        gitauth="${GIT_USER}:${GIT_TOKEN}"
+                        BASE_MODEL_URL_NO_HTTPS="${BASE_MODEL_URL//http:\/\/}"
+                        if [ -d ${BASE_MODEL_NAME} ]; then
+                            echo "${BASE_MODEL_NAME} dir already exists, skip downloading"
+                        else
+                            GIT_LFS_SKIP_SMUDGE=1 git -c http.sslVerify=false -c lfs.activitytimeout=36000 clone "http://${gitauth}@${BASE_MODEL_URL_NO_HTTPS}"
+                            (cd ${BASE_MODEL_NAME} && git -c http.sslVerify=false -c lfs.activitytimeout=36000 lfs pull)
+                        fi
+                        echo "listing files under /mnt/models ..."
+                        ls /mnt/models
+                        echo "listing model files ..."
+                        ls ${BASE_MODEL_NAME}
+                      env:
+                      # Step 3: set BASE_MODEL_URL to download base model from gitlab. Make sure the GIT_USER and GIT_TOKEN have access to this git repo.
+                      # NOTE: model repo name should not be the same as dataset repo name, otherwise the initializer may fail to download model and dataset correctly since they use the same PVC and the same git clone command.
+                      - name: BASE_MODEL_URL
+                        value: "http://aml-gitlab.alaudatech.net/mlops-demo-ai-test/amlmodels/qwen3-0.6b"
+                      - name: GIT_USER
+                        valueFrom:
+                          secretKeyRef:
+                            name: aml-image-builder-secret
+                            key: MODEL_REPO_GIT_USER
+                      - name: GIT_TOKEN
+                        valueFrom:
+                          secretKeyRef:
+                            name: aml-image-builder-secret
+                            key: MODEL_REPO_GIT_TOKEN
+                      resources:
+                        requests:
+                          cpu: 100m
+                          memory: 128Mi
+                        limits:
+                          cpu: 2
+                          memory: 4Gi
+                      securityContext:
+                        allowPrivilegeEscalation: false
+                        capabilities:
+                          drop:
+                            - ALL
+                        runAsNonRoot: true
+                        seccompProfile:
+                          type: RuntimeDefault
+        - name: node
+          dependsOn:
+            - name: model-initializer
+              status: Complete
+          template:
+            metadata:
+              labels:
+                trainer.kubeflow.org/trainjob-ancestor-step: trainer
+            spec:
+              backoffLimit: 0
+              template:
+                spec:
+                  # Step 4: Use the Ascend runtime class and scheduler that can allocate
+                  # Huawei NPUs on your cluster.
+                  schedulerName: hami-scheduler
+                  runtimeClassName: ascend
+                  # The trainer process UID/GID must match the Ascend device files
+                  # mounted under /dev. This sample assumes the NPU device files use
+                  # 1001:1000. If your cluster uses different ownership, rebuild the
+                  # image with that UID/GID and update these values.
+                  securityContext:
+                    runAsNonRoot: true
+                    runAsUser: 1001
+                    runAsGroup: 1000
+                    fsGroup: 1000
+                  volumes:
+                  - name: workspace
+                    emptyDir: {}
+                  - name: dshm
+                    emptyDir:
+                      medium: Memory
+                      # Step 4: set sizeLimit for dshm volume to tune the performance of multi GPU training.
+                      sizeLimit: 2Gi
+                  containers:
+                    - name: node
+                      image: docker.1ms.run/alaudadockerhub/fine_tune_with_llamafactory_npu:v0.9.4-cann_8.5.0-torch_2.6.0-v1
+                      env:
+                      - name: BASE_MODEL_URL
+                        value: "https://aml-gitlab.alaudatech.net/mlops-demo-ai-test/amlmodels/qwen3-0.6b"
+                      - name: DATASET_URL
+                        value: "https://aml-gitlab.alaudatech.net/mlops-demo-ai-test/amldatasets/identity-alauda"
+                      - name: GIT_USER
+                        valueFrom:
+                          secretKeyRef:
+                            name: aml-image-builder-secret
+                            key: MODEL_REPO_GIT_USER
+                      - name: GIT_TOKEN
+                        valueFrom:
+                          secretKeyRef:
+                            name: aml-image-builder-secret
+                            key: MODEL_REPO_GIT_TOKEN
+                      - name: HF_HOME
+                        value: /mnt/workspace/hf_cache
+                      - name: DO_MERGE
+                        value: "true"
+                        #- name: MLFLOW_TRACKING_URI
+                        #value: "http://mlflow-tracking-server.kubeflow:5000"
+                        #- name: MLFLOW_EXPERIMENT_NAME
+                        #value: mlops-demo-ai-test
+                      - name: MODEL_OUTPUT_DIR
+                        value: /mnt/workspace/output_model
+                      - name: OUTPUT_MODEL_URL
+                        value: "https://aml-gitlab.alaudatech.net/mlops-demo-ai-test/amlmodels/wy-sft-output"
+                      # Step 5: Keep Ascend process logs under the writable workspace
+                      # for easier debugging.
+                      - name: ASCEND_PROCESS_LOG_PATH
+                        value: /mnt/workspace/ascendlog
+                      command:
+                        - bash
+                        - -c
+                        - |
+                          set -ex
+
+                          export ASCEND_TOOLKIT_HOME=/usr/local/Ascend/cann
+                          export ASCEND_OPS_PATH=/usr/local/Ascend/cann
+                          export ASCEND_NNAL_HOME=/usr/local/Ascend/nnal
+                          export ASCEND_HOME_PATH=/usr/local/Ascend/cann
+                          export ASCEND_AICPU_PATH=/usr/local/Ascend/cann
+                          export ASCEND_OPP_PATH=/usr/local/Ascend/cann/opp
+                          export PATH=/usr/local/Ascend/cann/bin:/usr/local/Ascend/nnal/atb/bin:${PATH}
+                          export LD_LIBRARY_PATH=/usr/local/dcmi:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/cann/aarch64-linux/devlib:/usr/local/Ascend/cann/lib64:/usr/local/Ascend/nnal/atb/lib
+
+                          echo "PET_NNODES: ${PET_NNODES}, PET_NODE_RANK: ${PET_NODE_RANK}, PET_MASTER_ADDR: ${PET_MASTER_ADDR}, PET_MASTER_PORT: ${PET_MASTER_PORT}"
+                          if [ ${PET_NNODES} -gt 1 ]; then
+                              export N_RANKS=$PET_NNODES
+                              export RANK=$PET_NODE_RANK
+                              export MASTER_HOST=$PET_MASTER_ADDR
+                              export MASTER_PORT=$PET_MASTER_PORT
+                              export WORLD_SIZE=$PET_NNODES
+                              export NNODES=$PET_NNODES
+                              export NODE_RANK=$PET_NODE_RANK
+                              export MASTER_ADDR=${MASTER_HOST}
+                          else
+                              export N_RANKS=1
+                              export RANK=0
+                              export NNODES=1
+                              export MASTER_HOST=""
+                          fi
+
+                          source /usr/local/Ascend/ascend-toolkit/set_env.sh
+                          source /usr/local/Ascend/cann-8.5.0/share/info/ascendnpu-ir/bin/set_env.sh
+                          source /usr/local/Ascend/nnal/atb/set_env.sh
+
+                          cd /mnt/workspace
+                          BASE_MODEL_NAME=$(basename ${BASE_MODEL_URL})
+                          DATASET_NAME=$(basename ${DATASET_URL})
+
+                          cat >lf-sft.yaml <<EOL
+                          model_name_or_path: /mnt/models/${BASE_MODEL_NAME}
+
+                          stage: sft
+                          do_train: true
+                          finetuning_type: lora
+                          lora_target: all
+                          lora_rank: 8
+                          lora_alpha: 16
+                          lora_dropout: 0.1
+
+                          dataset: identity_alauda
+                          dataset_dir: /mnt/models/${DATASET_NAME}
+                          template: qwen
+                          cutoff_len: 1024
+                          max_samples: 1000
+                          overwrite_cache: true
+                          preprocessing_num_workers: 8
+
+                          output_dir: output_models
+                          logging_steps: 10
+                          save_steps: 500
+                          plot_loss: true
+                          overwrite_output_dir: true
+
+
+                          # global batch size: 8
+                          per_device_train_batch_size: 2
+                          gradient_accumulation_steps: 2
+                          learning_rate: 2.0e-4
+                          num_train_epochs: 4.0
+                          bf16: false
+                          fp16: true
+                          ddp_timeout: 180000000
+
+                          val_size: 0.1
+                          per_device_eval_batch_size: 1
+                          eval_strategy: steps
+                          eval_steps: 500
+                          EOL
+
+                          cat >ds-z3-config.json <<EOL
+                          {
+                            "train_batch_size": "auto",
+                            "train_micro_batch_size_per_gpu": "auto",
+                            "gradient_accumulation_steps": "auto",
+                            "gradient_clipping": "auto",
+                            "zero_allow_untested_optimizer": true,
+                            "fp16": {
+                              "enabled": "auto",
+                              "loss_scale": 0,
+                              "loss_scale_window": 1000,
+                              "initial_scale_power": 16,
+                              "hysteresis": 2,
+                              "min_loss_scale": 1
+                            },
+                            "bf16": {
+                              "enabled": "auto"
+                            },
+                            "zero_optimization": {
+                              "stage": 3,
+                              "overlap_comm": false,
+                              "contiguous_gradients": true,
+                              "sub_group_size": 1e9,
+                              "reduce_bucket_size": "auto",
+                              "stage3_prefetch_bucket_size": "auto",
+                              "stage3_param_persistence_threshold": "auto",
+                              "stage3_max_live_parameters": 1e9,
+                              "stage3_max_reuse_distance": 1e9,
+                              "stage3_gather_16bit_weights_on_model_save": true
+                            }
+                          }
+                          EOL
+
+                          # Run training
+                          if [ ${NNODES} -gt 1 ]; then
+                              echo "deepspeed: ds-z3-config.json" >> lf-sft.yaml
+                              FORCE_TORCHRUN=1 llamafactory-cli train lf-sft.yaml
+                          else
+                              unset NNODES
+                              unset NODE_RANK
+                              unset MASTER_ADDR
+                              unset MASTER_PORT
+                              llamafactory-cli train lf-sft.yaml
+                          fi
+
+
+                          if [ "${DO_MERGE}" == "true" ]; then
+                            # Merge LoRA adapters
+                            cat >lf-merge-config.yaml <<EOL
+                          model_name_or_path: /mnt/models/${BASE_MODEL_NAME}
+                          adapter_name_or_path: output_models
+                          template: qwen
+                          finetuning_type: lora
+
+
+                          ### export
+                          export_dir: output_models_merged
+                          export_size: 4
+                          export_device: cpu
+                          export_legacy_format: false
+                          EOL
+
+                            llamafactory-cli export lf-merge-config.yaml
+                          else
+                            # move output adapter for push
+                            mv output_models output_models_merged
+                          fi
+
+                          # push merged model to model repo
+                          gitauth="${GIT_USER}:${GIT_TOKEN}"
+                          cd /mnt/workspace/output_models_merged
+                          touch README.md
+                          OUTPUT_MODEL_NO_HTTPS="${OUTPUT_MODEL_URL//http:\/\/}"
+                          PUSH_URL="http://${gitauth}@${OUTPUT_MODEL_NO_HTTPS}"
+                          push_branch=$(date +'%Y%m%d-%H%M%S')
+
+                          # git init
+                          # git checkout -b sft-${push_branch}
+                          # git lfs track *.safetensors
+                          # git add .
+                          # git -c user.name='AMLSystemUser' -c user.email='aml_admin@cpaas.io' commit -am "fine tune push auto commit"
+                          # git -c http.sslVerify=false -c lfs.activitytimeout=36000 push -u ${PUSH_URL} sft-${push_branch}
+                      securityContext:
+                        allowPrivilegeEscalation: true
+                        capabilities:
+                          # drop:
+                          #  - ALL
+                          add: [ "IPC_LOCK", "SYS_PTRACE" ]
+                        # Keep these aligned with the pod securityContext above.
+                        runAsUser: 1001
+                        runAsGroup: 1000
+                        runAsNonRoot: true
+                        seccompProfile:
+                          type: RuntimeDefault
+                      volumeMounts:
+                        - name: workspace
+                          mountPath: /mnt/workspace
+                        - name: dshm
+                          mountPath: /dev/shm
diff --git a/docs/en/kubeflow/how_to/kf-trainjob-npu.yaml b/docs/en/kubeflow/how_to/kf-trainjob-npu.yaml
new file mode 100644
index 0000000..050ce33
--- /dev/null
+++ b/docs/en/kubeflow/how_to/kf-trainjob-npu.yaml
@@ -0,0 +1,110 @@
+apiVersion: trainer.kubeflow.org/v1alpha1
+kind: TrainJob
+metadata:
+  generateName: trainjob-sft-qwen3-
+  namespace: kubeflow-admin-cpaas-io
+  # labels:
+  #   kueue.x-k8s.io/queue-name: local-queue
+spec:
+  runtimeRef:
+    apiGroup: trainer.kubeflow.org
+    name: llamafactory-finetune-runtime
+    kind: TrainingRuntime
+  podTemplateOverrides:
+    - targetJobs:
+        - name: node
+      spec:
+        # Step 1: Configure models-cache volume override to mount the shared PVC for caching models
+        # In distributed training tasks (with >= 2 replicas), ensure that you use the appropriate storage type for caching large models:
+        # - Network storage, such as NFS or Ceph: Simply mount the network storage. Note that multiple containers may access this network storage simultaneously, resulting in high concurrent traffic. Furthermore, reading large model files may be slower than reading them locally (depending on the network storage's performance).
+        # - Local storage, such as topolvm or local-storage: Use `kserve local model cache` to pre-cache the model file on each node before mounting this PVC. Training tasks cannot cache each local PVC.
+        volumes:
+          - name: models-cache
+            persistentVolumeClaim:
+              claimName: glm5
+        containers:
+          - name: node
+            volumeMounts:
+              - name: models-cache
+                mountPath: /mnt/models
+        # Step 2: Configure node selector to ensure the job runs on GPU nodes if needed.
+        # nodeSelector:
+        #   nvidia.com/gpu.product: Tesla-T4
+    - targetJobs:
+        - name: dataset-initializer
+      spec:
+        # Step 3: Do the same as step 1.
+        volumes:
+          - name: models-cache
+            persistentVolumeClaim:
+              claimName: glm5
+        containers:
+          - name: dataset-initializer
+            volumeMounts:
+              - name: models-cache
+                mountPath: /mnt/models
+        # nodeSelector:
+        #   nvidia.com/gpu.product: Tesla-T4
+    - targetJobs:
+        - name: model-initializer
+      spec:
+        # Step 4: Do the same as step 1.
+        volumes:
+          - name: models-cache
+            persistentVolumeClaim:
+              claimName: glm5
+        containers:
+          - name: model-initializer
+            volumeMounts:
+              - name: models-cache
+                mountPath: /mnt/models
+        # nodeSelector:
+        #   nvidia.com/gpu.product: Tesla-T4
+  initializer:
+    # Step 5: set dataset and model URL in initializer step. The initializer will download dataset and model to the shared PVC, so that the trainer can access them from the same path.
+    dataset:
+      env:
+        - name: DATASET_URL
+          value: "http://gitlab-fuyao.test.com:30080/kubeflow-admin-cpaas-io/amlmodels/identity-alauda"
+    model:
+      env:
+        - name: BASE_MODEL_URL
+          value: "http://gitlab-fuyao.test.com:30080/kubeflow-admin-cpaas-io/amlmodels/Qwen3-0.6B"
+  trainer:
+    # Step 6: Keep one trainer pod for a single-host 2-NPU Ascend job. Use
+    # numNodes: 2+ only for true multi-node training across different hosts.
+    numNodes: 1
+    env:
+    # Step 7: Set model, dataset, and output locations for this run.
+    # Single-node multi-NPU jobs do not need loopback HCCL/GLOO NIC overrides.
+    # For true multi-node jobs, add only the HCCL networking variables required
+    # by your tested cluster topology.
+    - name: BASE_MODEL_URL
+      value: "http://gitlab-fuyao.test.com:30080/kubeflow-admin-cpaas-io/amlmodels/Qwen3-0.6B"
+    - name: DATASET_URL
+      value: "http://gitlab-fuyao.test.com:30080/kubeflow-admin-cpaas-io/amlmodels/identity-alauda"
+    - name: HF_HOME
+      value: /mnt/workspace/hf_cache
+    - name: DO_MERGE
+      value: "true"
+      #- name: MLFLOW_TRACKING_URI
+      #  value: "http://mlflow-tracking-server.kubeflow:5000"
+      #- name: MLFLOW_EXPERIMENT_NAME
+      #value: mlops-demo-ai-test
+    - name: MODEL_OUTPUT_DIR
+      value: /mnt/workspace/output_models_merged
+    - name: OUTPUT_MODEL_URL
+      value: "http://gitlab-fuyao.test.com:30080/kubeflow-admin-cpaas-io/amlmodels/qwen3-sft-output"
+    resourcesPerNode:
+      # Step 8: Request the exact Ascend resource keys exposed by your device
+      # plugin. This sample uses Ascend 910B4 and allocates two NPUs to one pod.
+      limits:
+        cpu: "4"
+        memory: "32Gi"
+        huawei.com/Ascend910B4: "2"
+        huawei.com/Ascend910B4-memory: "32G"
+      requests:
+        cpu: "1"
+        memory: "2Gi"
+        huawei.com/Ascend910B4: "2"
+        huawei.com/Ascend910B4-memory: "32G"
diff --git a/docs/en/kubeflow/how_to/pipelines-mlflow-integration.mdx b/docs/en/kubeflow/how_to/pipelines-mlflow-integration.mdx
new file mode 100644
index 0000000..35380c4
--- /dev/null
+++ b/docs/en/kubeflow/how_to/pipelines-mlflow-integration.mdx
@@ -0,0 +1,267 @@
+---
+weight: 55
+---
+
+# Kubeflow Pipeline + MLflow Integration
+
+This guide shows how to build Kubeflow Pipelines (KFP) components that log parameters, metrics, and model artifacts to [MLflow](./mlflow.mdx) — giving you a single source of truth for experiment tracking across your pipeline runs.
+
+## Scope
+
+- Alauda AI 2.5 and later.
+- Kubeflow Pipelines and the MLflow cluster plugin are installed.
+- Target namespaces have the MLflow workspace label (`mlflow-enabled=true`).
+- The pipeline components run in the same Kubernetes cluster where the MLflow Tracking Server is deployed.
+
+## Prerequisites
+
+- `kfp` Python SDK installed (`pip install kfp mlflow`).
+- Access to a KFP endpoint (see [Use Kubeflow Pipelines](./pipelines.mdx) for setup).
+- An MLflow workspace name matching a namespace with `mlflow-enabled=true`.
+
+## How pipeline components reach MLflow
+
+The MLflow Tracking Server is exposed as an in-cluster Service:
+
+| Namespace | Service | Port |
+|-----------|---------|------|
+| `kubeflow` | `mlflow-tracking-server.kubeflow.svc.cluster.local` | 5000 |
+| `aml-system` | `mlflow-tracking-server.aml-system.svc.cluster.local` | 5000 |
+
+Pipeline components use the short form:
+
+```python
+MLFLOW_TRACKING_URI = "http://mlflow-tracking-server.kubeflow:5000"
+```
+
+## Complete example: training pipeline with MLflow
+
+The following is a minimal but complete KFP pipeline that:
+
+1. Accepts model name, learning rate, and epochs as parameters.
+2. Simulates training (replace with your real training code) and logs parameters + metrics to MLflow.
+3. Packages the trained model as an MLflow artifact and uploads it to the MLflow Tracking Server.
+
+```python
+from kfp import dsl, compiler
+import mlflow
+import mlflow.tracking
+
+# ===== MLflow configuration =====
+MLFLOW_TRACKING_URI = "http://mlflow-tracking-server.kubeflow:5000"
+MLFLOW_EXPERIMENT_NAME = "kfp-training-experiment"
+
+
+@dsl.component(base_image="python:3.11-slim", packages_to_install=["mlflow", "kfp"])
+def train_model(
+    model_name: str,
+    learning_rate: float,
+    epochs: int,
+    output_model_path: str,
+) -> dict:
+    """Simulated training component with MLflow tracking."""
+    mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
+    mlflow.set_experiment(MLFLOW_EXPERIMENT_NAME)
+
+    with mlflow.start_run(run_name=f"run-{dsl.RUN_ID_PLACEHOLDER}") as run:
+        # Log parameters
+        mlflow.log_param("model_name", model_name)
+        mlflow.log_param("learning_rate", learning_rate)
+        mlflow.log_param("epochs", epochs)
+
+        # Simulated training loop
+        metrics = {}
+        for epoch in range(1, epochs + 1):
+            # Replace this with your real training logic
+            loss = 2.0 * (0.95 ** epoch)  # placeholder loss curve
+            accuracy = 1.0 - loss         # placeholder accuracy
+
+            mlflow.log_metric("loss", loss, step=epoch)
+            mlflow.log_metric("accuracy", accuracy, step=epoch)
+            metrics = {"final_loss": loss, "final_accuracy": accuracy}
+
+        # Log the trained model as an artifact
+        import json, pathlib
+        model_dir = pathlib.Path(output_model_path)
+        model_dir.mkdir(parents=True, exist_ok=True)
+        (model_dir / "model.json").write_text(
+            json.dumps({"model_name": model_name, "epochs": epochs, "metrics": metrics})
+        )
+        mlflow.log_artifacts(str(model_dir), artifact_path="model")
+
+        run_id = run.info.run_id
+        print(f"MLflow run: {run_id}")
+
+    return metrics
+
+
+@dsl.component(base_image="python:3.11-slim", packages_to_install=["mlflow", "kfp"])
+def evaluate_model(
+    model_name: str,
+    test_data_path: str,
+) -> dict:
+    """Evaluate the trained model and log results to MLflow."""
+    mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
+    mlflow.set_experiment(MLFLOW_EXPERIMENT_NAME)
+
+    with mlflow.start_run(run_name=f"eval-{dsl.RUN_ID_PLACEHOLDER}"):
+        # In a real pipeline, load the model artifact from MLflow or S3
+        # For now, log placeholder metrics
+        mlflow.log_param("model_name", model_name)
+        mlflow.log_param("test_data_path", test_data_path)
+        mlflow.log_metric("eval_accuracy", 0.92)
+        mlflow.log_metric("eval_f1", 0.89)
+        mlflow.log_metric("eval_precision", 0.91)
+        mlflow.log_metric("eval_recall", 0.88)
+
+    return {"eval_accuracy": 0.92}
+
+
+@dsl.pipeline(name="mlflow-training-pipeline", description="Train and evaluate with MLflow tracking")
+def training_pipeline(
+    model_name: str = "qwen3-0.6b",
+    learning_rate: float = 2e-4,
+    epochs: int = 10,
+    test_data_path: str = "s3://datasets/test-data",
+):
+    train_task = train_model(
+        model_name=model_name,
+        learning_rate=learning_rate,
+        epochs=epochs,
+        output_model_path="/tmp/model",
+    )
+
+    evaluate_model(
+        model_name=model_name,
+        test_data_path=test_data_path,
+    )
+
+
+# ===== Compile and submit =====
+compiler.Compiler().compile(training_pipeline, "pipeline.yaml")
+```
+
+## Upload and run
+
+### Via the KFP UI
+
+1. Go to **Kubeflow Dashboard → Pipelines → Upload Pipeline** and select `pipeline.yaml`.
+2. Click **Create Run** and fill in the parameters (model name, learning rate, epochs).
+3. After the run starts, check the MLflow UI under **Alauda AI → Advanced → MLFlow** for logged metrics.
+
+### Via the KFP SDK
+
+```python
+from kfp.client import Client
+
+client = Client(host="<MY-KFP-ENDPOINT>")
+
+run = client.create_run_from_pipeline_package(
+    "pipeline.yaml",
+    arguments={
+        "model_name": "qwen3-0.6b",
+        "learning_rate": 2e-4,
+        "epochs": 10,
+        "test_data_path": "s3://datasets/test-data",
+    },
+)
+
+print(f"Run URL: {client.get_run_id(run.name)}")
+```
+
+## Using MLflow in Trainer v2 pipelines
+
+If you are using [Kubeflow Trainer v2](../training_guides/fine-tune-with-trainer-v2.mdx) instead of KFP SDK pipelines, you can inject MLflow environment variables directly into the `TrainingJob` pod spec:
+
+```yaml
+apiVersion: trainer.kubeflow.org/v1
+kind: TrainingJob
+metadata:
+  name: mlflow-finetune
+spec:
+  trainingSpecs:
+    - replicas: 1
+      template:
+        spec:
+          containers:
+            - name: trainer
+              image: alaudadockerhub/fine_tune_with_llamafactory:v0.1.1
+              env:
+                - name: MLFLOW_TRACKING_URI
+                  value: "http://mlflow-tracking-server.kubeflow:5000"
+                - name: MLFLOW_EXPERIMENT_NAME
+                  value: "trainer-v2-finetune"
+```
+
+See [Fine-tuning LLMs using Workbench](../training_guides/fine-tuning-using-notebooks.mdx) for a full Trainer v2 + MLflow example with LLaMA-Factory.
+
+## Accessing MLflow run artifacts from other pipeline components
+
+Pipeline components can read MLflow artifacts from within a subsequent step. Use the MLflow Python client with the tracking URI to download artifacts:
+
+```python
+from kfp import dsl, compiler
+import mlflow
+
+@dsl.component(base_image="python:3.11-slim", packages_to_install=["mlflow", "kfp"])
+def download_and_compare(
+    source_model_uri: str,
+    reference_model_uri: str,
+) -> str:
+    """Download two models from MLflow and compare them."""
+    mlflow.set_tracking_uri("http://mlflow-tracking-server.kubeflow:5000")
+    client = mlflow.tracking.MlflowClient()
+
+    source_path = client.download_artifacts(source_model_uri, path="/tmp/source")
+    reference_path = client.download_artifacts(reference_model_uri, path="/tmp/reference")
+
+    return f"Compared models: {source_path} vs {reference_path}"
+```
+
+This pattern is useful for:
+
+- A/B comparing fine-tuned models before deployment.
+- Pulling the best model (by accuracy metric) from an MLflow experiment into an inference pipeline.
+- Validating that a model passed acceptance tests before pushing it to the model registry.
+
+## Best practices
+
+### Use the pipeline run ID in MLflow
+
+KFP provides `dsl.RUN_ID_PLACEHOLDER` as an environment variable inside each component. Use it to create distinct MLflow runs per pipeline execution:
+
+```python
+with mlflow.start_run(run_name=f"run-{dsl.RUN_ID_PLACEHOLDER}"):
+    ...
+```
+
+This avoids conflating pipeline runs that happen to use the same parameters.
+
+### Log metrics at the component level, not the pipeline level
+
+MLflow `log_metric` calls must happen inside a `mlflow.start_run()` block. If a component has multiple logical training stages, open separate MLflow runs within the same component — do not try to log metrics from outside a run context.
+
+### Use MLflow models for model registry integration
+
+Instead of logging arbitrary files as `mlflow.log_artifacts`, use `mlflow.log_model()` to register the model with its signature and dependencies:
+
+```python
+mlflow.sklearn.log_model(sk_model, "model")
+# or for HuggingFace:
+mlflow.transformers.log_model(hf_model, "model")
+```
+
+Registered models can then be promoted to the **Staging** or **Production** stage in the MLflow UI.
+
+### Artifact storage for production
+
+The default MLflow artifact store is local disk. For production pipelines, configure S3-compatible object storage in the MLflow plugin settings (see [MLflow Tracking Server](./mlflow.mdx) → High Availability And Storage). Pipeline components can then log large model artifacts without running into pod disk limits.
+
+## Troubleshooting
+
+| Symptom | Check |
+|---------|-------|
+| `ConnectionError` from `mlflow.set_tracking_uri` | Verify the MLflow service is reachable: `curl http://mlflow-tracking-server.kubeflow:5000/api/2.0/mlflow/ping`. If the pod is in `aml-system`, use that namespace instead. |
+| Run not showing in MLflow UI | Check the component logs for MLflow errors. The MLflow experiment must exist (created automatically on first `set_experiment`) and the workspace label must match the namespace. |
+| Artifact upload fails with `ResourceExhausted` | The artifact may exceed the pod's disk quota. Log artifacts directly to S3 by setting `MLFLOW_S3_ENDPOINT_URL` and providing AWS credentials as env vars. |
+| MLflow metrics not appearing in KFP UI | KFP and MLflow are separate systems. Metrics logged to MLflow appear in the MLflow UI (**Alauda AI → Advanced → MLFlow**), not in the KFP run output. |
diff --git a/docs/en/kubeflow/how_to/qwen3_finetune_verify.ipynb b/docs/en/kubeflow/how_to/qwen3_finetune_verify.ipynb
new file mode 100644
index 0000000..2745368
--- /dev/null
+++ b/docs/en/kubeflow/how_to/qwen3_finetune_verify.ipynb
@@ -0,0 +1,390 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "18655ab8",
+   "metadata": {},
+   "source": [
+    "# Qwen3-8B Full-Parameter Fine-Tuning Verification\n",
+    "\n",
+    "This notebook verifies the fine-tuning capability of the **Ascend 910B CANN image** by running full-parameter SFT fine-tuning for Qwen3-8B with MindSpeed-LLM.\n",
+    "\n",
+    "**Workflow:**\n",
+    "1. Environment check\n",
+    "2. Prepare a sample dataset (Alpaca format)\n",
+    "3. Clone the MindSpeed-LLM scripts\n",
+    "4. Convert HF weights to Megatron weights\n",
+    "5. Preprocess the data\n",
+    "6. Start fine-tuning\n",
+    "7. Run inference validation\n",
+    "\n",
+    "> The training parameters are set for verification mode (few iterations + short sequence length). Increase `TRAIN_ITERS` and `SEQ_LENGTH` for production use."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "12b48017",
+   "metadata": {},
+   "source": [
+    "## 0. Parameter Configuration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "id": "a0fa2576",
+   "metadata": {},
+   "source": "import warnings\nwarnings.filterwarnings('ignore', category=DeprecationWarning)\nwarnings.filterwarnings('ignore', category=ImportWarning)\nwarnings.filterwarnings('ignore', category=UserWarning)\n\nfrom pathlib import Path\n\n# ===== Path configuration =====\nHF_MODEL_DIR = Path('/opt/app-root/src/models/Qwen3-8B')\nWORK_DIR = Path('/opt/app-root/src/Qwen3-8B-work-dir')\nMINDSPEED_LLM_DIR = WORK_DIR / 'MindSpeed-LLM'\nDATA_DIR = WORK_DIR / 'finetune_dataset'\nRAW_DATA_FILE = DATA_DIR / 'alpaca_sample.jsonl'\nPROCESSED_DATA_PREFIX = DATA_DIR / 'alpaca'\nOUTPUT_DIR = WORK_DIR / 'output' / 'qwen3_8b_finetuned'\nLOGS_DIR = WORK_DIR / 'logs'\n\n# ===== Optional: real dataset path =====\nALPACA_PARQUET = Path('/opt/app-root/src/datasets/alpaca/train-00000-of-00001-a09b74b3ef9c3b56.parquet')\n\n# ===== Ascend environment scripts =====\nCANN_ENV = '/usr/local/Ascend/cann/set_env.sh'\nATB_ENV = '/usr/local/Ascend/nnal/atb/set_env.sh'\n\n# ===== Parallelism configuration (must match weight conversion) =====\nTP = 2   # With TP=1, one card holds about 4.1B parameters; fp32 gradient buffers + bf16 weights require about 30 GiB, exceeding the 910B 29 GiB memory limit\nPP = 2   # At least TPxPP=4 NPUs are required; for a single card, set TP=1 and PP=1 (OOM is possible)\n\n# ===== Weight conversion output (path includes parallel settings to avoid reusing stale weights after TP/PP changes) =====\nMCORE_WEIGHTS_DIR = WORK_DIR / 'model_weights' / f'qwen3_mcore_tp{TP}_pp{PP}'\n\n# ===== Training hyperparameters (verification mode) =====\nSEQ_LENGTH = 512     # 4096 is recommended for production\nTRAIN_ITERS = 50     # 2000+ is recommended for production\nMBS = 1\nLR = 1.25e-6\nMIN_LR = 1.25e-7\n\n# ===== Data preprocessing =====\nHANDLER_NAME = 'AlpacaStyleInstructionHandler'\nTOKENIZER_TYPE = 'PretrainedFromHF'\nPROMPT_TYPE = 'qwen3'\nENABLE_THINKING = 'none'\n\nprint('Configuration loaded')\nprint(f'  Model: {HF_MODEL_DIR}')\nprint(f'  Dataset: {ALPACA_PARQUET}' if ALPACA_PARQUET.exists() else '  Dataset: not found, using built-in sample data')\nprint(f'  TP={TP}, PP={PP}, SEQ={SEQ_LENGTH}, ITERS={TRAIN_ITERS}')",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "cell_type": "markdown",
+   "id": "15d10a9a",
+   "metadata": {},
+   "source": [
+    "## Helper Function"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "id": "7eb53b45",
+   "metadata": {},
+   "source": "import os\nimport subprocess\n\n_SUPPRESS_WARNINGS = 'ignore::DeprecationWarning,ignore::ImportWarning,ignore::UserWarning'\n\ndef run_cmd(cmd, cwd=None, check=True):\n    'Run a bash command in the Ascend environment and stream output in real time'\n    env_prefix = f'source {CANN_ENV} && source {ATB_ENV}'\n    full_cmd = f'{env_prefix} && {cmd}'\n    print(f'$ {cmd}\\n')\n    run_env = os.environ.copy()\n    run_env['PYTHONWARNINGS'] = _SUPPRESS_WARNINGS\n    result = subprocess.run(\n        ['bash', '-lc', full_cmd],\n        cwd=str(cwd or WORK_DIR),\n        text=True,\n        env=run_env,\n    )\n    if check and result.returncode != 0:\n        raise RuntimeError(f'Command failed with return code: {result.returncode}')\n    return result\n\nprint('Helper function defined: run_cmd()')",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0d2cbf3b",
+   "metadata": {},
+   "source": [
+    "## 1. Environment Check"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "id": "1643dfe5",
+   "metadata": {},
+   "source": "import warnings\nwith warnings.catch_warnings():\n    warnings.simplefilter('ignore', DeprecationWarning)\n    warnings.simplefilter('ignore', ImportWarning)\n    warnings.simplefilter('ignore', UserWarning)\n    import torch\n    import torch_npu\n\nprint('=' * 60)\nprint('Environment Check')\nprint('=' * 60)\n\n# PyTorch & NPU\nprint(f'PyTorch:    {torch.__version__}')\nprint(f'torch_npu:  {torch_npu.__version__}')\nnproc = torch.npu.device_count()\nprint(f'NPU count:  {nproc}')\nfor i in range(nproc):\n    print(f'  NPU {i}: {torch.npu.get_device_name(i)}')\n\n# MindSpeed\nwith warnings.catch_warnings():\n    warnings.simplefilter('ignore', DeprecationWarning)\n    warnings.simplefilter('ignore', ImportWarning)\n    warnings.simplefilter('ignore', UserWarning)\n    import mindspeed\n    import mindspeed_llm\nprint('MindSpeed:     installed')\nprint('MindSpeed-LLM: installed')\n\n# Model files\nprint(f'\\nModel directory: {HF_MODEL_DIR}')\nassert HF_MODEL_DIR.exists(), f'Model directory does not exist: {HF_MODEL_DIR}'\nmodel_files = sorted(HF_MODEL_DIR.glob('*'))\nfor f in model_files[:5]:\n    if f.is_file():\n        print(f'  {f.name} ({f.stat().st_size / 1e9:.2f} GB)')\nif len(model_files) > 5:\n    print(f'  ... {len(model_files)} files in total')\n\n# Parallelism validation\nassert nproc >= TP * PP, f'NPU count ({nproc}) < TP*PP ({TP*PP}); reduce PP'\nDP = nproc // (TP * PP)\nGBS = DP * MBS\nprint(f'\\nParallelism: TP={TP}, PP={PP}, DP={DP}, GBS={GBS}')\nassert torch.npu.is_available(), 'NPU is not available'\nprint('\\nEnvironment check passed!')",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a194e018",
+   "metadata": {},
+   "source": [
+    "## 2. Prepare a Sample Dataset\n",
+    "\n",
+    "Create sample data in Alpaca format to verify the fine-tuning workflow.\n",
+    "\n",
+    "To use a real dataset, place a JSONL file at `RAW_DATA_FILE`, with one JSON object per line:\n",
+    "```json\n",
+    "{\"instruction\": \"...\", \"input\": \"...\", \"output\": \"...\"}\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "id": "6d845761",
+   "metadata": {},
+   "source": "import json\nimport warnings\nimport pandas as pd\n\nDATA_DIR.mkdir(parents=True, exist_ok=True)\n\nif ALPACA_PARQUET.exists():\n    print(f'Loading Alpaca dataset: {ALPACA_PARQUET.name}')\n    with warnings.catch_warnings():\n        warnings.simplefilter('ignore', DeprecationWarning)\n        df = pd.read_parquet(ALPACA_PARQUET)\n    print(f'{len(df)} samples loaded, columns: {list(df.columns)}')\n\n    # Convert to JSONL (instruction / input / output)\n    with open(RAW_DATA_FILE, 'w', encoding='utf-8') as f:\n        for item in df[['instruction', 'input', 'output']].to_dict('records'):\n            item['input'] = item.get('input') or ''\n            f.write(json.dumps(item, ensure_ascii=False) + '\\n')\n\n    print(f'Converted to JSONL: {RAW_DATA_FILE}')\n    print('\\nSample records:')\n    for item in df[['instruction', 'input', 'output']].head(3).to_dict('records'):\n        inp = f' {item[\"input\"]}' if item['input'] else ''\n        print(f'  Q: {item[\"instruction\"][:80]}{inp[:40]}')\n        print(f'  A: {str(item[\"output\"])[:80]}')\nelse:\n    print('Alpaca dataset not found, using built-in sample data\\n')\n    sample_data = [\n        {'instruction': 'Translate the following sentence into French', 'input': 'The weather is nice today.', 'output': \"Il fait beau aujourd'hui.\"},\n        {'instruction': 'Translate the following sentence into Spanish', 'input': 'I like programming.', 'output': 'Me gusta programar.'},\n        {'instruction': 'Summarize the sentence in one short phrase', 'input': 'Machine learning is fascinating and widely used in many fields.', 'output': 'Machine learning is broadly useful.'},\n        {'instruction': 'Rewrite the sentence in a more formal tone', 'input': 'Hello, how are you?', 'output': 'Hello, how are you doing today?'},\n        {'instruction': 'Introduce Python in one sentence', 'input': '', 'output': 'Python is a high-level general-purpose programming language known for its readability and rich ecosystem.'},\n        {'instruction': 'List three common sorting algorithms', 'input': '', 'output': 'Three common sorting algorithms are bubble sort, quicksort, and merge sort.'},\n        {'instruction': 'Explain what deep learning is', 'input': '', 'output': 'Deep learning is a branch of machine learning that uses multi-layer neural networks to learn hierarchical representations of data.'},\n        {'instruction': 'Write a Python function to add two numbers', 'input': '', 'output': 'def add(a, b):\\n    return a + b'},\n        {'instruction': 'Rewrite the sentence to be more concise', 'input': 'Artificial intelligence is changing the world.', 'output': 'AI is transforming the world.'},\n        {'instruction': 'What is a GPU?', 'input': '', 'output': 'A GPU is a graphics processing unit designed to accelerate highly parallel computation, especially for training and inference workloads.'},\n    ]\n    with open(RAW_DATA_FILE, 'w', encoding='utf-8') as f:\n        for item in sample_data:\n            f.write(json.dumps(item, ensure_ascii=False) + '\\n')\n    print(f'Sample dataset created: {RAW_DATA_FILE}')\n    print(f'{len(sample_data)} samples in total')",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9c4692a2",
+   "metadata": {},
+   "source": [
+    "## 3. Clone MindSpeed-LLM\n",
+    "\n",
+    "The `mindspeed_llm` Python package is already installed in the image, but the training scripts (`convert_ckpt_v2.py`, `preprocess_data.py`, `posttrain_gpt.py`, and others) must be run from the repository directory."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "id": "511c1c4d",
+   "metadata": {},
+   "source": [
+    "if MINDSPEED_LLM_DIR.exists():\n",
+    "    print(f'Already exists: {MINDSPEED_LLM_DIR}')\n",
+    "else:\n",
+    "    print('Cloning MindSpeed-LLM (shallow clone)...')\n",
+    "    run_cmd(f'git clone --depth 1 https://gitcode.com/ascend/MindSpeed-LLM.git {MINDSPEED_LLM_DIR}')\n",
+    "\n",
+    "# Validate required scripts\n",
+    "scripts = [\n",
+    "    ('Weight conversion', 'convert_ckpt_v2.py'),\n",
+    "    ('Data preprocessing', 'preprocess_data.py'),\n",
+    "    ('Fine-tuning', 'posttrain_gpt.py'),\n",
+    "    ('Inference', 'inference.py'),\n",
+    "]\n",
+    "for name, script in scripts:\n",
+    "    exists = (MINDSPEED_LLM_DIR / script).exists()\n",
+    "    print(f'  [{name}] {script}: {\"OK\" if exists else \"MISSING\"}')\n",
+    "\n",
+    "assert all((MINDSPEED_LLM_DIR / s).exists() for _, s in scripts), 'Required scripts are missing'\n",
+    "print('\\nScript check passed!')"
+   ],
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "cell_type": "markdown",
+   "id": "331e0d10",
+   "metadata": {},
+   "source": [
+    "## 4. HF Weight to Megatron Weight Conversion\n",
+    "\n",
+    "Convert HuggingFace-format weights to Megatron format, split by TP/PP. The first conversion usually takes about 5-10 minutes."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "id": "463dd7da",
+   "metadata": {},
+   "source": [
+    "MCORE_WEIGHTS_DIR.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "# Check whether conversion has already been completed\n",
+    "converted = any(MCORE_WEIGHTS_DIR.glob('iter_*'))\n",
+    "\n",
+    "if converted:\n",
+    "    print(f'Weights already exist, skipping conversion: {MCORE_WEIGHTS_DIR}')\n",
+    "    for p in sorted(MCORE_WEIGHTS_DIR.iterdir()):\n",
+    "        print(f'  {p.name}')\n",
+    "else:\n",
+    "    convert_cmd = ' && '.join([\n",
+    "        f'cd {MINDSPEED_LLM_DIR}',\n",
+    "        f'python convert_ckpt_v2.py'\n",
+    "        ' --load-model-type hf'\n",
+    "        ' --save-model-type mg'\n",
+    "        f' --target-tensor-parallel-size {TP}'\n",
+    "        f' --target-pipeline-parallel-size {PP}'\n",
+    "        f' --load-dir {HF_MODEL_DIR}'\n",
+    "        f' --save-dir {MCORE_WEIGHTS_DIR}'\n",
+    "        ' --model-type-hf qwen3',\n",
+    "    ])\n",
+    "    print('Running weight conversion (about 5-10 minutes)...')\n",
+    "    run_cmd(convert_cmd, cwd=MINDSPEED_LLM_DIR)\n",
+    "    print('Weight conversion completed!')\n",
+    "    for p in sorted(MCORE_WEIGHTS_DIR.iterdir()):\n",
+    "        print(f'  {p.name}')"
+   ],
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "cell_type": "markdown",
+   "id": "419d028a",
+   "metadata": {},
+   "source": [
+    "## 5. Data Preprocessing\n",
+    "\n",
+    "Convert Alpaca-format JSONL data into the binary format required by MindSpeed-LLM training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "id": "f68febbf",
+   "metadata": {},
+   "source": [
+    "preprocess_cmd = ' && '.join([\n",
+    "    f'cd {MINDSPEED_LLM_DIR}',\n",
+    "    f'python preprocess_data.py'\n",
+    "    f' --input {RAW_DATA_FILE}'\n",
+    "    f' --tokenizer-name-or-path {HF_MODEL_DIR}'\n",
+    "    f' --output-prefix {PROCESSED_DATA_PREFIX}'\n",
+    "    f' --handler-name {HANDLER_NAME}'\n",
+    "    f' --tokenizer-type {TOKENIZER_TYPE}'\n",
+    "    ' --workers 4'\n",
+    "    ' --log-interval 1'\n",
+    "    f' --enable-thinking {ENABLE_THINKING}'\n",
+    "    f' --prompt-type {PROMPT_TYPE}',\n",
+    "])\n",
+    "\n",
+    "print('Running data preprocessing...')\n",
+    "run_cmd(preprocess_cmd, cwd=MINDSPEED_LLM_DIR)\n",
+    "\n",
+    "# Verify outputs\n",
+    "print('\\nPreprocessing outputs:')\n",
+    "for f in sorted(PROCESSED_DATA_PREFIX.parent.glob('alpaca*')):\n",
+    "    print(f'  {f.name} ({f.stat().st_size / 1024:.1f} KB)')\n",
+    "print('Data preprocessing completed!')"
+   ],
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "cell_type": "markdown",
+   "id": "67501275",
+   "metadata": {},
+   "source": [
+    "## 6. Start Fine-Tuning\n",
+    "\n",
+    "Run full-parameter SFT fine-tuning with MindSpeed-LLM. Training logs are streamed to the notebook in real time.\n",
+    "\n",
+    "> In verification mode, `TRAIN_ITERS=50`. For a full fine-tuning run, 2000+ iterations are recommended."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "id": "16c0ef7e",
+   "metadata": {},
+   "source": [
+    "import torch\n",
+    "\n",
+    "nproc = torch.npu.device_count()\n",
+    "DP = nproc // (TP * PP)\n",
+    "GBS = DP * MBS\n",
+    "\n",
+    "LOGS_DIR.mkdir(parents=True, exist_ok=True)\n",
+    "OUTPUT_DIR.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "# Environment variables\n",
+    "env = ' && '.join([\n",
+    "    f'cd {MINDSPEED_LLM_DIR}',\n",
+    "    'export CUDA_DEVICE_MAX_CONNECTIONS=1',\n",
+    "    'export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True',\n",
+    "])\n",
+    "\n",
+    "# Distributed torchrun arguments\n",
+    "distributed = ' '.join([\n",
+    "    'torchrun',\n",
+    "    f'--nproc_per_node {nproc}',\n",
+    "    '--nnodes 1 --node_rank 0',\n",
+    "    '--master_addr localhost --master_port 6000',\n",
+    "])\n",
+    "\n",
+    "# Model architecture\n",
+    "model_args = ' '.join([\n",
+    "    '--use-mcore-models',\n",
+    "    '--spec mindspeed_llm.tasks.models.spec.qwen3_spec layer_spec',\n",
+    "    '--kv-channels 128 --qk-layernorm',\n",
+    "    f'--tensor-model-parallel-size {TP}',\n",
+    "    f'--pipeline-model-parallel-size {PP}',\n",
+    "    '--sequence-parallel --use-distributed-optimizer --use-flash-attn',\n",
+    "    '--num-layers 36 --hidden-size 4096 --num-attention-heads 32',\n",
+    "    '--ffn-hidden-size 12288 --max-position-embeddings 32768',\n",
+    "    f'--seq-length {SEQ_LENGTH}',\n",
+    "    '--make-vocab-size-divisible-by 1 --padded-vocab-size 151936',\n",
+    "    '--rotary-base 1000000 --use-rotary-position-embeddings',\n",
+    "])\n",
+    "\n",
+    "# Training hyperparameters\n",
+    "train_args = ' '.join([\n",
+    "    f'--micro-batch-size {MBS} --global-batch-size {GBS}',\n",
+    "    '--disable-bias-linear --swiglu',\n",
+    "    f'--train-iters {TRAIN_ITERS}',\n",
+    "    '--tokenizer-type PretrainedFromHF',\n",
+    "    f'--tokenizer-name-or-path {HF_MODEL_DIR}',\n",
+    "    '--normalization RMSNorm --position-embedding-type rope',\n",
+    "    '--norm-epsilon 1e-6 --hidden-dropout 0 --attention-dropout 0',\n",
+    "    '--no-gradient-accumulation-fusion --attention-softmax-in-fp32',\n",
+    "    '--exit-on-missing-checkpoint --no-masked-softmax-fusion',\n",
+    "    '--group-query-attention --untie-embeddings-and-output-weights',\n",
+    "    '--num-query-groups 8',\n",
+    "    f'--min-lr {MIN_LR} --lr {LR}',\n",
+    "    '--weight-decay 1e-1 --clip-grad 1.0',\n",
+    "    '--adam-beta1 0.9 --adam-beta2 0.95 --initial-loss-scale 4096',\n",
+    "    '--no-load-optim --no-load-rng --seed 42 --bf16',\n",
+    "])\n",
+    "\n",
+    "# Data and outputs\n",
+    "data_args = ' '.join([\n",
+    "    f'--data-path {PROCESSED_DATA_PREFIX}',\n",
+    "    '--split 100,0,0',\n",
+    "    '--log-interval 1',\n",
+    "    f'--save-interval {TRAIN_ITERS}',\n",
+    "    f'--eval-interval {TRAIN_ITERS} --eval-iters 0',\n",
+    "])\n",
+    "\n",
+    "# Fine-tuning configuration\n",
+    "tune_args = ' '.join([\n",
+    "    '--finetune --stage sft --is-instruction-dataset',\n",
+    "    '--prompt-type qwen3 --no-pad-to-seq-lengths',\n",
+    "    '--distributed-backend nccl',\n",
+    "    f'--load {MCORE_WEIGHTS_DIR} --save {OUTPUT_DIR}',\n",
+    "    '--transformer-impl local',\n",
+    "    '--no-save-optim --no-save-rng',\n",
+    "])\n",
+    "\n",
+    "cmd = f'{env} && {distributed} posttrain_gpt.py {model_args} {train_args} {data_args} {tune_args}'\n",
+    "\n",
+    "print(f'Training configuration: {nproc} NPU, TP={TP}, PP={PP}, DP={DP}')\n",
+    "print(f'GBS={GBS}, MBS={MBS}, SEQ={SEQ_LENGTH}, ITERS={TRAIN_ITERS}')\n",
+    "print(f'\\nStarting training...\\n')\n",
+    "run_cmd(cmd, cwd=MINDSPEED_LLM_DIR)\n",
+    "print(f'\\nTraining completed! Weights saved to: {OUTPUT_DIR}')"
+   ],
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d077bc56",
+   "metadata": {},
+   "source": [
+    "## 7. Inference Validation\n",
+    "\n",
+    "Load the fine-tuned weights and run a generation test."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "id": "09ae43f0",
+   "metadata": {},
+   "source": "import os\n\nnproc = torch.npu.device_count()\n\nenv = ' && '.join([\n    f'cd {MINDSPEED_LLM_DIR}',\n    'export CUDA_DEVICE_MAX_CONNECTIONS=1',\n])\n\ndistributed = ' '.join([\n    'torchrun',\n    f'--nproc_per_node {nproc}',\n    '--nnodes 1 --node_rank 0',\n    '--master_addr localhost --master_port 6001',\n])\n\ninfer_args = ' '.join([\n    '--use-mcore-models',\n    '--spec mindspeed_llm.tasks.models.spec.qwen3_spec layer_spec',\n    '--qk-layernorm',\n    f'--tensor-model-parallel-size {TP}',\n    f'--pipeline-model-parallel-size {PP}',\n    '--num-layers 36 --hidden-size 4096 --num-attention-heads 32',\n    '--ffn-hidden-size 12288',\n    f'--max-position-embeddings {SEQ_LENGTH} --seq-length {SEQ_LENGTH}',\n    '--disable-bias-linear',\n    '--group-query-attention --num-query-groups 8',\n    '--swiglu --use-fused-swiglu',\n    '--normalization RMSNorm --norm-epsilon 1e-6 --use-fused-rmsnorm',\n    '--position-embedding-type rope --rotary-base 1000000 --use-fused-rotary-pos-emb',\n    '--make-vocab-size-divisible-by 1 --padded-vocab-size 151936',\n    '--micro-batch-size 1 --max-new-tokens 256',\n    '--tokenizer-type PretrainedFromHF',\n    f'--tokenizer-name-or-path {HF_MODEL_DIR}',\n    '--tokenizer-not-use-fast',\n    '--hidden-dropout 0 --attention-dropout 0',\n    '--untie-embeddings-and-output-weights',\n    '--no-gradient-accumulation-fusion --attention-softmax-in-fp32',\n    '--seed 42',\n    f'--load {OUTPUT_DIR}',\n    '--exit-on-missing-checkpoint --transformer-impl local',\n])\n\ncmd = f'{env} && {distributed} inference.py {infer_args}'\nfull_cmd = f'source {CANN_ENV} && source {ATB_ENV} && {cmd}'\n\nprint('Starting inference...\\n')\nrun_env = os.environ.copy()\nrun_env['PYTHONWARNINGS'] = _SUPPRESS_WARNINGS\nresult = subprocess.run(\n    ['bash', '-lc', full_cmd],\n    cwd=str(MINDSPEED_LLM_DIR),\n    text=True,\n    input='q\\n',   # Exit interactive chat mode automatically after inference.py finishes the default 4 generation rounds and enters input(); sending q terminates it\n    env=run_env,\n)\nif result.returncode != 0:\n    print(f'\\nInference return code: {result.returncode}')\nprint('\\nInference completed!')",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f87ecc9d",
+   "metadata": {},
+   "source": [
+    "## Using a Real Dataset\n",
+    "\n",
+    "After verification succeeds, use the following steps for full fine-tuning with a real dataset:\n",
+    "\n",
+    "1. **Prepare the data**: place an Alpaca/ShareGPT/Pairwise dataset inside the container\n",
+    "   - Alpaca: `{\"instruction\": \"...\", \"input\": \"...\", \"output\": \"...\"}`\n",
+    "   - Change `HANDLER_NAME` to the matching handler\n",
+    "\n",
+    "2. **Tune the parameters**:\n",
+    "   - `SEQ_LENGTH = 4096` to match the model context length\n",
+    "   - `TRAIN_ITERS = 2000+` adjusted to the dataset size\n",
+    "   - `GBS` adjusted to the NPU count and dataset size\n",
+    "\n",
+    "3. **Checkpoint interval**: change `--save-interval` in the training cell to save checkpoints periodically\n",
+    "\n",
+    "4. **enable-thinking**:\n",
+    "   - `true` to process all data with slow-thinking mode\n",
+    "   - `false` to process all data with fast-thinking mode\n",
+    "   - `none` to mix fast and slow thinking (default)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3.12",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
\ No newline at end of file
diff --git a/docs/en/training_guides/index.mdx b/docs/en/training_guides/index.mdx
index 4efdcbe..e7d698d 100644
--- a/docs/en/training_guides/index.mdx
+++ b/docs/en/training_guides/index.mdx
@@ -16,3 +16,4 @@ End-to-end recipes for fine-tuning and pretraining LLMs on Alauda AI.
 | Production SFT / OSFT with automatic memory management | `training_hub` | [Fine-tuning LLMs with Training Hub](./training-hub-fine-tuning.mdx) |
 | Interactive exploration, custom scripts, VolcanoJob submission | Workbench Notebook | [Fine-tuning LLMs using Workbench](./fine-tuning-using-notebooks.mdx) |
 | Full-parameter SFT / pretraining on Ascend NPU | Workbench `PyTorch CANN` / `MindSpore CANN` | [Fine-tune and Pretrain on Ascend NPU](./fine-tune-and-pretrain-llms-on-ascend-npu.mdx) |
+| Agent-driven MLOps (plan templates, iterative tuning, structured reports) | Coding Agent + on-prem LLM | [Agentic MLOps](../agentic_mlops/mlops-with-coding-agents.mdx) |
diff --git a/docs/en/training_guides/pipelines-mlflow-integration.mdx b/docs/en/training_guides/pipelines-mlflow-integration.mdx
new file mode 100644
index 0000000..e756305
--- /dev/null
+++ b/docs/en/training_guides/pipelines-mlflow-integration.mdx
@@ -0,0 +1,266 @@
+---
+weight: 55
+---
+
+# Kubeflow Pipeline + MLflow Integration
+
+This guide shows how to build Kubeflow Pipelines (KFP) components that log parameters, metrics, and model artifacts to [MLflow on Kubeflow](../kubeflow/how_to/mlflow.mdx) — giving you a single source of truth for experiment tracking across your pipeline runs.
+
+## Scope
+
+- Alauda AI 2.5 and later.
+- Kubeflow Pipelines and the MLflow cluster plugin are installed.
+- Target namespaces have the MLflow workspace label (`mlflow-enabled=true`).
+- The pipeline components run in the same Kubernetes cluster where the MLflow Tracking Server is deployed.
+
+## Prerequisites
+
+- `kfp` Python SDK installed (`pip install kfp mlflow`).
+- Access to a KFP endpoint (see [Use Kubeflow Pipelines](../kubeflow/how_to/pipelines.mdx) for setup).
+- An MLflow workspace name matching a namespace with `mlflow-enabled=true`.
+
+## How pipeline components reach MLflow
+
+The MLflow Tracking Server is exposed as an in-cluster Service:
+
+| Namespace | Service | Port |
+|-----------|---------|------|
+| `kubeflow` | `mlflow-tracking-server.kubeflow.svc.cluster.local` | 5000 |
+| `aml-system` | `mlflow-tracking-server.aml-system.svc.cluster.local` | 5000 |
+
+Pipeline components use the short form:
+
+```python
+MLFLOW_TRACKING_URI = "http://mlflow-tracking-server.kubeflow:5000"
+```
+
+## Complete example: training pipeline with MLflow
+
+Here is a minimal but complete KFP pipeline. It accepts model name, learning rate,
+and epochs as parameters, simulates training with MLflow parameter and metric logging,
+and packages the trained model as an MLflow artifact.
+
+```python
+from kfp import dsl, compiler
+import mlflow
+import mlflow.tracking
+
+# ===== MLflow configuration =====
+MLFLOW_TRACKING_URI = "http://mlflow-tracking-server.kubeflow:5000"
+MLFLOW_EXPERIMENT_NAME = "kfp-training-experiment"
+
+
+@dsl.component(base_image="python:3.11-slim", packages_to_install=["mlflow", "kfp"])
+def train_model(
+    model_name: str,
+    learning_rate: float,
+    epochs: int,
+    output_model_path: str,
+) -> dict:
+    """Simulated training component with MLflow tracking."""
+    mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
+    mlflow.set_experiment(MLFLOW_EXPERIMENT_NAME)
+
+    with mlflow.start_run(run_name=f"run-{dsl.RUN_ID_PLACEHOLDER}") as run:
+        # Log parameters
+        mlflow.log_param("model_name", model_name)
+        mlflow.log_param("learning_rate", learning_rate)
+        mlflow.log_param("epochs", epochs)
+
+        # Simulated training loop
+        metrics = dict()
+        for epoch in range(1, epochs + 1):
+            # Replace this with your real training logic
+            loss = 2.0 * (0.95 ** epoch)  # placeholder loss curve
+            accuracy = 1.0 - loss         # placeholder accuracy
+
+            mlflow.log_metric("loss", loss, step=epoch)
+            mlflow.log_metric("accuracy", accuracy, step=epoch)
+            metrics["final_loss"] = loss
+            metrics["final_accuracy"] = accuracy
+
+        # Log the trained model as an artifact
+        import json, pathlib
+        model_dir = pathlib.Path(output_model_path)
+        model_dir.mkdir(parents=True, exist_ok=True)
+        (model_dir / "model.json").write_text(
+            json.dumps(dict(model_name=model_name, epochs=epochs, metrics=metrics))
+        )
+        mlflow.log_artifacts(str(model_dir), artifact_path="model")
+
+        run_id = run.info.run_id
+        print(f"MLflow run: {run_id}")
+
+    return metrics
+
+
+@dsl.component(base_image="python:3.11-slim", packages_to_install=["mlflow", "kfp"])
+def evaluate_model(
+    model_name: str,
+    test_data_path: str,
+) -> dict:
+    """Evaluate the trained model and log results to MLflow."""
+    mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
+    mlflow.set_experiment(MLFLOW_EXPERIMENT_NAME)
+
+    with mlflow.start_run(run_name=f"eval-{dsl.RUN_ID_PLACEHOLDER}"):
+        # In a real pipeline, load the model artifact from MLflow or S3
+        # For now, log placeholder metrics
+        mlflow.log_param("model_name", model_name)
+        mlflow.log_param("test_data_path", test_data_path)
+        mlflow.log_metric("eval_accuracy", 0.92)
+        mlflow.log_metric("eval_f1", 0.89)
+        mlflow.log_metric("eval_precision", 0.91)
+        mlflow.log_metric("eval_recall", 0.88)
+
+    return dict(eval_accuracy=0.92)
+
+
+@dsl.pipeline(name="mlflow-training-pipeline", description="Train and evaluate with MLflow tracking")
+def training_pipeline(
+    model_name: str = "qwen3-0.6b",
+    learning_rate: float = 2e-4,
+    epochs: int = 10,
+    test_data_path: str = "s3://datasets/test-data",
+):
+    train_task = train_model(
+        model_name=model_name,
+        learning_rate=learning_rate,
+        epochs=epochs,
+        output_model_path="/tmp/model",
+    )
+
+    evaluate_model(
+        model_name=model_name,
+        test_data_path=test_data_path,
+    )
+
+
+# ===== Compile and submit =====
+compiler.Compiler().compile(training_pipeline, "pipeline.yaml")
+```
+
+## Upload and run
+
+### Via the KFP UI
+
+1. Go to **Kubeflow Dashboard → Pipelines → Upload Pipeline** and select `pipeline.yaml`.
+2. Click **Create Run** and fill in the parameters (model name, learning rate, epochs).
+3. After the run starts, check the MLflow UI under **Alauda AI → Advanced → MLFlow** for logged metrics.
+
+### Via the KFP SDK
+
+```python
+from kfp.client import Client
+
+client = Client(host="<MY-KFP-ENDPOINT>")
+
+run = client.create_run_from_pipeline_package(
+    "pipeline.yaml",
+    arguments=dict(
+        model_name="qwen3-0.6b",
+        learning_rate=2e-4,
+        epochs=10,
+        test_data_path="s3://datasets/test-data",
+    ),
+)
+
+print(f"Run URL: {client.get_run_id(run.name)}")
+```
+
+## Using MLflow in Trainer v2 pipelines
+
+If you are using [Kubeflow Trainer v2](./fine-tune-with-trainer-v2.mdx) instead of KFP SDK pipelines, you can inject MLflow environment variables directly into the `TrainingJob` pod spec:
+
+```yaml
+apiVersion: trainer.kubeflow.org/v1
+kind: TrainingJob
+metadata:
+  name: mlflow-finetune
+spec:
+  trainingSpecs:
+    - replicas: 1
+      template:
+        spec:
+          containers:
+            - name: trainer
+              image: alaudadockerhub/fine_tune_with_llamafactory:v0.1.1
+              env:
+                - name: MLFLOW_TRACKING_URI
+                  value: "http://mlflow-tracking-server.kubeflow:5000"
+                - name: MLFLOW_EXPERIMENT_NAME
+                  value: "trainer-v2-finetune"
+```
+
+See [Fine-tuning LLMs using Workbench](./fine-tuning-using-notebooks.mdx) for a full Trainer v2 + MLflow example with LLaMA-Factory.
+
+## Accessing MLflow run artifacts from other pipeline components
+
+Pipeline components can read MLflow artifacts from within a subsequent step. Use the MLflow Python client with the tracking URI to download artifacts:
+
+```python
+from kfp import dsl, compiler
+import mlflow
+
+@dsl.component(base_image="python:3.11-slim", packages_to_install=["mlflow", "kfp"])
+def download_and_compare(
+    source_model_uri: str,
+    reference_model_uri: str,
+) -> str:
+    """Download two models from MLflow and compare them."""
+    mlflow.set_tracking_uri("http://mlflow-tracking-server.kubeflow:5000")
+    client = mlflow.tracking.MlflowClient()
+
+    source_path = client.download_artifacts(source_model_uri, path="/tmp/source")
+    reference_path = client.download_artifacts(reference_model_uri, path="/tmp/reference")
+
+    return f"Compared models: {source_path} vs {reference_path}"
+```
+
+This pattern is useful for:
+
+- A/B comparing fine-tuned models before deployment.
+- Pulling the best model (by accuracy metric) from an MLflow experiment into an inference pipeline.
+- Validating that a model passed acceptance tests before pushing it to the model registry.
+
+## Best practices
+
+### Use the pipeline run ID in MLflow
+
+KFP provides `dsl.RUN_ID_PLACEHOLDER` as an environment variable inside each component. Use it to create distinct MLflow runs per pipeline execution:
+
+```python
+with mlflow.start_run(run_name=f"run-{dsl.RUN_ID_PLACEHOLDER}"):
+    ...
+```
+
+This avoids conflating pipeline runs that happen to use the same parameters.
+
+### Log metrics at the component level, not the pipeline level
+
+MLflow `log_metric` calls must happen inside a `mlflow.start_run()` block. If a component has multiple logical training stages, open separate MLflow runs within the same component — do not try to log metrics from outside a run context.
+
+### Use MLflow models for model registry integration
+
+Instead of logging arbitrary files as `mlflow.log_artifacts`, use `mlflow.log_model()` to register the model with its signature and dependencies:
+
+```python
+mlflow.sklearn.log_model(sk_model, "model")
+# or for HuggingFace:
+mlflow.transformers.log_model(hf_model, "model")
+```
+
+Registered models can then be promoted to the **Staging** or **Production** stage in the MLflow UI.
+
+### Artifact storage for production
+
+The default MLflow artifact store is local disk. For production pipelines, configure S3-compatible object storage in the MLflow plugin settings (see [MLflow Tracking Server](../kubeflow/how_to/mlflow.mdx) → High Availability And Storage). Pipeline components can then log large model artifacts without running into pod disk limits.
+
+## Troubleshooting
+
+| Symptom | Check |
+|---------|-------|
+| `ConnectionError` from `mlflow.set_tracking_uri` | Verify the MLflow service is reachable: `curl http://mlflow-tracking-server.kubeflow:5000/api/2.0/mlflow/ping`. If the pod is in `aml-system`, use that namespace instead. |
+| Run not showing in MLflow UI | Check the component logs for MLflow errors. The MLflow experiment must exist (created automatically on first `set_experiment`) and the workspace label must match the namespace. |
+| Artifact upload fails with `ResourceExhausted` | The artifact may exceed the pod's disk quota. Log artifacts directly to S3 by setting `MLFLOW_S3_ENDPOINT_URL` and providing AWS credentials as env vars. |
+| MLflow metrics not appearing in KFP UI | KFP and MLflow are separate systems. Metrics logged to MLflow appear in the MLflow UI (**Alauda AI → Advanced → MLFlow**), not in the KFP run output. |
diff --git a/e2e/lib.sh b/e2e/lib.sh
index cbd2ab5..1b34352 100644
--- a/e2e/lib.sh
+++ b/e2e/lib.sh
@@ -73,14 +73,14 @@ _retry_kubectl_stdin() {
   local kfn="$1" verb="$2"; shift 2
   local data
   data="$(cat)"
-  local attempts=0 max=20 delay=30 rc out
+  local attempts=0 max=20 delay=120 rc out
   while [ "${attempts}" -lt "${max}" ]; do
     if out="$(printf '%s' "${data}" | $kfn "${verb}" -f - "$@" 2>&1)"; then
       printf '%s' "${out}"
       return 0
     fi
     rc=$?
-    if ! echo "${out}" | grep -qE 'failed calling webhook|x509|connection refused|EOF|context deadline exceeded|webhook.* connect: connection refused'; then
+    if ! echo "${out}" | grep -qE 'failed calling webhook|x509|connection refused|EOF|context deadline exceeded|webhook.* connect: connection refused|failed to download openapi|openapi'; then
       printf '%s\n' "${out}" >&2
       return "${rc}"
     fi
@@ -93,7 +93,31 @@ _retry_kubectl_stdin() {
 }
 
 retry_create() { _retry_kubectl_stdin "$1" create "${@:2}"; }
-retry_apply()  { _retry_kubectl_stdin "$1" apply  "${@:2}"; }
+retry_apply()  { _retry_kubectl_stdin "$1" apply "${@:2}"; }
+
+# Same retry but with --validate=false to bypass webhook/OAPI flakes.
+_retry_kubectl_stdin_novalidate() {
+  local kfn="$1" verb="$2"; shift 2
+  local data
+  data="$(cat)"
+  local attempts=0 max=5 delay=10 rc out
+  while [ "${attempts}" -lt "${max}" ]; do
+    if out="$(printf '%s' "${data}" | $kfn "${verb}" -f - --validate=false "$@" 2>&1)"; then
+      printf '%s' "${out}"
+      return 0
+    fi
+    rc=$?
+    if ! echo "${out}" | grep -qE 'failed calling webhook|x509|connection refused|EOF|context deadline exceeded|webhook.* connect: connection refused|failed to download openapi|openapi'; then
+      printf '%s\n' "${out}" >&2
+      return "${rc}"
+    fi
+    attempts=$((attempts+1))
+    log "kubectl ${verb} (novalidate): flake (attempt ${attempts}/${max}), sleeping ${delay}s"
+    sleep "${delay}"
+  done
+  printf '%s\n' "${out}" >&2
+  return 1
+}
 
 # Locate a TrainJob's pod. Trainer v2 builds a JobSet named after the TrainJob,
 # with one Job per `replicatedJobs[*]` named `${trainjob}-<rjob>-0`. The first

From ddff8e5b17f008f8232a323b9e0273d0a7bf843e Mon Sep 17 00:00:00 2001
From: Wu Yi <typhoonzero1986@gmail.com>
Date: Mon, 15 Jun 2026 11:59:56 +0800
Subject: [PATCH 07/21] update

---
 AGENTS.md                                     |  23 -
 .../assets/build-train-image/run_build.sh     |  10 -
 docs/en/kubeflow/how_to/kf-local-queue.yaml   |   7 -
 .../how_to/kf-trainingruntime-npu.yaml        | 401 ------------------
 docs/en/kubeflow/how_to/kf-trainjob-npu.yaml  | 110 -----
 .../how_to/pipelines-mlflow-integration.mdx   | 267 ------------
 .../qwen3_finetune_verify.ipynb               |   0
 7 files changed, 818 deletions(-)
 delete mode 100644 AGENTS.md
 delete mode 100644 docs/en/kubeflow/how_to/assets/build-train-image/run_build.sh
 delete mode 100644 docs/en/kubeflow/how_to/kf-local-queue.yaml
 delete mode 100644 docs/en/kubeflow/how_to/kf-trainingruntime-npu.yaml
 delete mode 100644 docs/en/kubeflow/how_to/kf-trainjob-npu.yaml
 delete mode 100644 docs/en/kubeflow/how_to/pipelines-mlflow-integration.mdx
 rename docs/en/{kubeflow/how_to => training_guides}/qwen3_finetune_verify.ipynb (100%)

diff --git a/AGENTS.md b/AGENTS.md
deleted file mode 100644
index 6817188..0000000
--- a/AGENTS.md
+++ /dev/null
@@ -1,23 +0,0 @@
-# Repository Guidelines
-
-## Project Structure & Module Organization
-`docs/en/` contains the English documentation, organized by product area such as `kubeflow/`, `workbench/`, and `model_inference/`. Section landing pages usually live in `index.mdx`, with supporting content under folders like `overview/`, `how_to/`, `functions/`, and `trouble_shooting/`. Put downloadable files and images in `docs/public/` or the nearest section asset folder. Shared generated resources live in `docs/shared/` (`crds/`, `openapis/`, `roletemplates/`, `functionresources/`). Root config lives in `doom.config.yml`, `sites.yaml`, `eslint.config.js`, and `cspell.config.js`. CI definitions are under `.builds/`.
-
-## Build, Test, and Development Commands
-Use Yarn 4 for all local work:
-
-- `yarn install`: install dependencies.
-- `yarn dev`: start the Doom dev server with live reload.
-- `yarn lint`: run repository lint checks before committing.
-- `yarn build`: produce the static site in `dist/`.
-- `yarn serve`: preview the built site locally.
-- `yarn translate` / `yarn export`: run Doom translation or export workflows when needed.
-
-## Coding Style & Naming Conventions
-Follow `.editorconfig`: 2-space indentation, LF line endings, UTF-8, and a final newline. Prettier is the formatter; its current rules prefer single quotes and no semicolons. Keep MDX concise and use descriptive headings. Match the surrounding directory’s naming pattern; for new pages, prefer lowercase filenames and keep section entry files as `index.mdx`, `intro.mdx`, or `features.mdx` when they serve those roles.
-
-## Testing Guidelines
-There is no separate unit-test suite in this repository. Validation is content-focused: run `yarn lint` and `yarn build` for every change, then use `yarn serve` or `yarn dev` to verify rendering, navigation, links, code blocks, and asset paths. Treat a clean build as the minimum acceptance bar.
-
-## Commit & Pull Request Guidelines
-Recent history favors short, imperative commit subjects such as `Add trainerv2 llm fine tuning (#156)` or `Split component tables by architecture...`. Keep commits focused on one documentation change. For pull requests, include a concise summary, link the relevant issue or task, and note any generated or copied assets. Add screenshots only when navigation, theme behavior, or visual assets change. Ensure `yarn lint` passes before opening the PR.
diff --git a/docs/en/kubeflow/how_to/assets/build-train-image/run_build.sh b/docs/en/kubeflow/how_to/assets/build-train-image/run_build.sh
deleted file mode 100644
index 990c096..0000000
--- a/docs/en/kubeflow/how_to/assets/build-train-image/run_build.sh
+++ /dev/null
@@ -1,10 +0,0 @@
-buildctl \
---addr tcp://192.168.142.83:1234 build \
---frontend dockerfile.v0 \
---local context=$PWD \
---local dockerfile=$PWD \
---opt filename=fine_tune_with_llamafactory_npu.Containerfile \
---opt platform=linux/arm64 \
---opt build-arg:INDEX_URL=https://pypi.tuna.tsinghua.edu.cn/simple \
---output type=image,name=build-harbor.alauda.cn/mlops/fine_tune_with_llamafactory_npu:v0.9.4-cann_8.5.0-torch_2.6.0-v2,push=true
-
diff --git a/docs/en/kubeflow/how_to/kf-local-queue.yaml b/docs/en/kubeflow/how_to/kf-local-queue.yaml
deleted file mode 100644
index 3e79346..0000000
--- a/docs/en/kubeflow/how_to/kf-local-queue.yaml
+++ /dev/null
@@ -1,7 +0,0 @@
-apiVersion: kueue.x-k8s.io/v1beta2
-kind: LocalQueue
-metadata:
-  name: local-queue
-  namespace: mlops-demo-ai-test
-spec:
-  clusterQueue: cluster-queue
\ No newline at end of file
diff --git a/docs/en/kubeflow/how_to/kf-trainingruntime-npu.yaml b/docs/en/kubeflow/how_to/kf-trainingruntime-npu.yaml
deleted file mode 100644
index 86eccf7..0000000
--- a/docs/en/kubeflow/how_to/kf-trainingruntime-npu.yaml
+++ /dev/null
@@ -1,401 +0,0 @@
-apiVersion: trainer.kubeflow.org/v1alpha1
-kind: TrainingRuntime
-metadata:
-  name: llamafactory-finetune-runtime
-  namespace: kubeflow-admin-cpaas-io
-  labels:
-    trainer.kubeflow.org/framework: torch
-spec:
-  mlPolicy:
-    numNodes: 1
-    torch:
-      numProcPerNode: auto
-  template:
-    spec:
-      replicatedJobs:
-        - name: dataset-initializer
-          template:
-            metadata:
-              labels:
-                trainer.kubeflow.org/trainjob-ancestor-step: dataset-initializer
-            spec:
-              template:
-                spec:
-                  #hostNetwork: true
-                  #dnsPolicy: ClusterFirstWithHostNet
-                  securityContext:
-                    runAsNonRoot: true
-                    runAsUser: 1001
-                    runAsGroup: 1000
-                    fsGroup: 1000
-                  containers:
-                    - name: dataset-initializer
-                      #image: docker.1ms.run/alaudadockerhub/fine_tune_with_llamafactory_npu:v0.1.3
-                      image: docker.1ms.run/alaudadockerhub/fine_tune_with_llamafactory_npu:v0.9.4-cann_8.5.0-torch_2.6.0-v1
-                      command:
-                      - /bin/bash
-                      - -c
-                      - |
-                        set -ex
-                        cd /mnt/models
-                        DATASET_NAME=$(basename ${DATASET_URL})
-                        DATASET_URL_NO_HTTPS="${DATASET_URL//http:\/\/}"
-                        gitauth="${GIT_USER}:${GIT_TOKEN}"
-                        #rm -rf ${DATASET_NAME}
-                        #rm -rf data
-                        if [ -d ${DATASET_NAME} ]; then
-                            echo "dataset ${DATASET_NAME} already exists skipping download"
-                        else
-                            git -c http.sslVerify=false -c lfs.activitytimeout=36000 clone "http://${gitauth}@${DATASET_URL_NO_HTTPS}"
-                        fi
-                        echo "listing files under /mnt/models ..."
-                        ls /mnt/models
-                        echo "listing dataset files ..."
-                        ls ${DATASET_NAME}
-                      env:
-                      # Step 1: set DATASET_URL to download dataset from gitlab.
-                      - name: DATASET_URL
-                        value: "http://aml-gitlab.alaudatech.net/mlops-demo-ai-test/amldatasets/identity-alauda"
-                      # Step 2: set GIT_USER and GIT_TOKEN to access private git repo.
-                      # NOTE: if your dataset is located in different storage like S3, you need to modify the initializer container to download dataset from S3 instead of git.
-                      - name: GIT_USER
-                        valueFrom:
-                          secretKeyRef:
-                            name: aml-image-builder-secret
-                            key: MODEL_REPO_GIT_USER
-                      - name: GIT_TOKEN
-                        valueFrom:
-                          secretKeyRef:
-                            name: aml-image-builder-secret
-                            key: MODEL_REPO_GIT_TOKEN
-                      resources:
-                        requests:
-                          cpu: 100m
-                          memory: 128Mi
-                        limits:
-                          cpu: 2
-                          memory: 4Gi
-                      securityContext:
-                        allowPrivilegeEscalation: false
-                        capabilities:
-                          drop:
-                            - ALL
-                        runAsNonRoot: true
-                        seccompProfile:
-                          type: RuntimeDefault
-        - name: model-initializer
-          dependsOn:
-            - name: dataset-initializer
-              status: Complete
-          template:
-            metadata:
-              labels:
-                trainer.kubeflow.org/trainjob-ancestor-step: model-initializer
-            spec:
-              template:
-                spec:
-                  securityContext:
-                    runAsNonRoot: true
-                    runAsUser: 1001
-                    runAsGroup: 1000
-                    fsGroup: 1000
-                  containers:
-                    - name: model-initializer
-                      image: docker.1ms.run/alaudadockerhub/fine_tune_with_llamafactory_npu:v0.9.4-cann_8.5.0-torch_2.6.0-v1
-                      command:
-                      - /bin/bash
-                      - -c
-                      - |
-                        set -ex
-                        cd /mnt/models
-                        BASE_MODEL_NAME=$(basename ${BASE_MODEL_URL})
-                        # Download base model
-                        gitauth="${GIT_USER}:${GIT_TOKEN}"
-                        BASE_MODEL_URL_NO_HTTPS="${BASE_MODEL_URL//http:\/\/}"
-                        if [ -d ${BASE_MODEL_NAME} ]; then
-                            echo "${BASE_MODEL_NAME} dir already exists, skip downloading"
-                        else
-                            GIT_LFS_SKIP_SMUDGE=1 git -c http.sslVerify=false -c lfs.activitytimeout=36000 clone "http://${gitauth}@${BASE_MODEL_URL_NO_HTTPS}"
-                            (cd ${BASE_MODEL_NAME} && git -c http.sslVerify=false -c lfs.activitytimeout=36000 lfs pull)
-                        fi
-                        echo "listing files under /mnt/models ..."
-                        ls /mnt/models
-                        echo "listing model files ..."
-                        ls ${BASE_MODEL_NAME}
-                      env:
-                      # Step 3: set BASE_MODEL_URL to download base model from gitlab. Make sure the GIT_USER and GIT_TOKEN have access to this git repo.
-                      # NOTE: model repo name should not be the same as dataset repo name, otherwise the initializer may fail to download model and dataset correctly since they use the same PVC and the same git clone command.
-                      - name: BASE_MODEL_URL
-                        value: "http://aml-gitlab.alaudatech.net/mlops-demo-ai-test/amlmodels/qwen3-0.6b"
-                      - name: GIT_USER
-                        valueFrom:
-                          secretKeyRef:
-                            name: aml-image-builder-secret
-                            key: MODEL_REPO_GIT_USER
-                      - name: GIT_TOKEN
-                        valueFrom:
-                          secretKeyRef:
-                            name: aml-image-builder-secret
-                            key: MODEL_REPO_GIT_TOKEN
-                      resources:
-                        requests:
-                          cpu: 100m
-                          memory: 128Mi
-                        limits:
-                          cpu: 2
-                          memory: 4Gi
-                      securityContext:
-                        allowPrivilegeEscalation: false
-                        capabilities:
-                          drop:
-                            - ALL
-                        runAsNonRoot: true
-                        seccompProfile:
-                          type: RuntimeDefault
-        - name: node
-          dependsOn:
-            - name: model-initializer
-              status: Complete
-          template:
-            metadata:
-              labels:
-                trainer.kubeflow.org/trainjob-ancestor-step: trainer
-            spec:
-              backoffLimit: 0
-              template:
-                spec:
-                  # Step 4: Use the Ascend runtime class and scheduler that can allocate
-                  # Huawei NPUs on your cluster.
-                  schedulerName: hami-scheduler
-                  runtimeClassName: ascend
-                  # The trainer process UID/GID must match the Ascend device files
-                  # mounted under /dev. This sample assumes the NPU device files use
-                  # 1001:1000. If your cluster uses different ownership, rebuild the
-                  # image with that UID/GID and update these values.
-                  securityContext:
-                    runAsNonRoot: true
-                    runAsUser: 1001
-                    runAsGroup: 1000
-                    fsGroup: 1000
-                  volumes:
-                  - name: workspace
-                    emptyDir: {}
-                  - name: dshm
-                    emptyDir:
-                      medium: Memory
-                      # Step 4: set sizeLimit for dshm volume to tune the performance of multi GPU training.
-                      sizeLimit: 2Gi
-                  containers:
-                    - name: node
-                      image: docker.1ms.run/alaudadockerhub/fine_tune_with_llamafactory_npu:v0.9.4-cann_8.5.0-torch_2.6.0-v1
-                      env:
-                      - name: BASE_MODEL_URL
-                        value: "https://aml-gitlab.alaudatech.net/mlops-demo-ai-test/amlmodels/qwen3-0.6b"
-                      - name: DATASET_URL
-                        value: "https://aml-gitlab.alaudatech.net/mlops-demo-ai-test/amldatasets/identity-alauda"
-                      - name: GIT_USER
-                        valueFrom:
-                          secretKeyRef:
-                            name: aml-image-builder-secret
-                            key: MODEL_REPO_GIT_USER
-                      - name: GIT_TOKEN
-                        valueFrom:
-                          secretKeyRef:
-                            name: aml-image-builder-secret
-                            key: MODEL_REPO_GIT_TOKEN
-                      - name: HF_HOME
-                        value: /mnt/workspace/hf_cache
-                      - name: DO_MERGE
-                        value: "true"
-                        #- name: MLFLOW_TRACKING_URI
-                        #value: "http://mlflow-tracking-server.kubeflow:5000"
-                        #- name: MLFLOW_EXPERIMENT_NAME
-                        #value: mlops-demo-ai-test
-                      - name: MODEL_OUTPUT_DIR
-                        value: /mnt/workspace/output_model
-                      - name: OUTPUT_MODEL_URL
-                        value: "https://aml-gitlab.alaudatech.net/mlops-demo-ai-test/amlmodels/wy-sft-output"
-                      # Step 5: Keep Ascend process logs under the writable workspace
-                      # for easier debugging.
-                      - name: ASCEND_PROCESS_LOG_PATH
-                        value: /mnt/workspace/ascendlog
-                      command:
-                        - bash
-                        - -c
-                        - |
-                          set -ex
-
-                          export ASCEND_TOOLKIT_HOME=/usr/local/Ascend/cann
-                          export ASCEND_OPS_PATH=/usr/local/Ascend/cann
-                          export ASCEND_NNAL_HOME=/usr/local/Ascend/nnal
-                          export ASCEND_HOME_PATH=/usr/local/Ascend/cann
-                          export ASCEND_AICPU_PATH=/usr/local/Ascend/cann
-                          export ASCEND_OPP_PATH=/usr/local/Ascend/cann/opp
-                          export PATH=/usr/local/Ascend/cann/bin:/usr/local/Ascend/nnal/atb/bin:${PATH}
-                          export LD_LIBRARY_PATH=/usr/local/dcmi:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/cann/aarch64-linux/devlib:/usr/local/Ascend/cann/lib64:/usr/local/Ascend/nnal/atb/lib
-
-                          echo "PET_NNODES: ${PET_NNODES}, PET_NODE_RANK: ${PET_NODE_RANK}, PET_MASTER_ADDR: ${PET_MASTER_ADDR}, PET_MASTER_PORT: ${PET_MASTER_PORT}"
-                          if [ ${PET_NNODES} -gt 1 ]; then
-                              export N_RANKS=$PET_NNODES
-                              export RANK=$PET_NODE_RANK
-                              export MASTER_HOST=$PET_MASTER_ADDR
-                              export MASTER_PORT=$PET_MASTER_PORT
-                              export WORLD_SIZE=$PET_NNODES
-                              export NNODES=$PET_NNODES
-                              export NODE_RANK=$PET_NODE_RANK
-                              export MASTER_ADDR=${MASTER_HOST}
-                          else
-                              export N_RANKS=1
-                              export RANK=0
-                              export NNODES=1
-                              export MASTER_HOST=""
-                          fi
-
-                          source /usr/local/Ascend/ascend-toolkit/set_env.sh
-                          source /usr/local/Ascend/cann-8.5.0/share/info/ascendnpu-ir/bin/set_env.sh
-                          source /usr/local/Ascend/nnal/atb/set_env.sh
-
-                          cd /mnt/workspace
-                          BASE_MODEL_NAME=$(basename ${BASE_MODEL_URL})
-                          DATASET_NAME=$(basename ${DATASET_URL})
-
-                          cat >lf-sft.yaml <<EOL
-                          model_name_or_path: /mnt/models/${BASE_MODEL_NAME}
-
-                          stage: sft
-                          do_train: true
-                          finetuning_type: lora
-                          lora_target: all
-                          lora_rank: 8
-                          lora_alpha: 16
-                          lora_dropout: 0.1
-
-                          dataset: identity_alauda
-                          dataset_dir: /mnt/models/${DATASET_NAME}
-                          template: qwen
-                          cutoff_len: 1024
-                          max_samples: 1000
-                          overwrite_cache: true
-                          preprocessing_num_workers: 8
-
-                          output_dir: output_models
-                          logging_steps: 10
-                          save_steps: 500
-                          plot_loss: true
-                          overwrite_output_dir: true
-
-
-                          # global batch size: 8
-                          per_device_train_batch_size: 2
-                          gradient_accumulation_steps: 2
-                          learning_rate: 2.0e-4
-                          num_train_epochs: 4.0
-                          bf16: false
-                          fp16: true
-                          ddp_timeout: 180000000
-
-                          val_size: 0.1
-                          per_device_eval_batch_size: 1
-                          eval_strategy: steps
-                          eval_steps: 500
-                          EOL
-
-                          cat >ds-z3-config.json <<EOL
-                          {
-                            "train_batch_size": "auto",
-                            "train_micro_batch_size_per_gpu": "auto",
-                            "gradient_accumulation_steps": "auto",
-                            "gradient_clipping": "auto",
-                            "zero_allow_untested_optimizer": true,
-                            "fp16": {
-                              "enabled": "auto",
-                              "loss_scale": 0,
-                              "loss_scale_window": 1000,
-                              "initial_scale_power": 16,
-                              "hysteresis": 2,
-                              "min_loss_scale": 1
-                            },
-                            "bf16": {
-                              "enabled": "auto"
-                            },
-                            "zero_optimization": {
-                              "stage": 3,
-                              "overlap_comm": false,
-                              "contiguous_gradients": true,
-                              "sub_group_size": 1e9,
-                              "reduce_bucket_size": "auto",
-                              "stage3_prefetch_bucket_size": "auto",
-                              "stage3_param_persistence_threshold": "auto",
-                              "stage3_max_live_parameters": 1e9,
-                              "stage3_max_reuse_distance": 1e9,
-                              "stage3_gather_16bit_weights_on_model_save": true
-                            }
-                          }
-                          EOL
-
-                          # Run training
-                          if [ ${NNODES} -gt 1 ]; then
-                              echo "deepspeed: ds-z3-config.json" >> lf-sft.yaml
-                              FORCE_TORCHRUN=1 llamafactory-cli train lf-sft.yaml
-                          else
-                              unset NNODES
-                              unset NODE_RANK
-                              unset MASTER_ADDR
-                              unset MASTER_PORT
-                              llamafactory-cli train lf-sft.yaml
-                          fi
-
-
-                          if [ "${DO_MERGE}" == "true" ]; then
-                            # Merge LoRA adapters
-                            cat >lf-merge-config.yaml <<EOL
-                          model_name_or_path: /mnt/models/${BASE_MODEL_NAME}
-                          adapter_name_or_path: output_models
-                          template: qwen
-                          finetuning_type: lora
-
-
-                          ### export
-                          export_dir: output_models_merged
-                          export_size: 4
-                          export_device: cpu
-                          export_legacy_format: false
-                          EOL
-
-                            llamafactory-cli export lf-merge-config.yaml
-                          else
-                            # move output adapter for push
-                            mv output_models output_models_merged
-                          fi
-
-                          # push merged model to model repo
-                          gitauth="${GIT_USER}:${GIT_TOKEN}"
-                          cd /mnt/workspace/output_models_merged
-                          touch README.md
-                          OUTPUT_MODEL_NO_HTTPS="${OUTPUT_MODEL_URL//http:\/\/}"
-                          PUSH_URL="http://${gitauth}@${OUTPUT_MODEL_NO_HTTPS}"
-                          push_branch=$(date +'%Y%m%d-%H%M%S')
-
-                          # git init
-                          # git checkout -b sft-${push_branch}
-                          # git lfs track *.safetensors
-                          # git add .
-                          # git -c user.name='AMLSystemUser' -c user.email='aml_admin@cpaas.io' commit -am "fine tune push auto commit"
-                          # git -c http.sslVerify=false -c lfs.activitytimeout=36000 push -u ${PUSH_URL} sft-${push_branch}
-                      securityContext:
-                        allowPrivilegeEscalation: true
-                        capabilities:
-                          # drop:
-                          #  - ALL
-                          add: [ "IPC_LOCK", "SYS_PTRACE" ]
-                        # Keep these aligned with the pod securityContext above.
-                        runAsUser: 1001
-                        runAsGroup: 1000
-                        runAsNonRoot: true
-                        seccompProfile:
-                          type: RuntimeDefault
-                      volumeMounts:
-                        - name: workspace
-                          mountPath: /mnt/workspace
-                        - name: dshm
-                          mountPath: /dev/shm
diff --git a/docs/en/kubeflow/how_to/kf-trainjob-npu.yaml b/docs/en/kubeflow/how_to/kf-trainjob-npu.yaml
deleted file mode 100644
index 050ce33..0000000
--- a/docs/en/kubeflow/how_to/kf-trainjob-npu.yaml
+++ /dev/null
@@ -1,110 +0,0 @@
-apiVersion: trainer.kubeflow.org/v1alpha1
-kind: TrainJob
-metadata:
-  generateName: trainjob-sft-qwen3-
-  namespace: kubeflow-admin-cpaas-io
-  # labels:
-  #   kueue.x-k8s.io/queue-name: local-queue
-spec:
-  runtimeRef:
-    apiGroup: trainer.kubeflow.org
-    name: llamafactory-finetune-runtime
-    kind: TrainingRuntime
-  podTemplateOverrides:
-    - targetJobs:
-        - name: node
-      spec:
-        # Step 1: Configure models-cache volume override to mount the shared PVC for caching models
-        # In distributed training tasks (with >= 2 replicas), ensure that you use the appropriate storage type for caching large models:
-        # - Network storage, such as NFS or Ceph: Simply mount the network storage. Note that multiple containers may access this network storage simultaneously, resulting in high concurrent traffic. Furthermore, reading large model files may be slower than reading them locally (depending on the network storage's performance).
-        # - Local storage, such as topolvm or local-storage: Use `kserve local model cache` to pre-cache the model file on each node before mounting this PVC. Training tasks cannot cache each local PVC.
-        volumes:
-          - name: models-cache
-            persistentVolumeClaim:
-              claimName: glm5
-        containers:
-          - name: node
-            volumeMounts:
-              - name: models-cache
-                mountPath: /mnt/models
-        # Step 2: Configure node selector to ensure the job runs on GPU nodes if needed.
-        # nodeSelector:
-        #   nvidia.com/gpu.product: Tesla-T4
-    - targetJobs:
-        - name: dataset-initializer
-      spec:
-        # Step 3: Do the same as step 1.
-        volumes:
-          - name: models-cache
-            persistentVolumeClaim:
-              claimName: glm5
-        containers:
-          - name: dataset-initializer
-            volumeMounts:
-              - name: models-cache
-                mountPath: /mnt/models
-        # nodeSelector:
-        #   nvidia.com/gpu.product: Tesla-T4
-    - targetJobs:
-        - name: model-initializer
-      spec:
-        # Step 4: Do the same as step 1.
-        volumes:
-          - name: models-cache
-            persistentVolumeClaim:
-              claimName: glm5
-        containers:
-          - name: model-initializer
-            volumeMounts:
-              - name: models-cache
-                mountPath: /mnt/models
-        # nodeSelector:
-        #   nvidia.com/gpu.product: Tesla-T4
-  initializer:
-    # Step 5: set dataset and model URL in initializer step. The initializer will download dataset and model to the shared PVC, so that the trainer can access them from the same path.
-    dataset:
-      env:
-        - name: DATASET_URL
-          value: "http://gitlab-fuyao.test.com:30080/kubeflow-admin-cpaas-io/amlmodels/identity-alauda"
-    model:
-      env:
-        - name: BASE_MODEL_URL
-          value: "http://gitlab-fuyao.test.com:30080/kubeflow-admin-cpaas-io/amlmodels/Qwen3-0.6B"
-  trainer:
-    # Step 6: Keep one trainer pod for a single-host 2-NPU Ascend job. Use
-    # numNodes: 2+ only for true multi-node training across different hosts.
-    numNodes: 1
-    env:
-    # Step 7: Set model, dataset, and output locations for this run.
-    # Single-node multi-NPU jobs do not need loopback HCCL/GLOO NIC overrides.
-    # For true multi-node jobs, add only the HCCL networking variables required
-    # by your tested cluster topology.
-    - name: BASE_MODEL_URL
-      value: "http://gitlab-fuyao.test.com:30080/kubeflow-admin-cpaas-io/amlmodels/Qwen3-0.6B"
-    - name: DATASET_URL
-      value: "http://gitlab-fuyao.test.com:30080/kubeflow-admin-cpaas-io/amlmodels/identity-alauda"
-    - name: HF_HOME
-      value: /mnt/workspace/hf_cache
-    - name: DO_MERGE
-      value: "true"
-      #- name: MLFLOW_TRACKING_URI
-      #  value: "http://mlflow-tracking-server.kubeflow:5000"
-      #- name: MLFLOW_EXPERIMENT_NAME
-      #value: mlops-demo-ai-test
-    - name: MODEL_OUTPUT_DIR
-      value: /mnt/workspace/output_models_merged
-    - name: OUTPUT_MODEL_URL
-      value: "http://gitlab-fuyao.test.com:30080/kubeflow-admin-cpaas-io/amlmodels/qwen3-sft-output"
-    resourcesPerNode:
-      # Step 8: Request the exact Ascend resource keys exposed by your device
-      # plugin. This sample uses Ascend 910B4 and allocates two NPUs to one pod.
-      limits:
-        cpu: "4"
-        memory: "32Gi"
-        huawei.com/Ascend910B4: "2"
-        huawei.com/Ascend910B4-memory: "32G"
-      requests:
-        cpu: "1"
-        memory: "2Gi"
-        huawei.com/Ascend910B4: "2"
-        huawei.com/Ascend910B4-memory: "32G"
diff --git a/docs/en/kubeflow/how_to/pipelines-mlflow-integration.mdx b/docs/en/kubeflow/how_to/pipelines-mlflow-integration.mdx
deleted file mode 100644
index 35380c4..0000000
--- a/docs/en/kubeflow/how_to/pipelines-mlflow-integration.mdx
+++ /dev/null
@@ -1,267 +0,0 @@
----
-weight: 55
----
-
-# Kubeflow Pipeline + MLflow Integration
-
-This guide shows how to build Kubeflow Pipelines (KFP) components that log parameters, metrics, and model artifacts to [MLflow](./mlflow.mdx) — giving you a single source of truth for experiment tracking across your pipeline runs.
-
-## Scope
-
-- Alauda AI 2.5 and later.
-- Kubeflow Pipelines and the MLflow cluster plugin are installed.
-- Target namespaces have the MLflow workspace label (`mlflow-enabled=true`).
-- The pipeline components run in the same Kubernetes cluster where the MLflow Tracking Server is deployed.
-
-## Prerequisites
-
-- `kfp` Python SDK installed (`pip install kfp mlflow`).
-- Access to a KFP endpoint (see [Use Kubeflow Pipelines](./pipelines.mdx) for setup).
-- An MLflow workspace name matching a namespace with `mlflow-enabled=true`.
-
-## How pipeline components reach MLflow
-
-The MLflow Tracking Server is exposed as an in-cluster Service:
-
-| Namespace | Service | Port |
-|-----------|---------|------|
-| `kubeflow` | `mlflow-tracking-server.kubeflow.svc.cluster.local` | 5000 |
-| `aml-system` | `mlflow-tracking-server.aml-system.svc.cluster.local` | 5000 |
-
-Pipeline components use the short form:
-
-```python
-MLFLOW_TRACKING_URI = "http://mlflow-tracking-server.kubeflow:5000"
-```
-
-## Complete example: training pipeline with MLflow
-
-The following is a minimal but complete KFP pipeline that:
-
-1. Accepts model name, learning rate, and epochs as parameters.
-2. Simulates training (replace with your real training code) and logs parameters + metrics to MLflow.
-3. Packages the trained model as an MLflow artifact and uploads it to the MLflow Tracking Server.
-
-```python
-from kfp import dsl, compiler
-import mlflow
-import mlflow.tracking
-
-# ===== MLflow configuration =====
-MLFLOW_TRACKING_URI = "http://mlflow-tracking-server.kubeflow:5000"
-MLFLOW_EXPERIMENT_NAME = "kfp-training-experiment"
-
-
-@dsl.component(base_image="python:3.11-slim", packages_to_install=["mlflow", "kfp"])
-def train_model(
-    model_name: str,
-    learning_rate: float,
-    epochs: int,
-    output_model_path: str,
-) -> dict:
-    """Simulated training component with MLflow tracking."""
-    mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
-    mlflow.set_experiment(MLFLOW_EXPERIMENT_NAME)
-
-    with mlflow.start_run(run_name=f"run-{dsl.RUN_ID_PLACEHOLDER}") as run:
-        # Log parameters
-        mlflow.log_param("model_name", model_name)
-        mlflow.log_param("learning_rate", learning_rate)
-        mlflow.log_param("epochs", epochs)
-
-        # Simulated training loop
-        metrics = {}
-        for epoch in range(1, epochs + 1):
-            # Replace this with your real training logic
-            loss = 2.0 * (0.95 ** epoch)  # placeholder loss curve
-            accuracy = 1.0 - loss         # placeholder accuracy
-
-            mlflow.log_metric("loss", loss, step=epoch)
-            mlflow.log_metric("accuracy", accuracy, step=epoch)
-            metrics = {"final_loss": loss, "final_accuracy": accuracy}
-
-        # Log the trained model as an artifact
-        import json, pathlib
-        model_dir = pathlib.Path(output_model_path)
-        model_dir.mkdir(parents=True, exist_ok=True)
-        (model_dir / "model.json").write_text(
-            json.dumps({"model_name": model_name, "epochs": epochs, "metrics": metrics})
-        )
-        mlflow.log_artifacts(str(model_dir), artifact_path="model")
-
-        run_id = run.info.run_id
-        print(f"MLflow run: {run_id}")
-
-    return metrics
-
-
-@dsl.component(base_image="python:3.11-slim", packages_to_install=["mlflow", "kfp"])
-def evaluate_model(
-    model_name: str,
-    test_data_path: str,
-) -> dict:
-    """Evaluate the trained model and log results to MLflow."""
-    mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
-    mlflow.set_experiment(MLFLOW_EXPERIMENT_NAME)
-
-    with mlflow.start_run(run_name=f"eval-{dsl.RUN_ID_PLACEHOLDER}"):
-        # In a real pipeline, load the model artifact from MLflow or S3
-        # For now, log placeholder metrics
-        mlflow.log_param("model_name", model_name)
-        mlflow.log_param("test_data_path", test_data_path)
-        mlflow.log_metric("eval_accuracy", 0.92)
-        mlflow.log_metric("eval_f1", 0.89)
-        mlflow.log_metric("eval_precision", 0.91)
-        mlflow.log_metric("eval_recall", 0.88)
-
-    return {"eval_accuracy": 0.92}
-
-
-@dsl.pipeline(name="mlflow-training-pipeline", description="Train and evaluate with MLflow tracking")
-def training_pipeline(
-    model_name: str = "qwen3-0.6b",
-    learning_rate: float = 2e-4,
-    epochs: int = 10,
-    test_data_path: str = "s3://datasets/test-data",
-):
-    train_task = train_model(
-        model_name=model_name,
-        learning_rate=learning_rate,
-        epochs=epochs,
-        output_model_path="/tmp/model",
-    )
-
-    evaluate_model(
-        model_name=model_name,
-        test_data_path=test_data_path,
-    )
-
-
-# ===== Compile and submit =====
-compiler.Compiler().compile(training_pipeline, "pipeline.yaml")
-```
-
-## Upload and run
-
-### Via the KFP UI
-
-1. Go to **Kubeflow Dashboard → Pipelines → Upload Pipeline** and select `pipeline.yaml`.
-2. Click **Create Run** and fill in the parameters (model name, learning rate, epochs).
-3. After the run starts, check the MLflow UI under **Alauda AI → Advanced → MLFlow** for logged metrics.
-
-### Via the KFP SDK
-
-```python
-from kfp.client import Client
-
-client = Client(host="<MY-KFP-ENDPOINT>")
-
-run = client.create_run_from_pipeline_package(
-    "pipeline.yaml",
-    arguments={
-        "model_name": "qwen3-0.6b",
-        "learning_rate": 2e-4,
-        "epochs": 10,
-        "test_data_path": "s3://datasets/test-data",
-    },
-)
-
-print(f"Run URL: {client.get_run_id(run.name)}")
-```
-
-## Using MLflow in Trainer v2 pipelines
-
-If you are using [Kubeflow Trainer v2](../training_guides/fine-tune-with-trainer-v2.mdx) instead of KFP SDK pipelines, you can inject MLflow environment variables directly into the `TrainingJob` pod spec:
-
-```yaml
-apiVersion: trainer.kubeflow.org/v1
-kind: TrainingJob
-metadata:
-  name: mlflow-finetune
-spec:
-  trainingSpecs:
-    - replicas: 1
-      template:
-        spec:
-          containers:
-            - name: trainer
-              image: alaudadockerhub/fine_tune_with_llamafactory:v0.1.1
-              env:
-                - name: MLFLOW_TRACKING_URI
-                  value: "http://mlflow-tracking-server.kubeflow:5000"
-                - name: MLFLOW_EXPERIMENT_NAME
-                  value: "trainer-v2-finetune"
-```
-
-See [Fine-tuning LLMs using Workbench](../training_guides/fine-tuning-using-notebooks.mdx) for a full Trainer v2 + MLflow example with LLaMA-Factory.
-
-## Accessing MLflow run artifacts from other pipeline components
-
-Pipeline components can read MLflow artifacts from within a subsequent step. Use the MLflow Python client with the tracking URI to download artifacts:
-
-```python
-from kfp import dsl, compiler
-import mlflow
-
-@dsl.component(base_image="python:3.11-slim", packages_to_install=["mlflow", "kfp"])
-def download_and_compare(
-    source_model_uri: str,
-    reference_model_uri: str,
-) -> str:
-    """Download two models from MLflow and compare them."""
-    mlflow.set_tracking_uri("http://mlflow-tracking-server.kubeflow:5000")
-    client = mlflow.tracking.MlflowClient()
-
-    source_path = client.download_artifacts(source_model_uri, path="/tmp/source")
-    reference_path = client.download_artifacts(reference_model_uri, path="/tmp/reference")
-
-    return f"Compared models: {source_path} vs {reference_path}"
-```
-
-This pattern is useful for:
-
-- A/B comparing fine-tuned models before deployment.
-- Pulling the best model (by accuracy metric) from an MLflow experiment into an inference pipeline.
-- Validating that a model passed acceptance tests before pushing it to the model registry.
-
-## Best practices
-
-### Use the pipeline run ID in MLflow
-
-KFP provides `dsl.RUN_ID_PLACEHOLDER` as an environment variable inside each component. Use it to create distinct MLflow runs per pipeline execution:
-
-```python
-with mlflow.start_run(run_name=f"run-{dsl.RUN_ID_PLACEHOLDER}"):
-    ...
-```
-
-This avoids conflating pipeline runs that happen to use the same parameters.
-
-### Log metrics at the component level, not the pipeline level
-
-MLflow `log_metric` calls must happen inside a `mlflow.start_run()` block. If a component has multiple logical training stages, open separate MLflow runs within the same component — do not try to log metrics from outside a run context.
-
-### Use MLflow models for model registry integration
-
-Instead of logging arbitrary files as `mlflow.log_artifacts`, use `mlflow.log_model()` to register the model with its signature and dependencies:
-
-```python
-mlflow.sklearn.log_model(sk_model, "model")
-# or for HuggingFace:
-mlflow.transformers.log_model(hf_model, "model")
-```
-
-Registered models can then be promoted to the **Staging** or **Production** stage in the MLflow UI.
-
-### Artifact storage for production
-
-The default MLflow artifact store is local disk. For production pipelines, configure S3-compatible object storage in the MLflow plugin settings (see [MLflow Tracking Server](./mlflow.mdx) → High Availability And Storage). Pipeline components can then log large model artifacts without running into pod disk limits.
-
-## Troubleshooting
-
-| Symptom | Check |
-|---------|-------|
-| `ConnectionError` from `mlflow.set_tracking_uri` | Verify the MLflow service is reachable: `curl http://mlflow-tracking-server.kubeflow:5000/api/2.0/mlflow/ping`. If the pod is in `aml-system`, use that namespace instead. |
-| Run not showing in MLflow UI | Check the component logs for MLflow errors. The MLflow experiment must exist (created automatically on first `set_experiment`) and the workspace label must match the namespace. |
-| Artifact upload fails with `ResourceExhausted` | The artifact may exceed the pod's disk quota. Log artifacts directly to S3 by setting `MLFLOW_S3_ENDPOINT_URL` and providing AWS credentials as env vars. |
-| MLflow metrics not appearing in KFP UI | KFP and MLflow are separate systems. Metrics logged to MLflow appear in the MLflow UI (**Alauda AI → Advanced → MLFlow**), not in the KFP run output. |
diff --git a/docs/en/kubeflow/how_to/qwen3_finetune_verify.ipynb b/docs/en/training_guides/qwen3_finetune_verify.ipynb
similarity index 100%
rename from docs/en/kubeflow/how_to/qwen3_finetune_verify.ipynb
rename to docs/en/training_guides/qwen3_finetune_verify.ipynb

From 02774a44d738619473c3408e303e714606ff99fc Mon Sep 17 00:00:00 2001
From: Wu Yi <typhoonzero1986@gmail.com>
Date: Mon, 15 Jun 2026 05:42:42 +0000
Subject: [PATCH 08/21] docs: rewrite KFP+MLflow integration guide to a
 cluster-verified example

The pipelines-mlflow-integration example did not run as written. Fixes
verified against MLflow + KFP on g1-c1-x86:

- Import mlflow inside each @dsl.component (KFP v2 packages components from
  their own source; a module-level import raises NameError at runtime).
- Replace dsl.RUN_ID_PLACEHOLDER (removed in KFP v2) with
  dsl.PIPELINE_JOB_ID_PLACEHOLDER, passed in as a component argument.
- Document the secured-install access path: the mlflow-tracking-server
  Service fronts oauth2-proxy (302s headless clients), so components need a
  direct in-cluster Service, a ServiceAccount bearer token
  (MLFLOW_TRACKING_TOKEN), workspace RBAC, and a warm-up retry.
- Fix the Trainer v2 example (trainer.kubeflow.org/v1alpha1 TrainJob with
  runtimeRef/trainer, not TrainingJob/v1 with a raw pod template).
- Fix client.get_run_id -> run.run_id and the Tools menu path.

Also:
- Drop files unrelated to this PR's scope (agentic_mlops index + nav row,
  qwen3 finetune notebook) carried in from the coding-agents base branch.
- Remove dead _retry_kubectl_stdin_novalidate() from e2e/lib.sh.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 docs/en/agentic_mlops/index.mdx               |   6 -
 docs/en/training_guides/index.mdx             |   1 -
 .../pipelines-mlflow-integration.mdx          | 265 +++++++-----
 .../qwen3_finetune_verify.ipynb               | 390 ------------------
 e2e/lib.sh                                    |  24 --
 5 files changed, 173 insertions(+), 513 deletions(-)
 delete mode 100644 docs/en/agentic_mlops/index.mdx
 delete mode 100644 docs/en/training_guides/qwen3_finetune_verify.ipynb

diff --git a/docs/en/agentic_mlops/index.mdx b/docs/en/agentic_mlops/index.mdx
deleted file mode 100644
index ec8f937..0000000
--- a/docs/en/agentic_mlops/index.mdx
+++ /dev/null
@@ -1,6 +0,0 @@
----
-weight: 10
----
-# Agentic MLOps
-
-<Overview />
diff --git a/docs/en/training_guides/index.mdx b/docs/en/training_guides/index.mdx
index 4825d7a..27a0933 100644
--- a/docs/en/training_guides/index.mdx
+++ b/docs/en/training_guides/index.mdx
@@ -17,4 +17,3 @@ End-to-end recipes for fine-tuning and pretraining LLMs on Alauda AI.
 | Production SFT / OSFT with automatic memory management | `training_hub` | [Fine-tuning LLMs with Training Hub](./training-hub-fine-tuning.mdx) |
 | Interactive exploration, custom scripts, VolcanoJob submission | Workbench Notebook | [Fine-tuning LLMs using Workbench](./fine-tuning-using-notebooks.mdx) |
 | Full-parameter SFT / pretraining on Ascend NPU | Workbench `PyTorch CANN` / `MindSpore CANN` | [Fine-tune and Pretrain on Ascend NPU](./fine-tune-and-pretrain-llms-on-ascend-npu.mdx) |
-| Agent-driven MLOps (plan templates, iterative tuning, structured reports) | Coding Agent + on-prem LLM | [Agentic MLOps](../agentic_mlops/mlops-with-coding-agents.mdx) |
diff --git a/docs/en/training_guides/pipelines-mlflow-integration.mdx b/docs/en/training_guides/pipelines-mlflow-integration.mdx
index e756305..da76161 100644
--- a/docs/en/training_guides/pipelines-mlflow-integration.mdx
+++ b/docs/en/training_guides/pipelines-mlflow-integration.mdx
@@ -19,121 +19,206 @@ This guide shows how to build Kubeflow Pipelines (KFP) components that log param
 - Access to a KFP endpoint (see [Use Kubeflow Pipelines](../kubeflow/how_to/pipelines.mdx) for setup).
 - An MLflow workspace name matching a namespace with `mlflow-enabled=true`.
 
-## How pipeline components reach MLflow
+## How pipeline components reach MLflow \{#how-pipeline-components-reach-mlflow}
 
-The MLflow Tracking Server is exposed as an in-cluster Service:
+This is the part most KFP + MLflow examples get wrong, so read it before copying the code below.
 
-| Namespace | Service | Port |
-|-----------|---------|------|
-| `kubeflow` | `mlflow-tracking-server.kubeflow.svc.cluster.local` | 5000 |
-| `aml-system` | `mlflow-tracking-server.aml-system.svc.cluster.local` | 5000 |
+When the MLflow plugin is installed with single sign-on and multi-tenancy enabled (the default on Alauda AI), the `mlflow-tracking-server` Service does **not** point straight at MLflow. It points at an `oauth2-proxy` sidecar, and the MLflow server itself requires a per-request identity token:
 
-Pipeline components use the short form:
+| What a pipeline component does | What happens on a secured install |
+|--------------------------------|-----------------------------------|
+| `GET http://mlflow-tracking-server.kubeflow:5000/...` | **HTTP 302** redirect to the SSO login page — the proxy expects an interactive browser, so a bearer token does not help |
+| Reach the MLflow server directly with **no** token | **HTTP 401** `UNAUTHENTICATED` |
+| Reach it directly with a token but **no** workspace permissions | reads succeed, writes return **HTTP 403** `PERMISSION_DENIED` |
+| Reach it directly **with** a token **and** workspace RBAC | logging works |
+
+So a headless component needs three things: a direct in-cluster endpoint that skips the OAuth proxy, a bearer token, and write permission in the workspace.
+
+:::tip
+If your MLflow plugin is installed **without** SSO/multi-tenancy (no `oauth2-proxy` sidecar on the `mlflow-tracking-server` pod, and `MLFLOW_MULTITENANCY_ENABLED` unset), the `mlflow-tracking-server` Service already targets MLflow directly and accepts unauthenticated in-cluster traffic. In that case skip steps 1–2 below and use `http://mlflow-tracking-server.kubeflow:5000` with no token.
+:::
+
+### 1. Expose a direct in-cluster endpoint
+
+Create a Service that targets the MLflow server's application port (`mlflow-http`) instead of the proxy port, so in-cluster clients bypass the browser login flow:
+
+```yaml
+apiVersion: v1
+kind: Service
+metadata:
+  name: mlflow-tracking-direct
+  namespace: kubeflow
+spec:
+  selector:
+    app: mlflow-tracking-server   # verify against the running pod's labels
+  ports:
+    - name: http
+      port: 5000
+      targetPort: mlflow-http     # the server container's port, not "proxy"
+```
+
+Pipeline components then use `http://mlflow-tracking-direct.kubeflow:5000`.
+
+### 2. Grant the component's ServiceAccount MLflow permissions
+
+The MLflow server authorizes writes against Kubernetes RBAC in the workspace namespace. Bind the ServiceAccount that your pipeline pods run as (for KFP multi-user this is usually `default-editor` in the profile namespace) to a Role that allows writing MLflow resources:
+
+```yaml
+apiVersion: rbac.authorization.k8s.io/v1
+kind: Role
+metadata:
+  name: mlflow-writer
+  namespace: team-a            # the MLflow workspace namespace
+rules:
+  - apiGroups: ["mlflow.kubeflow.org"]
+    resources: ["experiments", "runs", "registeredmodels"]
+    verbs: ["get", "list", "create", "update", "delete"]
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: RoleBinding
+metadata:
+  name: mlflow-writer
+  namespace: team-a
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: Role
+  name: mlflow-writer
+subjects:
+  - kind: ServiceAccount
+    name: default-editor       # the SA your KFP component pods use
+    namespace: team-a
+```
+
+See [MLflow Tracking Server → Workspace Access](../kubeflow/how_to/mlflow.mdx) for the workspace/RBAC model.
+
+### 3. Authenticate from the component
+
+In a KFP v2 lightweight component, send the pod's ServiceAccount token as the MLflow bearer token by setting `MLFLOW_TRACKING_TOKEN`. The token is mounted at the standard projected-token path:
 
 ```python
-MLFLOW_TRACKING_URI = "http://mlflow-tracking-server.kubeflow:5000"
+with open("/var/run/secrets/kubernetes.io/serviceaccount/token") as f:
+    os.environ["MLFLOW_TRACKING_TOKEN"] = f.read().strip()
 ```
 
+Runs land in the server's **default workspace** (`MLFLOW_K8S_DEFAULT_WORKSPACE`). The stock `mlflow` client cannot send the per-request workspace header, so to target a *different* workspace use the Alauda MLflow client's `mlflow.set_workspace("team-a")` (see [Client Configuration](../kubeflow/how_to/mlflow.mdx)).
+
 ## Complete example: training pipeline with MLflow
 
-Here is a minimal but complete KFP pipeline. It accepts model name, learning rate,
-and epochs as parameters, simulates training with MLflow parameter and metric logging,
-and packages the trained model as an MLflow artifact.
+Every mechanism in this pipeline was checked against a secured (SSO + multi-tenant) MLflow install on Alauda AI: the SDK compiles the pipeline, the ServiceAccount-token auth, workspace RBAC, and warm-up retry were confirmed against the live tracking server. Note the three corrections over a naive example: **imports are inside each component** (KFP v2 packages components from their own source, so module-level imports are not in scope), the **ServiceAccount token** is set before logging, and the first MLflow call is **retried** to absorb the brief authorization warm-up after a pod starts.
 
 ```python
 from kfp import dsl, compiler
-import mlflow
-import mlflow.tracking
 
-# ===== MLflow configuration =====
-MLFLOW_TRACKING_URI = "http://mlflow-tracking-server.kubeflow:5000"
+MLFLOW_TRACKING_URI = "http://mlflow-tracking-direct.kubeflow:5000"
 MLFLOW_EXPERIMENT_NAME = "kfp-training-experiment"
 
 
-@dsl.component(base_image="python:3.11-slim", packages_to_install=["mlflow", "kfp"])
+@dsl.component(base_image="python:3.11-slim", packages_to_install=["mlflow"])
 def train_model(
+    tracking_uri: str,
+    experiment: str,
     model_name: str,
     learning_rate: float,
     epochs: int,
-    output_model_path: str,
+    run_id: str,
 ) -> dict:
     """Simulated training component with MLflow tracking."""
-    mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
-    mlflow.set_experiment(MLFLOW_EXPERIMENT_NAME)
-
-    with mlflow.start_run(run_name=f"run-{dsl.RUN_ID_PLACEHOLDER}") as run:
-        # Log parameters
+    import os, time, json, pathlib
+    import mlflow                       # MUST be imported inside the component
+
+    # Authenticate to the multi-tenant MLflow server with the pod's SA token.
+    with open("/var/run/secrets/kubernetes.io/serviceaccount/token") as f:
+        os.environ["MLFLOW_TRACKING_TOKEN"] = f.read().strip()
+    mlflow.set_tracking_uri(tracking_uri)
+
+    # The per-request authorization cache can lag a few seconds after a pod
+    # starts; retry the first call so the component does not fail on a 403.
+    for attempt in range(5):
+        try:
+            mlflow.set_experiment(experiment)
+            break
+        except Exception:
+            if attempt == 4:
+                raise
+            time.sleep(3)
+
+    metrics = {}
+    with mlflow.start_run(run_name=f"run-{run_id}"):
         mlflow.log_param("model_name", model_name)
         mlflow.log_param("learning_rate", learning_rate)
         mlflow.log_param("epochs", epochs)
 
-        # Simulated training loop
-        metrics = dict()
+        # Simulated training loop — replace with your real training logic.
         for epoch in range(1, epochs + 1):
-            # Replace this with your real training logic
             loss = 2.0 * (0.95 ** epoch)  # placeholder loss curve
             accuracy = 1.0 - loss         # placeholder accuracy
-
             mlflow.log_metric("loss", loss, step=epoch)
             mlflow.log_metric("accuracy", accuracy, step=epoch)
-            metrics["final_loss"] = loss
-            metrics["final_accuracy"] = accuracy
+            metrics = {"final_loss": loss, "final_accuracy": accuracy}
 
-        # Log the trained model as an artifact
-        import json, pathlib
-        model_dir = pathlib.Path(output_model_path)
+        # Log the trained model as an artifact (requires configured artifact
+        # storage — see "Artifact storage for production" below).
+        model_dir = pathlib.Path("/tmp/model")
         model_dir.mkdir(parents=True, exist_ok=True)
         (model_dir / "model.json").write_text(
             json.dumps(dict(model_name=model_name, epochs=epochs, metrics=metrics))
         )
         mlflow.log_artifacts(str(model_dir), artifact_path="model")
 
-        run_id = run.info.run_id
-        print(f"MLflow run: {run_id}")
-
     return metrics
 
 
-@dsl.component(base_image="python:3.11-slim", packages_to_install=["mlflow", "kfp"])
-def evaluate_model(
-    model_name: str,
-    test_data_path: str,
-) -> dict:
+@dsl.component(base_image="python:3.11-slim", packages_to_install=["mlflow"])
+def evaluate_model(tracking_uri: str, experiment: str, model_name: str, run_id: str) -> dict:
     """Evaluate the trained model and log results to MLflow."""
-    mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
-    mlflow.set_experiment(MLFLOW_EXPERIMENT_NAME)
-
-    with mlflow.start_run(run_name=f"eval-{dsl.RUN_ID_PLACEHOLDER}"):
-        # In a real pipeline, load the model artifact from MLflow or S3
-        # For now, log placeholder metrics
+    import os, time
+    import mlflow
+
+    with open("/var/run/secrets/kubernetes.io/serviceaccount/token") as f:
+        os.environ["MLFLOW_TRACKING_TOKEN"] = f.read().strip()
+    mlflow.set_tracking_uri(tracking_uri)
+    for attempt in range(5):
+        try:
+            mlflow.set_experiment(experiment)
+            break
+        except Exception:
+            if attempt == 4:
+                raise
+            time.sleep(3)
+
+    with mlflow.start_run(run_name=f"eval-{run_id}"):
         mlflow.log_param("model_name", model_name)
-        mlflow.log_param("test_data_path", test_data_path)
         mlflow.log_metric("eval_accuracy", 0.92)
         mlflow.log_metric("eval_f1", 0.89)
-        mlflow.log_metric("eval_precision", 0.91)
-        mlflow.log_metric("eval_recall", 0.88)
 
     return dict(eval_accuracy=0.92)
 
 
 @dsl.pipeline(name="mlflow-training-pipeline", description="Train and evaluate with MLflow tracking")
 def training_pipeline(
+    tracking_uri: str = MLFLOW_TRACKING_URI,
+    experiment: str = MLFLOW_EXPERIMENT_NAME,
     model_name: str = "qwen3-0.6b",
     learning_rate: float = 2e-4,
     epochs: int = 10,
-    test_data_path: str = "s3://datasets/test-data",
 ):
     train_task = train_model(
+        tracking_uri=tracking_uri,
+        experiment=experiment,
         model_name=model_name,
         learning_rate=learning_rate,
         epochs=epochs,
-        output_model_path="/tmp/model",
+        # PIPELINE_JOB_ID_PLACEHOLDER resolves to the run's job id at runtime;
+        # pass it in as an argument (a component cannot reference dsl.* itself).
+        run_id=dsl.PIPELINE_JOB_ID_PLACEHOLDER,
     )
 
     evaluate_model(
+        tracking_uri=tracking_uri,
+        experiment=experiment,
         model_name=model_name,
-        test_data_path=test_data_path,
-    )
+        run_id=dsl.PIPELINE_JOB_ID_PLACEHOLDER,
+    ).after(train_task)
 
 
 # ===== Compile and submit =====
@@ -146,7 +231,7 @@ compiler.Compiler().compile(training_pipeline, "pipeline.yaml")
 
 1. Go to **Kubeflow Dashboard → Pipelines → Upload Pipeline** and select `pipeline.yaml`.
 2. Click **Create Run** and fill in the parameters (model name, learning rate, epochs).
-3. After the run starts, check the MLflow UI under **Alauda AI → Advanced → MLFlow** for logged metrics.
+3. After the run starts, check the MLflow UI under **Alauda AI → Tools → MLFlow** for logged metrics.
 
 ### Via the KFP SDK
 
@@ -161,59 +246,55 @@ run = client.create_run_from_pipeline_package(
         model_name="qwen3-0.6b",
         learning_rate=2e-4,
         epochs=10,
-        test_data_path="s3://datasets/test-data",
     ),
 )
 
-print(f"Run URL: {client.get_run_id(run.name)}")
+print(f"Run ID: {run.run_id}")
 ```
 
 ## Using MLflow in Trainer v2 pipelines
 
-If you are using [Kubeflow Trainer v2](./fine-tune-with-trainer-v2.mdx) instead of KFP SDK pipelines, you can inject MLflow environment variables directly into the `TrainingJob` pod spec:
+If you are using [Kubeflow Trainer v2](./fine-tune-with-trainer-v2.mdx) instead of KFP SDK pipelines, set the MLflow environment variables on the `TrainJob`'s trainer. Note the API: Trainer v2 uses `apiVersion: trainer.kubeflow.org/v1alpha1`, `kind: TrainJob`, and a `spec.runtimeRef` + `spec.trainer` shape — not a raw pod template.
 
 ```yaml
-apiVersion: trainer.kubeflow.org/v1
-kind: TrainingJob
+apiVersion: trainer.kubeflow.org/v1alpha1
+kind: TrainJob
 metadata:
   name: mlflow-finetune
 spec:
-  trainingSpecs:
-    - replicas: 1
-      template:
-        spec:
-          containers:
-            - name: trainer
-              image: alaudadockerhub/fine_tune_with_llamafactory:v0.1.1
-              env:
-                - name: MLFLOW_TRACKING_URI
-                  value: "http://mlflow-tracking-server.kubeflow:5000"
-                - name: MLFLOW_EXPERIMENT_NAME
-                  value: "trainer-v2-finetune"
+  runtimeRef:
+    name: torch-distributed        # a TrainingRuntime / ClusterTrainingRuntime
+  trainer:
+    image: alaudadockerhub/fine_tune_with_llamafactory:v0.1.1
+    env:
+      - name: MLFLOW_TRACKING_URI
+        value: "http://mlflow-tracking-direct.kubeflow:5000"
+      - name: MLFLOW_EXPERIMENT_NAME
+        value: "trainer-v2-finetune"
 ```
 
-See [Fine-tuning LLMs using Workbench](./fine-tuning-using-notebooks.mdx) for a full Trainer v2 + MLflow example with LLaMA-Factory.
+On a secured install the trainer pod also needs a token and workspace RBAC (see [How pipeline components reach MLflow](#how-pipeline-components-reach-mlflow)): set `MLFLOW_TRACKING_TOKEN` from the pod's ServiceAccount token, and bind that ServiceAccount to the `mlflow-writer` Role. See [Fine-tuning LLMs using Workbench](./fine-tuning-using-notebooks.mdx) for a full Trainer v2 + MLflow example with LLaMA-Factory.
 
 ## Accessing MLflow run artifacts from other pipeline components
 
-Pipeline components can read MLflow artifacts from within a subsequent step. Use the MLflow Python client with the tracking URI to download artifacts:
+A later component can read MLflow artifacts logged by an earlier one. Authenticate the same way (imports inside, SA token), then use the MLflow client:
 
 ```python
-from kfp import dsl, compiler
-import mlflow
+from kfp import dsl
 
-@dsl.component(base_image="python:3.11-slim", packages_to_install=["mlflow", "kfp"])
-def download_and_compare(
-    source_model_uri: str,
-    reference_model_uri: str,
-) -> str:
+@dsl.component(base_image="python:3.11-slim", packages_to_install=["mlflow"])
+def download_and_compare(tracking_uri: str, source_model_uri: str, reference_model_uri: str) -> str:
     """Download two models from MLflow and compare them."""
-    mlflow.set_tracking_uri("http://mlflow-tracking-server.kubeflow:5000")
+    import os
+    import mlflow
+
+    with open("/var/run/secrets/kubernetes.io/serviceaccount/token") as f:
+        os.environ["MLFLOW_TRACKING_TOKEN"] = f.read().strip()
+    mlflow.set_tracking_uri(tracking_uri)
     client = mlflow.tracking.MlflowClient()
 
     source_path = client.download_artifacts(source_model_uri, path="/tmp/source")
     reference_path = client.download_artifacts(reference_model_uri, path="/tmp/reference")
-
     return f"Compared models: {source_path} vs {reference_path}"
 ```
 
@@ -225,16 +306,15 @@ This pattern is useful for:
 
 ## Best practices
 
-### Use the pipeline run ID in MLflow
+### Use the pipeline job ID in MLflow
 
-KFP provides `dsl.RUN_ID_PLACEHOLDER` as an environment variable inside each component. Use it to create distinct MLflow runs per pipeline execution:
+KFP v2 provides `dsl.PIPELINE_JOB_ID_PLACEHOLDER` (the v1 `dsl.RUN_ID_PLACEHOLDER` was removed). It is a pipeline-level placeholder, so pass it into the component as an argument — a component cannot reference `dsl.*` from inside its own body:
 
 ```python
-with mlflow.start_run(run_name=f"run-{dsl.RUN_ID_PLACEHOLDER}"):
-    ...
+train_model(..., run_id=dsl.PIPELINE_JOB_ID_PLACEHOLDER)
 ```
 
-This avoids conflating pipeline runs that happen to use the same parameters.
+Then use the received string in the run name to keep MLflow runs distinct per pipeline execution.
 
 ### Log metrics at the component level, not the pipeline level
 
@@ -242,7 +322,7 @@ MLflow `log_metric` calls must happen inside a `mlflow.start_run()` block. If a
 
 ### Use MLflow models for model registry integration
 
-Instead of logging arbitrary files as `mlflow.log_artifacts`, use `mlflow.log_model()` to register the model with its signature and dependencies:
+Instead of logging arbitrary files as `mlflow.log_artifacts`, use a flavor's `log_model()` to register the model with its signature and dependencies:
 
 ```python
 mlflow.sklearn.log_model(sk_model, "model")
@@ -254,13 +334,14 @@ Registered models can then be promoted to the **Staging** or **Production** stag
 
 ### Artifact storage for production
 
-The default MLflow artifact store is local disk. For production pipelines, configure S3-compatible object storage in the MLflow plugin settings (see [MLflow Tracking Server](../kubeflow/how_to/mlflow.mdx) → High Availability And Storage). Pipeline components can then log large model artifacts without running into pod disk limits.
+The default MLflow artifact store is local to the MLflow pod. For production pipelines, configure S3-compatible object storage in the MLflow plugin settings (see [MLflow Tracking Server](../kubeflow/how_to/mlflow.mdx) → High Availability And Storage). Pipeline components can then log large model artifacts without running into pod disk limits.
 
 ## Troubleshooting
 
 | Symptom | Check |
 |---------|-------|
-| `ConnectionError` from `mlflow.set_tracking_uri` | Verify the MLflow service is reachable: `curl http://mlflow-tracking-server.kubeflow:5000/api/2.0/mlflow/ping`. If the pod is in `aml-system`, use that namespace instead. |
-| Run not showing in MLflow UI | Check the component logs for MLflow errors. The MLflow experiment must exist (created automatically on first `set_experiment`) and the workspace label must match the namespace. |
-| Artifact upload fails with `ResourceExhausted` | The artifact may exceed the pod's disk quota. Log artifacts directly to S3 by setting `MLFLOW_S3_ENDPOINT_URL` and providing AWS credentials as env vars. |
-| MLflow metrics not appearing in KFP UI | KFP and MLflow are separate systems. Metrics logged to MLflow appear in the MLflow UI (**Alauda AI → Advanced → MLFlow**), not in the KFP run output. |
+| Component hangs or fails with an HTML/redirect (`302`) response | You are hitting the `oauth2-proxy` in front of MLflow. Use a direct in-cluster Service (`mlflow-tracking-direct`) that targets the `mlflow-http` port, not `http://mlflow-tracking-server.kubeflow:5000`. |
+| `401 UNAUTHENTICATED` / "Missing Authorization header" | The component sent no token. Set `MLFLOW_TRACKING_TOKEN` from `/var/run/secrets/kubernetes.io/serviceaccount/token` before the first MLflow call. |
+| `403 PERMISSION_DENIED` | The component's ServiceAccount lacks write RBAC in the workspace namespace — apply the `mlflow-writer` Role/RoleBinding. A single 403 right after pod start is the authorization cache warming up; the retry loop in the example absorbs it. |
+| Run shows up in the wrong workspace | The stock `mlflow` client logs to the server's default workspace. Use the Alauda client's `mlflow.set_workspace(...)` to target another workspace. |
+| MLflow metrics not appearing in KFP UI | KFP and MLflow are separate systems. Metrics logged to MLflow appear in the MLflow UI (**Alauda AI → Tools → MLFlow**), not in the KFP run output. |
diff --git a/docs/en/training_guides/qwen3_finetune_verify.ipynb b/docs/en/training_guides/qwen3_finetune_verify.ipynb
deleted file mode 100644
index 2745368..0000000
--- a/docs/en/training_guides/qwen3_finetune_verify.ipynb
+++ /dev/null
@@ -1,390 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "18655ab8",
-   "metadata": {},
-   "source": [
-    "# Qwen3-8B Full-Parameter Fine-Tuning Verification\n",
-    "\n",
-    "This notebook verifies the fine-tuning capability of the **Ascend 910B CANN image** by running full-parameter SFT fine-tuning for Qwen3-8B with MindSpeed-LLM.\n",
-    "\n",
-    "**Workflow:**\n",
-    "1. Environment check\n",
-    "2. Prepare a sample dataset (Alpaca format)\n",
-    "3. Clone the MindSpeed-LLM scripts\n",
-    "4. Convert HF weights to Megatron weights\n",
-    "5. Preprocess the data\n",
-    "6. Start fine-tuning\n",
-    "7. Run inference validation\n",
-    "\n",
-    "> The training parameters are set for verification mode (few iterations + short sequence length). Increase `TRAIN_ITERS` and `SEQ_LENGTH` for production use."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "12b48017",
-   "metadata": {},
-   "source": [
-    "## 0. Parameter Configuration"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "id": "a0fa2576",
-   "metadata": {},
-   "source": "import warnings\nwarnings.filterwarnings('ignore', category=DeprecationWarning)\nwarnings.filterwarnings('ignore', category=ImportWarning)\nwarnings.filterwarnings('ignore', category=UserWarning)\n\nfrom pathlib import Path\n\n# ===== Path configuration =====\nHF_MODEL_DIR = Path('/opt/app-root/src/models/Qwen3-8B')\nWORK_DIR = Path('/opt/app-root/src/Qwen3-8B-work-dir')\nMINDSPEED_LLM_DIR = WORK_DIR / 'MindSpeed-LLM'\nDATA_DIR = WORK_DIR / 'finetune_dataset'\nRAW_DATA_FILE = DATA_DIR / 'alpaca_sample.jsonl'\nPROCESSED_DATA_PREFIX = DATA_DIR / 'alpaca'\nOUTPUT_DIR = WORK_DIR / 'output' / 'qwen3_8b_finetuned'\nLOGS_DIR = WORK_DIR / 'logs'\n\n# ===== Optional: real dataset path =====\nALPACA_PARQUET = Path('/opt/app-root/src/datasets/alpaca/train-00000-of-00001-a09b74b3ef9c3b56.parquet')\n\n# ===== Ascend environment scripts =====\nCANN_ENV = '/usr/local/Ascend/cann/set_env.sh'\nATB_ENV = '/usr/local/Ascend/nnal/atb/set_env.sh'\n\n# ===== Parallelism configuration (must match weight conversion) =====\nTP = 2   # With TP=1, one card holds about 4.1B parameters; fp32 gradient buffers + bf16 weights require about 30 GiB, exceeding the 910B 29 GiB memory limit\nPP = 2   # At least TPxPP=4 NPUs are required; for a single card, set TP=1 and PP=1 (OOM is possible)\n\n# ===== Weight conversion output (path includes parallel settings to avoid reusing stale weights after TP/PP changes) =====\nMCORE_WEIGHTS_DIR = WORK_DIR / 'model_weights' / f'qwen3_mcore_tp{TP}_pp{PP}'\n\n# ===== Training hyperparameters (verification mode) =====\nSEQ_LENGTH = 512     # 4096 is recommended for production\nTRAIN_ITERS = 50     # 2000+ is recommended for production\nMBS = 1\nLR = 1.25e-6\nMIN_LR = 1.25e-7\n\n# ===== Data preprocessing =====\nHANDLER_NAME = 'AlpacaStyleInstructionHandler'\nTOKENIZER_TYPE = 'PretrainedFromHF'\nPROMPT_TYPE = 'qwen3'\nENABLE_THINKING = 'none'\n\nprint('Configuration loaded')\nprint(f'  Model: {HF_MODEL_DIR}')\nprint(f'  Dataset: {ALPACA_PARQUET}' if ALPACA_PARQUET.exists() else '  Dataset: not found, using built-in sample data')\nprint(f'  TP={TP}, PP={PP}, SEQ={SEQ_LENGTH}, ITERS={TRAIN_ITERS}')",
-   "outputs": [],
-   "execution_count": null
-  },
-  {
-   "cell_type": "markdown",
-   "id": "15d10a9a",
-   "metadata": {},
-   "source": [
-    "## Helper Function"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "id": "7eb53b45",
-   "metadata": {},
-   "source": "import os\nimport subprocess\n\n_SUPPRESS_WARNINGS = 'ignore::DeprecationWarning,ignore::ImportWarning,ignore::UserWarning'\n\ndef run_cmd(cmd, cwd=None, check=True):\n    'Run a bash command in the Ascend environment and stream output in real time'\n    env_prefix = f'source {CANN_ENV} && source {ATB_ENV}'\n    full_cmd = f'{env_prefix} && {cmd}'\n    print(f'$ {cmd}\\n')\n    run_env = os.environ.copy()\n    run_env['PYTHONWARNINGS'] = _SUPPRESS_WARNINGS\n    result = subprocess.run(\n        ['bash', '-lc', full_cmd],\n        cwd=str(cwd or WORK_DIR),\n        text=True,\n        env=run_env,\n    )\n    if check and result.returncode != 0:\n        raise RuntimeError(f'Command failed with return code: {result.returncode}')\n    return result\n\nprint('Helper function defined: run_cmd()')",
-   "outputs": [],
-   "execution_count": null
-  },
-  {
-   "cell_type": "markdown",
-   "id": "0d2cbf3b",
-   "metadata": {},
-   "source": [
-    "## 1. Environment Check"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "id": "1643dfe5",
-   "metadata": {},
-   "source": "import warnings\nwith warnings.catch_warnings():\n    warnings.simplefilter('ignore', DeprecationWarning)\n    warnings.simplefilter('ignore', ImportWarning)\n    warnings.simplefilter('ignore', UserWarning)\n    import torch\n    import torch_npu\n\nprint('=' * 60)\nprint('Environment Check')\nprint('=' * 60)\n\n# PyTorch & NPU\nprint(f'PyTorch:    {torch.__version__}')\nprint(f'torch_npu:  {torch_npu.__version__}')\nnproc = torch.npu.device_count()\nprint(f'NPU count:  {nproc}')\nfor i in range(nproc):\n    print(f'  NPU {i}: {torch.npu.get_device_name(i)}')\n\n# MindSpeed\nwith warnings.catch_warnings():\n    warnings.simplefilter('ignore', DeprecationWarning)\n    warnings.simplefilter('ignore', ImportWarning)\n    warnings.simplefilter('ignore', UserWarning)\n    import mindspeed\n    import mindspeed_llm\nprint('MindSpeed:     installed')\nprint('MindSpeed-LLM: installed')\n\n# Model files\nprint(f'\\nModel directory: {HF_MODEL_DIR}')\nassert HF_MODEL_DIR.exists(), f'Model directory does not exist: {HF_MODEL_DIR}'\nmodel_files = sorted(HF_MODEL_DIR.glob('*'))\nfor f in model_files[:5]:\n    if f.is_file():\n        print(f'  {f.name} ({f.stat().st_size / 1e9:.2f} GB)')\nif len(model_files) > 5:\n    print(f'  ... {len(model_files)} files in total')\n\n# Parallelism validation\nassert nproc >= TP * PP, f'NPU count ({nproc}) < TP*PP ({TP*PP}); reduce PP'\nDP = nproc // (TP * PP)\nGBS = DP * MBS\nprint(f'\\nParallelism: TP={TP}, PP={PP}, DP={DP}, GBS={GBS}')\nassert torch.npu.is_available(), 'NPU is not available'\nprint('\\nEnvironment check passed!')",
-   "outputs": [],
-   "execution_count": null
-  },
-  {
-   "cell_type": "markdown",
-   "id": "a194e018",
-   "metadata": {},
-   "source": [
-    "## 2. Prepare a Sample Dataset\n",
-    "\n",
-    "Create sample data in Alpaca format to verify the fine-tuning workflow.\n",
-    "\n",
-    "To use a real dataset, place a JSONL file at `RAW_DATA_FILE`, with one JSON object per line:\n",
-    "```json\n",
-    "{\"instruction\": \"...\", \"input\": \"...\", \"output\": \"...\"}\n",
-    "```"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "id": "6d845761",
-   "metadata": {},
-   "source": "import json\nimport warnings\nimport pandas as pd\n\nDATA_DIR.mkdir(parents=True, exist_ok=True)\n\nif ALPACA_PARQUET.exists():\n    print(f'Loading Alpaca dataset: {ALPACA_PARQUET.name}')\n    with warnings.catch_warnings():\n        warnings.simplefilter('ignore', DeprecationWarning)\n        df = pd.read_parquet(ALPACA_PARQUET)\n    print(f'{len(df)} samples loaded, columns: {list(df.columns)}')\n\n    # Convert to JSONL (instruction / input / output)\n    with open(RAW_DATA_FILE, 'w', encoding='utf-8') as f:\n        for item in df[['instruction', 'input', 'output']].to_dict('records'):\n            item['input'] = item.get('input') or ''\n            f.write(json.dumps(item, ensure_ascii=False) + '\\n')\n\n    print(f'Converted to JSONL: {RAW_DATA_FILE}')\n    print('\\nSample records:')\n    for item in df[['instruction', 'input', 'output']].head(3).to_dict('records'):\n        inp = f' {item[\"input\"]}' if item['input'] else ''\n        print(f'  Q: {item[\"instruction\"][:80]}{inp[:40]}')\n        print(f'  A: {str(item[\"output\"])[:80]}')\nelse:\n    print('Alpaca dataset not found, using built-in sample data\\n')\n    sample_data = [\n        {'instruction': 'Translate the following sentence into French', 'input': 'The weather is nice today.', 'output': \"Il fait beau aujourd'hui.\"},\n        {'instruction': 'Translate the following sentence into Spanish', 'input': 'I like programming.', 'output': 'Me gusta programar.'},\n        {'instruction': 'Summarize the sentence in one short phrase', 'input': 'Machine learning is fascinating and widely used in many fields.', 'output': 'Machine learning is broadly useful.'},\n        {'instruction': 'Rewrite the sentence in a more formal tone', 'input': 'Hello, how are you?', 'output': 'Hello, how are you doing today?'},\n        {'instruction': 'Introduce Python in one sentence', 'input': '', 'output': 'Python is a high-level general-purpose programming language known for its readability and rich ecosystem.'},\n        {'instruction': 'List three common sorting algorithms', 'input': '', 'output': 'Three common sorting algorithms are bubble sort, quicksort, and merge sort.'},\n        {'instruction': 'Explain what deep learning is', 'input': '', 'output': 'Deep learning is a branch of machine learning that uses multi-layer neural networks to learn hierarchical representations of data.'},\n        {'instruction': 'Write a Python function to add two numbers', 'input': '', 'output': 'def add(a, b):\\n    return a + b'},\n        {'instruction': 'Rewrite the sentence to be more concise', 'input': 'Artificial intelligence is changing the world.', 'output': 'AI is transforming the world.'},\n        {'instruction': 'What is a GPU?', 'input': '', 'output': 'A GPU is a graphics processing unit designed to accelerate highly parallel computation, especially for training and inference workloads.'},\n    ]\n    with open(RAW_DATA_FILE, 'w', encoding='utf-8') as f:\n        for item in sample_data:\n            f.write(json.dumps(item, ensure_ascii=False) + '\\n')\n    print(f'Sample dataset created: {RAW_DATA_FILE}')\n    print(f'{len(sample_data)} samples in total')",
-   "outputs": [],
-   "execution_count": null
-  },
-  {
-   "cell_type": "markdown",
-   "id": "9c4692a2",
-   "metadata": {},
-   "source": [
-    "## 3. Clone MindSpeed-LLM\n",
-    "\n",
-    "The `mindspeed_llm` Python package is already installed in the image, but the training scripts (`convert_ckpt_v2.py`, `preprocess_data.py`, `posttrain_gpt.py`, and others) must be run from the repository directory."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "id": "511c1c4d",
-   "metadata": {},
-   "source": [
-    "if MINDSPEED_LLM_DIR.exists():\n",
-    "    print(f'Already exists: {MINDSPEED_LLM_DIR}')\n",
-    "else:\n",
-    "    print('Cloning MindSpeed-LLM (shallow clone)...')\n",
-    "    run_cmd(f'git clone --depth 1 https://gitcode.com/ascend/MindSpeed-LLM.git {MINDSPEED_LLM_DIR}')\n",
-    "\n",
-    "# Validate required scripts\n",
-    "scripts = [\n",
-    "    ('Weight conversion', 'convert_ckpt_v2.py'),\n",
-    "    ('Data preprocessing', 'preprocess_data.py'),\n",
-    "    ('Fine-tuning', 'posttrain_gpt.py'),\n",
-    "    ('Inference', 'inference.py'),\n",
-    "]\n",
-    "for name, script in scripts:\n",
-    "    exists = (MINDSPEED_LLM_DIR / script).exists()\n",
-    "    print(f'  [{name}] {script}: {\"OK\" if exists else \"MISSING\"}')\n",
-    "\n",
-    "assert all((MINDSPEED_LLM_DIR / s).exists() for _, s in scripts), 'Required scripts are missing'\n",
-    "print('\\nScript check passed!')"
-   ],
-   "outputs": [],
-   "execution_count": null
-  },
-  {
-   "cell_type": "markdown",
-   "id": "331e0d10",
-   "metadata": {},
-   "source": [
-    "## 4. HF Weight to Megatron Weight Conversion\n",
-    "\n",
-    "Convert HuggingFace-format weights to Megatron format, split by TP/PP. The first conversion usually takes about 5-10 minutes."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "id": "463dd7da",
-   "metadata": {},
-   "source": [
-    "MCORE_WEIGHTS_DIR.mkdir(parents=True, exist_ok=True)\n",
-    "\n",
-    "# Check whether conversion has already been completed\n",
-    "converted = any(MCORE_WEIGHTS_DIR.glob('iter_*'))\n",
-    "\n",
-    "if converted:\n",
-    "    print(f'Weights already exist, skipping conversion: {MCORE_WEIGHTS_DIR}')\n",
-    "    for p in sorted(MCORE_WEIGHTS_DIR.iterdir()):\n",
-    "        print(f'  {p.name}')\n",
-    "else:\n",
-    "    convert_cmd = ' && '.join([\n",
-    "        f'cd {MINDSPEED_LLM_DIR}',\n",
-    "        f'python convert_ckpt_v2.py'\n",
-    "        ' --load-model-type hf'\n",
-    "        ' --save-model-type mg'\n",
-    "        f' --target-tensor-parallel-size {TP}'\n",
-    "        f' --target-pipeline-parallel-size {PP}'\n",
-    "        f' --load-dir {HF_MODEL_DIR}'\n",
-    "        f' --save-dir {MCORE_WEIGHTS_DIR}'\n",
-    "        ' --model-type-hf qwen3',\n",
-    "    ])\n",
-    "    print('Running weight conversion (about 5-10 minutes)...')\n",
-    "    run_cmd(convert_cmd, cwd=MINDSPEED_LLM_DIR)\n",
-    "    print('Weight conversion completed!')\n",
-    "    for p in sorted(MCORE_WEIGHTS_DIR.iterdir()):\n",
-    "        print(f'  {p.name}')"
-   ],
-   "outputs": [],
-   "execution_count": null
-  },
-  {
-   "cell_type": "markdown",
-   "id": "419d028a",
-   "metadata": {},
-   "source": [
-    "## 5. Data Preprocessing\n",
-    "\n",
-    "Convert Alpaca-format JSONL data into the binary format required by MindSpeed-LLM training."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "id": "f68febbf",
-   "metadata": {},
-   "source": [
-    "preprocess_cmd = ' && '.join([\n",
-    "    f'cd {MINDSPEED_LLM_DIR}',\n",
-    "    f'python preprocess_data.py'\n",
-    "    f' --input {RAW_DATA_FILE}'\n",
-    "    f' --tokenizer-name-or-path {HF_MODEL_DIR}'\n",
-    "    f' --output-prefix {PROCESSED_DATA_PREFIX}'\n",
-    "    f' --handler-name {HANDLER_NAME}'\n",
-    "    f' --tokenizer-type {TOKENIZER_TYPE}'\n",
-    "    ' --workers 4'\n",
-    "    ' --log-interval 1'\n",
-    "    f' --enable-thinking {ENABLE_THINKING}'\n",
-    "    f' --prompt-type {PROMPT_TYPE}',\n",
-    "])\n",
-    "\n",
-    "print('Running data preprocessing...')\n",
-    "run_cmd(preprocess_cmd, cwd=MINDSPEED_LLM_DIR)\n",
-    "\n",
-    "# Verify outputs\n",
-    "print('\\nPreprocessing outputs:')\n",
-    "for f in sorted(PROCESSED_DATA_PREFIX.parent.glob('alpaca*')):\n",
-    "    print(f'  {f.name} ({f.stat().st_size / 1024:.1f} KB)')\n",
-    "print('Data preprocessing completed!')"
-   ],
-   "outputs": [],
-   "execution_count": null
-  },
-  {
-   "cell_type": "markdown",
-   "id": "67501275",
-   "metadata": {},
-   "source": [
-    "## 6. Start Fine-Tuning\n",
-    "\n",
-    "Run full-parameter SFT fine-tuning with MindSpeed-LLM. Training logs are streamed to the notebook in real time.\n",
-    "\n",
-    "> In verification mode, `TRAIN_ITERS=50`. For a full fine-tuning run, 2000+ iterations are recommended."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "id": "16c0ef7e",
-   "metadata": {},
-   "source": [
-    "import torch\n",
-    "\n",
-    "nproc = torch.npu.device_count()\n",
-    "DP = nproc // (TP * PP)\n",
-    "GBS = DP * MBS\n",
-    "\n",
-    "LOGS_DIR.mkdir(parents=True, exist_ok=True)\n",
-    "OUTPUT_DIR.mkdir(parents=True, exist_ok=True)\n",
-    "\n",
-    "# Environment variables\n",
-    "env = ' && '.join([\n",
-    "    f'cd {MINDSPEED_LLM_DIR}',\n",
-    "    'export CUDA_DEVICE_MAX_CONNECTIONS=1',\n",
-    "    'export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True',\n",
-    "])\n",
-    "\n",
-    "# Distributed torchrun arguments\n",
-    "distributed = ' '.join([\n",
-    "    'torchrun',\n",
-    "    f'--nproc_per_node {nproc}',\n",
-    "    '--nnodes 1 --node_rank 0',\n",
-    "    '--master_addr localhost --master_port 6000',\n",
-    "])\n",
-    "\n",
-    "# Model architecture\n",
-    "model_args = ' '.join([\n",
-    "    '--use-mcore-models',\n",
-    "    '--spec mindspeed_llm.tasks.models.spec.qwen3_spec layer_spec',\n",
-    "    '--kv-channels 128 --qk-layernorm',\n",
-    "    f'--tensor-model-parallel-size {TP}',\n",
-    "    f'--pipeline-model-parallel-size {PP}',\n",
-    "    '--sequence-parallel --use-distributed-optimizer --use-flash-attn',\n",
-    "    '--num-layers 36 --hidden-size 4096 --num-attention-heads 32',\n",
-    "    '--ffn-hidden-size 12288 --max-position-embeddings 32768',\n",
-    "    f'--seq-length {SEQ_LENGTH}',\n",
-    "    '--make-vocab-size-divisible-by 1 --padded-vocab-size 151936',\n",
-    "    '--rotary-base 1000000 --use-rotary-position-embeddings',\n",
-    "])\n",
-    "\n",
-    "# Training hyperparameters\n",
-    "train_args = ' '.join([\n",
-    "    f'--micro-batch-size {MBS} --global-batch-size {GBS}',\n",
-    "    '--disable-bias-linear --swiglu',\n",
-    "    f'--train-iters {TRAIN_ITERS}',\n",
-    "    '--tokenizer-type PretrainedFromHF',\n",
-    "    f'--tokenizer-name-or-path {HF_MODEL_DIR}',\n",
-    "    '--normalization RMSNorm --position-embedding-type rope',\n",
-    "    '--norm-epsilon 1e-6 --hidden-dropout 0 --attention-dropout 0',\n",
-    "    '--no-gradient-accumulation-fusion --attention-softmax-in-fp32',\n",
-    "    '--exit-on-missing-checkpoint --no-masked-softmax-fusion',\n",
-    "    '--group-query-attention --untie-embeddings-and-output-weights',\n",
-    "    '--num-query-groups 8',\n",
-    "    f'--min-lr {MIN_LR} --lr {LR}',\n",
-    "    '--weight-decay 1e-1 --clip-grad 1.0',\n",
-    "    '--adam-beta1 0.9 --adam-beta2 0.95 --initial-loss-scale 4096',\n",
-    "    '--no-load-optim --no-load-rng --seed 42 --bf16',\n",
-    "])\n",
-    "\n",
-    "# Data and outputs\n",
-    "data_args = ' '.join([\n",
-    "    f'--data-path {PROCESSED_DATA_PREFIX}',\n",
-    "    '--split 100,0,0',\n",
-    "    '--log-interval 1',\n",
-    "    f'--save-interval {TRAIN_ITERS}',\n",
-    "    f'--eval-interval {TRAIN_ITERS} --eval-iters 0',\n",
-    "])\n",
-    "\n",
-    "# Fine-tuning configuration\n",
-    "tune_args = ' '.join([\n",
-    "    '--finetune --stage sft --is-instruction-dataset',\n",
-    "    '--prompt-type qwen3 --no-pad-to-seq-lengths',\n",
-    "    '--distributed-backend nccl',\n",
-    "    f'--load {MCORE_WEIGHTS_DIR} --save {OUTPUT_DIR}',\n",
-    "    '--transformer-impl local',\n",
-    "    '--no-save-optim --no-save-rng',\n",
-    "])\n",
-    "\n",
-    "cmd = f'{env} && {distributed} posttrain_gpt.py {model_args} {train_args} {data_args} {tune_args}'\n",
-    "\n",
-    "print(f'Training configuration: {nproc} NPU, TP={TP}, PP={PP}, DP={DP}')\n",
-    "print(f'GBS={GBS}, MBS={MBS}, SEQ={SEQ_LENGTH}, ITERS={TRAIN_ITERS}')\n",
-    "print(f'\\nStarting training...\\n')\n",
-    "run_cmd(cmd, cwd=MINDSPEED_LLM_DIR)\n",
-    "print(f'\\nTraining completed! Weights saved to: {OUTPUT_DIR}')"
-   ],
-   "outputs": [],
-   "execution_count": null
-  },
-  {
-   "cell_type": "markdown",
-   "id": "d077bc56",
-   "metadata": {},
-   "source": [
-    "## 7. Inference Validation\n",
-    "\n",
-    "Load the fine-tuned weights and run a generation test."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "id": "09ae43f0",
-   "metadata": {},
-   "source": "import os\n\nnproc = torch.npu.device_count()\n\nenv = ' && '.join([\n    f'cd {MINDSPEED_LLM_DIR}',\n    'export CUDA_DEVICE_MAX_CONNECTIONS=1',\n])\n\ndistributed = ' '.join([\n    'torchrun',\n    f'--nproc_per_node {nproc}',\n    '--nnodes 1 --node_rank 0',\n    '--master_addr localhost --master_port 6001',\n])\n\ninfer_args = ' '.join([\n    '--use-mcore-models',\n    '--spec mindspeed_llm.tasks.models.spec.qwen3_spec layer_spec',\n    '--qk-layernorm',\n    f'--tensor-model-parallel-size {TP}',\n    f'--pipeline-model-parallel-size {PP}',\n    '--num-layers 36 --hidden-size 4096 --num-attention-heads 32',\n    '--ffn-hidden-size 12288',\n    f'--max-position-embeddings {SEQ_LENGTH} --seq-length {SEQ_LENGTH}',\n    '--disable-bias-linear',\n    '--group-query-attention --num-query-groups 8',\n    '--swiglu --use-fused-swiglu',\n    '--normalization RMSNorm --norm-epsilon 1e-6 --use-fused-rmsnorm',\n    '--position-embedding-type rope --rotary-base 1000000 --use-fused-rotary-pos-emb',\n    '--make-vocab-size-divisible-by 1 --padded-vocab-size 151936',\n    '--micro-batch-size 1 --max-new-tokens 256',\n    '--tokenizer-type PretrainedFromHF',\n    f'--tokenizer-name-or-path {HF_MODEL_DIR}',\n    '--tokenizer-not-use-fast',\n    '--hidden-dropout 0 --attention-dropout 0',\n    '--untie-embeddings-and-output-weights',\n    '--no-gradient-accumulation-fusion --attention-softmax-in-fp32',\n    '--seed 42',\n    f'--load {OUTPUT_DIR}',\n    '--exit-on-missing-checkpoint --transformer-impl local',\n])\n\ncmd = f'{env} && {distributed} inference.py {infer_args}'\nfull_cmd = f'source {CANN_ENV} && source {ATB_ENV} && {cmd}'\n\nprint('Starting inference...\\n')\nrun_env = os.environ.copy()\nrun_env['PYTHONWARNINGS'] = _SUPPRESS_WARNINGS\nresult = subprocess.run(\n    ['bash', '-lc', full_cmd],\n    cwd=str(MINDSPEED_LLM_DIR),\n    text=True,\n    input='q\\n',   # Exit interactive chat mode automatically after inference.py finishes the default 4 generation rounds and enters input(); sending q terminates it\n    env=run_env,\n)\nif result.returncode != 0:\n    print(f'\\nInference return code: {result.returncode}')\nprint('\\nInference completed!')",
-   "outputs": [],
-   "execution_count": null
-  },
-  {
-   "cell_type": "markdown",
-   "id": "f87ecc9d",
-   "metadata": {},
-   "source": [
-    "## Using a Real Dataset\n",
-    "\n",
-    "After verification succeeds, use the following steps for full fine-tuning with a real dataset:\n",
-    "\n",
-    "1. **Prepare the data**: place an Alpaca/ShareGPT/Pairwise dataset inside the container\n",
-    "   - Alpaca: `{\"instruction\": \"...\", \"input\": \"...\", \"output\": \"...\"}`\n",
-    "   - Change `HANDLER_NAME` to the matching handler\n",
-    "\n",
-    "2. **Tune the parameters**:\n",
-    "   - `SEQ_LENGTH = 4096` to match the model context length\n",
-    "   - `TRAIN_ITERS = 2000+` adjusted to the dataset size\n",
-    "   - `GBS` adjusted to the NPU count and dataset size\n",
-    "\n",
-    "3. **Checkpoint interval**: change `--save-interval` in the training cell to save checkpoints periodically\n",
-    "\n",
-    "4. **enable-thinking**:\n",
-    "   - `true` to process all data with slow-thinking mode\n",
-    "   - `false` to process all data with fast-thinking mode\n",
-    "   - `none` to mix fast and slow thinking (default)"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3.12",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.12.12"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
\ No newline at end of file
diff --git a/e2e/lib.sh b/e2e/lib.sh
index 027d7aa..435887a 100644
--- a/e2e/lib.sh
+++ b/e2e/lib.sh
@@ -183,30 +183,6 @@ _retry_kubectl_stdin() {
 retry_create() { _retry_kubectl_stdin "$1" create "${@:2}"; }
 retry_apply()  { _retry_kubectl_stdin "$1" apply "${@:2}"; }
 
-# Same retry but with --validate=false to bypass webhook/OAPI flakes.
-_retry_kubectl_stdin_novalidate() {
-  local kfn="$1" verb="$2"; shift 2
-  local data
-  data="$(cat)"
-  local attempts=0 max=5 delay=10 rc out
-  while [ "${attempts}" -lt "${max}" ]; do
-    if out="$(printf '%s' "${data}" | $kfn "${verb}" -f - --validate=false "$@" 2>&1)"; then
-      printf '%s' "${out}"
-      return 0
-    fi
-    rc=$?
-    if ! echo "${out}" | grep -qE 'failed calling webhook|x509|connection refused|EOF|context deadline exceeded|webhook.* connect: connection refused|failed to download openapi|openapi'; then
-      printf '%s\n' "${out}" >&2
-      return "${rc}"
-    fi
-    attempts=$((attempts+1))
-    log "kubectl ${verb} (novalidate): flake (attempt ${attempts}/${max}), sleeping ${delay}s"
-    sleep "${delay}"
-  done
-  printf '%s\n' "${out}" >&2
-  return 1
-}
-
 # Locate a TrainJob's pod. Trainer v2 builds a JobSet named after the TrainJob,
 # with one Job per `replicatedJobs[*]` named `${trainjob}-<rjob>-0`. The first
 # pod under it is what we stream logs from. `rjob` defaults to `node` (the

From 595ea04392152d2995968aef3d13382524d7ca08 Mon Sep 17 00:00:00 2001
From: Wu Yi <typhoonzero1986@gmail.com>
Date: Mon, 15 Jun 2026 06:17:50 +0000
Subject: [PATCH 09/21] docs: align MLflow auth with the kubernetes-auth
 plugin's canonical method

Cross-checked against mlflow-plugin/mlflow-kubernetes-plugins:

- Name the canonical mechanism: the server's `kubernetes-auth` plugin
  authorizes via Kubernetes RBAC and accepts a ServiceAccount bearer token
  (Authorization / X-Forwarded-Access-Token) + X-MLFLOW-WORKSPACE.
- Fix caller RBAC resources to the plugin's API group set
  (experiments / datasets / registeredmodels); `runs` is not a resource
  (run writes authorize against `experiments`).
- Add the canonical out-of-cluster token path
  (`kubectl create token`) alongside the in-pod projected token.
- Document workspace selection via set_workspace() / MLFLOW_WORKSPACE.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .../pipelines-mlflow-integration.mdx          | 24 ++++++++++++-------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/docs/en/training_guides/pipelines-mlflow-integration.mdx b/docs/en/training_guides/pipelines-mlflow-integration.mdx
index da76161..f1aa27f 100644
--- a/docs/en/training_guides/pipelines-mlflow-integration.mdx
+++ b/docs/en/training_guides/pipelines-mlflow-integration.mdx
@@ -23,7 +23,7 @@ This guide shows how to build Kubeflow Pipelines (KFP) components that log param
 
 This is the part most KFP + MLflow examples get wrong, so read it before copying the code below.
 
-When the MLflow plugin is installed with single sign-on and multi-tenancy enabled (the default on Alauda AI), the `mlflow-tracking-server` Service does **not** point straight at MLflow. It points at an `oauth2-proxy` sidecar, and the MLflow server itself requires a per-request identity token:
+On Alauda AI the MLflow server runs the `kubernetes-auth` plugin, which authorizes every request against Kubernetes RBAC in the `mlflow.kubeflow.org` API group. The canonical way for a workload to authenticate is a **Kubernetes ServiceAccount bearer token** sent as `Authorization: Bearer <token>` (or `X-Forwarded-Access-Token`), together with a workspace selector (`X-MLFLOW-WORKSPACE`). With single sign-on and multi-tenancy enabled (the default), the shipped `mlflow-tracking-server` Service does **not** point at MLflow directly — it points at an `oauth2-proxy` sidecar built for interactive browser login:
 
 | What a pipeline component does | What happens on a secured install |
 |--------------------------------|-----------------------------------|
@@ -59,9 +59,9 @@ spec:
 
 Pipeline components then use `http://mlflow-tracking-direct.kubeflow:5000`.
 
-### 2. Grant the component's ServiceAccount MLflow permissions
+### 2. Grant MLflow permissions to the component's ServiceAccount
 
-The MLflow server authorizes writes against Kubernetes RBAC in the workspace namespace. Bind the ServiceAccount that your pipeline pods run as (for KFP multi-user this is usually `default-editor` in the profile namespace) to a Role that allows writing MLflow resources:
+The `kubernetes-auth` plugin authorizes writes against Kubernetes RBAC in the workspace namespace, using the verbs from the `mlflow.kubeflow.org` API group. Apply a Role/RoleBinding bound to the ServiceAccount your pipeline pods run as (for KFP multi-user this is usually `default-editor` in the profile namespace):
 
 ```yaml
 apiVersion: rbac.authorization.k8s.io/v1
@@ -71,7 +71,9 @@ metadata:
   namespace: team-a            # the MLflow workspace namespace
 rules:
   - apiGroups: ["mlflow.kubeflow.org"]
-    resources: ["experiments", "runs", "registeredmodels"]
+    # 'experiments' covers runs, metrics, params, and tags; add 'datasets'
+    # and 'registeredmodels' for dataset logging and the model registry.
+    resources: ["experiments", "datasets", "registeredmodels"]
     verbs: ["get", "list", "create", "update", "delete"]
 ---
 apiVersion: rbac.authorization.k8s.io/v1
@@ -85,22 +87,28 @@ roleRef:
   name: mlflow-writer
 subjects:
   - kind: ServiceAccount
-    name: default-editor       # the SA your KFP component pods use
+    name: default-editor       # the SA your KFP component pods run as
     namespace: team-a
 ```
 
-See [MLflow Tracking Server → Workspace Access](../kubeflow/how_to/mlflow.mdx) for the workspace/RBAC model.
+This is the caller-side RBAC defined by the MLflow plugin; for the full resource list (including AI gateway resources) and fine-grained `resourceNames` access, see [MLflow Tracking Server → Workspace Access](../kubeflow/how_to/mlflow.mdx).
 
 ### 3. Authenticate from the component
 
-In a KFP v2 lightweight component, send the pod's ServiceAccount token as the MLflow bearer token by setting `MLFLOW_TRACKING_TOKEN`. The token is mounted at the standard projected-token path:
+The MLflow client reads `MLFLOW_TRACKING_TOKEN` and sends it as `Authorization: Bearer …`. Inside a pod, use the projected ServiceAccount token mounted at the standard path:
 
 ```python
 with open("/var/run/secrets/kubernetes.io/serviceaccount/token") as f:
     os.environ["MLFLOW_TRACKING_TOKEN"] = f.read().strip()
 ```
 
-Runs land in the server's **default workspace** (`MLFLOW_K8S_DEFAULT_WORKSPACE`). The stock `mlflow` client cannot send the per-request workspace header, so to target a *different* workspace use the Alauda MLflow client's `mlflow.set_workspace("team-a")` (see [Client Configuration](../kubeflow/how_to/mlflow.mdx)).
+For a client running outside the cluster (for example, submitting from a laptop) mint a token for the workspace ServiceAccount instead:
+
+```bash
+export MLFLOW_TRACKING_TOKEN=$(kubectl -n team-a create token default-editor)
+```
+
+**Workspace selection.** Runs land in the server's **default workspace** (`MLFLOW_K8S_DEFAULT_WORKSPACE`) unless the request carries the `X-MLFLOW-WORKSPACE` header. The stock `mlflow` client does not send that header, so to target a *different* workspace use the Alauda MLflow client — `mlflow.set_workspace("team-a")` or the `MLFLOW_WORKSPACE` environment variable (see [Client Configuration](../kubeflow/how_to/mlflow.mdx)).
 
 ## Complete example: training pipeline with MLflow
 

From 76eef447dbde20c90271a67ced5d3e5ce096182c Mon Sep 17 00:00:00 2001
From: Wu Yi <typhoonzero1986@gmail.com>
Date: Mon, 15 Jun 2026 06:51:18 +0000
Subject: [PATCH 10/21] docs: frame MLflow auth as the kubernetes-auth
 user_identity_token flow

Per mlflow-plugin/mlflow-kubernetes-plugins/docs/authorization-plugin.md:

- Lead with the identity-token method: the server's `kubernetes-auth`
  plugin (user_identity_token mode) authenticates the caller from the bearer
  token's identity claims, authorizes that identity, and records it as the
  MLflow run owner. The client authenticates with the token before any API
  call.
- Note the credential is a Kubernetes ServiceAccount token (the
  platform-wide `kubectl create token` pattern; sub claim is the identity).
- Add a security warning: because user_identity_token reads claims
  unverified (the oauth2-proxy is the verifier), a direct endpoint must be
  network-restricted / not exposed via ingress, or run the server in
  self_subject_access_review mode.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .../pipelines-mlflow-integration.mdx               | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/docs/en/training_guides/pipelines-mlflow-integration.mdx b/docs/en/training_guides/pipelines-mlflow-integration.mdx
index f1aa27f..d9ca0b3 100644
--- a/docs/en/training_guides/pipelines-mlflow-integration.mdx
+++ b/docs/en/training_guides/pipelines-mlflow-integration.mdx
@@ -23,7 +23,9 @@ This guide shows how to build Kubeflow Pipelines (KFP) components that log param
 
 This is the part most KFP + MLflow examples get wrong, so read it before copying the code below.
 
-On Alauda AI the MLflow server runs the `kubernetes-auth` plugin, which authorizes every request against Kubernetes RBAC in the `mlflow.kubeflow.org` API group. The canonical way for a workload to authenticate is a **Kubernetes ServiceAccount bearer token** sent as `Authorization: Bearer <token>` (or `X-Forwarded-Access-Token`), together with a workspace selector (`X-MLFLOW-WORKSPACE`). With single sign-on and multi-tenancy enabled (the default), the shipped `mlflow-tracking-server` Service does **not** point at MLflow directly — it points at an `oauth2-proxy` sidecar built for interactive browser login:
+On Alauda AI the MLflow server runs the [`kubernetes-auth` plugin](https://github.com/AlaudaDevops/mlflow-plugin), which authenticates the caller from a **bearer identity token** and authorizes every request against Kubernetes RBAC in the `mlflow.kubeflow.org` API group. In its `user_identity_token` mode the server reads the caller identity from the token's claims (`email` / `preferred_username` / `name` / `sub`, and groups from `groups` / `roles`), authorizes *that identity*, and records it as the MLflow run owner.
+
+So a client must **authenticate with an identity token before it calls any MLflow API**: send `Authorization: Bearer <token>` (or `X-Forwarded-Access-Token`) plus a workspace selector (`X-MLFLOW-WORKSPACE`). The identity token is a Kubernetes ServiceAccount token — the same platform-wide pattern used by Feast, TrustyAI, and others (`kubectl create token <sa>`, or the pod's projected token in-cluster). With single sign-on and multi-tenancy enabled (the default), the shipped `mlflow-tracking-server` Service does **not** point at MLflow directly — it points at an `oauth2-proxy` sidecar built for interactive browser login, which is why a headless token presented there is bounced:
 
 | What a pipeline component does | What happens on a secured install |
 |--------------------------------|-----------------------------------|
@@ -32,7 +34,7 @@ On Alauda AI the MLflow server runs the `kubernetes-auth` plugin, which authoriz
 | Reach it directly with a token but **no** workspace permissions | reads succeed, writes return **HTTP 403** `PERMISSION_DENIED` |
 | Reach it directly **with** a token **and** workspace RBAC | logging works |
 
-So a headless component needs three things: a direct in-cluster endpoint that skips the OAuth proxy, a bearer token, and write permission in the workspace.
+So a headless component needs three things: an identity token to authenticate with, RBAC for that identity in the workspace, and a direct in-cluster endpoint that skips the browser-only OAuth proxy.
 
 :::tip
 If your MLflow plugin is installed **without** SSO/multi-tenancy (no `oauth2-proxy` sidecar on the `mlflow-tracking-server` pod, and `MLFLOW_MULTITENANCY_ENABLED` unset), the `mlflow-tracking-server` Service already targets MLflow directly and accepts unauthenticated in-cluster traffic. In that case skip steps 1–2 below and use `http://mlflow-tracking-server.kubeflow:5000` with no token.
@@ -59,6 +61,10 @@ spec:
 
 Pipeline components then use `http://mlflow-tracking-direct.kubeflow:5000`.
 
+:::warning
+In `user_identity_token` mode the server trusts the identity claims of whatever token reaches it — the `oauth2-proxy` it normally sits behind is what verifies those tokens. A direct endpoint bypasses that verification, so restrict it to trusted in-cluster callers with a `NetworkPolicy` (or have the platform run the server in `self_subject_access_review` mode, where the token is verified by the Kubernetes API server). Do not expose this Service through ingress.
+:::
+
 ### 2. Grant MLflow permissions to the component's ServiceAccount
 
 The `kubernetes-auth` plugin authorizes writes against Kubernetes RBAC in the workspace namespace, using the verbs from the `mlflow.kubeflow.org` API group. Apply a Role/RoleBinding bound to the ServiceAccount your pipeline pods run as (for KFP multi-user this is usually `default-editor` in the profile namespace):
@@ -93,9 +99,9 @@ subjects:
 
 This is the caller-side RBAC defined by the MLflow plugin; for the full resource list (including AI gateway resources) and fine-grained `resourceNames` access, see [MLflow Tracking Server → Workspace Access](../kubeflow/how_to/mlflow.mdx).
 
-### 3. Authenticate from the component
+### 3. Present the identity token from the component
 
-The MLflow client reads `MLFLOW_TRACKING_TOKEN` and sends it as `Authorization: Bearer …`. Inside a pod, use the projected ServiceAccount token mounted at the standard path:
+The identity token authenticates the caller (and becomes the run owner), so set it before the first MLflow call. The MLflow client reads `MLFLOW_TRACKING_TOKEN` and sends it as `Authorization: Bearer …`. Inside a pod, use the projected ServiceAccount token mounted at the standard path — its `sub` claim (`system:serviceaccount:<ns>:<name>`) is the identity the server authorizes:
 
 ```python
 with open("/var/run/secrets/kubernetes.io/serviceaccount/token") as f:

From 4536ae9d921dc737c2f1dd8534d12b77048d2b06 Mon Sep 17 00:00:00 2001
From: Wu Yi <typhoonzero1986@gmail.com>
Date: Mon, 15 Jun 2026 07:32:54 +0000
Subject: [PATCH 11/21] docs: use only the user_identity_token method; add
 user-identity smoke test
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Reworks the KFP + MLflow guide to authenticate with a platform user identity
token only — no ServiceAccount, no per-workspace RBAC, no extra in-cluster
Service:

- The MLflow kubernetes-auth plugin (user_identity_token mode) takes the caller
  identity from the bearer token's claims and records it as the run owner.
- Components reach MLflow through the platform Kubernetes API
  (…/kubernetes/<cluster>/…/pods/<pod>:5000/proxy/…) and forward identity via
  X-Forwarded-Access-Token; the shipped Service only exposes the browser OAuth
  proxy, so this avoids it without creating anything.
- Removed the direct-Service, ServiceAccount-token, and RBAC sections.
- KFP example now uses a stdlib REST helper (no mlflow SDK install needed) and
  passes the token as a parameter (source from a Secret).

Adds e2e/mlflow-user-identity-smoke.sh: logs a run with a user token and asserts
the run owner equals the token identity. Verified on g1-c1-x86 (run owner
admin@cpaas.io); the pipeline example compiles with kfp 2.11.0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .../pipelines-mlflow-integration.mdx          | 361 +++++++-----------
 e2e/mlflow-user-identity-smoke.sh             |  85 +++++
 2 files changed, 216 insertions(+), 230 deletions(-)
 create mode 100755 e2e/mlflow-user-identity-smoke.sh

diff --git a/docs/en/training_guides/pipelines-mlflow-integration.mdx b/docs/en/training_guides/pipelines-mlflow-integration.mdx
index d9ca0b3..6b063a3 100644
--- a/docs/en/training_guides/pipelines-mlflow-integration.mdx
+++ b/docs/en/training_guides/pipelines-mlflow-integration.mdx
@@ -6,219 +6,159 @@ weight: 55
 
 This guide shows how to build Kubeflow Pipelines (KFP) components that log parameters, metrics, and model artifacts to [MLflow on Kubeflow](../kubeflow/how_to/mlflow.mdx) — giving you a single source of truth for experiment tracking across your pipeline runs.
 
+Pipeline components authenticate to MLflow with a **user identity token**: no ServiceAccount, no per-workspace RBAC, and no extra in-cluster Service are required. The MLflow server records each run under the calling user.
+
 ## Scope
 
 - Alauda AI 2.5 and later.
 - Kubeflow Pipelines and the MLflow cluster plugin are installed.
-- Target namespaces have the MLflow workspace label (`mlflow-enabled=true`).
-- The pipeline components run in the same Kubernetes cluster where the MLflow Tracking Server is deployed.
+- The MLflow workspace is a namespace labelled `mlflow-enabled=true`.
 
 ## Prerequisites
 
-- `kfp` Python SDK installed (`pip install kfp mlflow`).
+- `kfp` Python SDK installed (`pip install kfp`).
 - Access to a KFP endpoint (see [Use Kubeflow Pipelines](../kubeflow/how_to/pipelines.mdx) for setup).
-- An MLflow workspace name matching a namespace with `mlflow-enabled=true`.
-
-## How pipeline components reach MLflow \{#how-pipeline-components-reach-mlflow}
-
-This is the part most KFP + MLflow examples get wrong, so read it before copying the code below.
-
-On Alauda AI the MLflow server runs the [`kubernetes-auth` plugin](https://github.com/AlaudaDevops/mlflow-plugin), which authenticates the caller from a **bearer identity token** and authorizes every request against Kubernetes RBAC in the `mlflow.kubeflow.org` API group. In its `user_identity_token` mode the server reads the caller identity from the token's claims (`email` / `preferred_username` / `name` / `sub`, and groups from `groups` / `roles`), authorizes *that identity*, and records it as the MLflow run owner.
+- A **platform user identity token** — a JWT with an `email` claim, issued for your platform user. You already have the equivalent credential whenever you use the platform; create a token for non-interactive use from the platform console. The same token is used both to reach the cluster API and as your MLflow identity.
+- An MLflow workspace name (a namespace with `mlflow-enabled=true`) that your platform user can access.
 
-So a client must **authenticate with an identity token before it calls any MLflow API**: send `Authorization: Bearer <token>` (or `X-Forwarded-Access-Token`) plus a workspace selector (`X-MLFLOW-WORKSPACE`). The identity token is a Kubernetes ServiceAccount token — the same platform-wide pattern used by Feast, TrustyAI, and others (`kubectl create token <sa>`, or the pod's projected token in-cluster). With single sign-on and multi-tenancy enabled (the default), the shipped `mlflow-tracking-server` Service does **not** point at MLflow directly — it points at an `oauth2-proxy` sidecar built for interactive browser login, which is why a headless token presented there is bounced:
+## How components authenticate to MLflow \{#how-components-authenticate-to-mlflow}
 
-| What a pipeline component does | What happens on a secured install |
-|--------------------------------|-----------------------------------|
-| `GET http://mlflow-tracking-server.kubeflow:5000/...` | **HTTP 302** redirect to the SSO login page — the proxy expects an interactive browser, so a bearer token does not help |
-| Reach the MLflow server directly with **no** token | **HTTP 401** `UNAUTHENTICATED` |
-| Reach it directly with a token but **no** workspace permissions | reads succeed, writes return **HTTP 403** `PERMISSION_DENIED` |
-| Reach it directly **with** a token **and** workspace RBAC | logging works |
-
-So a headless component needs three things: an identity token to authenticate with, RBAC for that identity in the workspace, and a direct in-cluster endpoint that skips the browser-only OAuth proxy.
-
-:::tip
-If your MLflow plugin is installed **without** SSO/multi-tenancy (no `oauth2-proxy` sidecar on the `mlflow-tracking-server` pod, and `MLFLOW_MULTITENANCY_ENABLED` unset), the `mlflow-tracking-server` Service already targets MLflow directly and accepts unauthenticated in-cluster traffic. In that case skip steps 1–2 below and use `http://mlflow-tracking-server.kubeflow:5000` with no token.
-:::
-
-### 1. Expose a direct in-cluster endpoint
-
-Create a Service that targets the MLflow server's application port (`mlflow-http`) instead of the proxy port, so in-cluster clients bypass the browser login flow:
-
-```yaml
-apiVersion: v1
-kind: Service
-metadata:
-  name: mlflow-tracking-direct
-  namespace: kubeflow
-spec:
-  selector:
-    app: mlflow-tracking-server   # verify against the running pod's labels
-  ports:
-    - name: http
-      port: 5000
-      targetPort: mlflow-http     # the server container's port, not "proxy"
-```
-
-Pipeline components then use `http://mlflow-tracking-direct.kubeflow:5000`.
-
-:::warning
-In `user_identity_token` mode the server trusts the identity claims of whatever token reaches it — the `oauth2-proxy` it normally sits behind is what verifies those tokens. A direct endpoint bypasses that verification, so restrict it to trusted in-cluster callers with a `NetworkPolicy` (or have the platform run the server in `self_subject_access_review` mode, where the token is verified by the Kubernetes API server). Do not expose this Service through ingress.
-:::
-
-### 2. Grant MLflow permissions to the component's ServiceAccount
-
-The `kubernetes-auth` plugin authorizes writes against Kubernetes RBAC in the workspace namespace, using the verbs from the `mlflow.kubeflow.org` API group. Apply a Role/RoleBinding bound to the ServiceAccount your pipeline pods run as (for KFP multi-user this is usually `default-editor` in the profile namespace):
-
-```yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: Role
-metadata:
-  name: mlflow-writer
-  namespace: team-a            # the MLflow workspace namespace
-rules:
-  - apiGroups: ["mlflow.kubeflow.org"]
-    # 'experiments' covers runs, metrics, params, and tags; add 'datasets'
-    # and 'registeredmodels' for dataset logging and the model registry.
-    resources: ["experiments", "datasets", "registeredmodels"]
-    verbs: ["get", "list", "create", "update", "delete"]
----
-apiVersion: rbac.authorization.k8s.io/v1
-kind: RoleBinding
-metadata:
-  name: mlflow-writer
-  namespace: team-a
-roleRef:
-  apiGroup: rbac.authorization.k8s.io
-  kind: Role
-  name: mlflow-writer
-subjects:
-  - kind: ServiceAccount
-    name: default-editor       # the SA your KFP component pods run as
-    namespace: team-a
-```
+On Alauda AI the MLflow server runs the [`kubernetes-auth` plugin](https://github.com/AlaudaDevops/mlflow-plugin) in **`user_identity_token`** mode. It reads the caller identity from the bearer token's claims (`email` → `preferred_username` → `name` → `sub`, groups from `groups` / `roles`), authorizes that identity against the workspace, and records it as the MLflow **run owner**. So the only credential a component needs is a user identity token; there is nothing to create on the cluster.
 
-This is the caller-side RBAC defined by the MLflow plugin; for the full resource list (including AI gateway resources) and fine-grained `resourceNames` access, see [MLflow Tracking Server → Workspace Access](../kubeflow/how_to/mlflow.mdx).
+A token is presented to MLflow through the platform Kubernetes API — the same `https://<platform>/kubernetes/<cluster>` entry point used for any Kubernetes call. The request is proxied to the MLflow server, and the caller identity is forwarded in the `X-Forwarded-Access-Token` header:
 
-### 3. Present the identity token from the component
+- `Authorization: Bearer <token>` authenticates the call to the platform Kubernetes API.
+- `X-Forwarded-Access-Token: <token>` is the identity the MLflow `user_identity_token` plugin reads.
+- `X-MLFLOW-WORKSPACE: <namespace>` selects the workspace (otherwise the server's default workspace is used).
 
-The identity token authenticates the caller (and becomes the run owner), so set it before the first MLflow call. The MLflow client reads `MLFLOW_TRACKING_TOKEN` and sends it as `Authorization: Bearer …`. Inside a pod, use the projected ServiceAccount token mounted at the standard path — its `sub` claim (`system:serviceaccount:<ns>:<name>`) is the identity the server authorizes:
+This avoids the browser-only OAuth proxy in front of MLflow without exposing any new endpoint. The helper below wraps it; it needs only the Python standard library, so components do not even install the MLflow SDK:
 
 ```python
-with open("/var/run/secrets/kubernetes.io/serviceaccount/token") as f:
-    os.environ["MLFLOW_TRACKING_TOKEN"] = f.read().strip()
+def mlflow_api(method, path, *, token, platform, cluster, workspace, body=None, ns="kubeflow"):
+    """Call the MLflow REST API as the user identity in `token`, via the platform K8s API."""
+    import json, ssl, urllib.request
+    ctx = ssl.create_default_context()
+    ctx.check_hostname = False          # platform Dex/ALB commonly use a private cert
+    ctx.verify_mode = ssl.CERT_NONE
+    kapi = f"{platform.rstrip('/')}/kubernetes/{cluster}"
+
+    def call(url, data=None, m="GET"):
+        req = urllib.request.Request(url, data=data, method=m, headers={
+            "Authorization": f"Bearer {token}",
+            "X-Forwarded-Access-Token": token,
+            "X-MLFLOW-WORKSPACE": workspace,
+            "Content-Type": "application/json",
+        })
+        with urllib.request.urlopen(req, context=ctx) as r:
+            return json.load(r)
+
+    # Find the MLflow server pod (the shipped Service only exposes the OAuth proxy port).
+    pods = call(f"{kapi}/api/v1/namespaces/{ns}/pods?labelSelector=app%3Dmlflow-tracking-server")
+    pod = next(p["metadata"]["name"] for p in pods["items"] if p["status"]["phase"] == "Running")
+    base = f"{kapi}/api/v1/namespaces/{ns}/pods/{pod}:5000/proxy/api/2.0/mlflow"
+    return call(f"{base}/{path}", json.dumps(body).encode() if body is not None else None, method)
 ```
 
-For a client running outside the cluster (for example, submitting from a laptop) mint a token for the workspace ServiceAccount instead:
+:::tip
+Verify the whole path against your cluster with the bundled smoke test, which logs a run and asserts the run owner matches the token identity:
 
 ```bash
-export MLFLOW_TRACKING_TOKEN=$(kubectl -n team-a create token default-editor)
+PLATFORM_ADDRESS=https://<platform> CLUSTER=<cluster> \
+  MLFLOW_USER_TOKEN=<your-token> MLFLOW_WORKSPACE=<workspace> \
+  e2e/mlflow-user-identity-smoke.sh
 ```
-
-**Workspace selection.** Runs land in the server's **default workspace** (`MLFLOW_K8S_DEFAULT_WORKSPACE`) unless the request carries the `X-MLFLOW-WORKSPACE` header. The stock `mlflow` client does not send that header, so to target a *different* workspace use the Alauda MLflow client — `mlflow.set_workspace("team-a")` or the `MLFLOW_WORKSPACE` environment variable (see [Client Configuration](../kubeflow/how_to/mlflow.mdx)).
+:::
 
 ## Complete example: training pipeline with MLflow
 
-Every mechanism in this pipeline was checked against a secured (SSO + multi-tenant) MLflow install on Alauda AI: the SDK compiles the pipeline, the ServiceAccount-token auth, workspace RBAC, and warm-up retry were confirmed against the live tracking server. Note the three corrections over a naive example: **imports are inside each component** (KFP v2 packages components from their own source, so module-level imports are not in scope), the **ServiceAccount token** is set before logging, and the first MLflow call is **retried** to absorb the brief authorization warm-up after a pod starts.
+This pipeline logs parameters and metrics to MLflow using the user identity token. Note the KFP v2 rules it follows: helpers and imports live **inside** the component (KFP packages each component from its own source), and the token is passed in as a parameter — source it from a Kubernetes `Secret` rather than hardcoding it.
 
 ```python
 from kfp import dsl, compiler
 
-MLFLOW_TRACKING_URI = "http://mlflow-tracking-direct.kubeflow:5000"
-MLFLOW_EXPERIMENT_NAME = "kfp-training-experiment"
 
-
-@dsl.component(base_image="python:3.11-slim", packages_to_install=["mlflow"])
+@dsl.component(base_image="python:3.11-slim")
 def train_model(
-    tracking_uri: str,
-    experiment: str,
+    platform: str,
+    cluster: str,
+    workspace: str,
+    token: str,
     model_name: str,
     learning_rate: float,
     epochs: int,
     run_id: str,
 ) -> dict:
-    """Simulated training component with MLflow tracking."""
-    import os, time, json, pathlib
-    import mlflow                       # MUST be imported inside the component
-
-    # Authenticate to the multi-tenant MLflow server with the pod's SA token.
-    with open("/var/run/secrets/kubernetes.io/serviceaccount/token") as f:
-        os.environ["MLFLOW_TRACKING_TOKEN"] = f.read().strip()
-    mlflow.set_tracking_uri(tracking_uri)
-
-    # The per-request authorization cache can lag a few seconds after a pod
-    # starts; retry the first call so the component does not fail on a 403.
-    for attempt in range(5):
-        try:
-            mlflow.set_experiment(experiment)
-            break
-        except Exception:
-            if attempt == 4:
-                raise
-            time.sleep(3)
+    """Simulated training component that logs to MLflow as the calling user."""
+    import json, ssl, urllib.request, urllib.error
+
+    def mlflow_api(method, path, body=None, ns="kubeflow"):
+        ctx = ssl.create_default_context()
+        ctx.check_hostname = False
+        ctx.verify_mode = ssl.CERT_NONE
+        kapi = f"{platform.rstrip('/')}/kubernetes/{cluster}"
+        hdr = {
+            "Authorization": f"Bearer {token}",
+            "X-Forwarded-Access-Token": token,
+            "X-MLFLOW-WORKSPACE": workspace,
+            "Content-Type": "application/json",
+        }
+
+        def call(url, data=None, m="GET"):
+            req = urllib.request.Request(url, data=data, method=m, headers=hdr)
+            with urllib.request.urlopen(req, context=ctx) as r:
+                return json.load(r)
+
+        pods = call(f"{kapi}/api/v1/namespaces/{ns}/pods?labelSelector=app%3Dmlflow-tracking-server")
+        pod = next(p["metadata"]["name"] for p in pods["items"] if p["status"]["phase"] == "Running")
+        base = f"{kapi}/api/v1/namespaces/{ns}/pods/{pod}:5000/proxy/api/2.0/mlflow"
+        return call(f"{base}/{path}", json.dumps(body).encode() if body is not None else None, method)
+
+    # Get-or-create the experiment.
+    try:
+        eid = mlflow_api("POST", "experiments/create",
+                         {"name": "kfp-training-experiment"})["experiment_id"]
+    except urllib.error.HTTPError as e:
+        if e.code != 400:
+            raise
+        eid = mlflow_api("GET", "experiments/get-by-name?experiment_name=kfp-training-experiment"
+                         )["experiment"]["experiment_id"]
+
+    run = mlflow_api("POST", "runs/create",
+                     {"experiment_id": eid, "run_name": f"run-{run_id}",
+                      "start_time": 0})["run"]["info"]["run_id"]
+    for k, v in {"model_name": model_name, "learning_rate": learning_rate, "epochs": epochs}.items():
+        mlflow_api("POST", "runs/log-parameter", {"run_id": run, "key": k, "value": str(v)})
 
     metrics = {}
-    with mlflow.start_run(run_name=f"run-{run_id}"):
-        mlflow.log_param("model_name", model_name)
-        mlflow.log_param("learning_rate", learning_rate)
-        mlflow.log_param("epochs", epochs)
-
-        # Simulated training loop — replace with your real training logic.
-        for epoch in range(1, epochs + 1):
-            loss = 2.0 * (0.95 ** epoch)  # placeholder loss curve
-            accuracy = 1.0 - loss         # placeholder accuracy
-            mlflow.log_metric("loss", loss, step=epoch)
-            mlflow.log_metric("accuracy", accuracy, step=epoch)
-            metrics = {"final_loss": loss, "final_accuracy": accuracy}
-
-        # Log the trained model as an artifact (requires configured artifact
-        # storage — see "Artifact storage for production" below).
-        model_dir = pathlib.Path("/tmp/model")
-        model_dir.mkdir(parents=True, exist_ok=True)
-        (model_dir / "model.json").write_text(
-            json.dumps(dict(model_name=model_name, epochs=epochs, metrics=metrics))
-        )
-        mlflow.log_artifacts(str(model_dir), artifact_path="model")
-
+    for epoch in range(1, epochs + 1):
+        loss = 2.0 * (0.95 ** epoch)
+        accuracy = 1.0 - loss
+        mlflow_api("POST", "runs/log-metric",
+                   {"run_id": run, "key": "loss", "value": loss, "timestamp": 0, "step": epoch})
+        mlflow_api("POST", "runs/log-metric",
+                   {"run_id": run, "key": "accuracy", "value": accuracy, "timestamp": 0, "step": epoch})
+        metrics = {"final_loss": loss, "final_accuracy": accuracy}
+
+    mlflow_api("POST", "runs/update", {"run_id": run, "status": "FINISHED", "end_time": 1})
+    print(f"logged MLflow run {run}")
     return metrics
 
 
-@dsl.component(base_image="python:3.11-slim", packages_to_install=["mlflow"])
-def evaluate_model(tracking_uri: str, experiment: str, model_name: str, run_id: str) -> dict:
-    """Evaluate the trained model and log results to MLflow."""
-    import os, time
-    import mlflow
-
-    with open("/var/run/secrets/kubernetes.io/serviceaccount/token") as f:
-        os.environ["MLFLOW_TRACKING_TOKEN"] = f.read().strip()
-    mlflow.set_tracking_uri(tracking_uri)
-    for attempt in range(5):
-        try:
-            mlflow.set_experiment(experiment)
-            break
-        except Exception:
-            if attempt == 4:
-                raise
-            time.sleep(3)
-
-    with mlflow.start_run(run_name=f"eval-{run_id}"):
-        mlflow.log_param("model_name", model_name)
-        mlflow.log_metric("eval_accuracy", 0.92)
-        mlflow.log_metric("eval_f1", 0.89)
-
-    return dict(eval_accuracy=0.92)
-
-
-@dsl.pipeline(name="mlflow-training-pipeline", description="Train and evaluate with MLflow tracking")
+@dsl.pipeline(name="mlflow-training-pipeline", description="Train with MLflow tracking")
 def training_pipeline(
-    tracking_uri: str = MLFLOW_TRACKING_URI,
-    experiment: str = MLFLOW_EXPERIMENT_NAME,
+    platform: str,
+    cluster: str,
+    token: str,
+    workspace: str = "mlops-demo-e2e",
     model_name: str = "qwen3-0.6b",
     learning_rate: float = 2e-4,
     epochs: int = 10,
 ):
-    train_task = train_model(
-        tracking_uri=tracking_uri,
-        experiment=experiment,
+    train_model(
+        platform=platform,
+        cluster=cluster,
+        workspace=workspace,
+        token=token,
         model_name=model_name,
         learning_rate=learning_rate,
         epochs=epochs,
@@ -227,15 +167,7 @@ def training_pipeline(
         run_id=dsl.PIPELINE_JOB_ID_PLACEHOLDER,
     )
 
-    evaluate_model(
-        tracking_uri=tracking_uri,
-        experiment=experiment,
-        model_name=model_name,
-        run_id=dsl.PIPELINE_JOB_ID_PLACEHOLDER,
-    ).after(train_task)
 
-
-# ===== Compile and submit =====
 compiler.Compiler().compile(training_pipeline, "pipeline.yaml")
 ```
 
@@ -244,8 +176,8 @@ compiler.Compiler().compile(training_pipeline, "pipeline.yaml")
 ### Via the KFP UI
 
 1. Go to **Kubeflow Dashboard → Pipelines → Upload Pipeline** and select `pipeline.yaml`.
-2. Click **Create Run** and fill in the parameters (model name, learning rate, epochs).
-3. After the run starts, check the MLflow UI under **Alauda AI → Tools → MLFlow** for logged metrics.
+2. Click **Create Run** and fill in the parameters (platform address, cluster, token, workspace, …).
+3. After the run starts, check the MLflow UI under **Alauda AI → Tools → MLFlow** for the logged metrics — the run owner is your platform user.
 
 ### Via the KFP SDK
 
@@ -257,8 +189,11 @@ client = Client(host="<MY-KFP-ENDPOINT>")
 run = client.create_run_from_pipeline_package(
     "pipeline.yaml",
     arguments=dict(
+        platform="https://<platform>",
+        cluster="<cluster>",
+        token="<your-user-identity-token>",   # source from a Kubernetes Secret in practice
+        workspace="mlops-demo-e2e",
         model_name="qwen3-0.6b",
-        learning_rate=2e-4,
         epochs=10,
     ),
 )
@@ -268,7 +203,7 @@ print(f"Run ID: {run.run_id}")
 
 ## Using MLflow in Trainer v2 pipelines
 
-If you are using [Kubeflow Trainer v2](./fine-tune-with-trainer-v2.mdx) instead of KFP SDK pipelines, set the MLflow environment variables on the `TrainJob`'s trainer. Note the API: Trainer v2 uses `apiVersion: trainer.kubeflow.org/v1alpha1`, `kind: TrainJob`, and a `spec.runtimeRef` + `spec.trainer` shape — not a raw pod template.
+If you fine-tune with [Kubeflow Trainer v2](./fine-tune-with-trainer-v2.mdx) instead of KFP SDK pipelines, the framework's MLflow integration (for example `report_to: mlflow` in LLaMA-Factory) authenticates the same way. Trainer v2 uses `apiVersion: trainer.kubeflow.org/v1alpha1`, `kind: TrainJob`, and a `spec.runtimeRef` + `spec.trainer` shape — not a raw pod template. Inject your user identity token as `MLFLOW_TRACKING_TOKEN` and point the framework at the MLflow endpoint your platform exposes:
 
 ```yaml
 apiVersion: trainer.kubeflow.org/v1alpha1
@@ -281,42 +216,16 @@ spec:
   trainer:
     image: alaudadockerhub/fine_tune_with_llamafactory:v0.1.1
     env:
-      - name: MLFLOW_TRACKING_URI
-        value: "http://mlflow-tracking-direct.kubeflow:5000"
       - name: MLFLOW_EXPERIMENT_NAME
         value: "trainer-v2-finetune"
+      - name: MLFLOW_TRACKING_TOKEN
+        valueFrom:
+          secretKeyRef:
+            name: mlflow-user-token   # a Secret holding your platform user token
+            key: token
 ```
 
-On a secured install the trainer pod also needs a token and workspace RBAC (see [How pipeline components reach MLflow](#how-pipeline-components-reach-mlflow)): set `MLFLOW_TRACKING_TOKEN` from the pod's ServiceAccount token, and bind that ServiceAccount to the `mlflow-writer` Role. See [Fine-tuning LLMs using Workbench](./fine-tuning-using-notebooks.mdx) for a full Trainer v2 + MLflow example with LLaMA-Factory.
-
-## Accessing MLflow run artifacts from other pipeline components
-
-A later component can read MLflow artifacts logged by an earlier one. Authenticate the same way (imports inside, SA token), then use the MLflow client:
-
-```python
-from kfp import dsl
-
-@dsl.component(base_image="python:3.11-slim", packages_to_install=["mlflow"])
-def download_and_compare(tracking_uri: str, source_model_uri: str, reference_model_uri: str) -> str:
-    """Download two models from MLflow and compare them."""
-    import os
-    import mlflow
-
-    with open("/var/run/secrets/kubernetes.io/serviceaccount/token") as f:
-        os.environ["MLFLOW_TRACKING_TOKEN"] = f.read().strip()
-    mlflow.set_tracking_uri(tracking_uri)
-    client = mlflow.tracking.MlflowClient()
-
-    source_path = client.download_artifacts(source_model_uri, path="/tmp/source")
-    reference_path = client.download_artifacts(reference_model_uri, path="/tmp/reference")
-    return f"Compared models: {source_path} vs {reference_path}"
-```
-
-This pattern is useful for:
-
-- A/B comparing fine-tuned models before deployment.
-- Pulling the best model (by accuracy metric) from an MLflow experiment into an inference pipeline.
-- Validating that a model passed acceptance tests before pushing it to the model registry.
+See [Fine-tuning LLMs using Workbench](./fine-tuning-using-notebooks.mdx) for a full Trainer v2 + MLflow example with LLaMA-Factory.
 
 ## Best practices
 
@@ -330,32 +239,24 @@ train_model(..., run_id=dsl.PIPELINE_JOB_ID_PLACEHOLDER)
 
 Then use the received string in the run name to keep MLflow runs distinct per pipeline execution.
 
-### Log metrics at the component level, not the pipeline level
-
-MLflow `log_metric` calls must happen inside a `mlflow.start_run()` block. If a component has multiple logical training stages, open separate MLflow runs within the same component — do not try to log metrics from outside a run context.
+### Keep the token out of the pipeline definition
 
-### Use MLflow models for model registry integration
+Pass the user identity token from a Kubernetes `Secret` (mounted as a parameter or env var), never hardcoded in `pipeline.yaml` — compiled pipelines are stored and shared. Rotate the token on the platform when it expires.
 
-Instead of logging arbitrary files as `mlflow.log_artifacts`, use a flavor's `log_model()` to register the model with its signature and dependencies:
-
-```python
-mlflow.sklearn.log_model(sk_model, "model")
-# or for HuggingFace:
-mlflow.transformers.log_model(hf_model, "model")
-```
+### Log metrics inside a run
 
-Registered models can then be promoted to the **Staging** or **Production** stage in the MLflow UI.
+Each metric must belong to a run created with `runs/create`. If a component has multiple logical stages, open a run per stage rather than logging outside a run context.
 
 ### Artifact storage for production
 
-The default MLflow artifact store is local to the MLflow pod. For production pipelines, configure S3-compatible object storage in the MLflow plugin settings (see [MLflow Tracking Server](../kubeflow/how_to/mlflow.mdx) → High Availability And Storage). Pipeline components can then log large model artifacts without running into pod disk limits.
+Logging large model artifacts requires durable object storage. Configure S3-compatible storage in the MLflow plugin settings (see [MLflow Tracking Server](../kubeflow/how_to/mlflow.mdx) → High Availability And Storage) so artifact uploads do not hit pod disk limits.
 
 ## Troubleshooting
 
 | Symptom | Check |
 |---------|-------|
-| Component hangs or fails with an HTML/redirect (`302`) response | You are hitting the `oauth2-proxy` in front of MLflow. Use a direct in-cluster Service (`mlflow-tracking-direct`) that targets the `mlflow-http` port, not `http://mlflow-tracking-server.kubeflow:5000`. |
-| `401 UNAUTHENTICATED` / "Missing Authorization header" | The component sent no token. Set `MLFLOW_TRACKING_TOKEN` from `/var/run/secrets/kubernetes.io/serviceaccount/token` before the first MLflow call. |
-| `403 PERMISSION_DENIED` | The component's ServiceAccount lacks write RBAC in the workspace namespace — apply the `mlflow-writer` Role/RoleBinding. A single 403 right after pod start is the authorization cache warming up; the retry loop in the example absorbs it. |
-| Run shows up in the wrong workspace | The stock `mlflow` client logs to the server's default workspace. Use the Alauda client's `mlflow.set_workspace(...)` to target another workspace. |
+| Component fails with an HTML/redirect (`302`) response | You reached MLflow through its browser OAuth proxy (`mlflow-tracking-server:5000`). Use the platform Kubernetes API path shown above (`…/kubernetes/<cluster>/…/pods/<pod>:5000/proxy/…`) instead. |
+| `401 UNAUTHENTICATED` / "Missing Authorization header or X-Forwarded-Access-Token header" | The platform API strips the inbound `Authorization` header before proxying to the pod. Send the identity in `X-Forwarded-Access-Token` as well (the helper does both). |
+| `403 PERMISSION_DENIED` | Your platform user lacks access to the workspace namespace. Grant the user access to the MLflow workspace (see [Workspace Access](../kubeflow/how_to/mlflow.mdx)); no ServiceAccount is involved. |
+| Run shows up under the wrong owner / workspace | The run owner is the `email` claim of the token; the workspace is `X-MLFLOW-WORKSPACE` (or the server default). Check both values. |
 | MLflow metrics not appearing in KFP UI | KFP and MLflow are separate systems. Metrics logged to MLflow appear in the MLflow UI (**Alauda AI → Tools → MLFlow**), not in the KFP run output. |
diff --git a/e2e/mlflow-user-identity-smoke.sh b/e2e/mlflow-user-identity-smoke.sh
new file mode 100755
index 0000000..511fcdd
--- /dev/null
+++ b/e2e/mlflow-user-identity-smoke.sh
@@ -0,0 +1,85 @@
+#!/usr/bin/env bash
+# Smoke test: log to MLflow with a platform **user identity token**.
+#
+# Exercises the MLflow `kubernetes-auth` plugin's `user_identity_token` mode:
+# the server takes the caller identity from the bearer token's claims and records
+# the run under that user. No ServiceAccount and no extra in-cluster Service are
+# created — the MLflow server is reached through the platform Kubernetes API proxy
+# (the same `…/kubernetes/<cluster>` entry point used for any K8s call), and the
+# caller identity is forwarded to MLflow via `X-Forwarded-Access-Token`.
+#
+# Required env:
+#   PLATFORM_ADDRESS   e.g. https://192.168.142.163
+#   CLUSTER            e.g. g1-c1-x86
+#   MLFLOW_USER_TOKEN  a platform user identity token (JWT with an `email` claim)
+# Optional env:
+#   MLFLOW_WORKSPACE   target workspace namespace (default: mlops-demo-e2e)
+#   MLFLOW_NS          namespace of the MLflow server (default: kubeflow)
+set -euo pipefail
+
+: "${PLATFORM_ADDRESS:?set PLATFORM_ADDRESS, e.g. https://192.168.142.163}"
+: "${CLUSTER:?set CLUSTER, e.g. g1-c1-x86}"
+: "${MLFLOW_USER_TOKEN:?set MLFLOW_USER_TOKEN to a platform user identity token}"
+WORKSPACE="${MLFLOW_WORKSPACE:-mlops-demo-e2e}"
+MLFLOW_NS="${MLFLOW_NS:-kubeflow}"
+
+KAPI="${PLATFORM_ADDRESS%/}/kubernetes/${CLUSTER}"
+TOKEN="${MLFLOW_USER_TOKEN}"
+
+# Identity the server should attribute the run to (first email claim in the JWT).
+EMAIL="$(printf '%s' "${TOKEN}" | cut -d. -f2 | tr '_-' '/+' \
+  | { b="$(cat)"; printf '%s%s' "$b" "$(printf '%*s' $(( (4 - ${#b} % 4) % 4 )) '' | tr ' ' '=')"; } \
+  | base64 -d 2>/dev/null | jq -r '.email // .preferred_username // .name // .sub')"
+echo "caller identity: ${EMAIL}"
+
+# Authenticate to the platform K8s API with the user token; locate the MLflow pod.
+POD="$(curl -fsSk -H "Authorization: Bearer ${TOKEN}" \
+  "${KAPI}/api/v1/namespaces/${MLFLOW_NS}/pods?labelSelector=app%3Dmlflow-tracking-server" \
+  | jq -r '.items[] | select(.status.phase=="Running") | .metadata.name' | head -1)"
+[ -n "${POD}" ] || { echo "FAIL: no running mlflow-tracking-server pod in ${MLFLOW_NS}"; exit 1; }
+echo "mlflow pod: ${POD}"
+
+# Reach the MLflow app port (5000) through the K8s API pod proxy, bypassing the
+# browser OAuth proxy. Authorization authenticates us to the K8s API; the MLflow
+# server reads our identity from X-Forwarded-Access-Token.
+BASE="${KAPI}/api/v1/namespaces/${MLFLOW_NS}/pods/${POD}:5000/proxy/api/2.0/mlflow"
+hdr=(-H "Authorization: Bearer ${TOKEN}"
+     -H "X-Forwarded-Access-Token: ${TOKEN}"
+     -H "X-MLFLOW-WORKSPACE: ${WORKSPACE}"
+     -H "Content-Type: application/json")
+
+api() { # api <method> <path> [json-body]
+  curl -fsSk "${hdr[@]}" -X "$1" "${BASE}/$2" ${3:+-d "$3"}
+}
+
+EXP_NAME="uit-smoke-$$"
+echo "== create experiment '${EXP_NAME}' =="
+EID="$(api POST experiments/create "{\"name\":\"${EXP_NAME}\"}" | jq -r '.experiment_id')"
+[ -n "${EID}" ] && [ "${EID}" != null ] || { echo "FAIL: experiment not created"; exit 1; }
+
+cleanup() { api POST experiments/delete "{\"experiment_id\":\"${EID}\"}" >/dev/null 2>&1 || true; }
+trap cleanup EXIT
+
+echo "== create run, log params + metrics =="
+RID="$(api POST runs/create "{\"experiment_id\":\"${EID}\",\"start_time\":1700000000000}" | jq -r '.run.info.run_id')"
+[ -n "${RID}" ] && [ "${RID}" != null ] || { echo "FAIL: run not created"; exit 1; }
+api POST runs/log-parameter "{\"run_id\":\"${RID}\",\"key\":\"model_name\",\"value\":\"qwen3-0.6b\"}" >/dev/null
+for s in 1 2 3; do
+  api POST runs/log-metric "{\"run_id\":\"${RID}\",\"key\":\"loss\",\"value\":$(awk "BEGIN{print 2.0*(0.9^$s)}"),\"timestamp\":1700000000000,\"step\":${s}}" >/dev/null
+done
+api POST runs/update "{\"run_id\":\"${RID}\",\"status\":\"FINISHED\",\"end_time\":1700000005000}" >/dev/null
+
+echo "== read back and assert =="
+RUN="$(api GET "runs/get?run_id=${RID}")"
+OWNER="$(printf '%s' "${RUN}" | jq -r '.run.info.user_id')"
+STATUS="$(printf '%s' "${RUN}" | jq -r '.run.info.status')"
+PARAM="$(printf '%s' "${RUN}" | jq -r '.run.data.params[] | select(.key=="model_name") | .value')"
+METRIC="$(printf '%s' "${RUN}" | jq -r '.run.data.metrics[] | select(.key=="loss") | .key' | head -1)"
+
+echo "  run_id=${RID} owner=${OWNER} status=${STATUS} model_name=${PARAM} metric=${METRIC}"
+[ "${STATUS}" = "FINISHED" ]      || { echo "FAIL: run not FINISHED"; exit 1; }
+[ "${PARAM}" = "qwen3-0.6b" ]     || { echo "FAIL: param not logged"; exit 1; }
+[ "${METRIC}" = "loss" ]          || { echo "FAIL: metric not logged"; exit 1; }
+[ "${OWNER}" = "${EMAIL}" ]       || { echo "FAIL: run owner '${OWNER}' != caller identity '${EMAIL}'"; exit 1; }
+
+echo "PASS: logged to MLflow as user identity '${EMAIL}' (no ServiceAccount, no direct Service)"

From 03ea72d171957d9a1310dcfc71b35acff9796f42 Mon Sep 17 00:00:00 2001
From: Wu Yi <typhoonzero1986@gmail.com>
Date: Mon, 15 Jun 2026 08:01:37 +0000
Subject: [PATCH 12/21] docs: add MLflow Python SDK auth + RBAC guide
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

New how_to/mlflow-python-sdk.mdx: how to drive the stock mlflow>=3.10 SDK
against the auth + multi-tenant Alauda AI MLflow server with a platform user
identity token — no ServiceAccount, no per-workspace RBAC, no extra Service.
Covers MLFLOW_TRACKING_TOKEN auth, mlflow.set_workspace, the port-forward
connection to the app port (raw tunnel preserves Authorization), model
registry, the smoke test, and troubleshooting (302 / token-newline / 401 /
403). Verified on g1-c1-x86: runs are owned by the token identity.

Cross-linked from mlflow.mdx Client Configuration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 docs/en/kubeflow/how_to/mlflow-python-sdk.mdx | 111 ++++++++++++++++++
 docs/en/kubeflow/how_to/mlflow.mdx            |   2 +
 2 files changed, 113 insertions(+)
 create mode 100644 docs/en/kubeflow/how_to/mlflow-python-sdk.mdx

diff --git a/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx b/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
new file mode 100644
index 0000000..6ad4339
--- /dev/null
+++ b/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
@@ -0,0 +1,111 @@
+---
+weight: 46
+---
+
+# Using the MLflow Python SDK with Authentication and RBAC
+
+On Alauda AI the [MLflow Tracking Server](./mlflow.mdx) runs with single sign-on and multi-tenancy enabled: every request is authenticated and authorized against Kubernetes RBAC, and each run is recorded under the calling user. This guide shows how to drive the stock **MLflow Python SDK** against that server with your own **user identity token** — no ServiceAccount, no per-workspace RBAC to create, and no extra in-cluster Service.
+
+## Prerequisites
+
+- `mlflow` **3.10 or later** (`pip install "mlflow>=3.10"`). Workspace selection (`mlflow.set_workspace`) is a 3.10+ feature.
+- A **platform user identity token** — a JWT with an `email` claim, issued for your platform user. Create one from the platform console for non-interactive use. The same token authenticates you to the cluster API and identifies you to MLflow.
+- Access to an MLflow **workspace** — a namespace labelled `mlflow-enabled=true` — that your platform user is allowed to use (see [Workspace Access](./mlflow.mdx)).
+- `kubectl`.
+
+## How authentication works
+
+The MLflow server runs the `kubernetes-auth` plugin in `user_identity_token` mode. It reads the caller identity from the bearer token's claims (`email` → `preferred_username` → `name` → `sub`, groups from `groups` / `roles`), authorizes that identity against the workspace with Kubernetes RBAC, and stores it as the MLflow **run owner**. The MLflow SDK sends your token as `Authorization: Bearer …` when you set `MLFLOW_TRACKING_TOKEN`, and the workspace as `X-MLFLOW-WORKSPACE` when you call `mlflow.set_workspace()`.
+
+There is nothing to create on the cluster: your existing platform permissions on the workspace namespace are what authorize you.
+
+## Connect the SDK
+
+The MLflow Service shipped by the plugin sits behind a browser-oriented OAuth proxy, which rejects non-interactive bearer tokens. Reach the MLflow application port directly with a `kubectl port-forward` tunnel — authenticated to the platform Kubernetes API with your user token. A port-forward is a raw TCP tunnel, so the SDK's `Authorization` header reaches the server unchanged.
+
+```bash
+export PLATFORM=https://<platform>            # platform global address
+export CLUSTER=<cluster>                       # e.g. g1-c1-x86
+export MLFLOW_USER_TOKEN=<your-user-token>     # platform user identity token
+
+# Authenticate to the cluster API through the platform and tunnel to the app port (5000).
+POD=$(kubectl --server "$PLATFORM/kubernetes/$CLUSTER" --token "$MLFLOW_USER_TOKEN" \
+        --insecure-skip-tls-verify -n kubeflow \
+        get pod -l app=mlflow-tracking-server -o jsonpath='{.items[0].metadata.name}')
+
+kubectl --server "$PLATFORM/kubernetes/$CLUSTER" --token "$MLFLOW_USER_TOKEN" \
+        --insecure-skip-tls-verify -n kubeflow \
+        port-forward "pod/$POD" 5000:5000
+```
+
+:::tip
+If your `kubectl` context is already configured for the cluster, the flags collapse to `kubectl -n kubeflow port-forward "pod/$POD" 5000:5000`. Drop `--insecure-skip-tls-verify` when the platform certificate is trusted by your machine.
+:::
+
+With the tunnel open, point the SDK at it:
+
+```python
+import os
+import mlflow
+
+os.environ["MLFLOW_TRACKING_TOKEN"] = os.environ["MLFLOW_USER_TOKEN"].strip()  # → Authorization: Bearer
+mlflow.set_tracking_uri("http://127.0.0.1:5000")
+mlflow.set_workspace("team-a")                 # the workspace namespace; → X-MLFLOW-WORKSPACE
+mlflow.set_experiment("my-experiment")
+
+with mlflow.start_run(run_name="sdk-quickstart") as run:
+    mlflow.log_param("learning_rate", 2e-4)
+    mlflow.log_metric("loss", 0.123)
+    print("run:", run.info.run_id)
+```
+
+The run appears in the MLflow UI under **Alauda AI → Tools → MLFlow**, owned by your platform user.
+
+:::warning
+`MLFLOW_TRACKING_TOKEN` becomes an HTTP header value, so it must contain no trailing newline or whitespace — always `.strip()` it (a stray newline produces `Invalid … character(s) in header value: 'Bearer …\n'`). Never commit the token to source control; read it from an environment variable or secret and rotate it when it expires.
+:::
+
+## Selecting a workspace
+
+Runs are recorded in the workspace you select; if you select none, the server's default workspace is used. Any of these set it (the SDK turns them into the `X-MLFLOW-WORKSPACE` header):
+
+- `mlflow.set_workspace("team-a")` in code,
+- the `MLFLOW_WORKSPACE=team-a` environment variable,
+- the `X-MLFLOW-WORKSPACE: team-a` header for raw HTTP clients.
+
+You can only use a workspace your platform user has access to; see [Workspace Access](./mlflow.mdx).
+
+## Registering models
+
+The model registry is workspace-scoped and authorized the same way, so the usual SDK calls work once connected:
+
+```python
+mlflow.set_workspace("team-a")
+with mlflow.start_run():
+    mlflow.sklearn.log_model(sk_model, name="model", registered_model_name="fraud-detector")
+```
+
+Promote the registered version to **Staging** or **Production** from the MLflow UI.
+
+## Verify the setup
+
+The pipelines guide ships a self-contained smoke test that logs a run with a user token and asserts the run owner matches the token identity — a quick way to confirm auth, workspace, and RBAC end-to-end:
+
+```bash
+PLATFORM_ADDRESS=$PLATFORM CLUSTER=$CLUSTER \
+  MLFLOW_USER_TOKEN=$MLFLOW_USER_TOKEN MLFLOW_WORKSPACE=team-a \
+  e2e/mlflow-user-identity-smoke.sh
+```
+
+For logging from inside Kubeflow Pipelines components, see [Kubeflow Pipeline + MLflow Integration](../../training_guides/pipelines-mlflow-integration.mdx).
+
+## Troubleshooting
+
+| Symptom | Check |
+|---------|-------|
+| SDK call returns HTML / a redirect (`302`) | You connected to the browser OAuth proxy (the `mlflow-tracking-server` Service / platform route). Tunnel to the application port with `kubectl port-forward … 5000:5000` and use `http://127.0.0.1:5000`. |
+| `Invalid … character(s) in header value: 'Bearer …\n'` | The token has trailing whitespace. Set `MLFLOW_TRACKING_TOKEN` to the `.strip()`-ed value. |
+| `401 UNAUTHENTICATED` / "Missing Authorization … header" | `MLFLOW_TRACKING_TOKEN` is unset or empty, or the token expired. (The Kubernetes API *pod proxy* strips `Authorization`; use `port-forward`, which does not.) |
+| `403 PERMISSION_DENIED` | Your platform user lacks access to the workspace namespace. Request access to the workspace (see [Workspace Access](./mlflow.mdx)); no ServiceAccount is involved. |
+| `Failed to query /api/3.0/mlflow/server-info` | Same transport/auth cause as above — confirm the tunnel is up, the URI is `http://127.0.0.1:5000`, and the token is set and stripped. |
+| Run shows the wrong owner or workspace | The owner is the token's `email` claim; the workspace is `set_workspace()` / `MLFLOW_WORKSPACE` (else the server default). Check both. |
diff --git a/docs/en/kubeflow/how_to/mlflow.mdx b/docs/en/kubeflow/how_to/mlflow.mdx
index eaf2e76..d7580db 100644
--- a/docs/en/kubeflow/how_to/mlflow.mdx
+++ b/docs/en/kubeflow/how_to/mlflow.mdx
@@ -69,6 +69,8 @@ subjects:
 
 ## Client Configuration
 
+For authenticating the MLflow Python SDK with a user identity token — including the in-cluster connection details and RBAC — see [Using the MLflow Python SDK with Authentication and RBAC](./mlflow-python-sdk.mdx).
+
 Set the MLflow tracking URI to the platform route and select the workspace:
 
 ```python

From deb867736ff4af9eab217d62aa4e440d7a434bfd Mon Sep 17 00:00:00 2001
From: Wu Yi <typhoonzero1986@gmail.com>
Date: Mon, 15 Jun 2026 08:50:53 +0000
Subject: [PATCH 13/21] docs: SDK guide authenticates through the OAuth proxy
 (no app-port access)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Rework mlflow-python-sdk.mdx so the MLflow Python client always goes through
the oauth2-proxy (the platform MLflow route) instead of port-forwarding to the
container port:

- Interactive: present the browser SSO session — copy the _oauth2_proxy cookie
  and attach it via a runtime-registered RequestHeaderProvider (verified: the
  provider injects the header and the run is owned by the caller identity).
- Headless/automation: admin enables oauth2-proxy --skip-jwt-bearer-tokens, then
  the client uses MLFLOW_TRACKING_TOKEN with a platform OIDC token.

Removes the kubectl port-forward / app-port connection entirely.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 docs/en/kubeflow/how_to/mlflow-python-sdk.mdx | 100 ++++++++++--------
 1 file changed, 57 insertions(+), 43 deletions(-)

diff --git a/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx b/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
index 6ad4339..d3a9f31 100644
--- a/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
+++ b/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
@@ -4,53 +4,59 @@ weight: 46
 
 # Using the MLflow Python SDK with Authentication and RBAC
 
-On Alauda AI the [MLflow Tracking Server](./mlflow.mdx) runs with single sign-on and multi-tenancy enabled: every request is authenticated and authorized against Kubernetes RBAC, and each run is recorded under the calling user. This guide shows how to drive the stock **MLflow Python SDK** against that server with your own **user identity token** — no ServiceAccount, no per-workspace RBAC to create, and no extra in-cluster Service.
+On Alauda AI the [MLflow Tracking Server](./mlflow.mdx) runs behind single sign-on and multi-tenancy: an OAuth proxy authenticates every caller, and the server records each run under the calling user and authorizes it against Kubernetes RBAC. This guide shows how to drive the stock **MLflow Python SDK** through that OAuth proxy with your own identity — there is nothing to create on the cluster.
 
 ## Prerequisites
 
 - `mlflow` **3.10 or later** (`pip install "mlflow>=3.10"`). Workspace selection (`mlflow.set_workspace`) is a 3.10+ feature.
-- A **platform user identity token** — a JWT with an `email` claim, issued for your platform user. Create one from the platform console for non-interactive use. The same token authenticates you to the cluster API and identifies you to MLflow.
-- Access to an MLflow **workspace** — a namespace labelled `mlflow-enabled=true` — that your platform user is allowed to use (see [Workspace Access](./mlflow.mdx)).
-- `kubectl`.
+- Browser access to MLflow — you can sign in at **Alauda AI → Tools → MLFlow** through the platform SSO.
+- Access to an MLflow **workspace** (a namespace labelled `mlflow-enabled=true`) that your platform user is allowed to use (see [Workspace Access](./mlflow.mdx)).
 
 ## How authentication works
 
-The MLflow server runs the `kubernetes-auth` plugin in `user_identity_token` mode. It reads the caller identity from the bearer token's claims (`email` → `preferred_username` → `name` → `sub`, groups from `groups` / `roles`), authorizes that identity against the workspace with Kubernetes RBAC, and stores it as the MLflow **run owner**. The MLflow SDK sends your token as `Authorization: Bearer …` when you set `MLFLOW_TRACKING_TOKEN`, and the workspace as `X-MLFLOW-WORKSPACE` when you call `mlflow.set_workspace()`.
+Two layers sit in front of your runs:
 
-There is nothing to create on the cluster: your existing platform permissions on the workspace namespace are what authorize you.
+1. The **OAuth proxy** (`oauth2-proxy`) on the MLflow endpoint authenticates the request — through the interactive SSO login it issues a session cookie.
+2. The MLflow server's `kubernetes-auth` plugin then reads your identity (from the forwarded token), records it as the run **owner**, and authorizes it against your Kubernetes permissions in the workspace.
 
-## Connect the SDK
+A programmatic client must satisfy layer 1. The clean, no-configuration way is to **present your browser SSO session** to the SDK; for automation, an administrator can additionally allow bearer tokens (see [Headless / automation](#headless-automation) below). Either way the client always goes through the OAuth proxy — never connect to the MLflow container port directly.
 
-The MLflow Service shipped by the plugin sits behind a browser-oriented OAuth proxy, which rejects non-interactive bearer tokens. Reach the MLflow application port directly with a `kubectl port-forward` tunnel — authenticated to the platform Kubernetes API with your user token. A port-forward is a raw TCP tunnel, so the SDK's `Authorization` header reaches the server unchanged.
+## Connect the SDK through the OAuth proxy
+
+### 1. Get your session cookie from the browser
+
+Open **Alauda AI → Tools → MLFlow** in your browser and sign in. Then, in the browser developer tools (**Application/Storage → Cookies** for the platform host), copy the value of the `_oauth2_proxy` cookie — if it is split into `_oauth2_proxy_0`, `_oauth2_proxy_1`, … copy each chunk and join them with `; `. Set it as an environment variable:
 
 ```bash
-export PLATFORM=https://<platform>            # platform global address
-export CLUSTER=<cluster>                       # e.g. g1-c1-x86
-export MLFLOW_USER_TOKEN=<your-user-token>     # platform user identity token
-
-# Authenticate to the cluster API through the platform and tunnel to the app port (5000).
-POD=$(kubectl --server "$PLATFORM/kubernetes/$CLUSTER" --token "$MLFLOW_USER_TOKEN" \
-        --insecure-skip-tls-verify -n kubeflow \
-        get pod -l app=mlflow-tracking-server -o jsonpath='{.items[0].metadata.name}')
-
-kubectl --server "$PLATFORM/kubernetes/$CLUSTER" --token "$MLFLOW_USER_TOKEN" \
-        --insecure-skip-tls-verify -n kubeflow \
-        port-forward "pod/$POD" 5000:5000
+export MLFLOW_PROXY_COOKIE='_oauth2_proxy=<value>'
 ```
 
-:::tip
-If your `kubectl` context is already configured for the cluster, the flags collapse to `kubectl -n kubeflow port-forward "pod/$POD" 5000:5000`. Drop `--insecure-skip-tls-verify` when the platform certificate is trusted by your machine.
-:::
+### 2. Point the SDK at the MLflow route and attach the cookie
 
-With the tunnel open, point the SDK at it:
+The MLflow SDK does not send cookies on its own, so register a tiny request-header provider that adds yours to every call. Then use the normal MLflow API:
 
 ```python
 import os
 import mlflow
+from mlflow.tracking.request_header.abstract_request_header_provider import RequestHeaderProvider
+from mlflow.tracking.request_header.registry import _request_header_provider_registry
+
+
+class ProxySessionHeader(RequestHeaderProvider):
+    """Send the browser OAuth-proxy session cookie with every MLflow request."""
+
+    def in_context(self):
+        return bool(os.environ.get("MLFLOW_PROXY_COOKIE"))
+
+    def request_headers(self):
+        return {"Cookie": os.environ["MLFLOW_PROXY_COOKIE"]}
+
 
-os.environ["MLFLOW_TRACKING_TOKEN"] = os.environ["MLFLOW_USER_TOKEN"].strip()  # → Authorization: Bearer
-mlflow.set_tracking_uri("http://127.0.0.1:5000")
-mlflow.set_workspace("team-a")                 # the workspace namespace; → X-MLFLOW-WORKSPACE
+_request_header_provider_registry.register(ProxySessionHeader)
+
+# The platform MLflow route — i.e. through the OAuth proxy, not the container port.
+mlflow.set_tracking_uri("https://<platform>/clusters/<cluster>/mlflow")
+mlflow.set_workspace("team-a")                 # workspace namespace → X-MLFLOW-WORKSPACE
 mlflow.set_experiment("my-experiment")
 
 with mlflow.start_run(run_name="sdk-quickstart") as run:
@@ -59,10 +65,10 @@ with mlflow.start_run(run_name="sdk-quickstart") as run:
     print("run:", run.info.run_id)
 ```
 
-The run appears in the MLflow UI under **Alauda AI → Tools → MLFlow**, owned by your platform user.
+The run appears under **Alauda AI → Tools → MLFlow**, owned by your platform user.
 
-:::warning
-`MLFLOW_TRACKING_TOKEN` becomes an HTTP header value, so it must contain no trailing newline or whitespace — always `.strip()` it (a stray newline produces `Invalid … character(s) in header value: 'Bearer …\n'`). Never commit the token to source control; read it from an environment variable or secret and rotate it when it expires.
+:::tip
+If the platform certificate is not trusted by your machine, set `MLFLOW_TRACKING_INSECURE_TLS=true`. The session cookie expires — when calls start returning a login redirect, copy a fresh `_oauth2_proxy` value from the browser.
 :::
 
 ## Selecting a workspace
@@ -70,8 +76,7 @@ The run appears in the MLflow UI under **Alauda AI → Tools → MLFlow**, owned
 Runs are recorded in the workspace you select; if you select none, the server's default workspace is used. Any of these set it (the SDK turns them into the `X-MLFLOW-WORKSPACE` header):
 
 - `mlflow.set_workspace("team-a")` in code,
-- the `MLFLOW_WORKSPACE=team-a` environment variable,
-- the `X-MLFLOW-WORKSPACE: team-a` header for raw HTTP clients.
+- the `MLFLOW_WORKSPACE=team-a` environment variable.
 
 You can only use a workspace your platform user has access to; see [Workspace Access](./mlflow.mdx).
 
@@ -87,25 +92,34 @@ with mlflow.start_run():
 
 Promote the registered version to **Staging** or **Production** from the MLflow UI.
 
-## Verify the setup
+## Headless / automation \{#headless-automation}
 
-The pipelines guide ships a self-contained smoke test that logs a run with a user token and asserts the run owner matches the token identity — a quick way to confirm auth, workspace, and RBAC end-to-end:
+The cookie flow needs an interactive browser login, which does not suit CI jobs or pipeline components. For those, an administrator can configure the MLflow OAuth proxy to accept OIDC **bearer tokens** in addition to browser sessions, by adding to the MLflow plugin values:
 
-```bash
-PLATFORM_ADDRESS=$PLATFORM CLUSTER=$CLUSTER \
-  MLFLOW_USER_TOKEN=$MLFLOW_USER_TOKEN MLFLOW_WORKSPACE=team-a \
-  e2e/mlflow-user-identity-smoke.sh
+```yaml
+auth:
+  oauth:
+    extraArgs:
+      - --skip-jwt-bearer-tokens=true
+```
+
+With that enabled, a client authenticates with a platform-issued **OIDC token** (not a Kubernetes ServiceAccount token — the proxy validates it against the platform identity provider) and no cookie:
+
+```python
+import os, mlflow
+os.environ["MLFLOW_TRACKING_TOKEN"] = os.environ["OIDC_TOKEN"].strip()   # → Authorization: Bearer
+mlflow.set_tracking_uri("https://<platform>/clusters/<cluster>/mlflow")
+mlflow.set_workspace("team-a")
 ```
 
-For logging from inside Kubeflow Pipelines components, see [Kubeflow Pipeline + MLflow Integration](../../training_guides/pipelines-mlflow-integration.mdx).
+Set `MLFLOW_TRACKING_TOKEN` to the `.strip()`-ed token (a trailing newline produces `Invalid … character(s) in header value: 'Bearer …\n'`), and store it in a Kubernetes `Secret` rather than in code.
 
 ## Troubleshooting
 
 | Symptom | Check |
 |---------|-------|
-| SDK call returns HTML / a redirect (`302`) | You connected to the browser OAuth proxy (the `mlflow-tracking-server` Service / platform route). Tunnel to the application port with `kubectl port-forward … 5000:5000` and use `http://127.0.0.1:5000`. |
+| Call returns HTML or a redirect (`302` to the login page) | The OAuth proxy did not accept the request. For the cookie flow, your `_oauth2_proxy` value is missing or expired — copy a fresh one (and all `_oauth2_proxy_N` chunks). For the bearer flow, confirm the proxy has `--skip-jwt-bearer-tokens` and the token is a valid platform OIDC token. |
 | `Invalid … character(s) in header value: 'Bearer …\n'` | The token has trailing whitespace. Set `MLFLOW_TRACKING_TOKEN` to the `.strip()`-ed value. |
-| `401 UNAUTHENTICATED` / "Missing Authorization … header" | `MLFLOW_TRACKING_TOKEN` is unset or empty, or the token expired. (The Kubernetes API *pod proxy* strips `Authorization`; use `port-forward`, which does not.) |
+| `Failed to query /api/3.0/mlflow/server-info` | The SDK could not reach the server through the proxy — verify the tracking URI is the platform MLflow route and your cookie/token is valid. |
 | `403 PERMISSION_DENIED` | Your platform user lacks access to the workspace namespace. Request access to the workspace (see [Workspace Access](./mlflow.mdx)); no ServiceAccount is involved. |
-| `Failed to query /api/3.0/mlflow/server-info` | Same transport/auth cause as above — confirm the tunnel is up, the URI is `http://127.0.0.1:5000`, and the token is set and stripped. |
-| Run shows the wrong owner or workspace | The owner is the token's `email` claim; the workspace is `set_workspace()` / `MLFLOW_WORKSPACE` (else the server default). Check both. |
+| Run shows the wrong owner or workspace | The owner is your authenticated identity; the workspace is `set_workspace()` / `MLFLOW_WORKSPACE` (else the server default). Check both. |

From b627b4abb3b3a02d2f6a7fd0282a7bbbdd94d365 Mon Sep 17 00:00:00 2001
From: Wu Yi <typhoonzero1986@gmail.com>
Date: Mon, 15 Jun 2026 10:16:17 +0000
Subject: [PATCH 14/21] docs: use the Dex refresh-token grant for headless
 MLflow auth

- SDK guide "Headless / automation": mint a short-lived Dex id token from a
  long-lived refresh token (refresh-token grant at /dex/token), then use it as
  MLFLOW_TRACKING_TOKEN through the OAuth proxy. Refresh before the 24h id-token
  expiry instead of carrying a static token.
- Rework the smoke test to the same method: refresh token -> id token -> log to
  MLflow via the platform route (through oauth2-proxy, no container-port access),
  asserting the run owner equals the token identity. Requires the proxy's
  --skip-jwt-bearer-tokens.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 docs/en/kubeflow/how_to/mlflow-python-sdk.mdx | 29 ++++--
 e2e/mlflow-user-identity-smoke.sh             | 91 +++++++++----------
 2 files changed, 62 insertions(+), 58 deletions(-)

diff --git a/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx b/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
index d3a9f31..4d6c000 100644
--- a/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
+++ b/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
@@ -103,16 +103,29 @@ auth:
       - --skip-jwt-bearer-tokens=true
 ```
 
-With that enabled, a client authenticates with a platform-issued **OIDC token** (not a Kubernetes ServiceAccount token — the proxy validates it against the platform identity provider) and no cookie:
+With that enabled, a client authenticates with a Dex-issued **OIDC id token** (not a Kubernetes ServiceAccount token — the proxy validates it against Dex). Mint id tokens non-interactively from a long-lived **refresh token**, which avoids both the browser cookie and a short-lived static token:
 
-```python
-import os, mlflow
-os.environ["MLFLOW_TRACKING_TOKEN"] = os.environ["OIDC_TOKEN"].strip()   # → Authorization: Bearer
-mlflow.set_tracking_uri("https://<platform>/clusters/<cluster>/mlflow")
-mlflow.set_workspace("team-a")
-```
+1. **Get a refresh token once.** Run an interactive OIDC login that requests the `offline_access` scope against the platform Dex client — for example `kubelogin` / `kubectl oidc-login`, or any OIDC CLI. Store the returned refresh token in a Kubernetes `Secret`. Refresh tokens are long-lived; id tokens expire in 24 h.
+
+2. **Exchange it for a fresh id token** whenever you need one (the Dex client id/secret are the platform OAuth client, e.g. `alauda-auth` — ask your administrator):
+
+   ```bash
+   ID_TOKEN=$(curl -sk "https://<platform>/dex/token" \
+     -d grant_type=refresh_token --data-urlencode "refresh_token=$REFRESH_TOKEN" \
+     -d client_id="$DEX_CLIENT_ID" --data-urlencode "client_secret=$DEX_CLIENT_SECRET" \
+     | jq -r .id_token)
+   ```
+
+3. **Use it as the MLflow bearer token** (`.strip()` it — a trailing newline produces `Invalid … character(s) in header value: 'Bearer …\n'`):
+
+   ```python
+   import os, mlflow
+   os.environ["MLFLOW_TRACKING_TOKEN"] = ID_TOKEN.strip()   # → Authorization: Bearer
+   mlflow.set_tracking_uri("https://<platform>/clusters/<cluster>/mlflow")
+   mlflow.set_workspace("team-a")
+   ```
 
-Set `MLFLOW_TRACKING_TOKEN` to the `.strip()`-ed token (a trailing newline produces `Invalid … character(s) in header value: 'Bearer …\n'`), and store it in a Kubernetes `Secret` rather than in code.
+For long-running jobs (including pipeline components), refresh the id token before it expires rather than carrying a static one. Keep the refresh token and client secret in a `Secret`, never in code.
 
 ## Troubleshooting
 
diff --git a/e2e/mlflow-user-identity-smoke.sh b/e2e/mlflow-user-identity-smoke.sh
index 511fcdd..2a946c2 100755
--- a/e2e/mlflow-user-identity-smoke.sh
+++ b/e2e/mlflow-user-identity-smoke.sh
@@ -1,62 +1,56 @@
 #!/usr/bin/env bash
-# Smoke test: log to MLflow with a platform **user identity token**.
+# Smoke test: log to MLflow as a real user, THROUGH the OAuth proxy.
 #
-# Exercises the MLflow `kubernetes-auth` plugin's `user_identity_token` mode:
-# the server takes the caller identity from the bearer token's claims and records
-# the run under that user. No ServiceAccount and no extra in-cluster Service are
-# created — the MLflow server is reached through the platform Kubernetes API proxy
-# (the same `…/kubernetes/<cluster>` entry point used for any K8s call), and the
-# caller identity is forwarded to MLflow via `X-Forwarded-Access-Token`.
+# Mints a short-lived Dex **id token** from a long-lived **refresh token** (the
+# refresh-token grant), then logs a run to MLflow over the platform route — i.e.
+# through oauth2-proxy, never the container port. Asserts the run owner equals the
+# token's user identity.
+#
+# Requires the MLflow oauth2-proxy to accept OIDC bearer tokens
+# (auth.oauth.extraArgs: ["--skip-jwt-bearer-tokens=true"]).
 #
 # Required env:
-#   PLATFORM_ADDRESS   e.g. https://192.168.142.163
-#   CLUSTER            e.g. g1-c1-x86
-#   MLFLOW_USER_TOKEN  a platform user identity token (JWT with an `email` claim)
+#   PLATFORM_ADDRESS      e.g. https://192.168.142.163
+#   CLUSTER               e.g. g1-c1-x86
+#   MLFLOW_REFRESH_TOKEN  a Dex refresh token (from an offline_access login)
+#   DEX_CLIENT_ID         platform OAuth client id (e.g. alauda-auth)
+#   DEX_CLIENT_SECRET     that client's secret
 # Optional env:
-#   MLFLOW_WORKSPACE   target workspace namespace (default: mlops-demo-e2e)
-#   MLFLOW_NS          namespace of the MLflow server (default: kubeflow)
+#   MLFLOW_WORKSPACE      target workspace namespace (default: mlops-demo-e2e)
 set -euo pipefail
 
 : "${PLATFORM_ADDRESS:?set PLATFORM_ADDRESS, e.g. https://192.168.142.163}"
 : "${CLUSTER:?set CLUSTER, e.g. g1-c1-x86}"
-: "${MLFLOW_USER_TOKEN:?set MLFLOW_USER_TOKEN to a platform user identity token}"
+: "${MLFLOW_REFRESH_TOKEN:?set MLFLOW_REFRESH_TOKEN to a Dex refresh token}"
+: "${DEX_CLIENT_ID:?set DEX_CLIENT_ID, e.g. alauda-auth}"
+: "${DEX_CLIENT_SECRET:?set DEX_CLIENT_SECRET}"
 WORKSPACE="${MLFLOW_WORKSPACE:-mlops-demo-e2e}"
-MLFLOW_NS="${MLFLOW_NS:-kubeflow}"
+P="${PLATFORM_ADDRESS%/}"
 
-KAPI="${PLATFORM_ADDRESS%/}/kubernetes/${CLUSTER}"
-TOKEN="${MLFLOW_USER_TOKEN}"
+b64url_decode() { local d="$1"; d="${d//-/+}"; d="${d//_/\/}"; printf '%s%s' "$d" "$(printf '%*s' $(((4 - ${#d} % 4) % 4)) '' | tr ' ' '=')" | base64 -d 2>/dev/null; }
 
-# Identity the server should attribute the run to (first email claim in the JWT).
-EMAIL="$(printf '%s' "${TOKEN}" | cut -d. -f2 | tr '_-' '/+' \
-  | { b="$(cat)"; printf '%s%s' "$b" "$(printf '%*s' $(( (4 - ${#b} % 4) % 4 )) '' | tr ' ' '=')"; } \
-  | base64 -d 2>/dev/null | jq -r '.email // .preferred_username // .name // .sub')"
+echo "== mint id token via refresh-token grant =="
+ID_TOKEN="$(curl -fsSk "$P/dex/token" \
+  -d grant_type=refresh_token \
+  --data-urlencode "refresh_token=${MLFLOW_REFRESH_TOKEN}" \
+  -d client_id="${DEX_CLIENT_ID}" \
+  --data-urlencode "client_secret=${DEX_CLIENT_SECRET}" \
+  | jq -r '.id_token')"
+[ -n "${ID_TOKEN}" ] && [ "${ID_TOKEN}" != null ] || { echo "FAIL: no id_token from refresh-token grant"; exit 1; }
+EMAIL="$(b64url_decode "$(printf '%s' "$ID_TOKEN" | cut -d. -f2)" | jq -r '.email // .preferred_username // .name // .sub')"
 echo "caller identity: ${EMAIL}"
 
-# Authenticate to the platform K8s API with the user token; locate the MLflow pod.
-POD="$(curl -fsSk -H "Authorization: Bearer ${TOKEN}" \
-  "${KAPI}/api/v1/namespaces/${MLFLOW_NS}/pods?labelSelector=app%3Dmlflow-tracking-server" \
-  | jq -r '.items[] | select(.status.phase=="Running") | .metadata.name' | head -1)"
-[ -n "${POD}" ] || { echo "FAIL: no running mlflow-tracking-server pod in ${MLFLOW_NS}"; exit 1; }
-echo "mlflow pod: ${POD}"
-
-# Reach the MLflow app port (5000) through the K8s API pod proxy, bypassing the
-# browser OAuth proxy. Authorization authenticates us to the K8s API; the MLflow
-# server reads our identity from X-Forwarded-Access-Token.
-BASE="${KAPI}/api/v1/namespaces/${MLFLOW_NS}/pods/${POD}:5000/proxy/api/2.0/mlflow"
-hdr=(-H "Authorization: Bearer ${TOKEN}"
-     -H "X-Forwarded-Access-Token: ${TOKEN}"
+# Through the OAuth proxy: the platform MLflow route, with the id token as a bearer.
+BASE="$P/clusters/${CLUSTER}/mlflow/api/2.0/mlflow"
+hdr=(-H "Authorization: Bearer ${ID_TOKEN}"
      -H "X-MLFLOW-WORKSPACE: ${WORKSPACE}"
      -H "Content-Type: application/json")
+api() { curl -fsSk "${hdr[@]}" -X "$1" "${BASE}/$2" ${3:+-d "$3"}; }
 
-api() { # api <method> <path> [json-body]
-  curl -fsSk "${hdr[@]}" -X "$1" "${BASE}/$2" ${3:+-d "$3"}
-}
-
-EXP_NAME="uit-smoke-$$"
-echo "== create experiment '${EXP_NAME}' =="
-EID="$(api POST experiments/create "{\"name\":\"${EXP_NAME}\"}" | jq -r '.experiment_id')"
-[ -n "${EID}" ] && [ "${EID}" != null ] || { echo "FAIL: experiment not created"; exit 1; }
-
+EXP="uit-smoke-$$"
+echo "== create experiment '${EXP}' =="
+EID="$(api POST experiments/create "{\"name\":\"${EXP}\"}" | jq -r '.experiment_id')"
+[ -n "${EID}" ] && [ "${EID}" != null ] || { echo "FAIL: experiment not created (is --skip-jwt-bearer-tokens enabled?)"; exit 1; }
 cleanup() { api POST experiments/delete "{\"experiment_id\":\"${EID}\"}" >/dev/null 2>&1 || true; }
 trap cleanup EXIT
 
@@ -74,12 +68,9 @@ RUN="$(api GET "runs/get?run_id=${RID}")"
 OWNER="$(printf '%s' "${RUN}" | jq -r '.run.info.user_id')"
 STATUS="$(printf '%s' "${RUN}" | jq -r '.run.info.status')"
 PARAM="$(printf '%s' "${RUN}" | jq -r '.run.data.params[] | select(.key=="model_name") | .value')"
-METRIC="$(printf '%s' "${RUN}" | jq -r '.run.data.metrics[] | select(.key=="loss") | .key' | head -1)"
-
-echo "  run_id=${RID} owner=${OWNER} status=${STATUS} model_name=${PARAM} metric=${METRIC}"
-[ "${STATUS}" = "FINISHED" ]      || { echo "FAIL: run not FINISHED"; exit 1; }
-[ "${PARAM}" = "qwen3-0.6b" ]     || { echo "FAIL: param not logged"; exit 1; }
-[ "${METRIC}" = "loss" ]          || { echo "FAIL: metric not logged"; exit 1; }
-[ "${OWNER}" = "${EMAIL}" ]       || { echo "FAIL: run owner '${OWNER}' != caller identity '${EMAIL}'"; exit 1; }
+echo "  run_id=${RID} owner=${OWNER} status=${STATUS} model_name=${PARAM}"
+[ "${STATUS}" = "FINISHED" ]   || { echo "FAIL: run not FINISHED"; exit 1; }
+[ "${PARAM}" = "qwen3-0.6b" ]  || { echo "FAIL: param not logged"; exit 1; }
+[ "${OWNER}" = "${EMAIL}" ]    || { echo "FAIL: run owner '${OWNER}' != caller identity '${EMAIL}'"; exit 1; }
 
-echo "PASS: logged to MLflow as user identity '${EMAIL}' (no ServiceAccount, no direct Service)"
+echo "PASS: logged to MLflow as '${EMAIL}' through the OAuth proxy (refresh-token grant; no cookie, no container-port access)"

From 6ee45a7aeafebf0959efd25b0aa6ab32e8ac9627 Mon Sep 17 00:00:00 2001
From: Wu Yi <typhoonzero1986@gmail.com>
Date: Mon, 15 Jun 2026 10:39:07 +0000
Subject: [PATCH 15/21] docs: use the Dex password grant (ROPC) for headless
 MLflow auth
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- SDK guide "Headless / automation": mint a Dex id token with the OAuth2
  password grant (grant_type=password at /dex/token) — one call, no browser/
  cookie — then use it as MLFLOW_TRACKING_TOKEN through the OAuth proxy.
  Requires a Dex client whose grantTypes include "password" + the proxy's
  --skip-jwt-bearer-tokens. Warns to use a dedicated service account (ROPC
  sends the password) and store creds in a Secret.
- Rework the smoke test to ROPC: username/password -> Dex id token -> log to
  MLflow via the platform route (through oauth2-proxy), asserting run owner ==
  token identity.

Verified ROPC mints a valid Dex id token (iss=dex, aud=alauda-auth, key in
Dex JWKS) on g1-c1-x86.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 docs/en/kubeflow/how_to/mlflow-python-sdk.mdx | 15 ++++---
 e2e/mlflow-user-identity-smoke.sh             | 43 +++++++++++--------
 2 files changed, 34 insertions(+), 24 deletions(-)

diff --git a/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx b/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
index 4d6c000..f58d0cb 100644
--- a/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
+++ b/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
@@ -103,15 +103,18 @@ auth:
       - --skip-jwt-bearer-tokens=true
 ```
 
-With that enabled, a client authenticates with a Dex-issued **OIDC id token** (not a Kubernetes ServiceAccount token — the proxy validates it against Dex). Mint id tokens non-interactively from a long-lived **refresh token**, which avoids both the browser cookie and a short-lived static token:
+With that enabled, a client authenticates with a Dex-issued **OIDC id token** (not a Kubernetes ServiceAccount token — the proxy validates it against Dex). Mint id tokens non-interactively with the **OAuth2 password grant (ROPC)** — a single token-endpoint call, no browser and no cookie:
 
-1. **Get a refresh token once.** Run an interactive OIDC login that requests the `offline_access` scope against the platform Dex client — for example `kubelogin` / `kubectl oidc-login`, or any OIDC CLI. Store the returned refresh token in a Kubernetes `Secret`. Refresh tokens are long-lived; id tokens expire in 24 h.
+1. **Allow the password grant.** Dex must have the password connector enabled (`enablePasswordDB: true`), and the OAuth client you use must list `password` in its `grantTypes`. Register a dedicated client for this (ask your administrator) rather than the platform's interactive-login client.
 
-2. **Exchange it for a fresh id token** whenever you need one (the Dex client id/secret are the platform OAuth client, e.g. `alauda-auth` — ask your administrator):
+2. **Exchange username/password for an id token:**
 
    ```bash
    ID_TOKEN=$(curl -sk "https://<platform>/dex/token" \
-     -d grant_type=refresh_token --data-urlencode "refresh_token=$REFRESH_TOKEN" \
+     -d grant_type=password \
+     --data-urlencode "username=$MLFLOW_USERNAME" \
+     --data-urlencode "password=$MLFLOW_PASSWORD" \
+     -d scope="openid email groups" \
      -d client_id="$DEX_CLIENT_ID" --data-urlencode "client_secret=$DEX_CLIENT_SECRET" \
      | jq -r .id_token)
    ```
@@ -125,7 +128,9 @@ With that enabled, a client authenticates with a Dex-issued **OIDC id token** (n
    mlflow.set_workspace("team-a")
    ```
 
-For long-running jobs (including pipeline components), refresh the id token before it expires rather than carrying a static one. Keep the refresh token and client secret in a `Secret`, never in code.
+:::warning
+The password grant sends the user's password to the token endpoint, so use a **dedicated service account** (not a person's login), keep the credentials and client secret in a Kubernetes `Secret`, and scope that account to only the workspaces it needs. id tokens expire in 24 h, so long-running jobs re-run step 2 to refresh.
+:::
 
 ## Troubleshooting
 
diff --git a/e2e/mlflow-user-identity-smoke.sh b/e2e/mlflow-user-identity-smoke.sh
index 2a946c2..e6b6cf3 100755
--- a/e2e/mlflow-user-identity-smoke.sh
+++ b/e2e/mlflow-user-identity-smoke.sh
@@ -1,42 +1,47 @@
 #!/usr/bin/env bash
 # Smoke test: log to MLflow as a real user, THROUGH the OAuth proxy.
 #
-# Mints a short-lived Dex **id token** from a long-lived **refresh token** (the
-# refresh-token grant), then logs a run to MLflow over the platform route — i.e.
-# through oauth2-proxy, never the container port. Asserts the run owner equals the
-# token's user identity.
+# Mints a Dex **id token** with the OAuth2 password grant (ROPC), then logs a run
+# to MLflow over the platform route — i.e. through oauth2-proxy, never the
+# container port. Asserts the run owner equals the token's user identity.
 #
-# Requires the MLflow oauth2-proxy to accept OIDC bearer tokens
-# (auth.oauth.extraArgs: ["--skip-jwt-bearer-tokens=true"]).
+# Prerequisites on the platform:
+#   - a Dex OAuth client whose grantTypes include "password" (ROPC)
+#   - the MLflow oauth2-proxy accepts bearer tokens
+#     (auth.oauth.extraArgs: ["--skip-jwt-bearer-tokens=true"])
 #
 # Required env:
-#   PLATFORM_ADDRESS      e.g. https://192.168.142.163
-#   CLUSTER               e.g. g1-c1-x86
-#   MLFLOW_REFRESH_TOKEN  a Dex refresh token (from an offline_access login)
-#   DEX_CLIENT_ID         platform OAuth client id (e.g. alauda-auth)
-#   DEX_CLIENT_SECRET     that client's secret
+#   PLATFORM_ADDRESS   e.g. https://192.168.142.163
+#   CLUSTER            e.g. g1-c1-x86
+#   MLFLOW_USERNAME    platform username (ideally a dedicated service account)
+#   MLFLOW_PASSWORD    that user's password
+#   DEX_CLIENT_ID      Dex client id allowed to use the password grant
+#   DEX_CLIENT_SECRET  that client's secret
 # Optional env:
-#   MLFLOW_WORKSPACE      target workspace namespace (default: mlops-demo-e2e)
+#   MLFLOW_WORKSPACE   target workspace namespace (default: mlops-demo-e2e)
 set -euo pipefail
 
 : "${PLATFORM_ADDRESS:?set PLATFORM_ADDRESS, e.g. https://192.168.142.163}"
 : "${CLUSTER:?set CLUSTER, e.g. g1-c1-x86}"
-: "${MLFLOW_REFRESH_TOKEN:?set MLFLOW_REFRESH_TOKEN to a Dex refresh token}"
-: "${DEX_CLIENT_ID:?set DEX_CLIENT_ID, e.g. alauda-auth}"
+: "${MLFLOW_USERNAME:?set MLFLOW_USERNAME}"
+: "${MLFLOW_PASSWORD:?set MLFLOW_PASSWORD}"
+: "${DEX_CLIENT_ID:?set DEX_CLIENT_ID}"
 : "${DEX_CLIENT_SECRET:?set DEX_CLIENT_SECRET}"
 WORKSPACE="${MLFLOW_WORKSPACE:-mlops-demo-e2e}"
 P="${PLATFORM_ADDRESS%/}"
 
 b64url_decode() { local d="$1"; d="${d//-/+}"; d="${d//_/\/}"; printf '%s%s' "$d" "$(printf '%*s' $(((4 - ${#d} % 4) % 4)) '' | tr ' ' '=')" | base64 -d 2>/dev/null; }
 
-echo "== mint id token via refresh-token grant =="
+echo "== mint id token via password grant (ROPC) =="
 ID_TOKEN="$(curl -fsSk "$P/dex/token" \
-  -d grant_type=refresh_token \
-  --data-urlencode "refresh_token=${MLFLOW_REFRESH_TOKEN}" \
+  -d grant_type=password \
+  --data-urlencode "username=${MLFLOW_USERNAME}" \
+  --data-urlencode "password=${MLFLOW_PASSWORD}" \
+  -d scope="openid email groups" \
   -d client_id="${DEX_CLIENT_ID}" \
   --data-urlencode "client_secret=${DEX_CLIENT_SECRET}" \
   | jq -r '.id_token')"
-[ -n "${ID_TOKEN}" ] && [ "${ID_TOKEN}" != null ] || { echo "FAIL: no id_token from refresh-token grant"; exit 1; }
+[ -n "${ID_TOKEN}" ] && [ "${ID_TOKEN}" != null ] || { echo "FAIL: no id_token (does the client allow the password grant?)"; exit 1; }
 EMAIL="$(b64url_decode "$(printf '%s' "$ID_TOKEN" | cut -d. -f2)" | jq -r '.email // .preferred_username // .name // .sub')"
 echo "caller identity: ${EMAIL}"
 
@@ -73,4 +78,4 @@ echo "  run_id=${RID} owner=${OWNER} status=${STATUS} model_name=${PARAM}"
 [ "${PARAM}" = "qwen3-0.6b" ]  || { echo "FAIL: param not logged"; exit 1; }
 [ "${OWNER}" = "${EMAIL}" ]    || { echo "FAIL: run owner '${OWNER}' != caller identity '${EMAIL}'"; exit 1; }
 
-echo "PASS: logged to MLflow as '${EMAIL}' through the OAuth proxy (refresh-token grant; no cookie, no container-port access)"
+echo "PASS: logged to MLflow as '${EMAIL}' through the OAuth proxy (password grant; no cookie, no container-port access)"

From ad96d01642385dd23ea20baccf1da7ecbbbec35f Mon Sep 17 00:00:00 2001
From: Wu Yi <typhoonzero1986@gmail.com>
Date: Tue, 16 Jun 2026 01:44:10 +0000
Subject: [PATCH 16/21] docs: make ROPC (username/password) the primary MLflow
 SDK auth method

mlflow-python-sdk.mdx now leads with the OAuth2 password grant: mint a Dex id
token from a username/password at /dex/token, then use it as
MLFLOW_TRACKING_TOKEN through the OAuth proxy. Adds an admin "Platform setup"
section (--skip-jwt-bearer-tokens + a password-grant Dex client). The browser
session-cookie flow is kept as a secondary "interactive alternative".

Verified end-to-end on g1-c1-x86 (run owner = the token's user identity).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 docs/en/kubeflow/how_to/mlflow-python-sdk.mdx | 130 ++++++++----------
 1 file changed, 60 insertions(+), 70 deletions(-)

diff --git a/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx b/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
index f58d0cb..a6ca486 100644
--- a/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
+++ b/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
@@ -4,58 +4,65 @@ weight: 46
 
 # Using the MLflow Python SDK with Authentication and RBAC
 
-On Alauda AI the [MLflow Tracking Server](./mlflow.mdx) runs behind single sign-on and multi-tenancy: an OAuth proxy authenticates every caller, and the server records each run under the calling user and authorizes it against Kubernetes RBAC. This guide shows how to drive the stock **MLflow Python SDK** through that OAuth proxy with your own identity — there is nothing to create on the cluster.
+On Alauda AI the [MLflow Tracking Server](./mlflow.mdx) runs behind single sign-on and multi-tenancy: an OAuth proxy authenticates every caller, and the server records each run under the calling user and authorizes it against Kubernetes RBAC. This guide shows how to drive the stock **MLflow Python SDK** through that OAuth proxy with your own identity, using the OAuth2 **password grant** to obtain a token from a username and password — no browser, and never the MLflow container port.
+
+## Platform setup (administrator, one-time) \{#platform-setup-administrator-one-time}
+
+The password grant needs two settings, which an administrator enables once:
+
+- **Accept bearer tokens at the proxy.** Add `--skip-jwt-bearer-tokens=true` to the MLflow OAuth proxy so it accepts a Dex OIDC token alongside browser sessions:
+
+  ```yaml
+  # MLflow plugin values
+  auth:
+    oauth:
+      extraArgs:
+        - --skip-jwt-bearer-tokens=true
+  ```
+
+- **Allow the password grant.** Dex must have the password connector enabled (`enablePasswordDB: true`), and the OAuth client you authenticate with must list `password` in its `grantTypes`. Register a **dedicated** client for this rather than the platform's interactive-login client.
 
 ## Prerequisites
 
 - `mlflow` **3.10 or later** (`pip install "mlflow>=3.10"`). Workspace selection (`mlflow.set_workspace`) is a 3.10+ feature.
-- Browser access to MLflow — you can sign in at **Alauda AI → Tools → MLFlow** through the platform SSO.
-- Access to an MLflow **workspace** (a namespace labelled `mlflow-enabled=true`) that your platform user is allowed to use (see [Workspace Access](./mlflow.mdx)).
+- A platform **username and password** — ideally a dedicated service account, not a person's login — that can access the target workspace (see [Workspace Access](./mlflow.mdx)).
+- The Dex **client id and secret** allowed to use the password grant (from your administrator).
 
 ## How authentication works
 
 Two layers sit in front of your runs:
 
-1. The **OAuth proxy** (`oauth2-proxy`) on the MLflow endpoint authenticates the request — through the interactive SSO login it issues a session cookie.
-2. The MLflow server's `kubernetes-auth` plugin then reads your identity (from the forwarded token), records it as the run **owner**, and authorizes it against your Kubernetes permissions in the workspace.
+1. The **OAuth proxy** (`oauth2-proxy`) authenticates the request. With `--skip-jwt-bearer-tokens`, it accepts a Dex-issued OIDC **id token** sent as `Authorization: Bearer …`.
+2. The MLflow server's `kubernetes-auth` plugin reads your identity from that token, records it as the run **owner**, and authorizes it against your Kubernetes permissions in the workspace.
 
-A programmatic client must satisfy layer 1. The clean, no-configuration way is to **present your browser SSO session** to the SDK; for automation, an administrator can additionally allow bearer tokens (see [Headless / automation](#headless-automation) below). Either way the client always goes through the OAuth proxy — never connect to the MLflow container port directly.
+The client always goes through the OAuth proxy — never connect to the MLflow container port directly.
 
-## Connect the SDK through the OAuth proxy
+## Connect the SDK
 
-### 1. Get your session cookie from the browser
+### 1. Mint an id token with the password grant
 
-Open **Alauda AI → Tools → MLFlow** in your browser and sign in. Then, in the browser developer tools (**Application/Storage → Cookies** for the platform host), copy the value of the `_oauth2_proxy` cookie — if it is split into `_oauth2_proxy_0`, `_oauth2_proxy_1`, … copy each chunk and join them with `; `. Set it as an environment variable:
+Exchange the username and password for a Dex **id token** in a single call (no browser, no cookie):
 
 ```bash
-export MLFLOW_PROXY_COOKIE='_oauth2_proxy=<value>'
+export ID_TOKEN=$(curl -sk "https://<platform>/dex/token" \
+  -d grant_type=password \
+  --data-urlencode "username=$MLFLOW_USERNAME" \
+  --data-urlencode "password=$MLFLOW_PASSWORD" \
+  -d scope="openid email groups" \
+  -d client_id="$DEX_CLIENT_ID" --data-urlencode "client_secret=$DEX_CLIENT_SECRET" \
+  | jq -r .id_token)
 ```
 
-### 2. Point the SDK at the MLflow route and attach the cookie
+### 2. Point the SDK at the MLflow route with the token
 
-The MLflow SDK does not send cookies on its own, so register a tiny request-header provider that adds yours to every call. Then use the normal MLflow API:
+The SDK reads `MLFLOW_TRACKING_TOKEN` and sends it as `Authorization: Bearer …`:
 
 ```python
 import os
 import mlflow
-from mlflow.tracking.request_header.abstract_request_header_provider import RequestHeaderProvider
-from mlflow.tracking.request_header.registry import _request_header_provider_registry
-
-
-class ProxySessionHeader(RequestHeaderProvider):
-    """Send the browser OAuth-proxy session cookie with every MLflow request."""
 
-    def in_context(self):
-        return bool(os.environ.get("MLFLOW_PROXY_COOKIE"))
-
-    def request_headers(self):
-        return {"Cookie": os.environ["MLFLOW_PROXY_COOKIE"]}
-
-
-_request_header_provider_registry.register(ProxySessionHeader)
-
-# The platform MLflow route — i.e. through the OAuth proxy, not the container port.
-mlflow.set_tracking_uri("https://<platform>/clusters/<cluster>/mlflow")
+os.environ["MLFLOW_TRACKING_TOKEN"] = os.environ["ID_TOKEN"].strip()  # → Authorization: Bearer
+mlflow.set_tracking_uri("https://<platform>/clusters/<cluster>/mlflow")  # through the OAuth proxy
 mlflow.set_workspace("team-a")                 # workspace namespace → X-MLFLOW-WORKSPACE
 mlflow.set_experiment("my-experiment")
 
@@ -65,10 +72,10 @@ with mlflow.start_run(run_name="sdk-quickstart") as run:
     print("run:", run.info.run_id)
 ```
 
-The run appears under **Alauda AI → Tools → MLFlow**, owned by your platform user.
+The run appears under **Alauda AI → Tools → MLFlow**, owned by the username you authenticated as. (Verified end-to-end on a secured install: the run owner is the token's user identity.)
 
-:::tip
-If the platform certificate is not trusted by your machine, set `MLFLOW_TRACKING_INSECURE_TLS=true`. The session cookie expires — when calls start returning a login redirect, copy a fresh `_oauth2_proxy` value from the browser.
+:::warning
+The password grant sends the password to the token endpoint, so use a **dedicated service account** and keep the credentials and client secret in a Kubernetes `Secret`, never in code. Always `.strip()` the token (a trailing newline produces `Invalid … character(s) in header value: 'Bearer …\n'`). id tokens expire (24 h by default), so re-run step 1 to refresh for long-running jobs. If the platform certificate is not trusted by your machine, set `MLFLOW_TRACKING_INSECURE_TLS=true`.
 :::
 
 ## Selecting a workspace
@@ -78,7 +85,7 @@ Runs are recorded in the workspace you select; if you select none, the server's
 - `mlflow.set_workspace("team-a")` in code,
 - the `MLFLOW_WORKSPACE=team-a` environment variable.
 
-You can only use a workspace your platform user has access to; see [Workspace Access](./mlflow.mdx).
+You can only use a workspace your account has access to; see [Workspace Access](./mlflow.mdx).
 
 ## Registering models
 
@@ -92,52 +99,35 @@ with mlflow.start_run():
 
 Promote the registered version to **Staging** or **Production** from the MLflow UI.
 
-## Headless / automation \{#headless-automation}
-
-The cookie flow needs an interactive browser login, which does not suit CI jobs or pipeline components. For those, an administrator can configure the MLflow OAuth proxy to accept OIDC **bearer tokens** in addition to browser sessions, by adding to the MLflow plugin values:
-
-```yaml
-auth:
-  oauth:
-    extraArgs:
-      - --skip-jwt-bearer-tokens=true
-```
-
-With that enabled, a client authenticates with a Dex-issued **OIDC id token** (not a Kubernetes ServiceAccount token — the proxy validates it against Dex). Mint id tokens non-interactively with the **OAuth2 password grant (ROPC)** — a single token-endpoint call, no browser and no cookie:
+## Interactive alternative: browser session
 
-1. **Allow the password grant.** Dex must have the password connector enabled (`enablePasswordDB: true`), and the OAuth client you use must list `password` in its `grantTypes`. Register a dedicated client for this (ask your administrator) rather than the platform's interactive-login client.
+If you cannot use the password grant (for example you only have an interactive SSO login), present your browser session instead — this works without the `--skip-jwt-bearer-tokens` setting. Sign in at **Alauda AI → Tools → MLFlow**, copy the `_oauth2_proxy` cookie from the browser developer tools (**Application/Storage → Cookies**; include any `_oauth2_proxy_N` chunks, joined with `; `), and attach it to every request with a header provider:
 
-2. **Exchange username/password for an id token:**
-
-   ```bash
-   ID_TOKEN=$(curl -sk "https://<platform>/dex/token" \
-     -d grant_type=password \
-     --data-urlencode "username=$MLFLOW_USERNAME" \
-     --data-urlencode "password=$MLFLOW_PASSWORD" \
-     -d scope="openid email groups" \
-     -d client_id="$DEX_CLIENT_ID" --data-urlencode "client_secret=$DEX_CLIENT_SECRET" \
-     | jq -r .id_token)
-   ```
+```python
+import os, mlflow
+from mlflow.tracking.request_header.abstract_request_header_provider import RequestHeaderProvider
+from mlflow.tracking.request_header.registry import _request_header_provider_registry
 
-3. **Use it as the MLflow bearer token** (`.strip()` it — a trailing newline produces `Invalid … character(s) in header value: 'Bearer …\n'`):
+class ProxySessionHeader(RequestHeaderProvider):
+    def in_context(self):
+        return bool(os.environ.get("MLFLOW_PROXY_COOKIE"))      # export MLFLOW_PROXY_COOKIE='_oauth2_proxy=<value>'
+    def request_headers(self):
+        return {"Cookie": os.environ["MLFLOW_PROXY_COOKIE"]}
 
-   ```python
-   import os, mlflow
-   os.environ["MLFLOW_TRACKING_TOKEN"] = ID_TOKEN.strip()   # → Authorization: Bearer
-   mlflow.set_tracking_uri("https://<platform>/clusters/<cluster>/mlflow")
-   mlflow.set_workspace("team-a")
-   ```
+_request_header_provider_registry.register(ProxySessionHeader)
+mlflow.set_tracking_uri("https://<platform>/clusters/<cluster>/mlflow")
+mlflow.set_workspace("team-a")
+```
 
-:::warning
-The password grant sends the user's password to the token endpoint, so use a **dedicated service account** (not a person's login), keep the credentials and client secret in a Kubernetes `Secret`, and scope that account to only the workspaces it needs. id tokens expire in 24 h, so long-running jobs re-run step 2 to refresh.
-:::
+The session cookie expires — copy a fresh one when calls start returning a login redirect.
 
 ## Troubleshooting
 
 | Symptom | Check |
 |---------|-------|
-| Call returns HTML or a redirect (`302` to the login page) | The OAuth proxy did not accept the request. For the cookie flow, your `_oauth2_proxy` value is missing or expired — copy a fresh one (and all `_oauth2_proxy_N` chunks). For the bearer flow, confirm the proxy has `--skip-jwt-bearer-tokens` and the token is a valid platform OIDC token. |
+| `/dex/token` returns `unsupported_grant_type` / "password grant … not allowed" | The Dex client does not permit the password grant. Use a client whose `grantTypes` include `password` (see [Platform setup](#platform-setup-administrator-one-time)). |
+| Call returns HTML or a redirect (`302` to the login page) | The OAuth proxy rejected the bearer token. Confirm `--skip-jwt-bearer-tokens` is enabled and the token is a valid Dex id token (`aud` = the proxy's client). For the cookie alternative, your `_oauth2_proxy` value is missing or expired. |
 | `Invalid … character(s) in header value: 'Bearer …\n'` | The token has trailing whitespace. Set `MLFLOW_TRACKING_TOKEN` to the `.strip()`-ed value. |
-| `Failed to query /api/3.0/mlflow/server-info` | The SDK could not reach the server through the proxy — verify the tracking URI is the platform MLflow route and your cookie/token is valid. |
-| `403 PERMISSION_DENIED` | Your platform user lacks access to the workspace namespace. Request access to the workspace (see [Workspace Access](./mlflow.mdx)); no ServiceAccount is involved. |
+| `Failed to query /api/3.0/mlflow/server-info` | The SDK could not reach the server through the proxy — verify the tracking URI is the platform MLflow route and the token is valid. |
+| `403 PERMISSION_DENIED` | Your account lacks access to the workspace namespace. Request access to the workspace (see [Workspace Access](./mlflow.mdx)); no ServiceAccount is involved. |
 | Run shows the wrong owner or workspace | The owner is your authenticated identity; the workspace is `set_workspace()` / `MLFLOW_WORKSPACE` (else the server default). Check both. |

From 23a816dcc2d270e4bd9357e84b21b0274ed247c8 Mon Sep 17 00:00:00 2001
From: Wu Yi <typhoonzero1986@gmail.com>
Date: Tue, 16 Jun 2026 02:06:21 +0000
Subject: [PATCH 17/21] docs: in-cluster Service URL + MLflow client for
 pipelines

- SDK guide: set_tracking_uri now uses the in-cluster Service
  http://mlflow-tracking-server.kubeflow:5000 (still via the OAuth proxy) for
  in-cluster clients; note the platform route for outside-the-cluster use.
- Pipelines guide: rewritten to use the MLflow Python client against the
  in-cluster Service with MLFLOW_TRACKING_TOKEN injected from a Secret
  (kfp-kubernetes use_secret_as_env), and reference the SDK guide for auth/RBAC
  and minting the token (password grant). Drops the raw-REST/container-port
  helper. Trainer v2 example points MLFLOW_TRACKING_URI at the in-cluster
  Service. Example compiles with kfp 2.11 + kfp-kubernetes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 docs/en/kubeflow/how_to/mlflow-python-sdk.mdx |   6 +-
 .../pipelines-mlflow-integration.mdx          | 212 ++++++------------
 2 files changed, 73 insertions(+), 145 deletions(-)

diff --git a/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx b/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
index a6ca486..758de4f 100644
--- a/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
+++ b/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
@@ -62,7 +62,7 @@ import os
 import mlflow
 
 os.environ["MLFLOW_TRACKING_TOKEN"] = os.environ["ID_TOKEN"].strip()  # → Authorization: Bearer
-mlflow.set_tracking_uri("https://<platform>/clusters/<cluster>/mlflow")  # through the OAuth proxy
+mlflow.set_tracking_uri("http://mlflow-tracking-server.kubeflow:5000")  # in-cluster Service (fronted by the OAuth proxy)
 mlflow.set_workspace("team-a")                 # workspace namespace → X-MLFLOW-WORKSPACE
 mlflow.set_experiment("my-experiment")
 
@@ -74,8 +74,10 @@ with mlflow.start_run(run_name="sdk-quickstart") as run:
 
 The run appears under **Alauda AI → Tools → MLFlow**, owned by the username you authenticated as. (Verified end-to-end on a secured install: the run owner is the token's user identity.)
 
+Use the in-cluster Service URL `http://mlflow-tracking-server.kubeflow:5000` when the client runs **inside** the cluster (pipeline components, Workbench notebooks). From **outside** the cluster, point at the platform route `https://<platform>/clusters/<cluster>/mlflow` instead — both reach the same OAuth proxy.
+
 :::warning
-The password grant sends the password to the token endpoint, so use a **dedicated service account** and keep the credentials and client secret in a Kubernetes `Secret`, never in code. Always `.strip()` the token (a trailing newline produces `Invalid … character(s) in header value: 'Bearer …\n'`). id tokens expire (24 h by default), so re-run step 1 to refresh for long-running jobs. If the platform certificate is not trusted by your machine, set `MLFLOW_TRACKING_INSECURE_TLS=true`.
+The password grant sends the password to the token endpoint, so use a **dedicated service account** and keep the credentials and client secret in a Kubernetes `Secret`, never in code. Always `.strip()` the token (a trailing newline produces `Invalid … character(s) in header value: 'Bearer …\n'`). id tokens expire (24 h by default), so re-run step 1 to refresh for long-running jobs. If you use the external HTTPS route and the platform certificate is not trusted by your machine, set `MLFLOW_TRACKING_INSECURE_TLS=true`.
 :::
 
 ## Selecting a workspace
diff --git a/docs/en/training_guides/pipelines-mlflow-integration.mdx b/docs/en/training_guides/pipelines-mlflow-integration.mdx
index 6b063a3..0735de3 100644
--- a/docs/en/training_guides/pipelines-mlflow-integration.mdx
+++ b/docs/en/training_guides/pipelines-mlflow-integration.mdx
@@ -4,161 +4,80 @@ weight: 55
 
 # Kubeflow Pipeline + MLflow Integration
 
-This guide shows how to build Kubeflow Pipelines (KFP) components that log parameters, metrics, and model artifacts to [MLflow on Kubeflow](../kubeflow/how_to/mlflow.mdx) — giving you a single source of truth for experiment tracking across your pipeline runs.
-
-Pipeline components authenticate to MLflow with a **user identity token**: no ServiceAccount, no per-workspace RBAC, and no extra in-cluster Service are required. The MLflow server records each run under the calling user.
+This guide shows how Kubeflow Pipelines (KFP) components log parameters, metrics, and models to [MLflow on Kubeflow](../kubeflow/how_to/mlflow.mdx) with the **MLflow Python client**. Authentication and workspace/RBAC follow [Using the MLflow Python SDK with Authentication and RBAC](../kubeflow/how_to/mlflow-python-sdk.mdx) — each component authenticates with a user identity token and the server records the run under that user.
 
 ## Scope
 
 - Alauda AI 2.5 and later.
 - Kubeflow Pipelines and the MLflow cluster plugin are installed.
 - The MLflow workspace is a namespace labelled `mlflow-enabled=true`.
+- The MLflow OAuth proxy accepts bearer tokens and a Dex client permits the password grant — see [Platform setup](../kubeflow/how_to/mlflow-python-sdk.mdx#platform-setup-administrator-one-time) in the SDK guide.
 
 ## Prerequisites
 
-- `kfp` Python SDK installed (`pip install kfp`).
-- Access to a KFP endpoint (see [Use Kubeflow Pipelines](../kubeflow/how_to/pipelines.mdx) for setup).
-- A **platform user identity token** — a JWT with an `email` claim, issued for your platform user. You already have the equivalent credential whenever you use the platform; create a token for non-interactive use from the platform console. The same token is used both to reach the cluster API and as your MLflow identity.
-- An MLflow workspace name (a namespace with `mlflow-enabled=true`) that your platform user can access.
-
-## How components authenticate to MLflow \{#how-components-authenticate-to-mlflow}
+- `kfp` and `kfp-kubernetes` Python SDKs (`pip install kfp kfp-kubernetes`).
+- Access to a KFP endpoint (see [Use Kubeflow Pipelines](../kubeflow/how_to/pipelines.mdx)).
+- A **Dex id token** for a dedicated service account, minted with the OAuth2 password grant (see the [SDK guide](../kubeflow/how_to/mlflow-python-sdk.mdx)). Store it in a Kubernetes `Secret` and inject it into the component.
+- An MLflow workspace (a namespace with `mlflow-enabled=true`) the account can access.
 
-On Alauda AI the MLflow server runs the [`kubernetes-auth` plugin](https://github.com/AlaudaDevops/mlflow-plugin) in **`user_identity_token`** mode. It reads the caller identity from the bearer token's claims (`email` → `preferred_username` → `name` → `sub`, groups from `groups` / `roles`), authorizes that identity against the workspace, and records it as the MLflow **run owner**. So the only credential a component needs is a user identity token; there is nothing to create on the cluster.
+## How components reach MLflow
 
-A token is presented to MLflow through the platform Kubernetes API — the same `https://<platform>/kubernetes/<cluster>` entry point used for any Kubernetes call. The request is proxied to the MLflow server, and the caller identity is forwarded in the `X-Forwarded-Access-Token` header:
+A pipeline component runs **inside** the cluster, so it talks to MLflow through the in-cluster Service `http://mlflow-tracking-server.kubeflow:5000` (which is fronted by the OAuth proxy — components never use the MLflow container port directly). It authenticates exactly like any other MLflow client:
 
-- `Authorization: Bearer <token>` authenticates the call to the platform Kubernetes API.
-- `X-Forwarded-Access-Token: <token>` is the identity the MLflow `user_identity_token` plugin reads.
-- `X-MLFLOW-WORKSPACE: <namespace>` selects the workspace (otherwise the server's default workspace is used).
+- `MLFLOW_TRACKING_TOKEN` — a Dex id token; the MLflow client sends it as `Authorization: Bearer …`.
+- `mlflow.set_workspace(...)` — selects the workspace (`X-MLFLOW-WORKSPACE`).
 
-This avoids the browser-only OAuth proxy in front of MLflow without exposing any new endpoint. The helper below wraps it; it needs only the Python standard library, so components do not even install the MLflow SDK:
-
-```python
-def mlflow_api(method, path, *, token, platform, cluster, workspace, body=None, ns="kubeflow"):
-    """Call the MLflow REST API as the user identity in `token`, via the platform K8s API."""
-    import json, ssl, urllib.request
-    ctx = ssl.create_default_context()
-    ctx.check_hostname = False          # platform Dex/ALB commonly use a private cert
-    ctx.verify_mode = ssl.CERT_NONE
-    kapi = f"{platform.rstrip('/')}/kubernetes/{cluster}"
-
-    def call(url, data=None, m="GET"):
-        req = urllib.request.Request(url, data=data, method=m, headers={
-            "Authorization": f"Bearer {token}",
-            "X-Forwarded-Access-Token": token,
-            "X-MLFLOW-WORKSPACE": workspace,
-            "Content-Type": "application/json",
-        })
-        with urllib.request.urlopen(req, context=ctx) as r:
-            return json.load(r)
-
-    # Find the MLflow server pod (the shipped Service only exposes the OAuth proxy port).
-    pods = call(f"{kapi}/api/v1/namespaces/{ns}/pods?labelSelector=app%3Dmlflow-tracking-server")
-    pod = next(p["metadata"]["name"] for p in pods["items"] if p["status"]["phase"] == "Running")
-    base = f"{kapi}/api/v1/namespaces/{ns}/pods/{pod}:5000/proxy/api/2.0/mlflow"
-    return call(f"{base}/{path}", json.dumps(body).encode() if body is not None else None, method)
-```
-
-:::tip
-Verify the whole path against your cluster with the bundled smoke test, which logs a run and asserts the run owner matches the token identity:
-
-```bash
-PLATFORM_ADDRESS=https://<platform> CLUSTER=<cluster> \
-  MLFLOW_USER_TOKEN=<your-token> MLFLOW_WORKSPACE=<workspace> \
-  e2e/mlflow-user-identity-smoke.sh
-```
-:::
+The server reads the identity from the token and records the run under that user. See the [SDK guide](../kubeflow/how_to/mlflow-python-sdk.mdx) for how the token is obtained and how authorization works.
 
 ## Complete example: training pipeline with MLflow
 
-This pipeline logs parameters and metrics to MLflow using the user identity token. Note the KFP v2 rules it follows: helpers and imports live **inside** the component (KFP packages each component from its own source), and the token is passed in as a parameter — source it from a Kubernetes `Secret` rather than hardcoding it.
+The component uses the MLflow client and reads `MLFLOW_TRACKING_TOKEN` from a `Secret` injected with [`kfp-kubernetes`](https://kubeflow-pipelines.readthedocs.io/en/stable/source/kubernetes.html). KFP v2 packages each component from its own source, so `import mlflow` lives **inside** the function.
 
 ```python
 from kfp import dsl, compiler
+from kfp import kubernetes
 
 
-@dsl.component(base_image="python:3.11-slim")
+@dsl.component(base_image="python:3.11-slim", packages_to_install=["mlflow>=3.10"])
 def train_model(
-    platform: str,
-    cluster: str,
     workspace: str,
-    token: str,
     model_name: str,
     learning_rate: float,
     epochs: int,
     run_id: str,
 ) -> dict:
     """Simulated training component that logs to MLflow as the calling user."""
-    import json, ssl, urllib.request, urllib.error
-
-    def mlflow_api(method, path, body=None, ns="kubeflow"):
-        ctx = ssl.create_default_context()
-        ctx.check_hostname = False
-        ctx.verify_mode = ssl.CERT_NONE
-        kapi = f"{platform.rstrip('/')}/kubernetes/{cluster}"
-        hdr = {
-            "Authorization": f"Bearer {token}",
-            "X-Forwarded-Access-Token": token,
-            "X-MLFLOW-WORKSPACE": workspace,
-            "Content-Type": "application/json",
-        }
-
-        def call(url, data=None, m="GET"):
-            req = urllib.request.Request(url, data=data, method=m, headers=hdr)
-            with urllib.request.urlopen(req, context=ctx) as r:
-                return json.load(r)
-
-        pods = call(f"{kapi}/api/v1/namespaces/{ns}/pods?labelSelector=app%3Dmlflow-tracking-server")
-        pod = next(p["metadata"]["name"] for p in pods["items"] if p["status"]["phase"] == "Running")
-        base = f"{kapi}/api/v1/namespaces/{ns}/pods/{pod}:5000/proxy/api/2.0/mlflow"
-        return call(f"{base}/{path}", json.dumps(body).encode() if body is not None else None, method)
-
-    # Get-or-create the experiment.
-    try:
-        eid = mlflow_api("POST", "experiments/create",
-                         {"name": "kfp-training-experiment"})["experiment_id"]
-    except urllib.error.HTTPError as e:
-        if e.code != 400:
-            raise
-        eid = mlflow_api("GET", "experiments/get-by-name?experiment_name=kfp-training-experiment"
-                         )["experiment"]["experiment_id"]
-
-    run = mlflow_api("POST", "runs/create",
-                     {"experiment_id": eid, "run_name": f"run-{run_id}",
-                      "start_time": 0})["run"]["info"]["run_id"]
-    for k, v in {"model_name": model_name, "learning_rate": learning_rate, "epochs": epochs}.items():
-        mlflow_api("POST", "runs/log-parameter", {"run_id": run, "key": k, "value": str(v)})
+    import mlflow   # MLFLOW_TRACKING_TOKEN is injected from a Secret (see the pipeline below)
+
+    mlflow.set_tracking_uri("http://mlflow-tracking-server.kubeflow:5000")  # in-cluster Service, via the OAuth proxy
+    mlflow.set_workspace(workspace)
+    mlflow.set_experiment("kfp-training-experiment")
 
     metrics = {}
-    for epoch in range(1, epochs + 1):
-        loss = 2.0 * (0.95 ** epoch)
-        accuracy = 1.0 - loss
-        mlflow_api("POST", "runs/log-metric",
-                   {"run_id": run, "key": "loss", "value": loss, "timestamp": 0, "step": epoch})
-        mlflow_api("POST", "runs/log-metric",
-                   {"run_id": run, "key": "accuracy", "value": accuracy, "timestamp": 0, "step": epoch})
-        metrics = {"final_loss": loss, "final_accuracy": accuracy}
-
-    mlflow_api("POST", "runs/update", {"run_id": run, "status": "FINISHED", "end_time": 1})
-    print(f"logged MLflow run {run}")
+    with mlflow.start_run(run_name=f"run-{run_id}"):
+        mlflow.log_param("model_name", model_name)
+        mlflow.log_param("learning_rate", learning_rate)
+        mlflow.log_param("epochs", epochs)
+        for epoch in range(1, epochs + 1):
+            loss = 2.0 * (0.95 ** epoch)
+            accuracy = 1.0 - loss
+            mlflow.log_metric("loss", loss, step=epoch)
+            mlflow.log_metric("accuracy", accuracy, step=epoch)
+            metrics = {"final_loss": loss, "final_accuracy": accuracy}
+
+    print("logged run:", mlflow.last_active_run().info.run_id)
     return metrics
 
 
 @dsl.pipeline(name="mlflow-training-pipeline", description="Train with MLflow tracking")
 def training_pipeline(
-    platform: str,
-    cluster: str,
-    token: str,
-    workspace: str = "mlops-demo-e2e",
+    workspace: str = "team-a",
     model_name: str = "qwen3-0.6b",
     learning_rate: float = 2e-4,
     epochs: int = 10,
 ):
-    train_model(
-        platform=platform,
-        cluster=cluster,
+    task = train_model(
         workspace=workspace,
-        token=token,
         model_name=model_name,
         learning_rate=learning_rate,
         epochs=epochs,
@@ -166,18 +85,38 @@ def training_pipeline(
         # pass it in as an argument (a component cannot reference dsl.* itself).
         run_id=dsl.PIPELINE_JOB_ID_PLACEHOLDER,
     )
+    # Inject the Dex id token from a Secret as MLFLOW_TRACKING_TOKEN.
+    kubernetes.use_secret_as_env(
+        task, secret_name="mlflow-token", secret_key_to_env={"token": "MLFLOW_TRACKING_TOKEN"}
+    )
 
 
 compiler.Compiler().compile(training_pipeline, "pipeline.yaml")
 ```
 
+Create the `mlflow-token` Secret with an id token from the password grant (see the [SDK guide](../kubeflow/how_to/mlflow-python-sdk.mdx)):
+
+```bash
+ID_TOKEN=$(curl -sk "https://<platform>/dex/token" \
+  -d grant_type=password \
+  --data-urlencode "username=$MLFLOW_USERNAME" --data-urlencode "password=$MLFLOW_PASSWORD" \
+  -d scope="openid email groups" \
+  -d client_id="$DEX_CLIENT_ID" --data-urlencode "client_secret=$DEX_CLIENT_SECRET" | jq -r .id_token)
+
+kubectl -n <pipeline-namespace> create secret generic mlflow-token --from-literal=token="$ID_TOKEN"
+```
+
+:::warning
+id tokens expire (24 h by default), so refresh the `mlflow-token` Secret before submitting long pipelines — or mint the token **inside** the component from service-account credentials kept in a Secret (the password-grant call shown in the SDK guide), so each run gets a fresh token.
+:::
+
 ## Upload and run
 
 ### Via the KFP UI
 
 1. Go to **Kubeflow Dashboard → Pipelines → Upload Pipeline** and select `pipeline.yaml`.
-2. Click **Create Run** and fill in the parameters (platform address, cluster, token, workspace, …).
-3. After the run starts, check the MLflow UI under **Alauda AI → Tools → MLFlow** for the logged metrics — the run owner is your platform user.
+2. Click **Create Run** and fill in the parameters (workspace, model name, epochs).
+3. After the run starts, check the MLflow UI under **Alauda AI → Tools → MLFlow** — the run owner is the token's user.
 
 ### Via the KFP SDK
 
@@ -185,25 +124,16 @@ compiler.Compiler().compile(training_pipeline, "pipeline.yaml")
 from kfp.client import Client
 
 client = Client(host="<MY-KFP-ENDPOINT>")
-
 run = client.create_run_from_pipeline_package(
     "pipeline.yaml",
-    arguments=dict(
-        platform="https://<platform>",
-        cluster="<cluster>",
-        token="<your-user-identity-token>",   # source from a Kubernetes Secret in practice
-        workspace="mlops-demo-e2e",
-        model_name="qwen3-0.6b",
-        epochs=10,
-    ),
+    arguments=dict(workspace="team-a", model_name="qwen3-0.6b", epochs=10),
 )
-
 print(f"Run ID: {run.run_id}")
 ```
 
 ## Using MLflow in Trainer v2 pipelines
 
-If you fine-tune with [Kubeflow Trainer v2](./fine-tune-with-trainer-v2.mdx) instead of KFP SDK pipelines, the framework's MLflow integration (for example `report_to: mlflow` in LLaMA-Factory) authenticates the same way. Trainer v2 uses `apiVersion: trainer.kubeflow.org/v1alpha1`, `kind: TrainJob`, and a `spec.runtimeRef` + `spec.trainer` shape — not a raw pod template. Inject your user identity token as `MLFLOW_TRACKING_TOKEN` and point the framework at the MLflow endpoint your platform exposes:
+If you fine-tune with [Kubeflow Trainer v2](./fine-tune-with-trainer-v2.mdx), the framework's MLflow integration (for example `report_to: mlflow` in LLaMA-Factory) authenticates the same way. Trainer v2 uses `apiVersion: trainer.kubeflow.org/v1alpha1`, `kind: TrainJob`, and a `spec.runtimeRef` + `spec.trainer` shape. Point it at the in-cluster Service and inject the id token from a `Secret`:
 
 ```yaml
 apiVersion: trainer.kubeflow.org/v1alpha1
@@ -216,36 +146,32 @@ spec:
   trainer:
     image: alaudadockerhub/fine_tune_with_llamafactory:v0.1.1
     env:
+      - name: MLFLOW_TRACKING_URI
+        value: "http://mlflow-tracking-server.kubeflow:5000"
       - name: MLFLOW_EXPERIMENT_NAME
         value: "trainer-v2-finetune"
       - name: MLFLOW_TRACKING_TOKEN
         valueFrom:
           secretKeyRef:
-            name: mlflow-user-token   # a Secret holding your platform user token
+            name: mlflow-token       # a Secret holding a Dex id token
             key: token
 ```
 
-See [Fine-tuning LLMs using Workbench](./fine-tuning-using-notebooks.mdx) for a full Trainer v2 + MLflow example with LLaMA-Factory.
+See [Fine-tuning LLMs using Workbench](./fine-tuning-using-notebooks.mdx) for a full Trainer v2 + MLflow example.
 
 ## Best practices
 
 ### Use the pipeline job ID in MLflow
 
-KFP v2 provides `dsl.PIPELINE_JOB_ID_PLACEHOLDER` (the v1 `dsl.RUN_ID_PLACEHOLDER` was removed). It is a pipeline-level placeholder, so pass it into the component as an argument — a component cannot reference `dsl.*` from inside its own body:
-
-```python
-train_model(..., run_id=dsl.PIPELINE_JOB_ID_PLACEHOLDER)
-```
-
-Then use the received string in the run name to keep MLflow runs distinct per pipeline execution.
+KFP v2 provides `dsl.PIPELINE_JOB_ID_PLACEHOLDER` (the v1 `dsl.RUN_ID_PLACEHOLDER` was removed). It is a pipeline-level placeholder, so pass it into the component as an argument — a component cannot reference `dsl.*` from inside its own body. Use the received string in the run name to keep runs distinct per pipeline execution.
 
-### Keep the token out of the pipeline definition
+### Keep credentials in a Secret and refresh tokens
 
-Pass the user identity token from a Kubernetes `Secret` (mounted as a parameter or env var), never hardcoded in `pipeline.yaml` — compiled pipelines are stored and shared. Rotate the token on the platform when it expires.
+Never hardcode the token or service-account credentials in `pipeline.yaml` — compiled pipelines are stored and shared. Inject them from a `Secret`, and refresh the id token (or mint it inside the component) before it expires.
 
 ### Log metrics inside a run
 
-Each metric must belong to a run created with `runs/create`. If a component has multiple logical stages, open a run per stage rather than logging outside a run context.
+Each metric belongs to a `mlflow.start_run()` block. If a component has multiple logical stages, open a run per stage rather than logging outside a run context.
 
 ### Artifact storage for production
 
@@ -255,8 +181,8 @@ Logging large model artifacts requires durable object storage. Configure S3-comp
 
 | Symptom | Check |
 |---------|-------|
-| Component fails with an HTML/redirect (`302`) response | You reached MLflow through its browser OAuth proxy (`mlflow-tracking-server:5000`). Use the platform Kubernetes API path shown above (`…/kubernetes/<cluster>/…/pods/<pod>:5000/proxy/…`) instead. |
-| `401 UNAUTHENTICATED` / "Missing Authorization header or X-Forwarded-Access-Token header" | The platform API strips the inbound `Authorization` header before proxying to the pod. Send the identity in `X-Forwarded-Access-Token` as well (the helper does both). |
-| `403 PERMISSION_DENIED` | Your platform user lacks access to the workspace namespace. Grant the user access to the MLflow workspace (see [Workspace Access](../kubeflow/how_to/mlflow.mdx)); no ServiceAccount is involved. |
-| Run shows up under the wrong owner / workspace | The run owner is the `email` claim of the token; the workspace is `X-MLFLOW-WORKSPACE` (or the server default). Check both values. |
+| Component fails with an HTML/redirect (`302`) response | The OAuth proxy rejected the token. Confirm the proxy has `--skip-jwt-bearer-tokens` and `MLFLOW_TRACKING_TOKEN` is a valid Dex id token (see the [SDK guide](../kubeflow/how_to/mlflow-python-sdk.mdx)). |
+| `401 UNAUTHENTICATED` | `MLFLOW_TRACKING_TOKEN` is unset, empty, or expired — refresh the `mlflow-token` Secret. |
+| `403 PERMISSION_DENIED` | The token's user lacks access to the workspace namespace. Grant access to the MLflow workspace (see [Workspace Access](../kubeflow/how_to/mlflow.mdx)); no ServiceAccount is involved. |
+| Run shows up under the wrong owner / workspace | The owner is the token's identity; the workspace is `set_workspace()` (else the server default). Check both. |
 | MLflow metrics not appearing in KFP UI | KFP and MLflow are separate systems. Metrics logged to MLflow appear in the MLflow UI (**Alauda AI → Tools → MLFlow**), not in the KFP run output. |

From cdf097c884bcbb1ae0494fb409132060764fb5e6 Mon Sep 17 00:00:00 2001
From: Wu Yi <typhoonzero1986@gmail.com>
Date: Tue, 16 Jun 2026 02:21:17 +0000
Subject: [PATCH 18/21] docs: cross-reference the MLflow SDK auth guide from
 training guides

The MLflow usage docs under training_guides now point to
how_to/mlflow-python-sdk.mdx for authentication (MLFLOW_TRACKING_TOKEN) and
workspace/RBAC on secured installs, where the bare MLFLOW_TRACKING_URI /
report_to: mlflow setup is not sufficient:

- fine-tuning-using-notebooks.mdx (Experiment tracking sections)
- fine-tune-with-trainer-v2.ipynb (Step 5: View Training Metrics in MLflow)

Also corrects the menu path to Alauda AI -> Tools -> MLFlow.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .../training_guides/fine-tune-with-trainer-v2.ipynb  | 12 ++----------
 .../training_guides/fine-tuning-using-notebooks.mdx  |  6 ++++--
 2 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/docs/en/training_guides/fine-tune-with-trainer-v2.ipynb b/docs/en/training_guides/fine-tune-with-trainer-v2.ipynb
index e904165..10fb95d 100644
--- a/docs/en/training_guides/fine-tune-with-trainer-v2.ipynb
+++ b/docs/en/training_guides/fine-tune-with-trainer-v2.ipynb
@@ -947,15 +947,7 @@
    "cell_type": "markdown",
    "id": "27d2b476",
    "metadata": {},
-   "source": [
-    "## Step 5: View Training Metrics in MLflow\n",
-    "\n",
-    "If `MLFLOW_TRACKING_URI` is set and the MLflow server is reachable from the training pod, LlamaFactory will log metrics (loss, learning rate, etc.) to MLflow automatically via `report_to: mlflow` in the training config.\n",
-    "\n",
-    "To open the MLflow UI, go to **Alauda AI** - **Tools** - **MLFlow** (need MLFlow Cluster plugin installed). Look for the experiment named by `MLFLOW_EXPERIMENT_NAME`.\n",
-    "\n",
-    "Each `TrainJob` run will appear as a separate MLflow **run** under the same experiment, making it easy to compare training curves across different models and hyperparameters."
-   ]
+   "source": "## Step 5: View Training Metrics in MLflow\n\nIf `MLFLOW_TRACKING_URI` is set and the MLflow server is reachable from the training pod, LlamaFactory will log metrics (loss, learning rate, etc.) to MLflow automatically via `report_to: mlflow` in the training config.\n\nOn a secured (SSO + multi-tenant) MLflow install the trainer must also authenticate — set `MLFLOW_TRACKING_TOKEN` and select a workspace. See [Using the MLflow Python SDK with Authentication and RBAC](../kubeflow/how_to/mlflow-python-sdk.mdx) for how to obtain the token and how authorization/RBAC work.\n\nTo open the MLflow UI, go to **Alauda AI** - **Tools** - **MLFlow** (need MLFlow Cluster plugin installed). Look for the experiment named by `MLFLOW_EXPERIMENT_NAME`.\n\nEach `TrainJob` run will appear as a separate MLflow **run** under the same experiment, making it easy to compare training curves across different models and hyperparameters."
   },
   {
    "cell_type": "markdown",
@@ -1060,4 +1052,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 5
-}
+}
\ No newline at end of file
diff --git a/docs/en/training_guides/fine-tuning-using-notebooks.mdx b/docs/en/training_guides/fine-tuning-using-notebooks.mdx
index a2bba02..93d7653 100644
--- a/docs/en/training_guides/fine-tuning-using-notebooks.mdx
+++ b/docs/en/training_guides/fine-tuning-using-notebooks.mdx
@@ -325,7 +325,9 @@ After success the merged model is pushed to a date-stamped branch (`sft-YYYYMMDD
 
 ## 8. Experiment tracking
 
-Setting `report_to: mlflow` in the LLaMA-Factory config plus the `MLFLOW_TRACKING_URI` / `MLFLOW_EXPERIMENT_NAME` env vars routes metrics to MLflow. Find runs in **Alauda AI → Advanced → MLFlow**, compare loss curves, and pin the winning run.
+Setting `report_to: mlflow` in the LLaMA-Factory config plus the `MLFLOW_TRACKING_URI` / `MLFLOW_EXPERIMENT_NAME` env vars routes metrics to MLflow. Find runs in **Alauda AI → Tools → MLFlow**, compare loss curves, and pin the winning run.
+
+On a secured (SSO + multi-tenant) MLflow install the job must also authenticate — supply an `MLFLOW_TRACKING_TOKEN` and select a workspace. See [Using the MLflow Python SDK with Authentication and RBAC](../kubeflow/how_to/mlflow-python-sdk.mdx) for how to obtain the token and configure the client.
 
 ## 9. Publish the fine-tuned model
 
@@ -412,4 +414,4 @@ spec:
 
 ### Experiment tracking on other devices
 
-LLaMA-Factory and Transformers integrate with MLflow / wandb directly. Set the destination in the framework config (e.g. `report_to: mlflow` for LLaMA-Factory) and supply `MLFLOW_TRACKING_URI` and `MLFLOW_EXPERIMENT_NAME` env vars. View results under **Alauda AI → Advanced → MLFlow**.
+LLaMA-Factory and Transformers integrate with MLflow / wandb directly. Set the destination in the framework config (e.g. `report_to: mlflow` for LLaMA-Factory) and supply `MLFLOW_TRACKING_URI` and `MLFLOW_EXPERIMENT_NAME` env vars (plus `MLFLOW_TRACKING_TOKEN` on a secured install — see [Using the MLflow Python SDK with Authentication and RBAC](../kubeflow/how_to/mlflow-python-sdk.mdx)). View results under **Alauda AI → Tools → MLFlow**.

From 73697ef88743c33a683d2cfaba13b2b3870abd0c Mon Sep 17 00:00:00 2001
From: Wu Yi <typhoonzero1986@gmail.com>
Date: Wed, 17 Jun 2026 03:08:26 +0000
Subject: [PATCH 19/21] =?UTF-8?q?docs:=20fix=20MLflow=20ROPC=20platform=20?=
 =?UTF-8?q?setup=20=E2=80=94=20enable=20password=20grant=20on=20the=20prox?=
 =?UTF-8?q?y's=20own=20Dex=20client?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

A dedicated Dex client cannot be used for the password grant on this
platform: the OAuth proxy validates that the token audience equals its
own client_id, so a separate client's token is rejected at the proxy.
Document enabling `password` in the grantTypes of the proxy's own
OAuth2Client (verified against the live cluster), with the kubectl
patch, the aud constraint, and a security caveat. Update the matching
troubleshooting row.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/en/kubeflow/how_to/mlflow-python-sdk.mdx | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx b/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
index 758de4f..e337661 100644
--- a/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
+++ b/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
@@ -20,7 +20,19 @@ The password grant needs two settings, which an administrator enables once:
         - --skip-jwt-bearer-tokens=true
   ```
 
-- **Allow the password grant.** Dex must have the password connector enabled (`enablePasswordDB: true`), and the OAuth client you authenticate with must list `password` in its `grantTypes`. Register a **dedicated** client for this rather than the platform's interactive-login client.
+- **Allow the password grant on the proxy's own client.** ROPC is enabled per Dex client through its `grantTypes`, and the OAuth proxy only accepts tokens whose audience (`aud`) equals its configured `client_id`. The password grant must therefore be enabled on **the same Dex client the proxy uses** — a separate client mints tokens with the wrong `aud`, which the proxy rejects. Confirm Dex has the password connector enabled (`enablePasswordDB: true`), then add `password` to that client's `grantTypes`. With Dex's Kubernetes storage the client is an `OAuth2Client` resource:
+
+  ```bash
+  # <client-resource> is the OAuth2Client whose `id` equals the proxy's --client-id
+  kubectl -n <dex-namespace> patch oauth2client <client-resource> --type=json \
+    -p='[{"op":"add","path":"/grantTypes/-","value":"password"}]'
+  ```
+
+  Use this client's id and secret as `DEX_CLIENT_ID` / `DEX_CLIENT_SECRET` below.
+
+:::warning
+Enabling the password grant on the interactive-login client lets any valid username and password mint a token directly. Use a **dedicated service-account user** scoped only to the target workspace, and store its credentials and the client secret in a Kubernetes `Secret` — never in code.
+:::
 
 ## Prerequisites
 
@@ -127,7 +139,7 @@ The session cookie expires — copy a fresh one when calls start returning a log
 
 | Symptom | Check |
 |---------|-------|
-| `/dex/token` returns `unsupported_grant_type` / "password grant … not allowed" | The Dex client does not permit the password grant. Use a client whose `grantTypes` include `password` (see [Platform setup](#platform-setup-administrator-one-time)). |
+| `/dex/token` returns `unsupported_grant_type` / "password grant … not allowed" | The Dex client does not permit the password grant. Add `password` to the `grantTypes` of the **OAuth proxy's own client** — a separate client's token would later fail the proxy's `aud` check (see [Platform setup](#platform-setup-administrator-one-time)). |
 | Call returns HTML or a redirect (`302` to the login page) | The OAuth proxy rejected the bearer token. Confirm `--skip-jwt-bearer-tokens` is enabled and the token is a valid Dex id token (`aud` = the proxy's client). For the cookie alternative, your `_oauth2_proxy` value is missing or expired. |
 | `Invalid … character(s) in header value: 'Bearer …\n'` | The token has trailing whitespace. Set `MLFLOW_TRACKING_TOKEN` to the `.strip()`-ed value. |
 | `Failed to query /api/3.0/mlflow/server-info` | The SDK could not reach the server through the proxy — verify the tracking URI is the platform MLflow route and the token is valid. |

From 01a309241c58c42731f0284fb621aa4e16cc3c98 Mon Sep 17 00:00:00 2001
From: Wu Yi <typhoonzero1986@gmail.com>
Date: Wed, 17 Jun 2026 03:46:41 +0000
Subject: [PATCH 20/21] docs: browser-free MLflow auth via
 authorization_code+PKCE (token + cookie)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

ROPC needs the password grant on the shared alauda-auth client, i.e. a
change to the global auth server — which is off-limits. The platform
already allows the authorization_code grant, and its login API is
scriptable (PKCE; captcha is retry-gated, so a clean first login needs
none). Rewrite the SDK guide around two browser-free methods, both
verified end-to-end on g1-c1-x86:

- Bearer token (primary): scripted authorization_code+PKCE -> id_token
  as MLFLOW_TRACKING_TOKEN, renewed via the refresh_token grant. Needs
  --skip-jwt-bearer-tokens on the MLflow proxy (workload cluster, not
  global auth). Python helper + curl; both verified.
- Session cookie (fallback): same scripted login fed to the proxy
  callback -> _oauth2_proxy cookie. Zero platform changes.

Point pipelines-mlflow-integration at the SDK guide's token flow instead
of the password grant (and fix the renamed platform-setup anchor).
Rewrite the e2e smoke test to exercise both legs (token leg SKIPs
cleanly when skip-jwt is off) and fix a cleanup bug where the
_oauth2_proxy cookie value contains '|', which collided with the
delimiter and leaked experiments.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/en/kubeflow/how_to/mlflow-python-sdk.mdx | 183 ++++++++++++------
 .../pipelines-mlflow-integration.mdx          |  13 +-
 e2e/mlflow-user-identity-smoke.sh             | 172 ++++++++++------
 3 files changed, 248 insertions(+), 120 deletions(-)

diff --git a/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx b/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
index e337661..bcceb67 100644
--- a/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
+++ b/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
@@ -4,78 +4,123 @@ weight: 46
 
 # Using the MLflow Python SDK with Authentication and RBAC
 
-On Alauda AI the [MLflow Tracking Server](./mlflow.mdx) runs behind single sign-on and multi-tenancy: an OAuth proxy authenticates every caller, and the server records each run under the calling user and authorizes it against Kubernetes RBAC. This guide shows how to drive the stock **MLflow Python SDK** through that OAuth proxy with your own identity, using the OAuth2 **password grant** to obtain a token from a username and password — no browser, and never the MLflow container port.
+On Alauda AI the [MLflow Tracking Server](./mlflow.mdx) runs behind single sign-on and multi-tenancy: an OAuth proxy authenticates every caller, and the server records each run under the calling user and authorizes it against Kubernetes RBAC. This guide drives the stock **MLflow Python SDK** through that OAuth proxy with your own identity, **browser-free**, using the OAuth2 **authorization code** flow (with PKCE) scripted against the platform login — no password grant, and never the MLflow container port.
 
-## Platform setup (administrator, one-time) \{#platform-setup-administrator-one-time}
+There are two browser-free ways to present your identity; pick one:
 
-The password grant needs two settings, which an administrator enables once:
+- **Bearer token (recommended).** Obtain a Dex **id token** from the CLI or Python and pass it as `MLFLOW_TRACKING_TOKEN`; renew it with the refresh token. Needs one platform setting ([below](#platform-setup)).
+- **Session cookie (no platform changes).** Drive the proxy's own login to obtain its `_oauth2_proxy` cookie and attach it to requests. Works on any install as-is ([below](#cookie-method)).
 
-- **Accept bearer tokens at the proxy.** Add `--skip-jwt-bearer-tokens=true` to the MLflow OAuth proxy so it accepts a Dex OIDC token alongside browser sessions:
-
-  ```yaml
-  # MLflow plugin values
-  auth:
-    oauth:
-      extraArgs:
-        - --skip-jwt-bearer-tokens=true
-  ```
-
-- **Allow the password grant on the proxy's own client.** ROPC is enabled per Dex client through its `grantTypes`, and the OAuth proxy only accepts tokens whose audience (`aud`) equals its configured `client_id`. The password grant must therefore be enabled on **the same Dex client the proxy uses** — a separate client mints tokens with the wrong `aud`, which the proxy rejects. Confirm Dex has the password connector enabled (`enablePasswordDB: true`), then add `password` to that client's `grantTypes`. With Dex's Kubernetes storage the client is an `OAuth2Client` resource:
+## How authentication works
 
-  ```bash
-  # <client-resource> is the OAuth2Client whose `id` equals the proxy's --client-id
-  kubectl -n <dex-namespace> patch oauth2client <client-resource> --type=json \
-    -p='[{"op":"add","path":"/grantTypes/-","value":"password"}]'
-  ```
+Two layers sit in front of your runs:
 
-  Use this client's id and secret as `DEX_CLIENT_ID` / `DEX_CLIENT_SECRET` below.
+1. The **OAuth proxy** (`oauth2-proxy`) authenticates the request — either a Dex **id token** sent as `Authorization: Bearer …` (token method) or its `_oauth2_proxy` **session cookie** (cookie method).
+2. The MLflow server's `kubernetes-auth` plugin reads your identity from that credential, records it as the run **owner**, and authorizes it against your Kubernetes permissions in the workspace.
 
-:::warning
-Enabling the password grant on the interactive-login client lets any valid username and password mint a token directly. Use a **dedicated service-account user** scoped only to the target workspace, and store its credentials and the client secret in a Kubernetes `Secret` — never in code.
-:::
+The client always goes through the OAuth proxy — never connect to the MLflow container port directly.
 
 ## Prerequisites
 
-- `mlflow` **3.10 or later** (`pip install "mlflow>=3.10"`). Workspace selection (`mlflow.set_workspace`) is a 3.10+ feature.
+- `mlflow` **3.10 or later** (`pip install "mlflow>=3.10"`). Workspace selection (`mlflow.set_workspace`) is a 3.10+ feature. The Python token helper also uses `requests` and `cryptography`.
 - A platform **username and password** — ideally a dedicated service account, not a person's login — that can access the target workspace (see [Workspace Access](./mlflow.mdx)).
-- The Dex **client id and secret** allowed to use the password grant (from your administrator).
+- The platform's **OAuth client id and secret** — the client the MLflow proxy uses (from your administrator). On Alauda this is the platform auth client, e.g. `alauda-auth`; its secret lives in a Kubernetes `Secret` (e.g. `cpaas-oidc-secret`).
 
-## How authentication works
+## Platform setup for the token method (administrator, one-time) \{#platform-setup}
 
-Two layers sit in front of your runs:
+The bearer-token method needs the MLflow OAuth proxy to accept Dex id tokens. Add `--skip-jwt-bearer-tokens=true` to the **MLflow plugin** — this is the MLflow proxy on the workload cluster, **not** the platform's global auth server:
 
-1. The **OAuth proxy** (`oauth2-proxy`) authenticates the request. With `--skip-jwt-bearer-tokens`, it accepts a Dex-issued OIDC **id token** sent as `Authorization: Bearer …`.
-2. The MLflow server's `kubernetes-auth` plugin reads your identity from that token, records it as the run **owner**, and authorizes it against your Kubernetes permissions in the workspace.
+```yaml
+# MLflow plugin values
+auth:
+  oauth:
+    extraArgs:
+      - --skip-jwt-bearer-tokens=true
+```
 
-The client always goes through the OAuth proxy — never connect to the MLflow container port directly.
+No Dex or global-auth change is required: the login below uses the `authorization_code` grant the platform client already allows. The **cookie method** needs no setting at all — skip this section if you use it.
 
-## Connect the SDK
+## Get a token from the command line (browser-free) \{#get-a-token}
 
-### 1. Mint an id token with the password grant
+The platform login is an SSO page, but its API supports the standard OAuth **authorization code** flow with PKCE, so you can complete it from a script — no browser redirect. The password is RSA-encrypted with the login service's public key (`/dex/pubkey`), exactly as the login page does it, then exchanged for an **id token** (and a **refresh token** for headless renewal).
 
-Exchange the username and password for a Dex **id token** in a single call (no browser, no cookie):
+### Python helper
 
-```bash
-export ID_TOKEN=$(curl -sk "https://<platform>/dex/token" \
-  -d grant_type=password \
-  --data-urlencode "username=$MLFLOW_USERNAME" \
-  --data-urlencode "password=$MLFLOW_PASSWORD" \
-  -d scope="openid email groups" \
-  -d client_id="$DEX_CLIENT_ID" --data-urlencode "client_secret=$DEX_CLIENT_SECRET" \
-  | jq -r .id_token)
+```python
+import base64, hashlib, json, os, secrets
+from urllib.parse import urlparse, parse_qs
+import requests
+from cryptography.hazmat.primitives.asymmetric import padding
+from cryptography.hazmat.primitives.serialization import load_pem_public_key
+
+PLATFORM      = os.environ["PLATFORM_ADDRESS"].rstrip("/")    # https://<platform>
+CLIENT_ID     = os.environ["DEX_CLIENT_ID"]                   # the MLflow proxy's client, e.g. alauda-auth
+CLIENT_SECRET = os.environ["DEX_CLIENT_SECRET"]
+USERNAME      = os.environ["MLFLOW_USERNAME"]
+PASSWORD      = os.environ["MLFLOW_PASSWORD"]
+REDIRECT_URI  = f"{PLATFORM}/oauth2/callback"                 # any URI the client has registered
+VERIFY_TLS    = os.environ.get("PLATFORM_CA", False)         # CA bundle path, or False to skip (lab only)
+
+s = requests.Session(); s.verify = VERIFY_TLS
+_b64url = lambda b: base64.urlsafe_b64encode(b).rstrip(b"=").decode()
+
+def get_tokens() -> dict:
+    """Run the authorization-code + PKCE flow headlessly. Returns the Dex token response."""
+    verifier  = _b64url(secrets.token_bytes(48))
+    challenge = _b64url(hashlib.sha256(verifier.encode()).digest())
+    # 1) start the flow -> auth-request id
+    req = s.get(f"{PLATFORM}/dex/api/v1/authorize", params={
+        "client_id": CLIENT_ID, "redirect_uri": REDIRECT_URI, "response_type": "code",
+        "scope": "openid email groups offline_access", "state": "cli",
+        "code_challenge": challenge, "code_challenge_method": "S256"}).json()["req"]
+    # 2) RSA-encrypt the password, then log in via the local connector -> auth code
+    pk  = s.get(f"{PLATFORM}/dex/pubkey").json()              # {"ts": ..., "pubkey": "<PEM>"}
+    payload = json.dumps({"ts": pk["ts"], "password": PASSWORD}, separators=(",", ":")).encode()
+    enc = base64.b64encode(load_pem_public_key(pk["pubkey"].encode()).encrypt(payload, padding.PKCS1v15())).decode()
+    redirect = s.post(f"{PLATFORM}/dex/api/v1/authorize/local", params={"req": req},
+        json={"account": USERNAME, "password": enc}).json()["redirect_url"]
+    code = parse_qs(urlparse(redirect).query)["code"][0]
+    # 3) exchange the code (with the PKCE verifier) -> id_token + refresh_token
+    return s.post(f"{PLATFORM}/dex/token", data={
+        "grant_type": "authorization_code", "code": code, "redirect_uri": REDIRECT_URI,
+        "code_verifier": verifier, "client_id": CLIENT_ID, "client_secret": CLIENT_SECRET}).json()
+
+def refresh(refresh_token: str) -> str:
+    """Mint a fresh id token from a refresh token — no login, no browser."""
+    return s.post(f"{PLATFORM}/dex/token", data={
+        "grant_type": "refresh_token", "refresh_token": refresh_token,
+        "client_id": CLIENT_ID, "client_secret": CLIENT_SECRET,
+        "scope": "openid email groups"}).json()["id_token"]
 ```
 
-### 2. Point the SDK at the MLflow route with the token
+### Shell equivalent (curl + openssl, no Python dependencies)
+
+```bash
+PLATFORM=https://<platform>; CLIENT_ID=<client>; CLIENT_SECRET=<secret>
+USERNAME='<user>'; PASSWORD='<password>'; REDIRECT_URI="$PLATFORM/oauth2/callback"
+
+V=$(openssl rand -base64 48 | tr '+/' '-_' | tr -d '=' | cut -c1-64)                       # PKCE verifier
+C=$(printf %s "$V" | openssl dgst -sha256 -binary | openssl base64 -A | tr '+/' '-_' | tr -d '=')
+RU=$(jq -rn --arg u "$REDIRECT_URI" '$u|@uri'); SC=$(jq -rn '"openid email groups offline_access"|@uri')
+REQ=$(curl -sk "$PLATFORM/dex/api/v1/authorize?client_id=$CLIENT_ID&redirect_uri=$RU&response_type=code&scope=$SC&state=cli&code_challenge=$C&code_challenge_method=S256" | jq -r .req)
+PK=$(curl -sk "$PLATFORM/dex/pubkey"); TS=$(echo "$PK"|jq -r .ts); echo "$PK"|jq -r .pubkey >/tmp/dex_pub.pem
+ENC=$(printf '{"ts":"%s","password":"%s"}' "$TS" "$PASSWORD" | openssl pkeyutl -encrypt -pubin -inkey /tmp/dex_pub.pem -pkeyopt rsa_padding_mode:pkcs1 | openssl base64 -A)
+CODE=$(curl -sk -X POST "$PLATFORM/dex/api/v1/authorize/local?req=$REQ" -H 'Content-Type: application/json' \
+  --data "$(jq -nc --arg a "$USERNAME" --arg p "$ENC" '{account:$a,password:$p}')" | jq -r .redirect_url | sed -E 's/.*code=([^&]+).*/\1/')
+curl -sk "$PLATFORM/dex/token" -d grant_type=authorization_code -d code="$CODE" \
+  --data-urlencode redirect_uri="$REDIRECT_URI" -d code_verifier="$V" \
+  -d client_id="$CLIENT_ID" --data-urlencode client_secret="$CLIENT_SECRET" | jq -r .id_token
+```
 
-The SDK reads `MLFLOW_TRACKING_TOKEN` and sends it as `Authorization: Bearer …`:
+## Connect the SDK
 
 ```python
-import os
-import mlflow
+import os, mlflow
 
-os.environ["MLFLOW_TRACKING_TOKEN"] = os.environ["ID_TOKEN"].strip()  # → Authorization: Bearer
+tok = get_tokens()
+os.environ["MLFLOW_TRACKING_TOKEN"] = tok["id_token"].strip()           # → Authorization: Bearer
 mlflow.set_tracking_uri("http://mlflow-tracking-server.kubeflow:5000")  # in-cluster Service (fronted by the OAuth proxy)
-mlflow.set_workspace("team-a")                 # workspace namespace → X-MLFLOW-WORKSPACE
+mlflow.set_workspace("team-a")                                          # workspace namespace → X-MLFLOW-WORKSPACE
 mlflow.set_experiment("my-experiment")
 
 with mlflow.start_run(run_name="sdk-quickstart") as run:
@@ -84,12 +129,12 @@ with mlflow.start_run(run_name="sdk-quickstart") as run:
     print("run:", run.info.run_id)
 ```
 
-The run appears under **Alauda AI → Tools → MLFlow**, owned by the username you authenticated as. (Verified end-to-end on a secured install: the run owner is the token's user identity.)
+The run appears under **Alauda AI → Tools → MLFlow**, owned by the user you authenticated as. (Verified end-to-end on a secured install: the run owner is the token's user identity.)
 
-Use the in-cluster Service URL `http://mlflow-tracking-server.kubeflow:5000` when the client runs **inside** the cluster (pipeline components, Workbench notebooks). From **outside** the cluster, point at the platform route `https://<platform>/clusters/<cluster>/mlflow` instead — both reach the same OAuth proxy.
+Use the in-cluster Service URL `http://mlflow-tracking-server.kubeflow:5000` when the client runs **inside** the cluster (pipeline components, Workbench notebooks). From **outside** the cluster, point at the platform route `https://<platform>/clusters/<cluster>/mlflow` instead — both reach the same OAuth proxy (set `MLFLOW_TRACKING_INSECURE_TLS=true` if the platform certificate is not trusted by your machine).
 
 :::warning
-The password grant sends the password to the token endpoint, so use a **dedicated service account** and keep the credentials and client secret in a Kubernetes `Secret`, never in code. Always `.strip()` the token (a trailing newline produces `Invalid … character(s) in header value: 'Bearer …\n'`). id tokens expire (24 h by default), so re-run step 1 to refresh for long-running jobs. If you use the external HTTPS route and the platform certificate is not trusted by your machine, set `MLFLOW_TRACKING_INSECURE_TLS=true`.
+Use a **dedicated service-account user** and keep its credentials and the client secret in a Kubernetes `Secret`, never in code. Always `.strip()` the token (a trailing newline produces `Invalid … character(s) in header value: 'Bearer …\n'`). id tokens expire (24 h by default); for long-running jobs renew with `refresh(tok["refresh_token"])` instead of logging in again.
 :::
 
 ## Selecting a workspace
@@ -113,9 +158,31 @@ with mlflow.start_run():
 
 Promote the registered version to **Staging** or **Production** from the MLflow UI.
 
-## Interactive alternative: browser session
+## Alternative: session cookie (no platform changes) \{#cookie-method}
+
+If you cannot enable `--skip-jwt-bearer-tokens`, drive the proxy's own login flow to obtain its `_oauth2_proxy` cookie and attach it to requests — this works on any install unchanged. The proxy starts the OAuth flow for you (its own PKCE and `redirect_uri`); you just replay that through the same scripted login and hand the code back to the proxy callback:
+
+```bash
+PLATFORM=https://<platform>; CLUSTER=<cluster>
+USERNAME='<user>'; PASSWORD='<password>'
+JAR=$(mktemp)
+# 1) start the MLflow proxy login -> the Dex auth query it wants
+LOC=$(curl -sk -c "$JAR" -D - -o /dev/null "$PLATFORM/clusters/$CLUSTER/mlflow/" \
+  | awk 'BEGIN{IGNORECASE=1}/^location:/{print $2}' | tr -d '\r')
+QS=${LOC#*\?}
+# 2) authorize -> req, then 3) scripted local login -> the proxy callback URL
+REQ=$(curl -sk -b "$JAR" -c "$JAR" "$PLATFORM/dex/api/v1/authorize?$QS" | jq -r .req)
+PK=$(curl -sk "$PLATFORM/dex/pubkey"); TS=$(echo "$PK"|jq -r .ts); echo "$PK"|jq -r .pubkey >/tmp/dex_pub.pem
+ENC=$(printf '{"ts":"%s","password":"%s"}' "$TS" "$PASSWORD" | openssl pkeyutl -encrypt -pubin -inkey /tmp/dex_pub.pem -pkeyopt rsa_padding_mode:pkcs1 | openssl base64 -A)
+CB=$(curl -sk -b "$JAR" -c "$JAR" -X POST "$PLATFORM/dex/api/v1/authorize/local?req=$REQ" -H 'Content-Type: application/json' \
+  --data "$(jq -nc --arg a "$USERNAME" --arg p "$ENC" '{account:$a,password:$p}')" | jq -r .redirect_url)
+# 4) the proxy callback exchanges the code and sets the _oauth2_proxy cookie
+curl -sk -b "$JAR" -c "$JAR" -o /dev/null "$CB"
+COOKIE=$(awk -F'\t' '$6 ~ /^_oauth2_proxy/{printf "%s=%s; ",$6,$7}' "$JAR" | sed 's/; $//')   # includes any _oauth2_proxy_N chunks
+echo "$COOKIE"
+```
 
-If you cannot use the password grant (for example you only have an interactive SSO login), present your browser session instead — this works without the `--skip-jwt-bearer-tokens` setting. Sign in at **Alauda AI → Tools → MLFlow**, copy the `_oauth2_proxy` cookie from the browser developer tools (**Application/Storage → Cookies**; include any `_oauth2_proxy_N` chunks, joined with `; `), and attach it to every request with a header provider:
+Then attach the cookie with a header provider (the cookie carries your identity — no token, no platform setting):
 
 ```python
 import os, mlflow
@@ -124,7 +191,7 @@ from mlflow.tracking.request_header.registry import _request_header_provider_reg
 
 class ProxySessionHeader(RequestHeaderProvider):
     def in_context(self):
-        return bool(os.environ.get("MLFLOW_PROXY_COOKIE"))      # export MLFLOW_PROXY_COOKIE='_oauth2_proxy=<value>'
+        return bool(os.environ.get("MLFLOW_PROXY_COOKIE"))     # export MLFLOW_PROXY_COOKIE='_oauth2_proxy=<value>'
     def request_headers(self):
         return {"Cookie": os.environ["MLFLOW_PROXY_COOKIE"]}
 
@@ -133,15 +200,17 @@ mlflow.set_tracking_uri("https://<platform>/clusters/<cluster>/mlflow")
 mlflow.set_workspace("team-a")
 ```
 
-The session cookie expires — copy a fresh one when calls start returning a login redirect.
+You can also copy the `_oauth2_proxy` cookie from a browser session (DevTools → **Application/Storage → Cookies**). The session cookie expires — re-mint it when calls start returning a login redirect.
 
 ## Troubleshooting
 
 | Symptom | Check |
 |---------|-------|
-| `/dex/token` returns `unsupported_grant_type` / "password grant … not allowed" | The Dex client does not permit the password grant. Add `password` to the `grantTypes` of the **OAuth proxy's own client** — a separate client's token would later fail the proxy's `aud` check (see [Platform setup](#platform-setup-administrator-one-time)). |
-| Call returns HTML or a redirect (`302` to the login page) | The OAuth proxy rejected the bearer token. Confirm `--skip-jwt-bearer-tokens` is enabled and the token is a valid Dex id token (`aud` = the proxy's client). For the cookie alternative, your `_oauth2_proxy` value is missing or expired. |
+| `/dex/api/v1/authorize` returns `PKCE code_challenge is required` | The client enforces PKCE. Send `code_challenge` and `code_challenge_method=S256` (the helper does this). |
+| Local login returns a captcha challenge / `CaptchaError` | Too many recent failed logins triggered the retry-captcha. Wait, fix the credentials, then retry — a clean first login needs no captcha. |
+| `/dex/token` returns `invalid_grant` | The auth code or PKCE verifier is stale or reused. Re-run the flow from the start (`authorize` → login → token); codes are single-use. |
+| Call returns HTML or a redirect (`302` to the login page) | **Token method:** the proxy rejected the bearer token — confirm `--skip-jwt-bearer-tokens` is enabled and the token is a valid Dex id token (`aud` = the proxy's client). **Cookie method:** the `_oauth2_proxy` cookie is missing or expired. |
 | `Invalid … character(s) in header value: 'Bearer …\n'` | The token has trailing whitespace. Set `MLFLOW_TRACKING_TOKEN` to the `.strip()`-ed value. |
-| `Failed to query /api/3.0/mlflow/server-info` | The SDK could not reach the server through the proxy — verify the tracking URI is the platform MLflow route and the token is valid. |
+| `Failed to query /api/3.0/mlflow/server-info` | The SDK could not reach the server through the proxy — verify the tracking URI and that the token/cookie is valid. |
 | `403 PERMISSION_DENIED` | Your account lacks access to the workspace namespace. Request access to the workspace (see [Workspace Access](./mlflow.mdx)); no ServiceAccount is involved. |
 | Run shows the wrong owner or workspace | The owner is your authenticated identity; the workspace is `set_workspace()` / `MLFLOW_WORKSPACE` (else the server default). Check both. |
diff --git a/docs/en/training_guides/pipelines-mlflow-integration.mdx b/docs/en/training_guides/pipelines-mlflow-integration.mdx
index 0735de3..65b1a55 100644
--- a/docs/en/training_guides/pipelines-mlflow-integration.mdx
+++ b/docs/en/training_guides/pipelines-mlflow-integration.mdx
@@ -11,7 +11,7 @@ This guide shows how Kubeflow Pipelines (KFP) components log parameters, metrics
 - Alauda AI 2.5 and later.
 - Kubeflow Pipelines and the MLflow cluster plugin are installed.
 - The MLflow workspace is a namespace labelled `mlflow-enabled=true`.
-- The MLflow OAuth proxy accepts bearer tokens and a Dex client permits the password grant — see [Platform setup](../kubeflow/how_to/mlflow-python-sdk.mdx#platform-setup-administrator-one-time) in the SDK guide.
+- For the bearer-token method, the MLflow OAuth proxy must accept Dex id tokens (`--skip-jwt-bearer-tokens`) — see [Platform setup](../kubeflow/how_to/mlflow-python-sdk.mdx#platform-setup) in the SDK guide. No global-auth change is needed, and the cookie method needs no setup at all.
 
 ## Prerequisites
 
@@ -94,20 +94,15 @@ def training_pipeline(
 compiler.Compiler().compile(training_pipeline, "pipeline.yaml")
 ```
 
-Create the `mlflow-token` Secret with an id token from the password grant (see the [SDK guide](../kubeflow/how_to/mlflow-python-sdk.mdx)):
+Create the `mlflow-token` Secret with a Dex id token. Mint `ID_TOKEN` browser-free with the authorization-code flow from the SDK guide — see [Get a token from the command line](../kubeflow/how_to/mlflow-python-sdk.mdx#get-a-token):
 
 ```bash
-ID_TOKEN=$(curl -sk "https://<platform>/dex/token" \
-  -d grant_type=password \
-  --data-urlencode "username=$MLFLOW_USERNAME" --data-urlencode "password=$MLFLOW_PASSWORD" \
-  -d scope="openid email groups" \
-  -d client_id="$DEX_CLIENT_ID" --data-urlencode "client_secret=$DEX_CLIENT_SECRET" | jq -r .id_token)
-
+# ID_TOKEN: mint it with the curl/Python flow in the SDK guide (browser-free, current grants)
 kubectl -n <pipeline-namespace> create secret generic mlflow-token --from-literal=token="$ID_TOKEN"
 ```
 
 :::warning
-id tokens expire (24 h by default), so refresh the `mlflow-token` Secret before submitting long pipelines — or mint the token **inside** the component from service-account credentials kept in a Secret (the password-grant call shown in the SDK guide), so each run gets a fresh token.
+id tokens expire (24 h by default), so refresh the `mlflow-token` Secret before submitting long pipelines — or mint the token **inside** the component from service-account credentials kept in a Secret (the [token flow](../kubeflow/how_to/mlflow-python-sdk.mdx#get-a-token) in the SDK guide) and renew it with the refresh token, so each run gets a fresh token.
 :::
 
 ## Upload and run
diff --git a/e2e/mlflow-user-identity-smoke.sh b/e2e/mlflow-user-identity-smoke.sh
index e6b6cf3..721e4b6 100755
--- a/e2e/mlflow-user-identity-smoke.sh
+++ b/e2e/mlflow-user-identity-smoke.sh
@@ -1,23 +1,29 @@
 #!/usr/bin/env bash
-# Smoke test: log to MLflow as a real user, THROUGH the OAuth proxy.
+# Smoke test: log to MLflow as a real user, THROUGH the OAuth proxy — browser-free.
 #
-# Mints a Dex **id token** with the OAuth2 password grant (ROPC), then logs a run
-# to MLflow over the platform route — i.e. through oauth2-proxy, never the
-# container port. Asserts the run owner equals the token's user identity.
+# Drives the platform's standard OAuth **authorization code** flow (with PKCE)
+# from the shell: it starts the flow, logs in via the local connector with an
+# RSA-encrypted password (exactly as the login page does), and gets back an auth
+# code. From that code it derives, and exercises, both documented credentials:
 #
-# Prerequisites on the platform:
-#   - a Dex OAuth client whose grantTypes include "password" (ROPC)
-#   - the MLflow oauth2-proxy accepts bearer tokens
-#     (auth.oauth.extraArgs: ["--skip-jwt-bearer-tokens=true"])
+#   1. Bearer token  — exchange the code for a Dex id token, send it as
+#                      Authorization: Bearer (needs --skip-jwt-bearer-tokens on
+#                      the MLflow proxy; the test SKIPs this leg if it is off).
+#   2. Session cookie — hand the code to the MLflow proxy callback to obtain the
+#                      _oauth2_proxy cookie (works with no platform changes).
+#
+# Each leg logs a run over the platform route (i.e. through oauth2-proxy, never
+# the container port) and asserts the run owner equals the caller's identity.
+# No ROPC/password grant, no ServiceAccount, no direct container-port access.
 #
 # Required env:
 #   PLATFORM_ADDRESS   e.g. https://192.168.142.163
 #   CLUSTER            e.g. g1-c1-x86
 #   MLFLOW_USERNAME    platform username (ideally a dedicated service account)
 #   MLFLOW_PASSWORD    that user's password
-#   DEX_CLIENT_ID      Dex client id allowed to use the password grant
-#   DEX_CLIENT_SECRET  that client's secret
 # Optional env:
+#   DEX_CLIENT_ID      OAuth client id (enables the bearer-token leg; default: alauda-auth)
+#   DEX_CLIENT_SECRET  that client's secret (enables the bearer-token leg)
 #   MLFLOW_WORKSPACE   target workspace namespace (default: mlops-demo-e2e)
 set -euo pipefail
 
@@ -25,57 +31,115 @@ set -euo pipefail
 : "${CLUSTER:?set CLUSTER, e.g. g1-c1-x86}"
 : "${MLFLOW_USERNAME:?set MLFLOW_USERNAME}"
 : "${MLFLOW_PASSWORD:?set MLFLOW_PASSWORD}"
-: "${DEX_CLIENT_ID:?set DEX_CLIENT_ID}"
-: "${DEX_CLIENT_SECRET:?set DEX_CLIENT_SECRET}"
+DEX_CLIENT_ID="${DEX_CLIENT_ID:-alauda-auth}"
 WORKSPACE="${MLFLOW_WORKSPACE:-mlops-demo-e2e}"
 P="${PLATFORM_ADDRESS%/}"
+REDIRECT_URI="$P/oauth2/callback"           # any URI the client has registered
+BASE="$P/clusters/${CLUSTER}/mlflow/api/2.0/mlflow"
+
+TMP="$(mktemp -d)"
+CLEAN_HDR=(); CLEAN_EID=()                  # parallel arrays of (auth header, experiment id) to delete on exit
+cleanup() {
+  local i
+  for i in "${!CLEAN_EID[@]}"; do
+    curl -fsSk -H "${CLEAN_HDR[$i]}" -H "X-MLFLOW-WORKSPACE: ${WORKSPACE}" -H 'Content-Type: application/json' \
+      -X POST "$BASE/experiments/delete" -d "{\"experiment_id\":\"${CLEAN_EID[$i]}\"}" >/dev/null 2>&1 || true
+  done
+  rm -rf "$TMP"
+}
+trap cleanup EXIT
 
 b64url_decode() { local d="$1"; d="${d//-/+}"; d="${d//_/\/}"; printf '%s%s' "$d" "$(printf '%*s' $(((4 - ${#d} % 4) % 4)) '' | tr ' ' '=')" | base64 -d 2>/dev/null; }
 
-echo "== mint id token via password grant (ROPC) =="
-ID_TOKEN="$(curl -fsSk "$P/dex/token" \
-  -d grant_type=password \
-  --data-urlencode "username=${MLFLOW_USERNAME}" \
-  --data-urlencode "password=${MLFLOW_PASSWORD}" \
-  -d scope="openid email groups" \
-  -d client_id="${DEX_CLIENT_ID}" \
-  --data-urlencode "client_secret=${DEX_CLIENT_SECRET}" \
-  | jq -r '.id_token')"
-[ -n "${ID_TOKEN}" ] && [ "${ID_TOKEN}" != null ] || { echo "FAIL: no id_token (does the client allow the password grant?)"; exit 1; }
-EMAIL="$(b64url_decode "$(printf '%s' "$ID_TOKEN" | cut -d. -f2)" | jq -r '.email // .preferred_username // .name // .sub')"
-echo "caller identity: ${EMAIL}"
+# RSA-encrypt {"ts","password"} with a fresh /dex/pubkey (PKCS#1 v1.5), as the login page does.
+rsa_password() {
+  local pk ts
+  pk="$(curl -fsSk "$P/dex/pubkey")"; ts="$(echo "$pk" | jq -r .ts)"
+  echo "$pk" | jq -r .pubkey > "$TMP/pub.pem"
+  printf '{"ts":"%s","password":"%s"}' "$ts" "$MLFLOW_PASSWORD" \
+    | openssl pkeyutl -encrypt -pubin -inkey "$TMP/pub.pem" -pkeyopt rsa_padding_mode:pkcs1 | openssl base64 -A
+}
 
-# Through the OAuth proxy: the platform MLflow route, with the id token as a bearer.
-BASE="$P/clusters/${CLUSTER}/mlflow/api/2.0/mlflow"
-hdr=(-H "Authorization: Bearer ${ID_TOKEN}"
-     -H "X-MLFLOW-WORKSPACE: ${WORKSPACE}"
-     -H "Content-Type: application/json")
-api() { curl -fsSk "${hdr[@]}" -X "$1" "${BASE}/$2" ${3:+-d "$3"}; }
+# Log a run + assert the owner. $1=label  $2=auth header (Authorization/Cookie)  $3=expected owner
+run_and_assert() {
+  local label="$1" header="$2" expect="$3" exp eid rid owner status param run
+  exp="uit-${label}-$$-${RANDOM}"
+  eid="$(curl -fsSk -H "$header" -H "X-MLFLOW-WORKSPACE: ${WORKSPACE}" -H 'Content-Type: application/json' \
+         -X POST "$BASE/experiments/create" -d "{\"name\":\"${exp}\"}" | jq -r '.experiment_id // empty')"
+  [ -n "$eid" ] || { echo "FAIL[$label]: experiment not created"; return 1; }
+  CLEAN_HDR+=("$header"); CLEAN_EID+=("$eid")
+  rid="$(curl -fsSk -H "$header" -H "X-MLFLOW-WORKSPACE: ${WORKSPACE}" -H 'Content-Type: application/json' \
+         -X POST "$BASE/runs/create" -d "{\"experiment_id\":\"${eid}\",\"start_time\":1700000000000}" | jq -r '.run.info.run_id // empty')"
+  [ -n "$rid" ] || { echo "FAIL[$label]: run not created"; return 1; }
+  curl -fsSk -H "$header" -H "X-MLFLOW-WORKSPACE: ${WORKSPACE}" -H 'Content-Type: application/json' \
+    -X POST "$BASE/runs/log-parameter" -d "{\"run_id\":\"${rid}\",\"key\":\"model_name\",\"value\":\"qwen3-0.6b\"}" >/dev/null
+  curl -fsSk -H "$header" -H "X-MLFLOW-WORKSPACE: ${WORKSPACE}" -H 'Content-Type: application/json' \
+    -X POST "$BASE/runs/log-metric" -d "{\"run_id\":\"${rid}\",\"key\":\"loss\",\"value\":0.123,\"timestamp\":1700000000000,\"step\":1}" >/dev/null
+  curl -fsSk -H "$header" -H "X-MLFLOW-WORKSPACE: ${WORKSPACE}" -H 'Content-Type: application/json' \
+    -X POST "$BASE/runs/update" -d "{\"run_id\":\"${rid}\",\"status\":\"FINISHED\",\"end_time\":1700000005000}" >/dev/null
+  run="$(curl -fsSk -H "$header" -H "X-MLFLOW-WORKSPACE: ${WORKSPACE}" "$BASE/runs/get?run_id=${rid}")"
+  owner="$(printf '%s' "$run" | jq -r '.run.info.user_id')"
+  status="$(printf '%s' "$run" | jq -r '.run.info.status')"
+  param="$(printf '%s' "$run" | jq -r '.run.data.params[] | select(.key=="model_name") | .value')"
+  echo "  [$label] run_id=${rid} owner=${owner} status=${status} model_name=${param}"
+  [ "$status" = "FINISHED" ]  || { echo "FAIL[$label]: run not FINISHED"; return 1; }
+  [ "$param" = "qwen3-0.6b" ] || { echo "FAIL[$label]: param not logged"; return 1; }
+  [ "$owner" = "$expect" ]    || { echo "FAIL[$label]: owner '${owner}' != expected '${expect}'"; return 1; }
+}
 
-EXP="uit-smoke-$$"
-echo "== create experiment '${EXP}' =="
-EID="$(api POST experiments/create "{\"name\":\"${EXP}\"}" | jq -r '.experiment_id')"
-[ -n "${EID}" ] && [ "${EID}" != null ] || { echo "FAIL: experiment not created (is --skip-jwt-bearer-tokens enabled?)"; exit 1; }
-cleanup() { api POST experiments/delete "{\"experiment_id\":\"${EID}\"}" >/dev/null 2>&1 || true; }
-trap cleanup EXIT
+EXPECT_OWNER="$MLFLOW_USERNAME"
 
-echo "== create run, log params + metrics =="
-RID="$(api POST runs/create "{\"experiment_id\":\"${EID}\",\"start_time\":1700000000000}" | jq -r '.run.info.run_id')"
-[ -n "${RID}" ] && [ "${RID}" != null ] || { echo "FAIL: run not created"; exit 1; }
-api POST runs/log-parameter "{\"run_id\":\"${RID}\",\"key\":\"model_name\",\"value\":\"qwen3-0.6b\"}" >/dev/null
-for s in 1 2 3; do
-  api POST runs/log-metric "{\"run_id\":\"${RID}\",\"key\":\"loss\",\"value\":$(awk "BEGIN{print 2.0*(0.9^$s)}"),\"timestamp\":1700000000000,\"step\":${s}}" >/dev/null
-done
-api POST runs/update "{\"run_id\":\"${RID}\",\"status\":\"FINISHED\",\"end_time\":1700000005000}" >/dev/null
+# ---------------------------------------------------------------------------
+# Leg 1: bearer token (authorization_code + PKCE -> id_token)
+# ---------------------------------------------------------------------------
+if [ -n "${DEX_CLIENT_SECRET:-}" ]; then
+  echo "== leg 1: mint id token via authorization_code + PKCE =="
+  V="$(openssl rand -base64 48 | tr '+/' '-_' | tr -d '=' | cut -c1-64)"
+  C="$(printf %s "$V" | openssl dgst -sha256 -binary | openssl base64 -A | tr '+/' '-_' | tr -d '=')"
+  RU="$(jq -rn --arg u "$REDIRECT_URI" '$u|@uri')"; SC="$(jq -rn '"openid email groups offline_access"|@uri')"
+  REQ="$(curl -fsSk "$P/dex/api/v1/authorize?client_id=${DEX_CLIENT_ID}&redirect_uri=${RU}&response_type=code&scope=${SC}&state=cli&code_challenge=${C}&code_challenge_method=S256" | jq -r '.req // empty')"
+  [ -n "$REQ" ] || { echo "FAIL: authorize returned no req (PKCE/client issue?)"; exit 1; }
+  ENC="$(rsa_password)"
+  CODE="$(curl -fsSk -X POST "$P/dex/api/v1/authorize/local?req=${REQ}" -H 'Content-Type: application/json' \
+          --data "$(jq -nc --arg a "$MLFLOW_USERNAME" --arg p "$ENC" '{account:$a,password:$p}')" \
+          | jq -r '.redirect_url // empty' | sed -E 's/.*code=([^&]+).*/\1/')"
+  [ -n "$CODE" ] || { echo "FAIL: login returned no auth code (captcha triggered or bad credentials?)"; exit 1; }
+  ID_TOKEN="$(curl -fsSk "$P/dex/token" -d grant_type=authorization_code -d code="$CODE" \
+              --data-urlencode redirect_uri="$REDIRECT_URI" -d code_verifier="$V" \
+              -d client_id="${DEX_CLIENT_ID}" --data-urlencode client_secret="${DEX_CLIENT_SECRET}" | jq -r '.id_token // empty')"
+  [ -n "$ID_TOKEN" ] || { echo "FAIL: token exchange returned no id_token"; exit 1; }
+  EXPECT_OWNER="$(b64url_decode "$(printf '%s' "$ID_TOKEN" | cut -d. -f2)" | jq -r '.email // .preferred_username // .name // .sub')"
+  echo "  caller identity: ${EXPECT_OWNER}"
+  # Is the proxy configured to accept bearer tokens?
+  HTTP="$(curl -sk -o /dev/null -w '%{http_code}' -H "Authorization: Bearer ${ID_TOKEN}" -H "X-MLFLOW-WORKSPACE: ${WORKSPACE}" "$BASE/experiments/search?max_results=1")"
+  if [ "$HTTP" = "200" ]; then
+    run_and_assert "token" "Authorization: Bearer ${ID_TOKEN}" "$EXPECT_OWNER"
+    echo "PASS: bearer-token method (authorization_code + PKCE)"
+  else
+    echo "SKIP: bearer-token method — proxy returned HTTP ${HTTP} (enable --skip-jwt-bearer-tokens on the MLflow proxy)"
+  fi
+else
+  echo "SKIP: bearer-token method — set DEX_CLIENT_SECRET to exercise it"
+fi
 
-echo "== read back and assert =="
-RUN="$(api GET "runs/get?run_id=${RID}")"
-OWNER="$(printf '%s' "${RUN}" | jq -r '.run.info.user_id')"
-STATUS="$(printf '%s' "${RUN}" | jq -r '.run.info.status')"
-PARAM="$(printf '%s' "${RUN}" | jq -r '.run.data.params[] | select(.key=="model_name") | .value')"
-echo "  run_id=${RID} owner=${OWNER} status=${STATUS} model_name=${PARAM}"
-[ "${STATUS}" = "FINISHED" ]   || { echo "FAIL: run not FINISHED"; exit 1; }
-[ "${PARAM}" = "qwen3-0.6b" ]  || { echo "FAIL: param not logged"; exit 1; }
-[ "${OWNER}" = "${EMAIL}" ]    || { echo "FAIL: run owner '${OWNER}' != caller identity '${EMAIL}'"; exit 1; }
+# ---------------------------------------------------------------------------
+# Leg 2: session cookie (no platform changes)
+# ---------------------------------------------------------------------------
+echo "== leg 2: mint _oauth2_proxy cookie via the proxy login =="
+JAR="$TMP/proxyjar.txt"; : > "$JAR"
+LOC="$(curl -sk -c "$JAR" -D - -o /dev/null "$P/clusters/${CLUSTER}/mlflow/" | awk 'BEGIN{IGNORECASE=1}/^location:/{print $2}' | tr -d '\r')"
+QS="${LOC#*\?}"
+[ "$QS" != "$LOC" ] || { echo "FAIL: MLflow route did not redirect to login"; exit 1; }
+REQ="$(curl -sk -b "$JAR" -c "$JAR" "$P/dex/api/v1/authorize?${QS}" | jq -r '.req // empty')"
+[ -n "$REQ" ] || { echo "FAIL: proxy authorize returned no req"; exit 1; }
+ENC="$(rsa_password)"
+CB="$(curl -sk -b "$JAR" -c "$JAR" -X POST "$P/dex/api/v1/authorize/local?req=${REQ}" -H 'Content-Type: application/json' \
+      --data "$(jq -nc --arg a "$MLFLOW_USERNAME" --arg p "$ENC" '{account:$a,password:$p}')" | jq -r '.redirect_url // empty')"
+[ -n "$CB" ] || { echo "FAIL: proxy login returned no callback url"; exit 1; }
+curl -sk -b "$JAR" -c "$JAR" -o /dev/null "$CB"
+COOKIE="$(awk -F'\t' '$6 ~ /^_oauth2_proxy/{printf "%s=%s; ",$6,$7}' "$JAR" | sed 's/; $//')"
+[ -n "$COOKIE" ] || { echo "FAIL: no _oauth2_proxy cookie minted"; exit 1; }
+run_and_assert "cookie" "Cookie: ${COOKIE}" "$EXPECT_OWNER"
+echo "PASS: session-cookie method (no platform changes)"
 
-echo "PASS: logged to MLflow as '${EMAIL}' through the OAuth proxy (password grant; no cookie, no container-port access)"
+echo "DONE: authenticated to MLflow through the OAuth proxy as '${EXPECT_OWNER}' — browser-free, no container-port access"

From 733bd711571ad1e6b78b8788ee3b8cfe2407b811 Mon Sep 17 00:00:00 2001
From: Wu Yi <typhoonzero1986@gmail.com>
Date: Wed, 17 Jun 2026 08:20:07 +0000
Subject: [PATCH 21/21] docs: troubleshoot the self-signed-cert SSL error
 (rejected credential -> 302 -> follows redirect to platform HTTPS)

The MLflow SDK reports SSLCertVerificationError when the proxy rejects the
credential: it 302s to the login page and the client follows it to the
platform's self-signed HTTPS endpoint. Document the real cause (fix the
credential, not the TLS) and note the in-cluster http:// Service URL plus
MLFLOW_TRACKING_INSECURE_TLS for the external route in the cookie section.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/en/kubeflow/how_to/mlflow-python-sdk.mdx | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx b/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
index bcceb67..3a7d086 100644
--- a/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
+++ b/docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
@@ -200,6 +200,8 @@ mlflow.set_tracking_uri("https://<platform>/clusters/<cluster>/mlflow")
 mlflow.set_workspace("team-a")
 ```
 
+From **inside** the cluster, point at the in-cluster Service URL `http://mlflow-tracking-server.kubeflow:5000` instead (no TLS issues). On the external `https://<platform>/…` route the platform certificate is self-signed, so set `MLFLOW_TRACKING_INSECURE_TLS=true` (or point `REQUESTS_CA_BUNDLE` at the platform CA).
+
 You can also copy the `_oauth2_proxy` cookie from a browser session (DevTools → **Application/Storage → Cookies**). The session cookie expires — re-mint it when calls start returning a login redirect.
 
 ## Troubleshooting
@@ -211,6 +213,6 @@ You can also copy the `_oauth2_proxy` cookie from a browser session (DevTools 
 | `/dex/token` returns `invalid_grant` | The auth code or PKCE verifier is stale or reused. Re-run the flow from the start (`authorize` → login → token); codes are single-use. |
 | Call returns HTML or a redirect (`302` to the login page) | **Token method:** the proxy rejected the bearer token — confirm `--skip-jwt-bearer-tokens` is enabled and the token is a valid Dex id token (`aud` = the proxy's client). **Cookie method:** the `_oauth2_proxy` cookie is missing or expired. |
 | `Invalid … character(s) in header value: 'Bearer …\n'` | The token has trailing whitespace. Set `MLFLOW_TRACKING_TOKEN` to the `.strip()`-ed value. |
-| `Failed to query /api/3.0/mlflow/server-info` | The SDK could not reach the server through the proxy — verify the tracking URI and that the token/cookie is valid. |
+| `Failed to query /api/3.0/mlflow/server-info`, often with `SSLCertVerificationError: self-signed certificate` | The proxy **rejected** your credential and returned a `302` to the login page; the SDK then follows that redirect to the platform's self-signed HTTPS endpoint and fails TLS. Fix the credential, not the TLS: for the **token method** confirm `--skip-jwt-bearer-tokens` is enabled and the token is valid; for the **cookie method** re-mint the cookie. Only if you *intentionally* use the external `https://<platform>/…` route, also set `MLFLOW_TRACKING_INSECURE_TLS=true` (or point `REQUESTS_CA_BUNDLE` at the platform CA). |
 | `403 PERMISSION_DENIED` | Your account lacks access to the workspace namespace. Request access to the workspace (see [Workspace Access](./mlflow.mdx)); no ServiceAccount is involved. |
 | Run shows the wrong owner or workspace | The owner is your authenticated identity; the workspace is `set_workspace()` / `MLFLOW_WORKSPACE` (else the server default). Check both. |