Document Ascend runtime examples for custom inference services by fyuan1316 · Pull Request #194 · alauda/aml-docs

fyuan1316 · 2026-04-20T02:11:09Z

Summary by CodeRabbit

Documentation
- Added comprehensive documentation for vLLM-ascend (Ascend NPU) runtime with complete configuration examples and security context best practices.
- Clarified inference service setup instructions, including explicit framework selection requirements.
- Added hardware validation details and compatibility information for Ascend processors.
- Standardized terminology and updated runtime references for consistency.

coderabbitai · 2026-04-20T02:11:21Z

Walkthrough

This documentation update revises the custom inference runtime guide, adding a new vLLM-ascend runtime section for Ascend NPU support, standardizing terminology, updating examples with security-context fields, reordering the MindIE section, and updating the runtime comparison table with validation notes for Ascend 310P and 910B4 processors.

Changes

Cohort / File(s)	Summary
Documentation Updates `docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx`	Added vLLM-ascend (Ascend NPU) runtime section with full `ClusterServingRuntime` YAML and `InferenceService` examples including security-context fields; updated Xinference guidance and terminology standardization (vLLM, Triton Inference Server); reordered MindIE section; added validation notes for Ascend 310P and 910B4; updated runtime comparison table with explicit rows for vLLM-ascend and MindIE.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Update custom infer runtime docs #35: Overlapping modifications to the same custom_inference_runtime.mdx file with shared updates to Xinference/MODEL_FAMILY guidance.
Add ascend 310p runtime #78: Covers Ascend NPU runtime documentation for MindIE (310P) while this PR expands to include both MindIE updates and new vLLM-ascend section.
Add how to use custom inference runtime #5: Initial creation of the custom_inference_runtime.mdx file; this PR extends and restructures the original content with new runtime examples.

Suggested reviewers

typhoonzero
zhaomingkun1030

Poem

🐰 Down the rabbit hole of runtimes we hop,
vLLM-ascend takes the top!
With Ascend NPU shining so bright,
New examples guide us right,
Documentation's perfectly clear,
Inference magic is finally here! ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding documentation for Ascend runtime examples (vLLM-ascend and MindIE) for custom inference services.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch vllm-ascend

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx (1)
865-873: ⚠️ Potential issue | 🟡 Minor

Wording says "annotations" (plural) but only one annotation is listed.

The paragraph states MindIE must include "the following annotations" in the InferenceService metadata, but the table below only lists a single key (storage.kserve.io/readonly). Either use singular phrasing or add the other required NPU-related annotations that the comparison table on line 891 alludes to ("the required NPU annotations").
✏️ Suggested wording fix
-Unlike other runtimes, MindIE **must** include the following annotations in the
-`InferenceService` metadata during the final publishing step. This ensures that
-the platform scheduler correctly binds the NPU hardware to the service.
+Unlike other runtimes, MindIE **must** include the following annotation in the
+`InferenceService` metadata during the final publishing step. This ensures that
+the platform scheduler correctly binds the NPU hardware to the service.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx`
around lines 865 - 873, The heading and paragraph currently say "annotations"
(plural) but only list a single key (`storage.kserve.io/readonly`) under the
"2.Mandatory Annotations for InferenceService" section; update the text to use
singular phrasing (e.g., "Mandatory Annotation for InferenceService") or expand
the table to include the other NPU-related annotations referenced elsewhere
(ensure consistency with the comparison table that mentions "required NPU
annotations"); adjust the section title and the sentence that follows to match
whichever option you choose so the wording and listed keys are consistent.

🧹 Nitpick comments (1)

docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx (1)
378-432: Consider using YAML literal block (|) instead of folded (>) for the bash script.

The command entry uses folded scalar style (- >) with blank lines to force newlines. It works, but it's error-prone for multi-line shell scripts — a single forgotten blank line silently joins two statements with a space. The MindIE example in the same file uses | (literal block) for the same purpose, which is more robust and easier to maintain. Consider switching this block to | for consistency.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx`
around lines 378 - 432, The YAML uses a folded scalar (->) for the multi-line
shell under the command entry (the block starting with - bash - -c and
containing MODEL_DIR/MODEL_PATH checks and the python3 -m
vllm.entrypoints.openai.api_server launch), which can silently join lines when
blank lines are present; change the folded style to a literal block (|) for that
command so newlines are preserved, keep the same indentation and all blank lines
unmodified, and ensure the bash script content (including MODEL_DIR, MODEL_PATH,
the gguf detection logic and the python3 -m vllm.entrypoints.openai.api_server
invocation) remains exactly the same inside the literal block.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx`:
- Around line 494-524: The InferenceService example uses runtime:
aml-vllm-ascend-0.18.0rc1 but the ClusterServingRuntime defined earlier is named
aml-vllm-ascend-cann-8.5.1, causing a runtime-not-found error; pick a single
identifier scheme and make both the ClusterServingRuntime resource name and the
InferenceService.runtime field match (e.g., rename the ClusterServingRuntime to
aml-vllm-ascend-0.18.0rc1 or change InferenceService.runtime to
aml-vllm-ascend-cann-8.5.1) and ensure any references/annotations that embed the
version string are updated consistently.

---

Outside diff comments:
In
`@docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx`:
- Around line 865-873: The heading and paragraph currently say "annotations"
(plural) but only list a single key (`storage.kserve.io/readonly`) under the
"2.Mandatory Annotations for InferenceService" section; update the text to use
singular phrasing (e.g., "Mandatory Annotation for InferenceService") or expand
the table to include the other NPU-related annotations referenced elsewhere
(ensure consistency with the comparison table that mentions "required NPU
annotations"); adjust the section title and the sentence that follows to match
whichever option you choose so the wording and listed keys are consistent.

---

Nitpick comments:
In
`@docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx`:
- Around line 378-432: The YAML uses a folded scalar (->) for the multi-line
shell under the command entry (the block starting with - bash - -c and
containing MODEL_DIR/MODEL_PATH checks and the python3 -m
vllm.entrypoints.openai.api_server launch), which can silently join lines when
blank lines are present; change the folded style to a literal block (|) for that
command so newlines are preserved, keep the same indentation and all blank lines
unmodified, and ensure the bash script content (including MODEL_DIR, MODEL_PATH,
the gguf detection logic and the python3 -m vllm.entrypoints.openai.api_server
invocation) remains exactly the same inside the literal block.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ac3aa6ee-68f5-4667-92a6-8ee2af60acbc

📥 Commits

Reviewing files that changed from the base of the PR and between f3af7cd and c10ce17.

📒 Files selected for processing (1)

docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx

coderabbitai · 2026-04-20T02:13:20Z

+```yaml
+kind: InferenceService
+apiVersion: serving.kserve.io/v1beta1
+metadata:
+  name: qwen35
+  namespace: demo
+  annotations:
+    aml-model-repo: Qwen3.5-0.8B
+    modelFormat: transformers
+    serving.kserve.io/deploymentMode: Standard
+  labels:
+    aml.cpaas.io/runtime-type: vllm
+spec:
+  predictor:
+    model:
+      env:
+        - name: HOME # [!code callout]
+          value: /tmp
+      modelFormat:
+        name: transformers
+      protocolVersion: v2
+      resources:
+        limits:
+          cpu: "4"
+          huawei.com/Ascend910B4: "1"
+          memory: 16Gi
+        requests:
+          cpu: "2"
+          memory: 8Gi
+      runtime: aml-vllm-ascend-0.18.0rc1
+      storageUri: pvc://qwen35/Qwen3.5-0.8B


⚠️ Potential issue | 🔴 Critical

Runtime name in the InferenceService example does not match the ClusterServingRuntime defined above.

On line 523 the example references runtime: aml-vllm-ascend-0.18.0rc1, but the ClusterServingRuntime defined in the preceding YAML (line 375) is named aml-vllm-ascend-cann-8.5.1. Users copy-pasting this example will hit a runtime-not-found error. Please align the two names (and, ideally, rename one of them so they encode a single identifier — either the CANN version or the image version, not both).

🔧 Proposed fix (pick one naming scheme and use it in both places)

- runtime: aml-vllm-ascend-0.18.0rc1 + runtime: aml-vllm-ascend-cann-8.5.1

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx` around lines 494 - 524, The InferenceService example uses runtime: aml-vllm-ascend-0.18.0rc1 but the ClusterServingRuntime defined earlier is named aml-vllm-ascend-cann-8.5.1, causing a runtime-not-found error; pick a single identifier scheme and make both the ClusterServingRuntime resource name and the InferenceService.runtime field match (e.g., rename the ClusterServingRuntime to aml-vllm-ascend-0.18.0rc1 or change InferenceService.runtime to aml-vllm-ascend-cann-8.5.1) and ensure any references/annotations that embed the version string are updated consistently.

liuwei-2622 · 2026-04-29T05:23:10Z

/test-pass

Document Ascend runtime examples for custom inference services

c10ce17

coderabbitai Bot reviewed Apr 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document Ascend runtime examples for custom inference services#194

Document Ascend runtime examples for custom inference services#194
fyuan1316 wants to merge 1 commit intomasterfrom
vllm-ascend

fyuan1316 commented Apr 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 20, 2026

Uh oh!

liuwei-2622 commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fyuan1316 commented Apr 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

liuwei-2622 commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fyuan1316 commented Apr 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 20, 2026 •

edited

Loading