Skip to content

Document Ascend runtime examples for custom inference services#194

Open
fyuan1316 wants to merge 1 commit intomasterfrom
vllm-ascend
Open

Document Ascend runtime examples for custom inference services#194
fyuan1316 wants to merge 1 commit intomasterfrom
vllm-ascend

Conversation

@fyuan1316
Copy link
Copy Markdown
Contributor

@fyuan1316 fyuan1316 commented Apr 20, 2026

Summary by CodeRabbit

  • Documentation
    • Added comprehensive documentation for vLLM-ascend (Ascend NPU) runtime with complete configuration examples and security context best practices.
    • Clarified inference service setup instructions, including explicit framework selection requirements.
    • Added hardware validation details and compatibility information for Ascend processors.
    • Standardized terminology and updated runtime references for consistency.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 20, 2026

Walkthrough

This documentation update revises the custom inference runtime guide, adding a new vLLM-ascend runtime section for Ascend NPU support, standardizing terminology, updating examples with security-context fields, reordering the MindIE section, and updating the runtime comparison table with validation notes for Ascend 310P and 910B4 processors.

Changes

Cohort / File(s) Summary
Documentation Updates
docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx
Added vLLM-ascend (Ascend NPU) runtime section with full ClusterServingRuntime YAML and InferenceService examples including security-context fields; updated Xinference guidance and terminology standardization (vLLM, Triton Inference Server); reordered MindIE section; added validation notes for Ascend 310P and 910B4; updated runtime comparison table with explicit rows for vLLM-ascend and MindIE.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Suggested reviewers

  • typhoonzero
  • zhaomingkun1030

Poem

🐰 Down the rabbit hole of runtimes we hop,
vLLM-ascend takes the top!
With Ascend NPU shining so bright,
New examples guide us right,
Documentation's perfectly clear,
Inference magic is finally here! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding documentation for Ascend runtime examples (vLLM-ascend and MindIE) for custom inference services.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch vllm-ascend

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx (1)

865-873: ⚠️ Potential issue | 🟡 Minor

Wording says "annotations" (plural) but only one annotation is listed.

The paragraph states MindIE must include "the following annotations" in the InferenceService metadata, but the table below only lists a single key (storage.kserve.io/readonly). Either use singular phrasing or add the other required NPU-related annotations that the comparison table on line 891 alludes to ("the required NPU annotations").

✏️ Suggested wording fix
-Unlike other runtimes, MindIE **must** include the following annotations in the
-`InferenceService` metadata during the final publishing step. This ensures that
-the platform scheduler correctly binds the NPU hardware to the service.
+Unlike other runtimes, MindIE **must** include the following annotation in the
+`InferenceService` metadata during the final publishing step. This ensures that
+the platform scheduler correctly binds the NPU hardware to the service.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx`
around lines 865 - 873, The heading and paragraph currently say "annotations"
(plural) but only list a single key (`storage.kserve.io/readonly`) under the
"2.Mandatory Annotations for InferenceService" section; update the text to use
singular phrasing (e.g., "Mandatory Annotation for InferenceService") or expand
the table to include the other NPU-related annotations referenced elsewhere
(ensure consistency with the comparison table that mentions "required NPU
annotations"); adjust the section title and the sentence that follows to match
whichever option you choose so the wording and listed keys are consistent.
🧹 Nitpick comments (1)
docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx (1)

378-432: Consider using YAML literal block (|) instead of folded (>) for the bash script.

The command entry uses folded scalar style (- >) with blank lines to force newlines. It works, but it's error-prone for multi-line shell scripts — a single forgotten blank line silently joins two statements with a space. The MindIE example in the same file uses | (literal block) for the same purpose, which is more robust and easier to maintain. Consider switching this block to | for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx`
around lines 378 - 432, The YAML uses a folded scalar (->) for the multi-line
shell under the command entry (the block starting with - bash - -c and
containing MODEL_DIR/MODEL_PATH checks and the python3 -m
vllm.entrypoints.openai.api_server launch), which can silently join lines when
blank lines are present; change the folded style to a literal block (|) for that
command so newlines are preserved, keep the same indentation and all blank lines
unmodified, and ensure the bash script content (including MODEL_DIR, MODEL_PATH,
the gguf detection logic and the python3 -m vllm.entrypoints.openai.api_server
invocation) remains exactly the same inside the literal block.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx`:
- Around line 494-524: The InferenceService example uses runtime:
aml-vllm-ascend-0.18.0rc1 but the ClusterServingRuntime defined earlier is named
aml-vllm-ascend-cann-8.5.1, causing a runtime-not-found error; pick a single
identifier scheme and make both the ClusterServingRuntime resource name and the
InferenceService.runtime field match (e.g., rename the ClusterServingRuntime to
aml-vllm-ascend-0.18.0rc1 or change InferenceService.runtime to
aml-vllm-ascend-cann-8.5.1) and ensure any references/annotations that embed the
version string are updated consistently.

---

Outside diff comments:
In
`@docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx`:
- Around line 865-873: The heading and paragraph currently say "annotations"
(plural) but only list a single key (`storage.kserve.io/readonly`) under the
"2.Mandatory Annotations for InferenceService" section; update the text to use
singular phrasing (e.g., "Mandatory Annotation for InferenceService") or expand
the table to include the other NPU-related annotations referenced elsewhere
(ensure consistency with the comparison table that mentions "required NPU
annotations"); adjust the section title and the sentence that follows to match
whichever option you choose so the wording and listed keys are consistent.

---

Nitpick comments:
In
`@docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx`:
- Around line 378-432: The YAML uses a folded scalar (->) for the multi-line
shell under the command entry (the block starting with - bash - -c and
containing MODEL_DIR/MODEL_PATH checks and the python3 -m
vllm.entrypoints.openai.api_server launch), which can silently join lines when
blank lines are present; change the folded style to a literal block (|) for that
command so newlines are preserved, keep the same indentation and all blank lines
unmodified, and ensure the bash script content (including MODEL_DIR, MODEL_PATH,
the gguf detection logic and the python3 -m vllm.entrypoints.openai.api_server
invocation) remains exactly the same inside the literal block.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ac3aa6ee-68f5-4667-92a6-8ee2af60acbc

📥 Commits

Reviewing files that changed from the base of the PR and between f3af7cd and c10ce17.

📒 Files selected for processing (1)
  • docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx

Comment on lines +494 to +524
```yaml
kind: InferenceService
apiVersion: serving.kserve.io/v1beta1
metadata:
name: qwen35
namespace: demo
annotations:
aml-model-repo: Qwen3.5-0.8B
modelFormat: transformers
serving.kserve.io/deploymentMode: Standard
labels:
aml.cpaas.io/runtime-type: vllm
spec:
predictor:
model:
env:
- name: HOME # [!code callout]
value: /tmp
modelFormat:
name: transformers
protocolVersion: v2
resources:
limits:
cpu: "4"
huawei.com/Ascend910B4: "1"
memory: 16Gi
requests:
cpu: "2"
memory: 8Gi
runtime: aml-vllm-ascend-0.18.0rc1
storageUri: pvc://qwen35/Qwen3.5-0.8B
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Runtime name in the InferenceService example does not match the ClusterServingRuntime defined above.

On line 523 the example references runtime: aml-vllm-ascend-0.18.0rc1, but the ClusterServingRuntime defined in the preceding YAML (line 375) is named aml-vllm-ascend-cann-8.5.1. Users copy-pasting this example will hit a runtime-not-found error. Please align the two names (and, ideally, rename one of them so they encode a single identifier — either the CANN version or the image version, not both).

🔧 Proposed fix (pick one naming scheme and use it in both places)
-      runtime: aml-vllm-ascend-0.18.0rc1
+      runtime: aml-vllm-ascend-cann-8.5.1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx`
around lines 494 - 524, The InferenceService example uses runtime:
aml-vllm-ascend-0.18.0rc1 but the ClusterServingRuntime defined earlier is named
aml-vllm-ascend-cann-8.5.1, causing a runtime-not-found error; pick a single
identifier scheme and make both the ClusterServingRuntime resource name and the
InferenceService.runtime field match (e.g., rename the ClusterServingRuntime to
aml-vllm-ascend-0.18.0rc1 or change InferenceService.runtime to
aml-vllm-ascend-cann-8.5.1) and ensure any references/annotations that embed the
version string are updated consistently.

@liuwei-2622
Copy link
Copy Markdown

/test-pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants