Document Ascend runtime examples for custom inference services#194
Document Ascend runtime examples for custom inference services#194
Conversation
WalkthroughThis documentation update revises the custom inference runtime guide, adding a new vLLM-ascend runtime section for Ascend NPU support, standardizing terminology, updating examples with security-context fields, reordering the MindIE section, and updating the runtime comparison table with validation notes for Ascend 310P and 910B4 processors. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx (1)
865-873:⚠️ Potential issue | 🟡 MinorWording says "annotations" (plural) but only one annotation is listed.
The paragraph states MindIE must include "the following annotations" in the
InferenceServicemetadata, but the table below only lists a single key (storage.kserve.io/readonly). Either use singular phrasing or add the other required NPU-related annotations that the comparison table on line 891 alludes to ("the required NPU annotations").✏️ Suggested wording fix
-Unlike other runtimes, MindIE **must** include the following annotations in the -`InferenceService` metadata during the final publishing step. This ensures that -the platform scheduler correctly binds the NPU hardware to the service. +Unlike other runtimes, MindIE **must** include the following annotation in the +`InferenceService` metadata during the final publishing step. This ensures that +the platform scheduler correctly binds the NPU hardware to the service.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx` around lines 865 - 873, The heading and paragraph currently say "annotations" (plural) but only list a single key (`storage.kserve.io/readonly`) under the "2.Mandatory Annotations for InferenceService" section; update the text to use singular phrasing (e.g., "Mandatory Annotation for InferenceService") or expand the table to include the other NPU-related annotations referenced elsewhere (ensure consistency with the comparison table that mentions "required NPU annotations"); adjust the section title and the sentence that follows to match whichever option you choose so the wording and listed keys are consistent.
🧹 Nitpick comments (1)
docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx (1)
378-432: Consider using YAML literal block (|) instead of folded (>) for the bash script.The
commandentry uses folded scalar style (- >) with blank lines to force newlines. It works, but it's error-prone for multi-line shell scripts — a single forgotten blank line silently joins two statements with a space. The MindIE example in the same file uses|(literal block) for the same purpose, which is more robust and easier to maintain. Consider switching this block to|for consistency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx` around lines 378 - 432, The YAML uses a folded scalar (->) for the multi-line shell under the command entry (the block starting with - bash - -c and containing MODEL_DIR/MODEL_PATH checks and the python3 -m vllm.entrypoints.openai.api_server launch), which can silently join lines when blank lines are present; change the folded style to a literal block (|) for that command so newlines are preserved, keep the same indentation and all blank lines unmodified, and ensure the bash script content (including MODEL_DIR, MODEL_PATH, the gguf detection logic and the python3 -m vllm.entrypoints.openai.api_server invocation) remains exactly the same inside the literal block.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx`:
- Around line 494-524: The InferenceService example uses runtime:
aml-vllm-ascend-0.18.0rc1 but the ClusterServingRuntime defined earlier is named
aml-vllm-ascend-cann-8.5.1, causing a runtime-not-found error; pick a single
identifier scheme and make both the ClusterServingRuntime resource name and the
InferenceService.runtime field match (e.g., rename the ClusterServingRuntime to
aml-vllm-ascend-0.18.0rc1 or change InferenceService.runtime to
aml-vllm-ascend-cann-8.5.1) and ensure any references/annotations that embed the
version string are updated consistently.
---
Outside diff comments:
In
`@docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx`:
- Around line 865-873: The heading and paragraph currently say "annotations"
(plural) but only list a single key (`storage.kserve.io/readonly`) under the
"2.Mandatory Annotations for InferenceService" section; update the text to use
singular phrasing (e.g., "Mandatory Annotation for InferenceService") or expand
the table to include the other NPU-related annotations referenced elsewhere
(ensure consistency with the comparison table that mentions "required NPU
annotations"); adjust the section title and the sentence that follows to match
whichever option you choose so the wording and listed keys are consistent.
---
Nitpick comments:
In
`@docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx`:
- Around line 378-432: The YAML uses a folded scalar (->) for the multi-line
shell under the command entry (the block starting with - bash - -c and
containing MODEL_DIR/MODEL_PATH checks and the python3 -m
vllm.entrypoints.openai.api_server launch), which can silently join lines when
blank lines are present; change the folded style to a literal block (|) for that
command so newlines are preserved, keep the same indentation and all blank lines
unmodified, and ensure the bash script content (including MODEL_DIR, MODEL_PATH,
the gguf detection logic and the python3 -m vllm.entrypoints.openai.api_server
invocation) remains exactly the same inside the literal block.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ac3aa6ee-68f5-4667-92a6-8ee2af60acbc
📒 Files selected for processing (1)
docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx
| ```yaml | ||
| kind: InferenceService | ||
| apiVersion: serving.kserve.io/v1beta1 | ||
| metadata: | ||
| name: qwen35 | ||
| namespace: demo | ||
| annotations: | ||
| aml-model-repo: Qwen3.5-0.8B | ||
| modelFormat: transformers | ||
| serving.kserve.io/deploymentMode: Standard | ||
| labels: | ||
| aml.cpaas.io/runtime-type: vllm | ||
| spec: | ||
| predictor: | ||
| model: | ||
| env: | ||
| - name: HOME # [!code callout] | ||
| value: /tmp | ||
| modelFormat: | ||
| name: transformers | ||
| protocolVersion: v2 | ||
| resources: | ||
| limits: | ||
| cpu: "4" | ||
| huawei.com/Ascend910B4: "1" | ||
| memory: 16Gi | ||
| requests: | ||
| cpu: "2" | ||
| memory: 8Gi | ||
| runtime: aml-vllm-ascend-0.18.0rc1 | ||
| storageUri: pvc://qwen35/Qwen3.5-0.8B |
There was a problem hiding this comment.
Runtime name in the InferenceService example does not match the ClusterServingRuntime defined above.
On line 523 the example references runtime: aml-vllm-ascend-0.18.0rc1, but the ClusterServingRuntime defined in the preceding YAML (line 375) is named aml-vllm-ascend-cann-8.5.1. Users copy-pasting this example will hit a runtime-not-found error. Please align the two names (and, ideally, rename one of them so they encode a single identifier — either the CANN version or the image version, not both).
🔧 Proposed fix (pick one naming scheme and use it in both places)
- runtime: aml-vllm-ascend-0.18.0rc1
+ runtime: aml-vllm-ascend-cann-8.5.1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx`
around lines 494 - 524, The InferenceService example uses runtime:
aml-vllm-ascend-0.18.0rc1 but the ClusterServingRuntime defined earlier is named
aml-vllm-ascend-cann-8.5.1, causing a runtime-not-found error; pick a single
identifier scheme and make both the ClusterServingRuntime resource name and the
InferenceService.runtime field match (e.g., rename the ClusterServingRuntime to
aml-vllm-ascend-0.18.0rc1 or change InferenceService.runtime to
aml-vllm-ascend-cann-8.5.1) and ensure any references/annotations that embed the
version string are updated consistently.
|
/test-pass |
Summary by CodeRabbit