feat(turbomind): support priority schedule policy by 4mengy · Pull Request #4614 · InternLM/lmdeploy

4mengy · 2026-05-22T11:10:56Z

Motivation

TurboMind currently schedules requests in FIFO order, which makes it hard to differentiate latency-sensitive online traffic from background or lower-priority workloads in shared serving scenarios.

This PR introduces request-level priority scheduling for the TurboMind backend, allowing high-priority requests to be admitted earlier while keeping the default FIFO behavior unchanged.

Modification

Add priority validation and plumbing through GenerationConfig, OpenAI-compatible request protocols, API server handling, pybind, and TurboMind GenerationConfig.
Add schedule_policy support for TurboMind engines, with fifo as the default policy and priority as the new priority scheduling policy.
Implement FIFO and non-preemptive priority request queues in TurboMind.
Use request priority during engine materialization while preserving already scheduled requests to avoid unnecessary preemption and KV cache swapping.
Add Python validation tests, C++ request queue tests, and priority scheduling documentation in both English and Chinese.

BC-breaking (Optional)

No backward compatibility break is introduced.

The default scheduling policy remains fifo, so existing users keep the same scheduling behavior unless they explicitly enable schedule_policy='priority'.

Use cases (Optional)

This feature is useful for serving workloads that mix different traffic classes, for example:

prioritizing interactive or latency-sensitive requests over background batch jobs;
serving high-priority users or internal system requests ahead of regular traffic;
allowing delay-tolerant requests to use lower priority while sharing the same TurboMind service.

Documentation has been added to describe how to enable --schedule-policy priority and how to set request-level priority.

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
The documentation has been modified accordingly, like docstring or example tutorials.

Validation

Ran the focused tests added by this PR, including Python validation tests for request priority and C++ request queue tests.

The full unit test suite was also attempted, but some existing tests failed. These failures are not introduced by this PR and are unrelated to the priority scheduling changes.

Add request priority validation and plumbing through Python configs, OpenAI protocol, pybind, and TurboMind GenerationConfig. Introduce schedule_policy for TurboMind engines, implement FIFO and non-preemptive priority request queues, and use request priority in engine materialization ordering while preserving already scheduled requests. Add focused Python validation tests, C++ request queue tests, and priority scheduling docs.

4mengy · 2026-05-29T02:05:22Z

Hi @windreamer, to move this PR forward, could you please let me know if any further work or adjustments are needed? Thanks!

windreamer · 2026-05-29T02:06:38Z

cc @lvhan028

windreamer · 2026-05-29T02:07:44Z

Hi @windreamer, to move this PR forward, could you please let me know if any further work or adjustments are needed? Thanks!

Do you think we have any risk of request starvations?

4mengy · 2026-05-29T02:18:01Z

Hi @windreamer, to move this PR forward, could you please let me know if any further work or adjustments are needed? Thanks!

Do you think we have any risk of request starvations?

Yes, there is a potential starvation risk for lower-priority requests under sustained high-priority traffic.

The current policy is strict priority when admitting requests from the waiting queue: lower priority values are always selected first. Once a request has already started, we try to keep it ahead of new requests, so an already-running low-priority request should not be starved by newly-arriving high-priority requests. The main risk is for low-priority requests that are still queued and have not entered the engine yet.

This is opt-in because the default policy remains fifo. With schedule_policy='priority', the behavior is intentional, but it does not currently include aging, quotas, deadlines, or weighted fairness. So if we need eventual service guarantees for low-priority traffic, we should add one of those mechanisms or enforce admission limits for high-priority requests.

We intentionally kept the initial implementation as strict priority scheduling. We did consider fairness mechanisms above, but did not include them in the first version because they would add scheduler complexity. These mechanisms are also highly business-dependent; fully adapting them to different traffic patterns may require quite a few tunable parameters. If we need stronger eventual-service guarantees, I can help implement one of those strategies.

windreamer · 2026-05-29T02:53:17Z

Hi @windreamer, to move this PR forward, could you please let me know if any further work or adjustments are needed? Thanks!

Do you think we have any risk of request starvations?

Yes, there is a potential starvation risk for lower-priority requests under sustained high-priority traffic.

The current policy is strict priority when admitting requests from the waiting queue: lower priority values are always selected first. Once a request has already started, we try to keep it ahead of new requests, so an already-running low-priority request should not be starved by newly-arriving high-priority requests. The main risk is for low-priority requests that are still queued and have not entered the engine yet.

This is opt-in because the default policy remains fifo. With schedule_policy='priority', the behavior is intentional, but it does not currently include aging, quotas, deadlines, or weighted fairness. So if we need eventual service guarantees for low-priority traffic, we should add one of those mechanisms or enforce admission limits for high-priority requests.

We intentionally kept the initial implementation as strict priority scheduling. We did consider fairness mechanisms above, but did not include them in the first version because they would add scheduler complexity. These mechanisms are also highly business-dependent; fully adapting them to different traffic patterns may require quite a few tunable parameters. If we need stronger eventual-service guarantees, I can help implement one of those strategies.

Quite reasonable — thanks for the detailed analysis. The maintainers are pretty swamped with the next LLM release right now, so we’ll likely need a little time to sync up and push this forward. To help us prioritize when we do pick this up: do you have a specific deadline or target timeline on your end? Also, could you share a bit more about the use case or business context driving this — e.g., is there a particular workload or SLA constraint this is meant to address? That would help us focus the discussion when we get to it.

lvhan028 · 2026-05-29T04:02:34Z

I will discuss with @lzhangzz asap

4mengy · 2026-05-29T09:55:41Z

Hi @windreamer, to move this PR forward, could you please let me know if any further work or adjustments are needed? Thanks!

Do you think we have any risk of request starvations?

Yes, there is a potential starvation risk for lower-priority requests under sustained high-priority traffic.
The current policy is strict priority when admitting requests from the waiting queue: lower priority values are always selected first. Once a request has already started, we try to keep it ahead of new requests, so an already-running low-priority request should not be starved by newly-arriving high-priority requests. The main risk is for low-priority requests that are still queued and have not entered the engine yet.
This is opt-in because the default policy remains fifo. With schedule_policy='priority', the behavior is intentional, but it does not currently include aging, quotas, deadlines, or weighted fairness. So if we need eventual service guarantees for low-priority traffic, we should add one of those mechanisms or enforce admission limits for high-priority requests.
We intentionally kept the initial implementation as strict priority scheduling. We did consider fairness mechanisms above, but did not include them in the first version because they would add scheduler complexity. These mechanisms are also highly business-dependent; fully adapting them to different traffic patterns may require quite a few tunable parameters. If we need stronger eventual-service guarantees, I can help implement one of those strategies.

Quite reasonable — thanks for the detailed analysis. The maintainers are pretty swamped with the next LLM release right now, so we’ll likely need a little time to sync up and push this forward. To help us prioritize when we do pick this up: do you have a specific deadline or target timeline on your end? Also, could you share a bit more about the use case or business context driving this — e.g., is there a particular workload or SLA constraint this is meant to address? That would help us focus the discussion when we get to it.

Thanks，there is no hard deadline on our side, so it is fine to wait until the maintainers have bandwidth.

The main use case is mixed online/offline deployment on the same inference cluster. Online requests are latency-sensitive, while offline batch jobs are throughput-oriented and only need to meet a loose T+1 requirement. Priority scheduling lets us give online traffic higher priority without splitting clusters or over-provisioning resources.

Copilot

Pull request overview

This PR adds request-level priority scheduling to the TurboMind backend, enabling non-preemptive priority admission while keeping the default FIFO behavior unchanged unless explicitly enabled.

Changes:

Add priority to request/GenerationConfig plumbing (OpenAI protocol + API server + Python/TurboMind bindings + C++ config serialization).
Add schedule_policy (fifo/priority) to TurboMind engine config and propagate it into Gateway/RequestQueue and engine materialization scheduling.
Add unit tests for Python validation/OpenAI protocol and C++ request queue behavior, plus EN/ZH documentation for priority scheduling.

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/test_lmdeploy/test_openai_protocol_priority.py	Adds OpenAI protocol validation tests for `priority`.
tests/test_lmdeploy/test_messages.py	Adds validation tests for `GenerationConfig.priority` and `TurbomindEngineConfig.schedule_policy`.
src/turbomind/turbomind.cc	Parses schedule policy from config and passes it to Gateway.
src/turbomind/python/bind.cpp	Exposes TurboMind `GenerationConfig.priority` to Python.
src/turbomind/engine/test_request_queue.cc	Adds Catch2 tests for FIFO vs priority queue ordering and schedule policy parsing.
src/turbomind/engine/schedule_policy.h	Introduces `SchedulePolicy` enum + string parser.
src/turbomind/engine/request.h	Adds C++ `GenerationConfig.priority` + request cache `scheduled` flag + serialization.
src/turbomind/engine/request.cc	Updates `GenerationConfig` debug printing to include priority.
src/turbomind/engine/request_queue.h	Refactors request queue into FIFO vs priority implementations behind a factory.
src/turbomind/engine/request_queue.cc	Implements `RequestQueue::create` factory.
src/turbomind/engine/gateway.h	Extends Gateway to accept/store schedule policy.
src/turbomind/engine/gateway.cc	Creates per-queue request queues based on schedule policy.
src/turbomind/engine/engine.cc	Applies priority (and “already scheduled” preference) during sequence materialization.
src/turbomind/engine/engine_config.h	Adds `schedule_policy` to engine config with default `fifo`.
src/turbomind/engine/CMakeLists.txt	Adds `test_request_queue` executable under `BUILD_TEST`.
lmdeploy/turbomind/turbomind.py	Plumbs schedule policy into EngineConfig and priority into TurboMind GenerationConfig.
lmdeploy/serve/openai/protocol.py	Adds `priority` field to OpenAI-compatible request schemas with strict int validation.
lmdeploy/serve/openai/api_server.py	Passes `priority` into `GenerationConfig` for chat/completions endpoints.
lmdeploy/messages.py	Adds `GenerationConfig.priority` and `TurbomindEngineConfig.schedule_policy` with validation/docs.
lmdeploy/cli/utils.py	Adds `--schedule-policy` CLI flag.
lmdeploy/cli/serve.py	Wires CLI arg into `TurbomindEngineConfig`.
lmdeploy/cli/cli.py	Adds schedule policy flag to chat CLI path.
docs/zh_cn/index.rst	Adds new TurboMind priority scheduling doc page to TOC.
docs/zh_cn/advance/turbomind_priority_scheduling.md	New Chinese documentation for priority scheduling usage/semantics.
docs/en/index.rst	Adds new TurboMind priority scheduling doc page to TOC.
docs/en/advance/turbomind_priority_scheduling.md	New English documentation for priority scheduling usage/semantics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+
+## Set Request Priority
+
+Set request priority with `priority`. The valid range is `[0, 255]`. Smaller values have higher priority; `0` is the highest priority and `255` is the lowest priority. The default value is `0`. `priority` must be an integer; out-of-range values and strings, floats, booleans, or other non-integer types are rejected by validation.


+
+## 设置请求优先级
+
+请求优先级通过 `priority` 设置，取值范围为 `[0, 255]`。数值越小，优先级越高；`0` 是最高优先级，`255` 是最低优先级。未设置时默认值为 `0`。`priority` 必须是整数，超出范围或使用字符串、浮点数、布尔值等类型会被校验拒绝。


+std::unique_ptr<RequestQueue> RequestQueue::create(SchedulePolicy schedule_policy)
+{
+    if (schedule_policy == SchedulePolicy::kFifo) {
+        return std::make_unique<FifoRequestQueue>();
+    }
+    return std::make_unique<PriorityRequestQueue>();
+}


+uint64_t make_schedule_key(SchedulePolicy policy, bool scheduled, uint8_t priority, uint64_t arrival_order)
+{
+    switch (policy) {
+        case SchedulePolicy::kFifo:
+            return arrival_order;
+        case SchedulePolicy::kPriority:
+            const uint64_t order           = arrival_order & ((uint64_t{1} << 48) - 1);
+            const uint64_t scheduled_field = scheduled ? 0x00 : 0x0F;
+            return (scheduled_field << 56) | (uint64_t{priority} << 48) | order;
+    }
+    return arrival_order;
+}


4mengy marked this pull request as ready for review May 22, 2026 11:12

lvhan028 requested a review from Copilot June 16, 2026 06:07

Copilot started reviewing on behalf of lvhan028 June 16, 2026 06:08 View session

Copilot AI reviewed Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(turbomind): support priority schedule policy#4614

feat(turbomind): support priority schedule policy#4614
4mengy wants to merge 1 commit into
InternLM:mainfrom
4mengy:feat-priority-request-scheduling

4mengy commented May 22, 2026

Uh oh!

4mengy commented May 29, 2026

Uh oh!

windreamer commented May 29, 2026

Uh oh!

windreamer commented May 29, 2026

Uh oh!

4mengy commented May 29, 2026 •

edited

Loading

Uh oh!

windreamer commented May 29, 2026

Uh oh!

lvhan028 commented May 29, 2026

Uh oh!

4mengy commented May 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		## Set Request Priority

		Set request priority with `priority`. The valid range is `[0, 255]`. Smaller values have higher priority; `0` is the highest priority and `255` is the lowest priority. The default value is `0`. `priority` must be an integer; out-of-range values and strings, floats, booleans, or other non-integer types are rejected by validation.


		## 设置请求优先级

		请求优先级通过 `priority` 设置，取值范围为 `[0, 255]`。数值越小，优先级越高；`0` 是最高优先级，`255` 是最低优先级。未设置时默认值为 `0`。`priority` 必须是整数，超出范围或使用字符串、浮点数、布尔值等类型会被校验拒绝。

Conversation

4mengy commented May 22, 2026

Motivation

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist

Validation

Uh oh!

4mengy commented May 29, 2026

Uh oh!

windreamer commented May 29, 2026

Uh oh!

windreamer commented May 29, 2026

Uh oh!

4mengy commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

windreamer commented May 29, 2026

Uh oh!

lvhan028 commented May 29, 2026

Uh oh!

4mengy commented May 29, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

4mengy commented May 29, 2026 •

edited

Loading