[Improve]: Remove dlblas from lmdeploy by RunningLeon · Pull Request #4682 · InternLM/lmdeploy

RunningLeon · 2026-06-16T11:42:16Z

Motivation

Remove dlblas from lmdeploy

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
The documentation has been modified accordingly, like docstring or example tutorials.

Copilot

Pull request overview

This PR removes the dlblas dependency from LMDeploy’s PyTorch CUDA MoE/EP stack by internalizing the previously dlblas-provided pieces (EPLB + EP FP8 kernel path), switching environment checks to DeepEP/DeepGEMM, and wiring a computed DeepEP token-limit through the build context/config.

Changes:

Replace dlblas environment checks/imports with deep_ep + deep_gemm checks and LMDeploy-owned DeepEP token dispatcher/buffer facade.
Add internal EPLB metadata + logical/physical expert mapping utilities and an EP FP8 fused MoE kernel implementation.
Thread max_batch_size through backend/build context to infer DeepEP max dispatch tokens per rank; add regression tests to ensure no dlblas imports.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
tests/pytorch/test_remove_dlblas.py	Adds regression/unit tests covering removal of `dlblas` imports and new DeepEP token-limit behavior.
lmdeploy/utils.py	Replaces `is_dlblas_installed()` with `is_deep_ep_installed()` / `is_deep_gemm_installed()`.
lmdeploy/serve/openai/responses/protocol.py	Adds compatibility fallback for `StreamOptions` import across OpenAI SDK versions.
lmdeploy/pytorch/nn/moe/default.py	Passes inferred DeepEP dispatch token-limit into MoE backend builder.
lmdeploy/pytorch/nn/moe/blocked_fp8.py	Passes inferred DeepEP dispatch token-limit + fp8 dtype into blocked-FP8 MoE backend builder and forwards quant args.
lmdeploy/pytorch/nn/eplb.py	Introduces LMDeploy-owned EPLB implementation replacing `dlblas` EPLB usage.
lmdeploy/pytorch/model_inputs.py	Adds `max_batch_size` and computes `deep_ep_max_tokens_per_rank`.
lmdeploy/pytorch/kernels/cuda/fused_moe_ep_fp8.py	Adds EP FP8 fused MoE kernel path (ported from `dlblas`).
lmdeploy/pytorch/kernels/cuda/blocked_gemm_fp8.py	Adds `per_token_group_quant_fp8()` utility.
lmdeploy/pytorch/kernels/cuda/activation.py	Adds masked SiLU+mul post-quant helper for FP8 path.
lmdeploy/pytorch/envs.py	Removes the old `DEEPEP_MAX_TOKENS_PER_RANK` read path; keeps other DeepEP envs.
lmdeploy/pytorch/engine/model_agent/agent.py	Passes backend `max_batch_size` into `BuildModelContext`.
lmdeploy/pytorch/engine/config_builder.py	Propagates `max_batch_size` into `BackendConfig`.
lmdeploy/pytorch/config.py	Adds `max_batch_size` to `BackendConfig`.
lmdeploy/pytorch/check_env/dist.py	Updates EP>1 validation to require DeepEP + DeepGEMM instead of `dlblas`.
lmdeploy/pytorch/backends/moe.py	Extends MoE builder interfaces with token-limit and FP8 dtype parameters.
lmdeploy/pytorch/backends/cuda/token_dispatcher.py	Adds LMDeploy-owned DeepEP buffer facade + normal dispatcher; adjusts low-latency dispatcher to accept explicit token limit.
lmdeploy/pytorch/backends/cuda/moe/default.py	Switches DeepEP dispatcher imports from `dlblas` to LMDeploy implementation and threads token-limit through.
lmdeploy/pytorch/backends/cuda/moe/blocked_fp8.py	Internalizes EP FP8 path and DeepEP dispatchers; introduces new normal/low-latency EP FP8 implementations.
lmdeploy/pytorch/backends/cuda/graph_runner.py	Switches DeepEP buffer imports from `dlblas` to LMDeploy implementation.
docker/install.sh	Removes `dlblas` install from the Docker image build.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        if not torch.compiler.is_compiling():
+            refcount = sys.getrefcount(self._value)
+            assert refcount == 2, f'refcount={refcount}'


+        cls._deepep_sms = int(os.getenv('DEEPEP_SMS', cls._deepep_sms))
+        cls._allow_mnnvl = os.getenv('DEEPEP_ENABLE_MNNVL', '1') != '0'


+def init_global_eplb_metadata(ep_size: int, num_routed_experts: int, num_hidden_layers: int):
+    global _global_eplb_metadata
+    assert _global_eplb_metadata is None
+    _global_eplb_metadata = EPLBMetadata.init(ep_size=ep_size,
+                                             num_routed_experts=num_routed_experts,
+                                             num_hidden_layers=num_hidden_layers)


+def get_global_eplb_metadata():
+    global _global_eplb_metadata
+    assert _global_eplb_metadata is not None
+    return _global_eplb_metadata


+        if cls._buffer_common is not None:
+            return cls._buffer_common

-    num_rdma_bytes = max(
-        Buffer.get_low_latency_rdma_size_hint(num_max_dispatch_tokens_per_rank, hidden, group.size(), num_experts),
-        num_rdma_bytes)
+        cls.update_parameters(hidden, num_experts)
+        num_max_dispatch_tokens_per_rank = num_max_dispatch_tokens_per_rank or cls._num_max_dispatch_tokens_per_rank


 from lmdeploy.pytorch.kernels.cuda.fused_moe import _renormalize
+from lmdeploy.pytorch.kernels.cuda.fused_moe_ep_fp8 import fused_moe_v3_fp8
 from lmdeploy.pytorch.model_inputs import get_step_ctx_manager


remove dlblas

e7bf0ec

Copilot AI review requested due to automatic review settings June 16, 2026 11:42

Copilot started reviewing on behalf of RunningLeon June 16, 2026 11:42 View session

Copilot AI reviewed Jun 16, 2026

View reviewed changes

RunningLeon changed the title ~~[WIP]: Remove dlblas from lmdeploy~~ [Improve]: Remove dlblas from lmdeploy Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improve]: Remove dlblas from lmdeploy#4682

[Improve]: Remove dlblas from lmdeploy#4682
RunningLeon wants to merge 1 commit into
InternLM:mainfrom
RunningLeon:remove-dlbas

RunningLeon commented Jun 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		cls._deepep_sms = int(os.getenv('DEEPEP_SMS', cls._deepep_sms))
		cls._allow_mnnvl = os.getenv('DEEPEP_ENABLE_MNNVL', '1') != '0'

Conversation

RunningLeon commented Jun 16, 2026

Motivation

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants