Skip to content

[Improve]: Remove dlblas from lmdeploy#4682

Open
RunningLeon wants to merge 1 commit into
InternLM:mainfrom
RunningLeon:remove-dlbas
Open

[Improve]: Remove dlblas from lmdeploy#4682
RunningLeon wants to merge 1 commit into
InternLM:mainfrom
RunningLeon:remove-dlbas

Conversation

@RunningLeon

Copy link
Copy Markdown
Collaborator

Motivation

Remove dlblas from lmdeploy

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  3. If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

Copilot AI review requested due to automatic review settings June 16, 2026 11:42

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes the dlblas dependency from LMDeploy’s PyTorch CUDA MoE/EP stack by internalizing the previously dlblas-provided pieces (EPLB + EP FP8 kernel path), switching environment checks to DeepEP/DeepGEMM, and wiring a computed DeepEP token-limit through the build context/config.

Changes:

  • Replace dlblas environment checks/imports with deep_ep + deep_gemm checks and LMDeploy-owned DeepEP token dispatcher/buffer facade.
  • Add internal EPLB metadata + logical/physical expert mapping utilities and an EP FP8 fused MoE kernel implementation.
  • Thread max_batch_size through backend/build context to infer DeepEP max dispatch tokens per rank; add regression tests to ensure no dlblas imports.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/pytorch/test_remove_dlblas.py Adds regression/unit tests covering removal of dlblas imports and new DeepEP token-limit behavior.
lmdeploy/utils.py Replaces is_dlblas_installed() with is_deep_ep_installed() / is_deep_gemm_installed().
lmdeploy/serve/openai/responses/protocol.py Adds compatibility fallback for StreamOptions import across OpenAI SDK versions.
lmdeploy/pytorch/nn/moe/default.py Passes inferred DeepEP dispatch token-limit into MoE backend builder.
lmdeploy/pytorch/nn/moe/blocked_fp8.py Passes inferred DeepEP dispatch token-limit + fp8 dtype into blocked-FP8 MoE backend builder and forwards quant args.
lmdeploy/pytorch/nn/eplb.py Introduces LMDeploy-owned EPLB implementation replacing dlblas EPLB usage.
lmdeploy/pytorch/model_inputs.py Adds max_batch_size and computes deep_ep_max_tokens_per_rank.
lmdeploy/pytorch/kernels/cuda/fused_moe_ep_fp8.py Adds EP FP8 fused MoE kernel path (ported from dlblas).
lmdeploy/pytorch/kernels/cuda/blocked_gemm_fp8.py Adds per_token_group_quant_fp8() utility.
lmdeploy/pytorch/kernels/cuda/activation.py Adds masked SiLU+mul post-quant helper for FP8 path.
lmdeploy/pytorch/envs.py Removes the old DEEPEP_MAX_TOKENS_PER_RANK read path; keeps other DeepEP envs.
lmdeploy/pytorch/engine/model_agent/agent.py Passes backend max_batch_size into BuildModelContext.
lmdeploy/pytorch/engine/config_builder.py Propagates max_batch_size into BackendConfig.
lmdeploy/pytorch/config.py Adds max_batch_size to BackendConfig.
lmdeploy/pytorch/check_env/dist.py Updates EP>1 validation to require DeepEP + DeepGEMM instead of dlblas.
lmdeploy/pytorch/backends/moe.py Extends MoE builder interfaces with token-limit and FP8 dtype parameters.
lmdeploy/pytorch/backends/cuda/token_dispatcher.py Adds LMDeploy-owned DeepEP buffer facade + normal dispatcher; adjusts low-latency dispatcher to accept explicit token limit.
lmdeploy/pytorch/backends/cuda/moe/default.py Switches DeepEP dispatcher imports from dlblas to LMDeploy implementation and threads token-limit through.
lmdeploy/pytorch/backends/cuda/moe/blocked_fp8.py Internalizes EP FP8 path and DeepEP dispatchers; introduces new normal/low-latency EP FP8 implementations.
lmdeploy/pytorch/backends/cuda/graph_runner.py Switches DeepEP buffer imports from dlblas to LMDeploy implementation.
docker/install.sh Removes dlblas install from the Docker image build.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +51 to +53
if not torch.compiler.is_compiling():
refcount = sys.getrefcount(self._value)
assert refcount == 2, f'refcount={refcount}'
Comment on lines +147 to +148
cls._deepep_sms = int(os.getenv('DEEPEP_SMS', cls._deepep_sms))
cls._allow_mnnvl = os.getenv('DEEPEP_ENABLE_MNNVL', '1') != '0'
Comment on lines +250 to +255
def init_global_eplb_metadata(ep_size: int, num_routed_experts: int, num_hidden_layers: int):
global _global_eplb_metadata
assert _global_eplb_metadata is None
_global_eplb_metadata = EPLBMetadata.init(ep_size=ep_size,
num_routed_experts=num_routed_experts,
num_hidden_layers=num_hidden_layers)
Comment on lines +258 to +261
def get_global_eplb_metadata():
global _global_eplb_metadata
assert _global_eplb_metadata is not None
return _global_eplb_metadata
Comment on lines +183 to +187
if cls._buffer_common is not None:
return cls._buffer_common

num_rdma_bytes = max(
Buffer.get_low_latency_rdma_size_hint(num_max_dispatch_tokens_per_rank, hidden, group.size(), num_experts),
num_rdma_bytes)
cls.update_parameters(hidden, num_experts)
num_max_dispatch_tokens_per_rank = num_max_dispatch_tokens_per_rank or cls._num_max_dispatch_tokens_per_rank
Comment on lines 21 to 23
from lmdeploy.pytorch.kernels.cuda.fused_moe import _renormalize
from lmdeploy.pytorch.kernels.cuda.fused_moe_ep_fp8 import fused_moe_v3_fp8
from lmdeploy.pytorch.model_inputs import get_step_ctx_manager
@RunningLeon RunningLeon changed the title [WIP]: Remove dlblas from lmdeploy [Improve]: Remove dlblas from lmdeploy Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants