[Improve]: Remove dlblas from lmdeploy#4682
Open
RunningLeon wants to merge 1 commit into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR removes the dlblas dependency from LMDeploy’s PyTorch CUDA MoE/EP stack by internalizing the previously dlblas-provided pieces (EPLB + EP FP8 kernel path), switching environment checks to DeepEP/DeepGEMM, and wiring a computed DeepEP token-limit through the build context/config.
Changes:
- Replace
dlblasenvironment checks/imports withdeep_ep+deep_gemmchecks and LMDeploy-owned DeepEP token dispatcher/buffer facade. - Add internal EPLB metadata + logical/physical expert mapping utilities and an EP FP8 fused MoE kernel implementation.
- Thread
max_batch_sizethrough backend/build context to infer DeepEP max dispatch tokens per rank; add regression tests to ensure nodlblasimports.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/pytorch/test_remove_dlblas.py | Adds regression/unit tests covering removal of dlblas imports and new DeepEP token-limit behavior. |
| lmdeploy/utils.py | Replaces is_dlblas_installed() with is_deep_ep_installed() / is_deep_gemm_installed(). |
| lmdeploy/serve/openai/responses/protocol.py | Adds compatibility fallback for StreamOptions import across OpenAI SDK versions. |
| lmdeploy/pytorch/nn/moe/default.py | Passes inferred DeepEP dispatch token-limit into MoE backend builder. |
| lmdeploy/pytorch/nn/moe/blocked_fp8.py | Passes inferred DeepEP dispatch token-limit + fp8 dtype into blocked-FP8 MoE backend builder and forwards quant args. |
| lmdeploy/pytorch/nn/eplb.py | Introduces LMDeploy-owned EPLB implementation replacing dlblas EPLB usage. |
| lmdeploy/pytorch/model_inputs.py | Adds max_batch_size and computes deep_ep_max_tokens_per_rank. |
| lmdeploy/pytorch/kernels/cuda/fused_moe_ep_fp8.py | Adds EP FP8 fused MoE kernel path (ported from dlblas). |
| lmdeploy/pytorch/kernels/cuda/blocked_gemm_fp8.py | Adds per_token_group_quant_fp8() utility. |
| lmdeploy/pytorch/kernels/cuda/activation.py | Adds masked SiLU+mul post-quant helper for FP8 path. |
| lmdeploy/pytorch/envs.py | Removes the old DEEPEP_MAX_TOKENS_PER_RANK read path; keeps other DeepEP envs. |
| lmdeploy/pytorch/engine/model_agent/agent.py | Passes backend max_batch_size into BuildModelContext. |
| lmdeploy/pytorch/engine/config_builder.py | Propagates max_batch_size into BackendConfig. |
| lmdeploy/pytorch/config.py | Adds max_batch_size to BackendConfig. |
| lmdeploy/pytorch/check_env/dist.py | Updates EP>1 validation to require DeepEP + DeepGEMM instead of dlblas. |
| lmdeploy/pytorch/backends/moe.py | Extends MoE builder interfaces with token-limit and FP8 dtype parameters. |
| lmdeploy/pytorch/backends/cuda/token_dispatcher.py | Adds LMDeploy-owned DeepEP buffer facade + normal dispatcher; adjusts low-latency dispatcher to accept explicit token limit. |
| lmdeploy/pytorch/backends/cuda/moe/default.py | Switches DeepEP dispatcher imports from dlblas to LMDeploy implementation and threads token-limit through. |
| lmdeploy/pytorch/backends/cuda/moe/blocked_fp8.py | Internalizes EP FP8 path and DeepEP dispatchers; introduces new normal/low-latency EP FP8 implementations. |
| lmdeploy/pytorch/backends/cuda/graph_runner.py | Switches DeepEP buffer imports from dlblas to LMDeploy implementation. |
| docker/install.sh | Removes dlblas install from the Docker image build. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+51
to
+53
| if not torch.compiler.is_compiling(): | ||
| refcount = sys.getrefcount(self._value) | ||
| assert refcount == 2, f'refcount={refcount}' |
Comment on lines
+147
to
+148
| cls._deepep_sms = int(os.getenv('DEEPEP_SMS', cls._deepep_sms)) | ||
| cls._allow_mnnvl = os.getenv('DEEPEP_ENABLE_MNNVL', '1') != '0' |
Comment on lines
+250
to
+255
| def init_global_eplb_metadata(ep_size: int, num_routed_experts: int, num_hidden_layers: int): | ||
| global _global_eplb_metadata | ||
| assert _global_eplb_metadata is None | ||
| _global_eplb_metadata = EPLBMetadata.init(ep_size=ep_size, | ||
| num_routed_experts=num_routed_experts, | ||
| num_hidden_layers=num_hidden_layers) |
Comment on lines
+258
to
+261
| def get_global_eplb_metadata(): | ||
| global _global_eplb_metadata | ||
| assert _global_eplb_metadata is not None | ||
| return _global_eplb_metadata |
Comment on lines
+183
to
+187
| if cls._buffer_common is not None: | ||
| return cls._buffer_common | ||
|
|
||
| num_rdma_bytes = max( | ||
| Buffer.get_low_latency_rdma_size_hint(num_max_dispatch_tokens_per_rank, hidden, group.size(), num_experts), | ||
| num_rdma_bytes) | ||
| cls.update_parameters(hidden, num_experts) | ||
| num_max_dispatch_tokens_per_rank = num_max_dispatch_tokens_per_rank or cls._num_max_dispatch_tokens_per_rank |
Comment on lines
21
to
23
| from lmdeploy.pytorch.kernels.cuda.fused_moe import _renormalize | ||
| from lmdeploy.pytorch.kernels.cuda.fused_moe_ep_fp8 import fused_moe_v3_fp8 | ||
| from lmdeploy.pytorch.model_inputs import get_step_ctx_manager |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Remove dlblas from lmdeploy
Modification
Please briefly describe what modification is made in this PR.
BC-breaking (Optional)
Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
Use cases (Optional)
If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.
Checklist