Bump vllm from 0.17.1 to 0.19.0 by dependabot[bot] · Pull Request #223 · VectorInstitute/vector-inference

dependabot · 2026-04-03T15:37:52Z

Bumps vllm from 0.17.1 to 0.19.0.

Release notes

v0.19.0

vLLM v0.19.0

Highlights

This release features 448 commits from 197 contributors (54 new)!

Gemma 4 support: Full Google Gemma 4 architecture support including MoE, multimodal, reasoning, and tool-use capabilities (#38826, #38847). Requires transformers>=5.5.0. We recommend using pre-built docker image vllm/vllm-openai:gemma4 for out of box usage.

Zero-bubble async scheduling + speculative decoding: Async scheduling now supports speculative decoding with zero-bubble overlap, significantly improving throughput (#32951).

Model Runner V2 maturation: MRV2 gains piecewise CUDA graphs for pipeline parallelism (#35162), spec decode rejection sampler with greedy/logprobs support (#37238, #37237), multi-modal embeddings for spec decode (#36097), streaming inputs (#37028), and EPLB support (#37488).

ViT Full CUDA Graphs: Vision encoders (ViT) now support full CUDA graph capture for reduced overhead (#35963).

General CPU KV cache offloading: A simple yet general CPU KV cache offloading mechanism for V1, with pluggable cache policy and block-level preemption handling (#37160, #37874, #34805, #36642, #37853).

DBO (Dual-Batch Overlap) generalization: The microbatch optimization (DBO) now works with general models, not just specific architectures (#37926).

NVIDIA B300/GB300 (SM 10.3) support: Allreduce fusion enabled by default with tuned all-reduce communicator (#37755, #37756).

Transformers v5 compatibility: Broad compatibility fixes across many models for HuggingFace Transformers v5 (#37681, #38127, #38090, #38247, #38410).

Model Support

New architectures: Gemma 4 (#38826), Cohere ASR (#35809), Cohere Transcribe (#38120), ColQwen3.5 4.5B (#36887), LFM2-ColBERT-350M (#37528), Granite 4.0 1B Speech (#38019), Qwen3-ForcedAligner (#35367).

Speculative decoding: Eagle3 for Pixtral (#37182), EagleMistralLarge3 fix (#37232).

LoRA expansion: H2OVL tower/connector LoRA (#31696), --lora-target-modules to restrict LoRA to specific modules (#34984), language_model_only respected (#37375), Mistral3 fix (#36928), Qwen3.5 fix (#36976), out-of-tree ops replacement (#37181).

Model fixes: NemotronH MTP + Chunked Prefill (#35447), Qwen3-VL video timestamps (#37439), Qwen3.5 GDN quantized models (#37448), Qwen3Next A_log FP32 (#37810), JAIS ALiBi (#37820), RoBERTa CUDA graph position IDs (#37873), AudioFlamingo3/MusicFlamingo (#37643), Music Flamingo loading (#35535), bge-m3 task selection (#37632), Nemotron Parse loading (#37407), GLM OCR patch merger (#37962), PaddleOCR checkpoint compat (#38232), DeepSeek v3.2 params (#33703), MiniMax NVFP4 weight loading (#37214), gated model HF token (#37920), Parakeet OOM on long audio (#36671).

Features: Temporal compression for Nemotron-3-VL videos (#36808), NemotronH Puzzle + MTP (#37803), torch.compile for InternVL vision encoder (#38049), multiple embedding types in single call (#35829).

Performance: GLM-4.xv ViT optimization (#37779).

Engine Core

Zero-bubble async scheduling + speculative decoding (#32951).

Model Runner V2: PP CUDA graphs (#35162), spec decode rejection sampler greedy (#37238) + logprobs (#37237), multimodal embeddings for spec decode (#36097), streaming inputs (#37028), configurable acceptance rate (#38045), FP32 draft logits (#37526), FP64 Gumbel noise (#37798), warmup with spec decode (#37812).

ViT Full CUDA Graph capture (#35963).

General CPU KV cache offloading with pluggable CachePolicy (#37160, #37874), block-level preemption (#34805), multiple KV groups (#36642), hybrid model support (#37853).

DBO for general models: Microbatch optimization generalized beyond specific architectures (#37926).

Compilation: Mega AOT artifact for torch 2.12+ (#37198), lazy graph module to defer recompile (#37609), remove model tag requirement for compile cache (#37345), Triton autotuning disk cache enabled by default (#37188), inductor runtime asserts disabled by default (#37485).

FlexAttention: Custom mask modification support (#37692).

Attention: Distinguish short extends vs decodes (#37303), allow qk_nope_head_dim=192 in FlashInfer MLA (#37475), skip sliding window attention layers with FP8 KV cache (#33695).

Scheduling: Schedule requests based on full input sequence length (#37307).

Spec decode: Per-draft-model MoE backend via --speculative-config (#37880), Eagle3 drafter quant_config propagation (#37280), Eagle3 norm_before_fc propagation (#38111).

Extensibility: PluggableLayer for CustomQwen2Decoder (#37293), tensor IPC transfer for multimodal data (#32104).

Performance: Optimize top-k in Triton sampler (#37225), optimize token_embed for pooling models with 1% improvement (#37347), fix slow hasattr in CUDAGraphWrapper (#37425), NFS prefetch auto-enabled with RAM guard (#37673), pybase64 replacement (#37290), optimize swap_states for hybrid models (#34733).

Bugfixes: Fix gibberish from FP8 MLA KV scale inconsistency (#37054), Mamba state corruption (#37728), deadlock with pause/resume (#37024), FlashInfer MNNVL socket collisions (#36674), multimodal prefix cache key collisions (#36708), DP coordinator ZMQ TOCTOU (#37452), CUDA graph memory double-counting (#37426), pooling non-determinism (#37775), AllReduce Fusion shutdown crash (#36955), FlashInfer allreduce workspace (#37461), async spec decoding with hybrid models (#38556), MLA sparse indexer prefill chunking (#36178), KV offloading + MLA (#37536), async scheduling extra CUDA context (#37449), DP MTP dummy run (#35243), offloading+prefetch for GLM-4.7-FP8 (#37178), max memory for multiple KV-cache groups (#36030).

Hardware & Performance

NVIDIA:

B300/GB300 (SM 10.3): Allreduce fusion enabled by default (#37755), tuned all-reduce communicator (#37756).

Blackwell: Optimized SM120 CUTLASS blockwise FP8 GEMM (#37970), fix NVFP4 NaN on desktop Blackwell (#37725), fix DeepGEMM E8M0 accuracy for Qwen3.5 FP8 (#38083), restore FP8 FlashMLA CUDA graph persistent buffers (#35175), DGX Spark fix (#38126).

FlashInfer sparse MLA as default for FP8 KV cache (#37252).

Tuned prefill configs for FP8 FA3 (#36265), tuned Triton MoE config for Qwen3.5 on H200 with 9.9% E2E improvement (#37340), H800 MoE configs (#31201).

GPT-OSS: Router GEMM kernel (#37205), eliminate padding with FlashInfer MXFP4/MXFP8 MoE (#30647), reduce redundant SparseMatrix creation (#37683).

NVFP4 CUTLASS MoE non-gated support (#37320), fuse pack topk in TRTLLM MoE via torch.compile (#37695).

Non-contiguous KV cache in TRTLLM FP8 dequant kernel (#36867), Qwen3 dual stream input projection (#36795).

AMD ROCm:

ROCm 7.2.1, torch 2.10, triton 3.6 (#38252).

DeepEP as all2all backend (#34692).

... (truncated)

Commits

2a69949 [Bugfix]: Fix Gemma4ToolParser.init() missing tools parameter (#38847)
8adcf8c feat(models): implement Google Gemma 4 architecture support (MoE, Multimodal,...
cfad6a5 Revert "[Bugfix] Restrict TRTLLM attention to SM100, fixing GB300 (SM103) han...
c284a66 [Bugfix] Restrict TRTLLM attention to SM100, fixing GB300 (SM103) hang (#38730)
3a30a1a [Misc] Rename think_start_str/think_end_str to reasoning_start_str/reasoning_...
29982d4 (security) Enforce frame limit in VideoMediaIO (#38636)
1dbbafd [Feat][v1] Simple yet General CPU KV Cache Offloading (#37160)
0ee3b7f [Bugfix][MLA] Add logits size budget to sparse indexer prefill chunking (#36178)
268bed9 [Bugfix][Async] Fix async spec decoding with hybrid models (#38556)
bcc0fdd [CI] fix LM Eval Qwen3.5 Models (B200) (#38632)
Additional commits viewable in compare view

Bumps [vllm](https://github.com/vllm-project/vllm) from 0.17.1 to 0.19.0. - [Release notes](https://github.com/vllm-project/vllm/releases) - [Changelog](https://github.com/vllm-project/vllm/blob/main/RELEASE.md) - [Commits](vllm-project/vllm@v0.17.1...v0.19.0) --- updated-dependencies: - dependency-name: vllm dependency-version: 0.19.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>

dependabot Bot added dependencies Pull requests that update a dependency file python:uv Pull requests that update python:uv code labels Apr 3, 2026

dependabot Bot mentioned this pull request Apr 3, 2026

Bump vllm from 0.17.1 to 0.18.0 #212

Closed

dependabot Bot force-pushed the dependabot/uv/vllm-0.19.0 branch from 098a759 to b379e9f Compare April 8, 2026 22:18

Merge branch 'main' into dependabot/uv/vllm-0.19.0

2a57d88

XkunW merged commit 602fa0f into main Apr 9, 2026
9 checks passed

XkunW deleted the dependabot/uv/vllm-0.19.0 branch April 9, 2026 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump vllm from 0.17.1 to 0.19.0#223

Bump vllm from 0.17.1 to 0.19.0#223
XkunW merged 2 commits into
mainfrom
dependabot/uv/vllm-0.19.0

dependabot Bot commented on behalf of github Apr 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dependabot Bot commented on behalf of github Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

v0.19.0

vLLM v0.19.0

Highlights

Model Support

Engine Core

Hardware & Performance

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dependabot Bot commented on behalf of github Apr 3, 2026 •

edited

Loading