Skip to content

Bump vllm from 0.17.1 to 0.19.0#223

Merged
XkunW merged 2 commits into
mainfrom
dependabot/uv/vllm-0.19.0
Apr 9, 2026
Merged

Bump vllm from 0.17.1 to 0.19.0#223
XkunW merged 2 commits into
mainfrom
dependabot/uv/vllm-0.19.0

Conversation

@dependabot
Copy link
Copy Markdown
Contributor

@dependabot dependabot Bot commented on behalf of github Apr 3, 2026

Bumps vllm from 0.17.1 to 0.19.0.

Release notes

Sourced from vllm's releases.

v0.19.0

vLLM v0.19.0

Highlights

This release features 448 commits from 197 contributors (54 new)!

  • Gemma 4 support: Full Google Gemma 4 architecture support including MoE, multimodal, reasoning, and tool-use capabilities (#38826, #38847). Requires transformers>=5.5.0. We recommend using pre-built docker image vllm/vllm-openai:gemma4 for out of box usage.
  • Zero-bubble async scheduling + speculative decoding: Async scheduling now supports speculative decoding with zero-bubble overlap, significantly improving throughput (#32951).
  • Model Runner V2 maturation: MRV2 gains piecewise CUDA graphs for pipeline parallelism (#35162), spec decode rejection sampler with greedy/logprobs support (#37238, #37237), multi-modal embeddings for spec decode (#36097), streaming inputs (#37028), and EPLB support (#37488).
  • ViT Full CUDA Graphs: Vision encoders (ViT) now support full CUDA graph capture for reduced overhead (#35963).
  • General CPU KV cache offloading: A simple yet general CPU KV cache offloading mechanism for V1, with pluggable cache policy and block-level preemption handling (#37160, #37874, #34805, #36642, #37853).
  • DBO (Dual-Batch Overlap) generalization: The microbatch optimization (DBO) now works with general models, not just specific architectures (#37926).
  • NVIDIA B300/GB300 (SM 10.3) support: Allreduce fusion enabled by default with tuned all-reduce communicator (#37755, #37756).
  • Transformers v5 compatibility: Broad compatibility fixes across many models for HuggingFace Transformers v5 (#37681, #38127, #38090, #38247, #38410).

Model Support

  • New architectures: Gemma 4 (#38826), Cohere ASR (#35809), Cohere Transcribe (#38120), ColQwen3.5 4.5B (#36887), LFM2-ColBERT-350M (#37528), Granite 4.0 1B Speech (#38019), Qwen3-ForcedAligner (#35367).
  • Speculative decoding: Eagle3 for Pixtral (#37182), EagleMistralLarge3 fix (#37232).
  • LoRA expansion: H2OVL tower/connector LoRA (#31696), --lora-target-modules to restrict LoRA to specific modules (#34984), language_model_only respected (#37375), Mistral3 fix (#36928), Qwen3.5 fix (#36976), out-of-tree ops replacement (#37181).
  • Model fixes: NemotronH MTP + Chunked Prefill (#35447), Qwen3-VL video timestamps (#37439), Qwen3.5 GDN quantized models (#37448), Qwen3Next A_log FP32 (#37810), JAIS ALiBi (#37820), RoBERTa CUDA graph position IDs (#37873), AudioFlamingo3/MusicFlamingo (#37643), Music Flamingo loading (#35535), bge-m3 task selection (#37632), Nemotron Parse loading (#37407), GLM OCR patch merger (#37962), PaddleOCR checkpoint compat (#38232), DeepSeek v3.2 params (#33703), MiniMax NVFP4 weight loading (#37214), gated model HF token (#37920), Parakeet OOM on long audio (#36671).
  • Features: Temporal compression for Nemotron-3-VL videos (#36808), NemotronH Puzzle + MTP (#37803), torch.compile for InternVL vision encoder (#38049), multiple embedding types in single call (#35829).
  • Performance: GLM-4.xv ViT optimization (#37779).

Engine Core

  • Zero-bubble async scheduling + speculative decoding (#32951).
  • Model Runner V2: PP CUDA graphs (#35162), spec decode rejection sampler greedy (#37238) + logprobs (#37237), multimodal embeddings for spec decode (#36097), streaming inputs (#37028), configurable acceptance rate (#38045), FP32 draft logits (#37526), FP64 Gumbel noise (#37798), warmup with spec decode (#37812).
  • ViT Full CUDA Graph capture (#35963).
  • General CPU KV cache offloading with pluggable CachePolicy (#37160, #37874), block-level preemption (#34805), multiple KV groups (#36642), hybrid model support (#37853).
  • DBO for general models: Microbatch optimization generalized beyond specific architectures (#37926).
  • Compilation: Mega AOT artifact for torch 2.12+ (#37198), lazy graph module to defer recompile (#37609), remove model tag requirement for compile cache (#37345), Triton autotuning disk cache enabled by default (#37188), inductor runtime asserts disabled by default (#37485).
  • FlexAttention: Custom mask modification support (#37692).
  • Attention: Distinguish short extends vs decodes (#37303), allow qk_nope_head_dim=192 in FlashInfer MLA (#37475), skip sliding window attention layers with FP8 KV cache (#33695).
  • Scheduling: Schedule requests based on full input sequence length (#37307).
  • Spec decode: Per-draft-model MoE backend via --speculative-config (#37880), Eagle3 drafter quant_config propagation (#37280), Eagle3 norm_before_fc propagation (#38111).
  • Extensibility: PluggableLayer for CustomQwen2Decoder (#37293), tensor IPC transfer for multimodal data (#32104).
  • Performance: Optimize top-k in Triton sampler (#37225), optimize token_embed for pooling models with 1% improvement (#37347), fix slow hasattr in CUDAGraphWrapper (#37425), NFS prefetch auto-enabled with RAM guard (#37673), pybase64 replacement (#37290), optimize swap_states for hybrid models (#34733).
  • Bugfixes: Fix gibberish from FP8 MLA KV scale inconsistency (#37054), Mamba state corruption (#37728), deadlock with pause/resume (#37024), FlashInfer MNNVL socket collisions (#36674), multimodal prefix cache key collisions (#36708), DP coordinator ZMQ TOCTOU (#37452), CUDA graph memory double-counting (#37426), pooling non-determinism (#37775), AllReduce Fusion shutdown crash (#36955), FlashInfer allreduce workspace (#37461), async spec decoding with hybrid models (#38556), MLA sparse indexer prefill chunking (#36178), KV offloading + MLA (#37536), async scheduling extra CUDA context (#37449), DP MTP dummy run (#35243), offloading+prefetch for GLM-4.7-FP8 (#37178), max memory for multiple KV-cache groups (#36030).

Hardware & Performance

  • NVIDIA:
    • B300/GB300 (SM 10.3): Allreduce fusion enabled by default (#37755), tuned all-reduce communicator (#37756).
    • Blackwell: Optimized SM120 CUTLASS blockwise FP8 GEMM (#37970), fix NVFP4 NaN on desktop Blackwell (#37725), fix DeepGEMM E8M0 accuracy for Qwen3.5 FP8 (#38083), restore FP8 FlashMLA CUDA graph persistent buffers (#35175), DGX Spark fix (#38126).
    • FlashInfer sparse MLA as default for FP8 KV cache (#37252).
    • Tuned prefill configs for FP8 FA3 (#36265), tuned Triton MoE config for Qwen3.5 on H200 with 9.9% E2E improvement (#37340), H800 MoE configs (#31201).
    • GPT-OSS: Router GEMM kernel (#37205), eliminate padding with FlashInfer MXFP4/MXFP8 MoE (#30647), reduce redundant SparseMatrix creation (#37683).
    • NVFP4 CUTLASS MoE non-gated support (#37320), fuse pack topk in TRTLLM MoE via torch.compile (#37695).
    • Non-contiguous KV cache in TRTLLM FP8 dequant kernel (#36867), Qwen3 dual stream input projection (#36795).
  • AMD ROCm:
    • ROCm 7.2.1, torch 2.10, triton 3.6 (#38252).
    • DeepEP as all2all backend (#34692).

... (truncated)

Commits
  • 2a69949 [Bugfix]: Fix Gemma4ToolParser.init() missing tools parameter (#38847)
  • 8adcf8c feat(models): implement Google Gemma 4 architecture support (MoE, Multimodal,...
  • cfad6a5 Revert "[Bugfix] Restrict TRTLLM attention to SM100, fixing GB300 (SM103) han...
  • c284a66 [Bugfix] Restrict TRTLLM attention to SM100, fixing GB300 (SM103) hang (#38730)
  • 3a30a1a [Misc] Rename think_start_str/think_end_str to reasoning_start_str/reasoning_...
  • 29982d4 (security) Enforce frame limit in VideoMediaIO (#38636)
  • 1dbbafd [Feat][v1] Simple yet General CPU KV Cache Offloading (#37160)
  • 0ee3b7f [Bugfix][MLA] Add logits size budget to sparse indexer prefill chunking (#36178)
  • 268bed9 [Bugfix][Async] Fix async spec decoding with hybrid models (#38556)
  • bcc0fdd [CI] fix LM Eval Qwen3.5 Models (B200) (#38632)
  • Additional commits viewable in compare view

@dependabot dependabot Bot added dependencies Pull requests that update a dependency file python:uv Pull requests that update python:uv code labels Apr 3, 2026
Bumps [vllm](https://github.com/vllm-project/vllm) from 0.17.1 to 0.19.0.
- [Release notes](https://github.com/vllm-project/vllm/releases)
- [Changelog](https://github.com/vllm-project/vllm/blob/main/RELEASE.md)
- [Commits](vllm-project/vllm@v0.17.1...v0.19.0)

---
updated-dependencies:
- dependency-name: vllm
  dependency-version: 0.19.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot Bot force-pushed the dependabot/uv/vllm-0.19.0 branch from 098a759 to b379e9f Compare April 8, 2026 22:18
@XkunW XkunW merged commit 602fa0f into main Apr 9, 2026
9 checks passed
@XkunW XkunW deleted the dependabot/uv/vllm-0.19.0 branch April 9, 2026 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file python:uv Pull requests that update python:uv code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant