共 63 个 commit,涉及 375 个文件,+13551/-4389 行变动。
概要
| 统计项 | 数值 |
|---|---|
| Commit 数 | 63 |
| 变更文件 | 375 |
| 新增行数 | +13551 |
| 删除行数 | -4389 |
Commit 列表
🐛 Bug Fix
- 99ef652 #44057 — [Bugfix] Reject non-positive values for ParallelConfig int knobs (#44057)
- 作者: JianweiZheng | +20/-18 | 1 个文件
Adds Pydantic lower-bound constraints to the parallelism-size knobs in ParallelConfig so that obviously invalid values (zero, negative) fail fast at construction time instead of producing nonsensical world_size or surfacing as opaque errors later in torch.distributed. ## Problem vllm/config/parallel.py declares the parallelism size fields as bare int = 1 with no validation: This means: - ParallelC…
- 3dbb4e0 #44509 — [Bugfix] MiniCPM-V-4.6 video inference crash: placeholder count mismatches visual embedding count (#44509)
- 作者: tc-mb | +60/-2 | 2 个文件
Sending a video request to openbmb/MiniCPM-V-4_6 causes EngineDeadError — the engine core crashes because the number of <|video_pad|> placeholder tokens does not match the number of visual embeddings produced by the vision tower. Image inference works correctly; only video triggers this. ## Root Cause Three issues conspire: 1. VideoProcessorItems.get_frame_size (parse.py) hardcodes (C, H, W) s…
- 9354fb1 #44476 — [Bugfix][Compile] Guard per_token_group_fp8_quant lookup on non-CUDA platforms (#44476)
- 作者: QiliangCui2023 | +7/-4 | 2 个文件
PR #42758 (“Enable perf_token_group_quant/_C_stable_libtorch for ROCm”) moved two QUANT_OPS dict entries — kFp8Dynamic128Sym and kFp8Dynamic64Sym, both indexing torch.ops._C.per_token_group_fp8_quant.default — out of the if current_platform.is_cuda(): guard and into the unconditional dict literal at module top-level. This breaks any platform whose vLLM build does not register per_token_group_fp8_q…
- 4b87b3e #44205 — [Bugfix] fix EVS for qwen3-vl (#44205)
- 作者: Rui “Garry” Gao | +4/-4 | 1 个文件
Fix EVS for Qwen3-VL, by reverting the changes to qwen3_vl.py by PR #34246. See Issue #44204 for detailed descriptions. Launch a service with EVS on (–video_pruning_rate 0.5) and send in a request with video. *Other tests should not be necessary, as we are just reverting the change. ## Test Result Fix works on vllm 0.20.2. ## Duplicate check Searched for “evs”, did not find any duplicate PR.
- 0c1e6f6 #44410 — [Bugfix] Fix VLLMNotFoundError when using LoRA adapter name in poolin… (#44410)
- 作者: Ted Mostly | +62/-0 | 2 个文件
…g/embed endpoint –runner pooling with –lora-modules, access lora adapter return 404, sync with OpenAIServing._maybe_get_adapters introduced by #36110 ## Test Result —
- 128adab #43982 — [Bugfix] Fix Gemma4 MTP block_table batch_size mismatch under concurrent load (#43982)
- 作者: Dima | +6/-1 | 1 个文件
Fix RuntimeError: batch_size must be equal to batch_size_k that occurs with Gemma4 + MTP + FlashAttention under concurrent load when the batch is partially occupied. Gemma4Proposer.set_per_group_block_table() captures block tables with shape (num_reqs_padded, max_blocks) during _prepare_inputs. Later, spec_decode_common_attn_metadata is unpadded to num_reqs via .unpadded(), but the per-group b…
- 2b237c7 #42752 — [Bugfix] Honor tool_choice=“none” in Chat Completions streaming (#42752)
- 作者: hoobnn | +40/-0 | 2 个文件
Fixes #42747. Streaming Chat Completions with tool_choice=“none” — or explicitly disabled via JSON null, where request.tool_choice resolves to None — could still produce delta.tool_calls and finish with finish_reason=“tool_calls” whenever the server was launched with a –tool-call-parser and the model output happened to match that parser’s tool-call format. Non-streaming Chat Completions already h…
- 209709a #44348 — [Bugfix] Fix unstreamed tool call args dropped in Responses API streaming (#44348)
- 作者: Flora Feng | +15/-4 | 3 个文件
When tool parsers stream arguments incrementally, they may buffer chunks that haven’t been sent to the client yet. On the final streaming delta, parse_delta(finished=True) calls _append_unstreamed_tool_args(), which computes the diff between the fully-parsed arguments and what was actually streamed, and appends the remainder. Without this flush, the client receives truncated or empty tool call arg…
- ace95c9 #44347 — [Bugfix] Update TrtLLM MoE routing methods (#44347)
- 作者: Wei Zhao | +24/-24 | 6 个文件
The PR introduces various fixes related to Trtllm MoE routing methods: - Revert _supports_router_logits_dtype change from https://github.com/vllm-project/vllm/pull/43859, which causes regression in nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8, see CI failure - Update RoutingMethodType in correspondence with flashinfer. - Refine get_routing_method_type to prevent stepfun-ai/Step-3.7-Flash from being c…
📦 Other
- 4cc78c9 #44363 — [Core] Freeze garbage collector in workers after model initialization (#44363)
- 作者: Tyler Michael Smith | +8/-0 | 1 个文件
Drastically reduce P99 ITL in some cases (especially WideEP) by freezing the garbage collector at the end of compile_and_warm_up_model
- b21443e #43519 — Add model support for granite speech plus (#43519)
- 作者: Zvi Kons | +106/-3 | 6 个文件
Adds support for the GraniteSpeechPlus architecture (GraniteSpeechPlusForConditionalGeneration) to vLLM, enabling inference for models such as ibm-granite/granite-speech-4.1-2b-plus. The implementation extends the existing GraniteSpeech model: - Refactors GraniteSpeechForConditionalGeneration to expose a _build_encoder hook so subclasses can swap in a custom encoder without duplicating the rest of…
- 06ee2d8 #44340 — [Quant] Support compressed-tensors WNA8O8Int linears and WNInt embeddings (#44340)
- 作者: Michael Goin | +744/-27 | 14 个文件
Requires compressed-tensors bump for embedding support with https://github.com/vllm-project/compressed-tensors/pull/718 Introduces two new specialized methods for compressed-tensors to dispatch to: * CompressedTensorsWNA8O8Int to support any linear layer that has INT1-8 weights with static per-tensor scales for INT8 inputs and outputs. Currently we are running these as fake quants on the input+out…
- b5235fc #43827 — [DSv4] Adding TRTLLM gen attention kernel (#43827)
- 作者: Yongye Zhu | +2971/-398 | 20 个文件
Rebase of @PerkzZheng’s #42316 onto current main, plus a few materially new pieces: - Once-per-step C128A metadata caching — adds FlashInferMLASparseMetadata + FlashInferMLASparseMetadataBuilder so that for compress_ratio == 128 layers the mixed-sparse-index Triton kernel runs once per step instead of once per layer. The SWA-baked combine is materialized lazily on first access by ensure_sp…
- 0c96dd6 #43625 — [ROCm] Bump fastsafetensors to v0.3.2 from PyPI, remove git source build (#43625)
- 作者: Turner Jabbour | +14/-9 | 8 个文件
Bumps fastsafetensors from a pinned git commit source build to the PyPI release v0.3.2 across all requirements files. Previously, ROCm required installing fastsafetensors directly from git (git+https://…@
) because earlier PyPI releases only shipped CUDA wheels. v0.3.2 ships a universal wheel with CUDA/ROCm runtime detection (foundation-model-stack/fastsafetensors#78), so the source build… - 68f5e56 #42554 — [PD][Nixl] Mamba prefix caching mode support (#42554)
- 作者: Nicolò Lucchesi | +97/-6 | 3 个文件
This PR adds support for PD Mamba setups to make use of prefix caching (“all”, “align” as well as the upcoming https://github.com/vllm-project/vllm/pull/37898). It merely adds the logic to handle the result of a prefix cache hit, so it is agnostic to the actual caching implementation. Running without this PR with prefix caching enabled will run into this assertion as prefix caching in mamba will…
- f35b557 #44534 — Add GH token to docs build pre run check (#44534)
- 作者: Harry Mellor | +7/-1 | 1 个文件
Increase the rate limit for the docs build skip check from 60/hour to 5000/hour
- e68988a #42443 — Refactor CT NVFP4 linear to use a single class (#42443)
- 作者: Dipika Sikka | +55/-162 | 6 个文件
- Remove vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a16_nvfp4.py and use the singular compressed_tensors_w4a4_nvfp4.py class - Update / expand linear layer unit test test_compressed_tensors_nvfp4 ## Test Result - Passes for
- 9061935 #43556 — [Attention] Mamba attention module refactor - LINEAR (#43556)
- 作者: wangxiyuan | +505/-551 | 7 个文件
following https://github.com/vllm-project/vllm/pull/41126/ This is the 2nd PR for mamba attention module refactor. This PR merge BailingMoELinearAttention and MiniMaxText01LinearAttention into model_executor/layers/mamba/linear. After this PR: |Model| mamba type| pluggable|location| Used by| |-|-|-|-|-| |BailingMoELinearAttention|linaer_attention|Yes|model_executor/layers/mamba/gdn/bailing_linear_…
- 1bdc60e #44493 — Fix Kimi-K2.5 FlashInfer ViT metadata (#44493)
- 作者: Kevin_Xiong | +109/-28 | 2 个文件
Fix Kimi-K2.5 ViT metadata handling when using FlashInfer attention. Previously, it would raise an error like below. BTW, I’ve also removed an unexpected device synchronization by keeping grid_thws on CPU. Run OCRbench. ## Test Result The score is consistent with Kimi2.6’s official score. —
- a618356 #43447 — [Prefix Caching] DeepSeekv4 - Support selective prefix-cache retention for sliding-window KV cache (#43447)
- 作者: Wei Zhao | +792/-45 | 7 个文件
Co-author: @ivanium DeepSeek v4 now exhibits very low effective prefix cache capacity. For example, on TP8 with 8xB300, the reported KV cache capacity is ~14.5x concurrency. However, a microbenchmark that sends 1M-context requests sequentially shows that after the second request is sent, replaying the first request already begins to miss the prefix cache. This means the practical prefix-cache rete…
- b4b4aaa #42129 — [Inductor] Fast-path Inductor fallback for vllm::/vllm_aiter:: custom ops (#42129)
- 作者: Oxana Korzh | +348/-0 | 2 个文件
When Inductor encounters a custom op without a registered lowering or decomposition (e.g. vllm::all_reduce, vllm_aiter::fused_add_rms_norm) it correctly creates an implicit fallback that calls into the eager Python impl. However, unless the op’s base_name (e.g. vllm::all_reduce) is a member of torch._inductor.lowering.FALLBACK_ALLOW_LIST, GraphLowering.call_function (torch/_inductor/graph.py, arou…
- 4f423bd #41633 — [EPLB] Nixl communicator optimization. Zero-copy transfers (#41633)
- 作者: Ilya Markov | +287/-219 | 12 个文件
Follow-up to #40013. This PR eliminates all intermediate send/recv buffers in NixlEplbCommunicator by implementing zero-copy RDMA transfers. Instead of copying expert weights into a send buffer and reading into a recv buffer, we now register the model’s expert_weights (all layers) directly as NIXL send sources and a pre-allocated expert_buffer as the receive destination. Peers pull data directly f…
- f0cd590 #44230 — optimize the compressor 128 split cutedsl kernel (#44230)
- 作者: Jie Fang | +233/-228 | 1 个文件
This PR optimizes the compressor 128 split cutedsl kernel which utilizes the block_size=8 of state_cache for dpsk v4. This PR optimizes the DeepSeek V4 C128 CuTeDSL sparse-attention compressor path. The optimized path is intentionally scoped to the real vLLM C128 layout: - head_size = 512 - state_width = 512 - compress_ratio = 128 - overlap = False - compressor state-cache block size = 8 - paged K…
- b58e082 #42865 — [KV Connector] Update lmcache kv_offloading_backend to use LMCacheMPConnector (#42865)
- 作者: maobaolong | +47/-18 | 2 个文件
Switch the default kv_offloading_backend=lmcache path to LMCache’s multi-process (MP) mode so that vLLM talks to a standalone LMCache server via LMCacheMPConnector instead of the legacy in-process LMCacheConnectorV1. Use LMCacheMPConnector as the connector for –kv-offloading-backend lmcache. Drop the now-irrelevant lmcache.local_cpu / lmcache.max_local_cpu_size extra config — KV capacity is owned…
- ceb0111 #43241 — [Model Runner V2][Spec Decode] Add Gemma4 MTP support (#43241)
- 作者: Giancarlo Delfin | +1243/-942 | 14 个文件
Context Gemma4 MTP is currently not supported in MRV2, but was added to MRV1 in this PR. The minimal changes needed for MRV2 include: - Constant positions across draft steps - Wiring up draft layers to reuse the KV cache of the final target model layer of the same attention group. - Returning tuple of tensors: (draft_hidden_states, backbone_hidden_states) from draft model forward, where the form…
- 0414d75 #44289 — [XPU] skip unapplied UT in test_gpu_model_runner.py (#44289)
- 作者: Yan Ma | +4/-4 | 1 个文件
skip unapplied UT in test_gpu_model_runner.py. ## Test Result —
- bdbf08f #35078 — Bump actions/stale from 10.1.1 to 10.2.0 (#35078)
- 作者: dependabot[bot] | +1/-1 | 1 个文件
Bumps actions/stale from 10.1.1 to 10.3.0. > Note > Automatic rebases have been disabled on this pull request as it has been open for over 30 days.
- 91945b6 #44253 — [Bug Fix][Model Runner V2][Spec Decode] Warmup & capture with different attention states for speculator prefill (#44253)
- 作者: Giancarlo Delfin | +59/-34 | 4 个文件
Context I noticed an issue previously for MRV2 DSV4 where the model would output gibberish for simple prompts. The root cause was that the model was running both the warmup and capture forward passes using the same attention metadata. This was problematic for attention backends (e.g. FlashMLA) that lazily initialize metadata state. The warmup pass (run eagerly) triggers the lazy init and flips a…
- a248b45 #44429 — [Model] Add Gemma4 Unified (encoder-free) support (#44429)
- 作者: Luciano Martins | +791/-31 | 14 个文件
Summary Adds Gemma4UnifiedForConditionalGeneration support for the encoder-free Gemma 4 12B model family. Unlike the tower-based Gemma 4 variants (E2B, E4B, 26B-A4B, 31B), the unified variant has no vision encoder and no audio encoder: raw pixel patches and audio waveform frames are projected directly into LM space. ### Architecture - Gemma4UnifiedForConditionalGeneration subclasses Gemma4ForC…
- 271328e #44413 — [LoRA] Fix dedup for post-replacement module aliases (#44413)
- 作者: linitra24 | +1/-0 | 1 个文件
Follow up on #42757 by fixing another shared-alias LoRA registration case. #42757 deduplicated LoRA wrapping when the same original module is reachable through multiple attribute paths, such as the MoE gate alias. However, it only tracked the id of the original module before replacement. This misses alias paths that resolve to the already-replaced LoRA wrapper. For example, in Gemma4, self_decoder…
- 59d0236 #44365 — [10b/n] Migrate custom all-reduce, DeepSeek V4 fused MLA, MiniMax reduce-RMS, and MXFP8 MoE to libtorch stable ABI (#44365)
- 作者: Chris Leonard | +568/-481 | 18 个文件
Continues the libtorch stable ABI migration by moving several kernels out of legacy _C and into _C_stable_libtorch. This PR migrates custom all-reduce, DeepSeek V4 fused MLA, MiniMax reduce-RMS, and MXFP8 MoE kernels from legacy _C to _C_stable_libtorch, converting host code to stable torch APIs and moving CMake/bindings accordingly. QuickReduce stays on legacy _C for ROCm-only builds cc @janeyx99…
- 0a5cbf6 #43659 — Handle spinloop ext load failure gracefully (#43659)
- 作者: pschlan-amd | +15/-6 | 2 个文件
Handle load failures gracefully and log a warning. See discussion in comments of https://github.com/vllm-project/vllm/pull/36517 ## Test Result —
- 51e0c57 #44207 — fix(config): validate max_num_scheduled_tokens >= 0 on all paths (#44207)
- 作者: Willow Lopez | +2/-2 | 2 个文件
Fixes #44123 — Negative max_num_scheduled_tokens bypasses validation (guard gated behind speculative decoding). ## Root Cause SchedulerConfig.max_num_scheduled_tokens had no field-level constraint and the only <= 0 guard lived inside _set_max_num_scheduled_tokens() which is gated behind if self.speculative_config is not None. Without speculative decoding, a negative value: 1. Survives config const…
- 0c6631f #37505 — [KVCache] Support Pluggable KVCacheSpec (#37505)
- 作者: Mengqing Cao | +698/-59 | 11 个文件
Support Pluggable KVCacheSpec. Plz see details in #36668 TODO: ## Test Result 1. test tests/v1/test_kv_cache_spec_registry.py locally 2. test locally with registering custom MLAAttentionSpec in vllm-ascend, and test pass with deepseek-v3.2 —
- 4d1fd13 #44425 — [CI/Build] Fix LoRA testing (#44425)
- 作者: Jee Jee Li | +10/-4 | 1 个文件
Try to fix https://buildkite.com/vllm/ci/builds/69686#019e8c27-d9d9-417b-9ab4-c98c5d850905 ## Test Result —
- ec8d60b #42472 — [Model Runner V2] Use FlashInfer sampler (#42472)
- 作者: Nick Hill | +141/-69 | 3 个文件
Only used if: - No greedy requests - No requests with user-provided seed - At least one request with top-k and/or top-p - No requests require processed logprobs
- e523267 #39968 — [XPU] Add XPU block-scaled W8A8 fp8 path (#39968)
- 作者: Xiaochang Wu | +45/-5 | 4 个文件
This PR adds the XPU block-scaled W8A8 FP8 path and updates FP8 block kernel selection so XPU can fall back to Triton when the native XPU FP8 block kernel is unavailable. Changes included in this update: - Enable TritonFp8BlockScaledMMKernel.is_supported() on XPU (in addition to CUDA-like). - Add TritonFp8BlockScaledMMKernel to the XPU FP8 block kernel candidate list as fallback. - Add unit tests….
- 309385a #43942 — [Rust Frontend] Add /server_info to Rust frontend (#43942)
- 作者: Xunzhuo | +274/-15 | 11 个文件
Adds /server_info to the Rust HTTP frontend alongside the existing /version route. The endpoint follows the Python frontend’s top-level response shape with vllm_config, vllm_env, and system_env. It returns a text config snapshot by default and supports ?config_format=json for structured config fields. ## Tests - cargo fmt –all –check (from rust/) - git diff –check - cargo check -p vllm-server -…
- 3d76f39 #43689 — [SharedOffloadRegion] Align blocks to page-size (#43689)
- 作者: Varun Sundar Rabindranath | +83/-59 | 6 个文件
Align blocks in SharedOffloadRegion to page_size so that O_DIRECT succeeds. Changes: - Update CPUOffloadingSpec to account for alignment - Update SharedOffloadRegion to compute aligned row_strides - Update test_fs_tier.py to use SharedOffloadRegion for robust testing and testing the interplay between fs_tier and SharedOffloadRegion Interface change: - CPUOffloadingSpec constructor inputs a block_s…
- 823d271 #44393 — [Attention][CPU] Standardize kv layout to blocks first (#44393)
- 作者: Li, Jiang | +14/-7 | 2 个文件
For #42082 Make the CPU attention backend KV cache follows standard logical shape unit tests ## Test Result —
- 02564b4 #43759 — [XPU]fallback to TRITON_ATTN for vit attn on xpu when use float32 dtype (#43759)
- 作者: Yan Ma | +7/-0 | 1 个文件
pytest -s -v tests/models/multimodal/generation/test_whisper.py::test_models[True-5-float-openai/whisper-large-v3-turbo] ## Test Result —
- 449be4f #44311 — [Rust Frontend] Fix several hf chat template rendering issues (#44311)
- 作者: Bugen Zhao | +183/-56 | 9 个文件
Signed-off-by: Bugen Zhao i@bugenzhao.com This PR includes 2 separate fixes for the HF chat template rendering due to the gaps between minijinja and Python jinja2. 1. Revert serde_json/arbitrary_precision (introduced in #43582) to avoid leaking $serde_json::private::Number through MiniJinja serialization. This is inevitable when numbers are rendered directly via {{ .. }}. See upstream issue: htt…
- 6550ff1 #43778 — [Rust Frontend] Add dynamic LoRA endpoints (#43778)
- 作者: Xunzhuo | +1079/-47 | 24 个文件
This adds Rust frontend support for dynamic LoRA adapter management on the OpenAI-compatible server path. Highlights: - Adds /v1/load_lora_adapter and /v1/unload_lora_adapter, gated by VLLM_ALLOW_RUNTIME_LORA_UPDATING. - Adds a Rust protocol representation for LoRARequest and forwards add_lora utility calls to engine-core. - Adds a Rust remove_lora utility wrapper and has /v1/unload_lora_adapter r…
- 4aaed4c #43774 — [Rust Frontend] Add server router extension hook (#43774)
- 作者: NolanHo | +17/-2 | 1 个文件
This draft adds a minimal Rust vllm-server router extension hook discussed in #43641. - Add serve_with_router_extension(config, shutdown, extend_router). - Keep the existing serve(config, shutdown) API and behavior by delegating through an identity closure. - Apply the extension after build_router(state.clone()), so the hook composes the finalized vLLM router without exposing AppState or other int…
- 7268457 #44287 — [KV Offloading] Enable HMA models for Tiering Offloading (#44287)
- 作者: Varun Sundar Rabindranath | +0/-1 | 1 个文件
On main running models with multiple KV cache groups, with Tiered Offloading is gated by an assert. This PR removes that assert so we can make Tiering offloading more widely available. I observe some model failures, see Failure section but they dont seem to be related HMA but rather due to some incorrect interaction between the tiers. vllm serve: lm_eval command: - Launch server once - Run lm-eval…
🧪 CI/Tests
- 3e77036 #44255 — [ROCm][CI] Specifying time outs for the lm eval models (#44255)
- 作者: Andreas Karatzas | +19/-4 | 4 个文件
This PR addresses a ROCm-specific failure that and appears to be caused by the GSM8K client timeout, not by incorrect model outputs. On the MI300 Buildkite job, these two models reached the eval’s fixed 600s aiohttp timeout and timed-out requests were returned as empty strings, which then counted as invalid answers and drove accuracy down. CUDA/H200 passed the same configs on the same commit, and …
- 6f68ca3 #44046 — [ROCm][CI] Stabilize memory-release in the Hybrid model generation tests (#44046)
- 作者: Andreas Karatzas | +76/-41 | 1 个文件
Stabilizes some intermittent memory-release failures observed on ROCm in the Hybrid model generation tests by making helper-created VllmRunner instances use an explicit context manager and adding a stricter ROCm-only memory settle between large APC engine lifetimes. The failures were seen in the Buildkite Hybrid 2 shard, for example: - https://buildkite.com/vllm/ci/builds/68818/canvas?jid=019e71c2…
- 22c2e87 #44497 — [CI] Reverted gitignore changes (#44497)
- 作者: Andreas Karatzas | +0/-14 | 3 个文件
Reverting dockerignore changes completely to unblock release. Also resolving a SCCACHE_ENDPOINT problem inside the ROCm build: sccache sees SCCACHE_ENDPOINT= and crashes with InvalidUri(Empty).
- 5e2af28 #44463 — [CI] Resolve release V2 docker build after ROCm CI wheels change (#44463)
- 作者: Andreas Karatzas | +8/-2 | 2 个文件
Release Docker builds were failing with Repo is dirty because tracked files excluded by .dockerignore were missing from the container worktree while .git was still available for the cleanliness check. This keeps .clang-format and .gitattributes in the Docker context, while continuing to ignore non-build paths such as docs/, .github/, .pre-commit-config.yaml, and format.sh. To make that safe, tools…
- 5b2a2be #44370 — [ROCm][CI] Move Model Executor test step from MI250 to MI300 (gfx942) (#44370)
- 作者: JartX | +23/-23 | 1 个文件
Set parallelism: 4 and shard with –shard-id/–num-shards (same pattern as LoRA %N). Gate the single-file tensorizer test to shard 0. @AndreasKaratzas
- df7252c #44174 — [CI] Align PD tests to HMA on by default (#44174)
- 作者: Nicolò Lucchesi | +9/-43 | 5 个文件
Connectors that support it will now have it turned on by default unless otherwise specified. This PR is a follow up to that to make sure our CI for the PD related tests is aligned to this, moving to testing the auto behavior rather than explicitely turning HMA on with –no-disable-hybrid-kv-cache-manager
⚡ Performance
- d0975a4 #42646 — [perf] Add gemma RMS AR fusion (#42646)
- 作者: Jiahan Chang (Cyrus) | +225/-16 | 3 个文件
integrate flashinfer gemma RMS AR fusion https://github.com/flashinfer-ai/flashinfer/pull/3322 which are used in gemma and Qwen3-next and Qwen3.5 Perf: Model: Qwen/Qwen3.5-397B-A17B-FP8 ISL/OSL = 1024/1024 TP=4 | conc | Output tok/s (Base) | Mean TTFT ms (Base) | Mean TPOT ms (Base) | Output tok/s (Fusion) | Mean TTFT ms (Fusion) | Mean TPOT ms (Fusion) | |——|——:|——:|——:|——:|–…
- f25952e #41759 — [MM][Perf][CG] Support ViT full CUDA graph for InternVL (#41759)
- 作者: Oğuzhan KIR | +183/-2 | 4 个文件
Add ViT CUDA Graph support for InternVL models (InternVL3, InternVL2.5, InternVL2), following #38061 (Qwen3-VL). Part of #38175. InternVL’s InternVisionModel uses standard ViT attention with no rotary embeddings or variable-length metadata, so no extra buffer keys are needed. - Unit tests: pytest tests/v1/cudagraph/test_encoder_cudagraph.py -v - Added InternVL entry to tests/models/multimodal/gene…
- 95b1615 #44212 — [Perf] Improve multimodal item handling from O(n) to O(log n) per step (#44212)
- 作者: Andy Lo | +92/-44 | 6 个文件
We observed that voxtral-realtime slows down a lot on long transcription sessions. The boils down to the fact that vLLM is not very efficient in handling a large amount of multimodal items. (For voxtral-realtime this can go up to 32K multimodal items). ### Proposed fix Use bisect to narrow the iteration range over multimodal features in the scheduler and model runner from O(n) to O(log n). This si…
- 1fa9ea0 #42212 — [Perf] Triton fast path for small CPU→GPU
swap_blocks_batchin the offloading connector (#42212)- 作者: Itay Etelis | +180/-11 | 3 个文件
OffloadingConnector copies KV between host and device via cuMemcpyBatchAsync. That call saturates PCIe for large contiguous copies, but on the CPU→GPU (onload / “read”) direction it collapses for small per-descriptor payloads — the regime KV offload actually runs in. This PR adds a small Triton kernel (_swap_blocks_kernel) that takes over the CPU→GPU direction, gated on batch size (n ≥ 16) a…
🦀 Rust Frontend
- d01d0b4 #44479 — [Frontend] Consolidate online serving utils. (#44479)
- 作者: wang.yuqi | +466/-435 | 81 个文件
- Move all online serving utils to vllm/entrypoints/serve/utils - Move vllm/entrypoints/sagemaker/ to vllm/entrypoints/serve/sagemaker/ - Move tests/entrypoints/offline_mode to entrypoints/llm/offline_mode ## Test Result —
- 27f1d34 #43590 — [Frontend][Responses API] Move developer-to-system conversion into HF renderer (#43590)
- 作者: Chauncey | +391/-1 | 2 个文件
Co-authored-by: kdcyberdude kdsingh.cyberdude@gmail.com Co-Authored-By: Ben Browning bbrownin@redhat.com Signed-off-by: chaunceyjiang chaunceyjiang@gmail.com OpenAI’s Responses API allows role: “developer” items in the input array. Clients such as Codex send developer messages for harness / policy text. vLLM’s non-harmony path previously forwarded those items into the chat template, which on…
🔧 Refactor
- e6018c6 #41471 — [Refactor] Remove dead code in tests and parallel_state (#41471)
- 作者: Wentao Ye | +7/-63 | 4 个文件
Remove dead code
- 2b91012 #44122 — [Refactor] Remove dead code fp quant (#44122)
- 作者: Wentao Ye | +0/-21 | 1 个文件
Remove dead code fp quant
- e3e132d #44346 — [Refactor] Suppress SyntaxWarning from ast.literal_eval in tool parsers (#44346)
- 作者: Flora Feng | +20/-15 | 7 个文件
Python 3.12+ emits SyntaxWarning for invalid escape sequences (e.g. \p in C:\path). LLM-generated tool call arguments can contain these, producing noisy warnings in server logs. This PR suppress the warning by using safe_literal_eval().
🔩 Misc
- 6bad553 #44442 — [Minor] Remove FlashInfer version check in topk_topp_sampler (#44442)
- 作者: Woosuk Kwon | +5/-20 | 1 个文件
✨ New Feature
- dad95e3 #42453 — [Feature] Support batch invariant rms norm with residual (#42453)
- 作者: Wentao Ye | +39/-41 | 3 个文件
Support batch invariant rms norm with residual so that the code in class RMSNorm(CustomOp) could be clearer No functional change as we go into the same kernel path. variance_size_override is only used in intern_vit.py and shouldn’t be used in batch invariance. pytest tests/v1/determinism/test_batch_invariance.py -xvs ===================== 13 passed, 30 warnings in 471.33s (0:07:51) ===============…
📖 Documentation
- 0e2b131 #44388 — [Doc] Update ViT CUDA graph interfaces (#44388)
- 作者: Shanshan Shen | +12/-16 | 1 个文件
Solve https://github.com/vllm-project/vllm/pull/41234#issuecomment-4608749739. Update the ViT CUDA graph doc following the changes in https://github.com/vllm-project/vllm/pull/41234 and https://github.com/vllm-project/vllm/pull/42288. ## Test Result —