共 79 个 commit,涉及 371 个文件,+13084/-2832 行变动。
概要
| 统计项 | 数值 |
|---|---|
| Commit 数 | 79 |
| 变更文件 | 371 |
| 新增行数 | +13084 |
| 删除行数 | -2832 |
Commit 列表
📦 Other
- 4d1fd13 #44425 — [CI/Build] Fix LoRA testing (#44425)
- 作者: Jee Jee Li | +10/-4 | 1 个文件
Try to fix https://buildkite.com/vllm/ci/builds/69686#019e8c27-d9d9-417b-9ab4-c98c5d850905 ## Test Result —
- ec8d60b #42472 — [Model Runner V2] Use FlashInfer sampler (#42472)
- 作者: Nick Hill | +141/-69 | 3 个文件
Only used if: - No greedy requests - No requests with user-provided seed - At least one request with top-k and/or top-p - No requests require processed logprobs
- e523267 #39968 — [XPU] Add XPU block-scaled W8A8 fp8 path (#39968)
- 作者: Xiaochang Wu | +45/-5 | 4 个文件
This PR adds the XPU block-scaled W8A8 FP8 path and updates FP8 block kernel selection so XPU can fall back to Triton when the native XPU FP8 block kernel is unavailable. Changes included in this update: - Enable TritonFp8BlockScaledMMKernel.is_supported() on XPU (in addition to CUDA-like). - Add TritonFp8BlockScaledMMKernel to the XPU FP8 block kernel candidate list as fallback. - Add unit tests….
- 309385a #43942 — [Rust Frontend] Add /server_info to Rust frontend (#43942)
- 作者: Xunzhuo | +274/-15 | 11 个文件
Adds /server_info to the Rust HTTP frontend alongside the existing /version route. The endpoint follows the Python frontend’s top-level response shape with vllm_config, vllm_env, and system_env. It returns a text config snapshot by default and supports ?config_format=json for structured config fields. ## Tests - cargo fmt –all –check (from rust/) - git diff –check - cargo check -p vllm-server -…
- 3d76f39 #43689 — [SharedOffloadRegion] Align blocks to page-size (#43689)
- 作者: Varun Sundar Rabindranath | +83/-59 | 6 个文件
Align blocks in SharedOffloadRegion to page_size so that O_DIRECT succeeds. Changes: - Update CPUOffloadingSpec to account for alignment - Update SharedOffloadRegion to compute aligned row_strides - Update test_fs_tier.py to use SharedOffloadRegion for robust testing and testing the interplay between fs_tier and SharedOffloadRegion Interface change: - CPUOffloadingSpec constructor inputs a block_s…
- 823d271 #44393 — [Attention][CPU] Standardize kv layout to blocks first (#44393)
- 作者: Li, Jiang | +14/-7 | 2 个文件
For #42082 Make the CPU attention backend KV cache follows standard logical shape unit tests ## Test Result —
- 02564b4 #43759 — [XPU]fallback to TRITON_ATTN for vit attn on xpu when use float32 dtype (#43759)
- 作者: Yan Ma | +7/-0 | 1 个文件
pytest -s -v tests/models/multimodal/generation/test_whisper.py::test_models[True-5-float-openai/whisper-large-v3-turbo] ## Test Result —
- 449be4f #44311 — [Rust Frontend] Fix several hf chat template rendering issues (#44311)
- 作者: Bugen Zhao | +183/-56 | 9 个文件
Signed-off-by: Bugen Zhao i@bugenzhao.com This PR includes 2 separate fixes for the HF chat template rendering due to the gaps between minijinja and Python jinja2. 1. Revert serde_json/arbitrary_precision (introduced in #43582) to avoid leaking $serde_json::private::Number through MiniJinja serialization. This is inevitable when numbers are rendered directly via {{ .. }}. See upstream issue: htt…
- 6550ff1 #43778 — [Rust Frontend] Add dynamic LoRA endpoints (#43778)
- 作者: Xunzhuo | +1079/-47 | 24 个文件
This adds Rust frontend support for dynamic LoRA adapter management on the OpenAI-compatible server path. Highlights: - Adds /v1/load_lora_adapter and /v1/unload_lora_adapter, gated by VLLM_ALLOW_RUNTIME_LORA_UPDATING. - Adds a Rust protocol representation for LoRARequest and forwards add_lora utility calls to engine-core. - Adds a Rust remove_lora utility wrapper and has /v1/unload_lora_adapter r…
- 4aaed4c #43774 — [Rust Frontend] Add server router extension hook (#43774)
- 作者: NolanHo | +17/-2 | 1 个文件
This draft adds a minimal Rust vllm-server router extension hook discussed in #43641. - Add serve_with_router_extension(config, shutdown, extend_router). - Keep the existing serve(config, shutdown) API and behavior by delegating through an identity closure. - Apply the extension after build_router(state.clone()), so the hook composes the finalized vLLM router without exposing AppState or other int…
- 7268457 #44287 — [KV Offloading] Enable HMA models for Tiering Offloading (#44287)
- 作者: Varun Sundar Rabindranath | +0/-1 | 1 个文件
On main running models with multiple KV cache groups, with Tiered Offloading is gated by an assert. This PR removes that assert so we can make Tiering offloading more widely available. I observe some model failures, see Failure section but they dont seem to be related HMA but rather due to some incorrect interaction between the tiers. vllm serve: lm_eval command: - Launch server once - Run lm-eval…
- 71df063 #42758 — Enable perf_token_group_quant/_C_stable_libtorch for ROCm (#42758)
- 作者: Charlie Fu | +146/-110 | 12 个文件
This PR enables perf_token_group_quant kernels for ROCm, and since those kernels are defined by the compilation target _C_stable_libtorch, so this PR also enables this target. - modify per_token_group_quant kernels to make them support rocm (warp size 64 and 32). - change GroupReduceMax device function to ensure it works on 64 warp size. - change hipify.py to handle cmake issue when hipifyinng cu …
- e0081ef #44244 — [Benchmark] Enable reasoning-model (thinking) benchmarking via
--chat-template-kwargsfor client-rendered datasets (#44244)- 作者: Albert Cheng | +88/-0 | 3 个文件
The custom and speed_bench datasets render prompts client-side via tokenizer.apply_chat_template() and, by default, post the already-rendered text to the /v1/completions endpoint. Because that apply_chat_template() call received no chat_template_kwargs, there was no way to benchmark reasoning models in their reasoning (“thinking”) mode on these datasets: - –extra-body only attaches fields…
- 597bc15 #44236 — fix: resolve CUTLASS fmin compatibility for DeepSeek-V4 init (#44236)
- 作者: Willow Lopez | +4/-4 | 1 个文件
Replace cute.arch.fmin with cutlass.min in sparse_attn_compress_cutedsl.py to fix an AttributeError during the JIT compilation of the TileLang/CUTLASS kernel used by DeepSeek-V4 initialization. ## Root Cause nvidia-cutlass-dsl-libs-base==4.5.2 (the default for CUDA 12.x installs) does not export a standalone fmin() function under cute.arch. The nvvm_wrappers.py module defines fmax() but the co…
- 3f0a91b #44293 — Nit Changes in Tiered KV Offload (#44293)
- 作者: Rotem Shavitt | +8/-0 | 3 个文件
I added documentation to fs tier manager to inform how to enable cross-process sharing of kv.
- 02a0149 #43838 — [Platform] Add is_cumem_allocator_available (#43838)
- 作者: wangxiyuan | +12/-11 | 2 个文件
move is_cumem_allocator_available to platform interface, so that custom platform can enable cumem allocator feature as well. ## Test Result —
- 27a93cd #44366 — [docker] Stop using extra-index-url for flashinfer-jit-cache (#44366)
- 作者: Kevin H. Luu | +1/-1 | 1 个文件
flashinfer-jit-cache is currently quarantined on PyPI Credit to @jstawinsky
- b254e04 #44367 — [DSV4] Minor cleanup for DeepseekV4MegaMoEExperts (#44367)
- 作者: Woosuk Kwon | +1/-18 | 1 个文件
Remove redundant _run_mega_moe
- a4ac746 #43332 — [MoE/b12x] Accept W4A16 (kNvfp4Static, None) in FlashInferB12xExperts supports check (#43332)
- 作者: Junhao Shen | +27/-4 | 1 个文件
FlashInferB12xExperts._supports_quant_scheme (introduced by PR #40082) currently requires the activation key to be kNvfp4Dynamic, which makes the dispatcher reject every W4A16 NVFP4 checkpoint (activation_key == None) — e.g. nvidia/Qwen3.6-35B-A3B-2.06GB-per-token. This forces such checkpoints onto Marlin, even though the b12x kernel itself is W4A16-compatible. PR #42566 (“W4A16 NVFP4 fused Mo…
- 8b3b71e #44036 — [CI/Build] Bump flashinfer to v0.6.12 (#44036)
- 作者: Vadim Gimpelson | +6/-6 | 4 个文件
Bump flashinfer to v0.6.12
- 0917a00 #44345 — Fix sparse NCCL weight transfer test construction (#44345)
- 作者: Siddharth Bedekar | +7/-2 | 2 个文件
Fix https://github.com/vllm-project/vllm/pull/44272 forward for the tests that failed in the nightly CI. The production path already passes the model through WeightTransferEngineFactory.create_engine(…); the breakage was in stale test/example-style call sites that still constructed NCCLWeightTransferEngine(config, parallel_config) directly which run only when >2 GPUs. This updates: - Ran targete…
- e15f202 #42187 — [ModelRunnerV2] Avoid pipeline parallel bubbles (#42187)
- 作者: Nick Hill | +615/-144 | 19 个文件
Reorganize PP scheduling and broadcast of sampled tokens so that pipelining works properly for decode and (chunked) prefill. Decode tokens for a given request are now scheduled every pp_size steps, the broadcast of sampled tokens from last stage to prior stages happens using a different stream and process group, so it can happen in parallel with consecutive stage p2p comms. For async scheduling we…
- e4a2e58 #44338 — [MRV2] Remove assignment of graph_pool in cudagraph_utils (#44338)
- 作者: Woosuk Kwon | +0/-1 | 1 个文件
The line is redundant since both use current_platform.get_global_graph_pool(). Followup from #44078
- b8b49e2 #39667 — Bump actions/github-script from 8.0.0 to 9.0.0 (#39667)
- 作者: dependabot[bot] | +7/-7 | 4 个文件
Bumps actions/github-script from 8.0.0 to 9.0.0. You can trigger a rebase of this PR by commenting @dependabot rebase. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) — You can trigger Dependabot actions by commenting on this PR: - @dependabot rebase will rebase this PR - @dependabot recreate will recreate this PR, overwriting any edits that have been made to it - @depend…
- da107a5 #43458 — [MRV2] Also enable MRV2 for Llama and Mistral dense models (#43458)
- 作者: Nick Hill | +59/-9 | 5 个文件
This is a combination of @yewentao256’s https://github.com/vllm-project/vllm/pull/42665 with additional fixes after iterating on the CI failures. For testing what CI issues remain.
- ed9a752 #44283 — [Anthropic] Support system role messages inside messages array (#44283)
- 作者: Chauncey | +173/-17 | 3 个文件
[Anthropic] Support system role messages inside messages array FIX https://github.com/vllm-project/vllm/issues/44000 ## Test Result before after — BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)
- 3f3e270 #43963 — [XPU] Enable rms_norm/act quant fusions (#43963)
- 作者: liuzhenwei | +31/-14 | 4 个文件
- Enable norm/act quant fusions - disable warning w/o compile mode ## Test Result —
- cab5c9a #44274 — [Core] Move
max_concurrent_batchestoVllmConfig(#44274)- 作者: Nick Hill | +25/-37 | 11 个文件
The model executor interface has a max_concurrent_batches method but this logically fits better as a centralized config-derived method which isn’t executor-specific. It’s used by the core engine and may soon also be consumed by the V2 model runner.
- 774e552 #44025 — [compressed-tensors] Asymmetric support for MoE WNA16 marlin (#44025)
- 作者: Brian Dellabetta | +126/-12 | 5 个文件
Prior to this PR, asymmetric WNA16 quantization schemes for MoEs were not supported through the compressed-tensors quant method. This PR updates to remove the constraint. Resolves https://github.com/vllm-project/llm-compressor/issues/2628 Validated that W4A16_ASYM improves on wikitext PPL over W4A16 (symmetric) baseline. ## Test Result W4A16 baseline: lm_eval –model vllm –model_args “pretrained=…
- 4d93bc3 #44013 — Migrate header files to torch stable abi (#44013)
- 作者: Chris Leonard | +13/-16 | 18 个文件
While moving kernels to use the libtorch stable abi and moving them into the csrc/libtorch_stable directory, we noticed a lot of the header files were being left behind. This PR is to fix that by moving header that are ‘stable’ (
i.e. no dependency on libtorch at all or only use stable headersi.e. header files only used by stable kernels) into the libtorch_stable directory ~~and changing the r… - 586201e #44320 — [Rust Frontend] Cover different thinking modes in roundtrip tests (#44320)
- 作者: Bugen Zhao | +75/-21 | 1 个文件
Add per-model thinking behavior to the roundtrip test fixtures so reasoning_and_content covers all different initial states of the reasoning parser: - explicit thinking enabled - explicit thinking disabled (if supported) - unspecified request behavior based on the real template default Updated roundtrip tests ## Test Result All tests passed —
- 88f1721 #44308 — [ROCm] Fix AITER RMSNormQuantFusion for Kimi-Linear (#44308)
- 作者: pschlan-amd | +9/-2 | 1 个文件
Launching vLLM with Kimi-Linear models (moonshotai/Kimi-Linear-48B-A3B-Base) and enabled AITER support crashes with ‘KimiGatedDeltaNetAttention’ object has no attribute ’num_v_heads’ in rocm_aiter_fusion.py. Apparently, we have to use num_heads instead for KimiGatedDeltaNetAttention. Run moonshotai/Kimi-Linear-48B-A3B-Base with vLLM and enabled AITER, perform basic inference testing. ## Test Resul…
- 880fc03 #44299 — [Rust Frontend] Support recursive tool parameter conversion (#44299)
- 作者: Bugen Zhao | +435/-79 | 2 个文件
Signed-off-by: Bugen Zhao i@bugenzhao.com Extend the Rust tool-parameter conversion utility so that it can handle both existing raw string parameters and future parser-produced structured parameter inputs for nested argument fields. The main changes are: - add a parser-neutral ParamInput representation for raw text or named structured elements - make normalized parameter schema handling recursiv…
- 6314de8 #44168 — [XPU] [Bug] remove xpuw4a16 output size check (#44168)
- 作者: zofia | +0/-7 | 1 个文件
- ea0d045 #44065 — [FlashAttention] Sync FA with upstream (#44065)
- 作者: Matthew Bonanni | +2/-2 | 1 个文件
Corresponding PR: https://github.com/vllm-project/flash-attention/pull/141 CI ## Test Result TBD —
- 0cbc48c #42958 — Support ModelOpt MXFP8 non-gated MoE (#42958)
- 作者: TomerBN-Nvidia | +14/-5 | 1 个文件
Summary: - FlashInfer now supports TRTLLM-GEN MXFP8 MoE kernels. - Allow ModelOpt MXFP8 TRTLLM MoE to run non-gated Relu2 activations through that backend. - Forward FlashInfer activation_type into the MXFP8 TRTLLM MoE call.
- 0eeba5e #42971 — Fix DFlash prefix cache corruption due to missing lookahead block (#42971)
- 作者: Shreyas Kulkarni | +171/-3 | 2 个文件
Fixes a KV cache corruption bug in DFlash that causes persistent MGL degradation under prefix caching with concurrency. ## How we found this While trying out DFlash, under high concurrency with prefix caching enabled, we observed MGL degradation on requests that were otherwise running correctly. Their KV cache was being corrupted by other newly-arriving prefix-hit requests. We saw a steady decreas…
- f69ede4 #43421 — [XPU][Mamba] Triton-based selective scan forward op for XPU (#43421)
- 作者: Marceli Fylcek | +523/-22 | 2 个文件
Adds a Triton implementation of the Mamba selective scan forward pass (selective_scan_fwd) to enable Mamba1 prefill on Intel XPU devices. ## Test Result tiiuae/falcon-mamba-7b |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |—–|——:|—————-|—–:|———–|—|—–:|—|—–:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.5208|± |0.0138| | | |strict-match | 5|exact_…
- 2a2b5ca #44206 — [KV Offload] Add
on_schedule_end()hook to separate step lifecycle from event draining (#44206)- 作者: Ronen Schaffer | +54/-39 | 5 个文件
- Adds on_schedule_end() lifecycle hook to OffloadingManager and SecondaryTierManager, called from OffloadingConnectorScheduler.build_connector_meta() at the end of each scheduler step - Separates step-lifecycle work (processing finished jobs, flushing batched promotions, resetting the per-step gate) from pure event observation in take_events() ## Test Result —
- 689b0ee #43754 — [HARDWARE][POWER] Enable SHM communicator support for PowerPC (#43754)
- 作者: Rukhaiya2004 | +125/-10 | 6 个文件
Enable SHM communicator support for PowerPC systems. ## Test Result | Metric | Main Branch (Without SHM) | SHM Enabled Branch | | ———————————– | ————————- | —————— | | Output Token Throughput (tok/s) | 97.92 | 100.26 | | Total Token Throughput (tok/s) | 195.85 | 200.52 | | Mean TTFT (ms) | 27585.48 | 26814.95 | | Mean TPOT (ms) | 8…
- f8e9c56 #44126 — [Multimodal] Automatically select registered video loader for VLM (#44126)
- 作者: Isotr0py | +188/-9 | 5 个文件
- Currently, there’re various VideoProcessor with different frames sampling algorithms. - However, we always use the fixed frames sampling by default, and users have to specify the correct video loader by themselves through VLLM_VIDEO_LOADER_BACKEND by default, which causes a quite bad user experience. - This PRs enhances current video loader registration mechanism to tie the registry with VideoPr…
- e303132 #42977 — [Parser] Migrate
ResponsesParserto unifiedParserinterface (#42977)- 作者: alberto | +442/-94 | 4 个文件
Migrates ResponsesParser (used by ParsableContext in the Responses API) to use the unified Parser interface introduced in #32712, addressing a TODO from RFC #32713. Currently, ResponsesParser directly instantiates separate ReasoningParser and ToolParser instances, bypassing the unified Parser class. Meanwhile, the streaming path in responses/serving.py already uses the unified Parser correctly. Th…
- d247a9d #41627 — [EC Connector] Non blocking EC Connector lookup (#41627)
- 作者: omerpaz95 | +154/-0 | 3 个文件
Introduce a non-blocking deferral mechanism for EC connector lookups in scheduler. I added ensure_cache_available() API to ECConnectorBase that the scheduler can poll in a non-invasive way. When a waiting request has multimodal features whose encoder cache is still being staged (e.g., remote → CPU prefetch in progress), the scheduler now skips the request for the current step and re-queues it in s…
- b817b23 #43883 — [Rust Frontend] add –enable-request-id-headers flag support. (#43883)
- 作者: Maria Guevara | +170/-12 | 11 个文件
Migrates –enable-request-id-headers from the Rust frontend unsupported-args list into a supported server option. This matches pythons vLLMs XRequestIdMiddleware behavior: when enabled. The Rust frontend appends X-Request-Id to HTTP responses generating an incoming X-Request-Id request header if present or generating a uuid4 hex value otherwise. [x] cargo fmt –check [x] cargo clippy -p vllm-cmd -…
- 93da882 #44177 — [kv_offload] Add
@overridedecorators to subclass method implementations (#44177)- 作者: Ronen Schaffer | +68/-0 | 10 个文件
Adds @override (from typing_extensions) to all overriding methods in vllm/v1/kv_offload.\ This turns silent breakage into a static-checker error when a base-class method is renamed or removed without updating its overriders - a real risk as this subsystem grows (multi-tier offloading, new cache policies, additional secondary tiers). This is purely a static-typing annotation pass - no behavior chan…
🦀 Rust Frontend
- 27f1d34 #43590 — [Frontend][Responses API] Move developer-to-system conversion into HF renderer (#43590)
- 作者: Chauncey | +391/-1 | 2 个文件
Co-authored-by: kdcyberdude kdsingh.cyberdude@gmail.com Co-Authored-By: Ben Browning bbrownin@redhat.com Signed-off-by: chaunceyjiang chaunceyjiang@gmail.com OpenAI’s Responses API allows role: “developer” items in the input array. Clients such as Codex send developer messages for harness / policy text. vLLM’s non-harmony path previously forwarded those items into the chat template, which on…
- b623f7e #44170 — [Frontend] Consolidate dev entrypoints. (#44170)
- 作者: wang.yuqi | +90/-79 | 22 个文件
Following #41907 Consolidate dev entrypoints. - Update the documentation for the Server in development mode in online_serving. - Consolidate all dev_mode entrypoints related code into vllm/entrypoints/serve/dev/. - It seems we are missing tests for Cache Management APIs, Weight Transfer APIs (RL Training), and Server Info. tests/entrypoints/serve/dev/ ## Test Result pass —
🔧 Refactor
- e3e132d #44346 — [Refactor] Suppress SyntaxWarning from ast.literal_eval in tool parsers (#44346)
- 作者: Flora Feng | +20/-15 | 7 个文件
Python 3.12+ emits SyntaxWarning for invalid escape sequences (e.g. \p in C:\path). LLM-generated tool call arguments can contain these, producing noisy warnings in server logs. This PR suppress the warning by using safe_literal_eval().
- 478b49d #44279 — [Refactor] Remove dead code from parser infrastructure (#44279)
- 作者: Flora Feng | +35/-328 | 5 个文件
- Remove _WrappedParser — move sub-parser instantiation into Parser.init - Delete MiniMaxM2Parser — just composed two already-registered parsers - Strip unused ParserManager registry machinery (register_module, register_lazy_module, get_parser_internal, list_registered, import_parser, lazy-registration table)
- 7c37096 #44165 — [Core][Refactor]: thread
scheduler_block_sizeinto KVCacheManager and KVCacheCoordinator (#44165)- 作者: Yifan Qiao | +99/-50 | 9 个文件
This is a small, behavior-preserving refactor that threads an explicit scheduler_block_size through KVCacheManager → KVCacheCoordinator → SingleTypeKVCacheManager, instead of having HybridKVCacheCoordinator recompute the LCM of group block sizes internally. Today the scheduler already resolves the scheduling-alignment granularity via resolve_kv_cache_block_sizes (returned as scheduler_block_size, …
- 68dafcc #44267 — [Refactor] Unify reasoning + tool-call parsing behind Parser.parse() (#44267)
- 作者: Flora Feng | +409/-152 | 5 个文件
Consolidates reasoning extraction and tool-call extraction for the non-streaming chat completions path into a single entry point. Previously, OpenAIServingChat.chat_completion_full_generator did two separate steps: 1. reasoning_parser.extract_reasoning(…) → (reasoning, content) 2. OpenAIServing._parse_tool_calls_from_content(…) (a large static method on the base serving class) → (tool_calls, c…
⚡ Performance
- 95b1615 #44212 — [Perf] Improve multimodal item handling from O(n) to O(log n) per step (#44212)
- 作者: Andy Lo | +92/-44 | 6 个文件
We observed that voxtral-realtime slows down a lot on long transcription sessions. The boils down to the fact that vLLM is not very efficient in handling a large amount of multimodal items. (For voxtral-realtime this can go up to 32K multimodal items). ### Proposed fix Use bisect to narrow the iteration range over multimodal features in the scheduler and model runner from O(n) to O(log n). This si…
- 1fa9ea0 #42212 — [Perf] Triton fast path for small CPU→GPU
swap_blocks_batchin the offloading connector (#42212)- 作者: Itay Etelis | +180/-11 | 3 个文件
OffloadingConnector copies KV between host and device via cuMemcpyBatchAsync. That call saturates PCIe for large contiguous copies, but on the CPU→GPU (onload / “read”) direction it collapses for small per-descriptor payloads — the regime KV offload actually runs in. This PR adds a small Triton kernel (_swap_blocks_kernel) that takes over the CPU→GPU direction, gated on batch size (n ≥ 16) a…
- 9af53a3 #44251 — [Perf] Add tuned selective_state_update configs for H200 and RTX PRO … (#44251)
- 作者: Majid | +348/-0 | 4 个文件
Follow up to PR #43083 Add tuned selective_state_update configs for two additional GPUs not yet covered: - NVIDIA H200 (SM 9.0) - NVIDIA RTX PRO 6000 Blackwell Server Edition (SM 12.0) The merged configs in #43083 cover B200, GB200, and H100_80GB_HBM3. On the two devices above the loader falls back to the Triton built-in heuristic and leaves measurable performance on the table. Generate co…
- ca17b6b #42191 — [Perf] Apply single-pass min_larger finding and binary search in Triton Top-p path. (#42191)
- 作者: Jongseok Park | +68/-178 | 1 个文件
Apply single-pass min_larger finding helper function introduced in PR #37225 to the Top-p path of the Triton top-k, top-p kernel. The helper function increases register pressure. To alleviate register use, the search algorithm of the Top-p path is now binary, instead of ternary. Bugs in the logit calculation and masking are also fixed. Correctness tested using tests/v1/sample/test_topk_topp_sample…
- 0b25cf4 #43534 — [CPU][Perf] Enable fused kernels for GDN’s gated delta rules (#43534)
- 作者: Fadi Arafeh | +812/-585 | 11 个文件
- makes the gated delta rule impls from sglang-kernels ISA agnostic - unifies AMX impl for GDN with other CPU ISAs. - for non-x86 ISAs (which lack fast brgemm kernels): uses openblas (if avaliable in pytorch wheel) or pytorch’s blas fallbacks for GEMMs in chunk gated delta rule - fixes incorrect beta ptr access in sigmoid calculation in the fused_sigmoid_gating_delta_rule_update kernel - adds test…
- dcdfe66 #44220 — [Perf] use triton moe backend on hopper by default (#44220)
- 作者: Jiangyun Zhu | +6/-0 | 1 个文件
vLLM uses flashinfer moe backend by default now, it’s slower than triton backend Tested on H200 For more benchmark, see this https://github.com/vllm-project/vllm/issues/41306 —
🐛 Bug Fix
- 209709a #44348 — [Bugfix] Fix unstreamed tool call args dropped in Responses API streaming (#44348)
- 作者: Flora Feng | +15/-4 | 3 个文件
When tool parsers stream arguments incrementally, they may buffer chunks that haven’t been sent to the client yet. On the final streaming delta, parse_delta(finished=True) calls _append_unstreamed_tool_args(), which computes the diff between the fully-parsed arguments and what was actually streamed, and appends the remainder. Without this flush, the client receives truncated or empty tool call arg…
- ace95c9 #44347 — [Bugfix] Update TrtLLM MoE routing methods (#44347)
- 作者: Wei Zhao | +24/-24 | 6 个文件
The PR introduces various fixes related to Trtllm MoE routing methods: - Revert _supports_router_logits_dtype change from https://github.com/vllm-project/vllm/pull/43859, which causes regression in nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8, see CI failure - Update RoutingMethodType in correspondence with flashinfer. - Refine get_routing_method_type to prevent stepfun-ai/Step-3.7-Flash from being c…
- f020435 #43862 — [Bugfix] fix crash in postprocess for null tool args (#43862)
- 作者: William Rom | +30/-1 | 2 个文件
Close #43851 When sending ‘argument: “null”’, json.loads(“null”) returns None which gets stored in the message and crashes Jinja templates (e.g. GLM 5.1). This PR adds a None check to the __postprocess_messages argument parsing step, making the previously untrue comment " if arguments is None or empty string, set to {}” true. Claude Opus 4.7 used for tracking down bug and drafting the test. Regres…
- 969aec4 #44356 — [Bugfix] Fix Deepseek v4 non-mega-moe model init error (#44356)
- 作者: Wei Zhao | +8/-0 | 1 个文件
Fix the following init error, introduced by https://github.com/vllm-project/vllm/pull/43339, which only updated _init_mega_moe_experts but not _init_fused_moe_experts. ## Test Result —
- e9e08c4 #44082 — [Bugfix] Cache the EAGLE/MTP lookahead block in the SWA prefix-cache mask (#44082)
- 作者: Yifan Qiao | +338/-82 | 5 个文件
PR #42258 added SlidingWindowManager._cache_block_mask() to skip caching SWA blocks that can never serve a prefix-cache hit. When EAGLE/MTP speculative decoding is active and the cache-hit alignment (the LCM of per-group block sizes) is larger than the SWA window, that mask is too aggressive: EAGLE’s lookup needs tail + 1 contiguous cached blocks, and the extra +1 block lives at the first po…
- fe32e78 #43669 — [Bugfix] flashinfer: fail fast when –kv-cache-dtype nvfp4 used on unsupported arch (#43669)
- 作者: Kartavya sonar | +7/-0 | 1 个文件
Problem –kv-cache-dtype nvfp4 is silently accepted on architectures without a trtllm-gen FP4 FMHA kernel. The engine starts cleanly, captures graphs, then dies on the first request with either: - AttributeError: module ’torch’ has no attribute ’nvfp4’ (flashinfer dtype resolution), or - RuntimeError: Unsupported architecture deep in trtllm-gen FMHA The server appears healthy until the first to…
- afcb580 #43100 — [BugFix] Fix Humming MoE deploy error (#43100)
- 作者: Alireza Dadgarnia | +3/-2 | 1 个文件
We attempted to deploy the model ISTA-DASLab/Qwen3.6-35B-A3B-2Bit-GSQ, which uses Humming packed quantization. With the latest version of vLLM, the model weights load successfully, but execution immediately fails with an assertion error at line 871 in: vllm/model_executor/layers/quantization/humming.py Specifically: The issue is that self.moe_quant_config is never initialized because self.get_fuse…
- c91a87f #43978 — [BugFix] [GDN] Read linear_key_head_dim from hf_text_config for multimodal models (#43978)
- 作者: IdoAtadTD | +2/-2 | 1 个文件
For multimodal Qwen3.5 models (e.g. Qwen3.5-397B-A17B), linear_key_head_dim (128) lives on hf_text_config, not hf_config. GDN prefill backend selection only read hf_config, so head_k_dim was None and CuteDSL/FlashInfer on Blackwell (SM100) was never enabled: INFO 05-29 16:47:49 [qwen_gdn_linear_attn.py:228] Using Triton/FLA GDN prefill kernel (requested=auto, head_k_dim=None). This mirrors the pat…
- 0bdfd5e #44282 — [Bugfix] Vendor MiniCPMV/MiniCPMO processors to unblock Transformers v5 (#44282)
- 作者: 王金旭 | +965/-16 | 7 个文件
Rebased and continued from #38437. Vendors MiniCPMVProcessor and MiniCPMOProcessor into vLLM to unblock the Transformers v5 upgrade. Fixes #38385 and the MiniCPMO skip in #30566. The original commit by @guanwei-wu is preserved verbatim via git cherry-pick -x, so author signature, sign-off, and source-commit reference are retained. Handoff coordinated with @tc-mb (MiniCPM-V team) per the discussion…
- 2fd0e52 #44232 — [Bugfix] Fix Gemma4 startup crash with recent transformers multimodal processor (#44232)
- 作者: Luciano Martins | +19/-0 | 1 个文件
Fix StopIteration to ValueError crash during Gemma4 server startup (profiling phase) when using recent versions of transformers that refactored ProcessorMixin.call to enforce 1:1 matching between multimodal placeholder tokens and replacement data. - During KV cache profiling, vLLM calls _apply_hf_processor_text_only to tokenize the dummy prompt (<|video|>) without multimodal data - The base im…
- 654bd2b #42967 — [Bugfix] Sync block_size from EngineCore to frontend for hybrid Mamba… (#42967)
- 作者: gruner | +32/-1 | 4 个文件
… models - For hybrid Mamba models (Qwen3_5MoeForConditionalGeneration), _align_hybrid_block_size() enlarges block_size in the worker process (e.g. to 528 or 1056 tokens) but this update was never synced back to the parent APIServer process via EngineCoreReadyResponse - This caused vllm:cache_config_info to report block_size=16 (stale default) instead of the actual runtime value - Fix adds block_s…
📖 Documentation
- 0e2b131 #44388 — [Doc] Update ViT CUDA graph interfaces (#44388)
- 作者: Shanshan Shen | +12/-16 | 1 个文件
Solve https://github.com/vllm-project/vllm/pull/41234#issuecomment-4608749739. Update the ViT CUDA graph doc following the changes in https://github.com/vllm-project/vllm/pull/41234 and https://github.com/vllm-project/vllm/pull/42288. ## Test Result —
🧪 CI/Tests
- 87954eb #36949 — [ROCm][CI] Optimize ROCm Docker build: registry cache, DeepEP, and ci-bake script (#36949)
- 作者: Andreas Karatzas | +2745/-157 | 10 个文件
Implements the three-tier Docker build for ROCm CI. Every PR currently rebuilds RIXL, DeepEP, rocshmem, torchcodec, and RDMA libraries from scratch, costing a total of 26 minutes on average per build. This PR introduces a pre-built Tier-1 ci_base image that absorbs those stable layers. Per-PR builds then only rebuild the thin vLLM wheel + workspace layer. Image registry layout after this PR: |…
- e670638 #44352 — [CI] Add missing vllm/parser/ CI trigger and fix test_parse.py (#44352)
- 作者: Flora Feng | +7/-4 | 2 个文件
#44279 removed _WrappedParser from vllm/parser/abstract_parser.py and updated test_streaming.py to use DelegatingParser, but test_parse.py was added by a separate PR around the same time and still referenced _WrappedParser. The CI didn’t catch this because vllm/parser/ wasn’t listed as a trigger path for the job — only tests/parser was. So the job only ran when test files changed, not when the sou…
- 53b88d1 #44042 — [CI] Reject out-of-vocabulary before they reach the GPU logprob path (#44042)
- 作者: Andreas Karatzas | +82/-9 | 4 个文件
This stabilizes a flakiness in ROCm CI seen at “AMD: Entrypoints Integration (API Server openai - Part 3) (mi325_1)”: - https://buildkite.com/vllm/ci/builds/68845/canvas?sid=019e7252-a1c0-4bdd-a643-d3d57f1b41f1&tab=output The first real failure came from the Schemathesis-generated /inference/v1/generate request: After that request, the engine hit a ROCm/HSA hardware exception and died. The later f…
- 7b476c8 #44369 — [ROCm][CI] Skip fp8 reload tests on gfx90a (MI250) (#44369)
- 作者: JartX | +19/-3 | 1 个文件
supports_fp8() returns True on gfx90a for the general (upcast) quant paths, so the fp8 reload/online-quantize/kv-scale tests were not skipped on MI250 even though it has no native fp8 and cannot run these models. Add a gfx90a guard so they skip there, without touching supports_fp8(). @AndreasKaratzas
- 4454a18 #44368 — [ROCm][CI] Fix stale wvSplitK GEMM fallback test for N=5 (#44368)
- 作者: JartX | +4/-2 | 1 个文件
PR #40687 raised the wvSplitK skinny-GEMM cutoff from n<=4 to n<=5, but test_rocm_unquantized_gemm_gfx1x_n_gt_4_falls_back still used n=5 and asserted a fallback, so it fails (wvSplitK is now called). Use n=6 and rename to …_n_gt_5_falls_back to test the real fallback boundary. Tested on gfx1100: fails before / passes after against the n<=5 impl. @AndreasKaratzas
🔩 Misc
- bd98e97 #44128 — [Misc] Remove dead VLLM_RPC_TIMEOUT env var and fix profiling doc that references it (#44128)
- 作者: Daoyuan Li | +1/-15 | 11 个文件
VLLM_RPC_TIMEOUT is documented in vllm/envs.py (“Time in ms for the zmq client to wait for a response from the backend server for simple data operations”, default 10000), but it has no consumers anywhere in the tree — it is a V0 leftover. In V1, the engine-core client waits on utility RPCs without any timeout: - SyncMPClient.call_utility → future.result() (no timeout arg) - AsyncMPClient._call…
- 5577811 #44350 — [Misc] Remove stray empty file (#44350)
- 作者: Matthew Bonanni | +0/-0 | 1 个文件
Remove stray empty file accidentally introduced by #43754
- 53fa09d #43843 — [Misc] Support local image encoding in benchmarks (#43843)
- 作者: XiaoZ | +238/-15 | 3 个文件
This PR adds an option for custom_image benchmark datasets to encode local image files before sending requests to the serving endpoint. By default, image references are sent to the serving endpoint directly and must be readable by the server. This works when the benchmark client and server share the same env, but may fail when they run on different machines or containers. With this PR, users can p…
🖥️ Kernel
- 3099de3 #42027 — [Kernel][MoE] Add GELU_TANH to CPU, CUTLASS, and WNA16 MoE backends (#42027)
- 作者: SeongJun Lee | +119/-7 | 7 个文件
Core GELU_TANH MoE activation is already on main, but three backend paths still reject it: | Backend | Problem | |—|—| | CPU fused MoE | No GELU_TANH entry in activation map, KeyError at runtime | | CUTLASS FP8/FP4 | Not listed in _supports_activation, unnecessary fallback | | WNA16 MoE | Hard-asserts SiLU, crash on any non-SiLU model | Added GELU_TANH to CPU and CUTLASS activation lists. Remo…
✨ New Feature
- 2427094 #43339 — [Feature] Support EPLB for DeepSeek v4 Mega Moe (#43339)
- 作者: Wei Zhao | +232/-46 | 4 个文件
Support EPLB for DeepSeek v4 Mega Moe - Test gsm8k and GPQA eval on deepseekv4 mega moe with EPLB ## Test Result ### Eval results ### Perf results 8xB200 Without –enable-eplb: With –enable-eplb: The perf results are consistent across runs. —