vLLM Daily — 2026-06-05 | Richelieu's Blog

共 58 个 commit，涉及 345 个文件，+12156/-3907 行变动。

概要

统计项	数值
Commit 数	58
变更文件	345
新增行数	+12156
删除行数	-3907

Commit 列表

📦 Other

91e17d4 #38804 — Fix sarvam forward compatibility with transformers v5 (#38804)
- 作者: Flame | +34/-1 | 1 个文件
This PR fixes issue #38734. ### Root cause sarvam_mla has a custom PretrainedConfig that is incompatible with transformers v5 as the ignore_keys parameter which was supported in previous versions of transformers was removed from the validate_rope() method. This was replaced with a class var in PretrainedConfig ignore_keys_at_rope_validation. The problematic line is at line 134 in the sarvam config…
6a89457 #41968 — Add objectstore as a secondary tier to multi-tier kv cache offloading (#41968)
- 作者: Effi Ofer | +659/-2 | 6 个文件
This PR adds an object store secondary tier to multi-tier kv cache offloading. The actual implementation uses Nvidia NIXL for accessing the object storage. Using the object store secondary tier, any s3 type object store can be used by passing the object store configuration to kv-connector-extra-config, for example: For more details on multi-tier see the multi-tier RFC here: https://github.com/vllm…
7f003a1 #44609 — Support MiniCPMV batched preprocessing (#44609)
- 作者: Yan Ma | +63/-56 | 1 个文件
This PR refines the vendored MiniCPMVProcessor to support batched text+image preprocessing, referring the upstream MiniCPM-V processor code. It fixes below error when run python3 ./examples/generate/multimodal/vision_language_offline.py -m minicpmv: ## Test Result —
a80af24 #44635 — Speed up docs build (#44635)
- 作者: Harry Mellor | +234/-159 | 32 个文件
In my local the combination of these changes reduces the build time from 376s to 275s. This is a ~27% decrease which will noticeably improve build and queue times in CI. The performance affecting changes are: - Exclude vendored HF processor & config classes from API reference - these model specific classes are not important to include in the API reference - Removing separate_signature and show_sig…
6a11d72 #44588 — [Reasoning][Structured Outputs] Add Command A plus tags for structural tags (#44588)
- 作者: rishitdholakia13 | +1/-1 | 1 个文件
This PR adds Command A+ to reasoning parser to the Cohere2MoeForCausalLM architecture.
02d2da0 #44561 — [DSV4] Move more ops out of eager breakpoint (#44561)
- 作者: Woosuk Kwon | +30/-14 | 1 个文件
Same config each run (MTP-3, 8192-in/1024-out, low_entropy), 3 full sweeps per version, mean±std (population stdev):
62215e7 #43167 — Remove KV cache scale boilerplate from model weight loading methods (#43167)
- 作者: Harry Mellor | +88/-731 | 56 个文件
The general changes in this PR are: - Converts get_cache_scale into get_cache_scale_mapper - Use this new mapper at the top level of AutoWeightsLoader.load_weights - Add KVCacheScaleParameter for BaseKVCacheMethod so that coersion to scalar happens automatically This allows us to: - Remove ~10 lines of boilerplate from every load_weights method in modelling code - Every new model going forward wil…
7fe7800 #43150 — [BUG] Fix FP64 Gumbel precision coverage (#43150)
- 作者: Tianyu Zhang | +391/-21 | 11 个文件
The existing –use-fp64-gumbel flag only covered the explicit Triton Gumbel sampler. V1 sampling and spec decode also use the equivalent exponential-race form q.exponential_(); probs / q; argmax, so those paths still used fp32 exponential noise even when the precision flag was enabled. Thread use_fp64_gumbel through the Python V1 sampler, TopKTopPSampler, rejection sampler recovery sampling, and L…
8a83e6f #44591 — [Rust Frontend] Batch auto-abort requests by engine (#44591)
- 作者: HueCodes | +118/-25 | 2 个文件
Coalesce Rust frontend auto-abort requests by engine before sending Abort messages. This reduces IPC round trips when many live streams are dropped at once. The abort worker still skips inactive requests and keeps the existing per-cause logging behavior. ## Test Result All passed. cargo test: 76 passed, 0 failed, 0 ignored, 0 doctests.
efc347f #44066 — docs: fix tokenizer optimization typo (#44066)
- 作者: Chunyang Wen | +1/-1 | 1 个文件
Fix a typo in the doc: Modes -> Models Build and check the pages. ## Test Result Expected —
d98b8f3 #43874 — [NixlConnector] Initiate deprecation cycle for kv_both role (#43874)
- 作者: Nicolò Lucchesi | +94/-24 | 10 个文件
Implement “Phase1” of the deprecation strategy described in https://github.com/vllm-project/vllm/issues/43807. That is a “soft deprecation” in which: - we change all official examples to refer users to using kv_producer/consumer roles - we warn users that kv_both is deprecated when detected - no effective functional change is expected with this PR, roles are not yet used/assumed We’ll follow up wi…
e64237a #44391 — [Rust Frontend] Support include_reasoning=false (#44391)
- 作者: Chao-Ju Chen | +544/-26 | 4 个文件
This PR adds Rust frontend support for include_reasoning=false in OpenAI chat completions. - Accept include_reasoning=false during chat request compatibility validation. - Preserve the request flag through chat request preparation. - Suppress reasoning output in non-streaming responses and streaming delta.reasoning chunks when disabled. - Add focused unit and route coverage for the non-streaming a…
d2f70da #44603 — fix: pad dummy run query_start_loc (#44603)
- 作者: Uranus | +3/-0 | 1 个文件
Hi from novita.ai team 👋 The decoder instance crashes when running GLM-5.1-FP8 in disaggregation way. Cuda coredump shows: By adding log before every torch.repeat_interleave call, I got: The root cause it that with dummy run, query_start_loc is not a monotonic sequence. Run same requests with the fixed decoder. ## Test Result The decoder worked fine for several hours. —
ef3af56 #44617 — Fix LLM.wait_for_completion output type docstring (#44617)
- 作者: Vic Wen | +2/-1 | 1 个文件
Fixes #44616. LLM.wait_for_completion() currently writes output_type as defaulting to RequestOutput, but the implementation accepts both RequestOutput and PoolingRequestOutput when output_type doesn’t provided: This PR updates the public API docstring so it matches the actual behavior. —
c505cd9 #44605 — [CI/Build] Disable CPU-Compatibility Tests (#44605)
- 作者: Li, Jiang | +13/-12 | 1 个文件
The tests dependency can’t be downloaded from CI hosts because of a new firewall policy. ## Test Result —
96229fa #43720 — [KVConnector][1/N] PP-aware handshake aggregation and intermediate-PP output plumbing (#43720)
- 作者: qizixi | +459/-26 | 9 个文件
The goal of this PR is to make kv connector / engine / model runner / gpu worker PP-aware, to layout the foundations to support PP(pipeline parallelism) + PD disaggregation. - This PR is connector agnostic, the changes here are needed for both NIXL connector and Mooncake connector to support PP + PD. - This PR is a pure PP-aware refactor and does not introduce any behavior changes. - Widen \Transf…
4efd6ff #44569 — [DSV4] Refactor DeepseekV4Attention (#44569)
- 作者: Woosuk Kwon | +521/-918 | 8 个文件
1. Merge DeepseekV4MLA and DeepseekV4MLAAttention into DeepseekV4Attention as the two classes were essentially no-ops. 2. Better code sharing between ROCm and NVIDIA. The dispatching logic is implemented cleanly through class inheritance. This also reduces imports in the shared code. This PR is pure code re-organization and does not change anything logically. Therefore, the performance and accurac…
56aff0d #44334 — [10/n] Migrate cuda_view and silu_and_mul_per_block_quant kernels to torch stale ABI. (#44334)
- 作者: Chris Leonard | +187/-164 | 25 个文件
Continues the libtorch stable ABI migration by moving several kernels out of legacy _C and into _C_stable_libtorch. Ops migrated - get_cuda_view_from_cpu_tensor — CPU pinned/UVA tensor → CUDA view; uses a version-guarded deleter supported for 2.11 and 2.10 fallback copies to device. - silu_and_mul_per_block_quant — fused SiLU+Mul + per-block FP8/INT8 quant; - Removed cuda_utils_kernels.cu and …
063ce98 #42139 — [XPU][MoE] support block_fp8_moe on xpu (#42139)
- 作者: zofia | +35/-2 | 3 个文件
add a new XPUExpertsBlockFp8 class for Block FP8 moe model. Tested with: Qwen/Qwen3-30B-A3B-Instruct-2507-FP8
62d6f06 #44500 — [Rust Frontend] Skip loading multimodal processor if --language-model-only is specified (#44500)
- 作者: Bugen Zhao | +87/-8 | 8 个文件
If –language-model-only is specified, we can stop eagerly loading multimodal processor metadata, which avoid failure during startup for unsupported or newly shaped multimodal models. This isn’t actually how Python frontend works but I believe it’s good to have. Also we’ll still forward the argument in managed engine mode. Unit tests and e2e tests. ## Test Result All targeted Rust tests and checks…
b7c5baf #43926 — fix: keep DeepSeek V4 RoPE cache on inv_freq device (#43926)
- 作者: Schwinn Saereesitthipitak | +1/-1 | 1 个文件
Summary DeepseekV4ScalingRotaryEmbedding._compute_cos_sin_cache() currently creates inv_freq on the active/default torch device, but creates the position arange with device=current_platform.device_type. That can mix meta tensors with CPU/CUDA tensors during meta-device model construction: - inv_freq: follows torch.device(“meta”) - t: forced to current_platform.device_type The constructor then…
a55fccf #44539 — [mamba] unify KDA conv states into one cache to match 2-state SSM layout (#44539)
- 作者: Jiangyun Zhu | +16/-30 | 3 个文件
align with GDN for future PD support, see https://github.com/vllm-project/vllm/pull/44064 ## Test Result |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |—–|——:|—————-|—–:|———–|—|—–:|—|—–:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8893|± |0.0086| | | |strict-match | 5|exact_match|↑ |0.8749|± |0.0091| —
41a4829 #43707 — [Logs Refactor] Optimize shutdown logs, easier to follow and consistent (#43707)
- 作者: Wentao Ye | +139/-23 | 7 个文件
Thanks for the context from @robertgshaw2-redhat , currently shutting down log is pretty complicated / hard to follow. This PR fixes the issue vllm serve Qwen/Qwen2.5-1.5B-Instruct –served-model-name qwen2.5-1.5b –tensor-parallel-size 4 –max-model-len 4096 –gpu-memory-utilization 0.6 –max-num-seqs 16 –port 9256 –shutdown-timeout 15 Then shut down: Now main
38fd240 #41980 — use split_group for pytorch process group creation (#41980)
- 作者: Tushar Jain | +527/-50 | 11 个文件
Summary: # Use torch.distributed.split_group for process-group creation This PR replaces torch.distributed.new_group with torch.distributed.split_group for cpu/device subgroup creation in GroupCoordinator. split_group is required by both the deprecation of lazy NCCL initialization and the planned migration to torchcomms. — ### 1. Lazy init is going away — eager init is now the recommended path P…
4cc78c9 #44363 — [Core] Freeze garbage collector in workers after model initialization (#44363)
- 作者: Tyler Michael Smith | +8/-0 | 1 个文件
Drastically reduce P99 ITL in some cases (especially WideEP) by freezing the garbage collector at the end of compile_and_warm_up_model
b21443e #43519 — Add model support for granite speech plus (#43519)
- 作者: Zvi Kons | +106/-3 | 6 个文件
Adds support for the GraniteSpeechPlus architecture (GraniteSpeechPlusForConditionalGeneration) to vLLM, enabling inference for models such as ibm-granite/granite-speech-4.1-2b-plus. The implementation extends the existing GraniteSpeech model: - Refactors GraniteSpeechForConditionalGeneration to expose a _build_encoder hook so subclasses can swap in a custom encoder without duplicating the rest of…
06ee2d8 #44340 — [Quant] Support compressed-tensors WNA8O8Int linears and WNInt embeddings (#44340)
- 作者: Michael Goin | +744/-27 | 14 个文件
Requires compressed-tensors bump for embedding support with https://github.com/vllm-project/compressed-tensors/pull/718 Introduces two new specialized methods for compressed-tensors to dispatch to: * CompressedTensorsWNA8O8Int to support any linear layer that has INT1-8 weights with static per-tensor scales for INT8 inputs and outputs. Currently we are running these as fake quants on the input+out…
b5235fc #43827 — [DSv4] Adding TRTLLM gen attention kernel (#43827)
- 作者: Yongye Zhu | +2971/-398 | 20 个文件
Rebase of @PerkzZheng’s #42316 onto current main, plus a few materially new pieces: - Once-per-step C128A metadata caching — adds FlashInferMLASparseMetadata + FlashInferMLASparseMetadataBuilder so that for compress_ratio == 128 layers the mixed-sparse-index Triton kernel runs once per step instead of once per layer. The SWA-baked combine is materialized lazily on first access by ensure_sp…
0c96dd6 #43625 — [ROCm] Bump fastsafetensors to v0.3.2 from PyPI, remove git source build (#43625)
- 作者: Turner Jabbour | +14/-9 | 8 个文件
Bumps fastsafetensors from a pinned git commit source build to the PyPI release v0.3.2 across all requirements files. Previously, ROCm required installing fastsafetensors directly from git (git+https://…@) because earlier PyPI releases only shipped CUDA wheels. v0.3.2 ships a universal wheel with CUDA/ROCm runtime detection (foundation-model-stack/fastsafetensors#78), so the source build…
68f5e56 #42554 — [PD][Nixl] Mamba prefix caching mode support (#42554)
- 作者: Nicolò Lucchesi | +97/-6 | 3 个文件
This PR adds support for PD Mamba setups to make use of prefix caching (“all”, “align” as well as the upcoming https://github.com/vllm-project/vllm/pull/37898). It merely adds the logic to handle the result of a prefix cache hit, so it is agnostic to the actual caching implementation. Running without this PR with prefix caching enabled will run into this assertion as prefix caching in mamba will…
f35b557 #44534 — Add GH token to docs build pre run check (#44534)
- 作者: Harry Mellor | +7/-1 | 1 个文件
Increase the rate limit for the docs build skip check from 60/hour to 5000/hour
e68988a #42443 — Refactor CT NVFP4 linear to use a single class (#42443)
- 作者: Dipika Sikka | +55/-162 | 6 个文件
- Remove vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a16_nvfp4.py and use the singular compressed_tensors_w4a4_nvfp4.py class - Update / expand linear layer unit test test_compressed_tensors_nvfp4 ## Test Result - Passes for
9061935 #43556 — [Attention] Mamba attention module refactor - LINEAR (#43556)
- 作者: wangxiyuan | +505/-551 | 7 个文件
following https://github.com/vllm-project/vllm/pull/41126/ This is the 2nd PR for mamba attention module refactor. This PR merge BailingMoELinearAttention and MiniMaxText01LinearAttention into model_executor/layers/mamba/linear. After this PR: |Model| mamba type| pluggable|location| Used by| |-|-|-|-|-| |BailingMoELinearAttention|linaer_attention|Yes|model_executor/layers/mamba/gdn/bailing_linear_…
1bdc60e #44493 — Fix Kimi-K2.5 FlashInfer ViT metadata (#44493)
- 作者: Kevin_Xiong | +109/-28 | 2 个文件
Fix Kimi-K2.5 ViT metadata handling when using FlashInfer attention. Previously, it would raise an error like below. BTW, I’ve also removed an unexpected device synchronization by keeping grid_thws on CPU. Run OCRbench. ## Test Result The score is consistent with Kimi2.6’s official score. —
a618356 #43447 — [Prefix Caching] DeepSeekv4 - Support selective prefix-cache retention for sliding-window KV cache (#43447)
- 作者: Wei Zhao | +792/-45 | 7 个文件
Co-author: @ivanium DeepSeek v4 now exhibits very low effective prefix cache capacity. For example, on TP8 with 8xB300, the reported KV cache capacity is ~14.5x concurrency. However, a microbenchmark that sends 1M-context requests sequentially shows that after the second request is sent, replaying the first request already begins to miss the prefix cache. This means the practical prefix-cache rete…

🐛 Bug Fix

aa6fb8a #44648 — [Bugfix] [ROCm] [Critical] fallback to regular abi for ROCm (#44648)
- 作者: TJian | +127/-25 | 6 个文件
Build error message 1. Build successfully 2. Pass test_uva.py ## Test Result 1. Built successfully 2. pytest -svvvvv tests/kernels/core/test_uva.py Results: ## Extra details The silu_and_mul_per_block_quant that was moved into stable ABI in this PR https://github.com/vllm-project/vllm/pull/44334 works on ROCm. pytest tests/kernels/core/test_fused_silu_mul_block_quant.py test results: =============…
bbb6c27 #44615 — [Bugfix] Fix gemma4 crash on CPU: guard mem_get_info call (#44615)
- 作者: adhithyamulticoreware | +5/-0 | 1 个文件
What this fixes Fixes #44039 On CPU, current_platform.mem_get_info returns None (via Platform.getattr fallback — torch.cpu has no mem_get_info attribute). Both _process_image_input and _process_video_input in gemma4_mm.py called it unconditionally, crashing with TypeError: ‘NoneType’ object is not callable during model warmup. ## Root cause Platform.getattr delegates attribute lookups t…
d61d856 #44622 — [Bugfix] Update mistral tokenizer test for continue_final_message fix (#44622)
- 作者: XuZhou | +2/-2 | 1 个文件
Fix flaky test caused by upstream mistral-common bug fix. mistral-common PR #233 fixed a bug where Tekken normalizers (V7/V15) did not forward continue_final_message, causing a spurious EOS token (, token id 2) to be appended even when continue_final_message=True. After upgrading mistral-common, the expected values in tests/tokenizers_/test_mistral.py became stale. This PR removes the trailing…
6542d48 #44618 — [Bugfix] Fix test_invocations flaky failure with newer openai SDK (#44618)
- 作者: XuZhou | +8/-2 | 1 个文件
Fix flaky test_invocations failure caused by openai SDK version mismatch. The test compared keys from client.chat.completions.create(…).model_dump() (openai SDK) against raw JSON from the /invocations endpoint (via requests). Starting with openai SDK >=2.32, model_dump() injects extra client-side fields (e.g. moderation) that are not present in the raw server response, causing: AssertionError: a…
ca73293 #44620 — [Bugfix][Rust Frontend] Fix UTF-8 char-boundary panic in incremental detokenizer (#44620)
- 作者: Ting SUN | +25/-0 | 1 个文件
The Rust frontend’s incremental detokenizer (DecodeStream::next_chunk in rust/src/tokenizer/src/incremental.rs) slices cumulative_output at a hold-back cutoff computed as a raw byte offset (len - min_bytes_to_buffer, where min_bytes_to_buffer comes from the request’s stop-string length), without aligning to a UTF-8 char boundary — unlike push_token and flush, which both use floor_char_boundary. Wh…
da1daf4 #44571 — [Bugfix] Exclude vision embedder from quantization in Gemma4 Unified (#44571)
- 作者: Luciano Martins | +3/-1 | 1 个文件
Fix W4A16 compressed-tensors checkpoint loading for Gemma4UnifiedForConditionalGeneration (encoder-free Gemma4 variant). Gemma4UnifiedVisionEmbedder.patch_dense is constructed as a ColumnParallelLinear with quant_config passed through. This causes vLLM to create quantized parameters (weight_packed, weight_scale) at module init time. However, the compressed-tensors ignore list — which correctly mar…
439203d #44380 — [Bugfix] Fix test_cutlass_moe.py (#44380)
- 作者: bnellnm | +20/-9 | 3 个文件
Fix test_cutlass_moe.py. - The FusedMoEConfig that was being used for the tests was incorrect/incomplete. It was tripping assertions in the permute cache. - The fp8 cutlass experts should still support the option of an external expert_map if one is provided. Failure log: ci_build_69591_kernels-fp8-moe-test-1-h100.log Ran pytest tests/kernels/moe/test_cutlass_moe.py ## Test Result —
99ef652 #44057 — [Bugfix] Reject non-positive values for ParallelConfig int knobs (#44057)
- 作者: JianweiZheng | +20/-18 | 1 个文件
Adds Pydantic lower-bound constraints to the parallelism-size knobs in ParallelConfig so that obviously invalid values (zero, negative) fail fast at construction time instead of producing nonsensical world_size or surfacing as opaque errors later in torch.distributed. ## Problem vllm/config/parallel.py declares the parallelism size fields as bare int = 1 with no validation: This means: - ParallelC…
3dbb4e0 #44509 — [Bugfix] MiniCPM-V-4.6 video inference crash: placeholder count mismatches visual embedding count (#44509)
- 作者: tc-mb | +60/-2 | 2 个文件
Sending a video request to openbmb/MiniCPM-V-4_6 causes EngineDeadError — the engine core crashes because the number of <|video_pad|> placeholder tokens does not match the number of visual embeddings produced by the vision tower. Image inference works correctly; only video triggers this. ## Root Cause Three issues conspire: 1. VideoProcessorItems.get_frame_size (parse.py) hardcodes (C, H, W) s…
9354fb1 #44476 — [Bugfix][Compile] Guard per_token_group_fp8_quant lookup on non-CUDA platforms (#44476)
- 作者: QiliangCui2023 | +7/-4 | 2 个文件
PR #42758 (“Enable perf_token_group_quant/_C_stable_libtorch for ROCm”) moved two QUANT_OPS dict entries — kFp8Dynamic128Sym and kFp8Dynamic64Sym, both indexing torch.ops._C.per_token_group_fp8_quant.default — out of the if current_platform.is_cuda(): guard and into the unconditional dict literal at module top-level. This breaks any platform whose vLLM build does not register per_token_group_fp8_q…
4b87b3e #44205 — [Bugfix] fix EVS for qwen3-vl (#44205)
- 作者: Rui “Garry” Gao | +4/-4 | 1 个文件
Fix EVS for Qwen3-VL, by reverting the changes to qwen3_vl.py by PR #34246. See Issue #44204 for detailed descriptions. Launch a service with EVS on (–video_pruning_rate 0.5) and send in a request with video. *Other tests should not be necessary, as we are just reverting the change. ## Test Result Fix works on vllm 0.20.2. ## Duplicate check Searched for “evs”, did not find any duplicate PR.

🧪 CI/Tests

ef0df7d #44647 — [CI] Bump mypy version 1.19.1 -> 1.20.2 (#44647)
- 作者: Harry Mellor | +18/-34 | 9 个文件
In my local environment this speeds up pre-commit run -a –hook-stage manual mypy-3.10 from ~11s to ~8s, which is a ~27% improvement. Also, the import follow skipping that happens in local mypy was removed a few weeks ago so it’s no longer true that mypy runs differently in CI. I’ve updated the documentation/comments/mypy.py to reflect this.
c66b198 #44649 — [CI] Bump mistral-common (#44649)
- 作者: Harry Mellor | +7/-7 | 7 个文件
This PR updates mistral-common in CI.
a947f7a #43307 — [Kernel][Test] Extend lightning_attn and awq_triton kernel tests to XPU (#43307)
- 作者: Agata Dobrzyniewicz | +37/-19 | 3 个文件
Make the Lightning Attention Triton tests (tests/kernels/attention/test_lightning_attn.py) and the AWQ Triton GEMM/dequantize tests (tests/kernels/quantization/test_awq_triton.py) runnable on Intel XPU in addition to CUDA/ROCm. The underlying kernels are pure Triton and already work on XPU per Intel-side validation - only the test harness was pinning device=“cuda”. Changes: - tests/kernels/attenti…
06f9463 #44436 — [ROCm][CI] Add test for Aiter unified attn kernel (#44436)
- 作者: Divakar Verma | +339/-0 | 1 个文件
This PR introduces a test to compare the output of rocm aiter_unified_attn kernel with a reference implementation (from existing test_triton_unified_attention). - Decode, Prefill and Mixed-Prefill Sequences - FP16, BF16 and FP8 (kv-cache, query and kv-cache+query) - Requires Aiter pytest -s -v tests/kernels/attention/test_rocm_aiter_unified_attn.py
3e77036 #44255 — [ROCm][CI] Specifying time outs for the lm eval models (#44255)
- 作者: Andreas Karatzas | +19/-4 | 4 个文件
This PR addresses a ROCm-specific failure that and appears to be caused by the GSM8K client timeout, not by incorrect model outputs. On the MI300 Buildkite job, these two models reached the eval’s fixed 600s aiohttp timeout and timed-out requests were returned as empty strings, which then counted as invalid answers and drove accuracy down. CUDA/H200 passed the same configs on the same commit, and …
6f68ca3 #44046 — [ROCm][CI] Stabilize memory-release in the Hybrid model generation tests (#44046)
- 作者: Andreas Karatzas | +76/-41 | 1 个文件
Stabilizes some intermittent memory-release failures observed on ROCm in the Hybrid model generation tests by making helper-created VllmRunner instances use an explicit context manager and adding a stricter ROCm-only memory settle between large APC engine lifetimes. The failures were seen in the Buildkite Hybrid 2 shard, for example: - https://buildkite.com/vllm/ci/builds/68818/canvas?jid=019e71c2…
22c2e87 #44497 — [CI] Reverted gitignore changes (#44497)
- 作者: Andreas Karatzas | +0/-14 | 3 个文件
Reverting dockerignore changes completely to unblock release. Also resolving a SCCACHE_ENDPOINT problem inside the ROCm build: sccache sees SCCACHE_ENDPOINT= and crashes with InvalidUri(Empty).

⚡ Performance

b4a6f26 #41002 — [ROCm][perf] Use workspace manager for sparse indexer allocations (#41002)
- 作者: Tuukka Sarvi | +55/-24 | 1 个文件
Replace dynamic per-call allocations in the ROCm sparse attention indexer with workspace manager allocations. The sparse attention indexer uses temporary buffers in both prefill and decode: k_fp8 / k_scale in prefill and decode logits buffers in rocm_fp8_paged_mqa_logits. These buffers are runtime scratch space, and their sizes vary with request shape, batch size, and speculative decoding state. T…
d0975a4 #42646 — [perf] Add gemma RMS AR fusion (#42646)
- 作者: Jiahan Chang (Cyrus) | +225/-16 | 3 个文件
integrate flashinfer gemma RMS AR fusion https://github.com/flashinfer-ai/flashinfer/pull/3322 which are used in gemma and Qwen3-next and Qwen3.5 Perf: Model: Qwen/Qwen3.5-397B-A17B-FP8 ISL/OSL = 1024/1024 TP=4 | conc | Output tok/s (Base) | Mean TTFT ms (Base) | Mean TPOT ms (Base) | Output tok/s (Fusion) | Mean TTFT ms (Fusion) | Mean TPOT ms (Fusion) | |——|——:|——:|——:|——:|–…

✨ New Feature

165b786 #40426 — [ROCM] [FEAT] Integrate Aiter hipBLASLt GEMM online tuning (#40426)
- 作者: Han Lin | +591/-0 | 6 个文件
Enable ROCm AITER hipBLASLt online tuning via a vLLM env var, and add ROCm tests covering the online-tuning flow and kernel gating behavior. This PR adds support for enabling hipBLASLt online tuning in vLLM through VLLM_ROCM_USE_AITER_LINEAR_HIPBMM, which forwards to HIP_ONLINE_TUNING=1 early in ROCm platform initialization so tuning is available before hipBLASLt is initialized. It also tightens t…

🔩 Misc

8d9536a #44471 — [Misc] Add unit tests for pooler head classes (#44471)
- 作者: Taneem Ibrahim | +481/-0 | 1 个文件
Adds unit tests for the four pooler head classes in vllm/model_executor/layers/pooler/: pytest tests/model_executor/layers/test_pooler_heads.py -v pre-commit run –all-files — all hooks ## Test Result All pass

📖 Documentation

3da29aa #34894 — [DOC] Add INT8 W4A8 docs and Arm’s supported quantization schemes (#34894)
- 作者: Fadi Arafeh | +372/-155 | 7 个文件
- Add documentation for INT8 W4A8 quantization scheme - Update Arm’s supported quantization scheme - Moves llm-compressor recipes under an llm-compressor specific section upon @dsikka ’s request Fixes: https://github.com/vllm-project/vllm/issues/25169 Evaluate quantized models using the provided INT8 W4A8 recipe using gsm8k and llama 3.1-8b: ## Test Result llama 3.1-8b quantized using the INT8 W4A…

概要#

Commit 列表#

📦 Other#

🐛 Bug Fix#

🧪 CI/Tests#

⚡ Performance#

✨ New Feature#

🔩 Misc#

📖 Documentation#

概要

Commit 列表

📦 Other

🐛 Bug Fix

🧪 CI/Tests

⚡ Performance

✨ New Feature

🔩 Misc

📖 Documentation